Commit Graph

4 Commits

Author SHA1 Message Date
Hongming Wang
c77a88c247 chore(security): pin Actions to SHAs + enable Dependabot auto-bumps
Supply-chain hardening for the CI pipeline. 23 workflow files
modified, 59 mutable-tag refs replaced with commit SHAs.

The risk

Every `uses:` reference in .github/workflows/*.yml was pinned to a
mutable tag (e.g., `actions/checkout@v4`). A maintainer of an
action — or a compromised maintainer account — can repoint that
tag to malicious code, and our pipelines silently pull it on the
next run. The tj-actions/changed-files compromise of March 2025 is
the canonical example: maintainer credential leak, attacker
repointed several `@v<N>` tags to a payload that exfiltrated
repository secrets. Repos that pinned to SHAs were unaffected.

The fix

Replace each `@v<N>` with `@<commit-sha> # v<N>`. The trailing
comment preserves human readability ("ah, this is v4"); the SHA
makes the reference immutable.

Actions covered (10 distinct):
  actions/{checkout,setup-go,setup-python,setup-node,upload-artifact,github-script}
  docker/{login-action,setup-buildx-action,build-push-action}
  github/codeql-action/{init,autobuild,analyze}
  dorny/paths-filter
  imjasonh/setup-crane
  pnpm/action-setup (already pinned in molecule-app, listed here for completeness)

Excluded:
  Molecule-AI/molecule-ci/.github/workflows/disable-auto-merge-on-push.yml@main
    — internal org reusable workflow; we control its repo, threat model
    is different from third-party actions. Conventional to pin to @main
    rather than SHA for internal reusables.

The maintenance cost

SHA pinning means upstream fixes require manual SHA bumps. Without
automation, pinned SHAs go stale. So this PR also enables Dependabot
across four ecosystems:

  - github-actions (workflows)
  - gomod (workspace-server)
  - npm (canvas)
  - pip (workspace runtime requirements)

Weekly cadence — the supply-chain attack window is "minutes between
repoint and pull"; weekly auto-bumps don't help with zero-days
regardless. The point is to pull in non-zero-day fixes without
operator effort.

Aligns with user-stated principle: "long-term, robust, fully-
automated, eliminate human error."

Companion PR: Molecule-AI/molecule-controlplane#308 (same pattern,
smaller surface).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 15:37:06 -07:00
Hongming Wang
f1792e1f7a fix(ci): stop sweep-cf-orphans noise — drop merge_group + soft-skip when secrets unset
The sweep-cf-orphans workflow shipped in #2088 was noisier than
intended in two ways. This PR fixes both — was filed under the
Optional finding I left on the original review and now matters because
the noise is observably hitting the merge queue.

1) `merge_group: types: [checks_requested]` was firing the entire
   sweep job on every PR through the merge queue. The original intent
   ("future required-check support without a workflow edit") never
   materialized, and meanwhile every recent merge-queue eval (#2091,
   #2092, #2093, #2094, #2095, #2097) generated a red `Sweep CF
   orphans (merge_group)` run.

   Drop the trigger. Comment in the workflow explains the re-add path
   if/when the workflow IS wired as a required check (re-add the
   trigger AND gate the actual sweep step with
   `if: github.event_name != 'merge_group'` so merge-queue evals are
   no-op success).

2) The `Verify required secrets present` step exits 2 when the 6
   secrets aren't configured yet (the PR body's post-merge step,
   still pending). That turns the hourly schedule into an hourly red
   CI run for as long as the secrets stay unset.

   Convert to a soft skip: emit a `:⚠️:` listing the missing
   secrets and set a `skip=true` step output, then gate the sweep
   step with `if: steps.verify.outputs.skip != 'true'`. Workflow
   reports green and ops still sees the warning when they review
   recent runs.

Net effect:
- merge-queue evals stop generating spurious red runs
- the schedule reports green-with-warning until secrets land
- once secrets land, behavior is identical to today's (real sweep
  runs, hard-fails if a secret is later removed)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 08:05:53 -07:00
rabbitblood
0ae6b201b4 refactor(ci): apply simplify findings on PR #2088
- Drop redundant 'aws --version' step. Script's own 'aws ec2
  describe-instances' fails just as loud with a more actionable
  error; the pre-check added ~1s with no signal value.
- timeout-minutes 10 → 3. Realistic worst case is ~2min (4 curls +
  1 aws + N×CF-DELETE each individually capped at 10s by the
  script's curl -m flag). 3 surfaces hangs within one cron tick
  instead of burning the full interval.
- Document the schedule-vs-dispatch dry-run asymmetry inline so
  the next reader doesn't need to trace input defaults.
- Add merge_group: types: [checks_requested] for queue parity with
  runtime-pin-compat.yml — cheap insurance if this ever becomes a
  required check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 04:18:24 -07:00
rabbitblood
3c18b76aa7 ops(cf): hourly sweep workflow for orphan Cloudflare DNS records (#239)
Closes Molecule-AI/molecule-controlplane#239.

CF zone hit the 200-record quota 2026-04-23+ — every E2E and canary
left a record on moleculesai.app, and no scheduled job pruned them.
Provisions started failing with code 81045 ('Record quota exceeded').

The sweep-cf-orphans.sh script (PR #1978, with decision-function
unit tests added in #2079) already exists but no workflow fires it.
Adding it here as a parallel janitor to sweep-stale-e2e-orgs.yml:

- hourly schedule at :15 (offset from the e2e-orgs sweep at :00 so
  the two converge cleanly without racing the same CP admin endpoint)
- workflow_dispatch with dry_run input default true (ad-hoc verify
  without committing to deletes)
- workflow_dispatch with max_delete_pct input for major cleanups
  (the script's own MAX_DELETE_PCT defaults to 50% as a safety gate)
- concurrency group prevents schedule + manual-dispatch from racing
  the same zone

Why a separate workflow vs sweep-stale-e2e-orgs.yml:
- That workflow drives DELETE /cp/admin/tenants/:slug, assumes CP
  has the org row. Doesn't catch records left when CP itself never
  knew about the tenant (canary scratch, manual ops experiments)
  or when the CP-side cascade's CF-delete branch failed.
- sweep-cf-orphans.sh enumerates the CF zone directly + matches
  against live CP slugs + AWS EC2 names. Catches what the CP-driven
  sweep can't.

Required secrets (will need to be set on the repo): CF_API_TOKEN,
CF_ZONE_ID, CP_PROD_ADMIN_TOKEN, CP_STAGING_ADMIN_TOKEN,
AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY. Pre-flight verify-secrets
step fails loud if any are missing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 04:16:43 -07:00