molecule-core

Author	SHA1	Message	Date
github-actions[bot]	475a51adec	fix(ci): defer promote when E2E is racing with publish (review fix) Self-review caught a real correctness bug: scenario where publish- workspace-server-image completes BEFORE E2E Staging SaaS for a runtime- touching SHA. Publish typically takes ~5-10min; E2E ~10-15min, so this ordering is the common case for runtime-path PRs. Previous gate logic: - completed/success: proceed - completed/failure: abort - everything else (including in_progress): proceed ← BUG If publish-trigger fires while E2E is still running, the gate returned "in_progress/none" and fell through the catch-all "proceed" branch. Result: :latest retagged on the publish signal alone. Then E2E ends red — but :latest was already wrongly advanced; the E2E-completion trigger's job-level if=conclusion==success filter just skips, never rolls back. Fix: explicit case for in_progress\|queued\|requested\|waiting\|pending that DEFERS — sets gate.proceed=false, writes a "deferred" summary, exits 0 (workflow run shows success, retag steps skipped). The E2E completion trigger then fires later and either promotes (green) or aborts (red), giving us correct ordering regardless of who finishes first. Subsequent steps now guarded by `if: steps.gate.outputs.proceed == 'true'` instead of relying on `exit 1` for skip semantics. Also added an explicit catch-all `*)` branch that aborts on unknown states (forward-compat: GitHub adds a new status, we surface it instead of silently promoting through it).	2026-04-28 16:59:58 -07:00
github-actions[bot]	f4f45f8561	fix(ci): auto-promote :latest also on publish-image, not just E2E Previously this workflow only triggered on E2E Staging SaaS completion, which is itself paths-filtered to runtime handlers (workspace-server/internal/handlers/{registry,workspace_provision, a2a_proxy}.go, middleware/, provisioner/). publish-workspace-server -image fires on a STRICTLY BROADER path set (workspace-server/, canvas/, manifest.json) — so canvas-only or cmd-only or sweep-only PRs rebuilt the platform image without ever advancing :latest. Result observed 2026-04-28: zero runs of this workflow since merge despite eight main pushes. :latest sat ~7 hours / 9 PRs behind main. Fix: add publish-workspace-server-image as a second trigger. Add an explicit gate inside the job that aborts when E2E Staging SaaS for the same SHA ended red. When E2E didn't fire (paths-filtered), proceed — auto-promote-staging's pre-merge gates (CI + E2E Canvas + E2E API + CodeQL on staging) already validated this SHA before main moved. Concurrency group serializes promotes per-SHA so the publish+E2E both- fired race lands cleanly. Idempotent crane tag makes it safe regardless.	2026-04-28 16:53:30 -07:00
Hongming Wang	c77a88c247	chore(security): pin Actions to SHAs + enable Dependabot auto-bumps Supply-chain hardening for the CI pipeline. 23 workflow files modified, 59 mutable-tag refs replaced with commit SHAs. The risk Every `uses:` reference in .github/workflows/*.yml was pinned to a mutable tag (e.g., `actions/checkout@v4`). A maintainer of an action — or a compromised maintainer account — can repoint that tag to malicious code, and our pipelines silently pull it on the next run. The tj-actions/changed-files compromise of March 2025 is the canonical example: maintainer credential leak, attacker repointed several `@v<N>` tags to a payload that exfiltrated repository secrets. Repos that pinned to SHAs were unaffected. The fix Replace each `@v<N>` with `@<commit-sha> # v<N>`. The trailing comment preserves human readability ("ah, this is v4"); the SHA makes the reference immutable. Actions covered (10 distinct): actions/{checkout,setup-go,setup-python,setup-node,upload-artifact,github-script} docker/{login-action,setup-buildx-action,build-push-action} github/codeql-action/{init,autobuild,analyze} dorny/paths-filter imjasonh/setup-crane pnpm/action-setup (already pinned in molecule-app, listed here for completeness) Excluded: Molecule-AI/molecule-ci/.github/workflows/disable-auto-merge-on-push.yml@main — internal org reusable workflow; we control its repo, threat model is different from third-party actions. Conventional to pin to @main rather than SHA for internal reusables. The maintenance cost SHA pinning means upstream fixes require manual SHA bumps. Without automation, pinned SHAs go stale. So this PR also enables Dependabot across four ecosystems: - github-actions (workflows) - gomod (workspace-server) - npm (canvas) - pip (workspace runtime requirements) Weekly cadence — the supply-chain attack window is "minutes between repoint and pull"; weekly auto-bumps don't help with zero-days regardless. The point is to pull in non-zero-day fixes without operator effort. Aligns with user-stated principle: "long-term, robust, fully- automated, eliminate human error." Companion PR: Molecule-AI/molecule-controlplane#308 (same pattern, smaller surface). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 15:37:06 -07:00
Hongming Wang	9d4ab7b1a2	feat(ci): auto-promote-on-e2e — retag :latest on green E2E Staging SaaS Closes the final gap in the SaaS pipeline. After auto-promote-staging fast-forwards main, publish-workspace-server-image builds new `:staging-<sha>` images, but `:latest` (what prod tenants pull) only moves on either a manual `promote-latest.yml` dispatch or a canary- verify retag (gated on Phase 2 fleet that doesn't exist). This workflow closes that gap by retagging `platform:staging-<sha>` + `platform-tenant:staging-<sha>` → `:latest` whenever E2E Staging SaaS passes for a `main` push. Uses crane (no Docker daemon needed). Verifies both images exist before retagging either, so a half-published state is impossible. Why trigger only on `main` (not staging): - `:latest` is what prod tenants pull. Only SHAs that have reached `main` (via auto-promote-staging) should advance `:latest`. - Triggering on staging would let a staging-only revert advance `:latest` to a SHA that never reaches `main`, breaking the invariant "production runs what's on `main`". Why a separate workflow rather than folding into e2e-staging-saas.yml: - Test concerns and release concerns separate. - Disabling promote during an incident is one workflow toggle, not an edit to the long E2E file. - When Phase 2 canary work eventually lands, the canary path can replace this trigger without touching the E2E workflow. Doc-aligned: per molecule-controlplane/docs/canary-tenants.md, "green staging E2E → :latest" is the recommended approach for the current scale (≤20 paying tenants); canary fleet is deferred until blast radius grows. Pipeline after this lands is fully self-healing: staging push → 4 gates green → auto-promote fast-forwards main → publish-workspace-server-image → E2E Staging SaaS → THIS WORKFLOW retags :latest → tenant fleet auto-pulls in 5 min (or redeploy-tenants-on-main fans out faster) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 13:58:41 -07:00

4 Commits