molecule-core

Author	SHA1	Message	Date
Hongming Wang	8df8487bbe	fix(auto-promote): treat E2E completed/cancelled as defer, not failure Bug: the case statement at line 189 grouped completed/failure \| completed/cancelled \| completed/timed_out into the same "abort + exit 1" branch. cancelled ≠ failure — when per-SHA concurrency (memory: feedback_concurrency_group_per_sha) cancels an older E2E run because a newer push landed, the workflow blocked the whole auto-promote chain on a non-failure. Caught 2026-05-05 02:03 on sha `31f9a5e`: E2E got cancelled by concurrency, auto-promote :latest aborted with exit 1, the next auto-promote-staging cycle had to manually clean up. Split: failure/timed_out keep the abort path. cancelled gets its own clean-defer branch (same shape as in_progress) — proceed=false without exit 1, with a step-summary explaining likely concurrency supersession and pointing operators at manual dispatch if they need that specific SHA promoted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 19:26:29 -07:00
dependabot[bot]	6c6c6eb1e8	chore(deps)(deps): bump imjasonh/setup-crane from 0.4 to 0.5 Bumps [imjasonh/setup-crane](https://github.com/imjasonh/setup-crane) from 0.4 to 0.5. - [Release notes](https://github.com/imjasonh/setup-crane/releases) - [Commits](`31b88efe9d...6da1ae0188`) --- updated-dependencies: - dependency-name: imjasonh/setup-crane dependency-version: '0.5' dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>	2026-05-02 19:23:13 +00:00
Hongming Wang	8efb2dae8d	fix(ci): handle empty E2E lookup in auto-promote-on-e2e gate When gh run list returns [] (no E2E run on the main SHA — the common case for canvas-only / cmd-only / sweep-only changes whose paths don't trigger E2E), jq's `.[0]` is null and the interpolation `"\(null)/\(null // "none")"` produces "null/none". The case statement has no `null/none)` branch, so it falls into `*)` → exit 1 → auto-promote-on-e2e fails → `:latest` doesn't get retagged to the new SHA → tenants on `redeploy-tenants-on-main` end up pulling the OLD `:latest` digest. Surfaced 2026-04-30 17:00Z as the first observable consequence of PR #2389 (App-token dispatch fix). Every prior auto-promote-on-e2e run was triggered by E2E completion (the "Upstream is E2E itself" short-circuit at line 151 fired before reaching the gate). #2389 made publish-image's completion event correctly fire workflow_run listeners — auto-promote-on-e2e is one of those listeners — and hit the latent jq bug on the first publish-upstream run. Fix: change `.[0]` to `(.[0] // {})` in the jq filter so the empty- array case becomes `none/none` (the documented "E2E paths-filtered out for this SHA — proceed" branch) instead of the unhandled `null/none`. Also default `.status` for the same defensive reason. Verified the three input shapes locally: [] → "none/none" ✓ [{status:completed,conclusion:success}] → "completed/success" ✓ [{status:in_progress,conclusion:null}] → "in_progress/none" ✓ Outer `\|\| echo "none/none"` fallback retained as defense-in-depth for non-zero gh exits (network / auth failures). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:07:52 -07:00
Hongming Wang	f7b9feb34f	ci: ancestry-check on auto-promote :latest (#2244 ) Two rapid main pushes whose E2Es complete out-of-order can promote :latest backwards: SHA-A merges, SHA-B merges, SHA-B's E2E completes first → :latest = staging-B → SHA-A's E2E completes → :latest = staging-A. Now :latest is older than main's tip and stays wrong until the next main push lands. The orphan-reconciler "next run corrects it" pattern doesn't apply because there's no auto-corrective re-promote. Detection: read the current :latest's `org.opencontainers.image.revision` label (set by publish-workspace-server-image.yml at build time) and ask the GitHub compare API how the candidate SHA relates to current. Branch on `.status`: ahead → retag (target newer) identical → retag is a no-op behind → HARD FAIL (this is the race we're catching) diverged → HARD FAIL (force-push or unusual history) error → fail; manual dispatch can override Hard-fail rather than soft-skip per the approved design — silent-bypass is the class we're moving away from per feedback_schedule_vs_dispatch_secrets_hardening. Workflow goes red, oncall sees it, operator decides whether to retry, force-promote, or investigate. Manual dispatch skips the check (operator override), matching the gate-step's existing semantics. Backward-compat: when current :latest carries no revision label (legacy image), skip-with-warning. All :latest images on main are post-label as of 2026-04-29, so this branch becomes dead within 90 days — TODO note in the step explains the cleanup. No tests — the race is hypothetical at our scale (<1 occurrence/year expected for a fleet of ≤20 paying tenants), and the only way to exercise the new branches is to construct production-shape image state. The dry-fall path lands behind the existing E2E gate-check, so a regression in this step would surface as a failed promote (visible), not a silent advance (invisible). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:18:42 -07:00
github-actions[bot]	475a51adec	fix(ci): defer promote when E2E is racing with publish (review fix) Self-review caught a real correctness bug: scenario where publish- workspace-server-image completes BEFORE E2E Staging SaaS for a runtime- touching SHA. Publish typically takes ~5-10min; E2E ~10-15min, so this ordering is the common case for runtime-path PRs. Previous gate logic: - completed/success: proceed - completed/failure: abort - everything else (including in_progress): proceed ← BUG If publish-trigger fires while E2E is still running, the gate returned "in_progress/none" and fell through the catch-all "proceed" branch. Result: :latest retagged on the publish signal alone. Then E2E ends red — but :latest was already wrongly advanced; the E2E-completion trigger's job-level if=conclusion==success filter just skips, never rolls back. Fix: explicit case for in_progress\|queued\|requested\|waiting\|pending that DEFERS — sets gate.proceed=false, writes a "deferred" summary, exits 0 (workflow run shows success, retag steps skipped). The E2E completion trigger then fires later and either promotes (green) or aborts (red), giving us correct ordering regardless of who finishes first. Subsequent steps now guarded by `if: steps.gate.outputs.proceed == 'true'` instead of relying on `exit 1` for skip semantics. Also added an explicit catch-all `*)` branch that aborts on unknown states (forward-compat: GitHub adds a new status, we surface it instead of silently promoting through it).	2026-04-28 16:59:58 -07:00
github-actions[bot]	f4f45f8561	fix(ci): auto-promote :latest also on publish-image, not just E2E Previously this workflow only triggered on E2E Staging SaaS completion, which is itself paths-filtered to runtime handlers (workspace-server/internal/handlers/{registry,workspace_provision, a2a_proxy}.go, middleware/, provisioner/). publish-workspace-server -image fires on a STRICTLY BROADER path set (workspace-server/, canvas/, manifest.json) — so canvas-only or cmd-only or sweep-only PRs rebuilt the platform image without ever advancing :latest. Result observed 2026-04-28: zero runs of this workflow since merge despite eight main pushes. :latest sat ~7 hours / 9 PRs behind main. Fix: add publish-workspace-server-image as a second trigger. Add an explicit gate inside the job that aborts when E2E Staging SaaS for the same SHA ended red. When E2E didn't fire (paths-filtered), proceed — auto-promote-staging's pre-merge gates (CI + E2E Canvas + E2E API + CodeQL on staging) already validated this SHA before main moved. Concurrency group serializes promotes per-SHA so the publish+E2E both- fired race lands cleanly. Idempotent crane tag makes it safe regardless.	2026-04-28 16:53:30 -07:00
Hongming Wang	c77a88c247	chore(security): pin Actions to SHAs + enable Dependabot auto-bumps Supply-chain hardening for the CI pipeline. 23 workflow files modified, 59 mutable-tag refs replaced with commit SHAs. The risk Every `uses:` reference in .github/workflows/*.yml was pinned to a mutable tag (e.g., `actions/checkout@v4`). A maintainer of an action — or a compromised maintainer account — can repoint that tag to malicious code, and our pipelines silently pull it on the next run. The tj-actions/changed-files compromise of March 2025 is the canonical example: maintainer credential leak, attacker repointed several `@v<N>` tags to a payload that exfiltrated repository secrets. Repos that pinned to SHAs were unaffected. The fix Replace each `@v<N>` with `@<commit-sha> # v<N>`. The trailing comment preserves human readability ("ah, this is v4"); the SHA makes the reference immutable. Actions covered (10 distinct): actions/{checkout,setup-go,setup-python,setup-node,upload-artifact,github-script} docker/{login-action,setup-buildx-action,build-push-action} github/codeql-action/{init,autobuild,analyze} dorny/paths-filter imjasonh/setup-crane pnpm/action-setup (already pinned in molecule-app, listed here for completeness) Excluded: Molecule-AI/molecule-ci/.github/workflows/disable-auto-merge-on-push.yml@main — internal org reusable workflow; we control its repo, threat model is different from third-party actions. Conventional to pin to @main rather than SHA for internal reusables. The maintenance cost SHA pinning means upstream fixes require manual SHA bumps. Without automation, pinned SHAs go stale. So this PR also enables Dependabot across four ecosystems: - github-actions (workflows) - gomod (workspace-server) - npm (canvas) - pip (workspace runtime requirements) Weekly cadence — the supply-chain attack window is "minutes between repoint and pull"; weekly auto-bumps don't help with zero-days regardless. The point is to pull in non-zero-day fixes without operator effort. Aligns with user-stated principle: "long-term, robust, fully- automated, eliminate human error." Companion PR: Molecule-AI/molecule-controlplane#308 (same pattern, smaller surface). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 15:37:06 -07:00
Hongming Wang	9d4ab7b1a2	feat(ci): auto-promote-on-e2e — retag :latest on green E2E Staging SaaS Closes the final gap in the SaaS pipeline. After auto-promote-staging fast-forwards main, publish-workspace-server-image builds new `:staging-<sha>` images, but `:latest` (what prod tenants pull) only moves on either a manual `promote-latest.yml` dispatch or a canary- verify retag (gated on Phase 2 fleet that doesn't exist). This workflow closes that gap by retagging `platform:staging-<sha>` + `platform-tenant:staging-<sha>` → `:latest` whenever E2E Staging SaaS passes for a `main` push. Uses crane (no Docker daemon needed). Verifies both images exist before retagging either, so a half-published state is impossible. Why trigger only on `main` (not staging): - `:latest` is what prod tenants pull. Only SHAs that have reached `main` (via auto-promote-staging) should advance `:latest`. - Triggering on staging would let a staging-only revert advance `:latest` to a SHA that never reaches `main`, breaking the invariant "production runs what's on `main`". Why a separate workflow rather than folding into e2e-staging-saas.yml: - Test concerns and release concerns separate. - Disabling promote during an incident is one workflow toggle, not an edit to the long E2E file. - When Phase 2 canary work eventually lands, the canary path can replace this trigger without touching the E2E workflow. Doc-aligned: per molecule-controlplane/docs/canary-tenants.md, "green staging E2E → :latest" is the recommended approach for the current scale (≤20 paying tenants); canary fleet is deferred until blast radius grows. Pipeline after this lands is fully self-healing: staging push → 4 gates green → auto-promote fast-forwards main → publish-workspace-server-image → E2E Staging SaaS → THIS WORKFLOW retags :latest → tenant fleet auto-pulls in 5 min (or redeploy-tenants-on-main fans out faster) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 13:58:41 -07:00

8 Commits