Commit Graph

8 Commits

Author SHA1 Message Date
Hongming Wang
8df8487bbe fix(auto-promote): treat E2E completed/cancelled as defer, not failure
Bug: the case statement at line 189 grouped completed/failure |
completed/cancelled | completed/timed_out into the same "abort
+ exit 1" branch. cancelled ≠ failure — when per-SHA concurrency
(memory: feedback_concurrency_group_per_sha) cancels an older E2E
run because a newer push landed, the workflow blocked the whole
auto-promote chain on a non-failure.

Caught 2026-05-05 02:03 on sha 31f9a5e: E2E got cancelled by
concurrency, auto-promote :latest aborted with exit 1, the next
auto-promote-staging cycle had to manually clean up.

Split: failure/timed_out keep the abort path. cancelled gets its
own clean-defer branch (same shape as in_progress) — proceed=false
without exit 1, with a step-summary explaining likely concurrency
supersession and pointing operators at manual dispatch if they
need that specific SHA promoted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 19:26:29 -07:00
dependabot[bot]
6c6c6eb1e8
chore(deps)(deps): bump imjasonh/setup-crane from 0.4 to 0.5
Bumps [imjasonh/setup-crane](https://github.com/imjasonh/setup-crane) from 0.4 to 0.5.
- [Release notes](https://github.com/imjasonh/setup-crane/releases)
- [Commits](31b88efe9d...6da1ae0188)

---
updated-dependencies:
- dependency-name: imjasonh/setup-crane
  dependency-version: '0.5'
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-05-02 19:23:13 +00:00
Hongming Wang
8efb2dae8d fix(ci): handle empty E2E lookup in auto-promote-on-e2e gate
When gh run list returns [] (no E2E run on the main SHA — the common
case for canvas-only / cmd-only / sweep-only changes whose paths
don't trigger E2E), jq's `.[0]` is null and the interpolation
`"\(null)/\(null // "none")"` produces "null/none". The case
statement has no `null/none)` branch, so it falls into `*)` →
exit 1 → auto-promote-on-e2e fails → `:latest` doesn't get retagged
to the new SHA → tenants on `redeploy-tenants-on-main` end up
pulling the OLD `:latest` digest.

Surfaced 2026-04-30 17:00Z as the first observable consequence of
PR #2389 (App-token dispatch fix). Every prior auto-promote-on-e2e
run was triggered by E2E completion (the "Upstream is E2E itself"
short-circuit at line 151 fired before reaching the gate). #2389
made publish-image's completion event correctly fire workflow_run
listeners — auto-promote-on-e2e is one of those listeners — and
hit the latent jq bug on the first publish-upstream run.

Fix: change `.[0]` to `(.[0] // {})` in the jq filter so the empty-
array case becomes `none/none` (the documented "E2E paths-filtered
out for this SHA — proceed" branch) instead of the unhandled
`null/none`. Also default `.status` for the same defensive reason.

Verified the three input shapes locally:
  []                                          → "none/none"  ✓
  [{status:completed,conclusion:success}]     → "completed/success"  ✓
  [{status:in_progress,conclusion:null}]      → "in_progress/none"  ✓

Outer `|| echo "none/none"` fallback retained as defense-in-depth
for non-zero gh exits (network / auth failures).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 10:07:52 -07:00
Hongming Wang
f7b9feb34f ci: ancestry-check on auto-promote :latest (#2244)
Two rapid main pushes whose E2Es complete out-of-order can promote
:latest backwards: SHA-A merges, SHA-B merges, SHA-B's E2E completes
first → :latest = staging-B → SHA-A's E2E completes → :latest = staging-A.
Now :latest is older than main's tip and stays wrong until the next
main push lands. The orphan-reconciler "next run corrects it" pattern
doesn't apply because there's no auto-corrective re-promote.

Detection: read the current :latest's `org.opencontainers.image.revision`
label (set by publish-workspace-server-image.yml at build time) and ask
the GitHub compare API how the candidate SHA relates to current. Branch
on `.status`:

  ahead     → retag (target newer)
  identical → retag is a no-op
  behind    → HARD FAIL (this is the race we're catching)
  diverged  → HARD FAIL (force-push or unusual history)
  error     → fail; manual dispatch can override

Hard-fail rather than soft-skip per the approved design — silent-bypass
is the class we're moving away from per
feedback_schedule_vs_dispatch_secrets_hardening. Workflow goes red,
oncall sees it, operator decides whether to retry, force-promote, or
investigate. Manual dispatch skips the check (operator override),
matching the gate-step's existing semantics.

Backward-compat: when current :latest carries no revision label
(legacy image), skip-with-warning. All :latest images on main are
post-label as of 2026-04-29, so this branch becomes dead within 90 days
— TODO note in the step explains the cleanup.

No tests — the race is hypothetical at our scale (<1 occurrence/year
expected for a fleet of ≤20 paying tenants), and the only way to
exercise the new branches is to construct production-shape image
state. The dry-fall path lands behind the existing E2E gate-check, so
a regression in this step would surface as a failed promote (visible),
not a silent advance (invisible).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 16:18:42 -07:00
github-actions[bot]
475a51adec fix(ci): defer promote when E2E is racing with publish (review fix)
Self-review caught a real correctness bug: scenario where publish-
workspace-server-image completes BEFORE E2E Staging SaaS for a runtime-
touching SHA. Publish typically takes ~5-10min; E2E ~10-15min, so this
ordering is the common case for runtime-path PRs.

Previous gate logic:
  - completed/success: proceed
  - completed/failure: abort
  - everything else (including in_progress): proceed   ← BUG

If publish-trigger fires while E2E is still running, the gate returned
"in_progress/none" and fell through the catch-all "proceed" branch.
Result: :latest retagged on the publish signal alone. Then E2E ends
red — but :latest was already wrongly advanced; the E2E-completion
trigger's job-level if=conclusion==success filter just skips, never
rolls back.

Fix: explicit case for in_progress|queued|requested|waiting|pending
that DEFERS — sets gate.proceed=false, writes a "deferred" summary,
exits 0 (workflow run shows success, retag steps skipped). The E2E
completion trigger then fires later and either promotes (green) or
aborts (red), giving us correct ordering regardless of who finishes
first.

Subsequent steps now guarded by `if: steps.gate.outputs.proceed ==
'true'` instead of relying on `exit 1` for skip semantics.

Also added an explicit catch-all `*)` branch that aborts on unknown
states (forward-compat: GitHub adds a new status, we surface it
instead of silently promoting through it).
2026-04-28 16:59:58 -07:00
github-actions[bot]
f4f45f8561 fix(ci): auto-promote :latest also on publish-image, not just E2E
Previously this workflow only triggered on E2E Staging SaaS completion,
which is itself paths-filtered to runtime handlers
(workspace-server/internal/handlers/{registry,workspace_provision,
a2a_proxy}.go, middleware/**, provisioner/**). publish-workspace-server
-image fires on a STRICTLY BROADER path set (workspace-server/**,
canvas/**, manifest.json) — so canvas-only or cmd-only or sweep-only
PRs rebuilt the platform image without ever advancing :latest.

Result observed 2026-04-28: zero runs of this workflow since merge
despite eight main pushes. :latest sat ~7 hours / 9 PRs behind main.

Fix: add publish-workspace-server-image as a second trigger. Add an
explicit gate inside the job that aborts when E2E Staging SaaS for the
same SHA ended red. When E2E didn't fire (paths-filtered), proceed —
auto-promote-staging's pre-merge gates (CI + E2E Canvas + E2E API +
CodeQL on staging) already validated this SHA before main moved.

Concurrency group serializes promotes per-SHA so the publish+E2E both-
fired race lands cleanly. Idempotent crane tag makes it safe regardless.
2026-04-28 16:53:30 -07:00
Hongming Wang
c77a88c247 chore(security): pin Actions to SHAs + enable Dependabot auto-bumps
Supply-chain hardening for the CI pipeline. 23 workflow files
modified, 59 mutable-tag refs replaced with commit SHAs.

The risk

Every `uses:` reference in .github/workflows/*.yml was pinned to a
mutable tag (e.g., `actions/checkout@v4`). A maintainer of an
action — or a compromised maintainer account — can repoint that
tag to malicious code, and our pipelines silently pull it on the
next run. The tj-actions/changed-files compromise of March 2025 is
the canonical example: maintainer credential leak, attacker
repointed several `@v<N>` tags to a payload that exfiltrated
repository secrets. Repos that pinned to SHAs were unaffected.

The fix

Replace each `@v<N>` with `@<commit-sha> # v<N>`. The trailing
comment preserves human readability ("ah, this is v4"); the SHA
makes the reference immutable.

Actions covered (10 distinct):
  actions/{checkout,setup-go,setup-python,setup-node,upload-artifact,github-script}
  docker/{login-action,setup-buildx-action,build-push-action}
  github/codeql-action/{init,autobuild,analyze}
  dorny/paths-filter
  imjasonh/setup-crane
  pnpm/action-setup (already pinned in molecule-app, listed here for completeness)

Excluded:
  Molecule-AI/molecule-ci/.github/workflows/disable-auto-merge-on-push.yml@main
    — internal org reusable workflow; we control its repo, threat model
    is different from third-party actions. Conventional to pin to @main
    rather than SHA for internal reusables.

The maintenance cost

SHA pinning means upstream fixes require manual SHA bumps. Without
automation, pinned SHAs go stale. So this PR also enables Dependabot
across four ecosystems:

  - github-actions (workflows)
  - gomod (workspace-server)
  - npm (canvas)
  - pip (workspace runtime requirements)

Weekly cadence — the supply-chain attack window is "minutes between
repoint and pull"; weekly auto-bumps don't help with zero-days
regardless. The point is to pull in non-zero-day fixes without
operator effort.

Aligns with user-stated principle: "long-term, robust, fully-
automated, eliminate human error."

Companion PR: Molecule-AI/molecule-controlplane#308 (same pattern,
smaller surface).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 15:37:06 -07:00
Hongming Wang
9d4ab7b1a2 feat(ci): auto-promote-on-e2e — retag :latest on green E2E Staging SaaS
Closes the final gap in the SaaS pipeline. After auto-promote-staging
fast-forwards main, publish-workspace-server-image builds new
`:staging-<sha>` images, but `:latest` (what prod tenants pull) only
moves on either a manual `promote-latest.yml` dispatch or a canary-
verify retag (gated on Phase 2 fleet that doesn't exist).

This workflow closes that gap by retagging
`platform:staging-<sha>` + `platform-tenant:staging-<sha>` → `:latest`
whenever E2E Staging SaaS passes for a `main` push. Uses crane
(no Docker daemon needed). Verifies both images exist before retagging
either, so a half-published state is impossible.

Why trigger only on `main` (not staging):
  - `:latest` is what prod tenants pull. Only SHAs that have reached
    `main` (via auto-promote-staging) should advance `:latest`.
  - Triggering on staging would let a staging-only revert advance
    `:latest` to a SHA that never reaches `main`, breaking the
    invariant "production runs what's on `main`".

Why a separate workflow rather than folding into e2e-staging-saas.yml:
  - Test concerns and release concerns separate.
  - Disabling promote during an incident is one workflow toggle, not
    an edit to the long E2E file.
  - When Phase 2 canary work eventually lands, the canary path can
    replace this trigger without touching the E2E workflow.

Doc-aligned: per molecule-controlplane/docs/canary-tenants.md,
"green staging E2E → :latest" is the recommended approach for the
current scale (≤20 paying tenants); canary fleet is deferred until
blast radius grows.

Pipeline after this lands is fully self-healing:
  staging push → 4 gates green → auto-promote fast-forwards main
   → publish-workspace-server-image → E2E Staging SaaS
   → THIS WORKFLOW retags :latest → tenant fleet auto-pulls in 5 min
                                    (or redeploy-tenants-on-main fans out faster)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 13:58:41 -07:00