Hard gate Tier 2 item 2 of 4. Cron-driven full-lifecycle E2E that
catches regressions visible only at runtime — schema drift,
deployment-pipeline gaps, vendor outages, env-var rotations,
DNS / CF / Railway side-effects.
Empirical motivation from today:
- #2345 (A2A v0.2 silent drop) — passed unit tests, broke at JSON-RPC
parse layer between sender + receiver. Visible only when a sender
exercises the full path. Now-fixed by PR #2349, but a continuous
E2E would have surfaced it within 20 min of the regression.
- RFC #2312 chat upload — landed staging-branch but never reached
staging tenants because publish-workspace-server-image was main-
only. Caught by manual dogfooding hours after deploy. Same pattern.
Both classes are invisible to PR-time CI. The continuous gate fires
every 20 min against a real staging tenant and surfaces regressions
within minutes.
Cadence: cron `0,20,40 * * * *` (3x/hour). Offsets the existing
sweep-cf-orphans (:15) and sweep-cf-tunnels (:45) so the three ops
don't burst CF/AWS APIs at the same minute. Concurrency group
prevents overlapping runs if one hangs.
Cost: ~$0.50-1/day GHA + pennies of staging tenant lifecycle.
Reuses existing tests/e2e/test_staging_full_saas.sh — no new harness
to maintain. Bounded at 10 min wall-clock (vs 15 min default) so
stuck runs fail fast rather than holding up the next firing.
Defaults to E2E_RUNTIME=langgraph (fastest cold start; the regression
classes this gate catches don't need hermes-specific paths). Operators
can dispatch with runtime=hermes when they want SDK-native coverage.
Schedule-vs-dispatch hardening: hard-fail on missing
CP_STAGING_ADMIN_API_TOKEN for cron firing (silent-skip would mask
real outages); soft-skip for operator dispatch.
Refs:
- #2342 hard-gates Tier 2 item 2
- #2345 (A2A v0.2 fix that this gate would have caught earlier)
- #2335 / #2337 (deployment-pipeline gaps that this gate also catches)
Parity with #2337's redeploy-tenants-on-staging.yml. Both prod and
staging redeploys now have explicit serialization:
group: redeploy-tenants-on-main (per-workflow, global)
group: redeploy-tenants-on-staging (per-workflow, global)
cancel-in-progress: false on both — aborting a half-rolled-out fleet
would leave tenants stuck on whatever image they happened to be on
when cancelled. Better to finish the in-flight rollout before starting
the next one.
Pre-fix this workflow relied on GitHub's implicit workflow_run queueing,
which is "probably fine" but not defensible — explicit > implicit for
load-bearing pipeline behavior. Picked up as a #2337 review nit
(architecture finding 1: concurrency asymmetry between the two
redeploy workflows).
No behavior change in the common case. The change matters only when
two main pushes land within seconds AND the first redeploy is still
mid-rollout — currently rare; will become more common once #2335
(staging-trigger publish) feeds main more frequently via auto-promote.
Two follow-ups from #2335 review (tracked in #2336):
1. Add `concurrency:` block to publish-workspace-server-image.yml so
two rapid staging pushes don't race the same :staging-latest retag.
Group is per-branch (`${{ github.ref }}`) so staging and main can
build in parallel — they produce different :staging-<sha> tags and
last-write-wins on :staging-latest is acceptable across branches.
`cancel-in-progress: false` keeps in-flight builds — partially-pushed
images would break canary-fleet pin consistency.
2. Add redeploy-tenants-on-staging.yml. After #2335, every staging push
produces a fresh :staging-latest, but existing tenants only pick it
up on next reprovision. This workflow mirrors redeploy-tenants-on-
main but for staging:
- workflow_run-gated to branches: [staging]
- target_tag default 'staging-latest' (vs 'latest' for prod)
- CP_URL default https://staging-api.moleculesai.app
- CP_STAGING_ADMIN_API_TOKEN repo secret (operator must set)
- canary_slug empty by default — staging is itself the canary; no
sub-canary needed inside it. Soak still applies if operator
specifies a tenant for blast-radius control.
Schedule-vs-dispatch hardening matches sweep-cf-orphans/sweep-cf-
tunnels: hard-fail on auto-trigger when secret missing so misconfig
doesn't silently leave staging tenants on stale code; soft-skip on
operator dispatch.
Operator action required after merge:
Add CP_STAGING_ADMIN_API_TOKEN repo secret. Pull value from staging-
CP's CP_ADMIN_API_TOKEN env in Railway controlplane / staging
environment. Until set, the auto-trigger will fail the workflow run
(visible as red CI), surfacing the misconfiguration. Workflow runs
only on staging publish-workspace-server-image success, so no extra
load while it sits unconfigured.
Verification:
- YAML lint clean on both workflows.
- Reviewed redeploy-tenants-on-main as template; differences are scoped
to staging-specific values (URL, tag, secret name) + harden-on-missing-
secret pattern.
Refs #2335, #2336.
Root cause: this workflow only triggered on `branches: [main]`, but
staging-CP pins TENANT_IMAGE=:staging-latest (verified via Railway).
:staging-latest was only retagged on main push, so:
staging-branch code → never built → never reaches staging tenants
staging-CP serves → "yesterday's main" indefinitely
When staging→main was wedged (path-filter parity bug, canvas teardown
race — both fixed earlier today), :staging-latest stopped updating
entirely. RFC #2312 (chat upload HTTP-forward) landed on staging but
freshly-provisioned staging tenants kept failing chat upload because
they pulled pre-RFC-#2312 image. Verified by tearing down a fresh
tenant and observing the legacy "workspace container not running"
error from the docker-exec code path that RFC #2312 deleted.
Pre-2026-04-24 there was a related-but-different incident: TENANT_IMAGE
was a static :staging-<sha> pin that drifted 10 days behind. This new
incident is "the dynamic pin still drifts when its update workflow
doesn't fire."
Fix: add `staging` to the branches trigger. Tag policy is unchanged
(:staging-<sha> + :staging-latest on every push). canary-verify.yml
still runs on main push (workflow_run-gated to `branches: [main]`),
preserving the canary-verified :latest promotion for prod tenants.
Steady state after this:
- staging push → :staging-latest = staging-branch code → staging-CP
- main push → :staging-<sha> for canary, :staging-latest retag
(post-promote main code), and after canary green
→ :latest for prod tenants
What this does NOT change:
- canary-verify.yml flow (still main-only)
- redeploy-tenants-on-main.yml (still rolls prod fleet on main push)
- publish-canvas-image.yml (self-hosted standalone canvas; orthogonal)
- The :latest tag (canary-verified main, unchanged)
What this does fix:
- RFC #2312-class fixes that land on staging now actually reach
staging tenants without waiting for staging→main promote.
- The dogfooding observation "staging tenants seem to be running
yesterday's code" disappears as a class.
Drive-by: also fixed the typo in the path-filter list (was
`publish-platform-image.yml`, the actual file is
`publish-workspace-server-image.yml`).
CP's tenant-delete cascade removes the DNS record (with sweep-cf-orphans
as a backstop) but does NOT delete the underlying Cloudflare Tunnel.
Each E2E provision creates one Tunnel named `tenant-<slug>`; without
cleanup these accumulate indefinitely on the account, consuming the
tunnel quota and cluttering the dashboard.
Observed 2026-04-30: dozens of `tenant-e2e-canvas-*` tunnels in Down
state with zero replicas, weeks past their tenant's deletion. Same
class of bug as the DNS-records leak that drove sweep-cf-orphans
(controlplane#239).
Parallel-shape to sweep-cf-orphans:
- Same dry-run-by-default + --execute pattern
- Same MAX_DELETE_PCT safety gate (default 90% — higher than DNS
sweep's 50% because tenant-shaped tunnels are orphans by design)
- Same schedule/dispatch hardening (hard-fail on missing secrets
when scheduled, soft-skip when dispatched)
- Cron offset to :45 to avoid CF API bursts colliding with the DNS
sweep at :15
Decision rules (in order):
1. Name doesn't match `tenant-<slug>` → keep (unknown — never sweep
tunnels that might belong to platform infra).
2. Tunnel has active connections (status=healthy or non-empty
connections array) → keep (defense-in-depth: don't kill a live
tunnel even if CP forgot the org).
3. Slug ∈ {prod_slugs ∪ staging_slugs} → keep.
4. Otherwise → delete (orphan).
Verified by:
- shell syntax check (bash -n)
- YAML lint
- Decide-logic offline smoke (7 cases, all pass)
- End-to-end dry-run smoke with stubbed CP + CF APIs
Required secrets (added to existing org-secrets):
CF_API_TOKEN must include account:cloudflare_tunnel:edit
scope (separate from zone:dns:edit used by
sweep-cf-orphans — same token if scope is
broad, or a new token if narrowly scoped).
CF_ACCOUNT_ID account that owns the tunnels (visible in
dash.cloudflare.com URL path).
CP_PROD_ADMIN_TOKEN reused from sweep-cf-orphans.
CP_STAGING_ADMIN_TOKEN reused from sweep-cf-orphans.
Note: CP-side root cause (tenant-delete should cascade to tunnel
delete) is in molecule-controlplane and worth fixing separately. This
janitor is the operational backstop in the meantime — same pattern
applied to DNS records when the same root cause was unaddressed.
Setup wrote .playwright-staging-state.json at the END (step 7), only
after org create + provision-wait + TLS + workspace create + workspace-
online all succeeded. If setup crashed at steps 1-6, the org existed in
CP but the state file did not, so Playwright's globalTeardown bailed
out ("nothing to tear down") and the workflow safety-net pattern-swept
every e2e-canvas-<today>-* org to compensate. That sweep deleted
concurrent runs' live tenants — including their CF DNS records —
causing victims' next fetch to die with `getaddrinfo ENOTFOUND`.
Race observed 2026-04-30 on PR #2264 staging→main: three real-test
runs killed each other mid-test, blocking 68 commits of staging→main
promotion.
Fix: write the state file as setup's first action, right after slug
generation, before any CP call. Now:
- Crash before slug gen → no state file, no orphan to clean
- Crash during steps 1-6 → state file has slug; teardown deletes
it (DELETE 404s if org never created)
- Setup completes → state file has full state; teardown
deletes the slug
The workflow safety-net no longer pattern-sweeps; it reads the state
file and deletes only the recorded slug. Concurrent canvas-E2E runs no
longer poison each other.
Verified by:
- tsc --noEmit on staging-setup.ts + staging-teardown.ts
- YAML lint on e2e-staging-canvas.yml
- Code review: state file write moved to line 113 (post-makeSlug,
pre-CP) with the original line-249 write retained as a "promote
to full state" overwrite at the end
Acceptance criterion 3 of #2001 ("CI check that fails if TENANT_IMAGE
contains a SHA-shaped suffix") was deferred from PR #2168 because
querying Railway from a GitHub Actions runner needs RAILWAY_TOKEN
plumbed as a repo secret. The detection script + regression test in
#2168 cover detection; this is the automation-cadence layer.
Daily 13:00 UTC schedule (06:00 PT) + workflow_dispatch. Daily is the
right cadence for variables-tier config — Railway env var changes are
deliberate operator actions, low-frequency. Hourly would risk Railway
API rate-limit surprises.
Issue-on-failure pattern mirrors e2e-staging-sanity.yml — drift opens
a `railway-drift` priority-high issue (or comments on the open one),
and a subsequent clean run auto-closes it with a "drift resolved"
comment. No human-in-the-loop needed for the close.
Schedule-vs-dispatch secret hardening per
feedback_schedule_vs_dispatch_secrets_hardening:
- Schedule trigger HARD-FAILS on missing RAILWAY_AUDIT_TOKEN
(silent-success was the failure mode that bit us before)
- workflow_dispatch SOFT-SKIPS so an operator can dry-run the
workflow shape during initial token provisioning
Operator action required before this gate is live:
- Provision a Railway API token, read-only `variables` scope on the
molecule-platform project (id 7ccc8c68-61f4-42ab-9be5-586eeee11768)
- Store as repo secret RAILWAY_AUDIT_TOKEN
- Rotate per the standard 90-day schedule
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Branch protection treats matching-name check runs as a SET — any SKIPPED
member fails the required-check eval, even with SUCCESS siblings. The
two-jobs-sharing-name pattern (no-op + real-job) emits one SKIPPED + one
SUCCESS check run per workflow run; with multiple runs at the same SHA
(detect-changes triggers + auto-promote re-runs) the SET fills with
SKIPPED entries that block branch protection.
Verified live on PR #2264 (staging→main auto-promote): mergeStateStatus
stayed BLOCKED for 18+ hours despite APPROVED + MERGEABLE + all gates
green at the workflow level. `gh pr merge` returned "base branch policy
prohibits the merge"; `enqueuePullRequest` returned "No merge queue
found for branch 'main'". The check-runs API showed `E2E API Smoke
Test` and `Canvas tabs E2E` each had 2 SKIPPED + 2 SUCCESS at head SHA
66142c1e.
Fix: collapse no-op + real-job into ONE job with no job-level `if:`,
gating real work via per-step `if: needs.detect-changes.outputs.X ==
'true'`. The job always runs and emits exactly one SUCCESS check run
under the required-check name regardless of paths-filter outcome —
branch-protection-clean.
Same pattern as ci.yml's earlier conversion of Canvas/Platform/Python/
Shellcheck (PR #2322). Closes the parity-fix that should have been
applied to all four path-filtered required checks at once.
Two rapid main pushes whose E2Es complete out-of-order can promote
:latest backwards: SHA-A merges, SHA-B merges, SHA-B's E2E completes
first → :latest = staging-B → SHA-A's E2E completes → :latest = staging-A.
Now :latest is older than main's tip and stays wrong until the next
main push lands. The orphan-reconciler "next run corrects it" pattern
doesn't apply because there's no auto-corrective re-promote.
Detection: read the current :latest's `org.opencontainers.image.revision`
label (set by publish-workspace-server-image.yml at build time) and ask
the GitHub compare API how the candidate SHA relates to current. Branch
on `.status`:
ahead → retag (target newer)
identical → retag is a no-op
behind → HARD FAIL (this is the race we're catching)
diverged → HARD FAIL (force-push or unusual history)
error → fail; manual dispatch can override
Hard-fail rather than soft-skip per the approved design — silent-bypass
is the class we're moving away from per
feedback_schedule_vs_dispatch_secrets_hardening. Workflow goes red,
oncall sees it, operator decides whether to retry, force-promote, or
investigate. Manual dispatch skips the check (operator override),
matching the gate-step's existing semantics.
Backward-compat: when current :latest carries no revision label
(legacy image), skip-with-warning. All :latest images on main are
post-label as of 2026-04-29, so this branch becomes dead within 90 days
— TODO note in the step explains the cleanup.
No tests — the race is hypothetical at our scale (<1 occurrence/year
expected for a fleet of ≤20 paying tenants), and the only way to
exercise the new branches is to construct production-shape image
state. The dry-fall path lands behind the existing E2E gate-check, so
a regression in this step would surface as a failed promote (visible),
not a silent advance (invisible).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Supersedes #2321 + #2322. Applies the same shape uniformly across every
required check that uses a path filter: Canvas (Next.js), Platform (Go),
Python Lint & Test, Shellcheck (E2E scripts).
The bug + fix in one paragraph:
GitHub registers a check run for every job whose `name:` matches the
required-check context, regardless of whether the job actually executed.
A job-level `if:` that evaluates false produces a SKIPPED check run.
Branch protection's "required check" rule looks at the SET of check
runs with the matching context name on the latest commit and treats
any conclusion other than SUCCESS as not-passed — including SKIPPED.
Adding a sibling no-op job under the same `name:` (PR #2321 / #2322
attempt) doesn't help: branch protection still sees the SKIPPED
sibling and stays BLOCKED.
The shape that works: ONE job per required check name, no job-level
`if:`, all real work gated per-step. The job always runs and reports
SUCCESS regardless of which paths changed.
This patch:
* Canvas (Next.js): drops the `canvas-build-noop` shadow added in
#2321 (which didn't actually clear merge state — verified live on
PR #2314). Refactors `canvas-build` to always run; gates checkout/
setup-node/install/build/test on `if: needs.changes.outputs.canvas
== 'true'`. Coverage upload step also gated.
* Platform (Go): drops job-level `if:`. Gates checkout/setup-go/
download/build/vet/lint/test/coverage-report/threshold-check on
per-step `if:`.
* Python Lint & Test: drops job-level `if:`. Gates checkout/setup-
python/install/pytest on per-step `if:`.
* Shellcheck (E2E scripts): drops job-level `if:`. Gates checkout/
shellcheck-run on per-step `if:`.
Each refactored job adds a leading no-op echo step with `working-directory: .`
override so the always-running spin-up doesn't fail when the path-
filter-true working-directory (workspace, workspace-server, canvas)
doesn't exist after no-op checkout.
Why all four in one PR: the bug shape is identical across all four,
and a future PR that only touches workspace-server (passing platform
filter, missing canvas/python/scripts) would hit the same BLOCKED state
on whichever filter it missed. PR-A and PR-2321 merged because their
diffs happened to trigger every filter; PR-B (#2314) only missed
canvas. Fixing one at a time means re-living this debugging cycle three
more times.
Cost: ~10s of always-on CI runtime per PR per job (the ubuntu-latest
spin-up + the no-op echo). 40s aggregate, negligible vs. the manual-
merge cost when BLOCKED catches us.
Memory `feedback_branch_protection_check_name_parity` already updated
(2026-04-29) to mark the original two-jobs-sharing-name pattern as
DO NOT FOLLOW and document the working shape this PR uses.
Refs PR #2321 (the misguided fix-attempt that this supersedes).
Supersedes PR #2321's two-jobs-sharing-a-name approach, which didn't
actually clear branch-protection's required-check evaluation. Live
test on PR #2314: GraphQL `isRequired` confirmed BOTH check runs
under "Canvas (Next.js)" name (one SUCCESS via no-op, one SKIPPED via
real job) registered, and the SKIPPED one kept mergeStateStatus =
BLOCKED despite the SUCCESS sibling. Branch protection's "set of
matching contexts" semantic is stricter than the durable feedback
memory documented — at least one passing isn't enough; SKIPPED
counts as not-passed regardless.
Real fix: ONE job that always runs (no job-level `if:`), with all
real work gated on the path filter via per-step `if:`. Produces
exactly one "Canvas (Next.js)" check run per commit, always SUCCEEDS,
regardless of which paths changed. Costs ~10s of always-on CI runtime
per PR — negligible vs. the manual-merge cost when the BLOCKED state
catches us.
This same anti-pattern probably affects Platform (Go) (`platform`
filter), Python Lint & Test (`python` filter), and Shellcheck (E2E
scripts) (`scripts` filter) — all required, all path-gated. PR-A and
PR-2321 merged because they happened to trigger every filter; PR-B
only missed canvas. File a follow-up issue to apply the same
single-job-conditional-steps pattern across those required jobs to
remove the latent merge-blocker.
Updates feedback memory: branch_protection_check_name_parity is wrong
about "two jobs sharing name + at-least-one-success works." Need to
correct the note.
PRs that don't touch canvas/** paths skip the Canvas (Next.js) job via
its `if: needs.changes.outputs.canvas == 'true'` guard. GitHub reports
SKIPPED for that conclusion. Branch protection on staging requires
Canvas (Next.js) — and treats SKIPPED as not-passed, blocking merge
on every workspace-server-only or migration-only PR.
This is the design pattern documented in feedback memory
"branch_protection_check_name_parity": split into a real job + a
no-op shadow that share the same `name:`. Exactly one runs per PR;
both report the same check context, and at least one always reports
SUCCESS, satisfying the required check.
The no-op job runs in a few seconds (single `echo` step) and produces
the right check context for any PR that has changes outside canvas/**.
Concrete blocker that prompted this: PR #2314 (RFC #2312 PR-B) sat
APPROVED + CI-green + UP-TO-DATE for half an hour with mergeStateStatus
BLOCKED, traced via the GraphQL `isRequired` field to a single
SKIPPED Canvas (Next.js) check. PRs #2319 (PR-F) and the rest of the
RFC #2312 stack would have hit the same wall.
Step 2 of #1815. Step 1 (instrumentation in canvas/vitest.config.ts)
already shipped — the inline comment there explicitly defers wiring
into CI to a follow-up because turning on a 70% threshold blind would
either fail CI immediately or paper over a real gap with an ad-hoc
exclude list.
This PR ships the observability half:
- Replaces `npx vitest run` with `npx vitest run --coverage` in the
canvas-build job. Coverage gets reported on every PR; no threshold
gate yet (vitest.config.ts intentionally doesn't set thresholds).
- Adds an artifact upload step for canvas/coverage/ (HTML + json-summary)
so reviewers can browse the coverage report from any PR. 7-day
retention; if-no-files-found=warn so a step skip doesn't fail.
Step 3 (thresholds + hard gate) is the natural follow-up — track in a
new sub-issue once we've seen ~5-10 PRs of baseline data and know
where current coverage sits. The issue body proposed lines:70 /
functions:70 / branches:65 / statements:70; that may need adjustment
once the baseline lands.
Closes the Step-2 portion of #1815. Step 3 stays open or gets a fresh
issue depending on your preference.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a comment block at the top of auto-promote-staging.yml naming the
load-bearing one-time repo setting that the workflow depends on:
Settings → Actions → General → Workflow permissions
→ ✅ Allow GitHub Actions to create and approve pull requests
Without this toggle, every workflow_run fails with
"GitHub Actions is not permitted to create or approve pull requests
(createPullRequest)". Observed 2026-04-29 01:43 UTC blocking the
fcd87b9 promotion (PRs #2248 + #2249); manually bridged via PR #2252.
The setting is invisible to anyone reading the workflow file, but the
workflow cannot do its job without it. Documenting here so the next
time it gets toggled off (org admin change, repo migration, audit
cleanup) the failure mode points at the cause rather than another
round of "why is auto-promote broken."
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror the sweep-cf-orphans hardening (#2248) on publish-runtime's
TEMPLATE_DISPATCH_TOKEN gate. The previous behaviour was to print
:⚠️:skipping cascade — templates will pick up the new version
on their own next rebuild and exit 0. That message is wrong: the 8
workspace-template repos only rebuild on this repository_dispatch
fanout. Without the dispatch they stay pinned to whatever runtime
version they last saw, and the gap is invisible until someone
notices a template several versions behind weeks later.
Behaviour after this PR:
- push (auto-trigger on workspace/runtime/** changes) → exit 1
- workflow_dispatch (manual operator) → exit 0
with a warning (operator already accepted state; let them rerun
after restoring the secret)
The token-missing path now also names the consequence concretely
("templates will NOT pick up the new version until this token is
restored") so future operators see the actionable line, not the
misleading "they'll catch up on their own" message.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the soft-skip-with-warning behaviour for scheduled runs of the
hourly Cloudflare orphan sweeper with an explicit failure when the six
required secrets aren't set. Manual workflow_dispatch keeps the
soft-skip path so an operator can short-circuit a deliberate rerun
without redoing the secrets dance — they accepted the state when they
clicked the button.
Why: from some-date to 2026-04-28, all six secrets were unset on the
repo. Every hourly tick printed a yellow :⚠️: and exited 0,
which GitHub registers as "completed/success" — the sweeper was
indistinguishable from a healthy janitor with nothing to do. Cloudflare
orphans accumulated unobserved to 152/200 (~76% of the zone quota),
and only surfaced via a manual audit. The mechanism to catch this kind
of regression is to make the workflow loud: red runs prompt
investigation, green runs are presumed healthy.
Schedule/workflow_run/push paths now print three ::error:: lines
naming the missing secrets, the fix, and a one-line reference to this
incident, then exit 1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the fix#2234 applied to auto-sync-main-to-staging.yml in the
reverse direction. Both workflows now use the same merge-queue path
that humans use; no special-case bypass.
Why
Every tick of auto-promote-staging.yml since main's branch protection
went stricter has been failing with:
remote: error: GH006: Protected branch update failed for refs/heads/main.
remote: - Required status checks "Analyze (go)", "Analyze (javascript-typescript)",
"Analyze (python)", "Canvas (Next.js)", "Detect changes",
"E2E API Smoke Test", "Platform (Go)", "Python Lint & Test",
and "Shellcheck (E2E scripts)" were not set by the expected
GitHub apps.
remote: - Changes must be made through a pull request.
The previous version did `git merge --ff-only origin/staging &&
git push origin main` directly. That works against a permissive
branch — it doesn't work against a ruleset that requires checks
satisfied by the expected GitHub apps. Only PR merges through the
queue produce check runs from the right apps.
Result was that today's 12+ merges to staging never propagated to
main; the auto-promote ran every tick and failed every tick, while
operators had to keep opening manual `staging → main` bridges.
Fix
- Replace the direct git push step with a step that opens (or reuses)
a PR base=main head=staging and enables auto-merge. The merge queue
lands it once gates are green on the merge_group ref.
- The PR's head IS the staging branch (no per-SHA promote branch
needed) — the whole purpose is "advance main to staging's tip".
- Add `pull-requests: write` permission so the workflow can call
gh pr create + gh pr merge --auto.
- Drop the `git merge-base --is-ancestor` divergence check — the
merge queue itself enforces branch protection now, and rejects
the PR if main has diverged from staging history.
Loop safety preserved: when this PR's merge lands on main, it
triggers auto-sync-main-to-staging.yml which opens a sync PR back
to staging. That sync PR's eventual merge is by GITHUB_TOKEN (the
merge queue) which doesn't trigger downstream workflow_run events
— so auto-promote-staging.yml does NOT re-fire from its own merge
landing.
Refs: #2234 (the parallel fix for auto-sync-main-to-staging.yml),
task #142, multiple failing runs visible in
https://github.com/Molecule-AI/molecule-core/actions/workflows/auto-promote-staging.yml
Consolidates the remaining safe-to-merge dependabot PRs from the
2026-04-28 wave into one consumable PR. Replaces three earlier
single-bump PRs (#2245, #2230, #2231) which were closed in favor of
this single batch — same pattern as #2235.
GitHub Actions majors (SHA-pinned per org convention):
github/codeql-action v3 → v4.35.2 (#2228)
actions/setup-node v4 → v6.4.0 (#2218)
actions/upload-artifact v4 → v7.0.1 (#2216)
actions/setup-python v5 → v6.2.0 (#2214)
npm dev deps (canvas/, lockfile regenerated in node:22-bookworm
container so @emnapi/* and other Linux-only optional deps are
properly resolved — Mac-native `npm install` strips them, which
caused the earlier #2235 batch to drop these two):
@types/node ^22 → ^25.6 (#2231)
jsdom ^25 → ^29.1 (#2230)
Why each is safe
setup-node v4 → v6 / setup-python v5 → v6:
Every consumer call pins node-version / python-version
explicitly. v5 / v6 changed defaults but pinned consumers
are unaffected. Confirmed via grep across .github/workflows/
— all setup-node call sites pin '20' or '22', all
setup-python call sites pin '3.11'.
codeql-action v3 → v4.35.2:
Used as init/autobuild/analyze sub-actions in codeql.yml.
v4 bundles a newer CodeQL CLI; ubuntu-latest auto-updates
so functional behavior is unchanged. The deprecated
CODEQL_ACTION_CLEANUP_TRAP_CACHES env var (per v4.35.2
release notes) is undocumented and we don't set it.
upload-artifact v4 → v7.0.1:
v6 introduced Node.js 24 runtime requiring Actions Runner
>= 2.327.1. All upload-artifact users (codeql.yml,
e2e-staging-canvas.yml) run on `ubuntu-latest` (GitHub-
hosted), which auto-updates the runner agent. Self-hosted
runners are NOT used for these jobs.
@types/node 22 → 25 / jsdom 25 → 29:
Both are dev-only — @types/node is type definitions,
jsdom backs vitest's DOM environment. Tests pass:
79 files / 1154 tests in node:22-bookworm container.
Verified locally (Linux container so the lockfile reflects what
CI's `npm ci` will install):
- cd canvas && npm install --include=optional → 169 packages
- npm test → 1154/1154 pass
- npm ci → clean install succeeds
- npm run build → Next.js prerendering succeeds
Closes when this lands (the 3 individual auto-merge PRs from earlier
were closed):
#2228#2218#2216#2214#2231#2230
NOT included (CI failing on dependabot's own run — major framework
bumps that need code-side migration tasks, not safe auto-bumps):
#2233 next 15 → 16
#2232 tailwindcss 3 → 4
#2226 typescript 5 → 6
Branch protection on `main` requires "E2E API Smoke Test" as a status
check. With Design B's no-op + e2e-api job split, when paths-filter
excludes a commit:
- e2e-api job (name="E2E API Smoke Test"): SKIPPED
- no-op job (name="no-op"): SUCCESS
Branch protection counts the skipped check-run as not-satisfied →
auto-promote-staging's `git push origin main` rejected with GH006.
Observed 2026-04-28 00:22 UTC: every gate green at the workflow level,
all_green=true in auto-promote-staging's gate-check, but the FF push
itself rejected with:
Required status checks "..., E2E API Smoke Test, ..." were not set
by the expected GitHub apps.
Fix: give the no-op job the same `name:` as the real one. Now both
register as check-runs named "E2E API Smoke Test" — exactly one runs
per workflow execution (mutex `if`), the other registers as skipped
with the same name. Branch protection sees at least one success,
requirement satisfied.
Same fix applied to e2e-staging-canvas.yml's no-op (name → "Canvas
tabs E2E") for symmetry, even though "Canvas tabs E2E" isn't currently
in main's required check list — kept consistent so the next time a
required-checks reshuffle pulls it in, it doesn't recreate this bug.
Note: Design B's intent was always "emit a result auto-promote can
read" — that intent was satisfied at the workflow-conclusion level
(success), but missed the per-check-run-name level. This PR closes
that second-order gap.
e2e-staging-canvas had a single global concurrency group:
concurrency:
group: e2e-staging-canvas
cancel-in-progress: false
That meant the entire repo shared one running + one pending slot. When a
staging push queued behind an in-flight run and a third entrant (a PR
run, a follow-on push) entered the group, the staging push got
cancelled. auto-promote-staging then saw `completed/cancelled` for a
required gate and refused to advance main.
Observed 2026-04-28 23:51-23:53: staging tip 3f99fede's e2e-staging-
canvas push run was cancelled within 2:20 of starting because a PR run
on a follow-on branch entered the group. Auto-promote-staging fired 8+
times after that, all skipped because canvas was still in the cancelled
state. The chain stayed stuck until the cancelled run was manually
re-dispatched.
e2e-api had a softer version of the same bug — `group: e2e-api-${{
github.ref }}`. Per-ref isolates push events from PR events, so this
specific scenario didn't hit it, but back-to-back pushes to staging at
SHA-A and SHA-B share refs/heads/staging and would still cancel SHA-A's
queued run when SHA-B enters.
Both workflows now use per-SHA grouping. The single-global-group's
original intent was to throttle parallel E2E provisions, but each E2E
run already isolates its state via fresh-org-per-run, and parallel
infrastructure cost at our scale (~$0.001/min × 10min × 2) is rounding
error compared to a stuck pipeline.
Per-SHA still dedupes accidental double-triggers for the SAME SHA.
It does not cancel obsolete-PR-version runs on force-push — that wasted
CI is acceptable given the alternative is losing staging-tip data that
auto-promote-staging depends on.
Other gate workflows: ci.yml uses `cancel-in-progress: true` which is
correct for unit tests (intentional cancellation on supersede). codeql.yml
is per-ref like e2e-api was; same fix probably applies if the same
deadlock pattern is observed there, but no incident yet so deferring.
Self-review caught a real correctness bug: scenario where publish-
workspace-server-image completes BEFORE E2E Staging SaaS for a runtime-
touching SHA. Publish typically takes ~5-10min; E2E ~10-15min, so this
ordering is the common case for runtime-path PRs.
Previous gate logic:
- completed/success: proceed
- completed/failure: abort
- everything else (including in_progress): proceed ← BUG
If publish-trigger fires while E2E is still running, the gate returned
"in_progress/none" and fell through the catch-all "proceed" branch.
Result: :latest retagged on the publish signal alone. Then E2E ends
red — but :latest was already wrongly advanced; the E2E-completion
trigger's job-level if=conclusion==success filter just skips, never
rolls back.
Fix: explicit case for in_progress|queued|requested|waiting|pending
that DEFERS — sets gate.proceed=false, writes a "deferred" summary,
exits 0 (workflow run shows success, retag steps skipped). The E2E
completion trigger then fires later and either promotes (green) or
aborts (red), giving us correct ordering regardless of who finishes
first.
Subsequent steps now guarded by `if: steps.gate.outputs.proceed ==
'true'` instead of relying on `exit 1` for skip semantics.
Also added an explicit catch-all `*)` branch that aborts on unknown
states (forward-compat: GitHub adds a new status, we surface it
instead of silently promoting through it).
Previously this workflow only triggered on E2E Staging SaaS completion,
which is itself paths-filtered to runtime handlers
(workspace-server/internal/handlers/{registry,workspace_provision,
a2a_proxy}.go, middleware/**, provisioner/**). publish-workspace-server
-image fires on a STRICTLY BROADER path set (workspace-server/**,
canvas/**, manifest.json) — so canvas-only or cmd-only or sweep-only
PRs rebuilt the platform image without ever advancing :latest.
Result observed 2026-04-28: zero runs of this workflow since merge
despite eight main pushes. :latest sat ~7 hours / 9 PRs behind main.
Fix: add publish-workspace-server-image as a second trigger. Add an
explicit gate inside the job that aborts when E2E Staging SaaS for the
same SHA ended red. When E2E didn't fire (paths-filtered), proceed —
auto-promote-staging's pre-merge gates (CI + E2E Canvas + E2E API +
CodeQL on staging) already validated this SHA before main moved.
Concurrency group serializes promotes per-SHA so the publish+E2E both-
fired race lands cleanly. Idempotent crane tag makes it safe regardless.
Consolidates 11 of the 17 open Dependabot PRs (#2215, #2217, #2219-#2225,
#2227, #2229) into one PR. Every entry is a patch / minor / floor bump
where the impact surface is small and CI carries the proof.
Same pattern as the 2026-04-15 batch.
Go (workspace-server/go.mod + go.sum, regenerated via `go mod tidy`):
- golang.org/x/crypto 0.49.0 → 0.50.0 (#2225)
- github.com/golang-jwt/jwt/v5 5.2.2 → 5.3.1 (#2222)
- github.com/gin-contrib/cors 1.7.2 → 1.7.7 (#2220)
- github.com/docker/go-connections 0.6.0 → 0.7.0 (#2223)
- github.com/redis/go-redis/v9 9.7.3 → 9.19.0 (#2217)
Python floor bumps (workspace/requirements.txt; current pip-resolved
versions don't change unless they happen to be below the new floor):
- httpx >=0.27 → >=0.28.1 (#2221)
- uvicorn >=0.30 → >=0.46 (#2229)
- temporalio >=1.7 → >=1.26 (#2227)
- websockets >=12 → >=16 (#2224)
- opentelemetry-sdk >=1.24 → >=1.41.1 (#2219)
GitHub Actions (SHA-pinned per existing convention):
- dorny/paths-filter@d1c1ffe (v3) → @fbd0ab8 (v4.0.1) (#2215)
REMOVED from this batch (lockfile platform mismatch):
- #2231 @types/node ^22 → ^25.6 (npm install on macOS strips
Linux-only @emnapi/* entries from package-lock.json that CI's
`npm ci` then refuses; needs a Linux-side install to land cleanly)
- #2230 jsdom ^25 → ^29.1 (same)
NOT included in this batch (deferred to per-PR human review):
- #2228 github/codeql-action v3 → v4 (CodeQL CLI alignment risk)
- #2218 actions/setup-node v4 → v6 (default Node version drift)
- #2216 actions/upload-artifact v4 → v7 (3 major versions)
- #2214 actions/setup-python v5 → v6 (action major)
NOT merged (CI failing on dependabot's own PR):
- #2233 next 15 → 16
- #2232 tailwindcss 3 → 4
- #2226 typescript 5 → 6
Verified:
- workspace-server: `go mod tidy && go build ./... && go test ./...` — green
- workspace requirements.txt: floor bumps only
The molecule-core/staging branch is protected by ruleset 15500102
(name: staging-merge-queue) which blocks ALL direct pushes — no
bypass even for org admins or the GitHub Actions integration. The
prior version of this workflow attempted `git push origin staging`
and was rejected with GH013:
! [remote rejected] staging -> staging
(push declined due to repository rule violations)
- Changes must be made through a pull request.
- Changes must be made through the merge queue
This was a real architectural mismatch: auto-sync was bypassing
the same gates everyone else goes through to land on staging,
which is exactly what the ruleset is designed to prevent.
The fix matches the org convention: the workflow now opens a PR
(base=staging, head=auto-sync/main-<sha>) and enables auto-merge.
The merge queue picks it up, runs required gates against the
merged result, and lands it. Same path human PRs take through
staging — no special-snowflake bypass.
Trade-off acknowledged
- Slight PR churn: every main push that needs sync opens a tracked
PR. With concurrency: cancel-in-progress: false (existing) and
the merge queue's serial processing, this is bounded — PRs land
in order, no thundering herd.
- The previous direct-push approach worked on
molecule-controlplane (which has no merge_queue ruleset on
staging). That version of the workflow was correct for that
repo's protection model. Per-repo divergence is acceptable; the
invariant ("staging ⊇ main") is what matters, not how it's
enforced.
Loop safety preserved
GITHUB_TOKEN-authored merges (including the merge queue's land
of this PR) do NOT trigger downstream workflow runs. So the merge
to staging from this PR doesn't fire auto-promote-staging — same
as the direct-push version.
Idempotency
The branch name is derived from main's short sha
(`auto-sync/main-<sha>`) so workflow restarts on the same main
push reuse the existing branch + PR rather than opening duplicates.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Supply-chain hardening for the CI pipeline. 23 workflow files
modified, 59 mutable-tag refs replaced with commit SHAs.
The risk
Every `uses:` reference in .github/workflows/*.yml was pinned to a
mutable tag (e.g., `actions/checkout@v4`). A maintainer of an
action — or a compromised maintainer account — can repoint that
tag to malicious code, and our pipelines silently pull it on the
next run. The tj-actions/changed-files compromise of March 2025 is
the canonical example: maintainer credential leak, attacker
repointed several `@v<N>` tags to a payload that exfiltrated
repository secrets. Repos that pinned to SHAs were unaffected.
The fix
Replace each `@v<N>` with `@<commit-sha> # v<N>`. The trailing
comment preserves human readability ("ah, this is v4"); the SHA
makes the reference immutable.
Actions covered (10 distinct):
actions/{checkout,setup-go,setup-python,setup-node,upload-artifact,github-script}
docker/{login-action,setup-buildx-action,build-push-action}
github/codeql-action/{init,autobuild,analyze}
dorny/paths-filter
imjasonh/setup-crane
pnpm/action-setup (already pinned in molecule-app, listed here for completeness)
Excluded:
Molecule-AI/molecule-ci/.github/workflows/disable-auto-merge-on-push.yml@main
— internal org reusable workflow; we control its repo, threat model
is different from third-party actions. Conventional to pin to @main
rather than SHA for internal reusables.
The maintenance cost
SHA pinning means upstream fixes require manual SHA bumps. Without
automation, pinned SHAs go stale. So this PR also enables Dependabot
across four ecosystems:
- github-actions (workflows)
- gomod (workspace-server)
- npm (canvas)
- pip (workspace runtime requirements)
Weekly cadence — the supply-chain attack window is "minutes between
repoint and pull"; weekly auto-bumps don't help with zero-days
regardless. The point is to pull in non-zero-day fixes without
operator effort.
Aligns with user-stated principle: "long-term, robust, fully-
automated, eliminate human error."
Companion PR: Molecule-AI/molecule-controlplane#308 (same pattern,
smaller surface).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a lint that diffs the canonical SECRET_PATTERNS array in
.github/workflows/secret-scan.yml against every known public
consumer mirror, failing on any divergence.
Why: every side that scans for credentials carries its own copy of
the pattern list. They drift — most recently the workspace-runtime
pre-commit hook lagged the canonical by one pattern (sk-cp- /
MiniMax F1088 vector), so a developer's local pre-commit would let
a sk-cp- token through while the org-wide CI scan would refuse it.
Useless friction; automated detection closes the gap.
Implementation:
.github/scripts/lint_secret_pattern_drift.py — pure stdlib, fetches
each consumer's RAW file via urllib, extracts the
SECRET_PATTERNS=( ... ) array via anchored regex (the closing
`)` is anchored to the start of a line because pattern comments
like `# GitHub PAT (classic)` contain their own paren mid-line),
diffs against canonical, fails on missing or extra patterns.
Fetch failures are warnings, not errors — a consumer whose
branch was renamed shouldn't fail the lint until someone updates
the URL list.
.github/workflows/secret-pattern-drift.yml — daily 05:00 UTC cron
+ on-push gate (when canonical, the workflow, or the script
changes) + workflow_dispatch. Read-only token, 5-minute timeout.
Initial consumer set: workspace-runtime's bundled pre-commit hook
(the one that drifted on sk-cp-). molecule-controlplane's inlined
copy is private so this workflow can't read it; that's tracked
separately and the controlplane's own self-monitor is the gap.
Verified locally: lint detects drift correctly when the runtime
hook is missing sk-cp-, returns clean when aligned.
Refs: task #139.
Three small fixes from the self-review of #2209:
1. **Required: concurrency group.** Two pushes to main in quick
succession (manual UI merge then auto-promote-staging's ff-push,
or any back-to-back main pushes) would race two auto-sync runs
against the same staging branch — second `git push origin staging`
fails non-fast-forward, surfacing as a red CI alert for what should
be a no-op. Add `concurrency: { group: auto-sync-main-to-staging,
cancel-in-progress: false }` so the second run waits for the first
and sees its result.
2. **Hygiene: `git merge --abort` on conflict.** The conflict-error
path exits 1 with the work tree in a half-merged state. Doesn't
affect future runs (each gets a fresh checkout) but is an
unpleasant artifact for anyone who shells into the runner. Abort
first, then exit.
3. **Doc accuracy: "Loop safety" comment.** The original said the
chain terminates because "main is either a no-op or advances
further." That's true but understates the actual safety: GitHub
Actions explicitly does NOT trigger downstream workflow runs from
`GITHUB_TOKEN`-authored pushes. So the loop is impossible by
construction, not just by happy coincidence of ref state. Updated
the comment to reflect the actual mechanism.
Plus a step-name nit: "Fast-forward staging → main" reads as if main
is the target. Renamed to "Fast-forward staging to main" for
consistency with the workflow's name (main → staging).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Background
`auto-promote-staging.yml` advances main via `git merge --ff-only`
+ `git push origin main` — clean fast-forward, no merge commit. But
manual `staging → main` merges via the GitHub UI / API create a merge
commit on main that staging doesn't have. The next `staging → main`
PR then evaluates as "BEHIND" because staging is missing that merge
commit, requiring a manual `gh pr update-branch` round-trip.
This pattern bit twice on 2026-04-28 (PRs #2202 and #2205, both
manual bridges to land pipeline fixes themselves). Each needed
update-branch + re-CI before they could merge. Annoying and
avoidable.
What this workflow does
Triggered on every push to main (regardless of source: auto-promote,
UI merge, API merge, direct push):
1. Check whether main is already in staging's ancestry. If yes,
no-op — auto-promote-staging keeps them aligned via ff push,
and the no-op case is the steady state.
2. If not (manual merge commit on main, or direct main hotfix):
try `git merge --ff-only origin/main` first. Works when staging
hasn't diverged with its own commits.
3. If ff fails (staging has its own in-flight feature work):
`git merge --no-ff origin/main -m "chore: sync main → staging"`.
Absorbs main's tip while keeping staging's own history.
4. Push staging.
Loop safety
Pushing the synced staging triggers auto-promote-staging.yml, which
checks gates on staging's new tip and, if green, ff-pushes staging
to main. Since staging now ⊇ main, the resulting push to main is
either a no-op (no ref change → no push event fires → auto-sync
doesn't re-trigger) or advances main further. In the latter case
auto-sync fires once more, sees main already in staging's ancestry,
no-ops. Bounded.
Conflict handling
If the merge step hits conflicts (staging and main diverged with
incompatible changes), the workflow fails with a clear summary
pointing to manual resolution. This shouldn't happen in practice —
staging is the integration branch; conflicts indicate a direct main
hotfix touching the same code as in-flight staging work.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two latent bash bugs in the canonical secret-scan workflow caught
during the post-merge review of molecule-controlplane #301 (a
private consumer that inlined this workflow's logic and got both
fixes there). Same bugs apply here; fixing in canonical means every
public consumer (gh-identity, github-app-auth, the 8 workspace
template repos) inherits the fix on their next workflow_call.
Bug 1: `printf "$OFFENDING"` is a format-string sink.
OFFENDING is built from filenames: `${f} (matched: ${pattern})\n`.
When passed to printf as the first argument, `%` characters in a
filename are interpreted as conversion specifiers — corrupting the
error message or printing `%(missing)` artifacts. No filename in
the current tree triggers it, but a future test fixture, build
artifact, or contributor-supplied path could.
Fix: `printf '%b' "$OFFENDING"` interprets the literal `\n` we
appended without treating OFFENDING as a format string.
Bug 2: `for f in $CHANGED` word-splits on whitespace.
Filenames containing spaces would split into multiple tokens. The
self-exclude check (`[ "$f" = "$SELF" ] && continue`) and the diff
lookup would both operate on partial-path tokens. No filename in
the current tree has whitespace, but the failure would be silent
if one ever did.
Fix: `while IFS= read -r f; do ... done <<< "$CHANGED"` reads
whole lines as filenames. Added `[ -z "$f" ] && continue` to
match the original `for` loop's implicit empty-input skip.
Both fixes are mechanically straightforward (~16 lines net diff,
mostly comments documenting the why). No behavior change for
filenames in the current tree; strictly better for the edge cases.
The same fixes already shipped in molecule-controlplane via #301
which inlined a copy of this workflow. The runtime's bundled
pre-commit hook (molecule-ai-workspace-runtime:
molecule_runtime/scripts/pre-commit-checks.sh) likely has the same
bugs — flagged as a follow-up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the final gap in the SaaS pipeline. After auto-promote-staging
fast-forwards main, publish-workspace-server-image builds new
`:staging-<sha>` images, but `:latest` (what prod tenants pull) only
moves on either a manual `promote-latest.yml` dispatch or a canary-
verify retag (gated on Phase 2 fleet that doesn't exist).
This workflow closes that gap by retagging
`platform:staging-<sha>` + `platform-tenant:staging-<sha>` → `:latest`
whenever E2E Staging SaaS passes for a `main` push. Uses crane
(no Docker daemon needed). Verifies both images exist before retagging
either, so a half-published state is impossible.
Why trigger only on `main` (not staging):
- `:latest` is what prod tenants pull. Only SHAs that have reached
`main` (via auto-promote-staging) should advance `:latest`.
- Triggering on staging would let a staging-only revert advance
`:latest` to a SHA that never reaches `main`, breaking the
invariant "production runs what's on `main`".
Why a separate workflow rather than folding into e2e-staging-saas.yml:
- Test concerns and release concerns separate.
- Disabling promote during an incident is one workflow toggle, not
an edit to the long E2E file.
- When Phase 2 canary work eventually lands, the canary path can
replace this trigger without touching the E2E workflow.
Doc-aligned: per molecule-controlplane/docs/canary-tenants.md,
"green staging E2E → :latest" is the recommended approach for the
current scale (≤20 paying tenants); canary fleet is deferred until
blast radius grows.
Pipeline after this lands is fully self-healing:
staging push → 4 gates green → auto-promote fast-forwards main
→ publish-workspace-server-image → E2E Staging SaaS
→ THIS WORKFLOW retags :latest → tenant fleet auto-pulls in 5 min
(or redeploy-tenants-on-main fans out faster)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Observed 2026-04-28: auto-promote ran for staging head 96955f7b with
all gates actually green (verified via /commits/<sha>/check-runs API)
yet `check-all-gates-green` reported `CodeQL → missing/none` and
aborted. Same SHA was promotable; auto-promote couldn't see it.
Cause: `gh run list --workflow="CodeQL"` matched two workflows in
this repo:
- codeql.yml (explicit, scans both staging and main)
- codeql (GitHub UI-configured Code-quality default setup,
internal, scans default branch only)
gh CLI rejects ambiguous `--workflow=<name>` lookups and returns no
result → the gate fell through to `missing/none` and ALL_GREEN was
set false. Every staging push since both names existed has been
silently dead-locked.
Fix: switch GATES from display-name strings to workflow file paths.
File paths are the unique identifier for a workflow file in
.github/workflows/; display names are decoration and can collide.
The same `gh run list --workflow=<file.yml>` query that fails on
"CodeQL" succeeds on "codeql.yml" because the file path resolves
unambiguously.
No behavior change for the other three gates (CI, E2E Canvas, E2E
API Smoke) since their names didn't collide — they keep working,
they just identify by ci.yml / e2e-staging-canvas.yml / e2e-api.yml
now. The log line shape changes from `CI → completed/success` to
`ci.yml → completed/success` which is fine for ops grep.
When adding/removing a gate going forward: file paths only. Keep
branch-protection required-checks (check-run display names) in
sync as a separate manual step.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The auto-promote-staging.yml gate-check (line 99) treats "workflow
didn't run" as failure. Path-filtered triggers on E2E API Smoke Test
and E2E Staging Canvas meant a platform-only or test-only push to
staging — say, the prior PR #2201 which only touched
tests/e2e/test_staging_full_saas.sh — never triggered the canvas
workflow, and auto-promote saw `missing/none`, marked all_green=false,
and aborted. Same class for any push that doesn't touch the gate's
watched paths. Dead-lock by design, never noticed because the gate
was new.
Fix per Design B (always-run + fast-skip):
- Drop `paths:` from the push/pull_request triggers on both gate
workflows. The workflow now always fires on every staging+main
push/PR.
- Add a `detect-changes` job using `dorny/paths-filter@v3` that
decides whether to do real work, scoped to the same paths the
trigger filter used to watch.
- Real work job (e2e-api / playwright) gates on
`needs: detect-changes; if: needs.detect-changes.outputs.X == 'true'`.
- Add a sibling `no-op` job that runs when the filter output is
false, emitting `::notice::… no-op pass`. The workflow run's
conclusion is `success` either way — auto-promote sees green and
proceeds.
manual `workflow_dispatch` and the weekly canvas `schedule` short-
circuit detect-changes to always-run — those triggers exist precisely
to exercise the suite and shouldn't be silently no-op'd.
Why this approach over making auto-promote-staging smarter:
The alternative (Design A, considered + rejected) was to teach
auto-promote-staging to read each gate's `paths:` filter and treat
"no run because filter excluded the commit" as conditional pass.
That couples auto-promote to other workflows' YAML schema and breaks
silently if a gate is renamed or its filter changes. Design B keeps
the auto-promote contract simple ("each gate emits success") and
makes each gate self-describing — adding a new gate doesn't require
touching auto-promote.
Cost: ~10-30s of runner overhead per gate per push for the no-op when
paths don't match. Negligible vs the alternative of dead-locked
auto-promote chains.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes#132. Extends the cascade propagation probe (added in #2197
and clarified in #2198) with a content-integrity check.
The previous probe verified pip can RESOLVE the version we just
published (catches surface 1+2 propagation lag — metadata + simple
index). It did NOT verify pip can DOWNLOAD bytes that match what we
uploaded — leaving a window where a Fastly stale-content scenario
(rare but PyPI has had it: e.g. 2026-04-01 incident where a CDN node
served a previous version's wheel under the new version's URL for
~90s after upload) would pass the probe and ship corrupt builds to
all 8 receiver templates.
Two-stage check, both must pass before the cascade fans out:
(a) `pip install --no-cache-dir PACKAGE==VERSION` succeeds —
version is resolvable. (Existing, unchanged.)
(b) `pip download` of the same wheel + `sha256sum` matches the
hash captured pre-upload from `dist/*.whl`. (New.)
Captured BEFORE upload via a new `wheel_hash` step that exposes
`steps.wheel_hash.outputs.wheel_sha256`, bubbled up as
`needs.publish.outputs.wheel_sha256`, and consumed by the cascade
probe via the EXPECTED_SHA256 env var.
`pip download` is the right primitive: it writes the actual .whl
file (vs `pip install` which unpacks and discards), so we can
sha256sum it directly. Combined with --no-cache-dir + a wiped
/tmp/probe-dl per poll, every poll re-fetches from the live Fastly
edge — no local-cache mask.
Per-poll cost: ~3-5s pip install + ~3s pip download + 4s sleep.
30-poll budget = ~5-6 min wall on a slow runner (vs the previous
~4-5 min for resolve-only). Well within the cascade's tolerance for
a known-rare CDN issue, and the overwhelming-common case (Fastly
serves matching bytes immediately) exits on the first poll.
Verified locally: pip download of the current PyPI-latest
(molecule-ai-workspace-runtime 0.1.29) produced
sha256=7e782b2d50812257…, exactly matching PyPI's own metadata
endpoint. The mismatch path is exercised inline (different builds
of the same version produce different hashes by definition — the
build_runtime_package.py output is timestamp-deterministic only
within a single CI invocation).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes#134. The post-merge review of #2196 flagged that the combined
workflow's `paths:` filter (the union of both jobs' needs:
`workspace/**` + `scripts/build_runtime_package.py` + the workflow
itself) caused the `pypi-latest-install` job to fire on every
doc-only / adapter-only / unrelated workspace/ edit. The PyPI artifact
that job tests against can't change based on our workspace/ source —
only on actual PyPI publishes — so those runs add noise without
information.
Splits the previously-merged combined workflow:
runtime-pin-compat.yml (kept):
- PyPI-latest install + import smoke (was: pypi-latest-install)
- Narrow `paths:` filter — only fires when workspace/requirements.txt
or this workflow file changes
- Cron-driven daily for upstream-yank detection (unchanged)
runtime-prbuild-compat.yml (new):
- PR-built wheel + import smoke (was: local-build-install)
- Broad `paths:` filter — fires on any workspace/ source change,
scripts/build_runtime_package.py, or this workflow file
- No cron (workspace/ doesn't change between firings)
Behavior identical to before for content; only the trigger surface is
narrower per-job. Each workflow's name is its own status check, so
branch protection (which currently lists neither as required) can
gate them independently in future.
The prior comment in the combined file explicitly acknowledged the
asymmetry and proposed this split as a follow-up; this is that
follow-up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cascade's PyPI-propagation gate polled `/pypi/<pkg>/<ver>/json`,
which is one of THREE surfaces pip touches when resolving an install:
1. /pypi/<pkg>/<ver>/json — metadata endpoint (the old check)
2. /simple/<pkg>/ — pip's primary download index
3. files.pythonhosted.org — CDN-fronted wheel binary
Each has its own cache. Any one of them can lag behind the others,
and the previous gate would let the cascade fire while (2) or (3)
still served the previous version. Downstream `pip install` in the
template repos then resolved to the OLD wheel, the docker layer
cache locked that stale resolution in, and subsequent rebuilds kept
shipping the old runtime — the "five times in one night" cache trap
referenced in the prior comment.
Replace the metadata-only poll with an actual `pip install
--no-cache-dir --force-reinstall --no-deps PACKAGE==VERSION` from
a fresh venv. If pip can resolve and install the exact version we
just published, every receiver template will too — pip itself is
the ground truth for what the receivers will see, no proxy guessing
about which surface is lagging.
- Venv created once outside the loop; only `pip install` runs in
the poll body.
- --no-cache-dir + --force-reinstall ensures every poll hits the
live PyPI surfaces (no local-cache mask).
- --no-deps keeps each poll fast — we only care about resolving
THIS package, not its dep tree.
- Loop budget: 30 attempts × 4s ≈ 2 min (vs prior 30 × 2s = 60s).
Generous vs typical PyPI propagation, surfaces real upstream
issues past the budget.
Verified locally:
- Probing a non-existent version (0.1.999999) → pip exits 1, loop
retries.
- Probing the current PyPI-latest → pip exits 0, `pip show`
returns the version, loop succeeds.
Closes#130.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes#128's chicken-and-egg. The original gate installed the
CURRENTLY-PUBLISHED molecule-ai-workspace-runtime from PyPI, then
overlaid workspace/requirements.txt, then smoke-imported. That
catches problems with the already-shipped artifact (the daily-cron
upstream-yank case), but it cannot catch problems introduced by the
PR itself: the imports it exercises are from the OLD wheel, not the
PR's source. A PR that adds `from a2a.utils.foo import bar` (where
`bar` is added in a2a-sdk 1.5 and the runtime currently pins 1.3)
slips through:
1. Pip resolves the existing PyPI wheel + a2a-sdk 1.3.
2. Smoke imports the OLD main.py — no reference to `bar` → green.
3. Merge → publish-runtime.yml ships a wheel WITH the new import.
4. Tenant images redeploy → all crash on first boot with
ImportError: cannot import name 'bar' from 'a2a.utils.foo'.
Splits the workflow into two jobs:
- pypi-latest-install (renamed from default-install): unchanged
behavior. Runs on the daily cron and on requirements.txt /
workflow edits. Catches upstream PyPI yanks + the
already-shipped artifact going stale.
- local-build-install (new): runs scripts/build_runtime_package.py
on the PR's workspace/, builds the wheel with python -m build
(mirroring publish-runtime.yml byte-for-byte), installs that
wheel, then runs the same smoke import. Tests the artifact
that WOULD be published if this PR merges.
Path filter widened to workspace/** so any runtime-source change
triggers the local-build job. The pypi-latest job's filter is the
same union; its internal logic is unchanged so the daily-cron and
upstream-detection use cases continue to work.
Verified locally: built the wheel from current workspace/ source via
the same script + python -m build invocation, installed into a fresh
venv, imported from molecule_runtime.main import main_sync
successfully.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The existing wheel-smoke catches AgentCard kwarg-shape regressions
(state_transition_history, supported_protocols) but doesn't catch the
SDK-contract drift class that #2193 just fixed in production: the
a2a-sdk 1.x rename of /.well-known/agent.json →
/.well-known/agent-card.json, plus AGENT_CARD_WELL_KNOWN_PATH moving
to a2a.utils.constants. main.py's readiness probe hardcoded the old
literal and 404'd every attempt, silently dropping every workspace's
initial_prompt for ~weeks before a user reported it.
Two additions to the smoke block:
1. Mount alignment: build an AgentCard, call create_agent_card_routes(),
and assert AGENT_CARD_WELL_KNOWN_PATH is among the mounted paths.
Catches a future SDK release that decouples the constant value
from the route factory's mount path. The source-tree test
(workspace/tests/test_agent_card_well_known_path.py) catches the
main.py side; this catches the SDK side BEFORE PyPI upload.
2. Message helper smoke: import a2a.helpers.new_text_message and
instantiate one. The v0→v1 cheat sheet (memory:
reference_a2a_sdk_v0_to_v1_migration.md) flagged this as a real
migration find — main.py and a2a_executor.py call it in hot
paths, so an import break errors every reply before the message
even leaves the workspace.
Verified by running the equivalent Python inside
ghcr.io/molecule-ai/workspace-template-langgraph:latest:
✓ well-known mount alignment OK (/.well-known/agent-card.json)
✓ message helper import + call OK
Closes the structural-fix half of the #2193 finding from the code-
review-and-quality pass: "the wheel publish smoke didn't catch this.
This is the 7th a2a-sdk migration find of this kind. Task #131 is the
right root-cause fix."
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>