Gitea is case-sensitive on owner slugs; canonical is lowercase
`molecule-ai/...`. Mixed-case `Molecule-AI/...` refs fail-at-0s
when the runner tries to resolve the cross-repo workflow / checkout.
Same fix as molecule-controlplane#12. Mechanical case-correction;
no behavior change beyond making CI resolve again.
Refs: internal#46
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-review of PR #2810 caught a regression: my mass-fix added
`2>/dev/null` to every curl invocation, suppressing stderr. The
original `|| echo "000"` shape only swallowed exit codes — stderr
(curl's `-sS`-shown dial errors, timeouts, DNS failures) still went
to the runner log so operators could see WHY a connection failed.
After PR #2810 the next deploy failure would log only the bare
HTTP code with no context. That's exactly the kind of diagnostic
loss that makes outages take longer to triage.
Drop `2>/dev/null` from each curl line — keep it on the `cat`
fallback (which legitimately suppresses "no such file" when curl
crashed before -w ran). The `>tempfile` redirect alone captures
curl's stdout (where -w writes) without touching stderr.
Same 8 files as #2810: redeploy-tenants-on-{main,staging},
sweep-stale-e2e-orgs, e2e-staging-{sanity,saas,external,canvas},
canary-staging.
Tests:
- All 8 files pass the lint
- YAML valid
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 2026-05-04 redeploy-tenants-on-main run for sha 2b862f6 emitted
"HTTP 000000" and failed the deploy. Root cause: when curl exits non-
zero (connection reset → 56, --fail-with-body 4xx/5xx → 22), the
`-w '%{http_code}'` already wrote a status to stdout; the inline
`|| echo "000"` then fires AND appends another "000" to the captured
substitution stdout. Result: HTTP_CODE="<actual><000>" — fails string
comparisons against "200" while looking superficially right.
Same class of bug the synth-E2E §7c gate hit twice (PRs #2779/#2783
+ #2797). Memory feedback_curl_status_capture_pollution.md.
Mass fix in 8 workflows: route -w into a tempfile so curl's exit
code can't pollute stdout. Wrap with set +e/-e so the non-zero
curl exit doesn't trip the outer pipeline.
redeploy-tenants-on-main.yml (production-critical, caught the bug)
redeploy-tenants-on-staging.yml (sibling)
sweep-stale-e2e-orgs.yml (cleanup loop)
e2e-staging-sanity.yml (E2E safety-net teardown)
e2e-staging-saas.yml
e2e-staging-external.yml
e2e-staging-canvas.yml
canary-staging.yml
Plus a new lint workflow `lint-curl-status-capture.yml` that runs on
every PR/push touching `.github/workflows/**`. Multi-line aware:
collapses bash `\` continuations, then matches the buggy
$(curl ... -w '%{http_code}' ... || echo "000") subshell shape.
Distinguishes from the SAFE $(cat tempfile || echo "000") shape
(cat with missing file emits empty stdout, no pollution).
Verified:
- All 8 workflows pass the lint locally
- A known-bad injection is caught
- A known-safe cat-fallback passes through
- yaml.safe_load clean on all changed files
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The workflow_dispatch input default and the workflow_run env fallback
both pointed at 'hongmingwang', which doesn't match any current prod
tenant (slugs are: hongming, chloe-dong, reno-stars). CP silently
skipped the missing canary and put every tenant in batch-1 in parallel,
defeating the canary-first soak gate that exists to catch image-boot
regressions before they hit the whole fleet.
Concrete example from today's c0838d6 redeploy at 11:53Z (run 25278434388):
the dispatched body was `{"target_tag":"staging-c0838d6","canary_slug":"hongmingwang",...}`
and the CP response showed all 3 tenants in `"phase":"batch-1"` — no
soak, no canary. The deploy happened to be safe, but a broken image
would have hit hongming + chloe-dong + reno-stars simultaneously.
Fixed in three places: the runtime ordering comment, the
workflow_dispatch default, and the env fallback used by the
workflow_run trigger. Comment documents the rationale so the next
slug rename doesn't silently regress this again.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Auto-trigger from publish-workspace-server-image now resolves
target_tag to the just-published `staging-<short_head_sha>` digest
instead of `:latest`. Bypasses the dead retag path that was leaving
prod tenants on a 4-day-old image.
The chain pre-fix:
publish-image → pushes :staging-<sha> + :staging-latest (NOT :latest)
canary-verify → soft-skips (CANARY_TENANT_URLS unset, fleet not stood up)
promote-latest → manual workflow_dispatch only, last run 2026-04-28
redeploy-main → pulls :latest → 2026-04-28 digest → all 3 tenants STALE
Today's incident:
e7375348 (main) → publish-image green → redeploy fired → tenants
pulled :latest (76c604fb digest from prior canary-verified state) →
hongming /buildinfo returned 76c604fb instead of e7375348 → verify
step correctly flagged 3/3 STALE → workflow failed.
Today's PRs (#2473 smoke wedge, #2487 panic recovery, #2496 sweeper
followups) shipped to GHCR as :staging-<sha> but never reached prod.
Fix:
- workflow_dispatch input default '' (was 'latest'); empty input
triggers auto-compute path
- new "Compute target tag" step resolves:
1. operator-supplied input → verbatim (rollback / pin)
2. else → staging-<short_head_sha> (auto)
- verify step's operator-pin detection now allows
staging-<short_head_sha> as a non-pin (verification still runs)
When canary fleet is real, this workflow should chain on
canary-verify completion (workflow_run from canary-verify, gated on
promote-to-latest success) instead of publish-image — separate,
smaller PR. Today's fix unblocks prod deploys without that
prerequisite.
Companion: promote-latest.yml dispatched 2026-05-02 against
e7375348 to unstick existing prod tenants. This PR prevents
recurrence.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-review of #2403 caught a regression: with a 1-tenant fleet (the
exact case the original #2402 fix targeted), the new floor would
re-introduce the flake. Trace:
TOTAL=1, UNREACHABLE=1, $((1/2))=0
if 1 -gt 0 → TRUE → exit 1
The 50%-rule only meaningfully distinguishes "real outage" from
"teardown race" when the fleet is large enough that "half down" is
statistically meaningful. With 1-3 tenants, canary-verify is the
actual gate (it runs against the canary first and aborts the rollout
if the canary fails to come up).
Gate the floor on TOTAL_VERIFIED >= 4. Truth table:
TOTAL UNREACHABLE RESULT
1 1 soft-warn (original e2e flake case)
4 2 soft-warn (exactly half)
4 3 hard-fail (75% — real outage)
10 6 hard-fail (60% — real outage)
Mirrored across staging.yml + main.yml.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Belt-and-suspenders sanity floor on top of the unreachable-soft-warn
introduced earlier in this PR. Addresses the residual gap noted in
review: if a new image crashes on startup, every tenant ends up
unreachable, and the soft-warn alone would let that ship as a green
deploy. Canary-verify catches it on the canary tenant first, but this
guard is a fallback for canary-skip dispatches and same-batch races.
Threshold is 50% of healthz_ok-snapshotted tenants — comfortably above
the typical e2e-* teardown rate (5-10/hour, ~1 ephemeral tenant per
batch) but below any plausible real-outage scenario.
Mirrored across staging.yml + main.yml for shape parity.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The /buildinfo verify step (PR #2398) was treating "no /buildinfo response"
the same as "tenant returned wrong SHA" — both bumped MISMATCH_COUNT and
hard-failed the workflow. First post-merge run on staging caught a real
edge case: ephemeral E2E tenants (slug e2e-20260430-...) get torn down by
the E2E teardown trap between CP's healthz_ok snapshot and the verify step
running, so the verify step would dial into DNS that no longer resolves
and hard-fail on a benign condition.
The bug class we actually care about is STALE (tenant up + serving old
code, the #2395 root). UNREACHABLE post-redeploy is almost always a benign
teardown race; real "tenant up but unreachable" is caught by CP's own
healthz monitor + the alert pipeline, so double-counting it here was
making this workflow flaky on every staging push that overlapped E2E.
Wire:
- Split MISMATCH_COUNT into STALE_COUNT + UNREACHABLE_COUNT.
- STALE → hard-fail the workflow (the bug class we're guarding).
- UNREACHABLE → :⚠️:, don't fail. Reachable-mismatch still hard-fails.
- Job summary surfaces both lists separately so on-call can tell at a
glance which class fired.
Mirror in redeploy-tenants-on-main.yml for shape parity (prod has fewer
ephemeral tenants but identical asymmetry would be a gratuitous fork).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the gap that let issue #2395 ship: redeploy-fleet workflows reported
ssm_status=Success based on SSM RPC return code alone, while EC2 tenants
silently kept serving the previous :latest digest because docker compose up
without an explicit pull is a no-op when the local tag already exists.
Wire:
- new buildinfo package exposes GitSHA, set at link time via -ldflags from
the GIT_SHA build-arg (default "dev" so test runs without ldflags fail
closed against an unset deploy)
- router exposes GET /buildinfo returning {git_sha} — public, no auth,
cheap enough to curl from CI for every tenant
- both Dockerfiles thread GIT_SHA into the Go build
- publish-workspace-server-image.yml passes GIT_SHA=github.sha for both
images
- redeploy-tenants-on-main.yml + redeploy-tenants-on-staging.yml curl each
tenant's /buildinfo after the redeploy SSM RPC and fail the workflow on
digest mismatch; staging treats both :latest and :staging-latest as
moving tags; verification is skipped only when an operator pinned a
specific tag via workflow_dispatch
Tests:
- TestGitSHA_DefaultDevSentinel pins the dev default
- TestBuildInfoEndpoint_ReturnsGitSHA pins the wire shape that the
workflow's jq lookup depends on
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Parity with #2337's redeploy-tenants-on-staging.yml. Both prod and
staging redeploys now have explicit serialization:
group: redeploy-tenants-on-main (per-workflow, global)
group: redeploy-tenants-on-staging (per-workflow, global)
cancel-in-progress: false on both — aborting a half-rolled-out fleet
would leave tenants stuck on whatever image they happened to be on
when cancelled. Better to finish the in-flight rollout before starting
the next one.
Pre-fix this workflow relied on GitHub's implicit workflow_run queueing,
which is "probably fine" but not defensible — explicit > implicit for
load-bearing pipeline behavior. Picked up as a #2337 review nit
(architecture finding 1: concurrency asymmetry between the two
redeploy workflows).
No behavior change in the common case. The change matters only when
two main pushes land within seconds AND the first redeploy is still
mid-rollout — currently rare; will become more common once #2335
(staging-trigger publish) feeds main more frequently via auto-promote.
Closes the "main merged but prod tenants still on old image" gap.
## Trigger chain
main merge
└─> publish-workspace-server-image (builds + pushes :latest + :<sha>)
└─> redeploy-tenants-on-main (this workflow)
└─> POST https://api.moleculesai.app/cp/admin/tenants/redeploy-fleet
└─> Canary hongmingwang + 60s soak, then batches of 3
with SSM Run Command redeploying each tenant EC2
## Features
- Auto-fires on every successful publish-workspace-server-image run.
- Manual dispatch with optional target_tag (for rollback to an older
SHA), canary_slug override, batch_size, dry_run.
- 30s delay before calling CP so GHCR edge cache serves the new
:latest consistently to every tenant's docker pull.
- Skips when publish job failed (workflow_run fires on any completion).
- Job summary renders per-tenant results as a markdown table so ops
can see which tenant, if any, broke the chain.
- Exits non-zero on HTTP != 200 or ok=false so a broken rollout marks
the commit status red.
## Secrets + vars required
- secret CP_ADMIN_API_TOKEN — Railway prod molecule-platform / CP_ADMIN_API_TOKEN
Mirrored into this repo's secrets.
- var CP_URL (optional) — defaults to https://api.moleculesai.app
## Paired with
- Molecule-AI/molecule-controlplane branch feat/tenant-auto-redeploy
which adds the /cp/admin/tenants/redeploy-fleet endpoint + the SSM
orchestration. This workflow is a no-op until that lands on prod CP.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>