molecule-core/.github/workflows
Hongming Wang 24bfced630 ci(publish-image): also tag :staging-latest so CP auto-picks up new builds
Root cause of the 2026-04-24 all-day E2E failure chain: Railway staging
CP had TENANT_IMAGE pinned to :staging-a14cf86 — a static SHA that had
silently drifted 10+ days stale. Every new tenant (including every E2E
run's fresh tenant) was spawned with that stale image, which predated
applyRuntimeModelEnv. Without applyRuntimeModelEnv, HERMES_DEFAULT_MODEL
never reached the workspace EC2 user-data, so install.sh fell back to
nousresearch/hermes-4-70b → openrouter → 401 "Missing Authentication
header" in every A2A reply.

Four correct fixes shipped today all got shadowed by this single stale
pin:
  • template-hermes#19 (provider priority for openai/*)
  • template-hermes#20 (decouple prefix-strip from bridge guard)
  • molecule-controlplane#247 (force fresh /opt/adapter clone)
  • molecule-core#1987 (E2E pins HERMES_CUSTOM_* as workaround)

Fix: publish each main build under both :staging-<sha> AND :staging-latest.
Change Railway staging CP's TENANT_IMAGE env to :staging-latest (done via
`railway variables --set` as part of this incident). Future main builds
then auto-propagate to new tenant provisions without any human in the
loop.

Safety: :staging-latest is the "most recent main build" — NOT a
canary-verified promotion. That distinction is preserved:
  • Prod tenants still pull :latest (canary-verified, retagged by
    canary-verify.yml only after the canary fleet green-lights a digest)
  • Staging tenants now pull :staging-latest (every main build, pre-canary)

So staging becomes the canary: if a :staging-latest build regresses,
the staging canary fleet catches it before it can be promoted to :latest
for prod. This is what the canary design intended; the missing
:staging-latest tag was the hole.

Zero impact on image size / build time: Docker tags point at the same
digest, no duplicate push.

Follow-up: filed an issue tracking the need for CP's TENANT_IMAGE to
NEVER be pinned to a SHA in any environment — it must always float on a
named tag (:staging-latest for staging, :latest for prod).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 00:29:55 -07:00
..
auto-promote-staging.yml ci: canary-verify graceful-skip + draft auto-promote staging→main 2026-04-22 22:39:23 +00:00
block-internal-paths.yml chore: remove internal content + add hard CI gate (CEO directive 2026-04-23) 2026-04-23 16:58:28 -07:00
canary-staging.yml fix(e2e): CP DELETE /cp/admin/tenants body uses 'confirm', not 'confirm_token' 2026-04-21 04:50:28 -07:00
canary-verify.yml ci: canary-verify graceful-skip + draft auto-promote staging→main 2026-04-22 22:39:23 +00:00
ci.yml ci: add merge_group trigger to ci + codeql 2026-04-23 21:24:53 -07:00
codeql.yml ci: add merge_group trigger to ci + codeql 2026-04-23 21:24:53 -07:00
e2e-api.yml feat(ci): run E2E API smoke test on staging branch 2026-04-23 17:47:47 -07:00
e2e-staging-canvas.yml feat(ci): run E2E Staging Canvas on staging branch pushes 2026-04-23 17:47:51 -07:00
e2e-staging-saas.yml ci(e2e): wire MOLECULE_STAGING_OPENAI_KEY into workflow env 2026-04-21 11:24:59 -07:00
e2e-staging-sanity.yml fix(e2e): CP DELETE /cp/admin/tenants body uses 'confirm', not 'confirm_token' 2026-04-21 04:50:28 -07:00
promote-latest.yml perf(ci): move all public-repo workflows to ubuntu-latest 2026-04-22 12:56:49 -07:00
publish-canvas-image.yml perf(ci): move all public-repo workflows to ubuntu-latest 2026-04-22 12:56:49 -07:00
publish-workspace-server-image.yml ci(publish-image): also tag :staging-latest so CP auto-picks up new builds 2026-04-24 00:29:55 -07:00
retarget-main-to-staging.yml ci: auto-retarget bot PRs opened against main → staging (#1853) 2026-04-23 19:20:40 +00:00