From 24bfced630f7ca82d16e2b7ce0c32ebdac28d4c0 Mon Sep 17 00:00:00 2001 From: Hongming Wang Date: Fri, 24 Apr 2026 00:29:55 -0700 Subject: [PATCH] ci(publish-image): also tag :staging-latest so CP auto-picks up new builds MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Root cause of the 2026-04-24 all-day E2E failure chain: Railway staging CP had TENANT_IMAGE pinned to :staging-a14cf86 — a static SHA that had silently drifted 10+ days stale. Every new tenant (including every E2E run's fresh tenant) was spawned with that stale image, which predated applyRuntimeModelEnv. Without applyRuntimeModelEnv, HERMES_DEFAULT_MODEL never reached the workspace EC2 user-data, so install.sh fell back to nousresearch/hermes-4-70b → openrouter → 401 "Missing Authentication header" in every A2A reply. Four correct fixes shipped today all got shadowed by this single stale pin: • template-hermes#19 (provider priority for openai/*) • template-hermes#20 (decouple prefix-strip from bridge guard) • molecule-controlplane#247 (force fresh /opt/adapter clone) • molecule-core#1987 (E2E pins HERMES_CUSTOM_* as workaround) Fix: publish each main build under both :staging- AND :staging-latest. Change Railway staging CP's TENANT_IMAGE env to :staging-latest (done via `railway variables --set` as part of this incident). Future main builds then auto-propagate to new tenant provisions without any human in the loop. Safety: :staging-latest is the "most recent main build" — NOT a canary-verified promotion. That distinction is preserved: • Prod tenants still pull :latest (canary-verified, retagged by canary-verify.yml only after the canary fleet green-lights a digest) • Staging tenants now pull :staging-latest (every main build, pre-canary) So staging becomes the canary: if a :staging-latest build regresses, the staging canary fleet catches it before it can be promoted to :latest for prod. This is what the canary design intended; the missing :staging-latest tag was the hole. Zero impact on image size / build time: Docker tags point at the same digest, no duplicate push. Follow-up: filed an issue tracking the need for CP's TENANT_IMAGE to NEVER be pinned to a SHA in any environment — it must always float on a named tag (:staging-latest for staging, :latest for prod). Co-Authored-By: Claude Opus 4.7 (1M context) --- .../publish-workspace-server-image.yml | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/.github/workflows/publish-workspace-server-image.yml b/.github/workflows/publish-workspace-server-image.yml index df0c3098..c7f3127f 100644 --- a/.github/workflows/publish-workspace-server-image.yml +++ b/.github/workflows/publish-workspace-server-image.yml @@ -73,7 +73,20 @@ jobs: # - canary-verify.yml runs smoke tests against them # - On green → canary-verify retags :staging- → :latest # - On red → :latest stays on the prior good digest, prod is safe - - name: Build & push platform image to GHCR (staging- only) + # Every push of :staging- also retags the same digest as + # :staging-latest so staging CP (which pins TENANT_IMAGE at + # :staging-latest) picks up new builds automatically — no more manual + # Railway env-var edits. Prod's :latest retag still happens in + # canary-verify.yml after the canary fleet greenlights this digest; + # :staging-latest is strictly the "most recent main build," not a + # canary-verified promotion. + # + # Before this, TENANT_IMAGE on Railway staging was pinned to a static + # :staging- and drifted months behind (2026-04-24 incident: + # canary tenant ran :staging-a14cf86, 10 days stale, which lacked + # applyRuntimeModelEnv and caused every E2E to route hermes+openai + # through openrouter → 401). See issue filed with this PR. + - name: Build & push platform image to GHCR (staging- + staging-latest) uses: docker/build-push-action@v6 with: context: . @@ -82,6 +95,7 @@ jobs: push: true tags: | ${{ env.IMAGE_NAME }}:staging-${{ steps.tags.outputs.sha }} + ${{ env.IMAGE_NAME }}:staging-latest cache-from: type=gha cache-to: type=gha,mode=max labels: | @@ -89,7 +103,7 @@ jobs: org.opencontainers.image.revision=${{ github.sha }} org.opencontainers.image.description=Molecule AI platform (Go API server) — pending canary verify - - name: Build & push tenant image to GHCR (staging- only) + - name: Build & push tenant image to GHCR (staging- + staging-latest) uses: docker/build-push-action@v6 with: context: . @@ -98,6 +112,7 @@ jobs: push: true tags: | ${{ env.TENANT_IMAGE_NAME }}:staging-${{ steps.tags.outputs.sha }} + ${{ env.TENANT_IMAGE_NAME }}:staging-latest cache-from: type=gha cache-to: type=gha,mode=max # Canvas uses same-origin fetches. The tenant Go platform