ci: trigger publish-workspace-server-image on staging push too

Root cause: this workflow only triggered on `branches: [main]`, but staging-CP pins TENANT_IMAGE=:staging-latest (verified via Railway). :staging-latest was only retagged on main push, so: staging-branch code → never built → never reaches staging tenants staging-CP serves → "yesterday's main" indefinitely When staging→main was wedged (path-filter parity bug, canvas teardown race — both fixed earlier today), :staging-latest stopped updating entirely. RFC #2312 (chat upload HTTP-forward) landed on staging but freshly-provisioned staging tenants kept failing chat upload because they pulled pre-RFC-#2312 image. Verified by tearing down a fresh tenant and observing the legacy "workspace container not running" error from the docker-exec code path that RFC #2312 deleted. Pre-2026-04-24 there was a related-but-different incident: TENANT_IMAGE was a static :staging-<sha> pin that drifted 10 days behind. This new incident is "the dynamic pin still drifts when its update workflow doesn't fire." Fix: add `staging` to the branches trigger. Tag policy is unchanged (:staging-<sha> + :staging-latest on every push). canary-verify.yml still runs on main push (workflow_run-gated to `branches: [main]`), preserving the canary-verified :latest promotion for prod tenants. Steady state after this: - staging push → :staging-latest = staging-branch code → staging-CP - main push → :staging-<sha> for canary, :staging-latest retag (post-promote main code), and after canary green → :latest for prod tenants What this does NOT change: - canary-verify.yml flow (still main-only) - redeploy-tenants-on-main.yml (still rolls prod fleet on main push) - publish-canvas-image.yml (self-hosted standalone canvas; orthogonal) - The :latest tag (canary-verified main, unchanged) What this does fix: - RFC #2312-class fixes that land on staging now actually reach staging tenants without waiting for staging→main promote. - The dogfooding observation "staging tenants seem to be running yesterday's code" disappears as a class. Drive-by: also fixed the typo in the path-filter list (was `publish-platform-image.yml`, the actual file is `publish-workspace-server-image.yml`).
2026-04-29 21:00:56 -07:00 · 2026-04-29 21:00:56 -07:00 · 2e1cef324b
commit 2e1cef324b
parent 86d9cb8b55
1 changed files with 55 additions and 26 deletions
--- a/.github/workflows/publish-workspace-server-image.yml
+++ b/.github/workflows/publish-workspace-server-image.yml
@ -1,17 +1,43 @@
 name: publish-workspace-server-image

-# Builds and pushes Docker images to GHCR when staging is promoted to main.
-# PRs target staging (default branch). Only main push triggers production builds.
+# Builds and pushes Docker images to GHCR on staging or main pushes.
 # EC2 tenant instances pull the tenant image from GHCR.
+#
+# Branch / tag policy (see Compute tags step for the per-branch logic):
+#
+#   staging push  → builds image, tags :staging-<sha> + :staging-latest.
+#                   staging-CP pins TENANT_IMAGE=:staging-latest, so it
+#                   picks up staging-branch code automatically. This is
+#                   what makes staging-CP actually test staging-branch
+#                   code instead of "yesterday's main" — pre-fix, this
+#                   workflow only ran on main, so staging tenants
+#                   silently served stale code (#2308 fix RFC #2312
+#                   landed on staging but never reached tenants because
+#                   staging→main was wedged on path-filter parity bugs).
+#
+#   main push     → builds image, tags :staging-<sha> + :staging-latest
+#                   (same as before). canary-verify.yml retags
+#                   :staging-<sha> → :latest after canary tenants
+#                   green-light the digest. The :staging-latest retag
+#                   on main push is intentional: when main lands AFTER a
+#                   staging push, staging-CP gets the post-promote code
+#                   (which equals what it had + any merge resolution),
+#                   so the canary-on-staging-CP step still runs against
+#                   the prod-bound digest.
+#
+# In the steady state both branches refresh :staging-latest; the
+# semantic is "most recent staging-or-main build of tenant code."
+# Drift between the two is bounded by the staging→main auto-promote
+# cadence and is corrected on the next staging push.

 on:
  push:
-    branches: [main]
+    branches: [staging, main]
    paths:
      - 'workspace-server/**'
      - 'canvas/**'
      - 'manifest.json'
-      - '.github/workflows/publish-platform-image.yml'
+      - '.github/workflows/publish-workspace-server-image.yml'
  workflow_dispatch:

 permissions:
@ -63,29 +89,32 @@ jobs:
        run: |
          echo "sha=${GITHUB_SHA::7}" >> "$GITHUB_OUTPUT"

-      # Canary-gated release: we publish :staging-<sha> ONLY here. The
-      # :latest tag (which existing prod tenants auto-pull every 5 min)
-      # is promoted by .github/workflows/canary-verify.yml after the
-      # staging canary fleet green-lights this digest.
-      # That means:
-      #   - Every main merge produces a :staging-<sha> image
-      #   - Canary tenants (configured to pull :staging-<sha>) pick it up
-      #   - canary-verify.yml runs smoke tests against them
-      #   - On green → canary-verify retags :staging-<sha> → :latest
-      #   - On red → :latest stays on the prior good digest, prod is safe
-      # Every push of :staging-<sha> also retags the same digest as
-      # :staging-latest so staging CP (which pins TENANT_IMAGE at
-      # :staging-latest) picks up new builds automatically — no more manual
-      # Railway env-var edits. Prod's :latest retag still happens in
-      # canary-verify.yml after the canary fleet greenlights this digest;
-      # :staging-latest is strictly the "most recent main build," not a
-      # canary-verified promotion.
+      # Canary-gated release flow:
+      #   - This step always publishes :staging-<sha> + :staging-latest.
+      #   - On staging push, staging-CP picks up :staging-latest immediately
+      #     (its TENANT_IMAGE pin is :staging-latest) — so staging-branch
+      #     code reaches staging tenants without waiting for main.
+      #   - On main push, canary-verify.yml runs smoke tests against
+      #     canary tenants (which pin :staging-<sha>), and on green retags
+      #     :staging-<sha> → :latest. Prod tenants pull :latest.
+      #   - On red, :latest stays on the prior good digest — prod is safe.
      #
-      # Before this, TENANT_IMAGE on Railway staging was pinned to a static
-      # :staging-<sha> and drifted months behind (2026-04-24 incident:
-      # canary tenant ran :staging-a14cf86, 10 days stale, which lacked
-      # applyRuntimeModelEnv and caused every E2E to route hermes+openai
-      # through openrouter → 401). See issue filed with this PR.
+      # Why :staging-latest is retagged on main push too: when main lands
+      # after a staging promote, staging-CP gets the post-promote code so
+      # the canary-on-staging-CP step still runs against the prod-bound
+      # digest. In a healthy flow the post-promote main code == the
+      # current staging code, so this is effectively a no-op except for
+      # the canary fleet pin handoff.
+      #
+      # Pre-fix history: this workflow used to only trigger on main. That
+      # meant staging-CP served "yesterday's main" indefinitely whenever
+      # staging→main was wedged. The 2026-04-30 dogfooding session
+      # surfaced this when RFC #2312 (chat upload HTTP-forward) landed on
+      # staging but staging tenants kept failing chat upload because they
+      # were running pre-RFC code. Adding the staging trigger above closes
+      # that gap. Earlier 2026-04-24 incident: a static :staging-<sha> pin
+      # drifted 10 days behind staging — same class of bug, different
+      # mechanism.
      - name: Build & push platform image to GHCR (staging-<sha> + staging-latest)
        uses: docker/build-push-action@10e90e3645eae34f1e60eeb005ba3a3d33f178e8 # v6
        with: