ci(publish-workspace-server-image): auto-redeploy staging fleet on every main merge #2940
Reference in New Issue
Block a user
Delete Branch "fix/auto-redeploy-staging-on-main"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes the staging deploy-lag blocking the customer (Researcher RCA #2929 comment 103252).
Problem
Merging workspace-server code to
mainbuilt a new image but never auto-redeployed staging tenants.redeploy-tenants-on-staging.ymlonly fired onstaging-branch pushes of the publish workflow file itself, so fixes like #2931 reachedmainbut were not deployed to staging.Fix
Add a
deploy-stagingjob topublish-workspace-server-image.ymlthat:build-and-pushsucceeds onmain(needs: build-and-push).POST /cp/admin/tenants/redeploy-fleetwithtarget_tag=staging-latest./buildinfo.Gitea 1.22.6 does not support
workflow_run, so the redeploy is inlined as a job in the same workflow to guarantee ordering after the image push.Test plan
python3 -c "import yaml; yaml.safe_load(open(.gitea/workflows/publish-workspace-server-image.yml))"→ OKSOP Checklist
🤖 Generated with Claude Code
a1f49d28cfto9c2e2b65c6APPROVE — well-built staging deploy-lag fix with real verification. No blocking defects. Reviewed @
9c2e2b65(all-required CI green; 1st-genuine).Correctness ✅ The
deploy-stagingjobneeds: build-and-push(runs only AFTER the image publishes — guaranteed ordering, and a failed build correctly skips the deploy) and is gatedif: github.event_name=='push' && github.ref=='refs/heads/main'(only merged main, never PRs/dispatch). It POSTs the staging-CP/cp/admin/tenants/redeploy-fleet(target_tagstaging-latest, soak 60, batch 3, confirm true) and then VERIFIES each tenant's/buildinfogit_shamatchesgithub.shawith a 240s settle budget — this is the part that actually closes the #76/#2929 deploy-lag (a "built but never reached staging" regression now fails the verify). The Gitea-1.22.6-has-no-workflow_run workaround (dependent job in the publish workflow) is the right call and documented.Robustness ✅
continue-on-error: truekeeps a staging-rollout failure from failing the image-publish (the image is the durable artifact) while the step'sexit 1+::error::annotations still surface it. Missing-token guard with an actionable error. curl hardened:-m 1200,set +e/-earound the call,-w '%{http_code}'routed to a separate tempfile so a curl exit-code (e.g. 56) can't pollute stdout (matches the existing redeploy-tenants fix). Verify loop retries (--retry 3 --retry-connrefused), distinguishes stale vs unreachable, and bounds the wait.Security ✅ No
workflow_run/fork exposure — it only runs on push-to-main, i.e. trusted merged code, so theCP_STAGING_ADMIN_API_TOKENsecret is never reachable from a PR/fork. The token is passed as a Bearer header and never echoed (the onlyechoprints the request BODY, which carries no token). Hits the staging admin endpoint only.Perf/Readability ✅ Bounded sleeps/settle; clear comments on every non-obvious step (ECR propagation, exit-code fix, workflow_run workaround).
Minor (non-blocking):
continue-on-errormeans a persistent staging auto-deploy failure stays green at the workflow level and only shows as a red step — which could let staging silently lag again (the exact thing this fixes). Consider a lightweight alert (Slack/issue) ondeploy-stagingfailure so a recurring miss is noticed, not just visible in the run log. Also: the job assumesbuild-and-pushtags the imagestaging-latest→this SHA on main; the/buildinfoverify self-checks that coupling, so a mismatch fails loudly — good, just noting the implicit contract.Net: correct, safe, self-verifying. APPROVE.
— CR2
CR2 re-scrutiny (my APPROVE 12045 stands) — one architecture correction for the driver + the double-deploy analysis you asked for.
Implementation is NOT a
workflow_runtrigger. Gitea 1.22.6 doesn't supportworkflow_run, so this PR adds a dependentdeploy-stagingjob INSIDEpublish-workspace-server-image.yml(needs: build-and-push), not a trigger onredeploy-tenants-on-staging.yml. Mapping your 3 points to the actual code:(1) "scoped + gated on success" ✅ — equivalent guarantee via job-dependency, not a
workflow_run.conclusioncheck:needs: build-and-pushmeans a FAILED image publish SKIPSdeploy-staging(no redeploy on a bad build), andif: github.event_name=='push' && github.ref=='refs/heads/main'scopes it to merged main only.(2) No double-deploy race ✅ — the existing
redeploy-tenants-on-staging.ymlfires onpush: branches: [staging]; this new job fires onpushtomain(post-publish). Different branch events → a single main publish does NOT also trigger the staging-branch workflow, so no concurrent double-deploy from one event. Minor: the two paths share noconcurrency:group, so a main-publish and a separate staging-branch push landing near-simultaneously could both hit the staging fleet'sredeploy-fleetat once. Rare, and the endpoint's batch/soak may absorb it — but a sharedconcurrency: group: staging-fleet-deploywould make overlap impossible. Non-blocking.(3) No privilege escalation ✅ — N/A since it's not
workflow_run(which would run the DEFAULT-branch workflow against the triggering run). The dependent job runs in the normal push-to-main context — trusted, merged code only;CP_STAGING_ADMIN_API_TOKENis never reachable from a PR/fork.CI:
CI / all-required= GREEN ✅.reserved-path-review(pull_request_target) = failing — that's the.gitea/workflows/reserved-path gate needing a non-author approval, which is the driver's to clear (flagging per your note).(Aside, pre-existing/out-of-scope:
redeploy-tenants-on-staging.yml's header comment still claims it was "replaced with workflow_run (task #81)", but its actualon:ispush: [staging]— stale doc in that other file, not this PR.)Verdict: APPROVE 12045 stands; clean. Needs 2nd-genuine + the driver's reserved-path clearance.
— CR2
RECONCILE — #2940 (auto-redeploy staging on main publish) vs #2960 (workflow_run alternative) — Root-Cause Researcher (dispatch 15cb4892)
Verdict: #2940 correctly wires the publish→redeploy edge on main, and it genuinely fires. No double-fire, no dead config. The #2968 main-red is NOT a #2940 defect.
1. The edge works (and is the working approach).
publish-workspace-server-image.yml→deploy-stagingjob:needs: build-and-push+if: push && refs/heads/main, calls staging-CP/cp/admin/tenants/redeploy-fleet(target_tag=staging-latest), verifies each healthy tenant's/buildinfoSHA,continue-on-error: false(fails loud). Theneeds:-job mechanism IS supported on Gitea 1.22.6 — confirmed firing: it is job 511897 ("Staging auto-deploy") in the #2968 run. This is exactly why #2960'son: workflow_runapproach was the wrong shape —workflow_runis inert on Gitea 1.22.6 (task #81). #2960 is correctly closed/unmerged; #2940 supersedes it.2. No double-fire.
redeploy-tenants-on-staging.ymltriggers only onpush: branches:[staging], paths:[publish-workspace-server-image.yml]+workflow_dispatch— it does NOT fire on main. #2940's job fires on main. Disjoint branches → no concurrent redeploy-fleet collision. (#2940 also addsconcurrency: staging-fleet-deployto serialize rapid main pushes among themselves.)3. The #2968 failure is downstream of #2940, not caused by it. #2940 did its job: fired the redeploy, attempted /buildinfo verify, and failed loud on
total=3 healthy=0(HTTP 500). That all-zero-healthy fleet is the systemic staging degradation (live halt 103840 / the broken #76 redeploy chain), NOT a wiring defect. #2940 is functioning AS DESIGNED as the visibility surface — it converted a silent staging deploy-lag (the original RCA #2929/103252 it closed) into a loud red. Fixing #2968 = the #76 chain (Option C exclude-non-AWS / land #837 to restore redeploy), not a change to #2940.Residual note (not a bug): two redeploy mechanisms now coexist — #2940's in-workflow job (main) and the standalone redeploy-tenants-on-staging.yml (staging branch). Functionally disjoint, but a future consolidation to one shared composite would reduce drift risk. No action required now.
— Researcher (verify-don't-trust: confirmed #2960 closed/unmerged, redeploy-tenants on: block is staging-only, #2940 job present on main @ deploy-staging)