From 26d5c5ba1f9b35358301924a42cd1c3a26891f9e Mon Sep 17 00:00:00 2001 From: Hongming Wang Date: Thu, 30 Apr 2026 00:03:31 -0700 Subject: [PATCH] fix(ci): close gaps in auto-promote dispatch tail (#2358 follow-up) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Independent review of #2358 surfaced three gaps that the original self-review missed. All three would manifest only on the FIRST real staging→main promotion through the new tail step, so they'd silently re-introduce the deploy-chain bug #2357 was supposed to fix. 1. **Missing `actions: write` permission.** `gh workflow run` POSTs to `/repos/.../actions/workflows/.../dispatches`, which requires the actions:write scope on GITHUB_TOKEN. The job had only contents:write + pull-requests:write, so the dispatch call would 403 on every run and the publish chain would still not fire. Adding the scope. 2. **No workflow-level concurrency block.** When CI + E2E Staging Canvas + E2E API Smoke + CodeQL all complete within seconds of each other on a green staging push (the typical case), four separate workflow_run events fire and four parallel auto-promote runs all reach the dispatch tail. They poll the same PR, all observe the same mergedAt, and all call `gh workflow run` — producing 2-4× redundant publish builds racing for the same `:staging-latest` retag and 2-4× canary-verify chains. Added `concurrency.group: auto-promote-staging, cancel-in-progress: false`. cancel-in-progress=false because killing a polling tail that's about to dispatch would re-introduce the original bug. 3. **PR closed-without-merge ties up a runner for 30 min.** If the merge queue rejects the PR (gates flip red post-approval), or an operator closes it manually, mergedAt stays null forever and the loop polls 60 × 30s burning a runner slot. Now also reads `state` in the same `gh pr view` call and breaks early when STATE=CLOSED. Verification on this PR is structural (workflow won't fire on a staging→main promotion until this lands AND a subsequent staging push triggers auto-promote). The actions:write fix in particular is unverifiable until the next real run — the prior #2358 fix has the same property, so we're stacking two unverifiable workflow edits. That's intentional rather than risky: stage 1 (#2358) was load-bearing for the deploy-chain restoration; stage 2 (this PR) hardens it before it actually matters. Co-Authored-By: Claude Opus 4.7 (1M context) --- .github/workflows/auto-promote-staging.yml | 37 ++++++++++++++++++++-- 1 file changed, 34 insertions(+), 3 deletions(-) diff --git a/.github/workflows/auto-promote-staging.yml b/.github/workflows/auto-promote-staging.yml index 8304398c..33c54e7e 100644 --- a/.github/workflows/auto-promote-staging.yml +++ b/.github/workflows/auto-promote-staging.yml @@ -76,6 +76,27 @@ on: permissions: contents: write pull-requests: write + # actions: write is needed by the post-merge dispatch tail step + # (#2358 / #2357) — `gh workflow run publish-workspace-server-image.yml` + # POSTs to /actions/workflows/.../dispatches which requires this scope. + # Without it the call 403s and the publish/canary/redeploy chain still + # doesn't run on staging→main promotions, undoing #2358. + actions: write + +# Serialize auto-promote runs. Multiple staging gate completions can land +# in quick succession (CI + E2E + CodeQL all finish within seconds of +# each other on a green PR) — without this, two parallel runs both: +# 1. Open / re-use the same promote PR. +# 2. Both call `gh pr merge --auto` (idempotent — fine). +# 3. Both poll for the same mergedAt and both `gh workflow run` publish +# → 2× redundant publish builds racing for the same `:staging-latest` +# retag, and 2× canary-verify chains. +# cancel-in-progress: false because we don't want a brand-new run to kill +# a polling-tail that's about to dispatch — the polling tail's 30 min cap +# is the right backstop, not workflow-level cancel. +concurrency: + group: auto-promote-staging + cancel-in-progress: false jobs: check-all-gates-green: @@ -271,19 +292,29 @@ jobs: PR_NUM: ${{ steps.promote_pr.outputs.promote_pr_num }} run: | # Poll for merge — max 30 min (60 × 30s). The merge queue - # typically lands within 5-10 min when gates are green. + # typically lands within 5-10 min when gates are green. Break + # early if the PR is closed without merging (operator action, + # gates flipped red post-approval, branch-protection rejection) + # so we don't tie up a runner for the full 30 min on a dead PR. MERGED="" + STATE="" for _ in $(seq 1 60); do - MERGED=$(gh pr view "$PR_NUM" --repo "$REPO" --json mergedAt --jq '.mergedAt // ""') + VIEW=$(gh pr view "$PR_NUM" --repo "$REPO" --json mergedAt,state) + MERGED=$(echo "$VIEW" | jq -r '.mergedAt // ""') + STATE=$(echo "$VIEW" | jq -r '.state // ""') if [ -n "$MERGED" ] && [ "$MERGED" != "null" ]; then echo "::notice::Promote PR #${PR_NUM} merged at ${MERGED}" break fi + if [ "$STATE" = "CLOSED" ]; then + echo "::warning::Promote PR #${PR_NUM} was closed without merging — skipping deploy dispatch." + exit 0 + fi sleep 30 done if [ -z "$MERGED" ] || [ "$MERGED" = "null" ]; then - echo "::warning::Promote PR #${PR_NUM} didn't merge within 30min — skipping deploy dispatch (manually run \`gh workflow run redeploy-tenants-on-main.yml\` once it lands)." + echo "::warning::Promote PR #${PR_NUM} didn't merge within 30min — skipping deploy dispatch (manually run \`gh workflow run publish-workspace-server-image.yml --ref main\` once it lands)." exit 0 fi