From 74253513211442a08ef4f72a6f2febed10567e3c Mon Sep 17 00:00:00 2001 From: Hongming Wang Date: Sun, 26 Apr 2026 14:44:27 -0700 Subject: [PATCH] fix(ci): canary teardown safety-net slug pattern (was reversed) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit [Molecule-Platform-Evolvement-Manager] ## What was broken `canary-staging.yml`'s teardown safety-net step filtered candidate slugs with `f'e2e-{today}-canary-'`. But `test_staging_full_saas.sh` emits canary slugs as `e2e-canary-${date}-${RUN_ID_SUFFIX}` — date SECOND, mode FIRST. Full-mode slugs are the other way around (`e2e-${date}-${RUN_ID_SUFFIX}`), and the canary workflow seems to have been copy-pasted from there without re-checking the slug generator. Net effect: the safety-net step ran on every cancelled / failed canary, hit the CP, got the org list, filtered to zero matches, and exited cleanly. Every cancelled canary EC2 leaked until the once-an-hour `sweep-stale-e2e-orgs.yml` cron eventually caught it (120-min default age threshold means ≥1h leak in the worst case). ## Today's incident Canary run 24966995140 cancelled at 21:03Z. EC2 `tenant-e2e-canary-20260426-canary-24966` still running 1h25m later, manually terminated by the CEO. Three earlier cancellations today (16:04Z, 19:26Z, 20:02Z) hit the same gap — visible as the hourly canary failure pattern in #2090. ## Fix - Filter prefix corrected to `e2e-canary-${today}-` (mode FIRST, date SECOND) to match the actual slug emitter. - Added per-run scoping (`-canary-${GITHUB_RUN_ID}-` suffix) when GITHUB_RUN_ID is set, mirroring the e2e-staging-saas.yml safety net's per-run scoping that was added after the 2026-04-21 cross-run cleanup incident — guards against a queued canary's safety-net step deleting an in-flight different canary's slug while the queue's `cancel-in-progress: false` lets two reach the teardown step concurrently. - Added a comment block tracing the bug + the prior incident so the next maintainer doesn't re-introduce the same mistake. ## Test plan - [x] Manual trace: today's slug `e2e-canary-20260426-canary-24966...` now matches `e2e-canary-20260426-canary-24966` prefix - [x] YAML parses - [ ] Next canary cancellation cleans up automatically ## Companion PR The PRIMARY symptom (TLS-timeout failures, not the leaked EC2) traces to a separate bug in `molecule-controlplane`: tunnel/DNS creation errors are logged-and-continued rather than failing provision. PR coming separately. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) --- .github/workflows/canary-staging.yml | 24 ++++++++++++++++++++++-- 1 file changed, 22 insertions(+), 2 deletions(-) diff --git a/.github/workflows/canary-staging.yml b/.github/workflows/canary-staging.yml index 0c4bae19..fa88cd15 100644 --- a/.github/workflows/canary-staging.yml +++ b/.github/workflows/canary-staging.yml @@ -152,14 +152,34 @@ jobs: ADMIN_TOKEN: ${{ secrets.MOLECULE_STAGING_ADMIN_TOKEN }} run: | set +e + # Slug prefix matches what test_staging_full_saas.sh emits + # in canary mode: + # SLUG="e2e-canary-$(date +%Y%m%d)-${RUN_ID_SUFFIX}" + # Earlier this was `e2e-{today}-canary-` — that was the + # full-mode pattern (date FIRST, mode SECOND); canary slugs + # have mode FIRST, date SECOND. The mismatch silently + # never matched, leaving every cancelled-canary EC2 alive + # until the once-an-hour sweep eventually caught it + # (incident 2026-04-26 21:03Z: 1h25m EC2 leak before manual + # cleanup; same gap on three earlier cancellations today). orgs=$(curl -sS "$MOLECULE_CP_URL/cp/admin/orgs" \ -H "Authorization: Bearer $ADMIN_TOKEN" 2>/dev/null \ | python3 -c " - import json, sys + import json, sys, os + run_id = os.environ.get('GITHUB_RUN_ID', '') d = json.load(sys.stdin) today = __import__('datetime').date.today().strftime('%Y%m%d') + # Scope to slugs from THIS canary run when GITHUB_RUN_ID is + # available; the canary workflow sets E2E_RUN_ID='canary-\${run_id}' + # so the slug suffix is '-canary-\${run_id}-...'. Mirrors the + # full-mode safety net's per-run scoping (e2e-staging-saas.yml) + # added after the 2026-04-21 cross-run cleanup incident. + if run_id: + prefix = f'e2e-canary-{today}-canary-{run_id}' + else: + prefix = f'e2e-canary-{today}-' candidates = [o['slug'] for o in d.get('orgs', []) - if o.get('slug','').startswith(f'e2e-{today}-canary-') + if o.get('slug','').startswith(prefix) and o.get('status') not in ('purged',)] print('\n'.join(candidates)) " 2>/dev/null)