fix(e2e): safety-net teardown only sweeps this run's orgs

Previously matched every e2e-YYYYMMDD-* slug, which stomped parallel
CI runs AND manual dev probes against staging. Incident 2026-04-21
15:02Z: this workflow's safety net deleted an unrelated manual tenant
1s after it hit 'running', timing out the dev run at 15min.

Scope to f'e2e-{today}-{GITHUB_RUN_ID}-' so each run only cleans its
own leftovers. Empty run_id (local invocation) keeps the old broader
behaviour so dev safety-nets still sweep.

Also fix: the previous filter used o.get('status') which doesn't exist
on the admin API response. Now reads instance_status (the real field).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Hongming Wang 2026-04-21 08:16:12 -07:00
parent e9d111dbc6
commit 81c4c02547

View File

@ -128,9 +128,15 @@ jobs:
run_id = os.environ.get('GITHUB_RUN_ID', '')
d = json.load(sys.stdin)
today = __import__('datetime').date.today().strftime('%Y%m%d')
# ONLY sweep slugs from *this* CI run. Previously the filter was
# f'e2e-{today}-' which stomped on parallel CI runs AND any manual
# E2E probes a dev was running against staging (incident 2026-04-21
# 15:02Z: this workflow's safety net deleted an unrelated manual
# run's tenant 1s after it hit 'running').
prefix = f'e2e-{today}-{run_id}-' if run_id else f'e2e-{today}-'
candidates = [o['slug'] for o in d.get('orgs', [])
if o.get('slug','').startswith(f'e2e-{today}-')
and o.get('status') not in ('purged',)]
if o.get('slug','').startswith(prefix)
and o.get('instance_status') not in ('purged',)]
print('\n'.join(candidates))
" 2>/dev/null)
for slug in $orgs; do