molecule-core

History

Hongming Wang 7425351321 fix(ci): canary teardown safety-net slug pattern (was reversed) [Molecule-Platform-Evolvement-Manager] ## What was broken `canary-staging.yml`'s teardown safety-net step filtered candidate slugs with `f'e2e-{today}-canary-'`. But `test_staging_full_saas.sh` emits canary slugs as `e2e-canary-${date}-${RUN_ID_SUFFIX}` — date SECOND, mode FIRST. Full-mode slugs are the other way around (`e2e-${date}-${RUN_ID_SUFFIX}`), and the canary workflow seems to have been copy-pasted from there without re-checking the slug generator. Net effect: the safety-net step ran on every cancelled / failed canary, hit the CP, got the org list, filtered to zero matches, and exited cleanly. Every cancelled canary EC2 leaked until the once-an-hour `sweep-stale-e2e-orgs.yml` cron eventually caught it (120-min default age threshold means ≥1h leak in the worst case). ## Today's incident Canary run 24966995140 cancelled at 21:03Z. EC2 `tenant-e2e-canary-20260426-canary-24966` still running 1h25m later, manually terminated by the CEO. Three earlier cancellations today (16:04Z, 19:26Z, 20:02Z) hit the same gap — visible as the hourly canary failure pattern in #2090. ## Fix - Filter prefix corrected to `e2e-canary-${today}-` (mode FIRST, date SECOND) to match the actual slug emitter. - Added per-run scoping (`-canary-${GITHUB_RUN_ID}-` suffix) when GITHUB_RUN_ID is set, mirroring the e2e-staging-saas.yml safety net's per-run scoping that was added after the 2026-04-21 cross-run cleanup incident — guards against a queued canary's safety-net step deleting an in-flight different canary's slug while the queue's `cancel-in-progress: false` lets two reach the teardown step concurrently. - Added a comment block tracing the bug + the prior incident so the next maintainer doesn't re-introduce the same mistake. ## Test plan - [x] Manual trace: today's slug `e2e-canary-20260426-canary-24966...` now matches `e2e-canary-20260426-canary-24966` prefix - [x] YAML parses - [ ] Next canary cancellation cleans up automatically ## Companion PR The PRIMARY symptom (TLS-timeout failures, not the leaked EC2) traces to a separate bug in `molecule-controlplane`: tunnel/DNS creation errors are logged-and-continued rather than failing provision. PR coming separately. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-04-26 14:44:27 -07:00
..
auto-promote-staging.yml	ci: canary-verify graceful-skip + draft auto-promote staging→main	2026-04-22 22:39:23 +00:00
auto-tag-runtime.yml	feat(platform/admin): /admin/workspace-images/refresh + Docker SDK + GHCR auth	2026-04-26 10:17:21 -07:00
block-internal-paths.yml	ci(block-paths): fetch PR base SHA to fix shallow-clone diff failure	2026-04-24 12:01:53 +00:00
canary-staging.yml	fix(ci): canary teardown safety-net slug pattern (was reversed)	2026-04-26 14:44:27 -07:00
canary-verify.yml	ci: canary-verify graceful-skip + draft auto-promote staging→main	2026-04-22 22:39:23 +00:00
check-merge-group-trigger.yml	ci: add linter that fails when required workflow lacks merge_group trigger	2026-04-24 00:33:05 -07:00
ci.yml	test(workspace): centralize pytest-cov config + 92% floor (closes #1817 )	2026-04-26 06:21:22 -07:00
codeql.yml	ci: add merge_group trigger to ci + codeql	2026-04-23 21:24:53 -07:00
e2e-api.yml	feat(ci): run E2E API smoke test on staging branch	2026-04-23 17:47:47 -07:00
e2e-staging-canvas.yml	feat(ci): run E2E Staging Canvas on staging branch pushes	2026-04-23 17:47:51 -07:00
e2e-staging-saas.yml	fix(e2e): increase hermes workspace wait from 20 to 30 min	2026-04-24 17:11:37 +00:00
e2e-staging-sanity.yml	fix(e2e): CP DELETE /cp/admin/tenants body uses 'confirm', not 'confirm_token'	2026-04-21 04:50:28 -07:00
promote-latest.yml	perf(ci): move all public-repo workflows to ubuntu-latest	2026-04-22 12:56:49 -07:00
publish-canvas-image.yml	perf(ci): move all public-repo workflows to ubuntu-latest	2026-04-22 12:56:49 -07:00
publish-runtime.yml	fix(publish-runtime): use PyPI Trusted Publisher (OIDC) instead of PYPI_TOKEN (#2113 )	2026-04-26 13:14:47 -07:00
publish-workspace-server-image.yml	ci(publish-image): also tag :staging-latest so CP auto-picks up new builds	2026-04-24 00:29:55 -07:00
redeploy-tenants-on-main.yml	ci(redeploy): fire post-main tenant fleet redeploy via CP admin endpoint	2026-04-24 14:34:28 -07:00
retarget-main-to-staging.yml	ci(retarget): handle 422 'duplicate PR' by closing redundant main-PR (closes #1884 )	2026-04-26 00:53:55 -07:00
runtime-pin-compat.yml	fix(ci): set WORKSPACE_ID for the runtime-pin smoke import	2026-04-26 01:59:56 -07:00
secret-scan.yml	fix(ci): handle merge_group + shallow-clone BASE in secret-scan	2026-04-26 14:08:19 -07:00
sweep-cf-orphans.yml	fix(ci): stop sweep-cf-orphans noise — drop merge_group + soft-skip when secrets unset	2026-04-26 08:05:53 -07:00
sweep-stale-e2e-orgs.yml	ci: hourly sweep of stale e2e-* orgs on staging	2026-04-24 23:07:57 -07:00
test-ops-scripts.yml	refactor(ops): apply simplify findings on #2027 PR	2026-04-26 00:28:15 -07:00