molecule-core/.github
Hongming Wang 7425351321 fix(ci): canary teardown safety-net slug pattern (was reversed)
[Molecule-Platform-Evolvement-Manager]

## What was broken

`canary-staging.yml`'s teardown safety-net step filtered candidate
slugs with `f'e2e-{today}-canary-'`. But `test_staging_full_saas.sh`
emits canary slugs as `e2e-canary-${date}-${RUN_ID_SUFFIX}` — date
SECOND, mode FIRST. Full-mode slugs are the other way around
(`e2e-${date}-${RUN_ID_SUFFIX}`), and the canary workflow seems to
have been copy-pasted from there without re-checking the slug
generator.

Net effect: the safety-net step ran on every cancelled / failed
canary, hit the CP, got the org list, filtered to zero matches,
and exited cleanly. Every cancelled canary EC2 leaked until the
once-an-hour `sweep-stale-e2e-orgs.yml` cron eventually caught it
(120-min default age threshold means ≥1h leak in the worst case).

## Today's incident

Canary run 24966995140 cancelled at 21:03Z. EC2
`tenant-e2e-canary-20260426-canary-24966` still running 1h25m
later, manually terminated by the CEO. Three earlier cancellations
today (16:04Z, 19:26Z, 20:02Z) hit the same gap — visible as the
hourly canary failure pattern in #2090.

## Fix

- Filter prefix corrected to `e2e-canary-${today}-` (mode FIRST,
  date SECOND) to match the actual slug emitter.
- Added per-run scoping (`-canary-${GITHUB_RUN_ID}-` suffix) when
  GITHUB_RUN_ID is set, mirroring the e2e-staging-saas.yml safety
  net's per-run scoping that was added after the 2026-04-21
  cross-run cleanup incident — guards against a queued canary's
  safety-net step deleting an in-flight different canary's slug
  while the queue's `cancel-in-progress: false` lets two reach the
  teardown step concurrently.
- Added a comment block tracing the bug + the prior incident so
  the next maintainer doesn't re-introduce the same mistake.

## Test plan

- [x] Manual trace: today's slug `e2e-canary-20260426-canary-24966...`
      now matches `e2e-canary-20260426-canary-24966` prefix
- [x] YAML parses
- [ ] Next canary cancellation cleans up automatically

## Companion PR

The PRIMARY symptom (TLS-timeout failures, not the leaked EC2)
traces to a separate bug in `molecule-controlplane`: tunnel/DNS
creation errors are logged-and-continued rather than failing
provision. PR coming separately.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 14:44:27 -07:00
..
workflows fix(ci): canary teardown safety-net slug pattern (was reversed) 2026-04-26 14:44:27 -07:00
CODEOWNERS chore: add CODEOWNERS to auto-route agent PRs to personal review account 2026-04-26 13:40:13 -07:00