molecule-core/.github/workflows
Hongming Wang 3a36d732e4 fix(ci): sweep prior UTC day in e2e safety nets (midnight-rollover)
[Molecule-Platform-Evolvement-Manager]

## What was breaking

All three staging e2e workflows' "Teardown safety net" steps
filtered candidate slugs by `f'e2e-...-{today}-...'` where `today`
was computed at safety-net-step time via `datetime.date.today()`.

When a run crossed midnight UTC (start before 00:00, end after),
`today` became the NEXT day, but the slug it created carried the
PRIOR day's date. The filter never matched its own slug → leak.

## Today's incident

E2E Staging Canvas run [24970092066](
https://github.com/Molecule-AI/molecule-core/actions/runs/24970092066):
  - started 2026-04-26 23:45:59Z
  - created slug `e2e-canvas-20260426-1u8nz3` at 23:59Z
  - ended 2026-04-27 00:12:47Z (failure)
  - safety-net step ran with `today=20260427`
  - filter `e2e-canvas-20260427-` did not match `...20260426-1u8nz3`
  - tenant + child workspace EC2 both stayed up

Confirmed via CP staging logs: no DELETE for `1u8nz3` ever issued.
The Playwright globalTeardown didn't fire (test crashed mid-run);
the workflow safety-net was the last line and it missed.

## Fix

All three workflows now sweep BOTH today AND yesterday's UTC dates,
so a run that crosses midnight still matches its own slug:

```python
today = datetime.date.today()
yesterday = today - datetime.timedelta(days=1)
dates = (today.strftime('%Y%m%d'), yesterday.strftime('%Y%m%d'))
prefixes = tuple(f'e2e-canvas-{d}-' for d in dates)  # (canvas variant)
```

Per-run-id scoping (saas + canary) is preserved — the prior-day
prefix still includes the run_id, so cross-midnight runs only sweep
their own slugs, not other in-flight runs from yesterday.

## Why two-day window vs. arbitrary lookback

A run can't legitimately last more than 24h on GitHub-hosted
runners (workflow `timeout-minutes` caps; canary=25, e2e-saas=45,
canvas=30). Two-day window is enough to cover any cross-midnight
run without widening the cross-run-cleanup blast radius further.
The `sweep-stale-e2e-orgs.yml` cron (with its 120-min age threshold)
remains the catch-all for anything older that drifts through.

## Test plan

- [x] Manual logic simulation: post-midnight slug matches yesterday's
      prefix; same-day still matches; 2-days-ago does NOT match;
      production tenant never matches
- [x] All three workflow YAMLs syntactically valid
- [ ] Next cross-midnight run cleans up its own slug

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 19:23:36 -07:00
..
auto-promote-staging.yml ci: canary-verify graceful-skip + draft auto-promote staging→main 2026-04-22 22:39:23 +00:00
auto-tag-runtime.yml feat(platform/admin): /admin/workspace-images/refresh + Docker SDK + GHCR auth 2026-04-26 10:17:21 -07:00
block-internal-paths.yml ci(block-paths): fetch PR base SHA to fix shallow-clone diff failure 2026-04-24 12:01:53 +00:00
canary-staging.yml fix(ci): sweep prior UTC day in e2e safety nets (midnight-rollover) 2026-04-26 19:23:36 -07:00
canary-verify.yml ci: canary-verify graceful-skip + draft auto-promote staging→main 2026-04-22 22:39:23 +00:00
check-merge-group-trigger.yml ci: add linter that fails when required workflow lacks merge_group trigger 2026-04-24 00:33:05 -07:00
ci.yml test(workspace): centralize pytest-cov config + 92% floor (closes #1817) 2026-04-26 06:21:22 -07:00
codeql.yml ci: add merge_group trigger to ci + codeql 2026-04-23 21:24:53 -07:00
e2e-api.yml feat(ci): run E2E API smoke test on staging branch 2026-04-23 17:47:47 -07:00
e2e-staging-canvas.yml fix(ci): sweep prior UTC day in e2e safety nets (midnight-rollover) 2026-04-26 19:23:36 -07:00
e2e-staging-saas.yml fix(ci): sweep prior UTC day in e2e safety nets (midnight-rollover) 2026-04-26 19:23:36 -07:00
e2e-staging-sanity.yml fix(e2e): CP DELETE /cp/admin/tenants body uses 'confirm', not 'confirm_token' 2026-04-21 04:50:28 -07:00
promote-latest.yml perf(ci): move all public-repo workflows to ubuntu-latest 2026-04-22 12:56:49 -07:00
publish-canvas-image.yml perf(ci): move all public-repo workflows to ubuntu-latest 2026-04-22 12:56:49 -07:00
publish-runtime.yml fix(publish-runtime): use PyPI Trusted Publisher (OIDC) instead of PYPI_TOKEN (#2113) 2026-04-26 13:14:47 -07:00
publish-workspace-server-image.yml ci(publish-image): also tag :staging-latest so CP auto-picks up new builds 2026-04-24 00:29:55 -07:00
redeploy-tenants-on-main.yml ci(redeploy): fire post-main tenant fleet redeploy via CP admin endpoint 2026-04-24 14:34:28 -07:00
retarget-main-to-staging.yml ci(retarget): handle 422 'duplicate PR' by closing redundant main-PR (closes #1884) 2026-04-26 00:53:55 -07:00
runtime-pin-compat.yml fix(ci): set WORKSPACE_ID for the runtime-pin smoke import 2026-04-26 01:59:56 -07:00
secret-scan.yml docs(ci): fix secret-scan reusable workflow self-doc — repo is molecule-core, ref is @staging 2026-04-26 15:44:31 -07:00
sweep-cf-orphans.yml fix(ci): stop sweep-cf-orphans noise — drop merge_group + soft-skip when secrets unset 2026-04-26 08:05:53 -07:00
sweep-stale-e2e-orgs.yml ci: hourly sweep of stale e2e-* orgs on staging 2026-04-24 23:07:57 -07:00
test-ops-scripts.yml refactor(ops): apply simplify findings on #2027 PR 2026-04-26 00:28:15 -07:00