Adds a janitor workflow that runs every hour and deletes any
e2e-prefixed staging org older than MAX_AGE_MINUTES (default 120).
Catches orgs left behind when per-test-run teardown didn't fire:
CI cancellation, runner crash, transient AWS error mid-cascade,
bash trap missed (signal 9), etc.
Why it exists despite per-run teardown:
- Per-run teardown is best-effort by definition. Any process death
after the test starts but before the trap fires leaves debris.
- GH Actions cancellation kills the runner with no grace period —
the workflow's `if: always()` step usually catches this but can
still fail on transient CP 5xx at the wrong moment.
- The CP cascade itself has best-effort branches today
(cascadeTerminateWorkspaces logs+continues on individual EC2
termination failures; DNS deletion same shape). Those need
cleanup-correctness work in the CP, but a safety net belongs in
CI either way — defense in depth.
Behaviour:
- Cron every hour. Manual workflow_dispatch with overrideable
max_age_minutes + dry_run inputs for one-off cleanups.
- Concurrency group prevents two sweeps fighting.
- SAFETY_CAP=50 — refuses to delete more than 50 orgs in a single
tick. If the CP admin endpoint goes weird and returns no
created_at (or returns no orgs at all), every e2e-* would look
stale; the cap catches the runaway-nuke case.
- DELETE is idempotent CP-side via org_purges.last_step, so a
half-deleted org from a prior sweep gets picked up cleanly on the
next tick.
- Per-org delete failures don't fail the workflow. Next hourly tick
retries. The workflow only fails loud at the safety-cap gate.
Tonight's specific motivation: ~10 canvas-tabs E2E retries in 2 hours
with various failure modes; each provisioned a fresh tenant + EC2 +
DNS + DB row. Some fraction leaked. Without this loop, ops has to
periodically run the manual sweep-cf-orphans.sh script. With it,
staging self-heals.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>