molecule-core/.github/workflows
Hongming Wang fe075ee1ba ci: hourly sweep of stale e2e-* orgs on staging
Adds a janitor workflow that runs every hour and deletes any
e2e-prefixed staging org older than MAX_AGE_MINUTES (default 120).
Catches orgs left behind when per-test-run teardown didn't fire:
CI cancellation, runner crash, transient AWS error mid-cascade,
bash trap missed (signal 9), etc.

Why it exists despite per-run teardown:
- Per-run teardown is best-effort by definition. Any process death
  after the test starts but before the trap fires leaves debris.
- GH Actions cancellation kills the runner with no grace period —
  the workflow's `if: always()` step usually catches this but can
  still fail on transient CP 5xx at the wrong moment.
- The CP cascade itself has best-effort branches today
  (cascadeTerminateWorkspaces logs+continues on individual EC2
  termination failures; DNS deletion same shape). Those need
  cleanup-correctness work in the CP, but a safety net belongs in
  CI either way — defense in depth.

Behaviour:
- Cron every hour. Manual workflow_dispatch with overrideable
  max_age_minutes + dry_run inputs for one-off cleanups.
- Concurrency group prevents two sweeps fighting.
- SAFETY_CAP=50 — refuses to delete more than 50 orgs in a single
  tick. If the CP admin endpoint goes weird and returns no
  created_at (or returns no orgs at all), every e2e-* would look
  stale; the cap catches the runaway-nuke case.
- DELETE is idempotent CP-side via org_purges.last_step, so a
  half-deleted org from a prior sweep gets picked up cleanly on the
  next tick.
- Per-org delete failures don't fail the workflow. Next hourly tick
  retries. The workflow only fails loud at the safety-cap gate.

Tonight's specific motivation: ~10 canvas-tabs E2E retries in 2 hours
with various failure modes; each provisioned a fresh tenant + EC2 +
DNS + DB row. Some fraction leaked. Without this loop, ops has to
periodically run the manual sweep-cf-orphans.sh script. With it,
staging self-heals.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 23:07:57 -07:00
..
auto-promote-staging.yml ci: canary-verify graceful-skip + draft auto-promote staging→main 2026-04-22 22:39:23 +00:00
block-internal-paths.yml ci(block-paths): fetch PR base SHA to fix shallow-clone diff failure 2026-04-24 12:01:53 +00:00
canary-staging.yml fix(e2e): CP DELETE /cp/admin/tenants body uses 'confirm', not 'confirm_token' 2026-04-21 04:50:28 -07:00
canary-verify.yml ci: canary-verify graceful-skip + draft auto-promote staging→main 2026-04-22 22:39:23 +00:00
check-merge-group-trigger.yml ci: add linter that fails when required workflow lacks merge_group trigger 2026-04-24 00:33:05 -07:00
ci.yml ci: add merge_group trigger to ci + codeql 2026-04-23 21:24:53 -07:00
codeql.yml ci: add merge_group trigger to ci + codeql 2026-04-23 21:24:53 -07:00
e2e-api.yml feat(ci): run E2E API smoke test on staging branch 2026-04-23 17:47:47 -07:00
e2e-staging-canvas.yml feat(ci): run E2E Staging Canvas on staging branch pushes 2026-04-23 17:47:51 -07:00
e2e-staging-saas.yml fix(e2e): increase hermes workspace wait from 20 to 30 min 2026-04-24 17:11:37 +00:00
e2e-staging-sanity.yml fix(e2e): CP DELETE /cp/admin/tenants body uses 'confirm', not 'confirm_token' 2026-04-21 04:50:28 -07:00
promote-latest.yml perf(ci): move all public-repo workflows to ubuntu-latest 2026-04-22 12:56:49 -07:00
publish-canvas-image.yml perf(ci): move all public-repo workflows to ubuntu-latest 2026-04-22 12:56:49 -07:00
publish-workspace-server-image.yml ci(publish-image): also tag :staging-latest so CP auto-picks up new builds 2026-04-24 00:29:55 -07:00
retarget-main-to-staging.yml ci: auto-retarget bot PRs opened against main → staging (#1853) 2026-04-23 19:20:40 +00:00
sweep-stale-e2e-orgs.yml ci: hourly sweep of stale e2e-* orgs on staging 2026-04-24 23:07:57 -07:00