Two changes that close one of the leak classes from the
molecule-controlplane#420 vCPU audit:
1. sweep-stale-e2e-orgs.yml: cron */15 (was hourly), MAX_AGE_MINUTES
30 (was 120). E2E runs are 8-25 min wall clock; 30 min is safely
above the longest run while shrinking the worst-case leak window
from ~2h to ~45 min (15-min sweep cadence + 30-min threshold).
2. canary-staging.yml teardown: the per-slug DELETE used `>/dev/null
|| true`, which swallowed every failure. A 5xx or timeout from CP
looked identical to "successfully deleted" and the canary tenant
kept eating ~2 vCPU until the sweeper caught it. Now we capture
the response code and surface non-2xx as a workflow warning that
names the leaked slug.
The exit semantics stay unchanged — a single-canary cleanup miss
shouldn't fail-flag the canary itself when the actual smoke check
passed. The sweeper is the safety net for whatever slips past.
Caught during the molecule-controlplane#420 audit on 2026-05-03 —
3 e2e canary tenant orphans were running for 24-95 min, all under
the previous 120-min sweep threshold so they went unnoticed until
manual cleanup. Same `|| true` pattern exists in
e2e-staging-{canvas,external,saas,sanity}.yml; out of scope for
this PR (mechanical port; tracking separately) but the sweeper
tightening covers all of them by reducing the safety-net latency.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
186 lines
8.2 KiB
YAML
186 lines
8.2 KiB
YAML
name: Sweep stale e2e-* orgs (staging)
|
|
|
|
# Janitor for staging tenants left behind when E2E cleanup didn't run:
|
|
# CI cancellations, runner crashes, transient AWS errors mid-cascade,
|
|
# bash trap missed (signal 9), etc. Without this loop, every failed
|
|
# teardown leaks an EC2 + DNS + DB row until manual ops cleanup —
|
|
# 2026-04-23 staging hit the 64 vCPU AWS quota from ~27 such orphans.
|
|
#
|
|
# Why not rely on per-test-run teardown:
|
|
# - Per-run teardown is best-effort by definition. Any process death
|
|
# after the test starts but before the trap fires leaves debris.
|
|
# - GH Actions cancellation kills the runner without grace period.
|
|
# The workflow's `if: always()` step usually catches this, but it
|
|
# too can fail (CP transient 5xx, runner network issue at the
|
|
# wrong moment).
|
|
# - Even when teardown runs, the CP cascade is best-effort in places
|
|
# (cascadeTerminateWorkspaces logs+continues; DNS deletion same).
|
|
# - This sweep is the catch-all that converges staging back to clean
|
|
# regardless of which specific path leaked.
|
|
#
|
|
# The PROPER fix is making CP cleanup transactional + verify-after-
|
|
# terminate (filed separately as cleanup-correctness work). This
|
|
# workflow is the safety net that catches everything else AND any
|
|
# future leak source we haven't yet identified.
|
|
|
|
on:
|
|
schedule:
|
|
# Every 15 min. E2E orgs are short-lived (~8-25 min wall clock from
|
|
# create to teardown — canary is ~8 min, full SaaS ~25 min). The
|
|
# previous hourly + 120-min stale threshold meant a leaked tenant
|
|
# could keep an EC2 alive for up to 2 hours, eating ~2 vCPU per
|
|
# leak. Tightening the cadence + threshold reduces the worst-case
|
|
# leak window from 120 min to ~45 min (15-min sweep cadence + 30-min
|
|
# threshold) without risk of catching in-progress runs (the longest
|
|
# e2e run is the 25-min canary, well under the 30-min threshold).
|
|
# See molecule-controlplane#420 for the leak-class accounting that
|
|
# motivated this tightening.
|
|
- cron: '*/15 * * * *'
|
|
workflow_dispatch:
|
|
inputs:
|
|
max_age_minutes:
|
|
description: "Delete e2e-* orgs older than N minutes (default 30)"
|
|
required: false
|
|
default: "30"
|
|
dry_run:
|
|
description: "Dry run only — list what would be deleted"
|
|
required: false
|
|
type: boolean
|
|
default: false
|
|
|
|
# Don't let two sweeps fight. Cron + workflow_dispatch could overlap
|
|
# on a manual trigger; queue rather than parallel-delete.
|
|
concurrency:
|
|
group: sweep-stale-e2e-orgs
|
|
cancel-in-progress: false
|
|
|
|
permissions:
|
|
contents: read
|
|
|
|
jobs:
|
|
sweep:
|
|
name: Sweep e2e orgs
|
|
runs-on: ubuntu-latest
|
|
timeout-minutes: 15
|
|
env:
|
|
MOLECULE_CP_URL: https://staging-api.moleculesai.app
|
|
ADMIN_TOKEN: ${{ secrets.MOLECULE_STAGING_ADMIN_TOKEN }}
|
|
MAX_AGE_MINUTES: ${{ github.event.inputs.max_age_minutes || '30' }}
|
|
DRY_RUN: ${{ github.event.inputs.dry_run || 'false' }}
|
|
# Refuse to delete more than this many orgs in one tick. If the
|
|
# CP DB is briefly empty (or the admin endpoint goes weird and
|
|
# returns no created_at), every e2e- org would look stale.
|
|
# Bailing protects against runaway nukes.
|
|
SAFETY_CAP: 50
|
|
|
|
steps:
|
|
- name: Verify admin token present
|
|
run: |
|
|
if [ -z "$ADMIN_TOKEN" ]; then
|
|
echo "::error::MOLECULE_STAGING_ADMIN_TOKEN not set"
|
|
exit 2
|
|
fi
|
|
echo "Admin token present ✓"
|
|
|
|
- name: Identify stale e2e orgs
|
|
id: identify
|
|
run: |
|
|
set -euo pipefail
|
|
# Fetch into a file so the python step reads it via stdin —
|
|
# cleaner than embedding $(curl ...) into a heredoc.
|
|
curl -sS --fail-with-body --max-time 30 \
|
|
"$MOLECULE_CP_URL/cp/admin/orgs?limit=500" \
|
|
-H "Authorization: Bearer $ADMIN_TOKEN" \
|
|
> orgs.json
|
|
|
|
# Filter:
|
|
# 1. slug starts with one of the ephemeral test prefixes:
|
|
# - 'e2e-' — covers e2e-canary-, e2e-canvas-*, etc.
|
|
# - 'rt-e2e-' — runtime-test harness fixtures (RFC #2251);
|
|
# missing this prefix left two such tenants
|
|
# orphaned 8h on staging (2026-05-03), then
|
|
# hard-failed redeploy-tenants-on-staging
|
|
# and broke the staging→main auto-promote
|
|
# chain. Kept in sync with the EPHEMERAL_PREFIX_RE
|
|
# regex in redeploy-tenants-on-staging.yml.
|
|
# 2. created_at is older than MAX_AGE_MINUTES ago
|
|
# Output one slug per line to a file the next step reads.
|
|
python3 > stale_slugs.txt <<'PY'
|
|
import json, os
|
|
from datetime import datetime, timezone, timedelta
|
|
EPHEMERAL_PREFIXES = ("e2e-", "rt-e2e-")
|
|
with open("orgs.json") as f:
|
|
data = json.load(f)
|
|
max_age = int(os.environ["MAX_AGE_MINUTES"])
|
|
cutoff = datetime.now(timezone.utc) - timedelta(minutes=max_age)
|
|
for o in data.get("orgs", []):
|
|
slug = o.get("slug", "")
|
|
if not slug.startswith(EPHEMERAL_PREFIXES):
|
|
continue
|
|
created = o.get("created_at")
|
|
if not created:
|
|
# Defensively skip rows without created_at — better
|
|
# to leave one orphan than nuke a brand-new row
|
|
# whose timestamp didn't render.
|
|
continue
|
|
# Python 3.11+ handles RFC3339 with Z directly via
|
|
# fromisoformat; older runners need the trailing Z swap.
|
|
created_dt = datetime.fromisoformat(created.replace("Z", "+00:00"))
|
|
if created_dt < cutoff:
|
|
print(slug)
|
|
PY
|
|
|
|
count=$(wc -l < stale_slugs.txt | tr -d ' ')
|
|
echo "Found $count stale e2e org(s) older than ${MAX_AGE_MINUTES}m"
|
|
if [ "$count" -gt 0 ]; then
|
|
echo "First 20:"
|
|
head -20 stale_slugs.txt | sed 's/^/ /'
|
|
fi
|
|
echo "count=$count" >> "$GITHUB_OUTPUT"
|
|
|
|
- name: Safety gate
|
|
if: steps.identify.outputs.count != '0'
|
|
run: |
|
|
count="${{ steps.identify.outputs.count }}"
|
|
if [ "$count" -gt "$SAFETY_CAP" ]; then
|
|
echo "::error::Refusing to delete $count orgs in one sweep (cap=$SAFETY_CAP). Investigate manually — this usually means the CP admin API returned no created_at or returned a degraded result. Re-run with workflow_dispatch + max_age_minutes if intentional."
|
|
exit 1
|
|
fi
|
|
echo "Within safety cap ($count ≤ $SAFETY_CAP) ✓"
|
|
|
|
- name: Delete stale orgs
|
|
if: steps.identify.outputs.count != '0' && env.DRY_RUN != 'true'
|
|
run: |
|
|
set -uo pipefail
|
|
deleted=0
|
|
failed=0
|
|
while IFS= read -r slug; do
|
|
[ -z "$slug" ] && continue
|
|
# The DELETE handler requires {"confirm": "<slug>"} matching
|
|
# the URL slug — fat-finger guard. Idempotent: re-issuing
|
|
# picks up via org_purges.last_step.
|
|
http_code=$(curl -sS -o /tmp/del_resp -w "%{http_code}" \
|
|
--max-time 60 \
|
|
-X DELETE "$MOLECULE_CP_URL/cp/admin/tenants/$slug" \
|
|
-H "Authorization: Bearer $ADMIN_TOKEN" \
|
|
-H "Content-Type: application/json" \
|
|
-d "{\"confirm\":\"$slug\"}" || echo "000")
|
|
if [ "$http_code" = "200" ] || [ "$http_code" = "204" ]; then
|
|
deleted=$((deleted+1))
|
|
echo " deleted: $slug"
|
|
else
|
|
failed=$((failed+1))
|
|
echo " FAILED ($http_code): $slug — $(cat /tmp/del_resp 2>/dev/null | head -c 200)"
|
|
fi
|
|
done < stale_slugs.txt
|
|
echo ""
|
|
echo "Sweep summary: deleted=$deleted failed=$failed"
|
|
# Don't fail the workflow on per-org delete errors — the
|
|
# sweeper is best-effort. Next hourly tick re-attempts. We
|
|
# only fail loud at the safety-cap gate above.
|
|
|
|
- name: Dry-run summary
|
|
if: env.DRY_RUN == 'true'
|
|
run: |
|
|
echo "DRY RUN — would have deleted ${{ steps.identify.outputs.count }} org(s). Re-run with dry_run=false to actually delete."
|