[bug] Orphan-workspace leak: auto_reap_at stamped but never enforced (detector is dry-run-only) #2642

Closed
opened 2026-06-12 11:24:40 +00:00 by claude-ceo-assistant · 1 comment
Owner

Orphan-workspace leak — root cause + fix (2026-06-12)

ROOT CAUSE: the CP stamps every workspace/tenant EC2 with an auto_reap_at kill-deadline tag, but NOTHING enforced it. aws-orphan-ec2-detector.sh is DRY-RUN/emit-only (its header literally says "never terminates" — it only emits a Loki metric that pings a human cleanup workflow nobody ran). The intended enforcer, staging-tenant-reap (operator-config#96), is not running on the operator (no cron/timer/script). Result: 31 of 32 running boxes were OVERDUE (oldest 8 days, ~$1.5k/mo bleed).

RESOLVED: terminated 25 confirmed-dead e2e/test orphans (overdue + aged/>6h-past-deadline). Held 3 ambiguous (tenant-cp455, 2x tenant-cncrg-dbg) — past-deadline but possibly intentional debug tenants → need owner confirm.

PREVENT (operator stopgap, AWS-only): deployed /usr/local/bin/auto-reap-enforcer.sh + cron molecule-auto-reap-enforcer (every 2h, --apply). Enforces auto_reap_at with 6h grace, scoped to ws-tenant-(e2e|gcp-test|hz) names, skips no_reap/keep=true, honors /etc/molecule-bootstrap/auto-reap-enforcer.disabled, logs to /var/log/auto-reap-enforcer.log.

PROPER DURABLE FIX (this issue tracks it): the enforcement belongs in the CONTROL PLANE (it knows live-vs-orphaned + covers Hetzner/GCP too, which the operator AWS stopgap does not). Needs: (1) a CP reconciler that terminates its own workspaces past auto_reap_at across all providers; (2) restore/replace operator-config#96; (3) an e2e regression gate (ties to #2615). The dry-run detector stays as the observability backstop.

**Orphan-workspace leak — root cause + fix (2026-06-12)** ROOT CAUSE: the CP stamps every workspace/tenant EC2 with an `auto_reap_at` kill-deadline tag, but NOTHING enforced it. `aws-orphan-ec2-detector.sh` is **DRY-RUN/emit-only** (its header literally says "never terminates" — it only emits a Loki metric that pings a human cleanup workflow nobody ran). The intended enforcer, **staging-tenant-reap (operator-config#96)**, is not running on the operator (no cron/timer/script). Result: 31 of 32 running boxes were OVERDUE (oldest 8 days, ~$1.5k/mo bleed). RESOLVED: terminated 25 confirmed-dead e2e/test orphans (overdue + aged/>6h-past-deadline). Held 3 ambiguous (`tenant-cp455`, 2x `tenant-cncrg-dbg`) — past-deadline but possibly intentional debug tenants → need owner confirm. PREVENT (operator stopgap, AWS-only): deployed `/usr/local/bin/auto-reap-enforcer.sh` + cron `molecule-auto-reap-enforcer` (every 2h, --apply). Enforces auto_reap_at with 6h grace, scoped to `ws-tenant-(e2e|gcp-test|hz)` names, skips `no_reap`/`keep`=true, honors `/etc/molecule-bootstrap/auto-reap-enforcer.disabled`, logs to /var/log/auto-reap-enforcer.log. PROPER DURABLE FIX (this issue tracks it): the enforcement belongs in the CONTROL PLANE (it knows live-vs-orphaned + covers Hetzner/GCP too, which the operator AWS stopgap does not). Needs: (1) a CP reconciler that terminates its own workspaces past auto_reap_at across all providers; (2) restore/replace operator-config#96; (3) an e2e regression gate (ties to #2615). The dry-run detector stays as the observability backstop.
Author
Owner

Durable cross-provider fix merged: molecule-controlplane#748 (AutoReapEnforcer across AWS/Hetzner/GCP, 6h grace, no_reap/keep opt-out, live-row cross-check, suspended-org guard restored, e2e-gated via internal/staginge2e + internal/sweep tests). Also root-caused why operator-config#96 silently died (cron-glob mismatch + binary never installed). Operator AWS-only stopgap remains as belt-and-suspenders. Closing.

Durable cross-provider fix merged: molecule-controlplane#748 (AutoReapEnforcer across AWS/Hetzner/GCP, 6h grace, no_reap/keep opt-out, live-row cross-check, suspended-org guard restored, e2e-gated via internal/staginge2e + internal/sweep tests). Also root-caused why operator-config#96 silently died (cron-glob mismatch + binary never installed). Operator AWS-only stopgap remains as belt-and-suspenders. Closing.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2642