Code-quality + efficiency review of PR #2079:
- Hoist all_slugs = prod_slugs | staging_slugs out of decide() into the
caller (was rebuilt on every record — 1k records × ~50-slug union per
call). decide() signature now (r, all_slugs, ec2_names).
- Compile regexes at module scope (_WS_RE, _E2E_RE, _TENANT_RE) +
hoist platform-core literal set (_PLATFORM_CORE_NAMES). Same change
mirrored in the bash heredoc.
- Drop decorative # Rule N: comments (numbering was out of order, 3
before 2 — actively confusing).
- Move the "edits must mirror" reminder OUTSIDE the CANONICAL DECIDE
block in the .sh file, eliminating the .replace() comment-skip hack
in TestParityWithBashScript.
- Drop per-line .strip() in _slice_canonical (would mask a real
indentation bug; both blocks already at column 0).
- subTest() in TestPlatformCore loops so a single failure no longer
short-circuits the rest of the items.
- merge_group + concurrency on test-ops-scripts.yml (parity with
ci.yml gate behaviour).
- Fix don't apostrophe in inline comment that closed the python
heredoc's single-quote and broke bash -n.
All 25 tests still pass. bash -n clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes#2027.
The CF orphan sweep deletes DNS records — a misclassification could nuke
a live workspace's tunnel. The decision function had MAX_DELETE_PCT
percentage gating but no automated test of category → action mapping.
Approach: extract the decide() function to scripts/ops/sweep_cf_decide.py
as a verbatim copy bracketed by `# CANONICAL DECIDE BEGIN/END` markers.
The shell script keeps its inline heredoc (so the operational path is
untouched) but bracketed by the same markers. A parity test
(TestParityWithBashScript) reads both files and asserts the bracketed
blocks match line-for-line — drift fails CI loudly.
Coverage (25 tests, 1 file, stdlib unittest only):
- Rule 1 platform-core: apex, _vercel, _domainkey, www/api/app/doc/send/status/staging-api
- Rule 3 ws-*: live (matches EC2 prefix) on prod + staging; orphan on prod + staging
- Rule 4 e2e-*: live + orphan on staging; orphan on prod
- Rule 2 generic tenant: live prod + staging; unknown subdomain kept-for-safety
- Rule 5 fallthrough: external domain + unrelated apex
- Rule priority: api.moleculesai.app stays platform-core (not tenant); _vercel stays verification
- Safety gate: under/at/over default 50% threshold; zero-total no-divide; custom threshold
- Empty live-sets: documents that decide() alone classifies as orphan, gate is the defense
CI: new .github/workflows/test-ops-scripts.yml runs `python -m unittest
discover` against scripts/ops/ on every PR/push that touches the
directory. Lightweight — no requirements file, stdlib only.
Local: `cd scripts/ops && python -m unittest test_sweep_cf_decide -v` →
25 tests, all OK.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The script's own help text documents \`MAX_DELETE_PCT=62 ./sweep-cf-orphans.sh\`
as the way to relax the safety gate, but the in-script assignment on line 35
was unconditional and overwrote any env value — so the override never worked.
During today's staging tenant-provision recovery (CP #255 context), hit the
57%-delete threshold and needed the documented override to clear 64 orphan
records. The one-char change to \`\${MAX_DELETE_PCT:-50}\` honors the env
while keeping the 50% default when no caller overrides.
Ran with MAX_DELETE_PCT=62 after the fix — deleted 64 records, CF zone 111→47.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the "panic-button at >65 records" manual sweep that nukes
every pattern-match unconditionally (would delete live workspaces
along with orphans).
This version:
- Queries CP prod + staging /admin/orgs for live tenant slugs
- Queries AWS EC2 describe-instances for live workspace Name tags
- Only deletes CF records whose slug/ws-id has no live counterpart
- Dry-run by default (--execute to actually delete)
- Safety gate refuses to delete >50% of records (configurable via
MAX_DELETE_PCT env var) — catches the "API returned zero orgs, every
tenant looks orphan" failure mode before it nukes production
- Per-category accounting: orphan-ws / orphan-e2e-tenant / etc.
Usage:
CF_API_TOKEN=... CF_ZONE_ID=... \
CP_PROD_ADMIN_TOKEN=... CP_STAGING_ADMIN_TOKEN=... \
bash scripts/ops/sweep-cf-orphans.sh # dry-run
bash scripts/ops/sweep-cf-orphans.sh --execute # actually delete
Ref: #1976 (root-cause: tenant.Delete + workspace.Delete don't clean
their CF records — until that's fixed, this script is the maintenance
path)
Co-authored-by: Hongming Wang <hongmingwang.rabbit@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com>