molecule-core

Author	SHA1	Message	Date
Hongming Wang	026f5e51d9	ops: add Railway SHA-pin drift audit script + regression test (#2001 ) #2000 fixed one symptom — TENANT_IMAGE pinned to `staging-a14cf86` (10 days stale) silently no-op'd four upstream fixes on 2026-04-24. This adds the audit pattern as a re-runnable script so the broader class is observable on demand without new CI infrastructure. Audit results today (2026-04-27): controlplane / production: 54 vars audited, 0 drift-prone pins controlplane / staging: 52 vars audited, 0 drift-prone pins So the immediate audit deliverable is clean — TENANT_IMAGE is the only known violation and #2000 already fixed it. The script makes the ongoing audit a 5-second command instead of a manual one. Detection regex catches: * branch-SHA suffixes (`staging\|main\|prod\|production-<6+ hex>`) — the exact 2026-04-24 incident shape * version pins after `:` or `=` (`:v1.2.3`, `=v0.1.16`) — same drift class, just rendered differently Anchoring on `:` or `=` keeps prose like "version 1.2.3 of the api" out of the false-positive set. UUIDs, ARNs, AMI IDs, secrets, and floating tags (`:staging-latest`, `:main`) pass through untouched. Regression test (tests/ops/test_audit_railway_sha_pins.sh) pins 20 representative cases — 9 should-flag (covering all four branch prefixes + semver variants + middle-of-value matches) and 11 should-pass (the false-positive guards). Same regex inlined in both files so a future tweak that weakens detection fails the test in lockstep with weakening the audit. Both files shellcheck clean. CI gate (acceptance criterion's "regression: add a CI check") is deliberately scoped out — querying Railway from CI requires plumbing RAILWAY_TOKEN as a repo secret, which is multi-step setup. The re-runnable script + test cover the same surface today; the CI workflow is a small follow-up once the token is provisioned. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 05:01:23 -07:00
rabbitblood	6494e9192b	refactor(ops): apply simplify findings on #2027 PR Code-quality + efficiency review of PR #2079: - Hoist all_slugs = prod_slugs \| staging_slugs out of decide() into the caller (was rebuilt on every record — 1k records × ~50-slug union per call). decide() signature now (r, all_slugs, ec2_names). - Compile regexes at module scope (_WS_RE, _E2E_RE, _TENANT_RE) + hoist platform-core literal set (_PLATFORM_CORE_NAMES). Same change mirrored in the bash heredoc. - Drop decorative # Rule N: comments (numbering was out of order, 3 before 2 — actively confusing). - Move the "edits must mirror" reminder OUTSIDE the CANONICAL DECIDE block in the .sh file, eliminating the .replace() comment-skip hack in TestParityWithBashScript. - Drop per-line .strip() in _slice_canonical (would mask a real indentation bug; both blocks already at column 0). - subTest() in TestPlatformCore loops so a single failure no longer short-circuits the rest of the items. - merge_group + concurrency on test-ops-scripts.yml (parity with ci.yml gate behaviour). - Fix don't apostrophe in inline comment that closed the python heredoc's single-quote and broke bash -n. All 25 tests still pass. bash -n clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 00:28:15 -07:00
rabbitblood	ba78a5c00d	test(ops): unit tests for sweep-cf-orphans decide() (#2027 ) Closes #2027. The CF orphan sweep deletes DNS records — a misclassification could nuke a live workspace's tunnel. The decision function had MAX_DELETE_PCT percentage gating but no automated test of category → action mapping. Approach: extract the decide() function to scripts/ops/sweep_cf_decide.py as a verbatim copy bracketed by `# CANONICAL DECIDE BEGIN/END` markers. The shell script keeps its inline heredoc (so the operational path is untouched) but bracketed by the same markers. A parity test (TestParityWithBashScript) reads both files and asserts the bracketed blocks match line-for-line — drift fails CI loudly. Coverage (25 tests, 1 file, stdlib unittest only): - Rule 1 platform-core: apex, _vercel, _domainkey, www/api/app/doc/send/status/staging-api - Rule 3 ws-: live (matches EC2 prefix) on prod + staging; orphan on prod + staging - Rule 4 e2e-: live + orphan on staging; orphan on prod - Rule 2 generic tenant: live prod + staging; unknown subdomain kept-for-safety - Rule 5 fallthrough: external domain + unrelated apex - Rule priority: api.moleculesai.app stays platform-core (not tenant); _vercel stays verification - Safety gate: under/at/over default 50% threshold; zero-total no-divide; custom threshold - Empty live-sets: documents that decide() alone classifies as orphan, gate is the defense CI: new .github/workflows/test-ops-scripts.yml runs `python -m unittest discover` against scripts/ops/ on every PR/push that touches the directory. Lightweight — no requirements file, stdlib only. Local: `cd scripts/ops && python -m unittest test_sweep_cf_decide -v` → 25 tests, all OK. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 00:22:30 -07:00
Hongming Wang	817b8b0307	fix(scripts): make MAX_DELETE_PCT actually honor env override The script's own help text documents \`MAX_DELETE_PCT=62 ./sweep-cf-orphans.sh\` as the way to relax the safety gate, but the in-script assignment on line 35 was unconditional and overwrote any env value — so the override never worked. During today's staging tenant-provision recovery (CP #255 context), hit the 57%-delete threshold and needed the documented override to clear 64 orphan records. The one-char change to \`\${MAX_DELETE_PCT:-50}\` honors the env while keeping the 50% default when no caller overrides. Ran with MAX_DELETE_PCT=62 after the fix — deleted 64 records, CF zone 111→47. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 14:14:55 -07:00
Hongming Wang	0576e341b9	ops(#1976 ): add smart-sweep script for orphan Cloudflare DNS records (#1978 ) Replaces the "panic-button at >65 records" manual sweep that nukes every pattern-match unconditionally (would delete live workspaces along with orphans). This version: - Queries CP prod + staging /admin/orgs for live tenant slugs - Queries AWS EC2 describe-instances for live workspace Name tags - Only deletes CF records whose slug/ws-id has no live counterpart - Dry-run by default (--execute to actually delete) - Safety gate refuses to delete >50% of records (configurable via MAX_DELETE_PCT env var) — catches the "API returned zero orgs, every tenant looks orphan" failure mode before it nukes production - Per-category accounting: orphan-ws / orphan-e2e-tenant / etc. Usage: CF_API_TOKEN=... CF_ZONE_ID=... \ CP_PROD_ADMIN_TOKEN=... CP_STAGING_ADMIN_TOKEN=... \ bash scripts/ops/sweep-cf-orphans.sh # dry-run bash scripts/ops/sweep-cf-orphans.sh --execute # actually delete Ref: #1976 (root-cause: tenant.Delete + workspace.Delete don't clean their CF records — until that's fixed, this script is the maintenance path) Co-authored-by: Hongming Wang <hongmingwang.rabbit@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com>	2026-04-24 04:19:49 +00:00

5 Commits