molecule-core

Author	SHA1	Message	Date
Hongming Wang	41d5f9558f	ops: scripts/ops/check-prod-versions.sh — one-line "is each tenant on latest?" Iterates a list of tenant slugs (default canary set on production, operator-supplied on staging), curls each tenant's /buildinfo plus canvas's /api/buildinfo, compares to origin/main's HEAD SHA, prints a table with one of {current, stale, unreachable} per surface. Returns non-zero if any surface is stale, so it can be wired into a periodic alert later. Why this exists: every "is the fix live?" question used to be answered with a one-off curl + git rev-parse + manual diff. This script does that uniformly across every public surface (workspace tenants + canvas) and is parseable. The redeploy verifier (#2398) covers the deploy moment; this covers any-time-after. Reads EXPECTED_SHA from `gh api repos/Molecule-AI/molecule-core/ commits/main` so it always reflects the actual upstream tip, not local working-copy state. Falls back to local origin/main with a WARN if `gh` isn't logged in — debugging is still useful even if the comparison may lag. Depends on: - #2409 (TenantGuard /buildinfo allowlist) — without it every tenant looks "unreachable" because the route 404s before the handler. Already merged on staging; will hit production after the next staging→main fast-forward + redeploy. - #2407 (canvas /api/buildinfo) — already on main + Vercel. Usage: ./scripts/ops/check-prod-versions.sh # production canary set TENANT_SLUGS="a b c" ./scripts/ops/check-prod-versions.sh # custom set ENV=staging TENANT_SLUGS="..." ./scripts/ops/check-prod-versions.sh Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 13:13:47 -07:00
Hongming Wang	b5df2126b9	fix(test): convert migration-collision tests from pytest to unittest (#2341 ) CI failure: the Ops scripts (unittest) job runs `python -m unittest discover` which doesn't have pytest installed. test_check_migration_ collisions.py imported pytest unconditionally, failing module import: ImportError: Failed to import test module: test_check_migration_collisions Traceback (most recent call last): File ".../test_check_migration_collisions.py", line 12, in <module> import pytest ModuleNotFoundError: No module named 'pytest' The tests use no pytest-specific features (just bare assert + plain class). Sibling test_sweep_cf_decide.py in the same dir already uses unittest.TestCase. Convert this one to match: drop the pytest import, make TestMigrationFileRe inherit from unittest.TestCase. unittest.TestLoader.discover() requires TestCase subclasses for auto-discovery, so the fix is two lines (drop import, add base). Bare assert statements work fine inside TestCase methods. Verified: `python3 -m unittest scripts.ops.test_check_migration_collisions -v` runs all 9 tests, all pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 01:47:27 -07:00
Hongming Wang	ea8ff626a9	ci: hard gate against migration version collisions (#2341 ) Two PRs targeting staging can each add a migration with the same numeric prefix (e.g. 044_.up.sql). Each passes CI independently. They collide at merge time. Worst case: second migration silently doesn't apply and prod schema drifts from what the code expects. Caught manually 2026-04-30 during PR #2276 rebase: 044_runtime_image_pins collided with 044_platform_inbound_secret from RFC #2312. This workflow makes that detection automatic at PR-open time. How it works: scripts/ops/check_migration_collisions.py runs on every PR that touches workspace-server/migrations/*. For each new/modified migration filename, extracts the numeric prefix and checks: 1. Does the base branch already have a DIFFERENT migration file with the same prefix? (PR branched off an old base, base advanced and another PR landed the same number — needs rebase.) 2. Is another OPEN PR (not this one) also adding a migration with the same prefix? (Race-window collision — both pass CI separately, would collide at merge time.) Either case → exit 1 with a clear ::error:: message naming the conflicting PR(s) so the author knows what to renumber. Implementation notes: - Uses git ls-tree (not working-tree walk) so it works against any base ref without checkout. - Uses gh pr diff --name-only per open PR, bounded by `gh pr list --limit 100`. ~30s worst case for a busy repo, <5s normally. - --diff-filter=AM picks up Added or Modified — renaming a migration in place is also flagged (intentional; renaming migrations isn't safe). - Same filename in both PR and base = no collision (PR is editing in-place, fine). Tests: scripts/ops/test_check_migration_collisions.py — 9 cases on the regex classifier (the load-bearing piece). End-to-end git/gh path is exercised by running the workflow against real PRs. Hard-gates Tier 1 item 1 (#2341). Cheapest, cleanest gate. Catches one specific class of merge-time foot-gun automatically. Refs hard-gates discussion 2026-04-30. Tier 1 of 4 (others tracked in #2342, #2343, #2344).	2026-04-29 21:42:42 -07:00
Hongming Wang	3a6d2f179d	feat(ops): add sweep-cf-tunnels janitor — orphan Cloudflare Tunnels accumulate CP's tenant-delete cascade removes the DNS record (with sweep-cf-orphans as a backstop) but does NOT delete the underlying Cloudflare Tunnel. Each E2E provision creates one Tunnel named `tenant-<slug>`; without cleanup these accumulate indefinitely on the account, consuming the tunnel quota and cluttering the dashboard. Observed 2026-04-30: dozens of `tenant-e2e-canvas-*` tunnels in Down state with zero replicas, weeks past their tenant's deletion. Same class of bug as the DNS-records leak that drove sweep-cf-orphans (controlplane#239). Parallel-shape to sweep-cf-orphans: - Same dry-run-by-default + --execute pattern - Same MAX_DELETE_PCT safety gate (default 90% — higher than DNS sweep's 50% because tenant-shaped tunnels are orphans by design) - Same schedule/dispatch hardening (hard-fail on missing secrets when scheduled, soft-skip when dispatched) - Cron offset to :45 to avoid CF API bursts colliding with the DNS sweep at :15 Decision rules (in order): 1. Name doesn't match `tenant-<slug>` → keep (unknown — never sweep tunnels that might belong to platform infra). 2. Tunnel has active connections (status=healthy or non-empty connections array) → keep (defense-in-depth: don't kill a live tunnel even if CP forgot the org). 3. Slug ∈ {prod_slugs ∪ staging_slugs} → keep. 4. Otherwise → delete (orphan). Verified by: - shell syntax check (bash -n) - YAML lint - Decide-logic offline smoke (7 cases, all pass) - End-to-end dry-run smoke with stubbed CP + CF APIs Required secrets (added to existing org-secrets): CF_API_TOKEN must include account:cloudflare_tunnel:edit scope (separate from zone:dns:edit used by sweep-cf-orphans — same token if scope is broad, or a new token if narrowly scoped). CF_ACCOUNT_ID account that owns the tunnels (visible in dash.cloudflare.com URL path). CP_PROD_ADMIN_TOKEN reused from sweep-cf-orphans. CP_STAGING_ADMIN_TOKEN reused from sweep-cf-orphans. Note: CP-side root cause (tenant-delete should cascade to tunnel delete) is in molecule-controlplane and worth fixing separately. This janitor is the operational backstop in the meantime — same pattern applied to DNS records when the same root cause was unaddressed.	2026-04-29 19:42:47 -07:00
Hongming Wang	026f5e51d9	ops: add Railway SHA-pin drift audit script + regression test (#2001 ) #2000 fixed one symptom — TENANT_IMAGE pinned to `staging-a14cf86` (10 days stale) silently no-op'd four upstream fixes on 2026-04-24. This adds the audit pattern as a re-runnable script so the broader class is observable on demand without new CI infrastructure. Audit results today (2026-04-27): controlplane / production: 54 vars audited, 0 drift-prone pins controlplane / staging: 52 vars audited, 0 drift-prone pins So the immediate audit deliverable is clean — TENANT_IMAGE is the only known violation and #2000 already fixed it. The script makes the ongoing audit a 5-second command instead of a manual one. Detection regex catches: * branch-SHA suffixes (`staging\|main\|prod\|production-<6+ hex>`) — the exact 2026-04-24 incident shape * version pins after `:` or `=` (`:v1.2.3`, `=v0.1.16`) — same drift class, just rendered differently Anchoring on `:` or `=` keeps prose like "version 1.2.3 of the api" out of the false-positive set. UUIDs, ARNs, AMI IDs, secrets, and floating tags (`:staging-latest`, `:main`) pass through untouched. Regression test (tests/ops/test_audit_railway_sha_pins.sh) pins 20 representative cases — 9 should-flag (covering all four branch prefixes + semver variants + middle-of-value matches) and 11 should-pass (the false-positive guards). Same regex inlined in both files so a future tweak that weakens detection fails the test in lockstep with weakening the audit. Both files shellcheck clean. CI gate (acceptance criterion's "regression: add a CI check") is deliberately scoped out — querying Railway from CI requires plumbing RAILWAY_TOKEN as a repo secret, which is multi-step setup. The re-runnable script + test cover the same surface today; the CI workflow is a small follow-up once the token is provisioned. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 05:01:23 -07:00
rabbitblood	6494e9192b	refactor(ops): apply simplify findings on #2027 PR Code-quality + efficiency review of PR #2079: - Hoist all_slugs = prod_slugs \| staging_slugs out of decide() into the caller (was rebuilt on every record — 1k records × ~50-slug union per call). decide() signature now (r, all_slugs, ec2_names). - Compile regexes at module scope (_WS_RE, _E2E_RE, _TENANT_RE) + hoist platform-core literal set (_PLATFORM_CORE_NAMES). Same change mirrored in the bash heredoc. - Drop decorative # Rule N: comments (numbering was out of order, 3 before 2 — actively confusing). - Move the "edits must mirror" reminder OUTSIDE the CANONICAL DECIDE block in the .sh file, eliminating the .replace() comment-skip hack in TestParityWithBashScript. - Drop per-line .strip() in _slice_canonical (would mask a real indentation bug; both blocks already at column 0). - subTest() in TestPlatformCore loops so a single failure no longer short-circuits the rest of the items. - merge_group + concurrency on test-ops-scripts.yml (parity with ci.yml gate behaviour). - Fix don't apostrophe in inline comment that closed the python heredoc's single-quote and broke bash -n. All 25 tests still pass. bash -n clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 00:28:15 -07:00
rabbitblood	ba78a5c00d	test(ops): unit tests for sweep-cf-orphans decide() (#2027 ) Closes #2027. The CF orphan sweep deletes DNS records — a misclassification could nuke a live workspace's tunnel. The decision function had MAX_DELETE_PCT percentage gating but no automated test of category → action mapping. Approach: extract the decide() function to scripts/ops/sweep_cf_decide.py as a verbatim copy bracketed by `# CANONICAL DECIDE BEGIN/END` markers. The shell script keeps its inline heredoc (so the operational path is untouched) but bracketed by the same markers. A parity test (TestParityWithBashScript) reads both files and asserts the bracketed blocks match line-for-line — drift fails CI loudly. Coverage (25 tests, 1 file, stdlib unittest only): - Rule 1 platform-core: apex, _vercel, _domainkey, www/api/app/doc/send/status/staging-api - Rule 3 ws-: live (matches EC2 prefix) on prod + staging; orphan on prod + staging - Rule 4 e2e-: live + orphan on staging; orphan on prod - Rule 2 generic tenant: live prod + staging; unknown subdomain kept-for-safety - Rule 5 fallthrough: external domain + unrelated apex - Rule priority: api.moleculesai.app stays platform-core (not tenant); _vercel stays verification - Safety gate: under/at/over default 50% threshold; zero-total no-divide; custom threshold - Empty live-sets: documents that decide() alone classifies as orphan, gate is the defense CI: new .github/workflows/test-ops-scripts.yml runs `python -m unittest discover` against scripts/ops/ on every PR/push that touches the directory. Lightweight — no requirements file, stdlib only. Local: `cd scripts/ops && python -m unittest test_sweep_cf_decide -v` → 25 tests, all OK. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 00:22:30 -07:00
Hongming Wang	817b8b0307	fix(scripts): make MAX_DELETE_PCT actually honor env override The script's own help text documents \`MAX_DELETE_PCT=62 ./sweep-cf-orphans.sh\` as the way to relax the safety gate, but the in-script assignment on line 35 was unconditional and overwrote any env value — so the override never worked. During today's staging tenant-provision recovery (CP #255 context), hit the 57%-delete threshold and needed the documented override to clear 64 orphan records. The one-char change to \`\${MAX_DELETE_PCT:-50}\` honors the env while keeping the 50% default when no caller overrides. Ran with MAX_DELETE_PCT=62 after the fix — deleted 64 records, CF zone 111→47. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 14:14:55 -07:00
Hongming Wang	0576e341b9	ops(#1976 ): add smart-sweep script for orphan Cloudflare DNS records (#1978 ) Replaces the "panic-button at >65 records" manual sweep that nukes every pattern-match unconditionally (would delete live workspaces along with orphans). This version: - Queries CP prod + staging /admin/orgs for live tenant slugs - Queries AWS EC2 describe-instances for live workspace Name tags - Only deletes CF records whose slug/ws-id has no live counterpart - Dry-run by default (--execute to actually delete) - Safety gate refuses to delete >50% of records (configurable via MAX_DELETE_PCT env var) — catches the "API returned zero orgs, every tenant looks orphan" failure mode before it nukes production - Per-category accounting: orphan-ws / orphan-e2e-tenant / etc. Usage: CF_API_TOKEN=... CF_ZONE_ID=... \ CP_PROD_ADMIN_TOKEN=... CP_STAGING_ADMIN_TOKEN=... \ bash scripts/ops/sweep-cf-orphans.sh # dry-run bash scripts/ops/sweep-cf-orphans.sh --execute # actually delete Ref: #1976 (root-cause: tenant.Delete + workspace.Delete don't clean their CF records — until that's fixed, this script is the maintenance path) Co-authored-by: Hongming Wang <hongmingwang.rabbit@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com>	2026-04-24 04:19:49 +00:00

9 Commits