Iterates a list of tenant slugs (default canary set on production,
operator-supplied on staging), curls each tenant's /buildinfo plus
canvas's /api/buildinfo, compares to origin/main's HEAD SHA, prints a
table with one of {current, stale, unreachable} per surface. Returns
non-zero if any surface is stale, so it can be wired into a periodic
alert later.
Why this exists: every "is the fix live?" question used to be
answered with a one-off curl + git rev-parse + manual diff. This
script does that uniformly across every public surface (workspace
tenants + canvas) and is parseable. The redeploy verifier (#2398)
covers the deploy moment; this covers any-time-after.
Reads EXPECTED_SHA from `gh api repos/Molecule-AI/molecule-core/
commits/main` so it always reflects the actual upstream tip, not
local working-copy state. Falls back to local origin/main with a
WARN if `gh` isn't logged in — debugging is still useful even if
the comparison may lag.
Depends on:
- #2409 (TenantGuard /buildinfo allowlist) — without it every
tenant looks "unreachable" because the route 404s before the
handler. Already merged on staging; will hit production after
the next staging→main fast-forward + redeploy.
- #2407 (canvas /api/buildinfo) — already on main + Vercel.
Usage:
./scripts/ops/check-prod-versions.sh # production canary set
TENANT_SLUGS="a b c" ./scripts/ops/check-prod-versions.sh # custom set
ENV=staging TENANT_SLUGS="..." ./scripts/ops/check-prod-versions.sh
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI failure: the Ops scripts (unittest) job runs `python -m unittest
discover` which doesn't have pytest installed. test_check_migration_
collisions.py imported pytest unconditionally, failing module import:
ImportError: Failed to import test module: test_check_migration_collisions
Traceback (most recent call last):
File ".../test_check_migration_collisions.py", line 12, in <module>
import pytest
ModuleNotFoundError: No module named 'pytest'
The tests use no pytest-specific features (just bare assert + plain
class). Sibling test_sweep_cf_decide.py in the same dir already uses
unittest.TestCase. Convert this one to match: drop the pytest import,
make TestMigrationFileRe inherit from unittest.TestCase.
unittest.TestLoader.discover() requires TestCase subclasses for
auto-discovery, so the fix is two lines (drop import, add base).
Bare assert statements work fine inside TestCase methods.
Verified: `python3 -m unittest scripts.ops.test_check_migration_collisions -v`
runs all 9 tests, all pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two PRs targeting staging can each add a migration with the same
numeric prefix (e.g. 044_*.up.sql). Each passes CI independently.
They collide at merge time. Worst case: second migration silently
doesn't apply and prod schema drifts from what the code expects.
Caught manually 2026-04-30 during PR #2276 rebase: 044_runtime_image_pins
collided with 044_platform_inbound_secret from RFC #2312. This workflow
makes that detection automatic at PR-open time.
How it works:
scripts/ops/check_migration_collisions.py runs on every PR that
touches workspace-server/migrations/**. For each new/modified
migration filename, extracts the numeric prefix and checks:
1. Does the base branch already have a DIFFERENT migration file with
the same prefix? (PR branched off an old base, base advanced and
another PR landed the same number — needs rebase.)
2. Is another OPEN PR (not this one) also adding a migration with
the same prefix? (Race-window collision — both pass CI separately,
would collide at merge time.)
Either case → exit 1 with a clear ::error:: message naming the
conflicting PR(s) so the author knows what to renumber.
Implementation notes:
- Uses git ls-tree (not working-tree walk) so it works against any
base ref without checkout.
- Uses gh pr diff --name-only per open PR, bounded by `gh pr list
--limit 100`. ~30s worst case for a busy repo, <5s normally.
- --diff-filter=AM picks up Added or Modified — renaming a migration
in place is also flagged (intentional; renaming migrations isn't
safe).
- Same filename in both PR and base = no collision (PR is editing
in-place, fine).
Tests:
scripts/ops/test_check_migration_collisions.py — 9 cases on the
regex classifier (the load-bearing piece). End-to-end git/gh path
is exercised by running the workflow against real PRs.
Hard-gates Tier 1 item 1 (#2341). Cheapest, cleanest gate. Catches
one specific class of merge-time foot-gun automatically.
Refs hard-gates discussion 2026-04-30. Tier 1 of 4 (others tracked
in #2342, #2343, #2344).
CP's tenant-delete cascade removes the DNS record (with sweep-cf-orphans
as a backstop) but does NOT delete the underlying Cloudflare Tunnel.
Each E2E provision creates one Tunnel named `tenant-<slug>`; without
cleanup these accumulate indefinitely on the account, consuming the
tunnel quota and cluttering the dashboard.
Observed 2026-04-30: dozens of `tenant-e2e-canvas-*` tunnels in Down
state with zero replicas, weeks past their tenant's deletion. Same
class of bug as the DNS-records leak that drove sweep-cf-orphans
(controlplane#239).
Parallel-shape to sweep-cf-orphans:
- Same dry-run-by-default + --execute pattern
- Same MAX_DELETE_PCT safety gate (default 90% — higher than DNS
sweep's 50% because tenant-shaped tunnels are orphans by design)
- Same schedule/dispatch hardening (hard-fail on missing secrets
when scheduled, soft-skip when dispatched)
- Cron offset to :45 to avoid CF API bursts colliding with the DNS
sweep at :15
Decision rules (in order):
1. Name doesn't match `tenant-<slug>` → keep (unknown — never sweep
tunnels that might belong to platform infra).
2. Tunnel has active connections (status=healthy or non-empty
connections array) → keep (defense-in-depth: don't kill a live
tunnel even if CP forgot the org).
3. Slug ∈ {prod_slugs ∪ staging_slugs} → keep.
4. Otherwise → delete (orphan).
Verified by:
- shell syntax check (bash -n)
- YAML lint
- Decide-logic offline smoke (7 cases, all pass)
- End-to-end dry-run smoke with stubbed CP + CF APIs
Required secrets (added to existing org-secrets):
CF_API_TOKEN must include account:cloudflare_tunnel:edit
scope (separate from zone:dns:edit used by
sweep-cf-orphans — same token if scope is
broad, or a new token if narrowly scoped).
CF_ACCOUNT_ID account that owns the tunnels (visible in
dash.cloudflare.com URL path).
CP_PROD_ADMIN_TOKEN reused from sweep-cf-orphans.
CP_STAGING_ADMIN_TOKEN reused from sweep-cf-orphans.
Note: CP-side root cause (tenant-delete should cascade to tunnel
delete) is in molecule-controlplane and worth fixing separately. This
janitor is the operational backstop in the meantime — same pattern
applied to DNS records when the same root cause was unaddressed.
#2000 fixed one symptom — TENANT_IMAGE pinned to `staging-a14cf86`
(10 days stale) silently no-op'd four upstream fixes on 2026-04-24.
This adds the audit pattern as a re-runnable script so the broader
class is observable on demand without new CI infrastructure.
Audit results today (2026-04-27):
controlplane / production: 54 vars audited, 0 drift-prone pins
controlplane / staging: 52 vars audited, 0 drift-prone pins
So the immediate audit deliverable is clean — TENANT_IMAGE is the only
known violation and #2000 already fixed it. The script makes the
ongoing audit a 5-second command instead of a manual one.
Detection regex catches:
* branch-SHA suffixes (`staging|main|prod|production-<6+ hex>`)
— the exact 2026-04-24 incident shape
* version pins after `:` or `=` (`:v1.2.3`, `=v0.1.16`)
— same drift class, just rendered differently
Anchoring on `:` or `=` keeps prose like "version 1.2.3 of the api"
out of the false-positive set. UUIDs, ARNs, AMI IDs, secrets, and
floating tags (`:staging-latest`, `:main`) pass through untouched.
Regression test (tests/ops/test_audit_railway_sha_pins.sh) pins 20
representative cases — 9 should-flag (covering all four branch
prefixes + semver variants + middle-of-value matches) and 11
should-pass (the false-positive guards). Same regex inlined in both
files so a future tweak that weakens detection fails the test in
lockstep with weakening the audit.
Both files shellcheck clean.
CI gate (acceptance criterion's "regression: add a CI check") is
deliberately scoped out — querying Railway from CI requires plumbing
RAILWAY_TOKEN as a repo secret, which is multi-step setup. The
re-runnable script + test cover the same surface today; the CI
workflow is a small follow-up once the token is provisioned.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Code-quality + efficiency review of PR #2079:
- Hoist all_slugs = prod_slugs | staging_slugs out of decide() into the
caller (was rebuilt on every record — 1k records × ~50-slug union per
call). decide() signature now (r, all_slugs, ec2_names).
- Compile regexes at module scope (_WS_RE, _E2E_RE, _TENANT_RE) +
hoist platform-core literal set (_PLATFORM_CORE_NAMES). Same change
mirrored in the bash heredoc.
- Drop decorative # Rule N: comments (numbering was out of order, 3
before 2 — actively confusing).
- Move the "edits must mirror" reminder OUTSIDE the CANONICAL DECIDE
block in the .sh file, eliminating the .replace() comment-skip hack
in TestParityWithBashScript.
- Drop per-line .strip() in _slice_canonical (would mask a real
indentation bug; both blocks already at column 0).
- subTest() in TestPlatformCore loops so a single failure no longer
short-circuits the rest of the items.
- merge_group + concurrency on test-ops-scripts.yml (parity with
ci.yml gate behaviour).
- Fix don't apostrophe in inline comment that closed the python
heredoc's single-quote and broke bash -n.
All 25 tests still pass. bash -n clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes#2027.
The CF orphan sweep deletes DNS records — a misclassification could nuke
a live workspace's tunnel. The decision function had MAX_DELETE_PCT
percentage gating but no automated test of category → action mapping.
Approach: extract the decide() function to scripts/ops/sweep_cf_decide.py
as a verbatim copy bracketed by `# CANONICAL DECIDE BEGIN/END` markers.
The shell script keeps its inline heredoc (so the operational path is
untouched) but bracketed by the same markers. A parity test
(TestParityWithBashScript) reads both files and asserts the bracketed
blocks match line-for-line — drift fails CI loudly.
Coverage (25 tests, 1 file, stdlib unittest only):
- Rule 1 platform-core: apex, _vercel, _domainkey, www/api/app/doc/send/status/staging-api
- Rule 3 ws-*: live (matches EC2 prefix) on prod + staging; orphan on prod + staging
- Rule 4 e2e-*: live + orphan on staging; orphan on prod
- Rule 2 generic tenant: live prod + staging; unknown subdomain kept-for-safety
- Rule 5 fallthrough: external domain + unrelated apex
- Rule priority: api.moleculesai.app stays platform-core (not tenant); _vercel stays verification
- Safety gate: under/at/over default 50% threshold; zero-total no-divide; custom threshold
- Empty live-sets: documents that decide() alone classifies as orphan, gate is the defense
CI: new .github/workflows/test-ops-scripts.yml runs `python -m unittest
discover` against scripts/ops/ on every PR/push that touches the
directory. Lightweight — no requirements file, stdlib only.
Local: `cd scripts/ops && python -m unittest test_sweep_cf_decide -v` →
25 tests, all OK.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The script's own help text documents \`MAX_DELETE_PCT=62 ./sweep-cf-orphans.sh\`
as the way to relax the safety gate, but the in-script assignment on line 35
was unconditional and overwrote any env value — so the override never worked.
During today's staging tenant-provision recovery (CP #255 context), hit the
57%-delete threshold and needed the documented override to clear 64 orphan
records. The one-char change to \`\${MAX_DELETE_PCT:-50}\` honors the env
while keeping the 50% default when no caller overrides.
Ran with MAX_DELETE_PCT=62 after the fix — deleted 64 records, CF zone 111→47.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the "panic-button at >65 records" manual sweep that nukes
every pattern-match unconditionally (would delete live workspaces
along with orphans).
This version:
- Queries CP prod + staging /admin/orgs for live tenant slugs
- Queries AWS EC2 describe-instances for live workspace Name tags
- Only deletes CF records whose slug/ws-id has no live counterpart
- Dry-run by default (--execute to actually delete)
- Safety gate refuses to delete >50% of records (configurable via
MAX_DELETE_PCT env var) — catches the "API returned zero orgs, every
tenant looks orphan" failure mode before it nukes production
- Per-category accounting: orphan-ws / orphan-e2e-tenant / etc.
Usage:
CF_API_TOKEN=... CF_ZONE_ID=... \
CP_PROD_ADMIN_TOKEN=... CP_STAGING_ADMIN_TOKEN=... \
bash scripts/ops/sweep-cf-orphans.sh # dry-run
bash scripts/ops/sweep-cf-orphans.sh --execute # actually delete
Ref: #1976 (root-cause: tenant.Delete + workspace.Delete don't clean
their CF records — until that's fixed, this script is the maintenance
path)
Co-authored-by: Hongming Wang <hongmingwang.rabbit@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com>