forked from molecule-ai/molecule-core
c5669aa304
13 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
| e075557b19 |
fix(ci): replace gh pr CLI with Gitea v1 REST in workflows + scripts (#75 class A)
Part of the post-#66 sweep to remove `gh` CLI dependencies that fail
silently against Gitea (which exposes /api/v1 only — no GraphQL → 405,
no /api/v3 → 404). Class A covers `gh pr list / view / diff / comment`
shapes.
Affected:
- `.github/workflows/auto-tag-runtime.yml`
Replaced `gh pr list --search SHA --json number,labels` with a curl
to `/api/v1/repos/.../pulls?state=closed&sort=newest&limit=50` +
jq filter on `merge_commit_sha == github.sha`. Same end-to-end
behaviour: locate the merged PR for this push, read its labels,
pick the bump kind. Defensive `?.name // empty` jq guard handles
unlabelled PRs without erroring. The 50-PR window is comfortably
larger than the volume of staging→main promotes that close in any
reasonable detection window.
- `scripts/check-stale-promote-pr.sh`
Rewrote `fetch_prs` and `post_comment` to call Gitea's REST API
directly. Gitea doesn't expose GitHub's compound `mergeStateStatus`
/ `reviewDecision` fields, so the new fetcher pulls
`/pulls?state=open&base=main` then for each PR pulls
`/pulls/{n}/reviews` and synthesizes the GitHub-shape JSON the rest
of the script (and the existing fixture-based unit tests) consume:
BLOCKED + REVIEW_REQUIRED ↔ mergeable=true AND 0 APPROVED reviews
DIRTY ↔ mergeable=false (alarm doesn't fire)
CLEAN + APPROVED ↔ mergeable=true AND ≥1 APPROVED review
Comment-posting moves to `POST /repos/.../issues/{n}/comments`
(Gitea treats PRs as issues for the comment surface, same as
GitHub's REST). All 23 fixture-driven unit tests still pass —
fixtures pass GitHub-shape JSON via PR_FIXTURE which short-circuits
the live fetch path.
- `scripts/ops/check_migration_collisions.py`
Replaced `gh pr list` + `gh pr diff` calls with stdlib `urllib`
against /api/v1. Helper `_gitea_get` centralizes auth + error
handling; uses GITEA_TOKEN env, falling back to GITHUB_TOKEN
(act_runner) and GH_TOKEN. Return shape from
`open_prs_with_migration_prefix` mimics the historical
`--json number,headRefName` so the call sites are unchanged. All 9
regex-classifier unit tests still pass; live integration test
against the production Gitea API returns 0 collisions for prefix=999
as expected.
curl invocation pattern is `curl --fail-with-body -sS` (NOT `-fsS` —
the two short-fail flags are mutually exclusive in modern curl;
caught by `curl: You must select either --fail or --fail-with-body,
not both` during local verification).
Token model: workflows pass act_runner's GITHUB_TOKEN (per-run, repo
read scope) — same surface used by the auto-sync fix in PR #66 plus
the surrounding workflows. No new repo secrets required.
Verification: bash unit tests (23/23 pass), python unittest (9/9 pass),
live curl call against production Gitea returns 200 with the expected
shape, YAML / shell / Python syntax all validate.
Closes part of #75. Other classes (D — `gh api`; F — `gh run list`)
land in follow-up PRs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|||
|
|
6f8f7932d2 |
feat(ops): add sweep-aws-secrets janitor — orphan tenant bootstrap secrets
CP's deprovision flow calls Secrets.DeleteSecret() (provisioner/ec2.go:806)
but only when the deprovision runs to completion. Crashed provisions and
incomplete teardowns leak the per-tenant `molecule/tenant/<org_id>/bootstrap`
secret. At ~$0.40/secret/month, ~45 leaked secrets surfaced as ~$19/month
on the AWS cost dashboard.
The tenant_resources audit table (mig 024) tracks four kinds today —
CloudflareTunnel, CloudflareDNS, EC2Instance, SecurityGroup — and the
existing reconciler doesn't catch Secrets Manager orphans. The proper fix
(KindSecretsManagerSecret + recorder hook + reconciler enumerator) is filed
as a follow-up controlplane issue. This sweeper is the immediate stopgap.
Parallel-shape to sweep-cf-tunnels.sh:
- Hourly schedule offset (:30, between sweep-cf-orphans :15 and
sweep-cf-tunnels :45) so the three janitors don't burst CP admin
at the same minute.
- 24h grace window — never deletes a secret younger than the
provisioning roundtrip, so an in-flight provision can't be racemurdered.
- MAX_DELETE_PCT=50 default (mirrors sweep-cf-orphans for durable
resources; tenant secrets should track 1:1 with live tenants).
- Same schedule-vs-dispatch hardening as the other janitors:
schedule → hard-fail on missing secrets, dispatch → soft-skip.
- 8-way xargs parallelism, dry-run by default, --execute to delete.
Requires a dedicated AWS_JANITOR_* IAM principal — the prod molecule-cp
principal lacks secretsmanager:ListSecrets (it only has scoped
Get/Create/Update/Delete). The workflow's verify-secrets step will hard-fail
on the first scheduled run until those secrets are configured, surfacing
the missing setup loudly rather than silently no-op'ing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
8bf29b7d0e |
fix(sweep-cf-tunnels): parallelize deletes + raise workflow timeout
The hourly Sweep stale Cloudflare Tunnels job got cancelled mid-cleanup
on 2026-05-02 (run 25248788312, killed at 5min after deleting 424/672
stale tunnels). A second manual dispatch finished the remaining 254
fine, so the immediate backlog cleared, but two underlying bugs would
re-trip on the next big cleanup.
Bug 1: serial delete loop. The execute branch was a `while read; do
curl -X DELETE; done` pipeline at ~0.7s/tunnel — fine for the
steady-state cleanup of a handful, but a 600+ backlog needs ~7-8min.
This commit fans out to $SWEEP_CONCURRENCY (default 8) workers via
`xargs -P 8 -L 1 -I {} bash -c '...' _ {} < "$DELETE_PLAN"`. With 8x
parallelism the same 600+ list drains in ~60s. Notes:
- We use stdin (`<`) not GNU's `xargs -a FILE` so the script stays
portable to BSD xargs (matters for local-runner testing on macOS).
- We pass ONLY the tunnel id on argv. xargs tokenizes on whitespace
by default; tab-separating id+name on argv risks mangling. The
name is kept in a side-channel id->name map ($NAME_MAP) and looked
up by the worker only on failure, for FAIL_LOG readability.
- Workers print exactly `OK` or `FAIL` on stdout; tally with
`grep -c '^OK$' / '^FAIL$'`.
- On non-zero FAILED, log the first 20 lines of $FAIL_LOG as
"Failure detail (first 20):" — same diagnostic surface as before
but consolidated so we don't spam logs on a flaky CF API.
Bug 2: workflow's 5-min cap was set as a hangs-detector but turned out
to be a real-job-too-slow detector. Raised to 30 min — generous
headroom for the ~60s steady-state run while still surfacing genuine
hangs (and in line with the sweep-cf-orphans companion job).
Bug 3 (drive-by): the existing trap was `trap 'rm -rf "$PAGES_DIR"'
EXIT`, which would have been silently overwritten by any later trap
registration. Replaced with a single `cleanup()` function that wipes
PAGES_DIR + all four new tempfiles (DELETE_PLAN, NAME_MAP, FAIL_LOG,
RESULT_LOG), called once via `trap cleanup EXIT`.
Verification:
- bash -n scripts/ops/sweep-cf-tunnels.sh: clean
- shellcheck -S warning scripts/ops/sweep-cf-tunnels.sh: clean
- python3 yaml.safe_load on the workflow: clean
- Synthetic 30-line delete plan with every 7th id sentinel'd to
return {"success":false}: TEST PASS, DELETED=26 FAILED=4, FAIL_LOG
side-channel name lookup verified.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
a117a60eed |
fix(sweep-cf-tunnels): buffer pages to disk to avoid argv ARG_MAX
The page-merge loop passed the entire accumulating tunnel JSON to python3 -c via argv on every iteration. On a busy account (verified 2026-05-02: 672 tunnels, 14 pages on Hongmingwangrabbit account) this exceeds the GH Ubuntu runner's combined argv+envp limit (~128 KB) and dies with `python3: Argument list too long` at exit 126 — the workflow has been silently failing this way since the very first run that hit a real account, masked earlier by a missing-CF_ACCOUNT_ID secret check. Buffer each page response to a file under a temp dir, merge from disk at the end. Also bumps the page cap from 20 to 40 (1000 → 2000 tunnel ceiling) so the existing soft-cap warning has headroom; the disk-merge shape is O(n) in tunnel count rather than the previous O(n^2) so the larger ceiling is cheap. Verified locally against the live account (672 tunnels): script now runs cleanly to the existing MAX_DELETE_PCT safety gate, which trips at 99% > 90% as designed and surfaces the actual orphan backlog for operator-driven cleanup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
41d5f9558f |
ops: scripts/ops/check-prod-versions.sh — one-line "is each tenant on latest?"
Iterates a list of tenant slugs (default canary set on production,
operator-supplied on staging), curls each tenant's /buildinfo plus
canvas's /api/buildinfo, compares to origin/main's HEAD SHA, prints a
table with one of {current, stale, unreachable} per surface. Returns
non-zero if any surface is stale, so it can be wired into a periodic
alert later.
Why this exists: every "is the fix live?" question used to be
answered with a one-off curl + git rev-parse + manual diff. This
script does that uniformly across every public surface (workspace
tenants + canvas) and is parseable. The redeploy verifier (#2398)
covers the deploy moment; this covers any-time-after.
Reads EXPECTED_SHA from `gh api repos/Molecule-AI/molecule-core/
commits/main` so it always reflects the actual upstream tip, not
local working-copy state. Falls back to local origin/main with a
WARN if `gh` isn't logged in — debugging is still useful even if
the comparison may lag.
Depends on:
- #2409 (TenantGuard /buildinfo allowlist) — without it every
tenant looks "unreachable" because the route 404s before the
handler. Already merged on staging; will hit production after
the next staging→main fast-forward + redeploy.
- #2407 (canvas /api/buildinfo) — already on main + Vercel.
Usage:
./scripts/ops/check-prod-versions.sh # production canary set
TENANT_SLUGS="a b c" ./scripts/ops/check-prod-versions.sh # custom set
ENV=staging TENANT_SLUGS="..." ./scripts/ops/check-prod-versions.sh
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
b5df2126b9 |
fix(test): convert migration-collision tests from pytest to unittest (#2341)
CI failure: the Ops scripts (unittest) job runs `python -m unittest
discover` which doesn't have pytest installed. test_check_migration_
collisions.py imported pytest unconditionally, failing module import:
ImportError: Failed to import test module: test_check_migration_collisions
Traceback (most recent call last):
File ".../test_check_migration_collisions.py", line 12, in <module>
import pytest
ModuleNotFoundError: No module named 'pytest'
The tests use no pytest-specific features (just bare assert + plain
class). Sibling test_sweep_cf_decide.py in the same dir already uses
unittest.TestCase. Convert this one to match: drop the pytest import,
make TestMigrationFileRe inherit from unittest.TestCase.
unittest.TestLoader.discover() requires TestCase subclasses for
auto-discovery, so the fix is two lines (drop import, add base).
Bare assert statements work fine inside TestCase methods.
Verified: `python3 -m unittest scripts.ops.test_check_migration_collisions -v`
runs all 9 tests, all pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
ea8ff626a9 |
ci: hard gate against migration version collisions (#2341)
Two PRs targeting staging can each add a migration with the same numeric prefix (e.g. 044_*.up.sql). Each passes CI independently. They collide at merge time. Worst case: second migration silently doesn't apply and prod schema drifts from what the code expects. Caught manually 2026-04-30 during PR #2276 rebase: 044_runtime_image_pins collided with 044_platform_inbound_secret from RFC #2312. This workflow makes that detection automatic at PR-open time. How it works: scripts/ops/check_migration_collisions.py runs on every PR that touches workspace-server/migrations/**. For each new/modified migration filename, extracts the numeric prefix and checks: 1. Does the base branch already have a DIFFERENT migration file with the same prefix? (PR branched off an old base, base advanced and another PR landed the same number — needs rebase.) 2. Is another OPEN PR (not this one) also adding a migration with the same prefix? (Race-window collision — both pass CI separately, would collide at merge time.) Either case → exit 1 with a clear ::error:: message naming the conflicting PR(s) so the author knows what to renumber. Implementation notes: - Uses git ls-tree (not working-tree walk) so it works against any base ref without checkout. - Uses gh pr diff --name-only per open PR, bounded by `gh pr list --limit 100`. ~30s worst case for a busy repo, <5s normally. - --diff-filter=AM picks up Added or Modified — renaming a migration in place is also flagged (intentional; renaming migrations isn't safe). - Same filename in both PR and base = no collision (PR is editing in-place, fine). Tests: scripts/ops/test_check_migration_collisions.py — 9 cases on the regex classifier (the load-bearing piece). End-to-end git/gh path is exercised by running the workflow against real PRs. Hard-gates Tier 1 item 1 (#2341). Cheapest, cleanest gate. Catches one specific class of merge-time foot-gun automatically. Refs hard-gates discussion 2026-04-30. Tier 1 of 4 (others tracked in #2342, #2343, #2344). |
||
|
|
3a6d2f179d |
feat(ops): add sweep-cf-tunnels janitor — orphan Cloudflare Tunnels accumulate
CP's tenant-delete cascade removes the DNS record (with sweep-cf-orphans
as a backstop) but does NOT delete the underlying Cloudflare Tunnel.
Each E2E provision creates one Tunnel named `tenant-<slug>`; without
cleanup these accumulate indefinitely on the account, consuming the
tunnel quota and cluttering the dashboard.
Observed 2026-04-30: dozens of `tenant-e2e-canvas-*` tunnels in Down
state with zero replicas, weeks past their tenant's deletion. Same
class of bug as the DNS-records leak that drove sweep-cf-orphans
(controlplane#239).
Parallel-shape to sweep-cf-orphans:
- Same dry-run-by-default + --execute pattern
- Same MAX_DELETE_PCT safety gate (default 90% — higher than DNS
sweep's 50% because tenant-shaped tunnels are orphans by design)
- Same schedule/dispatch hardening (hard-fail on missing secrets
when scheduled, soft-skip when dispatched)
- Cron offset to :45 to avoid CF API bursts colliding with the DNS
sweep at :15
Decision rules (in order):
1. Name doesn't match `tenant-<slug>` → keep (unknown — never sweep
tunnels that might belong to platform infra).
2. Tunnel has active connections (status=healthy or non-empty
connections array) → keep (defense-in-depth: don't kill a live
tunnel even if CP forgot the org).
3. Slug ∈ {prod_slugs ∪ staging_slugs} → keep.
4. Otherwise → delete (orphan).
Verified by:
- shell syntax check (bash -n)
- YAML lint
- Decide-logic offline smoke (7 cases, all pass)
- End-to-end dry-run smoke with stubbed CP + CF APIs
Required secrets (added to existing org-secrets):
CF_API_TOKEN must include account:cloudflare_tunnel:edit
scope (separate from zone:dns:edit used by
sweep-cf-orphans — same token if scope is
broad, or a new token if narrowly scoped).
CF_ACCOUNT_ID account that owns the tunnels (visible in
dash.cloudflare.com URL path).
CP_PROD_ADMIN_TOKEN reused from sweep-cf-orphans.
CP_STAGING_ADMIN_TOKEN reused from sweep-cf-orphans.
Note: CP-side root cause (tenant-delete should cascade to tunnel
delete) is in molecule-controlplane and worth fixing separately. This
janitor is the operational backstop in the meantime — same pattern
applied to DNS records when the same root cause was unaddressed.
|
||
|
|
026f5e51d9 |
ops: add Railway SHA-pin drift audit script + regression test (#2001)
#2000 fixed one symptom — TENANT_IMAGE pinned to `staging-a14cf86` (10 days stale) silently no-op'd four upstream fixes on 2026-04-24. This adds the audit pattern as a re-runnable script so the broader class is observable on demand without new CI infrastructure. Audit results today (2026-04-27): controlplane / production: 54 vars audited, 0 drift-prone pins controlplane / staging: 52 vars audited, 0 drift-prone pins So the immediate audit deliverable is clean — TENANT_IMAGE is the only known violation and #2000 already fixed it. The script makes the ongoing audit a 5-second command instead of a manual one. Detection regex catches: * branch-SHA suffixes (`staging|main|prod|production-<6+ hex>`) — the exact 2026-04-24 incident shape * version pins after `:` or `=` (`:v1.2.3`, `=v0.1.16`) — same drift class, just rendered differently Anchoring on `:` or `=` keeps prose like "version 1.2.3 of the api" out of the false-positive set. UUIDs, ARNs, AMI IDs, secrets, and floating tags (`:staging-latest`, `:main`) pass through untouched. Regression test (tests/ops/test_audit_railway_sha_pins.sh) pins 20 representative cases — 9 should-flag (covering all four branch prefixes + semver variants + middle-of-value matches) and 11 should-pass (the false-positive guards). Same regex inlined in both files so a future tweak that weakens detection fails the test in lockstep with weakening the audit. Both files shellcheck clean. CI gate (acceptance criterion's "regression: add a CI check") is deliberately scoped out — querying Railway from CI requires plumbing RAILWAY_TOKEN as a repo secret, which is multi-step setup. The re-runnable script + test cover the same surface today; the CI workflow is a small follow-up once the token is provisioned. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
6494e9192b |
refactor(ops): apply simplify findings on #2027 PR
Code-quality + efficiency review of PR #2079: - Hoist all_slugs = prod_slugs | staging_slugs out of decide() into the caller (was rebuilt on every record — 1k records × ~50-slug union per call). decide() signature now (r, all_slugs, ec2_names). - Compile regexes at module scope (_WS_RE, _E2E_RE, _TENANT_RE) + hoist platform-core literal set (_PLATFORM_CORE_NAMES). Same change mirrored in the bash heredoc. - Drop decorative # Rule N: comments (numbering was out of order, 3 before 2 — actively confusing). - Move the "edits must mirror" reminder OUTSIDE the CANONICAL DECIDE block in the .sh file, eliminating the .replace() comment-skip hack in TestParityWithBashScript. - Drop per-line .strip() in _slice_canonical (would mask a real indentation bug; both blocks already at column 0). - subTest() in TestPlatformCore loops so a single failure no longer short-circuits the rest of the items. - merge_group + concurrency on test-ops-scripts.yml (parity with ci.yml gate behaviour). - Fix don't apostrophe in inline comment that closed the python heredoc's single-quote and broke bash -n. All 25 tests still pass. bash -n clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
ba78a5c00d |
test(ops): unit tests for sweep-cf-orphans decide() (#2027)
Closes #2027. The CF orphan sweep deletes DNS records — a misclassification could nuke a live workspace's tunnel. The decision function had MAX_DELETE_PCT percentage gating but no automated test of category → action mapping. Approach: extract the decide() function to scripts/ops/sweep_cf_decide.py as a verbatim copy bracketed by `# CANONICAL DECIDE BEGIN/END` markers. The shell script keeps its inline heredoc (so the operational path is untouched) but bracketed by the same markers. A parity test (TestParityWithBashScript) reads both files and asserts the bracketed blocks match line-for-line — drift fails CI loudly. Coverage (25 tests, 1 file, stdlib unittest only): - Rule 1 platform-core: apex, _vercel, _domainkey, www/api/app/doc/send/status/staging-api - Rule 3 ws-*: live (matches EC2 prefix) on prod + staging; orphan on prod + staging - Rule 4 e2e-*: live + orphan on staging; orphan on prod - Rule 2 generic tenant: live prod + staging; unknown subdomain kept-for-safety - Rule 5 fallthrough: external domain + unrelated apex - Rule priority: api.moleculesai.app stays platform-core (not tenant); _vercel stays verification - Safety gate: under/at/over default 50% threshold; zero-total no-divide; custom threshold - Empty live-sets: documents that decide() alone classifies as orphan, gate is the defense CI: new .github/workflows/test-ops-scripts.yml runs `python -m unittest discover` against scripts/ops/ on every PR/push that touches the directory. Lightweight — no requirements file, stdlib only. Local: `cd scripts/ops && python -m unittest test_sweep_cf_decide -v` → 25 tests, all OK. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
817b8b0307 |
fix(scripts): make MAX_DELETE_PCT actually honor env override
The script's own help text documents \`MAX_DELETE_PCT=62 ./sweep-cf-orphans.sh\` as the way to relax the safety gate, but the in-script assignment on line 35 was unconditional and overwrote any env value — so the override never worked. During today's staging tenant-provision recovery (CP #255 context), hit the 57%-delete threshold and needed the documented override to clear 64 orphan records. The one-char change to \`\${MAX_DELETE_PCT:-50}\` honors the env while keeping the 50% default when no caller overrides. Ran with MAX_DELETE_PCT=62 after the fix — deleted 64 records, CF zone 111→47. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
0576e341b9
|
ops(#1976): add smart-sweep script for orphan Cloudflare DNS records (#1978)
Replaces the "panic-button at >65 records" manual sweep that nukes
every pattern-match unconditionally (would delete live workspaces
along with orphans).
This version:
- Queries CP prod + staging /admin/orgs for live tenant slugs
- Queries AWS EC2 describe-instances for live workspace Name tags
- Only deletes CF records whose slug/ws-id has no live counterpart
- Dry-run by default (--execute to actually delete)
- Safety gate refuses to delete >50% of records (configurable via
MAX_DELETE_PCT env var) — catches the "API returned zero orgs, every
tenant looks orphan" failure mode before it nukes production
- Per-category accounting: orphan-ws / orphan-e2e-tenant / etc.
Usage:
CF_API_TOKEN=... CF_ZONE_ID=... \
CP_PROD_ADMIN_TOKEN=... CP_STAGING_ADMIN_TOKEN=... \
bash scripts/ops/sweep-cf-orphans.sh # dry-run
bash scripts/ops/sweep-cf-orphans.sh --execute # actually delete
Ref: #1976 (root-cause: tenant.Delete + workspace.Delete don't clean
their CF records — until that's fixed, this script is the maintenance
path)
Co-authored-by: Hongming Wang <hongmingwang.rabbit@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com>
|