molecule-core

Author	SHA1	Message	Date
hongming-pc2	5373b5e7f6	fix(ci): extend class-E rename to scripts/ops/sweep-.sh (chained-defect from #430 review) Some checks failed Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 18s Details Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 13s Details CI / Detect changes (pull_request) Successful in 50s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 20s Details sop-tier-check / tier-check (pull_request) Successful in 19s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 55s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 50s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 59s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 41s Details Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 55s Details CI / Platform (Go) (pull_request) Successful in 9s Details CI / Canvas (Next.js) (pull_request) Successful in 10s Details CI / Python Lint & Test (pull_request) Successful in 9s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 23s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 13s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 11s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 12s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details audit-force-merge / audit (pull_request) Successful in 23s Details E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Failing after 4m53s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Failing after 5m15s Details core-devops lens review (review 1075) caught the chained defect: the 3 sweep workflows shell out to `bash scripts/ops/sweep-{aws-secrets,cf-orphans,cf-tunnels}.sh`, and those scripts still consume the OLD env-var names — `need CP_PROD_ADMIN_TOKEN`, `need CP_STAGING_ADMIN_TOKEN`, and `Bearer $CP_PROD_ADMIN_TOKEN` / `Bearer $CP_STAGING_ADMIN_TOKEN` in the CP-admin curl calls. The workflow- level presence-check loop (renamed in the first commit) would pass, then the shell script would `exit 1` at the `need CP_PROD_ADMIN_TOKEN` line. Classic `feedback_chained_defects_in_never_tested_workflows` — the YAML- surface rename looked complete; the actual consumer is one layer deeper. This commit completes the rename in the scripts: - `CP_PROD_ADMIN_TOKEN` -> `CP_ADMIN_API_TOKEN` - `CP_STAGING_ADMIN_TOKEN` -> `CP_STAGING_ADMIN_API_TOKEN` (6 occurrences total per script — comments, `need` checks, `Bearer $...` curl headers — across all 3). The .gitea/workflows/sweep-.yml files (first commit) export `CP_ADMIN_API_TOKEN: ${{ secrets.CP_ADMIN_API_TOKEN }}` etc., so the scripts now read `$CP_ADMIN_API_TOKEN` — consistent end-to-end. Per core-devops's other (non-blocking) note: `workflow_dispatch` each sweep in dry-run after this lands + after the #425 class-A PUT, to confirm the path beyond the presence-check actually works (the `MINIMAX_TOKEN`-grade shape-match isn't enough — exercise the real CP-admin call). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 01:32:26 -07:00
Hongming Wang	8bf29b7d0e	fix(sweep-cf-tunnels): parallelize deletes + raise workflow timeout The hourly Sweep stale Cloudflare Tunnels job got cancelled mid-cleanup on 2026-05-02 (run 25248788312, killed at 5min after deleting 424/672 stale tunnels). A second manual dispatch finished the remaining 254 fine, so the immediate backlog cleared, but two underlying bugs would re-trip on the next big cleanup. Bug 1: serial delete loop. The execute branch was a `while read; do curl -X DELETE; done` pipeline at ~0.7s/tunnel — fine for the steady-state cleanup of a handful, but a 600+ backlog needs ~7-8min. This commit fans out to $SWEEP_CONCURRENCY (default 8) workers via `xargs -P 8 -L 1 -I {} bash -c '...' _ {} < "$DELETE_PLAN"`. With 8x parallelism the same 600+ list drains in ~60s. Notes: - We use stdin (`<`) not GNU's `xargs -a FILE` so the script stays portable to BSD xargs (matters for local-runner testing on macOS). - We pass ONLY the tunnel id on argv. xargs tokenizes on whitespace by default; tab-separating id+name on argv risks mangling. The name is kept in a side-channel id->name map ($NAME_MAP) and looked up by the worker only on failure, for FAIL_LOG readability. - Workers print exactly `OK` or `FAIL` on stdout; tally with `grep -c '^OK$' / '^FAIL$'`. - On non-zero FAILED, log the first 20 lines of $FAIL_LOG as "Failure detail (first 20):" — same diagnostic surface as before but consolidated so we don't spam logs on a flaky CF API. Bug 2: workflow's 5-min cap was set as a hangs-detector but turned out to be a real-job-too-slow detector. Raised to 30 min — generous headroom for the ~60s steady-state run while still surfacing genuine hangs (and in line with the sweep-cf-orphans companion job). Bug 3 (drive-by): the existing trap was `trap 'rm -rf "$PAGES_DIR"' EXIT`, which would have been silently overwritten by any later trap registration. Replaced with a single `cleanup()` function that wipes PAGES_DIR + all four new tempfiles (DELETE_PLAN, NAME_MAP, FAIL_LOG, RESULT_LOG), called once via `trap cleanup EXIT`. Verification: - bash -n scripts/ops/sweep-cf-tunnels.sh: clean - shellcheck -S warning scripts/ops/sweep-cf-tunnels.sh: clean - python3 yaml.safe_load on the workflow: clean - Synthetic 30-line delete plan with every 7th id sentinel'd to return {"success":false}: TEST PASS, DELETED=26 FAILED=4, FAIL_LOG side-channel name lookup verified. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 02:35:46 -07:00
Hongming Wang	a117a60eed	fix(sweep-cf-tunnels): buffer pages to disk to avoid argv ARG_MAX The page-merge loop passed the entire accumulating tunnel JSON to python3 -c via argv on every iteration. On a busy account (verified 2026-05-02: 672 tunnels, 14 pages on Hongmingwangrabbit account) this exceeds the GH Ubuntu runner's combined argv+envp limit (~128 KB) and dies with `python3: Argument list too long` at exit 126 — the workflow has been silently failing this way since the very first run that hit a real account, masked earlier by a missing-CF_ACCOUNT_ID secret check. Buffer each page response to a file under a temp dir, merge from disk at the end. Also bumps the page cap from 20 to 40 (1000 → 2000 tunnel ceiling) so the existing soft-cap warning has headroom; the disk-merge shape is O(n) in tunnel count rather than the previous O(n^2) so the larger ceiling is cheap. Verified locally against the live account (672 tunnels): script now runs cleanly to the existing MAX_DELETE_PCT safety gate, which trips at 99% > 90% as designed and surfaces the actual orphan backlog for operator-driven cleanup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 00:42:25 -07:00
Hongming Wang	3a6d2f179d	feat(ops): add sweep-cf-tunnels janitor — orphan Cloudflare Tunnels accumulate CP's tenant-delete cascade removes the DNS record (with sweep-cf-orphans as a backstop) but does NOT delete the underlying Cloudflare Tunnel. Each E2E provision creates one Tunnel named `tenant-<slug>`; without cleanup these accumulate indefinitely on the account, consuming the tunnel quota and cluttering the dashboard. Observed 2026-04-30: dozens of `tenant-e2e-canvas-*` tunnels in Down state with zero replicas, weeks past their tenant's deletion. Same class of bug as the DNS-records leak that drove sweep-cf-orphans (controlplane#239). Parallel-shape to sweep-cf-orphans: - Same dry-run-by-default + --execute pattern - Same MAX_DELETE_PCT safety gate (default 90% — higher than DNS sweep's 50% because tenant-shaped tunnels are orphans by design) - Same schedule/dispatch hardening (hard-fail on missing secrets when scheduled, soft-skip when dispatched) - Cron offset to :45 to avoid CF API bursts colliding with the DNS sweep at :15 Decision rules (in order): 1. Name doesn't match `tenant-<slug>` → keep (unknown — never sweep tunnels that might belong to platform infra). 2. Tunnel has active connections (status=healthy or non-empty connections array) → keep (defense-in-depth: don't kill a live tunnel even if CP forgot the org). 3. Slug ∈ {prod_slugs ∪ staging_slugs} → keep. 4. Otherwise → delete (orphan). Verified by: - shell syntax check (bash -n) - YAML lint - Decide-logic offline smoke (7 cases, all pass) - End-to-end dry-run smoke with stubbed CP + CF APIs Required secrets (added to existing org-secrets): CF_API_TOKEN must include account:cloudflare_tunnel:edit scope (separate from zone:dns:edit used by sweep-cf-orphans — same token if scope is broad, or a new token if narrowly scoped). CF_ACCOUNT_ID account that owns the tunnels (visible in dash.cloudflare.com URL path). CP_PROD_ADMIN_TOKEN reused from sweep-cf-orphans. CP_STAGING_ADMIN_TOKEN reused from sweep-cf-orphans. Note: CP-side root cause (tenant-delete should cascade to tunnel delete) is in molecule-controlplane and worth fixing separately. This janitor is the operational backstop in the meantime — same pattern applied to DNS records when the same root cause was unaddressed.	2026-04-29 19:42:47 -07:00

4 Commits