molecule-core/scripts/ops
Hongming Wang 8bf29b7d0e fix(sweep-cf-tunnels): parallelize deletes + raise workflow timeout
The hourly Sweep stale Cloudflare Tunnels job got cancelled mid-cleanup
on 2026-05-02 (run 25248788312, killed at 5min after deleting 424/672
stale tunnels). A second manual dispatch finished the remaining 254
fine, so the immediate backlog cleared, but two underlying bugs would
re-trip on the next big cleanup.

Bug 1: serial delete loop. The execute branch was a `while read; do
curl -X DELETE; done` pipeline at ~0.7s/tunnel — fine for the
steady-state cleanup of a handful, but a 600+ backlog needs ~7-8min.
This commit fans out to $SWEEP_CONCURRENCY (default 8) workers via
`xargs -P 8 -L 1 -I {} bash -c '...' _ {} < "$DELETE_PLAN"`. With 8x
parallelism the same 600+ list drains in ~60s. Notes:

  - We use stdin (`<`) not GNU's `xargs -a FILE` so the script stays
    portable to BSD xargs (matters for local-runner testing on macOS).
  - We pass ONLY the tunnel id on argv. xargs tokenizes on whitespace
    by default; tab-separating id+name on argv risks mangling. The
    name is kept in a side-channel id->name map ($NAME_MAP) and looked
    up by the worker only on failure, for FAIL_LOG readability.
  - Workers print exactly `OK` or `FAIL` on stdout; tally with
    `grep -c '^OK$' / '^FAIL$'`.
  - On non-zero FAILED, log the first 20 lines of $FAIL_LOG as
    "Failure detail (first 20):" — same diagnostic surface as before
    but consolidated so we don't spam logs on a flaky CF API.

Bug 2: workflow's 5-min cap was set as a hangs-detector but turned out
to be a real-job-too-slow detector. Raised to 30 min — generous
headroom for the ~60s steady-state run while still surfacing genuine
hangs (and in line with the sweep-cf-orphans companion job).

Bug 3 (drive-by): the existing trap was `trap 'rm -rf "$PAGES_DIR"'
EXIT`, which would have been silently overwritten by any later trap
registration. Replaced with a single `cleanup()` function that wipes
PAGES_DIR + all four new tempfiles (DELETE_PLAN, NAME_MAP, FAIL_LOG,
RESULT_LOG), called once via `trap cleanup EXIT`.

Verification:
  - bash -n scripts/ops/sweep-cf-tunnels.sh: clean
  - shellcheck -S warning scripts/ops/sweep-cf-tunnels.sh: clean
  - python3 yaml.safe_load on the workflow: clean
  - Synthetic 30-line delete plan with every 7th id sentinel'd to
    return {"success":false}: TEST PASS, DELETED=26 FAILED=4, FAIL_LOG
    side-channel name lookup verified.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 02:35:46 -07:00
..
audit-railway-sha-pins.sh ops: add Railway SHA-pin drift audit script + regression test (#2001) 2026-04-27 05:01:23 -07:00
check_migration_collisions.py ci: hard gate against migration version collisions (#2341) 2026-04-29 21:42:42 -07:00
check-prod-versions.sh ops: scripts/ops/check-prod-versions.sh — one-line "is each tenant on latest?" 2026-04-30 13:13:47 -07:00
sweep_cf_decide.py refactor(ops): apply simplify findings on #2027 PR 2026-04-26 00:28:15 -07:00
sweep-cf-orphans.sh refactor(ops): apply simplify findings on #2027 PR 2026-04-26 00:28:15 -07:00
sweep-cf-tunnels.sh fix(sweep-cf-tunnels): parallelize deletes + raise workflow timeout 2026-05-02 02:35:46 -07:00
test_check_migration_collisions.py fix(test): convert migration-collision tests from pytest to unittest (#2341) 2026-04-30 01:47:27 -07:00
test_sweep_cf_decide.py refactor(ops): apply simplify findings on #2027 PR 2026-04-26 00:28:15 -07:00