Commit Graph

3 Commits

Author SHA1 Message Date
Hongming Wang
8bf29b7d0e fix(sweep-cf-tunnels): parallelize deletes + raise workflow timeout
The hourly Sweep stale Cloudflare Tunnels job got cancelled mid-cleanup
on 2026-05-02 (run 25248788312, killed at 5min after deleting 424/672
stale tunnels). A second manual dispatch finished the remaining 254
fine, so the immediate backlog cleared, but two underlying bugs would
re-trip on the next big cleanup.

Bug 1: serial delete loop. The execute branch was a `while read; do
curl -X DELETE; done` pipeline at ~0.7s/tunnel — fine for the
steady-state cleanup of a handful, but a 600+ backlog needs ~7-8min.
This commit fans out to $SWEEP_CONCURRENCY (default 8) workers via
`xargs -P 8 -L 1 -I {} bash -c '...' _ {} < "$DELETE_PLAN"`. With 8x
parallelism the same 600+ list drains in ~60s. Notes:

  - We use stdin (`<`) not GNU's `xargs -a FILE` so the script stays
    portable to BSD xargs (matters for local-runner testing on macOS).
  - We pass ONLY the tunnel id on argv. xargs tokenizes on whitespace
    by default; tab-separating id+name on argv risks mangling. The
    name is kept in a side-channel id->name map ($NAME_MAP) and looked
    up by the worker only on failure, for FAIL_LOG readability.
  - Workers print exactly `OK` or `FAIL` on stdout; tally with
    `grep -c '^OK$' / '^FAIL$'`.
  - On non-zero FAILED, log the first 20 lines of $FAIL_LOG as
    "Failure detail (first 20):" — same diagnostic surface as before
    but consolidated so we don't spam logs on a flaky CF API.

Bug 2: workflow's 5-min cap was set as a hangs-detector but turned out
to be a real-job-too-slow detector. Raised to 30 min — generous
headroom for the ~60s steady-state run while still surfacing genuine
hangs (and in line with the sweep-cf-orphans companion job).

Bug 3 (drive-by): the existing trap was `trap 'rm -rf "$PAGES_DIR"'
EXIT`, which would have been silently overwritten by any later trap
registration. Replaced with a single `cleanup()` function that wipes
PAGES_DIR + all four new tempfiles (DELETE_PLAN, NAME_MAP, FAIL_LOG,
RESULT_LOG), called once via `trap cleanup EXIT`.

Verification:
  - bash -n scripts/ops/sweep-cf-tunnels.sh: clean
  - shellcheck -S warning scripts/ops/sweep-cf-tunnels.sh: clean
  - python3 yaml.safe_load on the workflow: clean
  - Synthetic 30-line delete plan with every 7th id sentinel'd to
    return {"success":false}: TEST PASS, DELETED=26 FAILED=4, FAIL_LOG
    side-channel name lookup verified.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 02:35:46 -07:00
Hongming Wang
a117a60eed fix(sweep-cf-tunnels): buffer pages to disk to avoid argv ARG_MAX
The page-merge loop passed the entire accumulating tunnel JSON to
python3 -c via argv on every iteration. On a busy account (verified
2026-05-02: 672 tunnels, 14 pages on Hongmingwangrabbit account) this
exceeds the GH Ubuntu runner's combined argv+envp limit (~128 KB) and
dies with `python3: Argument list too long` at exit 126 — the workflow
has been silently failing this way since the very first run that hit a
real account, masked earlier by a missing-CF_ACCOUNT_ID secret check.

Buffer each page response to a file under a temp dir, merge from disk
at the end. Also bumps the page cap from 20 to 40 (1000 → 2000 tunnel
ceiling) so the existing soft-cap warning has headroom; the disk-merge
shape is O(n) in tunnel count rather than the previous O(n^2) so the
larger ceiling is cheap.

Verified locally against the live account (672 tunnels): script now
runs cleanly to the existing MAX_DELETE_PCT safety gate, which trips
at 99% > 90% as designed and surfaces the actual orphan backlog for
operator-driven cleanup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 00:42:25 -07:00
Hongming Wang
3a6d2f179d feat(ops): add sweep-cf-tunnels janitor — orphan Cloudflare Tunnels accumulate
CP's tenant-delete cascade removes the DNS record (with sweep-cf-orphans
as a backstop) but does NOT delete the underlying Cloudflare Tunnel.
Each E2E provision creates one Tunnel named `tenant-<slug>`; without
cleanup these accumulate indefinitely on the account, consuming the
tunnel quota and cluttering the dashboard.

Observed 2026-04-30: dozens of `tenant-e2e-canvas-*` tunnels in Down
state with zero replicas, weeks past their tenant's deletion. Same
class of bug as the DNS-records leak that drove sweep-cf-orphans
(controlplane#239).

Parallel-shape to sweep-cf-orphans:
  - Same dry-run-by-default + --execute pattern
  - Same MAX_DELETE_PCT safety gate (default 90% — higher than DNS
    sweep's 50% because tenant-shaped tunnels are orphans by design)
  - Same schedule/dispatch hardening (hard-fail on missing secrets
    when scheduled, soft-skip when dispatched)
  - Cron offset to :45 to avoid CF API bursts colliding with the DNS
    sweep at :15

Decision rules (in order):
  1. Name doesn't match `tenant-<slug>` → keep (unknown — never sweep
     tunnels that might belong to platform infra).
  2. Tunnel has active connections (status=healthy or non-empty
     connections array) → keep (defense-in-depth: don't kill a live
     tunnel even if CP forgot the org).
  3. Slug ∈ {prod_slugs ∪ staging_slugs} → keep.
  4. Otherwise → delete (orphan).

Verified by:
  - shell syntax check (bash -n)
  - YAML lint
  - Decide-logic offline smoke (7 cases, all pass)
  - End-to-end dry-run smoke with stubbed CP + CF APIs

Required secrets (added to existing org-secrets):
  CF_API_TOKEN          must include account:cloudflare_tunnel:edit
                        scope (separate from zone:dns:edit used by
                        sweep-cf-orphans — same token if scope is
                        broad, or a new token if narrowly scoped).
  CF_ACCOUNT_ID         account that owns the tunnels (visible in
                        dash.cloudflare.com URL path).
  CP_PROD_ADMIN_TOKEN   reused from sweep-cf-orphans.
  CP_STAGING_ADMIN_TOKEN reused from sweep-cf-orphans.

Note: CP-side root cause (tenant-delete should cascade to tunnel
delete) is in molecule-controlplane and worth fixing separately. This
janitor is the operational backstop in the meantime — same pattern
applied to DNS records when the same root cause was unaddressed.
2026-04-29 19:42:47 -07:00