Files
molecule-core/scripts/ops/sweep-cf-tunnels.sh
Molecule AI Dev Engineer A (Kimi) 1424fb0d55
ci-arm64-advisory / fast-checks (pull_request) Waiting to run
CI / Python Lint & Test (pull_request) Successful in 3s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Successful in 3s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 7s
CI / Detect changes (pull_request) Successful in 11s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 9s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 11s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 10s
E2E Chat / detect-changes (pull_request) Successful in 14s
lint-required-workflows-docker-host-pinned / Lint docker-host pin on docker-touching workflows (pull_request) Successful in 5s
E2E API Smoke Test / detect-changes (pull_request) Successful in 18s
CI / Platform (Go) (pull_request) Successful in 2s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 19s
gate-check-v3 / gate-check (pull_request_target) Successful in 8s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 15s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2s
E2E Chat / E2E Chat (pull_request) Successful in 1s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1s
qa-review / approved (pull_request_target) Failing after 21s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 24s
CI / Canvas (Next.js) (pull_request) Successful in 18s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 2s
CI / Canvas Deploy Status (pull_request) Has been skipped
security-review / approved (pull_request_target) Failing after 22s
CI / all-required (pull_request) Successful in 4s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 58s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Failing after 1m4s
lint-required-context-exists-in-bp / lint-required-context-exists-in-bp (pull_request) Successful in 1m17s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 1m14s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 1m29s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 1m24s
sop-checklist / review-refire (pull_request_target) Has been skipped
sop-checklist / all-items-acked (pull_request) [info tier:low] acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, l
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-tier-check / tier-check (pull_request_target) Failing after 5s
sop-checklist / all-items-acked (pull_request_target) Successful in 6s
qa-review / approved (pull_request_review) Has been skipped
security-review / approved (pull_request_review) Has been skipped
sop-tier-check / tier-check (pull_request_review) Successful in 5s
fix(ci): add CLOUDFLARE_* secret fallback to sweep-cf workflows (internal#805)
The sweep-cf-orphans and sweep-cf-tunnels workflows reference CI-scoped
secret names (CF_API_TOKEN, CF_ZONE_ID, CF_ACCOUNT_ID) while the
operator-host canonical names are CLOUDFLARE_API_TOKEN,
CLOUDFLARE_ZONE_ID, CLOUDFLARE_ACCOUNT_ID. When the CI-scoped duplicates
are missing from the secret store, the scheduled sweeps hard-fail even
though the canonical names are present.

Changes:
- Workflow YAML: `secrets.CF_API_TOKEN || secrets.CLOUDFLARE_API_TOKEN`
  (same pattern for ZONE_ID and ACCOUNT_ID).
- Scripts: add env-var fallback so direct local invocation also works.
- Comments and error messages updated to mention both naming conventions.

Closes internal#805

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-05 14:55:58 +00:00

340 lines
13 KiB
Bash
Executable File
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
#!/usr/bin/env bash
# sweep-cf-tunnels.sh — safe, targeted sweep of Cloudflare Tunnels
# whose corresponding tenant no longer exists.
#
# Why this exists: CP's tenant-delete cascade removes the DNS record
# (caught by sweep-cf-orphans.sh as a backstop) but does NOT delete
# the underlying Cloudflare Tunnel. Each E2E provision creates one
# Tunnel named `tenant-<slug>`; without cleanup these accumulate
# indefinitely on the account, consuming the account's tunnel quota
# and cluttering the Cloudflare dashboard.
#
# Observed 2026-04-30: dozens of `tenant-e2e-canvas-*` tunnels in
# Down state with zero replicas, weeks past their tenant's deletion.
#
# This script is a parallel-shape janitor to sweep-cf-orphans.sh:
# 1. Query CP admin API to enumerate live org slugs (prod + staging)
# 2. Enumerate Cloudflare Tunnels via the account-scoped API
# 3. For each tunnel matching `tenant-<slug>`, check if <slug>
# appears in the live set
# 4. Skip tunnels with active connections (defense-in-depth — never
# delete a healthy tunnel even if CP claims the org is gone)
# 5. Only delete tunnels with NO live counterpart AND NO active
# connections
#
# Dry-run by default; must pass --execute to actually delete.
#
# Env vars required:
# CF_API_TOKEN — Cloudflare token with
# account:cloudflare_tunnel:edit scope.
# (Same secret as sweep-cf-orphans, but the
# token must include the tunnel scope.)
# (falls back to CLOUDFLARE_API_TOKEN if CF_API_TOKEN is unset;
# the workflow YAML maps both secret names into CF_API_TOKEN)
# CF_ACCOUNT_ID — the account that owns the tunnels (visible
# in dash.cloudflare.com URL path)
# (falls back to CLOUDFLARE_ACCOUNT_ID if CF_ACCOUNT_ID is unset)
# CP_ADMIN_API_TOKEN — CP admin bearer for api.moleculesai.app
# CP_STAGING_ADMIN_API_TOKEN — CP admin bearer for staging-api.moleculesai.app
#
# Exit codes:
# 0 — dry-run completed or sweep executed successfully
# 1 — missing required env, API failure, or unexpected state
# 2 — safety check failed (would delete >MAX_DELETE_PCT% of
# tenant-shaped tunnels; refusing)
set -euo pipefail
DRY_RUN=1
# Tenant tunnels are short-lived by design — most of them at any
# given moment are orphans from finished E2E runs. The default is
# tuned higher than sweep-cf-orphans (50%) to reflect that the
# steady-state for tenant-* tunnels is mostly-orphan, not mostly-live.
MAX_DELETE_PCT="${MAX_DELETE_PCT:-90}"
for arg in "$@"; do
case "$arg" in
--execute|--no-dry-run) DRY_RUN=0 ;;
--help|-h)
grep '^#' "$0" | head -45 | sed 's/^# \{0,1\}//'
exit 0
;;
*)
echo "unknown arg: $arg (use --help)" >&2
exit 1
;;
esac
done
need() {
local var="$1"
if [ -z "${!var:-}" ]; then
echo "ERROR: $var is required" >&2
exit 1
fi
}
# Fallback: operator-host canonical names → CI-scoped names.
# The workflow YAML already maps both, but direct script invocation
# (e.g. local ops) may only have the canonical names set.
CF_API_TOKEN="${CF_API_TOKEN:-${CLOUDFLARE_API_TOKEN:-}}"
CF_ACCOUNT_ID="${CF_ACCOUNT_ID:-${CLOUDFLARE_ACCOUNT_ID:-}}"
need CF_API_TOKEN
need CF_ACCOUNT_ID
need CP_ADMIN_API_TOKEN
need CP_STAGING_ADMIN_API_TOKEN
log() { echo "[$(date -u +%H:%M:%S)] $*"; }
# --- Gather live sets ------------------------------------------------------
log "Fetching CP prod org slugs..."
PROD_SLUGS=$(curl -sS -m 15 -H "Authorization: Bearer $CP_ADMIN_API_TOKEN" \
"https://api.moleculesai.app/cp/admin/orgs?limit=500" \
| python3 -c "import json,sys; print(' '.join(o['slug'] for o in json.load(sys.stdin).get('orgs',[])))")
log " prod orgs: $(echo "$PROD_SLUGS" | wc -w | tr -d ' ')"
log "Fetching CP staging org slugs..."
STAGING_SLUGS=$(curl -sS -m 15 -H "Authorization: Bearer $CP_STAGING_ADMIN_API_TOKEN" \
"https://staging-api.moleculesai.app/cp/admin/orgs?limit=500" \
| python3 -c "import json,sys; print(' '.join(o['slug'] for o in json.load(sys.stdin).get('orgs',[])))")
log " staging orgs: $(echo "$STAGING_SLUGS" | wc -w | tr -d ' ')"
log "Fetching Cloudflare tunnels..."
# The cfd_tunnel list endpoint is paginated; per_page max is 50.
# Walk all pages so we don't silently miss orphans on busy accounts.
#
# Pages are buffered to a temp dir and merged at the end. The earlier
# shape passed the accumulating JSON on argv every iteration, which on
# a busy account (700+ tunnels = 14+ pages) blows past Linux ARG_MAX
# (~128 KB combined argv+envp on the GH Ubuntu runner) and dies with
# `python3: Argument list too long`. Disk-buffering also makes the
# accumulator O(n) instead of O(n^2).
PAGES_DIR=$(mktemp -d -t cf-tunnels-XXXXXX)
# Single cleanup() covering all tempfiles created downstream
# ($DELETE_PLAN, $NAME_MAP, $FAIL_LOG, $RESULT_LOG). One trap call so a
# later `trap '...' EXIT` doesn't silently overwrite an earlier one.
DELETE_PLAN=""
NAME_MAP=""
FAIL_LOG=""
RESULT_LOG=""
cleanup() {
rm -rf "$PAGES_DIR"
[ -n "$DELETE_PLAN" ] && rm -f "$DELETE_PLAN"
[ -n "$NAME_MAP" ] && rm -f "$NAME_MAP"
[ -n "$FAIL_LOG" ] && rm -f "$FAIL_LOG"
[ -n "$RESULT_LOG" ] && rm -f "$RESULT_LOG"
return 0
}
trap cleanup EXIT
PAGE=1
while :; do
page_file="$PAGES_DIR/page-$(printf '%05d' "$PAGE").json"
curl -sS -m 15 -H "Authorization: Bearer $CF_API_TOKEN" \
"https://api.cloudflare.com/client/v4/accounts/$CF_ACCOUNT_ID/cfd_tunnel?per_page=50&page=$PAGE&is_deleted=false" \
> "$page_file"
page_count=$(python3 -c "import json,sys; print(len(json.load(open(sys.argv[1])).get('result') or []))" "$page_file")
if [ "$page_count" = "0" ]; then rm -f "$page_file"; break; fi
PAGE=$((PAGE + 1))
if [ "$PAGE" -gt 40 ]; then
log "::warning::stopping pagination at page 40 (2000 tunnels) — re-run if more"
break
fi
done
TUNNEL_JSON=$(python3 -c '
import glob, json, os, sys
acc = {"result": []}
for f in sorted(glob.glob(os.path.join(sys.argv[1], "page-*.json"))):
with open(f) as fh:
acc["result"].extend(json.load(fh).get("result") or [])
print(json.dumps(acc))
' "$PAGES_DIR")
TOTAL_TUNNELS=$(echo "$TUNNEL_JSON" | python3 -c "import json,sys; print(len(json.load(sys.stdin)['result']))")
log " total tunnels: $TOTAL_TUNNELS"
# --- Compute orphans -------------------------------------------------------
#
# Rules (in order):
# 1. Name doesn't match `tenant-<slug>` → keep (unknown — never sweep
# arbitrary tunnels that might belong to platform infra).
# 2. Tunnel has active connections (status=healthy or non-empty
# connections array) → keep (defense-in-depth: don't kill a live
# tunnel even if CP forgot the org).
# 3. Slug ∈ {prod_slugs staging_slugs} → keep (live tenant).
# 4. Otherwise → delete (orphan).
export PROD_SLUGS STAGING_SLUGS
DECISIONS=$(echo "$TUNNEL_JSON" | python3 -c '
import json, os, re, sys
prod_slugs = set(os.environ["PROD_SLUGS"].split())
staging_slugs = set(os.environ["STAGING_SLUGS"].split())
all_slugs = prod_slugs | staging_slugs
_TENANT_RE = re.compile(r"^tenant-(.+)$")
def decide(t, all_slugs):
name = t.get("name", "")
tid = t.get("id", "")
status = t.get("status", "")
conns = t.get("connections") or []
m = _TENANT_RE.match(name)
if not m:
return ("keep", "not-a-tenant-tunnel", tid, name, status)
slug = m.group(1)
# Defense-in-depth: never delete a tunnel with live connectors.
# The CF tunnel "status" field is one of inactive/degraded/healthy/down.
# "down" with empty connections is the orphan state we sweep.
if status == "healthy" or len(conns) > 0:
return ("keep", "active-connections", tid, name, status)
if slug in all_slugs:
return ("keep", "live-tenant", tid, name, status)
return ("delete", "orphan-tenant", tid, name, status)
d = json.loads(sys.stdin.read())
for t in d.get("result", []):
action, reason, tid, name, status = decide(t, all_slugs)
print(json.dumps({"action": action, "reason": reason, "id": tid, "name": name, "status": status}))
')
# --- Summarize + safety gate ----------------------------------------------
DELETE_COUNT=$(printf '%s' "$DECISIONS" | python3 -c "import json,sys; print(sum(1 for l in sys.stdin if json.loads(l)['action']=='delete'))")
KEEP_COUNT=$((TOTAL_TUNNELS - DELETE_COUNT))
TENANT_TUNNELS=$(printf '%s' "$DECISIONS" | python3 -c "
import json, sys
n = sum(1 for l in sys.stdin if json.loads(l)['reason'] != 'not-a-tenant-tunnel')
print(n)
")
log ""
log "== Sweep plan =="
log " total tunnels: $TOTAL_TUNNELS"
log " tenant-shaped tunnels: $TENANT_TUNNELS"
log " would delete: $DELETE_COUNT"
log " would keep: $KEEP_COUNT"
log ""
# Per-reason breakdown of deletes
printf '%s' "$DECISIONS" | python3 -c "
import json,sys,collections
c = collections.Counter()
for l in sys.stdin:
d = json.loads(l)
if d['action'] == 'delete':
c[d['reason']] += 1
for reason, n in c.most_common():
print(f' delete/{reason}: {n}')
"
# Safety gate operates against the tenant-shaped subset (the reasonable
# "all of these could conceivably be ours" denominator), not the total.
# A miscount of platform-infra tunnels shouldn't relax the gate.
if [ "$TENANT_TUNNELS" -gt 0 ]; then
PCT=$(( DELETE_COUNT * 100 / TENANT_TUNNELS ))
if [ "$PCT" -gt "$MAX_DELETE_PCT" ]; then
log ""
log "SAFETY: would delete $PCT% of tenant-shaped tunnels (threshold $MAX_DELETE_PCT%) — refusing."
log " If this is expected (e.g. major cleanup after incident), rerun with"
log " MAX_DELETE_PCT=$((PCT+5)) $0 $*"
exit 2
fi
fi
if [ "$DRY_RUN" = "1" ]; then
log ""
log "Dry run complete. Pass --execute to actually delete $DELETE_COUNT tunnels."
log ""
log "First 20 tunnels that would be deleted:"
printf '%s' "$DECISIONS" | python3 -c "
import json, sys
shown = 0
for l in sys.stdin:
d = json.loads(l)
if d['action'] == 'delete':
print(f\" {d['reason']:25s} {d['name']:40s} status={d['status']}\")
shown += 1
if shown >= 20: break
"
exit 0
fi
# --- Execute deletes -------------------------------------------------------
#
# Parallel delete loop. Was a serial `curl -X DELETE` while-loop;
# at ~0.7s/tunnel that meant 672 stale tunnels needed ~7-8 min, which
# tripped the workflow's 5-min timeout-minutes (run 25248788312,
# cancelled at 424/672). Fan out to $SWEEP_CONCURRENCY workers via
# xargs so a 600+ backlog drains in ~60s.
#
# Design notes:
# - Materialize the (id, name) plan to a tempfile for stdin'ing into
# xargs. xargs `-a FILE` is GNU-only; piping/`<` is portable to
# macOS/BSD xargs (matters for local testing).
# - Pass ONLY the id on argv. xargs tokenizes on whitespace by
# default; tab-separating id+name on argv risks mangling. We keep
# the name in a side-channel id→name map ($NAME_MAP) for failure
# log readability, and the worker also writes failure detail to
# $FAIL_LOG (`FAIL <name> <id>`) for grep-ability.
# - Workers print exactly `OK` or `FAIL` on stdout (one line per
# invocation); we tally with `grep -c '^OK$' / '^FAIL$'`.
CONCURRENCY="${SWEEP_CONCURRENCY:-8}"
DELETE_PLAN=$(mktemp -t cf-tunnels-plan-XXXXXX)
NAME_MAP=$(mktemp -t cf-tunnels-names-XXXXXX)
FAIL_LOG=$(mktemp -t cf-tunnels-fail-XXXXXX)
RESULT_LOG=$(mktemp -t cf-tunnels-result-XXXXXX)
# Build delete plan (just ids, one per line) and the side-channel
# id→name map (tab-separated).
printf '%s' "$DECISIONS" | python3 -c '
import json, os, sys
plan_path = sys.argv[1]
map_path = sys.argv[2]
with open(plan_path, "w") as plan, open(map_path, "w") as nmap:
for line in sys.stdin:
d = json.loads(line)
if d.get("action") != "delete":
continue
tid = d["id"]
name = d.get("name", "")
plan.write(tid + "\n")
nmap.write(tid + "\t" + name + "\n")
' "$DELETE_PLAN" "$NAME_MAP"
log ""
log "Executing $DELETE_COUNT deletions ($CONCURRENCY-way parallel)..."
export CF_API_TOKEN CF_ACCOUNT_ID NAME_MAP FAIL_LOG
# shellcheck disable=SC2016
xargs -P "$CONCURRENCY" -L 1 -I {} bash -c '
tid="$1"
resp=$(curl -sS -m 10 -X DELETE \
-H "Authorization: Bearer $CF_API_TOKEN" \
"https://api.cloudflare.com/client/v4/accounts/$CF_ACCOUNT_ID/cfd_tunnel/$tid")
if printf "%s" "$resp" | grep -q "\"success\":true"; then
echo OK
else
name=$(awk -F"\t" -v id="$tid" "\$1==id {print \$2; exit}" "$NAME_MAP")
echo FAIL
echo "FAIL $name $tid" >> "$FAIL_LOG"
fi
' _ {} < "$DELETE_PLAN" > "$RESULT_LOG"
DELETED=$(grep -c '^OK$' "$RESULT_LOG" || true)
FAILED=$(grep -c '^FAIL$' "$RESULT_LOG" || true)
log ""
log "Done. deleted=$DELETED failed=$FAILED"
if [ "$FAILED" -ne 0 ]; then
log "Failure detail (first 20):"
head -20 "$FAIL_LOG" | while IFS= read -r fl; do log " $fl"; done
fi
[ "$FAILED" -eq 0 ]