Some checks failed
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 18s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 13s
CI / Detect changes (pull_request) Successful in 50s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 20s
sop-tier-check / tier-check (pull_request) Successful in 19s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 55s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 50s
E2E API Smoke Test / detect-changes (pull_request) Successful in 59s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 41s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 55s
CI / Platform (Go) (pull_request) Successful in 9s
CI / Canvas (Next.js) (pull_request) Successful in 10s
CI / Python Lint & Test (pull_request) Successful in 9s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 23s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 13s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 11s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 12s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
audit-force-merge / audit (pull_request) Successful in 23s
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Failing after 4m53s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Failing after 5m15s
core-devops lens review (review 1075) caught the chained defect: the 3
sweep workflows shell out to `bash scripts/ops/sweep-{aws-secrets,cf-orphans,cf-tunnels}.sh`,
and those scripts still consume the OLD env-var names — `need CP_PROD_ADMIN_TOKEN`,
`need CP_STAGING_ADMIN_TOKEN`, and `Bearer $CP_PROD_ADMIN_TOKEN` /
`Bearer $CP_STAGING_ADMIN_TOKEN` in the CP-admin curl calls. The workflow-
level presence-check loop (renamed in the first commit) would pass, then
the shell script would `exit 1` at the `need CP_PROD_ADMIN_TOKEN` line.
Classic `feedback_chained_defects_in_never_tested_workflows` — the YAML-
surface rename looked complete; the actual consumer is one layer deeper.
This commit completes the rename in the scripts:
- `CP_PROD_ADMIN_TOKEN` -> `CP_ADMIN_API_TOKEN`
- `CP_STAGING_ADMIN_TOKEN` -> `CP_STAGING_ADMIN_API_TOKEN`
(6 occurrences total per script — comments, `need` checks, `Bearer $...`
curl headers — across all 3). The .gitea/workflows/sweep-*.yml files (first
commit) export `CP_ADMIN_API_TOKEN: ${{ secrets.CP_ADMIN_API_TOKEN }}` etc.,
so the scripts now read `$CP_ADMIN_API_TOKEN` — consistent end-to-end.
Per core-devops's other (non-blocking) note: `workflow_dispatch` each
sweep in dry-run after this lands + after the #425 class-A PUT, to confirm
the path beyond the presence-check actually works (the `MINIMAX_TOKEN`-grade
shape-match isn't enough — exercise the real CP-admin call).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
265 lines
9.9 KiB
Bash
Executable File
265 lines
9.9 KiB
Bash
Executable File
#!/usr/bin/env bash
|
||
# sweep-cf-orphans.sh — safe, targeted sweep of Cloudflare DNS records whose
|
||
# corresponding workspace/tenant no longer exists.
|
||
#
|
||
# Why this exists: tenant.Delete + workspace.Delete don't currently clean
|
||
# their CF records — see #1976. Until that lands, records accumulate at
|
||
# ~10/hour under normal E2E cadence. The old "sweep when >65" approach
|
||
# (deletes every record matching a pattern, regardless of liveness) was a
|
||
# panic button that would nuke live workspaces too.
|
||
#
|
||
# This script is the do-it-right version:
|
||
# 1. Query CP admin API to enumerate live org slugs
|
||
# 2. Query AWS EC2 to enumerate live workspace Name tags
|
||
# 3. For each CF record matching the sweep patterns, check if the
|
||
# corresponding slug / ws-id appears in the live sets
|
||
# 4. Only delete records with NO live counterpart
|
||
#
|
||
# Dry-run by default; must pass --execute to actually delete.
|
||
#
|
||
# Env vars required:
|
||
# CF_API_TOKEN — Cloudflare token with zone:dns:edit
|
||
# CF_ZONE_ID — the zone (moleculesai.app)
|
||
# CP_ADMIN_API_TOKEN — CP admin bearer for api.moleculesai.app
|
||
# CP_STAGING_ADMIN_API_TOKEN — CP admin bearer for staging-api.moleculesai.app
|
||
# AWS_* — standard AWS creds (default region us-east-2)
|
||
#
|
||
# Exit codes:
|
||
# 0 — dry-run completed or sweep executed successfully
|
||
# 1 — missing required env, API failure, or unexpected state
|
||
# 2 — safety check failed (would delete >50% of records; refusing)
|
||
|
||
set -euo pipefail
|
||
|
||
DRY_RUN=1
|
||
MAX_DELETE_PCT="${MAX_DELETE_PCT:-50}" # refuse to delete more than this pct of records in one run; caller can override via env
|
||
REGION="${AWS_DEFAULT_REGION:-us-east-2}"
|
||
|
||
for arg in "$@"; do
|
||
case "$arg" in
|
||
--execute|--no-dry-run) DRY_RUN=0 ;;
|
||
--help|-h)
|
||
grep '^#' "$0" | head -35 | sed 's/^# \{0,1\}//'
|
||
exit 0
|
||
;;
|
||
*)
|
||
echo "unknown arg: $arg (use --help)" >&2
|
||
exit 1
|
||
;;
|
||
esac
|
||
done
|
||
|
||
need() {
|
||
local var="$1"
|
||
if [ -z "${!var:-}" ]; then
|
||
echo "ERROR: $var is required" >&2
|
||
exit 1
|
||
fi
|
||
}
|
||
need CF_API_TOKEN
|
||
need CF_ZONE_ID
|
||
need CP_ADMIN_API_TOKEN
|
||
need CP_STAGING_ADMIN_API_TOKEN
|
||
|
||
log() { echo "[$(date -u +%H:%M:%S)] $*"; }
|
||
|
||
# --- Gather live sets ------------------------------------------------------
|
||
|
||
log "Fetching CP prod org slugs..."
|
||
PROD_SLUGS=$(curl -sS -m 15 -H "Authorization: Bearer $CP_ADMIN_API_TOKEN" \
|
||
"https://api.moleculesai.app/cp/admin/orgs?limit=500" \
|
||
| python3 -c "import json,sys; print(' '.join(o['slug'] for o in json.load(sys.stdin).get('orgs',[])))")
|
||
log " prod orgs: $(echo "$PROD_SLUGS" | wc -w | tr -d ' ')"
|
||
|
||
log "Fetching CP staging org slugs..."
|
||
STAGING_SLUGS=$(curl -sS -m 15 -H "Authorization: Bearer $CP_STAGING_ADMIN_API_TOKEN" \
|
||
"https://staging-api.moleculesai.app/cp/admin/orgs?limit=500" \
|
||
| python3 -c "import json,sys; print(' '.join(o['slug'] for o in json.load(sys.stdin).get('orgs',[])))")
|
||
log " staging orgs: $(echo "$STAGING_SLUGS" | wc -w | tr -d ' ')"
|
||
|
||
log "Fetching live EC2 Name tags (region=$REGION)..."
|
||
# Use JSON output + python — AWS CLI's --query with nested filters has
|
||
# surprising flattening behavior that dropped tags silently on first attempt.
|
||
EC2_NAMES=$(aws ec2 describe-instances --region "$REGION" \
|
||
--filters "Name=instance-state-name,Values=running,pending" \
|
||
--output json 2>/dev/null | python3 -c '
|
||
import json, sys
|
||
out = []
|
||
for r in json.load(sys.stdin).get("Reservations", []):
|
||
for inst in r.get("Instances", []):
|
||
for t in inst.get("Tags", []):
|
||
if t.get("Key") == "Name" and t.get("Value"):
|
||
out.append(t["Value"])
|
||
print(" ".join(out))
|
||
')
|
||
log " live EC2s: $(echo "$EC2_NAMES" | wc -w | tr -d ' ')"
|
||
|
||
log "Fetching Cloudflare DNS records..."
|
||
CF_JSON=$(curl -sS -m 15 -H "Authorization: Bearer $CF_API_TOKEN" \
|
||
"https://api.cloudflare.com/client/v4/zones/$CF_ZONE_ID/dns_records?per_page=500")
|
||
TOTAL_CF=$(echo "$CF_JSON" | python3 -c "import json,sys; print(len(json.load(sys.stdin)['result']))")
|
||
log " CF records: $TOTAL_CF"
|
||
|
||
# --- Compute orphans -------------------------------------------------------
|
||
|
||
# We emit NDJSON so downstream can pipe into jq etc. Each line is one decision.
|
||
# Fields: action=keep|delete, reason, id, name, type.
|
||
#
|
||
# Rules (in order of priority — first match wins):
|
||
# 1. Platform-core (api, app, doc, apex, www, _vercel, _domainkey, _railway-verify,
|
||
# send, status, MX root) → always keep.
|
||
# 2. Tenant subdomain `<slug>.moleculesai.app` or `<slug>.staging.moleculesai.app`
|
||
# → keep if <slug> ∈ {prod_slugs ∪ staging_slugs}, else delete.
|
||
# 3. ws-<id8>.moleculesai.app / ws-<id8>.staging.moleculesai.app
|
||
# → keep if ws-<id8>* matches any live EC2 Name (prefix match), else delete.
|
||
# 4. e2e-<slug>.staging.moleculesai.app (or canary/canvas variants)
|
||
# → keep if <slug> ∈ {prod_slugs ∪ staging_slugs}, else delete.
|
||
# 5. Anything else → keep (we only sweep patterns we understand).
|
||
|
||
export PROD_SLUGS STAGING_SLUGS EC2_NAMES TOTAL_CF
|
||
# Edits inside the CANONICAL DECIDE block below must mirror
|
||
# scripts/ops/sweep_cf_decide.py — the parity test in
|
||
# test_sweep_cf_decide.py asserts they match byte-for-byte.
|
||
DECISIONS=$(echo "$CF_JSON" | python3 -c '
|
||
import json, os, re, sys
|
||
d = json.load(sys.stdin)
|
||
prod_slugs = set(os.environ["PROD_SLUGS"].split())
|
||
staging_slugs = set(os.environ["STAGING_SLUGS"].split())
|
||
all_slugs = prod_slugs | staging_slugs
|
||
ec2_names = set(n for n in os.environ["EC2_NAMES"].split() if n)
|
||
|
||
_PLATFORM_CORE_NAMES = {
|
||
"api.moleculesai.app", "app.moleculesai.app", "doc.moleculesai.app",
|
||
"send.moleculesai.app", "status.moleculesai.app", "www.moleculesai.app",
|
||
"staging-api.moleculesai.app",
|
||
}
|
||
_WS_RE = re.compile(r"^(ws-[a-f0-9]{8}-[a-f0-9]+)(?:\.staging)?\.moleculesai\.app$")
|
||
_E2E_RE = re.compile(r"^(e2e-[^.]+)(?:\.staging)?\.moleculesai\.app$")
|
||
_TENANT_RE = re.compile(r"^([a-z0-9][a-z0-9-]*)(?:\.staging)?\.moleculesai\.app$")
|
||
|
||
# CANONICAL DECIDE BEGIN
|
||
def decide(r, all_slugs, ec2_names):
|
||
n = r["name"]
|
||
rid = r["id"]
|
||
typ = r["type"]
|
||
|
||
if n == "moleculesai.app":
|
||
return ("keep", "apex", rid, n, typ)
|
||
if n.startswith("_") or n.endswith("._domainkey.moleculesai.app"):
|
||
return ("keep", "verification/key", rid, n, typ)
|
||
if n in _PLATFORM_CORE_NAMES:
|
||
return ("keep", "platform-core", rid, n, typ)
|
||
|
||
m = _WS_RE.match(n)
|
||
if m:
|
||
prefix = m.group(1)
|
||
# Live EC2 names share the ws-<hex8>-<rest> shape with the DNS subdomain.
|
||
for ename in ec2_names:
|
||
if ename.startswith(prefix):
|
||
return ("keep", "live-ec2", rid, n, typ)
|
||
return ("delete", "orphan-ws", rid, n, typ)
|
||
|
||
m = _E2E_RE.match(n)
|
||
if m:
|
||
slug = m.group(1)
|
||
if slug in all_slugs:
|
||
return ("keep", "live-e2e-tenant", rid, n, typ)
|
||
return ("delete", "orphan-e2e-tenant", rid, n, typ)
|
||
|
||
m = _TENANT_RE.match(n)
|
||
if m:
|
||
slug = m.group(1)
|
||
if slug in all_slugs:
|
||
return ("keep", "live-tenant", rid, n, typ)
|
||
# KEEP unknown tenant-shaped names — avoid false-positive nukes on
|
||
# ad-hoc records (e.g. hermes-final-*) that do not match a known slug.
|
||
return ("keep", "unknown-subdomain-kept-for-safety", rid, n, typ)
|
||
|
||
return ("keep", "not-a-pattern-we-sweep", rid, n, typ)
|
||
# CANONICAL DECIDE END
|
||
|
||
for r in d["result"]:
|
||
action, reason, rid, name, typ = decide(r, all_slugs, ec2_names)
|
||
print(json.dumps({"action": action, "reason": reason, "id": rid, "name": name, "type": typ}))
|
||
')
|
||
|
||
# --- Summarize + safety gate ----------------------------------------------
|
||
|
||
DELETE_COUNT=$(echo "$DECISIONS" | python3 -c "import json,sys; print(sum(1 for l in sys.stdin if json.loads(l)['action']=='delete'))")
|
||
KEEP_COUNT=$((TOTAL_CF - DELETE_COUNT))
|
||
|
||
log ""
|
||
log "== Sweep plan =="
|
||
log " total CF records: $TOTAL_CF"
|
||
log " would delete: $DELETE_COUNT"
|
||
log " would keep: $KEEP_COUNT"
|
||
log ""
|
||
|
||
# Per-reason breakdown of deletes
|
||
echo "$DECISIONS" | python3 -c "
|
||
import json,sys,collections
|
||
c = collections.Counter()
|
||
for l in sys.stdin:
|
||
d = json.loads(l)
|
||
if d['action'] == 'delete':
|
||
c[d['reason']] += 1
|
||
for reason, n in c.most_common():
|
||
print(f' delete/{reason}: {n}')
|
||
"
|
||
|
||
# Safety gate: refuse to delete more than MAX_DELETE_PCT of records. If we
|
||
# hit this, something is wrong — probably CP admin API returned no orgs,
|
||
# making every tenant look orphan. Bail before nuking production.
|
||
if [ "$TOTAL_CF" -gt 0 ]; then
|
||
PCT=$(( DELETE_COUNT * 100 / TOTAL_CF ))
|
||
if [ "$PCT" -gt "$MAX_DELETE_PCT" ]; then
|
||
log ""
|
||
log "SAFETY: would delete $PCT% of records (threshold $MAX_DELETE_PCT%) — refusing."
|
||
log " If this is expected (e.g. major cleanup after incident), rerun with"
|
||
log " MAX_DELETE_PCT=$((PCT+5)) $0 $*"
|
||
exit 2
|
||
fi
|
||
fi
|
||
|
||
if [ "$DRY_RUN" = "1" ]; then
|
||
log ""
|
||
log "Dry run complete. Pass --execute to actually delete $DELETE_COUNT records."
|
||
log ""
|
||
log "First 20 records that would be deleted:"
|
||
echo "$DECISIONS" | python3 -c "
|
||
import json, sys
|
||
for i, l in enumerate(sys.stdin):
|
||
d = json.loads(l)
|
||
if d['action'] == 'delete':
|
||
print(f\" {d['reason']:25s} {d['name']}\")
|
||
if i > 50: break
|
||
" | head -20
|
||
exit 0
|
||
fi
|
||
|
||
# --- Execute deletes -------------------------------------------------------
|
||
|
||
log ""
|
||
log "Executing $DELETE_COUNT deletions..."
|
||
DELETED=0
|
||
FAILED=0
|
||
while IFS= read -r line; do
|
||
action=$(echo "$line" | python3 -c "import json,sys; print(json.loads(sys.stdin.read())['action'])")
|
||
[ "$action" = "delete" ] || continue
|
||
rid=$(echo "$line" | python3 -c "import json,sys; print(json.loads(sys.stdin.read())['id'])")
|
||
name=$(echo "$line" | python3 -c "import json,sys; print(json.loads(sys.stdin.read())['name'])")
|
||
if curl -sS -m 10 -X DELETE \
|
||
-H "Authorization: Bearer $CF_API_TOKEN" \
|
||
"https://api.cloudflare.com/client/v4/zones/$CF_ZONE_ID/dns_records/$rid" \
|
||
| grep -q '"success":true'; then
|
||
DELETED=$((DELETED+1))
|
||
else
|
||
FAILED=$((FAILED+1))
|
||
log " FAILED: $name ($rid)"
|
||
fi
|
||
done <<< "$DECISIONS"
|
||
|
||
log ""
|
||
log "Done. deleted=$DELETED failed=$FAILED"
|
||
[ "$FAILED" -eq 0 ]
|