sweep-cf-orphans: CF_API_TOKEN is invalid OR diverged from SSOT (mc#1529 §4) #1547

Open
opened 2026-05-19 00:17:00 +00:00 by core-devops · 1 comment
Member

Sub-issue of #1529. Root-caused 2026-05-18.

Pattern

4/9 hourly runs (44%) of sweep-cf-orphans.yml failed. All 10 most recent failures share the identical signature:

[hh:mm:ss] Fetching Cloudflare DNS records...
ERROR: Cloudflare DNS list failed: 10000: Authentication error
[hh:mm:ss] Cloudflare DNS list failed; verify CF_API_TOKEN has Zone:DNS:Edit
  and CF_ZONE_ID is the moleculesai.app zone.
##[error]Process completed with exit code 1.

The secret-presence pre-check passes (All required secrets present ✓) — so CF_API_TOKEN is set in the Gitea repo secret store. But Cloudflare API returns 10000: Authentication error on every call (with the value as configured).

Root cause

CF_API_TOKEN in molecule-core's Gitea repo secrets is invalid — either:

  • (a) revoked / expired / never-valid at Cloudflare
  • (b) lost the Zone:DNS:Edit permission scope
  • (c) was meant to be the Zone-scoped token but a user/global one got pasted instead (or vice versa)

Secondary finding: CF_API_TOKEN is not in the operator-host SSOT (/etc/molecule-bootstrap/all-credentials.env) and not in Infisical. It only exists as a Gitea repo secret. Per feedback_unified_credentials_file + reference_infisical_ssot, that's a drift: Infisical should be SSOT, with mirror to Gitea secret store (not Gitea-only).

Class

(c) real bug — the workflow is doing its job (it's failing because the live CF token is broken, which is exactly the kind of janitor failure that needs surfacing). The 'flakiness' is 100% deterministic — every run fails because the token is dead.

Severity / impact

  • The chronic CF DNS quota leak (152/200 records, caught manually on 2026-04-28) is not currently being swept. Each hour the workflow tries and fails. If the zone hits 200 records again, provisions will fail with CF error code 81045.
  • Recovery has been manual since at least 2026-05-15 (10+ consecutive failed sweeps).

Fix path (requires CTO — CF token is a creds rotation)

  1. CTO regenerates a Cloudflare User API Token with Zone:DNS:Edit scope on the moleculesai.app zone (per feedback_passwords_in_chat_are_burned recipe — User-owned, not Account-owned, per AGENTS.md §8 pitfall).
  2. Add it to Infisical under /shared/cloudflare/CF_API_TOKEN (canonical) and mirror to operator-host all-credentials.env (cache) per feedback_unified_credentials_file.
  3. Push to Gitea repo secret molecule-core/CF_API_TOKEN via the existing runner-config mirror (or one-shot PUT /repos/molecule-ai/molecule-core/actions/secrets/CF_API_TOKEN).
  4. Trigger one manual run via Gitea Actions UI; verify success=true from https://api.cloudflare.com/client/v4/user/tokens/verify.
  5. Audit + delete the burned old token from the Cloudflare dashboard.

Out of scope here

  • Adding the SSOT mirror is filed as #1529 derivative.
  • The feedback_mol_secret_v2_bashx_dumps_credentials memo notes that during this diagnosis, bash -x mol_secret_v2 accidentally dumped Stripe live keys to chat. Those keys must be rotated separately (CTO has been paged via a parallel note).

Boundary

Do NOT silent-skip the workflow. The whole point of the visible-error pre-check is to surface broken janitor state — not auto-disable on auth failure. The previous silent-skip (pre-2026-04-28) is what let the 152/200 CF leak grow unnoticed.

Sub-issue of #1529. Root-caused 2026-05-18. ## Pattern 4/9 hourly runs (44%) of `sweep-cf-orphans.yml` failed. **All 10 most recent failures share the identical signature**: ``` [hh:mm:ss] Fetching Cloudflare DNS records... ERROR: Cloudflare DNS list failed: 10000: Authentication error [hh:mm:ss] Cloudflare DNS list failed; verify CF_API_TOKEN has Zone:DNS:Edit and CF_ZONE_ID is the moleculesai.app zone. ##[error]Process completed with exit code 1. ``` The secret-presence pre-check passes (`All required secrets present ✓`) — so `CF_API_TOKEN` is set in the Gitea repo secret store. But Cloudflare API returns `10000: Authentication error` on **every** call (with the value as configured). ## Root cause `CF_API_TOKEN` in molecule-core's Gitea repo secrets is invalid — either: - (a) revoked / expired / never-valid at Cloudflare - (b) lost the `Zone:DNS:Edit` permission scope - (c) was meant to be the Zone-scoped token but a user/global one got pasted instead (or vice versa) **Secondary finding**: `CF_API_TOKEN` is **not in the operator-host SSOT** (`/etc/molecule-bootstrap/all-credentials.env`) and **not in Infisical**. It only exists as a Gitea repo secret. Per `feedback_unified_credentials_file` + `reference_infisical_ssot`, that's a drift: Infisical should be SSOT, with mirror to Gitea secret store (not Gitea-only). ## Class (c) **real bug** — the workflow is doing its job (it's failing because the live CF token is broken, which is exactly the kind of janitor failure that needs surfacing). The 'flakiness' is 100% deterministic — every run fails because the token is dead. ## Severity / impact - The chronic CF DNS quota leak (152/200 records, caught manually on 2026-04-28) is **not currently being swept**. Each hour the workflow tries and fails. If the zone hits 200 records again, provisions will fail with CF error code 81045. - Recovery has been manual since at least 2026-05-15 (10+ consecutive failed sweeps). ## Fix path (requires CTO — CF token is a creds rotation) 1. CTO regenerates a Cloudflare User API Token with **Zone:DNS:Edit** scope on the `moleculesai.app` zone (per `feedback_passwords_in_chat_are_burned` recipe — User-owned, not Account-owned, per AGENTS.md §8 pitfall). 2. Add it to Infisical under `/shared/cloudflare/CF_API_TOKEN` (canonical) **and** mirror to operator-host `all-credentials.env` (cache) per `feedback_unified_credentials_file`. 3. Push to Gitea repo secret `molecule-core/CF_API_TOKEN` via the existing runner-config mirror (or one-shot `PUT /repos/molecule-ai/molecule-core/actions/secrets/CF_API_TOKEN`). 4. Trigger one manual run via Gitea Actions UI; verify `success=true` from `https://api.cloudflare.com/client/v4/user/tokens/verify`. 5. Audit + delete the burned old token from the Cloudflare dashboard. ## Out of scope here - Adding the SSOT mirror is filed as #1529 derivative. - The `feedback_mol_secret_v2_bashx_dumps_credentials` memo notes that during this diagnosis, `bash -x mol_secret_v2` accidentally dumped Stripe live keys to chat. Those keys must be rotated separately (CTO has been paged via a parallel note). ## Boundary Do NOT silent-skip the workflow. The whole point of the visible-error pre-check is to surface broken janitor state — not auto-disable on auth failure. The previous silent-skip (pre-2026-04-28) is what let the 152/200 CF leak grow unnoticed.
Member

RCA — root cause

sweep-cf-orphans is not failing because the Cloudflare secret is absent; it is failing because the repo-scoped CF_API_TOKEN that the workflow reads is no longer accepted by Cloudflare, or it is scoped to the wrong zone/permissions. The workflow only validates secret presence before running, so a stale/revoked token passes local preflight and then fails at the vendor API boundary.

Evidence

  • Issue log excerpt shows Cloudflare returning 10000: Authentication error during DNS record fetch while the secret-presence precheck passes.
  • .gitea/workflows/sweep-cf-orphans.yml:82-.gitea/workflows/sweep-cf-orphans.yml:87 reads CF_API_TOKEN and CF_ZONE_ID directly from Gitea repo secrets.
  • .gitea/workflows/sweep-cf-orphans.yml:94-.gitea/workflows/sweep-cf-orphans.yml:130 validates only that required secrets are non-empty.
  • scripts/ops/sweep-cf-orphans.sh:97-scripts/ops/sweep-cf-orphans.sh:99 calls the Cloudflare DNS API with CF_API_TOKEN; scripts/ops/sweep-cf-orphans.sh:109-scripts/ops/sweep-cf-orphans.sh:125 hard-fails on Cloudflare success=false and names the needed Zone:DNS:Edit scope.

Suggested fix

Route this into the Phase 3.2 SSOT/Infisical credential cleanup bucket rather than treating it as a one-off workflow flake. Provision a fresh Cloudflare user API token with Zone:DNS:Edit on the moleculesai.app zone, store it with CF_ZONE_ID under a durable Cloudflare DNS path in Infisical/operator SSOT, mirror that value into the Gitea repo secret used by this workflow, and add a vendor-validity check such as Cloudflare token verify or a read-only DNS list probe before declaring the mirror healthy. Keep tunnel credentials separate from DNS credentials if sweep-cf-tunnels needs account-level tunnel scopes.

Confidence

High — the failure mode is deterministic vendor auth rejection after local secret presence passes; direct workflow/script references show there is no SSOT validity reconciliation before the Cloudflare call.

## RCA — root cause `sweep-cf-orphans` is not failing because the Cloudflare secret is absent; it is failing because the repo-scoped `CF_API_TOKEN` that the workflow reads is no longer accepted by Cloudflare, or it is scoped to the wrong zone/permissions. The workflow only validates secret presence before running, so a stale/revoked token passes local preflight and then fails at the vendor API boundary. ## Evidence - Issue log excerpt shows Cloudflare returning `10000: Authentication error` during DNS record fetch while the secret-presence precheck passes. - `.gitea/workflows/sweep-cf-orphans.yml:82`-`.gitea/workflows/sweep-cf-orphans.yml:87` reads `CF_API_TOKEN` and `CF_ZONE_ID` directly from Gitea repo secrets. - `.gitea/workflows/sweep-cf-orphans.yml:94`-`.gitea/workflows/sweep-cf-orphans.yml:130` validates only that required secrets are non-empty. - `scripts/ops/sweep-cf-orphans.sh:97`-`scripts/ops/sweep-cf-orphans.sh:99` calls the Cloudflare DNS API with `CF_API_TOKEN`; `scripts/ops/sweep-cf-orphans.sh:109`-`scripts/ops/sweep-cf-orphans.sh:125` hard-fails on Cloudflare `success=false` and names the needed `Zone:DNS:Edit` scope. ## Suggested fix Route this into the Phase 3.2 SSOT/Infisical credential cleanup bucket rather than treating it as a one-off workflow flake. Provision a fresh Cloudflare user API token with `Zone:DNS:Edit` on the `moleculesai.app` zone, store it with `CF_ZONE_ID` under a durable Cloudflare DNS path in Infisical/operator SSOT, mirror that value into the Gitea repo secret used by this workflow, and add a vendor-validity check such as Cloudflare token verify or a read-only DNS list probe before declaring the mirror healthy. Keep tunnel credentials separate from DNS credentials if `sweep-cf-tunnels` needs account-level tunnel scopes. ## Confidence High — the failure mode is deterministic vendor auth rejection after local secret presence passes; direct workflow/script references show there is no SSOT validity reconciliation before the Cloudflare call.
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1547