RCA: CF orphan sweep fails zone preflight due token location/policy #2690

Open
opened 2026-06-13 01:20:29 +00:00 by agent-researcher · 2 comments
Member

MECHANISM: molecule-core scheduled CF orphan sweep is failing during Cloudflare zone preflight, before any record gather or delete. The workflow maps CF_API_TOKEN/CF_ZONE_ID from Actions secrets and hard-fails scheduled runs when the janitor cannot run (.gitea/workflows/sweep-cf-orphans.yml:96-160). The script first verifies the token, then calls GET /client/v4/zones/$CF_ZONE_ID; that zone check fails and exits non-zero (scripts/ops/sweep-cf-orphans.sh:86-148). Because token verify succeeds but zone lookup fails, the likely failure class is Cloudflare token zone/read policy or source-location restriction, not missing secrets.

EVIDENCE: Run 355783, job 482500, commit 58c82215b3343ec09f5e830951531f4fd4c219b7 failed in Sweep CF orphans. Log excerpt: CF token active ✓; then Cannot use the access token from location; then zone ... unreachable or token lacks Zone:Read. The workflow is a schedule run, so line .gitea/workflows/sweep-cf-orphans.yml:112 intentionally treats this as red rather than a soft skip.

RECOMMENDED FIX SHAPE: Update the Cloudflare credential/policy used by molecule-core Actions for CF_API_TOKEN/CLOUDFLARE_API_TOKEN and CF_ZONE_ID/CLOUDFLARE_ZONE_ID, not product code. Responsible surfaces are the repo Actions secrets and scripts/ops/sweep-cf-orphans.sh preflight. Ensure the token has Zone:Read/DNS permissions for the moleculesai.app zone and is valid from the runner egress location, or provision a CI-scoped token without that location restriction.

MECHANISM: molecule-core scheduled CF orphan sweep is failing during Cloudflare zone preflight, before any record gather or delete. The workflow maps `CF_API_TOKEN`/`CF_ZONE_ID` from Actions secrets and hard-fails scheduled runs when the janitor cannot run (`.gitea/workflows/sweep-cf-orphans.yml:96-160`). The script first verifies the token, then calls `GET /client/v4/zones/$CF_ZONE_ID`; that zone check fails and exits non-zero (`scripts/ops/sweep-cf-orphans.sh:86-148`). Because token verify succeeds but zone lookup fails, the likely failure class is Cloudflare token zone/read policy or source-location restriction, not missing secrets. EVIDENCE: Run `355783`, job `482500`, commit `58c82215b3343ec09f5e830951531f4fd4c219b7` failed in `Sweep CF orphans`. Log excerpt: `CF token active ✓`; then `Cannot use the access token from location`; then `zone ... unreachable or token lacks Zone:Read`. The workflow is a schedule run, so line `.gitea/workflows/sweep-cf-orphans.yml:112` intentionally treats this as red rather than a soft skip. RECOMMENDED FIX SHAPE: Update the Cloudflare credential/policy used by molecule-core Actions for `CF_API_TOKEN`/`CLOUDFLARE_API_TOKEN` and `CF_ZONE_ID`/`CLOUDFLARE_ZONE_ID`, not product code. Responsible surfaces are the repo Actions secrets and `scripts/ops/sweep-cf-orphans.sh` preflight. Ensure the token has Zone:Read/DNS permissions for the moleculesai.app zone and is valid from the runner egress location, or provision a CI-scoped token without that location restriction.
Author
Member

MECHANISM: The hourly Cloudflare orphan sweep is failing before any gather/delete decision because the configured CF token is active but cannot access the zone from this runner location. .gitea/workflows/sweep-cf-orphans.yml:96-159 correctly verifies required secrets, then runs scripts/ops/sweep-cf-orphans.sh --execute; the script's preflight token check succeeds, but the zone lookup fails with Cloudflare error 9109. That means the janitor never reaches the CP/AWS/record classification logic, so this is an operator-token/location-policy issue, not a sweep deletion bug.

EVIDENCE: molecule-core scheduled run 356747, job 484276, head 094da1609d60cd0f830d53d8838547cb88ef0627. Logs show all required secrets present, CF token active, then zone lookup returned success=false: 9109. The specific Cloudflare message is Cannot use the access token from location: 2a01:4f8:222:dc3::2. The workflow comments at .gitea/workflows/sweep-cf-orphans.yml:96-136 intentionally hard-fail scheduled runs when the janitor cannot operate, so the red is expected fail-loud behavior.

RECOMMENDED FIX SHAPE: Update the Cloudflare token/policy used by molecule-core Actions so the runner egress location can read the moleculesai.app zone, or run the sweep from an allowed egress. Keep the workflow fail-loud; do not soften it to green/skip, because .gitea/workflows/sweep-cf-orphans.yml:101-108 documents the prior silent skip that hid active DNS leaks. Responsible surface: Cloudflare token/location policy plus the repo/org secrets CF_API_TOKEN/CF_ZONE_ID (or canonical CLOUDFLARE_*).

MECHANISM: The hourly Cloudflare orphan sweep is failing before any gather/delete decision because the configured CF token is active but cannot access the zone from this runner location. `.gitea/workflows/sweep-cf-orphans.yml:96-159` correctly verifies required secrets, then runs `scripts/ops/sweep-cf-orphans.sh --execute`; the script's preflight token check succeeds, but the zone lookup fails with Cloudflare error 9109. That means the janitor never reaches the CP/AWS/record classification logic, so this is an operator-token/location-policy issue, not a sweep deletion bug. EVIDENCE: molecule-core scheduled run `356747`, job `484276`, head `094da1609d60cd0f830d53d8838547cb88ef0627`. Logs show all required secrets present, `CF token active`, then `zone lookup returned success=false: 9109`. The specific Cloudflare message is `Cannot use the access token from location: 2a01:4f8:222:dc3::2`. The workflow comments at `.gitea/workflows/sweep-cf-orphans.yml:96-136` intentionally hard-fail scheduled runs when the janitor cannot operate, so the red is expected fail-loud behavior. RECOMMENDED FIX SHAPE: Update the Cloudflare token/policy used by molecule-core Actions so the runner egress location can read the `moleculesai.app` zone, or run the sweep from an allowed egress. Keep the workflow fail-loud; do not soften it to green/skip, because `.gitea/workflows/sweep-cf-orphans.yml:101-108` documents the prior silent skip that hid active DNS leaks. Responsible surface: Cloudflare token/location policy plus the repo/org secrets `CF_API_TOKEN`/`CF_ZONE_ID` (or canonical `CLOUDFLARE_*`).
Author
Member

MECHANISM: The scheduled Cloudflare DNS orphan sweep is still blocked before any gather/delete logic runs. .gitea/workflows/sweep-cf-orphans.yml:141-159 invokes scripts/ops/sweep-cf-orphans.sh --execute; the script verifies the token first, then calls /client/v4/zones/$CF_ZONE_ID at scripts/ops/sweep-cf-orphans.sh:118-147. Token verification succeeds, but the zone lookup returns Cloudflare success=false, so the preflight exits 1 and the janitor never reaches the orphan decision/delete path. This is owner Cloudflare credential/location policy, not a molecule-core code regression.

EVIDENCE: main scheduled run 362866 / job 495110 on 9595757a failed in Sweep CF orphans. Log excerpt: CF token active ✓; then 9109: Cannot use the access token from location; then CF preflight FAILED. Current main 03e323e3 has required status aggregate success, and the current-head push E2E lanes I checked are green: E2E Chat 362881, Peer Visibility 362882, Staging Canvas 362883, Harness Replays 362885, Local Provision 362888. The other recent reds are staging-smoke / continuous-synth HTTP 400 org-create failures already tracked under #2737.

RECOMMENDED FIX SHAPE: Owner/infra should rotate or re-scope the Cloudflare token used by sweep-cf-orphans.yml so it permits Zone:Read/DNS operations from the Gitea runner egress location, or route this scheduled janitor through an allowed egress. Do not patch the repo sweep logic for this symptom: the preflight is correctly failing closed rather than silently skipping a DNS-leak cleanup.

MECHANISM: The scheduled Cloudflare DNS orphan sweep is still blocked before any gather/delete logic runs. `.gitea/workflows/sweep-cf-orphans.yml:141-159` invokes `scripts/ops/sweep-cf-orphans.sh --execute`; the script verifies the token first, then calls `/client/v4/zones/$CF_ZONE_ID` at `scripts/ops/sweep-cf-orphans.sh:118-147`. Token verification succeeds, but the zone lookup returns Cloudflare `success=false`, so the preflight exits 1 and the janitor never reaches the orphan decision/delete path. This is owner Cloudflare credential/location policy, not a molecule-core code regression. EVIDENCE: main scheduled run 362866 / job 495110 on 9595757a failed in `Sweep CF orphans`. Log excerpt: `CF token active ✓`; then `9109: Cannot use the access token from location`; then `CF preflight FAILED`. Current main 03e323e3 has required status aggregate success, and the current-head push E2E lanes I checked are green: E2E Chat 362881, Peer Visibility 362882, Staging Canvas 362883, Harness Replays 362885, Local Provision 362888. The other recent reds are staging-smoke / continuous-synth HTTP 400 org-create failures already tracked under #2737. RECOMMENDED FIX SHAPE: Owner/infra should rotate or re-scope the Cloudflare token used by `sweep-cf-orphans.yml` so it permits Zone:Read/DNS operations from the Gitea runner egress location, or route this scheduled janitor through an allowed egress. Do not patch the repo sweep logic for this symptom: the preflight is correctly failing closed rather than silently skipping a DNS-leak cleanup.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2690