ci: synthetic-check cron for AUTO_SYNC_TOKEN rotation drift detection (post-#66 hostile-self-review #3) #72

Closed
opened 2026-05-07 22:20:17 +00:00 by claude-ceo-assistant · 2 comments

Context

PR #66 fixed auto-sync main→staging by replacing the broken gh pr create (Gitea 405 on GraphQL) with a direct git push from the devops-engineer persona's AUTO_SYNC_TOKEN. The hostile self-review of that PR flagged weakest-spot #3:

Token rotation silently breaks auto-sync. If AUTO_SYNC_TOKEN is rotated without updating the repo secret, every push to main fails red on the auto-sync push step. The workflow surfaces the failure mode in the step summary (failure mode B in the header), but there's no proactive monitoring. Detection latency: rotation is only caught when the next main push triggers auto-sync.

In the worst case (slow main-push cadence), the gap between rotation and detection could be many hours. During that window, every commit to main fails to propagate to staging — auto-promote-staging.yml then sees a divergent staging that isn't a superset of main, and the staging is a superset of main invariant is silently broken.

What this issue tracks

Add a low-frequency cron-triggered synthetic check that fires the auto-sync auth surface (or a cheap variant) and emits a clear red signal if AUTO_SYNC_TOKEN has drifted out of validity.

Investigation findings

What AUTO_SYNC_TOKEN does today

Used in three workflows:

  • auto-sync-main-to-staging.yml — PR #66's direct push from devops-engineer persona
  • publish-workspace-server-image.ymloauth2:<token> basic-auth for cloning manifest deps in CI
  • (any other? — only these two on grep)

Failure modes from the auto-sync header:

  • A: staging conflicts with main → already detected immediately on merge, no synthetic check needed
  • B: token rotated / wrong scope → THIS IS WHAT THIS ISSUE ADDRESSES
  • C: branch protection no longer whitelists devops-engineer → already monitored by branch-protection-drift.yml (daily cron)
  • D: concurrent push race → handled by workflow concurrency group

So: synthetic check focuses on B.

Decision: Option B (read-only verify), rejecting A and C

Option A — full auto-sync on schedule: REJECTED. Every 6h × 4 = 4 synthetic merge commits per day on staging when main hasn't advanced. That's pure history clutter. Worse: if main has advanced, the scheduled run races the real push: trigger.

Option B — token-validity probe (pick this): cron-triggered workflow that does:

  1. GET /api/v1/user against Gitea with the token → validates auth + identity (expects username == devops-engineer)
  2. GET /api/v1/repos/molecule-ai/molecule-core with the token → validates read:repository scope on this repo
  3. git ls-remote https://oauth2:$AUTO_SYNC_TOKEN@git.moleculesai.app/molecule-ai/molecule-core staging → validates the exact HTTPS auth path used by actions/checkout step in the real workflow
  4. (Optional) git push --dry-run origin staging from a noop synthetic branch — exercises the connection + ref-negotiation path, but does NOT exercise pre-receive hook (so does not validate authz; Option C does, but authz is already covered by branch-protection-drift)

Pros: cheap (~3 HTTPS calls, ~5s wall-clock), zero side-effects on staging, no branch noise.
Cons: doesn't validate the protected-branch push whitelist authz on its own. (Branch-protection-drift.yml is the canonical gate for that, daily.)

Option C — push to dedicated auto-sync-canary branch: REJECTED. Tests authz too, but: (a) branch noise on Gitea, (b) requires adding the canary branch to staging's push_whitelist or a new protection — more YAML drift, (c) authz validation is already done daily by branch-protection-drift.yml. Don't duplicate.

Prior art

  • Cloudflare API tokens: have a dedicated /user/tokens/verify endpoint specifically for canary scripts to validate "is this token still good" with no side effects. Gitea's equivalent is GET /api/v1/user.
  • AWS Secrets Manager rotation Lambda: includes a testSecret step that calls the target service with the new credential before promoting AWSPENDING to AWSCURRENT. Same shape — auth probe before commit.
  • HashiCorp Vault: secret_id periodic health checks via vault token lookup to detect renewal failures.

The canonical pattern is: a dedicated read-only auth-validity endpoint + cron canary. Option B applies that pattern verbatim.

Cadence

6h is the brief's suggestion. Justification:

  • Cost: 4 runs/day × ~5s × ~3 HTTPS calls = trivial (Gitea is self-hosted, no API quota)
  • Detection latency: average 3h between rotation and red signal, max 6h. Acceptable for an event that happens at most a few times per quarter.
  • Alternative 1h: 24× the runs, 6× shorter latency. Marginal benefit for our rotation cadence.
  • Alternative daily: 6× cheaper, 6× longer latency. Auto-sync would silently fail on every main push for up to 24h — worse than the status quo (we'd still notice on the first main push).

6h dominates daily, and the marginal benefit of 1h doesn't justify 24× the noise in actions feed.

Token-scoping security

No new token. Reuses secrets.AUTO_SYNC_TOKEN (read scope is sufficient — Option B does not push). The synthetic check has the same blast-radius profile as the workflow it's monitoring.

Surfaces affected

  • New: .github/workflows/auto-sync-canary.yml
  • No code changes
  • No script changes (curl + git suffice in inline shell)
  • Runbook: brief inline in the workflow header (matches the auto-sync workflow's own header convention)

Plan (Phase 2 design → Phase 3 implement → Phase 4 verify)

  • One new workflow file, one PR
  • Header comment in PR #66 shape: what / why / failure modes / runbook
  • Mutation test: temporarily set token to junk in a fork branch, confirm RED with actionable message
  • No follow-up issues unless a richer alerting surface (Slack/Discord webhook) is requested — current proposal: red workflow status only, operator polls Gitea actions feed (which is the same surface used by auto-promote-stale-alarm.yml).

Coordination

  • 3 sister agents in flight (provisioner #194, retarget bundle #195+#196, sweep agent #197). New file, no overlap with their existing-workflow edits.
  • core/main is 20/20 GREEN post-#66.
## Context PR #66 fixed auto-sync main→staging by replacing the broken `gh pr create` (Gitea 405 on GraphQL) with a direct git push from the `devops-engineer` persona's `AUTO_SYNC_TOKEN`. The hostile self-review of that PR flagged weakest-spot #3: > Token rotation silently breaks auto-sync. If `AUTO_SYNC_TOKEN` is rotated without updating the repo secret, every push to main fails red on the auto-sync push step. The workflow surfaces the failure mode in the step summary (failure mode B in the header), but there's no proactive monitoring. **Detection latency**: rotation is only caught when the next main push triggers auto-sync. In the worst case (slow main-push cadence), the gap between rotation and detection could be many hours. During that window, every commit to main fails to propagate to staging — auto-promote-staging.yml then sees a divergent staging that isn't a superset of main, and the `staging is a superset of main` invariant is silently broken. ## What this issue tracks Add a low-frequency cron-triggered synthetic check that fires the auto-sync auth surface (or a cheap variant) and emits a clear red signal if `AUTO_SYNC_TOKEN` has drifted out of validity. ## Investigation findings ### What `AUTO_SYNC_TOKEN` does today Used in three workflows: - `auto-sync-main-to-staging.yml` — PR #66's direct push from devops-engineer persona - `publish-workspace-server-image.yml` — `oauth2:<token>` basic-auth for cloning manifest deps in CI - (any other? — only these two on grep) Failure modes from the auto-sync header: - **A**: staging conflicts with main → already detected immediately on merge, no synthetic check needed - **B**: token rotated / wrong scope → THIS IS WHAT THIS ISSUE ADDRESSES - **C**: branch protection no longer whitelists devops-engineer → already monitored by `branch-protection-drift.yml` (daily cron) - **D**: concurrent push race → handled by workflow concurrency group So: synthetic check focuses on B. ### Decision: Option B (read-only verify), rejecting A and C **Option A — full auto-sync on schedule**: REJECTED. Every 6h × 4 = 4 synthetic merge commits per day on staging when main hasn't advanced. That's pure history clutter. Worse: if main has advanced, the scheduled run races the real `push:` trigger. **Option B — token-validity probe (pick this)**: cron-triggered workflow that does: 1. `GET /api/v1/user` against Gitea with the token → validates auth + identity (expects `username == devops-engineer`) 2. `GET /api/v1/repos/molecule-ai/molecule-core` with the token → validates `read:repository` scope on this repo 3. `git ls-remote https://oauth2:$AUTO_SYNC_TOKEN@git.moleculesai.app/molecule-ai/molecule-core staging` → validates the exact HTTPS auth path used by `actions/checkout` step in the real workflow 4. (Optional) `git push --dry-run origin staging` from a noop synthetic branch — exercises the connection + ref-negotiation path, but does NOT exercise pre-receive hook (so does not validate authz; Option C does, but authz is already covered by branch-protection-drift) Pros: cheap (~3 HTTPS calls, ~5s wall-clock), zero side-effects on staging, no branch noise. Cons: doesn't validate the protected-branch push whitelist authz on its own. (Branch-protection-drift.yml is the canonical gate for that, daily.) **Option C — push to dedicated `auto-sync-canary` branch**: REJECTED. Tests authz too, but: (a) branch noise on Gitea, (b) requires adding the canary branch to staging's `push_whitelist` or a new protection — more YAML drift, (c) authz validation is already done daily by `branch-protection-drift.yml`. Don't duplicate. ### Prior art - **Cloudflare API tokens**: have a dedicated `/user/tokens/verify` endpoint specifically for canary scripts to validate "is this token still good" with no side effects. Gitea's equivalent is `GET /api/v1/user`. - **AWS Secrets Manager rotation Lambda**: includes a `testSecret` step that calls the target service with the new credential before promoting `AWSPENDING` to `AWSCURRENT`. Same shape — auth probe before commit. - **HashiCorp Vault**: `secret_id` periodic health checks via `vault token lookup` to detect renewal failures. The canonical pattern is: a dedicated read-only auth-validity endpoint + cron canary. Option B applies that pattern verbatim. ### Cadence 6h is the brief's suggestion. Justification: - Cost: 4 runs/day × ~5s × ~3 HTTPS calls = trivial (Gitea is self-hosted, no API quota) - Detection latency: average 3h between rotation and red signal, max 6h. Acceptable for an event that happens at most a few times per quarter. - Alternative 1h: 24× the runs, 6× shorter latency. Marginal benefit for our rotation cadence. - Alternative daily: 6× cheaper, 6× longer latency. Auto-sync would silently fail on every main push for up to 24h — worse than the status quo (we'd still notice on the first main push). 6h dominates daily, and the marginal benefit of 1h doesn't justify 24× the noise in actions feed. ### Token-scoping security No new token. Reuses `secrets.AUTO_SYNC_TOKEN` (read scope is sufficient — Option B does not push). The synthetic check has the same blast-radius profile as the workflow it's monitoring. ### Surfaces affected - New: `.github/workflows/auto-sync-canary.yml` - No code changes - No script changes (curl + git suffice in inline shell) - Runbook: brief inline in the workflow header (matches the auto-sync workflow's own header convention) ## Plan (Phase 2 design → Phase 3 implement → Phase 4 verify) - One new workflow file, one PR - Header comment in PR #66 shape: what / why / failure modes / runbook - Mutation test: temporarily set token to junk in a fork branch, confirm RED with actionable message - No follow-up issues unless a richer alerting surface (Slack/Discord webhook) is requested — current proposal: red workflow status only, operator polls Gitea actions feed (which is the same surface used by `auto-promote-stale-alarm.yml`). ## Coordination - 3 sister agents in flight (provisioner #194, retarget bundle #195+#196, sweep agent #197). New file, no overlap with their existing-workflow edits. - core/main is 20/20 GREEN post-#66.
Author
Owner

Implementation in PR #77 (fix/issue-72-auto-sync-token-canary-v2main). Phase 3 done; Phase 4 verification pending merge + manual trigger + mutation test.

Implementation in PR #77 (`fix/issue-72-auto-sync-token-canary-v2` → `main`). Phase 3 done; Phase 4 verification pending merge + manual trigger + mutation test.
Author
Owner

Phase 4 verification update

Local probe verification (since Gitea 1.22.6 doesn't expose REST workflow_dispatch)

Ran all three probes against live Gitea using a real token, then mutated each.

Probe 1 — GET /api/v1/user:

  • Valid token: HTTP 200, username == claude-ceo-assistant (when run with my token; will be devops-engineer in production).
  • Junk token: HTTP 401, error message: Token rotation suspected: GET /api/v1/user returned HTTP 401 ... Likely cause: AUTO_SYNC_TOKEN has been rotated/revoked on Gitea but the repo Actions secret was not updated. Runbook: see header comment of this workflow file.
  • Wrong-persona token (valid claude-ceo-assistant token but EXPECTED_PERSONA=devops-engineer): HTTP 200, then persona check fails with: Token resolves to user 'claude-ceo-assistant', expected 'devops-engineer'. AUTO_SYNC_TOKEN must be the devops-engineer persona PAT (not founder PAT, not another persona).

Probe 2 — GET /api/v1/repos/molecule-ai/molecule-core: HTTP 200 with valid token (read scope confirmed).

Probe 3 — original git ls-remote refs/heads/staging: REJECTED on review. Discovered Gitea falls back to anonymous read on public repos, so ls-remote succeeded even with a junk token. False-green — the worst possible canary failure mode. Rewrote to use git push --dry-run of current staging SHA back to staging:

  • Valid token: Everything up-to-date, exit 0.
  • Junk token: fatal: Authentication failed for ..., exit 128. Error message: Token rotation suspected: git push --dry-run against staging failed via the AUTO_SYNC_TOKEN HTTPS auth path (exit 128). This is the EXACT auth path that actions/checkout + git push use in auto-sync-main-to-staging.yml. Likely cause: AUTO_SYNC_TOKEN was rotated/revoked on Gitea but the repo Actions secret was not updated.

Because git push requires a local repo, the workflow now does git init in a tempdir (~50ms, ~1KB) instead of actions/checkout (which would clone hundreds of MB).

CI status on PR #77

  • 22/24 required checks: GREEN (after fix on 0cef033a)
  • 1 known infra issue: pr-guards / disable-auto-merge-on-push — depends on molecule-ai/molecule-ci reusable workflow that appears to be unavailable on Gitea. Pre-existing org-wide; not introduced by this PR. Deferred.
  • 1 currently pending: another CI cycle running on the latest commit e4e1bf40 (post-self-review comment update); will settle GREEN/22-of-23 (excluding pr-guards).

Hostile self-review weakest-3

  1. First-6h dark window: schedule trigger doesn't fire until ~6h post-merge (cron at :17 every 6h). Workflow_dispatch lets an operator run a manual probe immediately after merge — will do as a final verification step. Not a code defect.
  2. EXPECTED_PERSONA hardcode coupling: persona rename requires updating both auto-sync-main-to-staging.yml and this canary's env var. Addressed by adding an inline comment pointing the next editor at both files (commit e4e1bf40).
  3. Probe 3 race window: theoretical — staging deleted between ls-remote and push --dry-run. --dry-run semantics specifically don't transmit, so even in the race no actual ref-create happens. And branch protection prevents staging deletion. Documented for completeness.

Outstanding follow-ups

None within scope. Possible future enhancements (out of scope for this issue):

  • Slack/Discord webhook on RED instead of relying on operator polling Gitea actions feed
  • Extend canary to validate the full v2 scope contract (write:repository, read:user, etc.) — currently only validates the read paths the canary itself uses
## Phase 4 verification update ### Local probe verification (since Gitea 1.22.6 doesn't expose REST workflow_dispatch) Ran all three probes against live Gitea using a real token, then mutated each. **Probe 1 — `GET /api/v1/user`**: - Valid token: HTTP 200, `username == claude-ceo-assistant` (when run with my token; will be `devops-engineer` in production). - Junk token: HTTP 401, error message: `Token rotation suspected: GET /api/v1/user returned HTTP 401 ... Likely cause: AUTO_SYNC_TOKEN has been rotated/revoked on Gitea but the repo Actions secret was not updated. Runbook: see header comment of this workflow file.` - Wrong-persona token (valid claude-ceo-assistant token but EXPECTED_PERSONA=devops-engineer): HTTP 200, then persona check fails with: `Token resolves to user 'claude-ceo-assistant', expected 'devops-engineer'. AUTO_SYNC_TOKEN must be the devops-engineer persona PAT (not founder PAT, not another persona).` **Probe 2 — `GET /api/v1/repos/molecule-ai/molecule-core`**: HTTP 200 with valid token (read scope confirmed). **Probe 3 — original `git ls-remote refs/heads/staging`**: REJECTED on review. Discovered Gitea falls back to anonymous read on public repos, so `ls-remote` succeeded even with a junk token. False-green — the worst possible canary failure mode. Rewrote to use `git push --dry-run` of current staging SHA back to staging: - Valid token: `Everything up-to-date`, exit 0. - Junk token: `fatal: Authentication failed for ...`, exit 128. Error message: `Token rotation suspected: git push --dry-run against staging failed via the AUTO_SYNC_TOKEN HTTPS auth path (exit 128). This is the EXACT auth path that actions/checkout + git push use in auto-sync-main-to-staging.yml. Likely cause: AUTO_SYNC_TOKEN was rotated/revoked on Gitea but the repo Actions secret was not updated.` Because `git push` requires a local repo, the workflow now does `git init` in a tempdir (~50ms, ~1KB) instead of `actions/checkout` (which would clone hundreds of MB). ### CI status on PR #77 - 22/24 required checks: GREEN (after fix on 0cef033a) - 1 known infra issue: `pr-guards / disable-auto-merge-on-push` — depends on `molecule-ai/molecule-ci` reusable workflow that appears to be unavailable on Gitea. Pre-existing org-wide; not introduced by this PR. Deferred. - 1 currently pending: another CI cycle running on the latest commit e4e1bf40 (post-self-review comment update); will settle GREEN/22-of-23 (excluding pr-guards). ### Hostile self-review weakest-3 1. **First-6h dark window**: schedule trigger doesn't fire until ~6h post-merge (cron at :17 every 6h). Workflow_dispatch lets an operator run a manual probe immediately after merge — will do as a final verification step. Not a code defect. 2. **EXPECTED_PERSONA hardcode coupling**: persona rename requires updating both `auto-sync-main-to-staging.yml` and this canary's env var. Addressed by adding an inline comment pointing the next editor at both files (commit e4e1bf40). 3. **Probe 3 race window**: theoretical — staging deleted between `ls-remote` and `push --dry-run`. `--dry-run` semantics specifically don't transmit, so even in the race no actual ref-create happens. And branch protection prevents staging deletion. Documented for completeness. ### Outstanding follow-ups None within scope. Possible future enhancements (out of scope for this issue): - Slack/Discord webhook on RED instead of relying on operator polling Gitea actions feed - Extend canary to validate the full v2 scope contract (`write:repository`, `read:user`, etc.) — currently only validates the read paths the canary itself uses
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#72
No description provided.