name: Auto-sync canary — AUTO_SYNC_TOKEN rotation drift # Synthetic health check for the AUTO_SYNC_TOKEN secret consumed by # auto-sync-main-to-staging.yml (PR #66) and publish-workspace-server-image.yml. # # ============================================================ # Why this workflow exists # ============================================================ # # PR #66 fixed auto-sync (replaced GitHub-era `gh pr create` — which # 405s on Gitea's GraphQL endpoint — with a direct git push from the # `devops-engineer` persona's `AUTO_SYNC_TOKEN`). Hostile self-review # weakest spot #3 of that PR: # # "Token rotation silently breaks auto-sync. If AUTO_SYNC_TOKEN is # rotated without updating the repo secret, every push to main # fails red on the auto-sync push step. The workflow surfaces the # failure mode in the step summary (failure mode B in the header), # but there's no proactive monitoring." # # Detection latency under the status quo: rotation is only caught on # the next push to `main`. During quiet periods (no main push for # many hours) the staging-superset-of-main invariant silently breaks. # # This workflow closes the gap: every 6 hours, it fires the auth # surface that auto-sync depends on and emits a red workflow status # if AUTO_SYNC_TOKEN has drifted out of validity. # # ============================================================ # What this checks (Option B — read-only verify) # ============================================================ # # 1. `GET /api/v1/user` against Gitea with the token → validates the # token authenticates AND resolves to `devops-engineer` (catches # the case where the token was regenerated under a different # persona by mistake). # 2. `GET /api/v1/repos/molecule-ai/molecule-core` with the token → # validates the token has `read:repository` scope on this repo # (the v2 scope contract — see saved memory # `reference_persona_token_v2_scope`). # 3. `git push --dry-run` of the current staging SHA back to # `refs/heads/staging` via `https://oauth2:@/...` # → validates the EXACT HTTPS basic-auth path that # `actions/checkout` + `git push origin staging` use inside # auto-sync-main-to-staging.yml. NOP by construction (push the # current tip to itself = "Everything up-to-date"); auth is # checked at the smart-protocol handshake BEFORE the empty-diff # computation, so bad token → exit 128 with "Authentication # failed". `git ls-remote` is NOT used here because Gitea # falls back to anonymous read on public repos and would # silently green-light a rotated token. # # Each step exits non-zero with an actionable error message if it # fails. The workflow status itself is the operator-facing surface. # # ============================================================ # What this does NOT check (intentional) # ============================================================ # # - **Branch-protection authz** (failure mode C in auto-sync header): # would require an actual write to staging. Already monitored by # `branch-protection-drift.yml` daily. Don't duplicate. # - **Conflict resolution** (failure mode A): a real conflict is data- # driven, not auth-driven; can't synthesise it without polluting # staging. Already surfaces immediately on the next main push. # - **Concurrency** (failure mode D): handled by workflow concurrency # group on auto-sync, not a credential issue. # # ============================================================ # Why Option B (read-only) and not the alternatives # ============================================================ # # Considered + rejected (see issue #72 for full write-up): # # - **Option A — full auto-sync on schedule**: every run creates a # no-op merge commit on staging when main hasn't advanced. 4 noise # commits/day. And races the real `push:` trigger when main has # advanced. Rejected. # # - **Option C — push to dedicated `auto-sync-canary` branch**: would # exercise authz too, but adds branch noise on Gitea AND requires # maintaining a second branch protection (or expanding staging's # whitelist to a junk branch). Authz already covered by # `branch-protection-drift.yml`. Rejected. # # Prior art for the chosen Option B shape: # - Cloudflare's `/user/tokens/verify` endpoint (read-only auth # probe explicitly designed for credential canaries). # - AWS Secrets Manager rotation Lambda's `testSecret` step (auth # probe before promoting AWSPENDING → AWSCURRENT). # - HashiCorp Vault's `vault token lookup` for renewal canaries. # # ============================================================ # Operator runbook — what to do when this workflow goes RED # ============================================================ # # 1. **Identify which step failed**: # - Step "Verify token authenticates as devops-engineer" red → # token is invalid OR resolves to wrong persona. # - Step "Verify token has repo read scope" red → token valid but # stripped of `read:repository` scope (or repo perms changed). # - Step "Verify git HTTPS auth path via no-op dry-run push to # staging" red → token rotated/revoked OR Gitea git-HTTPS # surface is broken (rare). Auth check happens on the # smart-protocol handshake, separate from the API path. # # 2. **Re-issue the token** on the operator host: # ``` # ssh root@5.78.80.188 'docker exec --user git molecule-gitea-1 \ # gitea admin user generate-access-token \ # --username devops-engineer \ # --token-name persona-devops-engineer-vN \ # --scopes "read:repository,write:repository,read:user,read:organization,read:issue,write:issue,read:notification,read:misc"' # ``` # Update `/etc/molecule-bootstrap/agent-secrets.env` in place # (per `feedback_unified_credentials_file`). The previous token # file lands at `.bak.`. # # 3. **Update the repo Actions secret** at: # Settings → Secrets and variables → Actions → AUTO_SYNC_TOKEN # Paste the new token. (Don't echo it in chat — but per # `feedback_passwords_in_chat_are_burned`, a paste in a 1:1 # Claude session is within trust boundary.) # # 4. **Re-run this canary** via workflow_dispatch. Confirm GREEN. # # 5. **Backfill any missed main → staging syncs** by re-running # `auto-sync-main-to-staging.yml` from its workflow_dispatch # surface, OR by pushing an empty commit to main (if you'd # rather force a real trigger). # # ============================================================ # Security notes # ============================================================ # # - Token usage: read-only (`GET /api/v1/user`, `GET /api/v1/repos/...`, # `git ls-remote`). No write paths. Same blast-radius profile as # `actions/checkout` on a public repo. # - The token NEVER appears in logs: every `curl` uses a header # variable, never inline; the `git ls-remote` URL builds the # `oauth2:$TOKEN@host` form into a single env var that's not # echoed. GitHub Actions secret-masking covers anything that does # slip through. # - No new token introduced — same `AUTO_SYNC_TOKEN` the workflow # under monitor uses. Per least-privilege we deliberately do NOT # broaden scope for the canary. on: schedule: # Every 6 hours at :17 (offsets the cron herd at :00). Justification # from issue #72: cheap to run (~5s wall-clock, no quota), 3h average # detection latency, 6h max. 1h would be 24× the runs for marginal # benefit; daily would be 6× longer latency and worse than status # quo on a quiet-main day. - cron: '17 */6 * * *' workflow_dispatch: # No concurrency group needed — the canary is read-only and idempotent. # Two parallel runs (e.g. operator dispatch during a scheduled tick) are # harmless: same result, doubled HTTPS calls, no shared state. permissions: contents: read jobs: verify-token: name: Verify AUTO_SYNC_TOKEN validity runs-on: ubuntu-latest # 2 min surfaces hangs (Gitea API stall, DNS issue) within one # cron interval. Realistic worst case is ~10s: 2 curls + 1 git # ls-remote, each capped by the explicit timeouts below. timeout-minutes: 2 env: # Pinned in env so individual steps can read it without # repeating the secret reference. GitHub masks the value in # logs automatically. AUTO_SYNC_TOKEN: ${{ secrets.AUTO_SYNC_TOKEN }} # MUST stay in sync with auto-sync-main-to-staging.yml's # `git config user.name "devops-engineer"` line. Renaming the # devops-engineer persona requires updating both files (and # the staging branch protection's `push_whitelist_usernames`). EXPECTED_PERSONA: devops-engineer GITEA_HOST: git.moleculesai.app REPO_PATH: molecule-ai/molecule-core steps: - name: Verify AUTO_SYNC_TOKEN secret is configured # Schedule-vs-dispatch behaviour split, per # `feedback_schedule_vs_dispatch_secrets_hardening`: # # - schedule: hard-fail when the secret is missing. The # whole point of the canary is to surface drift; soft- # skipping on missing-secret would make the canary # itself drift-invisible (sweep-cf-orphans #2088 lesson). # - workflow_dispatch: hard-fail too — there's no scenario # where an operator wants this canary to silently no-op. # The workflow has no other ad-hoc utility; if you ran # it, you wanted the answer. run: | if [ -z "${AUTO_SYNC_TOKEN}" ]; then echo "::error::AUTO_SYNC_TOKEN secret is not set on this repo." >&2 echo "::error::Set it at Settings → Secrets and variables → Actions." >&2 echo "::error::Without it, auto-sync-main-to-staging.yml will fail every push to main." >&2 exit 1 fi echo "AUTO_SYNC_TOKEN is configured (value masked)." - name: Verify token authenticates as ${{ env.EXPECTED_PERSONA }} # Calls Gitea's `/api/v1/user` — the canonical # auth-probe-with-no-side-effects endpoint (mirrors # Cloudflare's /user/tokens/verify). # # Failure surfaces: # - HTTP 401: token invalid (rotated, revoked, or never # correctly registered). # - HTTP 200 but username != devops-engineer: token was # regenerated under the wrong persona — this would let # auth pass but commit attribution would be wrong, and # branch-protection authz would fail because only # `devops-engineer` is whitelisted. run: | set -euo pipefail response_file="$(mktemp)" code_file="$(mktemp)" # `--max-time 30`: full call ceiling. `--connect-timeout 10`: # DNS + TCP. `-w "%{http_code}"` routed to a tempfile so curl's # exit code can't pollute the captured status — see # feedback_curl_status_capture_pollution + the # `lint-curl-status-capture.yml` gate that rejects the unsafe # `$(curl ... || echo "000")` shape. set +e curl -sS -o "$response_file" \ --max-time 30 --connect-timeout 10 \ -w "%{http_code}" \ -H "Authorization: token ${AUTO_SYNC_TOKEN}" \ -H "Accept: application/json" \ "https://${GITEA_HOST}/api/v1/user" >"$code_file" 2>/dev/null set -e status=$(cat "$code_file" 2>/dev/null || true) [ -z "$status" ] && status="000" if [ "$status" != "200" ]; then echo "::error::Token rotation suspected: GET /api/v1/user returned HTTP $status (expected 200)." >&2 echo "::error::Likely cause: AUTO_SYNC_TOKEN has been rotated/revoked on Gitea but the repo Actions secret was not updated." >&2 echo "::error::Runbook: see header comment of this workflow file." >&2 # Print response body but redact anything that looks like a token. sed -E 's/[A-Fa-f0-9]{32,}//g' "$response_file" >&2 || true exit 1 fi username=$(python3 -c "import json,sys; print(json.load(open(sys.argv[1])).get('login',''))" "$response_file") if [ "$username" != "${EXPECTED_PERSONA}" ]; then echo "::error::Token resolves to user '$username', expected '${EXPECTED_PERSONA}'." >&2 echo "::error::AUTO_SYNC_TOKEN must be the devops-engineer persona PAT (not founder PAT, not another persona)." >&2 echo "::error::Auto-sync push will fail because only 'devops-engineer' is whitelisted on staging branch protection." >&2 exit 1 fi echo "Token authenticates as: $username ✓" - name: Verify token has repo read scope # `GET /api/v1/repos//` requires `read:repository` # on the persona's v2 scope contract. If the scope was # narrowed/dropped on rotation we catch it here, before the # next main push reveals it via a checkout failure. run: | set -euo pipefail response_file="$(mktemp)" code_file="$(mktemp)" # See first probe step for the rationale on the tempfile-routed # `-w "%{http_code}"` pattern — the unsafe `|| echo "000"` shape # is rejected by lint-curl-status-capture.yml. set +e curl -sS -o "$response_file" \ --max-time 30 --connect-timeout 10 \ -w "%{http_code}" \ -H "Authorization: token ${AUTO_SYNC_TOKEN}" \ -H "Accept: application/json" \ "https://${GITEA_HOST}/api/v1/repos/${REPO_PATH}" >"$code_file" 2>/dev/null set -e status=$(cat "$code_file" 2>/dev/null || true) [ -z "$status" ] && status="000" if [ "$status" != "200" ]; then echo "::error::Token lacks read:repository scope on ${REPO_PATH}: HTTP $status." >&2 echo "::error::Auto-sync's actions/checkout step will fail with this token." >&2 echo "::error::Re-issue with v2 scope contract: read:repository,write:repository,read:user,read:organization,read:issue,write:issue,read:notification,read:misc" >&2 sed -E 's/[A-Fa-f0-9]{32,}//g' "$response_file" >&2 || true exit 1 fi echo "Token has read:repository on ${REPO_PATH} ✓" - name: Verify git HTTPS auth path via no-op dry-run push to staging # Final probe: exercise the EXACT auth path that # `actions/checkout` + `git push origin staging` use in # auto-sync-main-to-staging.yml. Gitea's API and git-HTTPS # surfaces share the token-lookup code path internally but # the wire-level error shapes differ — historically (#173) # the API path was healthy while git-HTTPS rejected, so # checking only the API would have given false-green. # # IMPORTANT: `git ls-remote` on a public repo (which # molecule-core is) succeeds even with a junk token because # Gitea falls back to anonymous-read. `ls-remote` therefore # CANNOT validate auth on this surface. We use # `git push --dry-run` instead — push is auth-gated even on # public repos. # # NOP shape: read the current staging SHA via authenticated # ls-remote (the SHA itself is public; auth is incidental # here, used only to colocate the discovery in one step), then # `git push --dry-run :refs/heads/staging`. Pushing the # current tip back to itself is "Everything up-to-date" with # exit 0 when auth succeeds. With a bad token Gitea returns # HTTP 401 in the smart-protocol handshake and git exits 128 # with "Authentication failed". # # The dry-run never reaches Gitea's pre-receive hook (which # is where branch-protection authz runs), so this probe does # not validate failure mode C. That's intentional — # branch-protection-drift.yml owns authz monitoring; this # canary owns auth. env: # Don't hang waiting for password prompt if auth fails on a # terminal-attached run. (In Actions there's no terminal, # but the env-var hardens against an interactive runner # config.) GIT_TERMINAL_PROMPT: "0" run: | set -euo pipefail # Token is in $AUTO_SYNC_TOKEN (job-level env). Compose the # URL as a local var that's never echoed. url="https://oauth2:${AUTO_SYNC_TOKEN}@${GITEA_HOST}/${REPO_PATH}" # Step a: read current staging SHA. ~1KB; auth-gated only # on private repos but always works on public — used here # only to discover the SHA, not to validate auth. staging_ref=$(timeout 30s git ls-remote --refs "$url" refs/heads/staging 2>&1) || { redacted=$(echo "$staging_ref" | sed -E "s|oauth2:[^@]+@|oauth2:@|g") echo "::error::ls-remote against staging failed (network/DNS issue):" >&2 echo "$redacted" >&2 exit 1 } if ! echo "$staging_ref" | grep -qE '^[0-9a-f]{40}[[:space:]]+refs/heads/staging$'; then echo "::error::ls-remote returned unexpected shape:" >&2 echo "$staging_ref" | sed -E "s|oauth2:[^@]+@|oauth2:@|g" >&2 exit 1 fi staging_sha=$(echo "$staging_ref" | awk '{print $1}') # Step b: spin up an ephemeral local repo. `git push` always # requires a local repo even when pushing a remote SHA that # isn't in the local object DB (the protocol negotiates and # discovers we don't need to send any objects). We don't use # `actions/checkout` for this — it would clone the whole # repo (~hundreds of MB) for what's essentially `git init`. tmp_repo="$(mktemp -d)" trap 'rm -rf "$tmp_repo"' EXIT git -C "$tmp_repo" init -q # Author config required for any git operation; values are # arbitrary because nothing gets committed here. git -C "$tmp_repo" config user.email canary@auto-sync.local git -C "$tmp_repo" config user.name auto-sync-canary # Step c: dry-run push the current staging SHA back to # staging. NOP by construction — the remote tip equals the # SHA we're pushing, so "Everything up-to-date" is the # success path. # # Authentication is checked at the smart-protocol handshake, # BEFORE the dry-run can compute an empty diff. Bad token # → "Authentication failed", exit 128. Good token → exit 0. set +e push_out=$(timeout 30s git -C "$tmp_repo" push --dry-run "$url" "${staging_sha}:refs/heads/staging" 2>&1) push_rc=$? set -e if [ "$push_rc" -ne 0 ]; then redacted=$(echo "$push_out" | sed -E "s|oauth2:[^@]+@|oauth2:@|g") echo "::error::Token rotation suspected: git push --dry-run against staging failed via the AUTO_SYNC_TOKEN HTTPS auth path (exit $push_rc)." >&2 echo "::error::This is the EXACT auth path that actions/checkout + git push use in auto-sync-main-to-staging.yml." >&2 echo "::error::Likely cause: AUTO_SYNC_TOKEN was rotated/revoked on Gitea but the repo Actions secret was not updated. Runbook: see header." >&2 echo "$redacted" >&2 exit 1 fi echo "git HTTPS auth path: NOP push --dry-run to staging → ${staging_sha:0:8} ✓" - name: Summarise canary result # Everything passed — surface a green summary. (Failures # already wrote ::error:: lines and exited above; if we got # here, all three probes passed.) run: | { echo "## Auto-sync canary: GREEN" echo "" echo "AUTO_SYNC_TOKEN is healthy:" echo "- Authenticates as \`${EXPECTED_PERSONA}\` ✓" echo "- Has \`read:repository\` scope on \`${REPO_PATH}\` ✓" echo "- Git HTTPS auth path: no-op dry-run push to \`refs/heads/staging\` succeeds ✓" echo "" echo "Auto-sync main → staging will succeed on the next push to main." echo "If this canary ever goes RED, see the runbook in this workflow's header." } >> "$GITHUB_STEP_SUMMARY"