forked from molecule-ai/molecule-core
Hostile-self-review weakest-spot #2: if the devops-engineer persona is ever renamed, the canary will go red even if everything else is fine. Add an inline comment pointing the next editor at both files that must update together (auto-sync-main-to-staging.yml's git config + this canary's EXPECTED_PERSONA + the staging branch protection's push_whitelist_usernames). No behaviour change — comment-only.
405 lines
20 KiB
YAML
405 lines
20 KiB
YAML
name: Auto-sync canary — AUTO_SYNC_TOKEN rotation drift
|
||
|
||
# Synthetic health check for the AUTO_SYNC_TOKEN secret consumed by
|
||
# auto-sync-main-to-staging.yml (PR #66) and publish-workspace-server-image.yml.
|
||
#
|
||
# ============================================================
|
||
# Why this workflow exists
|
||
# ============================================================
|
||
#
|
||
# PR #66 fixed auto-sync (replaced GitHub-era `gh pr create` — which
|
||
# 405s on Gitea's GraphQL endpoint — with a direct git push from the
|
||
# `devops-engineer` persona's `AUTO_SYNC_TOKEN`). Hostile self-review
|
||
# weakest spot #3 of that PR:
|
||
#
|
||
# "Token rotation silently breaks auto-sync. If AUTO_SYNC_TOKEN is
|
||
# rotated without updating the repo secret, every push to main
|
||
# fails red on the auto-sync push step. The workflow surfaces the
|
||
# failure mode in the step summary (failure mode B in the header),
|
||
# but there's no proactive monitoring."
|
||
#
|
||
# Detection latency under the status quo: rotation is only caught on
|
||
# the next push to `main`. During quiet periods (no main push for
|
||
# many hours) the staging-superset-of-main invariant silently breaks.
|
||
#
|
||
# This workflow closes the gap: every 6 hours, it fires the auth
|
||
# surface that auto-sync depends on and emits a red workflow status
|
||
# if AUTO_SYNC_TOKEN has drifted out of validity.
|
||
#
|
||
# ============================================================
|
||
# What this checks (Option B — read-only verify)
|
||
# ============================================================
|
||
#
|
||
# 1. `GET /api/v1/user` against Gitea with the token → validates the
|
||
# token authenticates AND resolves to `devops-engineer` (catches
|
||
# the case where the token was regenerated under a different
|
||
# persona by mistake).
|
||
# 2. `GET /api/v1/repos/molecule-ai/molecule-core` with the token →
|
||
# validates the token has `read:repository` scope on this repo
|
||
# (the v2 scope contract — see saved memory
|
||
# `reference_persona_token_v2_scope`).
|
||
# 3. `git push --dry-run` of the current staging SHA back to
|
||
# `refs/heads/staging` via `https://oauth2:<token>@<gitea>/...`
|
||
# → validates the EXACT HTTPS basic-auth path that
|
||
# `actions/checkout` + `git push origin staging` use inside
|
||
# auto-sync-main-to-staging.yml. NOP by construction (push the
|
||
# current tip to itself = "Everything up-to-date"); auth is
|
||
# checked at the smart-protocol handshake BEFORE the empty-diff
|
||
# computation, so bad token → exit 128 with "Authentication
|
||
# failed". `git ls-remote` is NOT used here because Gitea
|
||
# falls back to anonymous read on public repos and would
|
||
# silently green-light a rotated token.
|
||
#
|
||
# Each step exits non-zero with an actionable error message if it
|
||
# fails. The workflow status itself is the operator-facing surface.
|
||
#
|
||
# ============================================================
|
||
# What this does NOT check (intentional)
|
||
# ============================================================
|
||
#
|
||
# - **Branch-protection authz** (failure mode C in auto-sync header):
|
||
# would require an actual write to staging. Already monitored by
|
||
# `branch-protection-drift.yml` daily. Don't duplicate.
|
||
# - **Conflict resolution** (failure mode A): a real conflict is data-
|
||
# driven, not auth-driven; can't synthesise it without polluting
|
||
# staging. Already surfaces immediately on the next main push.
|
||
# - **Concurrency** (failure mode D): handled by workflow concurrency
|
||
# group on auto-sync, not a credential issue.
|
||
#
|
||
# ============================================================
|
||
# Why Option B (read-only) and not the alternatives
|
||
# ============================================================
|
||
#
|
||
# Considered + rejected (see issue #72 for full write-up):
|
||
#
|
||
# - **Option A — full auto-sync on schedule**: every run creates a
|
||
# no-op merge commit on staging when main hasn't advanced. 4 noise
|
||
# commits/day. And races the real `push:` trigger when main has
|
||
# advanced. Rejected.
|
||
#
|
||
# - **Option C — push to dedicated `auto-sync-canary` branch**: would
|
||
# exercise authz too, but adds branch noise on Gitea AND requires
|
||
# maintaining a second branch protection (or expanding staging's
|
||
# whitelist to a junk branch). Authz already covered by
|
||
# `branch-protection-drift.yml`. Rejected.
|
||
#
|
||
# Prior art for the chosen Option B shape:
|
||
# - Cloudflare's `/user/tokens/verify` endpoint (read-only auth
|
||
# probe explicitly designed for credential canaries).
|
||
# - AWS Secrets Manager rotation Lambda's `testSecret` step (auth
|
||
# probe before promoting AWSPENDING → AWSCURRENT).
|
||
# - HashiCorp Vault's `vault token lookup` for renewal canaries.
|
||
#
|
||
# ============================================================
|
||
# Operator runbook — what to do when this workflow goes RED
|
||
# ============================================================
|
||
#
|
||
# 1. **Identify which step failed**:
|
||
# - Step "Verify token authenticates as devops-engineer" red →
|
||
# token is invalid OR resolves to wrong persona.
|
||
# - Step "Verify token has repo read scope" red → token valid but
|
||
# stripped of `read:repository` scope (or repo perms changed).
|
||
# - Step "Verify git HTTPS auth path via no-op dry-run push to
|
||
# staging" red → token rotated/revoked OR Gitea git-HTTPS
|
||
# surface is broken (rare). Auth check happens on the
|
||
# smart-protocol handshake, separate from the API path.
|
||
#
|
||
# 2. **Re-issue the token** on the operator host:
|
||
# ```
|
||
# ssh root@5.78.80.188 'docker exec --user git molecule-gitea-1 \
|
||
# gitea admin user generate-access-token \
|
||
# --username devops-engineer \
|
||
# --token-name persona-devops-engineer-vN \
|
||
# --scopes "read:repository,write:repository,read:user,read:organization,read:issue,write:issue,read:notification,read:misc"'
|
||
# ```
|
||
# Update `/etc/molecule-bootstrap/agent-secrets.env` in place
|
||
# (per `feedback_unified_credentials_file`). The previous token
|
||
# file lands at `.bak.<date>`.
|
||
#
|
||
# 3. **Update the repo Actions secret** at:
|
||
# Settings → Secrets and variables → Actions → AUTO_SYNC_TOKEN
|
||
# Paste the new token. (Don't echo it in chat — but per
|
||
# `feedback_passwords_in_chat_are_burned`, a paste in a 1:1
|
||
# Claude session is within trust boundary.)
|
||
#
|
||
# 4. **Re-run this canary** via workflow_dispatch. Confirm GREEN.
|
||
#
|
||
# 5. **Backfill any missed main → staging syncs** by re-running
|
||
# `auto-sync-main-to-staging.yml` from its workflow_dispatch
|
||
# surface, OR by pushing an empty commit to main (if you'd
|
||
# rather force a real trigger).
|
||
#
|
||
# ============================================================
|
||
# Security notes
|
||
# ============================================================
|
||
#
|
||
# - Token usage: read-only (`GET /api/v1/user`, `GET /api/v1/repos/...`,
|
||
# `git ls-remote`). No write paths. Same blast-radius profile as
|
||
# `actions/checkout` on a public repo.
|
||
# - The token NEVER appears in logs: every `curl` uses a header
|
||
# variable, never inline; the `git ls-remote` URL builds the
|
||
# `oauth2:$TOKEN@host` form into a single env var that's not
|
||
# echoed. GitHub Actions secret-masking covers anything that does
|
||
# slip through.
|
||
# - No new token introduced — same `AUTO_SYNC_TOKEN` the workflow
|
||
# under monitor uses. Per least-privilege we deliberately do NOT
|
||
# broaden scope for the canary.
|
||
|
||
on:
|
||
schedule:
|
||
# Every 6 hours at :17 (offsets the cron herd at :00). Justification
|
||
# from issue #72: cheap to run (~5s wall-clock, no quota), 3h average
|
||
# detection latency, 6h max. 1h would be 24× the runs for marginal
|
||
# benefit; daily would be 6× longer latency and worse than status
|
||
# quo on a quiet-main day.
|
||
- cron: '17 */6 * * *'
|
||
workflow_dispatch:
|
||
|
||
# No concurrency group needed — the canary is read-only and idempotent.
|
||
# Two parallel runs (e.g. operator dispatch during a scheduled tick) are
|
||
# harmless: same result, doubled HTTPS calls, no shared state.
|
||
|
||
permissions:
|
||
contents: read
|
||
|
||
jobs:
|
||
verify-token:
|
||
name: Verify AUTO_SYNC_TOKEN validity
|
||
runs-on: ubuntu-latest
|
||
# 2 min surfaces hangs (Gitea API stall, DNS issue) within one
|
||
# cron interval. Realistic worst case is ~10s: 2 curls + 1 git
|
||
# ls-remote, each capped by the explicit timeouts below.
|
||
timeout-minutes: 2
|
||
|
||
env:
|
||
# Pinned in env so individual steps can read it without
|
||
# repeating the secret reference. GitHub masks the value in
|
||
# logs automatically.
|
||
AUTO_SYNC_TOKEN: ${{ secrets.AUTO_SYNC_TOKEN }}
|
||
# MUST stay in sync with auto-sync-main-to-staging.yml's
|
||
# `git config user.name "devops-engineer"` line. Renaming the
|
||
# devops-engineer persona requires updating both files (and
|
||
# the staging branch protection's `push_whitelist_usernames`).
|
||
EXPECTED_PERSONA: devops-engineer
|
||
GITEA_HOST: git.moleculesai.app
|
||
REPO_PATH: molecule-ai/molecule-core
|
||
|
||
steps:
|
||
- name: Verify AUTO_SYNC_TOKEN secret is configured
|
||
# Schedule-vs-dispatch behaviour split, per
|
||
# `feedback_schedule_vs_dispatch_secrets_hardening`:
|
||
#
|
||
# - schedule: hard-fail when the secret is missing. The
|
||
# whole point of the canary is to surface drift; soft-
|
||
# skipping on missing-secret would make the canary
|
||
# itself drift-invisible (sweep-cf-orphans #2088 lesson).
|
||
# - workflow_dispatch: hard-fail too — there's no scenario
|
||
# where an operator wants this canary to silently no-op.
|
||
# The workflow has no other ad-hoc utility; if you ran
|
||
# it, you wanted the answer.
|
||
run: |
|
||
if [ -z "${AUTO_SYNC_TOKEN}" ]; then
|
||
echo "::error::AUTO_SYNC_TOKEN secret is not set on this repo." >&2
|
||
echo "::error::Set it at Settings → Secrets and variables → Actions." >&2
|
||
echo "::error::Without it, auto-sync-main-to-staging.yml will fail every push to main." >&2
|
||
exit 1
|
||
fi
|
||
echo "AUTO_SYNC_TOKEN is configured (value masked)."
|
||
|
||
- name: Verify token authenticates as ${{ env.EXPECTED_PERSONA }}
|
||
# Calls Gitea's `/api/v1/user` — the canonical
|
||
# auth-probe-with-no-side-effects endpoint (mirrors
|
||
# Cloudflare's /user/tokens/verify).
|
||
#
|
||
# Failure surfaces:
|
||
# - HTTP 401: token invalid (rotated, revoked, or never
|
||
# correctly registered).
|
||
# - HTTP 200 but username != devops-engineer: token was
|
||
# regenerated under the wrong persona — this would let
|
||
# auth pass but commit attribution would be wrong, and
|
||
# branch-protection authz would fail because only
|
||
# `devops-engineer` is whitelisted.
|
||
run: |
|
||
set -euo pipefail
|
||
response_file="$(mktemp)"
|
||
code_file="$(mktemp)"
|
||
# `--max-time 30`: full call ceiling. `--connect-timeout 10`:
|
||
# DNS + TCP. `-w "%{http_code}"` routed to a tempfile so curl's
|
||
# exit code can't pollute the captured status — see
|
||
# feedback_curl_status_capture_pollution + the
|
||
# `lint-curl-status-capture.yml` gate that rejects the unsafe
|
||
# `$(curl ... || echo "000")` shape.
|
||
set +e
|
||
curl -sS -o "$response_file" \
|
||
--max-time 30 --connect-timeout 10 \
|
||
-w "%{http_code}" \
|
||
-H "Authorization: token ${AUTO_SYNC_TOKEN}" \
|
||
-H "Accept: application/json" \
|
||
"https://${GITEA_HOST}/api/v1/user" >"$code_file" 2>/dev/null
|
||
set -e
|
||
status=$(cat "$code_file" 2>/dev/null || true)
|
||
[ -z "$status" ] && status="000"
|
||
|
||
if [ "$status" != "200" ]; then
|
||
echo "::error::Token rotation suspected: GET /api/v1/user returned HTTP $status (expected 200)." >&2
|
||
echo "::error::Likely cause: AUTO_SYNC_TOKEN has been rotated/revoked on Gitea but the repo Actions secret was not updated." >&2
|
||
echo "::error::Runbook: see header comment of this workflow file." >&2
|
||
# Print response body but redact anything that looks like a token.
|
||
sed -E 's/[A-Fa-f0-9]{32,}/<redacted>/g' "$response_file" >&2 || true
|
||
exit 1
|
||
fi
|
||
|
||
username=$(python3 -c "import json,sys; print(json.load(open(sys.argv[1])).get('login',''))" "$response_file")
|
||
if [ "$username" != "${EXPECTED_PERSONA}" ]; then
|
||
echo "::error::Token resolves to user '$username', expected '${EXPECTED_PERSONA}'." >&2
|
||
echo "::error::AUTO_SYNC_TOKEN must be the devops-engineer persona PAT (not founder PAT, not another persona)." >&2
|
||
echo "::error::Auto-sync push will fail because only 'devops-engineer' is whitelisted on staging branch protection." >&2
|
||
exit 1
|
||
fi
|
||
echo "Token authenticates as: $username ✓"
|
||
|
||
- name: Verify token has repo read scope
|
||
# `GET /api/v1/repos/<owner>/<repo>` requires `read:repository`
|
||
# on the persona's v2 scope contract. If the scope was
|
||
# narrowed/dropped on rotation we catch it here, before the
|
||
# next main push reveals it via a checkout failure.
|
||
run: |
|
||
set -euo pipefail
|
||
response_file="$(mktemp)"
|
||
code_file="$(mktemp)"
|
||
# See first probe step for the rationale on the tempfile-routed
|
||
# `-w "%{http_code}"` pattern — the unsafe `|| echo "000"` shape
|
||
# is rejected by lint-curl-status-capture.yml.
|
||
set +e
|
||
curl -sS -o "$response_file" \
|
||
--max-time 30 --connect-timeout 10 \
|
||
-w "%{http_code}" \
|
||
-H "Authorization: token ${AUTO_SYNC_TOKEN}" \
|
||
-H "Accept: application/json" \
|
||
"https://${GITEA_HOST}/api/v1/repos/${REPO_PATH}" >"$code_file" 2>/dev/null
|
||
set -e
|
||
status=$(cat "$code_file" 2>/dev/null || true)
|
||
[ -z "$status" ] && status="000"
|
||
|
||
if [ "$status" != "200" ]; then
|
||
echo "::error::Token lacks read:repository scope on ${REPO_PATH}: HTTP $status." >&2
|
||
echo "::error::Auto-sync's actions/checkout step will fail with this token." >&2
|
||
echo "::error::Re-issue with v2 scope contract: read:repository,write:repository,read:user,read:organization,read:issue,write:issue,read:notification,read:misc" >&2
|
||
sed -E 's/[A-Fa-f0-9]{32,}/<redacted>/g' "$response_file" >&2 || true
|
||
exit 1
|
||
fi
|
||
echo "Token has read:repository on ${REPO_PATH} ✓"
|
||
|
||
- name: Verify git HTTPS auth path via no-op dry-run push to staging
|
||
# Final probe: exercise the EXACT auth path that
|
||
# `actions/checkout` + `git push origin staging` use in
|
||
# auto-sync-main-to-staging.yml. Gitea's API and git-HTTPS
|
||
# surfaces share the token-lookup code path internally but
|
||
# the wire-level error shapes differ — historically (#173)
|
||
# the API path was healthy while git-HTTPS rejected, so
|
||
# checking only the API would have given false-green.
|
||
#
|
||
# IMPORTANT: `git ls-remote` on a public repo (which
|
||
# molecule-core is) succeeds even with a junk token because
|
||
# Gitea falls back to anonymous-read. `ls-remote` therefore
|
||
# CANNOT validate auth on this surface. We use
|
||
# `git push --dry-run` instead — push is auth-gated even on
|
||
# public repos.
|
||
#
|
||
# NOP shape: read the current staging SHA via authenticated
|
||
# ls-remote (the SHA itself is public; auth is incidental
|
||
# here, used only to colocate the discovery in one step), then
|
||
# `git push --dry-run <SHA>:refs/heads/staging`. Pushing the
|
||
# current tip back to itself is "Everything up-to-date" with
|
||
# exit 0 when auth succeeds. With a bad token Gitea returns
|
||
# HTTP 401 in the smart-protocol handshake and git exits 128
|
||
# with "Authentication failed".
|
||
#
|
||
# The dry-run never reaches Gitea's pre-receive hook (which
|
||
# is where branch-protection authz runs), so this probe does
|
||
# not validate failure mode C. That's intentional —
|
||
# branch-protection-drift.yml owns authz monitoring; this
|
||
# canary owns auth.
|
||
env:
|
||
# Don't hang waiting for password prompt if auth fails on a
|
||
# terminal-attached run. (In Actions there's no terminal,
|
||
# but the env-var hardens against an interactive runner
|
||
# config.)
|
||
GIT_TERMINAL_PROMPT: "0"
|
||
run: |
|
||
set -euo pipefail
|
||
# Token is in $AUTO_SYNC_TOKEN (job-level env). Compose the
|
||
# URL as a local var that's never echoed.
|
||
url="https://oauth2:${AUTO_SYNC_TOKEN}@${GITEA_HOST}/${REPO_PATH}"
|
||
|
||
# Step a: read current staging SHA. ~1KB; auth-gated only
|
||
# on private repos but always works on public — used here
|
||
# only to discover the SHA, not to validate auth.
|
||
staging_ref=$(timeout 30s git ls-remote --refs "$url" refs/heads/staging 2>&1) || {
|
||
redacted=$(echo "$staging_ref" | sed -E "s|oauth2:[^@]+@|oauth2:<redacted>@|g")
|
||
echo "::error::ls-remote against staging failed (network/DNS issue):" >&2
|
||
echo "$redacted" >&2
|
||
exit 1
|
||
}
|
||
if ! echo "$staging_ref" | grep -qE '^[0-9a-f]{40}[[:space:]]+refs/heads/staging$'; then
|
||
echo "::error::ls-remote returned unexpected shape:" >&2
|
||
echo "$staging_ref" | sed -E "s|oauth2:[^@]+@|oauth2:<redacted>@|g" >&2
|
||
exit 1
|
||
fi
|
||
staging_sha=$(echo "$staging_ref" | awk '{print $1}')
|
||
|
||
# Step b: spin up an ephemeral local repo. `git push` always
|
||
# requires a local repo even when pushing a remote SHA that
|
||
# isn't in the local object DB (the protocol negotiates and
|
||
# discovers we don't need to send any objects). We don't use
|
||
# `actions/checkout` for this — it would clone the whole
|
||
# repo (~hundreds of MB) for what's essentially `git init`.
|
||
tmp_repo="$(mktemp -d)"
|
||
trap 'rm -rf "$tmp_repo"' EXIT
|
||
git -C "$tmp_repo" init -q
|
||
# Author config required for any git operation; values are
|
||
# arbitrary because nothing gets committed here.
|
||
git -C "$tmp_repo" config user.email canary@auto-sync.local
|
||
git -C "$tmp_repo" config user.name auto-sync-canary
|
||
|
||
# Step c: dry-run push the current staging SHA back to
|
||
# staging. NOP by construction — the remote tip equals the
|
||
# SHA we're pushing, so "Everything up-to-date" is the
|
||
# success path.
|
||
#
|
||
# Authentication is checked at the smart-protocol handshake,
|
||
# BEFORE the dry-run can compute an empty diff. Bad token
|
||
# → "Authentication failed", exit 128. Good token → exit 0.
|
||
set +e
|
||
push_out=$(timeout 30s git -C "$tmp_repo" push --dry-run "$url" "${staging_sha}:refs/heads/staging" 2>&1)
|
||
push_rc=$?
|
||
set -e
|
||
|
||
if [ "$push_rc" -ne 0 ]; then
|
||
redacted=$(echo "$push_out" | sed -E "s|oauth2:[^@]+@|oauth2:<redacted>@|g")
|
||
echo "::error::Token rotation suspected: git push --dry-run against staging failed via the AUTO_SYNC_TOKEN HTTPS auth path (exit $push_rc)." >&2
|
||
echo "::error::This is the EXACT auth path that actions/checkout + git push use in auto-sync-main-to-staging.yml." >&2
|
||
echo "::error::Likely cause: AUTO_SYNC_TOKEN was rotated/revoked on Gitea but the repo Actions secret was not updated. Runbook: see header." >&2
|
||
echo "$redacted" >&2
|
||
exit 1
|
||
fi
|
||
|
||
echo "git HTTPS auth path: NOP push --dry-run to staging → ${staging_sha:0:8} ✓"
|
||
|
||
- name: Summarise canary result
|
||
# Everything passed — surface a green summary. (Failures
|
||
# already wrote ::error:: lines and exited above; if we got
|
||
# here, all three probes passed.)
|
||
run: |
|
||
{
|
||
echo "## Auto-sync canary: GREEN"
|
||
echo ""
|
||
echo "AUTO_SYNC_TOKEN is healthy:"
|
||
echo "- Authenticates as \`${EXPECTED_PERSONA}\` ✓"
|
||
echo "- Has \`read:repository\` scope on \`${REPO_PATH}\` ✓"
|
||
echo "- Git HTTPS auth path: no-op dry-run push to \`refs/heads/staging\` succeeds ✓"
|
||
echo ""
|
||
echo "Auto-sync main → staging will succeed on the next push to main."
|
||
echo "If this canary ever goes RED, see the runbook in this workflow's header."
|
||
} >> "$GITHUB_STEP_SUMMARY"
|