Merge remote-tracking branch 'origin/main' into feat/canvas-activity-tab-ws-subscribe
Some checks failed
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 7s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 7s
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 7s
pr-guards / disable-auto-merge-on-push (pull_request) Failing after 7s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 13s
CI / Detect changes (pull_request) Successful in 16s
E2E API Smoke Test / detect-changes (pull_request) Successful in 15s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 15s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 15s
Harness Replays / detect-changes (pull_request) Successful in 16s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 16s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 14s
CI / Platform (Go) (pull_request) Successful in 9s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 6s
CI / Python Lint & Test (pull_request) Successful in 9s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 10s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 14s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 10s
Harness Replays / Harness Replays (pull_request) Failing after 1m32s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 6m27s
CI / Canvas (Next.js) (pull_request) Failing after 10m9s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Some checks failed
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 7s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 7s
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 7s
pr-guards / disable-auto-merge-on-push (pull_request) Failing after 7s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 13s
CI / Detect changes (pull_request) Successful in 16s
E2E API Smoke Test / detect-changes (pull_request) Successful in 15s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 15s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 15s
Harness Replays / detect-changes (pull_request) Successful in 16s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 16s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 14s
CI / Platform (Go) (pull_request) Successful in 9s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 6s
CI / Python Lint & Test (pull_request) Successful in 9s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 10s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 14s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 10s
Harness Replays / Harness Replays (pull_request) Failing after 1m32s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 6m27s
CI / Canvas (Next.js) (pull_request) Failing after 10m9s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
This commit is contained in:
commit
e909417224
328
.github/workflows/auto-sync-main-to-staging.yml
vendored
328
.github/workflows/auto-sync-main-to-staging.yml
vendored
@ -3,85 +3,138 @@ name: Auto-sync main → staging
|
||||
# Reflects every push to `main` back onto `staging` so the
|
||||
# staging-as-superset-of-main invariant holds.
|
||||
#
|
||||
# Background:
|
||||
# ============================================================
|
||||
# What this workflow does
|
||||
# ============================================================
|
||||
#
|
||||
# `auto-promote-staging.yml` advances main via `git merge --ff-only`
|
||||
# + `git push origin main` — that's a clean fast-forward, no merge
|
||||
# commit. But manual merges of `staging → main` PRs through the
|
||||
# GitHub UI / API create a merge commit on main that staging
|
||||
# doesn't have. The next `staging → main` PR then evaluates as
|
||||
# "BEHIND" because staging is missing that merge commit, requiring
|
||||
# a manual `gh pr update-branch` round-trip.
|
||||
# On every push to `main`:
|
||||
# 1. Checks if staging already contains main → no-op.
|
||||
# 2. Fetches both branches, merges main into staging in the
|
||||
# runner workspace (fast-forward if possible, else
|
||||
# `--no-ff` merge commit).
|
||||
# 3. Pushes staging directly to origin via the
|
||||
# `devops-engineer` persona's `AUTO_SYNC_TOKEN`.
|
||||
#
|
||||
# This happened twice on 2026-04-28 (PRs #2202, #2205, both manual
|
||||
# bridges). Each time the bridge needed update-branch + a re-CI
|
||||
# round before merging. Operationally annoying and avoidable.
|
||||
# Authoritative path: a single `git push origin staging` from
|
||||
# inside this workflow is the SSOT for advancing staging after
|
||||
# a main push. No PR, no merge queue, no human approval —
|
||||
# staging is mechanically maintained as a superset of main.
|
||||
#
|
||||
# Architecture:
|
||||
# `auto-promote-staging.yml` is the reverse-direction
|
||||
# counterpart (staging → main, gated on green CI). Together
|
||||
# they keep the staging-superset-of-main invariant tight.
|
||||
#
|
||||
# This repo's `staging` branch is protected by a `merge_queue`
|
||||
# ruleset (id 15500102) that blocks ALL direct pushes — no bypass
|
||||
# even for org admins or the GitHub Actions integration. Direct
|
||||
# `git push origin staging` returns GH013. So instead of pushing
|
||||
# directly, this workflow:
|
||||
# ============================================================
|
||||
# Why direct push (and not "open a PR")
|
||||
# ============================================================
|
||||
#
|
||||
# 1. Checks if main is already in staging's ancestry → no-op.
|
||||
# 2. Creates an `auto-sync/main-<sha>` branch from staging.
|
||||
# 3. Tries `git merge --ff-only origin/main` → if staging hasn't
|
||||
# diverged this is a clean ff.
|
||||
# 4. Otherwise `git merge --no-ff origin/main` to absorb main's
|
||||
# tip while keeping staging's history.
|
||||
# 5. Pushes the auto-sync branch.
|
||||
# 6. Opens a PR (base=staging, head=auto-sync/main-<sha>) and
|
||||
# enables auto-merge so the merge queue lands it.
|
||||
# Pre-2026-05-06 the canonical SCM was GitHub.com, where:
|
||||
# - The `staging` branch had a `merge_queue` ruleset that
|
||||
# blocked ALL direct pushes (no bypass even for org
|
||||
# admins or the GitHub Actions integration).
|
||||
# - Therefore this workflow opened a PR via `gh pr create`
|
||||
# and let auto-merge land it through the queue.
|
||||
#
|
||||
# This mirrors the path human PRs take through staging — same
|
||||
# rules, same gates, no special-case bypass.
|
||||
# Post-2026-05-06 the canonical SCM is Gitea
|
||||
# (`git.moleculesai.app/molecule-ai/molecule-core`). Gitea:
|
||||
# - Has no `merge_queue` concept.
|
||||
# - Allows direct push to protected branches via per-user
|
||||
# `push_whitelist_usernames` on the branch protection.
|
||||
# - Does not expose a GraphQL endpoint, so `gh pr create`
|
||||
# returns `HTTP 405 Method Not Allowed
|
||||
# (https://git.moleculesai.app/api/graphql)` — the
|
||||
# pre-suspension architecture cannot work on Gitea.
|
||||
#
|
||||
# Loop safety:
|
||||
# The molecule-ai/molecule-core staging branch protection
|
||||
# (verified via `GET /api/v1/repos/.../branch_protections`)
|
||||
# whitelists `devops-engineer` for direct push. So the
|
||||
# correct Gitea-shape architecture is: authenticate as
|
||||
# `devops-engineer`, merge locally, push staging directly.
|
||||
#
|
||||
# `GITHUB_TOKEN`-authored merges (including the merge queue's land
|
||||
# of the auto-sync PR) do NOT trigger downstream workflow runs
|
||||
# (GitHub Actions safety). So when the auto-sync PR lands on
|
||||
# staging, `auto-promote-staging.yml` is NOT triggered by that
|
||||
# push. The next developer push to staging triggers auto-promote
|
||||
# normally. No loop possible.
|
||||
# This is structurally simpler than the GitHub-era PR dance
|
||||
# and removes the dependence on `gh` CLI / GraphQL entirely.
|
||||
#
|
||||
# Concurrency:
|
||||
# ============================================================
|
||||
# Identity + token (anti-bot-ring per saved-memory
|
||||
# `feedback_per_agent_gitea_identity_default`)
|
||||
# ============================================================
|
||||
#
|
||||
# Two pushes to main in quick succession (e.g., manual UI merge
|
||||
# immediately followed by auto-promote-staging's ff-merge) could
|
||||
# otherwise open two overlapping auto-sync PRs. The concurrency
|
||||
# group serializes runs; the second waits for the first to exit.
|
||||
# (The first run exits after opening + auto-merge-queueing the PR,
|
||||
# not after the merge actually completes — so multiple PRs can be
|
||||
# open simultaneously, but the merge queue handles them serially.)
|
||||
# This workflow uses `secrets.AUTO_SYNC_TOKEN`, which is a
|
||||
# personal access token issued to the `devops-engineer`
|
||||
# persona on Gitea — NOT the founder PAT. The bot-ring
|
||||
# fingerprint that triggered the GitHub org suspension on
|
||||
# 2026-05-06 was characterised by founder PAT acting as CI
|
||||
# at machine speed; per-persona identities split the
|
||||
# attribution honestly.
|
||||
#
|
||||
# Token scope on Gitea: repo write. Push target restricted
|
||||
# to `staging` (this workflow is the only writer; main is
|
||||
# untouched). Compromise blast radius: bounded to staging
|
||||
# branch + this repo's read surface.
|
||||
#
|
||||
# Commits are authored by the persona email
|
||||
# `devops-engineer@agents.moleculesai.app` so commit history
|
||||
# reflects which automation produced the merge.
|
||||
#
|
||||
# ============================================================
|
||||
# Failure modes & operational notes
|
||||
# ============================================================
|
||||
#
|
||||
# A — staging has commits main doesn't, and the merge
|
||||
# conflicts:
|
||||
# - The `--no-ff` merge step exits non-zero. Workflow
|
||||
# fails red. Operator (devops-engineer or human)
|
||||
# resolves manually:
|
||||
# git fetch origin
|
||||
# git checkout staging
|
||||
# git merge --no-ff origin/main
|
||||
# # resolve conflicts
|
||||
# git push origin staging
|
||||
# - Step summary surfaces the conflict so the failed run
|
||||
# is self-explanatory.
|
||||
#
|
||||
# B — `AUTO_SYNC_TOKEN` rotated / wrong scope:
|
||||
# - `git push` step exits non-zero with `HTTP 401` /
|
||||
# `403`. Step summary surfaces the failed push.
|
||||
# - Re-issue the token from `~/.molecule-ai/personas/`
|
||||
# on the operator host and update the repo Actions
|
||||
# secret. Re-run the workflow.
|
||||
#
|
||||
# C — staging branch protection no longer whitelists
|
||||
# `devops-engineer`:
|
||||
# - `git push` exits non-zero with a Gitea protected-
|
||||
# branch rejection. Step summary surfaces it.
|
||||
# - Re-add `devops-engineer` to
|
||||
# `push_whitelist_usernames` on the staging
|
||||
# protection (Settings → Branches → staging).
|
||||
#
|
||||
# D — concurrent push to main while a sync is in flight:
|
||||
# - The `concurrency` group below serialises runs.
|
||||
# The second waits for the first; if main advances
|
||||
# again while we're syncing, the second run picks
|
||||
# up the new tip on its own fetch.
|
||||
#
|
||||
# ============================================================
|
||||
# Loop safety
|
||||
# ============================================================
|
||||
#
|
||||
# The push to staging from this workflow does NOT itself
|
||||
# fire a `push: branches: [main]` event (different branch),
|
||||
# so there's no risk of self-recursion. `auto-promote-staging.yml`
|
||||
# fires on `workflow_run` of CI etc. — it sees the new
|
||||
# staging tip on its next gate-completion event, NOT on this
|
||||
# push directly. No loop.
|
||||
|
||||
on:
|
||||
push:
|
||||
branches: [main]
|
||||
# workflow_dispatch lets:
|
||||
# 1. Operators manually backfill a missed sync (e.g. after a manual
|
||||
# UI merge that the runner missed).
|
||||
# 2. auto-promote-staging.yml's polling tail explicitly invoke us
|
||||
# after the promote PR lands. This is load-bearing: when the
|
||||
# merge queue lands a promote-PR merge, the resulting push to
|
||||
# `main` is "by GITHUB_TOKEN", and per GitHub's no-recursion
|
||||
# rule (https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow)
|
||||
# that push event does NOT fire any downstream workflows. The
|
||||
# `on: push` trigger above is silently dead for the very pattern
|
||||
# we exist to handle. Verified empirically 2026-05-02 against
|
||||
# SHA 76c604fb (PR #2437 staging→main): only ONE workflow fired
|
||||
# (publish-workspace-server-image, dispatched explicitly by
|
||||
# auto-promote's polling tail with an App token). Every other
|
||||
# `on: push: branches: [main]` workflow — including this one —
|
||||
# was suppressed. Until the underlying merge call moves to an
|
||||
# App token, an explicit dispatch is the only reliable path.
|
||||
# workflow_dispatch lets operators manually backfill a
|
||||
# missed sync (e.g. if AUTO_SYNC_TOKEN was rotated and a
|
||||
# main push slipped through while the secret was stale).
|
||||
workflow_dispatch:
|
||||
|
||||
permissions:
|
||||
contents: write
|
||||
pull-requests: write
|
||||
|
||||
concurrency:
|
||||
group: auto-sync-main-to-staging
|
||||
@ -89,26 +142,25 @@ concurrency:
|
||||
|
||||
jobs:
|
||||
sync-staging:
|
||||
# ubuntu-latest matches every other workflow in this repo. The
|
||||
# earlier `[self-hosted, macos, arm64]` was a copy-paste artefact
|
||||
# from the molecule-controlplane repo (which IS private and uses a
|
||||
# Mac runner) — molecule-core has no Mac runner registered, so the
|
||||
# job sat unassigned whenever the trigger fired. Verified 2026-05-02:
|
||||
# this is the ONLY workflow in molecule-core/.github/workflows/ with
|
||||
# a non-ubuntu runs-on.
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Checkout staging
|
||||
- name: Checkout staging (with devops-engineer push token)
|
||||
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
|
||||
with:
|
||||
fetch-depth: 0
|
||||
ref: staging
|
||||
token: ${{ secrets.GITHUB_TOKEN }}
|
||||
# AUTO_SYNC_TOKEN authenticates as the
|
||||
# `devops-engineer` Gitea persona — the only
|
||||
# identity whitelisted for direct push to
|
||||
# staging. See header comment for context.
|
||||
token: ${{ secrets.AUTO_SYNC_TOKEN }}
|
||||
|
||||
- name: Configure git author
|
||||
run: |
|
||||
git config user.name "github-actions[bot]"
|
||||
git config user.email "41898282+github-actions[bot]@users.noreply.github.com"
|
||||
# Per-persona identity, NOT founder PAT.
|
||||
# `feedback_per_agent_gitea_identity_default`.
|
||||
git config user.name "devops-engineer"
|
||||
git config user.email "devops-engineer@agents.moleculesai.app"
|
||||
|
||||
- name: Check if staging already contains main
|
||||
id: check
|
||||
@ -118,7 +170,7 @@ jobs:
|
||||
if git merge-base --is-ancestor origin/main HEAD; then
|
||||
echo "needs_sync=false" >> "$GITHUB_OUTPUT"
|
||||
{
|
||||
echo "## ✅ No-op"
|
||||
echo "## No-op"
|
||||
echo
|
||||
echo "staging already contains \`origin/main\` ($(git rev-parse --short=8 origin/main))."
|
||||
} >> "$GITHUB_STEP_SUMMARY"
|
||||
@ -126,112 +178,78 @@ jobs:
|
||||
echo "needs_sync=true" >> "$GITHUB_OUTPUT"
|
||||
MAIN_SHORT=$(git rev-parse --short=8 origin/main)
|
||||
echo "main_short=${MAIN_SHORT}" >> "$GITHUB_OUTPUT"
|
||||
echo "branch=auto-sync/main-${MAIN_SHORT}" >> "$GITHUB_OUTPUT"
|
||||
echo "::notice::staging is missing main's tip (${MAIN_SHORT}) — opening sync PR"
|
||||
echo "::notice::staging is missing main's tip (${MAIN_SHORT}) — merging in-runner and pushing"
|
||||
fi
|
||||
|
||||
- name: Create auto-sync branch + merge main
|
||||
- name: Merge main into staging (in-runner)
|
||||
if: steps.check.outputs.needs_sync == 'true'
|
||||
id: prep
|
||||
id: merge
|
||||
run: |
|
||||
set -euo pipefail
|
||||
BRANCH="${{ steps.check.outputs.branch }}"
|
||||
|
||||
# If a previous auto-sync run already opened a branch for the
|
||||
# same main sha, prefer reusing it (idempotent behavior on
|
||||
# workflow restart). Force-update from latest staging anyway
|
||||
# so it absorbs any staging-side commits that landed since.
|
||||
git checkout -B "$BRANCH"
|
||||
|
||||
# Already on staging from checkout. Try fast-forward
|
||||
# first (cleanest history); fall back to merge commit
|
||||
# if staging has commits main doesn't.
|
||||
if git merge --ff-only origin/main; then
|
||||
echo "did_ff=true" >> "$GITHUB_OUTPUT"
|
||||
echo "::notice::Fast-forwarded ${BRANCH} to origin/main"
|
||||
echo "::notice::Fast-forwarded staging to origin/main"
|
||||
else
|
||||
echo "did_ff=false" >> "$GITHUB_OUTPUT"
|
||||
if ! git merge --no-ff origin/main -m "chore: sync main → staging (auto)"; then
|
||||
if ! git merge --no-ff origin/main \
|
||||
-m "chore: sync main → staging (auto, ${{ steps.check.outputs.main_short }})"; then
|
||||
# Hygiene: leave the work tree clean before failing.
|
||||
git merge --abort || true
|
||||
{
|
||||
echo "## ❌ Conflict"
|
||||
echo "## Conflict"
|
||||
echo
|
||||
echo "Auto-merge \`main → staging\` failed with conflicts."
|
||||
echo "A human needs to resolve manually."
|
||||
echo "A human (or devops-engineer persona) needs to resolve manually:"
|
||||
echo
|
||||
echo '```'
|
||||
echo "git fetch origin"
|
||||
echo "git checkout staging"
|
||||
echo "git merge --no-ff origin/main"
|
||||
echo "# resolve conflicts"
|
||||
echo "git push origin staging"
|
||||
echo '```'
|
||||
} >> "$GITHUB_STEP_SUMMARY"
|
||||
exit 1
|
||||
fi
|
||||
fi
|
||||
|
||||
- name: Push auto-sync branch
|
||||
- name: Push staging to origin
|
||||
if: steps.check.outputs.needs_sync == 'true'
|
||||
run: |
|
||||
set -euo pipefail
|
||||
# Force-with-lease so a concurrent auto-sync run can't
|
||||
# silently clobber an in-flight branch we just updated. If a
|
||||
# different writer touched the branch, we abort and the next
|
||||
# run picks up the latest state.
|
||||
git push --force-with-lease origin "${{ steps.check.outputs.branch }}"
|
||||
|
||||
- name: Open auto-sync PR + enable auto-merge
|
||||
if: steps.check.outputs.needs_sync == 'true'
|
||||
env:
|
||||
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
||||
BRANCH: ${{ steps.check.outputs.branch }}
|
||||
MAIN_SHORT: ${{ steps.check.outputs.main_short }}
|
||||
DID_FF: ${{ steps.prep.outputs.did_ff }}
|
||||
run: |
|
||||
set -euo pipefail
|
||||
|
||||
# Find existing PR for this branch (idempotent on workflow
|
||||
# restart) before creating a new one.
|
||||
PR_NUM=$(gh pr list --head "$BRANCH" --base staging --state open --json number --jq '.[0].number // ""')
|
||||
|
||||
if [ -z "$PR_NUM" ]; then
|
||||
# Body lives in a temp file to keep the multi-line content
|
||||
# out of the YAML block scalar (un-indented newlines inside
|
||||
# an inline shell string break YAML parsing).
|
||||
BODY_FILE=$(mktemp)
|
||||
if [ "$DID_FF" = "true" ]; then
|
||||
TITLE="chore: sync main → staging (auto, ff to ${MAIN_SHORT})"
|
||||
cat > "$BODY_FILE" <<EOFBODY
|
||||
Automated fast-forward of \`staging\` to \`origin/main\` (\`${MAIN_SHORT}\`). Staging has no in-flight commits that diverge from main. Merge queue lands this; no human action needed.
|
||||
|
||||
This PR is auto-generated by \`.github/workflows/auto-sync-main-to-staging.yml\` on every push to \`main\`. It exists because this repo's \`staging\` branch has a \`merge_queue\` ruleset that blocks direct pushes — even from the GitHub Actions integration.
|
||||
EOFBODY
|
||||
else
|
||||
TITLE="chore: sync main → staging (auto, merge ${MAIN_SHORT})"
|
||||
cat > "$BODY_FILE" <<EOFBODY
|
||||
Automated merge of \`origin/main\` (\`${MAIN_SHORT}\`) into \`staging\`. Staging has commits main doesn't, so this is a non-ff merge that absorbs main's tip. Merge queue lands this.
|
||||
|
||||
This PR is auto-generated by \`.github/workflows/auto-sync-main-to-staging.yml\` on every push to \`main\`.
|
||||
EOFBODY
|
||||
fi
|
||||
|
||||
# gh pr create prints the URL on stdout; extract the PR number.
|
||||
PR_URL=$(gh pr create \
|
||||
--base staging \
|
||||
--head "$BRANCH" \
|
||||
--title "$TITLE" \
|
||||
--body-file "$BODY_FILE")
|
||||
PR_NUM=$(echo "$PR_URL" | grep -oE '[0-9]+$' | tail -1)
|
||||
rm -f "$BODY_FILE"
|
||||
echo "::notice::Opened PR #${PR_NUM}"
|
||||
else
|
||||
echo "::notice::Re-using existing PR #${PR_NUM} for ${BRANCH}"
|
||||
fi
|
||||
|
||||
# Enable auto-merge — the merge queue picks it up once
|
||||
# required gates are green. Use --merge for merge commits
|
||||
# (matches the rest of this repo's PR convention).
|
||||
if ! gh pr merge "$PR_NUM" --auto --merge 2>&1; then
|
||||
echo "::warning::Failed to enable auto-merge on PR #${PR_NUM} — operator may need to merge manually."
|
||||
# Direct push to staging. devops-engineer persona is
|
||||
# whitelisted for direct push on the staging branch
|
||||
# protection (Settings → Branches → staging).
|
||||
#
|
||||
# No --force / --force-with-lease: a fast-forward or
|
||||
# legitimate merge commit on top of current staging
|
||||
# is the only thing we'd ever push. If origin/staging
|
||||
# advanced under us (concurrent merge), the push
|
||||
# legitimately rejects and the next run picks up the
|
||||
# new state.
|
||||
if ! git push origin staging; then
|
||||
{
|
||||
echo "## Push rejected"
|
||||
echo
|
||||
echo "Direct push to \`staging\` failed. Likely causes:"
|
||||
echo "- \`AUTO_SYNC_TOKEN\` rotated / wrong scope (HTTP 401/403)"
|
||||
echo "- \`devops-engineer\` no longer in"
|
||||
echo " \`push_whitelist_usernames\` on the staging"
|
||||
echo " branch protection (HTTP 422)"
|
||||
echo "- staging advanced concurrently — re-running this"
|
||||
echo " workflow on the new main tip will pick it up"
|
||||
} >> "$GITHUB_STEP_SUMMARY"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
{
|
||||
echo "## ✅ Auto-sync PR opened"
|
||||
echo "## Auto-sync succeeded"
|
||||
echo
|
||||
echo "- Branch: \`$BRANCH\`"
|
||||
echo "- PR: #$PR_NUM"
|
||||
echo "- Strategy: $([ "$DID_FF" = "true" ] && echo "ff" || echo "merge commit")"
|
||||
echo
|
||||
echo "Merge queue lands the PR once required gates are green; no human action needed unless gates fail."
|
||||
echo "- staging advanced to: \`$(git rev-parse --short=8 HEAD)\`"
|
||||
echo "- main tip: \`${{ steps.check.outputs.main_short }}\`"
|
||||
echo "- Strategy: $([ "${{ steps.merge.outputs.did_ff }}" = "true" ] && echo "fast-forward" || echo "merge commit")"
|
||||
echo "- Pushed by: \`devops-engineer\` (per-agent persona, anti-bot-ring)"
|
||||
} >> "$GITHUB_STEP_SUMMARY"
|
||||
|
||||
30
.github/workflows/branch-protection-drift.yml
vendored
30
.github/workflows/branch-protection-drift.yml
vendored
@ -19,6 +19,7 @@ on:
|
||||
branches: [staging, main]
|
||||
paths:
|
||||
- 'tools/branch-protection/**'
|
||||
- '.github/workflows/**'
|
||||
- '.github/workflows/branch-protection-drift.yml'
|
||||
|
||||
permissions:
|
||||
@ -79,3 +80,32 @@ jobs:
|
||||
# Repo-admin scope, needed for /branches/:b/protection.
|
||||
GH_TOKEN: ${{ secrets.GH_TOKEN_FOR_ADMIN_API }}
|
||||
run: bash tools/branch-protection/drift_check.sh
|
||||
|
||||
# Self-test the parity script before running it on the real
|
||||
# workflows — pins the script's classification logic against
|
||||
# synthetic safe/unsafe/missing/unsafe-mix/matrix fixtures so a
|
||||
# regression in the script can't false-pass on the production
|
||||
# workflow audit. Cheap (~0.5s); always runs.
|
||||
- name: Self-test check-name parity script
|
||||
run: bash tools/branch-protection/test_check_name_parity.sh
|
||||
|
||||
# Check-name parity gate (#144 / saved memory
|
||||
# feedback_branch_protection_check_name_parity).
|
||||
#
|
||||
# drift_check.sh asserts the live branch protection matches what
|
||||
# apply.sh would set; check_name_parity.sh closes the orthogonal
|
||||
# gap: it asserts every required check name in apply.sh maps to a
|
||||
# workflow job whose "always emits this status" shape is intact.
|
||||
#
|
||||
# The two checks fail in different scenarios:
|
||||
#
|
||||
# - drift_check fails → live state was rewritten out-of-band
|
||||
# (UI click, manual PATCH).
|
||||
# - check_name_parity fails → an apply.sh required name has no
|
||||
# emitter, OR the emitting workflow has a top-level paths:
|
||||
# filter without per-step if-gates (the silent-block shape).
|
||||
#
|
||||
# Cheap (~1s); runs without the admin token because it only reads
|
||||
# apply.sh + .github/workflows/ from the checkout.
|
||||
- name: Run check-name parity gate
|
||||
run: bash tools/branch-protection/check_name_parity.sh
|
||||
|
||||
49
.github/workflows/harness-replays.yml
vendored
49
.github/workflows/harness-replays.yml
vendored
@ -98,6 +98,55 @@ jobs:
|
||||
# github-app-auth sibling-checkout removed 2026-05-07 (#157):
|
||||
# the plugin was dropped + Dockerfile.tenant no longer COPYs it.
|
||||
|
||||
# Pre-clone manifest deps before docker compose builds the tenant
|
||||
# image (Task #173 followup — same pattern as
|
||||
# publish-workspace-server-image.yml's "Pre-clone manifest deps"
|
||||
# step).
|
||||
#
|
||||
# Why pre-clone here too: tests/harness/compose.yml builds tenant-alpha
|
||||
# and tenant-beta from workspace-server/Dockerfile.tenant with
|
||||
# context=../.. (repo root). That Dockerfile expects
|
||||
# .tenant-bundle-deps/{workspace-configs-templates,org-templates,plugins}
|
||||
# to be present at build context root (post-#173 it COPYs from there
|
||||
# instead of running an in-image clone — the in-image clone failed
|
||||
# with "could not read Username for https://git.moleculesai.app"
|
||||
# because there's no auth path inside the build sandbox).
|
||||
#
|
||||
# Without this step harness-replays fails before any replay runs,
|
||||
# with `failed to calculate checksum of ref ...
|
||||
# "/.tenant-bundle-deps/plugins": not found`. Caught by run #892
|
||||
# (main, 2026-05-07T20:28:53Z) and run #964 (staging — same
|
||||
# symptom, different root cause: staging still has the in-image
|
||||
# clone path, hits the auth error directly).
|
||||
#
|
||||
# Token shape matches publish-workspace-server-image.yml: AUTO_SYNC_TOKEN
|
||||
# is the devops-engineer persona PAT, NOT the founder PAT (per
|
||||
# `feedback_per_agent_gitea_identity_default`). clone-manifest.sh
|
||||
# embeds it as basic-auth for the duration of the clones and strips
|
||||
# .git directories — the token never enters the resulting image.
|
||||
- name: Pre-clone manifest deps
|
||||
if: needs.detect-changes.outputs.run == 'true'
|
||||
env:
|
||||
MOLECULE_GITEA_TOKEN: ${{ secrets.AUTO_SYNC_TOKEN }}
|
||||
run: |
|
||||
set -euo pipefail
|
||||
if [ -z "${MOLECULE_GITEA_TOKEN}" ]; then
|
||||
echo "::error::AUTO_SYNC_TOKEN secret is empty — register the devops-engineer persona PAT in repo Actions secrets"
|
||||
exit 1
|
||||
fi
|
||||
mkdir -p .tenant-bundle-deps
|
||||
bash scripts/clone-manifest.sh \
|
||||
manifest.json \
|
||||
.tenant-bundle-deps/workspace-configs-templates \
|
||||
.tenant-bundle-deps/org-templates \
|
||||
.tenant-bundle-deps/plugins
|
||||
# Sanity-check counts so a silent partial clone fails fast
|
||||
# instead of producing a half-empty image.
|
||||
ws_count=$(find .tenant-bundle-deps/workspace-configs-templates -mindepth 1 -maxdepth 1 -type d | wc -l)
|
||||
org_count=$(find .tenant-bundle-deps/org-templates -mindepth 1 -maxdepth 1 -type d | wc -l)
|
||||
plugins_count=$(find .tenant-bundle-deps/plugins -mindepth 1 -maxdepth 1 -type d | wc -l)
|
||||
echo "Cloned: ws=$ws_count org=$org_count plugins=$plugins_count"
|
||||
|
||||
- name: Install Python deps for replays
|
||||
# peer-discovery-404 (and future replays) eval Python against the
|
||||
# running tenant — importing workspace/a2a_client.py pulls in
|
||||
|
||||
147
docs/engineering/ratelimit-observability.md
Normal file
147
docs/engineering/ratelimit-observability.md
Normal file
@ -0,0 +1,147 @@
|
||||
# Rate-limit observability runbook
|
||||
|
||||
> Companion to issue #64 ("RATE_LIMIT default re-tune analysis"). After
|
||||
> #60 deployed the per-tenant `keyFor` keying, the right RATE_LIMIT
|
||||
> default became data-dependent. This runbook documents the metrics +
|
||||
> queries an operator should run to confirm whether the current 600
|
||||
> req/min/key default is correct, too tight, or too loose.
|
||||
|
||||
## What's already exposed
|
||||
|
||||
The workspace-server's existing Prometheus middleware
|
||||
(`workspace-server/internal/metrics/metrics.go`) tracks every request
|
||||
on every path:
|
||||
|
||||
```
|
||||
molecule_http_requests_total{method, path, status} counter
|
||||
molecule_http_request_duration_seconds_total{method,path,status} counter
|
||||
```
|
||||
|
||||
Path is the matched route pattern (`/workspaces/:id/activity` etc), so
|
||||
high-cardinality workspace UUIDs do not explode the label space.
|
||||
|
||||
The rate limiter middleware (#60, `workspace-server/internal/middleware/ratelimit.go`)
|
||||
also stamps every response with `X-RateLimit-Limit`, `X-RateLimit-Remaining`,
|
||||
and `X-RateLimit-Reset`. Operators with browser-side or proxy-side
|
||||
header capture can read per-request bucket state directly.
|
||||
|
||||
No new instrumentation is needed for #64's acceptance criteria. The
|
||||
metric surface is sufficient — this runbook just collects the queries.
|
||||
|
||||
## Queries to run after #60 deploys
|
||||
|
||||
### 1. Is the bucket actually firing 429s?
|
||||
|
||||
```promql
|
||||
sum(rate(molecule_http_requests_total{status="429"}[5m]))
|
||||
```
|
||||
|
||||
If this is zero on a given tenant, the bucket isn't being hit. If it's
|
||||
sustained > 1/min, dig in.
|
||||
|
||||
### 2. Which routes attract 429s?
|
||||
|
||||
```promql
|
||||
topk(
|
||||
10,
|
||||
sum by (path) (
|
||||
rate(molecule_http_requests_total{status="429"}[5m])
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
Expected shape post-#60:
|
||||
- `/workspaces/:id/activity` should be near zero — the canvas no longer
|
||||
polls it on a 30s/60s/5s cadence (PRs #69 / #71 / #76).
|
||||
- Probe / health / heartbeat paths should be ~0 (those routes have a
|
||||
separate IP-fallback bucket).
|
||||
|
||||
If `/workspaces/:id/activity` 429s persist post-PRs-69/71/76 deploy, the
|
||||
canvas isn't running the WS-subscriber path — investigate WS health
|
||||
on that tenant.
|
||||
|
||||
### 3. Per-bucket-key inference (no direct exposure today)
|
||||
|
||||
The bucket map itself is in-memory only; we deliberately do **not**
|
||||
expose `org:<uuid>` ↔ remaining-tokens because that map can include
|
||||
SHA-256 hashes of bearer tokens. A tenant that wants per-key visibility
|
||||
should rely on response headers (`X-RateLimit-Remaining` on every
|
||||
response from a given session is the bucket's view of that session).
|
||||
|
||||
If you genuinely need server-side per-bucket counts for triage,
|
||||
file a follow-up — the proper shape is a `/internal/ratelimit-stats`
|
||||
endpoint that emits **counts per key prefix only** (e.g. `org:`, `tok:`,
|
||||
`ip:`), never the key payloads. Don't roll that ad-hoc; it's a security
|
||||
review surface.
|
||||
|
||||
## Decision tree for the re-tune
|
||||
|
||||
After 14 days of production traffic on a tenant, look at the queries
|
||||
above and walk this tree:
|
||||
|
||||
```
|
||||
Q1: Is the 429 rate sustained > 0.1/sec on any tenant?
|
||||
├─ NO → The 600 default has comfortable headroom. Either keep it,
|
||||
│ or lower it carefully (300) ONLY if you have a documented
|
||||
│ reason (e.g. a misbehaving client we want to throttle harder).
|
||||
│ Default to "no change" — see #64 for the math.
|
||||
└─ YES → Q2.
|
||||
|
||||
Q2: Is the 429 rate concentrated on ONE tenant or spread across many?
|
||||
├─ ONE tenant → Operator override: set RATE_LIMIT=1200 or 1800 on that
|
||||
│ tenant's box. Document in the tenant's ops note. The
|
||||
│ default does not need to change.
|
||||
└─ MANY tenants → Q3.
|
||||
|
||||
Q3: Are the 429s on a route that polls (e.g. /activity / /peers)?
|
||||
├─ YES → Confirm PRs #69, #71, #76 have actually deployed to those
|
||||
│ tenants. If they have and 429s persist, the canvas may have
|
||||
│ a regression — do not raise RATE_LIMIT. File a canvas issue.
|
||||
└─ NO → 429s on mutating routes mean genuine load. Raise the default
|
||||
to 1200 in `workspace-server/internal/router/router.go:54`.
|
||||
Same PR should attach: the metric chart, the time window,
|
||||
and a paragraph explaining what changed in our traffic shape.
|
||||
```
|
||||
|
||||
## Alert rule template (drop-in for Prometheus)
|
||||
|
||||
```yaml
|
||||
# Sustained 429s — file is the SLO trip-wire. If this fires, walk the
|
||||
# decision tree above. NB: the issue#64 acceptance criterion is "two
|
||||
# weeks of metrics"; this alert is the inverse — it tells you something
|
||||
# changed before the two weeks are up.
|
||||
groups:
|
||||
- name: workspace-server-ratelimit
|
||||
rules:
|
||||
- alert: WorkspaceServerRateLimit429Sustained
|
||||
expr: |
|
||||
sum by (instance) (
|
||||
rate(molecule_http_requests_total{status="429"}[10m])
|
||||
) > 0.1
|
||||
for: 30m
|
||||
labels:
|
||||
severity: warning
|
||||
owner: workspace-server
|
||||
annotations:
|
||||
summary: "{{ $labels.instance }} sustained 429s — see ratelimit-observability runbook"
|
||||
runbook: "https://git.moleculesai.app/molecule-ai/molecule-core/blob/main/docs/engineering/ratelimit-observability.md"
|
||||
```
|
||||
|
||||
Threshold rationale: 0.1 req/s = 6/min sustained over 10min. Below
|
||||
that, a 429 is almost certainly a transient burst that the canvas's
|
||||
retry-once handler at `canvas/src/lib/api.ts:55` already absorbs. The
|
||||
30m `for:` keeps the alert from chattering on a brief blip.
|
||||
|
||||
## Companion probe script
|
||||
|
||||
For one-off triage when an operator can reproduce the problem in their
|
||||
own browser, `scripts/edge-429-probe.sh` (#62) reproduces a canvas-
|
||||
sized burst against a tenant subdomain and dumps each 429's response
|
||||
shape so the operator can distinguish workspace-server bucket overflow
|
||||
from CF/Vercel edge rate-limiting without dashboard access.
|
||||
|
||||
```sh
|
||||
./scripts/edge-429-probe.sh hongming.moleculesai.app --burst 80 --out /tmp/edge.txt
|
||||
```
|
||||
|
||||
The script's report header explains how to read the output.
|
||||
155
scripts/edge-429-probe.sh
Executable file
155
scripts/edge-429-probe.sh
Executable file
@ -0,0 +1,155 @@
|
||||
#!/usr/bin/env bash
|
||||
# edge-429-probe.sh — capture 429 origin (workspace-server vs CF/Vercel edge)
|
||||
# during a simulated canvas-burst against a tenant subdomain.
|
||||
#
|
||||
# Issue molecule-core#62. The post-#60 verification step asks an
|
||||
# operator with CF/Vercel dashboard access to confirm whether the
|
||||
# layout-chunk 429s observed in DevTools were:
|
||||
# (a) workspace-server bucket overflow (closes once #60 deploys), or
|
||||
# (b) actual edge-layer rate-limiting (CF or Vercel).
|
||||
#
|
||||
# This script doesn't need dashboard access. It reproduces the burst
|
||||
# pattern locally and dumps every 429's response shape so the operator
|
||||
# can distinguish (a) from (b) by inspection: workspace-server emits a
|
||||
# JSON body, CF emits HTML, Vercel emits a different HTML. Headers tell
|
||||
# the same story (cf-ray vs x-vercel-*).
|
||||
#
|
||||
# Usage:
|
||||
# ./scripts/edge-429-probe.sh <tenant-host> [--burst N] [--waves N] [--pause SECS] [--out file]
|
||||
#
|
||||
# Example:
|
||||
# ./scripts/edge-429-probe.sh hongming.moleculesai.app --burst 80 --out /tmp/edge.txt
|
||||
#
|
||||
# The script is read-only against the target — it only issues GETs to
|
||||
# public-by-design endpoints. No mutating requests, no credential use.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# ── Help / usage handling first, before positional capture ────────────────────
|
||||
case "${1:-}" in
|
||||
-h|--help|"")
|
||||
sed -n '/^# edge-429-probe.sh/,/^$/p' "$0" | sed 's/^# \{0,1\}//'
|
||||
exit 0
|
||||
;;
|
||||
esac
|
||||
|
||||
HOST="$1"; shift
|
||||
BURST=80
|
||||
WAVES=3
|
||||
WAVE_PAUSE=2
|
||||
OUT=""
|
||||
|
||||
while [ "${1:-}" != "" ]; do
|
||||
case "$1" in
|
||||
--burst) BURST="$2"; shift 2 ;;
|
||||
--waves) WAVES="$2"; shift 2 ;;
|
||||
--pause) WAVE_PAUSE="$2"; shift 2 ;;
|
||||
--out) OUT="$2"; shift 2 ;;
|
||||
-h|--help)
|
||||
sed -n '/^# edge-429-probe.sh/,/^$/p' "$0" | sed 's/^# \{0,1\}//'
|
||||
exit 0
|
||||
;;
|
||||
*) echo "unknown arg: $1" >&2; exit 2 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
# ── Endpoint discovery ────────────────────────────────────────────────────────
|
||||
echo "→ Discovering a layout-chunk URL from canvas root..." >&2
|
||||
ROOT_BODY=$(curl -fsSL --max-time 10 "https://${HOST}/" 2>/dev/null || true)
|
||||
LAYOUT_PATH=$(echo "$ROOT_BODY" \
|
||||
| grep -oE '/_next/static/chunks/layout-[A-Za-z0-9_-]+\.js' \
|
||||
| head -1 || true)
|
||||
if [ -z "$LAYOUT_PATH" ]; then
|
||||
LAYOUT_PATH="/_next/static/chunks/layout-probe-not-found.js"
|
||||
echo " (no layout chunk discovered — using sentinel path; 404 on this is expected)" >&2
|
||||
else
|
||||
echo " layout chunk: $LAYOUT_PATH" >&2
|
||||
fi
|
||||
|
||||
# Probe URL: a generic activity endpoint. The rate-limiter middleware
|
||||
# runs BEFORE workspace-id validation, so unauth/invalid-id requests
|
||||
# still hit the bucket.
|
||||
ACTIVITY_PATH="/workspaces/00000000-0000-0000-0000-000000000000/activity?probe=edge-429"
|
||||
|
||||
# ── Fire one curl, write a single-line JSON-ish status record to stdout ──────
|
||||
# Inlined into xargs as a heredoc-style command rather than a function so
|
||||
# the function-export pitfalls (some shells lose `export -f` across xargs)
|
||||
# don't apply. Each output line is a parseable record; failed curls emit
|
||||
# a curl_err record so request volume is preserved.
|
||||
TMP_RESULTS="$(mktemp -t edge-429-probe.XXXXXX)"
|
||||
trap 'rm -f "$TMP_RESULTS"' EXIT
|
||||
|
||||
run_burst() {
|
||||
# $1 = path; $2 = label; $3 = wave_id
|
||||
local path="$1" label="$2" wave="$3"
|
||||
local i
|
||||
for i in $(seq 1 "$BURST"); do
|
||||
{
|
||||
out=$(curl -sS --max-time 10 -o /dev/null \
|
||||
-w 'status=%{http_code} size=%{size_download} time=%{time_total} server=%{header.server} cf_ray=%{header.cf-ray} x_vercel=%{header.x-vercel-id} retry_after=%{header.retry-after} content_type=%{header.content-type} x_ratelimit_limit=%{header.x-ratelimit-limit} x_ratelimit_remaining=%{header.x-ratelimit-remaining} x_ratelimit_reset=%{header.x-ratelimit-reset}\n' \
|
||||
"https://${HOST}${path}" 2>/dev/null) || out="status=curl_err"
|
||||
printf 'label=%s-%s-%s %s\n' "$label" "$wave" "$i" "$out" >> "$TMP_RESULTS"
|
||||
} &
|
||||
done
|
||||
wait
|
||||
}
|
||||
|
||||
emit() {
|
||||
if [ -n "$OUT" ]; then
|
||||
printf '%s\n' "$*" >> "$OUT"
|
||||
else
|
||||
printf '%s\n' "$*"
|
||||
fi
|
||||
}
|
||||
|
||||
if [ -n "$OUT" ]; then : > "$OUT"; fi
|
||||
|
||||
emit "# edge-429-probe report"
|
||||
emit "# host=$HOST burst=$BURST waves=$WAVES pause=${WAVE_PAUSE}s"
|
||||
emit "# layout_path=$LAYOUT_PATH"
|
||||
emit "# activity_path=$ACTIVITY_PATH"
|
||||
emit "# generated=$(date -u +%Y-%m-%dT%H:%M:%SZ)"
|
||||
emit ""
|
||||
|
||||
for wave in $(seq 1 "$WAVES"); do
|
||||
emit "## wave $wave"
|
||||
: > "$TMP_RESULTS"
|
||||
run_burst "$LAYOUT_PATH" "layout" "$wave"
|
||||
run_burst "$ACTIVITY_PATH" "activity" "$wave"
|
||||
while read -r line; do
|
||||
emit " $line"
|
||||
done < "$TMP_RESULTS"
|
||||
if [ "$wave" -lt "$WAVES" ]; then
|
||||
sleep "$WAVE_PAUSE"
|
||||
fi
|
||||
done
|
||||
|
||||
emit ""
|
||||
emit "## summary — how to read the report"
|
||||
emit "# status=429 + content_type starts with application/json + x_ratelimit_limit set"
|
||||
emit "# => workspace-server bucket overflow. Closes when #60 deploys."
|
||||
emit "# status=429 + cf_ray set + content_type=text/html"
|
||||
emit "# => Cloudflare WAF / rate-limit. Audit dashboard rules per #62."
|
||||
emit "# status=429 + x_vercel set + content_type=text/html"
|
||||
emit "# => Vercel edge / Bot Fight Mode. Audit Vercel project per #62."
|
||||
emit "# status=429 with no server/cf_ray/x_vercel"
|
||||
emit "# => corporate proxy or VPN. Not actionable in this repo."
|
||||
|
||||
if [ -n "$OUT" ]; then
|
||||
echo "→ Report written to $OUT" >&2
|
||||
# Match only data lines (begin with two-space indent + "label="),
|
||||
# not the summary's reference text which also mentions "status=429".
|
||||
# grep -c outputs "0" + exits 1 when zero matches; `|| true` masks
|
||||
# the exit status so set -e doesn't trip without losing the count.
|
||||
total=$(grep -c '^ label=' "$OUT" 2>/dev/null || true)
|
||||
total429=$(grep -c '^ label=.*status=429' "$OUT" 2>/dev/null || true)
|
||||
total=${total:-0}
|
||||
total429=${total429:-0}
|
||||
echo "→ Totals: ${total429} of ${total} requests returned 429" >&2
|
||||
if [ "${total429}" -gt 0 ]; then
|
||||
echo "→ Per-label 429 counts:" >&2
|
||||
grep '^ label=.*status=429' "$OUT" \
|
||||
| sed -E 's/^ label=([^-]+).*/ \1/' \
|
||||
| sort | uniq -c >&2
|
||||
fi
|
||||
fi
|
||||
252
tools/branch-protection/check_name_parity.sh
Executable file
252
tools/branch-protection/check_name_parity.sh
Executable file
@ -0,0 +1,252 @@
|
||||
#!/usr/bin/env bash
|
||||
# tools/branch-protection/check_name_parity.sh — assert every required-
|
||||
# check name listed in apply.sh maps to a workflow job whose "always
|
||||
# emits this status" shape is intact.
|
||||
#
|
||||
# Closes #144 / encodes the saved memory
|
||||
# feedback_branch_protection_check_name_parity:
|
||||
#
|
||||
# "Path filters (e.g., detect-changes → conditional skip) silently
|
||||
# break branch protection because no job emits the protected
|
||||
# sentinel status when path-filter returns false."
|
||||
#
|
||||
# Two safe shapes for a required-check job:
|
||||
#
|
||||
# 1. Single-job-with-per-step-if (path-filter case):
|
||||
# The workflow has NO top-level `paths:` filter; the always-running
|
||||
# job has steps gated on `if: needs.<gate>.outputs.<flag> == 'true'`
|
||||
# so the no-op step alone fires when paths exclude the commit.
|
||||
# Used by ci.yml's Platform/Canvas/Python/Shellcheck and by
|
||||
# e2e-api.yml / e2e-staging-canvas.yml / runtime-prbuild-compat.yml.
|
||||
#
|
||||
# 2. Aggregator-with-needs+always() (matrix-refactor case):
|
||||
# An aggregator job named after the protected check `needs:` the
|
||||
# matrix children + uses `if: always()` + checks each child's
|
||||
# result. (Not currently in this repo but supported.)
|
||||
#
|
||||
# Unsafe shape this script catches:
|
||||
# - Workflow has top-level `paths:` filter AND the protected check
|
||||
# name is on a single job. When paths-filter excludes a commit, the
|
||||
# workflow doesn't fire — branch protection waits forever.
|
||||
#
|
||||
# Exit codes:
|
||||
# 0 — every required check name has at least one safe-shape match
|
||||
# 1 — a required name has no match OR matches an unsafe shape
|
||||
# 2 — script-internal error (apply.sh missing, awk failure, etc.)
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
|
||||
WORKFLOWS_DIR="$REPO_ROOT/.github/workflows"
|
||||
APPLY_SH="$SCRIPT_DIR/apply.sh"
|
||||
|
||||
if [[ ! -f "$APPLY_SH" ]]; then
|
||||
echo "check_name_parity: missing apply.sh at $APPLY_SH" >&2
|
||||
exit 2
|
||||
fi
|
||||
if [[ ! -d "$WORKFLOWS_DIR" ]]; then
|
||||
echo "check_name_parity: missing .github/workflows at $WORKFLOWS_DIR" >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
# ─── Extract the union of required check names from apply.sh ──────
|
||||
# apply.sh has STAGING_CHECKS and MAIN_CHECKS heredocs; union them so
|
||||
# we audit any name that gates EITHER branch. Filters out blank lines
|
||||
# and the heredoc end marker. Sorted + uniq so the audit output is stable.
|
||||
#
|
||||
# Captures the heredoc end-marker dynamically from the `<<'MARKER'`
|
||||
# token on the opening line — the token can be `EOF` (production
|
||||
# apply.sh), `EOF2` (test fixtures with nested heredocs), or any other
|
||||
# bash-legal identifier. Without dynamic extraction, test fixtures
|
||||
# with nested heredocs would either skip-capture (wrong end marker)
|
||||
# or capture the inner end marker as a stray check name.
|
||||
#
|
||||
# Two-step approach to keep awk-portable across BSD awk (macOS) and
|
||||
# gawk (Linux): grep finds the heredoc-opening lines, sed extracts the
|
||||
# marker, then awk does the capture. Pure-awk attempts hit BSD-vs-GNU
|
||||
# regex/variable-init differences that regress silently — this shape
|
||||
# stays in POSIX-portable territory.
|
||||
extract_heredoc_block() {
|
||||
local file="$1"
|
||||
local marker="$2"
|
||||
awk -v marker="$marker" '
|
||||
$0 ~ "<<.?" marker { capture=1; next }
|
||||
$0 == marker && capture { capture=0; next }
|
||||
capture && NF { print }
|
||||
' "$file"
|
||||
}
|
||||
|
||||
# Find every heredoc-end marker used in apply.sh (typically just EOF
|
||||
# in the production script, but EOF2 / TAG / ABC are all valid in
|
||||
# fixtures or future expansions). Each marker maps to one or more
|
||||
# heredoc blocks; we union all of them.
|
||||
markers=$(grep -E "<<['\"]?[A-Za-z0-9_]+['\"]?[[:space:]]*\\|\\|" "$APPLY_SH" \
|
||||
| sed -E "s/.*<<['\"]?([A-Za-z0-9_]+)['\"]?.*/\\1/" \
|
||||
| sort -u)
|
||||
|
||||
required_names=""
|
||||
while IFS= read -r marker; do
|
||||
[[ -z "$marker" ]] && continue
|
||||
block=$(extract_heredoc_block "$APPLY_SH" "$marker")
|
||||
if [[ -n "$block" ]]; then
|
||||
required_names+="$block"$'\n'
|
||||
fi
|
||||
done <<< "$markers"
|
||||
|
||||
required_names=$(printf '%s' "$required_names" | sort -u | sed '/^$/d')
|
||||
|
||||
if [[ -z "$required_names" ]]; then
|
||||
echo "check_name_parity: failed to extract required check names from apply.sh" >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
# ─── For each required name, find the workflow file that owns it ──
|
||||
# A workflow "owns" a name if any `name:` line in the file equals the
|
||||
# required name. We look at job-level names AND the workflow-level
|
||||
# `name:` (the latter prefixes "Analyze" jobs in codeql.yml).
|
||||
#
|
||||
# Then we check whether the owning workflow has a top-level `paths:`
|
||||
# filter. The unsafe shape is:
|
||||
# - top-level paths: filter present
|
||||
# - AND the named job is gated only at the workflow level (no per-
|
||||
# step `if:` gates)
|
||||
#
|
||||
# Distinguishing "no `paths:` filter" from "paths: filter + per-step
|
||||
# gating" requires parsing the YAML semantics. We do it heuristically:
|
||||
#
|
||||
# - "no top-level paths:" → safe by construction (workflow always
|
||||
# fires)
|
||||
# - "paths: present" → check that the matching job has at
|
||||
# least one `if: needs.<x>.outputs`
|
||||
# step gate. If yes, that's the
|
||||
# single-job-with-per-step-if shape.
|
||||
# If no, flag as unsafe.
|
||||
#
|
||||
# Heuristic so it stays a portable bash + awk + grep tool — full YAML
|
||||
# parsing would need yq which isn't a dependency. The known unsafe
|
||||
# shape (workflow-level paths: AND no per-step if-gates) is what we're
|
||||
# trying to catch.
|
||||
|
||||
failed=0
|
||||
declare -a unsafe_findings=()
|
||||
|
||||
while IFS= read -r name; do
|
||||
[[ -z "$name" ]] && continue
|
||||
# Find every workflow file that contains a job with `name: <name>` or
|
||||
# whose top-level workflow `name:` plus matrix substitution would
|
||||
# produce <name>. Need to be careful about quoting — YAML allows
|
||||
# `name: Foo`, `name: "Foo"`, `name: 'Foo'`. Strip quotes.
|
||||
matches=()
|
||||
while IFS= read -r f; do
|
||||
# Look for an exact `name:` match (anywhere in the file). The
|
||||
# workflow-level name line is at column 0; job-level names are
|
||||
# indented. Either is acceptable for parity — what matters is
|
||||
# whether the EMITTED check-run name is the one we required.
|
||||
# Strip surrounding quotes/whitespace before comparing.
|
||||
if awk -v want="$name" '
|
||||
/^[[:space:]]*name:[[:space:]]*/ {
|
||||
line = $0
|
||||
sub(/^[[:space:]]*name:[[:space:]]*/, "", line)
|
||||
# Strip surrounding " or '\''
|
||||
gsub(/^["\047]|["\047]$/, "", line)
|
||||
# Strip trailing whitespace + comment
|
||||
sub(/[[:space:]]*#.*$/, "", line)
|
||||
sub(/[[:space:]]+$/, "", line)
|
||||
if (line == want) found = 1
|
||||
}
|
||||
END { exit !found }
|
||||
' "$f"; then
|
||||
matches+=("$f")
|
||||
fi
|
||||
done < <(find "$WORKFLOWS_DIR" -name '*.yml' -o -name '*.yaml')
|
||||
|
||||
if [[ ${#matches[@]} -eq 0 ]]; then
|
||||
# Special case — Analyze (go/javascript-typescript/python) is
|
||||
# generated by codeql.yml's matrix expansion of `Analyze (${{
|
||||
# matrix.language }})`. Don't flag those as missing if codeql.yml
|
||||
# exists with the expected base name.
|
||||
case "$name" in
|
||||
"Analyze (go)"|"Analyze (javascript-typescript)"|"Analyze (python)")
|
||||
# shellcheck disable=SC2016
|
||||
# The literal `${{ matrix.language }}` is the GHA template
|
||||
# syntax we're searching FOR — not a shell expansion. SC2016
|
||||
# would have us add quotes that defeat the search.
|
||||
if [[ -f "$WORKFLOWS_DIR/codeql.yml" ]] && \
|
||||
grep -q 'name: Analyze (${{[[:space:]]*matrix.language[[:space:]]*}})' "$WORKFLOWS_DIR/codeql.yml"; then
|
||||
matches=("$WORKFLOWS_DIR/codeql.yml")
|
||||
fi
|
||||
;;
|
||||
esac
|
||||
fi
|
||||
|
||||
if [[ ${#matches[@]} -eq 0 ]]; then
|
||||
unsafe_findings+=("MISSING: required check name '$name' has no matching workflow job")
|
||||
failed=1
|
||||
continue
|
||||
fi
|
||||
|
||||
# For each owning workflow, classify safe vs unsafe.
|
||||
for f in "${matches[@]}"; do
|
||||
rel="${f#"$REPO_ROOT"/}"
|
||||
# Heuristic: does the workflow have a top-level `paths:` filter?
|
||||
# Top-level here means under the `on:` key, not under jobs.<x>.if.
|
||||
# Workflow-level paths filters appear at indent depth 4 (under
|
||||
# `push:` or `pull_request:`). Job-level `if:` paths-filter doesn't
|
||||
# block the workflow from firing.
|
||||
has_top_paths=0
|
||||
if awk '
|
||||
# Track whether we are inside the `on:` block. The `on:` block
|
||||
# starts at column 0 (`on:` key) and ends when the next column-0
|
||||
# key appears.
|
||||
/^on:[[:space:]]*$/ { in_on = 1; next }
|
||||
/^[a-zA-Z]/ && in_on { in_on = 0 }
|
||||
in_on && /^[[:space:]]+paths:[[:space:]]*$/ { print "yes"; exit }
|
||||
in_on && /^[[:space:]]+paths:[[:space:]]*\[/ { print "yes"; exit }
|
||||
' "$f" | grep -q yes; then
|
||||
has_top_paths=1
|
||||
fi
|
||||
|
||||
if [[ "$has_top_paths" -eq 0 ]]; then
|
||||
# Safe: workflow always fires. If there are inner per-step if-
|
||||
# gates (single-job-with-per-step-if pattern), the no-op step
|
||||
# produces SUCCESS for the protected name — branch-protection-clean.
|
||||
continue
|
||||
fi
|
||||
|
||||
# Unsafe candidate — has top-level paths: AND we need to verify
|
||||
# the per-step if-gate pattern is absent. Look for any `if:`
|
||||
# referencing a paths-filter / detect-changes output inside the
|
||||
# owning job's body. If at least one is present, classify as the
|
||||
# single-job-with-per-step-if pattern (safe).
|
||||
#
|
||||
# The regex is intentionally anchored loosely — actual workflow
|
||||
# YAML writes per-step if-gates as ` - if: needs.X.outputs.Y`
|
||||
# (with the `-` step-marker between the leading spaces and the
|
||||
# `if`). Anchoring on `^[[:space:]]+if:` would miss those.
|
||||
if grep -qE "if:[[:space:]]+needs\.[a-zA-Z_-]+\.outputs\." "$f"; then
|
||||
# Per-step if-gates exist. Combined with top-level paths: this
|
||||
# would be a buggy mix (the workflow might still skip entirely
|
||||
# when paths exclude). Flag as unsafe — the safe pattern omits
|
||||
# the top-level paths: filter altogether and gates per-step.
|
||||
unsafe_findings+=("UNSAFE-MIX: $rel has top-level paths: AND per-step if-gates — when paths exclude the commit, the workflow doesn't fire and the required check '$name' is silently absent. Drop the top-level paths: filter; keep the per-step if-gates.")
|
||||
failed=1
|
||||
else
|
||||
# Top-level paths: with no per-step if-gates: the canonical
|
||||
# check-name parity bug.
|
||||
unsafe_findings+=("UNSAFE-PATH-FILTER: $rel has top-level paths: filter and no per-step if-gates. When paths exclude the commit, no job emits the required check '$name' — branch protection waits forever. Either drop the paths: filter and add per-step if-gates against a detect-changes output, or add an aggregator-with-needs+always() job that emits '$name'.")
|
||||
failed=1
|
||||
fi
|
||||
done
|
||||
done <<< "$required_names"
|
||||
|
||||
if [[ "$failed" -eq 0 ]]; then
|
||||
echo "check_name_parity: OK — every required check name maps to a safe workflow shape."
|
||||
exit 0
|
||||
fi
|
||||
|
||||
echo "check_name_parity: FOUND $((${#unsafe_findings[@]})) issue(s):" >&2
|
||||
for finding in "${unsafe_findings[@]}"; do
|
||||
echo " - $finding" >&2
|
||||
done
|
||||
exit 1
|
||||
285
tools/branch-protection/test_check_name_parity.sh
Executable file
285
tools/branch-protection/test_check_name_parity.sh
Executable file
@ -0,0 +1,285 @@
|
||||
#!/usr/bin/env bash
|
||||
# tools/branch-protection/test_check_name_parity.sh — unit tests for
|
||||
# check_name_parity.sh.
|
||||
#
|
||||
# Builds synthetic apply.sh + workflow files in a tmpdir for each case,
|
||||
# invokes the script with REPO_ROOT pointing at the tmpdir, and asserts
|
||||
# on exit code + stderr. Per feedback_assert_exact_not_substring we
|
||||
# pin the EXACT exit code AND a substring of the stderr that names the
|
||||
# offending workflow + name combo — so a "false-pass that prints the
|
||||
# wrong message" still fails the test.
|
||||
#
|
||||
# Run locally: bash tools/branch-protection/test_check_name_parity.sh
|
||||
# Run in CI: same — added to ci.yml's shellcheck job's "E2E bash unit
|
||||
# tests" step alongside test_model_slug.sh.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
SCRIPT_UNDER_TEST="$SCRIPT_DIR/check_name_parity.sh"
|
||||
|
||||
if [[ ! -x "$SCRIPT_UNDER_TEST" ]]; then
|
||||
echo "test_check_name_parity: script under test missing or not executable: $SCRIPT_UNDER_TEST" >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
PASSED=0
|
||||
FAILED=0
|
||||
|
||||
# Tracks the active tmpdir for the running case so the trap can clean
|
||||
# up even when assertions abort the case mid-flight.
|
||||
TMPDIR_FOR_CASE=""
|
||||
trap '[[ -n "$TMPDIR_FOR_CASE" && -d "$TMPDIR_FOR_CASE" ]] && rm -rf "$TMPDIR_FOR_CASE"' EXIT
|
||||
|
||||
# Build a synthetic repo at $1 with apply.sh listing $2 (one name per
|
||||
# line) as the staging required set + zero main required, then write
|
||||
# whatever .github/workflows/* files the test case adds.
|
||||
make_fake_repo() {
|
||||
local root="$1"
|
||||
local checks="$2"
|
||||
mkdir -p "$root/tools/branch-protection"
|
||||
mkdir -p "$root/.github/workflows"
|
||||
cat > "$root/tools/branch-protection/apply.sh" <<EOF
|
||||
#!/usr/bin/env bash
|
||||
# Stub apply.sh — only the heredoc-shaped check lists matter for the
|
||||
# parity script. Other functions intentionally absent.
|
||||
|
||||
read -r -d '' STAGING_CHECKS <<'EOF2' || true
|
||||
$checks
|
||||
EOF2
|
||||
|
||||
read -r -d '' MAIN_CHECKS <<'EOF2' || true
|
||||
$checks
|
||||
EOF2
|
||||
EOF
|
||||
chmod +x "$root/tools/branch-protection/apply.sh"
|
||||
# Place the script-under-test alongside its sibling apply.sh so the
|
||||
# script's REPO_ROOT walk finds the synthetic .github/workflows/.
|
||||
cp "$SCRIPT_UNDER_TEST" "$root/tools/branch-protection/check_name_parity.sh"
|
||||
}
|
||||
|
||||
run_case() {
|
||||
local desc="$1"
|
||||
local checks="$2"
|
||||
local workflow_yaml="$3" # contents to write
|
||||
local workflow_filename="$4"
|
||||
local expected_exit="$5"
|
||||
local expected_stderr_substring="$6"
|
||||
TMPDIR_FOR_CASE=$(mktemp -d)
|
||||
make_fake_repo "$TMPDIR_FOR_CASE" "$checks"
|
||||
printf '%s' "$workflow_yaml" > "$TMPDIR_FOR_CASE/.github/workflows/$workflow_filename"
|
||||
local stderr_file
|
||||
stderr_file=$(mktemp)
|
||||
local actual_exit=0
|
||||
bash "$TMPDIR_FOR_CASE/tools/branch-protection/check_name_parity.sh" 2>"$stderr_file" >/dev/null || actual_exit=$?
|
||||
local stderr_content
|
||||
stderr_content=$(cat "$stderr_file")
|
||||
rm "$stderr_file"
|
||||
if [[ "$actual_exit" -ne "$expected_exit" ]]; then
|
||||
echo "FAIL: $desc"
|
||||
echo " expected exit: $expected_exit, got: $actual_exit"
|
||||
echo " stderr: $stderr_content"
|
||||
FAILED=$((FAILED+1))
|
||||
rm -rf "$TMPDIR_FOR_CASE"; TMPDIR_FOR_CASE=""
|
||||
return
|
||||
fi
|
||||
# Empty expected substring → no assertion on stderr (used for the
|
||||
# passing case where stderr should be empty / not interesting).
|
||||
if [[ -n "$expected_stderr_substring" ]]; then
|
||||
if ! grep -qF "$expected_stderr_substring" <<< "$stderr_content"; then
|
||||
echo "FAIL: $desc"
|
||||
echo " expected stderr to contain: '$expected_stderr_substring'"
|
||||
echo " actual stderr: $stderr_content"
|
||||
FAILED=$((FAILED+1))
|
||||
rm -rf "$TMPDIR_FOR_CASE"; TMPDIR_FOR_CASE=""
|
||||
return
|
||||
fi
|
||||
fi
|
||||
echo "PASS: $desc"
|
||||
PASSED=$((PASSED+1))
|
||||
rm -rf "$TMPDIR_FOR_CASE"; TMPDIR_FOR_CASE=""
|
||||
}
|
||||
|
||||
# Case 1: safe workflow — no top-level paths: filter, single job
|
||||
# emitting the required name. Should exit 0.
|
||||
run_case "safe: no paths filter, job emits required name" \
|
||||
"Foo Build" \
|
||||
"$(cat <<'EOF'
|
||||
name: Foo
|
||||
|
||||
on:
|
||||
push:
|
||||
branches: [main]
|
||||
pull_request:
|
||||
|
||||
jobs:
|
||||
foo:
|
||||
name: Foo Build
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- run: echo ok
|
||||
EOF
|
||||
)" \
|
||||
"foo.yml" \
|
||||
0 \
|
||||
""
|
||||
|
||||
# Case 2: unsafe — top-level paths: filter AND no per-step if-gates.
|
||||
# This is the silent-block shape from the saved memory.
|
||||
run_case "unsafe: top-level paths: filter without per-step if-gates" \
|
||||
"Bar Build" \
|
||||
"$(cat <<'EOF'
|
||||
name: Bar
|
||||
|
||||
on:
|
||||
push:
|
||||
branches: [main]
|
||||
paths:
|
||||
- 'bar/**'
|
||||
pull_request:
|
||||
paths:
|
||||
- 'bar/**'
|
||||
|
||||
jobs:
|
||||
bar:
|
||||
name: Bar Build
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- run: echo ok
|
||||
EOF
|
||||
)" \
|
||||
"bar.yml" \
|
||||
1 \
|
||||
"UNSAFE-PATH-FILTER"
|
||||
|
||||
# Case 3: required name has no emitter at all.
|
||||
run_case "missing: required name not in any workflow" \
|
||||
"Nonexistent Job" \
|
||||
"$(cat <<'EOF'
|
||||
name: Other
|
||||
|
||||
on:
|
||||
pull_request:
|
||||
|
||||
jobs:
|
||||
other:
|
||||
name: Other Job
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- run: echo ok
|
||||
EOF
|
||||
)" \
|
||||
"other.yml" \
|
||||
1 \
|
||||
"MISSING: required check name 'Nonexistent Job'"
|
||||
|
||||
# Case 4: safe — top-level paths: filter is absent BUT per-step if-
|
||||
# gates are present (single-job-with-per-step-if pattern, what
|
||||
# ci.yml + e2e-api.yml use). Should exit 0.
|
||||
run_case "safe: per-step if-gates without top-level paths" \
|
||||
"Baz Build" \
|
||||
"$(cat <<'EOF'
|
||||
name: Baz
|
||||
|
||||
on:
|
||||
push:
|
||||
branches: [main]
|
||||
pull_request:
|
||||
|
||||
jobs:
|
||||
changes:
|
||||
name: Detect changes
|
||||
runs-on: ubuntu-latest
|
||||
outputs:
|
||||
baz: ${{ steps.check.outputs.baz }}
|
||||
steps:
|
||||
- id: check
|
||||
run: echo "baz=true" >> "$GITHUB_OUTPUT"
|
||||
|
||||
baz:
|
||||
needs: changes
|
||||
name: Baz Build
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- if: needs.changes.outputs.baz != 'true'
|
||||
run: echo no-op
|
||||
- if: needs.changes.outputs.baz == 'true'
|
||||
run: echo real work
|
||||
EOF
|
||||
)" \
|
||||
"baz.yml" \
|
||||
0 \
|
||||
""
|
||||
|
||||
# Case 5: unsafe-mix — top-level paths: AND per-step if-gates. The
|
||||
# script flags this distinctly because the workflow may STILL skip
|
||||
# entirely when paths exclude the commit (the per-step gates only
|
||||
# matter if the workflow actually fires).
|
||||
run_case "unsafe-mix: top-level paths: AND per-step if-gates" \
|
||||
"Qux Build" \
|
||||
"$(cat <<'EOF'
|
||||
name: Qux
|
||||
|
||||
on:
|
||||
push:
|
||||
branches: [main]
|
||||
paths:
|
||||
- 'qux/**'
|
||||
pull_request:
|
||||
paths:
|
||||
- 'qux/**'
|
||||
|
||||
jobs:
|
||||
changes:
|
||||
name: Detect changes
|
||||
runs-on: ubuntu-latest
|
||||
outputs:
|
||||
qux: ${{ steps.check.outputs.qux }}
|
||||
steps:
|
||||
- id: check
|
||||
run: echo "qux=true" >> "$GITHUB_OUTPUT"
|
||||
|
||||
qux:
|
||||
needs: changes
|
||||
name: Qux Build
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- if: needs.changes.outputs.qux == 'true'
|
||||
run: echo build
|
||||
EOF
|
||||
)" \
|
||||
"qux.yml" \
|
||||
1 \
|
||||
"UNSAFE-MIX"
|
||||
|
||||
# Case 6: codeql.yml matrix — required names like "Analyze (go)" are
|
||||
# generated by `Analyze (${{ matrix.language }})`. Script must
|
||||
# special-case match this pattern.
|
||||
run_case "matrix: codeql Analyze (go) is recognised via matrix expansion" \
|
||||
"$(printf 'Analyze (go)\nAnalyze (javascript-typescript)\nAnalyze (python)')" \
|
||||
"$(cat <<'EOF'
|
||||
name: CodeQL
|
||||
|
||||
on:
|
||||
pull_request:
|
||||
|
||||
jobs:
|
||||
analyze:
|
||||
name: Analyze (${{ matrix.language }})
|
||||
runs-on: ubuntu-latest
|
||||
strategy:
|
||||
matrix:
|
||||
language: [go, javascript-typescript, python]
|
||||
steps:
|
||||
- run: echo analyse
|
||||
EOF
|
||||
)" \
|
||||
"codeql.yml" \
|
||||
0 \
|
||||
""
|
||||
|
||||
echo ""
|
||||
echo "================================================"
|
||||
echo "test_check_name_parity: $PASSED passed, $FAILED failed"
|
||||
echo "================================================"
|
||||
exit "$FAILED"
|
||||
Loading…
Reference in New Issue
Block a user