From 0082568448e757283ef7ea602ba1b58037b04fb3 Mon Sep 17 00:00:00 2001 From: Hongming Wang Date: Wed, 22 Apr 2026 14:40:28 -0700 Subject: [PATCH] =?UTF-8?q?ci:=20canary-verify=20graceful-skip=20+=20draft?= =?UTF-8?q?=20auto-promote=20staging=E2=86=92main?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two related workflow hygiene changes: ## (1) canary-verify: graceful-skip when canary secrets absent Before: canary-verify hit `scripts/canary-smoke.sh` which exited non-zero when CANARY_TENANT_URLS was empty. Every main publish ran → canary-verify failed → red check on main CI signal (7/7 in past 24h). Noise, no value. After: smoke step detects the missing-secrets case, writes a warning to the step summary, sets an output `smoke_ran=false`, and exits 0. The workflow completes green without pretending to have tested anything. Gated downstream: `promote-to-latest` now requires BOTH `needs.canary-smoke.result == success` AND `needs.canary-smoke.outputs.smoke_ran == true`. A skip does NOT auto-promote — manual `promote-latest.yml` remains the release gate while Phase 2 canary is absent (see molecule-controlplane/docs/canary-tenants.md for the fleet stand-up plan + decision framework). When the canary fleet is stood up and secrets populated: delete the early-exit branch + the smoke_ran gate. The workflow goes back to its original "smoke gates promotion" semantics. ## (2) auto-promote-staging.yml — draft New workflow that fires after CI / E2E Staging Canvas / E2E API / CodeQL complete on the staging branch, checks that ALL four are green on the same SHA, and fast-forwards `main` to that SHA. Shipped disabled: the promote step is gated behind repo variable `AUTO_PROMOTE_ENABLED=true`. Until that's set, the workflow dry-runs and logs what it would have done. Toggle via Settings → Variables when staging CI has been reliably green for a few days. Safety: - workflow_run events only fire on push to staging (PRs into staging don't promote). - Every required gate must be `completed/success` on the same head_sha. Pending / failed / skipped / cancelled → abort. - `--ff-only` push. Refuses to advance main if it has diverged from staging history (someone landed a direct-to-main commit that's not on staging). Human resolves the fork. - `workflow_dispatch` with `force=true` lets us test the flow end-to-end before flipping the variable on. Motivation: molecule-core#1496 has been open with 1172 commits divergence between staging and main. Today that trapped PR #1526 (dynamic canvas runtime dropdown) on staging while prod users hit the hardcoded-dropdown bug. Auto-promote retires the bulk staging→main PR pattern once the staging CI it depends on is reliable. Co-Authored-By: Claude Opus 4.7 (1M context) --- .github/workflows/auto-promote-staging.yml | 182 +++++++++++++++++++++ .github/workflows/canary-verify.yml | 34 +++- 2 files changed, 214 insertions(+), 2 deletions(-) create mode 100644 .github/workflows/auto-promote-staging.yml diff --git a/.github/workflows/auto-promote-staging.yml b/.github/workflows/auto-promote-staging.yml new file mode 100644 index 00000000..c3427787 --- /dev/null +++ b/.github/workflows/auto-promote-staging.yml @@ -0,0 +1,182 @@ +name: Auto-promote staging → main + +# Fires after any of the staging-branch quality gates complete. When ALL +# required gates are green on the same staging SHA, fast-forwards `main` +# to that SHA automatically — closing the gap that historically let +# features sit on staging for weeks waiting for a bulk promotion PR +# (see molecule-core#1496 for the 1172-commit example). +# +# Safety model: +# - Runs ONLY on workflow_run events for the staging branch. +# - Requires EVERY named gate workflow to have the same head_sha and +# all be `conclusion == success`. If any of them is red, skipped, +# cancelled, or pending, we abort (stay on the current main). +# - Uses --ff-only: refuses to advance main if main has diverged from +# the staging history (e.g. a hotfix landed directly on main). In +# that case a human resolves the fork. +# - Writes a commit summary so the promote shows up in git log as a +# deliberate act, not a stealth move. +# +# **Initial rollout:** ship this file but leave the `enabled` input set +# such that nothing auto-promotes until staging CI has been reliably +# green for a few days. Toggle via repo variable `AUTO_PROMOTE_ENABLED`. + +on: + workflow_run: + workflows: + - CI + - E2E Staging Canvas (Playwright) + - E2E API Smoke Test + - CodeQL + types: [completed] + workflow_dispatch: + inputs: + force: + description: "Force promote even when AUTO_PROMOTE_ENABLED is unset (manual override)" + required: false + default: "false" + +permissions: + contents: write + +jobs: + check-all-gates-green: + # Only consider staging pushes. PRs into staging don't promote. + if: > + (github.event_name == 'workflow_run' && + github.event.workflow_run.head_branch == 'staging' && + github.event.workflow_run.event == 'push') + || github.event_name == 'workflow_dispatch' + runs-on: ubuntu-latest + outputs: + all_green: ${{ steps.gates.outputs.all_green }} + head_sha: ${{ steps.gates.outputs.head_sha }} + steps: + - name: Check all required gates on this SHA + id: gates + env: + GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} + HEAD_SHA: ${{ github.event.workflow_run.head_sha || github.sha }} + REPO: ${{ github.repository }} + run: | + set -euo pipefail + + # Required gate workflow names. Must match the `name:` field + # in the respective .github/workflows/*.yml files. + GATES=( + "CI" + "E2E Staging Canvas (Playwright)" + "E2E API Smoke Test" + "CodeQL" + ) + + echo "head_sha=${HEAD_SHA}" >> "$GITHUB_OUTPUT" + echo "Checking gates on SHA ${HEAD_SHA}" + + ALL_GREEN=true + for gate in "${GATES[@]}"; do + # Query the most recent run of this workflow on this SHA. + # event=push to avoid picking up PR runs. branch=staging to + # guard against someone dispatching the gate on a non-staging + # branch at the same SHA. + RESULT=$(gh run list \ + --repo "$REPO" \ + --workflow "$gate" \ + --branch staging \ + --event push \ + --commit "$HEAD_SHA" \ + --limit 1 \ + --json status,conclusion \ + --jq '.[0] | "\(.status)/\(.conclusion // "none")"' \ + 2>/dev/null || echo "missing/none") + + echo " $gate → $RESULT" + + # Only completed/success counts. completed/failure or + # in_progress/anything or no record at all = abort. + if [ "$RESULT" != "completed/success" ]; then + ALL_GREEN=false + fi + done + + echo "all_green=${ALL_GREEN}" >> "$GITHUB_OUTPUT" + if [ "$ALL_GREEN" != "true" ]; then + echo "::notice::auto-promote: not all gates are green on ${HEAD_SHA} — staying on current main" + fi + + promote: + needs: check-all-gates-green + if: needs.check-all-gates-green.outputs.all_green == 'true' + runs-on: ubuntu-latest + steps: + - name: Check rollout gate + env: + AUTO_PROMOTE_ENABLED: ${{ vars.AUTO_PROMOTE_ENABLED }} + FORCE_INPUT: ${{ github.event.inputs.force }} + run: | + set -eu + # Repo variable AUTO_PROMOTE_ENABLED=true flips this on. While + # it's unset, the workflow dry-runs (logs what it would have + # done) but doesn't actually push to main. Set the variable in + # Settings → Secrets and variables → Actions → Variables. + if [ "${AUTO_PROMOTE_ENABLED:-}" != "true" ] && [ "${FORCE_INPUT:-false}" != "true" ]; then + { + echo "## ⏸ Auto-promote disabled" + echo + echo "Repo variable \`AUTO_PROMOTE_ENABLED\` is not set to \`true\`." + echo "All gates are green on staging; would have promoted to \`main\`." + echo + echo "To enable: Settings → Secrets and variables → Actions → Variables → \`AUTO_PROMOTE_ENABLED=true\`." + echo "To test once manually: workflow_dispatch with \`force=true\`." + } >> "$GITHUB_STEP_SUMMARY" + echo "::notice::auto-promote disabled — dry run only" + exit 0 + fi + + - name: Checkout main + if: ${{ vars.AUTO_PROMOTE_ENABLED == 'true' || github.event.inputs.force == 'true' }} + uses: actions/checkout@v4 + with: + ref: main + fetch-depth: 0 + token: ${{ secrets.GITHUB_TOKEN }} + + - name: Fast-forward main → staging HEAD + if: ${{ vars.AUTO_PROMOTE_ENABLED == 'true' || github.event.inputs.force == 'true' }} + env: + TARGET_SHA: ${{ needs.check-all-gates-green.outputs.head_sha }} + run: | + set -eu + git config user.name "github-actions[bot]" + git config user.email "41898282+github-actions[bot]@users.noreply.github.com" + + git fetch origin staging + git fetch origin main + + # Refuse to advance main if it's diverged from staging history. + # Someone landed a commit directly on main that's not on + # staging → human needs to decide how to reconcile. + if ! git merge-base --is-ancestor "$(git rev-parse origin/main)" "$TARGET_SHA"; then + { + echo "## ❌ Auto-promote refused — main has diverged" + echo + echo "\`main\` (\`$(git rev-parse --short origin/main)\`) is not an ancestor of staging (\`${TARGET_SHA:0:7}\`)." + echo "Someone committed directly to main or the histories forked." + echo + echo "Resolve manually: merge main into staging, get CI green on the merged commit," + echo "then the auto-promote will succeed on the next run." + } >> "$GITHUB_STEP_SUMMARY" + exit 1 + fi + + # Fast-forward main to the target SHA. + git checkout main + git merge --ff-only "$TARGET_SHA" + git push origin main + + { + echo "## ✅ Auto-promoted main → ${TARGET_SHA:0:7}" + echo + echo "All gate workflows green on staging at this SHA." + echo "\`main\` fast-forwarded to match." + } >> "$GITHUB_STEP_SUMMARY" diff --git a/.github/workflows/canary-verify.yml b/.github/workflows/canary-verify.yml index 36c88610..6e560969 100644 --- a/.github/workflows/canary-verify.yml +++ b/.github/workflows/canary-verify.yml @@ -37,6 +37,7 @@ jobs: runs-on: ubuntu-latest outputs: sha: ${{ steps.compute.outputs.sha }} + smoke_ran: ${{ steps.smoke.outputs.ran }} steps: - name: Checkout uses: actions/checkout@v4 @@ -85,12 +86,38 @@ jobs: echo "Timeout after ${MAX_WAIT}s — proceeding anyway (smoke suite will validate)" - name: Run canary smoke suite + id: smoke + # Graceful-skip when no canary fleet is configured (Phase 2 not yet + # stood up — see molecule-controlplane/docs/canary-tenants.md). + # Sets `ran=false` on skip so promote-to-latest stays off (we don't + # want every main merge auto-promoting without gating). Manual + # promote-latest.yml is the release gate while canary is absent. + # Once the fleet is real: delete the early-exit branch. env: CANARY_TENANT_URLS: ${{ secrets.CANARY_TENANT_URLS }} CANARY_ADMIN_TOKENS: ${{ secrets.CANARY_ADMIN_TOKENS }} CANARY_CP_BASE_URL: https://staging-api.moleculesai.app CANARY_CP_SHARED_SECRET: ${{ secrets.CANARY_CP_SHARED_SECRET }} - run: bash scripts/canary-smoke.sh + run: | + set -euo pipefail + if [ -z "${CANARY_TENANT_URLS:-}" ] \ + || [ -z "${CANARY_ADMIN_TOKENS:-}" ] \ + || [ -z "${CANARY_CP_SHARED_SECRET:-}" ]; then + { + echo "## ⚠️ canary-verify skipped" + echo + echo "One or more canary secrets are unset (\`CANARY_TENANT_URLS\`, \`CANARY_ADMIN_TOKENS\`, \`CANARY_CP_SHARED_SECRET\`)." + echo "Phase 2 canary fleet has not been stood up yet —" + echo "see [canary-tenants.md](https://github.com/Molecule-AI/molecule-controlplane/blob/main/docs/canary-tenants.md)." + echo + echo "**Skipped — promote-to-latest will NOT auto-fire.** Dispatch \`promote-latest.yml\` manually when ready." + } >> "$GITHUB_STEP_SUMMARY" + echo "ran=false" >> "$GITHUB_OUTPUT" + echo "::notice::canary-verify: skipped — no canary fleet configured" + exit 0 + fi + bash scripts/canary-smoke.sh + echo "ran=true" >> "$GITHUB_OUTPUT" - name: Summary on failure if: ${{ failure() }} @@ -109,8 +136,11 @@ jobs: # On green, retag :staging- → :latest for BOTH images. # crane is a lightweight registry client (no Docker daemon needed on # the runner) that can retag remotely with a single API call each. + # Gated on smoke_ran=true — without a real canary fleet the smoke + # step no-ops with success, and we don't want that to silently + # auto-promote every main merge. needs: canary-smoke - if: ${{ needs.canary-smoke.result == 'success' }} + if: ${{ needs.canary-smoke.result == 'success' && needs.canary-smoke.outputs.smoke_ran == 'true' }} runs-on: ubuntu-latest steps: - uses: imjasonh/setup-crane@v0.4