feat(ops): hourly alarm for auto-promote PR stuck on REVIEW_REQUIRED (#2975)

Closes the silent-block failure mode that left 25 commits — including
the Memory v2 redesign and the reno-stars data-loss fix — wedged on
staging for 12+ hours behind a single missing review. The auto-promote
workflow opened the PR + armed auto-merge, but main's branch protection
required a human review and nobody noticed until a user reported
"still seeing old memory tab".

## Detection logic — `scripts/check-stale-promote-pr.sh`

Reads open PRs `base=main head=staging` and alarms on:
  - `mergeStateStatus == BLOCKED`
  - `reviewDecision == REVIEW_REQUIRED`
  - createdAt older than `STALE_HOURS` (default 4h)

Other BLOCKED reasons (DIRTY, BEHIND, failed checks) are NOT alarmed —
those are the author's signal-to-fix. This script targets the specific
"no human reviewed yet" wedge.

Output:
  - `::warning` per stale PR (visible in workflow summary + Actions UI)
  - PR comment (idempotent via marker-string detection; one alarm
    per PR, never re-spammed)
  - Exit code = count of stale PRs (capped at 125)

Logic in a script (not inline workflow YAML) so it's:
  - **Unit-testable** — tests/test-check-stale-promote-pr.sh exercises
    every branch with stubbed fixture JSON + frozen clock. 23 tests
    covering: empty list, single stale, just-under-threshold, wrong
    reviewDecision, wrong mergeStateStatus, mixed list (only matching
    PRs alarm), custom threshold via --stale-hours, exit-code-counts-
    matching-PRs, --help, unknown arg → 64, missing repo → 2.
  - **Operator-runnable ad-hoc** — `scripts/check-stale-promote-pr.sh`
    works from any shell with `gh` + `jq`.
  - **SSOT** — one detector, the workflow YAML is just schedule +
    invocation surface. Future sibling workflows that need the same
    check call the same script.

## Workflow — `.github/workflows/auto-promote-stale-alarm.yml`

Triggers:
  - cron `27 * * * *` (hourly, off-the-hour to dodge cron herd)
  - workflow_dispatch with `stale_hours` + `post_comment` overrides

Concurrency: `auto-promote-stale-alarm` group, cancel-in-progress=false
(idempotent script; no benefit to cancelling a running scan).

Permissions: `contents: read` + `pull-requests: write` (post comments).

Sparse checkout — only fetches `scripts/check-stale-promote-pr.sh`.
No node_modules, no go modules, no slow setup steps. Workflow runs
in <30s on a clean repo.

## Why "alarm + comment" not "auto-approve"

Considered options in issue #2975:
  1. Slack/email alert — picked.
  2. Bot-account auto-approve via molecule-ops — circumvents the
     human-review gate that branch protection encodes.
  3. Trusted-promote bypass via CODEOWNERS — needs Org Admin config
     change; out of scope for a workflow PR.

The comment-on-PR pattern picks (1) without external dependencies
(no Slack token, no email config). Subscribers get notified via
GitHub's existing PR notification delivery; the warning shows up in
the Actions feed.

## Why this won't false-positive on legitimate slow reviews

Threshold is 4h. Most legitimate gates clear in <1h, so 4× headroom
is plenty for slow CI. The comment is idempotent (one alarm per PR,
never re-posted) — adding noise stops at 1 comment regardless of
how long the PR sits.

## Test plan

- [x] `bash scripts/test-check-stale-promote-pr.sh` — 23/23 pass
- [x] `python3 -c 'yaml.safe_load(...)'` clean
- [x] `bash -n` clean on both scripts
- [ ] Live verification: dispatch the workflow once main has caught up,
      confirm it correctly reports zero stale PRs
This commit is contained in:
Hongming Wang 2026-05-05 17:55:27 -07:00
parent 9dd29882e2
commit caf19e8980
3 changed files with 556 additions and 0 deletions

View File

@ -0,0 +1,83 @@
name: auto-promote-stale-alarm
# Hourly cron + on-demand alarm for the silent-block failure mode that
# motivated issue #2975:
# - The auto-promote-staging.yml workflow opened a PR + armed
# auto-merge, but main's branch protection requires a human review
# (reviewDecision=REVIEW_REQUIRED). The PR sat BLOCKED with no
# surface-up-the-stack for 12+ hours, holding 25 commits hostage
# including the Memory v2 redesign and a reno-stars data-loss fix.
#
# This workflow runs `scripts/check-stale-promote-pr.sh` against the
# repo's open auto-promote PRs (base=main head=staging). When a PR has
# been BLOCKED on REVIEW_REQUIRED for >4h, it:
# 1. Emits a workflow-level warning (visible in run summary + the
# Actions UI feed).
# 2. Posts a comment on the PR (idempotent — one alarm per PR).
#
# The detection logic lives in scripts/check-stale-promote-pr.sh so
# it's unit-testable with stubbed `gh` (see test-check-stale-promote-pr.sh).
# This file is the schedule + invocation surface only — SSOT for the
# detector itself.
on:
schedule:
# Hourly. Cheap (one `gh pr list` + jq), and 1h granularity is
# plenty for a 4h staleness threshold — operators see the alarm
# within at most 1h of crossing the threshold.
- cron: "27 * * * *" # at :27 to dodge the cron herd at :00
workflow_dispatch:
inputs:
stale_hours:
description: "Hours after which a BLOCKED+REVIEW_REQUIRED PR is stale (default 4)"
required: false
default: "4"
post_comment:
description: "Post a comment on stale PRs (default true)"
required: false
default: "true"
permissions:
contents: read
pull-requests: write # post comments on stale PRs
# Serialize so the on-demand and scheduled runs don't double-comment
# the same PR. cancel-in-progress=false because the script is idempotent
# (existing comment marker prevents dupes), but a scheduled run firing
# while a manual one runs would just re-list the same PR set.
concurrency:
group: auto-promote-stale-alarm
cancel-in-progress: false
jobs:
scan:
runs-on: ubuntu-latest
steps:
- name: Checkout (need scripts/ only)
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
sparse-checkout: |
scripts/check-stale-promote-pr.sh
sparse-checkout-cone-mode: false
- name: Run stale-PR detector
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
GITHUB_REPOSITORY: ${{ github.repository }}
STALE_HOURS: ${{ inputs.stale_hours || '4' }}
POST_COMMENT: ${{ inputs.post_comment || 'true' }}
run: |
# The script's exit code reflects the count of stale PRs.
# We don't want a stale finding to fail the workflow run —
# the warning + comment are the signal, the green/red is
# noise. So convert any non-zero exit to a workflow notice
# and exit 0.
set +e
bash scripts/check-stale-promote-pr.sh
rc=$?
set -e
if [ "$rc" -ne 0 ]; then
echo "::notice::Stale PR detector found $rc PR(s) needing attention. See warnings above + comments on the PRs."
fi
# Always succeed — operator-facing surface is the warning,
# not the workflow status.
exit 0

216
scripts/check-stale-promote-pr.sh Executable file
View File

@ -0,0 +1,216 @@
#!/usr/bin/env bash
# scripts/check-stale-promote-pr.sh
#
# Scan open auto-promote PRs (base=main head=staging) for the
# silent-block failure mode that motivated issue #2975:
# - PR sat for hours with mergeStateStatus=BLOCKED
# - reviewDecision=REVIEW_REQUIRED (auto-merge armed but waiting
# on a human approval that never comes)
#
# When found, emit:
# - GitHub Actions notice/warning lines (workflow summary surface)
# - Optionally post a comment on the PR (--comment)
#
# Exit code is the count of stale PRs found, capped at 125 so callers
# can detect "alarm fired" via `if ! check-stale-promote-pr.sh; then …`.
# Exit 0 = clean, exit ≥1 = at least N stale PRs need attention.
#
# Used by .github/workflows/auto-promote-stale-alarm.yml. Logic lives
# here (not inline in the workflow YAML) so we can:
# - Unit-test it with a stubbed `gh` (see test-check-stale-promote-pr.sh)
# - Run it ad-hoc by an operator: `scripts/check-stale-promote-pr.sh`
# - Reuse the same surface in any sibling workflow that needs the same
# check (SSOT — one detector, many callers).
#
# Requires: `gh` CLI, `jq`. `GH_TOKEN` env in the workflow context.
set -euo pipefail
# -----------------------------------------------------------------------------
# Inputs
# -----------------------------------------------------------------------------
# Threshold beyond which a BLOCKED+REVIEW_REQUIRED promote PR is "stale"
# enough to alarm. 4 hours is the floor: most legitimate gates clear
# inside an hour, so 4× headroom is plenty for slow CI without false-
# alarming. Override via env for tests + edge ops.
STALE_HOURS="${STALE_HOURS:-4}"
# Repo defaults to the current `gh` context. Tests pass --repo explicitly.
REPO="${GITHUB_REPOSITORY:-}"
# Whether to post a comment to the PR. Off by default to avoid noise on
# manual ad-hoc runs; the cron workflow turns it on.
POST_COMMENT="${POST_COMMENT:-false}"
# Where to read the open-PR JSON from. Empty = call `gh` live. Tests
# point this at a fixture file.
PR_FIXTURE="${PR_FIXTURE:-}"
# Where to read "now" from. Empty = real clock. Tests freeze time so
# the staleness math is deterministic.
NOW_OVERRIDE="${NOW_OVERRIDE:-}"
while [ $# -gt 0 ]; do
case "$1" in
--repo) REPO="$2"; shift 2 ;;
--comment) POST_COMMENT="true"; shift ;;
--no-comment) POST_COMMENT="false"; shift ;;
--fixture) PR_FIXTURE="$2"; shift 2 ;;
--stale-hours) STALE_HOURS="$2"; shift 2 ;;
-h|--help)
sed -n '1,/^set /p' "$0" | grep '^# ' | sed 's/^# //'
exit 0
;;
*) echo "unknown arg: $1" >&2; exit 64 ;;
esac
done
if [ -z "$REPO" ] && [ -z "$PR_FIXTURE" ]; then
echo "::error::REPO env (or GITHUB_REPOSITORY) required when no fixture given" >&2
exit 2
fi
# -----------------------------------------------------------------------------
# Clock helpers — split out so tests can freeze time
# -----------------------------------------------------------------------------
now_epoch() {
if [ -n "$NOW_OVERRIDE" ]; then
printf '%s\n' "$NOW_OVERRIDE"
else
date -u +%s
fi
}
# Parse RFC3339 timestamps the way GitHub emits them (e.g.
# "2026-05-05T23:15:00Z"). gnu-date uses -d, bsd-date uses -j -f. Cover
# both because the workflow runs on ubuntu-latest (gnu) but operators
# may run this script on macOS (bsd).
to_epoch() {
local ts="$1"
# gnu-date path first.
if date -u -d "$ts" +%s 2>/dev/null; then
return 0
fi
# bsd-date fallback — strip optional fractional seconds before %S.
local ts_clean="${ts%%.*}"
ts_clean="${ts_clean%Z}Z"
date -u -j -f "%Y-%m-%dT%H:%M:%SZ" "$ts_clean" +%s 2>/dev/null || {
echo "::error::cannot parse timestamp: $ts" >&2
return 1
}
}
# -----------------------------------------------------------------------------
# Fetch open auto-promote PRs
# -----------------------------------------------------------------------------
fetch_prs() {
if [ -n "$PR_FIXTURE" ]; then
cat "$PR_FIXTURE"
return 0
fi
gh pr list --repo "$REPO" \
--base main --head staging --state open \
--json number,title,createdAt,mergeStateStatus,reviewDecision,url
}
# -----------------------------------------------------------------------------
# Stale detection
# -----------------------------------------------------------------------------
# Read PR list from stdin, emit one TSV line per stale PR:
# <num>\t<age_hours>\t<url>\t<title>
# Caller decides what to do (warn, comment, escalate).
detect_stale() {
local now_ts
now_ts="$(now_epoch)"
local stale_seconds=$((STALE_HOURS * 3600))
jq -r '.[] | [.number, .createdAt, .mergeStateStatus, .reviewDecision, .url, .title] | @tsv' \
| while IFS=$'\t' read -r num created_at merge_state review_decision url title; do
# Only alarm on the specific failure mode: BLOCKED + REVIEW_REQUIRED.
# Other BLOCKED reasons (DIRTY, BEHIND, failed checks) are the
# author's signal-to-fix; this script targets the silent
# "no human reviewed yet" wedge specifically.
[ "$merge_state" = "BLOCKED" ] || continue
[ "$review_decision" = "REVIEW_REQUIRED" ] || continue
local created_ts
created_ts="$(to_epoch "$created_at")" || continue
local age=$((now_ts - created_ts))
if [ "$age" -ge "$stale_seconds" ]; then
local age_h=$((age / 3600))
printf '%s\t%d\t%s\t%s\n' "$num" "$age_h" "$url" "$title"
fi
done
}
# -----------------------------------------------------------------------------
# Reporting
# -----------------------------------------------------------------------------
# Comment body — kept short; the issue body has the full design.
comment_body() {
local age_h="$1"
cat <<EOF
⚠️ This auto-promote PR has been BLOCKED on \`REVIEW_REQUIRED\` for **${age_h}h**.
Auto-merge is armed, but main's branch protection requires 1 review and no human has approved. Until someone reviews, the staging→main promote chain is wedged and downstream consumers (canvas builds, tenant redeploys) won't see new code.
**Action**: a human reviewer on \`@Molecule-AI/maintainers\` should approve this PR (or mark it as not ready and close).
Detected by \`scripts/check-stale-promote-pr.sh\` per issue #2975.
EOF
}
post_comment() {
local pr_num="$1"
local age_h="$2"
if [ "$POST_COMMENT" != "true" ]; then
return 0
fi
# Idempotency: only one alarm comment per PR. Look for the marker
# string in existing comments before posting a new one.
local existing
existing="$(gh pr view "$pr_num" --repo "$REPO" --json comments \
--jq '.comments[] | select(.body | test("scripts/check-stale-promote-pr.sh per issue #2975")) | .databaseId' \
| head -n1)"
if [ -n "$existing" ]; then
echo "::notice::PR #$pr_num already has a stale-alarm comment ($existing) — not re-posting"
return 0
fi
comment_body "$age_h" | gh pr comment "$pr_num" --repo "$REPO" --body-file -
echo "::notice::Posted stale-alarm comment on PR #$pr_num (age=${age_h}h)"
}
# -----------------------------------------------------------------------------
# Main
# -----------------------------------------------------------------------------
stale_count=0
while IFS=$'\t' read -r num age_h url title; do
[ -n "$num" ] || continue
stale_count=$((stale_count + 1))
echo "::warning title=Stale auto-promote PR::PR #$num — BLOCKED on REVIEW_REQUIRED for ${age_h}h. $url"
{
echo "## ⚠️ Stale auto-promote PR detected"
echo
echo "- PR: #$num — \`$title\`"
echo "- Age: ${age_h}h"
echo "- State: BLOCKED on REVIEW_REQUIRED"
echo "- URL: $url"
echo
echo "Auto-merge is armed but waiting on a human review. See issue #2975."
} >> "${GITHUB_STEP_SUMMARY:-/dev/null}"
post_comment "$num" "$age_h"
done < <(fetch_prs | detect_stale)
if [ "$stale_count" -eq 0 ]; then
echo "::notice::No stale auto-promote PRs detected (threshold: ${STALE_HOURS}h)"
fi
# Cap exit code so we don't accidentally break shells that interpret
# >125 as signal-style. 1..N maps to "1..N stale PRs".
exit $(( stale_count > 125 ? 125 : stale_count ))

View File

@ -0,0 +1,257 @@
#!/usr/bin/env bash
# scripts/test-check-stale-promote-pr.sh
#
# Exhaustive bash unit tests for check-stale-promote-pr.sh.
# Goal: 100% branch coverage on the detector logic.
#
# Each case writes a fixture JSON, freezes the clock with NOW_OVERRIDE,
# runs the script with --fixture + --no-comment (so we don't try to
# actually call `gh pr comment`), and asserts on stdout/exit code.
#
# Run: bash scripts/test-check-stale-promote-pr.sh
# Expected: "All N tests passed" + exit 0.
set -euo pipefail
SCRIPT="$(cd "$(dirname "$0")" && pwd)/check-stale-promote-pr.sh"
TMP="$(mktemp -d)"
trap 'rm -rf "$TMP"' EXIT
PASS=0
FAIL=0
# ─────────────────────────────────────────────────────────────────────────────
# Helpers
# ─────────────────────────────────────────────────────────────────────────────
# Frozen "now" — 2026-05-06T05:00:00Z. Compute dynamically so the
# tests stay correct regardless of platform-specific date semantics
# (gnu vs bsd) and any author math errors on the epoch.
if FROZEN_NOW="$(date -u -d '2026-05-06T05:00:00Z' +%s 2>/dev/null)"; then
: # gnu-date worked
elif FROZEN_NOW="$(date -u -j -f '%Y-%m-%dT%H:%M:%SZ' '2026-05-06T05:00:00Z' +%s 2>/dev/null)"; then
: # bsd-date worked
else
echo "FATAL: cannot compute FROZEN_NOW on this platform" >&2
exit 1
fi
run_script() {
# Args: <fixture-file>
# Returns stdout + exit code via a known marker.
local fixture="$1"
shift
set +e
NOW_OVERRIDE="$FROZEN_NOW" \
POST_COMMENT="false" \
bash "$SCRIPT" --fixture "$fixture" "$@" 2>&1
local rc=$?
set -e
echo "EXIT_CODE=$rc"
}
assert_pass() {
local name="$1"
local got="$2"
local want_pattern="$3"
if printf '%s' "$got" | grep -qE "$want_pattern"; then
PASS=$((PASS + 1))
printf ' ✓ %s\n' "$name"
else
FAIL=$((FAIL + 1))
printf ' ✗ %s\n want pattern: %s\n got:\n%s\n' "$name" "$want_pattern" "$got"
fi
}
assert_no_match() {
local name="$1"
local got="$2"
local bad_pattern="$3"
if printf '%s' "$got" | grep -qE "$bad_pattern"; then
FAIL=$((FAIL + 1))
printf ' ✗ %s\n bad pattern matched: %s\n got:\n%s\n' "$name" "$bad_pattern" "$got"
else
PASS=$((PASS + 1))
printf ' ✓ %s\n' "$name"
fi
}
# ─────────────────────────────────────────────────────────────────────────────
# Test cases
# ─────────────────────────────────────────────────────────────────────────────
echo "1. Empty PR list — clean exit"
echo '[]' > "$TMP/empty.json"
got=$(run_script "$TMP/empty.json")
assert_pass "empty-no-warning" "$got" "No stale auto-promote PRs detected"
assert_pass "empty-exit-zero" "$got" "EXIT_CODE=0"
echo
echo "2. Single PR, BLOCKED+REVIEW_REQUIRED, 5h old — fires alarm"
cat > "$TMP/stale1.json" <<EOF
[{
"number": 2963,
"title": "staging → main",
"createdAt": "2026-05-06T00:00:00Z",
"mergeStateStatus": "BLOCKED",
"reviewDecision": "REVIEW_REQUIRED",
"url": "https://github.com/test/test/pull/2963"
}]
EOF
got=$(run_script "$TMP/stale1.json")
assert_pass "stale1-warning" "$got" "Stale auto-promote PR"
assert_pass "stale1-pr-number" "$got" "PR #2963"
assert_pass "stale1-age" "$got" "for 5h"
assert_pass "stale1-exit-1" "$got" "EXIT_CODE=1"
echo
echo "3. Same PR but only 3h old — under threshold, NO alarm"
cat > "$TMP/young.json" <<EOF
[{
"number": 100,
"title": "fresh promote",
"createdAt": "2026-05-06T02:00:00Z",
"mergeStateStatus": "BLOCKED",
"reviewDecision": "REVIEW_REQUIRED",
"url": "https://github.com/test/test/pull/100"
}]
EOF
got=$(run_script "$TMP/young.json")
assert_pass "young-no-alarm" "$got" "No stale auto-promote PRs"
assert_pass "young-exit-zero" "$got" "EXIT_CODE=0"
assert_no_match "young-no-warning" "$got" "Stale auto-promote PR"
echo
echo "4. PR is BLOCKED but for the wrong reason (DIRTY, not REVIEW_REQUIRED)"
cat > "$TMP/dirty.json" <<EOF
[{
"number": 200,
"title": "needs rebase",
"createdAt": "2026-05-06T00:00:00Z",
"mergeStateStatus": "BLOCKED",
"reviewDecision": "APPROVED",
"url": "https://github.com/test/test/pull/200"
}]
EOF
got=$(run_script "$TMP/dirty.json")
assert_pass "dirty-no-alarm" "$got" "No stale auto-promote PRs"
assert_pass "dirty-exit-zero" "$got" "EXIT_CODE=0"
echo
echo "5. PR is APPROVED but mergeStateStatus is CLEAN — NOT alarming"
cat > "$TMP/clean.json" <<EOF
[{
"number": 300,
"title": "all green",
"createdAt": "2026-05-06T00:00:00Z",
"mergeStateStatus": "CLEAN",
"reviewDecision": "APPROVED",
"url": "https://github.com/test/test/pull/300"
}]
EOF
got=$(run_script "$TMP/clean.json")
assert_pass "clean-no-alarm" "$got" "No stale auto-promote PRs"
echo
echo "6. Multiple PRs — only the BLOCKED+REVIEW_REQUIRED+old one alarms"
cat > "$TMP/mixed.json" <<EOF
[
{
"number": 100,
"title": "fresh",
"createdAt": "2026-05-06T04:00:00Z",
"mergeStateStatus": "BLOCKED",
"reviewDecision": "REVIEW_REQUIRED",
"url": "https://x/100"
},
{
"number": 200,
"title": "stale + alarming",
"createdAt": "2026-05-05T20:00:00Z",
"mergeStateStatus": "BLOCKED",
"reviewDecision": "REVIEW_REQUIRED",
"url": "https://x/200"
},
{
"number": 300,
"title": "approved + clean",
"createdAt": "2026-05-05T20:00:00Z",
"mergeStateStatus": "CLEAN",
"reviewDecision": "APPROVED",
"url": "https://x/300"
}
]
EOF
got=$(run_script "$TMP/mixed.json")
assert_pass "mixed-only-200" "$got" "PR #200"
assert_no_match "mixed-not-100" "$got" "PR #100"
assert_no_match "mixed-not-300" "$got" "PR #300"
assert_pass "mixed-exit-1" "$got" "EXIT_CODE=1"
echo
echo "7. Custom STALE_HOURS via --stale-hours overrides threshold"
got=$(run_script "$TMP/young.json" --stale-hours 1)
assert_pass "custom-threshold-fires" "$got" "PR #100"
assert_pass "custom-threshold-exit-1" "$got" "EXIT_CODE=1"
echo
echo "8. Two stale PRs — exit code reflects count"
cat > "$TMP/two-stale.json" <<EOF
[
{
"number": 200,
"title": "stale-A",
"createdAt": "2026-05-05T20:00:00Z",
"mergeStateStatus": "BLOCKED",
"reviewDecision": "REVIEW_REQUIRED",
"url": "https://x/200"
},
{
"number": 201,
"title": "stale-B",
"createdAt": "2026-05-05T19:00:00Z",
"mergeStateStatus": "BLOCKED",
"reviewDecision": "REVIEW_REQUIRED",
"url": "https://x/201"
}
]
EOF
got=$(run_script "$TMP/two-stale.json")
assert_pass "two-stale-exit-2" "$got" "EXIT_CODE=2"
echo
echo "9. Help text is shown for --help"
set +e
help_out=$(bash "$SCRIPT" --help 2>&1)
help_rc=$?
set -e
assert_pass "help-exits-zero" "EXIT_CODE=$help_rc" "EXIT_CODE=0"
assert_pass "help-mentions-issue" "$help_out" "issue #2975"
echo
echo "10. Unknown arg exits 64 (EX_USAGE)"
set +e
bad_out=$(bash "$SCRIPT" --bogus 2>&1)
bad_rc=$?
set -e
assert_pass "unknown-arg-rc" "EXIT_CODE=$bad_rc" "EXIT_CODE=64"
echo
echo "11. Missing repo + missing fixture exits 2"
set +e
out=$(REPO="" bash "$SCRIPT" 2>&1)
rc=$?
set -e
assert_pass "no-repo-exit-2" "EXIT_CODE=$rc" "EXIT_CODE=2"
# ─────────────────────────────────────────────────────────────────────────────
# Summary
# ─────────────────────────────────────────────────────────────────────────────
echo
echo "─────────────────────────────────────────────"
echo "Tests: $PASS passed, $FAIL failed"
if [ "$FAIL" -gt 0 ]; then
exit 1
fi
echo "All tests passed."