molecule-core/scripts/check-stale-promote-pr.sh
Hongming Wang caf19e8980 feat(ops): hourly alarm for auto-promote PR stuck on REVIEW_REQUIRED (#2975)
Closes the silent-block failure mode that left 25 commits — including
the Memory v2 redesign and the reno-stars data-loss fix — wedged on
staging for 12+ hours behind a single missing review. The auto-promote
workflow opened the PR + armed auto-merge, but main's branch protection
required a human review and nobody noticed until a user reported
"still seeing old memory tab".

## Detection logic — `scripts/check-stale-promote-pr.sh`

Reads open PRs `base=main head=staging` and alarms on:
  - `mergeStateStatus == BLOCKED`
  - `reviewDecision == REVIEW_REQUIRED`
  - createdAt older than `STALE_HOURS` (default 4h)

Other BLOCKED reasons (DIRTY, BEHIND, failed checks) are NOT alarmed —
those are the author's signal-to-fix. This script targets the specific
"no human reviewed yet" wedge.

Output:
  - `::warning` per stale PR (visible in workflow summary + Actions UI)
  - PR comment (idempotent via marker-string detection; one alarm
    per PR, never re-spammed)
  - Exit code = count of stale PRs (capped at 125)

Logic in a script (not inline workflow YAML) so it's:
  - **Unit-testable** — tests/test-check-stale-promote-pr.sh exercises
    every branch with stubbed fixture JSON + frozen clock. 23 tests
    covering: empty list, single stale, just-under-threshold, wrong
    reviewDecision, wrong mergeStateStatus, mixed list (only matching
    PRs alarm), custom threshold via --stale-hours, exit-code-counts-
    matching-PRs, --help, unknown arg → 64, missing repo → 2.
  - **Operator-runnable ad-hoc** — `scripts/check-stale-promote-pr.sh`
    works from any shell with `gh` + `jq`.
  - **SSOT** — one detector, the workflow YAML is just schedule +
    invocation surface. Future sibling workflows that need the same
    check call the same script.

## Workflow — `.github/workflows/auto-promote-stale-alarm.yml`

Triggers:
  - cron `27 * * * *` (hourly, off-the-hour to dodge cron herd)
  - workflow_dispatch with `stale_hours` + `post_comment` overrides

Concurrency: `auto-promote-stale-alarm` group, cancel-in-progress=false
(idempotent script; no benefit to cancelling a running scan).

Permissions: `contents: read` + `pull-requests: write` (post comments).

Sparse checkout — only fetches `scripts/check-stale-promote-pr.sh`.
No node_modules, no go modules, no slow setup steps. Workflow runs
in <30s on a clean repo.

## Why "alarm + comment" not "auto-approve"

Considered options in issue #2975:
  1. Slack/email alert — picked.
  2. Bot-account auto-approve via molecule-ops — circumvents the
     human-review gate that branch protection encodes.
  3. Trusted-promote bypass via CODEOWNERS — needs Org Admin config
     change; out of scope for a workflow PR.

The comment-on-PR pattern picks (1) without external dependencies
(no Slack token, no email config). Subscribers get notified via
GitHub's existing PR notification delivery; the warning shows up in
the Actions feed.

## Why this won't false-positive on legitimate slow reviews

Threshold is 4h. Most legitimate gates clear in <1h, so 4× headroom
is plenty for slow CI. The comment is idempotent (one alarm per PR,
never re-posted) — adding noise stops at 1 comment regardless of
how long the PR sits.

## Test plan

- [x] `bash scripts/test-check-stale-promote-pr.sh` — 23/23 pass
- [x] `python3 -c 'yaml.safe_load(...)'` clean
- [x] `bash -n` clean on both scripts
- [ ] Live verification: dispatch the workflow once main has caught up,
      confirm it correctly reports zero stale PRs
2026-05-05 17:55:27 -07:00

217 lines
7.9 KiB
Bash
Executable File
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

#!/usr/bin/env bash
# scripts/check-stale-promote-pr.sh
#
# Scan open auto-promote PRs (base=main head=staging) for the
# silent-block failure mode that motivated issue #2975:
# - PR sat for hours with mergeStateStatus=BLOCKED
# - reviewDecision=REVIEW_REQUIRED (auto-merge armed but waiting
# on a human approval that never comes)
#
# When found, emit:
# - GitHub Actions notice/warning lines (workflow summary surface)
# - Optionally post a comment on the PR (--comment)
#
# Exit code is the count of stale PRs found, capped at 125 so callers
# can detect "alarm fired" via `if ! check-stale-promote-pr.sh; then …`.
# Exit 0 = clean, exit ≥1 = at least N stale PRs need attention.
#
# Used by .github/workflows/auto-promote-stale-alarm.yml. Logic lives
# here (not inline in the workflow YAML) so we can:
# - Unit-test it with a stubbed `gh` (see test-check-stale-promote-pr.sh)
# - Run it ad-hoc by an operator: `scripts/check-stale-promote-pr.sh`
# - Reuse the same surface in any sibling workflow that needs the same
# check (SSOT — one detector, many callers).
#
# Requires: `gh` CLI, `jq`. `GH_TOKEN` env in the workflow context.
set -euo pipefail
# -----------------------------------------------------------------------------
# Inputs
# -----------------------------------------------------------------------------
# Threshold beyond which a BLOCKED+REVIEW_REQUIRED promote PR is "stale"
# enough to alarm. 4 hours is the floor: most legitimate gates clear
# inside an hour, so 4× headroom is plenty for slow CI without false-
# alarming. Override via env for tests + edge ops.
STALE_HOURS="${STALE_HOURS:-4}"
# Repo defaults to the current `gh` context. Tests pass --repo explicitly.
REPO="${GITHUB_REPOSITORY:-}"
# Whether to post a comment to the PR. Off by default to avoid noise on
# manual ad-hoc runs; the cron workflow turns it on.
POST_COMMENT="${POST_COMMENT:-false}"
# Where to read the open-PR JSON from. Empty = call `gh` live. Tests
# point this at a fixture file.
PR_FIXTURE="${PR_FIXTURE:-}"
# Where to read "now" from. Empty = real clock. Tests freeze time so
# the staleness math is deterministic.
NOW_OVERRIDE="${NOW_OVERRIDE:-}"
while [ $# -gt 0 ]; do
case "$1" in
--repo) REPO="$2"; shift 2 ;;
--comment) POST_COMMENT="true"; shift ;;
--no-comment) POST_COMMENT="false"; shift ;;
--fixture) PR_FIXTURE="$2"; shift 2 ;;
--stale-hours) STALE_HOURS="$2"; shift 2 ;;
-h|--help)
sed -n '1,/^set /p' "$0" | grep '^# ' | sed 's/^# //'
exit 0
;;
*) echo "unknown arg: $1" >&2; exit 64 ;;
esac
done
if [ -z "$REPO" ] && [ -z "$PR_FIXTURE" ]; then
echo "::error::REPO env (or GITHUB_REPOSITORY) required when no fixture given" >&2
exit 2
fi
# -----------------------------------------------------------------------------
# Clock helpers — split out so tests can freeze time
# -----------------------------------------------------------------------------
now_epoch() {
if [ -n "$NOW_OVERRIDE" ]; then
printf '%s\n' "$NOW_OVERRIDE"
else
date -u +%s
fi
}
# Parse RFC3339 timestamps the way GitHub emits them (e.g.
# "2026-05-05T23:15:00Z"). gnu-date uses -d, bsd-date uses -j -f. Cover
# both because the workflow runs on ubuntu-latest (gnu) but operators
# may run this script on macOS (bsd).
to_epoch() {
local ts="$1"
# gnu-date path first.
if date -u -d "$ts" +%s 2>/dev/null; then
return 0
fi
# bsd-date fallback — strip optional fractional seconds before %S.
local ts_clean="${ts%%.*}"
ts_clean="${ts_clean%Z}Z"
date -u -j -f "%Y-%m-%dT%H:%M:%SZ" "$ts_clean" +%s 2>/dev/null || {
echo "::error::cannot parse timestamp: $ts" >&2
return 1
}
}
# -----------------------------------------------------------------------------
# Fetch open auto-promote PRs
# -----------------------------------------------------------------------------
fetch_prs() {
if [ -n "$PR_FIXTURE" ]; then
cat "$PR_FIXTURE"
return 0
fi
gh pr list --repo "$REPO" \
--base main --head staging --state open \
--json number,title,createdAt,mergeStateStatus,reviewDecision,url
}
# -----------------------------------------------------------------------------
# Stale detection
# -----------------------------------------------------------------------------
# Read PR list from stdin, emit one TSV line per stale PR:
# <num>\t<age_hours>\t<url>\t<title>
# Caller decides what to do (warn, comment, escalate).
detect_stale() {
local now_ts
now_ts="$(now_epoch)"
local stale_seconds=$((STALE_HOURS * 3600))
jq -r '.[] | [.number, .createdAt, .mergeStateStatus, .reviewDecision, .url, .title] | @tsv' \
| while IFS=$'\t' read -r num created_at merge_state review_decision url title; do
# Only alarm on the specific failure mode: BLOCKED + REVIEW_REQUIRED.
# Other BLOCKED reasons (DIRTY, BEHIND, failed checks) are the
# author's signal-to-fix; this script targets the silent
# "no human reviewed yet" wedge specifically.
[ "$merge_state" = "BLOCKED" ] || continue
[ "$review_decision" = "REVIEW_REQUIRED" ] || continue
local created_ts
created_ts="$(to_epoch "$created_at")" || continue
local age=$((now_ts - created_ts))
if [ "$age" -ge "$stale_seconds" ]; then
local age_h=$((age / 3600))
printf '%s\t%d\t%s\t%s\n' "$num" "$age_h" "$url" "$title"
fi
done
}
# -----------------------------------------------------------------------------
# Reporting
# -----------------------------------------------------------------------------
# Comment body — kept short; the issue body has the full design.
comment_body() {
local age_h="$1"
cat <<EOF
⚠️ This auto-promote PR has been BLOCKED on \`REVIEW_REQUIRED\` for **${age_h}h**.
Auto-merge is armed, but main's branch protection requires 1 review and no human has approved. Until someone reviews, the staging→main promote chain is wedged and downstream consumers (canvas builds, tenant redeploys) won't see new code.
**Action**: a human reviewer on \`@Molecule-AI/maintainers\` should approve this PR (or mark it as not ready and close).
Detected by \`scripts/check-stale-promote-pr.sh\` per issue #2975.
EOF
}
post_comment() {
local pr_num="$1"
local age_h="$2"
if [ "$POST_COMMENT" != "true" ]; then
return 0
fi
# Idempotency: only one alarm comment per PR. Look for the marker
# string in existing comments before posting a new one.
local existing
existing="$(gh pr view "$pr_num" --repo "$REPO" --json comments \
--jq '.comments[] | select(.body | test("scripts/check-stale-promote-pr.sh per issue #2975")) | .databaseId' \
| head -n1)"
if [ -n "$existing" ]; then
echo "::notice::PR #$pr_num already has a stale-alarm comment ($existing) — not re-posting"
return 0
fi
comment_body "$age_h" | gh pr comment "$pr_num" --repo "$REPO" --body-file -
echo "::notice::Posted stale-alarm comment on PR #$pr_num (age=${age_h}h)"
}
# -----------------------------------------------------------------------------
# Main
# -----------------------------------------------------------------------------
stale_count=0
while IFS=$'\t' read -r num age_h url title; do
[ -n "$num" ] || continue
stale_count=$((stale_count + 1))
echo "::warning title=Stale auto-promote PR::PR #$num — BLOCKED on REVIEW_REQUIRED for ${age_h}h. $url"
{
echo "## ⚠️ Stale auto-promote PR detected"
echo
echo "- PR: #$num — \`$title\`"
echo "- Age: ${age_h}h"
echo "- State: BLOCKED on REVIEW_REQUIRED"
echo "- URL: $url"
echo
echo "Auto-merge is armed but waiting on a human review. See issue #2975."
} >> "${GITHUB_STEP_SUMMARY:-/dev/null}"
post_comment "$num" "$age_h"
done < <(fetch_prs | detect_stale)
if [ "$stale_count" -eq 0 ]; then
echo "::notice::No stale auto-promote PRs detected (threshold: ${STALE_HOURS}h)"
fi
# Cap exit code so we don't accidentally break shells that interpret
# >125 as signal-style. 1..N maps to "1..N stale PRs".
exit $(( stale_count > 125 ? 125 : stale_count ))