Closes the silent-block failure mode that left 25 commits — including
the Memory v2 redesign and the reno-stars data-loss fix — wedged on
staging for 12+ hours behind a single missing review. The auto-promote
workflow opened the PR + armed auto-merge, but main's branch protection
required a human review and nobody noticed until a user reported
"still seeing old memory tab".
## Detection logic — `scripts/check-stale-promote-pr.sh`
Reads open PRs `base=main head=staging` and alarms on:
- `mergeStateStatus == BLOCKED`
- `reviewDecision == REVIEW_REQUIRED`
- createdAt older than `STALE_HOURS` (default 4h)
Other BLOCKED reasons (DIRTY, BEHIND, failed checks) are NOT alarmed —
those are the author's signal-to-fix. This script targets the specific
"no human reviewed yet" wedge.
Output:
- `::warning` per stale PR (visible in workflow summary + Actions UI)
- PR comment (idempotent via marker-string detection; one alarm
per PR, never re-spammed)
- Exit code = count of stale PRs (capped at 125)
Logic in a script (not inline workflow YAML) so it's:
- **Unit-testable** — tests/test-check-stale-promote-pr.sh exercises
every branch with stubbed fixture JSON + frozen clock. 23 tests
covering: empty list, single stale, just-under-threshold, wrong
reviewDecision, wrong mergeStateStatus, mixed list (only matching
PRs alarm), custom threshold via --stale-hours, exit-code-counts-
matching-PRs, --help, unknown arg → 64, missing repo → 2.
- **Operator-runnable ad-hoc** — `scripts/check-stale-promote-pr.sh`
works from any shell with `gh` + `jq`.
- **SSOT** — one detector, the workflow YAML is just schedule +
invocation surface. Future sibling workflows that need the same
check call the same script.
## Workflow — `.github/workflows/auto-promote-stale-alarm.yml`
Triggers:
- cron `27 * * * *` (hourly, off-the-hour to dodge cron herd)
- workflow_dispatch with `stale_hours` + `post_comment` overrides
Concurrency: `auto-promote-stale-alarm` group, cancel-in-progress=false
(idempotent script; no benefit to cancelling a running scan).
Permissions: `contents: read` + `pull-requests: write` (post comments).
Sparse checkout — only fetches `scripts/check-stale-promote-pr.sh`.
No node_modules, no go modules, no slow setup steps. Workflow runs
in <30s on a clean repo.
## Why "alarm + comment" not "auto-approve"
Considered options in issue #2975:
1. Slack/email alert — picked.
2. Bot-account auto-approve via molecule-ops — circumvents the
human-review gate that branch protection encodes.
3. Trusted-promote bypass via CODEOWNERS — needs Org Admin config
change; out of scope for a workflow PR.
The comment-on-PR pattern picks (1) without external dependencies
(no Slack token, no email config). Subscribers get notified via
GitHub's existing PR notification delivery; the warning shows up in
the Actions feed.
## Why this won't false-positive on legitimate slow reviews
Threshold is 4h. Most legitimate gates clear in <1h, so 4× headroom
is plenty for slow CI. The comment is idempotent (one alarm per PR,
never re-posted) — adding noise stops at 1 comment regardless of
how long the PR sits.
## Test plan
- [x] `bash scripts/test-check-stale-promote-pr.sh` — 23/23 pass
- [x] `python3 -c 'yaml.safe_load(...)'` clean
- [x] `bash -n` clean on both scripts
- [ ] Live verification: dispatch the workflow once main has caught up,
confirm it correctly reports zero stale PRs
84 lines
3.3 KiB
YAML
84 lines
3.3 KiB
YAML
name: auto-promote-stale-alarm
|
|
|
|
# Hourly cron + on-demand alarm for the silent-block failure mode that
|
|
# motivated issue #2975:
|
|
# - The auto-promote-staging.yml workflow opened a PR + armed
|
|
# auto-merge, but main's branch protection requires a human review
|
|
# (reviewDecision=REVIEW_REQUIRED). The PR sat BLOCKED with no
|
|
# surface-up-the-stack for 12+ hours, holding 25 commits hostage
|
|
# including the Memory v2 redesign and a reno-stars data-loss fix.
|
|
#
|
|
# This workflow runs `scripts/check-stale-promote-pr.sh` against the
|
|
# repo's open auto-promote PRs (base=main head=staging). When a PR has
|
|
# been BLOCKED on REVIEW_REQUIRED for >4h, it:
|
|
# 1. Emits a workflow-level warning (visible in run summary + the
|
|
# Actions UI feed).
|
|
# 2. Posts a comment on the PR (idempotent — one alarm per PR).
|
|
#
|
|
# The detection logic lives in scripts/check-stale-promote-pr.sh so
|
|
# it's unit-testable with stubbed `gh` (see test-check-stale-promote-pr.sh).
|
|
# This file is the schedule + invocation surface only — SSOT for the
|
|
# detector itself.
|
|
|
|
on:
|
|
schedule:
|
|
# Hourly. Cheap (one `gh pr list` + jq), and 1h granularity is
|
|
# plenty for a 4h staleness threshold — operators see the alarm
|
|
# within at most 1h of crossing the threshold.
|
|
- cron: "27 * * * *" # at :27 to dodge the cron herd at :00
|
|
workflow_dispatch:
|
|
inputs:
|
|
stale_hours:
|
|
description: "Hours after which a BLOCKED+REVIEW_REQUIRED PR is stale (default 4)"
|
|
required: false
|
|
default: "4"
|
|
post_comment:
|
|
description: "Post a comment on stale PRs (default true)"
|
|
required: false
|
|
default: "true"
|
|
|
|
permissions:
|
|
contents: read
|
|
pull-requests: write # post comments on stale PRs
|
|
|
|
# Serialize so the on-demand and scheduled runs don't double-comment
|
|
# the same PR. cancel-in-progress=false because the script is idempotent
|
|
# (existing comment marker prevents dupes), but a scheduled run firing
|
|
# while a manual one runs would just re-list the same PR set.
|
|
concurrency:
|
|
group: auto-promote-stale-alarm
|
|
cancel-in-progress: false
|
|
|
|
jobs:
|
|
scan:
|
|
runs-on: ubuntu-latest
|
|
steps:
|
|
- name: Checkout (need scripts/ only)
|
|
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
|
|
with:
|
|
sparse-checkout: |
|
|
scripts/check-stale-promote-pr.sh
|
|
sparse-checkout-cone-mode: false
|
|
- name: Run stale-PR detector
|
|
env:
|
|
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
|
GITHUB_REPOSITORY: ${{ github.repository }}
|
|
STALE_HOURS: ${{ inputs.stale_hours || '4' }}
|
|
POST_COMMENT: ${{ inputs.post_comment || 'true' }}
|
|
run: |
|
|
# The script's exit code reflects the count of stale PRs.
|
|
# We don't want a stale finding to fail the workflow run —
|
|
# the warning + comment are the signal, the green/red is
|
|
# noise. So convert any non-zero exit to a workflow notice
|
|
# and exit 0.
|
|
set +e
|
|
bash scripts/check-stale-promote-pr.sh
|
|
rc=$?
|
|
set -e
|
|
if [ "$rc" -ne 0 ]; then
|
|
echo "::notice::Stale PR detector found $rc PR(s) needing attention. See warnings above + comments on the PRs."
|
|
fi
|
|
# Always succeed — operator-facing surface is the warning,
|
|
# not the workflow status.
|
|
exit 0
|