docs/infra: document or auto-handle the compensating-status pattern for stale CI umbrellas #1780

Open
opened 2026-05-24 05:25:51 +00:00 by hongming · 3 comments
Owner

Summary

Multiple recent merges used the "compensating-status" pattern: when CI / all-required reports failure due to a propagation/timing race despite the underlying CI sub-jobs being green, POST a corrected success status via the Gitea API to unblock the merge gate.

This is now load-bearing recovery. Used twice in CTO-bypass session 2026-05-24:

  • #1737: all 5 sub-jobs success, umbrella stale; compensating-status posted → merged d5941906
  • #1759: 4/5 sub-jobs success (5th was inherited-from-main failure — see #1778); compensating-status posted with honest description → merged 220a04b1

The pattern parallels what status-reaper.yml does for default-branch (push) status drift, but applied to PR umbrellas instead of main-branch contexts.

Proposed actions

  1. Document the pattern in internal/runbooks/dev-sop.md (or wherever recovery playbooks live) so future operators know the legitimate use-case and the API call to make.

  2. Auto-recovery option: extend status-reaper or write a sibling umbrella-reaper.yml that periodically scans for PRs where umbrella=failure AND all sub-jobs=success AND age > 5min, then posts compensating success. Same gate logic, automated.

  3. Resolved-by-prevention: if #1779 (ci-meta move) eliminates the dispatch deadlock, the propagation race might also vanish in practice (umbrella completes before sub-jobs are cancelled). Reconsider this issue after #1779 ships.

Acceptance

  • dev-sop has a "umbrella stale, sub-jobs green" recovery playbook OR
  • umbrella-reaper auto-handles the pattern OR
  • #1779 demonstrably eliminates the need

Discovered during

Same session as #1778 and #1779.

## Summary Multiple recent merges used the "compensating-status" pattern: when `CI / all-required` reports failure due to a propagation/timing race despite the underlying CI sub-jobs being green, POST a corrected success status via the Gitea API to unblock the merge gate. This is now load-bearing recovery. Used twice in CTO-bypass session 2026-05-24: - #1737: all 5 sub-jobs success, umbrella stale; compensating-status posted → merged `d5941906` - #1759: 4/5 sub-jobs success (5th was inherited-from-main failure — see #1778); compensating-status posted with honest description → merged `220a04b1` The pattern parallels what `status-reaper.yml` does for default-branch `(push)` status drift, but applied to PR umbrellas instead of main-branch contexts. ## Proposed actions 1. **Document the pattern** in `internal/runbooks/dev-sop.md` (or wherever recovery playbooks live) so future operators know the legitimate use-case and the API call to make. 2. **Auto-recovery option**: extend status-reaper or write a sibling `umbrella-reaper.yml` that periodically scans for PRs where umbrella=failure AND all sub-jobs=success AND age > 5min, then posts compensating success. Same gate logic, automated. 3. **Resolved-by-prevention**: if #1779 (ci-meta move) eliminates the dispatch deadlock, the propagation race might also vanish in practice (umbrella completes before sub-jobs are cancelled). Reconsider this issue after #1779 ships. ## Acceptance - [ ] dev-sop has a "umbrella stale, sub-jobs green" recovery playbook OR - [ ] umbrella-reaper auto-handles the pattern OR - [ ] #1779 demonstrably eliminates the need ## Discovered during Same session as #1778 and #1779.
Author
Owner

Docs sub-task landed in commit 2e027df8 (PR #1782): docs/runbooks/ci-umbrella-stale-compensating-status.md. Covers when-to-use, when-NOT, diagnose, recover, why-it-happens, and prevent.

Leaving this issue open for the auto-recovery sub-task (build umbrella-reaper.yml or extend status-reaper.yml to handle PR umbrellas). My recommendation is to defer that work until we have data on whether #1779's ci-meta fix actually eliminates the propagation race in normal load. If we go ~2 weeks without needing the compensating-status runbook, auto-recovery is unnecessary engineering. If we hit the pattern again within a week, build the reaper.

Tracking signal: every time the runbook gets exercised, post a brief comment here. We close this when one of:

  • 2 weeks pass with no exercise (close as obsolete, runner-pool fix sufficed)
  • Pattern recurs 3+ times → green-light the reaper build
Docs sub-task landed in commit `2e027df8` (PR #1782): `docs/runbooks/ci-umbrella-stale-compensating-status.md`. Covers when-to-use, when-NOT, diagnose, recover, why-it-happens, and prevent. Leaving this issue open for the **auto-recovery sub-task** (build `umbrella-reaper.yml` or extend `status-reaper.yml` to handle PR umbrellas). My recommendation is to defer that work until we have data on whether #1779's ci-meta fix actually eliminates the propagation race in normal load. If we go ~2 weeks without needing the compensating-status runbook, auto-recovery is unnecessary engineering. If we hit the pattern again within a week, build the reaper. Tracking signal: every time the runbook gets exercised, post a brief comment here. We close this when one of: - 2 weeks pass with no exercise (close as obsolete, runner-pool fix sufficed) - Pattern recurs 3+ times → green-light the reaper build
Member

RCA — root cause\nThis issue is partly resolved on the documentation path, but not on the automation path. The repo now has a dedicated compensating-status runbook for stale PR umbrellas; status-reaper.py remains scoped to default-branch push/status-shadow repair and does not implement the requested “PR umbrella failed while all required sub-jobs are green” auto-reaper.\n\n## Evidence\n- docs/runbooks/ci-umbrella-stale-compensating-status.md:1 — the runbook exists and documents stale CI / all-required (pull_request) recovery.\n- docs/runbooks/ci-umbrella-stale-compensating-status.md:37 — recovery is manual: verify all required sub-jobs succeeded, then POST a compensating success status.\n- docs/runbooks/ci-umbrella-stale-compensating-status.md:63 — the runbook explicitly points back to #1780 for umbrella-reaper.yml automation if the pattern remains frequent.\n- .gitea/scripts/status-reaper.py:534 — current automation only compensates pull-request-shadow statuses when the matching push context succeeded on the same SHA, which is a narrower condition than PR sub-job umbrella recovery.\n\n## Suggested fix\nClose the documentation acceptance item if maintainers consider the runbook sufficient. If automation is still desired, implement a separate umbrella-reaper.yml that queries action job rows for the PR head, requires all five umbrella sub-jobs to be success, refuses compensation if any required job failed/missing, and posts a success status with an explicit compensating description. Do not fold this into the existing default-branch status reaper without a separate guardrail/test set; its safety proof is different.\n\n## Confidence\nHigh — the runbook and status-reaper code directly show current coverage and the remaining automation gap.

## RCA — root cause\nThis issue is partly resolved on the documentation path, but not on the automation path. The repo now has a dedicated compensating-status runbook for stale PR umbrellas; `status-reaper.py` remains scoped to default-branch push/status-shadow repair and does not implement the requested “PR umbrella failed while all required sub-jobs are green” auto-reaper.\n\n## Evidence\n- `docs/runbooks/ci-umbrella-stale-compensating-status.md:1` — the runbook exists and documents stale `CI / all-required (pull_request)` recovery.\n- `docs/runbooks/ci-umbrella-stale-compensating-status.md:37` — recovery is manual: verify all required sub-jobs succeeded, then POST a compensating success status.\n- `docs/runbooks/ci-umbrella-stale-compensating-status.md:63` — the runbook explicitly points back to #1780 for `umbrella-reaper.yml` automation if the pattern remains frequent.\n- `.gitea/scripts/status-reaper.py:534` — current automation only compensates pull-request-shadow statuses when the matching push context succeeded on the same SHA, which is a narrower condition than PR sub-job umbrella recovery.\n\n## Suggested fix\nClose the documentation acceptance item if maintainers consider the runbook sufficient. If automation is still desired, implement a separate `umbrella-reaper.yml` that queries action job rows for the PR head, requires all five umbrella sub-jobs to be `success`, refuses compensation if any required job failed/missing, and posts a success status with an explicit compensating description. Do not fold this into the existing default-branch status reaper without a separate guardrail/test set; its safety proof is different.\n\n## Confidence\nHigh — the runbook and status-reaper code directly show current coverage and the remaining automation gap.
Member

Auto-handling implementation in PR #1964.

Auto-handling implementation in PR #1964.
Sign in to join this conversation.
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1780