docs(runbooks): #1780 compensating-status recovery for stale CI umbrellas #1782

Merged
hongming merged 1 commits from docs/issue-1780-compensating-status-runbook into main 2026-05-24 06:14:19 +00:00
Owner

Summary

Adds docs/runbooks/ci-umbrella-stale-compensating-status.md documenting the compensating-status recovery pattern surfaced as load-bearing during the 2026-05-24 CTO-bypass session. The runbook covers:

  • When to use it (umbrella failed, all 5 required sub-jobs verified success in action_run_job)
  • When NOT to use it (any required sub-job actually failed — compensating would lie)
  • Diagnose commands (Gitea API + Postgres queries to verify the discrepancy)
  • Recover command (POST to /statuses/{sha} with honest description)
  • Why it happens (40-min poll deadline vs. notifier propagation lag, RFC internal#219 design)
  • Prevent (the 7da843f2 ci-meta fix from #1779 eliminates most cases; remaining cases get auto-recovery from a future umbrella-reaper.yml per #1780)
  • Cross-refs to status-reaper.yml (sibling pattern) and audit-force-merge.yml (audit trail)
  • Session-local examples — PR #1737 and #1759 from 2026-05-24

Closes (partial)

Closes the docs sub-task of #1780. The auto-recovery sub-task (build umbrella-reaper.yml) stays open in #1780 pending observation of whether #1779's fix makes it unnecessary.

SOP Checklist (RFC #351)

1. Comprehensive testing performed

N/A: docs-only change. markdownlint not configured for this repo; visually inspected for formatting.

2. Local-postgres E2E run

N/A.

3. Staging-smoke verified or pending

N/A: docs-only.

4. Root-cause not symptom

The root cause of "operator stuck without playbook" is no documented recovery procedure. This fix is the procedure. The deeper root cause (propagation race) is tracked in #1779/#1780; runbook acknowledges that and points at both.

5. Five-Axis review walked

Walked solo. Happy to dispatch a reviewer if anyone wants to challenge the "when NOT to use this" framing.

6. No backwards-compat shim / dead code added

Pure addition: +79 lines, 1 new file. No code touched.

7. Memory/saved-feedback consulted

  • reference_post_suspension_pipeline — confirmed runbook references Gitea API/DB, not GitHub.
  • Runbook itself becomes part of reference_dev_sop_canonical_doc sibling material once the canonical doc cross-links to it (followup).

🤖 Generated with Claude Code

## Summary Adds `docs/runbooks/ci-umbrella-stale-compensating-status.md` documenting the compensating-status recovery pattern surfaced as load-bearing during the 2026-05-24 CTO-bypass session. The runbook covers: - **When to use it** (umbrella failed, all 5 required sub-jobs verified success in `action_run_job`) - **When NOT to use it** (any required sub-job actually failed — compensating would lie) - **Diagnose** commands (Gitea API + Postgres queries to verify the discrepancy) - **Recover** command (POST to `/statuses/{sha}` with honest description) - **Why it happens** (40-min poll deadline vs. notifier propagation lag, RFC internal#219 design) - **Prevent** (the `7da843f2` ci-meta fix from #1779 eliminates most cases; remaining cases get auto-recovery from a future `umbrella-reaper.yml` per #1780) - **Cross-refs** to status-reaper.yml (sibling pattern) and audit-force-merge.yml (audit trail) - **Session-local examples** — PR #1737 and #1759 from 2026-05-24 ## Closes (partial) Closes the docs sub-task of #1780. The auto-recovery sub-task (build `umbrella-reaper.yml`) stays open in #1780 pending observation of whether #1779's fix makes it unnecessary. ## SOP Checklist (RFC #351) ### 1. Comprehensive testing performed N/A: docs-only change. `markdownlint` not configured for this repo; visually inspected for formatting. ### 2. Local-postgres E2E run N/A. ### 3. Staging-smoke verified or pending N/A: docs-only. ### 4. Root-cause not symptom The root cause of "operator stuck without playbook" is no documented recovery procedure. This fix is the procedure. The deeper root cause (propagation race) is tracked in #1779/#1780; runbook acknowledges that and points at both. ### 5. Five-Axis review walked Walked solo. Happy to dispatch a reviewer if anyone wants to challenge the "when NOT to use this" framing. ### 6. No backwards-compat shim / dead code added Pure addition: +79 lines, 1 new file. No code touched. ### 7. Memory/saved-feedback consulted - `reference_post_suspension_pipeline` — confirmed runbook references Gitea API/DB, not GitHub. - Runbook itself becomes part of `reference_dev_sop_canonical_doc` sibling material once the canonical doc cross-links to it (followup). 🤖 Generated with [Claude Code](https://claude.com/claude-code)
hongming added 1 commit 2026-05-24 05:34:07 +00:00
docs(runbooks): document compensating-status recovery for stale CI umbrellas (#1780)
ci-arm64-advisory / fast-checks (pull_request) Waiting to run
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Successful in 8s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 6s
CI / Detect changes (pull_request) Successful in 12s
CI / Python Lint & Test (pull_request) Successful in 4s
E2E API Smoke Test / detect-changes (pull_request) Successful in 9s
E2E Chat / detect-changes (pull_request) Successful in 8s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 11s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 10s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 4s
Harness Replays / detect-changes (pull_request) Successful in 9s
Lint no tenant GITEA or GITHUB token write / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 5s
gate-check-v3 / gate-check (pull_request) Successful in 7s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 8s
qa-review / approved (pull_request) Failing after 6s
security-review / approved (pull_request) Failing after 6s
sop-checklist / review-refire (pull_request) Has been skipped
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request) Successful in 5s
sop-tier-check / tier-check (pull_request) Successful in 5s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m4s
CI / Platform (Go) (pull_request) Successful in 4s
CI / Canvas (Next.js) (pull_request) Successful in 4s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 8s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 9s
E2E Chat / E2E Chat (pull_request) Successful in 15s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 13s
Harness Replays / Harness Replays (pull_request) Successful in 4s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 4s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / all-required (pull_request) Successful in 12s
audit-force-merge / audit (pull_request) Successful in 8s
0ea86df071
Adds docs/runbooks/ci-umbrella-stale-compensating-status.md documenting
the recovery pattern: when CI/all-required is failure but all 5
required sub-jobs are success in action_run_job, POST a corrected
success status via the Gitea API to unblock the merge gate.

Used twice in the 2026-05-24 CTO-bypass session (PRs #1737 and #1759);
the pattern parallels status-reaper.yml's compensating-status approach
for default-branch (push) drift.

The runbook is explicit about when NOT to use it (any required sub-job
actually failed) and requires WHO+WHY in the description field so the
audit trail stays honest.

Closes #1780 (the docs sub-task). The auto-recovery sub-task tracked in
#1780 stays open pending decision on whether to build umbrella-reaper or
let #1779's runner-pool fix make it unnecessary.
devops-engineer approved these changes 2026-05-24 05:34:32 +00:00
devops-engineer left a comment
Member

Approving #1782 on current HEAD 0ea86df071 — single-file change, scope is exactly what the issue called for, tests verified. CTO-bypass session 2026-05-24.

Approving #1782 on current HEAD 0ea86df071b6d746b526f14490a62184a2ca4c10 — single-file change, scope is exactly what the issue called for, tests verified. CTO-bypass session 2026-05-24.
core-devops approved these changes 2026-05-24 05:34:33 +00:00
core-devops left a comment
Member

Approving #1782 on current HEAD 0ea86df071 — single-file change, scope is exactly what the issue called for, tests verified. CTO-bypass session 2026-05-24.

Approving #1782 on current HEAD 0ea86df071b6d746b526f14490a62184a2ca4c10 — single-file change, scope is exactly what the issue called for, tests verified. CTO-bypass session 2026-05-24.
hongming merged commit 2e027df890 into main 2026-05-24 06:14:19 +00:00
Sign in to join this conversation.
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1782