Merge branch 'main' into fix/audit-force-merge-pipefail
Some checks failed
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 3s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 8s
qa-review / approved (pull_request) Failing after 9s
CI / Detect changes (pull_request) Successful in 12s
E2E API Smoke Test / detect-changes (pull_request) Successful in 12s
security-review / approved (pull_request) Failing after 10s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 14s
sop-tier-check / tier-check (pull_request) Successful in 12s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 14s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 15s
gate-check-v3 / gate-check (pull_request) Successful in 14s
CI / Platform (Go) (pull_request) Successful in 3s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 3s
CI / Canvas (Next.js) (pull_request) Successful in 4s
CI / Python Lint & Test (pull_request) Successful in 3s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 3s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 4s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 3s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2s
CI / all-required (pull_request) Successful in 1s
audit-force-merge / audit (pull_request) Successful in 5s

This commit is contained in:
claude-ceo-assistant 2026-05-12 03:34:13 +00:00
commit f9214391fb
4 changed files with 137 additions and 26 deletions

View File

@ -19,13 +19,18 @@ What this script does, per `.gitea/workflows/status-reaper.yml` invocation:
downstream Gitea uses ` / ` as the workflow/job separator).
Classify each by whether `on:` contains a `push:` trigger.
2. List the last N (=10) commits on WATCH_BRANCH via
GET /repos/{o}/{r}/commits?sha={branch}&limit={N}. rev2 sweeps
N commits per tick instead of HEAD only schedule workflows
post `failure` to whatever SHA was HEAD when they COMPLETED, so
by the next */5 tick main has often moved forward and the red
gets stranded on a stale commit (Phase 1+2 evidence: rev1 saw
`compensated:0` every tick across ~6 cycles).
2. List the last N (=30, rev3 widened from 10) commits on
WATCH_BRANCH via GET /repos/{o}/{r}/commits?sha={branch}&limit={N}.
rev2 sweeps N commits per tick instead of HEAD only schedule
workflows post `failure` to whatever SHA was HEAD when they
COMPLETED, so by the next */5 tick main has often moved forward
and the red gets stranded on a stale commit. rev3 widens the
window from 10 30 because schedule workflows post `failure`
RETROACTIVELY (5-15 min after their merge); a 10-commit window
is narrower than the merge-cadence during a burst, so reds land
OUTSIDE the window before reaper sees them (Phase 1+2 evidence:
rev2 run 17057 at 02:46Z saw 185/0 contexts on 10 SHAs; direct
probe ~30min later showed ~25 fails on those same 10 SHAs).
3. For EACH SHA in the list:
- GET combined commit status. Per-SHA error isolation
@ -502,7 +507,17 @@ def reap(
# already stale enough that the schedule-run that posted them has long
# since been overwritten by a real push trigger. See `reference_post_
# suspension_pipeline` for the merge-cadence baseline.
DEFAULT_SWEEP_LIMIT = 10
#
# rev3 (2026-05-12, hongming-pc2 GO 03:25Z): widened from 10 → 30.
# rev2 (limit=10) shipped 01:48Z and ran 6/6 ticks post-merge with
# `compensated:0` despite ~25 stranded reds visible on those same 10
# SHAs ~30min later. Root cause: schedule workflows post `failure`
# RETROACTIVELY 5-15 min after their merge, so by the time reaper's
# next */5 tick lands, the stranded red is on a SHA that has already
# fallen out of a 10-commit window during a burst-merge period.
# Trades window-width-cheap for cadence-loady (per hongming-pc2):
# kept `*/5` cron unchanged; only the window-N is widened.
DEFAULT_SWEEP_LIMIT = 30
def list_recent_commit_shas(branch: str, limit: int) -> list[str]:

View File

@ -37,13 +37,15 @@ name: main-red-watchdog
# "unknown on type" when `workflow_dispatch.inputs.X` is present. Revisit
# when Gitea ≥ 1.23 is fleet-wide.
on:
# SCHEDULE DISABLED 2026-05-12 — interim per RFC#420 Option-C machinery-down emergency
# Watchdog timing out behind runner saturation; rev3+dedicated-runner-label in flight
# Re-enable after rev3 lands + runner saturation root resolved
# schedule:
# # Hourly at :05 — task spec calls for "off-zero" (`5 * * * *`),
# # offset from :17 (ci-required-drift) and :00 (peak cron load).
# - cron: '5 * * * *'
# SCHEDULE RE-ENABLED 2026-05-12 rev3 — interim disable (mc#645) reverted alongside
# status-reaper rev3 (widen-window). Job-level timeout-minutes raised 5 → 15 below
# to absorb runner-saturation latency without spurious cancels (the original cascade
# cause). If runner-saturation root persists, the dedicated-runner-label split
# remains the structural next step (tracked separately).
schedule:
# Hourly at :05 — task spec calls for "off-zero" (`5 * * * *`),
# offset from :17 (ci-required-drift) and :00 (peak cron load).
- cron: '5 * * * *'
workflow_dispatch:
# Read commit status + branch ref + issues; write issues (open/PATCH/close).
@ -61,7 +63,12 @@ concurrency:
jobs:
watchdog:
runs-on: ubuntu-latest
timeout-minutes: 5
# rev3 (2026-05-12, mc#645 revert): raised 5 → 15 to absorb runner-saturation
# latency. Original 5min cap was producing 124-style cancels under load,
# which fed the very `[main-red]` issues this workflow files (self-poisoning).
# 15min is still well below Gitea-default 6h job ceiling; if a real hang
# occurs the issue-file path is still the alarm surface.
timeout-minutes: 15
steps:
- name: Check out repo (script lives at .gitea/scripts/)
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2

View File

@ -53,16 +53,19 @@ name: status-reaper
# `inputs:` block here. Gitea 1.22.6 rejects the whole workflow as
# "unknown on type" when `workflow_dispatch.inputs.X` is present.
on:
# SCHEDULE DISABLED 2026-05-12 — interim per RFC#420 Option-C machinery-down emergency
# Reaper rev2 not compensating + watchdog timeout-cascade; rev3 in flight
# Re-enable after rev3 lands + runner saturation root resolved
# schedule:
# # Every 5 minutes. Off-zero alignment with sibling cron workflows:
# # ci-required-drift (`:17`), main-red-watchdog (`:05`),
# # railway-pin-audit (`:23`). 5-min cadence gives a tight enough
# # close on schedule-triggered false-reds that main-red-watchdog
# # (hourly :05) almost never files an issue on the false case.
# - cron: '*/5 * * * *'
# SCHEDULE RE-ENABLED 2026-05-12 rev3 — interim disable (mc#645) reverted now that
# rev3 widens DEFAULT_SWEEP_LIMIT 10 → 30 (covers retroactive-failure timing window).
# Sibling watchdog re-enabled in the same PR with timeout-minutes raised 5 → 15.
schedule:
# Every 5 minutes. Off-zero alignment with sibling cron workflows:
# ci-required-drift (`:17`), main-red-watchdog (`:05`),
# railway-pin-audit (`:23`). 5-min cadence gives a tight enough
# close on schedule-triggered false-reds that main-red-watchdog
# (hourly :05) almost never files an issue on the false case.
# rev3 keeps `*/5` unchanged per hongming-pc2 03:25Z review:
# "trades window-width-cheap for cadence-loady" — N=30 widens
# the lookback cheaply without doubling runner load via `*/2`.
- cron: '*/5 * * * *'
workflow_dispatch:
# Compensating-status POST needs write on repo statuses; no other

View File

@ -713,6 +713,92 @@ def test_reap_skips_combined_success_shas(sr_module, monkeypatch):
assert posts[0][0] == f"/repos/owner/repo/statuses/{SHA_B}"
def test_default_sweep_limit_is_30(sr_module):
"""rev3 contract: `DEFAULT_SWEEP_LIMIT = 30` (widened from rev2's 10).
Root cause of the widening: schedule workflows post `failure`
RETROACTIVELY 5-15 min after their merge. A 10-commit window is
narrower than the merge-cadence during a burst, so reds land
OUTSIDE the window before reaper's next tick sees them.
Evidence: rev2 run 17057 (02:46Z 2026-05-12) saw 185 contexts / 0
fails on its 10 SHAs; direct probe ~30min later showed ~25 fails
on those same 10 SHAs.
If this default is ever lowered back, that change MUST cite
re-measured cadence data a smaller window than the
retroactive-failure-post lag re-introduces compensated:0.
"""
assert sr_module.DEFAULT_SWEEP_LIMIT == 30
def test_reap_widened_window_catches_retroactive_failure(sr_module, monkeypatch):
"""rev3 regression: with limit=30, a stranded red on a SHA at depth=20
(which the rev2 limit=10 window would have missed) IS swept + compensated.
Why this matters: rev2 ran with limit=10 and saw `compensated:0` for
6 consecutive ticks despite ~25 known-stranded reds across the last
30 main commits. Widening to 30 must demonstrably catch a SHA past
the old window. We mock 30 SHAs, plant the failure on SHA[20], and
verify exactly one compensation lands on that SHA.
"""
shas = [f"{c:02x}" * 20 for c in range(30)] # 30 deterministic SHAs
failing_sha = shas[20] # depth 20 — outside rev2's window=10, inside rev3's =30
posts: list[tuple[str, dict]] = []
def fake_api(method, path, *, body=None, query=None, expect_json=True):
if method == "GET" and path.endswith("/commits"):
# /commits listing — return all 30 fake commit objects
assert query.get("limit") == "30", (
f"expected limit=30 in query, got {query}"
)
return (200, [{"sha": s} for s in shas])
if method == "GET" and "/commits/" in path and path.endswith("/status"):
sha = path.split("/commits/")[1].split("/status")[0]
if sha == failing_sha:
return (
200,
{
"state": "failure",
"statuses": [
{
"context": "retroactive-drift / drift (push)",
"state": "failure",
"target_url": "https://example.test/run/9001",
}
],
},
)
# All others combined=success (cost-opt short-circuit).
return (200, {"state": "success", "statuses": []})
if method == "POST":
posts.append((path, body))
return (201, {})
raise AssertionError(f"unexpected api call: {method} {path}")
monkeypatch.setattr(sr_module, "api", fake_api)
workflow_map = {"retroactive-drift": False} # schedule-only → class-O
counters = sr_module.reap_branch(
workflow_map, "main", limit=sr_module.DEFAULT_SWEEP_LIMIT, dry_run=False
)
# All 30 SHAs walked; exactly one compensated.
assert counters["scanned_shas"] == 30
assert counters["compensated"] == 1
assert failing_sha in counters["compensated_per_sha"]
assert counters["compensated_per_sha"][failing_sha] == [
"retroactive-drift / drift (push)"
]
assert len(posts) == 1
assert posts[0][0] == f"/repos/owner/repo/statuses/{failing_sha}"
# Sanity: with rev2's window=10, depth=20 would NOT have been reached.
# This assertion documents the rev3 widening as the structural fix:
# the failing_sha index (20) is strictly greater than rev2's old limit (10).
assert shas.index(failing_sha) >= 10
def test_reap_continues_on_per_sha_apierror(sr_module, monkeypatch, capsys):
"""rev2 refinement #7 (MOST CRITICAL): a transient ApiError or HTTP-5xx
on get_combined_status(SHA_X) must NOT fail the whole tick. Log + skip