From 98323734eaa4aa99180a65f28812c0f7df1f9263 Mon Sep 17 00:00:00 2001 From: core-devops Date: Mon, 11 May 2026 18:41:39 -0700 Subject: [PATCH] feat(ci): status-reaper rev2 sweeps last 10 main commits (closes stranded-status gap) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit rev1 (PR #618, merged 4db64bcb) only inspected the CURRENT main HEAD per tick. Schedule workflows post `failure` to whatever SHA was HEAD when the run COMPLETED, which by the next */5 tick is usually a stale commit because main has already moved forward via merges. Result: rev1 was running successfully but with `compensated:0` on every tick across ~6 cycles (orchestrator + hongming-pc2 Phase 1+2 evidence 23:46Z / 23:59Z / 00:02Z); reds stranded on stale commits. rev2 sweeps the last 10 main commits per tick: - New `list_recent_commit_shas(branch, limit)` wraps GET /repos/{o}/{r}/commits?sha={branch}&limit={limit}. Vendor-truth probe 2026-05-11 confirms Gitea 1.22.6 returns a JSON list of commit objects with `sha` keys (per `feedback_smoke_test_vendor_truth_not_ shape_match`). - New `reap_branch()` orchestrates the sweep: - For each SHA: GET combined status with PER-SHA ERROR ISOLATION (refinement #7) — ApiError on one stale SHA logs `::warning::` and continues to the next. Different from the single-HEAD pre-rev2 path where fail-loud was correct; the sweep is best-effort across historical commits. - When `combined.state == "success"`: skip the per-context loop entirely (refinement #2, cost optimization, common case). - Otherwise delegate to the existing per-SHA `reap()` worker (logic UNCHANGED — `_has_push_trigger` / `parse_push_context` / `scan_workflows` not touched per refinement #6). - Aggregated counters preserve all rev1 fields PLUS: - `scanned_shas`: how many SHAs we actually iterated (always 10 in normal operation; less if commits API returns fewer) - `compensated_per_sha`: {: [, ...]} for the SHAs that actually got at least one compensation - `reap()` now also returns `compensated_contexts` so `reap_branch()` can build `compensated_per_sha` without re-deriving it from the POST stream. Backwards-compatible — all existing test assertions check specific counter keys, none enforce a closed dict shape. - `main()` switches from `get_head_sha` + `get_combined_status` + `reap` to a single `reap_branch()` call. Adds `--limit` CLI flag for ops-driven sweep-width tuning (default 10). Design choices (refinements 1-4): - N=10: covers the burst-merge window between */5 ticks; older reds falling off acceptable (the schedule run that posted them has long since been overwritten by a real push trigger). - Skip combined=success early: most commits in the window will be green; short-circuit before the per-context loop saves work. - No de-dup needed (refinement #4): each workflow run posts to exactly one SHA, so two different SHAs in the sweep cannot have the same (context) pair eligible for compensation. Test suite: 37 + 3 = 40/40 cases pass. - New: test_reap_sweeps_n_shas_smoke (mock 3 SHAs, verify each GET'd) - New: test_reap_skips_combined_success_shas (verify the combined=success short-circuit; only the 1 failure SHA is iterated) - New: test_reap_continues_on_per_sha_apierror (per-SHA error isolation contract — ApiError on SHA[0] logged + skipped + SHA[1] processes) - All 37 existing rev1 tests pass unchanged (per-SHA worker logic + the helpers it consumes are untouched). Live dry-run smoke against git.moleculesai.app: scanned 41 workflows; push-triggered=18, class-O candidates=23 summary: {"branch":"main","compensated":0,"compensated_per_sha":{}, "dry_run":true,"limit":10,"preserved_non_failure":196, ...,"scanned_shas":10} Cross-link: - internal#327 (sibling publish-runtime-bot) - task #90 (orchestrator brief), task #46 (hongming-pc2 brief) - PR #618 (parent rev1, merge 4db64bcb) - `reference_post_suspension_pipeline` - `feedback_no_shared_persona_token_use` (commit author = core-devops, not hongming-pc2) - `feedback_strict_root_only_after_class_a` (root cause, not symptom) - `feedback_brief_hypothesis_vs_evidence` (evidence: compensated:0 across 6 cycles) Removal path: drop this workflow when Gitea >= 1.24 ships with a real fix for the hardcoded-suffix bug. Audit issue (filed alongside rev1) tracks the deletion as a follow-up sweep. --- .gitea/scripts/status-reaper.py | 199 ++++++++++++++++++++++++++++---- tests/test_status_reaper.py | 172 +++++++++++++++++++++++++++ 2 files changed, 351 insertions(+), 20 deletions(-) diff --git a/.gitea/scripts/status-reaper.py b/.gitea/scripts/status-reaper.py index dea13812..41cc8c1f 100644 --- a/.gitea/scripts/status-reaper.py +++ b/.gitea/scripts/status-reaper.py @@ -19,18 +19,34 @@ What this script does, per `.gitea/workflows/status-reaper.yml` invocation: downstream — Gitea uses ` / ` as the workflow/job separator). Classify each by whether `on:` contains a `push:` trigger. - 2. GET combined status for HEAD of WATCH_BRANCH. + 2. List the last N (=10) commits on WATCH_BRANCH via + GET /repos/{o}/{r}/commits?sha={branch}&limit={N}. rev2 sweeps + N commits per tick instead of HEAD only — schedule workflows + post `failure` to whatever SHA was HEAD when they COMPLETED, so + by the next */5 tick main has often moved forward and the red + gets stranded on a stale commit (Phase 1+2 evidence: rev1 saw + `compensated:0` every tick across ~6 cycles). - 3. For each per-context status entry where: - state == "failure" AND context.endswith(" (push)") - Parse context as ` / (push)`. Look up - workflow_name in the trigger map: - - missing → log ::notice:: and skip (conservative). - - has_push_trigger=True → preserve (would mask real signal). - - has_push_trigger=False → POST a compensating - `state=success` status to /statuses/{sha} with the same - context (Gitea de-dups by context) and a description that - documents the workaround + this script's path. + 3. For EACH SHA in the list: + - GET combined commit status. Per-SHA error isolation + (refinement #7): if this call raises ApiError or any 5xx, + LOG `::warning::` + continue to the next SHA. Different from + the single-HEAD pre-rev2 path where fail-loud was correct; + the sweep is best-effort across historical commits, so one + transient blip on a stale SHA must not strand reds on the + OTHER stale SHAs. + - If combined.state == "success": skip — cost optimization + (refinement #2), common case (most commits are green). + - Otherwise iterate per-context entries. For each entry where: + state == "failure" AND context.endswith(" (push)") + Parse context as ` / (push)`. + Look up workflow_name in the trigger map: + - missing → log ::notice:: and skip (conservative). + - has_push_trigger=True → preserve (real defect signal). + - has_push_trigger=False → POST a compensating + `state=success` status to /statuses/{sha} with the same + context (Gitea de-dups by context) and a description + documenting the workaround + this script's path. 4. Exit 0. Re-running is idempotent — Gitea's commit-status table stores the LATEST state-per-context, so the success POST sticks @@ -401,21 +417,29 @@ def reap( sha: str, *, dry_run: bool = False, -) -> dict[str, int]: +) -> dict[str, Any]: """Walk `combined.statuses[]` and compensate where appropriate. + Per-SHA worker. The multi-SHA orchestrator (`reap_branch`) calls + this once per stale main commit each tick. + Returns counters for observability: {compensated, preserved_real_push, preserved_unknown, preserved_non_failure, preserved_non_push_suffix, - preserved_unparseable} + preserved_unparseable, + compensated_contexts: [, ...]} + + `compensated_contexts` is rev2-added so `reap_branch` can build + `compensated_per_sha` without re-deriving it from the POST stream. """ - counters = { + counters: dict[str, Any] = { "compensated": 0, "preserved_real_push": 0, "preserved_unknown": 0, "preserved_non_failure": 0, "preserved_non_push_suffix": 0, "preserved_unparseable": 0, + "compensated_contexts": [], } statuses = combined.get("statuses") or [] @@ -464,10 +488,136 @@ def reap( sha, context, s.get("target_url"), dry_run=dry_run ) counters["compensated"] += 1 + counters["compensated_contexts"].append(context) return counters +# -------------------------------------------------------------------------- +# rev2: multi-SHA sweep over the last N commits on WATCH_BRANCH +# -------------------------------------------------------------------------- +# How many main commits to sweep per tick. Sized to cover a burst-merge +# window where multiple PRs land in the 5-min interval between reaper +# ticks. Older reds falling off the window is acceptable — they were +# already stale enough that the schedule-run that posted them has long +# since been overwritten by a real push trigger. See `reference_post_ +# suspension_pipeline` for the merge-cadence baseline. +DEFAULT_SWEEP_LIMIT = 10 + + +def list_recent_commit_shas(branch: str, limit: int) -> list[str]: + """List the most recent `limit` commit SHAs on `branch`, newest + first. + + Wraps GET /repos/{o}/{r}/commits?sha={branch}&limit={limit}. Gitea + 1.22.6 returns a JSON list of commit objects each with a `sha` key + (verified via vendor-truth probe 2026-05-11 against + git.moleculesai.app — `feedback_smoke_test_vendor_truth_not_shape_match`). + + Raises ApiError on non-2xx OR on unexpected response shape. This is + a HARD halt — without the commit list the sweep can't proceed. (The + per-SHA error isolation downstream is a different concern: tolerating + a transient 5xx on ONE commit's status is best-effort; losing the + commit list itself means we don't even know which commits to try.) + """ + _, body = api( + "GET", + f"/repos/{OWNER}/{NAME}/commits", + query={"sha": branch, "limit": str(limit)}, + ) + if not isinstance(body, list): + raise ApiError( + f"commits listing for {branch} not a JSON array " + f"(got {type(body).__name__})" + ) + shas: list[str] = [] + for entry in body: + if not isinstance(entry, dict): + continue + sha = entry.get("sha") + if isinstance(sha, str) and len(sha) >= 7: + shas.append(sha) + if not shas: + raise ApiError( + f"commits listing for {branch} returned no usable SHAs" + ) + return shas + + +def reap_branch( + workflow_trigger_map: dict[str, bool], + branch: str, + *, + limit: int = DEFAULT_SWEEP_LIMIT, + dry_run: bool = False, +) -> dict[str, Any]: + """Sweep the last `limit` commits on `branch`, applying `reap()` + to each (with per-SHA error isolation). + + Returns aggregated counters PLUS rev2 observability fields: + - scanned_shas: how many SHAs we actually iterated + - compensated_per_sha: {: [, ...]} — only + SHAs that actually got at least one compensation are included + """ + shas = list_recent_commit_shas(branch, limit) + + aggregate: dict[str, Any] = { + "scanned_shas": 0, + "compensated": 0, + "preserved_real_push": 0, + "preserved_unknown": 0, + "preserved_non_failure": 0, + "preserved_non_push_suffix": 0, + "preserved_unparseable": 0, + "compensated_per_sha": {}, + } + + for sha in shas: + aggregate["scanned_shas"] += 1 + + # Per-SHA error isolation (refinement #7). One transient blip + # on a historical commit must NOT abort the whole tick — the + # OTHER stale SHAs may still hold strandable reds. + try: + combined = get_combined_status(sha) + except ApiError as e: + print( + f"::warning::get_combined_status({sha[:10]}) failed; " + f"skipping this SHA: {e}" + ) + continue + + # Cost optimization (refinement #2): the common case is a green + # commit. Skip the per-context loop entirely when combined is + # already success — saves a tight loop over ~20 statuses per SHA + # on green commits, the dominant majority. + if combined.get("state") == "success": + continue + + per_sha = reap( + workflow_trigger_map, combined, sha, dry_run=dry_run + ) + + # Aggregate scalar counters. + for key in ( + "compensated", + "preserved_real_push", + "preserved_unknown", + "preserved_non_failure", + "preserved_non_push_suffix", + "preserved_unparseable", + ): + aggregate[key] += per_sha[key] + + # Record per-SHA compensated contexts (only when non-empty — + # keep the summary readable when most SHAs are no-ops). + contexts = per_sha.get("compensated_contexts") or [] + if contexts: + aggregate["compensated_per_sha"][sha] = list(contexts) + + return aggregate + + def main() -> int: parser = argparse.ArgumentParser(description=__doc__) parser.add_argument( @@ -475,6 +625,15 @@ def main() -> int: action="store_true", help="Skip the compensating POST; print what would be done.", ) + parser.add_argument( + "--limit", + type=int, + default=DEFAULT_SWEEP_LIMIT, + help=( + "How many recent commits on WATCH_BRANCH to sweep per tick " + f"(default: {DEFAULT_SWEEP_LIMIT})." + ), + ) args = parser.parse_args() _require_runtime_env() @@ -486,11 +645,11 @@ def main() -> int: f"class-O candidates={sum(1 for v in workflow_trigger_map.values() if not v)}" ) - sha = get_head_sha(WATCH_BRANCH) - combined = get_combined_status(sha) - - counters = reap( - workflow_trigger_map, combined, sha, dry_run=args.dry_run + counters = reap_branch( + workflow_trigger_map, + WATCH_BRANCH, + limit=args.limit, + dry_run=args.dry_run, ) # Observability: print one JSON line summarising the tick. Loki @@ -499,9 +658,9 @@ def main() -> int: "status-reaper summary: " + json.dumps( { - "sha": sha, "branch": WATCH_BRANCH, "dry_run": args.dry_run, + "limit": args.limit, **counters, }, sort_keys=True, diff --git a/tests/test_status_reaper.py b/tests/test_status_reaper.py index 5e444779..72dc9690 100644 --- a/tests/test_status_reaper.py +++ b/tests/test_status_reaper.py @@ -601,3 +601,175 @@ def test_scan_workflows_missing_dir_returns_empty(sr_module, tmp_path, capsys): assert out == {} captured = capsys.readouterr() assert "::warning::workflows dir not found" in captured.out + + +# -------------------------------------------------------------------------- +# rev2: multi-SHA sweep — `reap_branch()` walks last N main commits +# -------------------------------------------------------------------------- +# Phase 1+2 evidence (orchestrator + hongming-pc2): rev1 sees `compensated:0` +# every tick because the schedule workflow posts `failure` to whatever SHA +# was HEAD when it COMPLETED. By the next */5 tick, main has often moved +# forward, so the single-HEAD reaper misses the stranded red. rev2 sweeps +# the last 10 commits each tick. See `reference_post_suspension_pipeline` +# and parent rev1 PR #618 for context. + +SHA_A = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" +SHA_B = "bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb" +SHA_C = "cccccccccccccccccccccccccccccccccccccccc" + + +def test_reap_sweeps_n_shas_smoke(sr_module, monkeypatch): + """rev2 contract: sweep last 10 (or N) main commits, GET combined + status for EACH. Smoke: with 3 stub SHAs, each is GET'd exactly once. + """ + gets: list[str] = [] + posts: list[tuple[str, dict]] = [] + + def fake_api(method, path, *, body=None, query=None, expect_json=True): + if method == "GET" and path.endswith("/commits"): + # commits listing — return 3 fake commit objects + return (200, [{"sha": SHA_A}, {"sha": SHA_B}, {"sha": SHA_C}]) + if method == "GET" and "/commits/" in path and path.endswith("/status"): + sha = path.split("/commits/")[1].split("/status")[0] + gets.append(sha) + # All combined=success → cost-optimization short-circuit + return (200, {"state": "success", "statuses": []}) + if method == "POST": + posts.append((path, body)) + return (201, {}) + raise AssertionError(f"unexpected api call: {method} {path}") + + monkeypatch.setattr(sr_module, "api", fake_api) + + workflow_map = {"x": False} + counters = sr_module.reap_branch( + workflow_map, "main", limit=10, dry_run=False + ) + + # Each of the 3 SHAs returned by /commits should be GET'd once. + assert gets == [SHA_A, SHA_B, SHA_C] + # No POST (everything was combined=success). + assert posts == [] + # Counters reflect what we saw. + assert counters["scanned_shas"] == 3 + assert counters["compensated"] == 0 + assert counters["compensated_per_sha"] == {} + + +def test_reap_skips_combined_success_shas(sr_module, monkeypatch): + """rev2 cost-optimization (refinement #2): when combined==success for + a SHA, do NOT iterate per-context statuses; move on to next SHA. + + Mock 2 SHAs with combined=success + 1 with combined=failure → only + the failure-SHA's statuses get the per-context loop applied. + """ + per_context_iterated_for: list[str] = [] + posts: list[tuple[str, dict]] = [] + + failure_statuses = [ + { + "context": "drift / drift (push)", + "state": "failure", + "target_url": "https://example.test/run/42", + } + ] + + def fake_api(method, path, *, body=None, query=None, expect_json=True): + if method == "GET" and path.endswith("/commits"): + return (200, [{"sha": SHA_A}, {"sha": SHA_B}, {"sha": SHA_C}]) + if method == "GET" and "/commits/" in path and path.endswith("/status"): + sha = path.split("/commits/")[1].split("/status")[0] + if sha == SHA_B: + # Mark this SHA as the failure one — return per-context + # statuses that would compensate if iterated. + return (200, {"state": "failure", "statuses": failure_statuses}) + # Others are combined=success — must short-circuit. + return (200, {"state": "success", "statuses": failure_statuses}) + if method == "POST": + # If a POST hits a non-failure SHA, the short-circuit failed. + posts.append((path, body)) + return (201, {}) + raise AssertionError(f"unexpected api call: {method} {path}") + + monkeypatch.setattr(sr_module, "api", fake_api) + + # Workflow trigger map: `drift` is schedule-only (compensable). + workflow_map = {"drift": False} + counters = sr_module.reap_branch( + workflow_map, "main", limit=10, dry_run=False + ) + + # Only SHA_B (the combined=failure one) should be compensated. + assert counters["compensated"] == 1 + assert counters["scanned_shas"] == 3 + assert SHA_B in counters["compensated_per_sha"] + assert counters["compensated_per_sha"][SHA_B] == ["drift / drift (push)"] + # SHA_A and SHA_C must NOT appear in compensated_per_sha — their + # per-context loop was skipped via the combined=success short-circuit. + assert SHA_A not in counters["compensated_per_sha"] + assert SHA_C not in counters["compensated_per_sha"] + # Exactly one POST: the compensation on SHA_B. + assert len(posts) == 1 + assert posts[0][0] == f"/repos/owner/repo/statuses/{SHA_B}" + + +def test_reap_continues_on_per_sha_apierror(sr_module, monkeypatch, capsys): + """rev2 refinement #7 (MOST CRITICAL): a transient ApiError or HTTP-5xx + on get_combined_status(SHA_X) must NOT fail the whole tick. Log + skip + SHA_X, continue with SHA_Y. + + Different from the single-HEAD path (where fail-loud is correct): the + sweep is best-effort across historical commits, so one transient blip + on a stale SHA should not strand reds on the OTHER stale SHAs. + """ + posts: list[tuple[str, dict]] = [] + + def fake_api(method, path, *, body=None, query=None, expect_json=True): + if method == "GET" and path.endswith("/commits"): + return (200, [{"sha": SHA_A}, {"sha": SHA_B}]) + if method == "GET" and "/commits/" in path and path.endswith("/status"): + sha = path.split("/commits/")[1].split("/status")[0] + if sha == SHA_A: + raise sr_module.ApiError( + f"GET /repos/owner/repo/commits/{SHA_A}/status " + f"-> HTTP 502: bad gateway" + ) + # SHA_B returns normally with a failure to compensate. + return ( + 200, + { + "state": "failure", + "statuses": [ + { + "context": "drift / drift (push)", + "state": "failure", + } + ], + }, + ) + if method == "POST": + posts.append((path, body)) + return (201, {}) + raise AssertionError(f"unexpected api call: {method} {path}") + + monkeypatch.setattr(sr_module, "api", fake_api) + + workflow_map = {"drift": False} + # Must NOT raise — per-SHA error isolation contract. + counters = sr_module.reap_branch( + workflow_map, "main", limit=10, dry_run=False + ) + + # SHA_A was logged + skipped. SHA_B processed normally. + assert counters["scanned_shas"] == 2 + assert counters["compensated"] == 1 + assert SHA_B in counters["compensated_per_sha"] + assert SHA_A not in counters["compensated_per_sha"] + # Compensation POST landed on SHA_B only. + assert len(posts) == 1 + assert posts[0][0] == f"/repos/owner/repo/statuses/{SHA_B}" + # The ApiError must be logged so a human auditing tick output can see + # WHICH SHA blipped and WHY. + captured = capsys.readouterr() + assert "::warning::" in captured.out or "::notice::" in captured.out + assert SHA_A[:10] in captured.out -- 2.45.2