From 8c343e3ac47d6e555ec9c1417142bf87c78a89c6 Mon Sep 17 00:00:00 2001 From: Molecule AI Infra-Runtime-BE Date: Tue, 12 May 2026 03:26:36 +0000 Subject: [PATCH 1/3] fix(gitea): add || true guards to jq pipelines in audit-force-merge.sh Same root cause as sop-tier-check.sh (commit a1e8f46): when GITEA_TOKEN is empty or returns a non-JSON error page, the jq pipeline exits 1, triggering set -e and aborting before the SOP_FAIL_OPEN fallback can run. Added || true to all jq-piped variable assignments: - MERGE_SHA, MERGED_BY, TITLE, BASE_BRANCH, HEAD_SHA extractions (lines 52-56): guard against malformed/empty PR JSON - process-substitution in the status-check while loop (line 78): guard against empty/invalid STATUS response - FAILED_JSON construction (line 100): guard against empty FAILED_CHECKS array producing empty-pipeline jq failures Co-Authored-By: Claude Opus 4.7 --- .gitea/scripts/audit-force-merge.sh | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/.gitea/scripts/audit-force-merge.sh b/.gitea/scripts/audit-force-merge.sh index d2c34fe3..be665d45 100755 --- a/.gitea/scripts/audit-force-merge.sh +++ b/.gitea/scripts/audit-force-merge.sh @@ -49,11 +49,11 @@ if [ "$MERGED" != "true" ]; then exit 0 fi -MERGE_SHA=$(echo "$PR" | jq -r '.merge_commit_sha // empty') -MERGED_BY=$(echo "$PR" | jq -r '.merged_by.login // "unknown"') -TITLE=$(echo "$PR" | jq -r '.title // ""') -BASE_BRANCH=$(echo "$PR" | jq -r '.base.ref // "main"') -HEAD_SHA=$(echo "$PR" | jq -r '.head.sha // empty') +MERGE_SHA=$(echo "$PR" | jq -r '.merge_commit_sha // empty') || true +MERGED_BY=$(echo "$PR" | jq -r '.merged_by.login // "unknown"') || true +TITLE=$(echo "$PR" | jq -r '.title // ""') || true +BASE_BRANCH=$(echo "$PR" | jq -r '.base.ref // "main"') || true +HEAD_SHA=$(echo "$PR" | jq -r '.head.sha // empty') || true if [ -z "$MERGE_SHA" ]; then echo "::warning::PR #${PR_NUMBER} merged=true but no merge_commit_sha — cannot evaluate force-merge." @@ -75,7 +75,7 @@ STATUS=$(curl -sS -H "$AUTH" \ declare -A CHECK_STATE while IFS=$'\t' read -r ctx state; do [ -n "$ctx" ] && CHECK_STATE[$ctx]="$state" -done < <(echo "$STATUS" | jq -r '.statuses // [] | .[] | "\(.context)\t\(.status)"') +done < <(echo "$STATUS" | jq -r '.statuses // [] | .[] | "\(.context)\t\(.status)"') || true # 4. For each required check, was it green at merge? YAML block scalars # (`|`) leave a trailing newline; skip blank/whitespace-only lines. @@ -97,7 +97,7 @@ fi # 5. Emit structured audit event. NOW=$(date -u +%Y-%m-%dT%H:%M:%SZ) -FAILED_JSON=$(printf '%s\n' "${FAILED_CHECKS[@]}" | jq -R . | jq -s .) +FAILED_JSON=$(printf '%s\n' "${FAILED_CHECKS[@]}" | jq -R . | jq -s .) || true # Print as a single-line JSON so Vector's parse_json transform can pick # it up cleanly from docker_logs. From fae62ac8c15e23df7939ed3a4b0222d537c2549c Mon Sep 17 00:00:00 2001 From: core-devops Date: Mon, 11 May 2026 20:29:06 -0700 Subject: [PATCH 2/3] fix(ci): status-reaper rev3 widens window 10->30 + raises watchdog timeout + re-enables both crons MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Phase 1+2 evidence (rev2 PR#633, merged 01:48Z): 6/6 ticks post-merge with `compensated:0` despite ~25 known-stranded reds visible across those same 10 SHAs on direct probe ~30min later. Reaper run 17057 at 02:46Z explicitly logged: scanned 42 workflows; push-triggered=19, class-O candidates=23 status-reaper summary: {compensated:0, preserved_non_failure:185, scanned_shas:10, limit:10} Root cause: schedule workflows post `failure` to commit-status RETROACTIVELY 5-15 min after their merge. By the time reaper's next */5 tick lands, the stranded red is on a SHA that has already fallen OUTSIDE a 10-commit window during a burst-merge period. Reaper algorithm is correct; the lookback window is too narrow vs. the retroactive-failure-post lag. Three-in-one fix (atomic per hongming-pc2 GO 03:25Z): 1. `.gitea/scripts/status-reaper.py` DEFAULT_SWEEP_LIMIT 10 -> 30. Trades window-width-cheap for cadence-loady; kept `*/5` cron unchanged (avoiding `*/2` which would double runner load). 2. `.gitea/workflows/status-reaper.yml` Restore schedule cron block (revert mc#645 comment-out for THIS workflow only). Cron stays `*/5 * * * *`. 3. `.gitea/workflows/main-red-watchdog.yml` Restore schedule cron block (revert mc#645 comment-out) AND raise job-level `timeout-minutes: 5 -> 15`. Original 5min cap was producing cancels under runner-saturation latency, which fed the very `[main-red]` issues this workflow files (self-poisoning). 4. `tests/test_status_reaper.py` + test_default_sweep_limit_is_30 (contract pin) + test_reap_widened_window_catches_retroactive_failure: mocks 30 SHAs, plants the failing context on SHA[20] (depth strictly past rev2's window=10), asserts the compensation POST lands on that SHA. Existing tests retain explicit `limit=10` overrides and remain unchanged. Suite: 42/42 passed (was 40 + 2 new). Verification plan (post-merge, 10-15 min after merge / 2-3 cron ticks): - DB: SELECT id, status FROM action_run WHERE workflow_id= 'status-reaper.yml' ORDER BY id DESC LIMIT 5 -> all status=1 - Log via web UI: /molecule-ai/molecule-core/actions/runs//jobs/0/logs -> summary line should now show compensated > 0 with compensated_per_sha populated - Direct probe: pick a SHA in the last 30 main commits with class-O fails, GET /repos/molecule-ai/molecule-core/commits/{sha}/status -> compensated contexts now show state=success with description starting 'Compensated by status-reaper' If rev3 STILL shows compensated:0 after the window-widening, the diagnosis is wrong and a DIFFERENT bug needs to be uncovered (per hongming-pc2 caveat 03:25Z). Re-enabling the crons IS the diagnosis verification. Cross-links: - PR#618 (rev1, drop-concurrency, merge 4db64bcb) - PR#633 (rev2, sweep-recent-commits, merge e7965a0f) - PR#645 (interim disable, merge 4c54b590) — re-enable being reverted - task #90 (orch rev3 tracker) / task #46 (hongming-pc2 tracker) - feedback_brief_hypothesis_vs_evidence (empirical evidence above) - feedback_strict_root_only_after_class_a (3-in-one root fix vs. longer patching chain) Co-Authored-By: Claude Opus 4.7 (1M context) --- .gitea/scripts/status-reaper.py | 31 +++++++--- .gitea/workflows/main-red-watchdog.yml | 23 ++++--- .gitea/workflows/status-reaper.yml | 23 ++++--- tests/test_status_reaper.py | 86 ++++++++++++++++++++++++++ 4 files changed, 137 insertions(+), 26 deletions(-) diff --git a/.gitea/scripts/status-reaper.py b/.gitea/scripts/status-reaper.py index 41cc8c1f..5e1f895f 100644 --- a/.gitea/scripts/status-reaper.py +++ b/.gitea/scripts/status-reaper.py @@ -19,13 +19,18 @@ What this script does, per `.gitea/workflows/status-reaper.yml` invocation: downstream — Gitea uses ` / ` as the workflow/job separator). Classify each by whether `on:` contains a `push:` trigger. - 2. List the last N (=10) commits on WATCH_BRANCH via - GET /repos/{o}/{r}/commits?sha={branch}&limit={N}. rev2 sweeps - N commits per tick instead of HEAD only — schedule workflows - post `failure` to whatever SHA was HEAD when they COMPLETED, so - by the next */5 tick main has often moved forward and the red - gets stranded on a stale commit (Phase 1+2 evidence: rev1 saw - `compensated:0` every tick across ~6 cycles). + 2. List the last N (=30, rev3 — widened from 10) commits on + WATCH_BRANCH via GET /repos/{o}/{r}/commits?sha={branch}&limit={N}. + rev2 sweeps N commits per tick instead of HEAD only — schedule + workflows post `failure` to whatever SHA was HEAD when they + COMPLETED, so by the next */5 tick main has often moved forward + and the red gets stranded on a stale commit. rev3 widens the + window from 10 → 30 because schedule workflows post `failure` + RETROACTIVELY (5-15 min after their merge); a 10-commit window + is narrower than the merge-cadence during a burst, so reds land + OUTSIDE the window before reaper sees them (Phase 1+2 evidence: + rev2 run 17057 at 02:46Z saw 185/0 contexts on 10 SHAs; direct + probe ~30min later showed ~25 fails on those same 10 SHAs). 3. For EACH SHA in the list: - GET combined commit status. Per-SHA error isolation @@ -502,7 +507,17 @@ def reap( # already stale enough that the schedule-run that posted them has long # since been overwritten by a real push trigger. See `reference_post_ # suspension_pipeline` for the merge-cadence baseline. -DEFAULT_SWEEP_LIMIT = 10 +# +# rev3 (2026-05-12, hongming-pc2 GO 03:25Z): widened from 10 → 30. +# rev2 (limit=10) shipped 01:48Z and ran 6/6 ticks post-merge with +# `compensated:0` despite ~25 stranded reds visible on those same 10 +# SHAs ~30min later. Root cause: schedule workflows post `failure` +# RETROACTIVELY 5-15 min after their merge, so by the time reaper's +# next */5 tick lands, the stranded red is on a SHA that has already +# fallen out of a 10-commit window during a burst-merge period. +# Trades window-width-cheap for cadence-loady (per hongming-pc2): +# kept `*/5` cron unchanged; only the window-N is widened. +DEFAULT_SWEEP_LIMIT = 30 def list_recent_commit_shas(branch: str, limit: int) -> list[str]: diff --git a/.gitea/workflows/main-red-watchdog.yml b/.gitea/workflows/main-red-watchdog.yml index f3f62be7..4370a15d 100644 --- a/.gitea/workflows/main-red-watchdog.yml +++ b/.gitea/workflows/main-red-watchdog.yml @@ -37,13 +37,15 @@ name: main-red-watchdog # "unknown on type" when `workflow_dispatch.inputs.X` is present. Revisit # when Gitea ≥ 1.23 is fleet-wide. on: - # SCHEDULE DISABLED 2026-05-12 — interim per RFC#420 Option-C machinery-down emergency - # Watchdog timing out behind runner saturation; rev3+dedicated-runner-label in flight - # Re-enable after rev3 lands + runner saturation root resolved - # schedule: - # # Hourly at :05 — task spec calls for "off-zero" (`5 * * * *`), - # # offset from :17 (ci-required-drift) and :00 (peak cron load). - # - cron: '5 * * * *' + # SCHEDULE RE-ENABLED 2026-05-12 rev3 — interim disable (mc#645) reverted alongside + # status-reaper rev3 (widen-window). Job-level timeout-minutes raised 5 → 15 below + # to absorb runner-saturation latency without spurious cancels (the original cascade + # cause). If runner-saturation root persists, the dedicated-runner-label split + # remains the structural next step (tracked separately). + schedule: + # Hourly at :05 — task spec calls for "off-zero" (`5 * * * *`), + # offset from :17 (ci-required-drift) and :00 (peak cron load). + - cron: '5 * * * *' workflow_dispatch: # Read commit status + branch ref + issues; write issues (open/PATCH/close). @@ -61,7 +63,12 @@ concurrency: jobs: watchdog: runs-on: ubuntu-latest - timeout-minutes: 5 + # rev3 (2026-05-12, mc#645 revert): raised 5 → 15 to absorb runner-saturation + # latency. Original 5min cap was producing 124-style cancels under load, + # which fed the very `[main-red]` issues this workflow files (self-poisoning). + # 15min is still well below Gitea-default 6h job ceiling; if a real hang + # occurs the issue-file path is still the alarm surface. + timeout-minutes: 15 steps: - name: Check out repo (script lives at .gitea/scripts/) uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 diff --git a/.gitea/workflows/status-reaper.yml b/.gitea/workflows/status-reaper.yml index f6d0289d..c904ce5c 100644 --- a/.gitea/workflows/status-reaper.yml +++ b/.gitea/workflows/status-reaper.yml @@ -53,16 +53,19 @@ name: status-reaper # `inputs:` block here. Gitea 1.22.6 rejects the whole workflow as # "unknown on type" when `workflow_dispatch.inputs.X` is present. on: - # SCHEDULE DISABLED 2026-05-12 — interim per RFC#420 Option-C machinery-down emergency - # Reaper rev2 not compensating + watchdog timeout-cascade; rev3 in flight - # Re-enable after rev3 lands + runner saturation root resolved - # schedule: - # # Every 5 minutes. Off-zero alignment with sibling cron workflows: - # # ci-required-drift (`:17`), main-red-watchdog (`:05`), - # # railway-pin-audit (`:23`). 5-min cadence gives a tight enough - # # close on schedule-triggered false-reds that main-red-watchdog - # # (hourly :05) almost never files an issue on the false case. - # - cron: '*/5 * * * *' + # SCHEDULE RE-ENABLED 2026-05-12 rev3 — interim disable (mc#645) reverted now that + # rev3 widens DEFAULT_SWEEP_LIMIT 10 → 30 (covers retroactive-failure timing window). + # Sibling watchdog re-enabled in the same PR with timeout-minutes raised 5 → 15. + schedule: + # Every 5 minutes. Off-zero alignment with sibling cron workflows: + # ci-required-drift (`:17`), main-red-watchdog (`:05`), + # railway-pin-audit (`:23`). 5-min cadence gives a tight enough + # close on schedule-triggered false-reds that main-red-watchdog + # (hourly :05) almost never files an issue on the false case. + # rev3 keeps `*/5` unchanged per hongming-pc2 03:25Z review: + # "trades window-width-cheap for cadence-loady" — N=30 widens + # the lookback cheaply without doubling runner load via `*/2`. + - cron: '*/5 * * * *' workflow_dispatch: # Compensating-status POST needs write on repo statuses; no other diff --git a/tests/test_status_reaper.py b/tests/test_status_reaper.py index 72dc9690..fda532c7 100644 --- a/tests/test_status_reaper.py +++ b/tests/test_status_reaper.py @@ -713,6 +713,92 @@ def test_reap_skips_combined_success_shas(sr_module, monkeypatch): assert posts[0][0] == f"/repos/owner/repo/statuses/{SHA_B}" +def test_default_sweep_limit_is_30(sr_module): + """rev3 contract: `DEFAULT_SWEEP_LIMIT = 30` (widened from rev2's 10). + + Root cause of the widening: schedule workflows post `failure` + RETROACTIVELY 5-15 min after their merge. A 10-commit window is + narrower than the merge-cadence during a burst, so reds land + OUTSIDE the window before reaper's next tick sees them. + + Evidence: rev2 run 17057 (02:46Z 2026-05-12) saw 185 contexts / 0 + fails on its 10 SHAs; direct probe ~30min later showed ~25 fails + on those same 10 SHAs. + + If this default is ever lowered back, that change MUST cite + re-measured cadence data — a smaller window than the + retroactive-failure-post lag re-introduces compensated:0. + """ + assert sr_module.DEFAULT_SWEEP_LIMIT == 30 + + +def test_reap_widened_window_catches_retroactive_failure(sr_module, monkeypatch): + """rev3 regression: with limit=30, a stranded red on a SHA at depth=20 + (which the rev2 limit=10 window would have missed) IS swept + compensated. + + Why this matters: rev2 ran with limit=10 and saw `compensated:0` for + 6 consecutive ticks despite ~25 known-stranded reds across the last + 30 main commits. Widening to 30 must demonstrably catch a SHA past + the old window. We mock 30 SHAs, plant the failure on SHA[20], and + verify exactly one compensation lands on that SHA. + """ + shas = [f"{c:02x}" * 20 for c in range(30)] # 30 deterministic SHAs + failing_sha = shas[20] # depth 20 — outside rev2's window=10, inside rev3's =30 + + posts: list[tuple[str, dict]] = [] + + def fake_api(method, path, *, body=None, query=None, expect_json=True): + if method == "GET" and path.endswith("/commits"): + # /commits listing — return all 30 fake commit objects + assert query.get("limit") == "30", ( + f"expected limit=30 in query, got {query}" + ) + return (200, [{"sha": s} for s in shas]) + if method == "GET" and "/commits/" in path and path.endswith("/status"): + sha = path.split("/commits/")[1].split("/status")[0] + if sha == failing_sha: + return ( + 200, + { + "state": "failure", + "statuses": [ + { + "context": "retroactive-drift / drift (push)", + "state": "failure", + "target_url": "https://example.test/run/9001", + } + ], + }, + ) + # All others combined=success (cost-opt short-circuit). + return (200, {"state": "success", "statuses": []}) + if method == "POST": + posts.append((path, body)) + return (201, {}) + raise AssertionError(f"unexpected api call: {method} {path}") + + monkeypatch.setattr(sr_module, "api", fake_api) + + workflow_map = {"retroactive-drift": False} # schedule-only → class-O + counters = sr_module.reap_branch( + workflow_map, "main", limit=sr_module.DEFAULT_SWEEP_LIMIT, dry_run=False + ) + + # All 30 SHAs walked; exactly one compensated. + assert counters["scanned_shas"] == 30 + assert counters["compensated"] == 1 + assert failing_sha in counters["compensated_per_sha"] + assert counters["compensated_per_sha"][failing_sha] == [ + "retroactive-drift / drift (push)" + ] + assert len(posts) == 1 + assert posts[0][0] == f"/repos/owner/repo/statuses/{failing_sha}" + # Sanity: with rev2's window=10, depth=20 would NOT have been reached. + # This assertion documents the rev3 widening as the structural fix: + # the failing_sha index (20) is strictly greater than rev2's old limit (10). + assert shas.index(failing_sha) >= 10 + + def test_reap_continues_on_per_sha_apierror(sr_module, monkeypatch, capsys): """rev2 refinement #7 (MOST CRITICAL): a transient ApiError or HTTP-5xx on get_combined_status(SHA_X) must NOT fail the whole tick. Log + skip From 1c9255125e7b53c4859769a1e4289c1dd8ed75cd Mon Sep 17 00:00:00 2001 From: Molecule AI Core-DevOps Date: Tue, 12 May 2026 03:37:52 +0000 Subject: [PATCH 3/3] fix(ci): make go vet hard-failing in weekly-platform-go --- .gitea/workflows/weekly-platform-go.yml | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/.gitea/workflows/weekly-platform-go.yml b/.gitea/workflows/weekly-platform-go.yml index ef133d3b..09ba7d8e 100644 --- a/.gitea/workflows/weekly-platform-go.yml +++ b/.gitea/workflows/weekly-platform-go.yml @@ -53,9 +53,20 @@ jobs: - name: Build run: go build ./cmd/server + # `go vet` is NOT `|| true`-guarded: surfacing latent vet errors on main is + # the whole point of this workflow (issue #567 — the motivating case was a + # `go vet` error in org_external.go that sat undetected on main for weeks). + # A vet error here fails the step → fails the job → shows red on the weekly + # commit. Per Gitea quirk #10 (job-level continue-on-error is ignored), that + # red surfaces on main — which is the intended signal, not a regression. - name: go vet - run: go vet ./... || true + run: go vet ./... + # golangci-lint stays `|| true`-guarded: lint is noisier (more false- + # positives than vet) and golangci-lint may not be pre-installed on every + # runner image — a `|| true` here keeps a missing-binary or lint-noise case + # from masking the vet/test signal above. Tighten to match ci.yml's lint + # gate if/when ci.yml's lint step becomes hard-failing. - name: golangci-lint run: golangci-lint run --timeout 3m ./... || true