From 8c343e3ac47d6e555ec9c1417142bf87c78a89c6 Mon Sep 17 00:00:00 2001
From: Molecule AI Infra-Runtime-BE <infra-runtime-be@agents.moleculesai.app>
Date: Tue, 12 May 2026 03:26:36 +0000
Subject: [PATCH 1/3] fix(gitea): add || true guards to jq pipelines in
 audit-force-merge.sh

Same root cause as sop-tier-check.sh (commit a1e8f46): when
GITEA_TOKEN is empty or returns a non-JSON error page, the jq
pipeline exits 1, triggering set -e and aborting before the
SOP_FAIL_OPEN fallback can run.

Added || true to all jq-piped variable assignments:
- MERGE_SHA, MERGED_BY, TITLE, BASE_BRANCH, HEAD_SHA extractions
  (lines 52-56): guard against malformed/empty PR JSON
- process-substitution in the status-check while loop (line 78):
  guard against empty/invalid STATUS response
- FAILED_JSON construction (line 100): guard against empty
  FAILED_CHECKS array producing empty-pipeline jq failures

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 .gitea/scripts/audit-force-merge.sh | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/.gitea/scripts/audit-force-merge.sh b/.gitea/scripts/audit-force-merge.sh
index d2c34fe3..be665d45 100755
--- a/.gitea/scripts/audit-force-merge.sh
+++ b/.gitea/scripts/audit-force-merge.sh
@@ -49,11 +49,11 @@ if [ "$MERGED" != "true" ]; then
   exit 0
 fi
 
-MERGE_SHA=$(echo "$PR" | jq -r '.merge_commit_sha // empty')
-MERGED_BY=$(echo "$PR" | jq -r '.merged_by.login // "unknown"')
-TITLE=$(echo "$PR" | jq -r '.title // ""')
-BASE_BRANCH=$(echo "$PR" | jq -r '.base.ref // "main"')
-HEAD_SHA=$(echo "$PR" | jq -r '.head.sha // empty')
+MERGE_SHA=$(echo "$PR" | jq -r '.merge_commit_sha // empty') || true
+MERGED_BY=$(echo "$PR" | jq -r '.merged_by.login // "unknown"') || true
+TITLE=$(echo "$PR" | jq -r '.title // ""') || true
+BASE_BRANCH=$(echo "$PR" | jq -r '.base.ref // "main"') || true
+HEAD_SHA=$(echo "$PR" | jq -r '.head.sha // empty') || true
 
 if [ -z "$MERGE_SHA" ]; then
   echo "::warning::PR #${PR_NUMBER} merged=true but no merge_commit_sha — cannot evaluate force-merge."
@@ -75,7 +75,7 @@ STATUS=$(curl -sS -H "$AUTH" \
 declare -A CHECK_STATE
 while IFS=$'\t' read -r ctx state; do
   [ -n "$ctx" ] && CHECK_STATE[$ctx]="$state"
-done < <(echo "$STATUS" | jq -r '.statuses // [] | .[] | "\(.context)\t\(.status)"')
+done < <(echo "$STATUS" | jq -r '.statuses // [] | .[] | "\(.context)\t\(.status)"') || true
 
 # 4. For each required check, was it green at merge? YAML block scalars
 #    (`|`) leave a trailing newline; skip blank/whitespace-only lines.
@@ -97,7 +97,7 @@ fi
 
 # 5. Emit structured audit event.
 NOW=$(date -u +%Y-%m-%dT%H:%M:%SZ)
-FAILED_JSON=$(printf '%s\n' "${FAILED_CHECKS[@]}" | jq -R . | jq -s .)
+FAILED_JSON=$(printf '%s\n' "${FAILED_CHECKS[@]}" | jq -R . | jq -s .) || true
 
 # Print as a single-line JSON so Vector's parse_json transform can pick
 # it up cleanly from docker_logs.

From fae62ac8c15e23df7939ed3a4b0222d537c2549c Mon Sep 17 00:00:00 2001
From: core-devops <core-devops@moleculesai.app>
Date: Mon, 11 May 2026 20:29:06 -0700
Subject: [PATCH 2/3] fix(ci): status-reaper rev3 widens window 10->30 + raises
 watchdog timeout + re-enables both crons
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Phase 1+2 evidence (rev2 PR#633, merged 01:48Z): 6/6 ticks post-merge
with `compensated:0` despite ~25 known-stranded reds visible across
those same 10 SHAs on direct probe ~30min later. Reaper run 17057 at
02:46Z explicitly logged:

    scanned 42 workflows; push-triggered=19, class-O candidates=23
    status-reaper summary: {compensated:0, preserved_non_failure:185,
      scanned_shas:10, limit:10}

Root cause: schedule workflows post `failure` to commit-status
RETROACTIVELY 5-15 min after their merge. By the time reaper's next
*/5 tick lands, the stranded red is on a SHA that has already fallen
OUTSIDE a 10-commit window during a burst-merge period. Reaper
algorithm is correct; the lookback window is too narrow vs. the
retroactive-failure-post lag.

Three-in-one fix (atomic per hongming-pc2 GO 03:25Z):

1. `.gitea/scripts/status-reaper.py`
   DEFAULT_SWEEP_LIMIT 10 -> 30. Trades window-width-cheap for
   cadence-loady; kept `*/5` cron unchanged (avoiding `*/2` which
   would double runner load).

2. `.gitea/workflows/status-reaper.yml`
   Restore schedule cron block (revert mc#645 comment-out for THIS
   workflow only). Cron stays `*/5 * * * *`.

3. `.gitea/workflows/main-red-watchdog.yml`
   Restore schedule cron block (revert mc#645 comment-out) AND raise
   job-level `timeout-minutes: 5 -> 15`. Original 5min cap was
   producing cancels under runner-saturation latency, which fed the
   very `[main-red]` issues this workflow files (self-poisoning).

4. `tests/test_status_reaper.py`
   + test_default_sweep_limit_is_30 (contract pin)
   + test_reap_widened_window_catches_retroactive_failure: mocks 30
     SHAs, plants the failing context on SHA[20] (depth strictly past
     rev2's window=10), asserts the compensation POST lands on that
     SHA. Existing tests retain explicit `limit=10` overrides and
     remain unchanged. Suite: 42/42 passed (was 40 + 2 new).

Verification plan (post-merge, 10-15 min after merge / 2-3 cron ticks):
  - DB: SELECT id, status FROM action_run WHERE workflow_id=
    'status-reaper.yml' ORDER BY id DESC LIMIT 5 -> all status=1
  - Log via web UI:
    /molecule-ai/molecule-core/actions/runs/<index>/jobs/0/logs ->
    summary line should now show compensated > 0 with
    compensated_per_sha populated
  - Direct probe: pick a SHA in the last 30 main commits with class-O
    fails, GET /repos/molecule-ai/molecule-core/commits/{sha}/status
    -> compensated contexts now show state=success with description
    starting 'Compensated by status-reaper'

If rev3 STILL shows compensated:0 after the window-widening, the
diagnosis is wrong and a DIFFERENT bug needs to be uncovered (per
hongming-pc2 caveat 03:25Z). Re-enabling the crons IS the diagnosis
verification.

Cross-links:
  - PR#618 (rev1, drop-concurrency, merge 4db64bcb)
  - PR#633 (rev2, sweep-recent-commits, merge e7965a0f)
  - PR#645 (interim disable, merge 4c54b590) — re-enable being reverted
  - task #90 (orch rev3 tracker) / task #46 (hongming-pc2 tracker)
  - feedback_brief_hypothesis_vs_evidence (empirical evidence above)
  - feedback_strict_root_only_after_class_a (3-in-one root fix vs.
    longer patching chain)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .gitea/scripts/status-reaper.py        | 31 +++++++---
 .gitea/workflows/main-red-watchdog.yml | 23 ++++---
 .gitea/workflows/status-reaper.yml     | 23 ++++---
 tests/test_status_reaper.py            | 86 ++++++++++++++++++++++++++
 4 files changed, 137 insertions(+), 26 deletions(-)

diff --git a/.gitea/scripts/status-reaper.py b/.gitea/scripts/status-reaper.py
index 41cc8c1f..5e1f895f 100644
--- a/.gitea/scripts/status-reaper.py
+++ b/.gitea/scripts/status-reaper.py
@@ -19,13 +19,18 @@ What this script does, per `.gitea/workflows/status-reaper.yml` invocation:
          downstream — Gitea uses ` / ` as the workflow/job separator).
      Classify each by whether `on:` contains a `push:` trigger.
 
-  2. List the last N (=10) commits on WATCH_BRANCH via
-     GET /repos/{o}/{r}/commits?sha={branch}&limit={N}. rev2 sweeps
-     N commits per tick instead of HEAD only — schedule workflows
-     post `failure` to whatever SHA was HEAD when they COMPLETED, so
-     by the next */5 tick main has often moved forward and the red
-     gets stranded on a stale commit (Phase 1+2 evidence: rev1 saw
-     `compensated:0` every tick across ~6 cycles).
+  2. List the last N (=30, rev3 — widened from 10) commits on
+     WATCH_BRANCH via GET /repos/{o}/{r}/commits?sha={branch}&limit={N}.
+     rev2 sweeps N commits per tick instead of HEAD only — schedule
+     workflows post `failure` to whatever SHA was HEAD when they
+     COMPLETED, so by the next */5 tick main has often moved forward
+     and the red gets stranded on a stale commit. rev3 widens the
+     window from 10 → 30 because schedule workflows post `failure`
+     RETROACTIVELY (5-15 min after their merge); a 10-commit window
+     is narrower than the merge-cadence during a burst, so reds land
+     OUTSIDE the window before reaper sees them (Phase 1+2 evidence:
+     rev2 run 17057 at 02:46Z saw 185/0 contexts on 10 SHAs; direct
+     probe ~30min later showed ~25 fails on those same 10 SHAs).
 
   3. For EACH SHA in the list:
        - GET combined commit status. Per-SHA error isolation
@@ -502,7 +507,17 @@ def reap(
 # already stale enough that the schedule-run that posted them has long
 # since been overwritten by a real push trigger. See `reference_post_
 # suspension_pipeline` for the merge-cadence baseline.
-DEFAULT_SWEEP_LIMIT = 10
+#
+# rev3 (2026-05-12, hongming-pc2 GO 03:25Z): widened from 10 → 30.
+# rev2 (limit=10) shipped 01:48Z and ran 6/6 ticks post-merge with
+# `compensated:0` despite ~25 stranded reds visible on those same 10
+# SHAs ~30min later. Root cause: schedule workflows post `failure`
+# RETROACTIVELY 5-15 min after their merge, so by the time reaper's
+# next */5 tick lands, the stranded red is on a SHA that has already
+# fallen out of a 10-commit window during a burst-merge period.
+# Trades window-width-cheap for cadence-loady (per hongming-pc2):
+# kept `*/5` cron unchanged; only the window-N is widened.
+DEFAULT_SWEEP_LIMIT = 30
 
 
 def list_recent_commit_shas(branch: str, limit: int) -> list[str]:
diff --git a/.gitea/workflows/main-red-watchdog.yml b/.gitea/workflows/main-red-watchdog.yml
index f3f62be7..4370a15d 100644
--- a/.gitea/workflows/main-red-watchdog.yml
+++ b/.gitea/workflows/main-red-watchdog.yml
@@ -37,13 +37,15 @@ name: main-red-watchdog
 # "unknown on type" when `workflow_dispatch.inputs.X` is present. Revisit
 # when Gitea ≥ 1.23 is fleet-wide.
 on:
-  # SCHEDULE DISABLED 2026-05-12 — interim per RFC#420 Option-C machinery-down emergency
-  # Watchdog timing out behind runner saturation; rev3+dedicated-runner-label in flight
-  # Re-enable after rev3 lands + runner saturation root resolved
-  #   schedule:
-  #     # Hourly at :05 — task spec calls for "off-zero" (`5 * * * *`),
-  #     # offset from :17 (ci-required-drift) and :00 (peak cron load).
-  #     - cron: '5 * * * *'
+  # SCHEDULE RE-ENABLED 2026-05-12 rev3 — interim disable (mc#645) reverted alongside
+  # status-reaper rev3 (widen-window). Job-level timeout-minutes raised 5 → 15 below
+  # to absorb runner-saturation latency without spurious cancels (the original cascade
+  # cause). If runner-saturation root persists, the dedicated-runner-label split
+  # remains the structural next step (tracked separately).
+  schedule:
+    # Hourly at :05 — task spec calls for "off-zero" (`5 * * * *`),
+    # offset from :17 (ci-required-drift) and :00 (peak cron load).
+    - cron: '5 * * * *'
   workflow_dispatch:
 
 # Read commit status + branch ref + issues; write issues (open/PATCH/close).
@@ -61,7 +63,12 @@ concurrency:
 jobs:
   watchdog:
     runs-on: ubuntu-latest
-    timeout-minutes: 5
+    # rev3 (2026-05-12, mc#645 revert): raised 5 → 15 to absorb runner-saturation
+    # latency. Original 5min cap was producing 124-style cancels under load,
+    # which fed the very `[main-red]` issues this workflow files (self-poisoning).
+    # 15min is still well below Gitea-default 6h job ceiling; if a real hang
+    # occurs the issue-file path is still the alarm surface.
+    timeout-minutes: 15
     steps:
       - name: Check out repo (script lives at .gitea/scripts/)
         uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd  # v6.0.2
diff --git a/.gitea/workflows/status-reaper.yml b/.gitea/workflows/status-reaper.yml
index f6d0289d..c904ce5c 100644
--- a/.gitea/workflows/status-reaper.yml
+++ b/.gitea/workflows/status-reaper.yml
@@ -53,16 +53,19 @@ name: status-reaper
 # `inputs:` block here. Gitea 1.22.6 rejects the whole workflow as
 # "unknown on type" when `workflow_dispatch.inputs.X` is present.
 on:
-  # SCHEDULE DISABLED 2026-05-12 — interim per RFC#420 Option-C machinery-down emergency
-  # Reaper rev2 not compensating + watchdog timeout-cascade; rev3 in flight
-  # Re-enable after rev3 lands + runner saturation root resolved
-  #   schedule:
-  #     # Every 5 minutes. Off-zero alignment with sibling cron workflows:
-  #     # ci-required-drift (`:17`), main-red-watchdog (`:05`),
-  #     # railway-pin-audit (`:23`). 5-min cadence gives a tight enough
-  #     # close on schedule-triggered false-reds that main-red-watchdog
-  #     # (hourly :05) almost never files an issue on the false case.
-  #     - cron: '*/5 * * * *'
+  # SCHEDULE RE-ENABLED 2026-05-12 rev3 — interim disable (mc#645) reverted now that
+  # rev3 widens DEFAULT_SWEEP_LIMIT 10 → 30 (covers retroactive-failure timing window).
+  # Sibling watchdog re-enabled in the same PR with timeout-minutes raised 5 → 15.
+  schedule:
+    # Every 5 minutes. Off-zero alignment with sibling cron workflows:
+    # ci-required-drift (`:17`), main-red-watchdog (`:05`),
+    # railway-pin-audit (`:23`). 5-min cadence gives a tight enough
+    # close on schedule-triggered false-reds that main-red-watchdog
+    # (hourly :05) almost never files an issue on the false case.
+    # rev3 keeps `*/5` unchanged per hongming-pc2 03:25Z review:
+    # "trades window-width-cheap for cadence-loady" — N=30 widens
+    # the lookback cheaply without doubling runner load via `*/2`.
+    - cron: '*/5 * * * *'
   workflow_dispatch:
 
 # Compensating-status POST needs write on repo statuses; no other
diff --git a/tests/test_status_reaper.py b/tests/test_status_reaper.py
index 72dc9690..fda532c7 100644
--- a/tests/test_status_reaper.py
+++ b/tests/test_status_reaper.py
@@ -713,6 +713,92 @@ def test_reap_skips_combined_success_shas(sr_module, monkeypatch):
     assert posts[0][0] == f"/repos/owner/repo/statuses/{SHA_B}"
 
 
+def test_default_sweep_limit_is_30(sr_module):
+    """rev3 contract: `DEFAULT_SWEEP_LIMIT = 30` (widened from rev2's 10).
+
+    Root cause of the widening: schedule workflows post `failure`
+    RETROACTIVELY 5-15 min after their merge. A 10-commit window is
+    narrower than the merge-cadence during a burst, so reds land
+    OUTSIDE the window before reaper's next tick sees them.
+
+    Evidence: rev2 run 17057 (02:46Z 2026-05-12) saw 185 contexts / 0
+    fails on its 10 SHAs; direct probe ~30min later showed ~25 fails
+    on those same 10 SHAs.
+
+    If this default is ever lowered back, that change MUST cite
+    re-measured cadence data — a smaller window than the
+    retroactive-failure-post lag re-introduces compensated:0.
+    """
+    assert sr_module.DEFAULT_SWEEP_LIMIT == 30
+
+
+def test_reap_widened_window_catches_retroactive_failure(sr_module, monkeypatch):
+    """rev3 regression: with limit=30, a stranded red on a SHA at depth=20
+    (which the rev2 limit=10 window would have missed) IS swept + compensated.
+
+    Why this matters: rev2 ran with limit=10 and saw `compensated:0` for
+    6 consecutive ticks despite ~25 known-stranded reds across the last
+    30 main commits. Widening to 30 must demonstrably catch a SHA past
+    the old window. We mock 30 SHAs, plant the failure on SHA[20], and
+    verify exactly one compensation lands on that SHA.
+    """
+    shas = [f"{c:02x}" * 20 for c in range(30)]  # 30 deterministic SHAs
+    failing_sha = shas[20]  # depth 20 — outside rev2's window=10, inside rev3's =30
+
+    posts: list[tuple[str, dict]] = []
+
+    def fake_api(method, path, *, body=None, query=None, expect_json=True):
+        if method == "GET" and path.endswith("/commits"):
+            # /commits listing — return all 30 fake commit objects
+            assert query.get("limit") == "30", (
+                f"expected limit=30 in query, got {query}"
+            )
+            return (200, [{"sha": s} for s in shas])
+        if method == "GET" and "/commits/" in path and path.endswith("/status"):
+            sha = path.split("/commits/")[1].split("/status")[0]
+            if sha == failing_sha:
+                return (
+                    200,
+                    {
+                        "state": "failure",
+                        "statuses": [
+                            {
+                                "context": "retroactive-drift / drift (push)",
+                                "state": "failure",
+                                "target_url": "https://example.test/run/9001",
+                            }
+                        ],
+                    },
+                )
+            # All others combined=success (cost-opt short-circuit).
+            return (200, {"state": "success", "statuses": []})
+        if method == "POST":
+            posts.append((path, body))
+            return (201, {})
+        raise AssertionError(f"unexpected api call: {method} {path}")
+
+    monkeypatch.setattr(sr_module, "api", fake_api)
+
+    workflow_map = {"retroactive-drift": False}  # schedule-only → class-O
+    counters = sr_module.reap_branch(
+        workflow_map, "main", limit=sr_module.DEFAULT_SWEEP_LIMIT, dry_run=False
+    )
+
+    # All 30 SHAs walked; exactly one compensated.
+    assert counters["scanned_shas"] == 30
+    assert counters["compensated"] == 1
+    assert failing_sha in counters["compensated_per_sha"]
+    assert counters["compensated_per_sha"][failing_sha] == [
+        "retroactive-drift / drift (push)"
+    ]
+    assert len(posts) == 1
+    assert posts[0][0] == f"/repos/owner/repo/statuses/{failing_sha}"
+    # Sanity: with rev2's window=10, depth=20 would NOT have been reached.
+    # This assertion documents the rev3 widening as the structural fix:
+    # the failing_sha index (20) is strictly greater than rev2's old limit (10).
+    assert shas.index(failing_sha) >= 10
+
+
 def test_reap_continues_on_per_sha_apierror(sr_module, monkeypatch, capsys):
     """rev2 refinement #7 (MOST CRITICAL): a transient ApiError or HTTP-5xx
     on get_combined_status(SHA_X) must NOT fail the whole tick. Log + skip

From 1c9255125e7b53c4859769a1e4289c1dd8ed75cd Mon Sep 17 00:00:00 2001
From: Molecule AI Core-DevOps <core-devops@agents.moleculesai.app>
Date: Tue, 12 May 2026 03:37:52 +0000
Subject: [PATCH 3/3] fix(ci): make go vet hard-failing in weekly-platform-go

---
 .gitea/workflows/weekly-platform-go.yml | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/.gitea/workflows/weekly-platform-go.yml b/.gitea/workflows/weekly-platform-go.yml
index ef133d3b..09ba7d8e 100644
--- a/.gitea/workflows/weekly-platform-go.yml
+++ b/.gitea/workflows/weekly-platform-go.yml
@@ -53,9 +53,20 @@ jobs:
       - name: Build
         run: go build ./cmd/server
 
+      # `go vet` is NOT `|| true`-guarded: surfacing latent vet errors on main is
+      # the whole point of this workflow (issue #567 — the motivating case was a
+      # `go vet` error in org_external.go that sat undetected on main for weeks).
+      # A vet error here fails the step → fails the job → shows red on the weekly
+      # commit. Per Gitea quirk #10 (job-level continue-on-error is ignored), that
+      # red surfaces on main — which is the intended signal, not a regression.
       - name: go vet
-        run: go vet ./... || true
+        run: go vet ./...
 
+      # golangci-lint stays `|| true`-guarded: lint is noisier (more false-
+      # positives than vet) and golangci-lint may not be pre-installed on every
+      # runner image — a `|| true` here keeps a missing-binary or lint-noise case
+      # from masking the vet/test signal above. Tighten to match ci.yml's lint
+      # gate if/when ci.yml's lint step becomes hard-failing.
       - name: golangci-lint
         run: golangci-lint run --timeout 3m ./... || true