molecule-core/.gitea/workflows
core-devops fae62ac8c1
Some checks failed
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 19s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 23s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 22s
qa-review / approved (pull_request) Failing after 17s
gate-check-v3 / gate-check (pull_request) Successful in 24s
security-review / approved (pull_request) Failing after 13s
CI / Detect changes (pull_request) Successful in 29s
E2E API Smoke Test / detect-changes (pull_request) Successful in 32s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 31s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 33s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 33s
sop-tier-check / tier-check (pull_request) Successful in 14s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 6s
CI / Python Lint & Test (pull_request) Successful in 6s
CI / Canvas (Next.js) (pull_request) Successful in 8s
CI / Platform (Go) (pull_request) Successful in 7s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 7s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 6s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 8s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 6s
CI / all-required (pull_request) Successful in 3s
audit-force-merge / audit (pull_request) Successful in 8s
fix(ci): status-reaper rev3 widens window 10->30 + raises watchdog timeout + re-enables both crons
Phase 1+2 evidence (rev2 PR#633, merged 01:48Z): 6/6 ticks post-merge
with `compensated:0` despite ~25 known-stranded reds visible across
those same 10 SHAs on direct probe ~30min later. Reaper run 17057 at
02:46Z explicitly logged:

    scanned 42 workflows; push-triggered=19, class-O candidates=23
    status-reaper summary: {compensated:0, preserved_non_failure:185,
      scanned_shas:10, limit:10}

Root cause: schedule workflows post `failure` to commit-status
RETROACTIVELY 5-15 min after their merge. By the time reaper's next
*/5 tick lands, the stranded red is on a SHA that has already fallen
OUTSIDE a 10-commit window during a burst-merge period. Reaper
algorithm is correct; the lookback window is too narrow vs. the
retroactive-failure-post lag.

Three-in-one fix (atomic per hongming-pc2 GO 03:25Z):

1. `.gitea/scripts/status-reaper.py`
   DEFAULT_SWEEP_LIMIT 10 -> 30. Trades window-width-cheap for
   cadence-loady; kept `*/5` cron unchanged (avoiding `*/2` which
   would double runner load).

2. `.gitea/workflows/status-reaper.yml`
   Restore schedule cron block (revert mc#645 comment-out for THIS
   workflow only). Cron stays `*/5 * * * *`.

3. `.gitea/workflows/main-red-watchdog.yml`
   Restore schedule cron block (revert mc#645 comment-out) AND raise
   job-level `timeout-minutes: 5 -> 15`. Original 5min cap was
   producing cancels under runner-saturation latency, which fed the
   very `[main-red]` issues this workflow files (self-poisoning).

4. `tests/test_status_reaper.py`
   + test_default_sweep_limit_is_30 (contract pin)
   + test_reap_widened_window_catches_retroactive_failure: mocks 30
     SHAs, plants the failing context on SHA[20] (depth strictly past
     rev2's window=10), asserts the compensation POST lands on that
     SHA. Existing tests retain explicit `limit=10` overrides and
     remain unchanged. Suite: 42/42 passed (was 40 + 2 new).

Verification plan (post-merge, 10-15 min after merge / 2-3 cron ticks):
  - DB: SELECT id, status FROM action_run WHERE workflow_id=
    'status-reaper.yml' ORDER BY id DESC LIMIT 5 -> all status=1
  - Log via web UI:
    /molecule-ai/molecule-core/actions/runs/<index>/jobs/0/logs ->
    summary line should now show compensated > 0 with
    compensated_per_sha populated
  - Direct probe: pick a SHA in the last 30 main commits with class-O
    fails, GET /repos/molecule-ai/molecule-core/commits/{sha}/status
    -> compensated contexts now show state=success with description
    starting 'Compensated by status-reaper'

If rev3 STILL shows compensated:0 after the window-widening, the
diagnosis is wrong and a DIFFERENT bug needs to be uncovered (per
hongming-pc2 caveat 03:25Z). Re-enabling the crons IS the diagnosis
verification.

Cross-links:
  - PR#618 (rev1, drop-concurrency, merge 4db64bcb)
  - PR#633 (rev2, sweep-recent-commits, merge e7965a0f)
  - PR#645 (interim disable, merge 4c54b590) — re-enable being reverted
  - task #90 (orch rev3 tracker) / task #46 (hongming-pc2 tracker)
  - feedback_brief_hypothesis_vs_evidence (empirical evidence above)
  - feedback_strict_root_only_after_class_a (3-in-one root fix vs.
    longer patching chain)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 20:29:06 -07:00
..
audit-force-merge.yml
block-internal-paths.yml
cascade-list-drift-gate.yml
check-migration-collisions.yml
ci-required-drift.yml fix(ci): ci-required-drift handles 403/404 on protection endpoint gracefully 2026-05-12 03:13:37 +00:00
ci.yml feat(ci): add per-package diagnostic step to platform-build job 2026-05-12 00:07:45 +00:00
continuous-synth-e2e.yml
e2e-api.yml
e2e-staging-canvas.yml
e2e-staging-external.yml
e2e-staging-saas.yml fix(ci): restore pull_request trigger + pr-validate to e2e-staging-saas 2026-05-11 18:14:50 +00:00
e2e-staging-sanity.yml
gate-check-v3.yml fix(sre): add explicit 15s timeout to gate-check-v3 HTTP calls (closes #603) 2026-05-11 23:36:21 +00:00
handlers-postgres-integration.yml
harness-replays.yml fix(ci): strip JSON5 comments from manifest.json before clone-manifest.sh 2026-05-11 22:19:55 +00:00
lint-curl-status-capture.yml
main-red-watchdog.yml fix(ci): status-reaper rev3 widens window 10->30 + raises watchdog timeout + re-enables both crons 2026-05-11 20:29:06 -07:00
publish-canvas-image.yml revert(ci): restore ubuntu-latest runner for publish workflows 2026-05-12 00:02:03 +00:00
publish-runtime-autobump.yml fix(ci): publish-runtime-autobump bump-and-tag condition is always-skipped 2026-05-11 20:41:57 +00:00
publish-runtime.yml
publish-workspace-server-image.yml revert(ci): restore ubuntu-latest runner for publish workflows 2026-05-12 00:02:03 +00:00
qa-review.yml fix(ci)(security): stop token appearing in curl argv (#541) 2026-05-11 19:30:22 +00:00
railway-pin-audit.yml
redeploy-tenants-on-main.yml
redeploy-tenants-on-staging.yml
review-check-tests.yml fix(ci): add jq install to review-check-tests workflow + fix /tmp/jq hardcode 2026-05-12 01:24:24 +00:00
runtime-pin-compat.yml
runtime-prbuild-compat.yml
secret-pattern-drift.yml
secret-scan.yml
security-review.yml fix(ci)(security): stop token appearing in curl argv (#541) 2026-05-11 19:30:22 +00:00
sop-tier-check.yml
sop-tier-refire.yml
staging-smoke.yml
staging-verify.yml
status-reaper.yml fix(ci): status-reaper rev3 widens window 10->30 + raises watchdog timeout + re-enables both crons 2026-05-11 20:29:06 -07:00
sweep-aws-secrets.yml fix(ci): reconcile sweep workflow secrets — use confirmed-existing names (#482) 2026-05-11 14:07:53 +00:00
sweep-cf-orphans.yml fix(ci): reconcile sweep workflow secrets — use confirmed-existing names (#482) 2026-05-11 14:07:53 +00:00
sweep-cf-tunnels.yml fix(ci): reconcile sweep workflow secrets — use confirmed-existing names (#482) 2026-05-11 14:07:53 +00:00
sweep-stale-e2e-orgs.yml fix(ci): sweep-stale-e2e-orgs reference + drop continue-on-error (closes EC2 leak) (#461) 2026-05-11 12:05:36 +00:00
test-ops-scripts.yml
weekly-platform-go.yml feat(ci): add weekly Platform-Go latent-error surface workflow 2026-05-11 23:49:59 +00:00