f9214391fb
3 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
fae62ac8c1 |
fix(ci): status-reaper rev3 widens window 10->30 + raises watchdog timeout + re-enables both crons
Some checks failed
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 19s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 23s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 22s
qa-review / approved (pull_request) Failing after 17s
gate-check-v3 / gate-check (pull_request) Successful in 24s
security-review / approved (pull_request) Failing after 13s
CI / Detect changes (pull_request) Successful in 29s
E2E API Smoke Test / detect-changes (pull_request) Successful in 32s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 31s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 33s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 33s
sop-tier-check / tier-check (pull_request) Successful in 14s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 6s
CI / Python Lint & Test (pull_request) Successful in 6s
CI / Canvas (Next.js) (pull_request) Successful in 8s
CI / Platform (Go) (pull_request) Successful in 7s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 7s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 6s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 8s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 6s
CI / all-required (pull_request) Successful in 3s
audit-force-merge / audit (pull_request) Successful in 8s
Phase 1+2 evidence (rev2 PR#633, merged 01:48Z): 6/6 ticks post-merge
with `compensated:0` despite ~25 known-stranded reds visible across
those same 10 SHAs on direct probe ~30min later. Reaper run 17057 at
02:46Z explicitly logged:
scanned 42 workflows; push-triggered=19, class-O candidates=23
status-reaper summary: {compensated:0, preserved_non_failure:185,
scanned_shas:10, limit:10}
Root cause: schedule workflows post `failure` to commit-status
RETROACTIVELY 5-15 min after their merge. By the time reaper's next
*/5 tick lands, the stranded red is on a SHA that has already fallen
OUTSIDE a 10-commit window during a burst-merge period. Reaper
algorithm is correct; the lookback window is too narrow vs. the
retroactive-failure-post lag.
Three-in-one fix (atomic per hongming-pc2 GO 03:25Z):
1. `.gitea/scripts/status-reaper.py`
DEFAULT_SWEEP_LIMIT 10 -> 30. Trades window-width-cheap for
cadence-loady; kept `*/5` cron unchanged (avoiding `*/2` which
would double runner load).
2. `.gitea/workflows/status-reaper.yml`
Restore schedule cron block (revert mc#645 comment-out for THIS
workflow only). Cron stays `*/5 * * * *`.
3. `.gitea/workflows/main-red-watchdog.yml`
Restore schedule cron block (revert mc#645 comment-out) AND raise
job-level `timeout-minutes: 5 -> 15`. Original 5min cap was
producing cancels under runner-saturation latency, which fed the
very `[main-red]` issues this workflow files (self-poisoning).
4. `tests/test_status_reaper.py`
+ test_default_sweep_limit_is_30 (contract pin)
+ test_reap_widened_window_catches_retroactive_failure: mocks 30
SHAs, plants the failing context on SHA[20] (depth strictly past
rev2's window=10), asserts the compensation POST lands on that
SHA. Existing tests retain explicit `limit=10` overrides and
remain unchanged. Suite: 42/42 passed (was 40 + 2 new).
Verification plan (post-merge, 10-15 min after merge / 2-3 cron ticks):
- DB: SELECT id, status FROM action_run WHERE workflow_id=
'status-reaper.yml' ORDER BY id DESC LIMIT 5 -> all status=1
- Log via web UI:
/molecule-ai/molecule-core/actions/runs/<index>/jobs/0/logs ->
summary line should now show compensated > 0 with
compensated_per_sha populated
- Direct probe: pick a SHA in the last 30 main commits with class-O
fails, GET /repos/molecule-ai/molecule-core/commits/{sha}/status
-> compensated contexts now show state=success with description
starting 'Compensated by status-reaper'
If rev3 STILL shows compensated:0 after the window-widening, the
diagnosis is wrong and a DIFFERENT bug needs to be uncovered (per
hongming-pc2 caveat 03:25Z). Re-enabling the crons IS the diagnosis
verification.
Cross-links:
- PR#618 (rev1, drop-concurrency, merge
|
||
| 98323734ea |
feat(ci): status-reaper rev2 sweeps last 10 main commits (closes stranded-status gap)
Some checks failed
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 10s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 15s
qa-review / approved (pull_request) Failing after 14s
CI / Detect changes (pull_request) Successful in 24s
security-review / approved (pull_request) Failing after 12s
sop-tier-check / tier-check (pull_request) Successful in 12s
E2E API Smoke Test / detect-changes (pull_request) Successful in 26s
gate-check-v3 / gate-check (pull_request) Successful in 22s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 26s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 27s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 26s
CI / Platform (Go) (pull_request) Successful in 5s
CI / Canvas (Next.js) (pull_request) Successful in 4s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 5s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 7s
CI / Python Lint & Test (pull_request) Successful in 7s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 6s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 6s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 7s
CI / all-required (pull_request) Successful in 3s
rev1 (PR #618, merged |
|||
| afaf0a1e54 |
feat(ci): status-reaper compensates Gitea hardcoded-(push)-suffix on schedule-triggered operational workflow failures
Some checks failed
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 12s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 13s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 16s
security-review / approved (pull_request) Failing after 18s
CI / Detect changes (pull_request) Successful in 30s
sop-tier-check / tier-check (pull_request) Successful in 11s
qa-review / approved (pull_request) Failing after 18s
gate-check-v3 / gate-check (pull_request) Successful in 29s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 33s
E2E API Smoke Test / detect-changes (pull_request) Successful in 34s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 36s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 34s
CI / Platform (Go) (pull_request) Successful in 7s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 6s
CI / Canvas (Next.js) (pull_request) Successful in 8s
CI / Python Lint & Test (pull_request) Successful in 7s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 8s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 8s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 9s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 6s
CI / all-required (pull_request) Successful in 3s
audit-force-merge / audit (pull_request) Successful in 21s
Root cause (verified via runs 14525 + 14526):
Gitea 1.22.6 emits commit-status context as
<workflow_name> / <job_name> (push)
for ANY workflow run on the default-branch HEAD, REGARDLESS of the
trigger event. Schedule- and workflow_dispatch-triggered runs
therefore paint main red via a fake-push status. No upstream fix
in 1.23-1.26.1 (sibling a6f20db1 research; internal#80 RFC).
Design — Option B (b2 cron-based compensating-status POST):
workflow_run is NOT supported on Gitea 1.22.6 (verified via
modules/actions/workflows.go enumeration); cron is the only
event-shaped option that fires reliably.
Every 5min, .gitea/workflows/status-reaper.yml runs a stdlib +
PyYAML scanner that:
1. Walks .gitea/workflows/*.yml. Resolves each workflow_id from
top-level 'name:' (else filename stem). Fails LOUD on
name-collision OR '/' in name (would break ' / ' context
parsing downstream). Classifies each by 'push:' trigger
presence (str / list / dict on: shapes all handled).
2. Reads main HEAD's combined commit status.
3. For each failure-state context ending ' (push)':
- parses '<workflow_name> / <job_name> (push)';
- skips if workflow not in scan map (conservative);
- preserves if workflow has push: trigger (real defect);
- else POSTs state=success with the same context to
/repos/{o}/{r}/statuses/{sha}, with a description that
documents the workaround.
Safety:
- Only failure-state contexts whose suffix is ' (push)' are
compensated. Branch_protections required checks on main (Secret
scan, sop-tier-check) have ' (pull_request)' suffix — UNREACHABLE
from this code path. Verified 2026-05-11 + test
test_reap_required_check_pull_request_suffix_never_touched.
- publish-workspace-server-image has a real push: trigger →
PRESERVED. mc#576's docker-socket failure stays visible as
intended. Explicit test fixture.
- api() raises ApiError on non-2xx + JSON-decode failure per
feedback_api_helper_must_raise_not_return_dict. Pre-fix
'soft-fail' would silently paint main green via omission.
Persona:
claude-status-reaper (Gitea uid 94, write:repository) — provisioned
2026-05-11 21:39Z by sub-agent aefaac1b. Token under
secrets.STATUS_REAPER_TOKEN (no other write surface touched).
Acceptance (post-merge verify, Step-5):
Trigger one class-O workflow via workflow_dispatch (e.g.
sweep-cf-tunnels). Observe reaper compensate the resulting
(push)-suffix failure on the next 5-min tick. Real
push-triggered failures (publish-workspace-server-image) MUST
still red main.
Removal path:
Drop this workflow + script + tests when Gitea is upgraded to
>= 1.24 with a fix for the hardcoded-suffix bug, OR when an
upstream patch lands (internal#80 RFC). Tracked in
post-merge audit issue.
Cross-links:
- sibling internal#327 (publish-runtime-bot)
- sibling internal#328 (mc-drift-bot)
- sibling internal#329 (Gitea dispatcher race)
- sibling internal#330 (disk-GC cron Gitea-class bug)
- upstream internal#80 (Gitea hardcoded-suffix RFC)
- mc#576 (preserved by design — real push-trigger failure)
- sub-agent aefaac1b (provisioning sibling)
- sub-agent a6f20db1 (Option A research — no upstream fix)
Tests: 37 pytest cases pass (incl. hongming-pc 22:08Z review's 3
design checks: name-collision fail-loud, '/' in name lint, name vs
filename fallback).
|