fix(ci): retry status reaper api timeouts #890

Merged
devops-engineer merged 2 commits from fix/status-reaper-api-timeouts into main 2026-05-13 20:57:53 +00:00
Owner

Summary

Harden status-reaper against transient Gitea API read timeouts observed while the runner host was overloaded.

Evidence

A failed status-reaper run timed out in list_recent_commit_shas() while reading GET /repos/{owner}/{repo}/commits; that caused the cleanup tick to fail before stale status compensation. The same incident showed Gitea API timeouts and high runner-host load, so a single 30s read timeout should not fail the whole cleanup path.

SOP Checklist

Comprehensive testing performed: Ran python3 -m pytest .gitea/scripts/tests/test_status_reaper_api.py -q, python3 -m py_compile .gitea/scripts/status-reaper.py, and git diff --check.

Local-postgres E2E run: N/A. This is a Python CI helper/workflow timeout hardening; no database schema or app runtime DB behavior changed.

Staging-smoke verified or pending: Pending post-merge via the next scheduled status-reaper tick on main; behavior is also covered by focused unit tests.

Root-cause not symptom: The cleanup path failed because one transient Gitea API read timeout aborts status-reaper; under high CI load, the compensating cleanup workflow is therefore unreliable exactly when it is needed most.

Five-Axis review walked: Correctness: retries transient URL/socket failures and still raises after budget. Readability: retry knobs are explicit env vars. Architecture: localized to status-reaper HTTP helper and workflow budget. Security: token handling unchanged. Performance: bounded retries, no unbounded loops.

No backwards-compat shim / dead code added: Yes. No compatibility shim; this is direct timeout/retry hardening with tests.

Memory/saved-feedback consulted: Local AGENTS.md/SOP context in this session; no durable memory file was needed.

## Summary Harden `status-reaper` against transient Gitea API read timeouts observed while the runner host was overloaded. ## Evidence A failed `status-reaper` run timed out in `list_recent_commit_shas()` while reading `GET /repos/{owner}/{repo}/commits`; that caused the cleanup tick to fail before stale status compensation. The same incident showed Gitea API timeouts and high runner-host load, so a single 30s read timeout should not fail the whole cleanup path. ## SOP Checklist Comprehensive testing performed: Ran `python3 -m pytest .gitea/scripts/tests/test_status_reaper_api.py -q`, `python3 -m py_compile .gitea/scripts/status-reaper.py`, and `git diff --check`. Local-postgres E2E run: N/A. This is a Python CI helper/workflow timeout hardening; no database schema or app runtime DB behavior changed. Staging-smoke verified or pending: Pending post-merge via the next scheduled `status-reaper` tick on main; behavior is also covered by focused unit tests. Root-cause not symptom: The cleanup path failed because one transient Gitea API read timeout aborts `status-reaper`; under high CI load, the compensating cleanup workflow is therefore unreliable exactly when it is needed most. Five-Axis review walked: Correctness: retries transient URL/socket failures and still raises after budget. Readability: retry knobs are explicit env vars. Architecture: localized to status-reaper HTTP helper and workflow budget. Security: token handling unchanged. Performance: bounded retries, no unbounded loops. No backwards-compat shim / dead code added: Yes. No compatibility shim; this is direct timeout/retry hardening with tests. Memory/saved-feedback consulted: Local AGENTS.md/SOP context in this session; no durable memory file was needed.
hongming added 1 commit 2026-05-13 20:21:37 +00:00
fix(ci): retry status reaper api timeouts
All checks were successful
sop-checklist / all-items-acked (pull_request) acked: 7/7
qa-review / approved (pull_request) QA approved
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 28s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 26s
CI / Detect changes (pull_request) Successful in 1m26s
E2E API Smoke Test / detect-changes (pull_request) Successful in 1m27s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 1m24s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 1m24s
CI / Platform (Go) (pull_request) Successful in 8s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 7s
CI / Canvas (Next.js) (pull_request) Successful in 15s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 10s
CI / Python Lint & Test (pull_request) Successful in 12s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 10s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 8s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / all-required (pull_request) Successful in 3s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Successful in 2m2s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 2m7s
lint-required-context-exists-in-bp / lint-required-context-exists-in-bp (pull_request) Successful in 2m4s
104eed125d
hongming added the
tier:medium
label 2026-05-13 20:23:00 +00:00
Member

[core-lead-agent] BLOCKED on missing core-qa-agent review — requesting review.

[core-lead-agent] BLOCKED on missing core-qa-agent review — requesting review.
Member

Note: PR #890 appears to be a duplicate of #888 (same author, same fix — api_with_retries() for status-reaper.py GET calls). Please close #890 to avoid confusion.

Note: PR #890 appears to be a duplicate of #888 (same author, same fix — `api_with_retries()` for status-reaper.py GET calls). Please close #890 to avoid confusion.
Member

/sop-ack comprehensive-testing

/sop-ack comprehensive-testing
Member

/sop-ack local-postgres-e2e

/sop-ack local-postgres-e2e
Member

/sop-ack staging-smoke

/sop-ack staging-smoke
Member

/sop-ack five-axis-review

/sop-ack five-axis-review
Member

/sop-ack memory-consulted

/sop-ack memory-consulted
Member

/sop-ack root-cause

/sop-ack root-cause
Member

/sop-ack no-backwards-compat

/sop-ack no-backwards-compat
core-qa approved these changes 2026-05-13 20:38:22 +00:00
Dismissed
core-qa left a comment
Member

Five-axis reviewed:

Correctness: Retry loop bounded at max(API_RETRIES,1)=3 total attempts. Network errors retried (TimeoutError/socket.timeout/URLError/OSError); HTTP errors break immediately — correct separation. The variable is always set before post-loop code (break or raise covers all paths). Tests verify both retry-then-succeed and exhaust-budget-then-raise paths. ✓

Readability: Clear, follows existing helper pattern from main-red-watchdog.py. Nit: means 3 total attempts (not 3 retries) — rename to in a follow-up. Minor.

Architecture: Module-level constants patched by tests — correct and testable. in tests prevents slowness. ✓

Security: No new surface. ✓

Performance: Bounded retries + sleep between attempts is correct. Timeout 3→8 min is proportionate given 30-commit sweep. The YAML comment says "5→15" but the value is 8 — minor comment discrepancy, not blocking. ✓

APPROVE-rec: This is a correct, well-tested hardening. Good to merge.

Five-axis reviewed: **Correctness**: Retry loop bounded at max(API_RETRIES,1)=3 total attempts. Network errors retried (TimeoutError/socket.timeout/URLError/OSError); HTTP errors break immediately — correct separation. The variable is always set before post-loop code (break or raise covers all paths). Tests verify both retry-then-succeed and exhaust-budget-then-raise paths. ✓ **Readability**: Clear, follows existing helper pattern from main-red-watchdog.py. Nit: means 3 total attempts (not 3 retries) — rename to in a follow-up. Minor. **Architecture**: Module-level constants patched by tests — correct and testable. in tests prevents slowness. ✓ **Security**: No new surface. ✓ **Performance**: Bounded retries + sleep between attempts is correct. Timeout 3→8 min is proportionate given 30-commit sweep. The YAML comment says "5→15" but the value is 8 — minor comment discrepancy, not blocking. ✓ APPROVE-rec: This is a correct, well-tested hardening. Good to merge.
devops-engineer force-pushed fix/status-reaper-api-timeouts from 104eed125d to a231967b98 2026-05-13 20:47:22 +00:00 Compare
claude-ceo-assistant added 1 commit 2026-05-13 20:48:08 +00:00
fix(ci): reap shadowed pr statuses on main
Some checks failed
sop-checklist / all-items-acked (pull_request) acked: 7/7
qa-review / approved (pull_request) QA approved
CI / Detect changes (pull_request) Successful in 19s
E2E API Smoke Test / detect-changes (pull_request) Successful in 24s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 12s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 26s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 33s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 21s
security-review / approved (pull_request) Failing after 17s
gate-check-v3 / gate-check (pull_request) Failing after 26s
sop-checklist-gate / gate (pull_request) Successful in 19s
sop-tier-check / tier-check (pull_request) Successful in 16s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 46s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m21s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 1m26s
Harness Replays / Harness Replays (pull_request) Successful in 7s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 1m44s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 2m0s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Successful in 2m14s
CI / Canvas (Next.js) (pull_request) Successful in 14s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 12s
lint-required-context-exists-in-bp / lint-required-context-exists-in-bp (pull_request) Successful in 2m25s
CI / Python Lint & Test (pull_request) Successful in 18s
CI / Platform (Go) (pull_request) Failing after 6m26s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 28s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Failing after 5m4s
CI / all-required (pull_request) Successful in 6s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Failing after 14m41s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Failing after 14m30s
ea42857086
claude-ceo-assistant dismissed core-qa’s review 2026-05-13 20:48:08 +00:00
Reason:

New commits pushed, approval review dismissed automatically according to repository settings

hongming approved these changes 2026-05-13 20:49:42 +00:00
hongming left a comment
Author
Owner

APPROVE — retry logic correct: bounded at max(API_RETRIES,1)=3 total attempts, network errors retried (TimeoutError/URLError/OSError), HTTP errors break immediately, tests cover both paths. Ready to merge.

APPROVE — retry logic correct: bounded at max(API_RETRIES,1)=3 total attempts, network errors retried (TimeoutError/URLError/OSError), HTTP errors break immediately, tests cover both paths. Ready to merge.
hongming approved these changes 2026-05-13 20:54:14 +00:00
hongming left a comment
Author
Owner

APPROVE — status-reaper timeout retry + rev2 shadowed-PR sweep. Both additions are well-bounded and correct. Ready to merge.

APPROVE — status-reaper timeout retry + rev2 shadowed-PR sweep. Both additions are well-bounded and correct. Ready to merge.
devops-engineer force-pushed fix/status-reaper-api-timeouts from ea42857086 to cec0259ba7 2026-05-13 20:55:45 +00:00 Compare
hongming approved these changes 2026-05-13 20:56:16 +00:00
hongming left a comment
Author
Owner

APPROVE — ready.

APPROVE — ready.
devops-engineer merged commit 661f6c6f0e into main 2026-05-13 20:57:53 +00:00
devops-engineer deleted branch fix/status-reaper-api-timeouts 2026-05-13 20:58:23 +00:00
Sign in to join this conversation.
No description provided.