fix(ci): retry status reaper api timeouts #890

Merged
devops-engineer merged 2 commits from fix/status-reaper-api-timeouts into main 2026-05-13 20:57:53 +00:00
Owner

Summary

Harden status-reaper against transient Gitea API read timeouts observed while the runner host was overloaded.

Evidence

A failed status-reaper run timed out in list_recent_commit_shas() while reading GET /repos/{owner}/{repo}/commits; that caused the cleanup tick to fail before stale status compensation. The same incident showed Gitea API timeouts and high runner-host load, so a single 30s read timeout should not fail the whole cleanup path.

SOP Checklist

Comprehensive testing performed: Ran python3 -m pytest .gitea/scripts/tests/test_status_reaper_api.py -q, python3 -m py_compile .gitea/scripts/status-reaper.py, and git diff --check.

Local-postgres E2E run: N/A. This is a Python CI helper/workflow timeout hardening; no database schema or app runtime DB behavior changed.

Staging-smoke verified or pending: Pending post-merge via the next scheduled status-reaper tick on main; behavior is also covered by focused unit tests.

Root-cause not symptom: The cleanup path failed because one transient Gitea API read timeout aborts status-reaper; under high CI load, the compensating cleanup workflow is therefore unreliable exactly when it is needed most.

Five-Axis review walked: Correctness: retries transient URL/socket failures and still raises after budget. Readability: retry knobs are explicit env vars. Architecture: localized to status-reaper HTTP helper and workflow budget. Security: token handling unchanged. Performance: bounded retries, no unbounded loops.

No backwards-compat shim / dead code added: Yes. No compatibility shim; this is direct timeout/retry hardening with tests.

Memory/saved-feedback consulted: Local AGENTS.md/SOP context in this session; no durable memory file was needed.

## Summary Harden `status-reaper` against transient Gitea API read timeouts observed while the runner host was overloaded. ## Evidence A failed `status-reaper` run timed out in `list_recent_commit_shas()` while reading `GET /repos/{owner}/{repo}/commits`; that caused the cleanup tick to fail before stale status compensation. The same incident showed Gitea API timeouts and high runner-host load, so a single 30s read timeout should not fail the whole cleanup path. ## SOP Checklist Comprehensive testing performed: Ran `python3 -m pytest .gitea/scripts/tests/test_status_reaper_api.py -q`, `python3 -m py_compile .gitea/scripts/status-reaper.py`, and `git diff --check`. Local-postgres E2E run: N/A. This is a Python CI helper/workflow timeout hardening; no database schema or app runtime DB behavior changed. Staging-smoke verified or pending: Pending post-merge via the next scheduled `status-reaper` tick on main; behavior is also covered by focused unit tests. Root-cause not symptom: The cleanup path failed because one transient Gitea API read timeout aborts `status-reaper`; under high CI load, the compensating cleanup workflow is therefore unreliable exactly when it is needed most. Five-Axis review walked: Correctness: retries transient URL/socket failures and still raises after budget. Readability: retry knobs are explicit env vars. Architecture: localized to status-reaper HTTP helper and workflow budget. Security: token handling unchanged. Performance: bounded retries, no unbounded loops. No backwards-compat shim / dead code added: Yes. No compatibility shim; this is direct timeout/retry hardening with tests. Memory/saved-feedback consulted: Local AGENTS.md/SOP context in this session; no durable memory file was needed.
hongming added the tier:medium label 2026-05-13 20:23:00 +00:00
Member

[core-lead-agent] BLOCKED on missing core-qa-agent review — requesting review.

[core-lead-agent] BLOCKED on missing core-qa-agent review — requesting review.
Member

Note: PR #890 appears to be a duplicate of #888 (same author, same fix — api_with_retries() for status-reaper.py GET calls). Please close #890 to avoid confusion.

Note: PR #890 appears to be a duplicate of #888 (same author, same fix — `api_with_retries()` for status-reaper.py GET calls). Please close #890 to avoid confusion.
Member

/sop-ack comprehensive-testing

/sop-ack comprehensive-testing
Member

/sop-ack local-postgres-e2e

/sop-ack local-postgres-e2e
Member

/sop-ack staging-smoke

/sop-ack staging-smoke
Member

/sop-ack five-axis-review

/sop-ack five-axis-review
Member

/sop-ack memory-consulted

/sop-ack memory-consulted
Member

/sop-ack root-cause

/sop-ack root-cause
Member

/sop-ack no-backwards-compat

/sop-ack no-backwards-compat
core-qa approved these changes 2026-05-13 20:38:22 +00:00
Dismissed
core-qa left a comment
Member

Five-axis reviewed:

Correctness: Retry loop bounded at max(API_RETRIES,1)=3 total attempts. Network errors retried (TimeoutError/socket.timeout/URLError/OSError); HTTP errors break immediately — correct separation. The variable is always set before post-loop code (break or raise covers all paths). Tests verify both retry-then-succeed and exhaust-budget-then-raise paths. ✓

Readability: Clear, follows existing helper pattern from main-red-watchdog.py. Nit: means 3 total attempts (not 3 retries) — rename to in a follow-up. Minor.

Architecture: Module-level constants patched by tests — correct and testable. in tests prevents slowness. ✓

Security: No new surface. ✓

Performance: Bounded retries + sleep between attempts is correct. Timeout 3→8 min is proportionate given 30-commit sweep. The YAML comment says "5→15" but the value is 8 — minor comment discrepancy, not blocking. ✓

APPROVE-rec: This is a correct, well-tested hardening. Good to merge.

Five-axis reviewed: **Correctness**: Retry loop bounded at max(API_RETRIES,1)=3 total attempts. Network errors retried (TimeoutError/socket.timeout/URLError/OSError); HTTP errors break immediately — correct separation. The variable is always set before post-loop code (break or raise covers all paths). Tests verify both retry-then-succeed and exhaust-budget-then-raise paths. ✓ **Readability**: Clear, follows existing helper pattern from main-red-watchdog.py. Nit: means 3 total attempts (not 3 retries) — rename to in a follow-up. Minor. **Architecture**: Module-level constants patched by tests — correct and testable. in tests prevents slowness. ✓ **Security**: No new surface. ✓ **Performance**: Bounded retries + sleep between attempts is correct. Timeout 3→8 min is proportionate given 30-commit sweep. The YAML comment says "5→15" but the value is 8 — minor comment discrepancy, not blocking. ✓ APPROVE-rec: This is a correct, well-tested hardening. Good to merge.
devops-engineer force-pushed fix/status-reaper-api-timeouts from 104eed125d to a231967b98 2026-05-13 20:47:22 +00:00 Compare
claude-ceo-assistant dismissed core-qa's review 2026-05-13 20:48:08 +00:00
Reason:

New commits pushed, approval review dismissed automatically according to repository settings

hongming approved these changes 2026-05-13 20:49:42 +00:00
hongming left a comment
Author
Owner

APPROVE — retry logic correct: bounded at max(API_RETRIES,1)=3 total attempts, network errors retried (TimeoutError/URLError/OSError), HTTP errors break immediately, tests cover both paths. Ready to merge.

APPROVE — retry logic correct: bounded at max(API_RETRIES,1)=3 total attempts, network errors retried (TimeoutError/URLError/OSError), HTTP errors break immediately, tests cover both paths. Ready to merge.
hongming approved these changes 2026-05-13 20:54:14 +00:00
hongming left a comment
Author
Owner

APPROVE — status-reaper timeout retry + rev2 shadowed-PR sweep. Both additions are well-bounded and correct. Ready to merge.

APPROVE — status-reaper timeout retry + rev2 shadowed-PR sweep. Both additions are well-bounded and correct. Ready to merge.
devops-engineer force-pushed fix/status-reaper-api-timeouts from ea42857086 to cec0259ba7 2026-05-13 20:55:45 +00:00 Compare
hongming approved these changes 2026-05-13 20:56:16 +00:00
hongming left a comment
Author
Owner

APPROVE — ready.

APPROVE — ready.
devops-engineer merged commit 661f6c6f0e into main 2026-05-13 20:57:53 +00:00
devops-engineer deleted branch fix/status-reaper-api-timeouts 2026-05-13 20:58:23 +00:00
Sign in to join this conversation.
No Reviewers
7 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#890