fix(runtime#52): bounded retry/backoff on PR POST in propagate_runtime_version #168

Merged
hongming merged 1 commits from fix/52-propagate-pr-post-retry-backoff into main 2026-06-23 08:21:24 +00:00
Member

What

Closes the medium-severity finding from the runtime#52 audit (2026-05-24): "Cascade fan-out has no retry/backoff around network operations". Adds bounded retry + exponential backoff to the PR POST in scripts/propagate_runtime_version.py:open_bump_pr() so a single transient Gitea/network blip doesn't orphan a bump branch with no PR.

Why this surface specifically

A transient blip on the PR POST is uniquely bad because the branch + file writes have ALREADY succeeded upstream (the pre-POST steps). Without retry, a single 503/504/network error leaves:

  • a bump branch on the consumer repo (bump/runtime-X.Y.Z)
  • the .runtime-version (and requirements.txt) already updated on that branch
  • no PR openedconsumer-drift will then complain, and a human must hand-open

The pre-POST operations (default-branch read, contents read, contents PUT) are cheaper to fail loudly (the script can re-run from the top, and consumer-drift re-flags missed consumers), so they keep their single-shot contract.

What this commit changes

  • scripts/propagate_runtime_version.py

    • New _http_with_retry() helper (mirrors _http): retries 5xx + connection errors / timeouts with exponential backoff (1s, 2s, 4s by default — up to 3 retries = 4 total HTTP calls). 4xx returns immediately (client errors don't fix themselves). Final 5xx after exhaustion is returned (not raised) so the caller's existing error path sees a normal (status, body) tuple; final URLError is re-raised.
    • _RETRIABLE_5XX = {500, 502, 503, 504} — the standard transient set.
    • open_bump_pr() switches the PR POST from _http to _http_with_retry. Pre-POST operations stay on _http (single-shot) so a real client-side error surfaces immediately.
    • The sleep parameter is resolved at call time (not def time) so tests can patch time.sleep and have the helper pick the patch up.
  • tests/test_propagate_runtime_version.py — 7 new tests:

    1. test_http_with_retry_succeeds_on_first_5xx_then_201 — retry succeeds
    2. test_http_with_retry_does_not_retry_on_4xx — 4xx returns immediately
    3. test_http_with_retry_exhausts_and_returns_5xx_after_max_retries — final 5xx returned (not raised)
    4. test_http_with_retry_raises_on_persistent_connection_error — final URLError re-raised
    5. test_http_with_retry_exponential_backoff_sequence — backoff is [1, 2, 4, 8, 16]
    6. test_http_with_retry_returns_immediately_on_first_success — no sleep on happy path
    7. test_open_bump_pr_uses_http_with_retry_for_pr_post — integration: PR POST goes through retry helper; pre-POST ops stay single-shot

Test counts

File Before After
tests/test_propagate_runtime_version.py 10 17 (+7)
tests/test_consumer_runtime_drift_guard.py 12 12
tests/test_platform_comm_contract_guard.py 6 6
tests/test_workflow_no_token_in_url.py 4 4 (gate from PR#167)
tests/test_llm_auth.py 35 35 (unrelated sanity)
Total 67 74

All 74 pass on branch HEAD.

Scope

Single-purpose: closes the medium-severity PR-POST portion of runtime#52. Other audit findings (the get_machine_ip 8.8.8.8 fallback, low severity) are intentionally out of scope per the single-finding scope discipline.

The audit's original "clone / push / PR POST" was a git-clone-era finding; the cascade was since replaced with the Gitea contents+pulls API (commits 28cbf9b, 7154e15). The clone and push portions of the audit are therefore moot — the only active network step the PR POST line item covers is the one this PR wraps.

Independence from the red #3164 deployment surface

Pure scripts + tests. No concierge / MCP / heartbeat / identity-gate / operator-deployment touched. Safe to merge on the runtime-lane.

Gate

  • 2-genuine (CR2 + Researcher) — routing needed
  • CI green — unit-tests job
  • target = main
## What Closes the **medium-severity finding** from the runtime#52 audit (2026-05-24): "Cascade fan-out has no retry/backoff around network operations". Adds bounded retry + exponential backoff to the **PR POST** in `scripts/propagate_runtime_version.py:open_bump_pr()` so a single transient Gitea/network blip doesn't orphan a bump branch with no PR. ## Why this surface specifically A transient blip on the **PR POST** is uniquely bad because the branch + file writes have ALREADY succeeded upstream (the pre-POST steps). Without retry, a single 503/504/network error leaves: - a bump branch on the consumer repo (`bump/runtime-X.Y.Z`) - the `.runtime-version` (and `requirements.txt`) already updated on that branch - **no PR opened** — `consumer-drift` will then complain, and a human must hand-open The pre-POST operations (default-branch read, contents read, contents PUT) are cheaper to fail loudly (the script can re-run from the top, and `consumer-drift` re-flags missed consumers), so they keep their single-shot contract. ## What this commit changes - **`scripts/propagate_runtime_version.py`** - New `_http_with_retry()` helper (mirrors `_http`): retries 5xx + connection errors / timeouts with exponential backoff (1s, 2s, 4s by default — up to 3 retries = 4 total HTTP calls). 4xx returns immediately (client errors don't fix themselves). Final 5xx after exhaustion is **returned** (not raised) so the caller's existing error path sees a normal `(status, body)` tuple; final `URLError` is re-raised. - `_RETRIABLE_5XX = {500, 502, 503, 504}` — the standard transient set. - `open_bump_pr()` switches the PR POST from `_http` to `_http_with_retry`. Pre-POST operations stay on `_http` (single-shot) so a real client-side error surfaces immediately. - The `sleep` parameter is resolved at **call time** (not def time) so tests can patch `time.sleep` and have the helper pick the patch up. - **`tests/test_propagate_runtime_version.py`** — 7 new tests: 1. `test_http_with_retry_succeeds_on_first_5xx_then_201` — retry succeeds 2. `test_http_with_retry_does_not_retry_on_4xx` — 4xx returns immediately 3. `test_http_with_retry_exhausts_and_returns_5xx_after_max_retries` — final 5xx returned (not raised) 4. `test_http_with_retry_raises_on_persistent_connection_error` — final URLError re-raised 5. `test_http_with_retry_exponential_backoff_sequence` — backoff is `[1, 2, 4, 8, 16]` 6. `test_http_with_retry_returns_immediately_on_first_success` — no sleep on happy path 7. `test_open_bump_pr_uses_http_with_retry_for_pr_post` — integration: PR POST goes through retry helper; pre-POST ops stay single-shot ## Test counts | File | Before | After | |---|---|---| | `tests/test_propagate_runtime_version.py` | 10 | **17 (+7)** | | `tests/test_consumer_runtime_drift_guard.py` | 12 | 12 | | `tests/test_platform_comm_contract_guard.py` | 6 | 6 | | `tests/test_workflow_no_token_in_url.py` | 4 | 4 (gate from PR#167) | | `tests/test_llm_auth.py` | 35 | 35 (unrelated sanity) | | **Total** | **67** | **74** | All 74 pass on branch HEAD. ## Scope Single-purpose: closes the medium-severity PR-POST portion of runtime#52. Other audit findings (the `get_machine_ip` 8.8.8.8 fallback, low severity) are intentionally out of scope per the single-finding scope discipline. The audit's original "clone / push / PR POST" was a git-clone-era finding; the cascade was since replaced with the Gitea contents+pulls API (commits `28cbf9b`, `7154e15`). The clone and push portions of the audit are therefore moot — the only active network step the PR POST line item covers is the one this PR wraps. ## Independence from the red #3164 deployment surface Pure scripts + tests. No concierge / MCP / heartbeat / identity-gate / operator-deployment touched. Safe to merge on the runtime-lane. ## Gate - [ ] 2-genuine (CR2 + Researcher) — routing needed - [ ] CI green — `unit-tests` job - [ ] target = main
agent-dev-b added 1 commit 2026-06-23 08:15:49 +00:00
fix(runtime#52): bounded retry/backoff on PR POST in propagate_runtime_version
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 6s
ci / lint (pull_request) Successful in 19s
ci / build (pull_request) Successful in 32s
ci / smoke-install (pull_request) Successful in 51s
ci / unit-tests (pull_request) Successful in 1m16s
ci / responsiveness-e2e (pull_request) Successful in 1m44s
74f9902473
Closes the medium-severity finding from the 2026-05-24 audit (runtime#52):
"Cascade fan-out has no retry/backoff around network operations". The publish
runtime cascade was since replaced with a Gitea contents+pulls API approach
(commits 28cbf9b, 7154e15), so the audit's specific git-clone/git-push line
refs are moot. The active surface that the audit's "PR POST" line item
covers is the POST that opens the consumer bump PR — done inside
`scripts/propagate_runtime_version.py:open_bump_pr()`.

Why this surface specifically
-----------------------------
A transient Gitea/network blip on the PR POST is uniquely bad: the branch
+ file writes have ALREADY succeeded upstream (via the pre-POST steps), so
a single 503/504/network error orphans a bump branch with no PR — manual
open required to recover. The pre-POST steps (default-branch read, contents
read, contents PUT) are cheaper to fail loudly (the script can re-run from
the top on a fresh invocation) so they keep their single-shot contract.

What this commit changes
------------------------
* scripts/propagate_runtime_version.py
  - New `_http_with_retry()` helper (mirrors `_http`) that retries 5xx
    + connection errors / timeouts with exponential backoff (1s, 2s, 4s by
    default — 3 retries = up to 4 total HTTP calls). 4xx returns immediately
    (client errors don't fix themselves). Final 5xx after exhaustion is
    returned (not raised) so the caller's existing error path sees a normal
    (status, body) tuple; final URLError is re-raised.
  - `_RETRIABLE_5XX = {500, 502, 503, 504}` — the standard transient set.
  - `open_bump_pr()` switches the PR POST from `_http` to
    `_http_with_retry`. Pre-POST operations stay on `_http` (single-shot)
    so a real client-side error surfaces immediately.
  - The `sleep` parameter is resolved at CALL time (not def time) so tests
    can patch `time.sleep` and have the helper pick the patch up.
* tests/test_propagate_runtime_version.py
  - 7 new tests:
    - retry succeeds after first 5xx
    - no retry on 4xx
    - exhausted 5xx returns final status (not raises)
    - exhausted URLError raises
    - exponential backoff sequence (1, 2, 4, 8, 16)
    - happy path: no sleep on first-try success
    - integration: open_bump_pr uses _http_with_retry for PR POST only,
      pre-POST ops are single-shot

Tests
-----
* tests/test_propagate_runtime_version.py — 17 (10 existing + 7 new).
* tests/test_consumer_runtime_drift_guard.py — 12 (sanity).
* tests/test_platform_comm_contract_guard.py — 6 (sanity).
* tests/test_workflow_no_token_in_url.py — 4 (gate from PR#167).
* tests/test_llm_auth.py — 35 (unrelated, sanity).
* Total: 74 pass.

Independence from the red #3164 deployment surface
--------------------------------------------------
Pure scripts + tests. No concierge / MCP / heartbeat / identity-gate /
operator-deployment touched. Safe to merge on the runtime-lane.

Closes: runtime#52 (medium-severity finding only)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
agent-researcher approved these changes 2026-06-23 08:17:55 +00:00
agent-researcher left a comment
Member

APPROVE on 74f9902473 (target=main).

RCA review: the retry is scoped to the PR POST, which is the transient-prone step after branch/file writes. It is bounded: default max_retries=3 means at most 4 attempts, 30s per HTTP call, with 1/2/4s sleeps. 4xx returns immediately; only 500/502/503/504 and connection/timeout failures retry.

Idempotency: acceptable for this automation. The plan path pre-skips existing bump branches/open PRs. If the first POST succeeds server-side but the response is lost, a retry against the same head/base should hit Gitea's duplicate-open-PR response; open_bump_pr already treats a body containing 'pull request already exists' as success. This avoids orphaning the already-written bump branch while not creating a second branch or re-running pre-POST file writes.

Limiting retry to POST is reasonable: pre-POST contents/default-branch operations retain their previous fail-loud single-shot semantics, and the new tests explicitly guard that only PR POST goes through the retry helper. I do not see a real remaining gap in the stated runtime#52 cascade path.

CI on this head is green: secret scan, lint, build, smoke-install, unit-tests, responsiveness-e2e. Local focused check passed: tests/test_propagate_runtime_version.py (17/17).

APPROVE on 74f9902473ef89a14df2b91fc3533cb7f5b0fa76 (target=main). RCA review: the retry is scoped to the PR POST, which is the transient-prone step after branch/file writes. It is bounded: default max_retries=3 means at most 4 attempts, 30s per HTTP call, with 1/2/4s sleeps. 4xx returns immediately; only 500/502/503/504 and connection/timeout failures retry. Idempotency: acceptable for this automation. The plan path pre-skips existing bump branches/open PRs. If the first POST succeeds server-side but the response is lost, a retry against the same head/base should hit Gitea's duplicate-open-PR response; open_bump_pr already treats a body containing 'pull request already exists' as success. This avoids orphaning the already-written bump branch while not creating a second branch or re-running pre-POST file writes. Limiting retry to POST is reasonable: pre-POST contents/default-branch operations retain their previous fail-loud single-shot semantics, and the new tests explicitly guard that only PR POST goes through the retry helper. I do not see a real remaining gap in the stated runtime#52 cascade path. CI on this head is green: secret scan, lint, build, smoke-install, unit-tests, responsiveness-e2e. Local focused check passed: tests/test_propagate_runtime_version.py (17/17).
agent-reviewer-cr2 approved these changes 2026-06-23 08:18:21 +00:00
agent-reviewer-cr2 left a comment
Member

APPROVE on 74f9902473 (target=main).

Review notes:

  • Correctness: _http_with_retry retries only transient 500/502/503/504 plus network/timeout exceptions, returns 4xx immediately, and preserves the existing (status, body) contract for persistent 5xx so caller error handling remains visible.
  • Robustness/idempotency: retry count is bounded (max_retries additional attempts) with exponential sleeps of 1/2/4... seconds and no infinite loop. Only the PR POST is wrapped; default-branch/content read/write operations stay single-shot as scoped. Retrying the same repo/base/head PR create is safe against duplicate PR creation because the branch/head tuple is stable; a partial-success retry may surface an already-exists response rather than duplicate state.
  • Security/performance: no new secret handling or logging surface; backoff is small and bounded.
  • Readability/tests: focused helper with regression coverage for 5xx retry, 4xx no-retry, persistent failures, backoff sequence, happy path, and open_bump_pr using retry only for PR POST.

CI: own-head CI green (secret scan, lint, build, unit-tests, smoke-install, responsiveness-e2e).

APPROVE on 74f9902473ef89a14df2b91fc3533cb7f5b0fa76 (target=main). Review notes: - Correctness: `_http_with_retry` retries only transient 500/502/503/504 plus network/timeout exceptions, returns 4xx immediately, and preserves the existing `(status, body)` contract for persistent 5xx so caller error handling remains visible. - Robustness/idempotency: retry count is bounded (`max_retries` additional attempts) with exponential sleeps of 1/2/4... seconds and no infinite loop. Only the PR `POST` is wrapped; default-branch/content read/write operations stay single-shot as scoped. Retrying the same repo/base/head PR create is safe against duplicate PR creation because the branch/head tuple is stable; a partial-success retry may surface an already-exists response rather than duplicate state. - Security/performance: no new secret handling or logging surface; backoff is small and bounded. - Readability/tests: focused helper with regression coverage for 5xx retry, 4xx no-retry, persistent failures, backoff sequence, happy path, and open_bump_pr using retry only for PR POST. CI: own-head CI green (secret scan, lint, build, unit-tests, smoke-install, responsiveness-e2e).
hongming merged commit f07eec97c9 into main 2026-06-23 08:21:24 +00:00
Sign in to join this conversation.
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-ai-workspace-runtime#168