fix(runtime#52): bounded retry/backoff on PR POST in propagate_runtime_version #168
Reference in New Issue
Block a user
Delete Branch "fix/52-propagate-pr-post-retry-backoff"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What
Closes the medium-severity finding from the runtime#52 audit (2026-05-24): "Cascade fan-out has no retry/backoff around network operations". Adds bounded retry + exponential backoff to the PR POST in
scripts/propagate_runtime_version.py:open_bump_pr()so a single transient Gitea/network blip doesn't orphan a bump branch with no PR.Why this surface specifically
A transient blip on the PR POST is uniquely bad because the branch + file writes have ALREADY succeeded upstream (the pre-POST steps). Without retry, a single 503/504/network error leaves:
bump/runtime-X.Y.Z).runtime-version(andrequirements.txt) already updated on that branchconsumer-driftwill then complain, and a human must hand-openThe pre-POST operations (default-branch read, contents read, contents PUT) are cheaper to fail loudly (the script can re-run from the top, and
consumer-driftre-flags missed consumers), so they keep their single-shot contract.What this commit changes
scripts/propagate_runtime_version.py_http_with_retry()helper (mirrors_http): retries 5xx + connection errors / timeouts with exponential backoff (1s, 2s, 4s by default — up to 3 retries = 4 total HTTP calls). 4xx returns immediately (client errors don't fix themselves). Final 5xx after exhaustion is returned (not raised) so the caller's existing error path sees a normal(status, body)tuple; finalURLErroris re-raised._RETRIABLE_5XX = {500, 502, 503, 504}— the standard transient set.open_bump_pr()switches the PR POST from_httpto_http_with_retry. Pre-POST operations stay on_http(single-shot) so a real client-side error surfaces immediately.sleepparameter is resolved at call time (not def time) so tests can patchtime.sleepand have the helper pick the patch up.tests/test_propagate_runtime_version.py— 7 new tests:test_http_with_retry_succeeds_on_first_5xx_then_201— retry succeedstest_http_with_retry_does_not_retry_on_4xx— 4xx returns immediatelytest_http_with_retry_exhausts_and_returns_5xx_after_max_retries— final 5xx returned (not raised)test_http_with_retry_raises_on_persistent_connection_error— final URLError re-raisedtest_http_with_retry_exponential_backoff_sequence— backoff is[1, 2, 4, 8, 16]test_http_with_retry_returns_immediately_on_first_success— no sleep on happy pathtest_open_bump_pr_uses_http_with_retry_for_pr_post— integration: PR POST goes through retry helper; pre-POST ops stay single-shotTest counts
tests/test_propagate_runtime_version.pytests/test_consumer_runtime_drift_guard.pytests/test_platform_comm_contract_guard.pytests/test_workflow_no_token_in_url.pytests/test_llm_auth.pyAll 74 pass on branch HEAD.
Scope
Single-purpose: closes the medium-severity PR-POST portion of runtime#52. Other audit findings (the
get_machine_ip8.8.8.8 fallback, low severity) are intentionally out of scope per the single-finding scope discipline.The audit's original "clone / push / PR POST" was a git-clone-era finding; the cascade was since replaced with the Gitea contents+pulls API (commits
28cbf9b,7154e15). The clone and push portions of the audit are therefore moot — the only active network step the PR POST line item covers is the one this PR wraps.Independence from the red #3164 deployment surface
Pure scripts + tests. No concierge / MCP / heartbeat / identity-gate / operator-deployment touched. Safe to merge on the runtime-lane.
Gate
unit-testsjobAPPROVE on
74f9902473(target=main).RCA review: the retry is scoped to the PR POST, which is the transient-prone step after branch/file writes. It is bounded: default max_retries=3 means at most 4 attempts, 30s per HTTP call, with 1/2/4s sleeps. 4xx returns immediately; only 500/502/503/504 and connection/timeout failures retry.
Idempotency: acceptable for this automation. The plan path pre-skips existing bump branches/open PRs. If the first POST succeeds server-side but the response is lost, a retry against the same head/base should hit Gitea's duplicate-open-PR response; open_bump_pr already treats a body containing 'pull request already exists' as success. This avoids orphaning the already-written bump branch while not creating a second branch or re-running pre-POST file writes.
Limiting retry to POST is reasonable: pre-POST contents/default-branch operations retain their previous fail-loud single-shot semantics, and the new tests explicitly guard that only PR POST goes through the retry helper. I do not see a real remaining gap in the stated runtime#52 cascade path.
CI on this head is green: secret scan, lint, build, smoke-install, unit-tests, responsiveness-e2e. Local focused check passed: tests/test_propagate_runtime_version.py (17/17).
APPROVE on
74f9902473(target=main).Review notes:
_http_with_retryretries only transient 500/502/503/504 plus network/timeout exceptions, returns 4xx immediately, and preserves the existing(status, body)contract for persistent 5xx so caller error handling remains visible.max_retriesadditional attempts) with exponential sleeps of 1/2/4... seconds and no infinite loop. Only the PRPOSTis wrapped; default-branch/content read/write operations stay single-shot as scoped. Retrying the same repo/base/head PR create is safe against duplicate PR creation because the branch/head tuple is stable; a partial-success retry may surface an already-exists response rather than duplicate state.CI: own-head CI green (secret scan, lint, build, unit-tests, smoke-install, responsiveness-e2e).