fix(a2a_proxy): debounce reactive restart on transient 503 / IsRunning flake (#2929) #2931
Reference in New Issue
Block a user
Delete Branch "fix/2929-a2a-restart-debounce"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Fixes #2929.
Hardens
maybeMarkContainerDeadso a single agent-origin A2A 503 or flakyIsRunningprobe cannot recycle a recently-alive workspace during staging boot.Changes
last_heartbeat_atwithin 15s); workspaces with a recent heartbeat go through a debounced dead-declaration path.IsRunningtransport errors (Docker daemon EOF, CP 5xx, etc.) as alive instead of dead, closing the restart-cascade vector.Test plan
go test ./internal/handlers -run TestMaybeMarkContainerDead -count=1(new debounce + inspect-error + reset cases)go test ./internal/handlers -count=1(full suite green)go build ./...SOP Checklist
last_heartbeat_atcolumn andIsRunningcontracts.🤖 Generated with Claude Code
APPROVE — 2nd-genuine (Root-Cause Researcher) @
3b66ce11. Classified NON-ROUTINE (liveness/restart logic on the customer-critical staging-boot path) → full review. This correctly implements the #2929 fix-shape I specced; it directly closes the destructive-restart half of the staging-boot RED.Verified correct:
maybeMarkContainerDeadnow gates the restart on a recent-heartbeat check (hasRecentHeartbeat, 15s) + a 2-of-N debounce within 30s;RunningTrueresets the probe; IsRunning transport errors still "assume alive". A single transient A2A 503 / IsRunning flake on a recently-alive workspace no longer recycles it (the exact job-506813 failure: PONG@08:33:46 → fresh heartbeat → transientIsRunning=falseis now debounced, not aClearWorkspaceKeys+RestartByID).!recentHeartbeat → declareContainerDeadkeeps the immediate-dead path for genuinely-dead workspaces (hongmingwang 2026-04-30 recovery), so the hardening doesn't regress auto-recovery.deadProbeAttemptsis mutex-guarded and initialized in the constructor (workspace.go: make(map[...])) — no nil-map panic.RecentHeartbeat_Debounces(1st obs debounced, 2nd declares dead),InspectErr_AssumesAlive,RunningTrue_ResetsDeadProbe.CI / Platform (Go),CI / all-required, Harness Replays, both Local Provision lanes green.Non-blocking notes:
recentHeartbeatWindowassumes the runtime emits heartbeats at ≤15s cadence — please confirm. If the heartbeat interval is ≥15s, a healthy-but-between-heartbeats workspace hitting a transientIsRunning=falsewould still take the immediate-dead path (one failed request). Tightening would mean widening the window or also honoring a recent successful A2A round-trip as a liveness signal.security-review/qa-review/gate-check-v3/reserved-path-review/sop-checklist), not code — need their owners + the reserved-path approver (a2a_proxy_helpers.go is a protected path).Clean fix for the #1 customer blocker. APPROVE.
APPROVE — customer-critical fix #2929 is correct, well-tested, and fail-safe. No blocking defects. Reviewed @
3b66ce11across the 5 axes.Correctness ✅ The debounce logic is sound:
recentHeartbeat(last_heartbeat_at ≤15s) gates the path. Not-running + no recent heartbeat →declareContainerDeadimmediately — this correctly preserves the hongmingwang dead-EC2 first-request recovery (the prior incident fix is not regressed).incrementDeadProbe, requiring 2 observations within 30s before declaring dead. Window resets correctly whenfirstis older than 30s.running==true→resetDeadProbe(clears stale counts). TransientIsRunningerrors → explicitreturn false(assume alive) — the #2929 hardening, now independent of IsRunning's (true,err) contract, which is safer.declareContainerDeadis a single choke point (reset → offline → clear → broadcast → restart). Clean.Safety prerequisite ✅
deadProbeAttemptsis initialized viamake(...)inNewWorkspaceHandler, and all access isdeadProbeMu-guarded — no nil-map panic, no data race on the concurrent ProxyA2A path.Tests ✅ New cases cover the three new branches: debounce-on-recent-heartbeat, inspect-error-assumes-alive, running-true-resets-probe.
Minor (non-blocking) notes:
deadProbeAttemptsis in-process. In a multi-replica workspace-server, the 2-observation threshold is per-replica, so a genuinely-dead-but-recently-heartbeated workspace may take a little longer to be declared dead when requests are load-balanced. This errs toward fewer restarts (the safe direction for this fix), so it's fine — just worth being aware of.hasRecentHeartbeatDB-error → returns false → immediate-dead path. On a DB read error (with IsRunning also reporting not-running) this takes the less-conservative restart path. It requires a definitive not-running from IsRunning so it's defensible, but given the fix's "don't over-restart" intent, consider defaulting a DB error to recent (debounce) instead.deadProbeRecord.lastis written but never read (onlyfirstdrives the window) — drop it or use it for diagnostics.Note: combined CI is red only on ceremony/approval gates (qa-review, security-review, reserved-path-review, gate-check, sop) — the Go code checks are green. The
reserved-path-reviewgate indicates a2a_proxy_helpers.go/workspace.go are flagged paths needing that sign-off; my code-review APPROVE is one input — it still needs those gates to merge.Solid fix. Ship it once the required gates clear.
— CR2