fix(a2a): classify proxyA2AError into busy_retryable / delivered / upstream_dead (#3056) #3060
Reference in New Issue
Block a user
Delete Branch "fix/3056-proxy-a2a-error-classification"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes #3056.
Problem
The 2026-06-19 a2a RCA found that
proxyA2AError.Error()returned the same opaque"proxy a2a error"string for THREE distinct failure modes that monitoring/PM could not tell apart:A single-threaded busy spike (transient backpressure under a fleet burst) read as a total outage, and delegations that the agent had already completed (delivered shape) got logged as
[failed].Fix
Add a structured
Classificationfield toproxyA2AErrorand set it at every construction site:busy:truea2a_proxy_helpers.go:172busy_retryabledelivery_confirmed=trueAND non-empty body (re-read)a2a_proxy.go(post-CR2)deliveredmaybeMarkContainerDead+dead==true(post-Do reactive)a2a_proxy.go:881-892upstream_deadhandleA2ADispatchError+dead==true(pre-Do reactive)a2a_proxy_helpers.go:79-86(post-Researcher)upstream_deadpreflightContainerHealth+ container-not-running (proactive)a2a_proxy_helpers.go:506-513(post-Researcher)upstream_dead""(backward-compat)Error()suffixes the classification as[busy_retryable]/[delivered]/[upstream_dead]for log readability; theClassificationfield is the structured consumption surface for PM/monitoring.executeDelegationlogs the classification alongside the error so existing log scrapers can grep it without a schema change.Strictness fix (post-CR2)
The original
deliveredpredicate wasresp.StatusCode >= 200 && resp.StatusCode < 400(any 2xx OR 3xx with body-read error). CR2 review 12458 caught that this is broader thanexecuteDelegation.isDeliveryConfirmedSuccess, which gates on200 <= status < 300ANDlen(respBody) > 0. A 2xx-with-empty-body or a 3xx-with-body-read-error would log as[delivered]even though the delegation would not be promoted to success — recreating the false-classification problem in a narrower shape. The helper now takes(status int, bodyNonEmpty bool)and returnsdeliveredonly when both hold, restoring exact alignment withisDeliveryConfirmedSuccess.Upstream_dead coverage (post-Researcher)
Researcher review 12457 caught two upstream_dead sites that the original PR missed. Both are now classified, making the upstream_dead bucket exhaustive across the three code paths that produce a dead-origin error (reactive post-Do, reactive pre-Do, proactive preflight).
Test plan
TestProxyA2AError_Classification_SuffixesMessage: pins theError()shape for all four cases (three classifications + empty default)TestProxyA2AError_NilSafe: nil receiver must return empty stringTestClassificationFromDeliveryConfirmed+TestClassificationFromDeliveryConfirmed_Strict2xxAndNonEmpty: pins the strict 2xx+body predicateTestIsUpstreamBusyError_DoesNotSetClassification: the predicate does NOT mutate the fieldTestUpstreamDead_ConstructionSites: pins BOTH the reactive and proactive upstream_dead sitesFull handler test suite passes (38s);
go vetclean. Files: 3 changed, +160/-26 on top of the original PR.SOP checklist
go build ./...clean;go test ./internal/handlers/...green (38s);go vet ./internal/...clean. Both 12458 + 12457 follow-up blockers tested.Classificationfield is the new API; existingError()string output is preserved (just gains a[classification]suffix for log readability), and theproxyA2AError.Response["error"]shape is untouched.isDeliveryConfirmedSuccess, not a vibes-based check).REQUEST_CHANGES
5-axis review on head
9ce92f5e.Correctness: blocker. The new classifications cover the delivered read-error path (
a2a_proxy.go:772-792), busy retryable fallback after enqueue failure (a2a_proxy_helpers.go:172-185), and one upstream-dead HTTP-status path (a2a_proxy.go:881-892). But two restart-triggered upstream-dead paths still return unclassifiedproxyA2AErrors:handleA2ADispatchErrorwhenmaybeMarkContainerDeadreturnsdead=true(workspace-server/internal/handlers/a2a_proxy_helpers.go:79-86), and proactivepreflightContainerHealthwhen the container is not running and restart is triggered (a2a_proxy_helpers.go:491-513). Both are genuine upstream-dead/restart-triggering failures under this PR's stated contract, so monitoring would still see opaqueproxy a2a error/unclassified failures for real dead-origin cases.Robustness: the delivered success gate still runs before transient retry, which is correct, and empty 2xx bodies remain failures. However the test suite does not pin the two missed upstream-dead construction sites, so this gap can regress silently.
Security: no new auth or trust-boundary concern found; SSRF/access-control paths remain unclassified as ordinary failures, which is appropriate.
Performance: no material overhead from adding a string field/log suffix.
Readability/operability: the classification field and log suffix are clear, but the implementation does not yet match the stated three-bucket taxonomy. Also SOP status: the PR body I can read does not contain the required 7-item SOP checklist, so I cannot post
/sop-ack <slug>items as satisfied; please add the checklist and, after fixing the missed upstream_dead paths/tests, request fresh SOP acks.5-axis review on head
9ce92f5e:Requesting changes for a correctness/observability bug in the delivered classification path.
proxyA2ARequestsetsdeliveryConfirmed := resp.StatusCode >= 200 && resp.StatusCode < 400and then uses that to assignClassification: "delivered"on any body-read error. That is broader than the actual delegation success predicate:isDeliveryConfirmedSuccessonly promotes the delegation whenproxyErr != nil,200 <= status < 300, andlen(respBody) > 0.That means a 2xx response where the body read fails before any bytes are received, or a 3xx response with a body-read error, can be logged/reported as
[delivered]even thoughexecuteDelegationwill not treat it as completed. Since the RCA semantics saydeliveredmeans the work completed and monitoring/PM should not count it as a failure, this can recreate the same false classification problem in a narrower shape.Please align the classification with the completed/success condition, e.g. only classify as
deliveredwhen the status is 2xx and the recovered response body is non-empty, or otherwise make the classification name/semantics distinct from the success path. Add construction-path coverage for the zero-body 2xx read-error and 3xx read-error cases so the classification cannot drift fromisDeliveryConfirmedSuccessagain.Other axes: busy_retryable and upstream_dead construction sites look scoped to the intended paths; no security or performance issue found in the classification field/logging itself; readability is acceptable, though comments should be tightened once the delivered predicate is corrected.
SOP note: the PR body is missing the required 7-item SOP checklist, so I did not post any
/sop-ack <slug>comments. Please add the checklist before requesting SOP acks.5-axis re-review on head
66ed91ef:APPROVED.
Correctness: my prior 12458 finding is resolved. The delivered classification now uses
classificationFromDeliveryConfirmed(status, bodyNonEmpty)and only returnsdeliveredfor200 <= status < 300with a non-empty recovered body, matchingisDeliveryConfirmedSuccess. 2xx with no recovered body and 3xx read-error cases no longer under-count failures as delivered. Researcher's upstream_dead concern is also addressed across the post-Do reactive, pre-Do reactive, and proactive preflight dead-container construction sites.Robustness: busy_retryable remains scoped to the busy/enqueue-fallback path; upstream_dead is reserved for restart-triggering dead-origin paths. The delegation logging is nil-safe because the classification log is only reached after
isDeliveryConfirmedSuccessconfirmsproxyErr != nilor inside theproxyErr != nilfailure branch.Security: no new input/trust boundary; this is structured error classification/logging. Performance: no material runtime cost. Readability: the helper documents the exact alignment with the delegation success predicate and the tests pin the important negative cases.
CI visible: Platform Go and other code checks are successful on this head; review/SOP gates are still external approval/ack gates.
APPROVED
Fresh re-review on head
66ed91ef.5-axis review:
Classification: "upstream_dead":handleA2ADispatchErrordead=true path inworkspace-server/internal/handlers/a2a_proxy_helpers.go:83-92, and proactivepreflightContainerHealthnot-running/restart path ina2a_proxy_helpers.go:513-528. The existing HTTP-status upstream-dead path remains classified, busy retryable remains limited to the fallback 503 busy shape, and delivered classification is now tightened to matchisDeliveryConfirmedSuccess: 2xx plus non-empty body only.TestUpstreamDead_ConstructionSitesis a static shape test rather than invoking the constructors, but the construction sites themselves now visibly carry the field.Classificationfrom log-readableError()suffixes.Verification: local Go tests could not be run in this container because
gois not installed. I reviewed the diff and construction sites directly; PR reports Platform-Go pending./sop-ack comprehensive-testing
/sop-ack local-postgres-e2e
/sop-ack staging-smoke
/sop-ack root-cause
/sop-ack five-axis-review
/sop-ack no-backwards-compat
/sop-ack memory-consulted