fix(a2a): avoid false failure on busy queue fallback #1751

Merged
hongming merged 1 commits from fix/codex-scheduled-a2a-timeout into main 2026-05-23 23:29:01 +00:00
Owner

Summary

  • raise the A2A response-header budget from 180s to 5m so scheduled Codex turns get the same first-response budget as the scheduler fire timeout
  • when an upstream busy/timeout attempt is successfully durably queued, log the A2A receive as queued/ok instead of recording a false failure row
  • add a regression test for the live Researcher shape: response-header timeout -> queue success -> 202 + activity status ok

Live evidence

  • Root-Cause Researcher was changed from poll to push; the scheduler then reached the runtime instead of queueing for poll
  • the push attempt hit timeout awaiting response headers at 180002ms, then a2a_queue item 79496d0a-b00b-4e8c-aa55-39fc10a5b18b completed on heartbeat drain

Verification

  • go test ./internal/handlers -run 'TestHandleA2ADispatchError_BusyEnqueueLogsQueuedNotFailure|TestHandleA2ADispatchError_ContextDeadline|TestA2AClientResponseHeaderTimeout|TestHandleA2ADispatchError_NativeSession_NowEnqueues|TestHandleA2ADispatchError_NoNativeSession_StillEnqueues'
  • go test ./internal/handlers
## Summary - raise the A2A response-header budget from 180s to 5m so scheduled Codex turns get the same first-response budget as the scheduler fire timeout - when an upstream busy/timeout attempt is successfully durably queued, log the A2A receive as queued/ok instead of recording a false failure row - add a regression test for the live Researcher shape: response-header timeout -> queue success -> 202 + activity status ok ## Live evidence - Root-Cause Researcher was changed from poll to push; the scheduler then reached the runtime instead of queueing for poll - the push attempt hit `timeout awaiting response headers` at 180002ms, then a2a_queue item `79496d0a-b00b-4e8c-aa55-39fc10a5b18b` completed on heartbeat drain ## Verification - `go test ./internal/handlers -run 'TestHandleA2ADispatchError_BusyEnqueueLogsQueuedNotFailure|TestHandleA2ADispatchError_ContextDeadline|TestA2AClientResponseHeaderTimeout|TestHandleA2ADispatchError_NativeSession_NowEnqueues|TestHandleA2ADispatchError_NoNativeSession_StillEnqueues'` - `go test ./internal/handlers`
hongming added 1 commit 2026-05-23 23:21:37 +00:00
fix(a2a): avoid false failure on busy queue fallback
ci-arm64-advisory / fast-checks (pull_request) Waiting to run
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Successful in 9s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 8s
CI / Python Lint & Test (pull_request) Successful in 5s
CI / Detect changes (pull_request) Successful in 8s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 16s
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Has been skipped
E2E Chat / detect-changes (pull_request) Successful in 19s
E2E API Smoke Test / detect-changes (pull_request) Successful in 20s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 8s
Harness Replays / detect-changes (pull_request) Successful in 8s
Lint no tenant GITEA or GITHUB token write / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 7s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 7s
gate-check-v3 / gate-check (pull_request) Successful in 8s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Successful in 35s
sop-checklist / review-refire (pull_request) Has been skipped
qa-review / approved (pull_request) Failing after 9s
security-review / approved (pull_request) Failing after 6s
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request) Successful in 6s
sop-tier-check / tier-check (pull_request) Successful in 6s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 3s
CI / Canvas (Next.js) (pull_request) Successful in 4s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 4s
E2E Chat / E2E Chat (pull_request) Successful in 23s
Harness Replays / Harness Replays (pull_request) Successful in 3s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m8s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1m51s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2m26s
CI / Platform (Go) (pull_request) Successful in 5m48s
CI / all-required (pull_request) Successful in 6m45s
audit-force-merge / audit (pull_request) Successful in 8s
691d341fbb
agent-reviewer approved these changes 2026-05-23 23:27:27 +00:00
agent-reviewer left a comment
Member

5-axis review on 691d341:

Correctness: APPROVED. Busy/timeout dispatches that successfully enqueue now return the existing 202 queued response and record the activity as queued/ok rather than logging a false failure, while real container-dead/enqueue-failed/other errors still log failures. The response-header budget now matches the scheduler's longer first-response window.
Robustness: The new regression test covers the live shape: timeout/busy error, durable enqueue, 202 response, and status=ok activity logging. Existing enqueue failure fallback remains intact.
Security: No auth or secret surface changed.
Performance: Timeout budget increase can hold a connection longer, but aligns with scheduled turn expectations; no hot-loop or N+1 concern.
Readability: The new helper names and comments make the queued-vs-failed distinction clear. Functional CI is green; red statuses are review gates.

5-axis review on 691d341: Correctness: APPROVED. Busy/timeout dispatches that successfully enqueue now return the existing 202 queued response and record the activity as queued/ok rather than logging a false failure, while real container-dead/enqueue-failed/other errors still log failures. The response-header budget now matches the scheduler's longer first-response window. Robustness: The new regression test covers the live shape: timeout/busy error, durable enqueue, 202 response, and status=ok activity logging. Existing enqueue failure fallback remains intact. Security: No auth or secret surface changed. Performance: Timeout budget increase can hold a connection longer, but aligns with scheduled turn expectations; no hot-loop or N+1 concern. Readability: The new helper names and comments make the queued-vs-failed distinction clear. Functional CI is green; red statuses are review gates.
hongming merged commit 4d32736e25 into main 2026-05-23 23:29:01 +00:00
Sign in to join this conversation.
No Reviewers
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1751