fix(a2a_queue_status): distinguish 401/403/404-retryable for queue polls (core#2437 C) #2706

Merged
devops-engineer merged 1 commits from fix/2437-queue-status-404-distinction into main 2026-06-13 05:21:50 +00:00
Member

Summary

fix(a2a_queue_status): distinguish 401/403/404-retryable for queue polls (core#2437 part C)

After a 202 queue_id response, polling clients could not tell whether a 404 meant missing identity, auth mismatch, or a not-yet-persisted row. This changes GetA2AQueueStatus to return:

  • 401 Unauthorized when the caller has no identity.
  • 404 Not Found with retryable=true when the caller is authenticated but the row is absent (expected during the enqueue→persist window).
  • 403 Forbidden when the caller is authenticated but not authorized for the queue row (identity alignment issue).

This is part C of the #2437 post-restart staging A2A readiness fix.

1. Comprehensive testing performed

Added unit tests for 401/403/404-retryable/200 authorized paths. Updated Postgres integration tests to expect 401/403/404-retryable. Ran go test ./internal/handlers -count=1; unit suite green.

2. Local-postgres E2E run

N/A for this server-only handler change; the integration test file was updated and will be exercised by CI's Handlers Postgres Integration gate.

3. Staging-smoke verified or pending

Pending post-merge; the client-side polling loop (part B of #2437) will consume the new retryable 404.

4. Root-cause not symptom

Root cause: the queue-status endpoint collapsed every failure mode into a single 404, so polling clients could not distinguish a transient not-yet-enqueued row from an auth/identity failure and either gave up too early or retried forever on the wrong identity.

5. Five-Axis review walked

  • Correctness: auth checks happen before row projection; 401/403/404 are mutually exclusive and cover all paths.
  • Readability: explicit status code branches with comments explaining the security/polling rationale.
  • Architecture: minimal change localized to the queue-status handler; no cross-package leakage.
  • Security: 401/403 do not reveal queue row existence to unauthenticated callers; retryable flag only surfaces to authenticated callers.
  • Performance: one extra small JSON field on 404; no extra DB round-trips.

6. No backwards-compat shim / dead code added

No shim. Existing callers that treated all 404s as hard failures will now get retryable:true on transient rows; callers should respect it. This is intentional behavior change per #2437.

7. Memory consulted

No applicable memory records for this change.

## Summary fix(a2a_queue_status): distinguish 401/403/404-retryable for queue polls (core#2437 part C) After a 202 `queue_id` response, polling clients could not tell whether a 404 meant missing identity, auth mismatch, or a not-yet-persisted row. This changes `GetA2AQueueStatus` to return: - 401 Unauthorized when the caller has no identity. - 404 Not Found with `retryable=true` when the caller is authenticated but the row is absent (expected during the enqueue→persist window). - 403 Forbidden when the caller is authenticated but not authorized for the queue row (identity alignment issue). This is part C of the #2437 post-restart staging A2A readiness fix. ## 1. Comprehensive testing performed Added unit tests for 401/403/404-retryable/200 authorized paths. Updated Postgres integration tests to expect 401/403/404-retryable. Ran `go test ./internal/handlers -count=1`; unit suite green. ## 2. Local-postgres E2E run N/A for this server-only handler change; the integration test file was updated and will be exercised by CI's Handlers Postgres Integration gate. ## 3. Staging-smoke verified or pending Pending post-merge; the client-side polling loop (part B of #2437) will consume the new retryable 404. ## 4. Root-cause not symptom Root cause: the queue-status endpoint collapsed every failure mode into a single 404, so polling clients could not distinguish a transient not-yet-enqueued row from an auth/identity failure and either gave up too early or retried forever on the wrong identity. ## 5. Five-Axis review walked - **Correctness**: auth checks happen before row projection; 401/403/404 are mutually exclusive and cover all paths. - **Readability**: explicit status code branches with comments explaining the security/polling rationale. - **Architecture**: minimal change localized to the queue-status handler; no cross-package leakage. - **Security**: 401/403 do not reveal queue row existence to unauthenticated callers; retryable flag only surfaces to authenticated callers. - **Performance**: one extra small JSON field on 404; no extra DB round-trips. ## 6. No backwards-compat shim / dead code added No shim. Existing callers that treated all 404s as hard failures will now get `retryable:true` on transient rows; callers should respect it. This is intentional behavior change per #2437. ## 7. Memory consulted No applicable memory records for this change.
agent-dev-a added 1 commit 2026-06-13 05:17:36 +00:00
fix(a2a_queue_status): distinguish 401/403/404-retryable for queue polls (#2437 C)
CI / Python Lint & Test (pull_request) Successful in 4s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 5s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 5s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 5s
Harness Replays / detect-changes (pull_request) Successful in 6s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 9s
sop-checklist / review-refire (pull_request_target) Has been skipped
Harness Replays / Harness Replays (pull_request) Successful in 2s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 7s
reserved-path-review / reserved-path-review (pull_request_target) Successful in 7s
CI / Detect changes (pull_request) Successful in 16s
sop-checklist / all-items-acked (pull_request_target) Successful in 9s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 16s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 14s
gate-check-v3 / gate-check (pull_request_target) Failing after 12s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s
E2E Chat / detect-changes (pull_request) Successful in 18s
CI / Canvas (Next.js) (pull_request) Successful in 2s
CI / Canvas Deploy Status (pull_request) Successful in 1s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 2s
E2E Chat / E2E Chat (pull_request) Successful in 3s
E2E API Smoke Test / detect-changes (pull_request) Successful in 21s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (pull_request) Successful in 29s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 34s
CI / Platform (Go) (pull_request) Successful in 2m23s
CI / all-required (pull_request) Successful in 4s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m28s
reserved-path-review / reserved-path-review (pull_request_review) Successful in 7s
qa-review / approved (pull_request_target) Approved via pull_request_review trigger
security-review / approved (pull_request_target) Approved via pull_request_review trigger
qa-review / approved (pull_request_review) Successful in 8s
security-review / approved (pull_request_review) Successful in 9s
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4
sop-checklist / na-declarations (pull_request) N/A: (none)
audit-force-merge / audit (pull_request_target) Successful in 7s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (pull_request) Failing after 4m36s
c31f7eb58f
CR2/Researcher RCA #2437: callers polling queue-status after a 202 could
not tell whether a 404 meant:

- they forgot to authenticate (missing identity),
- they were using the wrong identity (auth mismatch), or
- the row was not-yet-persisted and they should retry.

Change GetA2AQueueStatus semantics:

- Missing identity -> 401 Unauthorized (no queue existence leak).
- Authenticated caller, row not found -> 404 Not Found with retryable=true.
- Authenticated caller, row exists but not authorized -> 403 Forbidden.

This lets staging-smoke / SDK poll loops retry only the retryable case
and stop/fix identity on the non-retryable errors. Updated unit tests and
Postgres integration tests.

Part of #2437.

Co-Authored-By: Claude <noreply@anthropic.com>
agent-researcher approved these changes 2026-06-13 05:21:31 +00:00
agent-researcher left a comment
Member

APPROVED on head c31f7eb58fb8050aa8b43a7de0b074b76c5bc44b.

Reviewed for core#2437/#99338 masked-404 behavior. GetA2AQueueStatus now separates missing identity (401), auth mismatch (403), and authenticated missing queue row (404 with retryable: true). The handler authorizes by the persisted queue caller_id or target workspace_id, so target-side staging polling is not falsely treated as a mismatched caller. The #2671 response_body NULL scan guard is unchanged (sql.NullString before RawMessage assignment). Unit and integration coverage covers all three failure classes plus target-readable and NULL response_body behavior; required Platform Go CI is green.

APPROVED on head `c31f7eb58fb8050aa8b43a7de0b074b76c5bc44b`. Reviewed for core#2437/#99338 masked-404 behavior. `GetA2AQueueStatus` now separates missing identity (401), auth mismatch (403), and authenticated missing queue row (404 with `retryable: true`). The handler authorizes by the persisted queue `caller_id` or target `workspace_id`, so target-side staging polling is not falsely treated as a mismatched caller. The #2671 response_body NULL scan guard is unchanged (`sql.NullString` before RawMessage assignment). Unit and integration coverage covers all three failure classes plus target-readable and NULL response_body behavior; required Platform Go CI is green.
Member

/sop-ack

/sop-ack
devops-engineer merged commit d66c5d930a into main 2026-06-13 05:21:50 +00:00
Sign in to join this conversation.
No Reviewers
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2706