molecule-core/workspace-server
Hongming Wang 6faea202b9
fix(a2a-queue): nil-safe drain + 202-requeue handling (followup to #1893) (#1896)
* fix(a2a-queue): nil-safe error extraction in DrainQueueForWorkspace + handle 202-requeue

The drain path called proxyErr.Response["error"].(string) without a comma-
ok assertion. When proxyErr.Response had no "error" key (which happens in
the 202-Accepted-queued branch I added in the same PR — that response is
{"queued": true, "queue_id": ..., "queue_depth": ...}), the type assertion
panicked and killed the platform process.

The platform was down 25 minutes today before this was diagnosed. Fleet
went from 30 real outputs/15min → 0 events.

Two fixes here:

1. Treat 202 Accepted from the inner proxyA2ARequest as "re-queued"
   (target was busy AGAIN). Mark THIS attempt completed; the new queue
   row will be drained on the next heartbeat tick. Don't propagate as
   failure.

2. Defensive type-assertion when reading the error string. Falls back to
   http.StatusText, then a generic "unknown drain dispatch error" so the
   queue still gets a non-empty error_detail for ops debugging.

Now the drain path can never panic on a malformed proxy response.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(a2a-queue): return (202, body, nil) so callers see queued-as-success

Cycle 53 found callers logging 45× 'delegation failed: proxy a2a error'
even though the queue's drain stats showed 48 completions in the same
window. Investigation: my busy-error path returned

  return http.StatusAccepted, nil, &proxyA2AError{Status: 202, Response: ...}

The non-nil proxyA2AError is the failure signal. Even with status=202,
callers' `if proxyErr != nil` branch fires and logs the request as
failed. The 202 status was meaningless — the response body was nil too,
so the caller never even saw the queue_id/depth metadata.

Fix: return success-shape so callers do NOT enter the error branch:

  respBody, _ := json.Marshal(gin.H{"queued": true, "queue_id": qid, ...})
  return http.StatusAccepted, respBody, nil

Net effect: queue continues to absorb busy-errors (working since #1893),
AND callers correctly record the dispatch as queued-success rather than
failed. Closes the cycle 53 misclassification that was making the queue
look ineffective on activity_logs counts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com>
2026-04-23 22:55:43 +00:00
..
cmd/server fix(go): replace $1 literal with resp.Body.Close() in 7 files (#1247) 2026-04-21 03:18:21 +00:00
internal fix(a2a-queue): nil-safe drain + 202-requeue handling (followup to #1893) (#1896) 2026-04-23 22:55:43 +00:00
migrations feat(a2a): queue-on-busy — Phase 1 of priority queue (#1870) 2026-04-23 14:09:29 -07:00
pkg/provisionhook fix(docker): fix plugin go.mod replace for TokenProvider interface (#960) 2026-04-20 13:42:53 -07:00
.ci-force chore: force Platform(Go) CI run on main — validate go vet clean 2026-04-21 15:43:19 +00:00
.gitignore feat(ws-server): pull env from CP on startup 2026-04-19 02:41:15 -07:00
Dockerfile chore: extract ContextMenu Zustand fix + a2a_proxy local-docker SSRF bypass + workspace-server Dockerfile GID entrypoint 2026-04-22 20:00:16 -07:00
Dockerfile.tenant feat(terminal): remote path via aws ec2-instance-connect + pty 2026-04-21 18:13:29 -07:00
entrypoint-tenant.sh fix(security): add USER directive before ENTRYPOINT in all tenant images (#1155) 2026-04-20 23:51:33 +00:00
go.mod Merge main into staging - resolving 1,388 commit divergence for PR #1573 2026-04-22 13:54:53 +00:00
go.sum feat(terminal): remote path via aws ec2-instance-connect + pty 2026-04-21 18:13:29 -07:00