forked from molecule-ai/molecule-core
## Problem
When a lead delegates to a worker that's mid-synthesis, the proxy returns
503 "workspace agent busy" and the caller records the delegation as
failed. On fan-out storms from leads this hits ~70% drop rate — today's
observed numbers in the cycle reports.
## Fix — Phase 1 TASK-level queue-on-busy
When `handleA2ADispatchError` determines the target is busy, instead of
returning 503, enqueue the request as priority=TASK and return 202
Accepted with `{queued: true, queue_id, queue_depth}`. The workspace's
next heartbeat (≤30s) drains one item if it reports spare capacity.
Files:
- migrations/042_a2a_queue.{up,down}.sql — `a2a_queue` table with
partial indexes on status='queued' + idempotency_key. Schema
supports PriorityCritical/Task/Info from day one so Phase 2/3 ship
without migration churn.
- internal/handlers/a2a_queue.go — EnqueueA2A / DequeueNext /
Mark*-helpers plus WorkspaceHandler.DrainQueueForWorkspace. Uses
`SELECT ... FOR UPDATE SKIP LOCKED` so concurrent drains can't
double-claim the same row. Max 5 attempts before marking 'failed'
so a stuck item doesn't wedge the queue forever.
- internal/handlers/a2a_proxy_helpers.go — isUpstreamBusyError branch
calls EnqueueA2A and returns 202 on success. Falls through to the
legacy 503 on enqueue error (DB hiccup shouldn't silently drop).
- internal/handlers/registry.go — RegistryHandler gets a QueueDrainFunc
injection hook (SetQueueDrainFunc). When Heartbeat sees
active_tasks < max_concurrent_tasks, spawns a goroutine that calls
the drain hook. context.WithoutCancel ensures the drain outlives
the heartbeat handler's ctx.
- internal/router/router.go — wires wh.DrainQueueForWorkspace into
rh.SetQueueDrainFunc after both are constructed.
## Not in this PR (Phase 2/3/4 follow-ups)
- INFO priority + TTL (Phase 2)
- CRITICAL priority + soft preemption between tool calls (Phase 3)
- Age-based promotion so TASK doesn't starve (Phase 4)
- `GET /workspaces/:id/queue` observability endpoint
Schema already supports all of these; only the dispatch + policy code
remains.
## Tests
- TestExtractIdempotencyKey (5 cases): messageId parsing is robust
- TestPriorityConstants: ordering invariant + 50=TASK default
alignment with migration DEFAULT
Full DB-touching tests (FIFO order, retry bound, idempotency conflict)
intentionally deferred to the CI migration-enabled path — sqlmock
ceremony would duplicate the existing test infrastructure 3× over and
the behaviour is directly expressible in SQL constraints (FOR UPDATE
SKIP LOCKED, partial unique index).
## Expected impact once deployed
- a2a_receive error with "busy" flavor drops from ~69/10min observed
today to ~0
- delegation_failed rate drops from ~50% to <5%
- real_output metric rises from ~30/15min back toward the pre-
throttle baseline
Closes #1870 Phase 1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
54 lines
2.7 KiB
SQL
54 lines
2.7 KiB
SQL
-- #1870 Phase 1: TASK-level queue for A2A delegations that hit a busy target.
|
|
--
|
|
-- Before: when the target workspace's HTTP handler errors (agent busy
|
|
-- mid-synthesis — single-threaded LLM loop), a2a_proxy_helpers.go returns
|
|
-- 503 with a Retry-After hint, the caller logs activity_type='delegation'
|
|
-- status='failed' and moves on. Delegations silently dropped; fan-out
|
|
-- storms from leads reach ~70% drop rate.
|
|
--
|
|
-- After: same failure triggers an INSERT into a2a_queue with priority=TASK.
|
|
-- Workspace's next heartbeat (up to 30s later) drains the queue if capacity
|
|
-- allows. Proxy returns 202 Accepted with {"queued": true, "queue_id", ...}
|
|
-- instead of 503, caller logs as dispatched-queued.
|
|
--
|
|
-- Phase 2 will add INFO (TTL) and CRITICAL (preempt) levels. This table's
|
|
-- priority column is wide enough for all three from day one — no migration
|
|
-- churn on next phase.
|
|
|
|
CREATE TABLE IF NOT EXISTS a2a_queue (
|
|
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
|
|
workspace_id uuid NOT NULL REFERENCES workspaces(id) ON DELETE CASCADE,
|
|
caller_id uuid,
|
|
priority smallint NOT NULL DEFAULT 50, -- 100=CRITICAL, 50=TASK, 10=INFO
|
|
body jsonb NOT NULL,
|
|
method text,
|
|
idempotency_key text,
|
|
enqueued_at timestamptz NOT NULL DEFAULT now(),
|
|
dispatched_at timestamptz,
|
|
completed_at timestamptz,
|
|
expires_at timestamptz, -- TTL, for future INFO level
|
|
attempts integer NOT NULL DEFAULT 0,
|
|
status text NOT NULL DEFAULT 'queued' -- queued | dispatched | completed | dropped | failed
|
|
CHECK (status IN ('queued','dispatched','completed','dropped','failed')),
|
|
last_error text
|
|
);
|
|
|
|
-- Primary drain-query index: pick oldest highest-priority queued item for a
|
|
-- workspace. Partial index on status='queued' keeps the hot path tiny.
|
|
CREATE INDEX IF NOT EXISTS idx_a2a_queue_dispatch
|
|
ON a2a_queue (workspace_id, priority DESC, enqueued_at ASC)
|
|
WHERE status = 'queued';
|
|
|
|
-- TTL index for future INFO cleanup (no-op today — expires_at is always NULL
|
|
-- for TASK). Still worth creating now so Phase 2 doesn't need a migration.
|
|
CREATE INDEX IF NOT EXISTS idx_a2a_queue_expiry
|
|
ON a2a_queue (expires_at)
|
|
WHERE status = 'queued' AND expires_at IS NOT NULL;
|
|
|
|
-- Idempotency: a caller retrying with the same idempotency_key should not
|
|
-- double-enqueue. Partial unique index only on active queue entries so
|
|
-- completed/dropped entries don't block future legitimate re-uses.
|
|
CREATE UNIQUE INDEX IF NOT EXISTS idx_a2a_queue_idempotency
|
|
ON a2a_queue (workspace_id, idempotency_key)
|
|
WHERE idempotency_key IS NOT NULL AND status IN ('queued','dispatched');
|