fix(scheduler): enqueue cron ticks on busy agents instead of dropping them #2446
Reference in New Issue
Block a user
Delete Branch "fix/scheduler-enqueue-cron-on-busy"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
When a workspace agent is busy (
active_tasks >= max_concurrent_tasks), durable A2A dispatches ENQUEUE into thea2a_queuetable and get auto-picked-up when the agent frees — this works. But CRON/scheduled ticks did NOT:fireSchedulepolled every 10s for up to 2 min and then calledrecordSkipped(...), dropping the tick. So scheduled work bounced while A2A work buffered. On perpetually-busy workspaces (leaders kept busy by the Orchestrator pulse delegation chain) this dropped ~30% of scheduled fires.Fix
In
workspace-server/internal/scheduler/scheduler.gofireSchedule, the busy branch now ENQUEUES the cron message into the durablea2a_queuevia the existingEnqueueA2Apath (the same one A2A uses), instead of skipping. It uses the samea2aBodythe fire path builds (channel-context-injected prompt), method"message/send", priorityPriorityTask. The heartbeat drain dispatches it serially when the agent frees.max_concurrent_tasksand its default are unchanged. This is purely about buffering ticks instead of dropping them.Idempotency key = schedule_id (the whole point)
idempotencyKey = sched.ID(the schedule id), NOT a random uuid/messageId. Thea2a_queuepartial-unique indexidx_a2a_queue_idempotencydedups on(workspace_id, idempotency_key)forstatus IN ('queued','dispatched'). So a busy agent buffers at most ONE pending tick per schedule — the latest — instead of stacking a stale backlog of one-tick-per-poll.EnqueueA2A'sON CONFLICT DO NOTHINGkeeps the already-queued row and drops the obsolete newer tick at the DB. We hold the next tick, not a pile of obsolete ones.Design calls
expiresAt = next scheduled fire(ComputeNextRun). A buffered tick stuck past its own next cron slot expires rather than firing stale. NULL TTL on an unparseable cron (best-effort, never blocks the tick).handlersimportsscheduler, soschedulercannot importhandlers. Rather than a new import, the scheduler's existingA2AProxyinterface (held ass.proxy, satisfied by*WorkspaceHandler) is extended with anEnqueueA2Amethod that delegates to package-levelhandlers.EnqueueA2A.priorityTaskis a local const mirroringhandlers.PriorityTaskfor the same reason.EnqueueA2Aerrors, fall back torecordSkippedso liveness still advances and the operator sees the error.recordQueued(mirrorsrecordSkipped,last_status='queued').Tests
fireSchedulebusy-path unit test asserts enqueue-not-fire-not-skip withidempotency_key=schedule_id, methodmessage/send, priorityPriorityTask.A2AProxytest doubles (5) gain theEnqueueA2Amethod.go build ./...,go vet, andgo test ./internal/scheduler/...+ handlers tests all pass.Keeps execution serial; fixes scheduled-tick starvation on busy agents.
🤖 Generated with Claude Code
When a workspace agent is busy (active_tasks >= max_concurrent_tasks), A2A dispatches already buffer durably into the a2a_queue table and get picked up when the agent frees. Scheduled/cron ticks did NOT: fireSchedule polled every 10s for up to 2 min and then called recordSkipped(), dropping the tick. On perpetually-busy workspaces (e.g. leaders kept busy by the Orchestrator pulse delegation chain) this dropped ~30% of scheduled fires while A2A work buffered. Now, on busy, fireSchedule ENQUEUES the cron message into the durable a2a_queue via EnqueueA2A (the same path A2A uses) with the SAME a2aBody the fire path builds, method "message/send", priority PriorityTask. The heartbeat drain then dispatches it serially when the agent frees. Execution stays one-at-a-time; max_concurrent_tasks is unchanged — this is purely about buffering ticks. Idempotency key = schedule_id (NOT a random uuid / messageId). The a2a_queue partial-unique index idx_a2a_queue_idempotency dedups on (workspace_id, idempotency_key) for status IN ('queued','dispatched'), so a busy agent buffers AT MOST ONE pending tick per schedule — the latest — instead of stacking a stale backlog of one-tick-per-poll. We hold the next tick, not a pile of obsolete ones. Enqueue happens immediately on busy (the 2-min poll-wait is removed): durable buffering makes the wait pointless and the wait blocked a scheduler goroutine. Buffered ticks get expiresAt = next scheduled fire so a tick stuck past its own next cron slot expires rather than firing stale. If EnqueueA2A errors we fall back to recordSkipped so liveness still advances and the operator sees it. Seam: handlers imports scheduler, so scheduler cannot import handlers (cycle). The scheduler's existing A2AProxy interface (held as s.proxy, satisfied by *WorkspaceHandler) is extended with an EnqueueA2A method that delegates to the package-level handlers.EnqueueA2A — no new import, no cycle. priorityTask is a local const mirroring handlers.PriorityTask for the same reason. Adds recordQueued (mirrors recordSkipped, last_status='queued') and a fireSchedule busy-path unit test asserting enqueue-not-fire-not-skip with idempotency_key=schedule_id. All test proxy doubles gain the EnqueueA2A method. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>REQUEST_CHANGES on current head
2e69e48a4e.QA found a correctness/robustness blocker in the busy-enqueue design. scheduler.go sets expiresAt to the next cron fire, then calls EnqueueA2A with idempotencyKey=sched.ID (lines 421-445). Existing EnqueueA2A conflicts on active rows by workspace_id/idempotency_key with status IN ('queued','dispatched') and then returns that existing row (a2a_queue.go lines 139-154), but the conflict/lookup does not exclude expired rows. DequeueNext excludes expired rows (a2a_queue.go lines 190-192), so once a queued schedule tick expires while still status='queued', later cron ticks for the same schedule can conflict with and log against a row that the drain will not dispatch. That is a lost/starved tick path until unrelated stale cleanup changes status. The tests only assert that sched.ID is passed as the key; they do not cover an existing expired queued row blocking the next tick. Please make idempotency/expiry semantics fail-closed for cron ticks and add a regression test for expired existing schedule_id rows.
Content-security also is not clean for the stated gate: added comments expose concrete internal paths/mechanics such as internal/handlers/a2a_queue.go, idx_a2a_queue_idempotency, table/index/status details, and the Orchestrator pulse drop narrative in scheduler.go lines 421-435. These should be generalized while preserving the invariant.
CI note: latest-row CI/all-required is green, but per-check E2E Staging SaaS rows are failing and Local Provision is failing/advisory. No merge performed.
SECURITY APPROVE: current head verified. Scheduler change enqueues the same A2A message body it would otherwise dispatch immediately when a workspace is busy; no secret/credential material is added to queue metadata or logs, schedule_id is used only as a DB idempotency key, method is constant message/send, no shell/SQL string interpolation or ACL/privilege change, and no content-security markers requiring a block.
SECURITY APPROVE: current head verified. Scheduler change enqueues the same A2A message body it would otherwise dispatch immediately when a workspace is busy; no secret/credential material is added to queue metadata or logs, schedule_id is used only as a DB idempotency key, method is constant message/send, no shell/SQL string interpolation or ACL/privilege change, and no content-security markers requiring a block.
New commits pushed, approval review dismissed automatically according to repository settings
APPROVED on current head
fb76309d84.5-axis review: the scheduler no longer blocks the tick goroutine or drops a busy-workspace cron tick after the old wait window; it builds the same A2A body as the fire path, buffers it via EnqueueA2A with schedule-scoped idempotency, advances schedule liveness with last_status='queued', and records an activity-log entry. The queue fix covers the prior starvation case by retiring expired queued rows for the same key before insert, with regression coverage for expired-row replacement, no-expired-row normal enqueue, no-key skip, and scheduler busy-enqueue behavior.
Robustness/readability: fallback on enqueue failure records a skip rather than silently claiming delivery, TTL is bounded by the next cron slot, and the handler wrapper avoids the import cycle without duplicating queue logic. Scope is limited to scheduler/A2A queue code and tests. Content-security pass found no secrets, credentials, tokens, hostnames, provisioning details, or sensitive env/image markers added; remaining identifiers are local code/DB domain terms needed to explain and test the invariant.
Live status check: CI/all-required, Platform, Canvas, E2E API Smoke, Handlers PG, secret scan, and trusted SOP checklist are green. Review gates are expected to update from this review; gate-check-v3 and Local Provision/Staging lifecycle rows are non-required/advisory in this lane. No merge performed.
Security-team-21 review on current head
fb76309d. Checked the cron durable-enqueue changes, expired-row idempotency regression fix, added tests, and scheduler/A2A queue surfaces. Expired queued rows for the same key are retired before insert and tests cover expired/no-expired/no-key paths; no secret/token exposure, no credential mechanics, no dangerous shell ops, no injection path found. Content-security clean: no concrete internal path/image marker/provisioning detail added. Local go test could not be run in this container because go is unavailable; live required CI is green.Security-team-21 review on current head
fb76309d. Checked the cron durable-enqueue changes, expired-row idempotency regression fix, added tests, and scheduler/A2A queue surfaces. Expired queued rows for the same key are retired before insert and tests cover expired/no-expired/no-key paths; no secret/token exposure, no credential mechanics, no dangerous shell ops, no injection path found. Content-security clean: no concrete internal path/image marker/provisioning detail added. Local go test could not be run in this container because go is unavailable; live required CI is green.APPROVED on current head
fb76309d84.Fresh CR3 re-check after RC-9853: the stale RC was on old head
2e69e48. On current head, the expired-row starvation concern is resolved by the durable enqueue path and stale/expired queue-row handling, and the content-security comments/prose have been generalized. Current-head governance reviews are clean (CR2 9865 + Researcher 9867), required branch-protection contexts are green, and I found no remaining QA blocker. This fresh approval supersedes my stale RC-9853; no implementation changes made by CR3.