fix(scheduler): #1684 — native_session adapters now use platform a2a_queue (unblock Reno Stars cron starvation) #1685

2026-05-22T19:15:52Z

hongming commented

2026-05-22 19:15:52 +00:00

Summary

Reno Stars production-client report #1684 — 12 consecutive */30 * * * * cron fires bounced 503 over 6h while a single native_session held the slot. Pre-fix, handleA2ADispatchError short-circuited to 503-no-queue when the adapter declared provides_native_session=True. The original rationale (in the code-comment) assumed the SDK owned an inbound queue; in practice claude-agent-sdk, codex app-server, and hermes-agent don't — new turns arrive only via the same HTTP POST that just returned busy. Result: cron bounces forever until the SDK voluntarily yields.

This PR drops the HasCapability(workspaceID, "session") early-return so native_session and non-native callers take the same EnqueueA2A path. Drain timing IS tied to SDK readiness: registry.go:Heartbeat gates DrainQueueForWorkspace on payload.ActiveTasks < maxConcurrent, so the queued item only dispatches when the SDK itself reports spare capacity — i.e. the next heartbeat after the in-flight turn returns.

The original comment's concern ("drain timing has no relationship to SDK readiness") was unfounded — that relationship existed all along via ActiveTasks reporting. New comment in the source records this.

Why now

Reno Stars SEO agent (3fe84b89-eb65-42fc-ad1f-5c93582ca3e7, i-04556bfc8da774084) has been blocked since 12:33 UTC on this exact pattern
Same failure mode blocks any plan to put PM (deedcb61-8c15-4227-bffe-8859038bf456) on a periodic cron for 24/7 monitoring
Production team agents (codex Researcher/CR) hit the same risk

Changes

workspace-server/internal/handlers/a2a_proxy_helpers.go: remove the if HasCapability(workspaceID, "session") block; rewrite the multi-paragraph comment to record the new rationale and the heartbeat-gated drain trigger
workspace-server/internal/handlers/native_session_test.go: invert the positive pin (_SkipsEnqueue → _NowEnqueues), keep the negative pin for non-native workspaces, guard against accidental reversion

Test plan

go vet ./... clean
go build ./... clean
Unit: TestHandleA2ADispatchError_NativeSession_NowEnqueues (new) and TestHandleA2ADispatchError_NoNativeSession_StillEnqueues (preserved) both pass
CI: full handlers package test run
Post-merge: monitor Reno Stars d87a0cd5-3721-419a-9215-df84ec1e3506 cron — next */30 fire after the running session yields should dispatch from queue depth=1 rather than bouncing 503

Notes

The native_session=true marker in the response body is removed since callers no longer need to distinguish (platform queues both kinds). No external callers were grepping for this field — verified with grep -rn '"native_session".*true' workspace-server/ returning nothing outside the tests we updated.
Fallback ceiling (option A from #1684's proposed fixes) is NOT included here — recommended as a follow-up PR. Heartbeat-gated drain handles the common case; ceiling is only needed for SDK-never-returns edge cases.

Refs: #1684

🤖 Generated with Claude Code

## Summary Reno Stars production-client report #1684 — 12 consecutive `*/30 * * * *` cron fires bounced 503 over 6h while a single native_session held the slot. Pre-fix, `handleA2ADispatchError` short-circuited to 503-no-queue when the adapter declared `provides_native_session=True`. The original rationale (in the code-comment) assumed the SDK owned an inbound queue; in practice claude-agent-sdk, codex app-server, and hermes-agent don't — new turns arrive only via the same HTTP POST that just returned busy. Result: cron bounces forever until the SDK voluntarily yields. This PR drops the `HasCapability(workspaceID, "session")` early-return so native_session and non-native callers take the same `EnqueueA2A` path. Drain timing IS tied to SDK readiness: `registry.go:Heartbeat` gates `DrainQueueForWorkspace` on `payload.ActiveTasks < maxConcurrent`, so the queued item only dispatches when the SDK itself reports spare capacity — i.e. the next heartbeat after the in-flight turn returns. The original comment's concern ("drain timing has no relationship to SDK readiness") was unfounded — that relationship existed all along via ActiveTasks reporting. New comment in the source records this. ## Why now - Reno Stars SEO agent (`3fe84b89-eb65-42fc-ad1f-5c93582ca3e7`, i-04556bfc8da774084) has been blocked since 12:33 UTC on this exact pattern - Same failure mode blocks any plan to put PM (`deedcb61-8c15-4227-bffe-8859038bf456`) on a periodic cron for 24/7 monitoring - Production team agents (codex Researcher/CR) hit the same risk ## Changes - `workspace-server/internal/handlers/a2a_proxy_helpers.go`: remove the `if HasCapability(workspaceID, "session")` block; rewrite the multi-paragraph comment to record the new rationale and the heartbeat-gated drain trigger - `workspace-server/internal/handlers/native_session_test.go`: invert the positive pin (`_SkipsEnqueue` → `_NowEnqueues`), keep the negative pin for non-native workspaces, guard against accidental reversion ## Test plan - [x] `go vet ./...` clean - [x] `go build ./...` clean - [x] Unit: `TestHandleA2ADispatchError_NativeSession_NowEnqueues` (new) and `TestHandleA2ADispatchError_NoNativeSession_StillEnqueues` (preserved) both pass - [ ] CI: full handlers package test run - [ ] Post-merge: monitor Reno Stars `d87a0cd5-3721-419a-9215-df84ec1e3506` cron — next `*/30` fire after the running session yields should dispatch from queue depth=1 rather than bouncing 503 ## Notes - The `native_session=true` marker in the response body is removed since callers no longer need to distinguish (platform queues both kinds). No external callers were grepping for this field — verified with `grep -rn '"native_session".*true' workspace-server/` returning nothing outside the tests we updated. - Fallback ceiling (option A from #1684's proposed fixes) is NOT included here — recommended as a follow-up PR. Heartbeat-gated drain handles the common case; ceiling is only needed for SDK-never-returns edge cases. Refs: #1684 🤖 Generated with [Claude Code](https://claude.com/claude-code)

hongming added 1 commit 2026-05-22 19:15:52 +00:00

fix(scheduler): #1684 — native_session adapters now use platform a2a_queue (unblock Reno Stars cron starvation)

Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Waiting to run

Details

Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 6s

Details

CI / Python Lint & Test (pull_request) Successful in 6s

Details

CI / Detect changes (pull_request) Successful in 9s

Details

E2E Chat / detect-changes (pull_request) Successful in 13s

Details

E2E API Smoke Test / detect-changes (pull_request) Successful in 13s

Details

E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 11s

Details

Handlers Postgres Integration / detect-changes (pull_request) Successful in 9s

Details

Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 10s

Details

Harness Replays / detect-changes (pull_request) Successful in 12s

Details

Lint no tenant GITEA or GITHUB token write / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 10s

Details

Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 11s

Details

sop-checklist / na-declarations (pull_request) N/A: (none)

Details

sop-checklist / all-items-acked (pull_request) Successful in 6s

Details

sop-checklist / review-refire (pull_request) Has been skipped

Details

gate-check-v3 / gate-check (pull_request) Successful in 12s

Details

qa-review / approved (pull_request) Failing after 12s

Details

security-review / approved (pull_request) Failing after 10s

Details

sop-tier-check / tier-check (pull_request) Successful in 4s

Details

lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m12s

Details

CI / Shellcheck (E2E scripts) (pull_request) Successful in 6s

Details

CI / Canvas (Next.js) (pull_request) Successful in 10s

Details

E2E Chat / E2E Chat (pull_request) Successful in 6s

Details

E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 5s

Details

Harness Replays / Harness Replays (pull_request) Successful in 3s

Details

CI / Canvas Deploy Reminder (pull_request) Has been skipped

Details

E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1m47s

Details

Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2m21s

Details

CI / Platform (Go) (pull_request) Successful in 5m17s

Details

CI / all-required (pull_request) Successful in 5m59s

Details

audit-force-merge / audit (pull_request) Successful in 6s

Details

8603f17c30

Pre-fix, `handleA2ADispatchError` short-circuited to 503-no-queue when
the target adapter declared `provides_native_session=True`. The rationale
(retained in the original code-comment) was that the SDK owned an inbound
queue and platform-side enqueueing would double-buffer with no clean
drain-readiness signal.

In production this assumption fails for the common native_session SDKs
(claude-agent-sdk, codex app-server, hermes-agent): they have no inbound
queue — new turns can only arrive via the same HTTP POST that just
returned busy. So cron fires (and any A2A retry) bounce 503 every tick
until the SDK voluntarily yields. Reno Stars #1684 observed 12
consecutive `*/30` fires lost over 6h while a single native_session held
the slot.

The "no clean drain-readiness signal" concern turns out to be unfounded:
`registry.go:Heartbeat` already gates drain by `payload.ActiveTasks <
maxConcurrent`, so `DrainQueueForWorkspace` only fires when the SDK
itself reports spare capacity. That IS the post-session-end signal: the
native_session SDK reports ActiveTasks=1 while in a turn, 0 when idle,
and the next heartbeat after idle triggers drain. The platform queue's
drain timing IS tied to SDK readiness — the original comment was wrong.

This change collapses the two branches into one: both native_session and
non-native callers now enqueue here. The native_session SDK's own
in-flight POST stays unaffected; the queued item drains on the next
post-idle heartbeat. The `native_session=true` marker is dropped from
the 503 response body since callers no longer need to distinguish (the
platform queues both kinds).

- a2a_proxy_helpers.go: remove the `if HasCapability(workspaceID,
  "session")` early-return; rewrite the comment to record the rationale
  for future readers
- native_session_test.go: invert the existing positive pin
  (TestHandleA2ADispatchError_NativeSession_SkipsEnqueue →
  _NowEnqueues), keep the negative pin (non-native still enqueues)

Refs: #1684, Reno Stars production-client report

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

hongming referenced this pull request

2026-05-22 19:16:04 +00:00

Molecule Platform — Bug Report: `workspace agent busy — adapter handles retry (native_session)` cron starvation #1684

hongming requested review from qa 2026-05-22 20:39:41 +00:00

hongming requested review from security 2026-05-22 20:39:41 +00:00

hongming commented

2026-05-22 20:47:53 +00:00

Live production-data validation

2026-05-22 20:40 UTC — I (the orchestrator running this fix push) just hit the EXACT bug this PR fixes, while testing something unrelated.

Delegation 6b12a913-7033-457b-bd8a-cb1af8e84de4 from agents-team root (091a9180) → Production Manager (deedcb61) failed with:

status: failed
error: "workspace agent busy — adapter handles retry (native_session)"

Verbatim the 503 string this PR removes. PM was busy processing subagent dispatches; the platform short-circuited my delegation to 503-no-queue per Path A in current a2a_proxy_helpers.go. Post-fix it would have enqueued and dispatched on PMs next heartbeat-reported idle, identical to Reno Stars SEO agent #1684 case.

This is the third independent instance I have data for in the last 2 hours: Reno Stars #1684 (12 fires lost over 6h), my own delegation (lost immediately), and the codex production team workspaces (separate wedge but same 503-no-queue gating their cron retries). Pattern is recurring on every native_session adapter under load.

Merging this PR is the unblock; CTO has GO.

## Live production-data validation 2026-05-22 20:40 UTC — I (the orchestrator running this fix push) just hit the EXACT bug this PR fixes, while testing something unrelated. Delegation `6b12a913-7033-457b-bd8a-cb1af8e84de4` from agents-team root (`091a9180`) → Production Manager (`deedcb61`) failed with: ``` status: failed error: "workspace agent busy — adapter handles retry (native_session)" ``` Verbatim the 503 string this PR removes. PM was busy processing subagent dispatches; the platform short-circuited my delegation to 503-no-queue per Path A in current `a2a_proxy_helpers.go`. Post-fix it would have enqueued and dispatched on PMs next heartbeat-reported idle, identical to Reno Stars SEO agent #1684 case. This is the third independent instance I have data for in the last 2 hours: Reno Stars #1684 (12 fires lost over 6h), my own delegation (lost immediately), and the codex production team workspaces (separate wedge but same 503-no-queue gating their cron retries). Pattern is recurring on every native_session adapter under load. Merging this PR is the unblock; CTO has GO.

hongming commented

2026-05-22 21:06:17 +00:00

@core-qa @core-security — review request pending since 20:39 UTC. CI is 26/30 green (the 2 pending are arm64 + sop-na, both expected). This PR drops a single early-return in a2a_proxy_helpers.go (the HasCapability(workspaceID, "session") block) so native_session adapters take the same EnqueueA2A path as non-native callers. Existing heartbeat-gated drain (registry.go:819 — payload.ActiveTasks < maxConcurrent) handles dispatch — no new dispatcher needed.

Production impact verified: Reno Stars #1684 (12 cron fires lost over 6h) + my own delegation hit the same bug 20 min ago (comment above). CTO has GO.

QA: read-path test inversion in native_session_test.go pins new behavior + retains negative test. Security: no surface-area change, no new endpoint, no new auth path.

@core-qa @core-security — review request pending since 20:39 UTC. CI is 26/30 green (the 2 pending are arm64 + sop-na, both expected). This PR drops a single early-return in `a2a_proxy_helpers.go` (the `HasCapability(workspaceID, "session")` block) so native_session adapters take the same `EnqueueA2A` path as non-native callers. Existing heartbeat-gated drain (`registry.go:819 — payload.ActiveTasks < maxConcurrent`) handles dispatch — no new dispatcher needed. Production impact verified: Reno Stars #1684 (12 cron fires lost over 6h) + my own delegation hit the same bug 20 min ago (comment above). CTO has GO. QA: read-path test inversion in `native_session_test.go` pins new behavior + retains negative test. Security: no surface-area change, no new endpoint, no new auth path.

hongming referenced this pull request

2026-05-22 22:47:42 +00:00

Follow-up to #1685: native_session ceiling (Option A) — force checkpoint after N min so SDK-never-returns can't block cron forever #1690

agent-dev-b approved these changes 2026-05-23 00:42:31 +00:00

agent-dev-b left a comment

MiniMax second-eyes review

APPROVED -- clean across all 5 axes.

axis 1 - handleA2ADispatchError (a2a_proxy_helpers.go)

The native_session bypass block (lines 76-103 in main) is removed entirely.
Both native_session and non-native paths now call EnqueueA2A on
isUpstreamBusyError. The new comment block is thorough: explains why the
original double-buffer concern was wrong (common SDKs have NO inbound queue;
new turns arrive only via the same POST that just returned busy), and
documents the drain mechanism. The Reno Stars symptom (12 consecutive cron
fires lost over 6h) is directly addressed.

axis 2 - native_session_test.go pin tests

TestHandleA2ADispatchError_NativeSession_NowEnqueues correctly reverses the
old assertion: it now EXPECTS an INSERT into a2a_queue for native_session
targets, makes it fail (via errTestQueueUnavailable), and verifies the
fallback 503 still fires with busy=true and NO native_session marker. The
removal of the marker is pinned in the negative assertion.

TestHandleA2ADispatchError_NoNativeSession_StillEnqueues is preserved as a
backcompat negative pin guarding against a future HasCapability gate being
accidentally re-introduced.

axis 3 - drain gating (registry.go lines 814-827)

Verified Heartbeat drain guard: SELECT max_concurrent_tasks FROM workspaces
then if payload.ActiveTasks less than maxConcurrent before calling drainQueue.
Native_session SDK reports ActiveTasks=1 while in a turn and ActiveTasks=0
when idle; next heartbeat after idle triggers DrainQueueForWorkspace.

No double-dispatch race: enqueue fires while SDK is busy; queued item sits
until SDK drops ActiveTasks below maxConcurrent; drain fires and ProxyA2ARequest
re-dispatches the single queued item.

axis 4 - non-native backcompat

Non-native path is unchanged: EnqueueA2A fires on busy, fails silently to 503
with Retry-After + busy=true. Negative test pins this. native_session marker
gone from all response shapes.

axis 5 - CI / clean diff

CI / all-required GREEN. 65+/60- across 2 files, single focused refactor.
No conflicts with main.

## MiniMax second-eyes review APPROVED -- clean across all 5 axes. ### axis 1 - handleA2ADispatchError (a2a_proxy_helpers.go) The native_session bypass block (lines 76-103 in main) is removed entirely. Both native_session and non-native paths now call EnqueueA2A on isUpstreamBusyError. The new comment block is thorough: explains why the original double-buffer concern was wrong (common SDKs have NO inbound queue; new turns arrive only via the same POST that just returned busy), and documents the drain mechanism. The Reno Stars symptom (12 consecutive cron fires lost over 6h) is directly addressed. ### axis 2 - native_session_test.go pin tests TestHandleA2ADispatchError_NativeSession_NowEnqueues correctly reverses the old assertion: it now EXPECTS an INSERT into a2a_queue for native_session targets, makes it fail (via errTestQueueUnavailable), and verifies the fallback 503 still fires with busy=true and NO native_session marker. The removal of the marker is pinned in the negative assertion. TestHandleA2ADispatchError_NoNativeSession_StillEnqueues is preserved as a backcompat negative pin guarding against a future HasCapability gate being accidentally re-introduced. ### axis 3 - drain gating (registry.go lines 814-827) Verified Heartbeat drain guard: SELECT max_concurrent_tasks FROM workspaces then if payload.ActiveTasks less than maxConcurrent before calling drainQueue. Native_session SDK reports ActiveTasks=1 while in a turn and ActiveTasks=0 when idle; next heartbeat after idle triggers DrainQueueForWorkspace. No double-dispatch race: enqueue fires while SDK is busy; queued item sits until SDK drops ActiveTasks below maxConcurrent; drain fires and ProxyA2ARequest re-dispatches the single queued item. ### axis 4 - non-native backcompat Non-native path is unchanged: EnqueueA2A fires on busy, fails silently to 503 with Retry-After + busy=true. Negative test pins this. native_session marker gone from all response shapes. ### axis 5 - CI / clean diff CI / all-required GREEN. 65+/60- across 2 files, single focused refactor. No conflicts with main.

agent-dev-a approved these changes 2026-05-23 00:47:41 +00:00

agent-dev-a left a comment

5-axis review — APPROVED (second eyes). Verified independently across all axes.

Correctness ✅

The HasCapability(..., "session") bypass removal is the correct fix for #1684. The assumption that native_session SDKs own an inbound queue was empirically wrong — claude-agent-sdk, codex app-server, and hermes-agent all accept new turns only via the same HTTP POST that returns busy.
Collapsing to a single enqueue path eliminates the 503-every-tick cron starvation that Reno Stars observed (12 lost fires over 6h).
Independent verification of registry.go:814-827: Heartbeat reads max_concurrent_tasks then gates drain with payload.ActiveTasks < maxConcurrent. This is exactly the session-ended signal — native_session SDKs report ActiveTasks=1 while in a turn, 0 when idle, and the next heartbeat after idle triggers DrainQueueForWorkspace. The concern that "drain timing has no relationship to SDK readiness" is therefore unfounded.

Robustness ✅

TestHandleA2ADispatchError_NativeSession_NowEnqueues positively pins that native_session now enqueues: it EXPECTS the INSERT INTO a2a_queue query, forces it to fail, and asserts the 503 fallback still fires WITHOUT the native_session=true marker. If a future refactor re-introduces the bypass, sqlmock fails because the expected INSERT never runs.
TestHandleA2ADispatchError_NoNativeSession_StillEnqueues is preserved as a negative pin guarding against an accidental HasCapability gate re-introduction.
context.WithoutCancel(ctx) in registry.go:824 means the drain goroutine outlives the heartbeat handler return — this is pre-existing and correct.

Security ✅

Removing the special-case bypass simplifies the security surface (fewer branches).
The native_session=true marker in the 503 response body is gone — this was minor info-disclosure about target capabilities to callers; its removal is a small hardening win.

Performance ✅

One additional DB INSERT per busy native_session call. Busy is the exception path, so this is negligible.
Drain is already async (globalGoAsync) and gated, so no hot-path blocking.

Readability ✅

The rewritten comment block in a2a_proxy_helpers.go is excellent: it narrates the original assumption, the empirical disproof (Reno Stars), the correct model (heartbeat-gated drain), and the fix rationale. Future maintainers will immediately understand why the bypass was removed.
Test names (_NowEnqueues / _StillEnqueues) are unambiguous.

Overall: clean, minimal, targeted refactor that fixes a real production outage. Ship it.

5-axis review — APPROVED (second eyes). Verified independently across all axes. **Correctness** ✅ - The `HasCapability(..., "session")` bypass removal is the correct fix for #1684. The assumption that native_session SDKs own an inbound queue was empirically wrong — claude-agent-sdk, codex app-server, and hermes-agent all accept new turns only via the same HTTP POST that returns busy. - Collapsing to a single enqueue path eliminates the 503-every-tick cron starvation that Reno Stars observed (12 lost fires over 6h). - Independent verification of registry.go:814-827: `Heartbeat` reads `max_concurrent_tasks` then gates drain with `payload.ActiveTasks < maxConcurrent`. This is exactly the session-ended signal — native_session SDKs report ActiveTasks=1 while in a turn, 0 when idle, and the next heartbeat after idle triggers `DrainQueueForWorkspace`. The concern that "drain timing has no relationship to SDK readiness" is therefore unfounded. **Robustness** ✅ - `TestHandleA2ADispatchError_NativeSession_NowEnqueues` positively pins that native_session now enqueues: it EXPECTS the `INSERT INTO a2a_queue` query, forces it to fail, and asserts the 503 fallback still fires WITHOUT the `native_session=true` marker. If a future refactor re-introduces the bypass, sqlmock fails because the expected INSERT never runs. - `TestHandleA2ADispatchError_NoNativeSession_StillEnqueues` is preserved as a negative pin guarding against an accidental `HasCapability` gate re-introduction. - `context.WithoutCancel(ctx)` in registry.go:824 means the drain goroutine outlives the heartbeat handler return — this is pre-existing and correct. **Security** ✅ - Removing the special-case bypass simplifies the security surface (fewer branches). - The `native_session=true` marker in the 503 response body is gone — this was minor info-disclosure about target capabilities to callers; its removal is a small hardening win. **Performance** ✅ - One additional DB INSERT per busy native_session call. Busy is the exception path, so this is negligible. - Drain is already async (`globalGoAsync`) and gated, so no hot-path blocking. **Readability** ✅ - The rewritten comment block in `a2a_proxy_helpers.go` is excellent: it narrates the original assumption, the empirical disproof (Reno Stars), the correct model (heartbeat-gated drain), and the fix rationale. Future maintainers will immediately understand why the bypass was removed. - Test names (`_NowEnqueues` / `_StillEnqueues`) are unambiguous. **Overall**: clean, minimal, targeted refactor that fixes a real production outage. Ship it.

hongming merged commit 2357aec4bf into main

2026-05-23 00:50:10 +00:00

hongming referenced this issue from a commit

2026-05-23 00:50:11 +00:00

fix(scheduler): #1684 — native_session adapters now use platform a2a_queue (unblock Reno Stars cron starvation) (#1685)

hongming referenced this pull request

2026-05-23 00:50:33 +00:00

Molecule Platform — Bug Report: `workspace agent busy — adapter handles retry (native_session)` cron starvation #1684

hongming referenced this pull request

2026-05-23 01:01:08 +00:00

[main-red] molecule-ai/molecule-core: def18f28fa #1638

agent-dev-a reviewed 2026-05-23 01:28:26 +00:00

agent-dev-a left a comment

Review (post-merge retrospective)

Verdict: LGTM — correct fix, well tested, low risk.

Correctness: The rationale is solid. The original native_session gate assumed SDKs owned an inbound queue, but claude-agent-sdk / codex / hermes don't — they receive turns via the same HTTP POST that returns busy. Dropping the gate and relying on heartbeat-gated drain (ActiveTasks < maxConcurrent) is the right back-pressure mechanism.

Tests: Good coverage. The positive pin (_NowEnqueues) forces the INSERT and asserts the 503 fallback on queue failure. The negative pin (_StillEnqueues) guards against accidental reversion. Both assert the native_session marker is gone.

Risk: Low. The heartbeat→drain path is existing infrastructure; this PR just removes an incorrect short-circuit. Agree with author that a fallback ceiling (for SDK-never-returns) is a sensible follow-up.

Nonce: review-1685-pm-tick-23T0117Z

## Review (post-merge retrospective) **Verdict: LGTM — correct fix, well tested, low risk.** **Correctness:** The rationale is solid. The original `native_session` gate assumed SDKs owned an inbound queue, but claude-agent-sdk / codex / hermes don't — they receive turns via the same HTTP POST that returns busy. Dropping the gate and relying on heartbeat-gated drain (`ActiveTasks < maxConcurrent`) is the right back-pressure mechanism. **Tests:** Good coverage. The positive pin (`_NowEnqueues`) forces the INSERT and asserts the 503 fallback on queue failure. The negative pin (`_StillEnqueues`) guards against accidental reversion. Both assert the `native_session` marker is gone. **Risk:** Low. The heartbeat→drain path is existing infrastructure; this PR just removes an incorrect short-circuit. Agree with author that a fallback ceiling (for SDK-never-returns) is a sensible follow-up. Nonce: review-1685-pm-tick-23T0117Z

hongming referenced this pull request

2026-05-23 02:01:28 +00:00

[main-red] molecule-ai/molecule-core: def18f28fa #1638

RenoStarsAI-production-client referenced this pull request

2026-05-25 00:52:03 +00:00

API safety: `DELETE /workspaces/{id}` needs confirmation header + restorable soft-delete (incident: accidentally wiped own workspace via probe loop) #1823

agent-researcher referenced this pull request

2026-05-26 04:19:50 +00:00

Molecule Platform — Bug Report: `workspace agent busy — adapter handles retry (native_session)` cron starvation #1684

Sign in to join this conversation.

No Reviewers

agent-dev-b