Follow-up to #1685: native_session ceiling (Option A) — force checkpoint after N min so SDK-never-returns can't block cron forever #1690

Open
opened 2026-05-22 22:47:42 +00:00 by hongming · 1 comment
Owner

Why

PR #1685 (Option D from #1684's proposed fixes) makes native_session adapters queue at the platform level + drain on heartbeat-reported idle. That handles the common case: SDK eventually finishes its turn → reports idle → queue drains.

But if the SDK is genuinely stuck (deadlock, infinite loop, model API hung indefinitely, never returns from its current turn), the in-flight POST never completes, drain trigger never fires, queue grows forever.

This RFC adds Option A as a safety ceiling: force-checkpoint the active session after N min so a stuck SDK can't permanently block cron.

Proposal

A. native_session_ceiling_seconds per-adapter config

Default 1800s (30 min). Per-adapter override in runtime_config. Set to 0 to disable (back to current behavior).

B. Implementation in workspace-server/internal/scheduler/scheduler.go

On schedule fire: if target adapter has provides_native_session=True AND the in-flight POST has been running > native_session_ceiling_seconds, send a POST /<workspace_id>/control/checkpoint request that the SDK adapter handles by:

  1. Saving current state
  2. Returning from the in-flight POST cleanly (so drain can fire)
  3. Resuming on next turn from the saved checkpoint

Claude Agent SDK has cancel() / checkpoint() primitives that map to this. Codex CLI's app-server has turn/cancel. Hermes has a similar primitive.

C. Adapter-side handler

Workspace adapters that declare provides_native_session=True must also implement /control/checkpoint (returns 200 when the in-flight turn is cancelled cleanly, 409 if not steerable).

Acceptance

  • Native_session adapters that take > ceiling seconds get force-checkpointed instead of permanently blocking cron
  • Reno Stars scenario (#1684) would have force-checkpointed after 30 min instead of locking up for 6 hours
  • Setting ceiling=0 disables (parity with pre-PR#1685 behavior)
  • Documented in adapter-spec

Related

  • PR #1685 (Option D — the immediate enqueue fix)
  • #1684 (Reno Stars production report)

Owner

TBD. Touches scheduler + adapter-spec + workspace templates. Estimated 2-3d.

## Why PR #1685 (Option D from #1684's proposed fixes) makes native_session adapters queue at the platform level + drain on heartbeat-reported idle. That handles the common case: SDK eventually finishes its turn → reports idle → queue drains. But if the SDK is genuinely stuck (deadlock, infinite loop, model API hung indefinitely, never returns from its current turn), the in-flight POST never completes, drain trigger never fires, queue grows forever. This RFC adds Option A as a safety ceiling: force-checkpoint the active session after N min so a stuck SDK can't permanently block cron. ## Proposal ### A. `native_session_ceiling_seconds` per-adapter config Default 1800s (30 min). Per-adapter override in `runtime_config`. Set to 0 to disable (back to current behavior). ### B. Implementation in `workspace-server/internal/scheduler/scheduler.go` On schedule fire: if target adapter has `provides_native_session=True` AND the in-flight POST has been running > `native_session_ceiling_seconds`, send a `POST /<workspace_id>/control/checkpoint` request that the SDK adapter handles by: 1. Saving current state 2. Returning from the in-flight POST cleanly (so drain can fire) 3. Resuming on next turn from the saved checkpoint Claude Agent SDK has `cancel()` / `checkpoint()` primitives that map to this. Codex CLI's `app-server` has `turn/cancel`. Hermes has a similar primitive. ### C. Adapter-side handler Workspace adapters that declare `provides_native_session=True` must also implement `/control/checkpoint` (returns 200 when the in-flight turn is cancelled cleanly, 409 if not steerable). ## Acceptance - Native_session adapters that take > ceiling seconds get force-checkpointed instead of permanently blocking cron - Reno Stars scenario (#1684) would have force-checkpointed after 30 min instead of locking up for 6 hours - Setting ceiling=0 disables (parity with pre-PR#1685 behavior) - Documented in adapter-spec ## Related - PR #1685 (Option D — the immediate enqueue fix) - #1684 (Reno Stars production report) ## Owner TBD. Touches scheduler + adapter-spec + workspace templates. Estimated 2-3d.
Member

RCA — root cause

The current scheduler has a bounded platform-side fire timeout, but no native-session checkpoint/cancel ceiling. If a native-session SDK remains busy forever, the scheduler waits briefly, then records skipped fires; queued A2A drains only when the workspace later reports capacity, which never happens for a genuinely wedged SDK turn.

Evidence

  • workspace-server/internal/scheduler/scheduler.go:27 — platform fire timeout is 5 minutes, but it only bounds the outbound fire attempt.
  • workspace-server/internal/scheduler/scheduler.go:365 — busy workspaces enter a capacity deferral branch.
  • workspace-server/internal/scheduler/scheduler.go:370 — the deferral polls 12 times at 10s, about 2 minutes total.
  • workspace-server/internal/scheduler/scheduler.go:383 — after that, the schedule is recorded skipped if active_tasks >= maxConcurrent.
  • workspace-server/internal/handlers/a2a_proxy_helpers.go:78 — native-session busy work is queued and relies on heartbeat/drain when the SDK reports idle; no forced checkpoint path is present.
  • workspace-server/internal/handlers/workspace_restart.go:783 — a pre-restart drain signal exists, showing control signaling is possible, but it is tied to restarts rather than scheduler ceilings.

Suggested fix

Responsible repo is molecule-core: add a scheduler-owned native-session ceiling in workspace-server/internal/scheduler plus an adapter contract endpoint such as /control/checkpoint or cancel. The scheduler should distinguish ordinary short busy periods from a turn exceeding the configured ceiling, then request checkpoint/cancel and surface a specific activity/error if the adapter returns 409 or times out. Templates should implement the endpoint using their native primitive: Codex turn/interrupt/cancel, Claude SDK cancel/checkpoint, Hermes equivalent.

Confidence

High — current scheduler and A2A queue code show the exact missing forced-yield mechanism described by the RFC.

## RCA — root cause The current scheduler has a bounded platform-side fire timeout, but no native-session checkpoint/cancel ceiling. If a native-session SDK remains busy forever, the scheduler waits briefly, then records skipped fires; queued A2A drains only when the workspace later reports capacity, which never happens for a genuinely wedged SDK turn. ## Evidence - `workspace-server/internal/scheduler/scheduler.go:27` — platform fire timeout is 5 minutes, but it only bounds the outbound fire attempt. - `workspace-server/internal/scheduler/scheduler.go:365` — busy workspaces enter a capacity deferral branch. - `workspace-server/internal/scheduler/scheduler.go:370` — the deferral polls 12 times at 10s, about 2 minutes total. - `workspace-server/internal/scheduler/scheduler.go:383` — after that, the schedule is recorded skipped if `active_tasks >= maxConcurrent`. - `workspace-server/internal/handlers/a2a_proxy_helpers.go:78` — native-session busy work is queued and relies on heartbeat/drain when the SDK reports idle; no forced checkpoint path is present. - `workspace-server/internal/handlers/workspace_restart.go:783` — a pre-restart drain signal exists, showing control signaling is possible, but it is tied to restarts rather than scheduler ceilings. ## Suggested fix Responsible repo is `molecule-core`: add a scheduler-owned native-session ceiling in `workspace-server/internal/scheduler` plus an adapter contract endpoint such as `/control/checkpoint` or cancel. The scheduler should distinguish ordinary short busy periods from a turn exceeding the configured ceiling, then request checkpoint/cancel and surface a specific activity/error if the adapter returns 409 or times out. Templates should implement the endpoint using their native primitive: Codex `turn/interrupt`/cancel, Claude SDK cancel/checkpoint, Hermes equivalent. ## Confidence High — current scheduler and A2A queue code show the exact missing forced-yield mechanism described by the RFC.
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1690