scheduler: workspace_schedules.last_status='ok' lies when adapter SDK throws internally (HTTP 200 != work done) #1696

Closed
opened 2026-05-23 00:56:07 +00:00 by hongming · 0 comments
Owner

Problem

workspace_schedules.last_status is set to 'ok' whenever the scheduled A2A POST to the workspace returns HTTP 200. For runtimes whose adapter ALWAYS returns 200 even when the inner LLM call throws (claude-code-sdk: Exception: Claude Code returned an error result: success on Max-plan rate-limit / window saturation), this means the platform records a SUCCESS for what was actually a no-op.

Observed 2026-05-23 00:34Z – 00:50Z: pm-autonomous-tick (claude-code runtime on Max plan) fired 3 times. All 3 logged Scheduler: ... completed (HTTP 200), all 3 marked last_status='ok' in DB. ALL 3 actually surfaced Agent error (Exception) — see workspace logs for details in PM's chat tab. The schedule looked healthy in operator views; PM did zero work.

Why this matters

  • Operators can't see whether scheduled tasks are doing real work — last_status='ok' is a false-positive
  • consecutive_empty_runs only counts visibly-empty responses (per migration 032), not internal-errors-masked-as-200
  • Schedule-driven workloads on rate-limited runtimes silently no-op for hours until someone manually inspects logs

Proposed fix

Scheduler should inspect the response BODY, not just HTTP status. Adapter contract should include a top-level field like result_kind: 'ok'|'sdk_error'|'rate_limited'|'quota_exhausted' parseable by scheduler.go::scheduleFireOne. Specifically for claude-code-sdk adapter: when the SDK raises Exception: Claude Code returned an error result: success (subtype success with is_error=true), the adapter should emit result_kind: 'rate_limited' in the response body and the scheduler should map that to last_status='rate_limited' (or 'sdk_error') — NOT 'ok'.

Second-order: when a schedule hits rate_limited N times in a row, the scheduler should auto-disable + emit an activity_logs entry so it surfaces in the workspace chat (currently silent).

Workaround applied 2026-05-23 00:54Z

  • UPDATE workspace_schedules SET enabled=false WHERE name='pm-autonomous-tick' (PM no longer hit while CC Max plan window is saturated)
  • Kimi + MiniMax cron bumped to */5 * * * * (was */10 staggered) — they have independent token-plan capacity and absorb PM's scheduled workload

Related

  • reference_claude_code_prod_chat_blocked_oauth_org_not_allowed_not_image (Max plan window saturation = 429 surfacing as SDK success exception)
  • migration 032 consecutive_empty_runs counter — same family of issue, but only catches empty responses not SDK errors
## Problem `workspace_schedules.last_status` is set to `'ok'` whenever the scheduled A2A POST to the workspace returns HTTP 200. For runtimes whose adapter ALWAYS returns 200 even when the inner LLM call throws (claude-code-sdk: `Exception: Claude Code returned an error result: success` on Max-plan rate-limit / window saturation), this means the platform records a SUCCESS for what was actually a no-op. Observed 2026-05-23 00:34Z – 00:50Z: `pm-autonomous-tick` (claude-code runtime on Max plan) fired 3 times. All 3 logged `Scheduler: ... completed (HTTP 200)`, all 3 marked `last_status='ok'` in DB. ALL 3 actually surfaced `Agent error (Exception) — see workspace logs for details` in PM's chat tab. The schedule looked healthy in operator views; PM did zero work. ## Why this matters - Operators can't see whether scheduled tasks are doing real work — `last_status='ok'` is a false-positive - `consecutive_empty_runs` only counts visibly-empty responses (per migration 032), not internal-errors-masked-as-200 - Schedule-driven workloads on rate-limited runtimes silently no-op for hours until someone manually inspects logs ## Proposed fix Scheduler should inspect the response BODY, not just HTTP status. Adapter contract should include a top-level field like `result_kind: 'ok'|'sdk_error'|'rate_limited'|'quota_exhausted'` parseable by `scheduler.go::scheduleFireOne`. Specifically for claude-code-sdk adapter: when the SDK raises `Exception: Claude Code returned an error result: success` (subtype `success` with `is_error=true`), the adapter should emit `result_kind: 'rate_limited'` in the response body and the scheduler should map that to `last_status='rate_limited'` (or `'sdk_error'`) — NOT `'ok'`. Second-order: when a schedule hits `rate_limited` N times in a row, the scheduler should auto-disable + emit an activity_logs entry so it surfaces in the workspace chat (currently silent). ## Workaround applied 2026-05-23 00:54Z - `UPDATE workspace_schedules SET enabled=false WHERE name='pm-autonomous-tick'` (PM no longer hit while CC Max plan window is saturated) - Kimi + MiniMax cron bumped to `*/5 * * * *` (was `*/10` staggered) — they have independent token-plan capacity and absorb PM's scheduled workload ## Related - `reference_claude_code_prod_chat_blocked_oauth_org_not_allowed_not_image` (Max plan window saturation = 429 surfacing as SDK `success` exception) - migration 032 `consecutive_empty_runs` counter — same family of issue, but only catches empty responses not SDK errors
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1696