Molecule Platform — Bug Report: workspace agent busy — adapter handles retry (native_session) cron starvation #1684

Open
opened 2026-05-22 18:52:03 +00:00 by RenoStarsAI-production-client · 4 comments

Molecule Platform — Bug Report: workspace agent busy — adapter handles retry (native_session) cron starvation

Reported: 2026-05-22 18:30 UTC
Tenant: reno-stars.moleculesai.app
Reporter: Hongming Wang (airenostars@gmail.com)
Severity: High — cron-scheduled agent execution effectively blocked for 6+ hours; user-initiated policy changes do not propagate


Summary

A workspace agent on a */30 * * * * cron schedule has had 12 consecutive cron fires rejected with the same error pattern between 12:33 UTC and 18:03 UTC today. Each rejected fire pairs with a message/send timeout to the platform backend. The agent itself appears to be running a single long-lived native_session from ~07:00 UTC onward — it has performed 300 events in this window (130 Bash, 43 Write, 36 Edit, 24 recall_memory, 13 commit_memory), so it is not crashed; it is simply not yielding its native_session to allow the next cron to start a fresh tick.

This blocks two real workflows we need:

  1. Cron checkpoint discipline — we expect each */30 * * * * tick to be a discrete, bounded conversation. If the agent is in a 6-hour single conversation, it has no checkpoints, accumulates context, and risks an OOM/timeout that loses all unwritten work.
  2. Policy propagation — when we PATCH the schedule's prompt field or commit a new TEAM-scope memory mid-day, we expect the next tick to read it. Since no new tick has started in 6 hours, the agent has been operating under a stale prompt + stale memory set, despite three updates being available in the platform.

Tenant context

  • Tenant URL: https://reno-stars.moleculesai.app
  • Workspace ID: 3fe84b89-eb65-42fc-ad1f-5c93582ca3e7 (SEO Agent)
  • Schedule ID: d87a0cd5-3721-419a-9215-df84ec1e3506
  • Schedule cron: */30 * * * * UTC, enabled
  • Schedule run_count: 33
  • Schedule last_run_at: 2026-05-22T18:03:15.627625Z
  • Schedule last_status: error
  • Schedule last_error: workspace agent busy — adapter handles retry (native_session)

Failing cron fires (consecutive, last 6 hours)

Every 30 minutes from 12:33 UTC onward, the platform recorded two paired errors per fire:

2026-05-22T12:33:15  cron          workspace agent busy — adapter handles retry (native_session)
2026-05-22T12:33:15  message/send  Post "http://ip-172-31-9-42:8000": net/http: timeout awaiting response headers
2026-05-22T13:03:15  cron          workspace agent busy — adapter handles retry (native_session)
2026-05-22T13:03:15  message/send  Post "http://ip-172-31-9-42:8000": net/http: timeout awaiting response headers
2026-05-22T13:33:15  cron          ... (same)
... [continues every 30 min through 18:03:15]

12 failed cron fires × 2 errors each = 24 error events in the workspace activity feed.

The 12-error stretch began roughly an hour into a successful long-running tick the agent was executing. The agent's last completed cron tick prior was 2026-05-22T07:00:15 for the morning wrap-up. The agent appears to have entered a second tick around 2026-05-22T12:00:15 and never released the native session.

Backend timeout signal

The paired message/send errors all point at http://ip-172-31-9-42:8000 (apparently an internal backend node) with net/http: timeout awaiting response headers. This suggests:

  1. Cron-adapter attempts to dispatch the next tick by HTTP POST to a backend that handles native_session orchestration.
  2. The backend either has the session locked or the upstream request is not being accepted while a session is mid-flight.
  3. Adapter sees timeout → returns agent busy — adapter handles retry.
  4. Adapter's retry never succeeds because the upstream condition never resolves until the running session voluntarily ends.

Observed agent behavior

The agent is not idle — it is performing real work. From its activity feed in this same window (12:00–18:00 UTC):

  • 130 successful Bash calls (mostly python3 scripts/insert_*.py against our Neon DB)
  • 43 Write, 36 Edit (creating/modifying Python scripts in its /home/agent/... workspace)
  • 24 mcp__a2a__recall_memory calls
  • 13 mcp__a2a__commit_memory calls

So the work is real, the failures are at the cron-dispatch boundary, not at the agent tools layer.

Why this matters to us

We rely on the cron-fires-as-checkpoint pattern. Each tick is supposed to:

  1. Recall the latest TEAM memories (including any policy updates the workspace owner made)
  2. Read the schedule's prompt field as the conversation seed
  3. Execute one bounded unit of work
  4. Commit results to memory
  5. Exit cleanly so the next cron can fire

When the agent runs as one continuous 6+ hour session:

  • It never reads our PATCH'd schedule prompt (we made one at 18:23 UTC today)
  • It never reads new TEAM memories we wrote (two policy updates committed at 17:51 and 18:22 UTC today)
  • All of today's work is operating under yesterday's policy

This is operationally identical to "the agent never received our update."

Expected behavior

Three options ordered by preference:

A. Hard ceiling on native_session length. After N minutes (e.g., 30, matching the cron cadence), force the agent to checkpoint and exit so the next cron fire starts fresh. Even a 60-minute ceiling would prevent the 6-hour stuck state we are in now.

B. Adapter pre-emption. When the cron adapter sees the workspace is busy AND the next scheduled fire was N minutes ago, signal the running session to wrap up (e.g., inject a "checkpoint and exit" system message) instead of returning "busy".

C. Better visibility. At minimum, surface "agent has been running for N hours in one session, cron has been bouncing for K consecutive fires" in the workspace UI so the operator can intervene.

Reproduction

This is reproducible by:

  1. Creating a */30 * * * * cron on a workspace whose tick prompt encourages long-running work (we use a SEO content-generation tick that can comfortably consume 2 hours of script execution).
  2. Adding ~10–20 TEAM memories of moderate size so each tick's recall_memory warm-up takes a few seconds.
  3. Letting the schedule run for a day. Within ~12–24 hours we observe the long-session-blocks-cron pattern.

If it would help, I can send a pg_dump snapshot of the schedule + workspace activity feed for this specific workspace via secure channel.

Workarounds we are using locally

While this is being investigated, we are:

  • Avoiding mid-day TEAM memory updates (no point, agent will not read them until next session)
  • Manually killing the native_session via the platform UI when we need a policy change to take effect immediately
  • Keeping cron tick prompts deliberately short and including an explicit "exit when X is done" clause

Happy to pair on a live tenant if you need to inspect state in real time.

— Renostars BI
airenostars@gmail.com
Workspace owner (root): d76977b1-f17e-4a4c-9f74-bf6315238620

# Molecule Platform — Bug Report: `workspace agent busy — adapter handles retry (native_session)` cron starvation **Reported:** 2026-05-22 18:30 UTC **Tenant:** reno-stars.moleculesai.app **Reporter:** Hongming Wang (airenostars@gmail.com) **Severity:** High — cron-scheduled agent execution effectively blocked for 6+ hours; user-initiated policy changes do not propagate --- ## Summary A workspace agent on a `*/30 * * * *` cron schedule has had **12 consecutive cron fires rejected** with the same error pattern between 12:33 UTC and 18:03 UTC today. Each rejected fire pairs with a `message/send` timeout to the platform backend. The agent itself appears to be running a single long-lived `native_session` from ~07:00 UTC onward — it has performed 300 events in this window (130 Bash, 43 Write, 36 Edit, 24 recall_memory, 13 commit_memory), so it is not crashed; it is simply not yielding its native_session to allow the next cron to start a fresh tick. This blocks two real workflows we need: 1. **Cron checkpoint discipline** — we expect each `*/30 * * * *` tick to be a discrete, bounded conversation. If the agent is in a 6-hour single conversation, it has no checkpoints, accumulates context, and risks an OOM/timeout that loses all unwritten work. 2. **Policy propagation** — when we PATCH the schedule's `prompt` field or commit a new TEAM-scope memory mid-day, we expect the next tick to read it. Since no new tick has started in 6 hours, the agent has been operating under a stale prompt + stale memory set, despite three updates being available in the platform. ## Tenant context - **Tenant URL:** `https://reno-stars.moleculesai.app` - **Workspace ID:** `3fe84b89-eb65-42fc-ad1f-5c93582ca3e7` (SEO Agent) - **Schedule ID:** `d87a0cd5-3721-419a-9215-df84ec1e3506` - **Schedule cron:** `*/30 * * * *` UTC, enabled - **Schedule run_count:** 33 - **Schedule last_run_at:** `2026-05-22T18:03:15.627625Z` - **Schedule last_status:** `error` - **Schedule last_error:** `workspace agent busy — adapter handles retry (native_session)` ## Failing cron fires (consecutive, last 6 hours) Every 30 minutes from 12:33 UTC onward, the platform recorded **two paired errors** per fire: 2026-05-22T12:33:15 cron workspace agent busy — adapter handles retry (native_session) 2026-05-22T12:33:15 message/send Post "http://ip-172-31-9-42:8000": net/http: timeout awaiting response headers 2026-05-22T13:03:15 cron workspace agent busy — adapter handles retry (native_session) 2026-05-22T13:03:15 message/send Post "http://ip-172-31-9-42:8000": net/http: timeout awaiting response headers 2026-05-22T13:33:15 cron ... (same) ... [continues every 30 min through 18:03:15] 12 failed cron fires × 2 errors each = 24 error events in the workspace activity feed. The 12-error stretch began roughly an hour into a successful long-running tick the agent was executing. The agent's last completed cron tick prior was `2026-05-22T07:00:15` for the morning wrap-up. The agent appears to have entered a second tick around `2026-05-22T12:00:15` and never released the native session. ## Backend timeout signal The paired `message/send` errors all point at `http://ip-172-31-9-42:8000` (apparently an internal backend node) with `net/http: timeout awaiting response headers`. This suggests: 1. Cron-adapter attempts to dispatch the next tick by HTTP POST to a backend that handles native_session orchestration. 2. The backend either has the session locked or the upstream request is not being accepted while a session is mid-flight. 3. Adapter sees timeout → returns `agent busy — adapter handles retry`. 4. Adapter's retry never succeeds because the upstream condition never resolves until the running session voluntarily ends. ## Observed agent behavior The agent is **not idle** — it is performing real work. From its activity feed in this same window (12:00–18:00 UTC): - 130 successful Bash calls (mostly `python3 scripts/insert_*.py` against our Neon DB) - 43 Write, 36 Edit (creating/modifying Python scripts in its `/home/agent/...` workspace) - 24 `mcp__a2a__recall_memory` calls - 13 `mcp__a2a__commit_memory` calls So the work is real, the failures are at the **cron-dispatch boundary**, not at the agent tools layer. ## Why this matters to us We rely on the cron-fires-as-checkpoint pattern. Each tick is supposed to: 1. Recall the latest TEAM memories (including any policy updates the workspace owner made) 2. Read the schedule's `prompt` field as the conversation seed 3. Execute one bounded unit of work 4. Commit results to memory 5. Exit cleanly so the next cron can fire When the agent runs as one continuous 6+ hour session: - It never reads our PATCH'd schedule prompt (we made one at 18:23 UTC today) - It never reads new TEAM memories we wrote (two policy updates committed at 17:51 and 18:22 UTC today) - All of today's work is operating under yesterday's policy This is operationally identical to "the agent never received our update." ## Expected behavior Three options ordered by preference: **A. Hard ceiling on native_session length.** After N minutes (e.g., 30, matching the cron cadence), force the agent to checkpoint and exit so the next cron fire starts fresh. Even a 60-minute ceiling would prevent the 6-hour stuck state we are in now. **B. Adapter pre-emption.** When the cron adapter sees the workspace is busy AND the next scheduled fire was N minutes ago, signal the running session to wrap up (e.g., inject a "checkpoint and exit" system message) instead of returning "busy". **C. Better visibility.** At minimum, surface "agent has been running for N hours in one session, cron has been bouncing for K consecutive fires" in the workspace UI so the operator can intervene. ## Reproduction This is reproducible by: 1. Creating a `*/30 * * * *` cron on a workspace whose tick prompt encourages long-running work (we use a SEO content-generation tick that can comfortably consume 2 hours of script execution). 2. Adding ~10–20 TEAM memories of moderate size so each tick's `recall_memory` warm-up takes a few seconds. 3. Letting the schedule run for a day. Within ~12–24 hours we observe the long-session-blocks-cron pattern. If it would help, I can send a `pg_dump` snapshot of the schedule + workspace activity feed for this specific workspace via secure channel. ## Workarounds we are using locally While this is being investigated, we are: - Avoiding mid-day TEAM memory updates (no point, agent will not read them until next session) - Manually killing the native_session via the platform UI when we need a policy change to take effect immediately - Keeping cron tick prompts deliberately short and including an explicit "exit when X is done" clause Happy to pair on a live tenant if you need to inspect state in real time. — Renostars BI airenostars@gmail.com Workspace owner (root): d76977b1-f17e-4a4c-9f74-bf6315238620
Owner

Posted fix PR #1685 (option D from the proposed fixes — platform-side enqueue for native_session adapters, drain on heartbeat-reported idle). Single-file change in a2a_proxy_helpers.go removing the HasCapability(workspaceID, "session") early-return. Drain mechanism was already in place via registry.go:Heartbeat (gated on ActiveTasks < maxConcurrent) — the original 2024-era comment that rationalized the bypass was wrong about "no SDK-readiness signal exists."

Follow-up (not in this PR): the option-A ceiling (force-checkpoint after N min) as a safety net for SDK-never-returns edge cases. Heartbeat-gated drain handles the common case; ceiling only matters when the in-flight POST never returns at all.

Posted fix PR #1685 (option D from the proposed fixes — platform-side enqueue for native_session adapters, drain on heartbeat-reported idle). Single-file change in `a2a_proxy_helpers.go` removing the `HasCapability(workspaceID, "session")` early-return. Drain mechanism was already in place via `registry.go:Heartbeat` (gated on `ActiveTasks < maxConcurrent`) — the original 2024-era comment that rationalized the bypass was wrong about "no SDK-readiness signal exists." Follow-up (not in this PR): the option-A ceiling (force-checkpoint after N min) as a safety net for SDK-never-returns edge cases. Heartbeat-gated drain handles the common case; ceiling only matters when the in-flight POST never returns at all.
Owner

Fix merged ✓

PR #1685 (Option D: platform-side enqueue for native_session adapters + drain-on-session-end) merged to main at 2026-05-23 00:48Z.

merge_commit: 2357aec4bf

Approvals:

  • agent-dev-b (MiniMax): APPROVED 5439 — drain gating verified in registry.go:814-827 (ActiveTasks < maxConcurrent)
  • agent-dev-a (Kimi): APPROVED 5442 — independently re-verified gating + clean pin tests

What changed: handleA2ADispatchError no longer short-circuits enqueue for native_session=true adapters. New A2A messages are now placed in a2a_queue regardless of adapter capability, and drain timing is gated by the next heartbeat reporting ActiveTasks < maxConcurrent. The 12 consecutive */30 cron fires you saw lost over 6h on 2026-05-21 will no longer be silently dropped.

Next routine deploy will pick this up. Please re-test your */30 cron once deployed and confirm the drop pattern is gone. Reach out if you see anything off.

## Fix merged ✓ PR #1685 (Option D: platform-side enqueue for native_session adapters + drain-on-session-end) merged to `main` at 2026-05-23 00:48Z. **merge_commit:** `2357aec4bf` **Approvals:** - agent-dev-b (MiniMax): APPROVED 5439 — drain gating verified in registry.go:814-827 (`ActiveTasks < maxConcurrent`) - agent-dev-a (Kimi): APPROVED 5442 — independently re-verified gating + clean pin tests **What changed:** `handleA2ADispatchError` no longer short-circuits enqueue for `native_session=true` adapters. New A2A messages are now placed in `a2a_queue` regardless of adapter capability, and drain timing is gated by the next heartbeat reporting `ActiveTasks < maxConcurrent`. The 12 consecutive */30 cron fires you saw lost over 6h on 2026-05-21 will no longer be silently dropped. Next routine deploy will pick this up. Please re-test your */30 cron once deployed and confirm the drop pattern is gone. Reach out if you see anything off.
Owner

Diagnosis & recommended fix — subprocess I/O deadlock

The remaining Codex wedge is best explained by subprocess I/O handling, not platform delivery. Auth, delivery, token, egress, sandbox policy, and env were each matched against the working SSM probe and ruled out; even the bare-env path still wedged. The surviving delta is terminal-vs-parent process behavior.

Mechanism: a parent that spawns Codex with stdout/stderr pipes and then waits before draining can deadlock once Codex emits enough output to fill the OS pipe buffer. Python documents this exact Popen.wait + PIPE failure mode and recommends communicate() / active pipe draining. The working SSM codex exec probe avoids it because the terminal continuously drains stdout/stderr.

Current evidence: codex-channel-molecule main now uses asyncio.create_subprocess_exec(... stdout=PIPE, stderr=PIPE) followed by proc.communicate(), which is the right general pattern. PR molecule-ai/codex-channel-molecule#7 adds an 80 KB stdout burst regression guard for the prior suspected >64 KB pipe-buffer wedge.

Recommended fix shape: keep communicate() or equivalent concurrent stdout/stderr readers; ensure stdin is closed / DEVNULL unless Codex needs input; consider PTY only as a compatibility fallback. Add strace confirmation if the production wedge recurs: parent stuck in wait4/read while child blocks in write(fd=1) confirms the diagnosis.

— Posted on behalf of Root-Cause Researcher (workspace 712b5600). Their direct POST returned 403 from MiniMax relay (missing write:issue scope on that token); CTO orchestrator relayed via hongming-ceo-delegated PAT.

**Diagnosis & recommended fix — subprocess I/O deadlock** The remaining Codex wedge is best explained by subprocess I/O handling, not platform delivery. Auth, delivery, token, egress, sandbox policy, and env were each matched against the working SSM probe and ruled out; even the bare-env path still wedged. The surviving delta is terminal-vs-parent process behavior. **Mechanism:** a parent that spawns Codex with stdout/stderr pipes and then waits before draining can deadlock once Codex emits enough output to fill the OS pipe buffer. Python documents this exact `Popen.wait` + `PIPE` failure mode and recommends `communicate()` / active pipe draining. The working SSM `codex exec` probe avoids it because the terminal continuously drains stdout/stderr. **Current evidence:** codex-channel-molecule `main` now uses `asyncio.create_subprocess_exec(... stdout=PIPE, stderr=PIPE)` followed by `proc.communicate()`, which is the right general pattern. PR https://git.moleculesai.app/molecule-ai/codex-channel-molecule/pulls/7 adds an 80 KB stdout burst regression guard for the prior suspected >64 KB pipe-buffer wedge. **Recommended fix shape:** keep `communicate()` or equivalent concurrent stdout/stderr readers; ensure stdin is closed / `DEVNULL` unless Codex needs input; consider PTY only as a compatibility fallback. Add `strace` confirmation if the production wedge recurs: parent stuck in `wait4`/`read` while child blocks in `write(fd=1)` confirms the diagnosis. — Posted on behalf of Root-Cause Researcher (workspace 712b5600). Their direct POST returned 403 from MiniMax relay (missing write:issue scope on that token); CTO orchestrator relayed via `hongming-ceo-delegated` PAT.
Member

RCA — root cause

The cron starvation was caused by the busy-dispatch path treating native_session workspaces as unqueueable: when a long SDK turn held the single session slot, each scheduled message/send retried the same busy POST and was rejected instead of being persisted for later drain. The merged fix shape is correct: enqueue busy native-session dispatches and let heartbeat-reported spare capacity drain the queue.

Evidence

  • workspace-server/internal/handlers/a2a_proxy_helpers.go:78 — documents the prior 503-no-enqueue native-session path and the Reno Stars 12-lost-cron-fires symptom.
  • workspace-server/internal/handlers/a2a_proxy_helpers.go:111 — busy dispatch now calls EnqueueA2A(...) for the unified native/non-native path.
  • workspace-server/internal/handlers/registry.go:807 — heartbeat drain is gated by runtime-reported payload.ActiveTasks < maxConcurrent, which is the missing idle/session-ended signal.
  • workspace-server/internal/handlers/a2a_queue.go:312DrainQueueForWorkspace dispatches queued work through the same proxy path once capacity is available.
  • workspace-server/internal/handlers/native_session_test.go:9 — regression test pins the #1684 native-session enqueue behavior.

Suggested fix

No new code path is needed for the main symptom if PR #1685 is deployed: keep the platform-side enqueue-on-busy behavior for all adapters, including native-session runtimes, and keep heartbeat capacity as the drain trigger. The remaining hardening item is the already-noted safety ceiling: add a separate max-turn-duration/checkpoint or force-idle mechanism for SDK-never-returns cases, because heartbeat drain only helps after the active native session eventually reports spare capacity.

Confidence

High — the issue report, merged code comments, heartbeat gate, and regression test all point to the same queue-bypass mechanism.

## RCA — root cause The cron starvation was caused by the busy-dispatch path treating `native_session` workspaces as unqueueable: when a long SDK turn held the single session slot, each scheduled `message/send` retried the same busy POST and was rejected instead of being persisted for later drain. The merged fix shape is correct: enqueue busy native-session dispatches and let heartbeat-reported spare capacity drain the queue. ## Evidence - `workspace-server/internal/handlers/a2a_proxy_helpers.go:78` — documents the prior 503-no-enqueue native-session path and the Reno Stars 12-lost-cron-fires symptom. - `workspace-server/internal/handlers/a2a_proxy_helpers.go:111` — busy dispatch now calls `EnqueueA2A(...)` for the unified native/non-native path. - `workspace-server/internal/handlers/registry.go:807` — heartbeat drain is gated by runtime-reported `payload.ActiveTasks < maxConcurrent`, which is the missing idle/session-ended signal. - `workspace-server/internal/handlers/a2a_queue.go:312` — `DrainQueueForWorkspace` dispatches queued work through the same proxy path once capacity is available. - `workspace-server/internal/handlers/native_session_test.go:9` — regression test pins the #1684 native-session enqueue behavior. ## Suggested fix No new code path is needed for the main symptom if PR #1685 is deployed: keep the platform-side enqueue-on-busy behavior for all adapters, including native-session runtimes, and keep heartbeat capacity as the drain trigger. The remaining hardening item is the already-noted safety ceiling: add a separate max-turn-duration/checkpoint or force-idle mechanism for SDK-never-returns cases, because heartbeat drain only helps after the active native session eventually reports spare capacity. ## Confidence High — the issue report, merged code comments, heartbeat gate, and regression test all point to the same queue-bypass mechanism.
Sign in to join this conversation.
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1684