workspace with failed register stays 'online' via heartbeat-backfill while canvas chat delivery starves silently #2530

Closed
opened 2026-06-10 11:36:32 +00:00 by core-devops · 1 comment
Member

Live (agents-team concierge, 2026-06-10): after a container re-create lost the saved workspace auth token, every boot logged Register: HTTP 401 ... proceeding (heartbeat backfill is the recovery path). The workspace stayed online (heartbeats pass), the canvas showed a green dot — but every canvas chat message sat in queued for poll forever (3 user messages, 30+ min). The user-visible symptom: agent looks healthy, never replies.

Two asks:

  1. Status truth: a workspace whose register persistently 401s should surface as degraded (same posture as runtime_wedge), not online — the canvas should hint a restart/credential repair.
  2. Queued-chat delivery: poll-mode queued messages only flushed when a turn was already active (push-channel injection); an idle agent never drained its queue even when healthy. Either the runtime needs an idle inbox poll, or the tenant-side heartbeat drain needs to handle the platform-agent's delivery mode. Related observation: heartbeat drain repeatedly logged re-queued (target still busy) even with active_tasks=0.

Recovery used live: delete stale workspace_auth_tokens rows (register then re-bootstraps) or full reprovision.

🤖 Generated with Claude Code

**Live (agents-team concierge, 2026-06-10):** after a container re-create lost the saved workspace auth token, every boot logged `Register: HTTP 401 ... proceeding (heartbeat backfill is the recovery path)`. The workspace stayed **online** (heartbeats pass), the canvas showed a green dot — but every canvas chat message sat in `queued for poll` forever (3 user messages, 30+ min). The user-visible symptom: agent looks healthy, never replies. Two asks: 1. **Status truth**: a workspace whose register persistently 401s should surface as `degraded` (same posture as runtime_wedge), not online — the canvas should hint a restart/credential repair. 2. **Queued-chat delivery**: poll-mode queued messages only flushed when a turn was already active (push-channel injection); an idle agent never drained its queue even when healthy. Either the runtime needs an idle inbox poll, or the tenant-side heartbeat drain needs to handle the platform-agent's delivery mode. Related observation: heartbeat drain repeatedly logged `re-queued (target still busy)` even with active_tasks=0. Recovery used live: delete stale `workspace_auth_tokens` rows (register then re-bootstraps) or full reprovision. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Author
Member

Real root cause found (supersedes the poll-cadence theory): the concierge workspace row carried delivery_mode=poll — a stale leftover from the old bespoke-era registration — and resolveDeliveryMode rule 2 ("existing row wins") perpetuates it on every re-register, so the proxy queued every canvas message forever (an idle poll-mode workspace has NO drain path). Healthy claude-code workspaces are push.

Live fix applied: UPDATE workspaces SET delivery_mode='push' for the concierge — canvas chat now dispatches normally.

Durable asks: (1) installPlatformAgent / the #2508 seed-repair should reset delivery_mode along with status/runtime/tier when adopting a pre-existing row; (2) the original ask stands: an idle poll-mode workspace with queued messages needs SOME drain path (or poll-mode rows for runtimes that are push-capable should be self-healed at register, since rule 2 currently makes stale poll sticky forever).

**Real root cause found (supersedes the poll-cadence theory):** the concierge workspace row carried `delivery_mode=poll` — a stale leftover from the old bespoke-era registration — and `resolveDeliveryMode` rule 2 ("existing row wins") perpetuates it on every re-register, so the proxy queued every canvas message forever (an idle poll-mode workspace has NO drain path). Healthy claude-code workspaces are `push`. Live fix applied: `UPDATE workspaces SET delivery_mode='push'` for the concierge — canvas chat now dispatches normally. **Durable asks:** (1) `installPlatformAgent` / the #2508 seed-repair should reset `delivery_mode` along with status/runtime/tier when adopting a pre-existing row; (2) the original ask stands: an idle poll-mode workspace with queued messages needs SOME drain path (or poll-mode rows for runtimes that are push-capable should be self-healed at register, since rule 2 currently makes stale poll sticky forever).
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2530