feat(workspace): preserve in-flight A2A messages across container restart #125

New Issue

claude-ceo-assistant · 2026-05-08T16:04:38Z

2026-05-08 16:04:38 +00:00

Problem

When restartFunc fires (plugin install with cold-restart-required diff, settings change, manual restart, etc.), in-flight A2A messages can be dropped rather than buffered through the restart window.

Root cause: claude-code adapter declares RuntimeCapabilities.provides_native_session=True, which tells the platform to skip a2a_queue buffering and dispatch directly to the SDK session. But the SDK session is in-process — it doesn't survive docker restart. Net result: the SDK was supposed to own continuity, but a container restart kills the process before continuity can engage.

Why this matters

Reno-Stars iteration: most plugin updates are SKILL-content-only (covered by core#112 hot-reload classifier — no restart) but cold-restart-required updates (hooks/settings/plugin.yaml/new files) still drop ~5-10s of A2A traffic. Same gap exists for any other restart path (canvas restart button, secret change auto-restart, EC2 reprovision).

Proposed approach (3 options, increasing scope)

Pre-restart drain (small) — platform issues a restart_pending signal to the workspace via existing message channel; adapter has 2-3s to finish or park in-flight work; THEN platform fires docker restart. Requires adding the signal handler in adapter.py.
Tier-1 buffering window (medium) — temporarily flip provides_native_session=False semantics for the restart-window duration. Platform's a2a_queue holds messages until new session re-registers. Requires platform-side state machine + adapter coordination on the resume-handshake.
SDK session persistence (large) — write ClaudeSDKClient state to disk before SIGTERM, restore from disk on next boot. Touches SDK internals; may need claude-agent-sdk upstream change. Highest fidelity, biggest blast radius.

My recommendation: start with option 1. Cheapest; meaningful improvement; doesn't preclude option 2 layering on later. Option 3 is probably never worth it.

Acceptance criteria (option 1 scope)

Adapter handles restart_pending signal: pause new tool calls, wait for in-flight to complete (with cap, e.g. 5s), then ack.
Platform waits for ack (or 5s timeout) before issuing docker restart.
Pre-restart timer logged.
Test: send 5 A2A messages, fire restart mid-stream, verify all 5 arrive (either pre-restart or post-restart, never lost).

Out of scope

Session-state persistence (option 3 — separate issue if/when needed).
Multi-restart fan-out coordination (separate concern).

Refs

core#112 — hot-reload classifier (eliminates the restart entirely for SKILL-content-only)
core#114 — atomic install (the restart that this issue addresses)
Saved memory: feedback_pending_flag_for_concurrent_writes (similar pattern: pause + drain on conflicting writes)
adapter.py RuntimeCapabilities.provides_native_session=True declaration

## Problem When `restartFunc` fires (plugin install with cold-restart-required diff, settings change, manual restart, etc.), in-flight A2A messages can be **dropped** rather than buffered through the restart window. Root cause: claude-code adapter declares `RuntimeCapabilities.provides_native_session=True`, which tells the platform to skip a2a_queue buffering and dispatch directly to the SDK session. But the SDK session is in-process — it doesn't survive `docker restart`. Net result: the SDK was supposed to own continuity, but a container restart kills the process before continuity can engage. ## Why this matters Reno-Stars iteration: most plugin updates are SKILL-content-only (covered by core#112 hot-reload classifier — no restart) but cold-restart-required updates (hooks/settings/plugin.yaml/new files) still drop ~5-10s of A2A traffic. Same gap exists for any other restart path (canvas restart button, secret change auto-restart, EC2 reprovision). ## Proposed approach (3 options, increasing scope) 1. **Pre-restart drain (small)** — platform issues a `restart_pending` signal to the workspace via existing message channel; adapter has 2-3s to finish or park in-flight work; THEN platform fires `docker restart`. Requires adding the signal handler in adapter.py. 2. **Tier-1 buffering window (medium)** — temporarily flip `provides_native_session=False` semantics for the restart-window duration. Platform's a2a_queue holds messages until new session re-registers. Requires platform-side state machine + adapter coordination on the resume-handshake. 3. **SDK session persistence (large)** — write `ClaudeSDKClient` state to disk before SIGTERM, restore from disk on next boot. Touches SDK internals; may need claude-agent-sdk upstream change. Highest fidelity, biggest blast radius. My recommendation: **start with option 1**. Cheapest; meaningful improvement; doesn't preclude option 2 layering on later. Option 3 is probably never worth it. ## Acceptance criteria (option 1 scope) - Adapter handles `restart_pending` signal: pause new tool calls, wait for in-flight to complete (with cap, e.g. 5s), then ack. - Platform waits for ack (or 5s timeout) before issuing `docker restart`. - Pre-restart timer logged. - Test: send 5 A2A messages, fire restart mid-stream, verify all 5 arrive (either pre-restart or post-restart, never lost). ## Out of scope - Session-state persistence (option 3 — separate issue if/when needed). - Multi-restart fan-out coordination (separate concern). ## Refs - core#112 — hot-reload classifier (eliminates the restart entirely for SKILL-content-only) - core#114 — atomic install (the restart that this issue addresses) - Saved memory: `feedback_pending_flag_for_concurrent_writes` (similar pattern: pause + drain on conflicting writes) - adapter.py `RuntimeCapabilities.provides_native_session=True` declaration

Sign in to join this conversation.

No Label

tier:high

tier:low

tier:medium

No Milestone

No project

No Assignees

1 Participants

Notifications

Due Date

The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#125