feat(workspace): preserve in-flight A2A messages across container restart #125
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
When
restartFuncfires (plugin install with cold-restart-required diff, settings change, manual restart, etc.), in-flight A2A messages can be dropped rather than buffered through the restart window.Root cause: claude-code adapter declares
RuntimeCapabilities.provides_native_session=True, which tells the platform to skip a2a_queue buffering and dispatch directly to the SDK session. But the SDK session is in-process — it doesn't survivedocker restart. Net result: the SDK was supposed to own continuity, but a container restart kills the process before continuity can engage.Why this matters
Reno-Stars iteration: most plugin updates are SKILL-content-only (covered by core#112 hot-reload classifier — no restart) but cold-restart-required updates (hooks/settings/plugin.yaml/new files) still drop ~5-10s of A2A traffic. Same gap exists for any other restart path (canvas restart button, secret change auto-restart, EC2 reprovision).
Proposed approach (3 options, increasing scope)
Pre-restart drain (small) — platform issues a
restart_pendingsignal to the workspace via existing message channel; adapter has 2-3s to finish or park in-flight work; THEN platform firesdocker restart. Requires adding the signal handler in adapter.py.Tier-1 buffering window (medium) — temporarily flip
provides_native_session=Falsesemantics for the restart-window duration. Platform's a2a_queue holds messages until new session re-registers. Requires platform-side state machine + adapter coordination on the resume-handshake.SDK session persistence (large) — write
ClaudeSDKClientstate to disk before SIGTERM, restore from disk on next boot. Touches SDK internals; may need claude-agent-sdk upstream change. Highest fidelity, biggest blast radius.My recommendation: start with option 1. Cheapest; meaningful improvement; doesn't preclude option 2 layering on later. Option 3 is probably never worth it.
Acceptance criteria (option 1 scope)
restart_pendingsignal: pause new tool calls, wait for in-flight to complete (with cap, e.g. 5s), then ack.docker restart.Out of scope
Refs
feedback_pending_flag_for_concurrent_writes(similar pattern: pause + drain on conflicting writes)RuntimeCapabilities.provides_native_session=Truedeclaration