feat(workspace-server): pre-restart A2A drain signal (core#125) #207
No reviewers
Labels
No Label
release-blocker
security
tier:high
tier:low
tier:medium
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: molecule-ai/molecule-core#207
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "feat/a2a-pre-restart-drain-125"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
provides_native_session=Truetargetsa2a_queuebuffering; messages go directly to the SDK session. WhenstopForRestartfires, the container dies and all in-flight requests are dropped with no recovery path.stopForRestart, the platform sends aPOST /signals/restart_pendingJSON-RPC signal to the workspace agent. The agent drains in-flight work and acknowledges. The platform proceeds with the stop regardless of whether the signal succeeds (fire-and-forget with 10s timeout — graceful degradation).Changes
internal/handlers/restart_signals.go(new)gracefulPreRestart(ctx, workspaceID): Sends the pre-restart signal via HTTP POST to the workspace agent URL. Runs in a detached goroutine with its own 10s timeout. Logs acknowledgment or proceeds on failure.resolveAgentURLForRestartSignal(ctx, workspaceID): Resolves the workspace agent URL from Redis cache, falling back to DB.rewriteForDocker(agentURL, workspaceID): Rewrites127.0.0.1:portURLs to Docker-DNS form (ws-<id>:8000) when platform runs inside Docker.internal/handlers/workspace_restart.gorunRestartCycle: CallsgracefulPreRestartbeforestopForRestart. All restart paths (HTTP/restart, programmaticRestartByID) flow throughrunRestartCycle, so the fix covers both.internal/handlers/restart_signals_test.go(new)Phase 2 note
Phase 2 (workspace SDK side) requires the workspace adapter to implement the
/signals/restart_pendingendpoint: receive the JSON-RPC signal, pause new tool calls, waitdrain_seconds, then return 200. Until then, the platform proceeds without graceful drain (same as pre-fix behaviour).Test plan
TestRewriteForDocker_NonDockerHostUrlUnchangedTestRewriteForDocker_LocalhostUrlUnchanged_NoProvisionerTestRewriteForDocker_LocalhostUrlRewrittenTestResolveAgentURLForRestartSignal_CacheHitTestResolveAgentURLForRestartSignal_CacheMissTestResolveAgentURLForRestartSignal_DBErrorTestGracefulPreRestart_SuccessTestGracefulPreRestart_NotImplementedTestGracefulPreRestart_ConnectionRefusedTestGracefulPreRestart_URLResolutionErrorgo test -race ./workspace-server/internal/handlers/...🤖 Generated with Claude Code
[core-lead-agent] LGTM. Closes core#125 (in-flight A2A messages lost on native_session container restart). New restart_signals.go (155 lines) + 330-line test file + 12-line wiring in workspace_restart.go. Pre-restart drain signal pattern. tier:medium.
[core-lead-agent] Re-approving.
[core-lead-agent] Re-approving.