fix(workspace-server): SSOT-route container check + 422 on external runtimes (closes #10) #12
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "fix/issue10-runtime-aware-plugin-install"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes #10. Phase 1 analysis at #10 (issuecomment-315). Two coupled fixes — SSOT consolidation + runtime-aware guard — applied per the full canonical Dev SOP.
What changed
1. SSOT for "is this workspace's container running"
findRunningContainerininternal/handlers/plugins.gocarried its own copy of thecli.ContainerInspectlogic. It collapsed three distinct outcomes (running / definitively-stopped / transient daemon error) into a single""return — same as the bare-name from a stopped container. Healthsweep'sProvisioner.IsRunninghandled the same input correctly (defensive: transient errors stay online). Two impls, drift-prone.New:
provisioner.RunningContainerName(ctx, cli, workspaceID) (string, error)is the canonical impl with three documented outcomes:("ws-<id>", nil)→ running, exec away("", nil)→ genuinely not running (NotFound, Exited, …)("", err)→ transient daemon error; caller picks policyBoth consumers now route through it:
Provisioner.IsRunning(healthsweep) — wrapsRunningContainerNameand preserves the prior(true, err)-on-transient contract.PluginsHandler.findRunningContainer— same plus a distinct log line on the transient-error path so operators can triage daemon-flake vs stopped-container without re-reading the source.2. Runtime-aware Install/Uninstall
runtime='external'workspaces (Phase 30 — remote agents that pull plugins viaGET /workspaces/:id/plugins/:name/download) have no local container by design. Without a guard, POST/pluginsfalls through tofindRunningContainer, finds nothing, and returns the misleading 503 "workspace container not running." On this Mac that's 5 persona-agent workspaces stuck in that state right now.New: at the start of Install and Uninstall,
h.isExternalRuntime(id)checks the runtime via the already-wiredruntimeLookup. Ifexternal, return 422 + a hint pointing at the download endpoint. Container-backed runtimes (claude-code, langgraph) fall through unchanged.runtimeexternal"hint": "external workspaces pull plugins via GET /workspaces/:id/plugins/:name/download"claude-code/langgraph/ etc.findRunningContainerstill gates on real containerWhy these two together
(1) prevents future drift if the inspect logic ever needs to change. (2) eliminates the persistent split-state for external workspaces today. The two fixes are independent — either alone would be valuable — but bundled they close both the architectural smell and the user-visible bug. Reviewable in one sitting (~450 lines including tests).
Tests (all PASS)
Plus full pre-existing test suites for handlers, provisioner, registry: all green.
Mutation tests (verified, not just claimed)
Mutation 1: delete the runtime guard from
Install. Expected:TestPluginInstall_ExternalRuntime_Returns422fails.✓ FAILED as expected (test caught the regression). Restored → PASS.
Mutation 2b: revert
findRunningContainerto the parallel-impl shape (rawh.docker.ContainerInspect). Expected: AST gate fails.✓ FAILED as expected. Restored → PASS.
The AST gate's failure message points at the relevant issue and tells the next reviewer how to extend the contract instead of bypassing it.
Anti-drift coverage
The two AST gates pin the SSOT invariant by behavior (does this function call
RunningContainerName?), not by name allowlist (per saved memoryfeedback_behavior_based_ast_gates). A future PR that adds a third consumer of the inspect logic doesn't break this gate; only a regression that re-introduces the parallel impl does.Manual reproduction (this Mac)
Pre-fix, against the old workspace-server running on
:8080:5 workspaces sitting in the exact split-state from #10 —
status=online(correct, heartbeat-driven), no local container, every plugin install would 503. Post-fix, those return 422 with the corrective hint.Phase 2 — Security check (the 4 questions)
wsAuthstill gates Install/Uninstall; the 422 fires post-auth.Versioning + backwards compatibility
provisioner.RunningContainerNameis additive — new exported function. Existing callers ofProvisioner.IsRunningsee preserved behavior (verified by the AST gate + theProvisioner.IsRunningtests in the existing suite).Documentation
~/.molecule-ai/handbook.md§5 Troubleshooting gets a new entry "Plugin install returns 503 'workspace container not running' but workspace shows online" covering both flavors (external runtime, container-backed race) — committed locally on operator host.RunningContainerNamedocuments the three outcomes and the SSOT motivation.isExternalRuntimedocuments the fail-open policy.Rollout / rollback
git revertthe merge OR drop the runtime guard.Out of scope (parked as separate issues)
if (status === 422)branch against the real response.ListInstalledsoft-fails to[]on missing container;Install/Uninstallhard-fail. Bigger design conversation about which endpoints should be tolerant; not blocking #10.Hostile self-review — three weakest spots
No live-runtime test of the actual 422 response against a real running workspace-server. The 7 unit tests + AST gates exercise the handler in isolation; they don't verify the wired-up server returns 422 for an actual external workspace. Mitigated by: the 5 external workspaces in this DB will produce the symptom on the next operator-driven Install attempt; the manual repro above confirmed the pre-fix behavior; the unit tests pin the post-fix behavior at the same handler the router wires.
The
isExternalRuntimefail-open on lookup error — if Postgres flakes for a few seconds, plugin install on an external workspace would briefly fall through and 503 (today's behavior) instead of 422. That's the conservative choice (don't deny work on infrastructure flake), but it does mean the bug temporarily resurfaces during DB issues. Acceptable trade-off; documented in the godoc.The handbook update lives on the operator host, not in this repo —
docs/operations/doesn't exist in molecule-core. If the operator-host file drifts (e.g., box rebuild without restoring~/.molecule-ai/), the in-repo cross-reference goes stale. Same caveat as my prior PR (#8) — longer-term we should add adocs/operations/mirror and a sync job.🤖 Generated with Claude Code
Hongming-approved (chat 2026-05-07 'knock out core#8 + core#12'). Closes core#10 plugin install split-state (status=online but container not running). SSOT-route container check.