molecule-core/workspace-server
Hongming Wang f088090b27 fix(restart): coalesce concurrent restart requests via pending flag
The naive mutex-with-TryLock pattern in RestartByID was silently dropping
the second of two close-together restart requests. SetSecret and SetModel
both fire `go restartFunc(...)` from their HTTP handlers, and both DB
writes commit before either restart goroutine reaches loadWorkspaceSecrets.
If the second goroutine arrives while the first holds the per-workspace
mutex, TryLock returns false and the second is logged-and-dropped:

    Auto-restart: skipping <id> — restart already in progress

The first goroutine's loadWorkspaceSecrets ran before the second write
committed, so the new container boots without that env var. Surfaced
during the RFC #2251 V1.0 measurement as hermes returning "No LLM
provider configured" when MODEL_PROVIDER landed after the API-key write
and lost its restart to the mutex (HERMES_DEFAULT_MODEL absent →
start.sh fell back to nousresearch/hermes-4-70b → derived
provider=openrouter → no OPENROUTER_API_KEY → request-time error).
The same race hits any back-to-back secret/model save flow including
the canvas's "set MiniMax key + pick model" UX.

Fix: pending-flag / coalescing pattern. Any restart request that arrives
while one is in flight sets `pending=true` and returns. The in-flight
runner, on completion, checks the flag and runs another cycle. This
collapses N concurrent requests into at most 2 sequential cycles (the
current one + one more that picks up everyone who arrived during it),
while guaranteeing the final container always sees the latest secrets.

Concrete contract:

  - 1 request, no concurrency:        1 cycle
  - N concurrent requests during 1 in-flight cycle: 2 cycles total
  - N sequential requests (no overlap):              N cycles
  - Per-workspace state — different workspaces never serialize

Coalescing is extracted into `coalesceRestart(workspaceID, cycle func())`
so the gate logic is testable without the full WorkspaceHandler / DB /
provisioner stack. RestartByID now wraps that with the production cycle
function. runRestartCycle calls provisionWorkspace SYNCHRONOUSLY (drops
the historical `go`) so the loop's pending-flag check happens AFTER the
new container is up — without that, the next cycle's Stop call would
race the previous cycle's still-spawning provision goroutine.
sendRestartContext stays async; it's a one-way notification.

Tests in workspace_restart_coalesce_test.go cover all five contract
points + race-detector clean over 10 iterations:

  - Single call → 1 cycle
  - 5 concurrent during in-flight → exactly 2 cycles total
  - 3 sequential → 3 cycles
  - Pending-during-cycle picked up (targeted bug repro)
  - State cleared after drain (running flag reset)
  - Per-workspace isolation (no cross-workspace serialization)

Refs: molecule-core#2256 (V1.0 gate measurement); root cause for the
"No LLM provider configured" symptom seen during hermes/MiniMax repro.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 23:31:56 -07:00
..
cmd/server feat(runtime): native_scheduler skip — primitive #3 of 6 2026-04-26 22:47:00 -07:00
internal fix(restart): coalesce concurrent restart requests via pending flag 2026-04-28 23:31:56 -07:00
migrations chore: second-pass review polish — symmetry + clearer test fixtures 2026-04-25 08:48:30 -07:00
pkg/provisionhook feat(#1957): wire gh-identity plugin into workspace-server 2026-04-24 15:01:41 +00:00
.ci-force chore: force Platform(Go) CI run on main — validate go vet clean 2026-04-21 15:43:19 +00:00
.gitignore feat(ws-server): pull env from CP on startup 2026-04-19 02:41:15 -07:00
.golangci.yaml chore(workspace-server): add golangci.yaml disabling errcheck 2026-04-24 07:16:54 +00:00
Dockerfile chore: extract ContextMenu Zustand fix + a2a_proxy local-docker SSRF bypass + workspace-server Dockerfile GID entrypoint 2026-04-22 20:00:16 -07:00
Dockerfile.tenant feat(terminal): remote path via aws ec2-instance-connect + pty 2026-04-21 18:13:29 -07:00
entrypoint-tenant.sh fix(security): add USER directive before ENTRYPOINT in all tenant images (#1155) 2026-04-20 23:51:33 +00:00
go.mod chore(deps): batch dep bumps — 11 safe upgrades from 2026-04-28 dependabot wave 2026-04-28 16:25:46 -07:00
go.sum chore(deps): batch dep bumps — 11 safe upgrades from 2026-04-28 dependabot wave 2026-04-28 16:25:46 -07:00