molecule-core/workspace-server/internal
Hongming Wang f088090b27 fix(restart): coalesce concurrent restart requests via pending flag
The naive mutex-with-TryLock pattern in RestartByID was silently dropping
the second of two close-together restart requests. SetSecret and SetModel
both fire `go restartFunc(...)` from their HTTP handlers, and both DB
writes commit before either restart goroutine reaches loadWorkspaceSecrets.
If the second goroutine arrives while the first holds the per-workspace
mutex, TryLock returns false and the second is logged-and-dropped:

    Auto-restart: skipping <id> — restart already in progress

The first goroutine's loadWorkspaceSecrets ran before the second write
committed, so the new container boots without that env var. Surfaced
during the RFC #2251 V1.0 measurement as hermes returning "No LLM
provider configured" when MODEL_PROVIDER landed after the API-key write
and lost its restart to the mutex (HERMES_DEFAULT_MODEL absent →
start.sh fell back to nousresearch/hermes-4-70b → derived
provider=openrouter → no OPENROUTER_API_KEY → request-time error).
The same race hits any back-to-back secret/model save flow including
the canvas's "set MiniMax key + pick model" UX.

Fix: pending-flag / coalescing pattern. Any restart request that arrives
while one is in flight sets `pending=true` and returns. The in-flight
runner, on completion, checks the flag and runs another cycle. This
collapses N concurrent requests into at most 2 sequential cycles (the
current one + one more that picks up everyone who arrived during it),
while guaranteeing the final container always sees the latest secrets.

Concrete contract:

  - 1 request, no concurrency:        1 cycle
  - N concurrent requests during 1 in-flight cycle: 2 cycles total
  - N sequential requests (no overlap):              N cycles
  - Per-workspace state — different workspaces never serialize

Coalescing is extracted into `coalesceRestart(workspaceID, cycle func())`
so the gate logic is testable without the full WorkspaceHandler / DB /
provisioner stack. RestartByID now wraps that with the production cycle
function. runRestartCycle calls provisionWorkspace SYNCHRONOUSLY (drops
the historical `go`) so the loop's pending-flag check happens AFTER the
new container is up — without that, the next cycle's Stop call would
race the previous cycle's still-spawning provision goroutine.
sendRestartContext stays async; it's a one-way notification.

Tests in workspace_restart_coalesce_test.go cover all five contract
points + race-detector clean over 10 iterations:

  - Single call → 1 cycle
  - 5 concurrent during in-flight → exactly 2 cycles total
  - 3 sequential → 3 cycles
  - Pending-during-cycle picked up (targeted bug repro)
  - State cleared after drain (running flag reset)
  - Per-workspace isolation (no cross-workspace serialization)

Refs: molecule-core#2256 (V1.0 gate measurement); root cause for the
"No LLM provider configured" symptom seen during hermes/MiniMax repro.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 23:31:56 -07:00
..
artifacts
bundle
channels feat(channels): first-class Lark/Feishu support via schema-driven config 2026-04-24 11:51:15 -07:00
crypto
db
envx
events test(handlers): introduce events.EventEmitter interface (#1814 partial) 2026-04-26 09:05:52 -07:00
handlers fix(restart): coalesce concurrent restart requests via pending flag 2026-04-28 23:31:56 -07:00
imagewatch feat(workspace-server): GHCR digest watcher closes runtime CD chain (#2114) 2026-04-26 13:36:26 -07:00
metrics
middleware merge: resolve staging conflicts (a2a_proxy + workspace_crud) 2026-04-26 10:43:22 -07:00
models feat(runtime): adapter-declared idle_timeout_override end-to-end 2026-04-26 22:38:01 -07:00
orgtoken fix: F1085 rm scope concat + GH#756 ValidateToken terminal guard + CI test fixes 2026-04-24 07:16:54 +00:00
plugins
provisioner fix(provisioner): treat "removal already in progress" as no-op success 2026-04-27 13:25:32 -07:00
registry fix(orphan-sweeper): close TOCTOU race with issueAndInjectToken on restart 2026-04-27 17:28:50 -07:00
router merge: resolve staging conflicts (a2a_proxy + workspace_crud) 2026-04-26 10:43:22 -07:00
scheduler feat(runtime): native_scheduler skip — primitive #3 of 6 2026-04-26 22:47:00 -07:00
supervised
ws
wsauth