forked from molecule-ai/molecule-core
The naive mutex-with-TryLock pattern in RestartByID was silently dropping
the second of two close-together restart requests. SetSecret and SetModel
both fire `go restartFunc(...)` from their HTTP handlers, and both DB
writes commit before either restart goroutine reaches loadWorkspaceSecrets.
If the second goroutine arrives while the first holds the per-workspace
mutex, TryLock returns false and the second is logged-and-dropped:
Auto-restart: skipping <id> — restart already in progress
The first goroutine's loadWorkspaceSecrets ran before the second write
committed, so the new container boots without that env var. Surfaced
during the RFC #2251 V1.0 measurement as hermes returning "No LLM
provider configured" when MODEL_PROVIDER landed after the API-key write
and lost its restart to the mutex (HERMES_DEFAULT_MODEL absent →
start.sh fell back to nousresearch/hermes-4-70b → derived
provider=openrouter → no OPENROUTER_API_KEY → request-time error).
The same race hits any back-to-back secret/model save flow including
the canvas's "set MiniMax key + pick model" UX.
Fix: pending-flag / coalescing pattern. Any restart request that arrives
while one is in flight sets `pending=true` and returns. The in-flight
runner, on completion, checks the flag and runs another cycle. This
collapses N concurrent requests into at most 2 sequential cycles (the
current one + one more that picks up everyone who arrived during it),
while guaranteeing the final container always sees the latest secrets.
Concrete contract:
- 1 request, no concurrency: 1 cycle
- N concurrent requests during 1 in-flight cycle: 2 cycles total
- N sequential requests (no overlap): N cycles
- Per-workspace state — different workspaces never serialize
Coalescing is extracted into `coalesceRestart(workspaceID, cycle func())`
so the gate logic is testable without the full WorkspaceHandler / DB /
provisioner stack. RestartByID now wraps that with the production cycle
function. runRestartCycle calls provisionWorkspace SYNCHRONOUSLY (drops
the historical `go`) so the loop's pending-flag check happens AFTER the
new container is up — without that, the next cycle's Stop call would
race the previous cycle's still-spawning provision goroutine.
sendRestartContext stays async; it's a one-way notification.
Tests in workspace_restart_coalesce_test.go cover all five contract
points + race-detector clean over 10 iterations:
- Single call → 1 cycle
- 5 concurrent during in-flight → exactly 2 cycles total
- 3 sequential → 3 cycles
- Pending-during-cycle picked up (targeted bug repro)
- State cleared after drain (running flag reset)
- Per-workspace isolation (no cross-workspace serialization)
Refs: molecule-core#2256 (V1.0 gate measurement); root cause for the
"No LLM provider configured" symptom seen during hermes/MiniMax repro.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|---|---|---|
| .. | ||
| cmd/server | ||
| internal | ||
| migrations | ||
| pkg/provisionhook | ||
| .ci-force | ||
| .gitignore | ||
| .golangci.yaml | ||
| Dockerfile | ||
| Dockerfile.tenant | ||
| entrypoint-tenant.sh | ||
| go.mod | ||
| go.sum | ||