molecule-core/workspace-server/internal
Hongming Wang 46daae1ffb fix(provision): entry log + panic recovery on workspace provision goroutines
Issue #2486: 7 claude-code workspaces stuck in provisioning produced
NONE of the four documented exit-path log lines in
provisionWorkspaceCP — neither prepare-failed, nor start-failed, nor
persist-instance-id-failed, nor success. Operators couldn't tell
whether the goroutine ran at all.

Add an entry log at the top of provisionWorkspaceOpts +
provisionWorkspaceCP so a missing entry distinguishes "goroutine
never started" from "started but exited via an unlogged path."

Add logProvisionPanic at the same defer site so a panic inside
either provisioner doesn't (a) crash the whole workspace-server
process, taking every other tenant workspace with it, and (b)
silently leave the row in `provisioning` until the 10-min sweeper
fires. The recover persists status='failed' with a sanitized
panic-class message via a fresh 10s context (the goroutine's own
ctx may have been the one panicking).

Tests pin three contracts:
- no-op when no panic (otherwise every successful provision
  emits a spurious log line)
- recovers + persists failed status on panic, with stack trace
- defense-in-depth: if the persist itself fails, log it instead
  of leaving the operator with a recovered-panic log but no row

Regression-injected by neutering the recover() body — all three
tests fail until the recover + UPDATE path is restored.

This is observability + resilience only, not a root-cause fix
for #2486. The actual silent-drop class still needs reproduction
once the tenant is on a build that includes this entry log.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 19:14:20 -07:00
..
artifacts
buildinfo feat(deploy): verify each tenant /buildinfo matches published SHA after redeploy 2026-04-30 10:55:08 -07:00
bundle refactor(workspace-status): typed constants + AST-based drift gate 2026-04-30 10:41:41 -07:00
channels
crypto
db refactor(workspace-status): catch missed literal in workspace_bootstrap.go + add literal-drift gate 2026-04-30 10:51:01 -07:00
envx
events test(handlers): introduce events.EventEmitter interface (#1814 partial) 2026-04-26 09:05:52 -07:00
handlers fix(provision): entry log + panic recovery on workspace provision goroutines 2026-05-01 19:14:20 -07:00
imagewatch feat(workspace-server): GHCR digest watcher closes runtime CD chain (#2114) 2026-04-26 13:36:26 -07:00
metrics
middleware fix(tenant-guard): allowlist /buildinfo so redeploy verifier can reach it 2026-04-30 12:54:51 -07:00
models refactor(workspace-status): typed constants + AST-based drift gate 2026-04-30 10:41:41 -07:00
orgtoken
plugins
provisioner refactor(handlers): widen WorkspaceHandler.provisioner to LocalProvisionerAPI interface (#2369) 2026-04-30 09:18:16 -07:00
registry refactor(workspace-status): typed constants + AST-based drift gate 2026-04-30 10:41:41 -07:00
router feat(workspace-server): PUT /provider endpoint for explicit LLM provider (#196) 2026-04-30 22:25:48 -07:00
scheduler feat(runtime): native_scheduler skip — primitive #3 of 6 2026-04-26 22:47:00 -07:00
supervised
ws
wsauth refactor(wsauth): extract lookupTokenByHash to dedup auth predicate across 3 callers 2026-04-30 03:11:38 -07:00