molecule-core/workspace-server/internal
Hongming Wang 4eb09e2146 feat(platform,workspace): SDK-wedge detection + workspace_status ENUM
Heartbeat lies. The asyncio task that POSTs /registry/heartbeat lives
in its own process slot, so a workspace whose claude_agent_sdk has
wedged on `Control request timeout: initialize` keeps reporting
"online" — every chat send hangs the full 5-min platform deadline
even though the runtime is dead in the water. This commit teaches
the workspace to admit it's wedged and the platform to honor that
admission by flipping status → degraded.

Five layers, all in one commit because they share a contract:

1. Migration 043 — convert workspaces.status from free-form TEXT to
   a real `workspace_status` Postgres ENUM with the 6 values
   production code actually writes (provisioning, online, offline,
   degraded, failed, removed). Locks the value set; future typo
   writes error at the DB instead of silently storing rogue strings.
   Down migration reverts to TEXT and drops the type.

2. workspace-server/internal/models — `HeartbeatPayload` gains a
   `runtime_state string` field. Empty = healthy. Currently the only
   non-empty value the handler honors is "wedged"; future symptoms
   can extend without another migration.

3. workspace-server/internal/handlers/registry.go — `evaluateStatus`
   gains a wedge branch BEFORE the existing error_rate >= 0.5 path:
   if `RuntimeState=="wedged"` and currently online, flip to
   degraded and broadcast WORKSPACE_DEGRADED with the wedge sample
   error. Recovery (`degraded → online`) now requires BOTH
   error_rate < 0.1 AND runtime_state cleared, so a workspace still
   reporting wedged stays degraded even when its error count
   happens to be 0 (the wedge captures a runtime state, not an
   error count).

4. workspace/claude_sdk_executor.py — module-level `_sdk_wedged_reason`
   flag set when execute()'s catch block sees an error matching
   `_WEDGE_ERROR_PATTERNS` (currently just "control request
   timeout"). Sticky for the process lifetime; the SDK's internal
   client-process state is corrupted on this error and only a
   workspace restart (= new Python process = fresh module state)
   clears it. Helpers `is_wedged()` / `wedge_reason()` /
   `_reset_sdk_wedge_for_test()` exposed.

5. workspace/heartbeat.py — heartbeat body now layers on
   `_runtime_state_payload()` for both the happy path and the
   401-retry path. Lazy-imports claude_sdk_executor so non-Claude
   runtimes (where the module may not even be importable) keep
   working unchanged.

Canvas required no changes — `STATUS_CONFIG.degraded` was already
defined in design-tokens.ts (amber dot, "Degraded" label) and
WorkspaceNode.tsx already renders `lastSampleError` underneath the
status pill when status === "degraded". The existing wiring just
never fired because nothing was writing degraded in this code path.

Tests:
- 3 Go handler tests for the new transitions (online → degraded on
  wedged, degraded stays put while still wedged, degraded → online
  after wedge clears)
- 5 Python wedge-detector tests (default clean, mark sets flag,
  sticky-first-wins, execute() flips on Control request timeout,
  execute() does NOT flip on unrelated errors)
- Migration smoke-tested against the local dev DB (3 existing rows,
  all enum-compatible; migration applied cleanly, post-state has
  the column as workspace_status type and the index preserved)

Verified: 79 Python tests pass; full Go test suite passes; migration
applies clean on a real DB; reverse migration restores the column to
TEXT.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 00:59:15 -07:00
..
artifacts chore: sync staging to main — 1188 commits, 5 conflicts resolved (#1743) 2026-04-23 18:30:18 +00:00
bundle fix(platform): unblock SaaS workspace registration end-to-end 2026-04-21 03:06:46 -07:00
channels feat(channels): first-class Lark/Feishu support via schema-driven config 2026-04-24 11:51:15 -07:00
crypto chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
db test: schema_migrations tracking — 4 cases (first boot, re-boot, mixed, down.sql filter) 2026-04-18 11:52:27 -07:00
envx chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
events chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
handlers feat(platform,workspace): SDK-wedge detection + workspace_status ENUM 2026-04-25 00:59:15 -07:00
metrics chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
middleware fix(test): rename duplicate TestCanvasOrBearer_WrongOrigin test at line 946 — resolves Platform(Go) CI compile error on PR #2040 2026-04-24 18:04:13 +00:00
models feat(platform,workspace): SDK-wedge detection + workspace_status ENUM 2026-04-25 00:59:15 -07:00
orgtoken fix: F1085 rm scope concat + GH#756 ValidateToken terminal guard + CI test fixes 2026-04-24 07:16:54 +00:00
plugins chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
provisioner Merge branch 'staging' into test/2026-04-23-regression-suite 2026-04-24 01:55:06 +00:00
registry fix(sweeper): emit WORKSPACE_PROVISION_FAILED so canvas updates UI 2026-04-20 20:38:41 -07:00
router merge(staging): resolve conflicts + fix 7 test regressions on top of #2061 2026-04-24 13:50:39 -07:00
scheduler fix(scheduler): prevent wedge on invalid UTF-8 + unbounded DB ops (#2026) 2026-04-24 11:00:47 -07:00
supervised chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
ws chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
wsauth chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00