molecule-core/workspace
Hongming Wang 4eb09e2146 feat(platform,workspace): SDK-wedge detection + workspace_status ENUM
Heartbeat lies. The asyncio task that POSTs /registry/heartbeat lives
in its own process slot, so a workspace whose claude_agent_sdk has
wedged on `Control request timeout: initialize` keeps reporting
"online" — every chat send hangs the full 5-min platform deadline
even though the runtime is dead in the water. This commit teaches
the workspace to admit it's wedged and the platform to honor that
admission by flipping status → degraded.

Five layers, all in one commit because they share a contract:

1. Migration 043 — convert workspaces.status from free-form TEXT to
   a real `workspace_status` Postgres ENUM with the 6 values
   production code actually writes (provisioning, online, offline,
   degraded, failed, removed). Locks the value set; future typo
   writes error at the DB instead of silently storing rogue strings.
   Down migration reverts to TEXT and drops the type.

2. workspace-server/internal/models — `HeartbeatPayload` gains a
   `runtime_state string` field. Empty = healthy. Currently the only
   non-empty value the handler honors is "wedged"; future symptoms
   can extend without another migration.

3. workspace-server/internal/handlers/registry.go — `evaluateStatus`
   gains a wedge branch BEFORE the existing error_rate >= 0.5 path:
   if `RuntimeState=="wedged"` and currently online, flip to
   degraded and broadcast WORKSPACE_DEGRADED with the wedge sample
   error. Recovery (`degraded → online`) now requires BOTH
   error_rate < 0.1 AND runtime_state cleared, so a workspace still
   reporting wedged stays degraded even when its error count
   happens to be 0 (the wedge captures a runtime state, not an
   error count).

4. workspace/claude_sdk_executor.py — module-level `_sdk_wedged_reason`
   flag set when execute()'s catch block sees an error matching
   `_WEDGE_ERROR_PATTERNS` (currently just "control request
   timeout"). Sticky for the process lifetime; the SDK's internal
   client-process state is corrupted on this error and only a
   workspace restart (= new Python process = fresh module state)
   clears it. Helpers `is_wedged()` / `wedge_reason()` /
   `_reset_sdk_wedge_for_test()` exposed.

5. workspace/heartbeat.py — heartbeat body now layers on
   `_runtime_state_payload()` for both the happy path and the
   401-retry path. Lazy-imports claude_sdk_executor so non-Claude
   runtimes (where the module may not even be importable) keep
   working unchanged.

Canvas required no changes — `STATUS_CONFIG.degraded` was already
defined in design-tokens.ts (amber dot, "Degraded" label) and
WorkspaceNode.tsx already renders `lastSampleError` underneath the
status pill when status === "degraded". The existing wiring just
never fired because nothing was writing degraded in this code path.

Tests:
- 3 Go handler tests for the new transitions (online → degraded on
  wedged, degraded stays put while still wedged, degraded → online
  after wedge clears)
- 5 Python wedge-detector tests (default clean, mark sets flag,
  sticky-first-wins, execute() flips on Control request timeout,
  execute() does NOT flip on unrelated errors)
- Migration smoke-tested against the local dev DB (3 existing rows,
  all enum-compatible; migration applied cleanly, post-state has
  the column as workspace_status type and the index preserved)

Verified: 79 Python tests pass; full Go test suite passes; migration
applies clean on a real DB; reverse migration restores the column to
TEXT.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 00:59:15 -07:00
..
adapters feat(workspace): migrate a2a-sdk from 0.3.x to 1.0.0 (KI-009) (#1974) 2026-04-24 04:43:17 +00:00
builtin_tools fix(workspace): tag self-originated A2A POSTs with X-Workspace-ID 2026-04-24 19:54:43 -07:00
lib feat(workspace): pre-stop serialization for pause/resume (closes #1386) 2026-04-21 12:40:44 +00:00
molecule_audit chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
plugins_registry feat(plugin): implement MCPServerAdaptor (issue #847) 2026-04-24 01:42:13 +00:00
policies chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
scripts ci(gh-wrapper): translate --assignee @me → --label team:<role> 2026-04-24 00:34:21 -07:00
skill_loader chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
tests feat(platform,workspace): SDK-wedge detection + workspace_status ENUM 2026-04-25 00:59:15 -07:00
a2a_cli.py fix: apply #1124 env-var defaults + scrub F1088 credentials from INCIDENT_LOG.md (#1347) 2026-04-21 08:11:44 +00:00
a2a_client.py fix(a2a): review-driven hardening — prefix-anchored type check, error_detail cap, shared hint module 2026-04-24 23:47:44 -07:00
a2a_executor.py fix(a2a_executor): remove shadowing local Part import that broke streaming 2026-04-24 14:21:04 -07:00
a2a_mcp_server.py chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
a2a_tools.py fix(a2a): review-driven hardening — prefix-anchored type check, error_detail cap, shared hint module 2026-04-24 23:47:44 -07:00
adapter_base.py feat: platform instructions system with global/team/workspace scope 2026-04-22 15:17:14 -07:00
agent.py chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
agents_md.py chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
build-all.sh fix: update workspace script comments for workspace-template → workspace rename 2026-04-18 01:48:05 -07:00
claude_sdk_executor.py feat(platform,workspace): SDK-wedge detection + workspace_status ENUM 2026-04-25 00:59:15 -07:00
cli_executor.py fix(executors): move set_current_task inside try so active_tasks always decrements (#2026) 2026-04-24 18:03:12 +00:00
config.py fix(compliance): flip default mode to owasp_agentic (detect-only) 2026-04-24 11:52:09 -07:00
consolidation.py fix: apply #1124 env-var defaults + scrub F1088 credentials from INCIDENT_LOG.md (#1347) 2026-04-21 08:11:44 +00:00
coordinator.py fix: apply #1124 env-var defaults + scrub F1088 credentials from INCIDENT_LOG.md (#1347) 2026-04-21 08:11:44 +00:00
Dockerfile feat(workspace): 45-min gh-token refresh daemon + credential helper cache 2026-04-22 19:52:46 -07:00
entrypoint.sh fix(workspace): credential helper security hardening (#1797) 2026-04-23 18:14:55 +00:00
events.py chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
executor_helpers.py feat(canvas+platform): chat attachments, model selection, deploy/delete UX 2026-04-24 13:27:51 -07:00
heartbeat.py feat(platform,workspace): SDK-wedge detection + workspace_status ENUM 2026-04-25 00:59:15 -07:00
hermes_executor.py feat(workspace): migrate a2a-sdk from 0.3.x to 1.0.0 (KI-009) (#1974) 2026-04-24 04:43:17 +00:00
initial_prompt.py chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
main.py fix(workspace): tag self-originated A2A POSTs with X-Workspace-ID 2026-04-24 19:54:43 -07:00
molecule_ai_status.py fix: apply #1124 env-var defaults + scrub F1088 credentials from INCIDENT_LOG.md (#1347) 2026-04-21 08:11:44 +00:00
platform_auth.py fix(workspace): tag self-originated A2A POSTs with X-Workspace-ID 2026-04-24 19:54:43 -07:00
plugins.py chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
preflight.py chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
prompt.py fix(review): address code review blockers on tool-trace + instructions 2026-04-22 16:18:06 -07:00
pytest.ini chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
rebuild-runtime-images.sh fix: update workspace script comments for workspace-template → workspace rename 2026-04-18 01:48:05 -07:00
requirements.txt feat(workspace): migrate a2a-sdk from 0.3.x to 1.0.0 (KI-009) (#1974) 2026-04-24 04:43:17 +00:00
shared_runtime.py fix: CWE-78 rm scope, go vet failures, delegation idempotency 2026-04-21 18:22:30 +00:00
transcript_auth.py chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
watcher.py chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00