Commit Graph

4 Commits

Author SHA1 Message Date
Hongming Wang
4eb09e2146 feat(platform,workspace): SDK-wedge detection + workspace_status ENUM
Heartbeat lies. The asyncio task that POSTs /registry/heartbeat lives
in its own process slot, so a workspace whose claude_agent_sdk has
wedged on `Control request timeout: initialize` keeps reporting
"online" — every chat send hangs the full 5-min platform deadline
even though the runtime is dead in the water. This commit teaches
the workspace to admit it's wedged and the platform to honor that
admission by flipping status → degraded.

Five layers, all in one commit because they share a contract:

1. Migration 043 — convert workspaces.status from free-form TEXT to
   a real `workspace_status` Postgres ENUM with the 6 values
   production code actually writes (provisioning, online, offline,
   degraded, failed, removed). Locks the value set; future typo
   writes error at the DB instead of silently storing rogue strings.
   Down migration reverts to TEXT and drops the type.

2. workspace-server/internal/models — `HeartbeatPayload` gains a
   `runtime_state string` field. Empty = healthy. Currently the only
   non-empty value the handler honors is "wedged"; future symptoms
   can extend without another migration.

3. workspace-server/internal/handlers/registry.go — `evaluateStatus`
   gains a wedge branch BEFORE the existing error_rate >= 0.5 path:
   if `RuntimeState=="wedged"` and currently online, flip to
   degraded and broadcast WORKSPACE_DEGRADED with the wedge sample
   error. Recovery (`degraded → online`) now requires BOTH
   error_rate < 0.1 AND runtime_state cleared, so a workspace still
   reporting wedged stays degraded even when its error count
   happens to be 0 (the wedge captures a runtime state, not an
   error count).

4. workspace/claude_sdk_executor.py — module-level `_sdk_wedged_reason`
   flag set when execute()'s catch block sees an error matching
   `_WEDGE_ERROR_PATTERNS` (currently just "control request
   timeout"). Sticky for the process lifetime; the SDK's internal
   client-process state is corrupted on this error and only a
   workspace restart (= new Python process = fresh module state)
   clears it. Helpers `is_wedged()` / `wedge_reason()` /
   `_reset_sdk_wedge_for_test()` exposed.

5. workspace/heartbeat.py — heartbeat body now layers on
   `_runtime_state_payload()` for both the happy path and the
   401-retry path. Lazy-imports claude_sdk_executor so non-Claude
   runtimes (where the module may not even be importable) keep
   working unchanged.

Canvas required no changes — `STATUS_CONFIG.degraded` was already
defined in design-tokens.ts (amber dot, "Degraded" label) and
WorkspaceNode.tsx already renders `lastSampleError` underneath the
status pill when status === "degraded". The existing wiring just
never fired because nothing was writing degraded in this code path.

Tests:
- 3 Go handler tests for the new transitions (online → degraded on
  wedged, degraded stays put while still wedged, degraded → online
  after wedge clears)
- 5 Python wedge-detector tests (default clean, mark sets flag,
  sticky-first-wins, execute() flips on Control request timeout,
  execute() does NOT flip on unrelated errors)
- Migration smoke-tested against the local dev DB (3 existing rows,
  all enum-compatible; migration applied cleanly, post-state has
  the column as workspace_status type and the index preserved)

Verified: 79 Python tests pass; full Go test suite passes; migration
applies clean on a real DB; reverse migration restores the column to
TEXT.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 00:59:15 -07:00
Hongming Wang
65b531acf6 fix(workspace): tag self-originated A2A POSTs with X-Workspace-ID
Workspace runtime fired four classes of A2A request to the platform
without the X-Workspace-ID header that identifies the source
workspace: heartbeat self-messages, initial_prompt, idle-loop fires,
and peer-to-peer A2A from runtime tools. The platform's a2a_receive
logger keys source_id off that header — without it, every such row
was written with source_id=NULL, which the canvas's My Chat tab
filters as ?source=canvas (i.e. "user typed this") and rendered the
internal triggers as if the human user had sent them. The
"Delegation results are ready..." heartbeat trigger was visible to
end users in the chat history; delegate_task A2A calls between agents
were misclassified the same way.

Centralise the header construction in a new platform_auth helper
self_source_headers(workspace_id) that returns auth_headers() PLUS
{X-Workspace-ID: <id>}. Apply it to:

  - heartbeat.py self-message (refactored from inline header dict)
  - main.py initial_prompt POST
  - main.py idle_prompt POST
  - a2a_client.py send_a2a_message (peer A2A from runtime)
  - builtin_tools/a2a_tools.py delegate_task (was missing ALL headers)

Tests:
  - test_heartbeat.py asserts the X-Workspace-ID header is set on
    the self-message POST.
  - test_a2a_tools_module.py asserts the same on delegate_task POSTs;
    FakeClient.post mocks updated to accept the headers kwarg.

Production effect lands the moment workspace containers are rebuilt
with this code; existing rows in activity_logs keep their NULL
source_id (legacy data). The canvas-side filter (#follow-up)
covers the historical-rows case until backfill.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 19:54:43 -07:00
b5e2142c46 fix(#1877): close token-rotation race on restart — Option A+Option B combined
Platform side (Option B):
- provisioner.go: add WriteAuthTokenToVolume() — writes .auth_token to
  the Docker named volume BEFORE ContainerStart using a throwaway alpine
  container, eliminating the race window where a restarted container could
  read a stale token before WriteFilesToContainer writes the new one.
- workspace_provision.go: call WriteAuthTokenToVolume() in issueAndInjectToken
  as a best-effort pre-write before the container starts.

Runtime side (Option A):
- heartbeat.py: on HTTPStatusError 401 from /registry/heartbeat, call
  refresh_cache() to force re-read of /configs/.auth_token from disk,
  then retry the heartbeat once. Fall through to normal failure tracking
  if the retry also fails.
- platform_auth.py: add refresh_cache() which discards the in-process
  _cached_token and calls get_token() to re-read from disk.

Together these eliminate the >1 consecutive 401 window described in
issue #1877. Pre-write (B) is the primary fix; runtime retry (A) is the
self-healing fallback for any residual race.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-23 17:47:18 -07:00
Hongming Wang
479a027e4b chore: open-source restructure — rename dirs, remove internal files, scrub secrets
Renames:
- platform/ → workspace-server/ (Go module path stays as "platform" for
  external dep compat — will update after plugin module republish)
- workspace-template/ → workspace/

Removed (moved to separate repos or deleted):
- PLAN.md — internal roadmap (move to private project board)
- HANDOFF.md, AGENTS.md — one-time internal session docs
- .claude/ — gitignored entirely (local agent config)
- infra/cloudflare-worker/ → Molecule-AI/molecule-tenant-proxy
- org-templates/molecule-dev/ → standalone template repo
- .mcp-eval/ → molecule-mcp-server repo
- test-results/ — ephemeral, gitignored

Security scrubbing:
- Cloudflare account/zone/KV IDs → placeholders
- Real EC2 IPs → <EC2_IP> in all docs
- CF token prefix, Neon project ID, Fly app names → redacted
- Langfuse dev credentials → parameterized
- Personal runner username/machine name → generic

Community files:
- CONTRIBUTING.md — build, test, branch conventions
- CODE_OF_CONDUCT.md — Contributor Covenant 2.1

All Dockerfiles, CI workflows, docker-compose, railway.toml, render.yaml,
README, CLAUDE.md updated for new directory names.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 00:24:44 -07:00