When a target workspace's adapter has declared
provides_native_session=True (claude-code SDK's streaming session,
hermes-agent's in-container event log), the SDK owns its own queue/
session state. Adding the platform's a2a_queue layer on top would
double-buffer the same in-flight state — and worse, the platform
queue's drain timing has no relationship to the SDK's actual readiness,
so the queued request might dispatch while the SDK is STILL busy.
Behavior change: in handleA2ADispatchError, when isUpstreamBusyError(err)
fires and the target declared native_session, return 503 + Retry-After
directly without enqueueing. The caller's adapter handles retry on
its own schedule, and the SDK's own queue absorbs the request when
ready. Response body carries native_session=true so callers can
distinguish this from queue-failure 503s.
Observability is preserved: logA2AFailure still runs above; the
broadcaster still fires; the activity_logs row records the busy event
just like the platform-fallback path.
This is the consumer that validates the template-side declarations
already shipped in:
- molecule-ai-workspace-template-claude-code PR #12
- molecule-ai-workspace-template-hermes PR #25
Once those merge + image tags bump, claude-code + hermes workspaces'
busy 503s skip the platform queue end-to-end. End-to-end validation
of capability primitive #5.
Tests (2 new):
- NativeSession_SkipsEnqueue: cache pre-populated, deliberate
sqlmock with NO INSERT INTO a2a_queue expected — implicit
regression cover (sqlmock fails on unexpected queries). Asserts
503 + Retry-After + native_session=true marker in body.
- NoNativeSession_StillEnqueues: negative pin — empty cache, same
busy error → falls through to EnqueueA2A (which fails in this
test, falls through to legacy 503 without native_session marker).
Verification:
- All Go handlers tests pass (2 new + existing)
- go build + go vet clean
See project memory `project_runtime_native_pluggable.md`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When an adapter declares provides_native_status_mgmt=True (because its
SDK reports its own ready/degraded/failed state explicitly), the
platform's error-rate-based status inference fights the adapter's own
state machine. This PR gates the inference branches on the capability
flag — adapter-driven transitions become authoritative.
Components:
- registry.go evaluateStatus: gate the two inferred-status branches
(online → degraded when error_rate ≥ 0.5; degraded → online when
error_rate < 0.1 and runtime_state is empty) behind a check of
runtimeOverrides.HasCapability("status_mgmt").
- The wedged-branch (RuntimeState == "wedged" → degraded) is NOT
gated. That path is the adapter's OWN self-report, not platform
inference, and stays active under native_status_mgmt — adapters
can still drive transitions via runtime_state.
Python side: no change. The capability map is already serialized via
RuntimeCapabilities.to_dict() in PR #2137 and sent in the heartbeat's
runtime_metadata block via PR #2139. An adapter setting
RuntimeCapabilities(provides_native_status_mgmt=True) automatically
flows through.
Tests (3 new):
- SkipsDegradeInference: error_rate=0.8 + currentStatus=online + native
flag set → degrade UPDATE does NOT fire (sqlmock fails on unexpected
query, which is the regression cover)
- SkipsRecovery: error_rate=0.05 + currentStatus=degraded + native →
recovery UPDATE does NOT fire
- WedgedStillRespected: runtime_state="wedged" + native → wedged
branch DOES fire (adapter self-report stays active)
Verification:
- All Go handlers tests pass (3 new + existing)
- 1308/1308 Python pytest pass (unchanged — Python side unmodified)
- go build + go vet clean
Stacked on #2140 (already merged via cascade); branch is current with
staging since #2139 and #2140 merged.
See project memory `project_runtime_native_pluggable.md`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When an adapter declares provides_native_scheduler=True (because its
SDK has built-in cron / Temporal-style workflows), the platform's
polling loop must skip firing schedules for that workspace — otherwise
the schedule fires twice (once natively, once via platform). The
native skip preserves observability (next_run_at still advances, the
schedule row stays in the DB, last_run_at would still update) while
moving the FIRE responsibility to the SDK.
Stacked on PR #2139 (idle_timeout_override end-to-end). The
RuntimeMetadata heartbeat block already carries the capability map;
this PR teaches the platform how to read and act on the scheduler bit.
Components:
- handlers/runtime_overrides.go: extended the cache to store
capability flags alongside idle timeout. Two heartbeat fields are
independent — SetIdleTimeout / SetCapabilities each update one
without stomping the other. Defensive copy on SetCapabilities so
a caller mutating its map after the call doesn't retroactively
change cached declarations. Empty entries dropped to avoid stale
husks.
- handlers/runtime_overrides.go: new HasCapability(workspaceID, name)
+ ProvidesNativeScheduler(workspaceID) — the latter is the
package-level adapter the scheduler imports (avoids a
handlers/scheduler import cycle).
- handlers/registry.go: heartbeat handler now calls SetCapabilities
in addition to SetIdleTimeout.
- scheduler/scheduler.go: NativeSchedulerCheck function-pointer DI
(mirrors the existing QueueDrainFunc pattern). New() leaves the
field nil so existing callers preserve today's "always fire"
behavior. SetNativeSchedulerCheck wires production. tick() drops
workspaces declaring native ownership before goroutine fan-out;
advances next_run_at so we don't tight-loop on the same row.
- cmd/server/main.go: wires handlers.ProvidesNativeScheduler into
the cron scheduler at server boot.
Tests:
Go (7 new):
- SetCapabilitiesAndHas (round-trip)
- per-workspace isolation (ws-a's declaration doesn't leak to ws-b)
- nil/empty map clears (adapter dropping the flag restores fallback)
- SetCapabilities is a defensive copy (caller mutation can't
retroactively flip cached value)
- SetIdleTimeout preserves capabilities and vice-versa (two-field
independence)
- empty entry deleted (no stale husks)
- ProvidesNativeScheduler reads the same singleton heartbeat writes
- SetNativeSchedulerCheck wires the function (scheduler-side)
- nil-check safety contract for tick
Python: no change needed — the heartbeat already serializes the
full capability map via _runtime_metadata_payload (PR #2139). An
adapter setting RuntimeCapabilities(provides_native_scheduler=True)
automatically flows through.
Verification:
- 1308 / 1308 Python pytest pass (unchanged)
- All Go handlers + scheduler tests pass
- go build + go vet clean
See project memory `project_runtime_native_pluggable.md`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Capability primitive #2 (task #117). The first cross-cutting capability
where the adapter actually displaces platform behavior — claude-code's
streaming session can legitimately go silent for 8+ minutes during
synthesis + slow tool calls; the platform's hardcoded 5min idle timer
in a2a_proxy.go cancels it mid-flight (the bug PR #2128 patched at
the env-var layer). This PR fixes it at the right layer: the adapter
declares "I need 600s" and the platform's dispatch path honors it.
Wire shape (Python → Go):
POST /registry/heartbeat
{
"workspace_id": "...",
...
"runtime_metadata": {
"capabilities": {"heartbeat": false, "scheduler": false, ...},
"idle_timeout_seconds": 600 // optional, omitted = use default
}
}
Default behavior preserved: any adapter that doesn't override
BaseAdapter.idle_timeout_override() (returns None by default) sends
no idle_timeout_seconds field; the Go side falls through to
idleTimeoutDuration (env A2A_IDLE_TIMEOUT_SECONDS, default 5min).
Existing langgraph / crewai / deepagents workspaces are unaffected.
Components:
Python:
- adapter_base.py: idle_timeout_override() method on BaseAdapter
returning None (the platform-default sentinel).
- heartbeat.py: _runtime_metadata_payload() lazy-imports the active
adapter and assembles the capability + override block. Try/except
swallows ANY error so heartbeat never breaks because of capability
discovery — observability outranks capability accuracy.
Go:
- models.HeartbeatPayload.RuntimeMetadata (pointer so absent =
"old runtime, didn't say"; explicit zero-cap = "new runtime,
declared no native ownership").
- handlers.runtimeOverrides: in-memory sync.Map cache keyed by
workspaceID. Populated by the heartbeat handler, consulted on
every dispatchA2A. Reset on platform restart (worst-case 30s of
platform-default behavior — acceptable; nothing about overrides
is correctness-critical).
- a2a_proxy.dispatchA2A: looks up the override before applyIdle
Timeout; falls through to global default when absent.
Tests:
Python (17, all new):
- RuntimeCapabilities dataclass shape (frozen, defaults, wire keys)
- BaseAdapter.capabilities() default + override + sibling isolation
- idle_timeout_override default, positive override, dropped-override
- Heartbeat metadata producer: default adapter emits all-False,
native adapter emits flag + override, missing ADAPTER_MODULE
returns {} (graceful), zero/negative override is omitted from
wire, exception inside adapter swallowed
Go (6, all new):
- SetIdleTimeout + IdleTimeout round-trip
- Zero/negative duration clears the override
- Empty workspace_id ignored
- Replacement (heartbeat overwrites prior value)
- Reset clears entire cache
- Concurrent reads + writes (sync.Map invariant)
Verification:
- 1308 / 1308 workspace pytest pass (was 1300, +8)
- All Go handlers tests pass (6 new + existing)
- go vet clean
See project memory `project_runtime_native_pluggable.md` for the
architecture principle this implements.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
[Molecule-Platform-Evolvement-Manager]
## What this fixes
Closes one of the three skipped tests in workspace_provision_test.go
that #1814's interface refactor enabled but never had a body written:
`TestProvisionWorkspace_NoInternalErrorsInBroadcast`.
The interface blocker (`captureBroadcaster` couldn't substitute for
`*events.Broadcaster`) was already fixed when `events.EventEmitter`
was extracted; this PR ships the test body that the prior refactor
made possible. The test was effectively unverified regression cover
for issue #1206 (internal error leak in WORKSPACE_PROVISION_FAILED
broadcasts) until now.
## What the test pins
Drives the **earliest** failure path in `provisionWorkspace` — the
global-secrets decrypt failure — so the setup needs only:
- one `global_secrets` mock row (with `encryption_version=99` to
force `crypto.DecryptVersioned` to error with a string that
includes the literal version number)
- one `UPDATE workspaces SET status = 'failed'` expectation
- a `captureBroadcaster` (already in the test file) injected via
`NewWorkspaceHandler`
Asserts the captured `WORKSPACE_PROVISION_FAILED` payload:
1. carries the safe canned `"failed to decrypt global secret"` only
2. does NOT contain `"version=99"`, `"platform upgrade required"`,
or the global_secret row's `key` value (`FAKE_KEY`) — the three
leak markers a regression that interpolates `err.Error()` into
the broadcast would surface
## Why not use containsUnsafeString
The test file already has a `containsUnsafeString` helper with
`"secret"` and `"token"` in its prohibition list. Those substrings
match the legitimate redacted message (`"failed to decrypt global
secret"`) — appropriate in user-facing copy, NOT a leak. Using the
broad helper would either fail the test against the source's own
correct message OR require loosening the helper for everyone else.
Per-test explicit leak markers keep the assertion precise without
weakening shared infrastructure.
## What's still skipped (out of scope for this PR)
- `TestProvisionWorkspaceCP_NoInternalErrorsInBroadcast` — same
shape but blocked on a different refactor: `provisionWorkspaceCP`
routes through `*provisioner.CPProvisioner` (concrete pointer,
no interface), so the test would need either an interface
extraction or a real CPProvisioner with a mocked HTTP server.
Larger scope; deferred.
- `TestResolveAndStage_NoInternalErrorsInHTTPErr` — different
blocker (`mockPluginsSources` vs `*plugins.Registry` type
mismatch). Needs a SourceResolver-side interface refactor.
Both still carry their `t.Skip` notes documenting the remaining
work.
## Test plan
- [x] New test passes
- [x] Full handlers package suite still green (`go test ./internal/handlers/`)
- [x] No changes to production code — pure test addition
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two Critical bugs caught in code review of the agent→user attachments PR:
1. **Empty-URI attachments slipped past validation.** Gin's
go-playground/validator does NOT iterate slice elements without
`dive` — verified zero `dive` usage anywhere in workspace-server —
so the inner `binding:"required"` tags on NotifyAttachment.URI/Name
were never enforced. `attachments: [{"uri":"","name":""}]` would
pass validation, broadcast empty-URI chips that render blank in
canvas, AND persist them in activity_logs for every page reload to
re-render. Added explicit per-element validation in Notify (returns
400 with `attachment[i]: uri and name are required`) plus
defence-in-depth in the canvas filter (rejects empty strings, not
just non-strings).
3-case regression test pins the rejection.
2. **Hardcoded application/octet-stream stripped real mime types.**
`_upload_chat_files` always passed octet-stream as the multipart
Content-Type. chat_files.go:Upload reads `fh.Header.Get("Content-Type")`
FIRST and only falls back to extension-sniffing when the header is
empty, so every agent-attached file lost its real type forever —
broke the canvas's MIME-based icon/preview logic. Now sniff via
`mimetypes.guess_type(path)` and only fall back to octet-stream
when sniffing returns None.
Plus three Required nits:
- `sqlmockArgMatcher` was misleading — the closure always returned
true after capture, identical to `sqlmock.AnyArg()` semantics, but
named like a custom matcher. Renamed to `sqlmockCaptureArg(*string)`
so the intent (capture for post-call inspection, not validate via
driver-callback) is unambiguous.
- Test asserted notify call by `await_args_list[1]` index — fragile
to any future _upload_chat_files refactor that adds a pre-flight
POST. Now filter call list by URL suffix `/notify` and assert
exactly one match.
- Added `TestNotify_RejectsAttachmentWithEmptyURIOrName` (3 cases)
covering empty-uri, empty-name, both-empty so the Critical fix
stays defended.
Deferred to follow-up:
- ORDER BY tiebreaker for same-millisecond notifies — pre-existing
risk, not regression.
- Streaming multipart upload — bounded by the platform's 50MB total
cap so RAM ceiling is fixed; switch to streaming if cap rises.
- Symlink rejection — agent UID can already read whatever its
filesystem perms allow via the shell tool; rejecting symlinks
doesn't materially shrink the attack surface.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the gap where the Director would say "ZIP is ready at /tmp/foo.zip"
in plain text instead of attaching a download chip — the runtime literally
had no API for outbound file attachments. The canvas + platform's
chat-uploads infrastructure already supported the inbound (user → agent)
direction (commit 94d9331c); this PR wires the outbound side.
End-to-end shape:
agent: send_message_to_user("Done!", attachments=["/tmp/build.zip"])
↓ runtime
POST /workspaces/<self>/chat/uploads (multipart)
↓ platform
/workspace/.molecule/chat-uploads/<uuid>-build.zip
→ returns {uri: workspace:/...build.zip, name, mimeType, size}
↓ runtime
POST /workspaces/<self>/notify
{message: "Done!", attachments: [{uri, name, mimeType, size}]}
↓ platform
Broadcasts AGENT_MESSAGE with attachments + persists to activity_logs
with response_body = {result: "Done!", parts: [{kind:file, file:{...}}]}
↓ canvas
WS push: canvas-events.ts adds attachments to agentMessages queue
Reload: ChatTab.loadMessagesFromDB → extractFilesFromTask sees parts[]
Either path → ChatTab renders download chip via existing path
Files changed:
workspace-server/internal/handlers/activity.go
- NotifyAttachment struct {URI, Name, MimeType, Size}
- Notify body accepts attachments[], broadcasts in payload,
persists as response_body.parts[].kind="file"
canvas/src/store/canvas-events.ts
- AGENT_MESSAGE handler reads payload.attachments, type-validates
each entry, attaches to agentMessages queue
- Skips empty events (was: skipped only when content empty)
workspace/a2a_tools.py
- tool_send_message_to_user(message, attachments=[paths])
- New _upload_chat_files helper: opens each path, multipart POSTs
to /chat/uploads, returns the platform's metadata
- Fail-fast on missing file / upload error — never sends a notify
with a half-rendered attachment chip
workspace/a2a_mcp_server.py
- inputSchema declares attachments param so claude-code SDK
surfaces it to the model
- Defensive filter on the dispatch path (drops non-string entries
if the model sends a malformed payload)
Tests:
- 4 new Python: success path, missing file, upload 5xx, no-attach
backwards compat
- 1 new Go: Notify-with-attachments persists parts[] in
response_body so chat reload reconstructs the chip
Why /tmp paths work even though they're outside the canvas's allowed
roots: the runtime tool reads the bytes locally and re-uploads through
/chat/uploads, which lands the file under /workspace (an allowed root).
The agent can specify any readable path.
Does NOT include: agent → agent file transfer. Different design problem
(cross-workspace download auth: peer would need a credential to call
sender's /chat/download). Tracked as a follow-up under task #114.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Review-feedback follow-up. Pre-fix, A2A_IDLE_TIMEOUT_SECONDS=foo or =-30
fell back to the default with zero log signal — operator sets the wrong
value, sees "no effect," wastes hours debugging "why is my override not
working." Now bad-input cases log a clear message naming the variable,
the bad value, and the default applied.
Refactor: extract parseIdleTimeoutEnv(string) → time.Duration so the
parse logic is unit-testable. defaultIdleTimeoutDuration is a const so
tests reference it without re-deriving the value.
8 new unit tests cover empty / valid / negative / zero / non-numeric /
float / trailing-units inputs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two compounding bugs caused the "context canceled" wave on 2026-04-26
(15+ failed user/agent A2A calls in 1hr across 6 workspaces, including
the user's "send it in the chat" message that the director never
received):
1. **a2a_proxy.go:applyIdleTimeout cancels the dispatch after 60s of
broadcaster silence** for the workspace. Resets on any SSE event
for the workspace, fires cancel() if no event arrives in time.
2. **registry.go:Heartbeat broadcast was conditional** —
`if payload.CurrentTask != prevTask`. The runtime POSTs
/registry/heartbeat every 30s, but if current_task hasn't changed
the handler emits ZERO broadcasts. evaluateStatus only broadcasts
on online/degraded transitions — also no-op when steady.
Net: a claude-code agent on a long packaging step or slow tool call
keeps the same current_task for >60s → no broadcasts → idle timer
fires → in-flight request cancelled mid-flight with the "context
canceled" error the user sees in the activity log.
Fix:
(a) Heartbeat handler always emits a `WORKSPACE_HEARTBEAT` BroadcastOnly
event (no DB write — same path as TASK_UPDATED). At the existing 30s
runtime cadence this resets the idle timer twice per minute.
Cost is one in-memory channel send per active SSE subscriber + one
WS hub fan-out per heartbeat — far below any noise floor.
(b) idleTimeoutDuration default bumped 60s → 5min as a safety net for
any future regression where the heartbeat path goes silent (e.g.
runtime crashed mid-request before its next heartbeat). Made
env-overridable via A2A_IDLE_TIMEOUT_SECONDS for ops who want to
tune (canary tests fail-fast, prod tenants with slow plugins want
longer). Either fix alone closes today's gap; both together is
defence in depth.
The runtime side already POSTs /registry/heartbeat every 30s via
workspace/heartbeat.py — no runtime change needed.
Test: TestHeartbeatHandler_AlwaysBroadcastsHeartbeat pins the property
that an SSE subscriber observes a WORKSPACE_HEARTBEAT broadcast on a
same-task heartbeat (the regression scenario). All 16 existing handler
tests still pass.
Doesn't fix: task #102 (single SDK session bottleneck) — peers will
still queue when busy. But this PR ensures the queue/wait flow
actually completes instead of being killed by the idle timer
mid-wait.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The existing sweeper only reaps ws-* containers whose workspace row
has status='removed'. That misses the entire wiped-DB case: an
operator does `docker compose down -v` (kills the postgres volume),
the previous platform's ws-* containers keep running, the new
platform boots into an empty workspaces table — first pass finds
zero candidates and those containers leak forever. Symptom users
hit today: 7 ws-* containers from 11h ago, no rows in DB, no
visibility in Canvas, eating CPU + memory.
Fix shape:
1. Provisioner stamps every ws-* container + volume with
`molecule.platform.managed=true`. Without a label, the sweeper
would have to assume any unlabeled ws-* container might belong
to a sibling platform stack on a shared Docker daemon.
2. Provisioner exposes ListManagedContainerIDPrefixes — a label-filter
counterpart to the existing name-filter.
3. Sweeper splits sweepOnce into two independent passes:
- sweepRemovedRows (unchanged behavior; status='removed' only)
- sweepLabeledOrphansWithoutRows (new; labeled containers whose
workspace_id has no row in the table at all)
Each pass has its own short-circuit so an empty result or transient
error in one doesn't block the other — load-bearing because the
wiped-DB pass exists precisely for cases where the removed-row
pass finds nothing.
Safe under multi-platform-on-shared-daemon: only containers carrying
our label get reaped, sibling stacks' containers are invisible to this
pass. (For now the label is a constant string; a future per-instance
UUID layer can refine "ours" further if a real shared-daemon scenario
emerges.)
Migration: existing platforms running pre-PR builds have UNLABELED
ws-* containers. After this lands they continue to NOT be reaped by
the new path (no label = invisible). They'll only be cleaned via
manual intervention or once the operator recreates them — same as
today. No regression.
Tests cover all five branches of the new pass: happy-path reap,
no-reap when row exists, mixed reap-some-keep-some, Docker error
short-circuits cleanly, non-UUID prefixes get filtered before the
SQL query.
Pairs with PR #2122 (script-level fix). Together they close the
orphan-leak path for both `bash scripts/nuke-and-rebuild.sh` users
(handled by the script) AND `docker compose down -v` users (handled
by the runtime).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an opt-in goroutine that polls GHCR every 5 minutes for digest
changes on each workspace-template-*:latest tag and invokes the same
refresh logic /admin/workspace-images/refresh exposes. With this, the
chain from "merge runtime PR" to "containers running new code" is fully
hands-off — no operator step between auto-tag → publish-runtime →
cascade → template image rebuild → host pull + recreate.
Opt-in via IMAGE_AUTO_REFRESH=true. SaaS deploys whose pipeline already
pulls every release should leave it off (would be redundant work);
self-hosters get true zero-touch.
Why a refactor of admin_workspace_images.go is in this PR:
The HTTP handler held all the refresh logic inline. To share it with
the new watcher without HTTP loopback, extracted WorkspaceImageService
with a Refresh(ctx, runtimes, recreate) (RefreshResult, error) shape.
HTTP handler is now a thin wrapper; behavior is preserved (same JSON
response, same 500-on-list-failure, same per-runtime soft-fail).
Watcher design notes:
- Last-observed digest tracked in memory (not persisted). On boot the
first observation per runtime is seed-only — no spurious refresh
fires on every restart.
- On Refresh error, the seen digest rolls back so the next tick retries.
Without this rollback a transient Docker glitch would convince the
watcher the work was done.
- Per-runtime fetch errors don't block other runtimes (one template's
brief 500 doesn't pause the others).
- digestFetcher injection seam in tick() lets unit tests cover all
bookkeeping branches without standing up an httptest GHCR server.
Verified live: probed GHCR's /token + manifest HEAD against
workspace-template-claude-code; got HTTP 200 + a real
Docker-Content-Digest. Same calls the watcher makes.
Co-authored-by: Hongming Wang <hongmingwangalt@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI on PR #2105 caught 7 Create-handler tests still mocking the
pre-#1408 10-arg INSERT signature. With the column now wired
unconditionally into the INSERT, every WithArgs that pinned
budget_limit as the 10th arg needed a 11th slot for the resolved
max_concurrent_tasks value.
Files:
- workspace_test.go: 6 tests (DBInsertError, DefaultsApplied,
WithSecrets_Persists, TemplateDefaultsMissingRuntimeAndModel,
TemplateDefaultsLegacyTopLevelModel, CallerModelOverridesTemplateDefault)
- workspace_budget_test.go: 1 test (Budget_Create_WithLimit)
All resolved values are the schema-default mirror, so the test
expectation reads as the same models.DefaultMaxConcurrentTasks
const that the handler writes. New imports added to both files.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Simplify pass on top of the wire-up commit:
- New const models.DefaultMaxConcurrentTasks = 1; handlers and tests
reference the symbol so the schema-default mirror lives in one place.
- Strip 5 multi-line comments that narrated what the code does.
- Drop the duplicate field-rationale on OrgWorkspace; the one on
CreateWorkspacePayload is canonical.
- Drop test-side positional comments that would silently lie if columns
get reordered.
Pure cleanup; no behaviour change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 4 of #1408 (active_tasks counter). Runtime increment/decrement,
schema column (037), and scheduler enforcement (scheduler.go:312)
already shipped — but the write path from template config.yaml +
direct API was missing, so every workspace silently fell through to
the schema default of 1. Leaders that set max_concurrent_tasks: 3 in
their org template were getting 1 anyway, defeating the entire
feature for the use case it was built for (cron-vs-A2A contention on
PM/lead workspaces).
- OrgWorkspace gains MaxConcurrentTasks (yaml + json tags)
- CreateWorkspacePayload gains MaxConcurrentTasks (json tag)
- Both INSERTs now write the column unconditionally; 0/omitted
payload value falls back to 1 (schema default mirror) so the wire
stays single-shape — no forked column list / goto.
- Existing Create-handler test mocks updated to expect the 11th arg.
- New TestWorkspaceCreate_MaxConcurrentTasksOverride locks the
payload→DB propagation for the leader case (value=3).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three files conflicted with staging changes that landed while this PR
sat open. Resolved each by combining both intents (not picking one side):
- a2a_proxy.go: keep the branch's idle-timeout signature
(workspaceID parameter + comment) AND apply staging's #1483 SSRF
defense-in-depth check at the top of dispatchA2A. Type-assert
h.broadcaster (now an EventEmitter interface per staging) back to
*Broadcaster for applyIdleTimeout's SubscribeSSE call; falls through
to no-op when the assertion fails (test-mock case).
- a2a_proxy_test.go: keep both new test suites — branch's
TestApplyIdleTimeout_* (3 cases for the idle-timeout helper) AND
staging's TestDispatchA2A_RejectsUnsafeURL (#1483 regression). Updated
the staging test's dispatchA2A call to pass the workspaceID arg
introduced by the branch's signature change.
- workspace_crud.go: combine both Delete-cleanup intents:
* Branch's cleanupCtx detachment (WithoutCancel + 30s) so canvas
hang-up doesn't cancel mid-Docker-call (the container-leak fix)
* Branch's stopAndRemove helper that skips RemoveVolume when Stop
fails (orphan sweeper handles)
* Staging's #1843 stopErrs aggregation so Stop failures bubble up
as 500 to the client (the EC2 orphan-instance prevention)
Both concerns satisfied: cleanup runs to completion past canvas
hangup AND failed Stop calls surface to caller.
Build clean, all platform tests pass.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
PR #2103 widened the SSRF saasMode branch to also relax RFC-1918 + ULA
under MOLECULE_ENV=development (so the docker-compose dev pattern stops
rejecting workspace registrations on 172.18.x.x bridge IPs). The
existing TestIsSafeURL_DevMode_StillBlocksOtherRanges covered the
security floor (metadata / TEST-NET / CGNAT stay blocked), but no
test asserted the positive side — that 10.x / 172.x / 192.168.x / fd00::
ARE now allowed under dev mode.
Without this test, a future refactor that quietly drops the
`|| devModeAllowsLoopback()` from isPrivateOrMetadataIP wouldn't trip
any assertion, and the docker-compose dev loop would silently re-break.
Adds TestIsSafeURL_DevMode_AllowsRFC1918 — table of 4 URLs covering
the three RFC-1918 IPv4 ranges + IPv6 ULA fd00::/8. Sets
MOLECULE_DEPLOY_MODE=self-hosted explicitly so the test exercises the
devMode branch, not a SaaS-mode pass.
Closes the Optional finding I left on PR #2103.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The production-side end of the runtime CD chain. Operators (or the post-
publish CI workflow) hit this after a runtime release to pull the latest
workspace-template-* images from GHCR and recreate any running ws-* containers
so they adopt the new image. Without this, freshly-published runtime sat in
the registry but containers kept the old image until naturally cycled.
Implementation notes:
- Uses Docker SDK ImagePull rather than shelling out to docker CLI — the
alpine platform container has no docker CLI installed.
- ghcrAuthHeader() reads GHCR_USER + GHCR_TOKEN env, builds the base64-
encoded JSON payload Docker engine expects in PullOptions.RegistryAuth.
Both empty → public/cached images only; both set → private GHCR pulls.
- Container matching uses ContainerInspect (NOT ContainerList) because
ContainerList returns the resolved digest in .Image, not the human tag.
Inspect surfaces .Config.Image which is what we need.
- Provisioner.DefaultImagePlatform() exported so admin handler picks the
same Apple-Silicon-needs-amd64 platform as the provisioner — single
source of truth for the multi-arch override.
Local-dev companion: scripts/refresh-workspace-images.sh runs on the
host and inherits the host's docker keychain auth — alternate path for
when GHCR_USER/TOKEN aren't set in the platform env.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Pre-fix, POST /workspaces/:id/notify (the side-channel agents use to push
interim updates and follow-up results) only broadcast via WebSocket — no
DB write. When the user refreshed the page, the chat-history loader
(which queries activity_logs) couldn't restore those messages and they
vanished from the chat.
Hits the most common path: when the platform's POST /a2a times out (idle),
the runtime keeps working and eventually pushes its reply via
send_message_to_user. The reply rendered live but disappeared on reload.
Fix: also INSERT an activity_logs row with shape the existing loader
already understands (type=a2a_receive, source_id=NULL, response_body=
{result: text}). Persistence is best-effort — a DB hiccup doesn't block
the WebSocket push (which the user is already seeing).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
The docker-compose dev pattern puts platform and workspace containers on
the same docker bridge network (172.18.0.0/16, RFC-1918). The runtime
registers via its docker-internal hostname which DNS-resolves to a
172.18.x.x IP. The SSRF defence's isPrivateOrMetadataIP rejected those,
so every workspace POST through the platform proxy returned
'workspace URL is not publicly routable' — breaking the entire docker-
compose dev loop.
Fix: in isPrivateOrMetadataIP, treat MOLECULE_ENV=development the same
as SaaS mode for RFC-1918 relaxation. Both share the 'trusted intra-
network routing' property — SaaS is sibling EC2s in the same VPC, dev
is sibling containers on the same docker bridge. Always-blocked
categories (metadata link-local, TEST-NET, CGNAT) stay blocked.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
When proxyA2A returns 202+{queued:true} (target busy → enqueued for drain
on next heartbeat), executeDelegation previously treated it as a successful
completion and ran extractResponseText on the queued JSON. The result was
'Delegation completed (workspace agent busy — request queued, will dispatch...)'
landing in activity_logs.summary, which the LLM then echoed to the user
chat as garbage.
Two fixes:
1. delegation.go: detect queued shape via new isQueuedProxyResponse helper,
write status='queued' with clean summary 'Delegation queued — target at
capacity', store delegation_id in response_body so the drain can stitch
back later. Also embed delegation_id in params.message.metadata + use it
as messageId so the proxy's idempotency-key path keys off the same id.
2. a2a_queue.go: when DrainQueueForWorkspace successfully drains a queued
item, extract delegation_id from the body's metadata and UPDATE the
originating delegate_result row (queued → completed with real
response_body). Broadcast DELEGATION_COMPLETE so the canvas chat feed
flips the queued line to completed in real time.
Closes the loop so check_task_status reflects ground truth instead of
perpetual 'queued' even after the queued request eventually drained.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
The 3 skipped tests in workspace_provision_test.go (#1206 regression
tests) were blocked because captureBroadcaster's struct-embed wouldn't
type-check against WorkspaceHandler.broadcaster's concrete
*events.Broadcaster field. This PR fixes the interface blocker for
the 2 broadcaster-related tests; the 3rd (plugins.Registry resolver)
is a separate blocker tracked elsewhere.
Changes:
- internal/events/broadcaster.go: define `EventEmitter` interface with
RecordAndBroadcast + BroadcastOnly. *Broadcaster satisfies it via
its existing methods (compile-time assertion guards future drift).
SubscribeSSE / Subscribe stay off the interface because only sse.go
+ cmd/server/main.go call them, and both still hold the concrete
*Broadcaster.
- internal/handlers/workspace.go: WorkspaceHandler.broadcaster type
changes from *events.Broadcaster to events.EventEmitter.
NewWorkspaceHandler signature updated to match. Production callers
unchanged — they pass *events.Broadcaster, which the interface
accepts.
- internal/handlers/activity.go: LogActivity takes events.EventEmitter
for the same reason — tests passing a stub no longer need to
construct the full broadcaster.
- internal/handlers/workspace_provision_test.go: captureBroadcaster
drops the struct embed (no more zero-value Broadcaster underlying
the SSE+hub fields), implements RecordAndBroadcast directly, and
adds a no-op BroadcastOnly to satisfy the interface. Skip messages
on the 2 empty broadcaster-blocked tests updated to reflect the
new "interface unblocked, test body still needed" state.
Verified `go build ./...`, `go test ./internal/handlers/`, and
`go vet ./...` all clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#1483 flagged that dispatchA2A() doesn't call isSafeURL internally —
the guard exists only at the caller level (resolveAgentURL at
a2a_proxy.go:424). The primary call path through proxyA2ARequest
is safe today, but if any future code path ever calls dispatchA2A
directly without going through resolveAgentURL, the SSRF check
would be silently bypassed.
This adds the one-line defense-in-depth guard the issue prescribed:
if err := isSafeURL(agentURL); err != nil {
return nil, nil, &proxyDispatchBuildError{err: err}
}
Wrapping as *proxyDispatchBuildError preserves the existing caller
error-classification path — the same shape that maps to 500 elsewhere.
Adds TestDispatchA2A_RejectsUnsafeURL pinning the contract:
re-enables SSRF for the test (setupTestDB disables it for normal
unit tests), passes a metadata IP, asserts the build error returns
and cancel is nil so no resource is leaked.
The 4 existing dispatchA2A unit tests use setupTestDB → SSRF
disabled, so they continue passing unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#1484 flagged that discoverHostPeer() and writeExternalWorkspaceURL()
return URLs sourced from the workspaces table without an isSafeURL
check. Workspace runtimes register their own URLs via /registry/register
— a misbehaving / compromised runtime could register a metadata-IP URL.
Today both functions are gated by Phase 30.6 bearer-required Discover,
so exposure is theoretical. The fix makes them safe regardless of
upstream auth shape.
Changes:
- discoverHostPeer: isSafeURL on resolved URL before responding;
503 + log on rejection.
- writeExternalWorkspaceURL: same guard applied to the post-rewrite
outURL (so a host.docker.internal rewrite is checked AND a
metadata-IP that survived the rewrite untouched is rejected).
- 3 new regression tests:
* RejectsMetadataIPURL on host-peer path (169.254.169.254 → 503)
* AcceptsPublicURL on host-peer path (8.8.8.8 → 200; positive
counterpart so the rejection test can't pass via universal-fail)
* RejectsMetadataIPURL on external-workspace path
setupTestDB already disables SSRF checks via setSSRFCheckForTest,
so the 16+ existing discovery tests remain untouched. Only the new
tests opt in to enabled SSRF.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Extract walkTemplateConfigs(configsDir, fn) shared helper. Both
templates.List and loadRuntimeProvisionTimeouts walked configsDir
+ parsed config.yaml — same boilerplate twice. Now centralised so
a future template-discovery rule (subdir naming, README sentinel,
etc.) lands in one place.
- templates.List uses the walker — net -10 lines.
- loadRuntimeProvisionTimeouts uses the walker — net -10 lines.
- Document runtimeProvisionTimeoutsCache as 'NOT SAFE for
package-level reuse' so a future change doesn't accidentally
promote it to a singleton (sync.Once can't be reset → tests
would lock out other fixtures).
Skipped (review finding): atomic.Pointer[map[string]int] for
future hot-reload. The doc comment already documents the
limitation; YAGNI-promoting the primitive now would buy a
not-yet-built feature at the cost of more code today.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2 of #2054 — workspace-server reads runtime-level
provision_timeout_seconds from template config.yaml manifests and
includes provision_timeout_ms in the workspace List/Get response.
Phase 1 (canvas, #2092) already plumbs the field through socket →
node-data → ProvisioningTimeout's resolver, so the moment a
template declares the field the per-runtime banner threshold
adjusts without a canvas release.
Implementation:
- templates.go: parse runtime_config.provision_timeout_seconds in
the templateSummary marshaller. The /templates API now surfaces
the field too — useful for ops dashboards and future tooling.
- runtime_provision_timeouts.go (new): loadRuntimeProvisionTimeouts
scans configsDir, parses every immediate subdir's config.yaml,
returns runtime → seconds. Multiple templates with the same
runtime: max wins (so a slow template's threshold doesn't get
cut by a fast template's). Bad/empty inputs are silently
skipped — workspace-server starts cleanly with no templates.
- runtimeProvisionTimeoutsCache: sync.Once-backed lazy cache.
First workspace API request after process start pays the read
cost (~few KB across ~50 templates); every subsequent request is
a map lookup. Cache lifetime = process lifetime; invalidates on
workspace-server restart, which is the normal template-change
cadence.
- WorkspaceHandler gets a provisionTimeouts field (zero-value struct
is valid — the cache lazy-inits on first get()).
- addProvisionTimeoutMs decorates the response map with
provision_timeout_ms (seconds × 1000) when the runtime has a
declared timeout. Absent = no key in the response, canvas falls
through to its runtime-profile default. Wired into both List
(per-row decoration in the loop) and Get.
Tests (5 new in runtime_provision_timeouts_test.go):
- happy path: hermes declares 720, claude-code doesn't, only
hermes appears in the map
- max-on-duplicate: same runtime in two templates → max wins
- skip-bad-inputs: missing runtime, zero timeout, malformed yaml,
loose top-level files all silently ignored
- missing-dir: returns empty map, no crash
- cache: lazy-init on first get; subsequent gets hit cache even
after underlying file changes (sync.Once contract); unknown
runtime returns zero
Phase 3 (separate template-repo PR): template-hermes config.yaml
declares provision_timeout_seconds: 720 under runtime_config.
canvas RUNTIME_PROFILES.hermes becomes redundant + removable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The existing tokens_test.go skips every test when db.DB is nil, so CI
ran with 0% coverage on tokens.go's List/Create/Revoke. This file adds
sqlmock-driven tests that exercise the SQL paths directly without
needing a live Postgres, lifting coverage on all 4 functions to 100%
and module-level handler coverage from 60.3% → 61.1%.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both backends panicked when called on a zero-valued or nil receiver:
Provisioner.{Stop,IsRunning} dereferenced p.cli; CPProvisioner.{Stop,
IsRunning} dereferenced p.httpClient. The orphan sweeper and shutdown
paths can call these speculatively where the receiver isn't fully
wired — the panic crashed the goroutine instead of the caller seeing
a clean error.
Three changes:
1. Add ErrNoBackend (typed sentinel) and nil-guard the four methods.
- Provisioner.{Stop,IsRunning}: guard p == nil || p.cli == nil at
the top.
- CPProvisioner.Stop: guard p == nil up top, then httpClient nil
AFTER resolveInstanceID + empty-instance check (the empty
instance_id path doesn't need HTTP and stays a no-op success
even on zero-valued receivers — preserved historical contract
from TestIsRunning_EmptyInstanceIDReturnsFalse).
- CPProvisioner.IsRunning: same shape — empty instance_id stays
(false, nil); httpClient-nil with non-empty instance_id returns
ErrNoBackend.
2. Flip the t.Skip on TestDockerBackend_Contract +
TestCPProvisionerBackend_Contract — both contract tests run now
that the panics are gone. Skipped scenarios were the regression
guard for this fix.
3. Add TestZeroValuedBackends_NoPanic — explicit assertion that
zero-valued and nil receivers return cleanly (no panic). Docker
backend always returns ErrNoBackend on zero-valued; CPProvisioner
may return (false, nil) when the DB-lookup layer absorbs the case
(no instance to query → no HTTP needed). Both are acceptable per
the issue's contract — the gate is no-panic.
Tests:
- 6 sub-cases across the new TestZeroValuedBackends_NoPanic
- TestDockerBackend_Contract + TestCPProvisionerBackend_Contract
now run their 2 scenarios (4 sub-cases each)
- All existing provisioner tests still green
- go build ./... + go vet ./... + go test ./... clean
Closes drift-risk #6 in docs/architecture/backends.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-fix: workspace-server's provision-timeout sweep was hardcoded
at 10 min for all runtimes. The CP-side bootstrap-watcher (cp#245)
correctly gives hermes 25 min for cold-boot (hermes installs
include apt + uv + Python venv + Node + hermes-agent — 13–25 min on
slow apt mirrors is normal). The two timeout systems disagreed:
the watcher would happily wait 25 min, but the workspace-server's
10-min sweep killed healthy hermes boots mid-install at 10 min and
marked them failed.
Today's example: #2061's E2E run on 2026-04-26 at 08:06:34Z
created a hermes workspace, EC2 cloud-init was visibly making
progress on apt-installs (libcjson1, libmbedcrypto7t64) when the
sweep flipped status to 'failed' at 08:17:00Z (10:26 elapsed). The
test threw "Workspace failed: " (empty error from sql.NullString
serialization) and CI failed on a healthy boot.
Fix: provisioningTimeoutFor(runtime) — same shape as the CP's
bootstrapTimeoutFn:
- hermes: 30 min (watcher's 25 min + 5 min slack)
- others: 10 min (unchanged — claude-code/langgraph/etc. boot
in <5 min, 10 min is plenty)
PROVISION_TIMEOUT_SECONDS env override still works (applies to all
runtimes — operators who care about the runtime distinction
shouldn't use the override anyway).
Sweep query change: pulls (id, runtime, age_sec) per row instead
of pre-filtering by age in SQL. Per-row Go evaluation picks the
correct timeout. Slightly more rows scanned but bounded by the
status='provisioning' partial index — workspaces in flight, not
historical.
Tests:
- TestProvisioningTimeout_RuntimeAware — locks in the per-runtime
mapping
- TestSweepStuckProvisioning_HermesGets30MinSlack — hermes at
11 min must NOT be flipped
- TestSweepStuckProvisioning_HermesPastDeadline — hermes at
31 min IS flipped, payload includes runtime
- Existing tests updated for the new query shape
Verified:
- go build ./... clean
- go vet ./... clean
- go test ./... all green
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
\`Delete\`'s call to \`h.provisioner.Stop()\` was silently swallowing
errors — and on the SaaS/EC2 backend, Stop() is the call that
terminates the EC2 via the control plane. When Stop returned an
error (CP transient 5xx, network blip), the workspace was marked
'removed' in the DB but the EC2 stayed running with no row to
track it. The "14 orphan workspace EC2s on a 0-customer account"
incident in #1843 (40 vCPU on a 64 vCPU AWS limit) traced to this
silent-leak path.
This change aggregates Stop errors across both descendant and
self-stop calls and surfaces them as 500 to the client, matching
the loud-fail pattern from CP #262 (DeprovisionInstance) and the
DNS cleanup propagation (#269).
Idempotency:
- The DB row is already 'removed' before Stop runs (intentional,
per #73 — guards against register/heartbeat resurrection).
- \`resolveInstanceID\` reads instance_id without a status filter,
so a retry can replay Stop with the same instance_id.
- CP's TerminateInstance is idempotent on already-terminated EC2s.
- So a retry-after-500 either re-attempts the terminate (succeeds)
or finds the instance already gone (also succeeds).
Behaviour change at the API layer:
- Before: 200 \`{"status":"removed","cascade_deleted":N}\` regardless
of Stop outcome.
- After: 500 \`{"error":"...","removed_count":N,"stop_failures":K}\`
on Stop failure; 200 on success.
RemoveVolume errors stay log-and-continue — those are local
/var/data cleanup, not infra-leak class.
Test debt acknowledged: the WorkspaceHandler's \`provisioner\` field
is the concrete \`*provisioner.Provisioner\` type, not an interface.
Adding a regression test for the new error-propagation path
requires either a refactor (introduce a Provisioner interface) or
a docker-backed integration test. Filing the refactor as a
follow-up; the change here is small and mirrors a proven pattern
(CP #262 + #269 both ship without exhaustive new test coverage
for the same reason).
Verified:
- go build ./... clean
- go vet ./... clean
- go test ./... green across the whole module (existing TestDelete
cases unchanged behaviour for happy path)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Code-quality + efficiency review of PR #2081:
- Drop comma-ok on map type-asserts in filterPeersByQuery —
queryPeerMaps writes name/role unconditionally as string, so the
silent-empty-string fallback was cargo-culted defense that would
HIDE a real upstream shape change in tests rather than surface it.
Plain p["name"].(string) panics on violation, caught by tests.
- Trim filterPeersByQuery doc from 5 lines to 1 — function is 15
lines and self-evident.
- Refactor 6 separate Test functions into one table-driven
TestPeers_QFilter with 6 sub-tests. Net ~80 lines saved + naming
becomes readable subtest names instead of TestPeers_Q_Foo_Bar.
- Set-based peer-id comparison (peerIDSet) replaces fragile
peers[0]["id"] == "ws-alpha" asserts that would silently mask a
future sort/order regression on the production code.
- Fix the broken TestPeers_Q_NoMatches assertion: re-encoding an
unmarshalled []map collapses both null and [] to [], so the
previous json.Marshal(peers) == "[]" check was tautological. Move
the [] vs null distinction to a dedicated test
(TestPeers_Q_NoMatches_RawBodyIsArrayNotNull) that inspects the
recorder body BEFORE unmarshal.
runPeersWithQuery now returns both parsed peers and raw body so the
nil-guard test can use the bytes directly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Peers handler at workspace-server/internal/handlers/discovery.go
ignored the ?q= query param entirely — every caller got the full peer
list regardless of what they searched for. The handler exposes peer
identities + URLs, so leaking the unfiltered set on a "filtered"
endpoint is an info-disclosure bug (CWE-862).
Fix: read c.Query("q") and post-filter the in-memory peers slice by
case-insensitive substring match against name OR role. Filtering is
done in Go after the existing 3 SQL reads — keeps the SQL bytes
identical to the no-filter path (no injection vector, no DB-driver
collation surprises) at a small cost. The peer set is bounded by a
single workspace's parent + children + siblings (typically <50
rows), so the in-memory pass is negligible.
Empty / whitespace-only q is a no-op — preserves the no-filter
allocation profile.
Tests (6 new in discovery_test.go):
- TestPeers_NoQ_ReturnsAll — regression baseline (3 peers, no filter)
- TestPeers_Q_FiltersByName — q=alpha → ws-alpha only
- TestPeers_Q_CaseInsensitive — q=ALPHA → ws-alpha (locks in ToLower)
- TestPeers_Q_FiltersByRole — q=design → ws-beta (role-side match)
- TestPeers_Q_NoMatches — empty array, JSON [] not null
- TestPeers_Q_WhitespaceOnly — q=' ' treated as no-filter
Helpers peersFilterFixture + runPeersWithQuery + peerNames keep each
test scoped to the q-behaviour, not re-declaring SQL expectations.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User reported the canvas threw a generic "API GET /workspaces: 500
{auth check failed}" error when local Postgres + Redis were both
down. Two problems:
1. The error code (500) and message ("auth check failed") said
nothing useful. The actual condition was "platform can't reach
its datastore to validate your token" — a Service Unavailable
class, not Internal Server Error.
2. The canvas had no way to distinguish infra-down from a real
auth bug, so it rendered the raw API string in the same
generic-error overlay it uses for everything.
Fix in two layers:
Server (wsauth_middleware.go):
- New abortAuthLookupError helper centralises all three sites
that previously returned `500 {"error":"auth check failed"}`
when HasAnyLiveTokenGlobal or orgtoken.Validate hit a DB error.
- Now returns 503 + structured body
`{"error": "...", "code": "platform_unavailable"}`. 503 is
the correct semantic ("retry shortly, infra is unavailable")
and the code field is the contract the canvas reads.
- Body deliberately excludes the underlying DB error string —
production hostnames / connection-string fragments must not
leak into a user-visible error toast.
Canvas (api.ts):
- New PlatformUnavailableError class. api.ts inspects 503
responses for the platform_unavailable code and throws the
typed error instead of the generic "API GET /…: 503 …"
message. Generic 503s (upstream-busy, etc.) keep the legacy
path so existing busy-retry UX isn't disrupted.
Canvas (page.tsx):
- New PlatformDownDiagnostic component renders when the
initial hydration catches PlatformUnavailableError.
Surfaces the actual condition with operator-actionable
copy ("brew services start postgresql@14 / redis") +
pointer to the platform log + a Reload button.
Tests:
- Go: TestAdminAuth_DatastoreError_Returns503PlatformUnavailable
pins the response shape (status, code field, no DB-error leak)
- Canvas: 5 tests for PlatformUnavailableError classification —
typed throw on 503+code match, generic-Error fallback for
503-without-code (upstream busy), 500 stays generic, non-JSON
body falls back to generic.
1015 canvas tests + full Go middleware suite pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Convergence-pass review noted the comment at orphan_sweeper.go:171
still describes the pre-cb126014 contract ("Stop returns nil even
when container is gone, but a future change could surface real
errors"). The future is now — Stop does surface real errors today.
Tightened the comment to match the live contract:
isContainerNotFound is treated as success, anything else returns
the wrapped Docker error, sweeper retries on the next cycle.
Pure comment change, no behavior diff.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Review caught a critical issue with 12c49183: the headline "skip
RemoveVolume when Stop fails" guarantee was dead code. `Provisioner.Stop`
unconditionally `return nil`'d after logging the underlying
ContainerRemove error, so the new `if err := h.provisioner.Stop(...);
err != nil { skip volume }` guard in workspace_crud.go AND the same
guard in the orphan sweeper could never fire. RemoveVolume always
ran, predictably failing with "volume in use" when Stop hadn't
actually killed the container — which is the exact production bug
the commit claimed to fix.
Now Stop:
- returns nil on successful remove (no change)
- returns nil when the container is already gone (uses the existing
isContainerNotFound helper — that's the cleanup post-condition,
not a failure)
- returns the wrapped Docker error otherwise (daemon timeout, ctx
cancellation, socket EOF — anything that means the container
might still be alive)
Audited every Provisioner.Stop caller in the tree (team.go,
workspace_restart.go ×4, workspace.go) — all of them already
discard the return value, so the widened error surface is purely
opt-in for the new cleanup paths and breaks no existing behaviour.
Other review-driven fixes in this commit:
- workspace_crud.go: detached `broadcaster.RecordAndBroadcast` from
the request ctx too. RecordAndBroadcast does INSERT INTO
structure_events + Redis Publish; if the canvas hangs up, a
request-ctx-bound INSERT can be cancelled mid-write and the
WORKSPACE_REMOVED event never lands, leaving other WS clients
ignorant of the cascade.
- orphan_sweeper.go: added isLikelyWorkspaceID guard before turning
Docker container prefixes into SQL LIKE patterns. The Docker
name filter is a SUBSTRING match (not prefix), so non-workspace
containers like `my-ws-tool` slip through; the in-loop HasPrefix
in provisioner trims most, but the in-sweeper alphabet check
(hex + dashes only) is the second line of defence and also
blocks SQL LIKE wildcards (`_`, `%`) from reaching the query.
Two new tests pin this — TestSweepOnce_FiltersNonWorkspacePrefixes
and TestIsLikelyWorkspaceID with 10 alphabet cases.
- provisioner.go: comment added to ListWorkspaceContainerIDPrefixes
flagging the substring/HasPrefix relationship as load-bearing.
Verified: full Go test suite passes; all 8 sweeper tests pass
(2 new for the LIKE-pattern guard); existing dispatch / delete /
provisioner tests unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symptom: deleting workspaces from the canvas marked DB rows
status='removed' but left Docker containers running indefinitely.
After a session of org imports + cancellations, we counted 10
running ws-* containers all backed by 'removed' DB rows, eating
~1100% CPU on the Docker VM.
Two compounding bugs in handlers/workspace_crud.go's delete cascade:
1. The cleanup loop used `c.Request.Context()` for the Docker
stop/remove calls. When the canvas's `api.del` resolved on the
platform's 200, gin cancelled the request ctx — and any in-flight
Docker call cancelled with `context canceled`, leaving the
container alive. Old logs:
"Delete descendant <id> volume removal warning:
... context canceled"
2. `provisioner.Stop`'s error return was discarded and `RemoveVolume`
ran unconditionally afterward. When Stop didn't actually kill the
container (transient daemon error, ctx cancellation as in #1), the
volume removal would predictably fail with "volume in use" and
the container kept running with the volume mounted. Old logs:
"Delete descendant <id> volume removal warning:
Error response from daemon: remove ... volume is in use"
Fix layered in two parts:
- workspace_crud.go: detach cleanup with `context.WithoutCancel(ctx)`
+ a 30s bounded timeout. Stop's error is now checked and on
failure we skip RemoveVolume entirely (the orphan sweeper below
catches what we deferred).
- New registry/orphan_sweeper.go: periodic reconcile pass (every 60s,
initial run on boot). Lists running ws-* containers via Docker name
filter, intersects with DB rows where status='removed', stops +
removes volumes for the leaks. Defence in depth — even a brand-new
Stop failure mode heals on the next sweep instead of leaking
forever.
Provisioner gains a tiny ListWorkspaceContainerIDPrefixes helper
that wraps ContainerList with the `name=ws-` filter; the sweeper
takes an OrphanReaper interface (matches the ContainerChecker
pattern in healthsweep.go) so unit tests don't need a real Docker
daemon.
main.go wires the sweeper alongside the existing liveness +
health-sweep + provisioning-timeout monitors, all under
supervised.RunWithRecover so a panic restarts the goroutine.
6 new sweeper tests cover the reconcile path, the
no-running-containers short-circuit, the daemon-error skip, the
Stop-failure-leaves-volume invariant (the same trap that motivated
this fix), the volume-remove-error-is-non-fatal continuation,
and the nil-reaper no-op.
Verified: full Go test suite passes; manually purged the 10 leaked
containers + their orphan volumes from the dev host with `docker
rm -f` + `docker volume rm` (one-off cleanup; the sweeper would
have caught them on the next cycle once deployed).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Round-2 review of the wedge/idle/progress bundle came back Approve
with 4 optional polish items. All taken:
1. Migration 043 down file gained `SET LOCAL lock_timeout = '5s'`
matching the up file. A rollback under the same load that
motivated the up-file guard would otherwise stall writers.
2. _clear_sdk_wedge_on_success now gates on actual stream content
(result_text or assistant_chunks). A degenerate "iterator
returned without raising but emitted nothing" case (possible
from a partial stream or stub SDK) no longer falsely advertises
recovery — only a real successful query (≥1 ResultMessage or
AssistantMessage TextBlock) clears the wedge.
3. isUpstreamBusyError dropped the redundant
`strings.Contains(msg, "context deadline exceeded")` fallback.
*url.Error.Unwrap propagates the typed sentinel since Go 1.13;
errors.Is(err, context.DeadlineExceeded) catches the real
net/http shape. The substring was a foot-gun (would also match
user-content with that phrase). Test fixture updated to use
`fmt.Errorf("Post: %w", context.DeadlineExceeded)` which
reflects what net/http actually returns.
4. TestIsUpstreamBusyError added a context.Canceled case (both
typed and wrapped via %w) — pins the new applyIdleTimeout
classification.
No critical/required findings on second pass; reviewer verdict was
Approve. Items above are polish for symmetry and test clarity.
1010 canvas + 64 Python + full Go suites pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bundle review of pieces 1/2/3 surfaced two critical issues plus a
handful of required + optional fixes. All addressed.
Critical:
1. Migration 043 was missing 'paused' and 'hibernated' from the
workspace_status enum. Both are real production statuses written
by workspace_restart.go (lines 283 and 406), introduced by
migration 029_workspace_hibernation. The original `USING
status::workspace_status` cast would have errored mid-transaction
on any production DB containing those values. Added both. Also
added `SET LOCAL lock_timeout = '5s'` so the migration aborts
instead of stalling the workspace fleet behind a slow SELECT.
2. The chat activity-feed window kept only 8 lines, and a single
multi-tool turn (Read 5 files + Grep + Bash + Edit + delegate)
easily flushed older context before the user could read it.
Extracted appendActivityLine to chat/activityLog.ts with a
20-line window AND consecutive-duplicate collapse (same tool
on the same target twice in a row is noise, not new progress).
5 unit tests pin the behavior.
Required:
3. The SDK wedge flag was sticky-only — a single transient
Control-request-timeout from a flaky network blip locked the
workspace into degraded for the whole process lifetime, even
when the next query() would have succeeded. Added
_clear_sdk_wedge_on_success(), called from _run_query's success
path. The next heartbeat after a working query reports
runtime_state empty and the platform recovers the workspace to
online without a manual restart. New regression test.
4. _report_tool_use now sets target_id = WORKSPACE_ID for self-
actions, matching the convention other self-logged activity
rows use. DB consumers joining on target_id see a well-defined
value instead of NULL.
Optional taken:
5. Tightened _WEDGE_ERROR_PATTERNS from "control request timeout"
to "control request timeout: initialize" — suffix-anchored so a
future SDK error on an in-flight tool-call control message
doesn't get misclassified as the unrecoverable post-init wedge.
6. Dropped the redundant "context canceled" substring fallback in
isUpstreamBusyError. errors.Is(err, context.Canceled) is the
typed check; the substring would also match healthy client-side
aborts, which we don't want classified as upstream-busy.
Verified: 1010 canvas tests + 64 Python tests + full Go suite pass;
migration applies cleanly on dev DB with all 8 enum values; reverse
migration restores TEXT.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous canvas-default 5-min absolute deadline pre-empted any
chat that legitimately ran longer (multi-turn tool use, large
synthesis tasks) and made every wedged-SDK call burn 5 full minutes
before the user saw anything. Replaced with a per-dispatch idle
timeout: cancel the request only when the broadcaster has been
silent for `idleTimeoutDuration` (60s). Any progress event for the
workspace — agent_log tool-use rows, task_update, a2a_send,
a2a_receive — resets the clock.
Mechanics:
- new applyIdleTimeout helper subscribes to events.Broadcaster's
per-workspace SSE channel, drains its messages, resets a
time.Timer on each one, cancels the wrapped ctx when the timer
fires. Cleanup goroutine + subscription lives only as long as
the returned cancel func is uncalled.
- dispatchA2A now takes workspaceID as a parameter, applies the
idle timeout always (canvas + agent), and combines its cancel
with the existing 30-min agent-to-agent ceiling cancel into one
func the caller defers.
- Canvas dispatches no longer have an absolute ceiling at all —
the idle timer is the only "give up" signal. A healthy chat
reporting tool-use telemetry every few seconds runs forever;
a wedged runtime fails in 60s instead of 5 min.
- isUpstreamBusyError now also recognises context.Canceled (the
error class our idle cancel produces, distinct from
DeadlineExceeded). Same 503-busy retry semantics.
Tests:
- TestApplyIdleTimeout_FiresOnSilence — 60ms idle, no events,
ctx cancels with context.Canceled.
- TestApplyIdleTimeout_ResetsOnEvent — event mid-window extends
the deadline; ctx alive past original deadline, then cancels
on the second silence window.
- TestApplyIdleTimeout_NilBroadcasterDegradesGracefully — defensive
no-op for paths that don't wire a broadcaster.
- 3 existing dispatchA2A tests updated for the new workspaceID
param + the always-non-nil cancel return shape.
This pairs with Piece 1's per-tool-use telemetry (166c7f77): the
broadcaster events that reset the idle timer ARE the agent_log
rows the workspace started emitting per tool call. So the same
event stream feeds both the chat progress feed AND the proxy's
deadline.
Full Go test suite passes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Heartbeat lies. The asyncio task that POSTs /registry/heartbeat lives
in its own process slot, so a workspace whose claude_agent_sdk has
wedged on `Control request timeout: initialize` keeps reporting
"online" — every chat send hangs the full 5-min platform deadline
even though the runtime is dead in the water. This commit teaches
the workspace to admit it's wedged and the platform to honor that
admission by flipping status → degraded.
Five layers, all in one commit because they share a contract:
1. Migration 043 — convert workspaces.status from free-form TEXT to
a real `workspace_status` Postgres ENUM with the 6 values
production code actually writes (provisioning, online, offline,
degraded, failed, removed). Locks the value set; future typo
writes error at the DB instead of silently storing rogue strings.
Down migration reverts to TEXT and drops the type.
2. workspace-server/internal/models — `HeartbeatPayload` gains a
`runtime_state string` field. Empty = healthy. Currently the only
non-empty value the handler honors is "wedged"; future symptoms
can extend without another migration.
3. workspace-server/internal/handlers/registry.go — `evaluateStatus`
gains a wedge branch BEFORE the existing error_rate >= 0.5 path:
if `RuntimeState=="wedged"` and currently online, flip to
degraded and broadcast WORKSPACE_DEGRADED with the wedge sample
error. Recovery (`degraded → online`) now requires BOTH
error_rate < 0.1 AND runtime_state cleared, so a workspace still
reporting wedged stays degraded even when its error count
happens to be 0 (the wedge captures a runtime state, not an
error count).
4. workspace/claude_sdk_executor.py — module-level `_sdk_wedged_reason`
flag set when execute()'s catch block sees an error matching
`_WEDGE_ERROR_PATTERNS` (currently just "control request
timeout"). Sticky for the process lifetime; the SDK's internal
client-process state is corrupted on this error and only a
workspace restart (= new Python process = fresh module state)
clears it. Helpers `is_wedged()` / `wedge_reason()` /
`_reset_sdk_wedge_for_test()` exposed.
5. workspace/heartbeat.py — heartbeat body now layers on
`_runtime_state_payload()` for both the happy path and the
401-retry path. Lazy-imports claude_sdk_executor so non-Claude
runtimes (where the module may not even be importable) keep
working unchanged.
Canvas required no changes — `STATUS_CONFIG.degraded` was already
defined in design-tokens.ts (amber dot, "Degraded" label) and
WorkspaceNode.tsx already renders `lastSampleError` underneath the
status pill when status === "degraded". The existing wiring just
never fired because nothing was writing degraded in this code path.
Tests:
- 3 Go handler tests for the new transitions (online → degraded on
wedged, degraded stays put while still wedged, degraded → online
after wedge clears)
- 5 Python wedge-detector tests (default clean, mark sets flag,
sticky-first-wins, execute() flips on Control request timeout,
execute() does NOT flip on unrelated errors)
- Migration smoke-tested against the local dev DB (3 existing rows,
all enum-compatible; migration applied cleanly, post-state has
the column as workspace_status type and the index preserved)
Verified: 79 Python tests pass; full Go test suite passes; migration
applies clean on a real DB; reverse migration restores the column to
TEXT.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>