Closes the gap where the Director would say "ZIP is ready at /tmp/foo.zip"
in plain text instead of attaching a download chip — the runtime literally
had no API for outbound file attachments. The canvas + platform's
chat-uploads infrastructure already supported the inbound (user → agent)
direction (commit 94d9331c); this PR wires the outbound side.
End-to-end shape:
agent: send_message_to_user("Done!", attachments=["/tmp/build.zip"])
↓ runtime
POST /workspaces/<self>/chat/uploads (multipart)
↓ platform
/workspace/.molecule/chat-uploads/<uuid>-build.zip
→ returns {uri: workspace:/...build.zip, name, mimeType, size}
↓ runtime
POST /workspaces/<self>/notify
{message: "Done!", attachments: [{uri, name, mimeType, size}]}
↓ platform
Broadcasts AGENT_MESSAGE with attachments + persists to activity_logs
with response_body = {result: "Done!", parts: [{kind:file, file:{...}}]}
↓ canvas
WS push: canvas-events.ts adds attachments to agentMessages queue
Reload: ChatTab.loadMessagesFromDB → extractFilesFromTask sees parts[]
Either path → ChatTab renders download chip via existing path
Files changed:
workspace-server/internal/handlers/activity.go
- NotifyAttachment struct {URI, Name, MimeType, Size}
- Notify body accepts attachments[], broadcasts in payload,
persists as response_body.parts[].kind="file"
canvas/src/store/canvas-events.ts
- AGENT_MESSAGE handler reads payload.attachments, type-validates
each entry, attaches to agentMessages queue
- Skips empty events (was: skipped only when content empty)
workspace/a2a_tools.py
- tool_send_message_to_user(message, attachments=[paths])
- New _upload_chat_files helper: opens each path, multipart POSTs
to /chat/uploads, returns the platform's metadata
- Fail-fast on missing file / upload error — never sends a notify
with a half-rendered attachment chip
workspace/a2a_mcp_server.py
- inputSchema declares attachments param so claude-code SDK
surfaces it to the model
- Defensive filter on the dispatch path (drops non-string entries
if the model sends a malformed payload)
Tests:
- 4 new Python: success path, missing file, upload 5xx, no-attach
backwards compat
- 1 new Go: Notify-with-attachments persists parts[] in
response_body so chat reload reconstructs the chip
Why /tmp paths work even though they're outside the canvas's allowed
roots: the runtime tool reads the bytes locally and re-uploads through
/chat/uploads, which lands the file under /workspace (an allowed root).
The agent can specify any readable path.
Does NOT include: agent → agent file transfer. Different design problem
(cross-workspace download auth: peer would need a credential to call
sender's /chat/download). Tracked as a follow-up under task #114.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Review-feedback follow-up. Pre-fix, A2A_IDLE_TIMEOUT_SECONDS=foo or =-30
fell back to the default with zero log signal — operator sets the wrong
value, sees "no effect," wastes hours debugging "why is my override not
working." Now bad-input cases log a clear message naming the variable,
the bad value, and the default applied.
Refactor: extract parseIdleTimeoutEnv(string) → time.Duration so the
parse logic is unit-testable. defaultIdleTimeoutDuration is a const so
tests reference it without re-deriving the value.
8 new unit tests cover empty / valid / negative / zero / non-numeric /
float / trailing-units inputs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two compounding bugs caused the "context canceled" wave on 2026-04-26
(15+ failed user/agent A2A calls in 1hr across 6 workspaces, including
the user's "send it in the chat" message that the director never
received):
1. **a2a_proxy.go:applyIdleTimeout cancels the dispatch after 60s of
broadcaster silence** for the workspace. Resets on any SSE event
for the workspace, fires cancel() if no event arrives in time.
2. **registry.go:Heartbeat broadcast was conditional** —
`if payload.CurrentTask != prevTask`. The runtime POSTs
/registry/heartbeat every 30s, but if current_task hasn't changed
the handler emits ZERO broadcasts. evaluateStatus only broadcasts
on online/degraded transitions — also no-op when steady.
Net: a claude-code agent on a long packaging step or slow tool call
keeps the same current_task for >60s → no broadcasts → idle timer
fires → in-flight request cancelled mid-flight with the "context
canceled" error the user sees in the activity log.
Fix:
(a) Heartbeat handler always emits a `WORKSPACE_HEARTBEAT` BroadcastOnly
event (no DB write — same path as TASK_UPDATED). At the existing 30s
runtime cadence this resets the idle timer twice per minute.
Cost is one in-memory channel send per active SSE subscriber + one
WS hub fan-out per heartbeat — far below any noise floor.
(b) idleTimeoutDuration default bumped 60s → 5min as a safety net for
any future regression where the heartbeat path goes silent (e.g.
runtime crashed mid-request before its next heartbeat). Made
env-overridable via A2A_IDLE_TIMEOUT_SECONDS for ops who want to
tune (canary tests fail-fast, prod tenants with slow plugins want
longer). Either fix alone closes today's gap; both together is
defence in depth.
The runtime side already POSTs /registry/heartbeat every 30s via
workspace/heartbeat.py — no runtime change needed.
Test: TestHeartbeatHandler_AlwaysBroadcastsHeartbeat pins the property
that an SSE subscriber observes a WORKSPACE_HEARTBEAT broadcast on a
same-task heartbeat (the regression scenario). All 16 existing handler
tests still pass.
Doesn't fix: task #102 (single SDK session bottleneck) — peers will
still queue when busy. But this PR ensures the queue/wait flow
actually completes instead of being killed by the idle timer
mid-wait.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The existing sweeper only reaps ws-* containers whose workspace row
has status='removed'. That misses the entire wiped-DB case: an
operator does `docker compose down -v` (kills the postgres volume),
the previous platform's ws-* containers keep running, the new
platform boots into an empty workspaces table — first pass finds
zero candidates and those containers leak forever. Symptom users
hit today: 7 ws-* containers from 11h ago, no rows in DB, no
visibility in Canvas, eating CPU + memory.
Fix shape:
1. Provisioner stamps every ws-* container + volume with
`molecule.platform.managed=true`. Without a label, the sweeper
would have to assume any unlabeled ws-* container might belong
to a sibling platform stack on a shared Docker daemon.
2. Provisioner exposes ListManagedContainerIDPrefixes — a label-filter
counterpart to the existing name-filter.
3. Sweeper splits sweepOnce into two independent passes:
- sweepRemovedRows (unchanged behavior; status='removed' only)
- sweepLabeledOrphansWithoutRows (new; labeled containers whose
workspace_id has no row in the table at all)
Each pass has its own short-circuit so an empty result or transient
error in one doesn't block the other — load-bearing because the
wiped-DB pass exists precisely for cases where the removed-row
pass finds nothing.
Safe under multi-platform-on-shared-daemon: only containers carrying
our label get reaped, sibling stacks' containers are invisible to this
pass. (For now the label is a constant string; a future per-instance
UUID layer can refine "ours" further if a real shared-daemon scenario
emerges.)
Migration: existing platforms running pre-PR builds have UNLABELED
ws-* containers. After this lands they continue to NOT be reaped by
the new path (no label = invisible). They'll only be cleaned via
manual intervention or once the operator recreates them — same as
today. No regression.
Tests cover all five branches of the new pass: happy-path reap,
no-reap when row exists, mixed reap-some-keep-some, Docker error
short-circuits cleanly, non-UUID prefixes get filtered before the
SQL query.
Pairs with PR #2122 (script-level fix). Together they close the
orphan-leak path for both `bash scripts/nuke-and-rebuild.sh` users
(handled by the script) AND `docker compose down -v` users (handled
by the runtime).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an opt-in goroutine that polls GHCR every 5 minutes for digest
changes on each workspace-template-*:latest tag and invokes the same
refresh logic /admin/workspace-images/refresh exposes. With this, the
chain from "merge runtime PR" to "containers running new code" is fully
hands-off — no operator step between auto-tag → publish-runtime →
cascade → template image rebuild → host pull + recreate.
Opt-in via IMAGE_AUTO_REFRESH=true. SaaS deploys whose pipeline already
pulls every release should leave it off (would be redundant work);
self-hosters get true zero-touch.
Why a refactor of admin_workspace_images.go is in this PR:
The HTTP handler held all the refresh logic inline. To share it with
the new watcher without HTTP loopback, extracted WorkspaceImageService
with a Refresh(ctx, runtimes, recreate) (RefreshResult, error) shape.
HTTP handler is now a thin wrapper; behavior is preserved (same JSON
response, same 500-on-list-failure, same per-runtime soft-fail).
Watcher design notes:
- Last-observed digest tracked in memory (not persisted). On boot the
first observation per runtime is seed-only — no spurious refresh
fires on every restart.
- On Refresh error, the seen digest rolls back so the next tick retries.
Without this rollback a transient Docker glitch would convince the
watcher the work was done.
- Per-runtime fetch errors don't block other runtimes (one template's
brief 500 doesn't pause the others).
- digestFetcher injection seam in tick() lets unit tests cover all
bookkeeping branches without standing up an httptest GHCR server.
Verified live: probed GHCR's /token + manifest HEAD against
workspace-template-claude-code; got HTTP 200 + a real
Docker-Content-Digest. Same calls the watcher makes.
Co-authored-by: Hongming Wang <hongmingwangalt@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI on PR #2105 caught 7 Create-handler tests still mocking the
pre-#1408 10-arg INSERT signature. With the column now wired
unconditionally into the INSERT, every WithArgs that pinned
budget_limit as the 10th arg needed a 11th slot for the resolved
max_concurrent_tasks value.
Files:
- workspace_test.go: 6 tests (DBInsertError, DefaultsApplied,
WithSecrets_Persists, TemplateDefaultsMissingRuntimeAndModel,
TemplateDefaultsLegacyTopLevelModel, CallerModelOverridesTemplateDefault)
- workspace_budget_test.go: 1 test (Budget_Create_WithLimit)
All resolved values are the schema-default mirror, so the test
expectation reads as the same models.DefaultMaxConcurrentTasks
const that the handler writes. New imports added to both files.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Simplify pass on top of the wire-up commit:
- New const models.DefaultMaxConcurrentTasks = 1; handlers and tests
reference the symbol so the schema-default mirror lives in one place.
- Strip 5 multi-line comments that narrated what the code does.
- Drop the duplicate field-rationale on OrgWorkspace; the one on
CreateWorkspacePayload is canonical.
- Drop test-side positional comments that would silently lie if columns
get reordered.
Pure cleanup; no behaviour change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 4 of #1408 (active_tasks counter). Runtime increment/decrement,
schema column (037), and scheduler enforcement (scheduler.go:312)
already shipped — but the write path from template config.yaml +
direct API was missing, so every workspace silently fell through to
the schema default of 1. Leaders that set max_concurrent_tasks: 3 in
their org template were getting 1 anyway, defeating the entire
feature for the use case it was built for (cron-vs-A2A contention on
PM/lead workspaces).
- OrgWorkspace gains MaxConcurrentTasks (yaml + json tags)
- CreateWorkspacePayload gains MaxConcurrentTasks (json tag)
- Both INSERTs now write the column unconditionally; 0/omitted
payload value falls back to 1 (schema default mirror) so the wire
stays single-shape — no forked column list / goto.
- Existing Create-handler test mocks updated to expect the 11th arg.
- New TestWorkspaceCreate_MaxConcurrentTasksOverride locks the
payload→DB propagation for the leader case (value=3).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three files conflicted with staging changes that landed while this PR
sat open. Resolved each by combining both intents (not picking one side):
- a2a_proxy.go: keep the branch's idle-timeout signature
(workspaceID parameter + comment) AND apply staging's #1483 SSRF
defense-in-depth check at the top of dispatchA2A. Type-assert
h.broadcaster (now an EventEmitter interface per staging) back to
*Broadcaster for applyIdleTimeout's SubscribeSSE call; falls through
to no-op when the assertion fails (test-mock case).
- a2a_proxy_test.go: keep both new test suites — branch's
TestApplyIdleTimeout_* (3 cases for the idle-timeout helper) AND
staging's TestDispatchA2A_RejectsUnsafeURL (#1483 regression). Updated
the staging test's dispatchA2A call to pass the workspaceID arg
introduced by the branch's signature change.
- workspace_crud.go: combine both Delete-cleanup intents:
* Branch's cleanupCtx detachment (WithoutCancel + 30s) so canvas
hang-up doesn't cancel mid-Docker-call (the container-leak fix)
* Branch's stopAndRemove helper that skips RemoveVolume when Stop
fails (orphan sweeper handles)
* Staging's #1843 stopErrs aggregation so Stop failures bubble up
as 500 to the client (the EC2 orphan-instance prevention)
Both concerns satisfied: cleanup runs to completion past canvas
hangup AND failed Stop calls surface to caller.
Build clean, all platform tests pass.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
PR #2103 widened the SSRF saasMode branch to also relax RFC-1918 + ULA
under MOLECULE_ENV=development (so the docker-compose dev pattern stops
rejecting workspace registrations on 172.18.x.x bridge IPs). The
existing TestIsSafeURL_DevMode_StillBlocksOtherRanges covered the
security floor (metadata / TEST-NET / CGNAT stay blocked), but no
test asserted the positive side — that 10.x / 172.x / 192.168.x / fd00::
ARE now allowed under dev mode.
Without this test, a future refactor that quietly drops the
`|| devModeAllowsLoopback()` from isPrivateOrMetadataIP wouldn't trip
any assertion, and the docker-compose dev loop would silently re-break.
Adds TestIsSafeURL_DevMode_AllowsRFC1918 — table of 4 URLs covering
the three RFC-1918 IPv4 ranges + IPv6 ULA fd00::/8. Sets
MOLECULE_DEPLOY_MODE=self-hosted explicitly so the test exercises the
devMode branch, not a SaaS-mode pass.
Closes the Optional finding I left on PR #2103.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The production-side end of the runtime CD chain. Operators (or the post-
publish CI workflow) hit this after a runtime release to pull the latest
workspace-template-* images from GHCR and recreate any running ws-* containers
so they adopt the new image. Without this, freshly-published runtime sat in
the registry but containers kept the old image until naturally cycled.
Implementation notes:
- Uses Docker SDK ImagePull rather than shelling out to docker CLI — the
alpine platform container has no docker CLI installed.
- ghcrAuthHeader() reads GHCR_USER + GHCR_TOKEN env, builds the base64-
encoded JSON payload Docker engine expects in PullOptions.RegistryAuth.
Both empty → public/cached images only; both set → private GHCR pulls.
- Container matching uses ContainerInspect (NOT ContainerList) because
ContainerList returns the resolved digest in .Image, not the human tag.
Inspect surfaces .Config.Image which is what we need.
- Provisioner.DefaultImagePlatform() exported so admin handler picks the
same Apple-Silicon-needs-amd64 platform as the provisioner — single
source of truth for the multi-arch override.
Local-dev companion: scripts/refresh-workspace-images.sh runs on the
host and inherits the host's docker keychain auth — alternate path for
when GHCR_USER/TOKEN aren't set in the platform env.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Pre-fix, POST /workspaces/:id/notify (the side-channel agents use to push
interim updates and follow-up results) only broadcast via WebSocket — no
DB write. When the user refreshed the page, the chat-history loader
(which queries activity_logs) couldn't restore those messages and they
vanished from the chat.
Hits the most common path: when the platform's POST /a2a times out (idle),
the runtime keeps working and eventually pushes its reply via
send_message_to_user. The reply rendered live but disappeared on reload.
Fix: also INSERT an activity_logs row with shape the existing loader
already understands (type=a2a_receive, source_id=NULL, response_body=
{result: text}). Persistence is best-effort — a DB hiccup doesn't block
the WebSocket push (which the user is already seeing).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
The docker-compose dev pattern puts platform and workspace containers on
the same docker bridge network (172.18.0.0/16, RFC-1918). The runtime
registers via its docker-internal hostname which DNS-resolves to a
172.18.x.x IP. The SSRF defence's isPrivateOrMetadataIP rejected those,
so every workspace POST through the platform proxy returned
'workspace URL is not publicly routable' — breaking the entire docker-
compose dev loop.
Fix: in isPrivateOrMetadataIP, treat MOLECULE_ENV=development the same
as SaaS mode for RFC-1918 relaxation. Both share the 'trusted intra-
network routing' property — SaaS is sibling EC2s in the same VPC, dev
is sibling containers on the same docker bridge. Always-blocked
categories (metadata link-local, TEST-NET, CGNAT) stay blocked.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
When proxyA2A returns 202+{queued:true} (target busy → enqueued for drain
on next heartbeat), executeDelegation previously treated it as a successful
completion and ran extractResponseText on the queued JSON. The result was
'Delegation completed (workspace agent busy — request queued, will dispatch...)'
landing in activity_logs.summary, which the LLM then echoed to the user
chat as garbage.
Two fixes:
1. delegation.go: detect queued shape via new isQueuedProxyResponse helper,
write status='queued' with clean summary 'Delegation queued — target at
capacity', store delegation_id in response_body so the drain can stitch
back later. Also embed delegation_id in params.message.metadata + use it
as messageId so the proxy's idempotency-key path keys off the same id.
2. a2a_queue.go: when DrainQueueForWorkspace successfully drains a queued
item, extract delegation_id from the body's metadata and UPDATE the
originating delegate_result row (queued → completed with real
response_body). Broadcast DELEGATION_COMPLETE so the canvas chat feed
flips the queued line to completed in real time.
Closes the loop so check_task_status reflects ground truth instead of
perpetual 'queued' even after the queued request eventually drained.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
The 3 skipped tests in workspace_provision_test.go (#1206 regression
tests) were blocked because captureBroadcaster's struct-embed wouldn't
type-check against WorkspaceHandler.broadcaster's concrete
*events.Broadcaster field. This PR fixes the interface blocker for
the 2 broadcaster-related tests; the 3rd (plugins.Registry resolver)
is a separate blocker tracked elsewhere.
Changes:
- internal/events/broadcaster.go: define `EventEmitter` interface with
RecordAndBroadcast + BroadcastOnly. *Broadcaster satisfies it via
its existing methods (compile-time assertion guards future drift).
SubscribeSSE / Subscribe stay off the interface because only sse.go
+ cmd/server/main.go call them, and both still hold the concrete
*Broadcaster.
- internal/handlers/workspace.go: WorkspaceHandler.broadcaster type
changes from *events.Broadcaster to events.EventEmitter.
NewWorkspaceHandler signature updated to match. Production callers
unchanged — they pass *events.Broadcaster, which the interface
accepts.
- internal/handlers/activity.go: LogActivity takes events.EventEmitter
for the same reason — tests passing a stub no longer need to
construct the full broadcaster.
- internal/handlers/workspace_provision_test.go: captureBroadcaster
drops the struct embed (no more zero-value Broadcaster underlying
the SSE+hub fields), implements RecordAndBroadcast directly, and
adds a no-op BroadcastOnly to satisfy the interface. Skip messages
on the 2 empty broadcaster-blocked tests updated to reflect the
new "interface unblocked, test body still needed" state.
Verified `go build ./...`, `go test ./internal/handlers/`, and
`go vet ./...` all clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#1483 flagged that dispatchA2A() doesn't call isSafeURL internally —
the guard exists only at the caller level (resolveAgentURL at
a2a_proxy.go:424). The primary call path through proxyA2ARequest
is safe today, but if any future code path ever calls dispatchA2A
directly without going through resolveAgentURL, the SSRF check
would be silently bypassed.
This adds the one-line defense-in-depth guard the issue prescribed:
if err := isSafeURL(agentURL); err != nil {
return nil, nil, &proxyDispatchBuildError{err: err}
}
Wrapping as *proxyDispatchBuildError preserves the existing caller
error-classification path — the same shape that maps to 500 elsewhere.
Adds TestDispatchA2A_RejectsUnsafeURL pinning the contract:
re-enables SSRF for the test (setupTestDB disables it for normal
unit tests), passes a metadata IP, asserts the build error returns
and cancel is nil so no resource is leaked.
The 4 existing dispatchA2A unit tests use setupTestDB → SSRF
disabled, so they continue passing unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#1484 flagged that discoverHostPeer() and writeExternalWorkspaceURL()
return URLs sourced from the workspaces table without an isSafeURL
check. Workspace runtimes register their own URLs via /registry/register
— a misbehaving / compromised runtime could register a metadata-IP URL.
Today both functions are gated by Phase 30.6 bearer-required Discover,
so exposure is theoretical. The fix makes them safe regardless of
upstream auth shape.
Changes:
- discoverHostPeer: isSafeURL on resolved URL before responding;
503 + log on rejection.
- writeExternalWorkspaceURL: same guard applied to the post-rewrite
outURL (so a host.docker.internal rewrite is checked AND a
metadata-IP that survived the rewrite untouched is rejected).
- 3 new regression tests:
* RejectsMetadataIPURL on host-peer path (169.254.169.254 → 503)
* AcceptsPublicURL on host-peer path (8.8.8.8 → 200; positive
counterpart so the rejection test can't pass via universal-fail)
* RejectsMetadataIPURL on external-workspace path
setupTestDB already disables SSRF checks via setSSRFCheckForTest,
so the 16+ existing discovery tests remain untouched. Only the new
tests opt in to enabled SSRF.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Extract walkTemplateConfigs(configsDir, fn) shared helper. Both
templates.List and loadRuntimeProvisionTimeouts walked configsDir
+ parsed config.yaml — same boilerplate twice. Now centralised so
a future template-discovery rule (subdir naming, README sentinel,
etc.) lands in one place.
- templates.List uses the walker — net -10 lines.
- loadRuntimeProvisionTimeouts uses the walker — net -10 lines.
- Document runtimeProvisionTimeoutsCache as 'NOT SAFE for
package-level reuse' so a future change doesn't accidentally
promote it to a singleton (sync.Once can't be reset → tests
would lock out other fixtures).
Skipped (review finding): atomic.Pointer[map[string]int] for
future hot-reload. The doc comment already documents the
limitation; YAGNI-promoting the primitive now would buy a
not-yet-built feature at the cost of more code today.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2 of #2054 — workspace-server reads runtime-level
provision_timeout_seconds from template config.yaml manifests and
includes provision_timeout_ms in the workspace List/Get response.
Phase 1 (canvas, #2092) already plumbs the field through socket →
node-data → ProvisioningTimeout's resolver, so the moment a
template declares the field the per-runtime banner threshold
adjusts without a canvas release.
Implementation:
- templates.go: parse runtime_config.provision_timeout_seconds in
the templateSummary marshaller. The /templates API now surfaces
the field too — useful for ops dashboards and future tooling.
- runtime_provision_timeouts.go (new): loadRuntimeProvisionTimeouts
scans configsDir, parses every immediate subdir's config.yaml,
returns runtime → seconds. Multiple templates with the same
runtime: max wins (so a slow template's threshold doesn't get
cut by a fast template's). Bad/empty inputs are silently
skipped — workspace-server starts cleanly with no templates.
- runtimeProvisionTimeoutsCache: sync.Once-backed lazy cache.
First workspace API request after process start pays the read
cost (~few KB across ~50 templates); every subsequent request is
a map lookup. Cache lifetime = process lifetime; invalidates on
workspace-server restart, which is the normal template-change
cadence.
- WorkspaceHandler gets a provisionTimeouts field (zero-value struct
is valid — the cache lazy-inits on first get()).
- addProvisionTimeoutMs decorates the response map with
provision_timeout_ms (seconds × 1000) when the runtime has a
declared timeout. Absent = no key in the response, canvas falls
through to its runtime-profile default. Wired into both List
(per-row decoration in the loop) and Get.
Tests (5 new in runtime_provision_timeouts_test.go):
- happy path: hermes declares 720, claude-code doesn't, only
hermes appears in the map
- max-on-duplicate: same runtime in two templates → max wins
- skip-bad-inputs: missing runtime, zero timeout, malformed yaml,
loose top-level files all silently ignored
- missing-dir: returns empty map, no crash
- cache: lazy-init on first get; subsequent gets hit cache even
after underlying file changes (sync.Once contract); unknown
runtime returns zero
Phase 3 (separate template-repo PR): template-hermes config.yaml
declares provision_timeout_seconds: 720 under runtime_config.
canvas RUNTIME_PROFILES.hermes becomes redundant + removable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The existing tokens_test.go skips every test when db.DB is nil, so CI
ran with 0% coverage on tokens.go's List/Create/Revoke. This file adds
sqlmock-driven tests that exercise the SQL paths directly without
needing a live Postgres, lifting coverage on all 4 functions to 100%
and module-level handler coverage from 60.3% → 61.1%.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both backends panicked when called on a zero-valued or nil receiver:
Provisioner.{Stop,IsRunning} dereferenced p.cli; CPProvisioner.{Stop,
IsRunning} dereferenced p.httpClient. The orphan sweeper and shutdown
paths can call these speculatively where the receiver isn't fully
wired — the panic crashed the goroutine instead of the caller seeing
a clean error.
Three changes:
1. Add ErrNoBackend (typed sentinel) and nil-guard the four methods.
- Provisioner.{Stop,IsRunning}: guard p == nil || p.cli == nil at
the top.
- CPProvisioner.Stop: guard p == nil up top, then httpClient nil
AFTER resolveInstanceID + empty-instance check (the empty
instance_id path doesn't need HTTP and stays a no-op success
even on zero-valued receivers — preserved historical contract
from TestIsRunning_EmptyInstanceIDReturnsFalse).
- CPProvisioner.IsRunning: same shape — empty instance_id stays
(false, nil); httpClient-nil with non-empty instance_id returns
ErrNoBackend.
2. Flip the t.Skip on TestDockerBackend_Contract +
TestCPProvisionerBackend_Contract — both contract tests run now
that the panics are gone. Skipped scenarios were the regression
guard for this fix.
3. Add TestZeroValuedBackends_NoPanic — explicit assertion that
zero-valued and nil receivers return cleanly (no panic). Docker
backend always returns ErrNoBackend on zero-valued; CPProvisioner
may return (false, nil) when the DB-lookup layer absorbs the case
(no instance to query → no HTTP needed). Both are acceptable per
the issue's contract — the gate is no-panic.
Tests:
- 6 sub-cases across the new TestZeroValuedBackends_NoPanic
- TestDockerBackend_Contract + TestCPProvisionerBackend_Contract
now run their 2 scenarios (4 sub-cases each)
- All existing provisioner tests still green
- go build ./... + go vet ./... + go test ./... clean
Closes drift-risk #6 in docs/architecture/backends.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-fix: workspace-server's provision-timeout sweep was hardcoded
at 10 min for all runtimes. The CP-side bootstrap-watcher (cp#245)
correctly gives hermes 25 min for cold-boot (hermes installs
include apt + uv + Python venv + Node + hermes-agent — 13–25 min on
slow apt mirrors is normal). The two timeout systems disagreed:
the watcher would happily wait 25 min, but the workspace-server's
10-min sweep killed healthy hermes boots mid-install at 10 min and
marked them failed.
Today's example: #2061's E2E run on 2026-04-26 at 08:06:34Z
created a hermes workspace, EC2 cloud-init was visibly making
progress on apt-installs (libcjson1, libmbedcrypto7t64) when the
sweep flipped status to 'failed' at 08:17:00Z (10:26 elapsed). The
test threw "Workspace failed: " (empty error from sql.NullString
serialization) and CI failed on a healthy boot.
Fix: provisioningTimeoutFor(runtime) — same shape as the CP's
bootstrapTimeoutFn:
- hermes: 30 min (watcher's 25 min + 5 min slack)
- others: 10 min (unchanged — claude-code/langgraph/etc. boot
in <5 min, 10 min is plenty)
PROVISION_TIMEOUT_SECONDS env override still works (applies to all
runtimes — operators who care about the runtime distinction
shouldn't use the override anyway).
Sweep query change: pulls (id, runtime, age_sec) per row instead
of pre-filtering by age in SQL. Per-row Go evaluation picks the
correct timeout. Slightly more rows scanned but bounded by the
status='provisioning' partial index — workspaces in flight, not
historical.
Tests:
- TestProvisioningTimeout_RuntimeAware — locks in the per-runtime
mapping
- TestSweepStuckProvisioning_HermesGets30MinSlack — hermes at
11 min must NOT be flipped
- TestSweepStuckProvisioning_HermesPastDeadline — hermes at
31 min IS flipped, payload includes runtime
- Existing tests updated for the new query shape
Verified:
- go build ./... clean
- go vet ./... clean
- go test ./... all green
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
\`Delete\`'s call to \`h.provisioner.Stop()\` was silently swallowing
errors — and on the SaaS/EC2 backend, Stop() is the call that
terminates the EC2 via the control plane. When Stop returned an
error (CP transient 5xx, network blip), the workspace was marked
'removed' in the DB but the EC2 stayed running with no row to
track it. The "14 orphan workspace EC2s on a 0-customer account"
incident in #1843 (40 vCPU on a 64 vCPU AWS limit) traced to this
silent-leak path.
This change aggregates Stop errors across both descendant and
self-stop calls and surfaces them as 500 to the client, matching
the loud-fail pattern from CP #262 (DeprovisionInstance) and the
DNS cleanup propagation (#269).
Idempotency:
- The DB row is already 'removed' before Stop runs (intentional,
per #73 — guards against register/heartbeat resurrection).
- \`resolveInstanceID\` reads instance_id without a status filter,
so a retry can replay Stop with the same instance_id.
- CP's TerminateInstance is idempotent on already-terminated EC2s.
- So a retry-after-500 either re-attempts the terminate (succeeds)
or finds the instance already gone (also succeeds).
Behaviour change at the API layer:
- Before: 200 \`{"status":"removed","cascade_deleted":N}\` regardless
of Stop outcome.
- After: 500 \`{"error":"...","removed_count":N,"stop_failures":K}\`
on Stop failure; 200 on success.
RemoveVolume errors stay log-and-continue — those are local
/var/data cleanup, not infra-leak class.
Test debt acknowledged: the WorkspaceHandler's \`provisioner\` field
is the concrete \`*provisioner.Provisioner\` type, not an interface.
Adding a regression test for the new error-propagation path
requires either a refactor (introduce a Provisioner interface) or
a docker-backed integration test. Filing the refactor as a
follow-up; the change here is small and mirrors a proven pattern
(CP #262 + #269 both ship without exhaustive new test coverage
for the same reason).
Verified:
- go build ./... clean
- go vet ./... clean
- go test ./... green across the whole module (existing TestDelete
cases unchanged behaviour for happy path)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Code-quality + efficiency review of PR #2081:
- Drop comma-ok on map type-asserts in filterPeersByQuery —
queryPeerMaps writes name/role unconditionally as string, so the
silent-empty-string fallback was cargo-culted defense that would
HIDE a real upstream shape change in tests rather than surface it.
Plain p["name"].(string) panics on violation, caught by tests.
- Trim filterPeersByQuery doc from 5 lines to 1 — function is 15
lines and self-evident.
- Refactor 6 separate Test functions into one table-driven
TestPeers_QFilter with 6 sub-tests. Net ~80 lines saved + naming
becomes readable subtest names instead of TestPeers_Q_Foo_Bar.
- Set-based peer-id comparison (peerIDSet) replaces fragile
peers[0]["id"] == "ws-alpha" asserts that would silently mask a
future sort/order regression on the production code.
- Fix the broken TestPeers_Q_NoMatches assertion: re-encoding an
unmarshalled []map collapses both null and [] to [], so the
previous json.Marshal(peers) == "[]" check was tautological. Move
the [] vs null distinction to a dedicated test
(TestPeers_Q_NoMatches_RawBodyIsArrayNotNull) that inspects the
recorder body BEFORE unmarshal.
runPeersWithQuery now returns both parsed peers and raw body so the
nil-guard test can use the bytes directly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Peers handler at workspace-server/internal/handlers/discovery.go
ignored the ?q= query param entirely — every caller got the full peer
list regardless of what they searched for. The handler exposes peer
identities + URLs, so leaking the unfiltered set on a "filtered"
endpoint is an info-disclosure bug (CWE-862).
Fix: read c.Query("q") and post-filter the in-memory peers slice by
case-insensitive substring match against name OR role. Filtering is
done in Go after the existing 3 SQL reads — keeps the SQL bytes
identical to the no-filter path (no injection vector, no DB-driver
collation surprises) at a small cost. The peer set is bounded by a
single workspace's parent + children + siblings (typically <50
rows), so the in-memory pass is negligible.
Empty / whitespace-only q is a no-op — preserves the no-filter
allocation profile.
Tests (6 new in discovery_test.go):
- TestPeers_NoQ_ReturnsAll — regression baseline (3 peers, no filter)
- TestPeers_Q_FiltersByName — q=alpha → ws-alpha only
- TestPeers_Q_CaseInsensitive — q=ALPHA → ws-alpha (locks in ToLower)
- TestPeers_Q_FiltersByRole — q=design → ws-beta (role-side match)
- TestPeers_Q_NoMatches — empty array, JSON [] not null
- TestPeers_Q_WhitespaceOnly — q=' ' treated as no-filter
Helpers peersFilterFixture + runPeersWithQuery + peerNames keep each
test scoped to the q-behaviour, not re-declaring SQL expectations.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User reported the canvas threw a generic "API GET /workspaces: 500
{auth check failed}" error when local Postgres + Redis were both
down. Two problems:
1. The error code (500) and message ("auth check failed") said
nothing useful. The actual condition was "platform can't reach
its datastore to validate your token" — a Service Unavailable
class, not Internal Server Error.
2. The canvas had no way to distinguish infra-down from a real
auth bug, so it rendered the raw API string in the same
generic-error overlay it uses for everything.
Fix in two layers:
Server (wsauth_middleware.go):
- New abortAuthLookupError helper centralises all three sites
that previously returned `500 {"error":"auth check failed"}`
when HasAnyLiveTokenGlobal or orgtoken.Validate hit a DB error.
- Now returns 503 + structured body
`{"error": "...", "code": "platform_unavailable"}`. 503 is
the correct semantic ("retry shortly, infra is unavailable")
and the code field is the contract the canvas reads.
- Body deliberately excludes the underlying DB error string —
production hostnames / connection-string fragments must not
leak into a user-visible error toast.
Canvas (api.ts):
- New PlatformUnavailableError class. api.ts inspects 503
responses for the platform_unavailable code and throws the
typed error instead of the generic "API GET /…: 503 …"
message. Generic 503s (upstream-busy, etc.) keep the legacy
path so existing busy-retry UX isn't disrupted.
Canvas (page.tsx):
- New PlatformDownDiagnostic component renders when the
initial hydration catches PlatformUnavailableError.
Surfaces the actual condition with operator-actionable
copy ("brew services start postgresql@14 / redis") +
pointer to the platform log + a Reload button.
Tests:
- Go: TestAdminAuth_DatastoreError_Returns503PlatformUnavailable
pins the response shape (status, code field, no DB-error leak)
- Canvas: 5 tests for PlatformUnavailableError classification —
typed throw on 503+code match, generic-Error fallback for
503-without-code (upstream busy), 500 stays generic, non-JSON
body falls back to generic.
1015 canvas tests + full Go middleware suite pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Convergence-pass review noted the comment at orphan_sweeper.go:171
still describes the pre-cb126014 contract ("Stop returns nil even
when container is gone, but a future change could surface real
errors"). The future is now — Stop does surface real errors today.
Tightened the comment to match the live contract:
isContainerNotFound is treated as success, anything else returns
the wrapped Docker error, sweeper retries on the next cycle.
Pure comment change, no behavior diff.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Review caught a critical issue with 12c49183: the headline "skip
RemoveVolume when Stop fails" guarantee was dead code. `Provisioner.Stop`
unconditionally `return nil`'d after logging the underlying
ContainerRemove error, so the new `if err := h.provisioner.Stop(...);
err != nil { skip volume }` guard in workspace_crud.go AND the same
guard in the orphan sweeper could never fire. RemoveVolume always
ran, predictably failing with "volume in use" when Stop hadn't
actually killed the container — which is the exact production bug
the commit claimed to fix.
Now Stop:
- returns nil on successful remove (no change)
- returns nil when the container is already gone (uses the existing
isContainerNotFound helper — that's the cleanup post-condition,
not a failure)
- returns the wrapped Docker error otherwise (daemon timeout, ctx
cancellation, socket EOF — anything that means the container
might still be alive)
Audited every Provisioner.Stop caller in the tree (team.go,
workspace_restart.go ×4, workspace.go) — all of them already
discard the return value, so the widened error surface is purely
opt-in for the new cleanup paths and breaks no existing behaviour.
Other review-driven fixes in this commit:
- workspace_crud.go: detached `broadcaster.RecordAndBroadcast` from
the request ctx too. RecordAndBroadcast does INSERT INTO
structure_events + Redis Publish; if the canvas hangs up, a
request-ctx-bound INSERT can be cancelled mid-write and the
WORKSPACE_REMOVED event never lands, leaving other WS clients
ignorant of the cascade.
- orphan_sweeper.go: added isLikelyWorkspaceID guard before turning
Docker container prefixes into SQL LIKE patterns. The Docker
name filter is a SUBSTRING match (not prefix), so non-workspace
containers like `my-ws-tool` slip through; the in-loop HasPrefix
in provisioner trims most, but the in-sweeper alphabet check
(hex + dashes only) is the second line of defence and also
blocks SQL LIKE wildcards (`_`, `%`) from reaching the query.
Two new tests pin this — TestSweepOnce_FiltersNonWorkspacePrefixes
and TestIsLikelyWorkspaceID with 10 alphabet cases.
- provisioner.go: comment added to ListWorkspaceContainerIDPrefixes
flagging the substring/HasPrefix relationship as load-bearing.
Verified: full Go test suite passes; all 8 sweeper tests pass
(2 new for the LIKE-pattern guard); existing dispatch / delete /
provisioner tests unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symptom: deleting workspaces from the canvas marked DB rows
status='removed' but left Docker containers running indefinitely.
After a session of org imports + cancellations, we counted 10
running ws-* containers all backed by 'removed' DB rows, eating
~1100% CPU on the Docker VM.
Two compounding bugs in handlers/workspace_crud.go's delete cascade:
1. The cleanup loop used `c.Request.Context()` for the Docker
stop/remove calls. When the canvas's `api.del` resolved on the
platform's 200, gin cancelled the request ctx — and any in-flight
Docker call cancelled with `context canceled`, leaving the
container alive. Old logs:
"Delete descendant <id> volume removal warning:
... context canceled"
2. `provisioner.Stop`'s error return was discarded and `RemoveVolume`
ran unconditionally afterward. When Stop didn't actually kill the
container (transient daemon error, ctx cancellation as in #1), the
volume removal would predictably fail with "volume in use" and
the container kept running with the volume mounted. Old logs:
"Delete descendant <id> volume removal warning:
Error response from daemon: remove ... volume is in use"
Fix layered in two parts:
- workspace_crud.go: detach cleanup with `context.WithoutCancel(ctx)`
+ a 30s bounded timeout. Stop's error is now checked and on
failure we skip RemoveVolume entirely (the orphan sweeper below
catches what we deferred).
- New registry/orphan_sweeper.go: periodic reconcile pass (every 60s,
initial run on boot). Lists running ws-* containers via Docker name
filter, intersects with DB rows where status='removed', stops +
removes volumes for the leaks. Defence in depth — even a brand-new
Stop failure mode heals on the next sweep instead of leaking
forever.
Provisioner gains a tiny ListWorkspaceContainerIDPrefixes helper
that wraps ContainerList with the `name=ws-` filter; the sweeper
takes an OrphanReaper interface (matches the ContainerChecker
pattern in healthsweep.go) so unit tests don't need a real Docker
daemon.
main.go wires the sweeper alongside the existing liveness +
health-sweep + provisioning-timeout monitors, all under
supervised.RunWithRecover so a panic restarts the goroutine.
6 new sweeper tests cover the reconcile path, the
no-running-containers short-circuit, the daemon-error skip, the
Stop-failure-leaves-volume invariant (the same trap that motivated
this fix), the volume-remove-error-is-non-fatal continuation,
and the nil-reaper no-op.
Verified: full Go test suite passes; manually purged the 10 leaked
containers + their orphan volumes from the dev host with `docker
rm -f` + `docker volume rm` (one-off cleanup; the sweeper would
have caught them on the next cycle once deployed).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Round-2 review of the wedge/idle/progress bundle came back Approve
with 4 optional polish items. All taken:
1. Migration 043 down file gained `SET LOCAL lock_timeout = '5s'`
matching the up file. A rollback under the same load that
motivated the up-file guard would otherwise stall writers.
2. _clear_sdk_wedge_on_success now gates on actual stream content
(result_text or assistant_chunks). A degenerate "iterator
returned without raising but emitted nothing" case (possible
from a partial stream or stub SDK) no longer falsely advertises
recovery — only a real successful query (≥1 ResultMessage or
AssistantMessage TextBlock) clears the wedge.
3. isUpstreamBusyError dropped the redundant
`strings.Contains(msg, "context deadline exceeded")` fallback.
*url.Error.Unwrap propagates the typed sentinel since Go 1.13;
errors.Is(err, context.DeadlineExceeded) catches the real
net/http shape. The substring was a foot-gun (would also match
user-content with that phrase). Test fixture updated to use
`fmt.Errorf("Post: %w", context.DeadlineExceeded)` which
reflects what net/http actually returns.
4. TestIsUpstreamBusyError added a context.Canceled case (both
typed and wrapped via %w) — pins the new applyIdleTimeout
classification.
No critical/required findings on second pass; reviewer verdict was
Approve. Items above are polish for symmetry and test clarity.
1010 canvas + 64 Python + full Go suites pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bundle review of pieces 1/2/3 surfaced two critical issues plus a
handful of required + optional fixes. All addressed.
Critical:
1. Migration 043 was missing 'paused' and 'hibernated' from the
workspace_status enum. Both are real production statuses written
by workspace_restart.go (lines 283 and 406), introduced by
migration 029_workspace_hibernation. The original `USING
status::workspace_status` cast would have errored mid-transaction
on any production DB containing those values. Added both. Also
added `SET LOCAL lock_timeout = '5s'` so the migration aborts
instead of stalling the workspace fleet behind a slow SELECT.
2. The chat activity-feed window kept only 8 lines, and a single
multi-tool turn (Read 5 files + Grep + Bash + Edit + delegate)
easily flushed older context before the user could read it.
Extracted appendActivityLine to chat/activityLog.ts with a
20-line window AND consecutive-duplicate collapse (same tool
on the same target twice in a row is noise, not new progress).
5 unit tests pin the behavior.
Required:
3. The SDK wedge flag was sticky-only — a single transient
Control-request-timeout from a flaky network blip locked the
workspace into degraded for the whole process lifetime, even
when the next query() would have succeeded. Added
_clear_sdk_wedge_on_success(), called from _run_query's success
path. The next heartbeat after a working query reports
runtime_state empty and the platform recovers the workspace to
online without a manual restart. New regression test.
4. _report_tool_use now sets target_id = WORKSPACE_ID for self-
actions, matching the convention other self-logged activity
rows use. DB consumers joining on target_id see a well-defined
value instead of NULL.
Optional taken:
5. Tightened _WEDGE_ERROR_PATTERNS from "control request timeout"
to "control request timeout: initialize" — suffix-anchored so a
future SDK error on an in-flight tool-call control message
doesn't get misclassified as the unrecoverable post-init wedge.
6. Dropped the redundant "context canceled" substring fallback in
isUpstreamBusyError. errors.Is(err, context.Canceled) is the
typed check; the substring would also match healthy client-side
aborts, which we don't want classified as upstream-busy.
Verified: 1010 canvas tests + 64 Python tests + full Go suite pass;
migration applies cleanly on dev DB with all 8 enum values; reverse
migration restores TEXT.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous canvas-default 5-min absolute deadline pre-empted any
chat that legitimately ran longer (multi-turn tool use, large
synthesis tasks) and made every wedged-SDK call burn 5 full minutes
before the user saw anything. Replaced with a per-dispatch idle
timeout: cancel the request only when the broadcaster has been
silent for `idleTimeoutDuration` (60s). Any progress event for the
workspace — agent_log tool-use rows, task_update, a2a_send,
a2a_receive — resets the clock.
Mechanics:
- new applyIdleTimeout helper subscribes to events.Broadcaster's
per-workspace SSE channel, drains its messages, resets a
time.Timer on each one, cancels the wrapped ctx when the timer
fires. Cleanup goroutine + subscription lives only as long as
the returned cancel func is uncalled.
- dispatchA2A now takes workspaceID as a parameter, applies the
idle timeout always (canvas + agent), and combines its cancel
with the existing 30-min agent-to-agent ceiling cancel into one
func the caller defers.
- Canvas dispatches no longer have an absolute ceiling at all —
the idle timer is the only "give up" signal. A healthy chat
reporting tool-use telemetry every few seconds runs forever;
a wedged runtime fails in 60s instead of 5 min.
- isUpstreamBusyError now also recognises context.Canceled (the
error class our idle cancel produces, distinct from
DeadlineExceeded). Same 503-busy retry semantics.
Tests:
- TestApplyIdleTimeout_FiresOnSilence — 60ms idle, no events,
ctx cancels with context.Canceled.
- TestApplyIdleTimeout_ResetsOnEvent — event mid-window extends
the deadline; ctx alive past original deadline, then cancels
on the second silence window.
- TestApplyIdleTimeout_NilBroadcasterDegradesGracefully — defensive
no-op for paths that don't wire a broadcaster.
- 3 existing dispatchA2A tests updated for the new workspaceID
param + the always-non-nil cancel return shape.
This pairs with Piece 1's per-tool-use telemetry (166c7f77): the
broadcaster events that reset the idle timer ARE the agent_log
rows the workspace started emitting per tool call. So the same
event stream feeds both the chat progress feed AND the proxy's
deadline.
Full Go test suite passes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Heartbeat lies. The asyncio task that POSTs /registry/heartbeat lives
in its own process slot, so a workspace whose claude_agent_sdk has
wedged on `Control request timeout: initialize` keeps reporting
"online" — every chat send hangs the full 5-min platform deadline
even though the runtime is dead in the water. This commit teaches
the workspace to admit it's wedged and the platform to honor that
admission by flipping status → degraded.
Five layers, all in one commit because they share a contract:
1. Migration 043 — convert workspaces.status from free-form TEXT to
a real `workspace_status` Postgres ENUM with the 6 values
production code actually writes (provisioning, online, offline,
degraded, failed, removed). Locks the value set; future typo
writes error at the DB instead of silently storing rogue strings.
Down migration reverts to TEXT and drops the type.
2. workspace-server/internal/models — `HeartbeatPayload` gains a
`runtime_state string` field. Empty = healthy. Currently the only
non-empty value the handler honors is "wedged"; future symptoms
can extend without another migration.
3. workspace-server/internal/handlers/registry.go — `evaluateStatus`
gains a wedge branch BEFORE the existing error_rate >= 0.5 path:
if `RuntimeState=="wedged"` and currently online, flip to
degraded and broadcast WORKSPACE_DEGRADED with the wedge sample
error. Recovery (`degraded → online`) now requires BOTH
error_rate < 0.1 AND runtime_state cleared, so a workspace still
reporting wedged stays degraded even when its error count
happens to be 0 (the wedge captures a runtime state, not an
error count).
4. workspace/claude_sdk_executor.py — module-level `_sdk_wedged_reason`
flag set when execute()'s catch block sees an error matching
`_WEDGE_ERROR_PATTERNS` (currently just "control request
timeout"). Sticky for the process lifetime; the SDK's internal
client-process state is corrupted on this error and only a
workspace restart (= new Python process = fresh module state)
clears it. Helpers `is_wedged()` / `wedge_reason()` /
`_reset_sdk_wedge_for_test()` exposed.
5. workspace/heartbeat.py — heartbeat body now layers on
`_runtime_state_payload()` for both the happy path and the
401-retry path. Lazy-imports claude_sdk_executor so non-Claude
runtimes (where the module may not even be importable) keep
working unchanged.
Canvas required no changes — `STATUS_CONFIG.degraded` was already
defined in design-tokens.ts (amber dot, "Degraded" label) and
WorkspaceNode.tsx already renders `lastSampleError` underneath the
status pill when status === "degraded". The existing wiring just
never fired because nothing was writing degraded in this code path.
Tests:
- 3 Go handler tests for the new transitions (online → degraded on
wedged, degraded stays put while still wedged, degraded → online
after wedge clears)
- 5 Python wedge-detector tests (default clean, mark sets flag,
sticky-first-wins, execute() flips on Control request timeout,
execute() does NOT flip on unrelated errors)
- Migration smoke-tested against the local dev DB (3 existing rows,
all enum-compatible; migration applied cleanly, post-state has
the column as workspace_status type and the index preserved)
Verified: 79 Python tests pass; full Go test suite passes; migration
applies clean on a real DB; reverse migration restores the column to
TEXT.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Independent code review surfaced two required documentation fixes and
one growth-correctness gap. All addressed here.
Auto-fit gate (useCanvasViewport):
The previous "subtree-grew-by-count" check missed the delete-then-add
case: subtree of 6 → delete one → 5 → a different child arrives → 6
again. A length-only comparison reads no growth and the fit is
skipped, leaving the new node off-screen. Switched to an id-set
membership snapshot so any brand-new id forces the fit even when the
count is unchanged.
The gate logic is now extracted as a pure exported function
`shouldFitGrowing(currentIds, prevIds, userPannedAt, lastAutoFitAt)`
so the regression-prone decision can be unit-tested in isolation
without standing up React Flow + DOM event refs. 8 cases cover:
first-fit, empty-prior, brand-new id, status-update with user pan,
no-pan-ever, pan-before-last-fit, delete-then-add same length, and
shrink-only with user pan.
Parser parity (dotenv.go + next.config.ts):
Existing-env semantics were undocumented in both parsers. Both now
explicitly note that an explicitly-set empty string (`KEY=` from the
parent shell) counts as "set" — the file value does NOT backfill —
matching the Go (os.LookupEnv) and Node (`process.env[k] !==
undefined`) primitives.
`export ` prefix uses a literal space; `export\tFOO=bar` is
intentionally rejected. Added the same comment in both parsers
to lock in this parity invariant since the commit message claims
"if one parser changes, the other has to."
Skipped (per analysis):
- Drag-pan respect for left-click drag-pan during deploy. The
growth-check safety net means any pan gets overridden on the
next arrival anyway, which is the desired behavior for the
"watch the org deploy" use case. After deploy completes, no
more fit-deploying-org events fire so drag-pan works freely.
- Map cleanup for lastFitSubtreeIdsRef. Per-tab session, UUID
keys, tiny entries — not worth the cleanup hook.
993 canvas tests pass (8 new); Go dotenv tests pass; tsc clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The repo's own .env contains lines like
CONFIGS_DIR= # Path to workspace-configs-templates/...
where the value is empty + an inline comment. The pre-fix parser:
1. v = " # Path to ..."
2. TrimLeft → "# Path to ..."
3. Inline-comment loop looked for " #" or "\t#" — neither matches
because the leading whitespace is gone.
4. Returned the comment text as the value.
Result: os.Setenv("CONFIGS_DIR", "# Path to ...") clobbered the auto-
discovery fallback. The TemplatesHandler then opened the comment as
a directory, ReadDir errored silently, and GET /templates returned
[]. Canvas's Templates panel showed "No templates found in
workspace-configs-templates/" even though 8 valid templates existed
on disk.
Fix: strip leading whitespace from the value FIRST, then run a
position-aware comment scan that treats `#` as a comment marker iff
it's at the start of the (trimmed) value or preceded by whitespace.
A bare `#` mid-value (e.g. `KEY=token#fragment`) still survives.
Quoted-value handling moved above the comment scan so
`KEY="value # not"` keeps the `#` as part of the value — pulled the
quote-detection into the same TrimLeft-then-check shape as the bare
path. The unterminated-quote case still falls through to bare-value
handling.
Three regression tests added covering the exact .env line that
broke (`CONFIGS_DIR= # ...`), spaces-only with comment, and tab-
only with comment.
Verified end-to-end: GET /templates now returns all 8 templates.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Independent code review surfaced three required fixes and one cheap
optional one. All addressed here.
dotenv parser:
- `export FOO=bar` was parsed as key `"export FOO"` (with embedded
space) and silently os.Setenv'd, so a developer pasting from a
direnv `.envrc` would get junk vars. Now strips the prefix.
- Quoted values weren't unwrapped: `FOO="hello world"` produced value
`"hello world"` with literal quotes. Now strips one matched pair of
surrounding `"` or `'`. Inside a quoted value `#` is part of the
value, not a comment marker (matches godotenv convention).
- UTF-8 BOM at file start (Windows editors) would have produced a
first key like U+FEFF + "FOO". Now stripped via TrimPrefix.
dotenv loader:
- findDotEnv()'s upward walk would happily pick up `~/.env` or a
sibling-repo `.env` if the binary was run from `~/Documents/other-
project/`. Real foot-gun on shared dev boxes. Now gated on a
monorepo sentinel: the candidate directory must contain
`workspace-server/go.mod`. Falls through to "no .env found" (=
pre-fix behavior) when the sentinel is absent.
socket fallback poll:
- startFallbackPoll() previously fired only on onclose, so the very
first connect attempt — when onclose hasn't fired yet because we
never had a successful onopen — left the canvas with no HTTP poll
for the duration of the failing handshake (Chrome can hold a
SYN-SENT WebSocket open ~75s before giving up). Now also called at
the top of connect(); the timer-already-running guard makes it a
no-op when one cycle later onclose calls it again.
Test coverage added: export prefix, single+double quoted values, hash
inside quotes preserved, unterminated quote falls back to bare value,
CRLF stripping locked in, BOM stripping, and a sentinel-rejection
regression test that creates a temp .env with no workspace-server
sibling and asserts findDotEnv refuses to load it.
Verified: 985 canvas tests + 30 dotenv subtests + 4 dotenv integration
tests all pass; tsc clean; rebuilt platform from monorepo root with
stripped env still loads .env (49 vars) and /workspaces returns 200.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Local dev runs (`/tmp/molecule-server` after `go build`) used to 401 on
/workspaces the moment the DB had any workspace token in it: the binary
inherited a bare shell env with no MOLECULE_ENV, so AdminAuth's dev
fail-open branch (gated on MOLECULE_ENV=development) didn't fire.
The repo's .env already has MOLECULE_ENV=development plus DATABASE_URL,
REDIS_URL, ADMIN_TOKEN=, etc. Until now you had to `set -a && source
.env` in the launching shell — a paper cut, but worse, it's a paper
cut in EVERY automated dev workflow (IDE run configs, integration
test harnesses, the smoke-test loop in this branch's manual testing).
Fix: cmd/server now walks upward from CWD looking for a .env (capped
at 6 levels) and merges KEY=VALUE pairs into os.Environ before any
other code reads env. Already-set vars win over file values, so
docker run -e / CI exports / `KEY=val ./binary` still dominate — only
unset keys get filled in.
Why no godotenv dep: the format we use is plain KEY=VALUE with `#`
comments, no interpolation, no quoting (verified against the live
.env: 49 kv lines, zero references to ${...} or `export`). A 30-line
parser is auditable and avoids supply-chain surface.
Why it's safe in production: Dockerfile doesn't COPY .env into the
image and .env is gitignored, so prod containers have no .env on
disk to load — the function's findDotEnv() loop finds nothing and
returns silently. If an operator deliberately drops one in, the
existing-env-wins rule means container-injected env still dominates.
Verified by booting `env -i HOME=$HOME PATH=$PATH /tmp/molecule-server`
from the repo root with a stripped env: log shows
".env: /Users/.../molecule-core/.env — loaded 49, 0 already set" and
/workspaces returns 200 instead of 401.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Introduces the UX A/B Lab org template — a 7-agent cell for rapid
landing-page variant generation. The template is also the first
consumer of the new any_of env schema (ANTHROPIC_API_KEY OR
CLAUDE_CODE_OAUTH_TOKEN), so it doubles as an end-to-end fixture
for that feature.
Canvas tree (all claude-code / sonnet):
Design Director
├── UX Researcher
├── Visual Designer
├── React Engineer
├── Deploy Engineer
├── A11y + SEO Auditor ← WCAG AA + canonical/noindex gate
└── Perf Auditor ← Core Web Vitals gate
Template files live in their own standalone repo
(Molecule-AI/molecule-ai-org-template-ux-ab-lab, to be published);
this change adds the manifest.json entry so fresh clones + CI
populate the template via scripts/clone-manifest.sh.
Tests:
- TestOrgTemplate_ClaudeAnyOfAuthPreflight — parses the exact
required_env / recommended_env shape the template ships with
via inline YAML (not on-disk, since org-templates/ is
gitignored in this monorepo) and verifies either member
alternative satisfies the preflight.
SEO safety built into the auditor's system prompt:
- One canonical variant; all others canonicalise to it.
- noindex, follow on non-canonical variants.
- Sitemap contains only the canonical URL.
- No robots.txt disallow (blocked pages can't emit canonical).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends the org-import env preflight so a template can declare an
alternative: satisfy ANY one member to pass. Motivated by the
Claude-family node case where either ANTHROPIC_API_KEY or
CLAUDE_CODE_OAUTH_TOKEN unlocks the agent — forcing both was wrong.
Server (workspace-server):
- New EnvRequirement union type with custom YAML + JSON
(un)marshaling. Accepts scalar (strict) or {any_of: [...]} in
both on-disk org.yaml and inline POST /org/import bodies.
- collectOrgEnv now returns []EnvRequirement. Dedups groups by
sorted-member signature. "Strict wins" pruning drops any-of
groups that mention a name already declared strictly (same
tier and cross-tier).
- Import preflight uses EnvRequirement.IsSatisfied — scalar =
exact match, group = any member present.
- Empty any_of: [] rejected at parse time (never-satisfiable).
- 14 handler tests (6 updated for the union shape, 8 new
covering any-of satisfaction, dedup, strict-dominates-group,
cross-tier pruning, invalid-member filtering, YAML round-trip,
and empty-any-of rejection).
Canvas:
- EnvRequirement = string | {any_of: string[]} with envReqMembers,
envReqSatisfied, envReqKey helpers.
- OrgImportPreflightModal renders strict rows and any-of groups
via a new AnyOfEnvGroup sub-component: "Configure any one"
banner, per-member input, ✓-satisfied indicator, and dimmed
siblings once any member is configured so the user can still
switch providers.
- TemplatePalette.OrgTemplate.required_env / recommended_env
retyped to EnvRequirement[]; passthrough to the modal
unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Problem
Two issues the external-workspace path was silently dropping:
1. `knownRuntimes` was a hardcoded Go map that drifted from
manifest.json — e.g. `gemini-cli` was in manifest but missing
from the Go allowlist, so any workspace provisioning with
runtime=gemini-cli got silently coerced to langgraph.
2. No end-to-end "bring your own compute" story. The canvas UI
had no way to pick runtime=external; the partial backend code
required the operator to already have a URL ready (chicken-and-
egg with the agent that doesn't exist yet), and no workspace_auth
_token was minted so the external agent couldn't authenticate its
register call.
## Change
### Runtime registry driven by manifest.json
- New `runtime_registry.go` reads `manifest.json` at service init.
Each `workspace_templates[].name` becomes a runtime identifier
(with the `-default` suffix stripped so `claude-code-default`
and `claude-code` resolve to the same runtime).
- `external` is always injected (no template repo exists for it).
- Falls back to a static map on manifest load failure so tests /
dev containers keep working.
- 5 new tests including a real-manifest sanity check.
### First-class external workspace flow
When `POST /workspaces` is called with `runtime: "external"` AND
no URL supplied:
1. Workspace row inserted with `status='awaiting_agent'`
(distinct from `provisioning` so canvas doesn't trip its
provisioning-timeout UX).
2. A workspace_auth_token is minted via `wsauth.IssueToken`.
3. Response body includes a `connection` object with:
- `workspace_id`, `platform_url`, `auth_token`
- `registry_endpoint`, `heartbeat_endpoint`
- `curl_register_template` — zero-dep one-shot register snippet
- `python_snippet` — full SDK setup w/ heartbeat loop,
paired with molecule-sdk-python PR #13's A2AServer
4. The platform URL is resolved from `EXTERNAL_PLATFORM_URL` env
(ops-configurable per tenant) or falls back to request headers.
The legacy `payload.External` + `payload.URL` path is preserved —
org-import and other callers that already have a URL still work.
### Canvas UI
- New "External agent (bring your own compute)" checkbox in
CreateWorkspaceDialog.
- When checked, template/model/hermes-provider fields are hidden
and the POST body includes `runtime: "external"`.
- New `ExternalConnectModal` component: shown once after create,
renders Python / curl / raw-fields tabs with copy-to-clipboard
buttons. Stays mounted as a sibling of the create dialog so the
token survives the create dialog unmount.
- `auth_token` is interpolated into the snippet client-side so the
copied block is truly ready to run — operator only has to fill
in their agent's public URL.
## Tests
- Go: 5 new runtime_registry tests (happy path, -default strip,
external always injected, missing file, malformed JSON, real
manifest sanity). All existing handler tests still pass.
- TypeScript: no type errors on my files; pre-existing
canvas-batch-partial-failure type drift is on main already and
tracked on the #2061 branch.
## Follow-ups (filed separately)
- Cut molecule-sdk-python v0.y to PyPI so the snippet can use
`pip install molecule-ai-sdk` instead of `git+main`.
- Add a `runtime: string` field per template in manifest.json so
one template can declare its runtime explicitly (instead of
deriving it from name conventions). Unblocks N-templates-per-
runtime (e.g. hermes-minimax, hermes-anthropic both runtime=hermes).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builds on #2061. Three internally-cohesive sub-features; easiest to
read in order.
## 1. Org-level env preflight
Server
- `OrgTemplate` + `OrgWorkspace` gain `required_env: string[]` and
`recommended_env: string[]` YAML fields.
- `GET /org/templates` walks the tree and returns the tree-union
(deduped, sorted) of both. `collectOrgEnv` dedup prefers required
when the same key is declared at both tiers.
- `POST /org/import` preflights against `global_secrets` WHERE
`octet_length(encrypted_value) > 0` (empty-value rows used to be
counted as "configured" and the per-container preflight still
failed at start time). 412 Precondition Failed + `missing_env`
list when required keys are absent. `force=true` bypasses with
an audit log line. DB lookup failure now returns 500 (was:
silent fall-through that defeated the guard). Env-var NAMES
validated against `^[A-Z][A-Z0-9_]{0,127}$` so a malicious
template can't ship pathological names into the UI or DB.
Canvas
- New `OrgImportPreflightModal`: red "Required" section (blocking)
and yellow "Recommended" section (non-blocking, import stays
enabled, shows live missing-count next to the Import button).
- Per-key password input → `PUT /settings/secrets` → strike-through
on save. Functional `setDrafts` throughout (no stale-closure
clobbers on rapid successive saves). `useEffect` seed keyed on a
sorted-join string signature so a parent re-render with a new
array identity doesn't clobber typed inputs.
- `TemplatePalette.handleImport` branches: zero env declarations →
straight to import; any declarations → fetch configured global
secret keys, open the modal.
Tests (Go): `TestCollectOrgEnv_*` (5) cover union-across-levels,
required-wins-over-recommended (including same-struct), dedup,
empty, invalid-name rejection.
## 2. EmptyState parity with TemplatePalette
The "Deploy your first agent" grid used to call `POST /workspaces`
with no preflight while the sidebar palette ran
`checkDeploySecrets` + `MissingKeysModal` first. Same template
deployed two different ways → first-run users saw containers boot
in `failed` state without guidance. Now both surfaces share one
preflight + modal handshake.
EmptyState's previous `interface Template` dropped `runtime`,
`models`, and `required_env` — silently discarding exactly the
fields the preflight needs. `Template` now lives in
`deploy-preflight.ts` and is imported from there by both surfaces.
## 3. useTemplateDeploy hook
With the preflight + modal wiring now duplicated across
EmptyState + TemplatePalette + (going forward) any third surface,
extracted the pattern into `canvas/src/hooks/useTemplateDeploy.tsx`:
const { deploy, deploying, error, modal } = useTemplateDeploy({
canvasCoords: ..., // optional, default random
onDeployed: (id) => ...,
});
Closes three drift surfaces that the duplication had created:
- `resolveRuntime` id→runtime fallback table (moved to
`deploy-preflight.ts`). EmptyState had a narrower fallback that
would have silently disagreed with the palette on any future id
needing a non-identity mapping.
- `checkDeploySecrets` call signature. One owner.
- `MissingKeysModal` JSX wiring. One owner.
Narrow try/catch around `checkDeploySecrets` so a preflight network
failure clears `deploying` and surfaces via `setError` instead of
stranding the button forever. `modal: ReactNode` (not a
`renderModal()` function) — the previous memoization bought
nothing since consumers called it inline every render. Named
`MissingKeysInfo` interface for the state shape.
## 4. Viewport auto-fit user-pan gate fix
During org deploy the canvas was meant to pan+zoom to follow each
arriving workspace (`molecule:fit-deploying-org` event → debounced
fitView). In practice the fit stayed stuck on wherever the first
fit landed.
Root cause: React Flow v12 fires `onMoveEnd` with a truthy `event`
at the END of a programmatic `fitView` animation. The original
"respect-user-pan" gate stamped `userPannedAtRef` in `onMoveEnd`,
so our own fit completing looked like a user pan, and every
subsequent auto-fit short-circuited for the rest of the deploy.
Fix: stop trusting `onMoveEnd` for user-intent detection. Register
explicit `wheel` + `pointerdown` listeners on `document` with
capture phase and `target.closest('.react-flow__pane')` filter.
Capture-phase immunity to `stopPropagation`; pane-filter rejects
toolbar / modal / side-panel clicks (the old `window` fallback
caught those). `onMoveEnd` simplified to only drive the debounced
viewport save.
Also: fit event dispatched on root arrivals (not just children),
so the canvas centers on the just-landed root immediately instead
of waiting ~2s for the first child. Animation 600ms → 400ms so
successive per-arrival fits don't pile up visually. End-state fit
stays at 1200ms — intentional asymmetry ("settling" vs
"tracking"), documented in code.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Merge origin/staging into fix/canvas-multilevel-layout-ux. 18 files
auto-merged (mostly canvas/tabs/chat and workspace-server handlers
the earlier DIRTY marker was stale relative to current staging).
- Fix 7 test failures surfaced by the merge:
1. Canvas.pan-to-node.test.tsx — mockGetIntersectingNodes was
inferred as vi.fn(() => never[]); mockReturnValueOnce of a node
object failed type check. Explicit return-type annotation.
2. Canvas.pan-to-node.test.tsx + Canvas.a11y.test.tsx — Canvas.tsx
reads deletingIds.size (new multilevel-layout state). Both mock
stores lacked deletingIds; added new Set<string>() to each.
3. canvas-batch-partial-failure.test.ts — makeWS() built a wire-
format WorkspaceData (snake_case, with x/y/uptime_seconds). The
store's node.data is now WorkspaceNodeData (camelCase, no wire-
only fields). Rewrote makeWS to produce WorkspaceNodeData and
updated 5 call-site casts. No assertions changed.
4. ConfigTab.hermes.test.tsx — two tests pinned pre-#2061 behavior
that the PR intentionally inverts:
a. "shows hermes-specific info banner" — RUNTIMES_WITH_OWN_CONFIG
now contains only {"external"}, so the banner is no longer
shown for hermes. Inverted assertion: now pins ABSENCE of
the banner, with a comment noting the inversion.
b. "config.yaml runtime wins over DB" — priority reversed:
DB is now authoritative so the tier-on-node badge matches
the form. Inverted scenario: DB=hermes + yaml=crewai →
form shows hermes. Switched test's DB runtime off langgraph
because the dropdown collapses langgraph into an empty-
valued "default" option that would hide the win signal.
- No production code changed — this commit is staging merge + test
realignment only. 953/953 canvas tests pass. tsc --noEmit clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>