Closes the LOW-severity dependabot alert on workspace-server's go-redis
pin. Upstream advisory GHSA-92cp-5422-2mw7: "go-redis allows potential
out-of-order responses when CLIENT SETINFO times out" — fixed in 9.7.3.
Patch bump within the v9.7 line; semver guarantees no API change.
Full workspace-server test suite passes (18/18 packages clean).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the medium-severity dependabot alert #7 on workspace-server's
docker pin: "Moby firewalld reload makes published container ports
accessible from remote hosts" — fixed in v28.3.3, pulling v28.5.2
(latest in the v28 line).
Patch+minor bump within the v28 train; no client-API breaks
(workspace-server only uses docker.Client for container exec /
inspect, all stable since v20+).
Verification: full workspace-server test suite passes (18/18 packages
clean). Build clean.
Out of scope:
- Alerts #10 and #11 (the AuthZ bypass + plugin-priv off-by-one)
require v29.3.1, which is not yet published to the Go module
proxy (latest published is v28.5.2). They'll close in a follow-up
PR once v29 lands as a Go module.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the HIGH-severity dependabot alert on workspace-server's jwt-go
pin. Upstream advisory GHSA-mh63-6h87-95cp / CVE-2025-30204:
"jwt-go allows excessive memory allocation during header parsing" —
fixed in v5.2.2.
Patch bump within the v5.x line; semver guarantees no API change. Full
workspace-server test suite passes (\`go test ./...\` clean across all
18 packages).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the second of two skipped tests in workspace_provision_test.go
that were blocked on interface refactors. The Broadcaster + CP
provisioner halves landed in earlier #1814 cycles; this is the
plugin-source-registry half.
Refactor:
- Add handlers.pluginSources interface with the 3 methods handler
code actually calls (Register, Resolve, Schemes)
- Compile-time assertion `var _ pluginSources = (*plugins.Registry)(nil)`
catches future method-signature drift at build time
- PluginsHandler.sources narrowed from *plugins.Registry to the
interface; production wiring (NewPluginsHandler, WithSourceResolver)
still passes *plugins.Registry — satisfies the interface
Production fix (#1206 leak):
- resolveAndStage's Fetch-failure path was interpolating err.Error()
into the HTTP response body via `failed to fetch plugin from %s: %v`.
Resolver errors routinely contain rate-limit text, github request
IDs, raw HTTP body fragments, and (for local resolvers) file system
paths — none has any business landing in a user's browser.
- Body now carries just `failed to fetch plugin from <scheme>`; the
status code already differentiates the failure shape (404 not
found, 504 timeout, 502 generic). Full err detail stays in the
server-side log line one statement above.
Test:
- 6 sub-tests covering every error path inside resolveAndStage:
empty source, invalid format, unknown scheme, local
path-traversal, unpinned github (PLUGIN_ALLOW_UNPINNED unset),
Fetch failure with a leaky synthetic error
- The Fetch-failure case plants 5 realistic leak markers in the
resolver's error string (rate limit text, x-github-request-id,
auth_token, ghp_-prefixed token, /etc/passwd path); the assertion
fails if ANY appears in the response body
- Table-driven so a future error path added to resolveAndStage gets
one new row, not a copy-paste of the assertion logic
Verification:
- 6/6 sub-tests pass
- Full workspace-server test suite passes (interface refactor is
non-breaking; production caller paths unchanged)
- go build ./... clean
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The skipped test exists to assert that provisionWorkspaceCP never
leaks err.Error() in WORKSPACE_PROVISION_FAILED broadcasts (regression
guard for #1206). Writing the test body required substituting a
failing CPProvisioner — but the handler's `cpProv` field was the
concrete *CPProvisioner type, so a mock had nowhere to plug in.
Refactor:
- Add provisioner.CPProvisionerAPI interface with the 3 methods
handlers actually call (Start, Stop, GetConsoleOutput)
- Compile-time assertion `var _ CPProvisionerAPI = (*CPProvisioner)(nil)`
catches future method-signature drift at build time
- WorkspaceHandler.cpProv narrowed to the interface; SetCPProvisioner
accepts the interface (production caller passes *CPProvisioner
from NewCPProvisioner unchanged)
Test:
- stubFailingCPProv whose Start returns a deliberately leaky error
(machine_type=t3.large, ami=…, vpc=…, raw HTTP body fragment)
- Drive provisionWorkspaceCP via the cpProv.Start failure path
- Assert broadcast["error"] == "provisioning failed" (canned)
- Assert no leak markers (machine type, AMI, VPC, subnet, HTTP
body, raw error head) in any broadcast string value
- Stop/GetConsoleOutput on the stub panic — flags a future
regression that reaches into them on this path
Verification:
- Full workspace-server test suite passes (interface refactor
is non-breaking; production caller path unchanged)
- go build ./... clean
- The other skipped test in this file (TestResolveAndStage_…)
is a separate plugins.Registry refactor and remains skipped
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a target workspace's adapter has declared
provides_native_session=True (claude-code SDK's streaming session,
hermes-agent's in-container event log), the SDK owns its own queue/
session state. Adding the platform's a2a_queue layer on top would
double-buffer the same in-flight state — and worse, the platform
queue's drain timing has no relationship to the SDK's actual readiness,
so the queued request might dispatch while the SDK is STILL busy.
Behavior change: in handleA2ADispatchError, when isUpstreamBusyError(err)
fires and the target declared native_session, return 503 + Retry-After
directly without enqueueing. The caller's adapter handles retry on
its own schedule, and the SDK's own queue absorbs the request when
ready. Response body carries native_session=true so callers can
distinguish this from queue-failure 503s.
Observability is preserved: logA2AFailure still runs above; the
broadcaster still fires; the activity_logs row records the busy event
just like the platform-fallback path.
This is the consumer that validates the template-side declarations
already shipped in:
- molecule-ai-workspace-template-claude-code PR #12
- molecule-ai-workspace-template-hermes PR #25
Once those merge + image tags bump, claude-code + hermes workspaces'
busy 503s skip the platform queue end-to-end. End-to-end validation
of capability primitive #5.
Tests (2 new):
- NativeSession_SkipsEnqueue: cache pre-populated, deliberate
sqlmock with NO INSERT INTO a2a_queue expected — implicit
regression cover (sqlmock fails on unexpected queries). Asserts
503 + Retry-After + native_session=true marker in body.
- NoNativeSession_StillEnqueues: negative pin — empty cache, same
busy error → falls through to EnqueueA2A (which fails in this
test, falls through to legacy 503 without native_session marker).
Verification:
- All Go handlers tests pass (2 new + existing)
- go build + go vet clean
See project memory `project_runtime_native_pluggable.md`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When an adapter declares provides_native_status_mgmt=True (because its
SDK reports its own ready/degraded/failed state explicitly), the
platform's error-rate-based status inference fights the adapter's own
state machine. This PR gates the inference branches on the capability
flag — adapter-driven transitions become authoritative.
Components:
- registry.go evaluateStatus: gate the two inferred-status branches
(online → degraded when error_rate ≥ 0.5; degraded → online when
error_rate < 0.1 and runtime_state is empty) behind a check of
runtimeOverrides.HasCapability("status_mgmt").
- The wedged-branch (RuntimeState == "wedged" → degraded) is NOT
gated. That path is the adapter's OWN self-report, not platform
inference, and stays active under native_status_mgmt — adapters
can still drive transitions via runtime_state.
Python side: no change. The capability map is already serialized via
RuntimeCapabilities.to_dict() in PR #2137 and sent in the heartbeat's
runtime_metadata block via PR #2139. An adapter setting
RuntimeCapabilities(provides_native_status_mgmt=True) automatically
flows through.
Tests (3 new):
- SkipsDegradeInference: error_rate=0.8 + currentStatus=online + native
flag set → degrade UPDATE does NOT fire (sqlmock fails on unexpected
query, which is the regression cover)
- SkipsRecovery: error_rate=0.05 + currentStatus=degraded + native →
recovery UPDATE does NOT fire
- WedgedStillRespected: runtime_state="wedged" + native → wedged
branch DOES fire (adapter self-report stays active)
Verification:
- All Go handlers tests pass (3 new + existing)
- 1308/1308 Python pytest pass (unchanged — Python side unmodified)
- go build + go vet clean
Stacked on #2140 (already merged via cascade); branch is current with
staging since #2139 and #2140 merged.
See project memory `project_runtime_native_pluggable.md`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When an adapter declares provides_native_scheduler=True (because its
SDK has built-in cron / Temporal-style workflows), the platform's
polling loop must skip firing schedules for that workspace — otherwise
the schedule fires twice (once natively, once via platform). The
native skip preserves observability (next_run_at still advances, the
schedule row stays in the DB, last_run_at would still update) while
moving the FIRE responsibility to the SDK.
Stacked on PR #2139 (idle_timeout_override end-to-end). The
RuntimeMetadata heartbeat block already carries the capability map;
this PR teaches the platform how to read and act on the scheduler bit.
Components:
- handlers/runtime_overrides.go: extended the cache to store
capability flags alongside idle timeout. Two heartbeat fields are
independent — SetIdleTimeout / SetCapabilities each update one
without stomping the other. Defensive copy on SetCapabilities so
a caller mutating its map after the call doesn't retroactively
change cached declarations. Empty entries dropped to avoid stale
husks.
- handlers/runtime_overrides.go: new HasCapability(workspaceID, name)
+ ProvidesNativeScheduler(workspaceID) — the latter is the
package-level adapter the scheduler imports (avoids a
handlers/scheduler import cycle).
- handlers/registry.go: heartbeat handler now calls SetCapabilities
in addition to SetIdleTimeout.
- scheduler/scheduler.go: NativeSchedulerCheck function-pointer DI
(mirrors the existing QueueDrainFunc pattern). New() leaves the
field nil so existing callers preserve today's "always fire"
behavior. SetNativeSchedulerCheck wires production. tick() drops
workspaces declaring native ownership before goroutine fan-out;
advances next_run_at so we don't tight-loop on the same row.
- cmd/server/main.go: wires handlers.ProvidesNativeScheduler into
the cron scheduler at server boot.
Tests:
Go (7 new):
- SetCapabilitiesAndHas (round-trip)
- per-workspace isolation (ws-a's declaration doesn't leak to ws-b)
- nil/empty map clears (adapter dropping the flag restores fallback)
- SetCapabilities is a defensive copy (caller mutation can't
retroactively flip cached value)
- SetIdleTimeout preserves capabilities and vice-versa (two-field
independence)
- empty entry deleted (no stale husks)
- ProvidesNativeScheduler reads the same singleton heartbeat writes
- SetNativeSchedulerCheck wires the function (scheduler-side)
- nil-check safety contract for tick
Python: no change needed — the heartbeat already serializes the
full capability map via _runtime_metadata_payload (PR #2139). An
adapter setting RuntimeCapabilities(provides_native_scheduler=True)
automatically flows through.
Verification:
- 1308 / 1308 Python pytest pass (unchanged)
- All Go handlers + scheduler tests pass
- go build + go vet clean
See project memory `project_runtime_native_pluggable.md`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Capability primitive #2 (task #117). The first cross-cutting capability
where the adapter actually displaces platform behavior — claude-code's
streaming session can legitimately go silent for 8+ minutes during
synthesis + slow tool calls; the platform's hardcoded 5min idle timer
in a2a_proxy.go cancels it mid-flight (the bug PR #2128 patched at
the env-var layer). This PR fixes it at the right layer: the adapter
declares "I need 600s" and the platform's dispatch path honors it.
Wire shape (Python → Go):
POST /registry/heartbeat
{
"workspace_id": "...",
...
"runtime_metadata": {
"capabilities": {"heartbeat": false, "scheduler": false, ...},
"idle_timeout_seconds": 600 // optional, omitted = use default
}
}
Default behavior preserved: any adapter that doesn't override
BaseAdapter.idle_timeout_override() (returns None by default) sends
no idle_timeout_seconds field; the Go side falls through to
idleTimeoutDuration (env A2A_IDLE_TIMEOUT_SECONDS, default 5min).
Existing langgraph / crewai / deepagents workspaces are unaffected.
Components:
Python:
- adapter_base.py: idle_timeout_override() method on BaseAdapter
returning None (the platform-default sentinel).
- heartbeat.py: _runtime_metadata_payload() lazy-imports the active
adapter and assembles the capability + override block. Try/except
swallows ANY error so heartbeat never breaks because of capability
discovery — observability outranks capability accuracy.
Go:
- models.HeartbeatPayload.RuntimeMetadata (pointer so absent =
"old runtime, didn't say"; explicit zero-cap = "new runtime,
declared no native ownership").
- handlers.runtimeOverrides: in-memory sync.Map cache keyed by
workspaceID. Populated by the heartbeat handler, consulted on
every dispatchA2A. Reset on platform restart (worst-case 30s of
platform-default behavior — acceptable; nothing about overrides
is correctness-critical).
- a2a_proxy.dispatchA2A: looks up the override before applyIdle
Timeout; falls through to global default when absent.
Tests:
Python (17, all new):
- RuntimeCapabilities dataclass shape (frozen, defaults, wire keys)
- BaseAdapter.capabilities() default + override + sibling isolation
- idle_timeout_override default, positive override, dropped-override
- Heartbeat metadata producer: default adapter emits all-False,
native adapter emits flag + override, missing ADAPTER_MODULE
returns {} (graceful), zero/negative override is omitted from
wire, exception inside adapter swallowed
Go (6, all new):
- SetIdleTimeout + IdleTimeout round-trip
- Zero/negative duration clears the override
- Empty workspace_id ignored
- Replacement (heartbeat overwrites prior value)
- Reset clears entire cache
- Concurrent reads + writes (sync.Map invariant)
Verification:
- 1308 / 1308 workspace pytest pass (was 1300, +8)
- All Go handlers tests pass (6 new + existing)
- go vet clean
See project memory `project_runtime_native_pluggable.md` for the
architecture principle this implements.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
[Molecule-Platform-Evolvement-Manager]
## What this fixes
Closes one of the three skipped tests in workspace_provision_test.go
that #1814's interface refactor enabled but never had a body written:
`TestProvisionWorkspace_NoInternalErrorsInBroadcast`.
The interface blocker (`captureBroadcaster` couldn't substitute for
`*events.Broadcaster`) was already fixed when `events.EventEmitter`
was extracted; this PR ships the test body that the prior refactor
made possible. The test was effectively unverified regression cover
for issue #1206 (internal error leak in WORKSPACE_PROVISION_FAILED
broadcasts) until now.
## What the test pins
Drives the **earliest** failure path in `provisionWorkspace` — the
global-secrets decrypt failure — so the setup needs only:
- one `global_secrets` mock row (with `encryption_version=99` to
force `crypto.DecryptVersioned` to error with a string that
includes the literal version number)
- one `UPDATE workspaces SET status = 'failed'` expectation
- a `captureBroadcaster` (already in the test file) injected via
`NewWorkspaceHandler`
Asserts the captured `WORKSPACE_PROVISION_FAILED` payload:
1. carries the safe canned `"failed to decrypt global secret"` only
2. does NOT contain `"version=99"`, `"platform upgrade required"`,
or the global_secret row's `key` value (`FAKE_KEY`) — the three
leak markers a regression that interpolates `err.Error()` into
the broadcast would surface
## Why not use containsUnsafeString
The test file already has a `containsUnsafeString` helper with
`"secret"` and `"token"` in its prohibition list. Those substrings
match the legitimate redacted message (`"failed to decrypt global
secret"`) — appropriate in user-facing copy, NOT a leak. Using the
broad helper would either fail the test against the source's own
correct message OR require loosening the helper for everyone else.
Per-test explicit leak markers keep the assertion precise without
weakening shared infrastructure.
## What's still skipped (out of scope for this PR)
- `TestProvisionWorkspaceCP_NoInternalErrorsInBroadcast` — same
shape but blocked on a different refactor: `provisionWorkspaceCP`
routes through `*provisioner.CPProvisioner` (concrete pointer,
no interface), so the test would need either an interface
extraction or a real CPProvisioner with a mocked HTTP server.
Larger scope; deferred.
- `TestResolveAndStage_NoInternalErrorsInHTTPErr` — different
blocker (`mockPluginsSources` vs `*plugins.Registry` type
mismatch). Needs a SourceResolver-side interface refactor.
Both still carry their `t.Skip` notes documenting the remaining
work.
## Test plan
- [x] New test passes
- [x] Full handlers package suite still green (`go test ./internal/handlers/`)
- [x] No changes to production code — pure test addition
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two Critical bugs caught in code review of the agent→user attachments PR:
1. **Empty-URI attachments slipped past validation.** Gin's
go-playground/validator does NOT iterate slice elements without
`dive` — verified zero `dive` usage anywhere in workspace-server —
so the inner `binding:"required"` tags on NotifyAttachment.URI/Name
were never enforced. `attachments: [{"uri":"","name":""}]` would
pass validation, broadcast empty-URI chips that render blank in
canvas, AND persist them in activity_logs for every page reload to
re-render. Added explicit per-element validation in Notify (returns
400 with `attachment[i]: uri and name are required`) plus
defence-in-depth in the canvas filter (rejects empty strings, not
just non-strings).
3-case regression test pins the rejection.
2. **Hardcoded application/octet-stream stripped real mime types.**
`_upload_chat_files` always passed octet-stream as the multipart
Content-Type. chat_files.go:Upload reads `fh.Header.Get("Content-Type")`
FIRST and only falls back to extension-sniffing when the header is
empty, so every agent-attached file lost its real type forever —
broke the canvas's MIME-based icon/preview logic. Now sniff via
`mimetypes.guess_type(path)` and only fall back to octet-stream
when sniffing returns None.
Plus three Required nits:
- `sqlmockArgMatcher` was misleading — the closure always returned
true after capture, identical to `sqlmock.AnyArg()` semantics, but
named like a custom matcher. Renamed to `sqlmockCaptureArg(*string)`
so the intent (capture for post-call inspection, not validate via
driver-callback) is unambiguous.
- Test asserted notify call by `await_args_list[1]` index — fragile
to any future _upload_chat_files refactor that adds a pre-flight
POST. Now filter call list by URL suffix `/notify` and assert
exactly one match.
- Added `TestNotify_RejectsAttachmentWithEmptyURIOrName` (3 cases)
covering empty-uri, empty-name, both-empty so the Critical fix
stays defended.
Deferred to follow-up:
- ORDER BY tiebreaker for same-millisecond notifies — pre-existing
risk, not regression.
- Streaming multipart upload — bounded by the platform's 50MB total
cap so RAM ceiling is fixed; switch to streaming if cap rises.
- Symlink rejection — agent UID can already read whatever its
filesystem perms allow via the shell tool; rejecting symlinks
doesn't materially shrink the attack surface.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the gap where the Director would say "ZIP is ready at /tmp/foo.zip"
in plain text instead of attaching a download chip — the runtime literally
had no API for outbound file attachments. The canvas + platform's
chat-uploads infrastructure already supported the inbound (user → agent)
direction (commit 94d9331c); this PR wires the outbound side.
End-to-end shape:
agent: send_message_to_user("Done!", attachments=["/tmp/build.zip"])
↓ runtime
POST /workspaces/<self>/chat/uploads (multipart)
↓ platform
/workspace/.molecule/chat-uploads/<uuid>-build.zip
→ returns {uri: workspace:/...build.zip, name, mimeType, size}
↓ runtime
POST /workspaces/<self>/notify
{message: "Done!", attachments: [{uri, name, mimeType, size}]}
↓ platform
Broadcasts AGENT_MESSAGE with attachments + persists to activity_logs
with response_body = {result: "Done!", parts: [{kind:file, file:{...}}]}
↓ canvas
WS push: canvas-events.ts adds attachments to agentMessages queue
Reload: ChatTab.loadMessagesFromDB → extractFilesFromTask sees parts[]
Either path → ChatTab renders download chip via existing path
Files changed:
workspace-server/internal/handlers/activity.go
- NotifyAttachment struct {URI, Name, MimeType, Size}
- Notify body accepts attachments[], broadcasts in payload,
persists as response_body.parts[].kind="file"
canvas/src/store/canvas-events.ts
- AGENT_MESSAGE handler reads payload.attachments, type-validates
each entry, attaches to agentMessages queue
- Skips empty events (was: skipped only when content empty)
workspace/a2a_tools.py
- tool_send_message_to_user(message, attachments=[paths])
- New _upload_chat_files helper: opens each path, multipart POSTs
to /chat/uploads, returns the platform's metadata
- Fail-fast on missing file / upload error — never sends a notify
with a half-rendered attachment chip
workspace/a2a_mcp_server.py
- inputSchema declares attachments param so claude-code SDK
surfaces it to the model
- Defensive filter on the dispatch path (drops non-string entries
if the model sends a malformed payload)
Tests:
- 4 new Python: success path, missing file, upload 5xx, no-attach
backwards compat
- 1 new Go: Notify-with-attachments persists parts[] in
response_body so chat reload reconstructs the chip
Why /tmp paths work even though they're outside the canvas's allowed
roots: the runtime tool reads the bytes locally and re-uploads through
/chat/uploads, which lands the file under /workspace (an allowed root).
The agent can specify any readable path.
Does NOT include: agent → agent file transfer. Different design problem
(cross-workspace download auth: peer would need a credential to call
sender's /chat/download). Tracked as a follow-up under task #114.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Review-feedback follow-up. Pre-fix, A2A_IDLE_TIMEOUT_SECONDS=foo or =-30
fell back to the default with zero log signal — operator sets the wrong
value, sees "no effect," wastes hours debugging "why is my override not
working." Now bad-input cases log a clear message naming the variable,
the bad value, and the default applied.
Refactor: extract parseIdleTimeoutEnv(string) → time.Duration so the
parse logic is unit-testable. defaultIdleTimeoutDuration is a const so
tests reference it without re-deriving the value.
8 new unit tests cover empty / valid / negative / zero / non-numeric /
float / trailing-units inputs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two compounding bugs caused the "context canceled" wave on 2026-04-26
(15+ failed user/agent A2A calls in 1hr across 6 workspaces, including
the user's "send it in the chat" message that the director never
received):
1. **a2a_proxy.go:applyIdleTimeout cancels the dispatch after 60s of
broadcaster silence** for the workspace. Resets on any SSE event
for the workspace, fires cancel() if no event arrives in time.
2. **registry.go:Heartbeat broadcast was conditional** —
`if payload.CurrentTask != prevTask`. The runtime POSTs
/registry/heartbeat every 30s, but if current_task hasn't changed
the handler emits ZERO broadcasts. evaluateStatus only broadcasts
on online/degraded transitions — also no-op when steady.
Net: a claude-code agent on a long packaging step or slow tool call
keeps the same current_task for >60s → no broadcasts → idle timer
fires → in-flight request cancelled mid-flight with the "context
canceled" error the user sees in the activity log.
Fix:
(a) Heartbeat handler always emits a `WORKSPACE_HEARTBEAT` BroadcastOnly
event (no DB write — same path as TASK_UPDATED). At the existing 30s
runtime cadence this resets the idle timer twice per minute.
Cost is one in-memory channel send per active SSE subscriber + one
WS hub fan-out per heartbeat — far below any noise floor.
(b) idleTimeoutDuration default bumped 60s → 5min as a safety net for
any future regression where the heartbeat path goes silent (e.g.
runtime crashed mid-request before its next heartbeat). Made
env-overridable via A2A_IDLE_TIMEOUT_SECONDS for ops who want to
tune (canary tests fail-fast, prod tenants with slow plugins want
longer). Either fix alone closes today's gap; both together is
defence in depth.
The runtime side already POSTs /registry/heartbeat every 30s via
workspace/heartbeat.py — no runtime change needed.
Test: TestHeartbeatHandler_AlwaysBroadcastsHeartbeat pins the property
that an SSE subscriber observes a WORKSPACE_HEARTBEAT broadcast on a
same-task heartbeat (the regression scenario). All 16 existing handler
tests still pass.
Doesn't fix: task #102 (single SDK session bottleneck) — peers will
still queue when busy. But this PR ensures the queue/wait flow
actually completes instead of being killed by the idle timer
mid-wait.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The existing sweeper only reaps ws-* containers whose workspace row
has status='removed'. That misses the entire wiped-DB case: an
operator does `docker compose down -v` (kills the postgres volume),
the previous platform's ws-* containers keep running, the new
platform boots into an empty workspaces table — first pass finds
zero candidates and those containers leak forever. Symptom users
hit today: 7 ws-* containers from 11h ago, no rows in DB, no
visibility in Canvas, eating CPU + memory.
Fix shape:
1. Provisioner stamps every ws-* container + volume with
`molecule.platform.managed=true`. Without a label, the sweeper
would have to assume any unlabeled ws-* container might belong
to a sibling platform stack on a shared Docker daemon.
2. Provisioner exposes ListManagedContainerIDPrefixes — a label-filter
counterpart to the existing name-filter.
3. Sweeper splits sweepOnce into two independent passes:
- sweepRemovedRows (unchanged behavior; status='removed' only)
- sweepLabeledOrphansWithoutRows (new; labeled containers whose
workspace_id has no row in the table at all)
Each pass has its own short-circuit so an empty result or transient
error in one doesn't block the other — load-bearing because the
wiped-DB pass exists precisely for cases where the removed-row
pass finds nothing.
Safe under multi-platform-on-shared-daemon: only containers carrying
our label get reaped, sibling stacks' containers are invisible to this
pass. (For now the label is a constant string; a future per-instance
UUID layer can refine "ours" further if a real shared-daemon scenario
emerges.)
Migration: existing platforms running pre-PR builds have UNLABELED
ws-* containers. After this lands they continue to NOT be reaped by
the new path (no label = invisible). They'll only be cleaned via
manual intervention or once the operator recreates them — same as
today. No regression.
Tests cover all five branches of the new pass: happy-path reap,
no-reap when row exists, mixed reap-some-keep-some, Docker error
short-circuits cleanly, non-UUID prefixes get filtered before the
SQL query.
Pairs with PR #2122 (script-level fix). Together they close the
orphan-leak path for both `bash scripts/nuke-and-rebuild.sh` users
(handled by the script) AND `docker compose down -v` users (handled
by the runtime).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an opt-in goroutine that polls GHCR every 5 minutes for digest
changes on each workspace-template-*:latest tag and invokes the same
refresh logic /admin/workspace-images/refresh exposes. With this, the
chain from "merge runtime PR" to "containers running new code" is fully
hands-off — no operator step between auto-tag → publish-runtime →
cascade → template image rebuild → host pull + recreate.
Opt-in via IMAGE_AUTO_REFRESH=true. SaaS deploys whose pipeline already
pulls every release should leave it off (would be redundant work);
self-hosters get true zero-touch.
Why a refactor of admin_workspace_images.go is in this PR:
The HTTP handler held all the refresh logic inline. To share it with
the new watcher without HTTP loopback, extracted WorkspaceImageService
with a Refresh(ctx, runtimes, recreate) (RefreshResult, error) shape.
HTTP handler is now a thin wrapper; behavior is preserved (same JSON
response, same 500-on-list-failure, same per-runtime soft-fail).
Watcher design notes:
- Last-observed digest tracked in memory (not persisted). On boot the
first observation per runtime is seed-only — no spurious refresh
fires on every restart.
- On Refresh error, the seen digest rolls back so the next tick retries.
Without this rollback a transient Docker glitch would convince the
watcher the work was done.
- Per-runtime fetch errors don't block other runtimes (one template's
brief 500 doesn't pause the others).
- digestFetcher injection seam in tick() lets unit tests cover all
bookkeeping branches without standing up an httptest GHCR server.
Verified live: probed GHCR's /token + manifest HEAD against
workspace-template-claude-code; got HTTP 200 + a real
Docker-Content-Digest. Same calls the watcher makes.
Co-authored-by: Hongming Wang <hongmingwangalt@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI on PR #2105 caught 7 Create-handler tests still mocking the
pre-#1408 10-arg INSERT signature. With the column now wired
unconditionally into the INSERT, every WithArgs that pinned
budget_limit as the 10th arg needed a 11th slot for the resolved
max_concurrent_tasks value.
Files:
- workspace_test.go: 6 tests (DBInsertError, DefaultsApplied,
WithSecrets_Persists, TemplateDefaultsMissingRuntimeAndModel,
TemplateDefaultsLegacyTopLevelModel, CallerModelOverridesTemplateDefault)
- workspace_budget_test.go: 1 test (Budget_Create_WithLimit)
All resolved values are the schema-default mirror, so the test
expectation reads as the same models.DefaultMaxConcurrentTasks
const that the handler writes. New imports added to both files.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Simplify pass on top of the wire-up commit:
- New const models.DefaultMaxConcurrentTasks = 1; handlers and tests
reference the symbol so the schema-default mirror lives in one place.
- Strip 5 multi-line comments that narrated what the code does.
- Drop the duplicate field-rationale on OrgWorkspace; the one on
CreateWorkspacePayload is canonical.
- Drop test-side positional comments that would silently lie if columns
get reordered.
Pure cleanup; no behaviour change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 4 of #1408 (active_tasks counter). Runtime increment/decrement,
schema column (037), and scheduler enforcement (scheduler.go:312)
already shipped — but the write path from template config.yaml +
direct API was missing, so every workspace silently fell through to
the schema default of 1. Leaders that set max_concurrent_tasks: 3 in
their org template were getting 1 anyway, defeating the entire
feature for the use case it was built for (cron-vs-A2A contention on
PM/lead workspaces).
- OrgWorkspace gains MaxConcurrentTasks (yaml + json tags)
- CreateWorkspacePayload gains MaxConcurrentTasks (json tag)
- Both INSERTs now write the column unconditionally; 0/omitted
payload value falls back to 1 (schema default mirror) so the wire
stays single-shape — no forked column list / goto.
- Existing Create-handler test mocks updated to expect the 11th arg.
- New TestWorkspaceCreate_MaxConcurrentTasksOverride locks the
payload→DB propagation for the leader case (value=3).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three files conflicted with staging changes that landed while this PR
sat open. Resolved each by combining both intents (not picking one side):
- a2a_proxy.go: keep the branch's idle-timeout signature
(workspaceID parameter + comment) AND apply staging's #1483 SSRF
defense-in-depth check at the top of dispatchA2A. Type-assert
h.broadcaster (now an EventEmitter interface per staging) back to
*Broadcaster for applyIdleTimeout's SubscribeSSE call; falls through
to no-op when the assertion fails (test-mock case).
- a2a_proxy_test.go: keep both new test suites — branch's
TestApplyIdleTimeout_* (3 cases for the idle-timeout helper) AND
staging's TestDispatchA2A_RejectsUnsafeURL (#1483 regression). Updated
the staging test's dispatchA2A call to pass the workspaceID arg
introduced by the branch's signature change.
- workspace_crud.go: combine both Delete-cleanup intents:
* Branch's cleanupCtx detachment (WithoutCancel + 30s) so canvas
hang-up doesn't cancel mid-Docker-call (the container-leak fix)
* Branch's stopAndRemove helper that skips RemoveVolume when Stop
fails (orphan sweeper handles)
* Staging's #1843 stopErrs aggregation so Stop failures bubble up
as 500 to the client (the EC2 orphan-instance prevention)
Both concerns satisfied: cleanup runs to completion past canvas
hangup AND failed Stop calls surface to caller.
Build clean, all platform tests pass.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
PR #2103 widened the SSRF saasMode branch to also relax RFC-1918 + ULA
under MOLECULE_ENV=development (so the docker-compose dev pattern stops
rejecting workspace registrations on 172.18.x.x bridge IPs). The
existing TestIsSafeURL_DevMode_StillBlocksOtherRanges covered the
security floor (metadata / TEST-NET / CGNAT stay blocked), but no
test asserted the positive side — that 10.x / 172.x / 192.168.x / fd00::
ARE now allowed under dev mode.
Without this test, a future refactor that quietly drops the
`|| devModeAllowsLoopback()` from isPrivateOrMetadataIP wouldn't trip
any assertion, and the docker-compose dev loop would silently re-break.
Adds TestIsSafeURL_DevMode_AllowsRFC1918 — table of 4 URLs covering
the three RFC-1918 IPv4 ranges + IPv6 ULA fd00::/8. Sets
MOLECULE_DEPLOY_MODE=self-hosted explicitly so the test exercises the
devMode branch, not a SaaS-mode pass.
Closes the Optional finding I left on PR #2103.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The production-side end of the runtime CD chain. Operators (or the post-
publish CI workflow) hit this after a runtime release to pull the latest
workspace-template-* images from GHCR and recreate any running ws-* containers
so they adopt the new image. Without this, freshly-published runtime sat in
the registry but containers kept the old image until naturally cycled.
Implementation notes:
- Uses Docker SDK ImagePull rather than shelling out to docker CLI — the
alpine platform container has no docker CLI installed.
- ghcrAuthHeader() reads GHCR_USER + GHCR_TOKEN env, builds the base64-
encoded JSON payload Docker engine expects in PullOptions.RegistryAuth.
Both empty → public/cached images only; both set → private GHCR pulls.
- Container matching uses ContainerInspect (NOT ContainerList) because
ContainerList returns the resolved digest in .Image, not the human tag.
Inspect surfaces .Config.Image which is what we need.
- Provisioner.DefaultImagePlatform() exported so admin handler picks the
same Apple-Silicon-needs-amd64 platform as the provisioner — single
source of truth for the multi-arch override.
Local-dev companion: scripts/refresh-workspace-images.sh runs on the
host and inherits the host's docker keychain auth — alternate path for
when GHCR_USER/TOKEN aren't set in the platform env.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Pre-fix, POST /workspaces/:id/notify (the side-channel agents use to push
interim updates and follow-up results) only broadcast via WebSocket — no
DB write. When the user refreshed the page, the chat-history loader
(which queries activity_logs) couldn't restore those messages and they
vanished from the chat.
Hits the most common path: when the platform's POST /a2a times out (idle),
the runtime keeps working and eventually pushes its reply via
send_message_to_user. The reply rendered live but disappeared on reload.
Fix: also INSERT an activity_logs row with shape the existing loader
already understands (type=a2a_receive, source_id=NULL, response_body=
{result: text}). Persistence is best-effort — a DB hiccup doesn't block
the WebSocket push (which the user is already seeing).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
The docker-compose dev pattern puts platform and workspace containers on
the same docker bridge network (172.18.0.0/16, RFC-1918). The runtime
registers via its docker-internal hostname which DNS-resolves to a
172.18.x.x IP. The SSRF defence's isPrivateOrMetadataIP rejected those,
so every workspace POST through the platform proxy returned
'workspace URL is not publicly routable' — breaking the entire docker-
compose dev loop.
Fix: in isPrivateOrMetadataIP, treat MOLECULE_ENV=development the same
as SaaS mode for RFC-1918 relaxation. Both share the 'trusted intra-
network routing' property — SaaS is sibling EC2s in the same VPC, dev
is sibling containers on the same docker bridge. Always-blocked
categories (metadata link-local, TEST-NET, CGNAT) stay blocked.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
When proxyA2A returns 202+{queued:true} (target busy → enqueued for drain
on next heartbeat), executeDelegation previously treated it as a successful
completion and ran extractResponseText on the queued JSON. The result was
'Delegation completed (workspace agent busy — request queued, will dispatch...)'
landing in activity_logs.summary, which the LLM then echoed to the user
chat as garbage.
Two fixes:
1. delegation.go: detect queued shape via new isQueuedProxyResponse helper,
write status='queued' with clean summary 'Delegation queued — target at
capacity', store delegation_id in response_body so the drain can stitch
back later. Also embed delegation_id in params.message.metadata + use it
as messageId so the proxy's idempotency-key path keys off the same id.
2. a2a_queue.go: when DrainQueueForWorkspace successfully drains a queued
item, extract delegation_id from the body's metadata and UPDATE the
originating delegate_result row (queued → completed with real
response_body). Broadcast DELEGATION_COMPLETE so the canvas chat feed
flips the queued line to completed in real time.
Closes the loop so check_task_status reflects ground truth instead of
perpetual 'queued' even after the queued request eventually drained.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
The 3 skipped tests in workspace_provision_test.go (#1206 regression
tests) were blocked because captureBroadcaster's struct-embed wouldn't
type-check against WorkspaceHandler.broadcaster's concrete
*events.Broadcaster field. This PR fixes the interface blocker for
the 2 broadcaster-related tests; the 3rd (plugins.Registry resolver)
is a separate blocker tracked elsewhere.
Changes:
- internal/events/broadcaster.go: define `EventEmitter` interface with
RecordAndBroadcast + BroadcastOnly. *Broadcaster satisfies it via
its existing methods (compile-time assertion guards future drift).
SubscribeSSE / Subscribe stay off the interface because only sse.go
+ cmd/server/main.go call them, and both still hold the concrete
*Broadcaster.
- internal/handlers/workspace.go: WorkspaceHandler.broadcaster type
changes from *events.Broadcaster to events.EventEmitter.
NewWorkspaceHandler signature updated to match. Production callers
unchanged — they pass *events.Broadcaster, which the interface
accepts.
- internal/handlers/activity.go: LogActivity takes events.EventEmitter
for the same reason — tests passing a stub no longer need to
construct the full broadcaster.
- internal/handlers/workspace_provision_test.go: captureBroadcaster
drops the struct embed (no more zero-value Broadcaster underlying
the SSE+hub fields), implements RecordAndBroadcast directly, and
adds a no-op BroadcastOnly to satisfy the interface. Skip messages
on the 2 empty broadcaster-blocked tests updated to reflect the
new "interface unblocked, test body still needed" state.
Verified `go build ./...`, `go test ./internal/handlers/`, and
`go vet ./...` all clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#1483 flagged that dispatchA2A() doesn't call isSafeURL internally —
the guard exists only at the caller level (resolveAgentURL at
a2a_proxy.go:424). The primary call path through proxyA2ARequest
is safe today, but if any future code path ever calls dispatchA2A
directly without going through resolveAgentURL, the SSRF check
would be silently bypassed.
This adds the one-line defense-in-depth guard the issue prescribed:
if err := isSafeURL(agentURL); err != nil {
return nil, nil, &proxyDispatchBuildError{err: err}
}
Wrapping as *proxyDispatchBuildError preserves the existing caller
error-classification path — the same shape that maps to 500 elsewhere.
Adds TestDispatchA2A_RejectsUnsafeURL pinning the contract:
re-enables SSRF for the test (setupTestDB disables it for normal
unit tests), passes a metadata IP, asserts the build error returns
and cancel is nil so no resource is leaked.
The 4 existing dispatchA2A unit tests use setupTestDB → SSRF
disabled, so they continue passing unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#1484 flagged that discoverHostPeer() and writeExternalWorkspaceURL()
return URLs sourced from the workspaces table without an isSafeURL
check. Workspace runtimes register their own URLs via /registry/register
— a misbehaving / compromised runtime could register a metadata-IP URL.
Today both functions are gated by Phase 30.6 bearer-required Discover,
so exposure is theoretical. The fix makes them safe regardless of
upstream auth shape.
Changes:
- discoverHostPeer: isSafeURL on resolved URL before responding;
503 + log on rejection.
- writeExternalWorkspaceURL: same guard applied to the post-rewrite
outURL (so a host.docker.internal rewrite is checked AND a
metadata-IP that survived the rewrite untouched is rejected).
- 3 new regression tests:
* RejectsMetadataIPURL on host-peer path (169.254.169.254 → 503)
* AcceptsPublicURL on host-peer path (8.8.8.8 → 200; positive
counterpart so the rejection test can't pass via universal-fail)
* RejectsMetadataIPURL on external-workspace path
setupTestDB already disables SSRF checks via setSSRFCheckForTest,
so the 16+ existing discovery tests remain untouched. Only the new
tests opt in to enabled SSRF.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Extract walkTemplateConfigs(configsDir, fn) shared helper. Both
templates.List and loadRuntimeProvisionTimeouts walked configsDir
+ parsed config.yaml — same boilerplate twice. Now centralised so
a future template-discovery rule (subdir naming, README sentinel,
etc.) lands in one place.
- templates.List uses the walker — net -10 lines.
- loadRuntimeProvisionTimeouts uses the walker — net -10 lines.
- Document runtimeProvisionTimeoutsCache as 'NOT SAFE for
package-level reuse' so a future change doesn't accidentally
promote it to a singleton (sync.Once can't be reset → tests
would lock out other fixtures).
Skipped (review finding): atomic.Pointer[map[string]int] for
future hot-reload. The doc comment already documents the
limitation; YAGNI-promoting the primitive now would buy a
not-yet-built feature at the cost of more code today.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2 of #2054 — workspace-server reads runtime-level
provision_timeout_seconds from template config.yaml manifests and
includes provision_timeout_ms in the workspace List/Get response.
Phase 1 (canvas, #2092) already plumbs the field through socket →
node-data → ProvisioningTimeout's resolver, so the moment a
template declares the field the per-runtime banner threshold
adjusts without a canvas release.
Implementation:
- templates.go: parse runtime_config.provision_timeout_seconds in
the templateSummary marshaller. The /templates API now surfaces
the field too — useful for ops dashboards and future tooling.
- runtime_provision_timeouts.go (new): loadRuntimeProvisionTimeouts
scans configsDir, parses every immediate subdir's config.yaml,
returns runtime → seconds. Multiple templates with the same
runtime: max wins (so a slow template's threshold doesn't get
cut by a fast template's). Bad/empty inputs are silently
skipped — workspace-server starts cleanly with no templates.
- runtimeProvisionTimeoutsCache: sync.Once-backed lazy cache.
First workspace API request after process start pays the read
cost (~few KB across ~50 templates); every subsequent request is
a map lookup. Cache lifetime = process lifetime; invalidates on
workspace-server restart, which is the normal template-change
cadence.
- WorkspaceHandler gets a provisionTimeouts field (zero-value struct
is valid — the cache lazy-inits on first get()).
- addProvisionTimeoutMs decorates the response map with
provision_timeout_ms (seconds × 1000) when the runtime has a
declared timeout. Absent = no key in the response, canvas falls
through to its runtime-profile default. Wired into both List
(per-row decoration in the loop) and Get.
Tests (5 new in runtime_provision_timeouts_test.go):
- happy path: hermes declares 720, claude-code doesn't, only
hermes appears in the map
- max-on-duplicate: same runtime in two templates → max wins
- skip-bad-inputs: missing runtime, zero timeout, malformed yaml,
loose top-level files all silently ignored
- missing-dir: returns empty map, no crash
- cache: lazy-init on first get; subsequent gets hit cache even
after underlying file changes (sync.Once contract); unknown
runtime returns zero
Phase 3 (separate template-repo PR): template-hermes config.yaml
declares provision_timeout_seconds: 720 under runtime_config.
canvas RUNTIME_PROFILES.hermes becomes redundant + removable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The existing tokens_test.go skips every test when db.DB is nil, so CI
ran with 0% coverage on tokens.go's List/Create/Revoke. This file adds
sqlmock-driven tests that exercise the SQL paths directly without
needing a live Postgres, lifting coverage on all 4 functions to 100%
and module-level handler coverage from 60.3% → 61.1%.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both backends panicked when called on a zero-valued or nil receiver:
Provisioner.{Stop,IsRunning} dereferenced p.cli; CPProvisioner.{Stop,
IsRunning} dereferenced p.httpClient. The orphan sweeper and shutdown
paths can call these speculatively where the receiver isn't fully
wired — the panic crashed the goroutine instead of the caller seeing
a clean error.
Three changes:
1. Add ErrNoBackend (typed sentinel) and nil-guard the four methods.
- Provisioner.{Stop,IsRunning}: guard p == nil || p.cli == nil at
the top.
- CPProvisioner.Stop: guard p == nil up top, then httpClient nil
AFTER resolveInstanceID + empty-instance check (the empty
instance_id path doesn't need HTTP and stays a no-op success
even on zero-valued receivers — preserved historical contract
from TestIsRunning_EmptyInstanceIDReturnsFalse).
- CPProvisioner.IsRunning: same shape — empty instance_id stays
(false, nil); httpClient-nil with non-empty instance_id returns
ErrNoBackend.
2. Flip the t.Skip on TestDockerBackend_Contract +
TestCPProvisionerBackend_Contract — both contract tests run now
that the panics are gone. Skipped scenarios were the regression
guard for this fix.
3. Add TestZeroValuedBackends_NoPanic — explicit assertion that
zero-valued and nil receivers return cleanly (no panic). Docker
backend always returns ErrNoBackend on zero-valued; CPProvisioner
may return (false, nil) when the DB-lookup layer absorbs the case
(no instance to query → no HTTP needed). Both are acceptable per
the issue's contract — the gate is no-panic.
Tests:
- 6 sub-cases across the new TestZeroValuedBackends_NoPanic
- TestDockerBackend_Contract + TestCPProvisionerBackend_Contract
now run their 2 scenarios (4 sub-cases each)
- All existing provisioner tests still green
- go build ./... + go vet ./... + go test ./... clean
Closes drift-risk #6 in docs/architecture/backends.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-fix: workspace-server's provision-timeout sweep was hardcoded
at 10 min for all runtimes. The CP-side bootstrap-watcher (cp#245)
correctly gives hermes 25 min for cold-boot (hermes installs
include apt + uv + Python venv + Node + hermes-agent — 13–25 min on
slow apt mirrors is normal). The two timeout systems disagreed:
the watcher would happily wait 25 min, but the workspace-server's
10-min sweep killed healthy hermes boots mid-install at 10 min and
marked them failed.
Today's example: #2061's E2E run on 2026-04-26 at 08:06:34Z
created a hermes workspace, EC2 cloud-init was visibly making
progress on apt-installs (libcjson1, libmbedcrypto7t64) when the
sweep flipped status to 'failed' at 08:17:00Z (10:26 elapsed). The
test threw "Workspace failed: " (empty error from sql.NullString
serialization) and CI failed on a healthy boot.
Fix: provisioningTimeoutFor(runtime) — same shape as the CP's
bootstrapTimeoutFn:
- hermes: 30 min (watcher's 25 min + 5 min slack)
- others: 10 min (unchanged — claude-code/langgraph/etc. boot
in <5 min, 10 min is plenty)
PROVISION_TIMEOUT_SECONDS env override still works (applies to all
runtimes — operators who care about the runtime distinction
shouldn't use the override anyway).
Sweep query change: pulls (id, runtime, age_sec) per row instead
of pre-filtering by age in SQL. Per-row Go evaluation picks the
correct timeout. Slightly more rows scanned but bounded by the
status='provisioning' partial index — workspaces in flight, not
historical.
Tests:
- TestProvisioningTimeout_RuntimeAware — locks in the per-runtime
mapping
- TestSweepStuckProvisioning_HermesGets30MinSlack — hermes at
11 min must NOT be flipped
- TestSweepStuckProvisioning_HermesPastDeadline — hermes at
31 min IS flipped, payload includes runtime
- Existing tests updated for the new query shape
Verified:
- go build ./... clean
- go vet ./... clean
- go test ./... all green
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
\`Delete\`'s call to \`h.provisioner.Stop()\` was silently swallowing
errors — and on the SaaS/EC2 backend, Stop() is the call that
terminates the EC2 via the control plane. When Stop returned an
error (CP transient 5xx, network blip), the workspace was marked
'removed' in the DB but the EC2 stayed running with no row to
track it. The "14 orphan workspace EC2s on a 0-customer account"
incident in #1843 (40 vCPU on a 64 vCPU AWS limit) traced to this
silent-leak path.
This change aggregates Stop errors across both descendant and
self-stop calls and surfaces them as 500 to the client, matching
the loud-fail pattern from CP #262 (DeprovisionInstance) and the
DNS cleanup propagation (#269).
Idempotency:
- The DB row is already 'removed' before Stop runs (intentional,
per #73 — guards against register/heartbeat resurrection).
- \`resolveInstanceID\` reads instance_id without a status filter,
so a retry can replay Stop with the same instance_id.
- CP's TerminateInstance is idempotent on already-terminated EC2s.
- So a retry-after-500 either re-attempts the terminate (succeeds)
or finds the instance already gone (also succeeds).
Behaviour change at the API layer:
- Before: 200 \`{"status":"removed","cascade_deleted":N}\` regardless
of Stop outcome.
- After: 500 \`{"error":"...","removed_count":N,"stop_failures":K}\`
on Stop failure; 200 on success.
RemoveVolume errors stay log-and-continue — those are local
/var/data cleanup, not infra-leak class.
Test debt acknowledged: the WorkspaceHandler's \`provisioner\` field
is the concrete \`*provisioner.Provisioner\` type, not an interface.
Adding a regression test for the new error-propagation path
requires either a refactor (introduce a Provisioner interface) or
a docker-backed integration test. Filing the refactor as a
follow-up; the change here is small and mirrors a proven pattern
(CP #262 + #269 both ship without exhaustive new test coverage
for the same reason).
Verified:
- go build ./... clean
- go vet ./... clean
- go test ./... green across the whole module (existing TestDelete
cases unchanged behaviour for happy path)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Code-quality + efficiency review of PR #2081:
- Drop comma-ok on map type-asserts in filterPeersByQuery —
queryPeerMaps writes name/role unconditionally as string, so the
silent-empty-string fallback was cargo-culted defense that would
HIDE a real upstream shape change in tests rather than surface it.
Plain p["name"].(string) panics on violation, caught by tests.
- Trim filterPeersByQuery doc from 5 lines to 1 — function is 15
lines and self-evident.
- Refactor 6 separate Test functions into one table-driven
TestPeers_QFilter with 6 sub-tests. Net ~80 lines saved + naming
becomes readable subtest names instead of TestPeers_Q_Foo_Bar.
- Set-based peer-id comparison (peerIDSet) replaces fragile
peers[0]["id"] == "ws-alpha" asserts that would silently mask a
future sort/order regression on the production code.
- Fix the broken TestPeers_Q_NoMatches assertion: re-encoding an
unmarshalled []map collapses both null and [] to [], so the
previous json.Marshal(peers) == "[]" check was tautological. Move
the [] vs null distinction to a dedicated test
(TestPeers_Q_NoMatches_RawBodyIsArrayNotNull) that inspects the
recorder body BEFORE unmarshal.
runPeersWithQuery now returns both parsed peers and raw body so the
nil-guard test can use the bytes directly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Peers handler at workspace-server/internal/handlers/discovery.go
ignored the ?q= query param entirely — every caller got the full peer
list regardless of what they searched for. The handler exposes peer
identities + URLs, so leaking the unfiltered set on a "filtered"
endpoint is an info-disclosure bug (CWE-862).
Fix: read c.Query("q") and post-filter the in-memory peers slice by
case-insensitive substring match against name OR role. Filtering is
done in Go after the existing 3 SQL reads — keeps the SQL bytes
identical to the no-filter path (no injection vector, no DB-driver
collation surprises) at a small cost. The peer set is bounded by a
single workspace's parent + children + siblings (typically <50
rows), so the in-memory pass is negligible.
Empty / whitespace-only q is a no-op — preserves the no-filter
allocation profile.
Tests (6 new in discovery_test.go):
- TestPeers_NoQ_ReturnsAll — regression baseline (3 peers, no filter)
- TestPeers_Q_FiltersByName — q=alpha → ws-alpha only
- TestPeers_Q_CaseInsensitive — q=ALPHA → ws-alpha (locks in ToLower)
- TestPeers_Q_FiltersByRole — q=design → ws-beta (role-side match)
- TestPeers_Q_NoMatches — empty array, JSON [] not null
- TestPeers_Q_WhitespaceOnly — q=' ' treated as no-filter
Helpers peersFilterFixture + runPeersWithQuery + peerNames keep each
test scoped to the q-behaviour, not re-declaring SQL expectations.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User reported the canvas threw a generic "API GET /workspaces: 500
{auth check failed}" error when local Postgres + Redis were both
down. Two problems:
1. The error code (500) and message ("auth check failed") said
nothing useful. The actual condition was "platform can't reach
its datastore to validate your token" — a Service Unavailable
class, not Internal Server Error.
2. The canvas had no way to distinguish infra-down from a real
auth bug, so it rendered the raw API string in the same
generic-error overlay it uses for everything.
Fix in two layers:
Server (wsauth_middleware.go):
- New abortAuthLookupError helper centralises all three sites
that previously returned `500 {"error":"auth check failed"}`
when HasAnyLiveTokenGlobal or orgtoken.Validate hit a DB error.
- Now returns 503 + structured body
`{"error": "...", "code": "platform_unavailable"}`. 503 is
the correct semantic ("retry shortly, infra is unavailable")
and the code field is the contract the canvas reads.
- Body deliberately excludes the underlying DB error string —
production hostnames / connection-string fragments must not
leak into a user-visible error toast.
Canvas (api.ts):
- New PlatformUnavailableError class. api.ts inspects 503
responses for the platform_unavailable code and throws the
typed error instead of the generic "API GET /…: 503 …"
message. Generic 503s (upstream-busy, etc.) keep the legacy
path so existing busy-retry UX isn't disrupted.
Canvas (page.tsx):
- New PlatformDownDiagnostic component renders when the
initial hydration catches PlatformUnavailableError.
Surfaces the actual condition with operator-actionable
copy ("brew services start postgresql@14 / redis") +
pointer to the platform log + a Reload button.
Tests:
- Go: TestAdminAuth_DatastoreError_Returns503PlatformUnavailable
pins the response shape (status, code field, no DB-error leak)
- Canvas: 5 tests for PlatformUnavailableError classification —
typed throw on 503+code match, generic-Error fallback for
503-without-code (upstream busy), 500 stays generic, non-JSON
body falls back to generic.
1015 canvas tests + full Go middleware suite pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Convergence-pass review noted the comment at orphan_sweeper.go:171
still describes the pre-cb126014 contract ("Stop returns nil even
when container is gone, but a future change could surface real
errors"). The future is now — Stop does surface real errors today.
Tightened the comment to match the live contract:
isContainerNotFound is treated as success, anything else returns
the wrapped Docker error, sweeper retries on the next cycle.
Pure comment change, no behavior diff.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>