SSOT pass — replace 4 bare `h.provisioner == nil && h.cpProv == nil`
checks with `!h.HasProvisioner()`. When a third backend lands (k8s,
containerd, whatever), HasProvisioner gets one new field; bare both-nil
checks would each need to be hunted and updated.
Sites:
- a2a_proxy_helpers.go:166 — maybeMarkContainerDead skip-no-backend
- workspace_restart.go:118 — Restart endpoint guard
- workspace_restart.go:363 — RestartByID coalescer guard
- workspace_restart.go:660 — Resume endpoint guard
Adds TestNoBareBothNilCheck (source-level) so the antipattern can't
slip back in.
Out of scope but discovered during the audit (filed separately):
- team.go:207 — team-collapse Stop is Docker-only, leaks EC2 on SaaS
- workspace_crud.go:423 — workspace delete cleanup is Docker-only,
leaks EC2 on SaaS
Both need a StopWorkspaceAuto mirror of provisionWorkspaceAuto. Same
class of bug as today's org-import incident, different verb (stop vs
provision).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes that close the silent-drop bug class:
1. Add WorkspaceHandler.HasProvisioner() and use it as the org-import
gate. Pre-fix, org_import.go:178 read `h.provisioner != nil` (Docker-
only) — on SaaS tenants where cpProv is wired but Docker is nil, the
entire 220-line provisioning prep block was skipped. The Auto call
PR #2798 added at line 395 was unreachable on SaaS.
Repro: 2026-05-05 01:14 — hongming prod tenant, 7-workspace org
import, every workspace sat in 'provisioning' for 10 min until the
sweeper marked it failed with the misleading "container started but
never called /registry/register".
2. provisionWorkspaceAuto self-marks-failed on the no-backend path.
Defense in depth: even if a future caller bypasses HasProvisioner
gating or ignores the bool return (TeamHandler pre-#2367 did exactly
this), the workspace ends in a clean failed state with an actionable
error message instead of lingering until the 10-min sweep.
Auto becomes the single source of truth for "start a workspace" —
routing AND the no-backend failure path. Create's redundant
if-not-Auto-then-mark-failed block collapses (kept only the
workspace_config UPSERT, which is a Create-specific UI concern for
rendering runtime/model on the Config tab).
Tests:
- TestProvisionWorkspaceAuto_NoBackendMarksFailed pins the new contract
- TestHasProvisioner_TrueOnCPOnly catches the SaaS-only blind spot
- TestHasProvisioner_TrueOnDockerOnly preserves self-hosted shape
- TestHasProvisioner_FalseWhenNeitherWired pins the gate-out path
- TestOrgImportGate_UsesHasProvisionerNotBareField source-pins the gate
(verified: FAILS against the buggy `h.provisioner != nil` shape, PASSES
with `h.workspace.HasProvisioner()`)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the pattern hermes-channel-molecule uses (line 256). Drops
the broken `pip install codex-channel-molecule` which would 404.
PyPI publish workflow is a separate piece of work — until then,
git+https install is the path operators get.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The codex tab in the External Connect modal had a "outbound-tools-only
first cut" caveat — operators got the MCP wiring for codex calling
platform tools, but there was no documented inbound path. Canvas
messages couldn't wake an idle codex session.
That gap is now filled by codex-channel-molecule
(github.com/Molecule-AI/codex-channel-molecule), shipped today as the
codex counterpart to hermes-channel-molecule. The daemon long-polls
the platform inbox, runs `codex exec --resume <session>` per inbound
message, captures the assistant reply, routes it back via
send_message_to_user / delegate_task, and acks the inbox row.
Per-thread session continuity persisted to disk so daemon restarts
don't lose conversation context.
This commit:
- Updates externalCodexTemplate to include `pip install
codex-channel-molecule` (step 1) and a foreground `nohup
codex-channel-molecule` invocation (step 3) using the same env-var
contract as the MCP server (WORKSPACE_ID + PLATFORM_URL +
MOLECULE_WORKSPACE_TOKEN).
- Adds a "Canvas messages don't wake codex" common-issues entry to the
TAB_HELP codex section pointing at the bridge daemon log.
- Updates the doc comment to record the upstream deprecation path:
when openai/codex#17543 lands, the bridge becomes redundant and the
wired MCP server delivers push natively.
Verified: TestExternalTemplates_NoMoleculeOrgIDPlaceholder still
passes (no MOLECULE_ORG_ID re-introduction); full handlers suite
green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex / openclaw / hermes-channel snippets each instructed operators
to set `MOLECULE_ORG_ID = "<your org id>"`. The molecule_runtime MCP
subprocess these snippets spawn never reads MOLECULE_ORG_ID — that
env var is consumed only by workspace-server's TenantGuard
middleware, server-side, on the tenant box itself (set by the control
plane via user-data on provision).
External operator → tenant calls pass TenantGuard via the
isSameOriginCanvas path (Origin matches Host), with auth via Bearer
token + X-Workspace-ID. The universal_mcp snippet — which calls into
the same molecule_runtime — has always (correctly) omitted
MOLECULE_ORG_ID; this brings codex / openclaw / hermes-channel into
line.
Symptom that caught it: an external codex CLI session, after pasting
the codex-tab snippet, surfaced "MOLECULE_ORG_ID is still set to
'<your org id>'" as an unresolved blocker — agent reasonably treated
the placeholder as required setup. Operator has no value to fill.
Pinned with a structural test
(TestExternalTemplates_NoMoleculeOrgIDPlaceholder) so the placeholder
can't drift back across all six external-tab templates.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MEMORY_V2_CUTOVER=true gates the admin export/import path on the v2
plugin, but the cutoverActive() check in admin_memories.go silently
returns false when the plugin isn't wired:
func (h *AdminMemoriesHandler) cutoverActive() bool {
if os.Getenv(envMemoryV2Cutover) != "true" {
return false
}
return h.plugin != nil && h.resolver != nil
}
Two operator misconfigs hit the silent-fallback path:
1. MEMORY_V2_CUTOVER=true set, MEMORY_PLUGIN_URL unset
→ wiring.Build returns nil → handler stays on legacy SQL path
→ operator sees no error, assumes cutover is live, but every
request still writes the legacy table.
2. MEMORY_V2_CUTOVER=true set, MEMORY_PLUGIN_URL set, but plugin
unreachable at boot
→ wiring.Build still returns the bundle (intentional — circuit
breaker handles ongoing unavailability), but every cutover
write quietly falls back via the breaker.
→ only signal: legacy table keeps growing.
Both are exactly the "structurally invisible until prod" failure
mode; the only real-world detection today is "notice the legacy
table is still being written to," which no operator will check.
Add loud, distinctive WARN log lines at Build() time for both
shapes. Boot logs are operator-visible, so a half-config is
immediately obvious without needing dashboards.
Tests:
* 4 new (cutover+no-URL → warn, neither set → silent, cutover+probe-
fail → loud warn, probe-fail-without-cutover → quiet generic)
* 6 existing (still pass; pin no-warning-on-happy-path)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Org-import called h.workspace.provisionWorkspace directly — same silent-
drop bug that bit TeamHandler.Expand on 2026-05-04 (see workspace.go
:121-125 comment + #2486). Symptom on SaaS: every claude-code workspace
sat in "provisioning" until the 600s sweeper marked it failed with
"container started but never called /registry/register" — because no
container ever existed; the goroutine returned silently when the Docker
provisioner field was nil.
User reproduced 2026-05-04 ~22:30Z importing a 7-workspace template on
the hongming prod tenant. Tenant CP logs (queried live via SSM) showed
ZERO "Provisioner: goroutine entered" or "CPProvisioner: goroutine
entered" lines for any of the 7 failed workspace UUIDs in the 60min
window — confirming the goroutine never ran past line 384 of
org_import.go because provisionWorkspace returned early in SaaS mode.
The fix is one line: replace h.workspace.provisionWorkspace with
h.workspace.provisionWorkspaceAuto. Auto is the single source of
truth for backend selection (workspace.go:130) — picks CP-mode when
h.cpProv is wired, Docker-mode when h.provisioner is wired, returns
false when neither.
ALSO adds a generic source-level gate
(TestNoCallSiteCallsDirectProvisionerExceptAuto) so the next future
caller can't repeat the pattern. Walks every non-test .go file in
handlers/ and fails if any direct call to provisionWorkspace( or
provisionWorkspaceCP( appears outside the dispatcher's own definition
file.
The gate currently allows workspace_restart.go which has its own
manual if-h.cpProv-else dispatch (functionally equivalent to Auto,
not the bug class — but is architectural duplication; follow-up
filed for proper de-dup).
Test plan:
- TestOrgImport_UsesAutoNotDirectDockerPath: pin the org_import.go
call site
- TestNoCallSiteCallsDirectProvisionerExceptAuto: generic gate against
future drift
- TestTeamExpand_UsesAutoNotDirectDockerPath (existing): symmetric for
team.go
All 3 + the rest of the handler suite pass.
Closes#2486
Pairs with: PR #2794 (configurable provision concurrency) which made
it possible to bisect concurrency-vs-routing as the cause
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Org-import was hard-capped at 3 concurrent workspace provisions (#1084),
calibrated for Docker-mode workspaces where each provision was a
docker-run. Now that workspaces are EC2 instances, AWS RunInstances
parallelises happily and the artificial cap of 3 makes a 7-workspace
org-import take 3-4× longer than necessary (3 batches × ~70s/provision
≈ 4 min wall time when AWS could absorb all 7 in parallel for ~70s).
This PR makes the cap configurable via MOLECULE_PROVISION_CONCURRENCY:
unset → 3 (Docker-mode default, unchanged)
"0" → effectively unlimited (SaaS / EC2 backend; AWS rate-limit
+ vCPU quota are the real backpressure)
N>0 → exactly N
N<0 → fall back to default 3 + warning log
garbage → fall back to default 3 + warning log
The "0 = unlimited" mapping is the user-facing convention requested for
SaaS deployments — operators don't have to pick an arbitrary large
number. Implementation hands off 1<<20 internally so the channel-based
semaphore stays a no-op without infinite-buffer risk.
Test coverage (org_provision_concurrency_test.go, 6 cases / 15 subtests):
- unset → default
- "0" → large unlimited cap
- positive integer exact (1, 5, 10, 50)
- negative → default + warning
- non-numeric → default + warning
- whitespace-trimmed (" 7 " → 7)
Boot-time log line confirms the resolved cap so an operator can verify
their env is being honored without re-deploying.
Does NOT address the separate 600s "never registered" timeout the user
also reported during org-import — that's filed as molecule-core#2793
for proper investigation (parallel-provision contention, network
routing, register-retry budget, or container-start failure are all
candidates and need live SSM capture to bisect).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Parent → child knowledge sharing previously lived behind a `shared_context`
list in config.yaml: at boot, every child workspace HTTP-fetched its parent's
listed files via GET /workspaces/:id/shared-context and prepended them as
a "## Parent Context" block. That paid the full transfer cost on every
boot regardless of whether the agent needed it, single-parent SPOF, no team
or org scope, and broken if the parent was unreachable.
Replace with memory v2's team:<id> namespace: agents call recall_memory
on demand. For large blob-shaped artefacts see RFC #2789 (platform-owned
shared file storage).
Removed:
- workspace/coordinator.py: get_parent_context()
- workspace/prompt.py: parent_context arg + injection block
- workspace/adapter_base.py: import + call + arg pass
- workspace/config.py: shared_context field + parser entry
- workspace-server/internal/handlers/templates.go: SharedContext handler
- workspace-server/internal/router/router.go: GET /shared-context route
- canvas/src/components/tabs/ConfigTab.tsx: Shared Context tag input
- canvas/src/components/tabs/config/form-inputs.tsx: schema field + default
- canvas/src/components/tabs/config/yaml-utils.ts: serializer entry
- 6 tests pinning the removed behavior; 5 doc references
Added regression gates so any reintroduction is loud:
- workspace/tests/test_prompt.py: build_system_prompt must NOT emit
"## Parent Context"
- workspace/tests/test_config.py: legacy YAML key loads cleanly but
shared_context attr must NOT exist on WorkspaceConfig
- tests/e2e/test_staging_full_saas.sh §9d: GET /shared-context must NOT
return 200 against a live tenant
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-fix WriteFile (templates.go:436) had an `instance_id != ""` branch
that dispatched to writeFileViaEIC (SSH through EC2 Instance Connect),
but ReadFile (templates.go:362) skipped that branch entirely. ReadFile
always tried `findContainer` (which only works for local-Docker
workspaces, not SaaS EC2-per-workspace ones) and fell through to
`resolveTemplateDir` (which returns the seed template, not the
persisted workspace state).
Net effect on production: every Canvas Config tab open against a
SaaS workspace returned 404 "No config.yaml found" because GET
couldn't see what PUT had written. Visible to users after PR #2781
("show-misconfigured-state") surfaced the 404 as an error UX.
Caught by the synth-E2E 7c gate's GET-back assertion, but
misdiagnosed as a "test bug" and the GET assertion was dropped in
PR #2783 (rather than fixed at the source). This PR closes the loop:
1. New `readFileViaEIC` helper in template_files_eic.go that mirrors
writeFileViaEIC's SSH-via-EIC dance and runs `sudo -n cat <path>`.
Returns os.ErrNotExist on missing file (cat exits 1 with empty
stdout under `2>/dev/null`) so the handler maps it cleanly to 404.
2. ReadFile dispatch now mirrors WriteFile's: when `instance_id` is
non-empty, use readFileViaEIC; otherwise fall through to the
local-Docker / template-dir path.
3. ReadFile's DB query expanded to also select instance_id + runtime
(was just name). Three sqlmock-based tests updated to match the
new column shape; the existing local-Docker fallback path stays
green by passing instance_id="" in the mock rows.
Follow-up (separate PR): the synth-E2E 7c gate should restore the
GET-back marker assertion now that the read/write paths are unified.
That'll also catch any future Files API regression in the round-trip.
This PR doesn't touch the gate to keep the scope tight.
Verification:
- go build ./... clean
- full handlers test suite green (0.4s for ReadFile subset; 5.8s
full)
- The 3 ReadFile sqlmock tests still cover the local-Docker fallback
(instance_id=""); SaaS EIC dispatch is covered by the upcoming
re-enabled synth-E2E 7c GET assertion (deferred to follow-up)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause of the user-visible 500 ("install: cannot create directory
'/opt/configs': Permission denied") on PUT
/workspaces/<id>/files/config.yaml:
1. Path map fall-through. claude-code wasn't in workspaceFilePathPrefix,
so resolveWorkspaceFilePath returned the default `/opt/configs/...`.
That directory doesn't exist on the workspace EC2 — cloud-init in
provisioner/userdata_containerized.go runs `mkdir -p /configs` only.
Even if the SSH write had succeeded at /opt/configs, the docker
container's bind-mount is host:/configs → container:/configs,
so the file would have been invisible to the runtime.
2. /configs ownership. cloud-init runs as root, so /configs is
root-owned. The SSH-as-ubuntu install command can't write into it
without sudo. Hermes wasn't affected because its base path
(/home/ubuntu/.hermes) is ubuntu-owned.
Two-line fix:
- Add `claude-code: /configs` to the runtime → base-path map and flip
the default fall-through from `/opt/configs` to `/configs`. Leave the
pre-existing langgraph/external entries pointing at /opt/configs
pending a migration audit (no user report on those today, and
flipping them would silently relocate any files those runtimes
already wrote).
- Prefix the remote install command with `sudo -n` so the write
succeeds under the standard EC2 ubuntu/passwordless-sudo posture.
`-n` (non-interactive) ensures clean failure if that ever changes,
rather than a hang waiting for a password prompt.
Tests:
- TestResolveWorkspaceFilePath_KnownRuntimes adds claude-code +
CLAUDE-CODE coverage and updates the empty/unknown default cases
to expect /configs. The langgraph/external rows stay green
(unchanged values), confirming the scope of the rename.
Verification:
- go build ./... clean
- go test ./internal/handlers/ green
- The user-reported bug
(PUT /workspaces/57fb7043-79a0-4a53-ae4a-efb39deb457f/files/config.yaml
→ 500 EACCES on /opt/configs) is the failure mode this fix addresses
on both axes (path + sudo).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-review of #2755 found two tests that didn't actually exercise the
production code path:
- TestNamespaceCleanupFn_NamespaceFormat asserted
"workspace:" + "abc-123" == "workspace:abc-123" — a compile-time
invariant, not runtime behavior. Provided no protection if the closure
in Bundle.NamespaceCleanupFn ever stopped using that prefix.
- TestNamespaceCleanupFn_FailureLogsButReturns built a *parallel*
cleanup closure inline with errors.New, then invoked the parallel
closure. The production closure was never exercised. A regression
in NamespaceCleanupFn (e.g. forgetting the deferred recover, calling
the plugin without nil-check) would still pass this test.
Replaced both with real integration:
- TestNamespaceCleanupFn_HitsPluginAtCorrectNamespace spins up
httptest.Server, points MEMORY_PLUGIN_URL at it, calls Build(),
invokes the production closure, and asserts the server actually
saw DELETE /v1/namespaces/workspace:abc-123.
- TestNamespaceCleanupFn_PluginErrorDoesNotPanic exercises the
failure path for real: server returns 500 on DELETE, closure must
log and return without propagating. defer-recover is belt-and-
suspenders since production calls this from a for-loop in
workspace_crud.go that has no recover.
Couldn't ship with #2755 because the merge queue locks the branch
once enqueued. Following up now that #2755 is merged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Caught during continued review: the entire v2 plugin system shipped
in PRs #2729-#2742 + #2744-#2751 was never actually invoked because
main.go and router.go don't construct the plugin client/resolver or
attach the WithMemoryV2 / WithNamespaceCleanup hooks.
Operators setting MEMORY_PLUGIN_URL=... saw zero behavior change
because nothing read it. Every fixup we shipped (idempotency, verify
mode, expires_at validation, audit JSON, namespace cleanup, O(N)
export, boot E2E) was also dormant for the same reason.
Root cause: when a multi-handler feature lands across many PRs, none
of them are individually responsible for wiring main.go — and the
master-task-tracking issue didn't gate-check that the wiring landed.
Add main.go integration to every multi-handler RFC checklist.
What ships:
* internal/memory/wiring/wiring.go: new package that constructs the
plugin client + resolver from MEMORY_PLUGIN_URL once. Returns nil
when unset (preserves zero-config legacy behavior). Probes
/v1/health at boot but doesn't fail-closed — the MCP layer's
circuit breaker handles ongoing unavailability.
* internal/memory/wiring/wiring_test.go: 6 tests covering the
nil/non-nil bundle paths + the namespace-cleanup closure
contract (nil-safe, format-stable, failure-tolerant).
* cmd/server/main.go: imports memwiring, calls Build(db.DB) once
after WorkspaceHandler creation, attaches WithNamespaceCleanup,
threads the bundle through router.Setup.
* internal/router/router.go: Setup signature gains *memwiring.Bundle
param. Inside, attaches WithMemoryV2 to AdminMemoriesHandler and
MCPHandler when the bundle is non-nil.
After this, the v2 plugin is reachable end-to-end:
Operator sets MEMORY_PLUGIN_URL → main.Build instantiates client +
resolver → WorkspaceHandler gets cleanup hook → router wires
AdminMemoriesHandler + MCPHandler with WithMemoryV2 → MCP tool
calls (commit_memory_v2, search_memory, etc.) actually do
something → admin export/import respects MEMORY_V2_CUTOVER.
Prerequisite for #292 (staging verification) — without this, the
operator runbook's step 2 (set MEMORY_PLUGIN_URL, observe behavior)
silently no-ops.
Verified: all 9 affected test packages still green
(memory/{client,contract,e2e,namespace,pgplugin,wiring}, handlers,
router, plus the build).
ReadableNamespaces(rootID) returns {workspace:rootID, team:rootID,
org:rootID} — the workspace: namespace it surfaces is the root's only.
The I3 batching change resolved namespaces once per root which silently
dropped every child workspace's private memories from admin export
(workspace:childID never reached the plugin search).
Keep the per-root batching win for team:/org:/custom: namespaces;
inject each member's workspace:<id> + owner mapping explicitly so
coverage matches the legacy per-workspace iteration.
Cost stays at 1 SQL + N_roots resolver + 1 plugin search.
Test changes:
- New TestExport_IncludesEveryMembersPrivateNamespace uses a
per-workspace resolver stub (mirrors real behaviour) and asserts
every member's workspace:<id> reaches the plugin search AND that
children's private memories appear in the response with correct
owner attribution. Verified to FAIL on the pre-fix code.
- TestExport_BatchesPluginCallsByRoot updated to expect 5 namespaces
(3 workspace + team + org) instead of 3 — it had pinned the buggy
3-namespace behaviour.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Companion to boot_e2e_test.go (just merged). Documents:
- When the E2E suite runs (build tag + env var)
- Local run with docker postgres
- CI integration example (label-gated workflow step)
- What each test pins
- Explicit gap list (migration drift, recovery, TTL)
Self-review #293. PR-11's E2E test uses sqlmock + httptest —
integration, not E2E. This adds the actual real-subprocess test:
build the binary with `go build`, start it pointing at real postgres,
drive HTTP via the real client.
What in-process tests miss that this catches:
- Binary build / boot-path panics (env var typos, mixed-key
interface bugs that only surface when start() runs)
- Wire encoding bugs that sqlmock smooths over (the pq.Array
regression from PR-3 development would have been caught here)
- HTTP+TCP-socket edge cases
- Real upsert behavior under postgres ON CONFLICT (C1 fix)
Build-tag gated so default CI doesn't require docker:
go test -tags memory_plugin_e2e -v ./cmd/memory-plugin-postgres/
Tests skip silently when MEMORY_PLUGIN_E2E_DB is unset.
Three tests:
1. TestE2E_BootAndHealth — capabilities advertised correctly
2. TestE2E_FullCommitSearchForgetRoundTrip — full agent flow
3. TestE2E_IdempotencyKey — C1 upsert against real postgres
Plus E2E.md operator runbook with docker quickstart + CI integration
example + explicit statement of what's still uncovered (migration
drift, recovery scenarios, TTL eviction over real time).
Self-review #291. When a workspace is hard-purged, its
`workspace:<id>` namespace stays in the plugin storage. Over time
deleted workspaces accumulate as orphan namespaces.
Fix: optional namespaceCleanupFn hook on WorkspaceHandler. The
purge path (workspace_crud.go ~line 520) iterates each purged id
and calls the hook best-effort. main.go wires the hook to
plugin.DeleteNamespace when MEMORY_PLUGIN_URL is set; operators
who haven't enabled the plugin keep the no-op default.
Why a hook (not direct plugin import):
* Keeps WorkspaceHandler decoupled from the memory contract
package (easier to test, smaller blast radius if the contract
bumps)
* Tests inject a captureCleanupHook stub without standing up a
real plugin client
* Production wiring stays a one-liner in main.go
What gets cleaned up:
* `workspace:<id>` for each purged workspace
* NOT `team:<root>` / `org:<root>` — those may still be
referenced by other workspaces under the same root, so dropping
them on a single workspace's purge would orphan team/org data
for the survivors. Operator can purge those manually after
confirming the entire root is gone.
What stays untouched:
* Soft-removed workspaces (status='removed', no ?purge=true). The
grace window is by design — the data should still be there if
the operator unremoves.
Tests:
* TestWithNamespaceCleanup_DefaultIsNil pins the safe default
* TestWithNamespaceCleanup_NilStaysNil pins the explicit-nil case
* TestWithNamespaceCleanup_AttachesFn pins the wiring
* TestPurge_CallsCleanupHookPerID exercises the per-id loop body
* TestPurge_NilHookIsSkipped pins the nil guard
A full end-to-end Delete-handler test requires mocking broadcaster
+ provisioner + descendant SQL chain, which is out-of-scope for a
single fixup. Integration coverage for the wired path lives in
PR-11's E2E swap test (#293 follow-up).
Self-review #289. The previous exportViaPlugin ran one resolver CTE
walk + one plugin search PER WORKSPACE. For a 1000-workspace tenant
that's 1000× of each, mostly redundant — workspaces sharing a
team/org root see identical readable namespaces.
New strategy:
1. Single SQL pass returns each workspace + its computed root_id
via a recursive CTE (loadWorkspacesWithRoots).
2. Group by root → unique tree count is typically << workspace
count.
3. Resolver runs ONCE per root (any member sees the same readable
list).
4. Build the union of all root namespaces; single plugin.Search
call.
5. Map each memory back to a workspace_name via pickOwnerForNamespace
(workspace:<id> → matching member; team:* / org:* / custom:* →
canonical first member of root group).
Net call cost: 1 SQL + N_roots resolver + 1 plugin call (vs
N_workspaces × resolver + N_workspaces × plugin in the old code).
Tests:
* TestExport_BatchesPluginCallsByRoot pins the new behavior
explicitly: 3 workspaces under 1 root → exactly 1 plugin search
(was 3 with the old code).
* TestPickOwnerForNamespace covers all five attribution cases:
workspace:<id> match, workspace:<id> no-match-fallback, team:*,
org:*, custom:* → first-member-of-root-group; plus empty-members
fallback.
* All 9 existing TestExport_* / TestImport_* / TestPickOwner /
TestNamespaceKindFromLegacyScope / TestSkipImport / etc. tests
remain green (verified with -run "Export").
The legacy DB path (when MEMORY_V2_CUTOVER unset) is unchanged.
Self-review missed deliverable from PR-7's task spec. Operators had
no way to confirm a -apply produced equivalent search results to the
legacy agent_memories direct queries; this PR ships that.
Usage:
memory-backfill -verify # 50-workspace random sample
memory-backfill -verify -verify-sample=200 # bigger sample
memory-backfill -verify -workspace=<uuid> # one specific workspace
Algorithm:
1. Pick N random workspaces (or use -workspace if specified)
2. For each: query agent_memories direct, query plugin search via
the workspace's readable namespace list
3. Multiset-compare contents: every legacy row must have a matching
plugin row. Plugin having MORE rows is OK (team-shared content
may be visible from sibling workspaces).
4. Print mismatches with content excerpt; non-zero mismatches/errors
yields a non-zero exit so CI can gate cutover.
Sql:
- Sampling uses ORDER BY random() LIMIT N (TABLESAMPLE has surprising
distribution at small populations).
- Filters out status='removed' workspaces.
Test coverage:
* pickWorkspaceSample: single-ws short-circuit, random sampling,
query error, scan error
* queryLegacyMemories: happy path, error path
* verifyParity:
- all match → 1 match, 0 mismatch
- missing-from-plugin → 1 mismatch with content excerpt
- plugin-extra rows → 1 match (legacy is subset of plugin)
- legacy query error → 1 error counter
- resolver error → 1 error counter
- plugin search error → 1 error counter
- no readable namespaces + empty legacy → match
- no readable namespaces + non-empty legacy → mismatch
- pickSample error → propagated up
* CLI: -verify+-apply rejected as mutually exclusive; -verify alone
is a valid mode
Note: namespaceResolverAdapter bridges *namespace.Resolver to the
verify package's verifyResolver interface so verify.go has zero
dependency on the namespace package — keeps test stubs minimal.
Two small Important findings from self-review, bundled because both
are <20 line changes touching the same file.
I1: expires_at silent drop
- mcp_tools_memory_v2.go:130 had `if t, err := ...; err == nil { ... }`
which dropped malformed timestamps without telling the agent.
Agent passes `expires_at: "tomorrow"`, gets a 200, and the memory
has no TTL.
- Now returns a clear error: "invalid expires_at: must be RFC3339"
- Test renamed: TestCommitMemoryV2_BadExpiresIsIgnored (which
codified the bug) → TestCommitMemoryV2_BadExpiresReturnsError
(which pins the fix).
I4: audit log JSON via Sprintf-%q
- auditOrgWrite was building activity_logs.metadata via fmt.Sprintf
with %q. Go-quoted strings happen to coincide with JSON-quoted
for ASCII (and today's values are pure ASCII: UUID + hex digest)
so the bug was latent.
- Replaced with json.Marshal of map[string]string. Same wire shape
today, but won't silently produce invalid JSON if metadata grows
to include arbitrary content snippets.
- New test TestAuditOrgWrite_MetadataIsValidJSON uses a custom
sqlmock.Argument matcher (jsonValidMatcher) that fails the test
if the metadata column isn't parseable JSON. The test runs
auditOrgWrite with a content string containing quotes,
backslashes, and a control byte — values where %q would diverge
from JSON-quote.
Both pre-existing tests (TestCommitMemoryV2_AuditsOrgWrites etc.)
remain green.
Self-review (post-merge) flagged that the backfill claimed to be
idempotent on re-run but actually duplicates every row because the
plugin's INSERT uses gen_random_uuid() and ignores any id passed in.
Fix is contract-level: extend MemoryWrite with an optional `id`
idempotency key. When supplied, the plugin MUST treat the write as
upsert keyed on this id; when omitted, the plugin generates a fresh
UUID (production agent commits keep working unchanged).
Changes:
* docs/api-protocol/memory-plugin-v1.yaml: add id field with
description that flags it as idempotency key
* internal/memory/contract/contract.go: add ID to MemoryWrite struct,
update memory_write_minimal golden vector
* internal/memory/pgplugin/store.go: split CommitMemory into two
paths — upsert when body.ID set (INSERT ... ON CONFLICT (id) DO
UPDATE), plain INSERT otherwise
* cmd/memory-backfill/main.go: pass agent_memories.id to MemoryWrite,
fix the false comment about 409 deduplication
New tests:
* pgplugin: TestCommitMemory_WithIDUpserts pins the upsert SQL is
used when id is set; TestCommitMemory_UpsertScanError covers the
error branch
* backfill: TestBackfill_PassesSourceUUIDAsIdempotencyKey pins the
forwarding behavior; TestBackfill_RerunIsIdempotent simulates a
retry and asserts both runs pass the same uuid (plugin upsert is
what makes this safe)
Why this matters: operators retrying a failed backfill (which they
will — networks fail, transactions abort) would otherwise create N
duplicates per memory. The duplicates aren't visible until search
results show obvious dupes — debugging that under prod load is bad.
Production agent commits are unaffected: they leave id empty, the
plugin generates a fresh UUID via gen_random_uuid(), zero behavior
change for the hot path.
Final implementation PR. Builds on PR-1..10 (all merged or queued).
Proves the central design property of the plugin contract: ANY
plugin satisfying the v1 OpenAPI spec works as a drop-in replacement
for the built-in postgres plugin. If this test fails after a refactor,
the contract has drifted in a way that breaks ecosystem plugins.
What ships:
* internal/memory/e2e/swap_test.go — five E2E tests against a
deliberately minimal "flat-memory" stub plugin (~50 LOC, single
map, zero capabilities)
* MCPHandler.Dispatch — small exported wrapper around dispatch so
out-of-package E2E tests can drive tools by name without
duplicating the whole MCP RPC stack
E2E coverage:
* TestE2E_FlatPluginRoundTrip: full lifecycle
- list_writable_namespaces returns 3 entries
- commit_memory_v2 writes through plugin
- search_memory finds it back
- commit_summary writes a summary
- forget_memory deletes
- search after forget excludes the deleted memory
* TestE2E_LegacyShimRoutesThroughFlatPlugin: PR-6 shim wired up
- Legacy commit_memory(scope=LOCAL) ends up in plugin storage
- Legacy recall_memory finds it back through plugin search
- Response shapes preserved (scope:LOCAL stays scope:LOCAL)
* TestE2E_OrgMemoriesDelimiterWrap: prompt-injection mitigation
- Org-namespace memory committed
- Audit INSERT into activity_logs verified
- Search returns content with [MEMORY id=... scope=ORG ns=...]
prefix applied
* TestE2E_StubPluginCapabilitiesAreEmpty: capability negotiation
- Stub plugin reports zero capabilities
- Client.SupportsCapability returns false for FTS, embedding
- Confirms graceful degradation when plugin doesn't support a
feature
* TestE2E_PluginUnreachable_AgentSeesClearError: failure surface
- Plugin URL pointing at bogus port
- commit_memory_v2 returns informative error
- No nil-pointer dereference; error message is actionable
The flat plugin is intentionally minimal — it has no namespaces table
distinct from memory records, no FTS, no semantic search, no TTL. The
test proves operators can drop in a 50-line plugin and the agent
behavior is identical (modulo capability-gated features).
Builds on merged PR-1..7. Adds the operator-controlled cutover flag
that flips admin export/import from the legacy direct-DB path to the
v2 plugin path.
Activation: MEMORY_V2_CUTOVER=true AND the v2 plugin is wired via
WithMemoryV2. Both must be true to take the new path; either being
false falls through to the existing legacy SQL code unchanged.
What ships:
* AdminMemoriesHandler gains plugin + resolver fields, wired via
WithMemoryV2 (production) / withMemoryV2APIs (tests)
* Export: enumerates workspaces, asks resolver for each one's
readable namespaces, searches each via plugin, deduplicates by
memory id, applies SAFE-T1201 redaction on emitted content
(F1084 parity). Returns the legacy memoryExportEntry shape so
existing tooling keeps working.
* Import: scope→namespace translation mirrors PR-6 shim. Uses
UpsertNamespace + CommitMemory; runs SAFE-T1201 redaction
BEFORE the plugin sees the content (F1085 parity).
* Helpers: legacyScopeFromNamespace + namespaceKindFromLegacyScope
(lifted out so admin_memories doesn't depend on MCP handler
helpers). skipImport typed error.
Operational rollout (cutover sequencing):
1. Today: MEMORY_V2_CUTOVER unset → legacy DB path.
2. After PR-7 backfill applied + smoke verified: operator sets
MEMORY_V2_CUTOVER=true.
3. From that point, admin export/import operate on plugin
storage; legacy agent_memories table is read-only for the
~60-day grace window before PR-9 drops it.
Coverage on new paths:
* cutoverActive: 100%
* WithMemoryV2 / withMemoryV2APIs: 100%
* importViaPlugin: 100%
* exportViaPlugin: 97.2% (one defensive scan-error branch in the
workspace-list loop)
* scopeToWritableNamespaceForImport: 76.9% (resolver-error and
no-matching-kind branches exercised end-to-end via Import)
* legacyScopeFromNamespace + namespaceKindFromLegacyScope: 100%
Edge cases pinned:
* Cutover flag matrix (env unset/true/false × wired/unwired)
* Export deduplicates memories shared across team (one row per id)
* Export tolerates per-workspace failures (resolver / plugin) and
keeps going on the rest
* Export returns 500 only when the top-level workspace query fails
* Empty readable namespaces → empty export (no panic)
* Export redacts secrets in plugin path
* Import: unknown workspace skipped, unknown scope skipped,
plugin upsert/commit errors counted as errors
* Import redacts secrets BEFORE plugin sees content
* Legacy export/import path unchanged when cutover flag unset
Builds on merged PR-1..6. Operator runs this once at cutover to copy
agent_memories rows into the v2 plugin's storage.
Usage:
memory-backfill -dry-run # count + diff, no writes
memory-backfill -apply # actually copy
memory-backfill -apply -limit=10000 # cap rows per run
memory-backfill -apply -workspace=<uuid> # one workspace only
Required env: DATABASE_URL + MEMORY_PLUGIN_URL.
Translation matches the PR-6 legacy shim:
LOCAL → workspace:<workspace_id>
TEAM → team:<root_id> (resolved via the same namespace.Resolver
the runtime uses)
GLOBAL → org:<root_id>
Idempotent: each row is keyed by its UUID; re-running the backfill
does not duplicate writes (plugin handles deduplication).
What ships:
* cmd/memory-backfill/main.go: CLI entry, run() driver,
backfill() workhorse, mapScopeToNamespace + namespaceKindFromString
helpers
* main_test.go: 100% on the functional logic (mapScopeToNamespace,
namespaceKindFromString, backfill(), all CLI validation paths)
Coverage: 80.2% of statements. The 19.8% gap is main()'s body
(log.Fatalf — not unit-testable) and run()'s real-DB integration
(sql.Open + db.PingContext + new client/resolver — requires a live
postgres). Integration coverage for this path lives in PR-11
(E2E plugin-swap test).
Edge cases pinned (in functional logic):
* Every legacy scope → namespace mapping
* Unknown scope → skip with diagnostic, increment skipped counter
* Resolver error → propagate, abort run
* No-matching-kind in writable list → skip with error message
* Plugin UpsertNamespace error → increment errors, continue
* Plugin CommitMemory error → increment errors, continue
* Query error → propagate, abort
* Scan error → increment errors, continue
* Mid-iteration row error → propagate, abort
* Workspace filter passes through to SQL WHERE clause
* Dry-run mode never calls plugin
* CLI: rejects both/neither modes, missing env vars, bad flags
Builds on merged PR-1..5. Adds the bridge that lets legacy
commit_memory / recall_memory tools route through the v2 plugin path
when MEMORY_PLUGIN_URL is wired, otherwise fall through to the
existing DB-backed code unchanged.
What ships:
* handlers/mcp_tools_memory_legacy_shim.go — translation helpers:
scopeToWritableNamespace, scopeToReadableNamespaces,
commitMemoryLegacyShim, recallMemoryLegacyShim,
namespaceKindToLegacyScope
* handlers/mcp_tools.go — toolCommitMemory + toolRecallMemory now
delegate to the shim when memv2 is wired
Translation:
commit: LOCAL → workspace:<self>
TEAM → team:<root> (resolver picks at runtime)
empty → defaults to LOCAL (preserves legacy default)
GLOBAL → still rejected at MCP bridge (C3 preserved)
recall: LOCAL → search restricted to workspace:<self>
TEAM → workspace:<self> + team:<root>
empty → all readable (matches v2 default behavior)
GLOBAL → blocked at MCP bridge (C3 preserved)
Response shapes are preserved exactly:
commit: {"id":"...","scope":"LOCAL"|"TEAM"} — agents see no diff
recall: [{"id":"...","content":"...","scope":"LOCAL"|...,"created_at":"..."}, ...]
org-namespace memories get the same [MEMORY id=... scope=ORG ns=...]
prefix as v2 search; legacy scope label comes back as "GLOBAL"
Operational rollout:
* Today: MEMORY_PLUGIN_URL unset on most operators → legacy DB path
* After PR-7 backfill: operators set MEMORY_PLUGIN_URL → all writes
flow through plugin transparently
* After PR-8 cutover: dual-write removed, plugin is the only path
* After PR-9 (~60 days later): legacy tool entries dropped entirely
Coverage: 100% on every helper, 100% on recallMemoryLegacyShim,
94.7% on commitMemoryLegacyShim. The 1 uncovered line is a defensive
guard against a v2-response-parse error that's unreachable when the
v2 tool is operating correctly (it always returns valid JSON).
Edge cases pinned:
* scope translation for every legacy value + invalid scope
* resolver error propagation
* plugin error propagation
* GLOBAL still blocked
* default-scope fallback (LOCAL)
* empty content rejected
* No-op when v2 unwired (legacy SQL path exercised via sqlmock)
* org-namespace memory wrap on recall + GLOBAL scope label round-trip
* No-results returns "No memories found." (legacy message preserved)
Builds on PR-1, PR-2, PR-3, PR-4 (all merged). Adds the agent-facing
v2 surface for the memory plugin contract.
What ships (all in handlers/mcp_tools_memory_v2.go, no edits to
the legacy commit_memory / recall_memory paths):
commit_memory_v2 — write to a namespace; default workspace:self
search_memory — search across namespaces; default = all readable
commit_summary — kind=summary, 30-day default TTL, runtime-overridable
list_writable_namespaces — discover what you can write to
list_readable_namespaces — discover what you can read from
forget_memory — delete by id, only in namespaces you can write to
Workspace-server is the security perimeter — every layer the plugin
mustn't be trusted with runs here:
* SAFE-T1201 redactSecrets BEFORE every plugin write
* Server-side ACL re-validation: CanWrite + IntersectReadable run
on EVERY request, never trusting client-supplied namespaces (a
canvas re-parent between list_writable and commit would otherwise
let a stale namespace slip through)
* org:* writes audited to activity_logs (SHA256, not plaintext) —
matches memories.go:201-221 so the schema stays uniform
* Audit failure does NOT block the write (logged + continue) —
failing closed would deny org-scope writes whenever activity_logs
is unhappy
* org:* memories get the [MEMORY id=... scope=ORG ns=...]: prefix
on read — preserves the prompt-injection mitigation from
memories.go:455-461
Coexistence design: legacy commit_memory + recall_memory still wired
to their old code paths in mcp_tools.go. PR-6 will alias them to
delegate to these v2 implementations. PR-9 (60 days post-cutover)
removes the legacy entries.
Wiring:
* MCPHandler gains an memv2 field (nil-safe; tools return a clear
error when MEMORY_PLUGIN_URL is unset rather than crashing)
* WithMemoryV2(plugin, resolver) is the production wiring API
main.go calls at boot
* withMemoryV2APIs(plugin, resolver) is the test-injectable variant
against the memoryPluginAPI / namespaceResolverAPI interfaces
Coverage: 100.0% on every new function in mcp_tools_memory_v2.go.
Edge cases pinned:
* empty/whitespace content → reject before plugin
* plugin unconfigured → clear error, no crash
* ACL violation → clear error
* resolver error → wrapped error
* plugin error → wrapped error
* malformed expires_at → silently ignored (no exception)
* org write audit failure → logged, write proceeds
* search namespace intersection drops foreign entries
* search with all-foreign namespaces → empty result, plugin not called
* search org memories get delimiter wrap, workspace memories do not
* forget with explicit + default namespace
* forget cross-scope rejected
* pickStr / pickStringSlice handle missing keys, wrong types, mixed slices
* wrapOrgDelimiter format is exact-match
* dispatch wires all 6 tools (no "unknown tool" error)
Builds on merged PR-1 (#2729), independent of PR-2/PR-4.
Implements every endpoint of the v1 plugin contract behind an HTTP
server (cmd/memory-plugin-postgres/) backed by postgres. Operators
run this binary next to workspace-server; it's the default
implementation MEMORY_PLUGIN_URL points at.
What ships:
- cmd/memory-plugin-postgres/main.go: boot, signal-driven shutdown,
boot-time migrations, configurable LISTEN/DATABASE/MIGRATION_DIR
- cmd/memory-plugin-postgres/migrations/001_memory_v2.up.sql:
memory_namespaces (PK on name, kind CHECK, expires_at, metadata)
memory_records (FK to namespaces with CASCADE, kind+source CHECK,
pgvector embedding, FTS tsvector, ivfflat partial
index on embedding, partial index on expires_at)
- internal/memory/pgplugin/store.go: storage layer using lib/pq
- internal/memory/pgplugin/handlers.go: HTTP layer (no router dep —
a switch on URL.Path keeps the binary's dep surface tiny)
- 100% statement coverage on store.go + handlers.go
Schema notes:
- These tables live next to the plugin binary, NOT in workspace-
server/migrations/. When operators swap the plugin, these tables
become orphaned (operator drops manually). Documented in PR-10.
- Search supports semantic (pgvector cosine) → FTS (>=2 char query)
→ ILIKE (1-char query) → recent-listing (no query), with a TTL
filter applied uniformly across all paths.
- DELETE on namespace cascades to memory_records (FK ON DELETE
CASCADE) — a deleted namespace immediately frees its memories.
Coverage corner cases pinned:
- Health: ok, degraded (db ping fails), no-ping fn
- Every CRUD endpoint: happy path, bad name, bad JSON, bad body,
not-found, store errors, exec/scan/marshal errors
- Search: FTS, semantic, short-query (ILIKE), no-query (recent),
kinds filter, store errors, scan errors, mid-iteration row error
- Routing edge cases: unknown path, empty namespace, unknown sub,
method-not-allowed, GET on /v1/health (allowed), POST on /v1/health
(404), GET on /v1/search (404)
- Helper internals: marshalMetadata (nil/happy/unmarshalable),
nullTime (nil/non-nil), vectorString (empty/format),
nullVectorString (empty/non-empty), scanNamespace +
scanMemory metadata-decode errors
No callers in workspace-server yet; integration starts in PR-5
(MCP handlers wire the plugin client through to MCP tools).
Stacked on PR-1 (#2729). Computes the readable/writable namespace lists
for a workspace from the live workspaces tree at request time. No
precomputed columns, no migrations — re-parenting on canvas takes
effect immediately on the next memory call.
What ships:
- workspace-server/internal/memory/namespace/resolver.go
- walkChain: recursive CTE, walks parent_id chain to root, capped
at depth 50 to defend against malformed/cyclic data
- derive: maps a chain to (workspace, team, org) namespace strings
- ReadableNamespaces / WritableNamespaces: the public API
- CanWrite + IntersectReadable: server-side ACL helpers MCP
handlers (PR-5) will call before talking to the plugin
- resolver_test.go: 100% statement coverage
Design choices worth flagging:
- Today's tree is depth-1 (root + children). The recursive CTE
handles arbitrary depth so we don't have to revisit the resolver
when the tree deepens.
- GLOBAL→org write restriction (memories.go:167-174) is preserved
by gating the org namespace's Writable flag on parent_id IS NULL.
- Removed-status workspaces are NOT filtered from the chain walk —
matches today's TEAM behavior (memories.go:367-372 filters on
read, not on tree walk).
- IntersectReadable with empty `requested` returns ALL readable
namespaces (default-search-everything semantic from the discovery
tools spec).
This package has zero callers in this PR; integration starts in PR-5.
Builds on PR-1 (#2729). Implements every endpoint in the OpenAPI spec
plus two operational concerns the agent never sees:
1. Capability negotiation. Boot/Refresh probes /v1/health and
captures the plugin's capability list. MCP handlers (PR-5) ask
SupportsCapability before exposing capability-gated features —
e.g., agents can only request semantic search when "embedding"
is reported.
2. Circuit breaker. Three consecutive failures open the breaker for
60 seconds; while open, calls fail fast with ErrBreakerOpen.
Picked these constants because:
- 3 failures: long enough to skip transient blips, short enough
to react before all in-flight handlers stack on the timeout
- 60s cooldown: long enough to back off a flapping plugin,
short enough that recovery is felt within a single session
4xx responses do NOT count toward the breaker (those are client
bugs, not plugin health issues); 5xx + transport errors do.
What ships:
- workspace-server/internal/memory/client/client.go
- client_test.go: 100% statement coverage
Coverage corner cases pinned:
- env-var success branches in New (parseDurationEnv applied)
- json.Marshal error (via channel in Propagation)
- http.NewRequestWithContext error (via unbalanced bracket in BaseURL)
- 204 NoContent on endpoint that normally has a body
- 4xx vs 5xx breaker behavior (4xx must NOT trip)
- breaker cooldown elapsed → reset on next success
- all 6 public endpoints fail-fast when breaker is open
This package has no callers in this PR; integration starts in PR-5.
First of 11 PRs implementing the memory-system plugin refactor (RFC #2728).
This PR is pure additive scaffolding — no behavior change, no integration
yet. It defines the wire shape between workspace-server and a memory
plugin so PR-2 (HTTP client) and PR-3 (built-in postgres plugin) can be
built against a single source of truth.
What ships:
- docs/api-protocol/memory-plugin-v1.yaml: OpenAPI 3.0.3 spec covering
/v1/health, namespace upsert/patch/delete, memory commit, search,
forget. Auth-free (private network only); workspace-server is the
only sanctioned client and the security perimeter.
- workspace-server/internal/memory/contract: typed Go bindings with
Validate() methods on every wire object so both client (PR-2) and
server (PR-3) self-check at the boundary.
- Round-trip JSON tests for every type (catch asymmetric tag bugs).
- 5 golden vector files under testdata/ pinning the exact wire shape;
update via UPDATE_GOLDENS=1.
Coverage: 100% of statements in contract.go.
The validation rules encode design decisions worth flagging in review:
- SearchRequest with empty Namespaces is REJECTED at plugin level —
workspace-server is required to intersect the readable set
server-side; an empty list reaching the plugin is a bug.
- NamespacePatch with no fields is REJECTED — empty patches are
pointless round-trips.
- MemoryWrite with whitespace-only Content is REJECTED — zero-info
memories pollute search results.
No code yet calls into this package; integration starts in PR-2.
User-reported 2026-05-04: deploying a team org-template ("Design
Director" + 6 sub-agents) on a SaaS tenant produced 7-of-7
WORKSPACE_PROVISION_FAILED with the misleading message
"container started but never called /registry/register". Diagnose
returned "docker client not configured on this workspace-server" and
the workspace rows had no instance_id.
Root cause: TeamHandler.Expand hardcoded h.wh.provisionWorkspace —
the Docker leg of WorkspaceHandler. WorkspaceHandler.Create branched
on h.cpProv to pick CP-managed EC2 (SaaS) vs local Docker
(self-hosted), but Expand never used that branch. On SaaS the docker
goroutine ran but had no socket, so children silently sat in
"provisioning" until the 600s sweeper marked them failed.
Architectural principle (user): templates own
runtime/config/prompts/files/plugins; the platform owns where it
runs. Backend selection belongs in one helper.
Fix:
- Extract WorkspaceHandler.provisionWorkspaceAuto: picks CP when
cpProv is set, Docker when only provisioner is set, returns false
when neither (caller marks failed).
- WorkspaceHandler.Create routes through Auto.
- TeamHandler.Expand routes through Auto.
Tests pin three invariants:
- TestProvisionWorkspaceAuto_NoBackendReturnsFalse — Auto signals
fall-through correctly so the caller can persist + mark-failed.
- TestProvisionWorkspaceAuto_RoutesToCPWhenSet — when cpProv is
wired, Start lands on CP (the user-visible regression target).
Discipline-verified: removing the cpProv branch fails this.
- TestTeamExpand_UsesAutoNotDirectDockerPath — source-level guard
against future refactors reintroducing the hardcoded Docker call.
Discipline-verified: reverting team.go fails this with a clear
message naming the bug class.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Live-probed user's tenant: three of three external-runtime workspaces
register with delivery_mode = NULL, not "poll". The earlier narrow
poll-only check fell through to the misleading 503 for the actually-
observed shape.
Invariant we want: URL empty + not-exactly-"push" → no dispatch path
will ever exist → 422. Only push-mode with empty URL is genuinely
transient (mid-boot, restart in progress) → 503.
Added TestChatUpload_NullModeEmptyURL using the user's actual workspace
ID. Existing TestChatUpload_NoURL switched to explicit "push" mode
(was relying on default — unsafe given the new branching).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
External-runtime workspaces that register in poll mode have no callback
URL by design — the platform never dispatches to them, so chat upload
(HTTP-forward by design) can't proceed. Returning 503 + "workspace url
not registered yet" was misleading: the "yet" implied transient state,
but the URL would never arrive.
Caught externally on 2026-05-04: user uploading an image to an external
"mac laptop" runtime workspace saw the 503 and assumed they should
retry. The workspace's poll mode meant retrying would never help.
Fix: include delivery_mode in the workspace lookup. When URL is empty:
- poll mode → 422 + "re-register in push mode with a public URL"
(Unprocessable Entity — this request can't succeed against this
workspace's configuration; no retry will help)
- push mode → 503 + "not registered yet" (genuine transient state —
retry after next heartbeat is correct)
Test: TestChatUpload_PollModeEmptyURL pins the new 422 path; existing
TestChatUpload_NoURL strengthened to assert the "not registered yet"
substring stays on the push branch (it would have silently passed if
the new 422 path had clobbered both branches).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User report: handing the modal's Claude Code channel snippet to an
agent fails immediately with two errors that the snippet doesn't tell
the operator how to resolve:
plugin:molecule@Molecule-AI/molecule-mcp-claude-channel · plugin not installed
plugin:molecule@Molecule-AI/molecule-mcp-claude-channel · not on the approved channels allowlist
Root cause: the snippet's `claude --channels plugin:...` line assumes
the plugin is pre-installed AND that the channel is on Anthropic's
default allowlist. Both assumptions are wrong for a custom Molecule
plugin in a public repo.
Two changes:
1. Rewrite externalChannelTemplate (Go) with full setup chain:
- Bun prereq check (channel plugins are Bun scripts)
- `/plugin marketplace add Molecule-AI/molecule-mcp-claude-channel`
+ `/plugin install molecule@molecule-mcp-claude-channel` BEFORE the
launch — otherwise "plugin not installed"
- `--dangerously-load-development-channels` flag on launch — required
for non-Anthropic-allowlisted channels, otherwise "not on approved
channels allowlist"
- Common-errors block at the bottom mapping each error string to
which numbered step recovers it
- Team/Enterprise managed-settings caveat (the dev-channels flag is
blocked there; admin must use channelsEnabled + allowedChannelPlugins)
Plugin install info verified by reading `Molecule-AI/molecule-mcp-claude-channel`
plugin.json (`name: "molecule"`) and the Claude Code channels +
plugin-discovery docs at code.claude.com/docs/en/{channels,discover-plugins}.
2. Add per-tab HelpBlock to the modal (canvas):
- Collapsible <details> below each snippet, closed by default so the
snippet stays the visual focus
- "Where to install" link (PyPI for runtime, claude.com for Claude
Code, github.com/openai/codex for Codex, NousResearch/hermes-agent
for Hermes)
- "Documentation" link (docs.molecule.ai/docs/guides/*; hostname
confirmed by existing blog post canonical metadata; paths map
1:1 to docs/guides/*.md files in this repo)
- "Common errors" list with concrete recovery steps for each tab
(e.g. Codex tab calls out the codex≥0.57 requirement and TOML
duplicate-table parse error; OpenClaw calls out the :18789 port
conflict check)
URL discipline: every URL is either (a) verified against a file path
in this repo's docs/, (b) the canonical repo of an existing snippet
reference, or (c) a well-known third-party canonical URL. No guessed
URLs — broken links would defeat the purpose of "more comprehensive
instructions."
Verification:
- `go build ./...` clean in workspace-server
- `go test ./internal/handlers/...` passes (4.3s)
- Bash syntax check on test_staging_full_saas.sh (no edits there) clean
- TS brace/paren/bracket counts balanced; no full tsc run because the
worktree's node_modules isn't installed — counterpart Canvas tabs E2E
on the PR will exercise the full type-check + render path
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-review of the modal-tab additions caught footguns in the new
hermes/codex/openclaw snippets. Ship the fixes before merge.
Critical 1 — Hermes `cat >> ~/.hermes/config.yaml` corrupts existing
configs. Most existing hermes installs have a top-level gateway:
block; appending creates a duplicate, which YAML rejects. Replaced
the auto-append with explicit instructions: 'under your existing
gateway: block, add a plugin_platforms entry'.
Critical 2 — Codex `cat >> ~/.codex/config.toml` corrupts on
re-run. TOML rejects duplicate [mcp_servers.molecule] tables; a
second run breaks codex parse. Replaced auto-append with commented
config block + explicit 'open ~/.codex/config.toml in your editor
and paste'. Canvas-side token stamping still hits the literal in
the comment so the operator's clipboard has the real token already
substituted.
Required 3 — OpenClaw `onboard --non-interactive` missing
provider/model defaults. Added explicit --provider + --model
placeholders in a commented form so operators see what's needed
without a stub default applying silently.
Required 4 — OpenClaw gateway started with bare '&' dies on
terminal close. Switched to nohup + log file + disown, with a note
that systemd is the right answer for production.
Optional 5 + 6 (env_vars cleanup, tests) deferred — env_vars stripped
to keep the in-tree-vs-external surface narrow; tests for the new
response fields can land separately when external_connection.go is
next touched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The External Connect modal had tabs for Python SDK / curl / Claude Code
channel / Universal MCP. Operators using hermes / codex / openclaw as
their external runtime had no copy-paste; they pieced together
WORKSPACE_ID + PLATFORM_URL + auth_token into config files by reading
docs.
Adds three runtime-specific snippets stamped server-side:
- **Hermes** — installs molecule-ai-workspace-runtime + the
hermes-channel-molecule plugin, exports the 4 env vars, and writes
the gateway.plugin_platforms.molecule block into ~/.hermes/config.yaml.
Same long-poll-based push semantics the Claude Code channel tab
delivers (push parity with the in-tree template-hermes adapter).
- **Codex** — wires the molecule_runtime A2A MCP server into
~/.codex/config.toml ([mcp_servers.molecule] block with env_vars
passthrough + literal env values). Outbound tools only — codex's
MCP client doesn't route arbitrary notifications/* (verified by
reading codex-rs/codex-mcp/src/connection_manager.rs); push parity
on external codex would need a separate bridge daemon, tracked
as future work. Snippet calls this out so operators know to pair
with Python SDK if they need inbound delivery.
- **OpenClaw** — installs openclaw + onboards, wires the molecule
MCP server via openclaw mcp set, starts the gateway on loopback.
Same outbound-tools-only caveat as codex; the in-tree template-
openclaw adapter implements the full sessions.steer push path,
but an external setup would need the same bridge daemon to translate
platform inbox events into sessions.steer calls. Future work.
Default open tab changed from "Claude Code" to "Universal MCP".
Universal MCP is runtime-agnostic and works as a starting point for
any operator regardless of their downstream agent runtime; runtime-
specific tabs are still one click away. Pre-2026-05-03 the modal
defaulted to Claude Code, so operators using non-Claude runtimes
opened to a tab they had to skip past.
Tab order also reorganized:
Universal MCP → Python SDK → Claude Code → Hermes → Codex → OpenClaw → curl → Fields
Each runtime-specific tab is gated on the platform supplying the
snippet (older platform builds without the field don't show empty
tabs).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The chat_history query
WHERE workspace_id = $1
AND activity_type = 'a2a_receive'
AND (source_id = $2 OR target_id = $2)
ORDER BY created_at DESC
forces a workspace-scoped seq-scan-and-filter at every call —
idx_activity_ws_type_time covers workspace_id+type prefix but the
(source OR target) clause then walks every workspace row. Demo
workspaces (≤50 rows) don't notice; production workspaces accumulate
thousands over months and chat_history latency grows linearly.
Adds two partial btree indexes (workspace_id, source_id) WHERE NOT NULL
and (workspace_id, target_id) WHERE NOT NULL. Postgres BitmapOrs them
into a workspace-scoped BitmapAnd against the existing index, dropping
chat_history from O(workspace_rows) to O(peer_a2a_rows).
Partial WHERE NOT NULL because most activity rows (heartbeats,
agent_log, memory_write, etc.) carry NULL source_id/target_id and
shouldn't bloat the index.
Anti-pattern caveat (per the issue): a single compound (a, b) index
can't serve 'a OR b' — Postgres only uses compound for prefix match.
Two separate indexes + BitmapOr is the right shape.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Layer 1 of the runtime-rollout plan. Decouples publish from promotion by
giving operators a `runtime_image_pins` table the provisioner consults at
container-create time. No row = legacy `:latest` behavior; row present =
provisioner pulls `<base>@sha256:<digest>`. One bad publish no longer
breaks every workspace simultaneously.
Mechanics:
- Migration 047: `runtime_image_pins` (template_name PK + sha256 digest +
audit columns) and `workspaces.runtime_image_digest` (nullable, with
partial index) for "show me workspaces still on the old digest" queries.
- `resolveRuntimeImage` (handlers/runtime_image_pin.go): looks up the
pin, returns `<base>@sha256:<digest>` on hit, "" on miss/error so the
provisioner falls through to the legacy tag map. Availability over
pinning — any DB error logs and returns "" rather than blocking the
provision. `WORKSPACE_IMAGE_LOCAL_OVERRIDE=1` short-circuits the
lookup so devs rebuilding template images locally see their fresh
build.
- `WorkspaceConfig.Image` carries the resolved value into the
provisioner. `selectImage` honors it ahead of the runtime→tag map and
falls back to DefaultImage on unknown runtime.
- The existing `imageTagIsMoving` predicate (#215) already returns false
on `@sha256:` form, so digest pins skip the force-pull path naturally.
Tests:
- Handler-side (sqlmock): no-pin/db-error/with-pin/empty/unknown/local-
override paths cover every branch of `resolveRuntimeImage`.
- Provisioner-side: `selectImage` table covers explicit-image preference,
runtime-map fallback, unknown-runtime → default, empty-config →
default. Plus a struct-literal compile-time pin on `Image` so a future
refactor can't silently drop the field.
Layer 2 (per-ring routing via `workspaces.runtime_image_digest`) and the
admin promote/rollback endpoint ride on top of this and ship separately.
The Docker-mode orphan sweeper was incorrectly targeting external runtime
workspaces, revoking their auth tokens ~6 minutes after creation (one
sweep cycle past the 5-min grace).
External workspaces have NO local container by design — their agent runs
off-host. The "no live container" predicate the sweep uses to detect
wiped-volume orphans matches every external workspace unconditionally,
which was killing the only auth credential the off-host agent has.
Reproducer: create runtime=external workspace, paste the auth token into
molecule-mcp / curl, wait 5 minutes. Next request returns
`HTTP 401 — token may be revoked`. Platform log shows
`Orphan sweeper: revoking stale tokens for workspace <id> (no live
container; volume likely wiped)`.
Fix: add `AND w.runtime != 'external'` to the sweep's SELECT. The
existing test regexes (third-pass query expectations + the shared
expectStaleTokenSweepNoOp helper) are tightened to require the new
predicate, so a regression that drops it fails CI immediately.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Heartbeats fire every 60s per workspace and were the dominant caller
of ReadPlatformInboundSecret — one DB SELECT each, purely to redeliver
the same value. For an N-workspace fleet that's N SELECTs/minute of
pure overhead, growing linearly with the fleet (#189).
This adds a sync.Map cache keyed by workspaceID with a 5-minute TTL:
- **Read-through**: cache miss → DB SELECT → populate → return.
- **Write-through**: every IssuePlatformInboundSecret call refreshes
the cache with the new value before returning, so the lazy-heal mint
path (readOrLazyHealInboundSecret) doesn't see a stale read of the
value it just wrote.
- **TTL eviction**: 5 minutes — generous enough that the heartbeat
hot path hits cache for ~5 reads in a row before re-validating, short
enough that an out-of-band rotation (operator running
`UPDATE workspaces SET platform_inbound_secret=...` directly)
propagates within minutes without requiring a redeploy.
- **Absence not cached**: ErrNoInboundSecret skips the cache write so
the lazy-heal recovery contract for the column-NULL case
(readOrLazyHealInboundSecret in workspace_provision_shared.go) keeps
working.
Memory footprint is bounded by the active workspace fleet (~200 bytes
per entry); deleted workspaces leave dead entries until process restart,
acceptable given workspace-deletion is operator-rare.
Why in-process instead of Redis: workspace-server runs as a single
Railway service today (per memory project_controlplane_ownership);
adding Redis for this single column read would be over-engineering.
The cache is a self-contained, Redis-free upgrade that keeps the same
semantic surface (read returns the latest secret) while collapsing
the heartbeat read storm. If the deployment ever fans out across
replicas, an operator-side rotation propagates per-replica TTL-bounded
without needing a shared write log.
Tests: 5 new cases covering cache hit within TTL, refresh after TTL
(simulating an operator rotation via SQL), write-through on Issue,
absence-not-cached, and Reset clearing all entries. The setupMock
helper in wsauth and setupTestDB helper in handlers both call
ResetInboundSecretCacheForTesting() at start + cleanup so write-through
state from one test doesn't shadow SELECT expectations in the next.
SetInboundSecretCacheNowForTesting() exposes a deterministic clock
override so the TTL test doesn't sleep.
Task: #189.
Previously Start() only pulled when the image was missing locally
(imgErr != nil). Once a tenant's Docker daemon had `:latest` cached,
it stuck on that snapshot forever even after publish-runtime pushed
a newer image with the same tag — the same image-cache class that
sibling task #232 closed on the controlplane redeploy path.
Now Start() additionally re-pulls when the tag is "moving"
(`:latest`, no tag, `:staging`, `:main`, `:dev`, `:edge`, `:nightly`,
`:rolling`). Pinned tags (semver, sha-prefixed, date-stamped, build-id)
and digest-pinned references (`@sha256:...`) skip the pull because
their contents are by definition immutable.
The classifier (imageTagIsMoving) is deliberately conservative on the
"moving" side — only the well-known moving tags trip it. Misclassifying
a pinned tag as moving wastes bandwidth on every provision; misclassifying
moving as pinned silently bricks the fleet on stale snapshots, which
is exactly the bug class this fix closes.
Edge cases handled:
- Registry hostname with port (`localhost:5000/foo`) — the `:5000` is
not mistaken for a tag.
- Digest pinning (`image@sha256:...`) — never re-pulled even if a
moving-looking tag is also present.
- Legacy local-build tags (`workspace-template:hermes`) — treated as
pinned (no registry to move from).
Test coverage: 22 cases across all classifier shapes. No changes to
the pull-failure path (still best-effort, ContainerCreate still
surfaces the actionable "image not found" error if the pull failed
and the cache is also empty).
Task: #215. Companion to #232.
The drift gate's monorepoRoot walk-up looked for workspace-configs-templates/
which is gitignored locally and doesn't exist in this repo at all (the
canonical script lives in molecule-ai-workspace-template-hermes). Test
failed on CI from day one with "could not find monorepo root".
Two layered fixes in one PR:
1. Vendor upstream derive-provider.sh as testdata/ + drop monorepoRoot.
The vendored copy has a header pointing operators at the upstream
source and a one-line cp command for refresh. Test now reads two
files (vendored shell + workspace_provision.go) via package-relative
paths — Go test sets cwd to the package dir, so this is hermetic
without any walk-up gymnastics.
2. Update the case-statement regex to match upstream's renamed variable
(${_HERMES_MODEL} since v0.12.0, the resolved value of
HERMES_INFERENCE_MODEL with a HERMES_DEFAULT_MODEL legacy fallback).
Regex now accepts either spelling so a future rename fails loudly
on the parser-sanity check rather than silently returning empty.
Vendoring upstream surfaced real drift the gate was designed to catch:
upstream v0.12.0 added 12 provider prefixes that deriveProviderFromModelSlug
didn't handle (xai/grok, bedrock/aws, tencent/tencent-tokenhub, gmi,
qwen-oauth, lmstudio/lm-studio, minimax-oauth, alibaba-coding-plan,
google-gemini-cli, openai-codex, copilot-acp, copilot). Without these,
Save+Restart on a workspace using one of those prefixes would persist
LLM_PROVIDER="" and the next boot would fall back to derive-provider.sh's
runtime *=auto branch — losing the user's explicit choice on every restart.
Added all 12 case clauses + 16 new table-driven test cases (covering
both canonical and aliased forms). Drift gate now passes; future
upstream additions will fail loudly with a "DRIFT: ..." message
pointing the engineer at the missing case.
Task: #242
PR #2535 added a Go port of derive-provider.sh
(deriveProviderFromModelSlug) so workspace-server can persist
LLM_PROVIDER into workspace_secrets at provision time. This created
two sources of truth — if a future PR adds a provider prefix to one
without the other, the platform's persisted LLM_PROVIDER silently
disagrees with what the container's derive-provider.sh produces at
boot, with no test going red.
This adds a hermetic drift gate that:
1. Parses workspace-configs-templates/hermes/scripts/derive-provider.sh
with regex (handling both single-line `pat/*) PROVIDER="x" ;;`
clauses and multi-line conditional clauses) to build a
map[prefix]provider.
2. Walks workspace_provision.go's AST with go/ast, finds
deriveProviderFromModelSlug, and extracts every case-clause
prefix → return-string-literal pair.
3. Cross-checks both directions and accepts only the two documented
divergences (nousresearch/* and openai/* both → "openrouter" at
provision time because derive-provider.sh's runtime-env checks
aren't loaded yet) via a hardcoded acceptedDivergences map.
4. Fails with an actionable message that names both files and
suggests the exact fix (add the case OR add to divergence list
with a comment).
Pattern: behavior-based AST gate from PR #2367 / memory feedback —
pin the invariant by what the function maps, not by what it's named.
Stdlib-only (go/ast, go/parser, go/token, regexp); no network, no DB,
no docker — reads two monorepo files in-process.
A second sanity-check test pins anchor prefixes the regex must find,
so a future shell-syntax change can't silently produce an empty map
and trivially pass the main gate.
Closes task #242.
Two follow-ups from PR #2543's multi-model code review (audit #253).
1. **Log silent yaml.Unmarshal errors (#256).** When a malformed
config.yaml made `yaml.Unmarshal(data, &raw)` fail, the affected
template silently disappeared from /templates with no trace —
operator could not distinguish "excluded due to parse error" from
"never existed." That widened a real foot-gun once PR #2543 added
structured top-level `providers:` (a string-shaped top-level
`providers:` decoded into `[]providerRegistryEntry` would fail and
drop the whole entry). Now logs `templates list: skip <id>:
yaml.Unmarshal: <err>` and continues with the rest.
2. **Coexistence test (#257 part 1).** PR #2543 covered the structured
registry and slug list in isolation. claude-code-default in
production ships BOTH: top-level `providers:` (structured registry,
2 entries) AND `runtime_config.providers:` (slug list, 3 entries).
New `TestTemplatesList_BothProviderShapesCoexist` mirrors that
layout, asserts both shapes surface independently with no
cross-talk (e.g. a slug-only entry like `anthropic-api` does NOT
synthesize a stub in the structured registry), and pins the JSON
wire-shape for both fields side-by-side.
3. **`base_url: null` decoding assertion (#257 part 3).** Adds an
explicit `got[0].BaseURL == ""` check in the existing
`TestTemplatesList_SurfacesProviderRegistry` test, locking in the
`string` (not `*string`) type. A future change to `*string` would
surface as JSON `null` and break canvas's "no base_url = use
provider defaults" branch — caught loudly by this assertion.
Tests: 11 TestTemplatesList_* now green, including the new
MalformedYAMLLogsAndSkips and BothProviderShapesCoexist.
The remaining piece of #257 — renaming `Providers []string` JSON tag
to `provider_slugs` — requires coordinated canvas updates across 4
files and is intentionally deferred to a separate PR (no canvas
churn while user is mid-test).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the contract drift caught by audit #253. Task #235 ("Server:
enrich /templates payload with structured providers") was marked
completed, but `templates.go` only ever emitted the
`runtime_config.providers []string` slug list — the structured
ProviderEntry shape (auth_env, model_prefixes, model_aliases, base_url)
the description promised was never plumbed.
Templates ship the structured registry under a TOP-LEVEL `providers:`
block (claude-code carries 6+ entries today; hermes still uses the
slug list). Both shapes coexist and are independent — surface them as
two separate fields:
- `providers` → existing []string slug list (unchanged)
- `provider_registry` → new []providerRegistryEntry (structured)
The canvas's ProviderModelSelector comment block already anticipates
this ("Templates that ship explicit vendor metadata (future) should
override the heuristic."). With this field in place, the canvas can
optionally drop its prefix-inference fallback for templates that ship
an explicit registry — separate PR. Today's change is purely additive
on the server side; no canvas change required.
Tests:
- TestTemplatesList_SurfacesProviderRegistry: order preservation +
field plumbing on a claude-code-shaped fixture (oauth + minimax)
+ JSON wire-shape gate to catch struct-tag renames.
- TestTemplatesList_OmitsProviderRegistryWhenAbsent: omitempty so
legacy templates (hermes, langgraph) don't emit `null` and break
Array.isArray on the canvas side.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bug B fix, server-side complement to molecule-runtime PR #2538.
The runtime PR taught `workspace/config.py` to honour
`MODEL_PROVIDER` over `runtime_config.model` from the template's
verbatim YAML. This PR is the upstream half: workspace-server's
`applyRuntimeModelEnv` now sets `MODEL=<picked>` for **every**
runtime, not just hermes (which got `HERMES_DEFAULT_MODEL` already).
Pre-fix: applyRuntimeModelEnv's per-runtime switch only emitted
HERMES_DEFAULT_MODEL for hermes; every other runtime got nothing,
so the adapter read its template's default model from
/configs/config.yaml. Surfaced 2026-05-02 — picking MiniMax-M2 in
canvas → workspace booted with model=sonnet (claude-code template
default) and demanded CLAUDE_CODE_OAUTH_TOKEN.
Post-fix: MODEL is set unconditionally before the per-runtime switch.
HERMES_DEFAULT_MODEL stays for backwards compat. Adapters opt in by
reading os.environ["MODEL"] in their executor (claude-code adapter
already does this since the same Bug B fix; see
workspace-configs-templates/claude-code-default/adapter.py).
Tests
=====
- `TestApplyRuntimeModelEnv_SetsUniversalMODELForAllRuntimes`:
table-driven across claude-code/hermes/langgraph/crewai + empty-model
fallback + MODEL_PROVIDER-secret-fallback path. Adding a new
runtime = adding a row, not writing a new test.
- All 6 sub-cases pass + existing
`TestWorkspaceCreate_FirstDeploy_UnknownModel_OnlyMintModelProvider`
pin still green.
Why now
=======
This was authored alongside the runtime PR but stashed (not committed)
during a session-handoff cleanup. The molecule-runtime side shipped at
SHA 16ac895a and is live on PyPI as molecule-ai-workspace-runtime
0.1.84, but until the workspace-server side ships, the canvas-picked
MODEL env never reaches non-hermes adapters.
Caught by the systematic stash audit triggered by the user's
discovery that ProviderModelSelector had been similarly stashed.
Closes the workspace-server side of #246. Builds on merged #2538.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Deletes the 5 unsupported workspace_templates from manifest.json
(langgraph, crewai, autogen, deepagents, gemini-cli). The runtime
matrix is now claude-code / hermes / openclaw / codex — the four
templates with shipping images, working A2A integration, and active
CI publish-image cascades.
Mirrors the prune in:
- workspace-server/internal/handlers/runtime_registry.go
(fallbackRuntimes for dev/test contexts that boot without the
manifest mounted)
- workspace-server/internal/handlers/workspace_provision.go
(sanitizeRuntime: empty/unknown → "claude-code", was "langgraph";
removes the langgraph/deepagents-specific runtime_config skip
branch — they're no longer supported, so the block is dead)
- tests for both: rename TestEnsureDefaultConfig_LangGraph →
_Hermes, TestEnsureDefaultConfig_EmptyRuntimeDefaultsToLangGraph
→ _ClaudeCode, drop TestEnsureDefaultConfig_DeepAgents,
update TestSanitizeRuntime_Allowlist + the two
TestResolveRestartTemplate_* cases that pinned langgraph-default
as the safe-default name
Why this is safe: production reads manifest.json at boot and uses it
as the authoritative allowlist; the 5 removed runtimes have not
shipped working images for ≥1 release cycle. Any provision request
naming one will now coerce to claude-code (with a log line) instead
of returning a runtime that has no functioning template repo.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the canvas POSTs /workspaces with {model: "minimax/MiniMax-M2.7"},
the model slug was never written to workspace_secrets. The workspace
booted hermes once with HERMES_DEFAULT_MODEL set from payload.Model, but
on every subsequent restart applyRuntimeModelEnv's fallback chain found
nothing in envVars["MODEL_PROVIDER"] (because nothing wrote it) and
hermes silently fell through to the template default
(nousresearch/hermes-4-70b) — wrong provider keys → hermes gateway
401'd → /health poll failed → molecule-runtime never registered →
"container started but never called /registry/register".
Worse, LLM_PROVIDER was never written either (the canvas doesn't send
provider), so CP user-data wrote no provider: field to
/configs/config.yaml and derive-provider.sh fell through to PROVIDER=auto
on every custom-prefix slug.
Fix: after the workspace row commits, persist MODEL_PROVIDER (verbatim
slug) and LLM_PROVIDER (derived from slug prefix) to workspace_secrets.
LLM_PROVIDER is gating-only — derive-provider.sh remains the runtime
source of truth and can override at boot. Reuses extracted
setModelSecret / setProviderSecret helpers (refactored out of SetModel /
SetProvider gin handlers) so SQL stays in one place.
Symptom: failed-workspace 95ed3ff2 (2026-05-02).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both restart paths (interactive Restart handler + auto-restart's
stopForRestart) used to log-and-continue on cpProv.Stop failure. After
PR #2500 made CPProvisioner.Stop surface CP non-2xx as an error, those
paths became the actual leak generator: every transient CP/AWS hiccup =
one orphan EC2 alongside the freshly provisioned one. The 13 zombie
workspace EC2s on demo-prep staging traced to this exact path.
Adds cpStopWithRetry helper with bounded exponential backoff (3 attempts,
1s/2s/4s). Different policy from workspace_crud.go's Delete handler:
Delete returns 500 to the client on Stop failure (loud-fail-and-block —
user asked to destroy, silent leak unacceptable), whereas Restart's
contract is "make the workspace alive again" — refusing to reprovision
strands the user with a dead workspace. So this helper retries to absorb
transient failures, then on exhaustion emits a structured `LEAK-SUSPECT`
log line for the (forthcoming) CP-side workspace orphan reconciler to
correlate. Caller proceeds to reprovision regardless.
ctx-cancel exits the retry early without sleeping the backoff (matters
during shutdown drain); the cancel path emits a distinct log line and
deliberately does NOT emit LEAK-SUSPECT — operator-cancel and
retry-exhaustion are different signals and conflating them would noise
up the orphan-reconciler queue with workspaces we never had a chance to
retry.
Tests: 5 behavior tests covering every branch (no-op, first-try success,
eventual success, exhaustion, ctx-cancel) + 1 AST gate that pins the
helper-only invariant (any future inline `h.cpProv.Stop(...)` in
workspace_restart.go fires the gate, mutation-tested).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
http.Client.Do only errors on transport failure — a CP 5xx (AWS
hiccup, missing IAM, transient outage) was silently treated as
success. Workspace row then flipped to status='removed' and the EC2
stayed alive forever with no DB pointer (the "orphan EC2 on a
0-customer account" scenario flagged in workspace_crud.go #1843).
Found while triaging 13 zombie workspace EC2s on demo-prep staging.
Adds a status-code check that returns an error tagged with the
workspace ID + status + bounded body excerpt, so the existing
loud-fail path in workspace_crud.go's Delete handler can populate
stop_failures and surface a 500. Body read is io.LimitReader-capped
at 512 bytes to keep error logs sane during a CP outage.
Tests: 4 new (5xx surfaces, 4xx surfaces, 2xx variants 200/202/204
all succeed, long body is truncated). Test-first verified — the
first three fail on the buggy code and all four pass on the fix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two follow-ups from PR #2494's review:
1. Two new sweep tests exercise the lookup path through
sweepStuckProvisioning end-to-end:
- ManifestOverrideSparesRow: claude-code 11min old, manifest=20min
→ no UPDATE, no broadcast (sparing works through the sweeper)
- ManifestOverrideStillFlipsPastDeadline: claude-code 21min old,
manifest=20min → flipped + payload.timeout_secs=1200
Closes the gap that the unit-test on provisioningTimeoutFor alone
left open: a future refactor could drop the lookup arg from the
sweeper's call and only the unit test caught it. Verified by
regression-injecting `lookup→nil` in sweepStuckProvisioning — both
new tests fail, the old ones still pass.
2. addProvisionTimeoutMs now goes through ProvisionTimeoutSecondsForRuntime
instead of calling provisionTimeouts.get directly. Single accessor
path for the same data — the canvas response and the sweeper now
resolve identically by construction.
No production behavior change; tests + accessor cleanup only.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Real wiring gap discovered while investigating issue #2486 cluster of
prod claude-code workspaces failed at exactly 10m. The
runtimeProvisionTimeoutsCache (#2054 phase 2) reads
runtime_config.provision_timeout_seconds from each template's
config.yaml so the **canvas** spinner respects per-template timeouts —
but the **sweeper** in registry/provisiontimeout.go hardcoded 10 min
(claude-code) / 30 min (hermes) and never consulted the manifest. So a
template that declared a longer window had a UI that waited correctly
but a sweeper that killed the row at the hardcoded floor anyway.
Resolution order pinned by new TestProvisioningTimeout_ManifestOverride:
1. PROVISION_TIMEOUT_SECONDS env (ops-debug global override)
2. Template manifest lookup (per-runtime, beats hermes default too)
3. Hermes default (30 min — CP bootstrap-watcher 25 min + 5 min slack)
4. DefaultProvisioningTimeout (10 min)
Wiring:
- registry: new RuntimeTimeoutLookup function type, threaded through
StartProvisioningTimeoutSweep + sweepStuckProvisioning + the
pre-existing provisioningTimeoutFor.
- handlers: ProvisionTimeoutSecondsForRuntime exposes the cache's
lookup as a method so main.go can pass it without breaking the
handlers→registry import direction.
- cmd/server/main.go: wire wh.ProvisionTimeoutSecondsForRuntime into
the sweep boot.
Verified:
- go test -race ./... passes (every workspace-server package).
- Regression-injected the lookup arm: 3 manifest-override subcases
fail with the actual-vs-expected gap, confirming the new test is
load-bearing.
- The original two timeout tests (env-override, hermes default) keep
passing — `lookup=nil` argument preserves their semantics.
Operator action enabled: a template wanting a 15-min window can now
just set `runtime_config.provision_timeout_seconds: 900` in its
config.yaml and the sweeper honours it on the next workspace-server
restart.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per review nit on PR #2491: the previous message ("a goroutine reached
cpProv.Start but never broadcast its failure") could mislead an
operator if Assertion 2 and 4 both fire — Assertion 4 also catches
"goroutine exited via an earlier path before reaching Start." Spell
both modes out and cross-reference Assertion 2.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Post-merge follow-up to PR #2487 review feedback:
1. guardAgainstReraise(fn) helper around every panic-test exercise. The
original RecoversAndMarksFailed had its own outer recover() to detect
re-raise; NoOpWhenNoPanic and PersistFailureLogged didn't. If a future
regression makes logProvisionPanic re-raise, those two would have
crashed the test process (taking sibling tests down) instead of
reporting a clean failure. Now all three use the shared guard.
2. Concurrent repro now asserts bcast.count == 7 — the new
concurrentSafeBroadcaster's count field was added in the race fix
but not actually consumed. Cross-checks the existing recorder-set
assertion from a different angle: a goroutine could in principle
reach cpProv.Start (recorder hits) but then lose its
WORKSPACE_PROVISION_FAILED broadcast on the failure path. Pinning
both rules out that silent-drop variant for the canvas-broadcast
contract specifically.
3. Comment on captureLog noting log.SetOutput is process-global and
incompatible with t.Parallel() — preempts a future footgun if
someone parallelizes the panic suite.
Verified: all four tests pass under -race; full handlers + db packages
green under -race.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI Platform (Go) ran with -race and the concurrent test tripped the
detector: captureBroadcaster (sequential-test stub) writes lastData
unguarded; 7 fan-out goroutines call markProvisionFailed → that stub
concurrently. Local non-race run had hidden it.
Introduce concurrentSafeBroadcaster (mutex-counted) for this single
fan-out test. Sequential tests keep using captureBroadcaster — the
fix is local to the test that creates the goroutines.
Verified ./internal/handlers passes with -race.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three fixes addressing review of the issue #2486 observability PR:
1. CI failure: original inline UPDATE in logProvisionPanic used a hard-coded
`status='failed'` literal, which trips workspace_status_enum_drift_test
(the post-PR-#2396 gate that requires every status write to flow through
models.Status* via parameterized $N). Refactor to call
h.markProvisionFailed which uses StatusFailed parameterized.
2. Canvas-broadcast gap (review finding): inline UPDATE skipped
RecordAndBroadcast, so panic recovery marked the row failed in DB but
the canvas spinner stayed on "provisioning" until the next poll.
markProvisionFailed fires WORKSPACE_PROVISION_FAILED, so canvas now
flips to a failure card immediately.
3. Critical test bug (review finding): `defer log.SetOutput(log.Writer())`
in three test sites evaluated log.Writer() at defer-fire time AFTER the
SetOutput swap — restoring the buffer to itself, never restoring
os.Stderr. Subsequent tests in the package were running with the panic
tests' captured buffer as their writer. Extracted captureLog(t) helper
that captures `prev` BEFORE the swap and uses t.Cleanup.
Plus: softened the "goroutine never started" comment in the concurrent
repro harness — the harness atomic-counts BEFORE the entry log fires, so
"never started" was misleading; the real failure mode is "entry log
renamed/removed or writer hijacked."
Verified: full handlers suite passes; drift gate passes (Platform Go CI
failure root-caused). Regression-injected the recover body again — both
panic tests still fail as expected, confirming the contract is gated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Goal: a deterministic, in-process reproduction of the prod incident
where 7 simultaneous claude-code provisions on the hongming tenant
produced ZERO log lines from any of the four documented exit paths.
Approach: stub CPProvisioner that records every Start() call,
sqlmock for the prepare flow, fire 7 goroutines concurrently against
provisionWorkspaceCP, then assert:
1. Entry log fired exactly 7 times (one per goroutine).
2. Stub Start() recorded all 7 distinct workspace IDs.
3. Each goroutine's entry log names its own workspace ID.
Result on staging head as of 2026-05-02: PASSES — meaning the
silent-drop class isn't reproducible against current head with stub
CP. Tenant hongming runs sha 76c604fb (725 commits behind staging),
so the bug is most likely already fixed upstream — hongming needs
a redeploy.
The test stays as a regression gate: any future refactor that
re-introduces silent goroutine swallow in the CP provision path
(rate-limit drop, channel-send-without-receiver, panic without
recover, etc.) trips it.
A safeWriter wraps the captured log buffer because raw
bytes.Buffer.Write isn't safe for concurrent goroutines — without
serialization the 7 entry-log lines interleave at byte boundaries
and the strings.Count assertion gets unreliable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Issue #2486: 7 claude-code workspaces stuck in provisioning produced
NONE of the four documented exit-path log lines in
provisionWorkspaceCP — neither prepare-failed, nor start-failed, nor
persist-instance-id-failed, nor success. Operators couldn't tell
whether the goroutine ran at all.
Add an entry log at the top of provisionWorkspaceOpts +
provisionWorkspaceCP so a missing entry distinguishes "goroutine
never started" from "started but exited via an unlogged path."
Add logProvisionPanic at the same defer site so a panic inside
either provisioner doesn't (a) crash the whole workspace-server
process, taking every other tenant workspace with it, and (b)
silently leave the row in `provisioning` until the 10-min sweeper
fires. The recover persists status='failed' with a sanitized
panic-class message via a fresh 10s context (the goroutine's own
ctx may have been the one panicking).
Tests pin three contracts:
- no-op when no panic (otherwise every successful provision
emits a spurious log line)
- recovers + persists failed status on panic, with stack trace
- defense-in-depth: if the persist itself fails, log it instead
of leaving the operator with a recovered-panic log but no row
Regression-injected by neutering the recover() body — all three
tests fail until the recover + UPDATE path is restored.
This is observability + resilience only, not a root-cause fix
for #2486. The actual silent-drop class still needs reproduction
once the tenant is on a build that includes this entry log.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The wheel-side chat_history MCP tool advertises a `before_ts`
parameter for backward paging through long histories, and the docs
describe it as the canonical pagination knob — but the server
silently ignored it until now. Without this fix, an agent passing
before_ts to chat_history would always get the most-recent N rows
and pagination would be broken end-to-end.
Add `before_ts` query param parsed as RFC3339 at the trust boundary
and translated into a `created_at < $X` clause on the existing
builder. Mirrors the strict-inequality shape since_id uses for
forward paging (`created_at > cursorTime`) so paging across both
directions has consistent semantics.
Tests: 3 new branches (positive filter, composition with peer_id
into the canonical chat_history paging shape, RFC3339 rejection
across 4 malformed inputs including URL-encoded SQL injection).
Mutation-verified pre-commit; existing 9 activity tests still pass.
Reported by self-review on PR #2474.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Surfaces the conversation history with one specific peer for the
wheel-side chat_history MCP tool. The filter joins
(source_id = $X OR target_id = $X) so both inbound (peer was sender)
and outbound (peer was recipient) turns appear in the same view,
ordered by created_at, and composes with existing type/source/
since_secs/since_id/limit filters.
Validates peer_id as a UUID at the trust boundary so a malformed
caller can't smuggle SQL fragments via the parameter — the args are
bound but the explicit rejection gives the wheel a cleaner 400
signal than an empty list, and defends against any future code path
that might interpolate the value into a URL or another query.
Tests: 3 new branches (positive filter, composition with
type+source, UUID-shape rejection across 5 malformed inputs).
Mutation-verified: reverting activity.go fails all peer_id tests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Option B PR-5. Canvas Config tab now exposes a Provider override input
that's adapter-driven from each runtime's template — no hardcoded
provider list in the canvas. PUT /workspaces/:id/provider on Save
when dirty; auto-restart suppression to avoid double-restart with
the model handler's own restart.
The dropdown's suggestion list comes from /templates →
runtime_config.providers (the field added in
molecule-ai-workspace-template-hermes PR #31). For templates that
haven't migrated to the explicit providers list yet, suggestions
derive from model[].id slug prefixes — still adapter-driven, just
inferred. This keeps existing templates working while platform team
migrates them one at a time.
workspace-server changes:
- Add Providers []string field to templateSummary JSON
- Parse runtime_config.providers in /templates handler
- 2 new tests pin the surfacing + omitempty behavior
canvas changes:
- Remove hardcoded PROVIDER_SUGGESTIONS constant
- Add provider/originalProvider state + PUT-on-save logic
- Add deriveProvidersFromModels() fallback helper
- Wire RuntimeOption.providers from /templates response
- 8 new tests pin the behavior end-to-end
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror of PUT /model. Stores the provider slug as the LLM_PROVIDER
workspace secret so the canvas can update model + provider
independently — a user might keep the same model alias and switch
providers (route through a different gateway), or vice versa.
Forcing both into one endpoint imposes a single Save+Restart per
change; two endpoints let canvas update each as the user picks.
Plumbs through the existing chain: secret-load → envVars → CP
req.Env → user-data env exports → /configs/config.yaml (after
controlplane PR #364 lands the heredoc append).
Tests: 5 new cases mirroring SetModel/GetModel exactly — default
empty response, DB error, upsert with restart trigger, empty-clears,
invalid-UUID rejection.
Part of: Option B PR-2 (#196) — workspace-server plumbs LLM_PROVIDER
Stack: PR-1 schema (#2441 merged)
PR-2 (this) ws-server endpoint
PR-3 (#364 open) CP user-data persistence
PR-4 (pending) hermes adapter consume
PR-5 (pending) canvas Provider dropdown
#2429 review finding. The 410-Gone path issues a follow-up
`SELECT updated_at` after detecting status='removed'. If that query
fails (workspace row deleted between the two queries, transient DB
error, etc.), `removedAt` stays as Go's zero time and the JSON body
emits `"removed_at": "0001-01-01T00:00:00Z"` — a misleading timestamp
the client has to know to ignore.
Now we branch on `removedAt.IsZero()` and emit `null` for the failed
path. The actionable signal (the 410 + hint) is unchanged; only the
timestamp shape gets cleaner.
Pinned by `TestWorkspaceGet_RemovedReturns410WithNullRemovedAtOnTimestampFetchFailure`,
which simulates the row vanishing via `sqlmock`'s `WillReturnError(sql.ErrNoRows)`.
The original `_RemovedReturns410` test now also asserts that the
happy-path timestamp is a non-null value (was just checking the key
existed).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Defense-in-depth at the endpoint level. Previously, GET /workspaces/:id
returned 200 OK with `status:"removed"` in the body for deleted
workspaces — silent-fail UX hit on the hongmingwang tenant 2026-04-30:
the channel bridge / molecule-mcp wheel had a dead workspace_id + token
in .env, get_workspace_info returned 200 → caller assumed everything
was fine, then every subsequent /registry/* call 401d because tokens
were revoked, and operators had no idea their workspace was gone.
#2425 fixed the steady-state heartbeat path (escalate to ERROR after
3 consecutive 401s). This change is the startup-time defense — fail
loud when the operator first probes the workspace instead of waiting
for the heartbeat to sour.
The 410 body includes:
{error: "workspace removed", id, removed_at, hint: "Regenerate ..."}
Audit-trail consumers that need the body shape of a removed workspace
(admin views, "show me deleted workspaces" tooling) opt into the
legacy 200 + body via ?include_removed=true. Without this opt-in path
the audit trail becomes invisible at the API layer.
Two new tests pinned:
- TestWorkspaceGet_RemovedReturns410
- TestWorkspaceGet_RemovedWithIncludeQueryReturns200
Follow-ups in separate PRs:
- Update workspace/a2a_client.py get_workspace_info to surface
"removed" specifically rather than collapsing into "not found"
- Update channel bridge getWorkspaceInfo (server.ts) to detect 410
→ log clear "workspace was deleted, re-onboard" error
- Audit canvas/* + admin tooling consumers that may rely on the
legacy 200 + status:"removed" shape; switch them to the
?include_removed=true opt-in if needed
- Update docs (runtime-mcp.mdx Troubleshooting + external-agents.mdx
lifecycle table)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes from /code-review-and-quality on PR #2445:
1. **KI-005 hierarchy check parity with /terminal**
HandleConnect runs the KI-005 cross-workspace guard before dispatch
(terminal.go:85-106): when X-Workspace-ID is set and != :id, validate
the bearer's workspace binding then call canCommunicateCheck. Without
this, an org-level token holder in tenant Foo can probe any
workspace's diagnostic state by guessing the UUID — same enumeration
vector KI-005 closed for /terminal in #1609. Per-workspace bearer
tokens are URL-bound by WorkspaceAuth, so the gap is org tokens
within the same tenant.
Fix: copy the same gate into HandleDiagnose, before the
instance_id SELECT.
Test: TestHandleDiagnose_KI005_RejectsCrossWorkspace stubs
canCommunicateCheck=false and confirms 403 fires before the DB
lookup (sqlmock's ExpectationsWereMet pins that we never reached
the SELECT COALESCE). Mirrors the existing
TestTerminalConnect_KI005_RejectsUnauthorizedCrossWorkspace.
2. **Race-free tunnel stderr capture (syncBuf)**
strings.Builder isn't goroutine-safe. os/exec spawns a background
goroutine that copies the subprocess's stderr fd to cmd.Stderr's
Write, so reading the buffer's String() from the request goroutine
on wait-for-port timeout while the tunnel may still be writing is
a data race that `go test -race` flags. Worst-case impact in
production is a garbled Detail string (not a crash), but the fix
is small.
Fix: wrap bytes.Buffer in a sync.Mutex (syncBuf type). Same
io.Writer interface, no API changes elsewhere.
3. **Nit cleanup**
- read-pubkey failure now reports as its own step name instead of
a duplicated "ssh-keygen" entry — disambiguates two different
failure modes that previously shared a name.
- Replaced numToString hand-rolled int-to-string with strconv.Itoa
in the test (no import savings reason existed).
Suite: 4 diagnose tests pass with -race; full handlers suite passes
in 3.95s. go vet clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GET /workspaces/:id/terminal/diagnose runs the same per-stage pipeline as
/terminal (ssh-keygen → EIC send-key → tunnel → ssh) but non-interactively
and returns JSON. Each stage reports {name, ok, duration_ms, error,
detail}, plus a top-level first_failure naming the broken stage.
Why: when the canvas terminal silently disconnects ("Session ended" with
no error frame — the user-reported failure mode on hongmingwang's hermes
workspace), there is no remote-readable signal of WHICH stage failed.
The ssh client's stderr lives only in the workspace-server's stdout on
the tenant CP EC2 — invisible without shell access. /terminal can't
expose stderr cleanly because it has already upgraded to WebSocket
binary frames by the time ssh runs. /terminal/diagnose stays pure
HTTP/JSON, so the same auth (WorkspaceAuth + ADMIN_TOKEN fallback) gives
operators a one-call probe that splits "IAM broke" (send-ssh-public-key
fails) from "tunnel/SG broke" (wait-for-port fails) from "sshd auth
broke" (ssh-probe gets Permission denied) from "shell broke" (probe
exits non-zero with stderr).
Stages mirrored from handleRemoteConnect in terminal.go:
1. ssh-keygen ephemeral session keypair
2. send-ssh-public-key AWS EIC API push, IAM-gated
3. pick-free-port local port for the tunnel
4. open-tunnel aws ec2-instance-connect open-tunnel start
5. wait-for-port the tunnel actually listens (folds tunnel
stderr into Detail when it doesn't)
6. ssh-probe non-interactive `ssh ... 'echo MARKER'` that
confirms auth + bash + the marker round-trip
(CombinedOutput captures stderr verbatim —
this is the whole reason the endpoint exists)
Local Docker workspaces (no instance_id) get a smaller probe:
container-found + container-running. Same response shape so callers
don't need to branch.
Tests stub sendSSHPublicKey / openTunnelCmd / sshProbeCmd via the
existing package-level vars (same pattern as TestSSHCommandCmd_*) so
the test suite stays hermetic — no AWS, no network. The three new
tests pin: (a) routing to remote on instance_id present,
(b) routing to local on empty instance_id, (c) the operationally
critical case — full success through wait-for-port then a probe
failure surfaces ssh stderr in the ssh-probe step's Error/Detail
with first_failure="ssh-probe".
Auth: rides on existing WorkspaceAuth middleware. Operators with the
tenant ADMIN_TOKEN (fetched via /cp/admin/orgs/:slug/admin-token) can
probe any workspace without per-workspace token; same admin path as
the canvas dashboard reads workspace activity.
Response always returns HTTP 200 (success or step failure are both in
the JSON body) so callers don't need to branch on status code — the
endpoint either reports a first_failure or doesn't.
Resolves task #200, supports task #193 (workspace EC2 sshd
unresponsive — without this endpoint we couldn't pin the failure
stage from outside the tenant CP EC2).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pre-existing TestSSHCommandCmd_BuildsArgv asserts the literal argv
slice. Adding `-o ConnectTimeout=10` shifted the slice — this commit
tracks the snapshot to match. The new behavior-based
TestSSHCommandCmd_ConnectTimeoutPresent (added in the prior commit)
keeps the invariant pinned without depending on argv ordering, so
future tweaks land in only one place even if more options are added.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the workspace EC2's sshd is unresponsive (mid-restart, SG drop,
AMI without ec2-instance-connect), the canvas's xterm shows the user's
typed bytes echoed back by the workspace-server's *local* PTY (cooked +
echo mode before ssh sets it raw post-handshake) and then closes
silently when Cloudflare's idle WebSocket timer fires (~100s) — with no
"Connection refused" or "Permission denied" output ever reaching the
user. This is what hongmingwang's hermes terminal looked like 2026-04-30
right after the heartbeat-fix redeploy: status="online" but the shell
appeared dead.
Caught reproducibly by holding a fresh /workspaces/<id>/terminal
WebSocket open for 60s — server sent zero frames except the local-PTY
echo of one keystroke typed at t=8s. ssh was hung at handshake; bash
never saw the byte.
Fix: add `-o ConnectTimeout=10` to ssh args. Now the failure surfaces
as a real ssh error message in the terminal within 10s, instead of
masquerading as a silently dead shell over the next ~100s. Doesn't
diagnose *why* sshd isn't responding (separate investigation), but
it does mean the user gets actionable feedback within seconds.
Behavior-based regression test asserts `-o ConnectTimeout=N` is in the
ssh argv — pins presence, not the literal value, so operators can tune
without breaking the gate. Verified to FAIL on pre-fix code (matched
the literal arg pair) and PASS on fix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-fix Restart called provisioner.Stop / cpProv.Stop synchronously before
returning the HTTP response. CPProvisioner.Stop is DELETE /cp/workspaces/:id
→ CP → AWS EC2 terminate, which can exceed the canvas's 15s HTTP timeout,
especially right after a platform-wide redeploy when every tenant queues a
CP request at once. The user sees a misleading "signal timed out" red banner
on Save & Restart even though the async re-provision goroutine continues
and the workspace ends up online.
Caught 2026-04-30 on hongmingwang hermes workspace 32993ee7-…cb9d75d112a5
right after the heartbeat-fix platform redeploy at 02:11Z. The workspace
came back online correctly; only the canvas response timed out.
Fix moves Stop into the same goroutine as provisionWorkspaceCP /
provisionWorkspaceOpts. The handler now responds in <500ms (DB lookup +
status UPDATE only). Stop and provision keep their existing ordering
inside the goroutine. Uses context.Background() to detach from the request
lifecycle so an aborted client connection doesn't cancel the in-flight
Stop/provision pair.
Pinned by a behavior-based AST gate (workspace_restart_async_test.go):
the test parses workspace_restart.go and walks the Restart function body,
flagging any <recv>.{provisioner,cpProv}.Stop call that isn't nested in a
*ast.FuncLit. Same family as callsProvisionStart in
workspace_provision_shared_test.go. Verified the gate fails on the
pre-fix shape (flags lines 151 and 153 — the original sync Stop calls).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Heartbeat now echoes the workspace's platform_inbound_secret on every
beat (mirroring /registry/register), and the molecule-mcp client
persists it to /configs/.platform_inbound_secret on receipt.
Symptom (2026-04-30, hongmingwang tenant): chat upload returned 503
"workspace will pick it up on its next heartbeat" and then 401 on
retry — permanent until workspace restart. The 503 message was a lie:
heartbeat used to discard the platform_inbound_secret entirely; only
register delivered it, and register fires once at startup.
Server (Go):
- Heartbeat handler reuses readOrLazyHealInboundSecret (the same
helper chat_files + register use), so heartbeat-time recovery
covers the rotate / mid-life NULL-column case the existing
register-time heal can't reach.
- Failure is non-fatal: liveness contract trumps secret delivery,
chat_files retries lazy-heal on its own next request.
Client (Python):
- _persist_inbound_secret_from_heartbeat parses the heartbeat 200
response and persists via platform_inbound_auth.save_inbound_secret.
- All exceptions swallowed — heartbeat liveness > secret persistence;
next tick (≤20s) retries.
Tests:
- Server: pin secret-present, lazy-heal-mint-on-NULL, and heal-
failure-omits-field branches.
- Client: pin persist-on-200, skip-on-empty, skip-on-non-dict-body,
skip-on-401, swallow-save-OSError.
The canvas tab snippet for the Universal MCP path was written before
this PR added the built-in register + heartbeat thread. Earlier wording
described it as "outbound-only — pair with the Claude Code or Python SDK
tab for heartbeat + inbound messages" — that's stale. molecule-mcp now
handles register + heartbeat itself; the only thing it doesn't yet do is
inbound A2A delivery.
Updated:
- externalUniversalMcpTemplate header comment + body — describes
standalone behavior, points operators at SDK/channel only when they
need INBOUND (not heartbeat).
- Drops the now-redundant curl-register step from the snippet — the
binary registers itself on startup.
- Canvas modal label likewise updated.
No runtime / behavior change; pure docs polish so a copy-pasting
operator's mental model matches what the binary actually does.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two paired fixes that together let an external operator run a single
process (molecule-mcp) and see their workspace come up online in the
canvas — the bug surfaced live when status stuck at "awaiting_agent /
OFFLINE" despite an active MCP server.
Platform side (workspace-server/internal/handlers/registry.go):
Heartbeat handler already auto-recovers offline → online and
provisioning → online, but NOT awaiting_agent → online. Healthsweep
flips stale-heartbeat external workspaces TO awaiting_agent, and
with no recovery path the workspace stays "OFFLINE — Restart" in the
canvas forever. Add the symmetric branch: if currentStatus ==
"awaiting_agent" and a heartbeat arrives, flip to online + broadcast
WORKSPACE_ONLINE. Mirrors the existing offline/provisioning patterns
exactly. Test: TestHeartbeatHandler_AwaitingAgentToOnline asserts
the SQL UPDATE fires with the awaiting_agent guard clause.
Wheel side (workspace/mcp_cli.py):
molecule-mcp was outbound-only — operators had to run a separate
SDK process to register + heartbeat. Now mcp_cli.main():
1. Calls /registry/register at startup (idempotent upsert flips
status awaiting_agent → online via the existing register path).
2. Spawns a daemon thread that POSTs /registry/heartbeat every
20s. 20s is comfortably under the healthsweep stale window so
a single missed beat doesn't cause status churn.
3. Runs the MCP stdio loop in the foreground.
Both calls set Origin: ${PLATFORM_URL} so the SaaS edge WAF accepts
them. Threaded heartbeat (not asyncio) chosen because it doesn't
need to share an event loop with the MCP stdio server — daemon=True
cleanly dies when the operator's runtime exits.
MOLECULE_MCP_DISABLE_HEARTBEAT=1 escape hatch lets in-container
callers (which have heartbeat.py running already) reuse the entry
point without double-heartbeating. Default is enabled.
End-to-end verification (live, against
hongmingwang.moleculesai.app, workspace 8dad3e29-...):
pre-fix: status=awaiting_agent → canvas shows OFFLINE forever
post-fix: ran `molecule-mcp` for 5s standalone → canvas state:
status=online runtime=external agent=molecule-mcp-8dad3e29
Test coverage: 7 new mcp_cli tests (register-at-startup, heartbeat-
thread-spawned, disable-env-skips-both, env-and-file token resolution,
register payload shape, heartbeat endpoint + headers); 1 new platform
test (awaiting_agent → online recovery). Full workspace + handlers
suites green: 1355 Python, full Go handlers passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "Connect your external agent" dialog already covered Claude Code,
Python SDK, curl, and raw fields. This adds a Universal MCP tab that
documents the new \`molecule-mcp\` console script — the runtime-
agnostic baseline shipped by PR #2413's workspace-runtime changes.
Surface area:
- New \`externalUniversalMcpTemplate\` constant in workspace-server.
Three-step snippet: pip install runtime → one-shot register via curl
→ wire molecule-mcp into agent's MCP config (Claude Code example,
notes that hermes/codex/etc. take the same env-var contract).
- Workspace create response now includes \`universal_mcp_snippet\`
alongside the existing curl/python/channel snippets.
- Canvas modal renders the tab when \`universal_mcp_snippet\` is
present; backward-compatible with older platform builds (tab hides
when empty).
Origin/WAF coverage (the user explicitly asked for this):
- The runtime wheel handles Origin automatically (this PR's earlier
commit on platform_auth.auth_headers).
- The curl tab now sets \`Origin: {{PLATFORM_URL}}\` preemptively
with an explanatory comment; \`/registry/register\` is currently
WAF-allowed without it but adding now keeps the snippet working
if WAF rules expand. The comment also explains why
\`/workspaces/*\` paths return empty 404 without Origin — the
exact failure mode I hit while smoke-testing this PR live.
- The MCP snippet's footer notes that the wheel auto-handles
Origin so operators don't think about it.
End-to-end verification (against live tenant
hongmingwang.moleculesai.app, freshly registered workspace):
- get_workspace_info → full JSON
- list_peers → "Claude Code Agent (ID: 97ac32e9..., status: online)"
- recall_memory → "No memories found."
all returned by the molecule-mcp binary speaking MCP stdio to
this Claude Code session.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
POST /workspaces/:id/restart on a runtime=external workspace ran the full
re-provision pipeline (Stop → provisionWorkspace*), which calls
issueAndInjectToken → RevokeAllForWorkspace. For external workspaces
(operator-driven, no container/EC2) that silently destroyed the operator's
local bearer token on every "Restart" click in the canvas — the local
poller would then 401-spam against /activity until the operator manually
regenerated from the Tokens tab.
The auto-restart path (runRestartCycle, line 436) already short-circuits
runtime=external. This patch mirrors that for the manual handler so the
two paths agree, and surfaces a 200 OK with a clear message so the
canvas can tell the operator the fix is on their side rather than
silently no-op'ing.
Test coverage: TestRestartHandler_ExternalRuntimeNoOps asserts the
short-circuit fires *before* any DB write or provision call. sqlmock's
"unexpected query" failure mode would catch a regression that
re-introduced the token revoke or the status=provisioning UPDATE.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The /buildinfo route added in #2398 to verify each tenant runs the
published SHA was 404'd by TenantGuard on every production tenant —
the allowlist had /health, /metrics, /registry/register,
/registry/heartbeat, but not /buildinfo. The redeploy workflows
curl /buildinfo from a CI runner with no X-Molecule-Org-Id header,
TenantGuard 404'd them, gin's NoRoute proxied to canvas, canvas
returned its HTML 404 page, jq read empty git_sha, and the verifier
silently soft-warned every tenant as "unreachable" — which the
workflow doesn't fail on.
Confirmed externally:
curl https://hongmingwang.moleculesai.app/buildinfo
→ HTTP 404 + Content-Type: text/html (Next.js "404: This page
could not be found.") even though /health on the same host
returns {"status":"ok"} from gin.
The buildinfo package's own doc already declares /buildinfo public
by design ("Public is intentional: it's a build identifier, not
operational state. The same string is already published as
org.opencontainers.image.revision on the container image, so no new
info is exposed.") — the allowlist just missed it.
Pin the alignment in tenant_guard_test.go:
TestTenantGuard_AllowlistBypassesCheck now asserts /buildinfo
returns 200 without an org header alongside /health and /metrics,
so a future allowlist edit can't silently regress the verifier
again.
Closes the silent-success failure mode: stale tenants will now
show up as STALE (hard-fail) rather than UNREACHABLE (soft-warn).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-fix, cmd/server/main.go gated the entire health-sweep goroutine on
`prov != nil`. On SaaS tenants (`MOLECULE_ORG_ID` set) the local Docker
provisioner is never initialized — only `cpProv`. So the goroutine
never started, and `sweepStaleRemoteWorkspaces` (which transitions
runtime='external' workspaces from 'online' to 'awaiting_agent' when
their last_heartbeat_at goes stale) never ran.
Net effect on production: every external-runtime workspace on SaaS
that lost its agent stayed 'online' indefinitely instead of falling
back to 'awaiting_agent' (re-registrable). The drift gate (#2388)
caught the migration side and #2382 fixed the SQL writes, but this
orchestration-side gate slipped through both because there was no
SaaS-mode E2E coverage on the heartbeat-loss → awaiting_agent
transition.
Caught by #2392 (live staging external-runtime regression E2E)
failing at step 6 — 180s with no heartbeat, expected
status=awaiting_agent, got online.
Fix: drop the `if prov != nil` gate. `StartHealthSweep` already
handles nil checker correctly (healthsweep.go:50-71): the Docker
sweep is gated inside the loop, the remote sweep always runs. Test
coverage already exists at TestStartHealthSweep_NilCheckerRunsRemoteSweep.
After this lands and tenants redeploy, #2392 step 6 passes and the
regression coverage closes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the gap that let issue #2395 ship: redeploy-fleet workflows reported
ssm_status=Success based on SSM RPC return code alone, while EC2 tenants
silently kept serving the previous :latest digest because docker compose up
without an explicit pull is a no-op when the local tag already exists.
Wire:
- new buildinfo package exposes GitSHA, set at link time via -ldflags from
the GIT_SHA build-arg (default "dev" so test runs without ldflags fail
closed against an unset deploy)
- router exposes GET /buildinfo returning {git_sha} — public, no auth,
cheap enough to curl from CI for every tenant
- both Dockerfiles thread GIT_SHA into the Go build
- publish-workspace-server-image.yml passes GIT_SHA=github.sha for both
images
- redeploy-tenants-on-main.yml + redeploy-tenants-on-staging.yml curl each
tenant's /buildinfo after the redeploy SSM RPC and fail the workflow on
digest mismatch; staging treats both :latest and :staging-latest as
moving tags; verification is skipped only when an operator pinned a
specific tag via workflow_dispatch
Tests:
- TestGitSHA_DefaultDevSentinel pins the dev default
- TestBuildInfoEndpoint_ReturnsGitSHA pins the wire shape that the
workflow's jq lookup depends on
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related fixes after self-review of #2396:
1. workspace_bootstrap.go:62 — `SET status = 'failed'` was missed in the
initial sweep. Now parameterized as $3 with models.StatusFailed.
Test fixed with the additional WithArgs sentinel.
2. Drift gate now scans production .go AST for hard-coded
`UPDATE workspaces … SET status = '<literal>'` and fails with
file:line. This catches the kind of miss the first commit just
fixed — the original migration-vs-codebase axis only verified
AllWorkspaceStatuses ⊆ enum, not "no raw literals in writes."
Verified the gate fires: dropped a synthetic 'failed' literal into
internal/handlers/_drift_sanity.go and confirmed the gate flagged
"internal/handlers/_drift_sanity.go:6 → SET status = 'failed'".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Eliminate raw 'awaiting_agent'/'hibernating'/'failed'/etc string literals
from production status writes. Adds models.WorkspaceStatus typed alias and
models.AllWorkspaceStatuses canonical slice; every UPDATE workspaces SET
status = ... now passes a parameterized $N typed value rather than a
hard-coded SQL literal.
Defense-in-depth follow-up to migration 046 (#2388): the Postgres enum
type was missing 'awaiting_agent' + 'hibernating' for ~5 days because
sqlmock regex matching cannot enforce live enum constraints. The drift
gate is now a proper Go AST + SQL parser (no regex), asserting the
codebase ⊆ migration enum and every const appears in the canonical
slice. With status as a parameterized typed value, future enum mismatches
fail at the SQL layer in tests, not silently in prod.
Test coverage: full suite passes with -race; drift gate green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symmetric with the existing CPProvisionerAPI interface. Closes the
asymmetry where the SaaS provisioner field was an interface (mockable
in tests) but the Docker provisioner field was a concrete pointer
(not).
## Changes
- New ``provisioner.LocalProvisionerAPI`` interface — the 7 methods
WorkspaceHandler / TeamHandler call on h.provisioner today: Start,
Stop, IsRunning, ExecRead, RemoveVolume, VolumeHasFile,
WriteAuthTokenToVolume. Compile-time assertion confirms *Provisioner
satisfies it. Mirror of cp_provisioner.go's CPProvisionerAPI block.
- ``WorkspaceHandler.provisioner`` and ``TeamHandler.provisioner``
re-typed from ``*provisioner.Provisioner`` to
``provisioner.LocalProvisionerAPI``. Constructor parameter type is
unchanged — the assignment widens to the interface, so the 200+
callers of ``NewWorkspaceHandler`` / ``NewTeamHandler`` are
unaffected.
- Constructors gain a ``if p != nil`` guard before assigning to the
interface field. Without this, ``NewWorkspaceHandler(..., nil, ...)``
(the test fixture pattern across 200+ tests) yields a typed-nil
interface value where ``h.provisioner != nil`` evaluates *true*,
and the SaaS-vs-Docker fork incorrectly routes nil-fixture tests
into the Docker code path. Documented inline with reference to
the Go FAQ.
- Hardened the 5 Provisioner methods that lacked nil-receiver guards
(Start, ExecRead, WriteAuthTokenToVolume, RemoveVolume,
VolumeHasFile) — return ErrNoBackend on nil receiver instead of
panicking on p.cli dereference. Symmetric with Stop/IsRunning
(already hardened in #1813). Defensive cleanup so a future caller
that bypasses the constructor's nil-elision still degrades
cleanly.
- Extended TestZeroValuedBackends_NoPanic with 5 new sub-tests
covering the newly-hardened nil-receiver paths. Defense-in-depth:
a future refactor that drops one of the nil-checks fails red here
before reaching production.
## Why now
- Provisioner orchestration has been touched in #2366 / #2368 — the
interface symmetry is the natural follow-up captured in #2369.
- Future work (CP fleet redeploy endpoint, multi-backend
provisioners) wants this in place. Memory note
``project_provisioner_abstraction.md`` calls out pluggable
backends as a north-star.
- Memory note ``feedback_long_term_robust_automated.md`` —
compile-time gates + ErrNoBackend symmetry > runtime panics.
## Verification
- ``go build ./...`` clean.
- ``go test ./...`` clean — 1300+ tests pass, including the
previously-flaky Create-with-nil-provisioner paths that now
exercise the constructor's nil-elision correctly.
- ``go test ./internal/provisioner/ -run TestZeroValuedBackends_NoPanic
-v`` — all 11 nil-receiver subtests green (was 6, +5 for the
newly-hardened methods).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Migration 043 (2026-04-25) introduced the workspace_status enum but
omitted two values application code had been writing for days, so every
UPDATE that tried to write either value failed silently in production:
'awaiting_agent' (since 2026-04-24, commit 1e8b5e01):
- handlers/workspace.go:333 — external workspace pre-register
- handlers/registry.go (via PR #2382) — liveness offline transition
- registry/healthsweep.go (via PR #2382) — heartbeat-staleness sweep
'hibernating' (since hibernation feature shipped):
- handlers/workspace_restart.go:271 — DB-level claim before stop
All four/five sites swallowed the enum-cast error. User-visible impact:
external workspaces never transition to a stale state when their agent
disconnects (canvas shows them stuck on 'online'/'degraded' indefinitely),
new external workspaces never advance past 'provisioning', and idle
workspaces never auto-hibernate (resources held forever).
PR #2382 didn't *cause* this — it inherited the gap and added two more
silent-fail paths on top. The pre-existing two had been broken for five
days and went unnoticed because:
1. sqlmock matches SQL by regex, not against the live enum constraint.
Every test passed despite the prod-only failure.
2. The handlers either drop the Exec error entirely (workspace.go:333)
or log+continue without an alert (the other three).
Fix in three pieces:
1. migrations/046_*.up.sql — ALTER TYPE workspace_status ADD VALUE
'awaiting_agent', 'hibernating'. IF NOT EXISTS makes it idempotent
across re-runs (RunMigrations re-applies until schema_migrations
records the file). ALTER TYPE ADD VALUE doesn't take a heavy lock
and commits immediately, safe under live traffic.
2. migrations/046_*.down.sql — full rename → recreate → cast → drop
recipe. Postgres has no DROP VALUE so this is the only honest
rollback. Pre-flights existing rows to compatible values
(awaiting_agent → offline, hibernating → hibernated) before the
type swap.
3. internal/db/workspace_status_enum_drift_test.go — static gate that
parses every UPDATE/INSERT against `workspaces` in workspace-server/
internal/, extracts every status literal, and asserts each is in
the enum union (CREATE TYPE + every ALTER TYPE ADD VALUE). The gate
runs in unit tests, no DB required, and would have caught both
omissions on the day they shipped. Pattern matches
feedback_behavior_based_ast_gates and feedback_mock_at_drifting_layer.
Verification:
- go test ./internal/db/ -count=1 -race ✓
- go vet ./... ✓
- Drift gate flips red if I delete either ADD VALUE from the migration
(validated via local mutation).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "do request → check err → defer close → forward headers → set
status → io.Copy → log mid-stream errors" tail was duplicated between
Upload and Download. Each handler had ~12 lines that differed only in:
- the op label in log messages ("upload" vs "download")
- the set of response headers to forward verbatim
(Upload: Content-Type only; Download: Content-Type +
Content-Length + Content-Disposition)
Hoist into ChatFilesHandler.streamWorkspaceResponse(c, op,
workspaceID, forwardURL, req, forwardHeaders). Each call site
reduces to one line. Future changes — request-id forwarding,
observability metric, response-size cap, bytes-streamed log —
go in ONE place rather than two.
Same drift-prevention rationale as resolveWorkspaceForwardCreds
(#2372) and readOrLazyHealInboundSecret (#2376), applied to the
response-streaming layer of the same handlers.
Behavior preserved: existing TestChatUpload_* and TestChatDownload_*
integration tests (8 across both handlers) all pass unchanged. The
log message format is consistent across both handlers now (single
"chat_files {op}: ..." string template) — operators can grep one
prefix for both features instead of separate prefixes per handler.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Paired molecule-core change for the molecule-cli `molecule connect`
RFC (https://github.com/Molecule-AI/molecule-cli/issues/10).
After this PR an `external`-runtime workspace's full lifecycle
matches the operator-driven model: it boots in awaiting_agent, the
CLI connects in poll mode without operator-side flag tuning, the
heartbeat-loss path lands back on awaiting_agent (re-registrable)
instead of the terminal-feeling 'offline'.
Two changes in workspace-server:
1) `resolveDeliveryMode` (registry.go) now reads `runtime` alongside
`delivery_mode`. Resolution order:
a. payload.delivery_mode if non-empty (operator override)
b. row's existing delivery_mode if non-empty (preserves prior
registration)
c. **NEW:** "poll" if row.runtime = "external" — external
operators run on laptops without public HTTPS; push-mode
would hard-fail at validateAgentURL anyway. (`molecule connect`
registers without --mode and expects this default.)
d. "push" otherwise (historical default for platform-managed
runtimes — langgraph, hermes, claude-code, etc.)
2) Heartbeat-loss for external workspaces lands them in
`awaiting_agent` instead of `offline`. Two code paths:
- `liveness.go` — Redis TTL expiration. Uses a CASE expression
so the conditional is one UPDATE (no extra round-trip for
non-external runtimes, no TOCTOU between runtime read and
status write).
- `healthsweep.go::sweepStaleRemoteWorkspaces` — DB-side
last_heartbeat_at age scan. This sweep is already external-
only by query filter, so the UPDATE just hard-codes the new
status.
The Docker-side `sweepOnlineWorkspaces` keeps `offline` —
recovery there is "restart the container", not "re-register from
the operator's box".
Why awaiting_agent over offline for external:
- Matches the status the workspace was created in (workspace.go:333).
- The CLI re-registers on every invocation; awaiting_agent → online
is the natural transition. offline is a terminal-feeling status
that implies operator intervention is needed.
- An operator who closed their laptop overnight should see
awaiting_agent in canvas, not 'offline (something is wrong)'.
Test plan:
- Existing: 9 `resolveDeliveryMode` test sites updated to the new
query shape. Sqlmock now reads `delivery_mode, runtime` columns.
- New: TestRegister_ExternalRuntime_DefaultsToPoll asserts the
external→poll branch. TestRegister_NonExternalRuntime_StillDefaultsToPush
guards against the new branch overshooting (langgraph keeps push).
- Liveness: regex updated to match the CASE expression.
- Healthsweep: `TestSweepStaleRemoteWorkspaces_MarksStaleAwaitingAgent`
(renamed for grep-ability), Docker-side sweepOnlineWorkspaces test
unchanged (verified to still match `'offline'`).
- Full handlers + registry suite green under -race (12.873s + 2.264s).
No migration needed — `status` is a free-form text column; both
'offline' and 'awaiting_agent' are existing values used elsewhere
(workspace.go uses awaiting_agent on initial external creation).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The helper landed in #2376 and is exercised via chat_files + registry
integration tests. Those tests conflate the helper's behavior with the
caller's response shape — a future refactor that broke the (secret,
healed, err) contract subtly (e.g. returning healed=true on a
read-success path, or swallowing a mint error) might still pass them.
Adds 4 direct sub-tests pinning each branch of the contract:
- secret already present → (s, false, nil)
- secret missing, mint succeeds → (minted, true, nil)
- secret missing, mint fails → ("", false, err)
- read fails (non-NoInboundSecret) → ("", false, err)
Each sub-case asserts the return tuple shape AND mock.ExpectationsWereMet
(for the success path) so a future helper change that skips a DB op
trips the gate immediately.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The lazy-heal-on-miss pattern landed in two places this session:
PR #2372 (chat_files.go::resolveWorkspaceForwardCreds — Upload + Download)
and PR #2375 (registry.go::Register). Both implementations did the same
thing:
read → if ErrNoInboundSecret then mint inline → return outcome
Different response-shape requirements but the same core mechanic. Three
sites' worth of drift potential: any future heal-time condition we add
(audit log, alert, secret rotation, observability) had to be applied to
each site, with partial application silently re-opening the gap.
Fix: extract readOrLazyHealInboundSecret in workspace_provision_shared.go
returning (secret, healed, err). Each caller maps the outcome to its
response shape:
- chat_files: healed=true → 503 with retry hint; err != nil → 503 with
RFC-#2312 reprovision hint
- registry: healed=true|false + err==nil → include in response;
err != nil → omit field (workspace can retry on next register)
Net effect:
- Single source of truth for the read+heal mechanic
- Response-shape decisions stay in callers (they DO differ per feature)
- Future heal-time conditions go in one place
- Behavior preserved: existing TestRegister_NoInboundSecret_LazyHeals,
TestRegister_NoInboundSecret_LazyHealMintFailureOmitsField,
TestChatUpload_NoInboundSecret_LazyHeal*,
TestChatDownload_NoInboundSecret_LazyHeal* all pass unchanged
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-fix: a legacy SaaS workspace with NULL platform_inbound_secret
needed two round-trips before chat upload worked:
1. Workspace registers → response missing platform_inbound_secret
2. User attempts chat upload → chat_files lazy-heals platform-side
(RFC #2312 backfill) → 503 + retry-after
3. Workspace heartbeats → register response now includes the
freshly-minted secret → workspace writes /configs/.platform_inbound_secret
4. User retries chat upload → workspace bearer matches → 200
The platform-side lazy-heal in chat_files.go (#2366) closes the
existing-workspace gap, but the user-visible round-trip dance is
still ugly.
Fix: lazy-heal at register time too. When ReadPlatformInboundSecret
returns ErrNoInboundSecret, mint inline and include the freshly-
minted secret in the register response. Collapses the dance to a
single round-trip:
1. Workspace registers → response includes lazy-healed secret
2. User attempts chat upload → workspace bearer matches → 200
Failure model: best-effort. Mint failure logs and falls through to
omitting the field (workspace will retry on next register call).
The 200 response status is preserved — register success doesn't
hinge on the inbound-secret heal.
Tests:
- TestRegister_NoInboundSecret_LazyHeals: pins the success branch.
Mocks the UPDATE explicitly + asserts ExpectationsWereMet, so a
regression that skipped the mint would fail loudly. Replaces
the prior TestRegister_NoInboundSecret_OmitsField which
"passed" on this branch only because sqlmock-unmatched-UPDATE
coincidentally drove the omit-field error path.
- TestRegister_NoInboundSecret_LazyHealMintFailureOmitsField:
pins the failure branch — explicit UPDATE error → 200 + field
absent.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ValidateToken, WorkspaceFromToken, and ValidateAnyToken each duplicated
the same JOIN+WHERE auth predicate:
FROM workspace_auth_tokens t
JOIN workspaces w ON w.id = t.workspace_id
WHERE t.token_hash = $1
AND t.revoked_at IS NULL
AND w.status != 'removed'
Same drift class as the SaaS provision-mint bug fixed in #2366. A
future safety addition (e.g. exclude paused workspaces from auth) had
to be applied to all three queries; a partial application would
silently re-open one auth path while closing the others.
Fix: hoist the predicate into lookupTokenByHash, which projects
(id, workspace_id) — the union of fields any caller needs. Each
public function picks what it uses:
- ValidateToken — needs both (compares workspaceID, updates last_used_at by id)
- WorkspaceFromToken — needs workspace_id
- ValidateAnyToken — needs id
The trivial perf cost of selecting one extra column per call is worth
the single-source-of-truth guarantee for the auth predicate.
Test mock updates: two upstream test files (a2a_proxy_test, middleware
wsauth_middleware_test{,_canvasorbearer_test}) had hand-typed regex
matchers and row shapes pinned to the per-function SELECT projection.
Updated to the unified shape; behavior is unchanged.
All wsauth + middleware + handlers + full-module tests green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The admin test-token endpoint has a critical security check at
admin_test_token.go:64-72 — the IDOR fix from #112 that requires an
explicit ADMIN_TOKEN bearer when the env var is set. Pre-fix, the
route accepted ANY bearer that matched a live org token, allowing
cross-org test-token minting (and therefore cross-org workspace
authentication). The current code uses subtle.ConstantTimeCompare
against ADMIN_TOKEN.
Test coverage was zero. The existing tests exercised the
ADMIN_TOKEN-unset path (local dev / CI) but never set ADMIN_TOKEN.
A regression that:
- removed the os.Getenv("ADMIN_TOKEN") check
- inverted the comparison
- replaced ConstantTimeCompare with bytes.Equal (timing leak)
- re-introduced the AdminAuth fallback that allows org tokens
would not fail any test, and the breakage would re-open the IDOR
that #112 closed.
Adds four tests covering the gate matrix:
- ADMIN_TOKEN set + no Authorization header → 401
- ADMIN_TOKEN set + wrong Authorization → 401
- ADMIN_TOKEN set + correct Authorization → 200
- ADMIN_TOKEN unset + no Authorization → 200 (gate bypassed safely)
The 4-row matrix pins the gate's full truth table: any regression in
either dimension (gate enabled/disabled, header correct/wrong) trips
exactly one test.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 50-line "resolve URL + read inbound secret + lazy-heal on miss"
block was duplicated nearly verbatim between Upload and Download
handlers. Drift-prone — same class of risk as the original SaaS
provision drift fixed in #2366. A future change like:
- secret rotation (re-mint when the row's older than X)
- per-feature audit logging
- additional fail-closed conditions
would have to be applied to both handlers, and a partial application
that healed Upload but skipped Download would surface only at runtime.
Fix: hoist the shared logic into resolveWorkspaceForwardCreds. The
function takes an op label ("upload"/"download") used in log messages
+ the 503 RFC-#2312 detail copy so operators can still distinguish
which feature ran. Both handlers reduce to:
wsURL, secret, ok := resolveWorkspaceForwardCreds(c, ctx, workspaceID, "upload")
if !ok { return }
Net -20 lines (helper amortizes the 50-line block across both call
sites). Existing test coverage (TestChatUpload_NoInboundSecret_*,
TestChatDownload_NoInboundSecret_* from PR #2370) covers all four
branches of the shared helper.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#2367 moved PARENT_ID env injection from inline TeamHandler.Expand
into the shared prepareProvisionContext (sourced from
payload.ParentID). The test was missing — a regression that:
- dropped the injection
- inverted the nil-check
- leaked an empty PARENT_ID="" into env
would not fail any existing test, but workspace/coordinator.py reads
PARENT_ID on startup to track parent-child relationship, so the
breakage would surface only at runtime.
Adds TestPrepareProvisionContext_ParentIDInjection with three
sub-cases:
- nil ParentID → no PARENT_ID env
- empty-string ParentID → no PARENT_ID env (don't pollute)
- set ParentID → PARENT_ID env equals value
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 2026-04-30 lazy-heal fix in chat_files.go (PR #2366) ATTEMPTS to
mint platform_inbound_secret on miss so legacy workspaces self-heal
without requiring destructive reprovision. The pre-existing
TestChatUpload_NoInboundSecret + TestChatDownload_NoInboundSecret
tests asserted the 503 response shape but did NOT pin that the mint
UPDATE actually fires — they happened to exercise the mint-failure
branch (sqlmock unmatched UPDATE = error = "Failed to mint" code path
returns 503 with "RFC #2312" detail, which still passed the original
assertions).
This means a regression that:
- skipped the lazy-heal mint entirely
- inverted the success/failure response branches
- moved the mint to a different code path
would not fail those tests.
Fix:
- TestChatUpload_NoInboundSecret_LazyHeal: mock the UPDATE
successfully; assert sqlmock.ExpectationsWereMet (mint MUST run)
+ body contains "retry" + "30" (success branch).
- TestChatUpload_NoInboundSecret_LazyHealFailure: mock the UPDATE
to fail; assert body contains "Reprovision" (failure branch).
- Same pair for the Download handler — independent code path means
independent test.
Pins both branches of both handlers (4 tests) so future drift trips
the gate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes#2367.
TeamHandler.Expand provisioned child workspaces by directly calling
h.provisioner.Start, skipping mintWorkspaceSecrets and every other
preflight (secrets load, env mutators, identity injection, missing-env,
empty-config-volume auto-recover). Children shipped with NULL
platform_inbound_secret + never-issued auth_token — same drift class as
the SaaS bug just fixed in PR #2366, found while exercising a stronger
gate against this package.
Fix:
- TeamHandler now holds *WorkspaceHandler. Expand delegates each child
provision to wh.provisionWorkspace, picking up the shared
prepare/mint/preflight pipeline automatically. Future provision-time
steps go in ONE place and team-expand inherits them.
- prepareProvisionContext gains PARENT_ID env injection sourced from
payload.ParentID (which Expand now populates). This preserves the
signal workspace/coordinator.py reads on startup, without threading
env through provisioner.WorkspaceConfig manually.
- NewTeamHandler signature gains *WorkspaceHandler; router passes it.
Gate upgrade:
- TestProvisionFunctions_AllCallMintWorkspaceSecrets is now
behavior-based: it walks every FuncDecl in the package and flags any
function that calls h.provisioner.Start or h.cpProv.Start without
also calling mintWorkspaceSecrets. Drift-resistant by construction —
a future provision function with any name still trips the gate.
- Replaces the name-list version from PR #2366. The name list missed
Expand precisely because Expand wasn't named provision*; the
behavior-based detector caught it spontaneously when prototyped.
Tests: full workspace-server module green; gate previously verified to
fire red on Expand pre-fix and on deliberate mintWorkspaceSecrets
removal.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause of 2026-04-30 silent-503 chat-upload bug: provisionWorkspaceCP
(SaaS) skipped issueAndInjectInboundSecret while provisionWorkspaceOpts
(Docker) called it. Every prod SaaS workspace provisioned with NULL
platform_inbound_secret → upload returned 503 with the v2-enrollment
message on every attempt.
Structural fix:
- Extract prepareProvisionContext (secrets load, env mutators, preflight,
cfg build), mintWorkspaceSecrets (auth_token + platform_inbound_secret),
markProvisionFailed (broadcast + DB update) into workspace_provision_shared.go
- Refactor both provision modes to call the shared helpers
- Add provisionAbort struct so the missing-env failure class can carry its
structured "missing" payload through the shared abort path
- Unify last_sample_error: previously the decrypt-fail path skipped it while
others set it; users now see every failure class in the UI
Drift prevention:
- AST gate TestProvisionFunctions_AllCallMintWorkspaceSecrets asserts every
function in the provisionFunctions set calls mintWorkspaceSecrets at least
once (same shape as the audit-coverage gate from #335). New provision paths
must either call mint or be added to provisionExemptFunctions with a
one-line justification
- Behavioral test TestMintWorkspaceSecrets_PersistsInboundSecretInSaaSMode
pins the contract: SaaS mode MUST persist platform_inbound_secret to the DB
column even though it skips file injection
Existing-workspace recovery (chat_files.go lazy-heal):
- Upload + Download handlers detect NULL platform_inbound_secret and call
IssuePlatformInboundSecret inline, returning 503 with retry_after_seconds=30
- Self-heals workspaces that were provisioned before this fix without
requiring destructive reprovision
Tests: full handlers + workspace-server module green; AST gate verified to
fire red on deliberate violation (commented-out mint call surfaces the
exact function name + actionable remediation message).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Independent review on PR #2362 caught: the dead-agent classifier at
a2a_proxy.go included 502/503/504/524 but missed the rest of the CF
origin-failure family (521/522/523), which are MORE indicative of a
dead EC2 than 524:
- 521 "Web server is down" — CF can't open TCP to origin (most direct
dead-EC2 signal; fires when the workspace EC2 has been terminated
and CF still has the CNAME pointing at it).
- 522 "Connection timed out" — TCP didn't complete in ~15s (typical
of SG/NACL flap or agent process hung on accept).
- 523 "Origin is unreachable" — CF can't route to origin (DNS gone,
network path broken).
Pre-fix any of these would propagate as-is to the canvas and the user
would see a 5xx without the reactive auto-restart firing — exactly
the SaaS-blind class of failure PR #2362 was meant to close.
Refactor: extracted isUpstreamDeadStatus(int) helper so the matrix is
in one place, with TestIsUpstreamDeadStatus locking in 18 status
codes (7 dead, 11 not-dead including 520 and 525 which look CF-shaped
but indicate different failures).
Also tightened TestStopForRestart_NoProvisioner_NoOp per the same
review: now uses sqlmock.ExpectationsWereMet to assert the dispatcher
doesn't touch the DB on the both-nil path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Backward-compat replay gate for the A2A JSON-RPC protocol surface.
Every PR that touches normalizeA2APayload OR bumps the a-2-a-sdk
version pin runs every shape in testdata/a2a_corpus/ through the
current code and asserts:
valid/ — every shape MUST parse without error and produce a
canonical v0.3 payload (params.message.parts list).
invalid/ — every shape MUST be rejected with the documented
status code and error substring.
What this prevents
The 2026-04-29 v0.2 → v0.3 silent-drop bug (PR #2349) shipped
because the SDK bump PR didn't replay v0.2-shaped inputs against
the new code; the shape-mismatch surfaced only in production when
the receiver's Pydantic validator silently rejected inbound
messages.
This gate would have caught it pre-merge. Hand-verified: reverting
the v0.2 string→parts shim in normalizeA2APayload fails 3 of the
v0.2 corpus entries with the exact rejection class the production
bug exhibited.
Corpus contents (11 entries)
valid/ (10):
v0_2_string_content — basic v0.2 (the broken case)
v0_2_string_content_no_message_id — v0.2 + auto-fill messageId
v0_2_list_content — v0.2 with content as Part list
v0_3_parts_text_only — canonical v0.3
v0_3_parts_multi_text — multi-Part list
v0_3_parts_with_file — multimodal (text + file)
v0_3_parts_with_context — contextId for multi-turn
v0_3_streaming_method — message/stream variant
v0_3_unicode_text — emoji + multi-script
v0_3_long_text — 10KB text Part
no_jsonrpc_envelope — bare params/method without
outer envelope (legacy senders)
invalid/ (3):
no_content_or_parts — message has neither field
content_is_integer — wrong type for v0.2 content
content_is_bool — wrong type, separate from int
so the failure msg identifies
which type-class regressed
Plus 4 inline malformed-JSON cases (truncated, not-JSON, empty,
whitespace) that can't be expressed as JSON corpus entries.
Coverage tests
The gate has 4 test functions:
1. TestA2ACorpus_ValidShapesParse — replay valid/ corpus,
assert no error + canonical v0.3 output (parts list non-empty,
messageId non-empty, content field deleted).
2. TestA2ACorpus_InvalidShapesRejected — replay invalid/ corpus,
assert rejection matches recorded status + error substring.
3. TestA2ACorpus_MalformedJSONRejected — inline cases for
non-parseable bodies.
4. TestA2ACorpus_HasMinimumCoverage — at least one v0.2 +
one v0.3 entry exists (loses neither side of the bridge).
5. TestA2ACorpus_EveryEntryHasMetadata — _comment/_added/_source
on every entry per the README policy; _expect_error and
_expect_status on invalid entries.
Documentation
testdata/a2a_corpus/README.md describes the corpus contract:
- When to add entries (new SDK shape, new production-observed
shape).
- When NOT to add (test scaffolding, hypothetical futures).
- Removal policy (breaking change, deprecation window required).
Verification
- All 24 corpus subtests pass on current main.
- Hand-test: revert the v0.2 compat shim → 3 v0.2 entries fail
the gate with the exact rejection class the production bug
exhibited. Confirmed.
- Whole-module go test ./... green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses code-review C1 (test goroutine race) and I2 (CF 524) on PR #2362.
C1: TestRunRestartCycle_SaaSPath_DispatchesViaCPProv invoked runRestartCycle
end-to-end, which spawns `go h.sendRestartContext(...)`. That goroutine
outlived the test, then read db.DB while the next test's setupTestDB wrote
to it — DATA RACE under -race, cascading 30+ failures across the handlers
suite. Refactored: extracted `stopForRestart(ctx, id)` from runRestartCycle
as a pure dispatcher, and rewrote the SaaS-path test to call it directly
(no async goroutine spawned). Added a no-provisioner no-op guard test.
I2: Cloudflare 524 ("origin timed out") now triggers maybeMarkContainerDead
alongside 502/503/504. Same upstream signal — origin agent unresponsive.
Verified `go test -race -count=1 ./internal/handlers/...` green locally.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Independent review of #2362 caught a Critical gap: the previous commit
fixed the Stop dispatch in runRestartCycle but left the provisionWorkspace
dispatch unconditionally Docker-only. So on SaaS the auto-restart cycle
would Stop the EC2 successfully (good), then NPE inside provisionWorkspace's
`h.provisioner.VolumeHasFile` call. coalesceRestart's recover()-without-
re-raise (a deliberate platform-stability safeguard) silently swallowed
the panic, leaving the workspace permanently stuck in status='provisioning'
because the UPDATE on workspace_restart.go:450 had already run.
Net pre-amendment effect on SaaS: dead agent → structured 503 (good) →
workspace flipped to 'offline' (good) → cpProv.Stop succeeded (good) →
provisionWorkspace NPE swallowed (bad) → workspace permanently
'provisioning' until manual canvas restart. The headline claim of #2362
("SaaS auto-restart now works") was false on the path it shipped.
Fix: dispatch the reprovision call the same way every other call site
in the package does (workspace.go:431-433, workspace_restart.go:197+596) —
branch on `h.cpProv != nil` and call provisionWorkspaceCP for SaaS,
provisionWorkspace for Docker.
Tests:
- New TestRunRestartCycle_SaaSPath_DispatchesViaCPProv asserts cpProv.Stop
is called when the SaaS path runs (would have caught the NPE if
provisionWorkspace had been called instead).
- fakeCPProv updated: methods record calls and return nil/empty by
default rather than panicking. The previous "panic on unexpected call"
pattern was unsafe — the panic fires on the async restart goroutine
spawned by maybeMarkContainerDead AFTER the test assertions ran, so
the test passed by accident even though the production path was
broken (which is exactly how the Critical bug landed).
- Existing tests still pass (full handlers + provisioner suites green).
Branch-count audit refresh:
runRestartCycle dispatch decisions:
1. h.provisioner != nil → provisioner.Stop + provisionWorkspace ✓ (existing tests)
2. h.cpProv != nil → cpProv.Stop + provisionWorkspaceCP ✓ (NEW test)
3. both nil → coalesceRestart never called (RestartByID gate) ✓
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Class-of-bugs fix surfaced by hongmingwang.moleculesai.app's canvas chat
to a dead workspace returning a generic Cloudflare 502 page on
2026-04-30. Three independent gaps in the reactive-health path that
together leak dead-agent failures to canvas with no auto-recovery.
## Bug 1 — maybeMarkContainerDead is a no-op for SaaS tenants
`maybeMarkContainerDead` only consulted `h.provisioner` (local Docker
provisioner). SaaS tenants set `h.cpProv` (CP-backed EC2 provisioner)
and leave `h.provisioner` nil — so the function early-returned false
on every call and dead EC2 agents never triggered the offline-flip /
broadcast / restart cascade.
Fix: extend `CPProvisionerAPI` interface with `IsRunning(ctx, id)
(bool, error)` (already implemented on `*CPProvisioner`; just needs
to surface on the interface). `maybeMarkContainerDead` now branches:
local-Docker path uses `h.provisioner.IsRunning`; SaaS path uses
`h.cpProv.IsRunning` which calls the CP's `/cp/workspaces/:id/status`
endpoint to read the EC2 state.
## Bug 2 — RestartByID short-circuits on `h.provisioner == nil`
Same shape as Bug 1: the auto-restart cascade triggered by
`maybeMarkContainerDead` calls `RestartByID` which short-circuited
when the local Docker provisioner was missing. So even if Bug 1 were
fixed, the workspace-offline state would never recover.
Fix: change the gate to `h.provisioner == nil && h.cpProv == nil`
and update `runRestartCycle` to branch on which provisioner is
wired for the Stop call. (The HTTP `Restart` handler already does
this branching correctly — we're just bringing the auto-restart path
to parity.)
## Bug 3 — upstream 502/503/504 propagated as-is, masked by Cloudflare
When the agent's tunnel returns 5xx (the "tunnel up but no origin"
shape — agent process dead but cloudflared connection still healthy),
`dispatchA2A` returns successfully at the HTTP layer with a 5xx body.
`handleA2ADispatchError`'s reactive-health path doesn't run because
that path is only triggered on transport-level errors. The pre-fix
code propagated the 502 status to canvas; Cloudflare in front of the
platform then masked the 502 with its own opaque "error code: 502"
page, hiding any structured response and any Retry-After hint.
Fix: in `proxyA2ARequest`, when the upstream returns 502/503/504, run
`maybeMarkContainerDead` BEFORE propagating. If IsRunning confirms
the agent is dead → return a structured 503 with restarting=true +
Retry-After (CF doesn't mask 503s the same way). If running, propagate
the original status (don't recycle a healthy agent on a transient
hiccup — it might have legitimately returned 502).
## Drive-by — a2aClient transport timeouts
a2aClient was `&http.Client{}` with no Transport timeouts. When a
workspace's EC2 black-holes TCP connects (instance terminated mid-flight,
SG flipped, NACL bug), the OS default is 75s on Linux / 21s on macOS —
long enough for Cloudflare's ~100s edge timeout to fire first and
surface a generic 502. Added DialContext (10s connect), TLSHandshake
(10s), and ResponseHeaderTimeout (60s). Client.Timeout DELIBERATELY
unset — that would pre-empt slow-cold-start flows (Claude Code OAuth
first-token, multi-minute agent synthesis). Long-tail body streaming
is still governed by per-request context deadline.
## Tests
- `TestMaybeMarkContainerDead_CPOnly_NotRunning` — IsRunning(false) →
marks workspace offline, returns true.
- `TestMaybeMarkContainerDead_CPOnly_Running` — IsRunning(true) →
no offline-flip, returns false (don't recycle a healthy agent).
- `TestProxyA2A_Upstream502_TriggersContainerDeadCheck` — agent server
returns 502 + cpProv reports dead → caller gets 503 with restarting=
true and Retry-After: 15.
- `TestProxyA2A_Upstream502_AliveAgent_PropagatesAsIs` — same upstream
502 but cpProv reports running → propagates 502 (existing behavior;
safety check that prevents over-eager recycling).
- Existing `TestMaybeMarkContainerDead_NilProvisioner` /
`TestMaybeMarkContainerDead_ExternalRuntime` still pass.
- Full handlers + provisioner test suites pass.
## Impact
Pre-fix: dead EC2 agent on a SaaS tenant → CF-masked 502 to canvas, no
auto-recovery, manual restart from canvas required.
Post-fix: dead EC2 agent on a SaaS tenant → structured 503 with
restarting=true + Retry-After to canvas, workspace flipped to offline,
auto-restart cycle triggered. Canvas can show a user-actionable
"agent is restarting, please wait" message instead of a generic 502.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Telegram getUpdates / Slack RTM shape: poll-mode workspaces pass the id
of the last activity_logs row they consumed, server returns rows
strictly after in chronological (ASC) order. Existing callers that don't
pass since_id keep DESC + most-recent-N — backwards-compatible.
Cursor lookup is scoped by workspace_id so a caller cannot enumerate or
peek at another workspace's events by passing a UUID belonging to a
different workspace. Cross-workspace and pruned cursors both return
410 Gone — no information leak (caller cannot distinguish "row never
existed" from "row exists but you can't see it").
since_id + since_secs both apply (AND). When since_id is set the order
flips to ASC because polling consumers need recorded-order; the recent-
feed shape (no since_id) keeps DESC.
Tests:
- TestActivityHandler_SinceID_ReturnsNewerASC — cursor lookup → main
query with cursorTime + ASC ordering.
- TestActivityHandler_SinceID_CursorNotFound_410 — pruned/unknown cursor.
- TestActivityHandler_SinceID_CrossWorkspaceCursor_410 — UUID belongs to
another workspace, scoped lookup hides it (same 410 path, no leak).
- TestActivityHandler_SinceID_CombinedWithSinceSecs — placeholder index
arithmetic with both filters.
Stacked on #2353 (PR 2: poll-mode short-circuit).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Skip SSRF/dispatch and queue to activity_logs for delivery_mode=poll
workspaces. The polling agent (e.g. molecule-mcp-claude-channel on an
operator's laptop) consumes via GET /activity?since_id= in PR 3 — no
public URL needed.
Order: budget -> normalize -> lookupDeliveryMode short-circuit ->
resolveAgentURL. Normalizing before the short-circuit keeps the
JSON-RPC method name on the activity_logs row so the polling agent
can dispatch correctly.
Fail-closed-to-push: any DB error reading delivery_mode defaults to
push (loud + recoverable) rather than poll (silent drop).
Tests:
- TestProxyA2A_PollMode_ShortCircuits_NoSSRF_NoDispatch — core invariant:
no resolveAgentURL, no Do(), records to activity_logs, returns 200
{status:"queued",delivery_mode:"poll",method:"message/send"}.
- TestProxyA2A_PushMode_NoShortCircuit — push path unaffected; the agent
server actually receives the request.
- TestProxyA2A_PollMode_FailsClosedToPush — DB error on mode lookup
must NOT silently queue; falls through to the push path.
Stacked on #2348 (PR 1: schema + register flow).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hard gate #4: codified module boundaries as Go tests, so a new
contributor (or AI agent) can't silently land an import that crosses
a layer.
Boundaries enforced (one architecture_test.go per package):
- wsauth has no internal/* deps — auth leaf, must be unit-testable in
isolation
- models has no internal/* deps — pure-types leaf, reverse dep would
create cycles since most packages depend on models
- db has no internal/* deps — DB layer below business logic, must be
testable with sqlmock without spinning up handlers/provisioner
- provisioner does not import handlers or router — unidirectional
layering: handlers wires provisioner into HTTP routes; the reverse
is a cycle
Each test parses .go files in its package via go/parser (no x/tools
dep needed) and asserts forbidden import paths don't appear. Failure
messages name the rule, the offending file, and explain WHY the
boundary exists so the diff reviewer learns the rule.
Note: the original issue's first two proposed boundaries
(provisioner-no-DB, handlers-no-docker) don't match the codebase
today — provisioner already imports db (PR #2276 runtime-image
lookup) and handlers hold *docker.Client directly (terminal,
plugins, bundle, templates). I picked the four boundaries that
actually hold; the first two are aspirational and would need a
refactor before they could be codified.
Hand-tested by injecting a deliberate wsauth -> orgtoken violation:
the gate fires red with the rule message before merge.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes#2345.
## Symptom
Design Director silently dropped A2A briefs whose sender used the
v0.2 message format (`params.message.content` string) instead of v0.3
(`params.message.parts` part-list). The downstream a2a-sdk's v0.3
Pydantic validator rejected with "params.message.parts — Field
required" but the rejection only landed in tenant-side logs; the
sender saw HTTP 200/202 and assumed delivery.
UX Researcher therefore never received the kickoff. Multi-agent
pipeline silently idle.
## Fix
Convert at the proxy edge in normalizeA2APayload. Two cases handled,
one explicitly rejected:
v0.2 string content → wrap as [{kind: text, text: <content>}]
(the canonical v0.2 case from the dogfooding
report)
v0.2 list content → preserve list as parts (some older clients
put a list under `content`; treat as "client
meant parts, used wrong field name")
v0.3 parts present → no-op (hot path for normal traffic)
Neither present → return HTTP 400 with structured JSON-RPC
error pointing at the missing field
Why at the proxy edge: every workspace gets the compat for free
without each one bumping a2a-sdk separately. The SDK's own compat
adapter is strict about `parts` and rejects v0.2 senders.
Why reject loud on missing-both: pre-fix the SDK's Pydantic
rejection was post-handler-dispatch and invisible to the original
sender. Now misshapen payloads return a structured 400 to the actual
caller — kills the entire silent-drop class for this payload-shape
category.
## Tests
7 new cases on normalizeA2APayload (#2345) + 1 fixture update on the
existing _MissingMethodReturnsEmpty test:
TestNormalizeA2APayload_ConvertsV02StringContentToParts
TestNormalizeA2APayload_ConvertsV02ListContentToParts
TestNormalizeA2APayload_PreservesV03Parts (hot path)
TestNormalizeA2APayload_RejectsMessageWithNeitherContentNorParts
TestNormalizeA2APayload_RejectsContentWithUnsupportedType
TestNormalizeA2APayload_NoMessageNoCheck (e.g. tasks/list bypasses)
All 11 normalizeA2APayload tests pass + full handler suite (no
regressions).
## Refs
Hard-gates discussion: this is exactly the class of failure
(silent-drop on schema mismatch) that #2342 (continuous synthetic
E2E) would catch automatically. Tier 2 RFC item from #2345 (caller
gets structured JSON-RPC error on parse failure) is delivered above
via the loud-reject path.
Adds workspaces.delivery_mode (push, default | poll) and lets the register
handler accept poll-mode workspaces with no URL. This is the foundation
for the unified poll/push delivery design in #2339 — Telegram-getUpdates
shape for external runtimes that have no public URL.
What this PR does:
- Migration 045: NOT NULL TEXT column, default 'push', CHECK constraint
on the two valid values.
- models.Workspace + RegisterPayload + CreateWorkspacePayload gain a
DeliveryMode field. RegisterPayload.URL drops the `binding:"required"`
tag — the handler now enforces it conditionally on the resolved mode.
- Register handler: validates explicit delivery_mode if set; resolves
effective mode (payload value, else stored row value, else push) AFTER
the C18 token check; validates URL only when effective mode is push;
persists delivery_mode in the upsert; returns it in the response;
skips URL caching when payload.URL is empty.
- CreateWorkspace handler: persists delivery_mode (defaults to push) in
the same INSERT, validates it before any side effects.
What this PR does NOT do (intentional, follow-up PRs):
- PR 2: short-circuit ProxyA2A for poll-mode workspaces (skip SSRF +
dispatch, log a2a_receive activity, return 200).
- PR 3: since_id cursor on GET /activity for lossless polling.
- Plugin v0.2 in molecule-mcp-claude-channel: cursor persistence + a
register helper that creates poll-mode workspaces.
Backwards compatibility: every existing workspace stays push-mode (schema
default) with identical behavior. New tests:
TestRegister_PollMode_AcceptsEmptyURL,
TestRegister_PushMode_RejectsEmptyURL,
TestRegister_InvalidDeliveryMode,
TestRegister_PollMode_PreservesExistingValue. All existing register +
create tests updated to expect the new delivery_mode column in the
INSERT args.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The header comment claimed:
"file upload (HTTP-forward) + download (Docker-exec)"
and:
"Download still uses the v1 docker-cp path; migrating it lives in
the next PR in this stack"
Both wrong now. RFC #2312 PR-D landed the Download HTTP-forward path:
chat_files.go:336 builds an http.NewRequestWithContext to
${wsURL}/internal/file/read?path=<abs>, with the response streamed
back to the caller. The workspace-side Starlette handler is at
workspace/internal_file_read.py, mounted at workspace/main.py:440.
Update the header to reflect actual code: both upload + download are
HTTP-forward, share the same per-workspace platform_inbound_secret
auth, and work uniformly on local Docker and SaaS EC2.
Pure docs change — no behavior, no build/test impact.
Closes the observability gap surfaced in #2329 item 5: callers received
queue_id in the 202 enqueue response but had no public lookup. The only
existing observability path was check_task_status (delegation-flavored
A2A only — joins via request_body->>'delegation_id'). Cross-workspace
peer-direct A2A had no observability after enqueue.
This PR ships RFC #2331's Tier 1: minimum viable observability + caller-
specified TTL. No schema migration — expires_at column already exists
(migration 042); only DequeueNext was honoring it, with no caller path
to populate it.
Two changes:
1. extractExpiresInSeconds(body) — new helper mirroring
extractIdempotencyKey/extractDelegationIDFromBody. Pulls
params.expires_in_seconds from the JSON-RPC body. Zero (the unset
default) preserves today's infinite-TTL semantics. EnqueueA2A grew
an expiresAt *time.Time parameter; the proxy callsite computes
*time.Time from the extracted seconds and threads it through to
the INSERT.
2. GET /workspaces/:id/a2a/queue/:queue_id — new public handler.
Auth: caller's workspace token must match queue.caller_id OR
queue.workspace_id, OR be an org-level token. 404 (not 403) on
auth failure to avoid leaking queue_id existence. Response
includes status/attempts/last_error/timestamps/expires_at; embeds
response_body via LEFT JOIN against activity_logs when status=
completed for delegation-flavored items.
What this does NOT change:
- Drain semantics (heartbeat-driven dispatch).
- Native-session bypass (claude-agent-sdk, hermes still skip queue).
- Schema (column already exists).
- MCP tools (delegate_task_async / check_task_status keep their
contract; this is a parallel queue-id surface).
Tests:
- 7 cases on extractExpiresInSeconds covering absent/positive/
zero/negative/invalid-JSON/wrong-type/empty-params.
- go vet + go build clean.
- Full handlers test suite passes (no regressions from the
EnqueueA2A signature change — only one production caller).
Tier 2 (cross-workspace stitch + webhook callback) and Tier 3
(controllerized lifecycle) deferred per RFC #2331.
External callers (third-party SDKs, the channel plugin) authenticate
purely via bearer and frequently don't set the X-Workspace-ID header.
Without this, activity_logs.source_id ends up NULL — breaking the
peer_id signal on notifications, the "Agent Comms by peer" canvas tab,
and any analytics that breaks down inbound A2A by sender.
The bearer is the authoritative caller identity per the wsauth contract
(it's what proves who you are); the header is a display/routing hint
that must agree with it. So we derive callerID from the bearer's owning
workspace whenever the header is absent. The existing validateCallerToken
guard fires after this and enforces token-to-callerID binding the same
way it always has.
Org-token requests are skipped — those grant org-wide access and don't
bind to a single workspace, so the canvas-class semantics (callerID="")
are preserved. Bearer-resolution failures (revoked, removed workspace)
fall through to canvas-class as well, never 401.
New wsauth.WorkspaceFromToken exposes the bearer→workspace lookup as a
modular interface; mirrors ValidateAnyToken's defense-in-depth JOIN on
workspaces.status != 'removed'.
Tests: 4 unit tests on WorkspaceFromToken + 3 integration tests on
ProxyA2A covering the three observable paths (bearer-derived,
org-token skipped, derive-failure fallthrough).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors PR-C's Upload migration: replaces the docker-cp tar-stream
extraction with a streaming HTTP GET to the workspace's own
/internal/file/read endpoint. Closes the SaaS gap for downloads —
without this PR, GET /workspaces/:id/chat/download still returns 503
on Railway-hosted SaaS even after A+B+C+F land.
Stacks: PR-A #2313 → PR-B #2314 → PR-C #2315 → PR-F #2319 → this PR.
Why a single broad /internal/file/read instead of /internal/chat/download:
Today's chat_files.go::Download already accepts paths under any of the
four allowed roots {/configs, /workspace, /home, /plugins} — it's not
strictly chat. Future PRs (template export, etc.) will reuse this
endpoint via the same forward pattern; reusing avoids three near-
identical handlers (one per domain) with duplicated path-safety logic.
Path safety is duplicated on platform + workspace sides — defence in
depth via two parallel checks, not "trust the workspace."
Changes:
* workspace/internal_file_read.py — Starlette handler. Validates path
(must be absolute, under allowed roots, no traversal, canonicalises
cleanly). lstat (not stat) so a symlink at the path doesn't redirect
the read. Streams via FileResponse (no buffering). Mirrors Go's
contentDispositionAttachment for Content-Disposition header.
* workspace/main.py — registers GET /internal/file/read alongside the
POST /internal/chat/uploads/ingest from PR-B.
* scripts/build_runtime_package.py — adds internal_file_read to
TOP_LEVEL_MODULES so the publish-runtime cascade rewrites its
imports correctly. Also includes the PR-B additions
(internal_chat_uploads, platform_inbound_auth) since this branch
was rooted before PR-B's drift-gate fix; merge-clean alphabetic
additions.
* workspace-server/internal/handlers/chat_files.go — Download
rewritten as streaming HTTP GET forward. Resolves workspace URL +
platform_inbound_secret (same shape as Upload), builds GET request
with path query param, propagates response headers (Content-Type /
Content-Length / Content-Disposition) + body. Drops archive/tar
+ mime imports (no longer needed). Drops Docker-exec branch entirely
— Download is now uniform across self-hosted Docker and SaaS EC2.
* workspace-server/internal/handlers/chat_files_test.go — replaces
TestChatDownload_DockerUnavailable (stale post-rewrite) with 4
new tests:
- TestChatDownload_WorkspaceNotInDB → 404 on missing row
- TestChatDownload_NoInboundSecret → 503 on NULL column
(with RFC #2312 detail in body)
- TestChatDownload_ForwardsToWorkspace_HappyPath → forward shape
(auth header, GET method, /internal/file/read path) + headers
propagated + body byte-for-byte
- TestChatDownload_404FromWorkspacePropagated → 404 from
workspace propagates (NOT remapped to 500)
Existing TestChatDownload_InvalidPath path-safety tests preserved.
* workspace/tests/test_internal_file_read.py — 21 tests covering
_validate_path matrix (absolute, allowed roots, traversal, double-
slash, exact-match-on-root), 401 on missing/wrong/no-secret-file
bearer, 400 on missing path/outside-root/traversal, 404 on missing
file, happy-path streaming with correct Content-Type +
Content-Disposition, special-char escaping in Content-Disposition,
symlink-redirect-rejection (lstat-not-stat protection).
Test results:
* go test ./internal/handlers/ ./internal/wsauth/ — green
* pytest workspace/tests/ — 1292 passed (was 1272 before PR-D)
Refs #2312 (parent RFC), #2308 (chat upload+download 503 incident).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the SaaS-side gap that PR-A acknowledged but didn't fix: SaaS
workspaces have no persistent /configs volume, so the platform_inbound_secret
that PR-A's provisioner wrote at workspace creation never reaches the
runtime. Without this, even after the entire RFC #2312 stack lands,
SaaS chat upload would 401 (workspace fails-closed when /configs/.platform_inbound_secret
is missing).
Solution: return the secret in the /registry/register response body
on every register call. The runtime extracts it and persists to
/configs/.platform_inbound_secret at mode 0600. Idempotent — Docker-
mode workspaces also receive it and overwrite the value the provisioner
already wrote (same value until rotation).
Why on every register, not just first-register:
* SaaS containers can be restarted (deploys, drains, EBS detach/
re-attach) — /configs is rebuilt empty on each fresh start.
* The auth_token is "issue once" because re-issuing rotates and
invalidates the previous one. The inbound secret has no rotation
flow yet (#2318) so re-sending the same value is harmless.
* Eliminates the bootstrap window where a restarted SaaS workspace
has no inbound secret on disk and would 401 every platform call.
Changes:
* workspace-server/internal/handlers/registry.go — Register handler
reads workspaces.platform_inbound_secret via wsauth.ReadPlatformInboundSecret
and includes it in the response body. Legacy workspaces (NULL
column) get a successful registration with the field omitted.
* workspace-server/internal/handlers/registry_test.go — two new tests:
- TestRegister_ReturnsPlatformInboundSecret_RFC2312_PRF: secret
present in DB → secret in response, alongside auth_token.
- TestRegister_NoInboundSecret_OmitsField: NULL column → field
omitted, registration still 200.
* workspace/platform_inbound_auth.py — adds save_inbound_secret(secret).
Atomic write via tmp + os.replace, mode 0600 from os.open(O_CREAT,
0o600) so a concurrent reader never sees 0644-default. Resets the
in-process cache after write so the next get_inbound_secret() returns
the freshly-written value (rotation-safe when it lands).
* workspace/main.py — register-response handler extracts
platform_inbound_secret alongside auth_token and persists via
save_inbound_secret. Mirrors the existing save_token pattern.
* workspace/tests/test_platform_inbound_auth.py — 6 new tests for
save_inbound_secret: writes file, mode 0600, overwrite-existing,
cache invalidation after save, empty-input no-op, parent-dir creation
for fresh installs.
Test results:
* go test ./internal/handlers/ ./internal/wsauth/ — all green
* pytest workspace/tests/ — 1272 passed (was 1266 before this PR)
Refs #2312 (parent RFC), #2308 (chat upload 503 incident).
Stacks: PR-A #2313 → PR-B #2314 → PR-C #2315 → this PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-review found the original draft of this PR added forward-time
validateAgentURL() as defense-in-depth — paranoia layer on top of the
existing register-time gate. The validator unconditionally blocks
loopback (127.0.0.1/8), which makes httptest-based proxy tests
impossible without an env-var hatch I'd rather not add to a security-
critical path on first pass.
Trust note kept inline pointing at the upstream gate + tracking issue
so the gap is explicit, not invisible.
Refs #2312.
Closes the SaaS upload gap (#2308) with the unified architecture from
RFC #2312: same code path on local Docker and SaaS, no Docker socket
dependency, no `dockerCli == nil` cliff. Stacked on PR-A (#2313) +
PR-B (#2314).
Before:
Upload → findContainer (nil in SaaS) → 503
After:
Upload → resolve workspaces.url + platform_inbound_secret
→ stream multipart to <url>/internal/chat/uploads/ingest
→ forward response back unchanged
Same call site whether the workspace runs on local docker-compose
("http://ws-<id>:8000") or SaaS EC2 ("https://<id>.<tenant>...").
The bug behind #2308 cannot exist by construction.
Why streaming, not parse-then-re-encode:
* No 50 MB intermediate buffer on the platform
* Per-file size + path-safety enforcement is the workspace's job
(see workspace/internal_chat_uploads.py, PR-B)
* Workspace's error responses (413 with offending filename, 400 on
missing files field, etc.) propagate through unchanged
Changes:
* workspace-server/internal/handlers/chat_files.go — Upload rewritten
as a streaming HTTP proxy. Drops sanitizeFilename, copyFlatToContainer,
and the entire docker-exec path. ChatFilesHandler gains an httpClient
(broken out for test injection). Download stays docker-exec for now;
follow-up PR will migrate it to the same shape.
* workspace-server/internal/handlers/chat_files_external_test.go —
deleted. Pinned the wrong-headed runtime=external 422 gate from
#2309 (already reverted in #2311). Superseded by the proxy tests.
* workspace-server/internal/handlers/chat_files_test.go — replaced
sanitize-filename tests (now in workspace/tests/test_internal_chat_uploads.py)
with sqlmock + httptest proxy tests:
- 400 invalid workspace id
- 404 workspace row missing
- 503 platform_inbound_secret NULL (with RFC #2312 detail)
- 503 workspaces.url empty
- happy-path forward (asserts auth header, content-type forwarded,
body streamed, response propagated back)
- 413 from workspace propagated unchanged (NOT remapped to 500)
- 502 on workspace unreachable (connect refused)
Existing Download + ContentDisposition tests preserved.
* tests/e2e/test_chat_upload_e2e.sh — single-script-everywhere E2E.
Takes BASE as env (default http://localhost:8080). Creates a
workspace, waits for online, mints a test token, uploads a fixture,
reads it back via /chat/download, asserts content matches +
bearer-required. Same script runs against staging tenants (set
BASE=https://<id>.<tenant>.staging.moleculesai.app).
Test plan:
* go build ./... — green
* go test ./internal/handlers/ ./internal/wsauth/ — green (full suite)
* tests/e2e/test_chat_upload_e2e.sh against local docker-compose
after PR-A + PR-B + this PR all merge — TODO before merge
Refs #2312 (parent RFC), #2308 (chat upload 503 incident).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Foundation for the HTTP-forward architecture that replaces Docker-exec
in chat upload + 5 follow-on handlers. This PR is intentionally scoped
to schema + token mint + provisioner wiring; no caller reads the secret
yet so behavior is unchanged.
Why a second per-workspace bearer (not reuse the existing
workspace_auth_tokens row):
workspace_auth_tokens workspaces.platform_inbound_secret
───────────────────── ─────────────────────────────────
workspace → platform platform → workspace
hash stored, plaintext gone plaintext stored (platform reads back)
workspace presents bearer platform presents bearer
platform validates by hash workspace validates by file compare
Distinct roles, distinct rotation lifecycle, distinct audit signal —
splitting later would require a fleet-wide rolling rotation, so paying
the schema cost up front.
Changes:
* migration 044: ADD COLUMN workspaces.platform_inbound_secret TEXT
* wsauth.IssuePlatformInboundSecret + ReadPlatformInboundSecret
* issueAndInjectInboundSecret hook in workspace_provision: mints
on every workspace create / re-provision; Docker mode writes
plaintext to /configs/.platform_inbound_secret alongside .auth_token,
SaaS mode persists to DB only (workspace will receive via
/registry/register response in a follow-up PR)
* 8 unit tests against sqlmock — covers happy path, rotation, NULL
column, empty string, missing workspace row, empty workspaceID
PR-B (next) wires up workspace-side `/internal/chat/uploads/ingest`
that validates the bearer against /configs/.platform_inbound_secret.
Refs #2312 (parent RFC), #2308 (chat upload 503 incident).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #2309 added an early-return that 422'd uploads to external workspaces
with "file upload not supported." Both halves of that diagnosis were wrong:
1. External workspaces SHOULD support uploads — gating with 422
locks off intended functionality and labels it as design.
2. The 503 the user actually hit was on an INTERNAL workspace, not
an external one. The runtime check never even ran.
Real root cause (separate fix incoming):
- findContainer(...) requires a non-nil h.docker.
- In SaaS (MOLECULE_ORG_ID set), main.go selects the CP provisioner
instead of the local Docker provisioner — dockerCli is nil.
- findContainer short-circuits to "" → 503 "container not running"
on every workspace, internal or external, on Railway-hosted
SaaS where workspaces actually live on EC2.
This PR strips the misleading gate so #2308 can be re-investigated
against the real symptom. The proper fix routes the multipart upload
over HTTP to the workspace's URL when dockerCli is nil — tracked
as a follow-up.
Refs #2308.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symptom: pasting a screenshot into the canvas chat for a runtime="external"
workspace returned `503 {"error":"workspace container not running"}` —
accurate from the upload handler's POV (no container exists for external
workspaces) but misleading because it implies the container has crashed.
Fix: detect runtime="external" via DB lookup BEFORE the container-find
step and return 422 with:
- error: "file upload not supported for external workspaces"
- detail: explains why + points at admin/secrets workaround +
references issue #2308 for the v0.2 native-support roadmap
- runtime: "external" (machine-readable for clients)
Why 422 not 200/501:
- 422 = Unprocessable Entity — the request is well-formed but the
workspace's runtime can't accept it. Standard REST semantics.
- 200 with empty result would lie; 501 implies the API itself is
unimplemented (it's not — works for non-external workspaces); 503
was the misleading status this PR fixes.
Verified via live E2E against localhost:
- Created `runtime=external,external=true` workspace
- Posted multipart to /workspaces/:id/chat/uploads
- Got 422 with the expected structured body
Unit test (`chat_files_external_test.go`) pins the contract via sqlmock
+ httptest. Notable: the handler is constructed with `templates: nil`
to prove the runtime check happens BEFORE any docker plumbing — if a
future change moves the check below findContainer, the test crashes
on nil-deref instead of silently regressing.
Out of scope (for v0.2 follow-up):
- Native external-workspace file ingest via artifacts table or the
channel-plugin's inbox/ pattern. Requires separate design pass.
Closes#2308
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a third snippet alongside externalCurlTemplate / externalPythonTemplate
in workspace-server/internal/handlers/external_connection.go: the new
externalChannelTemplate guides operators through installing the Claude Code
channel plugin (Molecule-AI/molecule-mcp-claude-channel — scaffolded today)
and dropping the .env config for it.
Wires the new snippet into the external-workspace POST /workspaces response
under key `claude_code_channel_snippet`, alongside the existing
`curl_register_template` and `python_snippet`. Canvas's "external workspace
created" modal can render it as a third tab.
CONTRIBUTING.md gains a short "External integrations" section pointing at
the three peer repos (workspace-runtime, sdk-python, mcp-claude-channel)
so contributors know where related runtime artifacts live and to consider
downstream impact when changing the A2A wire shape.
The plugin itself is scaffolded at commit d07363c on the new repo's main
branch; v0.1 is polling-based via the /activity?since_secs= filter shipped
in PR #2300. README + roadmap details there.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The harness runner (scripts/measure-coordinator-task-bounds-runner.sh)
calls `/workspaces/:id/activity?since_secs=$A2A_TIMEOUT` to scope a
trace to a specific test window. The query param was silently
ignored — `ActivityHandler.List` accepted only `type`, `source`, and
`limit`, so the runner got the most-recent-100 events regardless of
how long ago they happened. Works for fresh-tenant tests where
activity_logs is ~empty pre-run, breaks on busy tenants and on tests
that exceed 100 events.
Adds `since_secs` parsing with three behaviors:
- Valid positive int → `AND created_at >= NOW() - make_interval(secs => $N)`
on the SQL. Parameterised; values bound via lib/pq, not interpolated.
`make_interval(secs => $N)` is required — the `INTERVAL '$N seconds'`
literal form rejects placeholder substitution inside the string.
- Above 30 days (2_592_000s) → silently clamped to the cap. Defends
against a paranoid client triggering a multi-month full-table scan
via `since_secs=999999999`.
- Negative, zero, or non-integer → 400 with a structured error, NOT
silently dropped. Silent drop is exactly the bug this is fixing
— a typoed param shouldn't be lost as most-recent-100.
Tests cover all four paths: accepted (with arg-binding assertion via
sqlmock.WithArgs), clamped at 30 days, invalid rejected (5 sub-cases),
and omitted (verifies no extra clause / arg leak via strict WithArgs
count).
RFC #2251 §V1.0 step 6 (platform-side-transition audit) also depends
on this for time-window filtering of activity_logs.
Closes#2268
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per-workspace `restartState` entries (introduced under the name
`restartMu` pre-#2266, renamed to `restartStates` in #2266) are
created via `LoadOrStore` in `workspace_restart.go` but never
deleted. On a long-running platform process serving many short-lived
workspaces (E2E tests, transient sandbox tenants), the sync.Map grows
monotonically — ~16 bytes per workspace ever created.
Fix: call `restartStates.Delete(wsID)` after stopAndRemove +
ClearWorkspaceKeys for each cascaded descendant and the parent. Mirrors
the existing per-ID cleanup loop. `sync.Map.Delete` is safe on absent
keys, so workspaces that were never restarted (no LoadOrStore call)
are no-op.
This is a pre-existing leak — #2266 did not introduce it; just renamed
the holder. Filing as a separate commit to keep the change minimal and
reviewable.
Closes#2269
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pre-#2290 \`force: true\` flag on POST /org/import skipped the
required-env preflight, letting orgs import without their declared
required keys (e.g. ANTHROPIC_API_KEY). The ux-ab-lab incident: that
import path was used, the org shipped without ANTHROPIC_API_KEY in
global_secrets, and every workspace 401'd on the first LLM call.
Per #2290 picks (C/remove/both):
- Q1=C: template-derived required_env (no schema change — already
the existing aggregation via collectOrgEnv).
- Q2=remove: drop the bypass entirely. The seed/dev-org flow that
legitimately needs to skip becomes a separate dry-run-import path
with its own audit trail, not a permission bypass.
- Q3=block-at-import-only: provision-time drift logging is a
follow-up; for this PR, blocking at import is the gate.
Surface change:
- Force field removed from POST /org/import request body.
- 412 \"suggestion\" text drops the \"or pass force=true\" guidance.
- Legacy callers sending {\"force\": true} are silently tolerated
(Go's json.Unmarshal drops unknown fields), so no client-side
breakage; the bypass effect is just gone.
Audited callers in this repo:
- canvas/src/components/TemplatePalette.tsx — never sends force.
- scripts/post-rebuild-setup.sh — never sends force.
- Only external tooling sent force=true. Those callers must now set
the global secret via POST /settings/secrets before importing.
Adds TestOrgImport_ForceFieldRemoved as a structural pin: if a future
change re-adds Force to the body struct, the test fails and forces an
explicit reckoning with the #2290 rationale.
Closes#2290
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-review caught a regression I introduced in #2266: if cycle() panics
(e.g. a future provisionWorkspace nil-deref or any runtime error from
the DB / Docker / encryption stacks it touches), the loop never reaches
`state.running = false`. The flag stays true forever, the early-return
guard at the top of coalesceRestart fires for every subsequent call,
and that workspace is permanently locked out of restarts until the
platform process restarts.
The pre-fix code had similar exposure (panic killed the goroutine
before defer wsMu.Unlock() ran in some Go versions), but my pending-
flag version made it worse: the guard is sticky, not ephemeral.
Fix: defer the state-clear so it always runs on exit, including panic.
Recover (and DON'T re-raise) so the panic doesn't propagate to the
goroutine boundary and crash the whole platform process — RestartByID
is always called via `go h.RestartByID(...)` from HTTP handlers, and
an unrecovered goroutine panic in Go terminates the program. Crashing
the platform for every tenant because one workspace's cycle panicked
is the wrong availability tradeoff. The panic message + full stack
trace via runtime/debug.Stack() are still logged for debuggability.
Regression test in TestCoalesceRestart_PanicInCycleClearsState:
1. First call's cycle panics. coalesceRestart's defer must swallow
the panic — assert no panic propagates out (would crash the
platform process from a goroutine in production).
2. Second call must run a fresh cycle (proves running was cleared).
All 7 tests pass with -race -count=10.
Surfaced via /code-review-and-quality self-review of #2266; the
re-raise-after-recover anti-pattern (originally argued as "don't
mask bugs") came up in the comprehensive review and was corrected
to log-with-stack-and-suppress for availability.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>