Self-review (post-merge) flagged that the backfill claimed to be
idempotent on re-run but actually duplicates every row because the
plugin's INSERT uses gen_random_uuid() and ignores any id passed in.
Fix is contract-level: extend MemoryWrite with an optional `id`
idempotency key. When supplied, the plugin MUST treat the write as
upsert keyed on this id; when omitted, the plugin generates a fresh
UUID (production agent commits keep working unchanged).
Changes:
* docs/api-protocol/memory-plugin-v1.yaml: add id field with
description that flags it as idempotency key
* internal/memory/contract/contract.go: add ID to MemoryWrite struct,
update memory_write_minimal golden vector
* internal/memory/pgplugin/store.go: split CommitMemory into two
paths — upsert when body.ID set (INSERT ... ON CONFLICT (id) DO
UPDATE), plain INSERT otherwise
* cmd/memory-backfill/main.go: pass agent_memories.id to MemoryWrite,
fix the false comment about 409 deduplication
New tests:
* pgplugin: TestCommitMemory_WithIDUpserts pins the upsert SQL is
used when id is set; TestCommitMemory_UpsertScanError covers the
error branch
* backfill: TestBackfill_PassesSourceUUIDAsIdempotencyKey pins the
forwarding behavior; TestBackfill_RerunIsIdempotent simulates a
retry and asserts both runs pass the same uuid (plugin upsert is
what makes this safe)
Why this matters: operators retrying a failed backfill (which they
will — networks fail, transactions abort) would otherwise create N
duplicates per memory. The duplicates aren't visible until search
results show obvious dupes — debugging that under prod load is bad.
Production agent commits are unaffected: they leave id empty, the
plugin generates a fresh UUID via gen_random_uuid(), zero behavior
change for the hot path.
CI's pytest harness pre-sets WORKSPACE_ID=test in the env before
test collection, so a2a_client's module-level WORKSPACE_ID
(captured at import time, line 24) holds "test" — but the local
fixture's monkeypatch.setenv("WORKSPACE_ID", ...) only affects the
ENV value seen on later os.environ reads, NOT the already-bound
module attribute.
Assert against a2a_client.WORKSPACE_ID directly so the test is
portable across local + CI runs without monkey-patching the module
itself (which a future test reload might undo).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR-2 of the multi-workspace external-agent stack. PR-1 (#2739)
landed per-workspace auth + heartbeat + inbox. This PR threads
``source_workspace_id`` through the A2A client + tool surface so an
agent registered against multiple workspaces can list peers across
all of them and delegate from a specific source.
Changes
-------
* ``a2a_client``: ``discover_peer``, ``send_a2a_message``,
``get_peers_with_diagnostic``, and ``enrich_peer_metadata`` now
accept ``source_workspace_id``. Routing uses it for both the
X-Workspace-ID header and (transitively, via ``auth_headers(src)``)
the bearer token. Defaults to module-level WORKSPACE_ID for
back-compat.
* ``a2a_client._peer_to_source``: a new lock-free cache mapping each
discovered peer back to the source workspace whose registry
surfaced it. ``tool_list_peers`` populates the cache on every call;
``tool_delegate_task`` consults it for auto-routing.
* ``a2a_tools.tool_list_peers(source_workspace_id=None)``: when
multiple workspaces are registered (MOLECULE_WORKSPACES) and no
explicit source is passed, aggregates peers across every
registered workspace and tags each entry with ``via: <src[:8]>``.
Single-workspace mode is unchanged — no ``via:`` annotation, same
output shape.
* ``a2a_tools.tool_delegate_task`` and ``tool_delegate_task_async``
resolve source via ``source_workspace_id arg → _peer_to_source[target]
→ WORKSPACE_ID``. Agents almost never need to specify ``source_*``
explicitly — call ``list_peers`` first and the cache handles the
rest.
* ``tool_delegate_task_async`` idempotency key now includes the
source workspace, so the same task delegated from two registered
workspaces produces two distinct delegations (the right behavior
— one per tenant audit trail).
* ``platform_auth.list_registered_workspaces()``: new helper for the
tool layer to enumerate the multi-ws registry. Lock-free reads
matched by the existing single-writer-per-workspace contract from
PR-1.
* ``platform_auth.self_source_headers``: now passes ``workspace_id``
through to ``auth_headers`` — without this, a multi-workspace POST
source-tagged with ``X-Workspace-ID=ws_b`` was authenticating
with ws_a's token (or no token if MOLECULE_WORKSPACE_TOKEN unset).
Latent PR-1 bug exposed by the new tool surface.
* ``a2a_mcp_server`` tool dispatch passes ``source_workspace_id``
from the tool call arguments.
* ``platform_tools.registry``: add ``source_workspace_id`` to the
delegate_task, delegate_task_async, check_task_status, list_peers
input schemas with copy explaining when to use it (rarely — the
cache handles it).
Tests (15 new, all passing)
---------------------------
``test_a2a_multi_workspace.py``:
* TestDiscoverPeerSourceRouting (3): src arg drives header+token,
fallback to module ws when omitted, invalid target short-circuits
before any HTTP attempt.
* TestSendA2AMessageSourceRouting (1): X-Workspace-ID source header
+ Authorization bearer both come from the source arg via the
patched self_source_headers chain.
* TestGetPeersSourceRouting (1): URL path AND headers use the
source workspace id.
* TestToolListPeersAggregation (4): aggregates across multiple
registered workspaces, tags origin, leaves single-workspace path
unchanged, explicit src arg overrides aggregation, diagnostic
joining when every workspace returns empty.
* TestToolDelegateTaskAutoRouting (3): cache-driven auto-route,
explicit override beats cache, single-workspace fallback to
module WORKSPACE_ID.
* TestListRegisteredWorkspaces (3): registry enumeration helper.
Plus ``tests/snapshots/a2a_instructions_mcp.txt`` regenerated to
absorb the new ``source_workspace_id`` schema entries.
Back-compat
-----------
Every change defaults ``source_workspace_id=None``; legacy
single-workspace operators (no MOLECULE_WORKSPACES) see identical
behavior — same URLs, same headers, same tool output. The 24
PR-1 tests + 125 existing A2A tests all still pass.
Out of scope (PR-3)
-------------------
Memory namespacing per registered workspace lands after the new
memory system v2 PR (#2740) settles in production.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Final implementation PR. Builds on PR-1..10 (all merged or queued).
Proves the central design property of the plugin contract: ANY
plugin satisfying the v1 OpenAPI spec works as a drop-in replacement
for the built-in postgres plugin. If this test fails after a refactor,
the contract has drifted in a way that breaks ecosystem plugins.
What ships:
* internal/memory/e2e/swap_test.go — five E2E tests against a
deliberately minimal "flat-memory" stub plugin (~50 LOC, single
map, zero capabilities)
* MCPHandler.Dispatch — small exported wrapper around dispatch so
out-of-package E2E tests can drive tools by name without
duplicating the whole MCP RPC stack
E2E coverage:
* TestE2E_FlatPluginRoundTrip: full lifecycle
- list_writable_namespaces returns 3 entries
- commit_memory_v2 writes through plugin
- search_memory finds it back
- commit_summary writes a summary
- forget_memory deletes
- search after forget excludes the deleted memory
* TestE2E_LegacyShimRoutesThroughFlatPlugin: PR-6 shim wired up
- Legacy commit_memory(scope=LOCAL) ends up in plugin storage
- Legacy recall_memory finds it back through plugin search
- Response shapes preserved (scope:LOCAL stays scope:LOCAL)
* TestE2E_OrgMemoriesDelimiterWrap: prompt-injection mitigation
- Org-namespace memory committed
- Audit INSERT into activity_logs verified
- Search returns content with [MEMORY id=... scope=ORG ns=...]
prefix applied
* TestE2E_StubPluginCapabilitiesAreEmpty: capability negotiation
- Stub plugin reports zero capabilities
- Client.SupportsCapability returns false for FTS, embedding
- Confirms graceful degradation when plugin doesn't support a
feature
* TestE2E_PluginUnreachable_AgentSeesClearError: failure surface
- Plugin URL pointing at bogus port
- commit_memory_v2 returns informative error
- No nil-pointer dereference; error message is actionable
The flat plugin is intentionally minimal — it has no namespaces table
distinct from memory records, no FTS, no semantic search, no TTL. The
test proves operators can drop in a 50-line plugin and the agent
behavior is identical (modulo capability-gated features).
Builds on merged PR-1..7 (PR-8 in queue). Pure docs; no code.
What ships:
* docs/memory-plugins/README.md — contract overview, capability
negotiation, deployment models, replacement workflow
* docs/memory-plugins/testing-your-plugin.md — using the contract
test harness to validate wire compatibility, what the harness
DOES NOT cover (capability accuracy, TTL eviction, concurrency)
* docs/memory-plugins/pinecone-example/README.md — worked example
of a Pinecone-backed plugin: capability mapping (only embedding,
no FTS), wire mapping (memory → vector + metadata), production-
hardening checklist
Documentation strategy:
* Lead with what workspace-server takes care of (security perimeter,
redaction, ACL, GLOBAL audit, prompt-injection wrap) so plugin
authors don't reimplement those layers
* Show three deployment models (same machine / separate container /
self-managed) so operators see their topology
* Capability table makes it explicit what each capability gates so
a plugin that supports only one (e.g. semantic search) is still
a useful plugin
* Pinecone example is honest: shows the skeleton, the wire mapping,
and explicitly calls out what's MISSING from the sketch (batch
commits, TTL janitor, circuit breaker, metrics)
Resolves three github-code-quality threads blocking PR-2739 merge:
- workspace/tests/test_mcp_cli_multi_workspace.py: remove unused
`import os` and `from unittest.mock import patch` (left over from
an earlier test draft that mocked at the os.environ layer).
- workspace/mcp_cli.py:523: replace bare `pass` in the
register_workspace_token ImportError handler with a debug log line +
one-line comment explaining the silent-degrade contract (older
installs that don't yet ship the helper fall back to the legacy
single-token path; single-workspace operators see no behavior
change).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builds on merged PR-1..7. Adds the operator-controlled cutover flag
that flips admin export/import from the legacy direct-DB path to the
v2 plugin path.
Activation: MEMORY_V2_CUTOVER=true AND the v2 plugin is wired via
WithMemoryV2. Both must be true to take the new path; either being
false falls through to the existing legacy SQL code unchanged.
What ships:
* AdminMemoriesHandler gains plugin + resolver fields, wired via
WithMemoryV2 (production) / withMemoryV2APIs (tests)
* Export: enumerates workspaces, asks resolver for each one's
readable namespaces, searches each via plugin, deduplicates by
memory id, applies SAFE-T1201 redaction on emitted content
(F1084 parity). Returns the legacy memoryExportEntry shape so
existing tooling keeps working.
* Import: scope→namespace translation mirrors PR-6 shim. Uses
UpsertNamespace + CommitMemory; runs SAFE-T1201 redaction
BEFORE the plugin sees the content (F1085 parity).
* Helpers: legacyScopeFromNamespace + namespaceKindFromLegacyScope
(lifted out so admin_memories doesn't depend on MCP handler
helpers). skipImport typed error.
Operational rollout (cutover sequencing):
1. Today: MEMORY_V2_CUTOVER unset → legacy DB path.
2. After PR-7 backfill applied + smoke verified: operator sets
MEMORY_V2_CUTOVER=true.
3. From that point, admin export/import operate on plugin
storage; legacy agent_memories table is read-only for the
~60-day grace window before PR-9 drops it.
Coverage on new paths:
* cutoverActive: 100%
* WithMemoryV2 / withMemoryV2APIs: 100%
* importViaPlugin: 100%
* exportViaPlugin: 97.2% (one defensive scan-error branch in the
workspace-list loop)
* scopeToWritableNamespaceForImport: 76.9% (resolver-error and
no-matching-kind branches exercised end-to-end via Import)
* legacyScopeFromNamespace + namespaceKindFromLegacyScope: 100%
Edge cases pinned:
* Cutover flag matrix (env unset/true/false × wired/unwired)
* Export deduplicates memories shared across team (one row per id)
* Export tolerates per-workspace failures (resolver / plugin) and
keeps going on the rest
* Export returns 500 only when the top-level workspace query fails
* Empty readable namespaces → empty export (no panic)
* Export redacts secrets in plugin path
* Import: unknown workspace skipped, unknown scope skipped,
plugin upsert/commit errors counted as errors
* Import redacts secrets BEFORE plugin sees content
* Legacy export/import path unchanged when cutover flag unset
PR-1's auth_headers added an optional workspace_id parameter for
multi-workspace token routing; the signature drift gate
(test_platform_auth_signature_matches_snapshot) caught the change as
expected. Snapshot regenerated to capture the new shape — diff is
visible in the PR for reviewers + template repos that depend on this
surface.
Behavior unchanged: auth_headers() with no arg still routes through
the legacy resolution path (back-compat exact); the workspace_id arg
is opt-in.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
External MCP agents (e.g. Claude Code installed on a company PC) can
now register against MULTIPLE workspaces from a single process — the
agent participates as a peer in workspace A (company) AND workspace B
(personal) simultaneously, with one merged inbox tagged so replies
route to the correct tenant.
Use case (verbatim from operator): "I have this computer AI thats in
company's PC, he is going to be put in company's workspace, but
personally, I want to register it to my own workspace as well, so
that I can talk to it and asking him to do work."
## What changed
**Wire format** — new env var:
MOLECULE_WORKSPACES='[
{"id":"<company-wsid>","token":"<company-tok>"},
{"id":"<personal-wsid>","token":"<personal-tok>"}
]'
When set, mcp_cli iterates the array and spawns one (register +
heartbeat + inbox poller) trio per workspace. Single-workspace mode
(WORKSPACE_ID + MOLECULE_WORKSPACE_TOKEN) is unchanged — every
existing operator's setup keeps working bit-for-bit.
**Per-workspace token registry** (platform_auth.py):
register_workspace_token(wsid, tok) — populated by mcp_cli once
per workspace before any thread spawns; thread-safe registration
+ lock-free reads on the hot path. auth_headers(workspace_id=...)
routes to the per-workspace token; auth_headers() with no arg
uses the legacy resolution path unchanged (back-compat).
**Per-workspace inbox cursors** (inbox.py):
InboxState now supports cursor_paths={wsid: Path,...}. Each poller
advances its own cursor — one workspace's slow poll can't stall
another, and a 410 only resets the affected workspace's cursor.
Single-workspace constructor (cursor_path=Path(...)) still works
exactly as before via __post_init__ promotion to the empty-string
key. Cursor filenames disambiguated by workspace_id[:8] when
multi-workspace; single-workspace keeps the legacy filename so
upgrade doesn't invalidate on-disk state.
**Arrival workspace tagging** (inbox.py):
InboxMessage.arrival_workspace_id — tells the agent which OF ITS
workspaces the inbound message arrived on. Set by the poller from
the cursor key. to_dict() omits the field when empty so single-
workspace consumers see no shape change.
**Reply routing** (a2a_tools.py + a2a_mcp_server.py + registry.py):
send_message_to_user(workspace_id=...) — optional override that
selects which workspace's /notify endpoint to POST to (and which
token authenticates). Multi-workspace agents pass the inbound
message's arrival_workspace_id; single-workspace agents omit it
and route to the only registered workspace via the legacy URL.
## Out of scope (future PRs)
- PR-2: cross-workspace delegation auto-routing — when an agent
receives a request from personal-ws "delegate to ops-bot" and
ops-bot lives in company-ws, the agent should auto-pick its
company-ws identity for the outbound delegate_task. Today the
agent must pass via_workspace explicitly (or fall through to
primary workspace).
- PR-3: memory namespacing — commit_memory() still writes to the
primary workspace's memory regardless of inbound context. Will
revisit when the new memory system (PR #2733 just landed) settles.
## Tests
workspace/tests/test_mcp_cli_multi_workspace.py — 24 new tests:
* MOLECULE_WORKSPACES JSON parsing (valid + 6 error shapes)
* Token registry register / lookup / rotation / clear
* auth_headers routing by workspace_id with legacy fallback
* Per-workspace cursor save/load/reset isolation
* arrival_workspace_id present-when-set, omitted-when-empty
* default_cursor_path namespacing
All 110 pre-existing tests in test_mcp_cli.py / test_inbox.py /
test_platform_auth.py still pass — back-compat is mechanical.
Refs: project memory entry "External agent multi-workspace
registration", design questions answered 2026-05-04 by user
(JSON env var; explicit memory writes deferred to PR-3).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builds on merged PR-1..6. Operator runs this once at cutover to copy
agent_memories rows into the v2 plugin's storage.
Usage:
memory-backfill -dry-run # count + diff, no writes
memory-backfill -apply # actually copy
memory-backfill -apply -limit=10000 # cap rows per run
memory-backfill -apply -workspace=<uuid> # one workspace only
Required env: DATABASE_URL + MEMORY_PLUGIN_URL.
Translation matches the PR-6 legacy shim:
LOCAL → workspace:<workspace_id>
TEAM → team:<root_id> (resolved via the same namespace.Resolver
the runtime uses)
GLOBAL → org:<root_id>
Idempotent: each row is keyed by its UUID; re-running the backfill
does not duplicate writes (plugin handles deduplication).
What ships:
* cmd/memory-backfill/main.go: CLI entry, run() driver,
backfill() workhorse, mapScopeToNamespace + namespaceKindFromString
helpers
* main_test.go: 100% on the functional logic (mapScopeToNamespace,
namespaceKindFromString, backfill(), all CLI validation paths)
Coverage: 80.2% of statements. The 19.8% gap is main()'s body
(log.Fatalf — not unit-testable) and run()'s real-DB integration
(sql.Open + db.PingContext + new client/resolver — requires a live
postgres). Integration coverage for this path lives in PR-11
(E2E plugin-swap test).
Edge cases pinned (in functional logic):
* Every legacy scope → namespace mapping
* Unknown scope → skip with diagnostic, increment skipped counter
* Resolver error → propagate, abort run
* No-matching-kind in writable list → skip with error message
* Plugin UpsertNamespace error → increment errors, continue
* Plugin CommitMemory error → increment errors, continue
* Query error → propagate, abort
* Scan error → increment errors, continue
* Mid-iteration row error → propagate, abort
* Workspace filter passes through to SQL WHERE clause
* Dry-run mode never calls plugin
* CLI: rejects both/neither modes, missing env vars, bad flags
Builds on merged PR-1..5. Adds the bridge that lets legacy
commit_memory / recall_memory tools route through the v2 plugin path
when MEMORY_PLUGIN_URL is wired, otherwise fall through to the
existing DB-backed code unchanged.
What ships:
* handlers/mcp_tools_memory_legacy_shim.go — translation helpers:
scopeToWritableNamespace, scopeToReadableNamespaces,
commitMemoryLegacyShim, recallMemoryLegacyShim,
namespaceKindToLegacyScope
* handlers/mcp_tools.go — toolCommitMemory + toolRecallMemory now
delegate to the shim when memv2 is wired
Translation:
commit: LOCAL → workspace:<self>
TEAM → team:<root> (resolver picks at runtime)
empty → defaults to LOCAL (preserves legacy default)
GLOBAL → still rejected at MCP bridge (C3 preserved)
recall: LOCAL → search restricted to workspace:<self>
TEAM → workspace:<self> + team:<root>
empty → all readable (matches v2 default behavior)
GLOBAL → blocked at MCP bridge (C3 preserved)
Response shapes are preserved exactly:
commit: {"id":"...","scope":"LOCAL"|"TEAM"} — agents see no diff
recall: [{"id":"...","content":"...","scope":"LOCAL"|...,"created_at":"..."}, ...]
org-namespace memories get the same [MEMORY id=... scope=ORG ns=...]
prefix as v2 search; legacy scope label comes back as "GLOBAL"
Operational rollout:
* Today: MEMORY_PLUGIN_URL unset on most operators → legacy DB path
* After PR-7 backfill: operators set MEMORY_PLUGIN_URL → all writes
flow through plugin transparently
* After PR-8 cutover: dual-write removed, plugin is the only path
* After PR-9 (~60 days later): legacy tool entries dropped entirely
Coverage: 100% on every helper, 100% on recallMemoryLegacyShim,
94.7% on commitMemoryLegacyShim. The 1 uncovered line is a defensive
guard against a v2-response-parse error that's unreachable when the
v2 tool is operating correctly (it always returns valid JSON).
Edge cases pinned:
* scope translation for every legacy value + invalid scope
* resolver error propagation
* plugin error propagation
* GLOBAL still blocked
* default-scope fallback (LOCAL)
* empty content rejected
* No-op when v2 unwired (legacy SQL path exercised via sqlmock)
* org-namespace memory wrap on recall + GLOBAL scope label round-trip
* No-results returns "No memories found." (legacy message preserved)
Builds on PR-1, PR-2, PR-3, PR-4 (all merged). Adds the agent-facing
v2 surface for the memory plugin contract.
What ships (all in handlers/mcp_tools_memory_v2.go, no edits to
the legacy commit_memory / recall_memory paths):
commit_memory_v2 — write to a namespace; default workspace:self
search_memory — search across namespaces; default = all readable
commit_summary — kind=summary, 30-day default TTL, runtime-overridable
list_writable_namespaces — discover what you can write to
list_readable_namespaces — discover what you can read from
forget_memory — delete by id, only in namespaces you can write to
Workspace-server is the security perimeter — every layer the plugin
mustn't be trusted with runs here:
* SAFE-T1201 redactSecrets BEFORE every plugin write
* Server-side ACL re-validation: CanWrite + IntersectReadable run
on EVERY request, never trusting client-supplied namespaces (a
canvas re-parent between list_writable and commit would otherwise
let a stale namespace slip through)
* org:* writes audited to activity_logs (SHA256, not plaintext) —
matches memories.go:201-221 so the schema stays uniform
* Audit failure does NOT block the write (logged + continue) —
failing closed would deny org-scope writes whenever activity_logs
is unhappy
* org:* memories get the [MEMORY id=... scope=ORG ns=...]: prefix
on read — preserves the prompt-injection mitigation from
memories.go:455-461
Coexistence design: legacy commit_memory + recall_memory still wired
to their old code paths in mcp_tools.go. PR-6 will alias them to
delegate to these v2 implementations. PR-9 (60 days post-cutover)
removes the legacy entries.
Wiring:
* MCPHandler gains an memv2 field (nil-safe; tools return a clear
error when MEMORY_PLUGIN_URL is unset rather than crashing)
* WithMemoryV2(plugin, resolver) is the production wiring API
main.go calls at boot
* withMemoryV2APIs(plugin, resolver) is the test-injectable variant
against the memoryPluginAPI / namespaceResolverAPI interfaces
Coverage: 100.0% on every new function in mcp_tools_memory_v2.go.
Edge cases pinned:
* empty/whitespace content → reject before plugin
* plugin unconfigured → clear error, no crash
* ACL violation → clear error
* resolver error → wrapped error
* plugin error → wrapped error
* malformed expires_at → silently ignored (no exception)
* org write audit failure → logged, write proceeds
* search namespace intersection drops foreign entries
* search with all-foreign namespaces → empty result, plugin not called
* search org memories get delimiter wrap, workspace memories do not
* forget with explicit + default namespace
* forget cross-scope rejected
* pickStr / pickStringSlice handle missing keys, wrong types, mixed slices
* wrapOrgDelimiter format is exact-match
* dispatch wires all 6 tools (no "unknown tool" error)
Builds on merged PR-1 (#2729), independent of PR-2/PR-4.
Implements every endpoint of the v1 plugin contract behind an HTTP
server (cmd/memory-plugin-postgres/) backed by postgres. Operators
run this binary next to workspace-server; it's the default
implementation MEMORY_PLUGIN_URL points at.
What ships:
- cmd/memory-plugin-postgres/main.go: boot, signal-driven shutdown,
boot-time migrations, configurable LISTEN/DATABASE/MIGRATION_DIR
- cmd/memory-plugin-postgres/migrations/001_memory_v2.up.sql:
memory_namespaces (PK on name, kind CHECK, expires_at, metadata)
memory_records (FK to namespaces with CASCADE, kind+source CHECK,
pgvector embedding, FTS tsvector, ivfflat partial
index on embedding, partial index on expires_at)
- internal/memory/pgplugin/store.go: storage layer using lib/pq
- internal/memory/pgplugin/handlers.go: HTTP layer (no router dep —
a switch on URL.Path keeps the binary's dep surface tiny)
- 100% statement coverage on store.go + handlers.go
Schema notes:
- These tables live next to the plugin binary, NOT in workspace-
server/migrations/. When operators swap the plugin, these tables
become orphaned (operator drops manually). Documented in PR-10.
- Search supports semantic (pgvector cosine) → FTS (>=2 char query)
→ ILIKE (1-char query) → recent-listing (no query), with a TTL
filter applied uniformly across all paths.
- DELETE on namespace cascades to memory_records (FK ON DELETE
CASCADE) — a deleted namespace immediately frees its memories.
Coverage corner cases pinned:
- Health: ok, degraded (db ping fails), no-ping fn
- Every CRUD endpoint: happy path, bad name, bad JSON, bad body,
not-found, store errors, exec/scan/marshal errors
- Search: FTS, semantic, short-query (ILIKE), no-query (recent),
kinds filter, store errors, scan errors, mid-iteration row error
- Routing edge cases: unknown path, empty namespace, unknown sub,
method-not-allowed, GET on /v1/health (allowed), POST on /v1/health
(404), GET on /v1/search (404)
- Helper internals: marshalMetadata (nil/happy/unmarshalable),
nullTime (nil/non-nil), vectorString (empty/format),
nullVectorString (empty/non-empty), scanNamespace +
scanMemory metadata-decode errors
No callers in workspace-server yet; integration starts in PR-5
(MCP handlers wire the plugin client through to MCP tools).
Stacked on PR-1 (#2729). Computes the readable/writable namespace lists
for a workspace from the live workspaces tree at request time. No
precomputed columns, no migrations — re-parenting on canvas takes
effect immediately on the next memory call.
What ships:
- workspace-server/internal/memory/namespace/resolver.go
- walkChain: recursive CTE, walks parent_id chain to root, capped
at depth 50 to defend against malformed/cyclic data
- derive: maps a chain to (workspace, team, org) namespace strings
- ReadableNamespaces / WritableNamespaces: the public API
- CanWrite + IntersectReadable: server-side ACL helpers MCP
handlers (PR-5) will call before talking to the plugin
- resolver_test.go: 100% statement coverage
Design choices worth flagging:
- Today's tree is depth-1 (root + children). The recursive CTE
handles arbitrary depth so we don't have to revisit the resolver
when the tree deepens.
- GLOBAL→org write restriction (memories.go:167-174) is preserved
by gating the org namespace's Writable flag on parent_id IS NULL.
- Removed-status workspaces are NOT filtered from the chain walk —
matches today's TEAM behavior (memories.go:367-372 filters on
read, not on tree walk).
- IntersectReadable with empty `requested` returns ALL readable
namespaces (default-search-everything semantic from the discovery
tools spec).
This package has zero callers in this PR; integration starts in PR-5.
Today's 4 cancelled canaries (25319625186 / 25320942822 / 25321618230 /
25322499952) were all blown by the workflow timeout despite the
underlying tenant boot completing successfully (PR molecule-controlplane#455
fix verified — boot events all reach `boot_script_finished/ok`).
Why the budget was wrong:
The tenant user-data install phase runs apt-get update + install of
docker.io / jq / awscli / caddy / amazon-ssm-agent FROM RAW UBUNTU on
every tenant boot — none of it is pre-baked into the tenant AMI
(EC2_AMI=ami-0ea3c35c5c3284d82, raw Jammy 22.04). Empirical
fetch_secrets/ok timing across today's canaries:
51s debug-mm-1777888039 (09:47Z)
82s 25319625186 (12:42Z)
143s 25320942822 (13:11Z)
625s 25322499952 (13:43Z)
Same EC2_AMI, same instance type (t3.small), same user-data install
sequence — variance is entirely apt-mirror tail latency. A 12-min job
budget leaves only ~2 min for the workspace on slow-apt days; the
workspace itself needs ~3.5 min for claude-code cold boot, so the
budget is structurally too tight whenever apt is slow.
20 min absorbs even the 10+ min boot worst-case and still leaves the
workspace its full ~7 min budget. Cap stays well under the runner's
6-hour ubuntu-latest job ceiling.
Real fix: pre-bake caddy + ssm-agent into the tenant AMI so the boot
phase is no-ops on cached pkgs (will file controlplane#TBD as
follow-up — packer/install-base.sh today only bakes the WORKSPACE thin
AMI, not the tenant AMI; tenants always boot from raw Ubuntu).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builds on PR-1 (#2729). Implements every endpoint in the OpenAPI spec
plus two operational concerns the agent never sees:
1. Capability negotiation. Boot/Refresh probes /v1/health and
captures the plugin's capability list. MCP handlers (PR-5) ask
SupportsCapability before exposing capability-gated features —
e.g., agents can only request semantic search when "embedding"
is reported.
2. Circuit breaker. Three consecutive failures open the breaker for
60 seconds; while open, calls fail fast with ErrBreakerOpen.
Picked these constants because:
- 3 failures: long enough to skip transient blips, short enough
to react before all in-flight handlers stack on the timeout
- 60s cooldown: long enough to back off a flapping plugin,
short enough that recovery is felt within a single session
4xx responses do NOT count toward the breaker (those are client
bugs, not plugin health issues); 5xx + transport errors do.
What ships:
- workspace-server/internal/memory/client/client.go
- client_test.go: 100% statement coverage
Coverage corner cases pinned:
- env-var success branches in New (parseDurationEnv applied)
- json.Marshal error (via channel in Propagation)
- http.NewRequestWithContext error (via unbalanced bracket in BaseURL)
- 204 NoContent on endpoint that normally has a body
- 4xx vs 5xx breaker behavior (4xx must NOT trip)
- breaker cooldown elapsed → reset on next success
- all 6 public endpoints fail-fast when breaker is open
This package has no callers in this PR; integration starts in PR-5.
First of 11 PRs implementing the memory-system plugin refactor (RFC #2728).
This PR is pure additive scaffolding — no behavior change, no integration
yet. It defines the wire shape between workspace-server and a memory
plugin so PR-2 (HTTP client) and PR-3 (built-in postgres plugin) can be
built against a single source of truth.
What ships:
- docs/api-protocol/memory-plugin-v1.yaml: OpenAPI 3.0.3 spec covering
/v1/health, namespace upsert/patch/delete, memory commit, search,
forget. Auth-free (private network only); workspace-server is the
only sanctioned client and the security perimeter.
- workspace-server/internal/memory/contract: typed Go bindings with
Validate() methods on every wire object so both client (PR-2) and
server (PR-3) self-check at the boundary.
- Round-trip JSON tests for every type (catch asymmetric tag bugs).
- 5 golden vector files under testdata/ pinning the exact wire shape;
update via UPDATE_GOLDENS=1.
Coverage: 100% of statements in contract.go.
The validation rules encode design decisions worth flagging in review:
- SearchRequest with empty Namespaces is REJECTED at plugin level —
workspace-server is required to intersect the readable set
server-side; an empty list reaching the plugin is a bug.
- NamespacePatch with no fields is REJECTED — empty patches are
pointless round-trips.
- MemoryWrite with whitespace-only Content is REJECTED — zero-info
memories pollute search results.
No code yet calls into this package; integration starts in PR-2.
Change cron from '10,30,50' (3 fires/hour) to '2,12,22,32,42,52'
(6 fires/hour). All new slots are 1-3 min away from any other
cron, avoiding both the cf-sweep collisions (:15, :45) and the
:30 heavy slot (canary-staging /30, sweep-aws-secrets,
sweep-stale-e2e-orgs every :15).
Why: empirically 2026-05-04 the canary fired only once per hour
on the 10,30,50 schedule (see #2726). Bumping fires-per-hour
gives more chances to land a survived fire under GH's load-
related drop ratio, and keeping all slots in clean lanes
minimizes the per-fire drop probability.
At empirically-observed ~67% drop ratio, 6 attempts/hour yields
~2 effective fires = ~30 min cadence; closer to the 20-min
target than the current shape and provides a real degradation
alarm if drops get worse.
Cost: ~$0.50/day → ~$1/day. Negligible.
Closes#2726.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User-reported 2026-05-04: deploying a team org-template ("Design
Director" + 6 sub-agents) on a SaaS tenant produced 7-of-7
WORKSPACE_PROVISION_FAILED with the misleading message
"container started but never called /registry/register". Diagnose
returned "docker client not configured on this workspace-server" and
the workspace rows had no instance_id.
Root cause: TeamHandler.Expand hardcoded h.wh.provisionWorkspace —
the Docker leg of WorkspaceHandler. WorkspaceHandler.Create branched
on h.cpProv to pick CP-managed EC2 (SaaS) vs local Docker
(self-hosted), but Expand never used that branch. On SaaS the docker
goroutine ran but had no socket, so children silently sat in
"provisioning" until the 600s sweeper marked them failed.
Architectural principle (user): templates own
runtime/config/prompts/files/plugins; the platform owns where it
runs. Backend selection belongs in one helper.
Fix:
- Extract WorkspaceHandler.provisionWorkspaceAuto: picks CP when
cpProv is set, Docker when only provisioner is set, returns false
when neither (caller marks failed).
- WorkspaceHandler.Create routes through Auto.
- TeamHandler.Expand routes through Auto.
Tests pin three invariants:
- TestProvisionWorkspaceAuto_NoBackendReturnsFalse — Auto signals
fall-through correctly so the caller can persist + mark-failed.
- TestProvisionWorkspaceAuto_RoutesToCPWhenSet — when cpProv is
wired, Start lands on CP (the user-visible regression target).
Discipline-verified: removing the cpProv branch fails this.
- TestTeamExpand_UsesAutoNotDirectDockerPath — source-level guard
against future refactors reintroducing the hardcoded Docker call.
Discipline-verified: reverting team.go fails this with a clear
message naming the bug class.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-review on PR #2723 caught a coverage gap: the existing
"visibility gate" describe block actually tested cadence (10s/30s
timing), not the gate itself. If a refactor dropped the
`if (!visible) return` line, the cadence test would still pass
because the effect would still fire every 30s — the regression would
silently ship.
New test renders with comms-returning mock so the panel renders, clicks
the close button, advances 60s, asserts no further fetches occur.
Discipline-verified: removed `if (!visible) return` from the source,
test fails as expected. Restored, test passes.
Same failure mode as PR #434 (test asserted broken behavior) — pin
what you claim to fix, not the easy substring.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User report 2026-05-04: 8+ workspace tenant (Design Director + 6 sub-agents
+ 3 standalones) saw sustained 429s in canvas console hitting
/workspaces/<id>/activity?limit=5. Server-side rate limit is 600 req/min/IP.
Three compounding issues in CommunicationOverlay:
1. Polled regardless of visibility — collapsed panel still hammered the API
2. 10s cadence — 6 req every 10s = 36 req/min from this overlay alone
3. Fan-out cap of 6 workspaces — scaled linearly with workspace count
Fix:
- Gate setInterval on `visible` (effect re-runs when collapsed/expanded)
- Cadence 10s → 30s
- Fan-out cap 6 → 3
Combined: ~36 req/min worst case → 6 req/min worst case (6x reduction),
0 req/min when collapsed.
Tests:
- Fan-out cap: 6 online nodes mounted → exactly 3 fetches (was 6)
- Offline gate: offline workspace never polled
- Cadence: timer at 10s = no new fetch; timer at 30s = next batch fires
Each test would fail if the corresponding dial regressed.
Follow-up (out of scope): structurally right fix is to consume the
WORKSPACE_ACTIVITY WS broadcast instead of polling per-workspace. Server
already publishes the events; canvas just isn't subscribing yet.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Live-probed user's tenant: three of three external-runtime workspaces
register with delivery_mode = NULL, not "poll". The earlier narrow
poll-only check fell through to the misleading 503 for the actually-
observed shape.
Invariant we want: URL empty + not-exactly-"push" → no dispatch path
will ever exist → 422. Only push-mode with empty URL is genuinely
transient (mid-boot, restart in progress) → 503.
Added TestChatUpload_NullModeEmptyURL using the user's actual workspace
ID. Existing TestChatUpload_NoURL switched to explicit "push" mode
(was relying on default — unsafe given the new branching).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
External-runtime workspaces that register in poll mode have no callback
URL by design — the platform never dispatches to them, so chat upload
(HTTP-forward by design) can't proceed. Returning 503 + "workspace url
not registered yet" was misleading: the "yet" implied transient state,
but the URL would never arrive.
Caught externally on 2026-05-04: user uploading an image to an external
"mac laptop" runtime workspace saw the 503 and assumed they should
retry. The workspace's poll mode meant retrying would never help.
Fix: include delivery_mode in the workspace lookup. When URL is empty:
- poll mode → 422 + "re-register in push mode with a public URL"
(Unprocessable Entity — this request can't succeed against this
workspace's configuration; no retry will help)
- push mode → 503 + "not registered yet" (genuine transient state —
retry after next heartbeat is correct)
Test: TestChatUpload_PollModeEmptyURL pins the new 422 path; existing
TestChatUpload_NoURL strengthened to assert the "not registered yet"
substring stays on the push branch (it would have silently passed if
the new 422 path had clobbered both branches).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>