RFC: Memory SSOT consolidation — v2 plugin only, freeze v1, drop K/V #1733

Open
opened 2026-05-23 19:55:16 +00:00 by hongming · 2 comments
Owner

RFC: Memory SSOT consolidation — v2 plugin only, freeze v1, drop K/V

Summary

Make the v2 memory-plugin contract the single source of truth for agent memory across the platform. Remove the silent v1 SQL fallback in workspace-server, freeze writes to agent_memories, and deprecate the K/V workspace_memory surface into v2 (kind=kv).

Phase-1 evidence

What exists today

  • v1: direct SQL on agent_memories (migration 008_agent_memories.sql, FTS in 017, pgvector in 031). Used when MEMORY_PLUGIN_URL is unset on workspace-server.
  • v2: HTTP plugin contract per RFC #2728. Spec at docs/api-protocol/memory-plugin-v1.yaml. Default impl at workspace-server/cmd/memory-plugin-postgres/. Owns its own memory_records table.
  • Legacy shim at workspace-server/internal/handlers/mcp_tools_memory_legacy_shim.go:
    • When memoryV2Available() == nil → translates commit_memory / recall_memory scopes to v2 namespaces and forwards to the plugin.
    • When v2 not wired → falls back to direct SQL on agent_memories (see mcp_tools.go:362 and :405).
  • Circuit breaker in workspace-server/internal/memory/client/client.go: after 3 consecutive plugin failures, opens for 60s and routes calls back to v1 SQL.
  • K/V at workspace_memory (migration 006, versioning in 023): orthogonal to v1/v2; handlers in memory.go:30/62/106 go straight to Postgres regardless of v2 wiring.
  • session-search at activity.go:623: UNION over activity_logs + agent_memories. Reads v1 directly, never v2.

Production state — the real surprise

Verified via Railway GraphQL on 2026-05-23, project molecule-platform (7ccc8c68-61f4-42ab-9be5-586eeee11768), service controlplane (ae76f064-6e59-4013-aa6c-1207ecf8c291):

Environment MEMORY_PLUGIN_URL MEMORY_V2_CUTOVER AWARENESS_*
production (env 59227671-…) unset unset unset
staging (env 639539ec-…) unset unset unset

Implication: the v2 cutover documented in docs/architecture/memory.md and the architecture doc's "frozen post-2026-05-05" claim about agent_memories are aspirational. Every commit_memory call from every agent today silently falls through the shim into direct SQL on agent_memories. The plugin sidecar that the provisioner injects on each workspace EC2 (-e MEMORY_PLUGIN_URL='http://localhost:9100' in /opt/cp-tmp/internal/provisioner/ec2.go) is running but unreachable from the controlplane — workspace-server has no plugin URL configured, so it never opens a connection to any sidecar.

Architectural ambiguity this exposes

The per-workspace localhost:9100 sidecar is reachable only from inside the EC2 it runs on. workspace-server runs on Railway, far from any sidecar loopback. RFC #2728 says "workspace-server is the only sanctioned client" of the plugin — but the deployment topology is currently impossible: a centralized client cannot reach N decentralized sidecars over loopback. Either:

  • (A) Centralize the plugin: one plugin instance reachable over the private network, MEMORY_PLUGIN_URL set once on controlplane. Fits RFC #2728 wording.
  • (B) Per-workspace plugin: workspace-server resolves MEMORY_PLUGIN_URL per workspace at request time (e.g. via the workspace's tunnel or a service-discovery field on workspaces). Closer to current provisioner behavior but requires a routing layer that does not exist today.

This RFC needs to settle (A) vs (B) before any code change. My instinct is (A) — single SSOT, single audit boundary — but I want the platform team's input.

Proposal

Sequencing

  1. Decide (A) vs (B) in this thread.
  2. Deploy the chosen v2 topology to staging, set MEMORY_PLUGIN_URL on the staging controlplane, run dual-read shadow for 24h (read from both agent_memories and v2, alert on divergence).
  3. One-shot migration: backfill agent_memories rows into v2 namespaces via workspace-server/cmd/memory-backfill/.
  4. PR-A — kill the fallback (area:memory, tier:high-risk):
    • Remove the SQL-fallback branches in commitMemoryLegacyShim / recallMemoryLegacyShim.
    • Remove the circuit-breaker → v1 path in internal/memory/client/client.go. Open breaker returns 503, never silently dual-writes.
    • memoryV2Available() becomes a boot invariant — missing MEMORY_PLUGIN_URL fails workspace-server startup. (Aligns with feedback_no_single_source_of_truth.)
    • Delete TestCommitMemory_FallbackToLegacy / TestRecallMemory_FallbackToLegacy; add startup-invariant test.
    • Stage A (local boot with/without env), Stage B (staging tenant commit + read), Stage C (agent-driven real task).
  5. PR-B — deprecate K/V (area:memory, tier:medium-risk):
    • Migrate any code still calling memory_set / memory_get to commit_memory_v2 with kind="kv" (and expires_at for TTL).
    • Drop K/V MCP tools from ~/.molecule-ai/mcp-server/src/tools/memory.ts.
    • Drop /workspaces/:id/memory* endpoints and their handlers (memory.go).
    • Schema: drop workspace_memory table and migration 023's version column work.
  6. PR-C — session-search reads v2 (area:memory, tier:medium-risk):
    • Change activity.go:623 UNION to query the v2 plugin (or a v2-backed view) instead of agent_memories.
  7. PR-D — freeze + drop agent_memories:
    • After 30 days of green prod metrics on v2, schedule a destructive migration to drop agent_memories and its 008/017/031 columns. Coordinate with #769 (RecallMemory_GlobalScope security work) — confirm it's resolved or moved to v2.

Required-approvals

  • Tier high-risk for PR-A and PR-D (mutate the agent-write path / drop a populated table). 2 non-author approvals per reference_merge_gate_model_changed_2026_05_18.
  • Staging E2E Stage A/B/C mandatory (dev-sop §SOP-6) for PR-A.

Risks

  • Dual-store window: between turning on v2 and dropping v1, reads must dual-source or v1 rows become inaccessible. Backfill + shadow-read mitigates.
  • Per-workspace store isolation: if we pick (A), one central plugin sees memories from every workspace; namespace ACL in internal/memory/namespace/resolver.go is the only enforcement boundary. Needs a security review pass (consider looping in #769's reviewer).
  • pgvector: migration 031 added it to agent_memories. If we want semantic search on v2, the plugin's table needs an equivalent column + index, and the backfill needs to recompute embeddings (or copy them).

Open questions

  1. (A) centralized vs (B) per-workspace plugin — see "Architectural ambiguity" above. Cannot proceed without an answer.
  2. Does MemoryInspectorPanel.tsx (which already calls /v2/memories) currently show empty in production? If yes, that's confirmation v2 reads are dead and no user has noticed — supports the "drop fallback, fail loud" stance.
  3. Is the per-workspace sidecar in ec2.go actively used by the agent's own in-process MCP client (i.e. agents talk to localhost:9100 directly, bypassing workspace-server)? If yes, two parallel memory paths exist and we need to unify them.

References

  • docs/architecture/memory.md (to be rewritten in PR-A follow-up)
  • docs/memory-plugins/README.md (plugin contract)
  • workspace-server/internal/handlers/mcp_tools_memory_legacy_shim.go
  • workspace-server/internal/memory/{client,namespace,wiring}/
  • Related: #769 (offsec-001 RecallMemory scope), #1706 (OpenAPI/client drift)
  • Related cleanup: separate RFC on Awareness namespace removal (to be filed)
# RFC: Memory SSOT consolidation — v2 plugin only, freeze v1, drop K/V ## Summary Make the v2 memory-plugin contract the single source of truth for agent memory across the platform. Remove the silent v1 SQL fallback in `workspace-server`, freeze writes to `agent_memories`, and deprecate the K/V `workspace_memory` surface into v2 (kind=`kv`). ## Phase-1 evidence ### What exists today - **v1**: direct SQL on `agent_memories` (migration `008_agent_memories.sql`, FTS in `017`, pgvector in `031`). Used when `MEMORY_PLUGIN_URL` is unset on `workspace-server`. - **v2**: HTTP plugin contract per RFC #2728. Spec at `docs/api-protocol/memory-plugin-v1.yaml`. Default impl at `workspace-server/cmd/memory-plugin-postgres/`. Owns its own `memory_records` table. - **Legacy shim** at `workspace-server/internal/handlers/mcp_tools_memory_legacy_shim.go`: - When `memoryV2Available() == nil` → translates `commit_memory` / `recall_memory` scopes to v2 namespaces and forwards to the plugin. - When v2 not wired → falls back to direct SQL on `agent_memories` (see `mcp_tools.go:362` and `:405`). - **Circuit breaker** in `workspace-server/internal/memory/client/client.go`: after 3 consecutive plugin failures, opens for 60s and routes calls back to v1 SQL. - **K/V** at `workspace_memory` (migration `006`, versioning in `023`): orthogonal to v1/v2; handlers in `memory.go:30/62/106` go straight to Postgres regardless of v2 wiring. - **`session-search`** at `activity.go:623`: UNION over `activity_logs` + `agent_memories`. Reads v1 directly, never v2. ### Production state — the real surprise Verified via Railway GraphQL on **2026-05-23**, project `molecule-platform` (`7ccc8c68-61f4-42ab-9be5-586eeee11768`), service `controlplane` (`ae76f064-6e59-4013-aa6c-1207ecf8c291`): | Environment | `MEMORY_PLUGIN_URL` | `MEMORY_V2_CUTOVER` | `AWARENESS_*` | |---|---|---|---| | `production` (env `59227671-…`) | **unset** | unset | unset | | `staging` (env `639539ec-…`) | **unset** | unset | unset | **Implication:** the v2 cutover documented in `docs/architecture/memory.md` and the architecture doc's "frozen post-2026-05-05" claim about `agent_memories` are aspirational. Every `commit_memory` call from every agent today silently falls through the shim into direct SQL on `agent_memories`. The plugin sidecar that the provisioner injects on each workspace EC2 (`-e MEMORY_PLUGIN_URL='http://localhost:9100'` in `/opt/cp-tmp/internal/provisioner/ec2.go`) is running but **unreachable from the controlplane** — workspace-server has no plugin URL configured, so it never opens a connection to any sidecar. ### Architectural ambiguity this exposes The per-workspace `localhost:9100` sidecar is reachable only from inside the EC2 it runs on. `workspace-server` runs on Railway, far from any sidecar loopback. RFC #2728 says "workspace-server is the only sanctioned client" of the plugin — but the deployment topology is currently impossible: a centralized client cannot reach N decentralized sidecars over loopback. Either: - **(A) Centralize the plugin**: one plugin instance reachable over the private network, `MEMORY_PLUGIN_URL` set once on `controlplane`. Fits RFC #2728 wording. - **(B) Per-workspace plugin**: `workspace-server` resolves `MEMORY_PLUGIN_URL` per workspace at request time (e.g. via the workspace's tunnel or a service-discovery field on `workspaces`). Closer to current provisioner behavior but requires a routing layer that does not exist today. This RFC needs to settle (A) vs (B) before any code change. My instinct is **(A)** — single SSOT, single audit boundary — but I want the platform team's input. ## Proposal ### Sequencing 1. **Decide (A) vs (B)** in this thread. 2. **Deploy the chosen v2 topology to staging**, set `MEMORY_PLUGIN_URL` on the staging controlplane, run dual-read shadow for 24h (read from both `agent_memories` and v2, alert on divergence). 3. **One-shot migration**: backfill `agent_memories` rows into v2 namespaces via `workspace-server/cmd/memory-backfill/`. 4. **PR-A — kill the fallback** (`area:memory`, `tier:high-risk`): - Remove the SQL-fallback branches in `commitMemoryLegacyShim` / `recallMemoryLegacyShim`. - Remove the circuit-breaker → v1 path in `internal/memory/client/client.go`. Open breaker returns 503, never silently dual-writes. - `memoryV2Available()` becomes a boot invariant — missing `MEMORY_PLUGIN_URL` fails workspace-server startup. (Aligns with `feedback_no_single_source_of_truth`.) - Delete `TestCommitMemory_FallbackToLegacy` / `TestRecallMemory_FallbackToLegacy`; add startup-invariant test. - Stage A (local boot with/without env), Stage B (staging tenant commit + read), Stage C (agent-driven real task). 5. **PR-B — deprecate K/V** (`area:memory`, `tier:medium-risk`): - Migrate any code still calling `memory_set` / `memory_get` to `commit_memory_v2` with `kind="kv"` (and `expires_at` for TTL). - Drop K/V MCP tools from `~/.molecule-ai/mcp-server/src/tools/memory.ts`. - Drop `/workspaces/:id/memory*` endpoints and their handlers (`memory.go`). - Schema: drop `workspace_memory` table and migration 023's `version` column work. 6. **PR-C — `session-search` reads v2** (`area:memory`, `tier:medium-risk`): - Change `activity.go:623` UNION to query the v2 plugin (or a v2-backed view) instead of `agent_memories`. 7. **PR-D — freeze + drop `agent_memories`**: - After 30 days of green prod metrics on v2, schedule a destructive migration to drop `agent_memories` and its 008/017/031 columns. Coordinate with #769 (RecallMemory_GlobalScope security work) — confirm it's resolved or moved to v2. ### Required-approvals - Tier `high-risk` for PR-A and PR-D (mutate the agent-write path / drop a populated table). 2 non-author approvals per `reference_merge_gate_model_changed_2026_05_18`. - Staging E2E Stage A/B/C mandatory (dev-sop §SOP-6) for PR-A. ### Risks - **Dual-store window**: between turning on v2 and dropping v1, reads must dual-source or v1 rows become inaccessible. Backfill + shadow-read mitigates. - **Per-workspace store isolation**: if we pick (A), one central plugin sees memories from every workspace; namespace ACL in `internal/memory/namespace/resolver.go` is the only enforcement boundary. Needs a security review pass (consider looping in #769's reviewer). - **`pgvector`**: migration `031` added it to `agent_memories`. If we want semantic search on v2, the plugin's table needs an equivalent column + index, and the backfill needs to recompute embeddings (or copy them). ## Open questions 1. **(A) centralized vs (B) per-workspace plugin** — see "Architectural ambiguity" above. Cannot proceed without an answer. 2. Does `MemoryInspectorPanel.tsx` (which already calls `/v2/memories`) currently show empty in production? If yes, that's confirmation v2 reads are dead and no user has noticed — supports the "drop fallback, fail loud" stance. 3. Is the per-workspace sidecar in `ec2.go` actively used by the agent's own in-process MCP client (i.e. agents talk to localhost:9100 directly, bypassing workspace-server)? If yes, two parallel memory paths exist and we need to unify them. ## References - `docs/architecture/memory.md` (to be rewritten in PR-A follow-up) - `docs/memory-plugins/README.md` (plugin contract) - `workspace-server/internal/handlers/mcp_tools_memory_legacy_shim.go` - `workspace-server/internal/memory/{client,namespace,wiring}/` - Related: #769 (offsec-001 RecallMemory scope), #1706 (OpenAPI/client drift) - Related cleanup: separate RFC on Awareness namespace removal (to be filed)
Author
Owner

Decision locked: Option A — centralized memory plugin

CTO chose A on 2026-05-23.

Rationale:

  1. SSOT is the goal of this entire cleanup. B would trade SSOT back for deployment-level isolation we already enforce at the ACL layer (internal/memory/namespace/resolver.go).
  2. TEAM and GLOBAL HMA scopes are first-class — they're inherently single-database queries. B turns them into a fan-out problem with no good answer.
  3. Per-EC2 isolation is defense-in-depth, not the primary boundary. The namespace ACL is.
  4. The sidecars provisioner injects today are already orphaned (controlplane doesn't reach them). A makes that reality canonical instead of trying to make the orphan canonical.

Escape hatch preserved: a future tenant demanding data residency can run their own plugin via a per-workspace routing override. Opt-in. Not the default.


Phase 1 — Deployment investigation

Verified 2026-05-23:

  • Railway molecule-platform project has one service (controlplane) and zero database plugins attached. Adding a second service for the memory plugin is straightforward — Railway gives internal DNS at <service>.railway.internal.
  • Operator host (5.78.80.188) runs Infisical, Gitea, runners, an IAM Postgres, and ephemeral CI containers. Memory plugin could live here but every controlplane→plugin call would cross Railway↔Hetzner, adding ~30-80 ms of network latency on every agent memory read/write.
  • Plugin binary (workspace-server/cmd/memory-plugin-postgres/main.go) is a single Go binary with embedded migrations. Required env: MEMORY_PLUGIN_DATABASE_URL, optional MEMORY_PLUGIN_LISTEN_ADDR. Default bind is 127.0.0.1:9100 (loopback only — needs :9100 override when not on localhost). Requires pgvector extension on its target DB.

Phase 2 — Staged execution plan

This is multi-PR work. Each stage is a separate merge gate per dev-sop §SOP-6.

Phase A0 — Stand up centralized plugin (infra, no code change to controlplane)

  1. Pick database hosting (see open question below).
  2. Provision/identify the DB, ensure pgvector is enabled.
  3. Add a memory-plugin Railway service in molecule-platform:
    • Build from workspace-server/cmd/memory-plugin-postgres/
    • Set MEMORY_PLUGIN_DATABASE_URL (from step 2), MEMORY_PLUGIN_LISTEN_ADDR=:9100
    • Internal DNS becomes memory-plugin.railway.internal:9100
  4. Boot to staging first, verify migrations apply and /v1/health returns OK with capabilities ["embedding","fts","ttl","pin","propagation"].
  5. Ship to prod with same shape.

No controlplane behavior changes in this phase. The plugin runs but nothing calls it.

Phase A1 — Wire controlplane → plugin (additive, non-breaking)

  1. Set MEMORY_PLUGIN_URL=http://memory-plugin.railway.internal:9100 on staging controlplane.
  2. Verify POST /workspaces/:id/v2/memories (which existing MemoryInspectorPanel.tsx already calls) starts returning real data.
  3. Stand up a dual-read shadow for 24-48 h: every commit_memory call writes to BOTH agent_memories (v1) AND the plugin (v2), and a sweep job alerts on row-count divergence.
  4. Lift MEMORY_PLUGIN_URL to prod once shadow is clean.

Single small PR — adds the env var, adds the dual-write code path, adds the divergence alert. Reversible by unsetting the env var.

Phase A2 — Backfill v1 → v2

  1. Run workspace-server/cmd/memory-backfill/ (already exists in the repo) on staging.
  2. Verify count(*) FROM agent_memoriescount(*) FROM memory_records WHERE source = 'agent' (within shadow-window tolerance).
  3. Repeat on prod once verified.

Ops task, no merge. Just an operator-run binary with shadow validation after.

Phase A3 — The actual cutover PR (the v1-removal one originally drafted)

  1. Delete the SQL-fallback branches in commitMemoryLegacyShim / recallMemoryLegacyShim (workspace-server/internal/handlers/mcp_tools_memory_legacy_shim.go).
  2. Remove the circuit-breaker → v1 fallback in internal/memory/client/client.go. Open breaker returns 503, never silently routes to v1.
  3. memoryV2Available() becomes a startup invariant — missing MEMORY_PLUGIN_URL fails workspace-server boot. (Aligns with feedback_no_single_source_of_truth.)
  4. Delete TestCommitMemory_FallbackToLegacy / TestRecallMemory_FallbackToLegacy.
  5. Remove the per-workspace sidecar injection in ec2.go (the -e MEMORY_PLUGIN_URL='http://localhost:9100' for tenant EC2s) — sidecars are dead under Option A.

tier:high-risk. Stage A (boot with/without env), Stage B (staging tenant agent commits memory → appears only in plugin DB, never in agent_memories), Stage C (real-task agent run with screenshots).

Phase A4 — Drop agent_memories (separate PR, +30 days post-cutover)

  1. Wait 30 days of green prod metrics on v2.
  2. Down migration removes agent_memories + migrations 008/017/031 columns.
  3. Coordinate with #769 (RecallMemory security work) — confirm resolved or moved to v2.

Open question (gating Phase A0)

Where should the memory plugin's database live?

  • (a) New Neon project — clean separation, own backup/billing line. ~10-min provision via Neon API.
  • (b) Reuse the existing controlplane Neon DB, isolate via a dedicated schema — fastest path, one fewer thing to manage, but mixes concerns. The plugin's vector(1536) type + ivfflat index would live next to controlplane's tables in the same DB.
  • (c) Railway-hosted Postgres — single billing surface, but Railway's Postgres add-ons have lower SLA than Neon and less flexible scaling.

CTO answer needed here before Phase A0 begins. My recommendation: (a) for prod long-term, (b) for staging if we want a one-day spike — the schema isolation is cheap to undo later.


Sequencing summary

A0 (infra) ──► A1 (wire+shadow) ──► A2 (backfill) ──► A3 (cutover PR) ──► A4 (drop v1 table, +30d)

Nothing in this thread blocks #1735 (awareness removal — PR #1737 open) or #1734 (canvas Memory tab fix, once #1733 lands).

# Decision locked: **Option A — centralized memory plugin** CTO chose A on 2026-05-23. **Rationale:** 1. SSOT is the goal of this entire cleanup. B would trade SSOT back for deployment-level isolation we already enforce at the ACL layer (`internal/memory/namespace/resolver.go`). 2. TEAM and GLOBAL HMA scopes are first-class — they're inherently single-database queries. B turns them into a fan-out problem with no good answer. 3. Per-EC2 isolation is defense-in-depth, not the primary boundary. The namespace ACL is. 4. The sidecars provisioner injects today are *already* orphaned (controlplane doesn't reach them). A makes that reality canonical instead of trying to make the orphan canonical. **Escape hatch preserved**: a future tenant demanding data residency can run their own plugin via a per-workspace routing override. Opt-in. Not the default. --- # Phase 1 — Deployment investigation Verified 2026-05-23: - **Railway `molecule-platform`** project has one service (`controlplane`) and zero database plugins attached. Adding a second service for the memory plugin is straightforward — Railway gives internal DNS at `<service>.railway.internal`. - **Operator host** (`5.78.80.188`) runs Infisical, Gitea, runners, an IAM Postgres, and ephemeral CI containers. Memory plugin could live here but every controlplane→plugin call would cross Railway↔Hetzner, adding ~30-80 ms of network latency on every agent memory read/write. - **Plugin binary** (`workspace-server/cmd/memory-plugin-postgres/main.go`) is a single Go binary with embedded migrations. Required env: `MEMORY_PLUGIN_DATABASE_URL`, optional `MEMORY_PLUGIN_LISTEN_ADDR`. Default bind is `127.0.0.1:9100` (loopback only — needs `:9100` override when not on localhost). Requires `pgvector` extension on its target DB. --- # Phase 2 — Staged execution plan This is multi-PR work. Each stage is a separate merge gate per dev-sop §SOP-6. ## Phase A0 — Stand up centralized plugin (infra, no code change to controlplane) 1. Pick database hosting (see open question below). 2. Provision/identify the DB, ensure `pgvector` is enabled. 3. Add a `memory-plugin` Railway service in `molecule-platform`: - Build from `workspace-server/cmd/memory-plugin-postgres/` - Set `MEMORY_PLUGIN_DATABASE_URL` (from step 2), `MEMORY_PLUGIN_LISTEN_ADDR=:9100` - Internal DNS becomes `memory-plugin.railway.internal:9100` 4. Boot to staging first, verify migrations apply and `/v1/health` returns OK with capabilities `["embedding","fts","ttl","pin","propagation"]`. 5. Ship to prod with same shape. **No controlplane behavior changes in this phase.** The plugin runs but nothing calls it. ## Phase A1 — Wire controlplane → plugin (additive, non-breaking) 1. Set `MEMORY_PLUGIN_URL=http://memory-plugin.railway.internal:9100` on staging controlplane. 2. Verify `POST /workspaces/:id/v2/memories` (which existing `MemoryInspectorPanel.tsx` already calls) starts returning real data. 3. Stand up a dual-read shadow for 24-48 h: every `commit_memory` call writes to BOTH `agent_memories` (v1) AND the plugin (v2), and a sweep job alerts on row-count divergence. 4. Lift `MEMORY_PLUGIN_URL` to prod once shadow is clean. **Single small PR** — adds the env var, adds the dual-write code path, adds the divergence alert. Reversible by unsetting the env var. ## Phase A2 — Backfill v1 → v2 1. Run `workspace-server/cmd/memory-backfill/` (already exists in the repo) on staging. 2. Verify `count(*) FROM agent_memories` ≈ `count(*) FROM memory_records WHERE source = 'agent'` (within shadow-window tolerance). 3. Repeat on prod once verified. **Ops task, no merge.** Just an operator-run binary with shadow validation after. ## Phase A3 — The actual cutover PR (the v1-removal one originally drafted) 1. Delete the SQL-fallback branches in `commitMemoryLegacyShim` / `recallMemoryLegacyShim` (`workspace-server/internal/handlers/mcp_tools_memory_legacy_shim.go`). 2. Remove the circuit-breaker → v1 fallback in `internal/memory/client/client.go`. Open breaker returns 503, never silently routes to v1. 3. `memoryV2Available()` becomes a startup invariant — missing `MEMORY_PLUGIN_URL` fails workspace-server boot. (Aligns with `feedback_no_single_source_of_truth`.) 4. Delete `TestCommitMemory_FallbackToLegacy` / `TestRecallMemory_FallbackToLegacy`. 5. Remove the per-workspace sidecar injection in `ec2.go` (the `-e MEMORY_PLUGIN_URL='http://localhost:9100'` for tenant EC2s) — sidecars are dead under Option A. `tier:high-risk`. Stage A (boot with/without env), Stage B (staging tenant agent commits memory → appears only in plugin DB, never in `agent_memories`), Stage C (real-task agent run with screenshots). ## Phase A4 — Drop `agent_memories` (separate PR, +30 days post-cutover) 1. Wait 30 days of green prod metrics on v2. 2. Down migration removes `agent_memories` + migrations 008/017/031 columns. 3. Coordinate with #769 (RecallMemory security work) — confirm resolved or moved to v2. --- # Open question (gating Phase A0) Where should the memory plugin's database live? - **(a) New Neon project** — clean separation, own backup/billing line. ~10-min provision via Neon API. - **(b) Reuse the existing controlplane Neon DB, isolate via a dedicated schema** — fastest path, one fewer thing to manage, but mixes concerns. The plugin's `vector(1536)` type + `ivfflat` index would live next to controlplane's tables in the same DB. - **(c) Railway-hosted Postgres** — single billing surface, but Railway's Postgres add-ons have lower SLA than Neon and less flexible scaling. CTO answer needed here before Phase A0 begins. My recommendation: **(a)** for prod long-term, **(b)** for staging if we want a one-day spike — the schema isolation is cheap to undo later. --- # Sequencing summary ``` A0 (infra) ──► A1 (wire+shadow) ──► A2 (backfill) ──► A3 (cutover PR) ──► A4 (drop v1 table, +30d) ``` Nothing in this thread blocks #1735 (awareness removal — PR #1737 open) or #1734 (canvas Memory tab fix, once #1733 lands).
Author
Owner

Architecture verified — major correction to the original plan

CTO clarified the tenancy model (2026-05-23): the platform provides billing+infra only; each tenant runs their own molecule-core deployment, with one tenant EC2 hosting workspace-server + Postgres + Redis + (intended) memory-plugin sidecar for that org. Each workspace today runs as a separate dedicated EC2 phoning home to the tenant EC2 at :8080.

I read molecule-controlplane to verify. My original plan in this RFC (deploy a centralized plugin in the platform Railway project) was wrong-target — the plugin already belongs on the tenant EC2, and CP almost wires it correctly. There's just one missing line.

What CP actually does today

  • One tenant EC2 per org — enforced by org_instances.org_id PRIMARY KEY (CP migrations/002_org_instances.up.sql:9). Provisioned by provisionTenant (internal/handlers/orgs.go:595).
  • One dedicated workspace EC2 per workspace — provisioned by ProvisionWorkspace (internal/provisioner/ec2.go:2475). Phones home to the tenant EC2 at MOLECULE_PLATFORM_URL=<tenant_ip>:8080. Tagged with WorkspaceID, OrgID. Single-tenant security invariant called out in internal/provisioner/userdata_t4_privileged_test.go:21.
  • Tenant EC2 user-data (buildTenantUserDataSM, ec2.go:1917-2300) launches:
    • Postgres (pgvector/pgvector:pg16) container @ 127.0.0.1:5432
    • Redis @ 127.0.0.1:6379
    • molecule-tenant container (workspace-server binary) with env (line 2232-2233):
      -e MEMORY_V2_CUTOVER='true' \
      -e MEMORY_PLUGIN_URL='http://localhost:9100' \
      
    • cloudflared systemd unit
  • No memory-plugin-postgres container is started anywhere in CP user-data. The env var is set, but 127.0.0.1:9100 is unreachable. Every tenant workspace-server today tries v2 → connection refused → silently falls back to v1 SQL on agent_memories (via the shim) or trips the circuit breaker.

Where the user's intended architecture differs

User's desired shape: workspaces are Docker containers on the tenant EC2, not separate EC2s. All workspaces share the tenant EC2's memory plugin via localhost:9100. HMA TEAM/GLOBAL scopes work because all of an org's workspaces share one DB.

Current shape:

  • ✓ Tenant EC2 already shared by all of an org's workspaces (workspace-server, DB, Redis).
  • ✗ Workspaces themselves are separate EC2s, not containers on the tenant EC2.
  • ✗ Memory plugin sidecar doesn't actually run.

The workspace-placement change (per-workspace EC2 → docker-on-tenant-EC2) is a major architectural pivot with security implications — today's per-workspace EC2 model relies on hardware isolation for tiers like T4 (privileged docker-run inside a dedicated EC2). Packing N tenants' workspaces onto one EC2 means containers share a kernel and potentially privileged capabilities. That's a separate large RFC.

But the memory plugin migration does NOT depend on that pivot. v2 can be made live for every existing tenant today by just starting the missing sidecar on the tenant EC2.

Revised sequencing

A0 — Start the missing memory-plugin sidecar on tenant EC2 (CP-side, small)

One PR in molecule-controlplane:

  1. Add a start_memory_plugin stage to buildTenantUserDataSM (internal/provisioner/ec2.go), placed before start_platform:
    • docker run -d --name molecule-memory-plugin --network host --restart unless-stopped <image> ...
    • Image: pre-built memory-plugin-postgres binary (the workspace-server repo's cmd/memory-plugin-postgres/ Dockerfile, or a multi-stage build alongside platform-tenant).
    • Env: MEMORY_PLUGIN_DATABASE_URL pointing at the tenant EC2's Postgres (same instance, dedicated memory_plugin schema OR a separate DB on the same Postgres cluster — TBD by reviewer).
    • MEMORY_PLUGIN_LISTEN_ADDR=127.0.0.1:9100.
  2. Wait-for-health gate before start_platform (the plugin must be up first or workspace-server boot breaks once we add the startup invariant in A1).
  3. Add a CP tenant_resources audit row when the plugin container launches.

This is reversible (env var can be unset on workspace-server, plugin container can be removed) and adds no behavioral change to existing tenants — the env var is already set on platform-tenant; the plugin just becomes reachable.

A1 — Tenant workspace-server: kill v1 fallback, fail-fast on missing plugin (core-side)

One PR in molecule-core, gated on A0 rolling to every tenant EC2 (live restart cycles them):

  1. Delete the SQL-fallback branches in commitMemoryLegacyShim / recallMemoryLegacyShim.
  2. Remove the circuit-breaker → v1 path in internal/memory/client/client.go. Open breaker returns 503.
  3. memoryV2Available() becomes a startup invariant — missing MEMORY_PLUGIN_URL fails workspace-server boot.
  4. Delete the two *_FallbackToLegacy tests.
  5. Update docs.

tier:medium-risk. Stage A (local boot with/without env), Stage B (one staging tenant — confirm agent commit_memory lands in plugin's memory_records only), Stage C (real-task agent run).

A2 — Backfill agent_memories → plugin (ops task)

Run workspace-server/cmd/memory-backfill/ per tenant EC2. Each tenant has its own DB so each tenant has its own backfill. Could be scripted as a CP-orchestrated sweep.

A3 — Drop agent_memories table (+30 days)

Down migration on the workspace-server side. Coordinate with #769.

A_workspace_placement — separate RFC (org-level workspace container hosting)

User's "workspaces as containers on tenant EC2" is filed as its own RFC. Affects:

  • CP's ProvisionWorkspace (today RunInstance → user-data); becomes "exec on existing tenant EC2's Docker daemon".
  • Security: lose hardware isolation between workspaces; need rootless docker or per-workspace VMs (Firecracker-style) for T4.
  • Existing tenants need a migration path (drain workspace EC2s, recreate as containers).
  • The memory work above is unaffected — A0-A3 land independently and benefit immediately.

Open questions for A0

  1. Plugin DB location: same Postgres instance as platform-tenant (separate schema), or a sibling container with its own postgres image? (Same instance is cheaper; separate is cleaner ownership per RFC #2728 comment "operator drops them when confident they don't want to switch back".)
  2. Plugin image build: extend the existing Dockerfile in molecule-core to multi-stage and produce a second binary, or new Dockerfile.memory-plugin? The platform-tenant image already has the Go runtime — could ship both binaries in one image (cheaper pull, single update unit) or keep them separate (independent versioning).
  3. Existing tenant rollout: tenants on running EC2s won't pick up the new user-data without a redeploy. Do we (a) force-recycle every tenant EC2, (b) ship a small helper that adds the missing container on-the-fly via SSM, or (c) wait for natural restart cycles?

Sequencing summary

A0 (CP user-data: start plugin sidecar) ──► A1 (kill v1 fallback in core) ──► A2 (backfill) ──► A3 (drop v1 table, +30d)
                                                                                                    │
A_workspace_placement RFC (separate; doesn't gate the above) ──────────────────────────────────────┘

#1735 (awareness removal — PR #1737) is independent and still moves forward separately.
#1734 (Memory tab fix) is gated on A1.

Looking for CTO sign-off on this sequencing before starting A0 in molecule-controlplane.

# Architecture verified — major correction to the original plan CTO clarified the tenancy model (2026-05-23): the platform provides billing+infra only; each tenant runs their own molecule-core deployment, with one tenant EC2 hosting `workspace-server` + Postgres + Redis + (intended) memory-plugin sidecar for that org. Each workspace today runs as a separate dedicated EC2 phoning home to the tenant EC2 at `:8080`. I read `molecule-controlplane` to verify. **My original plan in this RFC (deploy a centralized plugin in the platform Railway project) was wrong-target** — the plugin already belongs on the tenant EC2, and CP almost wires it correctly. There's just one missing line. ## What CP actually does today - **One tenant EC2 per org** — enforced by `org_instances.org_id PRIMARY KEY` (CP `migrations/002_org_instances.up.sql:9`). Provisioned by `provisionTenant` (`internal/handlers/orgs.go:595`). - **One dedicated workspace EC2 per workspace** — provisioned by `ProvisionWorkspace` (`internal/provisioner/ec2.go:2475`). Phones home to the tenant EC2 at `MOLECULE_PLATFORM_URL=<tenant_ip>:8080`. Tagged with `WorkspaceID`, `OrgID`. Single-tenant security invariant called out in `internal/provisioner/userdata_t4_privileged_test.go:21`. - **Tenant EC2 user-data** (`buildTenantUserDataSM`, `ec2.go:1917-2300`) launches: - Postgres (`pgvector/pgvector:pg16`) container @ `127.0.0.1:5432` - Redis @ `127.0.0.1:6379` - `molecule-tenant` container (workspace-server binary) with env (line 2232-2233): ``` -e MEMORY_V2_CUTOVER='true' \ -e MEMORY_PLUGIN_URL='http://localhost:9100' \ ``` - cloudflared systemd unit - **No memory-plugin-postgres container is started anywhere in CP user-data**. The env var is set, but `127.0.0.1:9100` is unreachable. Every tenant workspace-server today tries v2 → connection refused → silently falls back to v1 SQL on `agent_memories` (via the shim) or trips the circuit breaker. ## Where the user's intended architecture differs User's desired shape: **workspaces are Docker containers on the tenant EC2**, not separate EC2s. All workspaces share the tenant EC2's memory plugin via `localhost:9100`. HMA TEAM/GLOBAL scopes work because all of an org's workspaces share one DB. Current shape: - ✓ Tenant EC2 already shared by all of an org's workspaces (workspace-server, DB, Redis). - ✗ Workspaces themselves are separate EC2s, not containers on the tenant EC2. - ✗ Memory plugin sidecar doesn't actually run. The workspace-placement change (per-workspace EC2 → docker-on-tenant-EC2) is a **major architectural pivot** with security implications — today's per-workspace EC2 model relies on hardware isolation for tiers like T4 (privileged docker-run inside a dedicated EC2). Packing N tenants' workspaces onto one EC2 means containers share a kernel and potentially privileged capabilities. That's a separate large RFC. **But the memory plugin migration does NOT depend on that pivot.** v2 can be made live for every existing tenant *today* by just starting the missing sidecar on the tenant EC2. ## Revised sequencing ### A0 — Start the missing memory-plugin sidecar on tenant EC2 (CP-side, small) One PR in `molecule-controlplane`: 1. Add a `start_memory_plugin` stage to `buildTenantUserDataSM` (`internal/provisioner/ec2.go`), placed *before* `start_platform`: - `docker run -d --name molecule-memory-plugin --network host --restart unless-stopped <image> ...` - Image: pre-built `memory-plugin-postgres` binary (the workspace-server repo's `cmd/memory-plugin-postgres/` Dockerfile, or a multi-stage build alongside platform-tenant). - Env: `MEMORY_PLUGIN_DATABASE_URL` pointing at the tenant EC2's Postgres (same instance, dedicated `memory_plugin` schema OR a separate DB on the same Postgres cluster — TBD by reviewer). - `MEMORY_PLUGIN_LISTEN_ADDR=127.0.0.1:9100`. 2. Wait-for-health gate before `start_platform` (the plugin must be up first or workspace-server boot breaks once we add the startup invariant in A1). 3. Add a CP `tenant_resources` audit row when the plugin container launches. This is reversible (env var can be unset on workspace-server, plugin container can be removed) and adds no behavioral change to existing tenants — the env var is already set on platform-tenant; the plugin just becomes reachable. ### A1 — Tenant workspace-server: kill v1 fallback, fail-fast on missing plugin (core-side) One PR in `molecule-core`, gated on A0 rolling to every tenant EC2 (live restart cycles them): 1. Delete the SQL-fallback branches in `commitMemoryLegacyShim` / `recallMemoryLegacyShim`. 2. Remove the circuit-breaker → v1 path in `internal/memory/client/client.go`. Open breaker returns 503. 3. `memoryV2Available()` becomes a startup invariant — missing `MEMORY_PLUGIN_URL` fails workspace-server boot. 4. Delete the two `*_FallbackToLegacy` tests. 5. Update docs. `tier:medium-risk`. Stage A (local boot with/without env), Stage B (one staging tenant — confirm agent `commit_memory` lands in plugin's `memory_records` only), Stage C (real-task agent run). ### A2 — Backfill `agent_memories` → plugin (ops task) Run `workspace-server/cmd/memory-backfill/` per tenant EC2. Each tenant has its own DB so each tenant has its own backfill. Could be scripted as a CP-orchestrated sweep. ### A3 — Drop `agent_memories` table (+30 days) Down migration on the workspace-server side. Coordinate with #769. ### A_workspace_placement — separate RFC (org-level workspace container hosting) User's "workspaces as containers on tenant EC2" is filed as its own RFC. Affects: - CP's `ProvisionWorkspace` (today RunInstance → user-data); becomes "exec on existing tenant EC2's Docker daemon". - Security: lose hardware isolation between workspaces; need rootless docker or per-workspace VMs (Firecracker-style) for T4. - Existing tenants need a migration path (drain workspace EC2s, recreate as containers). - The memory work above is unaffected — A0-A3 land independently and benefit immediately. ## Open questions for A0 1. **Plugin DB location**: same Postgres instance as platform-tenant (separate schema), or a sibling container with its own postgres image? (Same instance is cheaper; separate is cleaner ownership per RFC #2728 comment "operator drops them when confident they don't want to switch back".) 2. **Plugin image build**: extend the existing `Dockerfile` in molecule-core to multi-stage and produce a second binary, or new `Dockerfile.memory-plugin`? The platform-tenant image already has the Go runtime — could ship both binaries in one image (cheaper pull, single update unit) or keep them separate (independent versioning). 3. **Existing tenant rollout**: tenants on running EC2s won't pick up the new user-data without a redeploy. Do we (a) force-recycle every tenant EC2, (b) ship a small helper that adds the missing container on-the-fly via SSM, or (c) wait for natural restart cycles? ## Sequencing summary ``` A0 (CP user-data: start plugin sidecar) ──► A1 (kill v1 fallback in core) ──► A2 (backfill) ──► A3 (drop v1 table, +30d) │ A_workspace_placement RFC (separate; doesn't gate the above) ──────────────────────────────────────┘ ``` #1735 (awareness removal — PR #1737) is independent and still moves forward separately. #1734 (Memory tab fix) is gated on A1. Looking for CTO sign-off on this sequencing before starting A0 in `molecule-controlplane`.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1733