refactor(memory): #1733 A1 — kill v1 SQL fallback, v2 plugin is the only backend #1747

Merged
hongming merged 2 commits from chore/issue-1733-a1-kill-v1-fallback into main 2026-05-24 02:45:57 +00:00

2 Commits

Author SHA1 Message Date
hongming c2b6e8bdb5 review(memory): #1747 — OFFSEC test depth + legacy redaction + cutover env deprecation
ci-arm64-advisory / fast-checks (pull_request) Waiting to run
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Waiting to run
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 6s
CI / Detect changes (pull_request) Successful in 8s
CI / Python Lint & Test (pull_request) Successful in 4s
E2E API Smoke Test / detect-changes (pull_request) Successful in 5s
E2E Chat / detect-changes (pull_request) Successful in 10s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Has been skipped
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 12s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Successful in 34s
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Has been skipped
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Successful in 52s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 7s
Harness Replays / detect-changes (pull_request) Successful in 6s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 4s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 7s
Lint no tenant GITEA or GITHUB token write / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 8s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Failing after 1m4s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Failing after 1m9s
lint-required-workflows-docker-host-pinned / Lint docker-host pin on docker-touching workflows (pull_request) Successful in 4s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m2s
lint-required-context-exists-in-bp / lint-required-context-exists-in-bp (pull_request) Successful in 1m15s
review-check-tests / review-check.sh regression tests (pull_request) Successful in 7s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 6s
qa-review / approved (pull_request) Failing after 4s
security-review / approved (pull_request) Failing after 3s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 1m23s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 1m7s
sop-checklist / na-declarations (pull_request) N/A: (none)
CI / Canvas (Next.js) (pull_request) Successful in 2s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1m38s
E2E Chat / E2E Chat (pull_request) Successful in 8s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 4s
Harness Replays / Harness Replays (pull_request) Successful in 5s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2m33s
CI / Platform (Go) (pull_request) Successful in 5m42s
CI / all-required (pull_request) Successful in 26m9s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
gate-check-v3 / gate-check (pull_request) Waiting to run
sop-checklist / all-items-acked (pull_request) Waiting to run
sop-checklist / review-refire (pull_request) Waiting to run
sop-tier-check / tier-check (pull_request) Waiting to run
audit-force-merge / audit (pull_request) Successful in 41s
Three reviewer-flagged fixes on PR #1747:

1. OFFSEC-001 scrub tests no longer vacuous (review finding C2).
   `TestMCPHandler_CommitMemory_RequestErrors_ScrubsInternalError`
   and its recall sibling triggered via `scope=GLOBAL`, but post-A1
   the handler short-circuits in `memoryV2Available()` BEFORE the
   GLOBAL block runs — so the leaked-tokens loop (`GLOBAL`, `scope`,
   `permitted`, `bridge`, `LOCAL`, `TEAM`) was scanning an error
   string that doesn't contain any of them ("memory plugin is not
   configured"). The test passed even if mcp.go:dispatchRPC's scrub
   was deleted entirely.

   Fix: wire `withMemoryV2APIs(&stubMemoryPlugin{}, rootNamespace
   Resolver())` so the request reaches `scopeToWritableNamespace` /
   `scopeToReadableNamespaces` where the actual GLOBAL block lives.
   Renamed `CommitMemory_RequestErrors_ScrubsInternalError` →
   `CommitMemory_GlobalScope_ScrubsInternalError` (and the recall
   sibling) to match what the test actually pins now.

   Verified by `go test -v` showing the internal log line
   `mcp: tool call failed workspace=ws-1 tool=commit_memory:
   GLOBAL scope is not permitted via the MCP bridge — use LOCAL or
   TEAM` — the actual leaky string is now reaching the scrub layer,
   so the negative-token assertions are real proof.

2. New `TestMCPHandler_CommitMemory_LegacyName_RedactionAtPlugin`
   (review finding N6). The deleted SQL-path version of this test
   pinned the SAFE-T1201 redaction contract against `agent_memories`
   INSERT args. Post-A1 the legacy `commit_memory` MCP tool routes
   through the shim → `toolCommitMemoryV2` → plugin; the plugin path
   redacts at mcp_tools_memory_v2.go:122. This test calls the
   legacy tool via JSON-RPC envelope with three secret-shaped
   patterns (ANTHROPIC_API_KEY=, Bearer token, sk-* prefix), uses
   a stub plugin that captures the `MemoryWrite.Content` it
   receives, and asserts the captured content is the redacted
   string — not the raw secret. Catches a regression where someone
   inlines the shim and skips redaction.

3. `MEMORY_V2_CUTOVER` deprecated in `entrypoint-tenant.sh` (review
   finding N1). The workspace-server binary stopped reading the env
   in PR #1747's main commit (`wiring.go` cleanup), but the
   entrypoint still keyed off it OR'd with `MEMORY_PLUGIN_URL`.
   Surfaced as a transition warning: when only `MEMORY_V2_CUTOVER`
   is set (no `MEMORY_PLUGIN_URL`), the entrypoint logs a loud
   deprecation notice and spawns the sidecar anyway so old CP
   user-data templates keep working through the rollout. Once
   CP user-data drops the var (separate CP PR), the
   `MEMORY_V2_CUTOVER` branch can be deleted from the entrypoint.

Also filed:
- **#1755** — tracking issue for migrating `seedInitialMemories`
  through the v2 plugin before Phase A3 drops `agent_memories`
  (review finding C3). Required before A3 can land or workspace
  creation breaks.

Refs: PR #1747 review (CTO 2026-05-23 dispatched subagent), findings
C2 (OFFSEC scrub depth), N1 (cutover env drift), N6 (lost legacy-
name redaction coverage). C3 spun out to #1755.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 17:09:18 -07:00
hongming 2aa6e8adf3 refactor(memory): #1733 A1 — kill v1 SQL fallback, v2 plugin is the only backend
ci-arm64-advisory / fast-checks (pull_request) Waiting to run
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Waiting to run
sop-checklist / review-refire (pull_request) Waiting to run
sop-tier-check / tier-check (pull_request) Waiting to run
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 4s
CI / Detect changes (pull_request) Successful in 5s
CI / Python Lint & Test (pull_request) Successful in 3s
E2E API Smoke Test / detect-changes (pull_request) Successful in 6s
E2E Chat / detect-changes (pull_request) Successful in 6s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Has been skipped
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 12s
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Has been skipped
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Successful in 44s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 8s
Harness Replays / detect-changes (pull_request) Successful in 3s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Successful in 34s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 6s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 6s
Lint no tenant GITEA or GITHUB token write / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 3s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Failing after 1m3s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 1m14s
lint-required-context-exists-in-bp / lint-required-context-exists-in-bp (pull_request) Successful in 1m27s
review-check-tests / review-check.sh regression tests (pull_request) Successful in 6s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s
lint-required-workflows-docker-host-pinned / Lint docker-host pin on docker-touching workflows (pull_request) Successful in 4s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m15s
gate-check-v3 / gate-check (pull_request) Successful in 4s
qa-review / approved (pull_request) Failing after 4s
security-review / approved (pull_request) Failing after 3s
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, local-postgres-e2
sop-checklist / na-declarations (pull_request) N/A: (none)
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 1m17s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 1m10s
CI / Canvas (Next.js) (pull_request) Successful in 2s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1m31s
CI / Platform (Go) (pull_request) Successful in 4m6s
E2E Chat / E2E Chat (pull_request) Successful in 2s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 2s
CI / all-required (pull_request) Successful in 23m38s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 1m12s
Harness Replays / Harness Replays (pull_request) Successful in 2s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CTO decision 2026-05-23: same Postgres, separate schema, v2 plugin is
SSOT for tenant memory. A0 (PR #1742) isolates plugin tables under the
`memory_plugin` schema. A1 (this PR) removes the silent fallback paths
that route to `agent_memories` when MEMORY_PLUGIN_URL is unset, so a
half-configured deployment fails loudly on the first memory call rather
than quietly writing to a frozen table no one reads.

## Surfaces removed

1. **`toolCommitMemory` legacy SQL branch** (`internal/handlers/mcp_tools.go`).
   Was: route through v2 shim when `memoryV2Available() == nil`, else
   INSERT directly into `agent_memories`. Now: error
   `memory plugin is not configured (set MEMORY_PLUGIN_URL)` when v2 is
   not wired; route through the shim otherwise. ~30 LOC of inline SQL +
   scope validation gone — the shim already enforces the same scope
   rules (it just talks to the plugin).

2. **`toolRecallMemory` legacy SQL branch** (same file). Was: three
   scope-specific SELECTs on `agent_memories` when v2 unwired. Now:
   same error semantics as commit. ~60 LOC gone.

3. **`AdminMemoriesHandler.Export` legacy SQL branch**
   (`internal/handlers/admin_memories.go`). Was: when
   `cutoverActive() == false` (which checked both `MEMORY_V2_CUTOVER`
   *and* v2 wiring), run a direct `SELECT FROM agent_memories JOIN
   workspaces`. Now: respond `503 Service Unavailable` with hint
   `memory plugin is not configured (set MEMORY_PLUGIN_URL)`. Plugin
   path unchanged.

4. **`AdminMemoriesHandler.Import` legacy SQL branch** (same file).
   Was: per-entry workspace lookup + dedup check + INSERT into
   `agent_memories`. Now: same 503 when plugin not wired.

5. **`cutoverActive()` + `envMemoryV2Cutover` constant** removed —
   replaced with `memoryV2Wired()` which only checks `h.plugin != nil
   && h.resolver != nil`. The `MEMORY_V2_CUTOVER` env var is no longer
   read inside the platform binary (CP user-data still inspects it
   before spawning the sidecar, but workspace-server does not branch
   on it). Tenant operators upgrading to this build cannot accidentally
   stay in v1 by unsetting `MEMORY_V2_CUTOVER`.

6. **Boot-time guard simplification** in
   `internal/memory/wiring/wiring.go`. The `cutover` boolean and its
   two loud-WARN log lines are gone (the half-configured-cutover
   failure mode they guarded against is structurally impossible now).
   The one remaining boot log when `MEMORY_PLUGIN_URL` is unset
   explicitly tells the operator that v2 MCP tools will return errors
   until the env is set.

## Surfaces deliberately NOT removed in this PR

- **`MemoriesHandler` REST endpoints** (`POST/GET/DELETE/PATCH
  /workspaces/:id/memories` in `memories.go`). These are the
  canvas-side surface and still write to `agent_memories`. Their
  cutover is the #1734 work (canvas Memory tab fix); coupling that
  here would make this PR much bigger and would conflict with the tab
  rewrite already drafted in #1734.

- **`seedInitialMemories`** (`workspace_provision.go`) still writes to
  `agent_memories` during workspace creation. Its v2 migration is
  deferred to Phase A3 (when we drop the table entirely) — early
  migration would force every test that exercises workspace creation
  to wire a v2 stub.

- **REST `/admin/memories/export` and `/admin/memories/import`
  request-decode path** — the `400 Bad Request` reject on malformed
  JSON runs before any plugin check and is preserved by the surviving
  `TestAdminMemories_Import_InvalidJSON`.

## Tests rewritten

- `TestCutoverActive` → `TestMemoryV2Wired` (the env-flag dimension is
  gone; only wiring matters now).
- `TestExport_LegacyPathWhenCutoverInactive` → `TestExport_503When
  PluginNotWired` + new sibling `TestImport_503WhenPluginNotWired`.
- `TestToolCommitMemory_FallsThroughToLegacyWhenV2Unwired` →
  `TestToolCommitMemory_ErrorsWhenV2Unwired` (asserts on the
  error-message hint at `MEMORY_PLUGIN_URL`).
- `TestToolRecallMemory_FallsThroughToLegacyWhenV2Unwired` →
  `TestToolRecallMemory_ErrorsWhenV2Unwired` (same).
- `TestBuild_WarnsWhenCutoverWithoutPluginURL`,
  `TestBuild_LoudWarnWhenCutoverAndProbeFails`,
  `TestBuild_NoWarnWhenNeitherSet`, `TestBuild_QuietProbeFailWhenCutoverOff`
  → consolidated to `TestBuild_LogsWhenURLUnset` +
  `TestBuild_LogsWhenProbeFails`.

## Tests deleted (legacy SQL coverage now redundant)

- `TestAdminMemories_Export_{Success,Empty,QueryError,RedactsSecrets}`,
  `TestAdminMemories_Import_{Success,WorkspaceNotFound_SkipsEntry,
  DuplicateSkipped,RedactsSecretsBeforeDedup,PreservesCreatedAt}` —
  every assertion was about the deleted SQL path. The plugin path is
  fully covered by 16 tests in `admin_memories_cutover_test.go`
  (listed in the file header for traceability).
- `TestMCPHandler_CommitMemory_LocalScope_Success`,
  `TestMCPHandler_CommitMemory_SecretInContent_IsRedactedBeforeInsert`,
  `TestMCPHandler_CommitMemory_CleanContent_PassesThrough`,
  `TestMCPHandler_RecallMemory_LocalScope_Empty` — same reason; the v2
  shim path is covered by `TestToolCommitMemory_RoutesThroughV2WhenWired`,
  `TestToolRecallMemory_RoutesThroughV2WhenWired`, and every test in
  `mcp_tools_memory_v2_test.go`.
- `TestMCPHandler_CommitMemory_GlobalScope_Blocked_ScrubsInternalError`
  and its `RecallMemory_` sibling are KEPT (renamed to drop the
  `GlobalScope_Blocked_` qualifier) because they validate the OFFSEC-001
  JSON-RPC scrub layer in `mcp.go dispatchRPC`, which is orthogonal to
  the memory backend.

Net diff: −752 LOC across 9 files.

## Stage gates

**Stage A — local build + test sweep**:
- `go vet ./...` clean.
- `go test -count=1 ./internal/handlers/... ./internal/memory/...
  ./cmd/memory-plugin-postgres/...` all green (~25 s).
- Manual: confirmed `memoryV2Available()` still returns its hint
  message ("set MEMORY_PLUGIN_URL"), and that the JSON-RPC scrub
  replaces it with the OFFSEC-001 constant.

**Stage B / C — deferred to staging deploy after gating PR lands**.

## Gating

**DO NOT merge until PR #1742 (A0 schema isolation) has been merged
and rolled out to every tenant's image cache.** A tenant booting this
build without a reachable memory plugin sidecar will return errors
on every `commit_memory` / `recall_memory` / admin export+import call.
That's the intended behavior — but it should only happen after the
sidecar is reliably running on every tenant EC2.

The existing tenant entrypoint (`entrypoint-tenant.sh:35-66`,
since 2026-05-14) already aborts boot when `MEMORY_V2_CUTOVER=true`
and the plugin doesn't come up in 30s, so SaaS tenants are protected
by the CP boot gate — but only IF they're running the post-2026-05-14
image. Audit `org_instances.fly_machine_id` boot logs before
flipping.

## Risks

- Self-hosted operators who haven't set `MEMORY_PLUGIN_URL` will get
  errors instead of working `commit_memory`. The error message tells
  them exactly what to set. Acceptable trade-off for SSOT.
- Tenants currently running on a pre-2026-05-14 tenant image
  (entrypoint doesn't spawn sidecar) will start failing memory calls
  on first restart after the new core image rolls out. Confirm CP
  has recycled every tenant onto the current image before merging.
- `seedInitialMemories` still writes to `agent_memories`; rows there
  remain readable to any operator query bypassing the API. They'll
  go away with the table in Phase A3.

Refs: #1733 (memory SSOT consolidation), #1742 (Phase A0 — schema
isolation, gates this PR), #1734 (Memory tab UI fix, unblocked once
this lands).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 15:06:52 -07:00