refactor(memory): #1733 A1 — kill v1 SQL fallback, v2 plugin is the only backend #1747
Merged
hongming
merged 2 commits from 2026-05-24 02:45:57 +00:00
chore/issue-1733-a1-kill-v1-fallback into main
2 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
c2b6e8bdb5 |
review(memory): #1747 — OFFSEC test depth + legacy redaction + cutover env deprecation
ci-arm64-advisory / fast-checks (pull_request) Waiting to run
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Waiting to run
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 6s
CI / Detect changes (pull_request) Successful in 8s
CI / Python Lint & Test (pull_request) Successful in 4s
E2E API Smoke Test / detect-changes (pull_request) Successful in 5s
E2E Chat / detect-changes (pull_request) Successful in 10s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Has been skipped
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 12s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Successful in 34s
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Has been skipped
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Successful in 52s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 7s
Harness Replays / detect-changes (pull_request) Successful in 6s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 4s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 7s
Lint no tenant GITEA or GITHUB token write / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 8s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Failing after 1m4s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Failing after 1m9s
lint-required-workflows-docker-host-pinned / Lint docker-host pin on docker-touching workflows (pull_request) Successful in 4s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m2s
lint-required-context-exists-in-bp / lint-required-context-exists-in-bp (pull_request) Successful in 1m15s
review-check-tests / review-check.sh regression tests (pull_request) Successful in 7s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 6s
qa-review / approved (pull_request) Failing after 4s
security-review / approved (pull_request) Failing after 3s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 1m23s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 1m7s
sop-checklist / na-declarations (pull_request) N/A: (none)
CI / Canvas (Next.js) (pull_request) Successful in 2s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1m38s
E2E Chat / E2E Chat (pull_request) Successful in 8s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 4s
Harness Replays / Harness Replays (pull_request) Successful in 5s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2m33s
CI / Platform (Go) (pull_request) Successful in 5m42s
CI / all-required (pull_request) Successful in 26m9s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
gate-check-v3 / gate-check (pull_request) Waiting to run
sop-checklist / all-items-acked (pull_request) Waiting to run
sop-checklist / review-refire (pull_request) Waiting to run
sop-tier-check / tier-check (pull_request) Waiting to run
audit-force-merge / audit (pull_request) Successful in 41s
Three reviewer-flagged fixes on PR #1747: 1. OFFSEC-001 scrub tests no longer vacuous (review finding C2). `TestMCPHandler_CommitMemory_RequestErrors_ScrubsInternalError` and its recall sibling triggered via `scope=GLOBAL`, but post-A1 the handler short-circuits in `memoryV2Available()` BEFORE the GLOBAL block runs — so the leaked-tokens loop (`GLOBAL`, `scope`, `permitted`, `bridge`, `LOCAL`, `TEAM`) was scanning an error string that doesn't contain any of them ("memory plugin is not configured"). The test passed even if mcp.go:dispatchRPC's scrub was deleted entirely. Fix: wire `withMemoryV2APIs(&stubMemoryPlugin{}, rootNamespace Resolver())` so the request reaches `scopeToWritableNamespace` / `scopeToReadableNamespaces` where the actual GLOBAL block lives. Renamed `CommitMemory_RequestErrors_ScrubsInternalError` → `CommitMemory_GlobalScope_ScrubsInternalError` (and the recall sibling) to match what the test actually pins now. Verified by `go test -v` showing the internal log line `mcp: tool call failed workspace=ws-1 tool=commit_memory: GLOBAL scope is not permitted via the MCP bridge — use LOCAL or TEAM` — the actual leaky string is now reaching the scrub layer, so the negative-token assertions are real proof. 2. New `TestMCPHandler_CommitMemory_LegacyName_RedactionAtPlugin` (review finding N6). The deleted SQL-path version of this test pinned the SAFE-T1201 redaction contract against `agent_memories` INSERT args. Post-A1 the legacy `commit_memory` MCP tool routes through the shim → `toolCommitMemoryV2` → plugin; the plugin path redacts at mcp_tools_memory_v2.go:122. This test calls the legacy tool via JSON-RPC envelope with three secret-shaped patterns (ANTHROPIC_API_KEY=, Bearer token, sk-* prefix), uses a stub plugin that captures the `MemoryWrite.Content` it receives, and asserts the captured content is the redacted string — not the raw secret. Catches a regression where someone inlines the shim and skips redaction. 3. `MEMORY_V2_CUTOVER` deprecated in `entrypoint-tenant.sh` (review finding N1). The workspace-server binary stopped reading the env in PR #1747's main commit (`wiring.go` cleanup), but the entrypoint still keyed off it OR'd with `MEMORY_PLUGIN_URL`. Surfaced as a transition warning: when only `MEMORY_V2_CUTOVER` is set (no `MEMORY_PLUGIN_URL`), the entrypoint logs a loud deprecation notice and spawns the sidecar anyway so old CP user-data templates keep working through the rollout. Once CP user-data drops the var (separate CP PR), the `MEMORY_V2_CUTOVER` branch can be deleted from the entrypoint. Also filed: - **#1755** — tracking issue for migrating `seedInitialMemories` through the v2 plugin before Phase A3 drops `agent_memories` (review finding C3). Required before A3 can land or workspace creation breaks. Refs: PR #1747 review (CTO 2026-05-23 dispatched subagent), findings C2 (OFFSEC scrub depth), N1 (cutover env drift), N6 (lost legacy- name redaction coverage). C3 spun out to #1755. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
2aa6e8adf3 |
refactor(memory): #1733 A1 — kill v1 SQL fallback, v2 plugin is the only backend
ci-arm64-advisory / fast-checks (pull_request) Waiting to run
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Waiting to run
sop-checklist / review-refire (pull_request) Waiting to run
sop-tier-check / tier-check (pull_request) Waiting to run
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 4s
CI / Detect changes (pull_request) Successful in 5s
CI / Python Lint & Test (pull_request) Successful in 3s
E2E API Smoke Test / detect-changes (pull_request) Successful in 6s
E2E Chat / detect-changes (pull_request) Successful in 6s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Has been skipped
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 12s
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Has been skipped
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Successful in 44s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 8s
Harness Replays / detect-changes (pull_request) Successful in 3s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Successful in 34s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 6s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 6s
Lint no tenant GITEA or GITHUB token write / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 3s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Failing after 1m3s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 1m14s
lint-required-context-exists-in-bp / lint-required-context-exists-in-bp (pull_request) Successful in 1m27s
review-check-tests / review-check.sh regression tests (pull_request) Successful in 6s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s
lint-required-workflows-docker-host-pinned / Lint docker-host pin on docker-touching workflows (pull_request) Successful in 4s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m15s
gate-check-v3 / gate-check (pull_request) Successful in 4s
qa-review / approved (pull_request) Failing after 4s
security-review / approved (pull_request) Failing after 3s
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, local-postgres-e2
sop-checklist / na-declarations (pull_request) N/A: (none)
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 1m17s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 1m10s
CI / Canvas (Next.js) (pull_request) Successful in 2s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1m31s
CI / Platform (Go) (pull_request) Successful in 4m6s
E2E Chat / E2E Chat (pull_request) Successful in 2s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 2s
CI / all-required (pull_request) Successful in 23m38s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 1m12s
Harness Replays / Harness Replays (pull_request) Successful in 2s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CTO decision 2026-05-23: same Postgres, separate schema, v2 plugin is SSOT for tenant memory. A0 (PR #1742) isolates plugin tables under the `memory_plugin` schema. A1 (this PR) removes the silent fallback paths that route to `agent_memories` when MEMORY_PLUGIN_URL is unset, so a half-configured deployment fails loudly on the first memory call rather than quietly writing to a frozen table no one reads. ## Surfaces removed 1. **`toolCommitMemory` legacy SQL branch** (`internal/handlers/mcp_tools.go`). Was: route through v2 shim when `memoryV2Available() == nil`, else INSERT directly into `agent_memories`. Now: error `memory plugin is not configured (set MEMORY_PLUGIN_URL)` when v2 is not wired; route through the shim otherwise. ~30 LOC of inline SQL + scope validation gone — the shim already enforces the same scope rules (it just talks to the plugin). 2. **`toolRecallMemory` legacy SQL branch** (same file). Was: three scope-specific SELECTs on `agent_memories` when v2 unwired. Now: same error semantics as commit. ~60 LOC gone. 3. **`AdminMemoriesHandler.Export` legacy SQL branch** (`internal/handlers/admin_memories.go`). Was: when `cutoverActive() == false` (which checked both `MEMORY_V2_CUTOVER` *and* v2 wiring), run a direct `SELECT FROM agent_memories JOIN workspaces`. Now: respond `503 Service Unavailable` with hint `memory plugin is not configured (set MEMORY_PLUGIN_URL)`. Plugin path unchanged. 4. **`AdminMemoriesHandler.Import` legacy SQL branch** (same file). Was: per-entry workspace lookup + dedup check + INSERT into `agent_memories`. Now: same 503 when plugin not wired. 5. **`cutoverActive()` + `envMemoryV2Cutover` constant** removed — replaced with `memoryV2Wired()` which only checks `h.plugin != nil && h.resolver != nil`. The `MEMORY_V2_CUTOVER` env var is no longer read inside the platform binary (CP user-data still inspects it before spawning the sidecar, but workspace-server does not branch on it). Tenant operators upgrading to this build cannot accidentally stay in v1 by unsetting `MEMORY_V2_CUTOVER`. 6. **Boot-time guard simplification** in `internal/memory/wiring/wiring.go`. The `cutover` boolean and its two loud-WARN log lines are gone (the half-configured-cutover failure mode they guarded against is structurally impossible now). The one remaining boot log when `MEMORY_PLUGIN_URL` is unset explicitly tells the operator that v2 MCP tools will return errors until the env is set. ## Surfaces deliberately NOT removed in this PR - **`MemoriesHandler` REST endpoints** (`POST/GET/DELETE/PATCH /workspaces/:id/memories` in `memories.go`). These are the canvas-side surface and still write to `agent_memories`. Their cutover is the #1734 work (canvas Memory tab fix); coupling that here would make this PR much bigger and would conflict with the tab rewrite already drafted in #1734. - **`seedInitialMemories`** (`workspace_provision.go`) still writes to `agent_memories` during workspace creation. Its v2 migration is deferred to Phase A3 (when we drop the table entirely) — early migration would force every test that exercises workspace creation to wire a v2 stub. - **REST `/admin/memories/export` and `/admin/memories/import` request-decode path** — the `400 Bad Request` reject on malformed JSON runs before any plugin check and is preserved by the surviving `TestAdminMemories_Import_InvalidJSON`. ## Tests rewritten - `TestCutoverActive` → `TestMemoryV2Wired` (the env-flag dimension is gone; only wiring matters now). - `TestExport_LegacyPathWhenCutoverInactive` → `TestExport_503When PluginNotWired` + new sibling `TestImport_503WhenPluginNotWired`. - `TestToolCommitMemory_FallsThroughToLegacyWhenV2Unwired` → `TestToolCommitMemory_ErrorsWhenV2Unwired` (asserts on the error-message hint at `MEMORY_PLUGIN_URL`). - `TestToolRecallMemory_FallsThroughToLegacyWhenV2Unwired` → `TestToolRecallMemory_ErrorsWhenV2Unwired` (same). - `TestBuild_WarnsWhenCutoverWithoutPluginURL`, `TestBuild_LoudWarnWhenCutoverAndProbeFails`, `TestBuild_NoWarnWhenNeitherSet`, `TestBuild_QuietProbeFailWhenCutoverOff` → consolidated to `TestBuild_LogsWhenURLUnset` + `TestBuild_LogsWhenProbeFails`. ## Tests deleted (legacy SQL coverage now redundant) - `TestAdminMemories_Export_{Success,Empty,QueryError,RedactsSecrets}`, `TestAdminMemories_Import_{Success,WorkspaceNotFound_SkipsEntry, DuplicateSkipped,RedactsSecretsBeforeDedup,PreservesCreatedAt}` — every assertion was about the deleted SQL path. The plugin path is fully covered by 16 tests in `admin_memories_cutover_test.go` (listed in the file header for traceability). - `TestMCPHandler_CommitMemory_LocalScope_Success`, `TestMCPHandler_CommitMemory_SecretInContent_IsRedactedBeforeInsert`, `TestMCPHandler_CommitMemory_CleanContent_PassesThrough`, `TestMCPHandler_RecallMemory_LocalScope_Empty` — same reason; the v2 shim path is covered by `TestToolCommitMemory_RoutesThroughV2WhenWired`, `TestToolRecallMemory_RoutesThroughV2WhenWired`, and every test in `mcp_tools_memory_v2_test.go`. - `TestMCPHandler_CommitMemory_GlobalScope_Blocked_ScrubsInternalError` and its `RecallMemory_` sibling are KEPT (renamed to drop the `GlobalScope_Blocked_` qualifier) because they validate the OFFSEC-001 JSON-RPC scrub layer in `mcp.go dispatchRPC`, which is orthogonal to the memory backend. Net diff: −752 LOC across 9 files. ## Stage gates **Stage A — local build + test sweep**: - `go vet ./...` clean. - `go test -count=1 ./internal/handlers/... ./internal/memory/... ./cmd/memory-plugin-postgres/...` all green (~25 s). - Manual: confirmed `memoryV2Available()` still returns its hint message ("set MEMORY_PLUGIN_URL"), and that the JSON-RPC scrub replaces it with the OFFSEC-001 constant. **Stage B / C — deferred to staging deploy after gating PR lands**. ## Gating **DO NOT merge until PR #1742 (A0 schema isolation) has been merged and rolled out to every tenant's image cache.** A tenant booting this build without a reachable memory plugin sidecar will return errors on every `commit_memory` / `recall_memory` / admin export+import call. That's the intended behavior — but it should only happen after the sidecar is reliably running on every tenant EC2. The existing tenant entrypoint (`entrypoint-tenant.sh:35-66`, since 2026-05-14) already aborts boot when `MEMORY_V2_CUTOVER=true` and the plugin doesn't come up in 30s, so SaaS tenants are protected by the CP boot gate — but only IF they're running the post-2026-05-14 image. Audit `org_instances.fly_machine_id` boot logs before flipping. ## Risks - Self-hosted operators who haven't set `MEMORY_PLUGIN_URL` will get errors instead of working `commit_memory`. The error message tells them exactly what to set. Acceptable trade-off for SSOT. - Tenants currently running on a pre-2026-05-14 tenant image (entrypoint doesn't spawn sidecar) will start failing memory calls on first restart after the new core image rolls out. Confirm CP has recycled every tenant onto the current image before merging. - `seedInitialMemories` still writes to `agent_memories`; rows there remain readable to any operator query bypassing the API. They'll go away with the table in Phase A3. Refs: #1733 (memory SSOT consolidation), #1742 (Phase A0 — schema isolation, gates this PR), #1734 (Memory tab UI fix, unblocked once this lands). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |