Agent memory subsystem broken mid-cutover: POST /memories 500, GET /memories 404, v2 awareness endpoints not exposed on tenant edge #1828

Closed
opened 2026-05-25 02:48:09 +00:00 by RenoStarsAI-production-client · 3 comments

Summary

The agent-memory subsystem on tenant reno-stars.moleculesai.app is broken mid-cutover. Both the SDK-side tools (recall_memory, commit_memory) and the REST plural-endpoints (/workspaces/{id}/memories) are unusable for several hours now. Old endpoint is being torn down, new v2 (/api/v1/namespaces/{ns}/memories per AwarenessClient) is not exposed on tenant edge. The workspace KV (/workspaces/{id}/memory singular) and session-search are unaffected.

This blocks every agent that depends on long-term memory recall — for us, the SEO Agent's system_brief recall failed and we had to work around by inlining the full prompt body into the schedule and bootstrapping SSOT from a git clone instead.

Observed behavior (incidents from today)

# Past behavior (worked ~3 hours ago at ~01:04Z when we configured a workspace):
POST /workspaces/{id}/memories  → 200/201  {"id": "...", "namespace": "...", "scope": "TEAM"}

# Current behavior (probed multiple times between 01:40Z and 02:30Z):
POST /workspaces/{id}/memories
  → 500  {"error": "failed to store memory"}

GET  /workspaces/{id}/memories
  → 404  Next.js canvas HTML page (the request never reaches workspace-server's Search handler at all — it's getting captured by the canvas frontend)

# Both ORG-key and workspace-key reproduce — not an auth issue.

The 404 on GET is what cascades into the agent's user-visible failure mode:

# builtin_tools/awareness_client.py _parse_search_response — and similar fallback in
# builtin_tools/memory.py recall_memory() legacy path
resp = await client.get(f"{PLATFORM_URL}/workspaces/{WORKSPACE_ID}/memories", ...)
data = resp.json()         # ValueError: Expecting value: line 1 column 1 (char 0)
                            # (because resp.body is HTML, not JSON)

→ surfaces to the agent as:

Error recalling memory: Expecting value: line 1 column 1 (char 0)

commit_memory returns {"success": False, "error": "failed to store memory"} because POST is 500.

Probe matrix (this tenant, 02:30Z 2026-05-25)

Endpoint Method Status What it returns
/workspaces/{id}/memories (plural HMA) POST 500 {"error":"failed to store memory"} — was 201 earlier today
/workspaces/{id}/memories (plural HMA) GET 404 Next.js HTML (request never reaches workspace-server route)
/workspaces/{id}/memory (singular KV) POST 200 {"key":"...","status":"ok","version":1}
/workspaces/{id}/memory (singular KV) GET 200 [{...}]
/workspaces/{id}/session-search GET 200 recent activity rows
/api/v1/namespaces/{id}/memories (v2 awareness, per AwarenessClient._memories_url) GET 404 Next.js HTML — v2 path not exposed on edge
/api/v1/namespaces/workspace:{id}/memories (with workspace: prefix) GET 404 Next.js HTML
/v1/namespaces/{id}/memories GET 404 Next.js HTML
/memory-plugin/v1/namespaces/{id}/memories GET 404 Next.js HTML
/awareness/api/v1/namespaces/{id}/memories GET 404 Next.js HTML

Agent runtime env on a fresh workspace (just provisioned today) does NOT contain AWARENESS_URL or AWARENESS_NAMESPACE either, so AwarenessClient returns None and the fallback HMA path is taken — which is the path that's now broken.

Likely cause

Cutover seems half-done. From the monorepo I can see:

  • MEMORY_V2_CUTOVER flag exists (PR-8 / RFC #2728) in workspace-server/internal/handlers/admin_memories.go
  • memory-plugin-postgres is bundled in-image (PR #2906)
  • The new memory tab v2 UI shipped (PR #2956)
  • But on the tenant edge (or Next.js front-of-house routing), GET /workspaces/:id/memories is NOT being routed to workspace-server — it's hitting the canvas SPA and rendering a 404 page
  • POST is routed (it reaches workspace-server) but the downstream plugin sidecar appears to be erroring → 500
  • The cutover-target endpoints (/api/v1/namespaces/...) are not exposed via the tenant edge at all

So the cutover middleware partially severed the old write path AND the new path is not exposed on this tenant, leaving us in a hole.

Impact

Any agent on this tenant calling recall_memory or commit_memory fails. For our SEO Agent specifically:

  • The recurring */30 * * * * schedule fires correctly, the workspace runs fine, but every tick that calls recall_memory("system_brief") returns the JSON decode error and the agent hands off having done nothing
  • Workaround we shipped today: inline the full system_brief (~13KB) in the schedule prompt body, and clone the SSOT repo from GitHub via a poured SERVICES_GH_TOKEN secret so the agent has access to all the scanner prompts + state files via filesystem instead of memory
  • This works but it's a brittle workaround; any other agent we run on this platform will hit the same wall

Proposed fixes (any one would unblock us)

Fix A — Restore the legacy plural endpoint on edge

Cheapest: revert or guard the edge change so GET /workspaces/:id/memories continues to route to workspace-server's Search handler (workspace-server/internal/handlers/memories.go:248) until v2 is fully exposed. Same for POST — restore the legacy MemoriesHandler.Commit path so it doesn't 500 mid-cutover. This is the safest no-op rollback while v2 lands.

Fix B — Expose /api/v1/namespaces/... on tenant edge

Wire the awareness/memory-plugin endpoints through the tenant edge proxy so AwarenessClient (already shipped in agent SDK as the preferred path) can reach them. Also: set AWARENESS_URL and AWARENESS_NAMESPACE on agent runtime env at workspace provision time so build_awareness_client() doesn't return None.

Fix C — Auto-deprecate with a useful error

If the old endpoint MUST go, return a structured JSON error from the edge instead of an HTML 404 page so the SDK can surface a useful message:

GET /workspaces/{id}/memories
  → 410  Content-Type: application/json
  {
    "error": "memory_v1_deprecated",
    "hint": "The HMA endpoint /workspaces/{id}/memories has been retired. Use AWARENESS_URL/api/v1/namespaces/{ns}/memories. AWARENESS_URL should be in your workspace env; if not, contact platform team.",
    "deprecated_at": "...",
    "migration_doc": "https://..."
  }

This at least makes the failure mode legible — right now agents see "Expecting value: line 1 column 1 (char 0)" which is unhelpful.

Suggested rollout order

  1. Fix A first (rollback / guard) — fastest unblock, low risk
  2. Fix B in parallel — properly expose v2 path + inject env vars on provision
  3. Fix C as the deprecation cap — once v2 is the canonical, replace the legacy POST/GET with the structured 410

Reproduction

# Set up auth
ORG_KEY=<an org-scoped key>
WS=<any provisioned workspace ID on the tenant>
P="https://reno-stars.moleculesai.app"

# 1. GET plural — should reach workspace-server's MemoriesHandler.Search.
#    Currently 404s with HTML from canvas Next.js.
curl -s -D - "$P/workspaces/$WS/memories" \
  -H "Authorization: Bearer $ORG_KEY" -H "Origin: $P" | head -20

# 2. POST plural — should reach workspace-server's MemoriesHandler.Commit.
#    Currently 500 with {"error":"failed to store memory"}.
curl -s "$P/workspaces/$WS/memories" \
  -H "Authorization: Bearer $ORG_KEY" -H "Origin: $P" -H "Content-Type: application/json" \
  -d '{"content":"probe","scope":"LOCAL","namespace":"probe"}' -w "\nHTTP %{http_code}\n"

# 3. By comparison, singular KV works fine:
curl -s "$P/workspaces/$WS/memory" \
  -H "Authorization: Bearer $ORG_KEY" -H "Origin: $P"

# 4. And session-search works:
curl -s "$P/workspaces/$WS/session-search?q=test" \
  -H "Authorization: Bearer $ORG_KEY" -H "Origin: $P" | head -5

Related context

  • Sibling issue: #1823 (workspace DELETE safety, filed earlier today)
  • Today's SEO Agent recreation incident: we had to recreate the workspace (lost the previous one to an accidental DELETE — see #1823) and during reprovisioning hit this cutover state immediately. We've shipped the workarounds in our schedule prompt, but the platform-side cutover should be cleaned up before this affects more tenants.

— Hongming Wang (airenostars@gmail.com)
— Tenant: reno-stars.moleculesai.app
— Affected workspace: 352e3c2b-0546-4e9c-b487-1e2ff1cf29fc (SEO Agent, recreated 2026-05-25)
— First observed: ~01:30Z 2026-05-25 (POST still worked at 01:04Z, was 500 by 02:15Z)

## Summary The agent-memory subsystem on tenant `reno-stars.moleculesai.app` is broken mid-cutover. Both the SDK-side tools (`recall_memory`, `commit_memory`) and the REST plural-endpoints (`/workspaces/{id}/memories`) are unusable for several hours now. Old endpoint is being torn down, new v2 (`/api/v1/namespaces/{ns}/memories` per `AwarenessClient`) is not exposed on tenant edge. The workspace KV (`/workspaces/{id}/memory` singular) and `session-search` are unaffected. This blocks every agent that depends on long-term memory recall — for us, the SEO Agent's `system_brief` recall failed and we had to work around by inlining the full prompt body into the schedule and bootstrapping SSOT from a git clone instead. ## Observed behavior (incidents from today) ``` # Past behavior (worked ~3 hours ago at ~01:04Z when we configured a workspace): POST /workspaces/{id}/memories → 200/201 {"id": "...", "namespace": "...", "scope": "TEAM"} # Current behavior (probed multiple times between 01:40Z and 02:30Z): POST /workspaces/{id}/memories → 500 {"error": "failed to store memory"} GET /workspaces/{id}/memories → 404 Next.js canvas HTML page (the request never reaches workspace-server's Search handler at all — it's getting captured by the canvas frontend) # Both ORG-key and workspace-key reproduce — not an auth issue. ``` The 404 on GET is what cascades into the agent's user-visible failure mode: ```python # builtin_tools/awareness_client.py _parse_search_response — and similar fallback in # builtin_tools/memory.py recall_memory() legacy path resp = await client.get(f"{PLATFORM_URL}/workspaces/{WORKSPACE_ID}/memories", ...) data = resp.json() # ValueError: Expecting value: line 1 column 1 (char 0) # (because resp.body is HTML, not JSON) ``` → surfaces to the agent as: ``` Error recalling memory: Expecting value: line 1 column 1 (char 0) ``` `commit_memory` returns `{"success": False, "error": "failed to store memory"}` because POST is 500. ## Probe matrix (this tenant, 02:30Z 2026-05-25) | Endpoint | Method | Status | What it returns | |---|---|---|---| | `/workspaces/{id}/memories` (plural HMA) | POST | **500** | `{"error":"failed to store memory"}` — was 201 earlier today | | `/workspaces/{id}/memories` (plural HMA) | GET | **404** | Next.js HTML (request never reaches workspace-server route) | | `/workspaces/{id}/memory` (singular KV) | POST | 200 | `{"key":"...","status":"ok","version":1}` | | `/workspaces/{id}/memory` (singular KV) | GET | 200 | `[{...}]` | | `/workspaces/{id}/session-search` | GET | 200 | recent activity rows | | `/api/v1/namespaces/{id}/memories` (v2 awareness, per `AwarenessClient._memories_url`) | GET | 404 | Next.js HTML — v2 path not exposed on edge | | `/api/v1/namespaces/workspace:{id}/memories` (with `workspace:` prefix) | GET | 404 | Next.js HTML | | `/v1/namespaces/{id}/memories` | GET | 404 | Next.js HTML | | `/memory-plugin/v1/namespaces/{id}/memories` | GET | 404 | Next.js HTML | | `/awareness/api/v1/namespaces/{id}/memories` | GET | 404 | Next.js HTML | Agent runtime env on a fresh workspace (just provisioned today) does NOT contain `AWARENESS_URL` or `AWARENESS_NAMESPACE` either, so `AwarenessClient` returns `None` and the fallback HMA path is taken — which is the path that's now broken. ## Likely cause Cutover seems half-done. From the monorepo I can see: - `MEMORY_V2_CUTOVER` flag exists (PR-8 / RFC #2728) in `workspace-server/internal/handlers/admin_memories.go` - `memory-plugin-postgres` is bundled in-image (PR #2906) - The new memory tab v2 UI shipped (PR #2956) - But on the tenant edge (or Next.js front-of-house routing), `GET /workspaces/:id/memories` is NOT being routed to workspace-server — it's hitting the canvas SPA and rendering a 404 page - POST is routed (it reaches workspace-server) but the downstream plugin sidecar appears to be erroring → 500 - The cutover-target endpoints (`/api/v1/namespaces/...`) are not exposed via the tenant edge at all So the cutover middleware partially severed the old write path AND the new path is not exposed on this tenant, leaving us in a hole. ## Impact Any agent on this tenant calling `recall_memory` or `commit_memory` fails. For our SEO Agent specifically: - The recurring `*/30 * * * *` schedule fires correctly, the workspace runs fine, but every tick that calls `recall_memory("system_brief")` returns the JSON decode error and the agent hands off having done nothing - Workaround we shipped today: inline the full system_brief (~13KB) in the schedule prompt body, and clone the SSOT repo from GitHub via a poured `SERVICES_GH_TOKEN` secret so the agent has access to all the scanner prompts + state files via filesystem instead of memory - This works but it's a brittle workaround; any other agent we run on this platform will hit the same wall ## Proposed fixes (any one would unblock us) ### Fix A — Restore the legacy plural endpoint on edge Cheapest: revert or guard the edge change so `GET /workspaces/:id/memories` continues to route to workspace-server's `Search` handler (`workspace-server/internal/handlers/memories.go:248`) until v2 is fully exposed. Same for `POST` — restore the legacy `MemoriesHandler.Commit` path so it doesn't 500 mid-cutover. This is the safest no-op rollback while v2 lands. ### Fix B — Expose `/api/v1/namespaces/...` on tenant edge Wire the awareness/memory-plugin endpoints through the tenant edge proxy so `AwarenessClient` (already shipped in agent SDK as the preferred path) can reach them. Also: set `AWARENESS_URL` and `AWARENESS_NAMESPACE` on agent runtime env at workspace provision time so `build_awareness_client()` doesn't return `None`. ### Fix C — Auto-deprecate with a useful error If the old endpoint MUST go, return a structured JSON error from the edge instead of an HTML 404 page so the SDK can surface a useful message: ``` GET /workspaces/{id}/memories → 410 Content-Type: application/json { "error": "memory_v1_deprecated", "hint": "The HMA endpoint /workspaces/{id}/memories has been retired. Use AWARENESS_URL/api/v1/namespaces/{ns}/memories. AWARENESS_URL should be in your workspace env; if not, contact platform team.", "deprecated_at": "...", "migration_doc": "https://..." } ``` This at least makes the failure mode legible — right now agents see "Expecting value: line 1 column 1 (char 0)" which is unhelpful. ## Suggested rollout order 1. **Fix A first** (rollback / guard) — fastest unblock, low risk 2. **Fix B in parallel** — properly expose v2 path + inject env vars on provision 3. **Fix C as the deprecation cap** — once v2 is the canonical, replace the legacy POST/GET with the structured 410 ## Reproduction ```bash # Set up auth ORG_KEY=<an org-scoped key> WS=<any provisioned workspace ID on the tenant> P="https://reno-stars.moleculesai.app" # 1. GET plural — should reach workspace-server's MemoriesHandler.Search. # Currently 404s with HTML from canvas Next.js. curl -s -D - "$P/workspaces/$WS/memories" \ -H "Authorization: Bearer $ORG_KEY" -H "Origin: $P" | head -20 # 2. POST plural — should reach workspace-server's MemoriesHandler.Commit. # Currently 500 with {"error":"failed to store memory"}. curl -s "$P/workspaces/$WS/memories" \ -H "Authorization: Bearer $ORG_KEY" -H "Origin: $P" -H "Content-Type: application/json" \ -d '{"content":"probe","scope":"LOCAL","namespace":"probe"}' -w "\nHTTP %{http_code}\n" # 3. By comparison, singular KV works fine: curl -s "$P/workspaces/$WS/memory" \ -H "Authorization: Bearer $ORG_KEY" -H "Origin: $P" # 4. And session-search works: curl -s "$P/workspaces/$WS/session-search?q=test" \ -H "Authorization: Bearer $ORG_KEY" -H "Origin: $P" | head -5 ``` ## Related context - Sibling issue: #1823 (workspace DELETE safety, filed earlier today) - Today's SEO Agent recreation incident: we had to recreate the workspace (lost the previous one to an accidental DELETE — see #1823) and during reprovisioning hit this cutover state immediately. We've shipped the workarounds in our schedule prompt, but the platform-side cutover should be cleaned up before this affects more tenants. — Hongming Wang (airenostars@gmail.com) — Tenant: `reno-stars.moleculesai.app` — Affected workspace: `352e3c2b-0546-4e9c-b487-1e2ff1cf29fc` (SEO Agent, recreated 2026-05-25) — First observed: ~01:30Z 2026-05-25 (POST still worked at 01:04Z, was 500 by 02:15Z)
Member

RCA — root cause

The memory outage is a Phase A3 half-cutover: legacy GET /workspaces/:id/memories was intentionally removed after agent_memories was dropped, while legacy POST /workspaces/:id/memories remains but now requires the v2 memory plugin. Tenants/SDKs that still fall back to the legacy plural surface therefore see frontend 404 HTML on reads and generic 500s on writes when the plugin path is configured but failing downstream.

Evidence

  • workspace-server/internal/router/router.go:275 — comments say legacy /memories Search/Delete/Update routes were removed because v1 agent_memories no longer exists.
  • workspace-server/internal/router/router.go:289 and :301-303 — only legacy POST /memories remains; reads are exposed as /workspaces/:id/v2/memories, not /workspaces/:id/memories or /api/v1/namespaces/....
  • workspace-server/internal/handlers/memories.go:173-230 — legacy POST writes through the v2 plugin and returns failed to store memory when CommitMemory errors.
  • workspace-server/internal/memory/wiring/wiring.go:55-60MEMORY_PLUGIN_URL unset yields a nil bundle; there is no SQL fallback.
  • workspace-server/migrations/20260524110000_drop_agent_memories.up.sql:8-20 — the v1 table was dropped and memory_plugin is the source of truth.
  • workspace-server/migrations/20260523130000_drop_workspaces_awareness_namespace.up.sql:3-7AWARENESS_URL / namespace plumbing was never wired in prod/staging, matching the issue's missing SDK env surface.

Suggested fix

Pick one compatibility contract and make it explicit in molecule-core/workspace-server plus tenant control-plane env. Fastest unblock: restore GET /workspaces/:id/memories as a compatibility adapter to MemoriesV2Handler.Search (or return structured JSON 410/upgrade guidance instead of Next.js HTML), and change legacy POST failures to expose plugin-unavailable/plugin-downstream status clearly. In parallel, update runtime/SDK env injection to use the actually exposed /workspaces/:id/v2/memories path, or expose the intended namespace API at the tenant edge and inject AWARENESS_URL / AWARENESS_NAMESPACE. Tenant templates must set MEMORY_PLUGIN_URL after Phase A3; otherwise new workspaces boot memoryless by design.

Confidence

High — the live symptoms line up with the route table and migration comments; production logs around the Commit memory error (plugin) line would distinguish missing plugin URL from plugin downstream failure for the POST 500.

## RCA — root cause The memory outage is a Phase A3 half-cutover: legacy `GET /workspaces/:id/memories` was intentionally removed after `agent_memories` was dropped, while legacy `POST /workspaces/:id/memories` remains but now requires the v2 memory plugin. Tenants/SDKs that still fall back to the legacy plural surface therefore see frontend 404 HTML on reads and generic 500s on writes when the plugin path is configured but failing downstream. ## Evidence - `workspace-server/internal/router/router.go:275` — comments say legacy `/memories` Search/Delete/Update routes were removed because v1 `agent_memories` no longer exists. - `workspace-server/internal/router/router.go:289` and `:301-303` — only legacy `POST /memories` remains; reads are exposed as `/workspaces/:id/v2/memories`, not `/workspaces/:id/memories` or `/api/v1/namespaces/...`. - `workspace-server/internal/handlers/memories.go:173-230` — legacy POST writes through the v2 plugin and returns `failed to store memory` when `CommitMemory` errors. - `workspace-server/internal/memory/wiring/wiring.go:55-60` — `MEMORY_PLUGIN_URL` unset yields a nil bundle; there is no SQL fallback. - `workspace-server/migrations/20260524110000_drop_agent_memories.up.sql:8-20` — the v1 table was dropped and `memory_plugin` is the source of truth. - `workspace-server/migrations/20260523130000_drop_workspaces_awareness_namespace.up.sql:3-7` — `AWARENESS_URL` / namespace plumbing was never wired in prod/staging, matching the issue's missing SDK env surface. ## Suggested fix Pick one compatibility contract and make it explicit in `molecule-core/workspace-server` plus tenant control-plane env. Fastest unblock: restore `GET /workspaces/:id/memories` as a compatibility adapter to `MemoriesV2Handler.Search` (or return structured JSON 410/upgrade guidance instead of Next.js HTML), and change legacy POST failures to expose plugin-unavailable/plugin-downstream status clearly. In parallel, update runtime/SDK env injection to use the actually exposed `/workspaces/:id/v2/memories` path, or expose the intended namespace API at the tenant edge and inject `AWARENESS_URL` / `AWARENESS_NAMESPACE`. Tenant templates must set `MEMORY_PLUGIN_URL` after Phase A3; otherwise new workspaces boot memoryless by design. ## Confidence High — the live symptoms line up with the route table and migration comments; production logs around the `Commit memory error (plugin)` line would distinguish missing plugin URL from plugin downstream failure for the POST 500.
Member

Status follow-up: the #1828 cutover repair is actively in progress as molecule-core PR #1852 (fix(handlers): restore GET /workspaces/:id/memories as v2 plugin shim).

Current state from API:

  • PR #1852 head 233f372711fb35509623035793c0c997db6c6e35, author agent-dev-a, mergeable=true.
  • It implements the fastest unblock from the RCA: restore legacy GET /workspaces/:id/memories as a shim over the v2 plugin and return structured 503/502 failures instead of falling through to canvas HTML.
  • Latest agent-reviewer approval is on the current head (r7039, not stale).
  • Commit status aggregate is still failure, but the non-success contexts are governance gates, not handler tests: qa-review / approved failed, security-review / approved failed, and sop-checklist / na-declarations is pending/N/A.

Read: the production memory outage has an engineer-authored fix path; the remaining blocker is review/gate completion for #1852, not missing root-cause diagnosis.

Status follow-up: the #1828 cutover repair is actively in progress as molecule-core PR #1852 (`fix(handlers): restore GET /workspaces/:id/memories as v2 plugin shim`). Current state from API: - PR #1852 head `233f372711fb35509623035793c0c997db6c6e35`, author `agent-dev-a`, `mergeable=true`. - It implements the fastest unblock from the RCA: restore legacy `GET /workspaces/:id/memories` as a shim over the v2 plugin and return structured 503/502 failures instead of falling through to canvas HTML. - Latest `agent-reviewer` approval is on the current head (`r7039`, not stale). - Commit status aggregate is still `failure`, but the non-success contexts are governance gates, not handler tests: `qa-review / approved` failed, `security-review / approved` failed, and `sop-checklist / na-declarations` is pending/N/A. Read: the production memory outage has an engineer-authored fix path; the remaining blocker is review/gate completion for #1852, not missing root-cause diagnosis.
Member

Formal closure pass: this memory cutover RCA is already closed, and the repair path identified in the RCA landed via molecule-core PR #1852 (233f3727, merged 2026-05-26T08:46:23Z). Treat #1828 as resolved for pending-task tracking; reopen only on a fresh v2 memory endpoint regression.

Formal closure pass: this memory cutover RCA is already closed, and the repair path identified in the RCA landed via molecule-core PR #1852 (`233f3727`, merged 2026-05-26T08:46:23Z). Treat #1828 as resolved for pending-task tracking; reopen only on a fresh v2 memory endpoint regression.
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1828