canvas chat messages stop appearing in target workspace activity_logs #1675

Open
opened 2026-05-22 06:01:04 +00:00 by cp-be · 2 comments
Member

Symptom: User typed canvas messages to CEO Assistant (30ba7f0b) ~05:33Z 2026-05-22. None reached bound CC session via channel plugin. Earlier sends from same user (4e623e5f at 02:43:50Z) did arrive. Regression appeared between those.

Root cause: target activity_logs has NO row for recent canvas sends. Most recent inbound at 30ba7f0b is 4e623e5f at 02:43:50Z. Bot polls correctly + advances cursor. Cannot deliver what table doesn't contain. Peer-agent A2A (PM relay) still works fine — regression is canvas-specific.

Missing rows all from source_id 344a2623 (canvas-user identity per RFC#637). Earlier rows from same source landed.

Suspects:

  1. logA2AReceiveQueued (a2a_proxy_helpers.go:567) skipped for canvas-user callers
  2. Canvas FE switched to new chat endpoint that doesn't write activity_logs
  3. proxyA2ARequest returns 4xx early for canvas callers, silently

Repro:

  1. SaaS tenant with poll-mode workspace.
  2. Send canvas message.
  3. GET /workspaces/:id/activity.
  4. Expected: a2a_receive|message/send row with source_id=canvas-user-ws.
  5. Actual: no row.

Reno-Stars / 30ba7f0b reproduces empirically.

Test coverage required (CTO 2026-05-22 all-bugs-need-test directive):

  • E2E: POST /workspaces/:id/a2a with canvas-user UUID callerID, assert activity row within 5s.
  • Unit sqlmock: pin logA2AReceiveQueued INSERT synchronous before 200.
  • Use production /workspaces/:id/a2a route per feedback_no_dev_only_routes_in_e2e.

Impact:

  • Canvas chat silently broken for poll-mode workspaces (Hongming's entire tenant). User types, sees bubble, CC session never receives.
  • chat-history on reopen broken (reads activity_logs).
  • Peer A2A unaffected.

Related: internal#471 logA2AReceiveQueued sync; internal#1347 push-mode sibling; RFC#637 canvas-user identity; feedback_no_dev_only_routes_in_e2e.

Symptom: User typed canvas messages to CEO Assistant (30ba7f0b) ~05:33Z 2026-05-22. None reached bound CC session via channel plugin. Earlier sends from same user (4e623e5f at 02:43:50Z) did arrive. Regression appeared between those. Root cause: target activity_logs has NO row for recent canvas sends. Most recent inbound at 30ba7f0b is 4e623e5f at 02:43:50Z. Bot polls correctly + advances cursor. Cannot deliver what table doesn't contain. Peer-agent A2A (PM relay) still works fine — regression is canvas-specific. Missing rows all from source_id 344a2623 (canvas-user identity per RFC#637). Earlier rows from same source landed. Suspects: 1. logA2AReceiveQueued (a2a_proxy_helpers.go:567) skipped for canvas-user callers 2. Canvas FE switched to new chat endpoint that doesn't write activity_logs 3. proxyA2ARequest returns 4xx early for canvas callers, silently Repro: 1. SaaS tenant with poll-mode workspace. 2. Send canvas message. 3. GET /workspaces/:id/activity. 4. Expected: a2a_receive|message/send row with source_id=canvas-user-ws. 5. Actual: no row. Reno-Stars / 30ba7f0b reproduces empirically. Test coverage required (CTO 2026-05-22 all-bugs-need-test directive): - E2E: POST /workspaces/:id/a2a with canvas-user UUID callerID, assert activity row within 5s. - Unit sqlmock: pin logA2AReceiveQueued INSERT synchronous before 200. - Use production /workspaces/:id/a2a route per feedback_no_dev_only_routes_in_e2e. Impact: - Canvas chat silently broken for poll-mode workspaces (Hongming's entire tenant). User types, sees bubble, CC session never receives. - chat-history on reopen broken (reads activity_logs). - Peer A2A unaffected. Related: internal#471 logA2AReceiveQueued sync; internal#1347 push-mode sibling; RFC#637 canvas-user identity; feedback_no_dev_only_routes_in_e2e.
Author
Member

Update — empirical diagnosis revision:

Corrected understanding after running diagnostic POST and inspecting workspace records:

  1. Workspace 344a2623 is NOT a canvas-user identity workspace — it's hongming-pc (parent_id=30ba7f0b), runtime=external. A child peer workspace, not a user-identity row. The earlier activity rows with source_id=344a2623 were peer A2A from hongming-pc → CEO Assistant (which worked properly through hierarchy).

  2. POST /workspaces/30ba7f0b/a2a does write activity_logs correctly — I posted a diagnostic message/send to that endpoint with the CEO Assistant workspace bearer. Activity row 6b900b20 (method=message/send, src=30ba7f0b, ts=2026-05-22T06:28:06) landed within 2s.

  3. Bot cursor advanced past my diagnostic POST — current cursor = 6b900b20. Bot polled, saw the row, advanced. But the synthetic turn never reached the bound CC session.

  4. Canvas FE uses NEXT_PUBLIC_ADMIN_TOKEN (org-level) — confirmed in canvas/src/lib/api.ts. callerID stays empty → hierarchy check bypassed in proxyA2ARequest. So the original hypothesis (RFC#637 broke hierarchy check) is wrong.

Revised root cause hypothesis:

Bot emitNotification → CC MCP host delivery is silently dropping for SOME rows. Either (a) self-call source_id (workspace_id == source_id) is filtered by CC's notification handler, OR (b) the bot's mcp.notification(...) Promise rejects with a swallowed error.

Next debug steps:

  • Capture bot stderr during a known-good POST + a missing-from-CC POST. The bot writes failed to deliver notification for ${act.id} on rejection.
  • Add a test that calls emitNotification with self-call shape + peer-call shape + asserts both reach the MCP transport.

The PR#1676 sqlmock test is technically valid (CanCommunicate DOES reject canvas-user-shaped X-Workspace-ID callers) but it's not THE bug the user is observing — that scenario doesn't happen in the canvas FE's actual code path.

Closing PR#1676 + re-doing as a delivery-path test once I instrument the bot's stderr.

Generated with Claude Code

Update — empirical diagnosis revision: Corrected understanding after running diagnostic POST and inspecting workspace records: 1. **Workspace 344a2623 is NOT a canvas-user identity workspace** — it's `hongming-pc` (parent_id=30ba7f0b), runtime=external. A child peer workspace, not a user-identity row. The earlier activity rows with source_id=344a2623 were peer A2A from hongming-pc → CEO Assistant (which worked properly through hierarchy). 2. **POST /workspaces/30ba7f0b/a2a does write activity_logs correctly** — I posted a diagnostic message/send to that endpoint with the CEO Assistant workspace bearer. Activity row 6b900b20 (method=message/send, src=30ba7f0b, ts=2026-05-22T06:28:06) landed within 2s. 3. **Bot cursor advanced past my diagnostic POST** — current cursor = 6b900b20. Bot polled, saw the row, advanced. But the synthetic <channel> turn never reached the bound CC session. 4. **Canvas FE uses NEXT_PUBLIC_ADMIN_TOKEN (org-level)** — confirmed in canvas/src/lib/api.ts. callerID stays empty → hierarchy check bypassed in proxyA2ARequest. So the original hypothesis (RFC#637 broke hierarchy check) is wrong. Revised root cause hypothesis: **Bot emitNotification → CC MCP host delivery is silently dropping for SOME rows.** Either (a) self-call source_id (workspace_id == source_id) is filtered by CC's notification handler, OR (b) the bot's `mcp.notification(...)` Promise rejects with a swallowed error. Next debug steps: - Capture bot stderr during a known-good POST + a missing-from-CC POST. The bot writes `failed to deliver notification for ${act.id}` on rejection. - Add a test that calls emitNotification with self-call shape + peer-call shape + asserts both reach the MCP transport. The PR#1676 sqlmock test is technically valid (CanCommunicate DOES reject canvas-user-shaped X-Workspace-ID callers) but it's not THE bug the user is observing — that scenario doesn't happen in the canvas FE's actual code path. Closing PR#1676 + re-doing as a delivery-path test once I instrument the bot's stderr. Generated with Claude Code
Member

RCA — root cause

Canvas chat loss for poll-mode workspaces was the poll-mode A2A ingest path acknowledging the message before its activity_logs row was durable. The user saw a synthetic queued/accepted response, but the only delivery/history source for poll-mode agents could still be racing in a detached write; process restart, deploy, OOM, or request cancellation in that window made the message disappear from both agent polling and chat history.

Evidence

  • workspace-server/internal/handlers/a2a_proxy.go:427 — poll-mode short-circuits push dispatch and calls logA2AReceiveQueued(...) before returning the synthetic queued envelope.
  • workspace-server/internal/handlers/a2a_proxy_helpers.go:624logA2AReceiveQueued is the durable poll-mode inbound write path.
  • workspace-server/internal/handlers/a2a_proxy_helpers.go:650 — current fix uses context.WithoutCancel plus a synchronous LogActivity call before returning.
  • workspace-server/internal/handlers/a2a_poll_ingest_persist_test.go:19 — regression test documents the pre-fix detached h.goAsync(...) insert race.
  • workspace-server/internal/handlers/a2a_poll_ingest_persist_test.go:60 — test pins the required contract: activity row committed before queued 200.

Suggested fix

Keep the current fix shape in workspace-server/internal/handlers/a2a_proxy_helpers.go: poll-mode inbound message/send must synchronously persist a2a_receive to activity_logs on a context.WithoutCancel context before ProxyA2A returns queued/200. Do not reintroduce goAsync for this write. The linked test should stay with the production /workspaces/:id/a2a route so future refactors that move canvas chat or poll-mode delivery cannot silently lose the durability barrier.

Confidence

High — the issue symptom, poll-mode short-circuit, durable row contract, and regression test all describe the same acknowledged-before-durable race.

## RCA — root cause Canvas chat loss for poll-mode workspaces was the poll-mode A2A ingest path acknowledging the message before its `activity_logs` row was durable. The user saw a synthetic queued/accepted response, but the only delivery/history source for poll-mode agents could still be racing in a detached write; process restart, deploy, OOM, or request cancellation in that window made the message disappear from both agent polling and chat history. ## Evidence - `workspace-server/internal/handlers/a2a_proxy.go:427` — poll-mode short-circuits push dispatch and calls `logA2AReceiveQueued(...)` before returning the synthetic queued envelope. - `workspace-server/internal/handlers/a2a_proxy_helpers.go:624` — `logA2AReceiveQueued` is the durable poll-mode inbound write path. - `workspace-server/internal/handlers/a2a_proxy_helpers.go:650` — current fix uses `context.WithoutCancel` plus a synchronous `LogActivity` call before returning. - `workspace-server/internal/handlers/a2a_poll_ingest_persist_test.go:19` — regression test documents the pre-fix detached `h.goAsync(...)` insert race. - `workspace-server/internal/handlers/a2a_poll_ingest_persist_test.go:60` — test pins the required contract: activity row committed before queued 200. ## Suggested fix Keep the current fix shape in `workspace-server/internal/handlers/a2a_proxy_helpers.go`: poll-mode inbound `message/send` must synchronously persist `a2a_receive` to `activity_logs` on a `context.WithoutCancel` context before `ProxyA2A` returns queued/200. Do not reintroduce `goAsync` for this write. The linked test should stay with the production `/workspaces/:id/a2a` route so future refactors that move canvas chat or poll-mode delivery cannot silently lose the durability barrier. ## Confidence High — the issue symptom, poll-mode short-circuit, durable row contract, and regression test all describe the same acknowledged-before-durable race.
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1675