canvas chat messages stop appearing in target workspace activity_logs — channel push (and chat-history) silently breaks #1673

Closed
opened 2026-05-22 06:00:18 +00:00 by cp-be · 1 comment
Member

Symptom

User typed canvas messages addressed to CEO Assistant (workspace 30ba7f0b-b303-4a20-aefe-3a4a675b8aa4) at ~05:33Z UTC on 2026-05-22. None reached the bound Claude Code session via the channel plugin's poll path. Earlier canvas messages from the same user (e.g. 4e623e5f at 02:43:50Z) did arrive — the regression appeared between those two timestamps.

Root cause (confirmed by querying the platform directly)

The target workspace's activity_logs table has NO row for the recent canvas sends. Hitting the platform-side activity feed directly:

GET https://hongming.moleculesai.app/workspaces/30ba7f0b-b303-4a20-aefe-3a4a675b8aa4/activity?limit=50
Authorization: Bearer <workspace bearer>

Most recent inbound rows (newest → oldest):
  540e76a3  type=a2a_receive   method=notify        src=NULL       ts=05:47:13  ← bot's own outbound notify
  e3f90d98  type=a2a_receive   method=notify        src=NULL       ts=05:39:28  ← bot's own outbound notify
  4e623e5f  type=a2a_receive   method=message/send  src=344a2623   ts=02:43:50  ← LAST canvas inbound row (3+ hours old)

The bot's channel plugin (which polls /activity?type=a2a_receive&since_id=<cursor>) advanced cursor to 4e623e5f at restart and saw nothing newer. It can't deliver what the table doesn't contain. cursor.json + bot lsof confirm the plugin is healthy:

  • bun server.ts (pid 5952) — 4 ESTABLISHED TCP to Cloudflare, polling normally
  • cursor advancing across each tick
  • peer_agent A2A messages from PM (deedcb61) still arriving as <channel kind="peer_agent"> tags — the plugin is doing its job

What changed

The missing inbound rows all originate from source_id 344a2623-50bf-4ab9-9732-220779305c8f (the canvas user's identity workspace per the RFC#637 canvas-user-identity rollout, peer_name=hongming-pc, agent_card_url points at https://hongming.moleculesai.app/registry/discover/344a2623-...). Earlier rows that DID land (4e623e5f, 6bf193dd, 3d74a5fd) were from the same source_id, so this isn't a categorical filter — but something between those landing and the next batch broke.

Likely causes (need a code archaeologist):

  1. POST /workspaces/:id/a2a's logA2AReceiveQueued is now being skipped for canvas-user callers — there's a condition somewhere that suppresses the activity write for callerID == 344a2623-... shape.
  2. The canvas frontend stopped POSTing to /workspaces/:id/a2a and now POSTs to a new chat endpoint (/chat/messages? /canvas/chat?) that doesn't write activity_logs.
  3. proxyA2ARequest is returning a 4xx early (before logA2AReceiveQueued fires) for canvas-user callers, silently — the canvas UI still shows the bubble but the row never lands.

Reproduction

  1. Have a SaaS tenant with one external/poll-mode workspace (e.g. 30ba7f0b).
  2. Open the canvas chat for that workspace, type and send a message.
  3. Query GET /workspaces/<ws>/activity with the workspace bearer.
  4. Expected: a new a2a_receive | message/send row with source_id=<canvas-user-ws-id>.
  5. Actual: no row.

Reno-Stars / Hongming's tenant 30ba7f0b reproduces this empirically (3+ hour gap in inbound rows despite multiple canvas sends).

SOP — test coverage required

Per CTO directive 2026-05-22 ("all bugs found should have test coverage"), this fix MUST land with:

  • An E2E that POSTs /workspaces/:id/a2a with a canvas-user-shaped callerID (UUID, not empty) and asserts an a2a_receive | message/send row appears in /workspaces/:id/activity within 5s, with source_id matching the callerID.
  • A unit-level handler test (sqlmock) that pins the logA2AReceiveQueued INSERT happening synchronously before the 200 returns.
  • Follow feedback_no_dev_only_routes_in_e2e — the E2E hits the same /workspaces/:id/a2a route a real canvas client hits, not an admin-only mint.

Impact

  • Canvas chat silently broken for poll-mode workspaces (which is Hongming's entire tenant). User types, sees the bubble, but the bound CC session never receives the message — no error, no log, just silence.
  • chat-history on canvas reopen also broken — per the comment in a2a_proxy_helpers.go:574-577, chat-history reads activity_logs. Missing rows means missing history.
  • Peer-agent A2A unaffected — that path still writes activity rows correctly (PM deedcb61 → CEO Assistant 30ba7f0b A2A messages were all delivered during this same time window).

Related

  • feedback_no_dev_only_routes_in_e2e — E2E must use production paths
  • internal#471 (logA2AReceiveQueued is the only durable write for poll-mode inbound — must be synchronous)
  • internal#1347 (push-mode sibling of the same data-loss class)
  • RFC#637 (canvas-user identity capture — introduced the 344a2623-shape callerID that may have unmasked this bug)

Generated with Claude Code.

## Symptom User typed canvas messages addressed to CEO Assistant (workspace 30ba7f0b-b303-4a20-aefe-3a4a675b8aa4) at ~05:33Z UTC on 2026-05-22. None reached the bound Claude Code session via the channel plugin's poll path. Earlier canvas messages from the same user (e.g. 4e623e5f at 02:43:50Z) did arrive — the regression appeared between those two timestamps. ## Root cause (confirmed by querying the platform directly) The target workspace's activity_logs table has NO row for the recent canvas sends. Hitting the platform-side activity feed directly: ``` GET https://hongming.moleculesai.app/workspaces/30ba7f0b-b303-4a20-aefe-3a4a675b8aa4/activity?limit=50 Authorization: Bearer <workspace bearer> Most recent inbound rows (newest → oldest): 540e76a3 type=a2a_receive method=notify src=NULL ts=05:47:13 ← bot's own outbound notify e3f90d98 type=a2a_receive method=notify src=NULL ts=05:39:28 ← bot's own outbound notify 4e623e5f type=a2a_receive method=message/send src=344a2623 ts=02:43:50 ← LAST canvas inbound row (3+ hours old) ``` The bot's channel plugin (which polls `/activity?type=a2a_receive&since_id=<cursor>`) advanced cursor to 4e623e5f at restart and saw nothing newer. It can't deliver what the table doesn't contain. cursor.json + bot lsof confirm the plugin is healthy: - `bun server.ts` (pid 5952) — 4 ESTABLISHED TCP to Cloudflare, polling normally - cursor advancing across each tick - peer_agent A2A messages from PM (deedcb61) still arriving as `<channel kind="peer_agent">` tags — the plugin is doing its job ## What changed The missing inbound rows all originate from source_id `344a2623-50bf-4ab9-9732-220779305c8f` (the canvas user's identity workspace per the RFC#637 canvas-user-identity rollout, peer_name=`hongming-pc`, agent_card_url points at `https://hongming.moleculesai.app/registry/discover/344a2623-...`). Earlier rows that DID land (4e623e5f, 6bf193dd, 3d74a5fd) were from the same source_id, so this isn't a categorical filter — but something between those landing and the next batch broke. Likely causes (need a code archaeologist): 1. POST /workspaces/:id/a2a's `logA2AReceiveQueued` is now being skipped for canvas-user callers — there's a condition somewhere that suppresses the activity write for `callerID == 344a2623-...` shape. 2. The canvas frontend stopped POSTing to `/workspaces/:id/a2a` and now POSTs to a new chat endpoint (`/chat/messages`? `/canvas/chat`?) that doesn't write `activity_logs`. 3. `proxyA2ARequest` is returning a 4xx early (before `logA2AReceiveQueued` fires) for canvas-user callers, silently — the canvas UI still shows the bubble but the row never lands. ## Reproduction 1. Have a SaaS tenant with one external/poll-mode workspace (e.g. 30ba7f0b). 2. Open the canvas chat for that workspace, type and send a message. 3. Query `GET /workspaces/<ws>/activity` with the workspace bearer. 4. Expected: a new `a2a_receive | message/send` row with `source_id=<canvas-user-ws-id>`. 5. Actual: no row. Reno-Stars / Hongming's tenant 30ba7f0b reproduces this empirically (3+ hour gap in inbound rows despite multiple canvas sends). ## SOP — test coverage required Per CTO directive 2026-05-22 ("all bugs found should have test coverage"), this fix MUST land with: - An E2E that POSTs `/workspaces/:id/a2a` with a canvas-user-shaped callerID (UUID, not empty) and asserts an `a2a_receive | message/send` row appears in `/workspaces/:id/activity` within 5s, with `source_id` matching the callerID. - A unit-level handler test (sqlmock) that pins the `logA2AReceiveQueued` INSERT happening synchronously before the 200 returns. - Follow `feedback_no_dev_only_routes_in_e2e` — the E2E hits the same `/workspaces/:id/a2a` route a real canvas client hits, not an admin-only mint. ## Impact - **Canvas chat silently broken** for poll-mode workspaces (which is Hongming's entire tenant). User types, sees the bubble, but the bound CC session never receives the message — no error, no log, just silence. - **chat-history on canvas reopen also broken** — per the comment in `a2a_proxy_helpers.go:574-577`, chat-history reads `activity_logs`. Missing rows means missing history. - **Peer-agent A2A unaffected** — that path still writes activity rows correctly (PM deedcb61 → CEO Assistant 30ba7f0b A2A messages were all delivered during this same time window). ## Related - `feedback_no_dev_only_routes_in_e2e` — E2E must use production paths - internal#471 (`logA2AReceiveQueued` is the only durable write for poll-mode inbound — must be synchronous) - internal#1347 (push-mode sibling of the same data-loss class) - RFC#637 (canvas-user identity capture — introduced the 344a2623-shape callerID that may have unmasked this bug) Generated with Claude Code.
Member

RCA — root cause

Poll-mode canvas chat delivery depends on a durable activity_logs insert before the synthetic queued response. The production symptom was missing a2a_receive/message/send rows for canvas-originated sends, so the polling channel plugin and chat-history reader had nothing to consume even though the UI showed the message optimistically.

Evidence

  • workspace-server/internal/handlers/a2a_proxy_helpers.go:621 — documents that the activity_logs row is what poll-mode agents read via /activity?since_id=.
  • workspace-server/internal/handlers/a2a_proxy_helpers.go:624logA2AReceiveQueued is the poll-mode durable receive writer.
  • workspace-server/internal/handlers/a2a_proxy_helpers.go:625-643 — current code explicitly says this write must be synchronous before returning queued 200, because missing the row loses the message and chat-history reads activity_logs.
  • workspace-server/internal/handlers/activity.go:448-456 — activity feed filters by activity_type and source, so absent rows cannot be recovered by the channel poller.
  • workspace-server/internal/handlers/chat_history.go:6-18 — chat history is a read-side adapter over the message store/activity-log-backed behavior, so missing ingest rows also break reopen history.

Suggested fix

Keep the fix in molecule-core/workspace-server around the production /workspaces/:id/a2a poll-mode path: pin synchronous logA2AReceiveQueued before any queued 200 for canvas-user callers, and add an E2E that posts through the real A2A route with a canvas-user caller ID then asserts the activity_logs row appears within a bounded window. Do not solve this in the channel plugin; the plugin cannot deliver rows the platform never wrote.

Confidence

High — issue evidence shows the channel poller was healthy, and the current code comments identify the same durable-write boundary as load-bearing for poll-mode canvas messages.

## RCA — root cause Poll-mode canvas chat delivery depends on a durable `activity_logs` insert before the synthetic queued response. The production symptom was missing `a2a_receive/message/send` rows for canvas-originated sends, so the polling channel plugin and chat-history reader had nothing to consume even though the UI showed the message optimistically. ## Evidence - `workspace-server/internal/handlers/a2a_proxy_helpers.go:621` — documents that the `activity_logs` row is what poll-mode agents read via `/activity?since_id=`. - `workspace-server/internal/handlers/a2a_proxy_helpers.go:624` — `logA2AReceiveQueued` is the poll-mode durable receive writer. - `workspace-server/internal/handlers/a2a_proxy_helpers.go:625-643` — current code explicitly says this write must be synchronous before returning queued 200, because missing the row loses the message and chat-history reads `activity_logs`. - `workspace-server/internal/handlers/activity.go:448-456` — activity feed filters by `activity_type` and source, so absent rows cannot be recovered by the channel poller. - `workspace-server/internal/handlers/chat_history.go:6-18` — chat history is a read-side adapter over the message store/activity-log-backed behavior, so missing ingest rows also break reopen history. ## Suggested fix Keep the fix in `molecule-core/workspace-server` around the production `/workspaces/:id/a2a` poll-mode path: pin synchronous `logA2AReceiveQueued` before any queued 200 for canvas-user callers, and add an E2E that posts through the real A2A route with a canvas-user caller ID then asserts the `activity_logs` row appears within a bounded window. Do not solve this in the channel plugin; the plugin cannot deliver rows the platform never wrote. ## Confidence High — issue evidence shows the channel poller was healthy, and the current code comments identify the same durable-write boundary as load-bearing for poll-mode canvas messages.
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1673