fix(workspace-server): persist push-mode chat round-trip synchronously — E2E Chat reload flake is a real data-loss race #2195
Reference in New Issue
Block a user
Delete Branch "fix/e2e-chat-mobile-history-reload-flake"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Root cause: REAL product persistence race (not test fragility)
The intermittent
E2E Chat / E2E Chatred —chat-mobile.spec.ts:56 › MobileChat › history persists across reload,expect(getByText('Mobile persistence')).toBeVisible()timing out after reload — is a genuine product data-loss race, root cause (b) from the triage (a real persistence race, not just a fragile fixed timeout).Evidence
Echo: Mobile persistenceto render (the/a2aHTTP 200), thenpage.reload()s and expects the transcript to come back fromGET /chat-history./chat-history(messagestore.PostgresMessageStore) readsactivity_logsrows withactivity_type='a2a_receive'—request_bodyholds the user message,response_bodyholds the agent reply. That single row is the only durable record of a chat round-trip.logA2ASuccess(a2a_proxy_helpers.go) inside a detachedgoAsyncgoroutine. ButProxyA2Aflushes the HTTP 200 (c.Data(...)) the instantproxyA2ARequestreturns — i.e. before the goroutine's INSERT commits.Echo:(test's first wait passes) →page.reload()→GET /chat-historyreadsactivity_logs→ the async INSERT hasn't committed yet → row missing → assertion fails. The::error::Canvas did not start in 30sseen earlier in the run is a separate, additive cold-start flake (see below).internal#470/#1347/RFC#2945) that was already fixed for poll mode by makinglogA2AReceiveQueued/persistUserMessageAtIngestsynchronous. The code comments there explicitly call out "the row must be durable before the queued 200 is returned." The push-mode sibling was left async — that's the bug. Outside CI, the same window loses a real user's message on a reload / workspace-server restart / deploy / OOM between the 200 and the goroutine commit.The fix
Product (primary):
workspace-serverlogA2ASuccessnow writes thea2a_receiverow synchronously (inline, nogoAsync) beforeproxyA2ARequestreturns and the 200 is flushed — mirroring the poll-mode discipline. Keptcontext.WithoutCancel(a chat-exit client disconnect must not abort the write) and best-effort semantics (LogActivityalready logs+swallows INSERT errors, so a DB hiccup never fails the user's send). The non-criticallast_outbound_atupdate and the WS broadcast stay async.Test determinism: after
reload(), the spec now deterministically waits for theGET /chat-history2xx that rehydrates the transcript before asserting visibility, instead of racing a fixed 5s render timeout against an in-flight fetch. No fixed-timeout guessing.Canvas did not start in 30s:e2e-chat.ymlreadiness now probes the real/?m=chatroute for a 2xx (Turbopack compiles routes lazily — a barecurl /can 200 before the page the tests load has compiled) and raises the cold-start budget 30s→120s (jobtimeout-minutesis 15, so bounded).Verification done
go build ./internal/handlers/✓,go vet✓internal/handlerssuite ✓ andinternal/messagestoresuite ✓ (sqlmock — no DB needed; the affected A2A/Delegation/Proxy/Persist/ChatHistory/CrossTenant/Budget tests all pass, including the ones that previously relied onwaitAsyncForTest)eslint e2e/chat-mobile.spec.tsclean ✓gofmt -lclean ✓; YAML parses ✓Could NOT verify locally
npx playwright test e2e/chat-mobile.spec.ts) — it needs Postgres + Redis + the platform server + the canvas dev server wired together (the workflow's full harness). Not run here. Confidence: high that the fix removes the race — the failure mechanism (async INSERT vs. read-after-reload) is established directly from the code paths, and the fix makes the write strictly happen-before the 200 the test keys off. The merge-queue/nightlyE2E Chatlane is the real gate for the browser run.🤖 Generated with Claude Code
Owner force-merged (claude-ceo-assistant), honest bypass. Fixes a REAL product data-loss bug the no-flakes investigation surfaced: push-mode A2A replies wrote the chat-history a2a_receive row in a detached goroutine, so the HTTP 200 flushed before the INSERT committed — a reload/restart/deploy in that window lost the message (poll-mode was already synchronous; push-mode was left async). Now synchronous before the 200, context.WithoutCancel, best-effort (DB hiccup never fails the send). Test made deterministic (waits on the chat-history GET, not a fixed timeout). Diff verified by me (serving-path); E2E Chat passes on the PR (e2e-proven); all required CI green. Token revoked.