rfc(canvas): poll-fan-out reduction — convert overlays to ACTIVITY_LOGGED subscribers (P3) #61
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
Parked follow-up from PR #60 (issue #59). The 429 storm root cause is closed — workspace-server now keys rate-limit buckets per-tenant via
keyFor, not per-IP. With 600 req/min per tenant bucket, the canvas's polling fan-out is comfortably under budget.This issue tracks the efficiency opportunity that remains: multiple canvas overlays poll
/workspaces/:id/activityindependently for the same workspaces. Filing as P3 (efficiency, not correctness) so the work doesn't get treated as urgent — but documenting the analysis now while it's fresh.Current fan-out math
Per-cycle traffic to
/workspaces/:id/activityfor a user with N visible workspaces, A active tab, A2A edges enabled:A2ATopologyOverlaycanvas/src/components/A2ATopologyOverlay.tsx:210?type=delegation&limit=500&source=agentCommunicationOverlaycanvas/src/components/CommunicationOverlay.tsx:112?limit=5ActivityTabcanvas/src/components/tabs/ActivityTab.tsx:71?type=<filter>(when filter selected)ChatTabinitialcanvas/src/components/tabs/ChatTab.tsx:164?type=a2a_receive&source=canvas&limit=10For N=6, ActivityTab open, A2A edges on: ~40 req/min steady-state to
/activity. Plus heartbeats, hydration, page state. Well under 600/min/key with PR #60's keying.Why this still matters (P3 reasoning)
Three angles:
Server load: each poll is a real DB query (
activity_logs) with aworkspace_idfilter. At 6 workspaces × 4 consumers × ~12 polls/min/consumer = ~280 DB queries/min per tenant for activity alone. At fleet scale this is real RDS CPU.Update latency: a new agent message → activity_logs insert →
ACTIVITY_LOGGEDWS broadcast (already implemented). Polling consumers see the new row up to cadence seconds later (5-60s). WS-subscribed consumers see it within ~10ms.Wasted cycles when nothing changes: in steady-state idle workspaces, every consumer's poll returns the same N rows it returned last cycle. Pure overhead — no DB row changed, no UI update fires.
Prior art surveyed
socket.ts.ChatTabdoes today.What applies to us:
ACTIVITY_LOGGEDover WS (workspace-server/internal/events/types.go:46). Two consumers (ChatTab,AgentCommsPanel) subscribe viauseSocketEvent. The remaining three (A2ATopologyOverlay,CommunicationOverlay,ActivityTab) don't, and that's the inconsistency.What doesn't:
Proposed approach
Stage 1 (small, low-risk): convert
CommunicationOverlayto subscribe toACTIVITY_LOGGEDfiltered byactivity_type IN ('a2a_send', 'a2a_receive', 'task_update'), with HTTP fallback only when the WS connection is unhealthy. Cleanest convert — already capped at 3 workspaces, already visibility-gated, already has dedup logic. Drops 6 req/min from the steady-state.Stage 2 (moderate): do the same for
A2ATopologyOverlay, filtered byactivity_type='delegation'. Drops another 6 req/min worst case. Slightly more complex because it consumes a 500-row windowed query (graph history), so the WS path needs to maintain a bounded ring buffer.Stage 3 (largest): convert
ActivityTab. Highest fan-out (12 req/min for one ws when active), but needs careful pagination + filter UX preservation. Fall-through scope.SSOT decision
Each consumer keeps its own subscription + state-store, but they share:
useSocketEventhook that filtersACTIVITY_LOGGEDeventsapi.get<ActivityEntry[]>('/workspaces/:id/activity?...')HTTP-fallback shapesocket.tsreconnect/health-check machinery for WS degradation detectionNo new abstractions. The
useSocketEventhook already exists; this issue is "use it consistently in three more places".Alternatives rejected
A. Server-side aggregation endpoint: one
/workspaces/:id/canvas-bundlereturning topology+comm+activity in one shot. Rejected because the three consumers have wildly different staleness tolerances (60s vs 30s vs 5s) and different filters (delegation vs all-types vs filtered), so a single aggregate is either over-fetching for some consumers or stale for others.B. Shared poll hook with response sharing in client memory: one
useWorkspaceActivity(wsId)hook that all three consumers call, dedupes the request. Rejected because the consumers' filters don't overlap meaningfully —?type=delegationvs?limit=5vs?type=<dynamic>would each need their own cache key, ending up close to parity with the current state.C. Reduce cadences: drop ActivityTab from 5s → 15s, CommunicationOverlay from 30s → 60s. Rejected because that hurts perceived freshness for active-tab use cases and doesn't address the structural overlap.
D. Status quo: do nothing. Rejected because: see "why this matters" above. Filed at P3 to reflect the priority though.
Security check
socket.ts.Versioning + backwards compat
/workspaces/:id/activityHTTP endpoint stays — needed for fallback + initial bootstrap on page load.ACTIVITY_LOGGEDWS event shape already pinned bysocket-events.test.ts; no shape change planned.Acceptance criteria (per stage)
Stage 1 (
CommunicationOverlay):ACTIVITY_LOGGED, filter by activity_typesocket.tsreports unhealthyStages 2–3: similar shape, separate PRs.
Out of scope
Severity
P3. PR #60 closed the bug; this is efficiency only. No timeline pressure.