rfc(canvas): poll-fan-out reduction — convert overlays to ACTIVITY_LOGGED subscribers (P3) #61

New Issue

claude-ceo-assistant · 2026-05-07T21:59:39Z

2026-05-07 21:59:39 +00:00

Context

Parked follow-up from PR #60 (issue #59). The 429 storm root cause is closed — workspace-server now keys rate-limit buckets per-tenant via keyFor, not per-IP. With 600 req/min per tenant bucket, the canvas's polling fan-out is comfortably under budget.

This issue tracks the efficiency opportunity that remains: multiple canvas overlays poll /workspaces/:id/activity independently for the same workspaces. Filing as P3 (efficiency, not correctness) so the work doesn't get treated as urgent — but documenting the analysis now while it's fresh.

Current fan-out math

Per-cycle traffic to /workspaces/:id/activity for a user with N visible workspaces, A active tab, A2A edges enabled:

Consumer	File	Cadence	Per-ws	Filter	Per-cycle cost
`A2ATopologyOverlay`	`canvas/src/components/A2ATopologyOverlay.tsx:210`	60s	yes	`?type=delegation&limit=500&source=agent`	N
`CommunicationOverlay`	`canvas/src/components/CommunicationOverlay.tsx:112`	30s	first 3 only (already capped)	`?limit=5`	min(N,3)
`ActivityTab`	`canvas/src/components/tabs/ActivityTab.tsx:71`	5s	active workspace only	`?type=<filter>` (when filter selected)	1 (active only)
`ChatTab` initial	`canvas/src/components/tabs/ChatTab.tsx:164`	once on mount, then on scroll	active only	`?type=a2a_receive&source=canvas&limit=10`	1 (one-shot)

For N=6, ActivityTab open, A2A edges on: ~40 req/min steady-state to /activity. Plus heartbeats, hydration, page state. Well under 600/min/key with PR #60's keying.

Why this still matters (P3 reasoning)

Three angles:

Server load: each poll is a real DB query (activity_logs) with a workspace_id filter. At 6 workspaces × 4 consumers × ~12 polls/min/consumer = ~280 DB queries/min per tenant for activity alone. At fleet scale this is real RDS CPU.
Update latency: a new agent message → activity_logs insert → ACTIVITY_LOGGED WS broadcast (already implemented). Polling consumers see the new row up to cadence seconds later (5-60s). WS-subscribed consumers see it within ~10ms.
Wasted cycles when nothing changes: in steady-state idle workspaces, every consumer's poll returns the same N rows it returned last cycle. Pure overhead — no DB row changed, no UI update fires.

Prior art surveyed

GitHub canvas (Projects v2): subscribes to a SubscriptionsAPI for board updates, falls back to polling at 30s when WS is unhealthy. Same shape we already have for workspace state in socket.ts.
Linear: full WS-driven, no polling. Designed for it from day 1; complete view-store reconciliation on every event.
Slack: WS-driven for messages, REST for backfill/scrollback. Latest-N pagination matches what ChatTab does today.
Stripe Dashboard: hybrid — polling for some panels, WS for others, decided per-resource based on update frequency and consistency requirements.

What applies to us:

We already broadcast ACTIVITY_LOGGED over WS (workspace-server/internal/events/types.go:46). Two consumers (ChatTab, AgentCommsPanel) subscribe via useSocketEvent. The remaining three (A2ATopologyOverlay, CommunicationOverlay, ActivityTab) don't, and that's the inconsistency.
WS-first + HTTP fallback is the correct shape (matches Slack/GitHub/Stripe). Linear's "always WS" is too strong because we lose backfill/scrollback semantics.

What doesn't:

Linear's full view-store reconciliation isn't worth the complexity for our scale; per-component subscription is enough.

Proposed approach

Stage 1 (small, low-risk): convert CommunicationOverlay to subscribe to ACTIVITY_LOGGED filtered by activity_type IN ('a2a_send', 'a2a_receive', 'task_update'), with HTTP fallback only when the WS connection is unhealthy. Cleanest convert — already capped at 3 workspaces, already visibility-gated, already has dedup logic. Drops 6 req/min from the steady-state.

Stage 2 (moderate): do the same for A2ATopologyOverlay, filtered by activity_type='delegation'. Drops another 6 req/min worst case. Slightly more complex because it consumes a 500-row windowed query (graph history), so the WS path needs to maintain a bounded ring buffer.

Stage 3 (largest): convert ActivityTab. Highest fan-out (12 req/min for one ws when active), but needs careful pagination + filter UX preservation. Fall-through scope.

SSOT decision

Each consumer keeps its own subscription + state-store, but they share:

The single useSocketEvent hook that filters ACTIVITY_LOGGED events
The single api.get<ActivityEntry[]>('/workspaces/:id/activity?...') HTTP-fallback shape
The single socket.ts reconnect/health-check machinery for WS degradation detection

No new abstractions. The useSocketEvent hook already exists; this issue is "use it consistently in three more places".

Alternatives rejected

A. Server-side aggregation endpoint: one /workspaces/:id/canvas-bundle returning topology+comm+activity in one shot. Rejected because the three consumers have wildly different staleness tolerances (60s vs 30s vs 5s) and different filters (delegation vs all-types vs filtered), so a single aggregate is either over-fetching for some consumers or stale for others.

B. Shared poll hook with response sharing in client memory: one useWorkspaceActivity(wsId) hook that all three consumers call, dedupes the request. Rejected because the consumers' filters don't overlap meaningfully — ?type=delegation vs ?limit=5 vs ?type=<dynamic> would each need their own cache key, ending up close to parity with the current state.

C. Reduce cadences: drop ActivityTab from 5s → 15s, CommunicationOverlay from 30s → 60s. Rejected because that hurts perceived freshness for active-tab use cases and doesn't address the structural overlap.

D. Status quo: do nothing. Rejected because: see "why this matters" above. Filed at P3 to reflect the priority though.

Security check

Untrusted input? No new input handling. Same WS auth chain as today.
Auth/sessions/permissions? No change. WS subscription uses the same per-workspace bearer that polling uses.
Data collection / logs? No new logging. WS path already logs subscribe/unsubscribe at socket.ts.
Access boundary changes? No.

Versioning + backwards compat

No API surface change. /workspaces/:id/activity HTTP endpoint stays — needed for fallback + initial bootstrap on page load.
ACTIVITY_LOGGED WS event shape already pinned by socket-events.test.ts; no shape change planned.

Acceptance criteria (per stage)

Stage 1 (CommunicationOverlay):

Subscribe to ACTIVITY_LOGGED, filter by activity_type
Initial bootstrap via existing HTTP path (preserved); HTTP fallback when socket.ts reports unhealthy
Update cadence drops to "as events arrive" + bootstrap on mount
Test: WS push → state update without HTTP call (mocked api.get)
Test: WS unhealthy → HTTP fallback fires at existing 30s cadence
Test: visibility-gating still active

Stages 2–3: similar shape, separate PRs.

Out of scope

WebSocket protocol changes
Activity event schema changes
Cross-workspace event aggregation (already handled by per-workspace subscriptions)

Severity

P3. PR #60 closed the bug; this is efficiency only. No timeline pressure.

## Context Parked follow-up from PR #60 (issue #59). The 429 storm root cause is closed — workspace-server now keys rate-limit buckets per-tenant via `keyFor`, not per-IP. With 600 req/min per tenant bucket, the canvas's polling fan-out is comfortably under budget. This issue tracks the **efficiency** opportunity that remains: multiple canvas overlays poll `/workspaces/:id/activity` independently for the same workspaces. Filing as **P3 (efficiency, not correctness)** so the work doesn't get treated as urgent — but documenting the analysis now while it's fresh. ## Current fan-out math Per-cycle traffic to `/workspaces/:id/activity` for a user with N visible workspaces, A active tab, A2A edges enabled: | Consumer | File | Cadence | Per-ws | Filter | Per-cycle cost | |---|---|---|---|---|---| | `A2ATopologyOverlay` | `canvas/src/components/A2ATopologyOverlay.tsx:210` | 60s | yes | `?type=delegation&limit=500&source=agent` | N | | `CommunicationOverlay` | `canvas/src/components/CommunicationOverlay.tsx:112` | 30s | first 3 only (already capped) | `?limit=5` | min(N,3) | | `ActivityTab` | `canvas/src/components/tabs/ActivityTab.tsx:71` | 5s | active workspace only | `?type=<filter>` (when filter selected) | 1 (active only) | | `ChatTab` initial | `canvas/src/components/tabs/ChatTab.tsx:164` | once on mount, then on scroll | active only | `?type=a2a_receive&source=canvas&limit=10` | 1 (one-shot) | For N=6, ActivityTab open, A2A edges on: ~40 req/min steady-state to `/activity`. Plus heartbeats, hydration, page state. Well under 600/min/key with PR #60's keying. ## Why this still matters (P3 reasoning) Three angles: 1. **Server load**: each poll is a real DB query (`activity_logs`) with a `workspace_id` filter. At 6 workspaces × 4 consumers × ~12 polls/min/consumer = ~280 DB queries/min per tenant for activity alone. At fleet scale this is real RDS CPU. 2. **Update latency**: a new agent message → activity_logs insert → `ACTIVITY_LOGGED` WS broadcast (already implemented). Polling consumers see the new row up to *cadence* seconds later (5-60s). WS-subscribed consumers see it within ~10ms. 3. **Wasted cycles when nothing changes**: in steady-state idle workspaces, every consumer's poll returns the same N rows it returned last cycle. Pure overhead — no DB row changed, no UI update fires. ## Prior art surveyed - **GitHub canvas (Projects v2)**: subscribes to a SubscriptionsAPI for board updates, falls back to polling at 30s when WS is unhealthy. Same shape we already have for workspace state in `socket.ts`. - **Linear**: full WS-driven, no polling. Designed for it from day 1; complete view-store reconciliation on every event. - **Slack**: WS-driven for messages, REST for backfill/scrollback. Latest-N pagination matches what `ChatTab` does today. - **Stripe Dashboard**: hybrid — polling for some panels, WS for others, decided per-resource based on update frequency and consistency requirements. What applies to us: - We already broadcast `ACTIVITY_LOGGED` over WS (`workspace-server/internal/events/types.go:46`). Two consumers (`ChatTab`, `AgentCommsPanel`) subscribe via `useSocketEvent`. The remaining three (`A2ATopologyOverlay`, `CommunicationOverlay`, `ActivityTab`) don't, and that's the inconsistency. - WS-first + HTTP fallback is the correct shape (matches Slack/GitHub/Stripe). Linear's "always WS" is too strong because we lose backfill/scrollback semantics. What doesn't: - Linear's full view-store reconciliation isn't worth the complexity for our scale; per-component subscription is enough. ## Proposed approach **Stage 1 (small, low-risk)**: convert `CommunicationOverlay` to subscribe to `ACTIVITY_LOGGED` filtered by `activity_type IN ('a2a_send', 'a2a_receive', 'task_update')`, with HTTP fallback only when the WS connection is unhealthy. Cleanest convert — already capped at 3 workspaces, already visibility-gated, already has dedup logic. Drops 6 req/min from the steady-state. **Stage 2 (moderate)**: do the same for `A2ATopologyOverlay`, filtered by `activity_type='delegation'`. Drops another 6 req/min worst case. Slightly more complex because it consumes a 500-row windowed query (graph history), so the WS path needs to maintain a bounded ring buffer. **Stage 3 (largest)**: convert `ActivityTab`. Highest fan-out (12 req/min for one ws when active), but needs careful pagination + filter UX preservation. Fall-through scope. ## SSOT decision Each consumer keeps its own subscription + state-store, but they share: - The single `useSocketEvent` hook that filters `ACTIVITY_LOGGED` events - The single `api.get<ActivityEntry[]>('/workspaces/:id/activity?...')` HTTP-fallback shape - The single `socket.ts` reconnect/health-check machinery for WS degradation detection No new abstractions. The `useSocketEvent` hook already exists; this issue is "use it consistently in three more places". ## Alternatives rejected **A. Server-side aggregation endpoint**: one `/workspaces/:id/canvas-bundle` returning topology+comm+activity in one shot. Rejected because the three consumers have wildly different staleness tolerances (60s vs 30s vs 5s) and different filters (delegation vs all-types vs filtered), so a single aggregate is either over-fetching for some consumers or stale for others. **B. Shared poll hook with response sharing in client memory**: one `useWorkspaceActivity(wsId)` hook that all three consumers call, dedupes the request. Rejected because the consumers' filters don't overlap meaningfully — `?type=delegation` vs `?limit=5` vs `?type=<dynamic>` would each need their own cache key, ending up close to parity with the current state. **C. Reduce cadences**: drop ActivityTab from 5s → 15s, CommunicationOverlay from 30s → 60s. Rejected because that hurts perceived freshness for active-tab use cases and doesn't address the structural overlap. **D. Status quo**: do nothing. Rejected because: see "why this matters" above. Filed at P3 to reflect the priority though. ## Security check - **Untrusted input?** No new input handling. Same WS auth chain as today. - **Auth/sessions/permissions?** No change. WS subscription uses the same per-workspace bearer that polling uses. - **Data collection / logs?** No new logging. WS path already logs subscribe/unsubscribe at `socket.ts`. - **Access boundary changes?** No. ## Versioning + backwards compat - No API surface change. `/workspaces/:id/activity` HTTP endpoint stays — needed for fallback + initial bootstrap on page load. - `ACTIVITY_LOGGED` WS event shape already pinned by `socket-events.test.ts`; no shape change planned. ## Acceptance criteria (per stage) Stage 1 (`CommunicationOverlay`): - [ ] Subscribe to `ACTIVITY_LOGGED`, filter by activity_type - [ ] Initial bootstrap via existing HTTP path (preserved); HTTP fallback when `socket.ts` reports unhealthy - [ ] Update cadence drops to "as events arrive" + bootstrap on mount - [ ] Test: WS push → state update without HTTP call (mocked api.get) - [ ] Test: WS unhealthy → HTTP fallback fires at existing 30s cadence - [ ] Test: visibility-gating still active Stages 2–3: similar shape, separate PRs. ## Out of scope - WebSocket protocol changes - Activity event schema changes - Cross-workspace event aggregation (already handled by per-workspace subscriptions) ## Severity P3. PR #60 closed the bug; this is efficiency only. No timeline pressure.

claude-ceo-assistant referenced this issue from a commit

2026-05-07 22:11:04 +00:00

feat(canvas): CommunicationOverlay subscribes to ACTIVITY_LOGGED — drop 30s polling

claude-ceo-assistant referenced this issue

2026-05-07 22:11:49 +00:00

feat(canvas): CommunicationOverlay → ACTIVITY_LOGGED subscriber (#61 stage 1) #69

claude-ceo-assistant referenced this issue from a commit

2026-05-07 22:17:21 +00:00

feat(canvas): A2ATopologyOverlay subscribes to ACTIVITY_LOGGED — drop 60s polling

claude-ceo-assistant referenced this issue

2026-05-07 22:17:56 +00:00

feat(canvas): A2ATopologyOverlay → ACTIVITY_LOGGED subscriber (#61 stage 2) #71