rfc(canvas): poll-fan-out reduction — convert overlays to ACTIVITY_LOGGED subscribers (P3) #61

Closed
opened 2026-05-07 21:59:39 +00:00 by claude-ceo-assistant · 0 comments

Context

Parked follow-up from PR #60 (issue #59). The 429 storm root cause is closed — workspace-server now keys rate-limit buckets per-tenant via keyFor, not per-IP. With 600 req/min per tenant bucket, the canvas's polling fan-out is comfortably under budget.

This issue tracks the efficiency opportunity that remains: multiple canvas overlays poll /workspaces/:id/activity independently for the same workspaces. Filing as P3 (efficiency, not correctness) so the work doesn't get treated as urgent — but documenting the analysis now while it's fresh.

Current fan-out math

Per-cycle traffic to /workspaces/:id/activity for a user with N visible workspaces, A active tab, A2A edges enabled:

Consumer File Cadence Per-ws Filter Per-cycle cost
A2ATopologyOverlay canvas/src/components/A2ATopologyOverlay.tsx:210 60s yes ?type=delegation&limit=500&source=agent N
CommunicationOverlay canvas/src/components/CommunicationOverlay.tsx:112 30s first 3 only (already capped) ?limit=5 min(N,3)
ActivityTab canvas/src/components/tabs/ActivityTab.tsx:71 5s active workspace only ?type=<filter> (when filter selected) 1 (active only)
ChatTab initial canvas/src/components/tabs/ChatTab.tsx:164 once on mount, then on scroll active only ?type=a2a_receive&source=canvas&limit=10 1 (one-shot)

For N=6, ActivityTab open, A2A edges on: ~40 req/min steady-state to /activity. Plus heartbeats, hydration, page state. Well under 600/min/key with PR #60's keying.

Why this still matters (P3 reasoning)

Three angles:

  1. Server load: each poll is a real DB query (activity_logs) with a workspace_id filter. At 6 workspaces × 4 consumers × ~12 polls/min/consumer = ~280 DB queries/min per tenant for activity alone. At fleet scale this is real RDS CPU.

  2. Update latency: a new agent message → activity_logs insert → ACTIVITY_LOGGED WS broadcast (already implemented). Polling consumers see the new row up to cadence seconds later (5-60s). WS-subscribed consumers see it within ~10ms.

  3. Wasted cycles when nothing changes: in steady-state idle workspaces, every consumer's poll returns the same N rows it returned last cycle. Pure overhead — no DB row changed, no UI update fires.

Prior art surveyed

  • GitHub canvas (Projects v2): subscribes to a SubscriptionsAPI for board updates, falls back to polling at 30s when WS is unhealthy. Same shape we already have for workspace state in socket.ts.
  • Linear: full WS-driven, no polling. Designed for it from day 1; complete view-store reconciliation on every event.
  • Slack: WS-driven for messages, REST for backfill/scrollback. Latest-N pagination matches what ChatTab does today.
  • Stripe Dashboard: hybrid — polling for some panels, WS for others, decided per-resource based on update frequency and consistency requirements.

What applies to us:

  • We already broadcast ACTIVITY_LOGGED over WS (workspace-server/internal/events/types.go:46). Two consumers (ChatTab, AgentCommsPanel) subscribe via useSocketEvent. The remaining three (A2ATopologyOverlay, CommunicationOverlay, ActivityTab) don't, and that's the inconsistency.
  • WS-first + HTTP fallback is the correct shape (matches Slack/GitHub/Stripe). Linear's "always WS" is too strong because we lose backfill/scrollback semantics.

What doesn't:

  • Linear's full view-store reconciliation isn't worth the complexity for our scale; per-component subscription is enough.

Proposed approach

Stage 1 (small, low-risk): convert CommunicationOverlay to subscribe to ACTIVITY_LOGGED filtered by activity_type IN ('a2a_send', 'a2a_receive', 'task_update'), with HTTP fallback only when the WS connection is unhealthy. Cleanest convert — already capped at 3 workspaces, already visibility-gated, already has dedup logic. Drops 6 req/min from the steady-state.

Stage 2 (moderate): do the same for A2ATopologyOverlay, filtered by activity_type='delegation'. Drops another 6 req/min worst case. Slightly more complex because it consumes a 500-row windowed query (graph history), so the WS path needs to maintain a bounded ring buffer.

Stage 3 (largest): convert ActivityTab. Highest fan-out (12 req/min for one ws when active), but needs careful pagination + filter UX preservation. Fall-through scope.

SSOT decision

Each consumer keeps its own subscription + state-store, but they share:

  • The single useSocketEvent hook that filters ACTIVITY_LOGGED events
  • The single api.get<ActivityEntry[]>('/workspaces/:id/activity?...') HTTP-fallback shape
  • The single socket.ts reconnect/health-check machinery for WS degradation detection

No new abstractions. The useSocketEvent hook already exists; this issue is "use it consistently in three more places".

Alternatives rejected

A. Server-side aggregation endpoint: one /workspaces/:id/canvas-bundle returning topology+comm+activity in one shot. Rejected because the three consumers have wildly different staleness tolerances (60s vs 30s vs 5s) and different filters (delegation vs all-types vs filtered), so a single aggregate is either over-fetching for some consumers or stale for others.

B. Shared poll hook with response sharing in client memory: one useWorkspaceActivity(wsId) hook that all three consumers call, dedupes the request. Rejected because the consumers' filters don't overlap meaningfully — ?type=delegation vs ?limit=5 vs ?type=<dynamic> would each need their own cache key, ending up close to parity with the current state.

C. Reduce cadences: drop ActivityTab from 5s → 15s, CommunicationOverlay from 30s → 60s. Rejected because that hurts perceived freshness for active-tab use cases and doesn't address the structural overlap.

D. Status quo: do nothing. Rejected because: see "why this matters" above. Filed at P3 to reflect the priority though.

Security check

  • Untrusted input? No new input handling. Same WS auth chain as today.
  • Auth/sessions/permissions? No change. WS subscription uses the same per-workspace bearer that polling uses.
  • Data collection / logs? No new logging. WS path already logs subscribe/unsubscribe at socket.ts.
  • Access boundary changes? No.

Versioning + backwards compat

  • No API surface change. /workspaces/:id/activity HTTP endpoint stays — needed for fallback + initial bootstrap on page load.
  • ACTIVITY_LOGGED WS event shape already pinned by socket-events.test.ts; no shape change planned.

Acceptance criteria (per stage)

Stage 1 (CommunicationOverlay):

  • Subscribe to ACTIVITY_LOGGED, filter by activity_type
  • Initial bootstrap via existing HTTP path (preserved); HTTP fallback when socket.ts reports unhealthy
  • Update cadence drops to "as events arrive" + bootstrap on mount
  • Test: WS push → state update without HTTP call (mocked api.get)
  • Test: WS unhealthy → HTTP fallback fires at existing 30s cadence
  • Test: visibility-gating still active

Stages 2–3: similar shape, separate PRs.

Out of scope

  • WebSocket protocol changes
  • Activity event schema changes
  • Cross-workspace event aggregation (already handled by per-workspace subscriptions)

Severity

P3. PR #60 closed the bug; this is efficiency only. No timeline pressure.

## Context Parked follow-up from PR #60 (issue #59). The 429 storm root cause is closed — workspace-server now keys rate-limit buckets per-tenant via `keyFor`, not per-IP. With 600 req/min per tenant bucket, the canvas's polling fan-out is comfortably under budget. This issue tracks the **efficiency** opportunity that remains: multiple canvas overlays poll `/workspaces/:id/activity` independently for the same workspaces. Filing as **P3 (efficiency, not correctness)** so the work doesn't get treated as urgent — but documenting the analysis now while it's fresh. ## Current fan-out math Per-cycle traffic to `/workspaces/:id/activity` for a user with N visible workspaces, A active tab, A2A edges enabled: | Consumer | File | Cadence | Per-ws | Filter | Per-cycle cost | |---|---|---|---|---|---| | `A2ATopologyOverlay` | `canvas/src/components/A2ATopologyOverlay.tsx:210` | 60s | yes | `?type=delegation&limit=500&source=agent` | N | | `CommunicationOverlay` | `canvas/src/components/CommunicationOverlay.tsx:112` | 30s | first 3 only (already capped) | `?limit=5` | min(N,3) | | `ActivityTab` | `canvas/src/components/tabs/ActivityTab.tsx:71` | 5s | active workspace only | `?type=<filter>` (when filter selected) | 1 (active only) | | `ChatTab` initial | `canvas/src/components/tabs/ChatTab.tsx:164` | once on mount, then on scroll | active only | `?type=a2a_receive&source=canvas&limit=10` | 1 (one-shot) | For N=6, ActivityTab open, A2A edges on: ~40 req/min steady-state to `/activity`. Plus heartbeats, hydration, page state. Well under 600/min/key with PR #60's keying. ## Why this still matters (P3 reasoning) Three angles: 1. **Server load**: each poll is a real DB query (`activity_logs`) with a `workspace_id` filter. At 6 workspaces × 4 consumers × ~12 polls/min/consumer = ~280 DB queries/min per tenant for activity alone. At fleet scale this is real RDS CPU. 2. **Update latency**: a new agent message → activity_logs insert → `ACTIVITY_LOGGED` WS broadcast (already implemented). Polling consumers see the new row up to *cadence* seconds later (5-60s). WS-subscribed consumers see it within ~10ms. 3. **Wasted cycles when nothing changes**: in steady-state idle workspaces, every consumer's poll returns the same N rows it returned last cycle. Pure overhead — no DB row changed, no UI update fires. ## Prior art surveyed - **GitHub canvas (Projects v2)**: subscribes to a SubscriptionsAPI for board updates, falls back to polling at 30s when WS is unhealthy. Same shape we already have for workspace state in `socket.ts`. - **Linear**: full WS-driven, no polling. Designed for it from day 1; complete view-store reconciliation on every event. - **Slack**: WS-driven for messages, REST for backfill/scrollback. Latest-N pagination matches what `ChatTab` does today. - **Stripe Dashboard**: hybrid — polling for some panels, WS for others, decided per-resource based on update frequency and consistency requirements. What applies to us: - We already broadcast `ACTIVITY_LOGGED` over WS (`workspace-server/internal/events/types.go:46`). Two consumers (`ChatTab`, `AgentCommsPanel`) subscribe via `useSocketEvent`. The remaining three (`A2ATopologyOverlay`, `CommunicationOverlay`, `ActivityTab`) don't, and that's the inconsistency. - WS-first + HTTP fallback is the correct shape (matches Slack/GitHub/Stripe). Linear's "always WS" is too strong because we lose backfill/scrollback semantics. What doesn't: - Linear's full view-store reconciliation isn't worth the complexity for our scale; per-component subscription is enough. ## Proposed approach **Stage 1 (small, low-risk)**: convert `CommunicationOverlay` to subscribe to `ACTIVITY_LOGGED` filtered by `activity_type IN ('a2a_send', 'a2a_receive', 'task_update')`, with HTTP fallback only when the WS connection is unhealthy. Cleanest convert — already capped at 3 workspaces, already visibility-gated, already has dedup logic. Drops 6 req/min from the steady-state. **Stage 2 (moderate)**: do the same for `A2ATopologyOverlay`, filtered by `activity_type='delegation'`. Drops another 6 req/min worst case. Slightly more complex because it consumes a 500-row windowed query (graph history), so the WS path needs to maintain a bounded ring buffer. **Stage 3 (largest)**: convert `ActivityTab`. Highest fan-out (12 req/min for one ws when active), but needs careful pagination + filter UX preservation. Fall-through scope. ## SSOT decision Each consumer keeps its own subscription + state-store, but they share: - The single `useSocketEvent` hook that filters `ACTIVITY_LOGGED` events - The single `api.get<ActivityEntry[]>('/workspaces/:id/activity?...')` HTTP-fallback shape - The single `socket.ts` reconnect/health-check machinery for WS degradation detection No new abstractions. The `useSocketEvent` hook already exists; this issue is "use it consistently in three more places". ## Alternatives rejected **A. Server-side aggregation endpoint**: one `/workspaces/:id/canvas-bundle` returning topology+comm+activity in one shot. Rejected because the three consumers have wildly different staleness tolerances (60s vs 30s vs 5s) and different filters (delegation vs all-types vs filtered), so a single aggregate is either over-fetching for some consumers or stale for others. **B. Shared poll hook with response sharing in client memory**: one `useWorkspaceActivity(wsId)` hook that all three consumers call, dedupes the request. Rejected because the consumers' filters don't overlap meaningfully — `?type=delegation` vs `?limit=5` vs `?type=<dynamic>` would each need their own cache key, ending up close to parity with the current state. **C. Reduce cadences**: drop ActivityTab from 5s → 15s, CommunicationOverlay from 30s → 60s. Rejected because that hurts perceived freshness for active-tab use cases and doesn't address the structural overlap. **D. Status quo**: do nothing. Rejected because: see "why this matters" above. Filed at P3 to reflect the priority though. ## Security check - **Untrusted input?** No new input handling. Same WS auth chain as today. - **Auth/sessions/permissions?** No change. WS subscription uses the same per-workspace bearer that polling uses. - **Data collection / logs?** No new logging. WS path already logs subscribe/unsubscribe at `socket.ts`. - **Access boundary changes?** No. ## Versioning + backwards compat - No API surface change. `/workspaces/:id/activity` HTTP endpoint stays — needed for fallback + initial bootstrap on page load. - `ACTIVITY_LOGGED` WS event shape already pinned by `socket-events.test.ts`; no shape change planned. ## Acceptance criteria (per stage) Stage 1 (`CommunicationOverlay`): - [ ] Subscribe to `ACTIVITY_LOGGED`, filter by activity_type - [ ] Initial bootstrap via existing HTTP path (preserved); HTTP fallback when `socket.ts` reports unhealthy - [ ] Update cadence drops to "as events arrive" + bootstrap on mount - [ ] Test: WS push → state update without HTTP call (mocked api.get) - [ ] Test: WS unhealthy → HTTP fallback fires at existing 30s cadence - [ ] Test: visibility-gating still active Stages 2–3: similar shape, separate PRs. ## Out of scope - WebSocket protocol changes - Activity event schema changes - Cross-workspace event aggregation (already handled by per-workspace subscriptions) ## Severity P3. PR #60 closed the bug; this is efficiency only. No timeline pressure.
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#61
No description provided.