bug(workspace-server): rate-limit per-IP keying collapses all canvas traffic to one bucket post-#179 — saw 429 storm on hongming.moleculesai.app today #59

Closed
opened 2026-05-07 21:44:34 +00:00 by claude-ceo-assistant · 0 comments

Symptom

/workspaces/:id/activity returns 429 with {"error":"rate limit exceeded","retry_after":13} from the canvas. Surfaced today on hongming.moleculesai.app while opening a freshly-spawned Claude Code Agent workspace (c7244ed9-f623-4cba-8873-020e5c9fe104) — workspace was STATUS=online but chat-history load failed and the side panel showed the limiter envelope. Multiple browser-side 429s on layout-*.js static asset are visible in DevTools at the same moment (separate concern, parked below).

Root cause

workspace-server/internal/middleware/ratelimit.go keys the bucket on c.ClientIP(). internal/router/router.go:32-38 calls r.SetTrustedProxies(nil) (issue #179, anti-spoofing). With trusted-proxies disabled, c.ClientIP() returns the TCP RemoteAddr — i.e. the IP of whatever proxy is in front (Caddy on a per-tenant EC2, Vercel edge / CF / CP on the SaaS plane).

Practical consequences in production:

  1. Per-tenant EC2 (Caddy fronts canvas + workspace-server): every browser tab from every user behind that one Caddy collapses into one bucket. A power user with the canvas open on multiple tabs + multiple workspaces visible can burn the 600 req/min budget in <60s.
  2. SaaS plane: every tenant routed through the same upstream proxy IP shares one bucket, breaking per-tenant fairness entirely.

The 600 req/min default in router.go:54 was sized assuming c.ClientIP() resolved to the user's real IP. After #179 closed the spoofing hole the assumption silently became wrong — the limit became deployment-shape-sensitive.

Why the canvas saturates the bucket

Multiple poll consumers each fan out per visible workspace:

Consumer Cadence Per visible workspace
A2ATopologyOverlay 60s yes (?type=delegation&limit=500)
CommunicationOverlay 30s yes (?limit=5)
ActivityTab 5s active workspace only
ChatTab initial load on mount active workspace, then on-scroll older pages
Heartbeat / registry 30s per workspace (registry surface)

For 6 visible workspaces + 1 active tab + 1 user: ~45 req/min sustained, ~80 req/min in burst. With 2 browser tabs that doubles. With all canvas traffic collapsed to one IP at the proxy, 600/min is plausible to overrun — exactly what the screenshot shows.

Why this didn't surface before

  • Single-user local Docker has the dev-mode fail-open (isDevModeFailOpen() at ratelimit.go:67-71) — bucket is bypassed.
  • E2E + harness replays use admin token + few concurrent endpoints, never the canvas's real fan-out.
  • Issue #105 added X-RateLimit-* headers but didn't change keying.
  • Issue #179 closed the spoofing hole without revisiting the keying assumption.

Proposed fix

Option A (preferred): tenant-aware keying that mirrors molecule-controlplane/internal/middleware/ratelimit.go's priority list

// keyFor returns the bucket identifier for this request. Priority:
//   1. X-Molecule-Org-Id header — when present (CP-routed SaaS traffic),
//      isolates tenants from each other regardless of upstream proxy IP
//   2. SHA-256 of Authorization Bearer token — when present (workspace
//      bearer, ADMIN_TOKEN, org-scoped API token) — distinguishes
//      sessions on a single per-tenant box where org-id header is absent
//   3. ClientIP() — anonymous probes, /health scrapers, registry boot
//      signals
func (rl *RateLimiter) keyFor(c *gin.Context) string { ... }

This is the same pattern MCPRateLimiter already uses (token-hash keying) and the same priority order CP's limiter uses. Two consumers, zero parallel impl.

Option B (rejected): trust local-proxy IPs for X-Forwarded-For
Adds an env-var allowlist of proxy IPs, calls r.SetTrustedProxies(allowlist). Restores user-IP keying. Rejected because:

  • Allowlist drift = silent regression (every Caddy/CP IP rotation needs ops follow-up)
  • Doesn't help when CP's egress IP NATs many tenants
  • Re-opens partial spoofing surface (any peer behind the trusted proxy can spoof XFF)

Option C (rejected): bump the default to 1800/min
Trivial, but symptom-only — punts on the keying problem and just buys time. Multi-tenant collapse still exists; multi-tab single-user still saturates within seconds.

Option D (rejected): per-route bucket exemption for /activity GETs
Endpoint-list maintenance burden + invalidates the global "are we under attack" signal the limiter encodes.

SSOT decision

Option A. The bucket-key derivation logic gets a single named helper keyFor(*gin.Context) string that all bucket-store mutations route through. Mirrors CP's helper, mirrors MCPRateLimiter's tokenKey pattern. Pinned by an AST gate (test suite below).

Anti-drift gate

Same shape as #36/#10's gates: walk (*RateLimiter).Middleware's AST and assert it calls rl.keyFor at least once and does not carry a direct c.ClientIP() call (= "the parallel-impl drift this PR fixes"). When a future PR renames keyFor or re-introduces direct IP keying, the gate fires.

Security check

  • Untrusted input? Yes — X-Molecule-Org-Id and Authorization headers are caller-supplied. Mitigation: header values are bucket keys, not auth grants. A spoofed X-Molecule-Org-Id lets the caller pick which bucket to drain — they still have to actually exhaust it. TenantGuard + WorkspaceAuth still gate every protected route, so spoofed values can't access data.
  • Auth/sessions/permissions? No change.
  • Data collection / logs? No new logging at decision points; bucket keys are kept in-memory only. Token values are SHA-256 hashed (mirrors MCPRateLimiter) so the in-memory map can't be a token dump.
  • Access-boundary changes? Slightly tighter — distinct callers with distinct tokens get distinct buckets, which is the intent.

Versioning + backwards compat

  • 429 response body unchanged ({"error":"rate limit exceeded","retry_after":N}).
  • X-RateLimit-* headers unchanged.
  • RATE_LIMIT env var unchanged in semantics; a new RATE_LIMIT deployment can keep its current value.
  • No schema, API version, or migration impact.

Acceptance criteria

  • PR opens, links this issue
  • RateLimiter carries a keyFor(c) helper covering org-id → bearer-hash → IP
  • Middleware() uses keyFor for both bucket lookup and 429 response
  • Three structured outcomes covered with explicit unit tests
  • AST gate test pinning the SSOT routing
  • Mutation test: stripping the org-id branch makes the multi-tenant test fail
  • No regression in existing ratelimit_test.go
  • mcp_ratelimit.go remains untouched (it already has its own SSOT — bearer-hash via tokenKey)

Out of scope (parked as follow-ups)

  • Canvas poll-fan-out reduction — multiple consumers fetching /activity for the same workspace at independent cadences could be deduped via a single shared poll + pub/sub fan-out. Separate canvas PR; would let us lower the default limit again without breaking power users.
  • Vercel/CF edge 429s on layout-*.js — DevTools shows 4× 429 on the static layout chunk; that's at the edge layer (outside this server). Likely an edge anti-DDoS rule reacting to the same retry storm. Closes once the workspace-server 429 stops firing (no more retry storm). Worth a CF rule audit if it persists.
  • EC2 reconciler timing post-CP#20 — orthogonal issue tracked under #36.
  • RATE_LIMIT default re-tune — once keying is fixed, the default can be lowered (per-bucket budget per session is much smaller than per-IP across a tenant). Defer to follow-up PR with traffic data.

Relationship to recent work

  • #36 / #12 / #10 (a2a-proxy + plugins SSOT routing through provisioner.RunningContainerName) — same SSOT-helper-with-AST-gate shape, different surface.
  • #179 (anti-spoofing trust nil) — interaction silently invalidated the original keying. This PR makes the keying not depend on c.ClientIP() returning the user IP, so #179's trust-nil stance can stay.
  • #105 (X-RateLimit-* headers) — unchanged.
## Symptom `/workspaces/:id/activity` returns 429 with `{"error":"rate limit exceeded","retry_after":13}` from the canvas. Surfaced today on `hongming.moleculesai.app` while opening a freshly-spawned `Claude Code Agent` workspace (`c7244ed9-f623-4cba-8873-020e5c9fe104`) — workspace was `STATUS=online` but chat-history load failed and the side panel showed the limiter envelope. Multiple browser-side `429`s on `layout-*.js` static asset are visible in DevTools at the same moment (separate concern, parked below). ## Root cause `workspace-server/internal/middleware/ratelimit.go` keys the bucket on `c.ClientIP()`. `internal/router/router.go:32-38` calls `r.SetTrustedProxies(nil)` (issue #179, anti-spoofing). With trusted-proxies disabled, `c.ClientIP()` returns the **TCP `RemoteAddr`** — i.e. the IP of whatever proxy is in front (Caddy on a per-tenant EC2, Vercel edge / CF / CP on the SaaS plane). Practical consequences in production: 1. **Per-tenant EC2 (Caddy fronts canvas + workspace-server)**: every browser tab from every user behind that one Caddy collapses into one bucket. A power user with the canvas open on multiple tabs + multiple workspaces visible can burn the 600 req/min budget in <60s. 2. **SaaS plane**: every tenant routed through the same upstream proxy IP shares one bucket, breaking per-tenant fairness entirely. The 600 req/min default in `router.go:54` was sized assuming `c.ClientIP()` resolved to the user's real IP. After #179 closed the spoofing hole the assumption silently became wrong — the limit became deployment-shape-sensitive. ## Why the canvas saturates the bucket Multiple poll consumers each fan out per visible workspace: | Consumer | Cadence | Per visible workspace | |---|---|---| | `A2ATopologyOverlay` | 60s | yes (`?type=delegation&limit=500`) | | `CommunicationOverlay` | 30s | yes (`?limit=5`) | | `ActivityTab` | 5s | active workspace only | | `ChatTab` initial load | on mount | active workspace, then on-scroll older pages | | Heartbeat / registry | 30s | per workspace (registry surface) | For 6 visible workspaces + 1 active tab + 1 user: ~45 req/min sustained, ~80 req/min in burst. With 2 browser tabs that doubles. With all canvas traffic collapsed to one IP at the proxy, 600/min is plausible to overrun — exactly what the screenshot shows. ## Why this didn't surface before - Single-user local Docker has the dev-mode fail-open (`isDevModeFailOpen()` at `ratelimit.go:67-71`) — bucket is bypassed. - E2E + harness replays use admin token + few concurrent endpoints, never the canvas's real fan-out. - Issue #105 added `X-RateLimit-*` headers but didn't change keying. - Issue #179 closed the spoofing hole without revisiting the keying assumption. ## Proposed fix **Option A (preferred): tenant-aware keying that mirrors `molecule-controlplane/internal/middleware/ratelimit.go`'s priority list** ```go // keyFor returns the bucket identifier for this request. Priority: // 1. X-Molecule-Org-Id header — when present (CP-routed SaaS traffic), // isolates tenants from each other regardless of upstream proxy IP // 2. SHA-256 of Authorization Bearer token — when present (workspace // bearer, ADMIN_TOKEN, org-scoped API token) — distinguishes // sessions on a single per-tenant box where org-id header is absent // 3. ClientIP() — anonymous probes, /health scrapers, registry boot // signals func (rl *RateLimiter) keyFor(c *gin.Context) string { ... } ``` This is the same pattern `MCPRateLimiter` already uses (token-hash keying) and the same priority order CP's limiter uses. Two consumers, zero parallel impl. **Option B (rejected): trust local-proxy IPs for X-Forwarded-For** Adds an env-var allowlist of proxy IPs, calls `r.SetTrustedProxies(allowlist)`. Restores user-IP keying. Rejected because: - Allowlist drift = silent regression (every Caddy/CP IP rotation needs ops follow-up) - Doesn't help when CP's egress IP NATs many tenants - Re-opens partial spoofing surface (any peer behind the trusted proxy can spoof XFF) **Option C (rejected): bump the default to 1800/min** Trivial, but symptom-only — punts on the keying problem and just buys time. Multi-tenant collapse still exists; multi-tab single-user still saturates within seconds. **Option D (rejected): per-route bucket exemption for /activity GETs** Endpoint-list maintenance burden + invalidates the global "are we under attack" signal the limiter encodes. ## SSOT decision Option A. The bucket-key derivation logic gets a single named helper `keyFor(*gin.Context) string` that all bucket-store mutations route through. Mirrors CP's helper, mirrors `MCPRateLimiter`'s `tokenKey` pattern. Pinned by an AST gate (test suite below). ## Anti-drift gate Same shape as #36/#10's gates: walk `(*RateLimiter).Middleware`'s AST and assert it calls `rl.keyFor` at least once and does **not** carry a direct `c.ClientIP()` call (= "the parallel-impl drift this PR fixes"). When a future PR renames `keyFor` or re-introduces direct IP keying, the gate fires. ## Security check - **Untrusted input?** Yes — `X-Molecule-Org-Id` and `Authorization` headers are caller-supplied. Mitigation: header values are bucket keys, not auth grants. A spoofed `X-Molecule-Org-Id` lets the caller pick *which* bucket to drain — they still have to actually exhaust it. TenantGuard + WorkspaceAuth still gate every protected route, so spoofed values can't access data. - **Auth/sessions/permissions?** No change. - **Data collection / logs?** No new logging at decision points; bucket keys are kept in-memory only. Token values are SHA-256 hashed (mirrors `MCPRateLimiter`) so the in-memory map can't be a token dump. - **Access-boundary changes?** Slightly tighter — distinct callers with distinct tokens get distinct buckets, which is the intent. ## Versioning + backwards compat - 429 response body unchanged (`{"error":"rate limit exceeded","retry_after":N}`). - `X-RateLimit-*` headers unchanged. - `RATE_LIMIT` env var unchanged in semantics; a new `RATE_LIMIT` deployment can keep its current value. - No schema, API version, or migration impact. ## Acceptance criteria - [ ] PR opens, links this issue - [ ] `RateLimiter` carries a `keyFor(c)` helper covering org-id → bearer-hash → IP - [ ] `Middleware()` uses `keyFor` for both bucket lookup and 429 response - [ ] Three structured outcomes covered with explicit unit tests - [ ] AST gate test pinning the SSOT routing - [ ] Mutation test: stripping the org-id branch makes the multi-tenant test fail - [ ] No regression in existing `ratelimit_test.go` - [ ] `mcp_ratelimit.go` remains untouched (it already has its own SSOT — bearer-hash via `tokenKey`) ## Out of scope (parked as follow-ups) - **Canvas poll-fan-out reduction** — multiple consumers fetching `/activity` for the same workspace at independent cadences could be deduped via a single shared poll + pub/sub fan-out. Separate canvas PR; would let us *lower* the default limit again without breaking power users. - **Vercel/CF edge 429s on `layout-*.js`** — DevTools shows 4× 429 on the static layout chunk; that's at the edge layer (outside this server). Likely an edge anti-DDoS rule reacting to the same retry storm. Closes once the workspace-server 429 stops firing (no more retry storm). Worth a CF rule audit if it persists. - **EC2 reconciler timing post-CP#20** — orthogonal issue tracked under #36. - **`RATE_LIMIT` default re-tune** — once keying is fixed, the default can be lowered (per-bucket budget per session is much smaller than per-IP across a tenant). Defer to follow-up PR with traffic data. ## Relationship to recent work - #36 / #12 / #10 (a2a-proxy + plugins SSOT routing through `provisioner.RunningContainerName`) — same SSOT-helper-with-AST-gate shape, different surface. - #179 (anti-spoofing trust nil) — interaction silently invalidated the original keying. This PR makes the keying *not* depend on `c.ClientIP()` returning the user IP, so #179's trust-nil stance can stay. - #105 (X-RateLimit-* headers) — unchanged.
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#59
No description provided.