bug(workspace-server): rate-limit per-IP keying collapses all canvas traffic to one bucket post-#179 — saw 429 storm on hongming.moleculesai.app today #59
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Symptom
/workspaces/:id/activityreturns 429 with{"error":"rate limit exceeded","retry_after":13}from the canvas. Surfaced today onhongming.moleculesai.appwhile opening a freshly-spawnedClaude Code Agentworkspace (c7244ed9-f623-4cba-8873-020e5c9fe104) — workspace wasSTATUS=onlinebut chat-history load failed and the side panel showed the limiter envelope. Multiple browser-side429s onlayout-*.jsstatic asset are visible in DevTools at the same moment (separate concern, parked below).Root cause
workspace-server/internal/middleware/ratelimit.gokeys the bucket onc.ClientIP().internal/router/router.go:32-38callsr.SetTrustedProxies(nil)(issue #179, anti-spoofing). With trusted-proxies disabled,c.ClientIP()returns the TCPRemoteAddr— i.e. the IP of whatever proxy is in front (Caddy on a per-tenant EC2, Vercel edge / CF / CP on the SaaS plane).Practical consequences in production:
The 600 req/min default in
router.go:54was sized assumingc.ClientIP()resolved to the user's real IP. After #179 closed the spoofing hole the assumption silently became wrong — the limit became deployment-shape-sensitive.Why the canvas saturates the bucket
Multiple poll consumers each fan out per visible workspace:
A2ATopologyOverlay?type=delegation&limit=500)CommunicationOverlay?limit=5)ActivityTabChatTabinitial loadFor 6 visible workspaces + 1 active tab + 1 user: ~45 req/min sustained, ~80 req/min in burst. With 2 browser tabs that doubles. With all canvas traffic collapsed to one IP at the proxy, 600/min is plausible to overrun — exactly what the screenshot shows.
Why this didn't surface before
isDevModeFailOpen()atratelimit.go:67-71) — bucket is bypassed.X-RateLimit-*headers but didn't change keying.Proposed fix
Option A (preferred): tenant-aware keying that mirrors
molecule-controlplane/internal/middleware/ratelimit.go's priority listThis is the same pattern
MCPRateLimiteralready uses (token-hash keying) and the same priority order CP's limiter uses. Two consumers, zero parallel impl.Option B (rejected): trust local-proxy IPs for X-Forwarded-For
Adds an env-var allowlist of proxy IPs, calls
r.SetTrustedProxies(allowlist). Restores user-IP keying. Rejected because:Option C (rejected): bump the default to 1800/min
Trivial, but symptom-only — punts on the keying problem and just buys time. Multi-tenant collapse still exists; multi-tab single-user still saturates within seconds.
Option D (rejected): per-route bucket exemption for /activity GETs
Endpoint-list maintenance burden + invalidates the global "are we under attack" signal the limiter encodes.
SSOT decision
Option A. The bucket-key derivation logic gets a single named helper
keyFor(*gin.Context) stringthat all bucket-store mutations route through. Mirrors CP's helper, mirrorsMCPRateLimiter'stokenKeypattern. Pinned by an AST gate (test suite below).Anti-drift gate
Same shape as #36/#10's gates: walk
(*RateLimiter).Middleware's AST and assert it callsrl.keyForat least once and does not carry a directc.ClientIP()call (= "the parallel-impl drift this PR fixes"). When a future PR renameskeyForor re-introduces direct IP keying, the gate fires.Security check
X-Molecule-Org-IdandAuthorizationheaders are caller-supplied. Mitigation: header values are bucket keys, not auth grants. A spoofedX-Molecule-Org-Idlets the caller pick which bucket to drain — they still have to actually exhaust it. TenantGuard + WorkspaceAuth still gate every protected route, so spoofed values can't access data.MCPRateLimiter) so the in-memory map can't be a token dump.Versioning + backwards compat
{"error":"rate limit exceeded","retry_after":N}).X-RateLimit-*headers unchanged.RATE_LIMITenv var unchanged in semantics; a newRATE_LIMITdeployment can keep its current value.Acceptance criteria
RateLimitercarries akeyFor(c)helper covering org-id → bearer-hash → IPMiddleware()useskeyForfor both bucket lookup and 429 responseratelimit_test.gomcp_ratelimit.goremains untouched (it already has its own SSOT — bearer-hash viatokenKey)Out of scope (parked as follow-ups)
/activityfor the same workspace at independent cadences could be deduped via a single shared poll + pub/sub fan-out. Separate canvas PR; would let us lower the default limit again without breaking power users.layout-*.js— DevTools shows 4× 429 on the static layout chunk; that's at the edge layer (outside this server). Likely an edge anti-DDoS rule reacting to the same retry storm. Closes once the workspace-server 429 stops firing (no more retry storm). Worth a CF rule audit if it persists.RATE_LIMITdefault re-tune — once keying is fixed, the default can be lowered (per-bucket budget per session is much smaller than per-IP across a tenant). Defer to follow-up PR with traffic data.Relationship to recent work
provisioner.RunningContainerName) — same SSOT-helper-with-AST-gate shape, different surface.c.ClientIP()returning the user IP, so #179's trust-nil stance can stay.