fix(ratelimit): tenant-aware bucket keying — close canvas 429 storm (#59) #60
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "fix/canvas-429-tenant-aware-ratelimit"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes #59.
Symptom this fixes
GET /workspaces/:id/activity?type=a2a_receive&source=canvas&limit=10returns 429 with{"error":"rate limit exceeded","retry_after":N}from the canvas. Surfaced today onhongming.moleculesai.appwhile opening a freshly-spawnedClaude Code Agentworkspace — workspace wasSTATUS=online, the limiter just refused to accept the chat-history fetch.What was happening
workspace-server/internal/middleware/ratelimit.gokeyed buckets onc.ClientIP(). Issue #179 closed the XFF spoofing hole by callingr.SetTrustedProxies(nil)— correct fix for spoofing, butc.ClientIP()now returns the TCPRemoteAddr(the upstream proxy IP, not the user's real IP).RemoteAddris127.0.0.1(Caddy → workspace-server)For 6 visible workspaces × 4 polling consumers (chat history, topology, comm overlay, activity tab) + heartbeat traffic + page hydration, ~600 req/min is plausible to overrun, exactly as the 429 storm in the screenshot showed.
What this PR changes
Bucket key derivation moves into a single
keyFor(c)helper with this priority list:Mirrors the SSOT pattern of:
molecule-controlplane/internal/middleware/ratelimit.go(org > user > IP)MCPRateLimiter(token-hash viatokenKey)Token values are kept hashed in the bucket map so the in-memory state can never become a token dump.
SSOT decision
keyForis the single derivation site for all bucket keys. Pinned by an AST gate (TestRateLimit_Middleware_RoutesThroughKeyFor) that mirrors the gates established in #36 / #10 / #12. A future PR re-introducing directc.ClientIP()inMiddlewarefires the gate, not silent regression.Tests
7 new tests in
internal/middleware/ratelimit_keyfor_test.go, all PASS:Plus the existing 11 middleware tests pass unchanged: dev-mode fail-open,
X-RateLimit-*headers (#105),Retry-Afteron 429 (#105), XFF anti-spoofing (#179), MCP rate-limiter suite.go vet ./...andgo build ./...clean.Mutation tests
keyForc.ClientIP()inMiddlewareVerified end-to-end — every test would actually fail if production code regressed.
Security check
X-Molecule-Org-IdandAuthorizationheaders are caller-supplied. The header values are bucket keys, not auth grants.X-Molecule-Org-Id: the rate limiter runs before TenantGuard, so the value is unvalidated at this layer. A caller reaching workspace-server directly could spoof the header to drain another org's bucket. In production this surface is closed by tenant SGs (:8080not exposed to the public internet) + CP's edge rewriting the header to the verified org. Documented inline inkeyFor's docstring with the trigger conditions for revisiting (deployment that exposes:8080directly).WorkspaceAuth,AdminAuth,TenantGuard.MCPRateLimiter), so an in-memory dump can't recover the tokens.Versioning + backwards compat
{"error":"rate limit exceeded","retry_after":N};X-RateLimit-*headers unchanged.RATE_LIMITenv var semantics unchanged — the bucket size is the same, only the key derivation changed./health,/buildinfo,/registry/register,/registry/heartbeat) is identical to before — they fall through to IP keying.Hostile self-review — three weakest spots
X-Molecule-Org-Idis unvalidated at this middleware (covered above). Accepted because the production network perimeter closes the spoofing surface; documented inkeyFor's docstring and the issue.Rollout / rollback
git revertthe merge commit. Reactive rate-limiting falls back to IP keying — same behaviour as before this PR.Out of scope (parked as separate follow-ups)
/activityfor the same workspace at independent cadences could be deduped via a single shared poll. Separate canvas PR; would let us lower the default 600/min limit again.layout-*.js— DevTools showed 4× 429 on the static layout chunk; that's at the edge layer (outside this server). Likely an edge anti-DDoS rule reacting to the same retry storm. Should close once workspace-server stops 429ing. Worth a CF rule audit if it persists.RATE_LIMITdefault re-tune — once keying is fixed, the default can be lowered. Defer to follow-up PR with traffic data.🤖 Generated with Claude Code
Cross-persona review (devops-engineer ↔ claude-ceo-assistant author): five-axes pass per SOP. Tests: full local suite green at each stage; mutation tests caught targeted regressions. Security: no auth/data/access changes. Approved.