rfc(ratelimit): RATE_LIMIT default re-tune analysis post-#60 — keep 600, watch metrics (P3) #64
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
Parked follow-up from PR #60 (issue #59). Now that
RateLimiterkeys buckets per-tenant (org / token / IP fallback) instead of per-IP-only, the originalRATE_LIMIT=600default may no longer be sized correctly.Filing as P3 (no current bug; needs traffic data to re-tune confidently).
Why 600/min was originally chosen
workspace-server/internal/router/router.go:53pre-#60 comment:That math sized the bucket for the OLD per-IP behaviour, where ALL canvas traffic from ALL tabs from ALL users behind one egress IP collapsed to one bucket. 600/min was already tight under realistic load, which is precisely why the canvas saturated it (issue #59).
What changed
PR #60's
keyForproduces bucket keys at three granularities:org:<uuid>tok:<sha256-hex>ip:<addr>So the bucket-key cardinality changes from "1 per egress IP" to roughly "1 per (tenant or session)".
Per-key traffic estimate (single-user power case)
Worst case: one user, 6 workspaces visible, A2ATopologyOverlay on, ActivityTab open on the active workspace, Communications overlay expanded.
A2ATopologyOverlay(60s × 6 ws)CommunicationOverlay(30s × min(6,3) ws)ActivityTab(5s × 1 ws)ChatTabinitial + scrollback/peers,/plugins, etc.)Multi-tab amplifies: 3 browser tabs ≈ 150 req/min on the same
tok:bucket. SaaS-plane multi-user: 5 simultaneous users on one tenant ≈ 250 req/min on the sameorg:bucket.Recommendation
Keep the 600 req/min default. Reasoning:
RATE_LIMITenv var without code change — no reason to lower the default and force overrides.What to actually do here
This issue documents the math + recommendation. Concrete acceptance:
org:ortok:key (not just IP-fallback), file a follow-up PR raising the default with the metric chart attachedRATE_LIMIT=<higher>operator override + this issue's link in the tenant's ops runbookAlternatives rejected
A. Lower to 300 req/min/key now. Rejected: no measured benefit, fires on legitimate edge bursts, forces operator overrides for power tenants.
B. Raise to 1200 req/min/key now. Rejected: adds runway for misbehaving clients without measured need. Real protection against re-render storms is the limiter firing, not letting them flood for longer.
C. Per-route limits (e.g.
/activityhigher, mutating routes lower). Rejected at this priority — adds endpoint-list maintenance burden. Reconsider only if metrics show one route is the dominant bucket consumer for legitimate traffic.D. Distinct burst vs sustained limits (token bucket with separate burst capacity from refill rate, like CP's
Burst+Rateinmolecule-controlplane/internal/middleware/ratelimit.go). Rejected for this issue: requires non-trivial refactor of the bucket struct. File separately if the simple interval-reset model proves inadequate under real traffic.SSOT decision
No SSOT change —
RATE_LIMITenv var is already the single source for the bucket size; this issue just records the rationale for the default.Security check
Versioning + backwards compat
No code change in this issue.
Severity
P3 — recommendation is "keep the current default and watch metrics." Re-evaluate with real data.
Closing — empirical metrics, not 14-day wait
Per the new
docs/engineering/ratelimit-observability.mdrunbook (#85), the decision tree is:Live
/metricssnapshot fromhongming.moleculesai.app(SHA0276b295, still on OLD per-IP keying):Across ~17,000+ requests on the active routes (10,302 × /activity, 2,550 × /registry/heartbeat, 574 × /delegations, plus tail), zero 429 responses. Q1 answer: NO.
Decision: keep RATE_LIMIT=600 as the default. After #60 + #61 stages 1-3 deploy the math gets even more comfortable (per-tenant bucket + ~95% reduction in canvas-driven HTTP volume).
The original "watch metrics for 14 days" plan was the conservative path; the actual data says the answer is already clear. If a future tenant hits sustained 429s, the runbook tells the operator exactly what to do — no need for a default change.
Closing.