fix: dev-mode bypass for IP rate limiter + 429 retry on GET

The 600-req/min/IP bucket is sized for SaaS where each tenant has a distinct client IP. On a local Docker setup every panel shares one IP — hydration (/workspaces + /templates + /org/templates + /approvals/pending) plus polling (A2A overlay + activity tabs + approvals + schedule + channels + audit trail) can burst past the bucket inside a minute, blanking the canvas with 429s. The user reported it after dragging workspaces — dragging itself is release-only (savePosition in onNodeDragStop), but the polling that's always running added onto startup tripped the limit. Two-layer fix: Server: RateLimiter.Middleware short-circuits when isDevModeFailOpen is true (MOLECULE_ENV=development + empty ADMIN_TOKEN), matching the Tier-1b hatch already applied to AdminAuth, WorkspaceAuth, and discovery. SaaS production keeps the bucket. Client: api.ts auto-retries a single 429 on idempotent GET requests, waiting the server-provided Retry-After (capped at 20s). Mutations (POST/PUT/PATCH/DELETE) never auto-retry to avoid double-applying. Users on SaaS hitting a legitimate rate-limit spike get one transparent recovery instead of an immediately-blank Canvas. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 20:44:09 -07:00 · 2026-04-23 20:44:09 -07:00 · f2a4b6e0d3
commit f2a4b6e0d3
parent 286dcbfd1e
2 changed files with 27 additions and 1 deletions
--- a/canvas/src/lib/api.ts
+++ b/canvas/src/lib/api.ts
@ -17,7 +17,8 @@ const DEFAULT_TIMEOUT_MS = 15_000;
 async function request<T>(
  method: string,
  path: string,
-  body?: unknown
+  body?: unknown,
+  retryCount = 0,
 ): Promise<T> {
  // SaaS cross-origin shape:
  //  - X-Molecule-Org-Slug: derived from window.location.hostname by
@ -38,6 +39,18 @@ async function request<T>(
    credentials: "include",
    signal: AbortSignal.timeout(DEFAULT_TIMEOUT_MS),
  });
+  // Transient rate-limit recovery. A single IP bucket can momentarily
+  // spike on page load (several panels hydrate simultaneously). Instead
+  // of bubbling up a 429 that blanks the Canvas, wait the
+  // Retry-After window and try once — any further 429 surfaces normally.
+  // GET / idempotent methods only; never auto-retry mutations.
+  if (res.status === 429 && retryCount === 0 && method === "GET") {
+    const retryAfterHeader = res.headers.get("Retry-After");
+    const retryAfter = retryAfterHeader ? parseInt(retryAfterHeader, 10) : NaN;
+    const delayMs = Number.isFinite(retryAfter) ? Math.min(retryAfter, 20) * 1000 : 2000;
+    await new Promise((resolve) => setTimeout(resolve, delayMs));
+    return request<T>(method, path, body, retryCount + 1);
+  }
  if (res.status === 401) {
    // Session expired or credentials lost. On SaaS (tenant subdomain)
    // the login page lives at /cp/auth/login and is mounted by the
--- a/workspace-server/internal/middleware/ratelimit.go
+++ b/workspace-server/internal/middleware/ratelimit.go
@ -57,6 +57,19 @@ func NewRateLimiter(rate int, interval time.Duration, ctx context.Context) *Rate
 // Middleware returns a Gin middleware that rate limits by client IP.
 func (rl *RateLimiter) Middleware() gin.HandlerFunc {
 	return func(c *gin.Context) {
+		// Tier-1b dev-mode hatch — same gate as AdminAuth / WorkspaceAuth /
+		// discovery. On a local single-user Docker setup the 600-req/min
+		// bucket fills fast: a 15-workspace canvas + activity polling +
+		// approvals polling + A2A overlay + initial hydration all share
+		// one IP bucket, so a minute of active use can trip 429 and blank
+		// the page. Gated by MOLECULE_ENV=development + empty ADMIN_TOKEN
+		// so SaaS production keeps the bucket.
+		if isDevModeFailOpen() {
+			c.Header("X-RateLimit-Limit", "unlimited")
+			c.Next()
+			return
+		}
+
 		ip := c.ClientIP()

 		rl.mu.Lock()