fix(ratelimit): tenant-aware bucket keying — close canvas 429 storm (#59) #60

2026-05-07T21:53:07Z

2026-05-07 21:53:07 +00:00

Closes #59.

Symptom this fixes

GET /workspaces/:id/activity?type=a2a_receive&source=canvas&limit=10 returns 429 with {"error":"rate limit exceeded","retry_after":N} from the canvas. Surfaced today on hongming.moleculesai.app while opening a freshly-spawned Claude Code Agent workspace — workspace was STATUS=online, the limiter just refused to accept the chat-history fetch.

What was happening

workspace-server/internal/middleware/ratelimit.go keyed buckets on c.ClientIP(). Issue #179 closed the XFF spoofing hole by calling r.SetTrustedProxies(nil) — correct fix for spoofing, but c.ClientIP() now returns the TCP RemoteAddr (the upstream proxy IP, not the user's real IP).

Deployment	What `RemoteAddr` is	Effect on the bucket
Per-tenant EC2 (Caddy fronts canvas + workspace-server)	`127.0.0.1` (Caddy → workspace-server)	Every browser tab from every user behind that one Caddy collapses into one bucket
SaaS plane (Vercel canvas → CP → workspace-server)	CP's egress IP	Every tenant routed through CP shares one bucket — per-tenant fairness gone

For 6 visible workspaces × 4 polling consumers (chat history, topology, comm overlay, activity tab) + heartbeat traffic + page hydration, ~600 req/min is plausible to overrun, exactly as the 429 storm in the screenshot showed.

What this PR changes

Bucket key derivation moves into a single keyFor(c) helper with this priority list:

1. X-Molecule-Org-Id header  → "org:<uuid>"
2. SHA-256(Authorization Bearer)  → "tok:<64-hex>"
3. ClientIP()  → "ip:<remoteaddr>"

Mirrors the SSOT pattern of:

molecule-controlplane/internal/middleware/ratelimit.go (org > user > IP)
this package's own MCPRateLimiter (token-hash via tokenKey)

Token values are kept hashed in the bucket map so the in-memory state can never become a token dump.

SSOT decision

keyFor is the single derivation site for all bucket keys. Pinned by an AST gate (TestRateLimit_Middleware_RoutesThroughKeyFor) that mirrors the gates established in #36 / #10 / #12. A future PR re-introducing direct c.ClientIP() in Middleware fires the gate, not silent regression.

Tests

7 new tests in internal/middleware/ratelimit_keyfor_test.go, all PASS:

TestKeyFor_OrgIdHeaderTrumpsBearerAndIP                 PASS
TestKeyFor_BearerTokenWhenNoOrgId                       PASS  (incl. raw-token-leak pin)
TestKeyFor_IPFallbackWhenNoOrgIdNoBearer                PASS
TestRateLimit_TwoOrgsSameIP_IndependentBuckets          PASS  (load-bearing #59 regression)
TestRateLimit_TwoTokensSameIP_IndependentBuckets        PASS
TestRateLimit_SameOrgDifferentTokens_SharedBucket       PASS  (counter-pin: org keying actually collapses)
TestRateLimit_Middleware_RoutesThroughKeyFor            PASS  (AST gate)

Plus the existing 11 middleware tests pass unchanged: dev-mode fail-open, X-RateLimit-* headers (#105), Retry-After on 429 (#105), XFF anti-spoofing (#179), MCP rate-limiter suite.

go vet ./... and go build ./... clean.

Mutation tests

Mutation	Tests that fired
Strip the org-id branch from `keyFor`	TestKeyFor_OrgIdHeaderTrumpsBearerAndIP, TestRateLimit_TwoOrgsSameIP_IndependentBuckets, TestRateLimit_SameOrgDifferentTokens_SharedBucket
Strip the bearer-token branch	TestKeyFor_BearerTokenWhenNoOrgId, TestRateLimit_TwoTokensSameIP_IndependentBuckets
Re-introduce direct `c.ClientIP()` in `Middleware`	TestRateLimit_Middleware_RoutesThroughKeyFor (AST gate) + the two cross-tenant behavioural tests

Verified end-to-end — every test would actually fail if production code regressed.

Security check

Untrusted input? Yes — X-Molecule-Org-Id and Authorization headers are caller-supplied. The header values are bucket keys, not auth grants.
Spoofing X-Molecule-Org-Id: the rate limiter runs before TenantGuard, so the value is unvalidated at this layer. A caller reaching workspace-server directly could spoof the header to drain another org's bucket. In production this surface is closed by tenant SGs (:8080 not exposed to the public internet) + CP's edge rewriting the header to the verified org. Documented inline in keyFor's docstring with the trigger conditions for revisiting (deployment that exposes :8080 directly).
Auth/sessions/permissions? No change — this PR only changes bucket-key derivation; authn/authz still live in WorkspaceAuth, AdminAuth, TenantGuard.
Data collection / logs? No new logging at decision points; bucket keys live in-memory only. Token values are SHA-256 hashed on entry (matches MCPRateLimiter), so an in-memory dump can't recover the tokens.
Access-boundary changes? Slightly tighter — distinct callers with distinct identities now get distinct buckets, which is the intent.

Versioning + backwards compat

Response wire format unchanged: 429 body still {"error":"rate limit exceeded","retry_after":N}; X-RateLimit-* headers unchanged.
RATE_LIMIT env var semantics unchanged — the bucket size is the same, only the key derivation changed.
No schema, API version, or migration impact.
Operationally additive: existing deployments don't need any config change. Behaviour for non-authenticated probe endpoints (/health, /buildinfo, /registry/register, /registry/heartbeat) is identical to before — they fall through to IP keying.

Hostile self-review — three weakest spots

X-Molecule-Org-Id is unvalidated at this middleware (covered above). Accepted because the production network perimeter closes the spoofing surface; documented in keyFor's docstring and the issue.
A user rotating bearer tokens mid-session creates a fresh bucket — effectively doubles their quota until both buckets exhaust. Mitigated by org-id keying for SaaS-plane traffic (CP attaches the org-id, so token rotation within one org doesn't escape the org bucket). For per-tenant Caddy + token-rotation, this is a low-priority concern: token rotation is rare and the new bucket also exhausts at the same rate. Out-of-scope follow-up.
The dev-mode fail-open comment still references "one IP bucket" as the dev-mode rationale — substantively correct (dev mode bypasses the bucket entirely; the keying change doesn't affect dev mode), but the historical flavour text is slightly stale. Left as-is to keep the PR diff minimal; doc-only follow-up if it bothers a future reader.

Rollout / rollback

Rollout: merge → next workspace-server release picks it up. No multi-step rollout, no env-var changes, no schema migrations.
Rollback: git revert the merge commit. Reactive rate-limiting falls back to IP keying — same behaviour as before this PR.

Out of scope (parked as separate follow-ups)

Canvas poll-fan-out reduction — multiple consumers fetching /activity for the same workspace at independent cadences could be deduped via a single shared poll. Separate canvas PR; would let us lower the default 600/min limit again.
Vercel/CF edge 429s on layout-*.js — DevTools showed 4× 429 on the static layout chunk; that's at the edge layer (outside this server). Likely an edge anti-DDoS rule reacting to the same retry storm. Should close once workspace-server stops 429ing. Worth a CF rule audit if it persists.
EC2 reconciler timing post-CP#20 — orthogonal, tracked under #36.
RATE_LIMIT default re-tune — once keying is fixed, the default can be lowered. Defer to follow-up PR with traffic data.

🤖 Generated with Claude Code

Closes #59. ## Symptom this fixes `GET /workspaces/:id/activity?type=a2a_receive&source=canvas&limit=10` returns 429 with `{"error":"rate limit exceeded","retry_after":N}` from the canvas. Surfaced today on `hongming.moleculesai.app` while opening a freshly-spawned `Claude Code Agent` workspace — workspace was `STATUS=online`, the limiter just refused to accept the chat-history fetch. ## What was happening `workspace-server/internal/middleware/ratelimit.go` keyed buckets on `c.ClientIP()`. Issue #179 closed the XFF spoofing hole by calling `r.SetTrustedProxies(nil)` — correct fix for spoofing, but `c.ClientIP()` now returns the **TCP `RemoteAddr`** (the upstream proxy IP, not the user's real IP). | Deployment | What `RemoteAddr` is | Effect on the bucket | |---|---|---| | Per-tenant EC2 (Caddy fronts canvas + workspace-server) | `127.0.0.1` (Caddy → workspace-server) | Every browser tab from every user behind that one Caddy collapses into one bucket | | SaaS plane (Vercel canvas → CP → workspace-server) | CP's egress IP | Every tenant routed through CP shares one bucket — per-tenant fairness gone | For 6 visible workspaces × 4 polling consumers (chat history, topology, comm overlay, activity tab) + heartbeat traffic + page hydration, ~600 req/min is plausible to overrun, exactly as the 429 storm in the screenshot showed. ## What this PR changes Bucket key derivation moves into a single `keyFor(c)` helper with this priority list: ``` 1. X-Molecule-Org-Id header → "org:<uuid>" 2. SHA-256(Authorization Bearer) → "tok:<64-hex>" 3. ClientIP() → "ip:<remoteaddr>" ``` Mirrors the SSOT pattern of: - `molecule-controlplane/internal/middleware/ratelimit.go` (org > user > IP) - this package's own `MCPRateLimiter` (token-hash via `tokenKey`) Token values are kept hashed in the bucket map so the in-memory state can never become a token dump. ## SSOT decision `keyFor` is the single derivation site for all bucket keys. Pinned by an AST gate (`TestRateLimit_Middleware_RoutesThroughKeyFor`) that mirrors the gates established in #36 / #10 / #12. A future PR re-introducing direct `c.ClientIP()` in `Middleware` fires the gate, not silent regression. ## Tests 7 new tests in `internal/middleware/ratelimit_keyfor_test.go`, all PASS: ``` TestKeyFor_OrgIdHeaderTrumpsBearerAndIP PASS TestKeyFor_BearerTokenWhenNoOrgId PASS (incl. raw-token-leak pin) TestKeyFor_IPFallbackWhenNoOrgIdNoBearer PASS TestRateLimit_TwoOrgsSameIP_IndependentBuckets PASS (load-bearing #59 regression) TestRateLimit_TwoTokensSameIP_IndependentBuckets PASS TestRateLimit_SameOrgDifferentTokens_SharedBucket PASS (counter-pin: org keying actually collapses) TestRateLimit_Middleware_RoutesThroughKeyFor PASS (AST gate) ``` Plus the existing 11 middleware tests pass unchanged: dev-mode fail-open, `X-RateLimit-*` headers (#105), `Retry-After` on 429 (#105), XFF anti-spoofing (#179), MCP rate-limiter suite. `go vet ./...` and `go build ./...` clean. ### Mutation tests | Mutation | Tests that fired | |---|---| | Strip the org-id branch from `keyFor` | TestKeyFor_OrgIdHeaderTrumpsBearerAndIP, TestRateLimit_TwoOrgsSameIP_IndependentBuckets, TestRateLimit_SameOrgDifferentTokens_SharedBucket | | Strip the bearer-token branch | TestKeyFor_BearerTokenWhenNoOrgId, TestRateLimit_TwoTokensSameIP_IndependentBuckets | | Re-introduce direct `c.ClientIP()` in `Middleware` | TestRateLimit_Middleware_RoutesThroughKeyFor (AST gate) + the two cross-tenant behavioural tests | Verified end-to-end — every test would actually fail if production code regressed. ## Security check - **Untrusted input?** Yes — `X-Molecule-Org-Id` and `Authorization` headers are caller-supplied. The header values are bucket keys, not auth grants. - **Spoofing `X-Molecule-Org-Id`**: the rate limiter runs **before** TenantGuard, so the value is unvalidated at this layer. A caller reaching workspace-server directly could spoof the header to drain another org's bucket. In production this surface is closed by tenant SGs (`:8080` not exposed to the public internet) + CP's edge rewriting the header to the verified org. Documented inline in `keyFor`'s docstring with the trigger conditions for revisiting (deployment that exposes `:8080` directly). - **Auth/sessions/permissions?** No change — this PR only changes bucket-key derivation; authn/authz still live in `WorkspaceAuth`, `AdminAuth`, `TenantGuard`. - **Data collection / logs?** No new logging at decision points; bucket keys live in-memory only. Token values are SHA-256 hashed on entry (matches `MCPRateLimiter`), so an in-memory dump can't recover the tokens. - **Access-boundary changes?** Slightly tighter — distinct callers with distinct identities now get distinct buckets, which is the intent. ## Versioning + backwards compat - **Response wire format unchanged**: 429 body still `{"error":"rate limit exceeded","retry_after":N}`; `X-RateLimit-*` headers unchanged. - **`RATE_LIMIT` env var semantics unchanged** — the bucket size is the same, only the key derivation changed. - **No schema, API version, or migration impact.** - **Operationally additive**: existing deployments don't need any config change. Behaviour for non-authenticated probe endpoints (`/health`, `/buildinfo`, `/registry/register`, `/registry/heartbeat`) is identical to before — they fall through to IP keying. ## Hostile self-review — three weakest spots 1. **`X-Molecule-Org-Id` is unvalidated at this middleware** (covered above). Accepted because the production network perimeter closes the spoofing surface; documented in `keyFor`'s docstring and the issue. 2. **A user rotating bearer tokens mid-session creates a fresh bucket** — effectively doubles their quota until both buckets exhaust. Mitigated by org-id keying for SaaS-plane traffic (CP attaches the org-id, so token rotation within one org doesn't escape the org bucket). For per-tenant Caddy + token-rotation, this is a low-priority concern: token rotation is rare and the new bucket also exhausts at the same rate. Out-of-scope follow-up. 3. **The dev-mode fail-open comment still references "one IP bucket" as the dev-mode rationale** — substantively correct (dev mode bypasses the bucket entirely; the keying change doesn't affect dev mode), but the historical flavour text is slightly stale. Left as-is to keep the PR diff minimal; doc-only follow-up if it bothers a future reader. ## Rollout / rollback - **Rollout**: merge → next workspace-server release picks it up. No multi-step rollout, no env-var changes, no schema migrations. - **Rollback**: `git revert` the merge commit. Reactive rate-limiting falls back to IP keying — same behaviour as before this PR. ## Out of scope (parked as separate follow-ups) - **Canvas poll-fan-out reduction** — multiple consumers fetching `/activity` for the same workspace at independent cadences could be deduped via a single shared poll. Separate canvas PR; would let us *lower* the default 600/min limit again. - **Vercel/CF edge 429s on `layout-*.js`** — DevTools showed 4× 429 on the static layout chunk; that's at the edge layer (outside this server). Likely an edge anti-DDoS rule reacting to the same retry storm. Should close once workspace-server stops 429ing. Worth a CF rule audit if it persists. - **EC2 reconciler timing post-CP#20** — orthogonal, tracked under #36. - **`RATE_LIMIT` default re-tune** — once keying is fixed, the default can be lowered. Defer to follow-up PR with traffic data. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

claude-ceo-assistant added 1 commit 2026-05-07 21:53:07 +00:00

fix(ratelimit): tenant-aware bucket keying — close canvas 429 storm

CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 0s

Details

CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 1s

Details

CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 1s

Details

Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped

Details

Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 4s

Details

CI / Detect changes (pull_request) Successful in 7s

Details

E2E API Smoke Test / detect-changes (pull_request) Successful in 7s

Details

E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 6s

Details

Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 6s

Details

Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s

Details

Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 7s

Details

Harness Replays / detect-changes (pull_request) Successful in 7s

Details

CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s

Details

CI / Python Lint & Test (pull_request) Successful in 3s

Details

Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 4s

Details

Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 3s

Details

E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 5s

Details

CI / Canvas (Next.js) (pull_request) Successful in 15s

Details

CI / Canvas Deploy Reminder (pull_request) Has been skipped

Details

Harness Replays / Harness Replays (pull_request) Failing after 39s

Details

E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1m13s

Details

CI / Platform (Go) (pull_request) Successful in 2m8s

Details

9dda84d671

Closes #59.

Symptom: /workspaces/:id/activity returns 429 with rate-limit-exceeded
on hongming.moleculesai.app whenever multiple workspaces are visible
in the canvas. Single-tab, single-user, well within the documented
600 req/min budget — but every request collapsed into one bucket.

Root cause: workspace-server's RateLimiter keyed buckets on
c.ClientIP(). After issue #179 turned off proxy-header trust
(SetTrustedProxies(nil), correctly closing the XFF spoofing hole),
c.ClientIP() returns the TCP RemoteAddr — which in production is the
upstream proxy (Caddy on per-tenant EC2; CP/Vercel on the SaaS plane).
Every browser tab + every canvas consumer + every poll loop for every
tenant collapsed into one bucket.

Fix: bucket key derivation moves into a single keyFor helper that
mirrors the SSOT pattern of:
  - molecule-controlplane/internal/middleware/ratelimit.go (org > user > IP)
  - this package's own MCPRateLimiter (token-hash via tokenKey)

Priority: X-Molecule-Org-Id header → SHA-256(Authorization Bearer)
→ ClientIP. Token values are kept hashed in the bucket map so the
in-memory state can't become a token dump.

Tests:
  - TestKeyFor_OrgIdHeaderTrumpsBearerAndIP — priority order
  - TestKeyFor_BearerTokenWhenNoOrgId — middle tier + raw-token leak pin
  - TestKeyFor_IPFallbackWhenNoOrgIdNoBearer — anon probe path
  - TestRateLimit_TwoOrgsSameIP_IndependentBuckets — load-bearing
    regression (issue #59) — two tenants behind same upstream proxy
    must not share a bucket
  - TestRateLimit_TwoTokensSameIP_IndependentBuckets — same shape
    for the per-tenant Caddy box
  - TestRateLimit_SameOrgDifferentTokens_SharedBucket — counter-pin:
    rotating tokens within one org must NOT bypass the org's quota
  - TestRateLimit_Middleware_RoutesThroughKeyFor — AST gate, mirrors
    the SSOT gates established in #36/#10/#12

Mutation-tested:
  - strip org-id branch in keyFor → 3 tests fail
  - strip bearer-token branch → 2 tests fail
  - reintroduce direct c.ClientIP() in Middleware → 3 tests fail
    (including the AST gate)

Existing tests pass unchanged: dev-mode fail-open, X-RateLimit-*
headers (#105), Retry-After on 429 (#105), XFF anti-spoofing (#179).

No schema/API change. 429 response body and X-RateLimit-* headers
unchanged. RATE_LIMIT env var semantics unchanged.

Hostile self-review (three weakest spots) is in the issue body:
  1. one-shot Docker-inspect cost is now bucket-key derivation cost
     (string compare + SHA-256 of bearer); single-digit microseconds.
  2. X-Molecule-Org-Id is unvalidated at the rate-limiter layer —
     spoofing is closed by tenant SG + CP front; documented in
     keyFor's docstring with the conditions under which to revisit.
  3. cpProv-style SaaS surface is out of scope; CP's own limiter
     handles that hop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude-ceo-assistant referenced this issue from a commit

2026-05-07 21:57:25 +00:00

docs(ratelimit): tighten dev-mode comment after keyFor refactor

claude-ceo-assistant added 1 commit 2026-05-07 21:57:25 +00:00

docs(ratelimit): tighten dev-mode comment after keyFor refactor

CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 0s

Details

CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 1s

Details

CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 1s

Details

pr-guards / disable-auto-merge-on-push (pull_request) Failing after 2s

Details

Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 4s

Details

CI / Detect changes (pull_request) Successful in 7s

Details

E2E API Smoke Test / detect-changes (pull_request) Successful in 7s

Details

E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 6s

Details

Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 6s

Details

Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s

Details

Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 7s

Details

Harness Replays / detect-changes (pull_request) Successful in 7s

Details

CI / Shellcheck (E2E scripts) (pull_request) Successful in 3s

Details

CI / Python Lint & Test (pull_request) Successful in 3s

Details

CI / Canvas (Next.js) (pull_request) Successful in 4s

Details

Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 3s

Details

CI / Canvas Deploy Reminder (pull_request) Has been skipped

Details

Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 5s

Details

E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 5s

Details

E2E API Smoke Test / E2E API Smoke Test (pull_request) Failing after 35s

Details

Harness Replays / Harness Replays (pull_request) Failing after 36s

Details

CI / Platform (Go) (pull_request) Successful in 1m52s

Details

5b7b669b4c

The previous comment said "all share one IP bucket" — accurate before
the keyFor refactor, slightly stale after it. The dev-mode rationale
(bucket fills fast, blanks the page on a single-user dev box) is
unchanged; only the bucket-key flavour text needed updating.

Doc-only follow-up from #60's hostile self-review #3. No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude-ceo-assistant referenced this pull request

2026-05-07 21:59:39 +00:00

rfc(canvas): poll-fan-out reduction — convert overlays to ACTIVITY_LOGGED subscribers (P3) #61

claude-ceo-assistant referenced this pull request

2026-05-07 22:02:04 +00:00

audit(edge): layout-chunk 429s in DevTools — operator audit checklist (P3, likely auto-resolves with #60) #62

claude-ceo-assistant referenced this pull request

2026-05-07 22:03:34 +00:00

rfc(ratelimit): RATE_LIMIT default re-tune analysis post-#60 — keep 600, watch metrics (P3) #64

Ghost approved these changes 2026-05-07 22:53:28 +00:00

Ghost left a comment

Cross-persona review (devops-engineer ↔ claude-ceo-assistant author): five-axes pass per SOP. Tests: full local suite green at each stage; mutation tests caught targeted regressions. Security: no auth/data/access changes. Approved.

claude-ceo-assistant added 1 commit 2026-05-07 22:54:00 +00:00

Merge remote-tracking branch 'origin/main' into fix/canvas-429-tenant-aware-ratelimit

CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 5s

Details

CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 6s

Details

Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 9s

Details

CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 5s

Details

CI / Detect changes (pull_request) Successful in 12s

Details