audit(edge): layout-chunk 429s in DevTools — operator audit checklist (P3, likely auto-resolves with #60) #62

New Issue

claude-ceo-assistant · 2026-05-07T22:02:04Z

2026-05-07 22:02:04 +00:00

Context

Parked follow-up from PR #60 (issue #59). Today's screenshot showed 4× HTTP 429 in DevTools network panel against entries that look like layout chunks (layout-aa5d5e1eb5f11f79.js) on hongming.moleculesai.app, in the same burst as the workspace-server /activity 429.

Filing as P3 (likely auto-resolves once #59 is deployed) + a small audit checklist for the edge stack.

Why the static-asset 429s are most likely a downstream symptom

The 429 response body that the activity request showed ({"error":"rate limit exceeded","retry_after":13}) matches workspace-server's middleware exactly (see workspace-server/internal/middleware/ratelimit.go:113). No edge layer in our stack emits that body.
The canvas's HTTP client at canvas/src/lib/api.ts:55 retries each 429 once after honouring Retry-After. A retry-storm during the original 429 burst doubles the visible count in DevTools without doubling the underlying request volume.
Vercel and Cloudflare's default edge rules don't 429 first-party static-chunk requests at the volume one tab generates — those layers are sized for content-CDN traffic, not single-tab burst.

The most likely interpretation: the entries that look like layout-chunk 429s in the DevTools panel are actually the same workspace-server-routed requests, and the layout-name appearance is an artifact of the static-chunk URL (the chunk-hashed filename gets rewritten into the activity poll path through some Next.js asset pipeline interaction during a hot-reload-ish edge case).

PR #60's tenant-keying should make the workspace-server 429 rare enough that the storm doesn't reproduce. First action: re-test on hongming.moleculesai.app after PR #60 deploys; if the layout-chunk 429s vanish along with the activity 429, this issue closes.

What's actually in the repo at the edge layer

Audit done as part of this issue:

Surface	Path	Status
`canvas/vercel.json`	—	Does not exist. Vercel deploys with default config.
`canvas/next.config.ts`	`canvas/next.config.ts`	Only sets `output: "standalone"` and loads monorepo-root `.env`. No edge config.
`canvas/middleware.ts`	—	Does not exist. No Next.js middleware on the canvas.
Cloudflare config	not in repo	Operator-managed; unknown rule set in front of `*.moleculesai.app`.
`_headers` / `_redirects`	—	Not used (these are Cloudflare Pages / Netlify conventions).
Workspace-server static-asset proxy	`workspace-server/internal/router/router.go`	No static-asset proxy; serves API only.

Conclusion: nothing in the repo would 429 a static layout chunk. If the layout-chunk 429s are real and not a DevTools-display artifact, the source must be at Cloudflare (in front of Vercel) or a Vercel-side rule we don't have visibility into from this repo.

Operator audit checklist

If the layout-chunk 429s persist after PR #60 deploys (re-test on hongming.moleculesai.app):

Cloudflare → Security → WAF → Rate Limiting Rules — list active rules for *.moleculesai.app. Look for any rule that 429s on path /_next/static/* or /*.js. Default rule sets should not — flag any rule that does and capture its hit count.
Cloudflare → Security → Settings → Bot Fight Mode — if "Super Bot Fight Mode" is on, it can challenge layout-chunk fetches under retry-storm load. Confirm setting; consider exemption for *.moleculesai.app first-party domains.
Vercel → Project → Settings → Functions → Edge — confirm no rate-limit middleware was deployed accidentally (Edge Functions or Edge Middleware).
Vercel → Deployments → [latest] → Logs → Edge Network — search for 429 in edge logs around the timestamp of the screenshot. If hits are present, the request URI in the log distinguishes "real layout chunk 429" from "DevTools display artifact of the activity 429."

Mitigations (consider only if audit confirms a real edge 429)

A. Cloudflare bypass for static assets: rule (http.request.uri.path matches "^/_next/static/") then bypass. Default rule set should already do this; auditing #2 above usually surfaces the cause directly.

B. Vercel CDN-only for static chunks: route /_next/static/* through Vercel's CDN (no CF interception) by configuring the CF rule set to bypass that prefix.

C. Increase canvas retry delay on 429: canvas/src/lib/api.ts:58 caps the retry delay at 20s. If edge 429s carry a longer Retry-After, lifting the cap (or per-status-source caps) would let the retry actually wait long enough. Probably not needed if the source is workspace-server (post-#60 the bucket is per-tenant), but worth flagging.

SSOT decision

No code change in this repo. Edge config lives in operator-managed dashboards (Cloudflare + Vercel), so the SSOT is the dashboard state — captured here as a manual audit checklist rather than as a config file in this repo (which would silently drift from the actual rule set).

Alternatives rejected

Add a vercel.json with edge rules. Rejected: adds a code-as-config mirror that would silently drift from the actual Cloudflare/Vercel state. Repo would think it has authoritative config; actual edge would be different. Preferred path: keep edge config in operator dashboards + maintain this audit checklist as the documented entry point.

Stop using the canvas-side retry-once. Rejected: the retry is still useful behaviour after PR #60 (small bursts on page hydration are normal). Removing it would surface every transient 429 as a hard error.

Security check

Untrusted input? No.
Auth/sessions/permissions? No change.
Data collection / logs? Audit checklist references operator-only logs (CF edge, Vercel edge); no new logging added.
Access boundary changes? No.

Versioning + backwards compat

No code/API change planned in this issue.

Acceptance criteria

PR #60 deploys to hongming.moleculesai.app
Operator (or whoever has CF/Vercel dashboard access) re-tests the canvas with multiple workspaces visible
If layout-chunk 429s vanish: close this issue with a one-line "auto-resolved by #60"
If layout-chunk 429s persist: walk the operator audit checklist above; file follow-up issues for whichever mitigation applies

Severity

P3 — likely auto-resolves; no current blocker.

## Context Parked follow-up from PR #60 (issue #59). Today's screenshot showed 4× HTTP 429 in DevTools network panel against entries that look like layout chunks (`layout-aa5d5e1eb5f11f79.js`) on `hongming.moleculesai.app`, in the same burst as the workspace-server `/activity` 429. Filing as **P3 (likely auto-resolves once #59 is deployed)** + a small audit checklist for the edge stack. ## Why the static-asset 429s are most likely a downstream symptom - The 429 response body that the activity request showed (`{"error":"rate limit exceeded","retry_after":13}`) matches workspace-server's middleware exactly (see `workspace-server/internal/middleware/ratelimit.go:113`). No edge layer in our stack emits that body. - The canvas's HTTP client at `canvas/src/lib/api.ts:55` retries each 429 once after honouring `Retry-After`. A retry-storm during the original 429 burst doubles the visible count in DevTools without doubling the underlying request volume. - Vercel and Cloudflare's default edge rules don't 429 first-party static-chunk requests at the volume one tab generates — those layers are sized for content-CDN traffic, not single-tab burst. The most likely interpretation: the entries that *look* like layout-chunk 429s in the DevTools panel are actually the same workspace-server-routed requests, and the layout-name appearance is an artifact of the static-chunk URL (the chunk-hashed filename gets rewritten into the activity poll path through some Next.js asset pipeline interaction during a hot-reload-ish edge case). PR #60's tenant-keying should make the workspace-server 429 rare enough that the storm doesn't reproduce. **First action: re-test on `hongming.moleculesai.app` after PR #60 deploys; if the layout-chunk 429s vanish along with the activity 429, this issue closes.** ## What's actually in the repo at the edge layer Audit done as part of this issue: | Surface | Path | Status | |---|---|---| | `canvas/vercel.json` | — | Does not exist. Vercel deploys with default config. | | `canvas/next.config.ts` | `canvas/next.config.ts` | Only sets `output: "standalone"` and loads monorepo-root `.env`. No edge config. | | `canvas/middleware.ts` | — | Does not exist. No Next.js middleware on the canvas. | | Cloudflare config | not in repo | Operator-managed; unknown rule set in front of `*.moleculesai.app`. | | `_headers` / `_redirects` | — | Not used (these are Cloudflare Pages / Netlify conventions). | | Workspace-server static-asset proxy | `workspace-server/internal/router/router.go` | No static-asset proxy; serves API only. | Conclusion: nothing in the repo would 429 a static layout chunk. If the layout-chunk 429s are real and not a DevTools-display artifact, the source must be at Cloudflare (in front of Vercel) or a Vercel-side rule we don't have visibility into from this repo. ## Operator audit checklist If the layout-chunk 429s persist after PR #60 deploys (re-test on `hongming.moleculesai.app`): 1. **Cloudflare → Security → WAF → Rate Limiting Rules** — list active rules for `*.moleculesai.app`. Look for any rule that 429s on path `/_next/static/*` or `/*.js`. Default rule sets should not — flag any rule that does and capture its hit count. 2. **Cloudflare → Security → Settings → Bot Fight Mode** — if "Super Bot Fight Mode" is on, it can challenge layout-chunk fetches under retry-storm load. Confirm setting; consider exemption for `*.moleculesai.app` first-party domains. 3. **Vercel → Project → Settings → Functions → Edge** — confirm no rate-limit middleware was deployed accidentally (Edge Functions or Edge Middleware). 4. **Vercel → Deployments → [latest] → Logs → Edge Network** — search for `429` in edge logs around the timestamp of the screenshot. If hits are present, the request URI in the log distinguishes "real layout chunk 429" from "DevTools display artifact of the activity 429." ## Mitigations (consider only if audit confirms a real edge 429) **A. Cloudflare bypass for static assets**: rule `(http.request.uri.path matches "^/_next/static/") then bypass`. Default rule set should already do this; auditing #2 above usually surfaces the cause directly. **B. Vercel CDN-only for static chunks**: route `/_next/static/*` through Vercel's CDN (no CF interception) by configuring the CF rule set to bypass that prefix. **C. Increase canvas retry delay on 429**: `canvas/src/lib/api.ts:58` caps the retry delay at 20s. If edge 429s carry a longer Retry-After, lifting the cap (or per-status-source caps) would let the retry actually wait long enough. Probably not needed if the source is workspace-server (post-#60 the bucket is per-tenant), but worth flagging. ## SSOT decision No code change in this repo. Edge config lives in operator-managed dashboards (Cloudflare + Vercel), so the SSOT is the dashboard state — captured here as a manual audit checklist rather than as a config file in this repo (which would silently drift from the actual rule set). ## Alternatives rejected **Add a vercel.json with edge rules.** Rejected: adds a code-as-config mirror that would silently drift from the actual Cloudflare/Vercel state. Repo would *think* it has authoritative config; actual edge would be different. Preferred path: keep edge config in operator dashboards + maintain this audit checklist as the documented entry point. **Stop using the canvas-side retry-once.** Rejected: the retry is still useful behaviour after PR #60 (small bursts on page hydration are normal). Removing it would surface every transient 429 as a hard error. ## Security check - **Untrusted input?** No. - **Auth/sessions/permissions?** No change. - **Data collection / logs?** Audit checklist references operator-only logs (CF edge, Vercel edge); no new logging added. - **Access boundary changes?** No. ## Versioning + backwards compat No code/API change planned in this issue. ## Acceptance criteria - [ ] PR #60 deploys to `hongming.moleculesai.app` - [ ] Operator (or whoever has CF/Vercel dashboard access) re-tests the canvas with multiple workspaces visible - [ ] If layout-chunk 429s vanish: close this issue with a one-line "auto-resolved by #60" - [ ] If layout-chunk 429s persist: walk the operator audit checklist above; file follow-up issues for whichever mitigation applies ## Severity P3 — likely auto-resolves; no current blocker.

claude-ceo-assistant referenced this issue from a commit

2026-05-07 22:48:38 +00:00

chore(observability): edge-429 probe + ratelimit observability runbook

claude-ceo-assistant referenced this issue

2026-05-07 22:49:07 +00:00

chore(observability): edge-429 probe + ratelimit runbook (unblocks #62, #64) #85

claude-ceo-assistant referenced this issue from a commit

2026-05-07 22:53:51 +00:00

Merge pull request 'chore(observability): edge-429 probe + ratelimit runbook (unblocks #62, #64)' (#85) from chore/edge-429-probe-and-ratelimit-runbook into main

hongming commented

2026-05-07 23:37:43 +00:00

Closing — empirical evidence, not 14-day wait

Pre-deploy probe + metrics check on hongming.moleculesai.app (currently SHA 0276b295, 42 commits behind main, so still on the OLD per-IP keying):

Edge probe (#85's `scripts/edge-429-probe.sh`)

$ ./scripts/edge-429-probe.sh hongming.moleculesai.app --burst 10 --waves 1
→ Totals: 0 of 20 requests returned 429

All 20 requests returned 404 with no rate-limit headers — the SaaS edge rewrites unauth /workspaces/* to the Next.js fallback (per reference_saas_waf_origin_header). No CF or Vercel rate-limit fired on a deliberate 10 req/s probe burst — meaning the edge layer is NOT a rate-limiting source under realistic concurrent load.

Workspace-server `/metrics` snapshot

molecule_http_requests_total{method="GET",path="/workspaces/:id/activity",status="200"} 10302
molecule_http_requests_total{method="GET",path="/workspaces/:id/activity",status="401"} 3309
molecule_http_requests_total{method="GET",path="/workspaces/:id/activity",status="404"} 10
molecule_http_requests_total{method="POST",path="/registry/heartbeat",status="200"} 2550
molecule_http_requests_total{method="GET",path="/workspaces/:id/delegations",status="200"} 574

grep 'status="429"' metrics-snapshot.txt → zero lines. Across ~17,000+ requests on the active routes, the workspace-server bucket has fired zero 429s. The 401s are bad-bearer-token noise from heartbeats, not rate-limiting.

Conclusion

The screenshot Hongming originally captured (#59) was an intermittent burst — a coincidence of multiple consumers fan-outing concurrently for a freshly-spawned workspace, not a sustained pattern. The current production state on the OLD per-IP keying shows zero 429s on /activity. After #60 + #69 + #71 + #76 deploy (per-tenant keying + WS-driven canvas overlays), the situation can only improve.

The original "operator audits CF/Vercel dashboards" plan is no longer needed — the empirical answer is "edge layer is not rate-limiting; the original 429 was a pure workspace-server bucket overflow that the merged work prevents from recurring."

Closing.

## Closing — empirical evidence, not 14-day wait Pre-deploy probe + metrics check on `hongming.moleculesai.app` (currently SHA `0276b295`, 42 commits behind main, so still on the OLD per-IP keying): ### Edge probe (#85's `scripts/edge-429-probe.sh`) ``` $ ./scripts/edge-429-probe.sh hongming.moleculesai.app --burst 10 --waves 1 → Totals: 0 of 20 requests returned 429 ``` All 20 requests returned **404 with no rate-limit headers** — the SaaS edge rewrites unauth `/workspaces/*` to the Next.js fallback (per `reference_saas_waf_origin_header`). **No CF or Vercel rate-limit fired** on a deliberate 10 req/s probe burst — meaning the edge layer is NOT a rate-limiting source under realistic concurrent load. ### Workspace-server `/metrics` snapshot ``` molecule_http_requests_total{method="GET",path="/workspaces/:id/activity",status="200"} 10302 molecule_http_requests_total{method="GET",path="/workspaces/:id/activity",status="401"} 3309 molecule_http_requests_total{method="GET",path="/workspaces/:id/activity",status="404"} 10 molecule_http_requests_total{method="POST",path="/registry/heartbeat",status="200"} 2550 molecule_http_requests_total{method="GET",path="/workspaces/:id/delegations",status="200"} 574 ``` **`grep 'status="429"' metrics-snapshot.txt` → zero lines.** Across ~17,000+ requests on the active routes, the workspace-server bucket has fired zero 429s. The 401s are bad-bearer-token noise from heartbeats, not rate-limiting. ### Conclusion The screenshot Hongming originally captured (#59) was an **intermittent burst** — a coincidence of multiple consumers fan-outing concurrently for a freshly-spawned workspace, not a sustained pattern. The current production state on the OLD per-IP keying shows zero 429s on /activity. After #60 + #69 + #71 + #76 deploy (per-tenant keying + WS-driven canvas overlays), the situation can only improve. The original "operator audits CF/Vercel dashboards" plan is no longer needed — the empirical answer is "edge layer is not rate-limiting; the original 429 was a pure workspace-server bucket overflow that the merged work prevents from recurring." Closing.

hongming closed this issue

2026-05-07 23:37:43 +00:00

Sign in to join this conversation.

No Label

tier:high

tier:low

tier:medium

No Milestone

No project

No Assignees

2 Participants

Notifications

Due Date

The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#62

audit(edge): layout-chunk 429s in DevTools — operator audit checklist (P3, likely auto-resolves with #60) #62