All checks were successful
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 28s
Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 35s
branch-protection drift check / Branch protection drift (pull_request) Successful in 36s
CI / Detect changes (pull_request) Successful in 21s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 8s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 8s
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 9s
E2E API Smoke Test / detect-changes (pull_request) Successful in 22s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 17s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 20s
Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 23s
Harness Replays / detect-changes (pull_request) Successful in 23s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 24s
CI / Platform (Go) (pull_request) Successful in 12s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 19s
CI / Python Lint & Test (pull_request) Successful in 17s
CI / Canvas (Next.js) (pull_request) Successful in 24s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 14s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / Shellcheck (E2E scripts) (pull_request) Successful in 29s
Harness Replays / Harness Replays (pull_request) Successful in 9s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 16s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 10s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 13s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 1m1s
Two artifacts that unblock the parked follow-ups from #59: 1. scripts/edge-429-probe.sh (closes the "operator-blocked" status of #62). An operator without CF/Vercel dashboard access can reproduce a canvas-sized burst against a tenant subdomain and read each 429's response shape — workspace-server bucket overflow (JSON body + X-RateLimit-* headers) is distinguishable from CF (cf-ray) and Vercel (x-vercel-id) by inspection of the report. Read-only, parallel via background subshells (no GNU parallel dependency), no credential use. Smoke-tested against example.com end-to-end. 2. docs/engineering/ratelimit-observability.md (closes the "metric-blocked" status of #64). The existing molecule_http_requests_total{path,status} counter + X-RateLimit-* response headers already cover #64's acceptance criterion ("watch metrics for two weeks"). The runbook collects the PromQL queries, a decision tree for the re-tune (keep / per-tenant override / change default), an alert rule template, and a hard "do not roll ad-hoc per-bucket-key exposure" note (in-memory map includes SHA-256 of bearer tokens — exposing it is a security review surface, file a follow-up if needed). Neither artifact changes runtime behaviour. Pure operational tooling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
148 lines
5.7 KiB
Markdown
148 lines
5.7 KiB
Markdown
# Rate-limit observability runbook
|
|
|
|
> Companion to issue #64 ("RATE_LIMIT default re-tune analysis"). After
|
|
> #60 deployed the per-tenant `keyFor` keying, the right RATE_LIMIT
|
|
> default became data-dependent. This runbook documents the metrics +
|
|
> queries an operator should run to confirm whether the current 600
|
|
> req/min/key default is correct, too tight, or too loose.
|
|
|
|
## What's already exposed
|
|
|
|
The workspace-server's existing Prometheus middleware
|
|
(`workspace-server/internal/metrics/metrics.go`) tracks every request
|
|
on every path:
|
|
|
|
```
|
|
molecule_http_requests_total{method, path, status} counter
|
|
molecule_http_request_duration_seconds_total{method,path,status} counter
|
|
```
|
|
|
|
Path is the matched route pattern (`/workspaces/:id/activity` etc), so
|
|
high-cardinality workspace UUIDs do not explode the label space.
|
|
|
|
The rate limiter middleware (#60, `workspace-server/internal/middleware/ratelimit.go`)
|
|
also stamps every response with `X-RateLimit-Limit`, `X-RateLimit-Remaining`,
|
|
and `X-RateLimit-Reset`. Operators with browser-side or proxy-side
|
|
header capture can read per-request bucket state directly.
|
|
|
|
No new instrumentation is needed for #64's acceptance criteria. The
|
|
metric surface is sufficient — this runbook just collects the queries.
|
|
|
|
## Queries to run after #60 deploys
|
|
|
|
### 1. Is the bucket actually firing 429s?
|
|
|
|
```promql
|
|
sum(rate(molecule_http_requests_total{status="429"}[5m]))
|
|
```
|
|
|
|
If this is zero on a given tenant, the bucket isn't being hit. If it's
|
|
sustained > 1/min, dig in.
|
|
|
|
### 2. Which routes attract 429s?
|
|
|
|
```promql
|
|
topk(
|
|
10,
|
|
sum by (path) (
|
|
rate(molecule_http_requests_total{status="429"}[5m])
|
|
)
|
|
)
|
|
```
|
|
|
|
Expected shape post-#60:
|
|
- `/workspaces/:id/activity` should be near zero — the canvas no longer
|
|
polls it on a 30s/60s/5s cadence (PRs #69 / #71 / #76).
|
|
- Probe / health / heartbeat paths should be ~0 (those routes have a
|
|
separate IP-fallback bucket).
|
|
|
|
If `/workspaces/:id/activity` 429s persist post-PRs-69/71/76 deploy, the
|
|
canvas isn't running the WS-subscriber path — investigate WS health
|
|
on that tenant.
|
|
|
|
### 3. Per-bucket-key inference (no direct exposure today)
|
|
|
|
The bucket map itself is in-memory only; we deliberately do **not**
|
|
expose `org:<uuid>` ↔ remaining-tokens because that map can include
|
|
SHA-256 hashes of bearer tokens. A tenant that wants per-key visibility
|
|
should rely on response headers (`X-RateLimit-Remaining` on every
|
|
response from a given session is the bucket's view of that session).
|
|
|
|
If you genuinely need server-side per-bucket counts for triage,
|
|
file a follow-up — the proper shape is a `/internal/ratelimit-stats`
|
|
endpoint that emits **counts per key prefix only** (e.g. `org:`, `tok:`,
|
|
`ip:`), never the key payloads. Don't roll that ad-hoc; it's a security
|
|
review surface.
|
|
|
|
## Decision tree for the re-tune
|
|
|
|
After 14 days of production traffic on a tenant, look at the queries
|
|
above and walk this tree:
|
|
|
|
```
|
|
Q1: Is the 429 rate sustained > 0.1/sec on any tenant?
|
|
├─ NO → The 600 default has comfortable headroom. Either keep it,
|
|
│ or lower it carefully (300) ONLY if you have a documented
|
|
│ reason (e.g. a misbehaving client we want to throttle harder).
|
|
│ Default to "no change" — see #64 for the math.
|
|
└─ YES → Q2.
|
|
|
|
Q2: Is the 429 rate concentrated on ONE tenant or spread across many?
|
|
├─ ONE tenant → Operator override: set RATE_LIMIT=1200 or 1800 on that
|
|
│ tenant's box. Document in the tenant's ops note. The
|
|
│ default does not need to change.
|
|
└─ MANY tenants → Q3.
|
|
|
|
Q3: Are the 429s on a route that polls (e.g. /activity / /peers)?
|
|
├─ YES → Confirm PRs #69, #71, #76 have actually deployed to those
|
|
│ tenants. If they have and 429s persist, the canvas may have
|
|
│ a regression — do not raise RATE_LIMIT. File a canvas issue.
|
|
└─ NO → 429s on mutating routes mean genuine load. Raise the default
|
|
to 1200 in `workspace-server/internal/router/router.go:54`.
|
|
Same PR should attach: the metric chart, the time window,
|
|
and a paragraph explaining what changed in our traffic shape.
|
|
```
|
|
|
|
## Alert rule template (drop-in for Prometheus)
|
|
|
|
```yaml
|
|
# Sustained 429s — file is the SLO trip-wire. If this fires, walk the
|
|
# decision tree above. NB: the issue#64 acceptance criterion is "two
|
|
# weeks of metrics"; this alert is the inverse — it tells you something
|
|
# changed before the two weeks are up.
|
|
groups:
|
|
- name: workspace-server-ratelimit
|
|
rules:
|
|
- alert: WorkspaceServerRateLimit429Sustained
|
|
expr: |
|
|
sum by (instance) (
|
|
rate(molecule_http_requests_total{status="429"}[10m])
|
|
) > 0.1
|
|
for: 30m
|
|
labels:
|
|
severity: warning
|
|
owner: workspace-server
|
|
annotations:
|
|
summary: "{{ $labels.instance }} sustained 429s — see ratelimit-observability runbook"
|
|
runbook: "https://git.moleculesai.app/molecule-ai/molecule-core/blob/main/docs/engineering/ratelimit-observability.md"
|
|
```
|
|
|
|
Threshold rationale: 0.1 req/s = 6/min sustained over 10min. Below
|
|
that, a 429 is almost certainly a transient burst that the canvas's
|
|
retry-once handler at `canvas/src/lib/api.ts:55` already absorbs. The
|
|
30m `for:` keeps the alert from chattering on a brief blip.
|
|
|
|
## Companion probe script
|
|
|
|
For one-off triage when an operator can reproduce the problem in their
|
|
own browser, `scripts/edge-429-probe.sh` (#62) reproduces a canvas-
|
|
sized burst against a tenant subdomain and dumps each 429's response
|
|
shape so the operator can distinguish workspace-server bucket overflow
|
|
from CF/Vercel edge rate-limiting without dashboard access.
|
|
|
|
```sh
|
|
./scripts/edge-429-probe.sh hongming.moleculesai.app --burst 80 --out /tmp/edge.txt
|
|
```
|
|
|
|
The script's report header explains how to read the output.
|