diff --git a/docs/engineering/ratelimit-observability.md b/docs/engineering/ratelimit-observability.md new file mode 100644 index 00000000..9e886137 --- /dev/null +++ b/docs/engineering/ratelimit-observability.md @@ -0,0 +1,147 @@ +# Rate-limit observability runbook + +> Companion to issue #64 ("RATE_LIMIT default re-tune analysis"). After +> #60 deployed the per-tenant `keyFor` keying, the right RATE_LIMIT +> default became data-dependent. This runbook documents the metrics + +> queries an operator should run to confirm whether the current 600 +> req/min/key default is correct, too tight, or too loose. + +## What's already exposed + +The workspace-server's existing Prometheus middleware +(`workspace-server/internal/metrics/metrics.go`) tracks every request +on every path: + +``` +molecule_http_requests_total{method, path, status} counter +molecule_http_request_duration_seconds_total{method,path,status} counter +``` + +Path is the matched route pattern (`/workspaces/:id/activity` etc), so +high-cardinality workspace UUIDs do not explode the label space. + +The rate limiter middleware (#60, `workspace-server/internal/middleware/ratelimit.go`) +also stamps every response with `X-RateLimit-Limit`, `X-RateLimit-Remaining`, +and `X-RateLimit-Reset`. Operators with browser-side or proxy-side +header capture can read per-request bucket state directly. + +No new instrumentation is needed for #64's acceptance criteria. The +metric surface is sufficient — this runbook just collects the queries. + +## Queries to run after #60 deploys + +### 1. Is the bucket actually firing 429s? + +```promql +sum(rate(molecule_http_requests_total{status="429"}[5m])) +``` + +If this is zero on a given tenant, the bucket isn't being hit. If it's +sustained > 1/min, dig in. + +### 2. Which routes attract 429s? + +```promql +topk( + 10, + sum by (path) ( + rate(molecule_http_requests_total{status="429"}[5m]) + ) +) +``` + +Expected shape post-#60: +- `/workspaces/:id/activity` should be near zero — the canvas no longer + polls it on a 30s/60s/5s cadence (PRs #69 / #71 / #76). +- Probe / health / heartbeat paths should be ~0 (those routes have a + separate IP-fallback bucket). + +If `/workspaces/:id/activity` 429s persist post-PRs-69/71/76 deploy, the +canvas isn't running the WS-subscriber path — investigate WS health +on that tenant. + +### 3. Per-bucket-key inference (no direct exposure today) + +The bucket map itself is in-memory only; we deliberately do **not** +expose `org:` ↔ remaining-tokens because that map can include +SHA-256 hashes of bearer tokens. A tenant that wants per-key visibility +should rely on response headers (`X-RateLimit-Remaining` on every +response from a given session is the bucket's view of that session). + +If you genuinely need server-side per-bucket counts for triage, +file a follow-up — the proper shape is a `/internal/ratelimit-stats` +endpoint that emits **counts per key prefix only** (e.g. `org:`, `tok:`, +`ip:`), never the key payloads. Don't roll that ad-hoc; it's a security +review surface. + +## Decision tree for the re-tune + +After 14 days of production traffic on a tenant, look at the queries +above and walk this tree: + +``` +Q1: Is the 429 rate sustained > 0.1/sec on any tenant? + ├─ NO → The 600 default has comfortable headroom. Either keep it, + │ or lower it carefully (300) ONLY if you have a documented + │ reason (e.g. a misbehaving client we want to throttle harder). + │ Default to "no change" — see #64 for the math. + └─ YES → Q2. + +Q2: Is the 429 rate concentrated on ONE tenant or spread across many? + ├─ ONE tenant → Operator override: set RATE_LIMIT=1200 or 1800 on that + │ tenant's box. Document in the tenant's ops note. The + │ default does not need to change. + └─ MANY tenants → Q3. + +Q3: Are the 429s on a route that polls (e.g. /activity / /peers)? + ├─ YES → Confirm PRs #69, #71, #76 have actually deployed to those + │ tenants. If they have and 429s persist, the canvas may have + │ a regression — do not raise RATE_LIMIT. File a canvas issue. + └─ NO → 429s on mutating routes mean genuine load. Raise the default + to 1200 in `workspace-server/internal/router/router.go:54`. + Same PR should attach: the metric chart, the time window, + and a paragraph explaining what changed in our traffic shape. +``` + +## Alert rule template (drop-in for Prometheus) + +```yaml +# Sustained 429s — file is the SLO trip-wire. If this fires, walk the +# decision tree above. NB: the issue#64 acceptance criterion is "two +# weeks of metrics"; this alert is the inverse — it tells you something +# changed before the two weeks are up. +groups: + - name: workspace-server-ratelimit + rules: + - alert: WorkspaceServerRateLimit429Sustained + expr: | + sum by (instance) ( + rate(molecule_http_requests_total{status="429"}[10m]) + ) > 0.1 + for: 30m + labels: + severity: warning + owner: workspace-server + annotations: + summary: "{{ $labels.instance }} sustained 429s — see ratelimit-observability runbook" + runbook: "https://git.moleculesai.app/molecule-ai/molecule-core/blob/main/docs/engineering/ratelimit-observability.md" +``` + +Threshold rationale: 0.1 req/s = 6/min sustained over 10min. Below +that, a 429 is almost certainly a transient burst that the canvas's +retry-once handler at `canvas/src/lib/api.ts:55` already absorbs. The +30m `for:` keeps the alert from chattering on a brief blip. + +## Companion probe script + +For one-off triage when an operator can reproduce the problem in their +own browser, `scripts/edge-429-probe.sh` (#62) reproduces a canvas- +sized burst against a tenant subdomain and dumps each 429's response +shape so the operator can distinguish workspace-server bucket overflow +from CF/Vercel edge rate-limiting without dashboard access. + +```sh +./scripts/edge-429-probe.sh hongming.moleculesai.app --burst 80 --out /tmp/edge.txt +``` + +The script's report header explains how to read the output. diff --git a/scripts/edge-429-probe.sh b/scripts/edge-429-probe.sh new file mode 100755 index 00000000..a7db80c2 --- /dev/null +++ b/scripts/edge-429-probe.sh @@ -0,0 +1,155 @@ +#!/usr/bin/env bash +# edge-429-probe.sh — capture 429 origin (workspace-server vs CF/Vercel edge) +# during a simulated canvas-burst against a tenant subdomain. +# +# Issue molecule-core#62. The post-#60 verification step asks an +# operator with CF/Vercel dashboard access to confirm whether the +# layout-chunk 429s observed in DevTools were: +# (a) workspace-server bucket overflow (closes once #60 deploys), or +# (b) actual edge-layer rate-limiting (CF or Vercel). +# +# This script doesn't need dashboard access. It reproduces the burst +# pattern locally and dumps every 429's response shape so the operator +# can distinguish (a) from (b) by inspection: workspace-server emits a +# JSON body, CF emits HTML, Vercel emits a different HTML. Headers tell +# the same story (cf-ray vs x-vercel-*). +# +# Usage: +# ./scripts/edge-429-probe.sh [--burst N] [--waves N] [--pause SECS] [--out file] +# +# Example: +# ./scripts/edge-429-probe.sh hongming.moleculesai.app --burst 80 --out /tmp/edge.txt +# +# The script is read-only against the target — it only issues GETs to +# public-by-design endpoints. No mutating requests, no credential use. + +set -euo pipefail + +# ── Help / usage handling first, before positional capture ──────────────────── +case "${1:-}" in + -h|--help|"") + sed -n '/^# edge-429-probe.sh/,/^$/p' "$0" | sed 's/^# \{0,1\}//' + exit 0 + ;; +esac + +HOST="$1"; shift +BURST=80 +WAVES=3 +WAVE_PAUSE=2 +OUT="" + +while [ "${1:-}" != "" ]; do + case "$1" in + --burst) BURST="$2"; shift 2 ;; + --waves) WAVES="$2"; shift 2 ;; + --pause) WAVE_PAUSE="$2"; shift 2 ;; + --out) OUT="$2"; shift 2 ;; + -h|--help) + sed -n '/^# edge-429-probe.sh/,/^$/p' "$0" | sed 's/^# \{0,1\}//' + exit 0 + ;; + *) echo "unknown arg: $1" >&2; exit 2 ;; + esac +done + +# ── Endpoint discovery ──────────────────────────────────────────────────────── +echo "→ Discovering a layout-chunk URL from canvas root..." >&2 +ROOT_BODY=$(curl -fsSL --max-time 10 "https://${HOST}/" 2>/dev/null || true) +LAYOUT_PATH=$(echo "$ROOT_BODY" \ + | grep -oE '/_next/static/chunks/layout-[A-Za-z0-9_-]+\.js' \ + | head -1 || true) +if [ -z "$LAYOUT_PATH" ]; then + LAYOUT_PATH="/_next/static/chunks/layout-probe-not-found.js" + echo " (no layout chunk discovered — using sentinel path; 404 on this is expected)" >&2 +else + echo " layout chunk: $LAYOUT_PATH" >&2 +fi + +# Probe URL: a generic activity endpoint. The rate-limiter middleware +# runs BEFORE workspace-id validation, so unauth/invalid-id requests +# still hit the bucket. +ACTIVITY_PATH="/workspaces/00000000-0000-0000-0000-000000000000/activity?probe=edge-429" + +# ── Fire one curl, write a single-line JSON-ish status record to stdout ────── +# Inlined into xargs as a heredoc-style command rather than a function so +# the function-export pitfalls (some shells lose `export -f` across xargs) +# don't apply. Each output line is a parseable record; failed curls emit +# a curl_err record so request volume is preserved. +TMP_RESULTS="$(mktemp -t edge-429-probe.XXXXXX)" +trap 'rm -f "$TMP_RESULTS"' EXIT + +run_burst() { + # $1 = path; $2 = label; $3 = wave_id + local path="$1" label="$2" wave="$3" + local i + for i in $(seq 1 "$BURST"); do + { + out=$(curl -sS --max-time 10 -o /dev/null \ + -w 'status=%{http_code} size=%{size_download} time=%{time_total} server=%{header.server} cf_ray=%{header.cf-ray} x_vercel=%{header.x-vercel-id} retry_after=%{header.retry-after} content_type=%{header.content-type} x_ratelimit_limit=%{header.x-ratelimit-limit} x_ratelimit_remaining=%{header.x-ratelimit-remaining} x_ratelimit_reset=%{header.x-ratelimit-reset}\n' \ + "https://${HOST}${path}" 2>/dev/null) || out="status=curl_err" + printf 'label=%s-%s-%s %s\n' "$label" "$wave" "$i" "$out" >> "$TMP_RESULTS" + } & + done + wait +} + +emit() { + if [ -n "$OUT" ]; then + printf '%s\n' "$*" >> "$OUT" + else + printf '%s\n' "$*" + fi +} + +if [ -n "$OUT" ]; then : > "$OUT"; fi + +emit "# edge-429-probe report" +emit "# host=$HOST burst=$BURST waves=$WAVES pause=${WAVE_PAUSE}s" +emit "# layout_path=$LAYOUT_PATH" +emit "# activity_path=$ACTIVITY_PATH" +emit "# generated=$(date -u +%Y-%m-%dT%H:%M:%SZ)" +emit "" + +for wave in $(seq 1 "$WAVES"); do + emit "## wave $wave" + : > "$TMP_RESULTS" + run_burst "$LAYOUT_PATH" "layout" "$wave" + run_burst "$ACTIVITY_PATH" "activity" "$wave" + while read -r line; do + emit " $line" + done < "$TMP_RESULTS" + if [ "$wave" -lt "$WAVES" ]; then + sleep "$WAVE_PAUSE" + fi +done + +emit "" +emit "## summary — how to read the report" +emit "# status=429 + content_type starts with application/json + x_ratelimit_limit set" +emit "# => workspace-server bucket overflow. Closes when #60 deploys." +emit "# status=429 + cf_ray set + content_type=text/html" +emit "# => Cloudflare WAF / rate-limit. Audit dashboard rules per #62." +emit "# status=429 + x_vercel set + content_type=text/html" +emit "# => Vercel edge / Bot Fight Mode. Audit Vercel project per #62." +emit "# status=429 with no server/cf_ray/x_vercel" +emit "# => corporate proxy or VPN. Not actionable in this repo." + +if [ -n "$OUT" ]; then + echo "→ Report written to $OUT" >&2 + # Match only data lines (begin with two-space indent + "label="), + # not the summary's reference text which also mentions "status=429". + # grep -c outputs "0" + exits 1 when zero matches; `|| true` masks + # the exit status so set -e doesn't trip without losing the count. + total=$(grep -c '^ label=' "$OUT" 2>/dev/null || true) + total429=$(grep -c '^ label=.*status=429' "$OUT" 2>/dev/null || true) + total=${total:-0} + total429=${total429:-0} + echo "→ Totals: ${total429} of ${total} requests returned 429" >&2 + if [ "${total429}" -gt 0 ]; then + echo "→ Per-label 429 counts:" >&2 + grep '^ label=.*status=429' "$OUT" \ + | sed -E 's/^ label=([^-]+).*/ \1/' \ + | sort | uniq -c >&2 + fi +fi