Merge branch 'main' into fix/196-retarget-main-to-staging-gitea-rest
Some checks failed
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 5s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 6s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 6s
Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 12s
CI / Detect changes (pull_request) Successful in 16s
pr-guards / disable-auto-merge-on-push (pull_request) Failing after 8s
branch-protection drift check / Branch protection drift (pull_request) Successful in 17s
E2E API Smoke Test / detect-changes (pull_request) Successful in 15s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 14s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 16s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 18s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 15s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 18s
CI / Platform (Go) (pull_request) Successful in 10s
CI / Canvas (Next.js) (pull_request) Successful in 11s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 8s
CI / Python Lint & Test (pull_request) Successful in 10s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 8s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 11s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 12s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 9s
CI / Canvas Deploy Reminder (pull_request) Has been skipped

This commit is contained in:
claude-ceo-assistant 2026-05-07 22:54:31 +00:00
commit ca644134f2
2 changed files with 302 additions and 0 deletions

View File

@ -0,0 +1,147 @@
# Rate-limit observability runbook
> Companion to issue #64 ("RATE_LIMIT default re-tune analysis"). After
> #60 deployed the per-tenant `keyFor` keying, the right RATE_LIMIT
> default became data-dependent. This runbook documents the metrics +
> queries an operator should run to confirm whether the current 600
> req/min/key default is correct, too tight, or too loose.
## What's already exposed
The workspace-server's existing Prometheus middleware
(`workspace-server/internal/metrics/metrics.go`) tracks every request
on every path:
```
molecule_http_requests_total{method, path, status} counter
molecule_http_request_duration_seconds_total{method,path,status} counter
```
Path is the matched route pattern (`/workspaces/:id/activity` etc), so
high-cardinality workspace UUIDs do not explode the label space.
The rate limiter middleware (#60, `workspace-server/internal/middleware/ratelimit.go`)
also stamps every response with `X-RateLimit-Limit`, `X-RateLimit-Remaining`,
and `X-RateLimit-Reset`. Operators with browser-side or proxy-side
header capture can read per-request bucket state directly.
No new instrumentation is needed for #64's acceptance criteria. The
metric surface is sufficient — this runbook just collects the queries.
## Queries to run after #60 deploys
### 1. Is the bucket actually firing 429s?
```promql
sum(rate(molecule_http_requests_total{status="429"}[5m]))
```
If this is zero on a given tenant, the bucket isn't being hit. If it's
sustained > 1/min, dig in.
### 2. Which routes attract 429s?
```promql
topk(
10,
sum by (path) (
rate(molecule_http_requests_total{status="429"}[5m])
)
)
```
Expected shape post-#60:
- `/workspaces/:id/activity` should be near zero — the canvas no longer
polls it on a 30s/60s/5s cadence (PRs #69 / #71 / #76).
- Probe / health / heartbeat paths should be ~0 (those routes have a
separate IP-fallback bucket).
If `/workspaces/:id/activity` 429s persist post-PRs-69/71/76 deploy, the
canvas isn't running the WS-subscriber path — investigate WS health
on that tenant.
### 3. Per-bucket-key inference (no direct exposure today)
The bucket map itself is in-memory only; we deliberately do **not**
expose `org:<uuid>` ↔ remaining-tokens because that map can include
SHA-256 hashes of bearer tokens. A tenant that wants per-key visibility
should rely on response headers (`X-RateLimit-Remaining` on every
response from a given session is the bucket's view of that session).
If you genuinely need server-side per-bucket counts for triage,
file a follow-up — the proper shape is a `/internal/ratelimit-stats`
endpoint that emits **counts per key prefix only** (e.g. `org:`, `tok:`,
`ip:`), never the key payloads. Don't roll that ad-hoc; it's a security
review surface.
## Decision tree for the re-tune
After 14 days of production traffic on a tenant, look at the queries
above and walk this tree:
```
Q1: Is the 429 rate sustained > 0.1/sec on any tenant?
├─ NO → The 600 default has comfortable headroom. Either keep it,
│ or lower it carefully (300) ONLY if you have a documented
│ reason (e.g. a misbehaving client we want to throttle harder).
│ Default to "no change" — see #64 for the math.
└─ YES → Q2.
Q2: Is the 429 rate concentrated on ONE tenant or spread across many?
├─ ONE tenant → Operator override: set RATE_LIMIT=1200 or 1800 on that
│ tenant's box. Document in the tenant's ops note. The
│ default does not need to change.
└─ MANY tenants → Q3.
Q3: Are the 429s on a route that polls (e.g. /activity / /peers)?
├─ YES → Confirm PRs #69, #71, #76 have actually deployed to those
│ tenants. If they have and 429s persist, the canvas may have
│ a regression — do not raise RATE_LIMIT. File a canvas issue.
└─ NO → 429s on mutating routes mean genuine load. Raise the default
to 1200 in `workspace-server/internal/router/router.go:54`.
Same PR should attach: the metric chart, the time window,
and a paragraph explaining what changed in our traffic shape.
```
## Alert rule template (drop-in for Prometheus)
```yaml
# Sustained 429s — file is the SLO trip-wire. If this fires, walk the
# decision tree above. NB: the issue#64 acceptance criterion is "two
# weeks of metrics"; this alert is the inverse — it tells you something
# changed before the two weeks are up.
groups:
- name: workspace-server-ratelimit
rules:
- alert: WorkspaceServerRateLimit429Sustained
expr: |
sum by (instance) (
rate(molecule_http_requests_total{status="429"}[10m])
) > 0.1
for: 30m
labels:
severity: warning
owner: workspace-server
annotations:
summary: "{{ $labels.instance }} sustained 429s — see ratelimit-observability runbook"
runbook: "https://git.moleculesai.app/molecule-ai/molecule-core/blob/main/docs/engineering/ratelimit-observability.md"
```
Threshold rationale: 0.1 req/s = 6/min sustained over 10min. Below
that, a 429 is almost certainly a transient burst that the canvas's
retry-once handler at `canvas/src/lib/api.ts:55` already absorbs. The
30m `for:` keeps the alert from chattering on a brief blip.
## Companion probe script
For one-off triage when an operator can reproduce the problem in their
own browser, `scripts/edge-429-probe.sh` (#62) reproduces a canvas-
sized burst against a tenant subdomain and dumps each 429's response
shape so the operator can distinguish workspace-server bucket overflow
from CF/Vercel edge rate-limiting without dashboard access.
```sh
./scripts/edge-429-probe.sh hongming.moleculesai.app --burst 80 --out /tmp/edge.txt
```
The script's report header explains how to read the output.

155
scripts/edge-429-probe.sh Executable file
View File

@ -0,0 +1,155 @@
#!/usr/bin/env bash
# edge-429-probe.sh — capture 429 origin (workspace-server vs CF/Vercel edge)
# during a simulated canvas-burst against a tenant subdomain.
#
# Issue molecule-core#62. The post-#60 verification step asks an
# operator with CF/Vercel dashboard access to confirm whether the
# layout-chunk 429s observed in DevTools were:
# (a) workspace-server bucket overflow (closes once #60 deploys), or
# (b) actual edge-layer rate-limiting (CF or Vercel).
#
# This script doesn't need dashboard access. It reproduces the burst
# pattern locally and dumps every 429's response shape so the operator
# can distinguish (a) from (b) by inspection: workspace-server emits a
# JSON body, CF emits HTML, Vercel emits a different HTML. Headers tell
# the same story (cf-ray vs x-vercel-*).
#
# Usage:
# ./scripts/edge-429-probe.sh <tenant-host> [--burst N] [--waves N] [--pause SECS] [--out file]
#
# Example:
# ./scripts/edge-429-probe.sh hongming.moleculesai.app --burst 80 --out /tmp/edge.txt
#
# The script is read-only against the target — it only issues GETs to
# public-by-design endpoints. No mutating requests, no credential use.
set -euo pipefail
# ── Help / usage handling first, before positional capture ────────────────────
case "${1:-}" in
-h|--help|"")
sed -n '/^# edge-429-probe.sh/,/^$/p' "$0" | sed 's/^# \{0,1\}//'
exit 0
;;
esac
HOST="$1"; shift
BURST=80
WAVES=3
WAVE_PAUSE=2
OUT=""
while [ "${1:-}" != "" ]; do
case "$1" in
--burst) BURST="$2"; shift 2 ;;
--waves) WAVES="$2"; shift 2 ;;
--pause) WAVE_PAUSE="$2"; shift 2 ;;
--out) OUT="$2"; shift 2 ;;
-h|--help)
sed -n '/^# edge-429-probe.sh/,/^$/p' "$0" | sed 's/^# \{0,1\}//'
exit 0
;;
*) echo "unknown arg: $1" >&2; exit 2 ;;
esac
done
# ── Endpoint discovery ────────────────────────────────────────────────────────
echo "→ Discovering a layout-chunk URL from canvas root..." >&2
ROOT_BODY=$(curl -fsSL --max-time 10 "https://${HOST}/" 2>/dev/null || true)
LAYOUT_PATH=$(echo "$ROOT_BODY" \
| grep -oE '/_next/static/chunks/layout-[A-Za-z0-9_-]+\.js' \
| head -1 || true)
if [ -z "$LAYOUT_PATH" ]; then
LAYOUT_PATH="/_next/static/chunks/layout-probe-not-found.js"
echo " (no layout chunk discovered — using sentinel path; 404 on this is expected)" >&2
else
echo " layout chunk: $LAYOUT_PATH" >&2
fi
# Probe URL: a generic activity endpoint. The rate-limiter middleware
# runs BEFORE workspace-id validation, so unauth/invalid-id requests
# still hit the bucket.
ACTIVITY_PATH="/workspaces/00000000-0000-0000-0000-000000000000/activity?probe=edge-429"
# ── Fire one curl, write a single-line JSON-ish status record to stdout ──────
# Inlined into xargs as a heredoc-style command rather than a function so
# the function-export pitfalls (some shells lose `export -f` across xargs)
# don't apply. Each output line is a parseable record; failed curls emit
# a curl_err record so request volume is preserved.
TMP_RESULTS="$(mktemp -t edge-429-probe.XXXXXX)"
trap 'rm -f "$TMP_RESULTS"' EXIT
run_burst() {
# $1 = path; $2 = label; $3 = wave_id
local path="$1" label="$2" wave="$3"
local i
for i in $(seq 1 "$BURST"); do
{
out=$(curl -sS --max-time 10 -o /dev/null \
-w 'status=%{http_code} size=%{size_download} time=%{time_total} server=%{header.server} cf_ray=%{header.cf-ray} x_vercel=%{header.x-vercel-id} retry_after=%{header.retry-after} content_type=%{header.content-type} x_ratelimit_limit=%{header.x-ratelimit-limit} x_ratelimit_remaining=%{header.x-ratelimit-remaining} x_ratelimit_reset=%{header.x-ratelimit-reset}\n' \
"https://${HOST}${path}" 2>/dev/null) || out="status=curl_err"
printf 'label=%s-%s-%s %s\n' "$label" "$wave" "$i" "$out" >> "$TMP_RESULTS"
} &
done
wait
}
emit() {
if [ -n "$OUT" ]; then
printf '%s\n' "$*" >> "$OUT"
else
printf '%s\n' "$*"
fi
}
if [ -n "$OUT" ]; then : > "$OUT"; fi
emit "# edge-429-probe report"
emit "# host=$HOST burst=$BURST waves=$WAVES pause=${WAVE_PAUSE}s"
emit "# layout_path=$LAYOUT_PATH"
emit "# activity_path=$ACTIVITY_PATH"
emit "# generated=$(date -u +%Y-%m-%dT%H:%M:%SZ)"
emit ""
for wave in $(seq 1 "$WAVES"); do
emit "## wave $wave"
: > "$TMP_RESULTS"
run_burst "$LAYOUT_PATH" "layout" "$wave"
run_burst "$ACTIVITY_PATH" "activity" "$wave"
while read -r line; do
emit " $line"
done < "$TMP_RESULTS"
if [ "$wave" -lt "$WAVES" ]; then
sleep "$WAVE_PAUSE"
fi
done
emit ""
emit "## summary — how to read the report"
emit "# status=429 + content_type starts with application/json + x_ratelimit_limit set"
emit "# => workspace-server bucket overflow. Closes when #60 deploys."
emit "# status=429 + cf_ray set + content_type=text/html"
emit "# => Cloudflare WAF / rate-limit. Audit dashboard rules per #62."
emit "# status=429 + x_vercel set + content_type=text/html"
emit "# => Vercel edge / Bot Fight Mode. Audit Vercel project per #62."
emit "# status=429 with no server/cf_ray/x_vercel"
emit "# => corporate proxy or VPN. Not actionable in this repo."
if [ -n "$OUT" ]; then
echo "→ Report written to $OUT" >&2
# Match only data lines (begin with two-space indent + "label="),
# not the summary's reference text which also mentions "status=429".
# grep -c outputs "0" + exits 1 when zero matches; `|| true` masks
# the exit status so set -e doesn't trip without losing the count.
total=$(grep -c '^ label=' "$OUT" 2>/dev/null || true)
total429=$(grep -c '^ label=.*status=429' "$OUT" 2>/dev/null || true)
total=${total:-0}
total429=${total429:-0}
echo "→ Totals: ${total429} of ${total} requests returned 429" >&2
if [ "${total429}" -gt 0 ]; then
echo "→ Per-label 429 counts:" >&2
grep '^ label=.*status=429' "$OUT" \
| sed -E 's/^ label=([^-]+).*/ \1/' \
| sort | uniq -c >&2
fi
fi