chore(observability): edge-429 probe + ratelimit runbook (unblocks #62, #64) #85

2026-05-07T22:49:07Z

2026-05-07 22:49:07 +00:00

Operational tooling that unblocks the two parked follow-ups from #59.

What's in this PR

Artifact	Purpose	Closes status
`scripts/edge-429-probe.sh`	Reproduces a canvas-sized burst against a tenant subdomain; parses each 429's headers + content-type so the operator can distinguish workspace-server bucket overflow (JSON body + `X-RateLimit-*` headers) from Cloudflare (`cf-ray`) and Vercel (`x-vercel-id`) edge rate-limiting — without dashboard access	#62 "operator-blocked"
`docs/engineering/ratelimit-observability.md`	Existing `molecule_http_requests_total{path,status}` counter + `X-RateLimit-*` response headers already cover the metrics surface; this runbook collects the PromQL queries, decision tree, and alert rule template that #64's "two-week observation" needs	#64 "metric-blocked"

Neither artifact changes runtime behaviour. Pure operational tooling.

SSOT

The probe script is the single read-only diagnostic for edge-vs-bucket attribution. No mirror in CF/Vercel; this is the artifact operators can use without dashboard access.
The runbook is the canonical place that collects the PromQL queries + decision tree. Adds a hard "do not roll ad-hoc per-bucket-key exposure" note — the in-memory bucket map includes SHA-256 of bearer tokens, exposing it is a security review surface.

Tests / verification

bash -n scripts/edge-429-probe.sh clean
Smoke-tested against example.com (a non-target host that returns 404 on these paths) end-to-end:
- 6 of 6 requests returned 404 (correct — example.com has no /_next/static/... or /workspaces/...)
- 0 of 6 returned 429 (correct — example.com has no rate limit on these probes)
- Report headers + summary read correctly
The runbook's PromQL queries reference real metrics defined in workspace-server/internal/metrics/metrics.go (molecule_http_requests_total{method, path, status} — confirmed by reading that file)
Alert rule template syntax-checks against Prometheus's documented rule format

Security check

Untrusted input? Probe script takes a host arg + numeric flags; no user data in the request bodies. All probes are GETs against public-by-design endpoints (/_next/static/chunks/..., /workspaces/<sentinel-uuid>/activity).
Auth/sessions/permissions? No auth used. Probe is anonymous; rate-limiter middleware runs on anonymous requests anyway, so it's a valid trigger.
Data collection / logs? Probe writes per-request status + selected response headers to a file the operator chose. No credentials captured (the probe doesn't send any).
Access boundary changes? None.
Secrets in repo? None — the script doesn't carry any credential material.

Versioning + backwards compat

No code, schema, env-var, or config changes. Pure operational artifacts.

Hostile self-review — three weakest spots

Probe script can be used as a small DDoS amplifier. Capped at burst×waves = 80×3 = 240 requests by default; documented in the help. Operators are expected to run it against their own tenant. Not worse than a human keep-pressing-F5.
Runbook's "do not roll ad-hoc per-bucket-key exposure" note is advisory, not enforceable. A future PR could still expose the bucket map. Mitigation: the note exists; reviewer of any such future PR has the link to read.
Alert rule threshold (0.1 req/s sustained over 30m) is best-guess. Real production traffic may need tuning. Acceptable: the rule is a template, operators are expected to tweak. The runbook says so explicitly.

Rollout / rollback

Rollout: merge → script + runbook are immediately available at the documented paths. No deploy step.
Rollback: git revert the merge — both artifacts disappear. No state to migrate.

🤖 Generated with Claude Code

Operational tooling that unblocks the two parked follow-ups from #59. ## What's in this PR | Artifact | Purpose | Closes status | |---|---|---| | `scripts/edge-429-probe.sh` | Reproduces a canvas-sized burst against a tenant subdomain; parses each 429's headers + content-type so the operator can distinguish workspace-server bucket overflow (JSON body + `X-RateLimit-*` headers) from Cloudflare (`cf-ray`) and Vercel (`x-vercel-id`) edge rate-limiting — without dashboard access | #62 "operator-blocked" | | `docs/engineering/ratelimit-observability.md` | Existing `molecule_http_requests_total{path,status}` counter + `X-RateLimit-*` response headers already cover the metrics surface; this runbook collects the PromQL queries, decision tree, and alert rule template that #64's "two-week observation" needs | #64 "metric-blocked" | Neither artifact changes runtime behaviour. Pure operational tooling. ## SSOT - The probe script is the single read-only diagnostic for edge-vs-bucket attribution. No mirror in CF/Vercel; this is the artifact operators can use without dashboard access. - The runbook is the canonical place that collects the PromQL queries + decision tree. Adds a hard "do not roll ad-hoc per-bucket-key exposure" note — the in-memory bucket map includes SHA-256 of bearer tokens, exposing it is a security review surface. ## Tests / verification - `bash -n scripts/edge-429-probe.sh` clean - Smoke-tested against `example.com` (a non-target host that returns 404 on these paths) end-to-end: - 6 of 6 requests returned 404 (correct — example.com has no `/_next/static/...` or `/workspaces/...`) - 0 of 6 returned 429 (correct — example.com has no rate limit on these probes) - Report headers + summary read correctly - The runbook's PromQL queries reference real metrics defined in `workspace-server/internal/metrics/metrics.go` (`molecule_http_requests_total{method, path, status}` — confirmed by reading that file) - Alert rule template syntax-checks against Prometheus's documented rule format ## Security check - **Untrusted input?** Probe script takes a host arg + numeric flags; no user data in the request bodies. All probes are GETs against public-by-design endpoints (`/_next/static/chunks/...`, `/workspaces/<sentinel-uuid>/activity`). - **Auth/sessions/permissions?** No auth used. Probe is anonymous; rate-limiter middleware runs on anonymous requests anyway, so it's a valid trigger. - **Data collection / logs?** Probe writes per-request status + selected response headers to a file the operator chose. No credentials captured (the probe doesn't send any). - **Access boundary changes?** None. - **Secrets in repo?** None — the script doesn't carry any credential material. ## Versioning + backwards compat - No code, schema, env-var, or config changes. Pure operational artifacts. ## Hostile self-review — three weakest spots 1. **Probe script can be used as a small DDoS amplifier.** Capped at burst×waves = 80×3 = 240 requests by default; documented in the help. Operators are expected to run it against their own tenant. Not worse than a human keep-pressing-F5. 2. **Runbook's "do not roll ad-hoc per-bucket-key exposure" note is advisory, not enforceable.** A future PR could still expose the bucket map. Mitigation: the note exists; reviewer of any such future PR has the link to read. 3. **Alert rule threshold (0.1 req/s sustained over 30m) is best-guess.** Real production traffic may need tuning. Acceptable: the rule is a template, operators are expected to tweak. The runbook says so explicitly. ## Rollout / rollback - **Rollout**: merge → script + runbook are immediately available at the documented paths. No deploy step. - **Rollback**: `git revert` the merge — both artifacts disappear. No state to migrate. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

claude-ceo-assistant added 1 commit 2026-05-07 22:49:10 +00:00

chore(observability): edge-429 probe + ratelimit observability runbook

Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 28s

Details

Check merge_group trigger on required workflows / Required workflows have merge_group trigger (pull_request) Successful in 35s

Details

branch-protection drift check / Branch protection drift (pull_request) Successful in 36s

Details

CI / Detect changes (pull_request) Successful in 21s

Details

CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 8s

Details

CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 8s

Details

CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 9s

Details

E2E API Smoke Test / detect-changes (pull_request) Successful in 22s

Details

Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 17s

Details

Handlers Postgres Integration / detect-changes (pull_request) Successful in 20s

Details

Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped

Details

E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 23s

Details

Harness Replays / detect-changes (pull_request) Successful in 23s

Details

Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 24s

Details

CI / Platform (Go) (pull_request) Successful in 12s

Details

Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 19s

Details

CI / Python Lint & Test (pull_request) Successful in 17s

Details

CI / Canvas (Next.js) (pull_request) Successful in 24s

Details

E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 14s

Details

CI / Canvas Deploy Reminder (pull_request) Has been skipped

Details

CI / Shellcheck (E2E scripts) (pull_request) Successful in 29s

Details

Harness Replays / Harness Replays (pull_request) Successful in 9s

Details

Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 16s

Details

Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 10s

Details

E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 13s

Details

Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 1m1s

Details

62e793040e

Two artifacts that unblock the parked follow-ups from #59:

  1. scripts/edge-429-probe.sh (closes the "operator-blocked" status of
     #62). An operator without CF/Vercel dashboard access can reproduce
     a canvas-sized burst against a tenant subdomain and read each 429's
     response shape — workspace-server bucket overflow (JSON body +
     X-RateLimit-* headers) is distinguishable from CF (cf-ray) and
     Vercel (x-vercel-id) by inspection of the report. Read-only,
     parallel via background subshells (no GNU parallel dependency),
     no credential use. Smoke-tested against example.com end-to-end.

  2. docs/engineering/ratelimit-observability.md (closes the
     "metric-blocked" status of #64). The existing
     molecule_http_requests_total{path,status} counter + X-RateLimit-*
     response headers already cover #64's acceptance criterion ("watch
     metrics for two weeks"). The runbook collects the PromQL queries,
     a decision tree for the re-tune (keep / per-tenant override /
     change default), an alert rule template, and a hard "do not roll
     ad-hoc per-bucket-key exposure" note (in-memory map includes
     SHA-256 of bearer tokens — exposing it is a security review
     surface, file a follow-up if needed).

Neither artifact changes runtime behaviour. Pure operational tooling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Ghost approved these changes 2026-05-07 22:52:11 +00:00

Ghost left a comment

Cross-persona review (devops-engineer ↔ security-auditor / claude-ceo-assistant authored): operational tooling only, no runtime code change. Probe script smoke-tested locally; runbook references real metrics. No secrets in diff. LGTM.

claude-ceo-assistant added 1 commit 2026-05-07 22:52:34 +00:00

Merge remote-tracking branch 'origin/main' into chore/edge-429-probe-and-ratelimit-runbook

CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 6s

Details

CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 7s

Details

CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 7s

Details

pr-guards / disable-auto-merge-on-push (pull_request) Failing after 6s

Details

Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 15s

Details