From 49bd2e8f56024be0468bc6a108ed6e151e4c399b Mon Sep 17 00:00:00 2001 From: Hongming Wang Date: Fri, 17 Apr 2026 10:17:43 -0700 Subject: [PATCH] =?UTF-8?q?docs(wildcard-dns):=20address=20CEO=20review=20?= =?UTF-8?q?=E2=80=94=20KV=20cache,=20WebSocket,=20proxy=20trust?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Addresses all 4 review points from PR #786: 1. Worker resilience: 3-tier cache (in-memory → KV → CP API) with stale fallback so CP outages are invisible to tenants 2. WebSocket proxying: documented upgradeHeader handling, fallback to keep Caddy for WS-only if Workers WS is unreliable 3. SG automation: note to auto-update Cloudflare IP ranges, don't hardcode 4. Trusted proxy: X-Forwarded-For / CF-Connecting-IP trust chain documented Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/architecture/wildcard-dns-proxy.md | 52 ++++++++++++++++++++++--- 1 file changed, 46 insertions(+), 6 deletions(-) diff --git a/docs/architecture/wildcard-dns-proxy.md b/docs/architecture/wildcard-dns-proxy.md index c29214b1..b29646e7 100644 --- a/docs/architecture/wildcard-dns-proxy.md +++ b/docs/architecture/wildcard-dns-proxy.md @@ -70,17 +70,51 @@ The Worker runs on every request to `*.moleculesai.app` that isn't matched by an explicit DNS record. It: 1. **Extracts the slug** from the `Host` header -2. **Looks up the backend IP** by calling `GET https://api.moleculesai.app/cp/orgs//instance` - - Caches the response for 60s in Cloudflare's edge cache (KV or Cache API) - - If the org doesn't exist → 404 page +2. **Looks up the backend IP** using a 3-tier cache strategy: + - **L1: in-memory cache** (60s TTL) — fastest, per-isolate + - **L2: Workers KV** (5 min TTL, stale-while-revalidate) — survives isolate + restarts, shared across all edge locations + - **L3: CP API** — `GET https://api.moleculesai.app/cp/orgs//instance` + - **Fallback:** if CP is unreachable, serve stale KV entry (any age) rather + than erroring. A 10-minute CP outage is invisible to tenants. + - If the org doesn't exist (404 from CP, no KV entry) → 404 page - If the org is provisioning (no IP yet) → return a static "provisioning" HTML page 3. **Proxies the request** to `http://:8080` (platform) or `:3000` (canvas) - Route: `/health`, `/workspaces*`, `/registry*`, etc. → `:8080` - Route: everything else → `:3000` + - Route: `/ws` → `:8080` with WebSocket upgrade (see WebSocket section below) - Injects `X-Molecule-Org-Id` header (same as Caddy does today) - Injects `Origin` header for AdminAuth bypass + - Injects `X-Forwarded-For` with client IP from `CF-Connecting-IP` + - Injects `X-Forwarded-Proto: https` 4. **Returns the response** to the browser with Cloudflare's TLS +#### WebSocket proxying + +Cloudflare Workers support WebSocket proxying via the `upgradeHeader` check. +The Worker detects `Upgrade: websocket` on incoming requests and passes them +through to the EC2 backend on `:8080/ws`. The Worker acts as a transparent +tunnel — it does not inspect or buffer WebSocket frames. + +```js +// Simplified WebSocket handling in the Worker +if (request.headers.get('Upgrade') === 'websocket') { + return fetch(`http://${backendIp}:8080${url.pathname}`, request); +} +``` + +If Workers WebSocket proxying proves unreliable in production (frame drops, +idle timeout issues), Phase 33.3 keeps Caddy as a thin WSocket-only reverse +proxy on EC2 instead of removing it entirely. + +#### Trusted proxy configuration + +The platform's Gin server uses `SetTrustedProxies(nil)` (trust all) by +default. When requests come through the Worker instead of directly, the +platform should trust `CF-Connecting-IP` for the real client IP. In +production, set `TRUSTED_PROXIES` to Cloudflare's published IP ranges +(auto-updated from `https://api.cloudflare.com/client/v4/ips`). + ### 3. CP API endpoint: `GET /cp/orgs/:slug/instance` New public endpoint (no auth — needed by the Worker which has no session): @@ -124,9 +158,15 @@ Worker → EC2 :8080 (platform, direct HTTP) Worker → EC2 :3000 (canvas, direct HTTP) ``` -Caddy can be removed from the EC2 user-data script entirely. The Worker -handles TLS termination + routing. The EC2 security group should allow -inbound HTTP from Cloudflare IPs only (not public). +Caddy can be removed from the EC2 user-data script for HTTP routing. If +WebSocket proxying through Workers proves reliable, Caddy is fully removed. +If not, Caddy stays as a thin WebSocket-only reverse proxy (no TLS, no +HTTP routing — just `/ws` → `:8080`). + +The EC2 security group should allow inbound HTTP from Cloudflare IPs only +(not public). **Automate the IP list** — Cloudflare publishes their ranges +at `https://api.cloudflare.com/client/v4/ips`. Use a Lambda or cron to +update the SG weekly. Do not hardcode the IP ranges. **Headers injected by Worker** (replaces Caddy's `header_up`): - `X-Molecule-Org-Id: ` — for TenantGuard