From d36b612bbf598130f5ac0b37b36798f8aa1dfb20 Mon Sep 17 00:00:00 2001 From: Hongming Wang Date: Fri, 17 Apr 2026 10:02:32 -0700 Subject: [PATCH 1/2] docs: wildcard DNS + Cloudflare Worker proxy architecture MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds Phase 33 plan and architecture doc for replacing per-tenant DNS records with a wildcard DNS + Cloudflare Worker proxy pattern. Eliminates: DNS propagation delays, NXDOMAIN caching, per-instance Let's Encrypt, Caddy on EC2. Same pattern used by Vercel, Railway, Fly.io, WordPress, n8n. 4-phase migration: deploy Worker → stop creating DNS records → remove Caddy from EC2 → cleanup. Co-Authored-By: Claude Opus 4.6 (1M context) --- CLAUDE.md | 6 + PLAN.md | 47 ++++++ docs/architecture/wildcard-dns-proxy.md | 192 ++++++++++++++++++++++++ 3 files changed, 245 insertions(+) create mode 100644 docs/architecture/wildcard-dns-proxy.md diff --git a/CLAUDE.md b/CLAUDE.md index aedab50f..6a70029d 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -28,6 +28,12 @@ secrets` on `molecule-cp`), the correct rotation order, and danger cases — notably `SECRETS_ENCRYPTION_KEY`, which cannot be rotated without a data migration until Phase H lands KMS envelope encryption. +For tenant subdomain routing architecture (why `*.moleculesai.app` uses a +Cloudflare Worker instead of per-tenant DNS records), read +**`docs/architecture/wildcard-dns-proxy.md`**. This eliminates DNS +propagation delays and NXDOMAIN caching that previously caused "site can't +be reached" errors for new orgs. + When handling a GDPR erasure request (user asks "delete my org and all my data"), read **`docs/runbooks/gdpr-erasure.md`** first. It explains the 4-step cascade in `molecule-controlplane` (Stripe → Redis → Infra → DB diff --git a/PLAN.md b/PLAN.md index 158e132a..10c37359 100644 --- a/PLAN.md +++ b/PLAN.md @@ -575,6 +575,53 @@ self-hosted per-customer). Ordered by dependency + ROI. --- +## Phase 33: Wildcard DNS + Cloudflare Worker Proxy + +> **Goal:** Eliminate DNS propagation delays and NXDOMAIN caching for tenant +> subdomains. Every SaaS (Vercel, Railway, Fly.io) uses this pattern — +> wildcard DNS + edge proxy routing by hostname. +> +> **Docs:** `docs/architecture/wildcard-dns-proxy.md` + +### Phase 33.1 — Worker + wildcard DNS (no tenant changes) + +- [ ] Create Cloudflare Worker that extracts slug from hostname, looks up + backend IP from CP API, proxies request to EC2 +- [ ] Add `GET /cp/orgs/:slug/instance` endpoint to CP (public, rate-limited) +- [ ] Add `*.moleculesai.app` wildcard DNS record (proxied, orange cloud) +- [ ] Worker serves static "provisioning" splash page when tenant not ready +- [ ] Deploy Worker via `wrangler deploy` + GitHub Actions +- [ ] Verify Worker routing works for existing tenants alongside old A records + +### Phase 33.2 — Stop per-tenant DNS records + +- [ ] Remove Cloudflare A record creation from `ec2.go` provisioner +- [ ] Remove Cloudflare DNS cleanup from deprovision/purge cascade +- [ ] Existing A records coexist harmlessly (explicit wins over wildcard) + +### Phase 33.3 — Remove Caddy from EC2 + +- [ ] Worker handles TLS termination — EC2 runs plain HTTP only +- [ ] Remove Caddy install + Caddyfile from EC2 user-data script +- [ ] EC2 security group: allow inbound HTTP from Cloudflare IPs only +- [ ] ~30s faster cold start (no apt-get caddy, no Let's Encrypt) + +### Phase 33.4 — Cleanup + +- [ ] Delete old per-tenant A records from Cloudflare +- [ ] Remove `cloudflareapi/` package from CP (Worker replaces it) +- [ ] Update `docs/runbooks/saas-secrets.md` with Worker secrets + +### Success criteria for Phase 33 + +- New org subdomain resolves instantly (zero DNS wait) +- No NXDOMAIN caching — user never sees "site can't be reached" +- Provisioning splash page shown while EC2 boots (auto-refreshes) +- Cold start ~30s faster (no Caddy/Let's Encrypt) +- Cost: Cloudflare Worker free tier or $5/mo + +--- + ## Infra footnote — Temporal `docker-compose.infra.yml` now includes Temporal (`:7233` gRPC, `:8233` Web diff --git a/docs/architecture/wildcard-dns-proxy.md b/docs/architecture/wildcard-dns-proxy.md new file mode 100644 index 00000000..c29214b1 --- /dev/null +++ b/docs/architecture/wildcard-dns-proxy.md @@ -0,0 +1,192 @@ +# Wildcard DNS + Cloudflare Worker Proxy + +> **Status:** Planned — replaces per-tenant DNS record creation. +> +> **Problem:** When a user creates an org, we create an EC2 instance and a +> Cloudflare A record pointing `.moleculesai.app` to the instance IP. +> This causes 3-5 min of DNS propagation + NXDOMAIN caching by ISPs, meaning +> users see "site can't be reached" for minutes after creating their org. +> +> **Solution:** Every SaaS (Vercel, Railway, Fly.io, WordPress, n8n) uses the +> same pattern: wildcard DNS + a reverse proxy that routes by hostname. + +--- + +## Architecture + +``` +Browser → https://acme.moleculesai.app + ↓ + *.moleculesai.app DNS → Cloudflare (proxied, orange cloud) + ↓ + Cloudflare Worker (edge, ~50ms) + 1. Extract slug from hostname + 2. Lookup backend IP from CP API (cached 60s) + 3. If no backend → return "provisioning" splash page + 4. Proxy request to EC2 instance + ↓ + EC2 tenant (platform :8080, canvas :3000) +``` + +## Why this fixes the DNS problem + +| Before (per-tenant DNS) | After (wildcard + proxy) | +|--------------------------|--------------------------| +| Create A record per org | Wildcard `*.moleculesai.app` exists once, forever | +| 3-5 min DNS propagation | Zero — wildcard already resolves | +| NXDOMAIN cached by ISP for hours | Never happens — domain always resolves | +| Let's Encrypt cert per EC2 (~30s) | Cloudflare handles TLS (wildcard or per-host, free) | +| Caddy on each EC2 for HTTPS | Caddy only needed for local reverse proxy (HTTP, no TLS) | +| DNS cleanup on org delete | No DNS records to clean up | + +## Components + +### 1. Cloudflare DNS (one-time setup) + +Add a single wildcard record in the Cloudflare dashboard: + +``` +Type: A +Name: * +Content: 0.0.0.0 (placeholder — Worker intercepts before it reaches this) +Proxy: ON (orange cloud — routes through Cloudflare) +TTL: Auto +``` + +The `0.0.0.0` content doesn't matter because the Worker intercepts every +request before Cloudflare would try to connect to the origin. The orange +cloud (proxy ON) is required for Workers to fire on the route. + +Also keep the explicit records for non-tenant subdomains: +- `api.moleculesai.app` → Railway (control plane) +- `app.moleculesai.app` → Vercel (customer dashboard) +- `moleculesai.app` → Vercel (landing page) + +These explicit records take priority over the wildcard. + +### 2. Cloudflare Worker (~50 lines) + +The Worker runs on every request to `*.moleculesai.app` that isn't matched +by an explicit DNS record. It: + +1. **Extracts the slug** from the `Host` header +2. **Looks up the backend IP** by calling `GET https://api.moleculesai.app/cp/orgs//instance` + - Caches the response for 60s in Cloudflare's edge cache (KV or Cache API) + - If the org doesn't exist → 404 page + - If the org is provisioning (no IP yet) → return a static "provisioning" HTML page +3. **Proxies the request** to `http://:8080` (platform) or `:3000` (canvas) + - Route: `/health`, `/workspaces*`, `/registry*`, etc. → `:8080` + - Route: everything else → `:3000` + - Injects `X-Molecule-Org-Id` header (same as Caddy does today) + - Injects `Origin` header for AdminAuth bypass +4. **Returns the response** to the browser with Cloudflare's TLS + +### 3. CP API endpoint: `GET /cp/orgs/:slug/instance` + +New public endpoint (no auth — needed by the Worker which has no session): + +```json +// GET /cp/orgs/acme/instance +// 200 when running: +{ + "slug": "acme", + "status": "running", + "ip": "18.220.182.88", + "region": "us-east-2" +} + +// 200 when provisioning: +{ + "slug": "acme", + "status": "provisioning", + "ip": null +} + +// 404 when org doesn't exist +``` + +**Security note:** This endpoint exposes the EC2 IP for a given slug. This is +equivalent to what DNS already exposes (A record → IP). No secrets are leaked. +The endpoint should be rate-limited to prevent enumeration. + +### 4. EC2 tenant changes + +With Cloudflare handling TLS, the EC2 instance no longer needs Caddy for HTTPS: + +**Before:** +``` +Caddy (:443, auto Let's Encrypt) → platform (:8080) / canvas (:3000) +``` + +**After:** +``` +Worker → EC2 :8080 (platform, direct HTTP) +Worker → EC2 :3000 (canvas, direct HTTP) +``` + +Caddy can be removed from the EC2 user-data script entirely. The Worker +handles TLS termination + routing. The EC2 security group should allow +inbound HTTP from Cloudflare IPs only (not public). + +**Headers injected by Worker** (replaces Caddy's `header_up`): +- `X-Molecule-Org-Id: ` — for TenantGuard +- `Origin: https://.moleculesai.app` — for AdminAuth +- `X-Forwarded-For: ` — for rate limiting +- `X-Forwarded-Proto: https` — so the platform knows the original scheme + +### 5. Provisioning splash page + +When the Worker detects `status: "provisioning"`, it returns a static HTML +page with: +- The Molecule AI logo +- "Setting up your workspace..." +- A progress animation +- Auto-refresh every 5s (meta refresh or JS fetch) + +This replaces the molecule-app provisioning page for direct subdomain visits. +The molecule-app provisioning page at `app.moleculesai.app/orgs/:slug/provisioning` +continues to work as the primary flow (redirect after org creation). + +## Migration plan + +1. **Phase 1: Deploy Worker + wildcard DNS** (no tenant changes) + - Worker proxies to existing EC2 instances (Caddy still running) + - Both paths work: direct DNS (old A records) + Worker proxy (new) + - Verify Worker routing works for existing tenants + +2. **Phase 2: Stop creating per-tenant DNS records** + - Update CP provisioner to skip Cloudflare A record creation + - Remove Cloudflare DNS cleanup from deprovision + - Existing A records coexist with wildcard (explicit wins) + +3. **Phase 3: Remove Caddy from EC2 user-data** + - Worker handles TLS + routing + - EC2 runs platform on :8080 and canvas on :3000 (plain HTTP) + - Simpler boot script, ~30s faster cold start + +4. **Phase 4: Clean up old A records** + - Delete per-tenant A records (wildcard handles everything) + - Remove Cloudflare client from CP provisioner + +## Cost + +- Cloudflare Worker: free tier = 100k requests/day. Paid = $5/mo for 10M. +- Wildcard DNS: free (Cloudflare). +- Savings: no more per-instance Let's Encrypt, no Caddy install time. + +## Files to change + +| File | Change | +|------|--------| +| `molecule-controlplane/internal/provisioner/ec2.go` | Remove Cloudflare DNS creation, remove Caddy from user-data | +| `molecule-controlplane/internal/cloudflareapi/dns.go` | Eventually removable (Worker replaces it) | +| `molecule-controlplane/internal/handlers/orgs.go` | Add `GET /cp/orgs/:slug/instance` endpoint | +| New: `infra/cloudflare-worker/` | Worker source + wrangler.toml | +| `docs/runbooks/saas-secrets.md` | Add Worker secrets (CF account ID, API token) | +| `.github/workflows/deploy-worker.yml` | CI/CD for Worker deploys | + +## References + +- [Cloudflare Workers docs](https://developers.cloudflare.com/workers/) +- [Vercel's routing architecture](https://vercel.com/docs/edge-network/overview) — same pattern +- [Railway custom domains](https://docs.railway.app/guides/public-networking#custom-domains) — same pattern From 8c02d2d8780218fe94225ef78813fe45db237855 Mon Sep 17 00:00:00 2001 From: Hongming Wang Date: Fri, 17 Apr 2026 10:17:43 -0700 Subject: [PATCH 2/2] =?UTF-8?q?docs(wildcard-dns):=20address=20CEO=20revie?= =?UTF-8?q?w=20=E2=80=94=20KV=20cache,=20WebSocket,=20proxy=20trust?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Addresses all 4 review points from PR #786: 1. Worker resilience: 3-tier cache (in-memory → KV → CP API) with stale fallback so CP outages are invisible to tenants 2. WebSocket proxying: documented upgradeHeader handling, fallback to keep Caddy for WS-only if Workers WS is unreliable 3. SG automation: note to auto-update Cloudflare IP ranges, don't hardcode 4. Trusted proxy: X-Forwarded-For / CF-Connecting-IP trust chain documented Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/architecture/wildcard-dns-proxy.md | 52 ++++++++++++++++++++++--- 1 file changed, 46 insertions(+), 6 deletions(-) diff --git a/docs/architecture/wildcard-dns-proxy.md b/docs/architecture/wildcard-dns-proxy.md index c29214b1..b29646e7 100644 --- a/docs/architecture/wildcard-dns-proxy.md +++ b/docs/architecture/wildcard-dns-proxy.md @@ -70,17 +70,51 @@ The Worker runs on every request to `*.moleculesai.app` that isn't matched by an explicit DNS record. It: 1. **Extracts the slug** from the `Host` header -2. **Looks up the backend IP** by calling `GET https://api.moleculesai.app/cp/orgs//instance` - - Caches the response for 60s in Cloudflare's edge cache (KV or Cache API) - - If the org doesn't exist → 404 page +2. **Looks up the backend IP** using a 3-tier cache strategy: + - **L1: in-memory cache** (60s TTL) — fastest, per-isolate + - **L2: Workers KV** (5 min TTL, stale-while-revalidate) — survives isolate + restarts, shared across all edge locations + - **L3: CP API** — `GET https://api.moleculesai.app/cp/orgs//instance` + - **Fallback:** if CP is unreachable, serve stale KV entry (any age) rather + than erroring. A 10-minute CP outage is invisible to tenants. + - If the org doesn't exist (404 from CP, no KV entry) → 404 page - If the org is provisioning (no IP yet) → return a static "provisioning" HTML page 3. **Proxies the request** to `http://:8080` (platform) or `:3000` (canvas) - Route: `/health`, `/workspaces*`, `/registry*`, etc. → `:8080` - Route: everything else → `:3000` + - Route: `/ws` → `:8080` with WebSocket upgrade (see WebSocket section below) - Injects `X-Molecule-Org-Id` header (same as Caddy does today) - Injects `Origin` header for AdminAuth bypass + - Injects `X-Forwarded-For` with client IP from `CF-Connecting-IP` + - Injects `X-Forwarded-Proto: https` 4. **Returns the response** to the browser with Cloudflare's TLS +#### WebSocket proxying + +Cloudflare Workers support WebSocket proxying via the `upgradeHeader` check. +The Worker detects `Upgrade: websocket` on incoming requests and passes them +through to the EC2 backend on `:8080/ws`. The Worker acts as a transparent +tunnel — it does not inspect or buffer WebSocket frames. + +```js +// Simplified WebSocket handling in the Worker +if (request.headers.get('Upgrade') === 'websocket') { + return fetch(`http://${backendIp}:8080${url.pathname}`, request); +} +``` + +If Workers WebSocket proxying proves unreliable in production (frame drops, +idle timeout issues), Phase 33.3 keeps Caddy as a thin WSocket-only reverse +proxy on EC2 instead of removing it entirely. + +#### Trusted proxy configuration + +The platform's Gin server uses `SetTrustedProxies(nil)` (trust all) by +default. When requests come through the Worker instead of directly, the +platform should trust `CF-Connecting-IP` for the real client IP. In +production, set `TRUSTED_PROXIES` to Cloudflare's published IP ranges +(auto-updated from `https://api.cloudflare.com/client/v4/ips`). + ### 3. CP API endpoint: `GET /cp/orgs/:slug/instance` New public endpoint (no auth — needed by the Worker which has no session): @@ -124,9 +158,15 @@ Worker → EC2 :8080 (platform, direct HTTP) Worker → EC2 :3000 (canvas, direct HTTP) ``` -Caddy can be removed from the EC2 user-data script entirely. The Worker -handles TLS termination + routing. The EC2 security group should allow -inbound HTTP from Cloudflare IPs only (not public). +Caddy can be removed from the EC2 user-data script for HTTP routing. If +WebSocket proxying through Workers proves reliable, Caddy is fully removed. +If not, Caddy stays as a thin WebSocket-only reverse proxy (no TLS, no +HTTP routing — just `/ws` → `:8080`). + +The EC2 security group should allow inbound HTTP from Cloudflare IPs only +(not public). **Automate the IP list** — Cloudflare publishes their ranges +at `https://api.cloudflare.com/client/v4/ips`. Use a Lambda or cron to +update the SG weekly. Do not hardcode the IP ranges. **Headers injected by Worker** (replaces Caddy's `header_up`): - `X-Molecule-Org-Id: ` — for TenantGuard