Merge pull request #786 from Molecule-AI/docs/wildcard-dns-proxy

docs: wildcard DNS + Cloudflare Worker proxy architecture (Phase 33)
2026-04-17 17:21:13 +00:00 · 2026-04-17 17:21:13 +00:00 · a41a2ba663
commit a41a2ba663
parent f5a8caefe1 8c02d2d878
3 changed files with 285 additions and 0 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -28,6 +28,12 @@ secrets` on `molecule-cp`), the correct rotation order, and danger cases —
 notably `SECRETS_ENCRYPTION_KEY`, which cannot be rotated without a data
 migration until Phase H lands KMS envelope encryption.

+For tenant subdomain routing architecture (why `*.moleculesai.app` uses a
+Cloudflare Worker instead of per-tenant DNS records), read
+**`docs/architecture/wildcard-dns-proxy.md`**. This eliminates DNS
+propagation delays and NXDOMAIN caching that previously caused "site can't
+be reached" errors for new orgs.
+
 When handling a GDPR erasure request (user asks "delete my org and all
 my data"), read **`docs/runbooks/gdpr-erasure.md`** first. It explains the
 4-step cascade in `molecule-controlplane` (Stripe → Redis → Infra → DB
--- a/PLAN.md
+++ b/PLAN.md
@ -575,6 +575,53 @@ self-hosted per-customer). Ordered by dependency + ROI.

 ---

+## Phase 33: Wildcard DNS + Cloudflare Worker Proxy
+
+> **Goal:** Eliminate DNS propagation delays and NXDOMAIN caching for tenant
+> subdomains. Every SaaS (Vercel, Railway, Fly.io) uses this pattern —
+> wildcard DNS + edge proxy routing by hostname.
+>
+> **Docs:** `docs/architecture/wildcard-dns-proxy.md`
+
+### Phase 33.1 — Worker + wildcard DNS (no tenant changes)
+
+- [ ] Create Cloudflare Worker that extracts slug from hostname, looks up
+  backend IP from CP API, proxies request to EC2
+- [ ] Add `GET /cp/orgs/:slug/instance` endpoint to CP (public, rate-limited)
+- [ ] Add `*.moleculesai.app` wildcard DNS record (proxied, orange cloud)
+- [ ] Worker serves static "provisioning" splash page when tenant not ready
+- [ ] Deploy Worker via `wrangler deploy` + GitHub Actions
+- [ ] Verify Worker routing works for existing tenants alongside old A records
+
+### Phase 33.2 — Stop per-tenant DNS records
+
+- [ ] Remove Cloudflare A record creation from `ec2.go` provisioner
+- [ ] Remove Cloudflare DNS cleanup from deprovision/purge cascade
+- [ ] Existing A records coexist harmlessly (explicit wins over wildcard)
+
+### Phase 33.3 — Remove Caddy from EC2
+
+- [ ] Worker handles TLS termination — EC2 runs plain HTTP only
+- [ ] Remove Caddy install + Caddyfile from EC2 user-data script
+- [ ] EC2 security group: allow inbound HTTP from Cloudflare IPs only
+- [ ] ~30s faster cold start (no apt-get caddy, no Let's Encrypt)
+
+### Phase 33.4 — Cleanup
+
+- [ ] Delete old per-tenant A records from Cloudflare
+- [ ] Remove `cloudflareapi/` package from CP (Worker replaces it)
+- [ ] Update `docs/runbooks/saas-secrets.md` with Worker secrets
+
+### Success criteria for Phase 33
+
+- New org subdomain resolves instantly (zero DNS wait)
+- No NXDOMAIN caching — user never sees "site can't be reached"
+- Provisioning splash page shown while EC2 boots (auto-refreshes)
+- Cold start ~30s faster (no Caddy/Let's Encrypt)
+- Cost: Cloudflare Worker free tier or $5/mo
+
+---
+
 ## Infra footnote — Temporal

 `docker-compose.infra.yml` now includes Temporal (`:7233` gRPC, `:8233` Web
--- a/docs/architecture/wildcard-dns-proxy.md
+++ b/docs/architecture/wildcard-dns-proxy.md
@ -0,0 +1,232 @@
+# Wildcard DNS + Cloudflare Worker Proxy
+
+> **Status:** Planned — replaces per-tenant DNS record creation.
+>
+> **Problem:** When a user creates an org, we create an EC2 instance and a
+> Cloudflare A record pointing `<slug>.moleculesai.app` to the instance IP.
+> This causes 3-5 min of DNS propagation + NXDOMAIN caching by ISPs, meaning
+> users see "site can't be reached" for minutes after creating their org.
+>
+> **Solution:** Every SaaS (Vercel, Railway, Fly.io, WordPress, n8n) uses the
+> same pattern: wildcard DNS + a reverse proxy that routes by hostname.
+
+---
+
+## Architecture
+
+```
+Browser → https://acme.moleculesai.app
+          ↓
+   *.moleculesai.app DNS → Cloudflare (proxied, orange cloud)
+          ↓
+   Cloudflare Worker (edge, ~50ms)
+     1. Extract slug from hostname
+     2. Lookup backend IP from CP API (cached 60s)
+     3. If no backend → return "provisioning" splash page
+     4. Proxy request to EC2 instance
+          ↓
+   EC2 tenant (platform :8080, canvas :3000)
+```
+
+## Why this fixes the DNS problem
+
+| Before (per-tenant DNS) | After (wildcard + proxy) |
+|--------------------------|--------------------------|
+| Create A record per org | Wildcard `*.moleculesai.app` exists once, forever |
+| 3-5 min DNS propagation | Zero — wildcard already resolves |
+| NXDOMAIN cached by ISP for hours | Never happens — domain always resolves |
+| Let's Encrypt cert per EC2 (~30s) | Cloudflare handles TLS (wildcard or per-host, free) |
+| Caddy on each EC2 for HTTPS | Caddy only needed for local reverse proxy (HTTP, no TLS) |
+| DNS cleanup on org delete | No DNS records to clean up |
+
+## Components
+
+### 1. Cloudflare DNS (one-time setup)
+
+Add a single wildcard record in the Cloudflare dashboard:
+
+```
+Type: A
+Name: *
+Content: 0.0.0.0 (placeholder — Worker intercepts before it reaches this)
+Proxy: ON (orange cloud — routes through Cloudflare)
+TTL: Auto
+```
+
+The `0.0.0.0` content doesn't matter because the Worker intercepts every
+request before Cloudflare would try to connect to the origin. The orange
+cloud (proxy ON) is required for Workers to fire on the route.
+
+Also keep the explicit records for non-tenant subdomains:
+- `api.moleculesai.app` → Railway (control plane)
+- `app.moleculesai.app` → Vercel (customer dashboard)
+- `moleculesai.app` → Vercel (landing page)
+
+These explicit records take priority over the wildcard.
+
+### 2. Cloudflare Worker (~50 lines)
+
+The Worker runs on every request to `*.moleculesai.app` that isn't matched
+by an explicit DNS record. It:
+
+1. **Extracts the slug** from the `Host` header
+2. **Looks up the backend IP** using a 3-tier cache strategy:
+   - **L1: in-memory cache** (60s TTL) — fastest, per-isolate
+   - **L2: Workers KV** (5 min TTL, stale-while-revalidate) — survives isolate
+     restarts, shared across all edge locations
+   - **L3: CP API** — `GET https://api.moleculesai.app/cp/orgs/<slug>/instance`
+   - **Fallback:** if CP is unreachable, serve stale KV entry (any age) rather
+     than erroring. A 10-minute CP outage is invisible to tenants.
+   - If the org doesn't exist (404 from CP, no KV entry) → 404 page
+   - If the org is provisioning (no IP yet) → return a static "provisioning" HTML page
+3. **Proxies the request** to `http://<ec2-ip>:8080` (platform) or `:3000` (canvas)
+   - Route: `/health`, `/workspaces*`, `/registry*`, etc. → `:8080`
+   - Route: everything else → `:3000`
+   - Route: `/ws` → `:8080` with WebSocket upgrade (see WebSocket section below)
+   - Injects `X-Molecule-Org-Id` header (same as Caddy does today)
+   - Injects `Origin` header for AdminAuth bypass
+   - Injects `X-Forwarded-For` with client IP from `CF-Connecting-IP`
+   - Injects `X-Forwarded-Proto: https`
+4. **Returns the response** to the browser with Cloudflare's TLS
+
+#### WebSocket proxying
+
+Cloudflare Workers support WebSocket proxying via the `upgradeHeader` check.
+The Worker detects `Upgrade: websocket` on incoming requests and passes them
+through to the EC2 backend on `:8080/ws`. The Worker acts as a transparent
+tunnel — it does not inspect or buffer WebSocket frames.
+
+```js
+// Simplified WebSocket handling in the Worker
+if (request.headers.get('Upgrade') === 'websocket') {
+  return fetch(`http://${backendIp}:8080${url.pathname}`, request);
+}
+```
+
+If Workers WebSocket proxying proves unreliable in production (frame drops,
+idle timeout issues), Phase 33.3 keeps Caddy as a thin WSocket-only reverse
+proxy on EC2 instead of removing it entirely.
+
+#### Trusted proxy configuration
+
+The platform's Gin server uses `SetTrustedProxies(nil)` (trust all) by
+default. When requests come through the Worker instead of directly, the
+platform should trust `CF-Connecting-IP` for the real client IP. In
+production, set `TRUSTED_PROXIES` to Cloudflare's published IP ranges
+(auto-updated from `https://api.cloudflare.com/client/v4/ips`).
+
+### 3. CP API endpoint: `GET /cp/orgs/:slug/instance`
+
+New public endpoint (no auth — needed by the Worker which has no session):
+
+```json
+// GET /cp/orgs/acme/instance
+// 200 when running:
+{
+  "slug": "acme",
+  "status": "running",
+  "ip": "18.220.182.88",
+  "region": "us-east-2"
+}
+
+// 200 when provisioning:
+{
+  "slug": "acme",
+  "status": "provisioning",
+  "ip": null
+}
+
+// 404 when org doesn't exist
+```
+
+**Security note:** This endpoint exposes the EC2 IP for a given slug. This is
+equivalent to what DNS already exposes (A record → IP). No secrets are leaked.
+The endpoint should be rate-limited to prevent enumeration.
+
+### 4. EC2 tenant changes
+
+With Cloudflare handling TLS, the EC2 instance no longer needs Caddy for HTTPS:
+
+**Before:**
+```
+Caddy (:443, auto Let's Encrypt) → platform (:8080) / canvas (:3000)
+```
+
+**After:**
+```
+Worker → EC2 :8080 (platform, direct HTTP)
+Worker → EC2 :3000 (canvas, direct HTTP)
+```
+
+Caddy can be removed from the EC2 user-data script for HTTP routing. If
+WebSocket proxying through Workers proves reliable, Caddy is fully removed.
+If not, Caddy stays as a thin WebSocket-only reverse proxy (no TLS, no
+HTTP routing — just `/ws` → `:8080`).
+
+The EC2 security group should allow inbound HTTP from Cloudflare IPs only
+(not public). **Automate the IP list** — Cloudflare publishes their ranges
+at `https://api.cloudflare.com/client/v4/ips`. Use a Lambda or cron to
+update the SG weekly. Do not hardcode the IP ranges.
+
+**Headers injected by Worker** (replaces Caddy's `header_up`):
+- `X-Molecule-Org-Id: <org-id>` — for TenantGuard
+- `Origin: https://<slug>.moleculesai.app` — for AdminAuth
+- `X-Forwarded-For: <client-ip>` — for rate limiting
+- `X-Forwarded-Proto: https` — so the platform knows the original scheme
+
+### 5. Provisioning splash page
+
+When the Worker detects `status: "provisioning"`, it returns a static HTML
+page with:
+- The Molecule AI logo
+- "Setting up your workspace..."
+- A progress animation
+- Auto-refresh every 5s (meta refresh or JS fetch)
+
+This replaces the molecule-app provisioning page for direct subdomain visits.
+The molecule-app provisioning page at `app.moleculesai.app/orgs/:slug/provisioning`
+continues to work as the primary flow (redirect after org creation).
+
+## Migration plan
+
+1. **Phase 1: Deploy Worker + wildcard DNS** (no tenant changes)
+   - Worker proxies to existing EC2 instances (Caddy still running)
+   - Both paths work: direct DNS (old A records) + Worker proxy (new)
+   - Verify Worker routing works for existing tenants
+
+2. **Phase 2: Stop creating per-tenant DNS records**
+   - Update CP provisioner to skip Cloudflare A record creation
+   - Remove Cloudflare DNS cleanup from deprovision
+   - Existing A records coexist with wildcard (explicit wins)
+
+3. **Phase 3: Remove Caddy from EC2 user-data**
+   - Worker handles TLS + routing
+   - EC2 runs platform on :8080 and canvas on :3000 (plain HTTP)
+   - Simpler boot script, ~30s faster cold start
+
+4. **Phase 4: Clean up old A records**
+   - Delete per-tenant A records (wildcard handles everything)
+   - Remove Cloudflare client from CP provisioner
+
+## Cost
+
+- Cloudflare Worker: free tier = 100k requests/day. Paid = $5/mo for 10M.
+- Wildcard DNS: free (Cloudflare).
+- Savings: no more per-instance Let's Encrypt, no Caddy install time.
+
+## Files to change
+
+| File | Change |
+|------|--------|
+| `molecule-controlplane/internal/provisioner/ec2.go` | Remove Cloudflare DNS creation, remove Caddy from user-data |
+| `molecule-controlplane/internal/cloudflareapi/dns.go` | Eventually removable (Worker replaces it) |
+| `molecule-controlplane/internal/handlers/orgs.go` | Add `GET /cp/orgs/:slug/instance` endpoint |
+| New: `infra/cloudflare-worker/` | Worker source + wrangler.toml |
+| `docs/runbooks/saas-secrets.md` | Add Worker secrets (CF account ID, API token) |
+| `.github/workflows/deploy-worker.yml` | CI/CD for Worker deploys |
+
+## References
+
+- [Cloudflare Workers docs](https://developers.cloudflare.com/workers/)
+- [Vercel's routing architecture](https://vercel.com/docs/edge-network/overview) — same pattern
+- [Railway custom domains](https://docs.railway.app/guides/public-networking#custom-domains) — same pattern