From aee4359dddde57dd7f9a102c07306e1cb99c2777 Mon Sep 17 00:00:00 2001
From: "molecule-ai[bot]" <276602405+molecule-ai[bot]@users.noreply.github.com>
Date: Tue, 21 Apr 2026 07:49:52 +0000
Subject: [PATCH] docs: add docs/architecture/wildcard-dns-proxy.md

---
 .../docs/architecture/wildcard-dns-proxy.md   | 232 ++++++++++++++++++
 1 file changed, 232 insertions(+)
 create mode 100644 content/docs/architecture/wildcard-dns-proxy.md
diff --git a/content/docs/architecture/wildcard-dns-proxy.md b/content/docs/architecture/wildcard-dns-proxy.md
new file mode 100644
index 0000000..123baa7
--- /dev/null
+++ b/content/docs/architecture/wildcard-dns-proxy.md
@@ -0,0 +1,232 @@
+# Wildcard DNS + Cloudflare Worker Proxy
+
+> **Status:** Planned — replaces per-tenant DNS record creation.
+>
+> **Problem:** When a user creates an org, we create an EC2 instance and a
+> Cloudflare A record pointing `<slug>.moleculesai.app` to the instance IP.
+> This causes 3-5 min of DNS propagation + NXDOMAIN caching by ISPs, meaning
+> users see "site can't be reached" for minutes after creating their org.
+>
+> **Solution:** Every SaaS (Vercel, Railway, Fly.io, WordPress, n8n) uses the
+> same pattern: wildcard DNS + a reverse proxy that routes by hostname.
+
+---
+
+## Architecture
+
+```
+Browser → https://acme.moleculesai.app
+          ↓
+   *.moleculesai.app DNS → Cloudflare (proxied, orange cloud)
+          ↓
+   Cloudflare Worker (edge, ~50ms)
+     1. Extract slug from hostname
+     2. Lookup backend IP from CP API (cached 60s)
+     3. If no backend → return "provisioning" splash page
+     4. Proxy request to EC2 instance
+          ↓
+   EC2 tenant (platform :8080, canvas :3000)
+```
+
+## Why this fixes the DNS problem
+
+| Before (per-tenant DNS) | After (wildcard + proxy) |
+|--------------------------|--------------------------|
+| Create A record per org | Wildcard `*.moleculesai.app` exists once, forever |
+| 3-5 min DNS propagation | Zero — wildcard already resolves |
+| NXDOMAIN cached by ISP for hours | Never happens — domain always resolves |
+| Let's Encrypt cert per EC2 (~30s) | Cloudflare handles TLS (wildcard or per-host, free) |
+| Caddy on each EC2 for HTTPS | Caddy only needed for local reverse proxy (HTTP, no TLS) |
+| DNS cleanup on org delete | No DNS records to clean up |
+
+## Components
+
+### 1. Cloudflare DNS (one-time setup)
+
+Add a single wildcard record in the Cloudflare dashboard:
+
+```
+Type: A
+Name: *
+Content: 0.0.0.0 (placeholder — Worker intercepts before it reaches this)
+Proxy: ON (orange cloud — routes through Cloudflare)
+TTL: Auto
+```
+
+The `0.0.0.0` content doesn't matter because the Worker intercepts every
+request before Cloudflare would try to connect to the origin. The orange
+cloud (proxy ON) is required for Workers to fire on the route.
+
+Also keep the explicit records for non-tenant subdomains:
+- `api.moleculesai.app` → Railway (control plane)
+- `app.moleculesai.app` → Vercel (customer dashboard)
+- `moleculesai.app` → Vercel (landing page)
+
+These explicit records take priority over the wildcard.
+
+### 2. Cloudflare Worker (~50 lines)
+
+The Worker runs on every request to `*.moleculesai.app` that isn't matched
+by an explicit DNS record. It:
+
+1. **Extracts the slug** from the `Host` header
+2. **Looks up the backend IP** using a 3-tier cache strategy:
+   - **L1: in-memory cache** (60s TTL) — fastest, per-isolate
+   - **L2: Workers KV** (5 min TTL, stale-while-revalidate) — survives isolate
+     restarts, shared across all edge locations
+   - **L3: CP API** — `GET https://api.moleculesai.app/cp/orgs/<slug>/instance`
+   - **Fallback:** if CP is unreachable, serve stale KV entry (any age) rather
+     than erroring. A 10-minute CP outage is invisible to tenants.
+   - If the org doesn't exist (404 from CP, no KV entry) → 404 page
+   - If the org is provisioning (no IP yet) → return a static "provisioning" HTML page
+3. **Proxies the request** to `http://<ec2-ip>:8080` (platform) or `:3000` (canvas)
+   - Route: `/health`, `/workspaces*`, `/registry*`, etc. → `:8080`
+   - Route: everything else → `:3000`
+   - Route: `/ws` → `:8080` with WebSocket upgrade (see WebSocket section below)
+   - Injects `X-Molecule-Org-Id` header (same as Caddy does today)
+   - Injects `Origin` header for AdminAuth bypass
+   - Injects `X-Forwarded-For` with client IP from `CF-Connecting-IP`
+   - Injects `X-Forwarded-Proto: https`
+4. **Returns the response** to the browser with Cloudflare's TLS
+
+#### WebSocket proxying
+
+Cloudflare Workers support WebSocket proxying via the `upgradeHeader` check.
+The Worker detects `Upgrade: websocket` on incoming requests and passes them
+through to the EC2 backend on `:8080/ws`. The Worker acts as a transparent
+tunnel — it does not inspect or buffer WebSocket frames.
+
+```js
+// Simplified WebSocket handling in the Worker
+if (request.headers.get('Upgrade') === 'websocket') {
+  return fetch(`http://${backendIp}:8080${url.pathname}`, request);
+}
+```
+
+If Workers WebSocket proxying proves unreliable in production (frame drops,
+idle timeout issues), Phase 33.3 keeps Caddy as a thin WSocket-only reverse
+proxy on EC2 instead of removing it entirely.
+
+#### Trusted proxy configuration
+
+The platform's Gin server uses `SetTrustedProxies(nil)` (trust all) by
+default. When requests come through the Worker instead of directly, the
+platform should trust `CF-Connecting-IP` for the real client IP. In
+production, set `TRUSTED_PROXIES` to Cloudflare's published IP ranges
+(auto-updated from `https://api.cloudflare.com/client/v4/ips`).
+
+### 3. CP API endpoint: `GET /cp/orgs/:slug/instance`
+
+New public endpoint (no auth — needed by the Worker which has no session):
+
+```json
+// GET /cp/orgs/acme/instance
+// 200 when running:
+{
+  "slug": "acme",
+  "status": "running",
+  "ip": "<EC2_IP>",
+  "region": "us-east-2"
+}
+
+// 200 when provisioning:
+{
+  "slug": "acme",
+  "status": "provisioning",
+  "ip": null
+}
+
+// 404 when org doesn't exist
+```
+
+**Security note:** This endpoint exposes the EC2 IP for a given slug. This is
+equivalent to what DNS already exposes (A record → IP). No secrets are leaked.
+The endpoint should be rate-limited to prevent enumeration.
+
+### 4. EC2 tenant changes
+
+With Cloudflare handling TLS, the EC2 instance no longer needs Caddy for HTTPS:
+
+**Before:**
+```
+Caddy (:443, auto Let's Encrypt) → platform (:8080) / canvas (:3000)
+```
+
+**After:**
+```
+Worker → EC2 :8080 (platform, direct HTTP)
+Worker → EC2 :3000 (canvas, direct HTTP)
+```
+
+Caddy can be removed from the EC2 user-data script for HTTP routing. If
+WebSocket proxying through Workers proves reliable, Caddy is fully removed.
+If not, Caddy stays as a thin WebSocket-only reverse proxy (no TLS, no
+HTTP routing — just `/ws` → `:8080`).
+
+The EC2 security group should allow inbound HTTP from Cloudflare IPs only
+(not public). **Automate the IP list** — Cloudflare publishes their ranges
+at `https://api.cloudflare.com/client/v4/ips`. Use a Lambda or cron to
+update the SG weekly. Do not hardcode the IP ranges.
+
+**Headers injected by Worker** (replaces Caddy's `header_up`):
+- `X-Molecule-Org-Id: <org-id>` — for TenantGuard
+- `Origin: https://<slug>.moleculesai.app` — for AdminAuth
+- `X-Forwarded-For: <client-ip>` — for rate limiting
+- `X-Forwarded-Proto: https` — so the platform knows the original scheme
+
+### 5. Provisioning splash page
+
+When the Worker detects `status: "provisioning"`, it returns a static HTML
+page with:
+- The Molecule AI logo
+- "Setting up your workspace..."
+- A progress animation
+- Auto-refresh every 5s (meta refresh or JS fetch)
+
+This replaces the molecule-app provisioning page for direct subdomain visits.
+The molecule-app provisioning page at `app.moleculesai.app/orgs/:slug/provisioning`
+continues to work as the primary flow (redirect after org creation).
+
+## Migration plan
+
+1. **Phase 1: Deploy Worker + wildcard DNS** (no tenant changes)
+   - Worker proxies to existing EC2 instances (Caddy still running)
+   - Both paths work: direct DNS (old A records) + Worker proxy (new)
+   - Verify Worker routing works for existing tenants
+
+2. **Phase 2: Stop creating per-tenant DNS records**
+   - Update CP provisioner to skip Cloudflare A record creation
+   - Remove Cloudflare DNS cleanup from deprovision
+   - Existing A records coexist with wildcard (explicit wins)
+
+3. **Phase 3: Remove Caddy from EC2 user-data**
+   - Worker handles TLS + routing
+   - EC2 runs platform on :8080 and canvas on :3000 (plain HTTP)
+   - Simpler boot script, ~30s faster cold start
+
+4. **Phase 4: Clean up old A records**
+   - Delete per-tenant A records (wildcard handles everything)
+   - Remove Cloudflare client from CP provisioner
+
+## Cost
+
+- Cloudflare Worker: free tier = 100k requests/day. Paid = $5/mo for 10M.
+- Wildcard DNS: free (Cloudflare).
+- Savings: no more per-instance Let's Encrypt, no Caddy install time.
+
+## Files to change
+
+| File | Change |
+|------|--------|
+| `the private control-plane repo/internal/provisioner/ec2.go` | Remove Cloudflare DNS creation, remove Caddy from user-data |
+| `the private control-plane repo/internal/cloudflareapi/dns.go` | Eventually removable (Worker replaces it) |
+| `the private control-plane repo/internal/handlers/orgs.go` | Add `GET /cp/orgs/:slug/instance` endpoint |
+| New: `Molecule-AI/molecule-tenant-proxy (separate repo)` | Worker source + wrangler.toml |
+| `docs/runbooks/saas-secrets.md` | Add Worker secrets (CF account ID, API token) |
+| `.github/workflows/deploy-worker.yml` | CI/CD for Worker deploys |
+
+## References
+
+- [Cloudflare Workers docs](https://developers.cloudflare.com/workers/)
+- [Vercel's routing architecture](https://vercel.com/docs/edge-network/overview) — same pattern
+- [Railway custom domains](https://docs.railway.app/guides/public-networking#custom-domains) — same pattern