Merge pull request #786 from Molecule-AI/docs/wildcard-dns-proxy

docs: wildcard DNS + Cloudflare Worker proxy architecture (Phase 33)
This commit is contained in:
molecule-ai[bot] 2026-04-17 17:21:13 +00:00 committed by GitHub
commit a41a2ba663
3 changed files with 285 additions and 0 deletions

View File

@ -28,6 +28,12 @@ secrets` on `molecule-cp`), the correct rotation order, and danger cases —
notably `SECRETS_ENCRYPTION_KEY`, which cannot be rotated without a data
migration until Phase H lands KMS envelope encryption.
For tenant subdomain routing architecture (why `*.moleculesai.app` uses a
Cloudflare Worker instead of per-tenant DNS records), read
**`docs/architecture/wildcard-dns-proxy.md`**. This eliminates DNS
propagation delays and NXDOMAIN caching that previously caused "site can't
be reached" errors for new orgs.
When handling a GDPR erasure request (user asks "delete my org and all
my data"), read **`docs/runbooks/gdpr-erasure.md`** first. It explains the
4-step cascade in `molecule-controlplane` (Stripe → Redis → Infra → DB

47
PLAN.md
View File

@ -575,6 +575,53 @@ self-hosted per-customer). Ordered by dependency + ROI.
---
## Phase 33: Wildcard DNS + Cloudflare Worker Proxy
> **Goal:** Eliminate DNS propagation delays and NXDOMAIN caching for tenant
> subdomains. Every SaaS (Vercel, Railway, Fly.io) uses this pattern —
> wildcard DNS + edge proxy routing by hostname.
>
> **Docs:** `docs/architecture/wildcard-dns-proxy.md`
### Phase 33.1 — Worker + wildcard DNS (no tenant changes)
- [ ] Create Cloudflare Worker that extracts slug from hostname, looks up
backend IP from CP API, proxies request to EC2
- [ ] Add `GET /cp/orgs/:slug/instance` endpoint to CP (public, rate-limited)
- [ ] Add `*.moleculesai.app` wildcard DNS record (proxied, orange cloud)
- [ ] Worker serves static "provisioning" splash page when tenant not ready
- [ ] Deploy Worker via `wrangler deploy` + GitHub Actions
- [ ] Verify Worker routing works for existing tenants alongside old A records
### Phase 33.2 — Stop per-tenant DNS records
- [ ] Remove Cloudflare A record creation from `ec2.go` provisioner
- [ ] Remove Cloudflare DNS cleanup from deprovision/purge cascade
- [ ] Existing A records coexist harmlessly (explicit wins over wildcard)
### Phase 33.3 — Remove Caddy from EC2
- [ ] Worker handles TLS termination — EC2 runs plain HTTP only
- [ ] Remove Caddy install + Caddyfile from EC2 user-data script
- [ ] EC2 security group: allow inbound HTTP from Cloudflare IPs only
- [ ] ~30s faster cold start (no apt-get caddy, no Let's Encrypt)
### Phase 33.4 — Cleanup
- [ ] Delete old per-tenant A records from Cloudflare
- [ ] Remove `cloudflareapi/` package from CP (Worker replaces it)
- [ ] Update `docs/runbooks/saas-secrets.md` with Worker secrets
### Success criteria for Phase 33
- New org subdomain resolves instantly (zero DNS wait)
- No NXDOMAIN caching — user never sees "site can't be reached"
- Provisioning splash page shown while EC2 boots (auto-refreshes)
- Cold start ~30s faster (no Caddy/Let's Encrypt)
- Cost: Cloudflare Worker free tier or $5/mo
---
## Infra footnote — Temporal
`docker-compose.infra.yml` now includes Temporal (`:7233` gRPC, `:8233` Web

View File

@ -0,0 +1,232 @@
# Wildcard DNS + Cloudflare Worker Proxy
> **Status:** Planned — replaces per-tenant DNS record creation.
>
> **Problem:** When a user creates an org, we create an EC2 instance and a
> Cloudflare A record pointing `<slug>.moleculesai.app` to the instance IP.
> This causes 3-5 min of DNS propagation + NXDOMAIN caching by ISPs, meaning
> users see "site can't be reached" for minutes after creating their org.
>
> **Solution:** Every SaaS (Vercel, Railway, Fly.io, WordPress, n8n) uses the
> same pattern: wildcard DNS + a reverse proxy that routes by hostname.
---
## Architecture
```
Browser → https://acme.moleculesai.app
*.moleculesai.app DNS → Cloudflare (proxied, orange cloud)
Cloudflare Worker (edge, ~50ms)
1. Extract slug from hostname
2. Lookup backend IP from CP API (cached 60s)
3. If no backend → return "provisioning" splash page
4. Proxy request to EC2 instance
EC2 tenant (platform :8080, canvas :3000)
```
## Why this fixes the DNS problem
| Before (per-tenant DNS) | After (wildcard + proxy) |
|--------------------------|--------------------------|
| Create A record per org | Wildcard `*.moleculesai.app` exists once, forever |
| 3-5 min DNS propagation | Zero — wildcard already resolves |
| NXDOMAIN cached by ISP for hours | Never happens — domain always resolves |
| Let's Encrypt cert per EC2 (~30s) | Cloudflare handles TLS (wildcard or per-host, free) |
| Caddy on each EC2 for HTTPS | Caddy only needed for local reverse proxy (HTTP, no TLS) |
| DNS cleanup on org delete | No DNS records to clean up |
## Components
### 1. Cloudflare DNS (one-time setup)
Add a single wildcard record in the Cloudflare dashboard:
```
Type: A
Name: *
Content: 0.0.0.0 (placeholder — Worker intercepts before it reaches this)
Proxy: ON (orange cloud — routes through Cloudflare)
TTL: Auto
```
The `0.0.0.0` content doesn't matter because the Worker intercepts every
request before Cloudflare would try to connect to the origin. The orange
cloud (proxy ON) is required for Workers to fire on the route.
Also keep the explicit records for non-tenant subdomains:
- `api.moleculesai.app` → Railway (control plane)
- `app.moleculesai.app` → Vercel (customer dashboard)
- `moleculesai.app` → Vercel (landing page)
These explicit records take priority over the wildcard.
### 2. Cloudflare Worker (~50 lines)
The Worker runs on every request to `*.moleculesai.app` that isn't matched
by an explicit DNS record. It:
1. **Extracts the slug** from the `Host` header
2. **Looks up the backend IP** using a 3-tier cache strategy:
- **L1: in-memory cache** (60s TTL) — fastest, per-isolate
- **L2: Workers KV** (5 min TTL, stale-while-revalidate) — survives isolate
restarts, shared across all edge locations
- **L3: CP API**`GET https://api.moleculesai.app/cp/orgs/<slug>/instance`
- **Fallback:** if CP is unreachable, serve stale KV entry (any age) rather
than erroring. A 10-minute CP outage is invisible to tenants.
- If the org doesn't exist (404 from CP, no KV entry) → 404 page
- If the org is provisioning (no IP yet) → return a static "provisioning" HTML page
3. **Proxies the request** to `http://<ec2-ip>:8080` (platform) or `:3000` (canvas)
- Route: `/health`, `/workspaces*`, `/registry*`, etc. → `:8080`
- Route: everything else → `:3000`
- Route: `/ws``:8080` with WebSocket upgrade (see WebSocket section below)
- Injects `X-Molecule-Org-Id` header (same as Caddy does today)
- Injects `Origin` header for AdminAuth bypass
- Injects `X-Forwarded-For` with client IP from `CF-Connecting-IP`
- Injects `X-Forwarded-Proto: https`
4. **Returns the response** to the browser with Cloudflare's TLS
#### WebSocket proxying
Cloudflare Workers support WebSocket proxying via the `upgradeHeader` check.
The Worker detects `Upgrade: websocket` on incoming requests and passes them
through to the EC2 backend on `:8080/ws`. The Worker acts as a transparent
tunnel — it does not inspect or buffer WebSocket frames.
```js
// Simplified WebSocket handling in the Worker
if (request.headers.get('Upgrade') === 'websocket') {
return fetch(`http://${backendIp}:8080${url.pathname}`, request);
}
```
If Workers WebSocket proxying proves unreliable in production (frame drops,
idle timeout issues), Phase 33.3 keeps Caddy as a thin WSocket-only reverse
proxy on EC2 instead of removing it entirely.
#### Trusted proxy configuration
The platform's Gin server uses `SetTrustedProxies(nil)` (trust all) by
default. When requests come through the Worker instead of directly, the
platform should trust `CF-Connecting-IP` for the real client IP. In
production, set `TRUSTED_PROXIES` to Cloudflare's published IP ranges
(auto-updated from `https://api.cloudflare.com/client/v4/ips`).
### 3. CP API endpoint: `GET /cp/orgs/:slug/instance`
New public endpoint (no auth — needed by the Worker which has no session):
```json
// GET /cp/orgs/acme/instance
// 200 when running:
{
"slug": "acme",
"status": "running",
"ip": "18.220.182.88",
"region": "us-east-2"
}
// 200 when provisioning:
{
"slug": "acme",
"status": "provisioning",
"ip": null
}
// 404 when org doesn't exist
```
**Security note:** This endpoint exposes the EC2 IP for a given slug. This is
equivalent to what DNS already exposes (A record → IP). No secrets are leaked.
The endpoint should be rate-limited to prevent enumeration.
### 4. EC2 tenant changes
With Cloudflare handling TLS, the EC2 instance no longer needs Caddy for HTTPS:
**Before:**
```
Caddy (:443, auto Let's Encrypt) → platform (:8080) / canvas (:3000)
```
**After:**
```
Worker → EC2 :8080 (platform, direct HTTP)
Worker → EC2 :3000 (canvas, direct HTTP)
```
Caddy can be removed from the EC2 user-data script for HTTP routing. If
WebSocket proxying through Workers proves reliable, Caddy is fully removed.
If not, Caddy stays as a thin WebSocket-only reverse proxy (no TLS, no
HTTP routing — just `/ws``:8080`).
The EC2 security group should allow inbound HTTP from Cloudflare IPs only
(not public). **Automate the IP list** — Cloudflare publishes their ranges
at `https://api.cloudflare.com/client/v4/ips`. Use a Lambda or cron to
update the SG weekly. Do not hardcode the IP ranges.
**Headers injected by Worker** (replaces Caddy's `header_up`):
- `X-Molecule-Org-Id: <org-id>` — for TenantGuard
- `Origin: https://<slug>.moleculesai.app` — for AdminAuth
- `X-Forwarded-For: <client-ip>` — for rate limiting
- `X-Forwarded-Proto: https` — so the platform knows the original scheme
### 5. Provisioning splash page
When the Worker detects `status: "provisioning"`, it returns a static HTML
page with:
- The Molecule AI logo
- "Setting up your workspace..."
- A progress animation
- Auto-refresh every 5s (meta refresh or JS fetch)
This replaces the molecule-app provisioning page for direct subdomain visits.
The molecule-app provisioning page at `app.moleculesai.app/orgs/:slug/provisioning`
continues to work as the primary flow (redirect after org creation).
## Migration plan
1. **Phase 1: Deploy Worker + wildcard DNS** (no tenant changes)
- Worker proxies to existing EC2 instances (Caddy still running)
- Both paths work: direct DNS (old A records) + Worker proxy (new)
- Verify Worker routing works for existing tenants
2. **Phase 2: Stop creating per-tenant DNS records**
- Update CP provisioner to skip Cloudflare A record creation
- Remove Cloudflare DNS cleanup from deprovision
- Existing A records coexist with wildcard (explicit wins)
3. **Phase 3: Remove Caddy from EC2 user-data**
- Worker handles TLS + routing
- EC2 runs platform on :8080 and canvas on :3000 (plain HTTP)
- Simpler boot script, ~30s faster cold start
4. **Phase 4: Clean up old A records**
- Delete per-tenant A records (wildcard handles everything)
- Remove Cloudflare client from CP provisioner
## Cost
- Cloudflare Worker: free tier = 100k requests/day. Paid = $5/mo for 10M.
- Wildcard DNS: free (Cloudflare).
- Savings: no more per-instance Let's Encrypt, no Caddy install time.
## Files to change
| File | Change |
|------|--------|
| `molecule-controlplane/internal/provisioner/ec2.go` | Remove Cloudflare DNS creation, remove Caddy from user-data |
| `molecule-controlplane/internal/cloudflareapi/dns.go` | Eventually removable (Worker replaces it) |
| `molecule-controlplane/internal/handlers/orgs.go` | Add `GET /cp/orgs/:slug/instance` endpoint |
| New: `infra/cloudflare-worker/` | Worker source + wrangler.toml |
| `docs/runbooks/saas-secrets.md` | Add Worker secrets (CF account ID, API token) |
| `.github/workflows/deploy-worker.yml` | CI/CD for Worker deploys |
## References
- [Cloudflare Workers docs](https://developers.cloudflare.com/workers/)
- [Vercel's routing architecture](https://vercel.com/docs/edge-network/overview) — same pattern
- [Railway custom domains](https://docs.railway.app/guides/public-networking#custom-domains) — same pattern