- Remove compiled workspace-server/server binary from git - Fix .gitignore, .gitattributes, .githooks/pre-commit for renamed dirs - Fix CI workflow path filters (workspace-template → workspace) - Replace real EC2 IP and personal slug in test_saas_tenant.sh - Scrub molecule-controlplane references in docs - Fix stale workspace-template/ paths in provisioner, handlers, tests - Clean tracked Python cache files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8.9 KiB
Wildcard DNS + Cloudflare Worker Proxy
Status: Planned — replaces per-tenant DNS record creation.
Problem: When a user creates an org, we create an EC2 instance and a Cloudflare A record pointing
<slug>.moleculesai.appto the instance IP. This causes 3-5 min of DNS propagation + NXDOMAIN caching by ISPs, meaning users see "site can't be reached" for minutes after creating their org.Solution: Every SaaS (Vercel, Railway, Fly.io, WordPress, n8n) uses the same pattern: wildcard DNS + a reverse proxy that routes by hostname.
Architecture
Browser → https://acme.moleculesai.app
↓
*.moleculesai.app DNS → Cloudflare (proxied, orange cloud)
↓
Cloudflare Worker (edge, ~50ms)
1. Extract slug from hostname
2. Lookup backend IP from CP API (cached 60s)
3. If no backend → return "provisioning" splash page
4. Proxy request to EC2 instance
↓
EC2 tenant (platform :8080, canvas :3000)
Why this fixes the DNS problem
| Before (per-tenant DNS) | After (wildcard + proxy) |
|---|---|
| Create A record per org | Wildcard *.moleculesai.app exists once, forever |
| 3-5 min DNS propagation | Zero — wildcard already resolves |
| NXDOMAIN cached by ISP for hours | Never happens — domain always resolves |
| Let's Encrypt cert per EC2 (~30s) | Cloudflare handles TLS (wildcard or per-host, free) |
| Caddy on each EC2 for HTTPS | Caddy only needed for local reverse proxy (HTTP, no TLS) |
| DNS cleanup on org delete | No DNS records to clean up |
Components
1. Cloudflare DNS (one-time setup)
Add a single wildcard record in the Cloudflare dashboard:
Type: A
Name: *
Content: 0.0.0.0 (placeholder — Worker intercepts before it reaches this)
Proxy: ON (orange cloud — routes through Cloudflare)
TTL: Auto
The 0.0.0.0 content doesn't matter because the Worker intercepts every
request before Cloudflare would try to connect to the origin. The orange
cloud (proxy ON) is required for Workers to fire on the route.
Also keep the explicit records for non-tenant subdomains:
api.moleculesai.app→ Railway (control plane)app.moleculesai.app→ Vercel (customer dashboard)moleculesai.app→ Vercel (landing page)
These explicit records take priority over the wildcard.
2. Cloudflare Worker (~50 lines)
The Worker runs on every request to *.moleculesai.app that isn't matched
by an explicit DNS record. It:
- Extracts the slug from the
Hostheader - Looks up the backend IP using a 3-tier cache strategy:
- L1: in-memory cache (60s TTL) — fastest, per-isolate
- L2: Workers KV (5 min TTL, stale-while-revalidate) — survives isolate restarts, shared across all edge locations
- L3: CP API —
GET https://api.moleculesai.app/cp/orgs/<slug>/instance - Fallback: if CP is unreachable, serve stale KV entry (any age) rather than erroring. A 10-minute CP outage is invisible to tenants.
- If the org doesn't exist (404 from CP, no KV entry) → 404 page
- If the org is provisioning (no IP yet) → return a static "provisioning" HTML page
- Proxies the request to
http://<ec2-ip>:8080(platform) or:3000(canvas)- Route:
/health,/workspaces*,/registry*, etc. →:8080 - Route: everything else →
:3000 - Route:
/ws→:8080with WebSocket upgrade (see WebSocket section below) - Injects
X-Molecule-Org-Idheader (same as Caddy does today) - Injects
Originheader for AdminAuth bypass - Injects
X-Forwarded-Forwith client IP fromCF-Connecting-IP - Injects
X-Forwarded-Proto: https
- Route:
- Returns the response to the browser with Cloudflare's TLS
WebSocket proxying
Cloudflare Workers support WebSocket proxying via the upgradeHeader check.
The Worker detects Upgrade: websocket on incoming requests and passes them
through to the EC2 backend on :8080/ws. The Worker acts as a transparent
tunnel — it does not inspect or buffer WebSocket frames.
// Simplified WebSocket handling in the Worker
if (request.headers.get('Upgrade') === 'websocket') {
return fetch(`http://${backendIp}:8080${url.pathname}`, request);
}
If Workers WebSocket proxying proves unreliable in production (frame drops, idle timeout issues), Phase 33.3 keeps Caddy as a thin WSocket-only reverse proxy on EC2 instead of removing it entirely.
Trusted proxy configuration
The platform's Gin server uses SetTrustedProxies(nil) (trust all) by
default. When requests come through the Worker instead of directly, the
platform should trust CF-Connecting-IP for the real client IP. In
production, set TRUSTED_PROXIES to Cloudflare's published IP ranges
(auto-updated from https://api.cloudflare.com/client/v4/ips).
3. CP API endpoint: GET /cp/orgs/:slug/instance
New public endpoint (no auth — needed by the Worker which has no session):
// GET /cp/orgs/acme/instance
// 200 when running:
{
"slug": "acme",
"status": "running",
"ip": "<EC2_IP>",
"region": "us-east-2"
}
// 200 when provisioning:
{
"slug": "acme",
"status": "provisioning",
"ip": null
}
// 404 when org doesn't exist
Security note: This endpoint exposes the EC2 IP for a given slug. This is equivalent to what DNS already exposes (A record → IP). No secrets are leaked. The endpoint should be rate-limited to prevent enumeration.
4. EC2 tenant changes
With Cloudflare handling TLS, the EC2 instance no longer needs Caddy for HTTPS:
Before:
Caddy (:443, auto Let's Encrypt) → platform (:8080) / canvas (:3000)
After:
Worker → EC2 :8080 (platform, direct HTTP)
Worker → EC2 :3000 (canvas, direct HTTP)
Caddy can be removed from the EC2 user-data script for HTTP routing. If
WebSocket proxying through Workers proves reliable, Caddy is fully removed.
If not, Caddy stays as a thin WebSocket-only reverse proxy (no TLS, no
HTTP routing — just /ws → :8080).
The EC2 security group should allow inbound HTTP from Cloudflare IPs only
(not public). Automate the IP list — Cloudflare publishes their ranges
at https://api.cloudflare.com/client/v4/ips. Use a Lambda or cron to
update the SG weekly. Do not hardcode the IP ranges.
Headers injected by Worker (replaces Caddy's header_up):
X-Molecule-Org-Id: <org-id>— for TenantGuardOrigin: https://<slug>.moleculesai.app— for AdminAuthX-Forwarded-For: <client-ip>— for rate limitingX-Forwarded-Proto: https— so the platform knows the original scheme
5. Provisioning splash page
When the Worker detects status: "provisioning", it returns a static HTML
page with:
- The Molecule AI logo
- "Setting up your workspace..."
- A progress animation
- Auto-refresh every 5s (meta refresh or JS fetch)
This replaces the molecule-app provisioning page for direct subdomain visits.
The molecule-app provisioning page at app.moleculesai.app/orgs/:slug/provisioning
continues to work as the primary flow (redirect after org creation).
Migration plan
-
Phase 1: Deploy Worker + wildcard DNS (no tenant changes)
- Worker proxies to existing EC2 instances (Caddy still running)
- Both paths work: direct DNS (old A records) + Worker proxy (new)
- Verify Worker routing works for existing tenants
-
Phase 2: Stop creating per-tenant DNS records
- Update CP provisioner to skip Cloudflare A record creation
- Remove Cloudflare DNS cleanup from deprovision
- Existing A records coexist with wildcard (explicit wins)
-
Phase 3: Remove Caddy from EC2 user-data
- Worker handles TLS + routing
- EC2 runs platform on :8080 and canvas on :3000 (plain HTTP)
- Simpler boot script, ~30s faster cold start
-
Phase 4: Clean up old A records
- Delete per-tenant A records (wildcard handles everything)
- Remove Cloudflare client from CP provisioner
Cost
- Cloudflare Worker: free tier = 100k requests/day. Paid = $5/mo for 10M.
- Wildcard DNS: free (Cloudflare).
- Savings: no more per-instance Let's Encrypt, no Caddy install time.
Files to change
| File | Change |
|---|---|
the private control-plane repo/internal/provisioner/ec2.go |
Remove Cloudflare DNS creation, remove Caddy from user-data |
the private control-plane repo/internal/cloudflareapi/dns.go |
Eventually removable (Worker replaces it) |
the private control-plane repo/internal/handlers/orgs.go |
Add GET /cp/orgs/:slug/instance endpoint |
New: Molecule-AI/molecule-tenant-proxy (separate repo) |
Worker source + wrangler.toml |
docs/runbooks/saas-secrets.md |
Add Worker secrets (CF account ID, API token) |
.github/workflows/deploy-worker.yml |
CI/CD for Worker deploys |
References
- Cloudflare Workers docs
- Vercel's routing architecture — same pattern
- Railway custom domains — same pattern