docs: staging environment design + Phase 36 plan

Full staging environment that mirrors production. Every infra change
ships to staging first before promotion. Gates Phase 33 (Tunnel) and
Phase 35 (security hardening).

Components: Railway staging env, Neon branch, staging DNS, tagged
Docker images, promotion workflow, automated smoke tests.

Also marks Phase 33 as migrating from Worker to Cloudflare Tunnel
(issue #933), prerequisite: staging.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Hongming Wang 2026-04-17 20:37:11 -07:00
parent cb122c98e5
commit 2dbb59cb35
2 changed files with 267 additions and 5 deletions

58
PLAN.md
View File

@ -575,13 +575,61 @@ self-hosted per-customer). Ordered by dependency + ROI.
---
## Phase 33: Wildcard DNS + Cloudflare Worker Proxy
## Phase 36: Full Staging Environment — GATES ALL INFRA CHANGES
> **Goal:** Eliminate DNS propagation delays and NXDOMAIN caching for tenant
> subdomains. Every SaaS (Vercel, Railway, Fly.io) uses this pattern —
> wildcard DNS + edge proxy routing by hostname.
> **Goal:** Stop merging untested infra changes to production. Every change
> ships to staging first, gets verified, then promotes to production.
>
> **Docs:** `docs/architecture/wildcard-dns-proxy.md`
> **Why now:** The 2026-04-17 session broke CI twice and caused hours of
> edge cache issues because there was no staging to catch regressions.
> This gates Phase 33 (Tunnel migration) and Phase 35 (security hardening).
>
> **Docs:** `docs/architecture/staging-environment.md`
### Phase 36.1 — Railway + Neon staging
- [ ] Create Railway `staging` environment with staging-specific vars
- [ ] Create Neon staging branch from main
- [ ] Add `staging.api.moleculesai.app` CNAME to Railway staging
- [ ] Verify CP deploys and boots on staging
### Phase 36.2 — Image + deploy pipeline
- [ ] Publish workflow pushes `:staging` tag (not `:latest`) on main merge
- [ ] Add `promote-to-production.yml` workflow (manual trigger)
- [ ] Promotion: retag `:staging``:latest`, deploy CP to production
- [ ] Production tenants auto-update via Option B cron
### Phase 36.3 — Staging DNS + Vercel
- [ ] `*.staging.moleculesai.app` for staging tenant subdomains
- [ ] `staging.app.moleculesai.app` for Vercel staging preview
- [ ] Staging Cloudflare Tunnel (or Worker) for tenant routing
### Phase 36.4 — Automated verification
- [ ] Post-deploy staging smoke test (run `test_saas_tenant.sh`)
- [ ] Block promotion if smoke test fails
- [ ] Slack/GitHub notification on staging deploy + promotion
### Success criteria for Phase 36
- No infra change reaches production without passing staging first
- Staging mirrors production (same services, same auth, separate data)
- Promotion is a single manual action (button click or CLI command)
- Staging cleanup is automated (terminate test EC2s after verification)
---
## Phase 33: Tenant Subdomain Routing — MIGRATING TO CLOUDFLARE TUNNEL
> **Original:** Wildcard DNS + Cloudflare Worker (implemented 2026-04-17).
> **Replacing with:** Cloudflare Tunnel per tenant (issue #933).
> Worker approach caused edge cache poisoning + security gaps (ADMIN_TOKEN
> in plaintext, unencrypted HTTP). Tunnel eliminates all of these.
> **Docs:** `docs/architecture/wildcard-dns-proxy.md` (original),
> issue #933 (tunnel migration plan).
> **Prerequisite:** Phase 36 (staging) — test tunnel on staging first.
### Phase 33.1 — Worker + wildcard DNS (no tenant changes)

View File

@ -0,0 +1,214 @@
# Staging Environment Design
> **Status:** Planned — gates all future infra changes (Tunnel migration,
> security fixes, etc.)
>
> **Problem:** We merge directly to main and auto-deploy to production.
> Today's session broke CI twice and caused hours of Cloudflare edge cache
> issues because there was no staging to test infra changes first.
>
> **Goal:** Full staging environment that mirrors production. Every change
> ships to staging first, gets verified, then promotes to production.
---
## Architecture
```
staging production
─────── ──────────
Git branch: main (auto-deploy) main (manual promote)
or staging branch
CP (Railway): staging service production service
staging.api.moleculesai.app api.moleculesai.app
Tenant EC2s: staging EC2 instances production EC2 instances
*.staging.moleculesai.app *.moleculesai.app
App (Vercel): staging.app.moleculesai.app app.moleculesai.app
(Vercel preview) (Vercel production)
DB (Neon): staging branch main branch
(or separate project)
Docker images: platform-tenant:staging platform-tenant:latest
(GHCR) (GHCR)
Cloudflare: *.staging.moleculesai.app *.moleculesai.app
(separate tunnel/worker) (tunnel per tenant)
```
## Deploy flow
```
Developer pushes to PR branch
→ CI runs (tests, build, lint)
→ PR merged to main
→ Auto-deploy to STAGING
→ Staging smoke tests (automated)
→ Manual verification if needed
→ Promote to PRODUCTION (manual trigger or approval)
```
## Components
### 1. Railway: two environments
Railway supports multiple environments per project. Create a `staging`
environment alongside `production`:
```bash
railway environment create staging
railway variables --environment staging --set "DATABASE_URL=<staging-neon>"
railway variables --environment staging --set "MOLECULE_ENV=staging"
# ... all other vars with staging-specific values
```
**Deploy trigger:**
- `staging`: auto-deploy on push to main
- `production`: manual promote via `railway up --environment production`
or GitHub Actions workflow_dispatch
**Domains:**
- staging: `staging-api.moleculesai.app` (Railway custom domain)
- production: `api.moleculesai.app` (unchanged)
### 2. Neon: branch per environment
Neon supports database branches (like git branches):
```bash
# Create staging branch from main
neon branch create --project-id <id> --name staging --parent main
```
- Staging DB has same schema, separate data
- Can reset staging by re-branching from main
- Production data never touched by staging tests
### 3. Vercel: preview deployments
Vercel already supports this natively:
- Push to main → deploys to `app.moleculesai.app` (production)
- Push to `staging` branch → deploys to preview URL
**Or** use Vercel environments:
- `staging.app.moleculesai.app` → staging deployment
- `app.moleculesai.app` → production deployment
### 4. GHCR: tagged images
```
platform-tenant:staging — built on every push to main
platform-tenant:latest — promoted from staging after verification
platform-tenant:sha-xxxxx — immutable, pinned to specific commit
```
**Publish workflow change:**
```yaml
# Current: pushes :latest on every main merge
# New: pushes :staging on every main merge
# pushes :latest only on manual promote
```
### 5. Cloudflare: staging subdomain
Option A (simple): `*.staging.moleculesai.app` with its own tunnel/worker
Option B (full): separate Cloudflare zone for staging (overkill)
Recommend Option A:
- Add `staging.moleculesai.app` DNS records
- Staging tenants get `slug.staging.moleculesai.app` subdomains
- Production tenants get `slug.moleculesai.app` (unchanged)
### 6. EC2: staging tag
Staging EC2 instances tagged with `Environment=staging`:
- Separate from production instances in AWS console
- Can use different AMI, instance type, security group
- Easy to identify and clean up
## Environment variables
| Variable | Staging | Production |
|----------|---------|------------|
| `MOLECULE_ENV` | `staging` | `production` |
| `DATABASE_URL` | Neon staging branch | Neon main branch |
| `TENANT_IMAGE` | `platform-tenant:staging` | `platform-tenant:latest` |
| `APP_DOMAIN` | `staging.moleculesai.app` | `moleculesai.app` |
| `CORS_ORIGINS` | `https://staging.app.moleculesai.app` | `https://app.moleculesai.app` |
| `ADMIN_TOKEN` | per-tenant (same mechanism) | per-tenant |
## Promotion workflow
### Automated (CI/CD)
```yaml
# .github/workflows/promote-to-production.yml
name: Promote to Production
on:
workflow_dispatch:
inputs:
confirm:
description: 'Type "promote" to confirm'
required: true
jobs:
promote:
if: github.event.inputs.confirm == 'promote'
steps:
# 1. Run staging smoke tests one more time
- run: bash tests/e2e/test_saas_tenant.sh
env:
TENANT_SLUG: smoke-test
BASE_URL: https://staging.api.moleculesai.app
# 2. Tag Docker image
- run: |
docker pull ghcr.io/molecule-ai/platform-tenant:staging
docker tag ghcr.io/molecule-ai/platform-tenant:staging \
ghcr.io/molecule-ai/platform-tenant:latest
docker push ghcr.io/molecule-ai/platform-tenant:latest
# 3. Deploy CP to production
- run: railway up --environment production
# 4. Production tenants auto-update within 5 min (Option B cron)
```
### Manual (for now)
Until the automated workflow is built:
1. Verify on staging (`staging.api.moleculesai.app`)
2. `docker tag platform-tenant:staging platform-tenant:latest && docker push`
3. `railway up --environment production`
4. Monitor production health
## What this prevents
- CI breakage from untested path filters (today's dorny/paths-filter issue)
- Cloudflare edge cache poisoning (test DNS changes on staging subdomain)
- Workspace boot script regressions (test on staging EC2 first)
- DB migration failures (test on Neon staging branch)
- Auth/security regressions (staging has same auth stack)
## Implementation order
1. **Railway staging environment** — create + configure vars (~30 min)
2. **Neon staging branch** — create from main (~5 min)
3. **Staging DNS**`staging.api.moleculesai.app` CNAME to Railway (~5 min)
4. **Publish workflow** — push `:staging` tag instead of `:latest` (~15 min)
5. **Promotion workflow** — manual trigger to promote staging → production (~30 min)
6. **Vercel staging** — configure preview deployment URL (~15 min)
7. **Staging smoke test** — automated test after staging deploy (~30 min)
**Total:** ~2.5 hours for full staging pipeline.
## Cost
- Railway staging: ~$5/mo (same as production, but can be smaller)
- Neon staging branch: free (included in plan)
- EC2 staging instances: only when testing (terminate after)
- Vercel: free (preview deployments included)
- Cloudflare: free (same zone, additional records)