forked from molecule-ai/molecule-core
docs: staging environment design + Phase 36 plan
Full staging environment that mirrors production. Every infra change ships to staging first before promotion. Gates Phase 33 (Tunnel) and Phase 35 (security hardening). Components: Railway staging env, Neon branch, staging DNS, tagged Docker images, promotion workflow, automated smoke tests. Also marks Phase 33 as migrating from Worker to Cloudflare Tunnel (issue #933), prerequisite: staging. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
cb122c98e5
commit
2dbb59cb35
58
PLAN.md
58
PLAN.md
@ -575,13 +575,61 @@ self-hosted per-customer). Ordered by dependency + ROI.
|
||||
|
||||
---
|
||||
|
||||
## Phase 33: Wildcard DNS + Cloudflare Worker Proxy
|
||||
## Phase 36: Full Staging Environment — GATES ALL INFRA CHANGES
|
||||
|
||||
> **Goal:** Eliminate DNS propagation delays and NXDOMAIN caching for tenant
|
||||
> subdomains. Every SaaS (Vercel, Railway, Fly.io) uses this pattern —
|
||||
> wildcard DNS + edge proxy routing by hostname.
|
||||
> **Goal:** Stop merging untested infra changes to production. Every change
|
||||
> ships to staging first, gets verified, then promotes to production.
|
||||
>
|
||||
> **Docs:** `docs/architecture/wildcard-dns-proxy.md`
|
||||
> **Why now:** The 2026-04-17 session broke CI twice and caused hours of
|
||||
> edge cache issues because there was no staging to catch regressions.
|
||||
> This gates Phase 33 (Tunnel migration) and Phase 35 (security hardening).
|
||||
>
|
||||
> **Docs:** `docs/architecture/staging-environment.md`
|
||||
|
||||
### Phase 36.1 — Railway + Neon staging
|
||||
|
||||
- [ ] Create Railway `staging` environment with staging-specific vars
|
||||
- [ ] Create Neon staging branch from main
|
||||
- [ ] Add `staging.api.moleculesai.app` CNAME to Railway staging
|
||||
- [ ] Verify CP deploys and boots on staging
|
||||
|
||||
### Phase 36.2 — Image + deploy pipeline
|
||||
|
||||
- [ ] Publish workflow pushes `:staging` tag (not `:latest`) on main merge
|
||||
- [ ] Add `promote-to-production.yml` workflow (manual trigger)
|
||||
- [ ] Promotion: retag `:staging` → `:latest`, deploy CP to production
|
||||
- [ ] Production tenants auto-update via Option B cron
|
||||
|
||||
### Phase 36.3 — Staging DNS + Vercel
|
||||
|
||||
- [ ] `*.staging.moleculesai.app` for staging tenant subdomains
|
||||
- [ ] `staging.app.moleculesai.app` for Vercel staging preview
|
||||
- [ ] Staging Cloudflare Tunnel (or Worker) for tenant routing
|
||||
|
||||
### Phase 36.4 — Automated verification
|
||||
|
||||
- [ ] Post-deploy staging smoke test (run `test_saas_tenant.sh`)
|
||||
- [ ] Block promotion if smoke test fails
|
||||
- [ ] Slack/GitHub notification on staging deploy + promotion
|
||||
|
||||
### Success criteria for Phase 36
|
||||
|
||||
- No infra change reaches production without passing staging first
|
||||
- Staging mirrors production (same services, same auth, separate data)
|
||||
- Promotion is a single manual action (button click or CLI command)
|
||||
- Staging cleanup is automated (terminate test EC2s after verification)
|
||||
|
||||
---
|
||||
|
||||
## Phase 33: Tenant Subdomain Routing — MIGRATING TO CLOUDFLARE TUNNEL
|
||||
|
||||
> **Original:** Wildcard DNS + Cloudflare Worker (implemented 2026-04-17).
|
||||
> **Replacing with:** Cloudflare Tunnel per tenant (issue #933).
|
||||
> Worker approach caused edge cache poisoning + security gaps (ADMIN_TOKEN
|
||||
> in plaintext, unencrypted HTTP). Tunnel eliminates all of these.
|
||||
> **Docs:** `docs/architecture/wildcard-dns-proxy.md` (original),
|
||||
> issue #933 (tunnel migration plan).
|
||||
> **Prerequisite:** Phase 36 (staging) — test tunnel on staging first.
|
||||
|
||||
### Phase 33.1 — Worker + wildcard DNS (no tenant changes)
|
||||
|
||||
|
||||
214
docs/architecture/staging-environment.md
Normal file
214
docs/architecture/staging-environment.md
Normal file
@ -0,0 +1,214 @@
|
||||
# Staging Environment Design
|
||||
|
||||
> **Status:** Planned — gates all future infra changes (Tunnel migration,
|
||||
> security fixes, etc.)
|
||||
>
|
||||
> **Problem:** We merge directly to main and auto-deploy to production.
|
||||
> Today's session broke CI twice and caused hours of Cloudflare edge cache
|
||||
> issues because there was no staging to test infra changes first.
|
||||
>
|
||||
> **Goal:** Full staging environment that mirrors production. Every change
|
||||
> ships to staging first, gets verified, then promotes to production.
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
staging production
|
||||
─────── ──────────
|
||||
Git branch: main (auto-deploy) main (manual promote)
|
||||
or staging branch
|
||||
|
||||
CP (Railway): staging service production service
|
||||
staging.api.moleculesai.app api.moleculesai.app
|
||||
|
||||
Tenant EC2s: staging EC2 instances production EC2 instances
|
||||
*.staging.moleculesai.app *.moleculesai.app
|
||||
|
||||
App (Vercel): staging.app.moleculesai.app app.moleculesai.app
|
||||
(Vercel preview) (Vercel production)
|
||||
|
||||
DB (Neon): staging branch main branch
|
||||
(or separate project)
|
||||
|
||||
Docker images: platform-tenant:staging platform-tenant:latest
|
||||
(GHCR) (GHCR)
|
||||
|
||||
Cloudflare: *.staging.moleculesai.app *.moleculesai.app
|
||||
(separate tunnel/worker) (tunnel per tenant)
|
||||
```
|
||||
|
||||
## Deploy flow
|
||||
|
||||
```
|
||||
Developer pushes to PR branch
|
||||
→ CI runs (tests, build, lint)
|
||||
→ PR merged to main
|
||||
→ Auto-deploy to STAGING
|
||||
→ Staging smoke tests (automated)
|
||||
→ Manual verification if needed
|
||||
→ Promote to PRODUCTION (manual trigger or approval)
|
||||
```
|
||||
|
||||
## Components
|
||||
|
||||
### 1. Railway: two environments
|
||||
|
||||
Railway supports multiple environments per project. Create a `staging`
|
||||
environment alongside `production`:
|
||||
|
||||
```bash
|
||||
railway environment create staging
|
||||
railway variables --environment staging --set "DATABASE_URL=<staging-neon>"
|
||||
railway variables --environment staging --set "MOLECULE_ENV=staging"
|
||||
# ... all other vars with staging-specific values
|
||||
```
|
||||
|
||||
**Deploy trigger:**
|
||||
- `staging`: auto-deploy on push to main
|
||||
- `production`: manual promote via `railway up --environment production`
|
||||
or GitHub Actions workflow_dispatch
|
||||
|
||||
**Domains:**
|
||||
- staging: `staging-api.moleculesai.app` (Railway custom domain)
|
||||
- production: `api.moleculesai.app` (unchanged)
|
||||
|
||||
### 2. Neon: branch per environment
|
||||
|
||||
Neon supports database branches (like git branches):
|
||||
|
||||
```bash
|
||||
# Create staging branch from main
|
||||
neon branch create --project-id <id> --name staging --parent main
|
||||
```
|
||||
|
||||
- Staging DB has same schema, separate data
|
||||
- Can reset staging by re-branching from main
|
||||
- Production data never touched by staging tests
|
||||
|
||||
### 3. Vercel: preview deployments
|
||||
|
||||
Vercel already supports this natively:
|
||||
- Push to main → deploys to `app.moleculesai.app` (production)
|
||||
- Push to `staging` branch → deploys to preview URL
|
||||
|
||||
**Or** use Vercel environments:
|
||||
- `staging.app.moleculesai.app` → staging deployment
|
||||
- `app.moleculesai.app` → production deployment
|
||||
|
||||
### 4. GHCR: tagged images
|
||||
|
||||
```
|
||||
platform-tenant:staging — built on every push to main
|
||||
platform-tenant:latest — promoted from staging after verification
|
||||
platform-tenant:sha-xxxxx — immutable, pinned to specific commit
|
||||
```
|
||||
|
||||
**Publish workflow change:**
|
||||
```yaml
|
||||
# Current: pushes :latest on every main merge
|
||||
# New: pushes :staging on every main merge
|
||||
# pushes :latest only on manual promote
|
||||
```
|
||||
|
||||
### 5. Cloudflare: staging subdomain
|
||||
|
||||
Option A (simple): `*.staging.moleculesai.app` with its own tunnel/worker
|
||||
Option B (full): separate Cloudflare zone for staging (overkill)
|
||||
|
||||
Recommend Option A:
|
||||
- Add `staging.moleculesai.app` DNS records
|
||||
- Staging tenants get `slug.staging.moleculesai.app` subdomains
|
||||
- Production tenants get `slug.moleculesai.app` (unchanged)
|
||||
|
||||
### 6. EC2: staging tag
|
||||
|
||||
Staging EC2 instances tagged with `Environment=staging`:
|
||||
- Separate from production instances in AWS console
|
||||
- Can use different AMI, instance type, security group
|
||||
- Easy to identify and clean up
|
||||
|
||||
## Environment variables
|
||||
|
||||
| Variable | Staging | Production |
|
||||
|----------|---------|------------|
|
||||
| `MOLECULE_ENV` | `staging` | `production` |
|
||||
| `DATABASE_URL` | Neon staging branch | Neon main branch |
|
||||
| `TENANT_IMAGE` | `platform-tenant:staging` | `platform-tenant:latest` |
|
||||
| `APP_DOMAIN` | `staging.moleculesai.app` | `moleculesai.app` |
|
||||
| `CORS_ORIGINS` | `https://staging.app.moleculesai.app` | `https://app.moleculesai.app` |
|
||||
| `ADMIN_TOKEN` | per-tenant (same mechanism) | per-tenant |
|
||||
|
||||
## Promotion workflow
|
||||
|
||||
### Automated (CI/CD)
|
||||
|
||||
```yaml
|
||||
# .github/workflows/promote-to-production.yml
|
||||
name: Promote to Production
|
||||
on:
|
||||
workflow_dispatch:
|
||||
inputs:
|
||||
confirm:
|
||||
description: 'Type "promote" to confirm'
|
||||
required: true
|
||||
|
||||
jobs:
|
||||
promote:
|
||||
if: github.event.inputs.confirm == 'promote'
|
||||
steps:
|
||||
# 1. Run staging smoke tests one more time
|
||||
- run: bash tests/e2e/test_saas_tenant.sh
|
||||
env:
|
||||
TENANT_SLUG: smoke-test
|
||||
BASE_URL: https://staging.api.moleculesai.app
|
||||
|
||||
# 2. Tag Docker image
|
||||
- run: |
|
||||
docker pull ghcr.io/molecule-ai/platform-tenant:staging
|
||||
docker tag ghcr.io/molecule-ai/platform-tenant:staging \
|
||||
ghcr.io/molecule-ai/platform-tenant:latest
|
||||
docker push ghcr.io/molecule-ai/platform-tenant:latest
|
||||
|
||||
# 3. Deploy CP to production
|
||||
- run: railway up --environment production
|
||||
|
||||
# 4. Production tenants auto-update within 5 min (Option B cron)
|
||||
```
|
||||
|
||||
### Manual (for now)
|
||||
|
||||
Until the automated workflow is built:
|
||||
1. Verify on staging (`staging.api.moleculesai.app`)
|
||||
2. `docker tag platform-tenant:staging platform-tenant:latest && docker push`
|
||||
3. `railway up --environment production`
|
||||
4. Monitor production health
|
||||
|
||||
## What this prevents
|
||||
|
||||
- CI breakage from untested path filters (today's dorny/paths-filter issue)
|
||||
- Cloudflare edge cache poisoning (test DNS changes on staging subdomain)
|
||||
- Workspace boot script regressions (test on staging EC2 first)
|
||||
- DB migration failures (test on Neon staging branch)
|
||||
- Auth/security regressions (staging has same auth stack)
|
||||
|
||||
## Implementation order
|
||||
|
||||
1. **Railway staging environment** — create + configure vars (~30 min)
|
||||
2. **Neon staging branch** — create from main (~5 min)
|
||||
3. **Staging DNS** — `staging.api.moleculesai.app` CNAME to Railway (~5 min)
|
||||
4. **Publish workflow** — push `:staging` tag instead of `:latest` (~15 min)
|
||||
5. **Promotion workflow** — manual trigger to promote staging → production (~30 min)
|
||||
6. **Vercel staging** — configure preview deployment URL (~15 min)
|
||||
7. **Staging smoke test** — automated test after staging deploy (~30 min)
|
||||
|
||||
**Total:** ~2.5 hours for full staging pipeline.
|
||||
|
||||
## Cost
|
||||
|
||||
- Railway staging: ~$5/mo (same as production, but can be smaller)
|
||||
- Neon staging branch: free (included in plan)
|
||||
- EC2 staging instances: only when testing (terminate after)
|
||||
- Vercel: free (preview deployments included)
|
||||
- Cloudflare: free (same zone, additional records)
|
||||
Loading…
Reference in New Issue
Block a user