Closes the canary loop with the escape hatch and a single place to
read about the whole flow.
scripts/rollback-latest.sh <sha>
uses crane to retag :latest ← :staging-<sha> for BOTH the platform
and tenant images. Pre-checks the target tag exists and verifies
the :latest digest after the move so a bad ops typo doesn't
silently promote the wrong thing. Prod tenants auto-update to the
rolled-back digest within their 5-min cycle. Exit codes: 0 = both
retagged, 1 = registry/tag error, 2 = usage error.
docs/architecture/canary-release.md
The one-page map of the pipeline: how PR → main → staging-<sha> →
canary smoke → :latest promotion works end-to-end, how to add a
canary tenant, how to roll back, and what this gate explicitly does
NOT catch (prod-only data, config drift, cross-tenant bugs).
No code changes in the CP or workspace-server — this PR is shell
+ docs only, so it's safe to land independently of the other Phase
{1,1.5,2,3} PRs still in review.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures the 10-PR staging→main cutover: what shipped, the three new
Railway prod env vars (PROVISION_SHARED_SECRET / EC2_VPC_ID /
CP_BASE_URL), and the sharp edge for existing tenants — their
containers pre-date PR #53 so they still need MOLECULE_CP_SHARED_SECRET
added manually (or a re-provision) before the new CPProvisioner's
outbound bearer works.
Also includes a post-deploy verification checklist and rollback plan.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CLAUDE.md was a 44KB catch-all mixing architecture docs (useful for
everyone) with agent operating instructions (internal). Split:
- docs/architecture/overview.md — system architecture, component
descriptions, 13 key patterns (import cycles, health detection,
communication rules, WebSocket flow, lifecycle, etc.)
- docs/api-reference.md — full REST API route table + database schema
- CLAUDE.md → gitignored (stays local for agent tooling)
All internal PR/issue references stripped from the new docs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove compiled workspace-server/server binary from git
- Fix .gitignore, .gitattributes, .githooks/pre-commit for renamed dirs
- Fix CI workflow path filters (workspace-template → workspace)
- Replace real EC2 IP and personal slug in test_saas_tenant.sh
- Scrub molecule-controlplane references in docs
- Fix stale workspace-template/ paths in provisioner, handlers, tests
- Clean tracked Python cache files
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Security:
- Replace hardcoded Cloudflare account/zone/KV IDs in wrangler.toml
with placeholders; add wrangler.toml to .gitignore, ship .example
- Replace real EC2 IPs in docs with <EC2_IP> placeholders
- Redact partial CF API token prefix in retrospective
- Parameterize Langfuse dev credentials in docker-compose.infra.yml
- Replace Neon project ID in runbook with <neon-project-id>
Community:
- Add CONTRIBUTING.md (build, test, branch conventions, CI info)
- Add CODE_OF_CONDUCT.md (Contributor Covenant 2.1)
Cleanup:
- Replace personal runner username/machine name in CI + PLAN.md
- Replace personal tenant URL in MCP setup guide
- Replace personal author field in bundle-system doc
- Replace personal login in webhook test fixture
- Rewrite cryptominer incident reference as generic security remediation
- Remove private repo commit hashes from PLAN.md
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Full staging environment that mirrors production. Every infra change
ships to staging first before promotion. Gates Phase 33 (Tunnel) and
Phase 35 (security hardening).
Components: Railway staging env, Neon branch, staging DNS, tagged
Docker images, promotion workflow, automated smoke tests.
Also marks Phase 33 as migrating from Worker to Cloudflare Tunnel
(issue #933), prerequisite: staging.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documents three upgrade strategies for keeping tenant EC2 instances
current with platform-tenant:latest:
- Option A: Rolling restart via CP admin endpoint (coordinated)
- Option B: Sidecar auto-updater cron (implemented, 5 min interval)
- Option C: Blue-green via Worker (zero downtime, future)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Addresses all 4 review points from PR #786:
1. Worker resilience: 3-tier cache (in-memory → KV → CP API) with stale
fallback so CP outages are invisible to tenants
2. WebSocket proxying: documented upgradeHeader handling, fallback to
keep Caddy for WS-only if Workers WS is unreliable
3. SG automation: note to auto-update Cloudflare IP ranges, don't hardcode
4. Trusted proxy: X-Forwarded-For / CF-Connecting-IP trust chain documented
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds Phase 33 plan and architecture doc for replacing per-tenant DNS
records with a wildcard DNS + Cloudflare Worker proxy pattern.
Eliminates: DNS propagation delays, NXDOMAIN caching, per-instance
Let's Encrypt, Caddy on EC2. Same pattern used by Vercel, Railway,
Fly.io, WordPress, n8n.
4-phase migration: deploy Worker → stop creating DNS records →
remove Caddy from EC2 → cleanup.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>