Security: - Replace hardcoded Cloudflare account/zone/KV IDs in wrangler.toml with placeholders; add wrangler.toml to .gitignore, ship .example - Replace real EC2 IPs in docs with <EC2_IP> placeholders - Redact partial CF API token prefix in retrospective - Parameterize Langfuse dev credentials in docker-compose.infra.yml - Replace Neon project ID in runbook with <neon-project-id> Community: - Add CONTRIBUTING.md (build, test, branch conventions, CI info) - Add CODE_OF_CONDUCT.md (Contributor Covenant 2.1) Cleanup: - Replace personal runner username/machine name in CI + PLAN.md - Replace personal tenant URL in MCP setup guide - Replace personal author field in bundle-system doc - Replace personal login in webhook test fixture - Rewrite cryptominer incident reference as generic security remediation - Remove private repo commit hashes from PLAN.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
151 lines
4.8 KiB
Markdown
151 lines
4.8 KiB
Markdown
# Tenant Image Upgrade Strategies
|
|
|
|
> **Status:** Option B (sidecar auto-updater) implemented. Options A and C
|
|
> documented for future use.
|
|
|
|
## Problem
|
|
|
|
When we push a new `platform-tenant:latest` to GHCR, existing EC2 tenant
|
|
instances keep running the old image. New orgs get the latest image at boot,
|
|
but existing tenants fall behind — missing bug fixes, security patches, and
|
|
new features.
|
|
|
|
## Option A: Rolling restart on publish (coordinated)
|
|
|
|
The publish workflow calls a CP admin endpoint after pushing the image.
|
|
The CP iterates all running tenants and restarts them one by one.
|
|
|
|
```
|
|
publish-platform-image succeeds
|
|
→ POST https://api.moleculesai.app/cp/admin/rolling-upgrade
|
|
→ CP queries org_instances WHERE status = 'running'
|
|
→ For each tenant (staggered, 30s apart):
|
|
1. AWS SSM Run Command: docker pull + docker restart
|
|
2. Wait for /health 200
|
|
3. Update org_instances.updated_at
|
|
4. If health fails after 60s, rollback (docker run old image)
|
|
→ Return summary: {upgraded: N, failed: M, skipped: K}
|
|
```
|
|
|
|
### Pros
|
|
- Immediate, coordinated upgrades across all tenants
|
|
- CP has full visibility into upgrade status
|
|
- Can implement canary (upgrade 1 tenant first, verify, then rest)
|
|
- Rollback capability per tenant
|
|
|
|
### Cons
|
|
- Requires AWS SSM agent on EC2 instances (not installed yet)
|
|
- Alternatively requires SSH access from Railway → EC2 (network/key management)
|
|
- Brief downtime per tenant during restart (~10-30s)
|
|
- Blast radius: a bad image can take down all tenants before canary catches it
|
|
|
|
### Implementation effort
|
|
- Add SSM agent to EC2 user-data script
|
|
- Add `POST /cp/admin/rolling-upgrade` handler
|
|
- Add upgrade step to publish workflow
|
|
- Add rollback logic
|
|
- ~2-3 days
|
|
|
|
### When to use
|
|
- Urgent security patches that can't wait 5 min
|
|
- Breaking changes that need coordinated rollout
|
|
- When you want canary/staged deployment
|
|
|
|
---
|
|
|
|
## Option B: Sidecar auto-updater (implemented)
|
|
|
|
A cron job on each EC2 checks GHCR for a new image digest every 5 minutes.
|
|
If the digest changed, it pulls the new image and restarts the container.
|
|
|
|
```bash
|
|
# Runs every 5 min on each EC2 (added to user-data)
|
|
*/5 * * * * /usr/local/bin/molecule-auto-update.sh
|
|
```
|
|
|
|
The update script:
|
|
1. `docker pull platform-tenant:latest`
|
|
2. Compare digest with running container's image digest
|
|
3. If different: `docker stop molecule-tenant && docker rm molecule-tenant && docker run ...`
|
|
4. Wait for `/health` 200
|
|
5. Log result to `/var/log/molecule-auto-update.log`
|
|
|
|
### Pros
|
|
- Zero CP involvement — fully autonomous per tenant
|
|
- Tenants upgrade within 5 min of any publish
|
|
- No SSH/SSM infrastructure needed
|
|
- Each tenant upgrades independently (natural canary)
|
|
- Simple to implement (2 lines in user-data + a small script)
|
|
|
|
### Cons
|
|
- Up to 5 min delay between publish and tenant upgrade
|
|
- Brief downtime during restart (~10-30s)
|
|
- No centralized visibility into upgrade status
|
|
- Can't selectively hold back specific tenants
|
|
- All tenants track `latest` — no pinned versions
|
|
|
|
### When to use
|
|
- Default for all tenants
|
|
- Works well for early-stage SaaS with frequent deploys
|
|
|
|
---
|
|
|
|
## Option C: Blue-green via Worker (zero downtime)
|
|
|
|
Each EC2 runs two container slots: `blue` (current) and `green` (new).
|
|
The Cloudflare Worker routes traffic to whichever is healthy.
|
|
|
|
```
|
|
EC2 instance:
|
|
molecule-tenant-blue → :8080 (current, serving traffic)
|
|
molecule-tenant-green → :8081 (new, starting up)
|
|
|
|
Upgrade flow:
|
|
1. Pull new image
|
|
2. Start green on :8081
|
|
3. Health check green: GET :8081/health
|
|
4. If healthy: update Worker routing (KV: slug → port 8081)
|
|
5. Stop blue
|
|
6. Next upgrade: blue becomes the new slot
|
|
|
|
Worker routing:
|
|
KV key: "example-org" → {"ip": "<EC2_IP>", "port": 8081}
|
|
(port defaults to 8080 when not in KV)
|
|
```
|
|
|
|
### Pros
|
|
- Zero downtime — traffic switches atomically after health check
|
|
- Instant rollback — just switch back to the old slot
|
|
- Worker already exists — just add port to the routing lookup
|
|
- Health-verified before any traffic switches
|
|
|
|
### Cons
|
|
- Double memory usage during transition (~512MB extra per tenant)
|
|
- More complex user-data script (manage two containers)
|
|
- Worker needs port-aware routing (KV schema change)
|
|
- Need to track which slot is active per tenant
|
|
|
|
### Implementation effort
|
|
- Update user-data to manage blue/green containers
|
|
- Update Worker to read port from KV
|
|
- Add blue/green state tracking to CP (org_instances.active_slot)
|
|
- Update auto-updater script for blue-green swap
|
|
- ~3-5 days
|
|
|
|
### When to use
|
|
- When tenants have SLAs requiring zero downtime
|
|
- Production deployments with paying customers
|
|
- After Option B proves the auto-update pattern works
|
|
|
|
---
|
|
|
|
## Migration path
|
|
|
|
```
|
|
Now: Option B (auto-updater, 5 min delay, brief downtime)
|
|
↓
|
|
Growth: Option A (add SSM for urgent patches, keep B as default)
|
|
↓
|
|
Scale: Option C (zero-downtime for premium/enterprise tenants)
|
|
```
|