docs: tenant image upgrade strategies

docs: tenant image upgrade strategies
This commit is contained in:
molecule-ai[bot] 2026-04-17 19:16:30 +00:00 committed by GitHub
commit 89d89e1459

View File

@ -0,0 +1,150 @@
# Tenant Image Upgrade Strategies
> **Status:** Option B (sidecar auto-updater) implemented. Options A and C
> documented for future use.
## Problem
When we push a new `platform-tenant:latest` to GHCR, existing EC2 tenant
instances keep running the old image. New orgs get the latest image at boot,
but existing tenants fall behind — missing bug fixes, security patches, and
new features.
## Option A: Rolling restart on publish (coordinated)
The publish workflow calls a CP admin endpoint after pushing the image.
The CP iterates all running tenants and restarts them one by one.
```
publish-platform-image succeeds
→ POST https://api.moleculesai.app/cp/admin/rolling-upgrade
→ CP queries org_instances WHERE status = 'running'
→ For each tenant (staggered, 30s apart):
1. AWS SSM Run Command: docker pull + docker restart
2. Wait for /health 200
3. Update org_instances.updated_at
4. If health fails after 60s, rollback (docker run old image)
→ Return summary: {upgraded: N, failed: M, skipped: K}
```
### Pros
- Immediate, coordinated upgrades across all tenants
- CP has full visibility into upgrade status
- Can implement canary (upgrade 1 tenant first, verify, then rest)
- Rollback capability per tenant
### Cons
- Requires AWS SSM agent on EC2 instances (not installed yet)
- Alternatively requires SSH access from Railway → EC2 (network/key management)
- Brief downtime per tenant during restart (~10-30s)
- Blast radius: a bad image can take down all tenants before canary catches it
### Implementation effort
- Add SSM agent to EC2 user-data script
- Add `POST /cp/admin/rolling-upgrade` handler
- Add upgrade step to publish workflow
- Add rollback logic
- ~2-3 days
### When to use
- Urgent security patches that can't wait 5 min
- Breaking changes that need coordinated rollout
- When you want canary/staged deployment
---
## Option B: Sidecar auto-updater (implemented)
A cron job on each EC2 checks GHCR for a new image digest every 5 minutes.
If the digest changed, it pulls the new image and restarts the container.
```bash
# Runs every 5 min on each EC2 (added to user-data)
*/5 * * * * /usr/local/bin/molecule-auto-update.sh
```
The update script:
1. `docker pull platform-tenant:latest`
2. Compare digest with running container's image digest
3. If different: `docker stop molecule-tenant && docker rm molecule-tenant && docker run ...`
4. Wait for `/health` 200
5. Log result to `/var/log/molecule-auto-update.log`
### Pros
- Zero CP involvement — fully autonomous per tenant
- Tenants upgrade within 5 min of any publish
- No SSH/SSM infrastructure needed
- Each tenant upgrades independently (natural canary)
- Simple to implement (2 lines in user-data + a small script)
### Cons
- Up to 5 min delay between publish and tenant upgrade
- Brief downtime during restart (~10-30s)
- No centralized visibility into upgrade status
- Can't selectively hold back specific tenants
- All tenants track `latest` — no pinned versions
### When to use
- Default for all tenants
- Works well for early-stage SaaS with frequent deploys
---
## Option C: Blue-green via Worker (zero downtime)
Each EC2 runs two container slots: `blue` (current) and `green` (new).
The Cloudflare Worker routes traffic to whichever is healthy.
```
EC2 instance:
molecule-tenant-blue → :8080 (current, serving traffic)
molecule-tenant-green → :8081 (new, starting up)
Upgrade flow:
1. Pull new image
2. Start green on :8081
3. Health check green: GET :8081/health
4. If healthy: update Worker routing (KV: slug → port 8081)
5. Stop blue
6. Next upgrade: blue becomes the new slot
Worker routing:
KV key: "hongming2" → {"ip": "3.144.193.40", "port": 8081}
(port defaults to 8080 when not in KV)
```
### Pros
- Zero downtime — traffic switches atomically after health check
- Instant rollback — just switch back to the old slot
- Worker already exists — just add port to the routing lookup
- Health-verified before any traffic switches
### Cons
- Double memory usage during transition (~512MB extra per tenant)
- More complex user-data script (manage two containers)
- Worker needs port-aware routing (KV schema change)
- Need to track which slot is active per tenant
### Implementation effort
- Update user-data to manage blue/green containers
- Update Worker to read port from KV
- Add blue/green state tracking to CP (org_instances.active_slot)
- Update auto-updater script for blue-green swap
- ~3-5 days
### When to use
- When tenants have SLAs requiring zero downtime
- Production deployments with paying customers
- After Option B proves the auto-update pattern works
---
## Migration path
```
Now: Option B (auto-updater, 5 min delay, brief downtime)
Growth: Option A (add SSM for urgent patches, keep B as default)
Scale: Option C (zero-downtime for premium/enterprise tenants)
```