forked from molecule-ai/molecule-core
docs: tenant image upgrade strategies
docs: tenant image upgrade strategies
This commit is contained in:
commit
89d89e1459
150
docs/architecture/tenant-image-upgrades.md
Normal file
150
docs/architecture/tenant-image-upgrades.md
Normal file
@ -0,0 +1,150 @@
|
||||
# Tenant Image Upgrade Strategies
|
||||
|
||||
> **Status:** Option B (sidecar auto-updater) implemented. Options A and C
|
||||
> documented for future use.
|
||||
|
||||
## Problem
|
||||
|
||||
When we push a new `platform-tenant:latest` to GHCR, existing EC2 tenant
|
||||
instances keep running the old image. New orgs get the latest image at boot,
|
||||
but existing tenants fall behind — missing bug fixes, security patches, and
|
||||
new features.
|
||||
|
||||
## Option A: Rolling restart on publish (coordinated)
|
||||
|
||||
The publish workflow calls a CP admin endpoint after pushing the image.
|
||||
The CP iterates all running tenants and restarts them one by one.
|
||||
|
||||
```
|
||||
publish-platform-image succeeds
|
||||
→ POST https://api.moleculesai.app/cp/admin/rolling-upgrade
|
||||
→ CP queries org_instances WHERE status = 'running'
|
||||
→ For each tenant (staggered, 30s apart):
|
||||
1. AWS SSM Run Command: docker pull + docker restart
|
||||
2. Wait for /health 200
|
||||
3. Update org_instances.updated_at
|
||||
4. If health fails after 60s, rollback (docker run old image)
|
||||
→ Return summary: {upgraded: N, failed: M, skipped: K}
|
||||
```
|
||||
|
||||
### Pros
|
||||
- Immediate, coordinated upgrades across all tenants
|
||||
- CP has full visibility into upgrade status
|
||||
- Can implement canary (upgrade 1 tenant first, verify, then rest)
|
||||
- Rollback capability per tenant
|
||||
|
||||
### Cons
|
||||
- Requires AWS SSM agent on EC2 instances (not installed yet)
|
||||
- Alternatively requires SSH access from Railway → EC2 (network/key management)
|
||||
- Brief downtime per tenant during restart (~10-30s)
|
||||
- Blast radius: a bad image can take down all tenants before canary catches it
|
||||
|
||||
### Implementation effort
|
||||
- Add SSM agent to EC2 user-data script
|
||||
- Add `POST /cp/admin/rolling-upgrade` handler
|
||||
- Add upgrade step to publish workflow
|
||||
- Add rollback logic
|
||||
- ~2-3 days
|
||||
|
||||
### When to use
|
||||
- Urgent security patches that can't wait 5 min
|
||||
- Breaking changes that need coordinated rollout
|
||||
- When you want canary/staged deployment
|
||||
|
||||
---
|
||||
|
||||
## Option B: Sidecar auto-updater (implemented)
|
||||
|
||||
A cron job on each EC2 checks GHCR for a new image digest every 5 minutes.
|
||||
If the digest changed, it pulls the new image and restarts the container.
|
||||
|
||||
```bash
|
||||
# Runs every 5 min on each EC2 (added to user-data)
|
||||
*/5 * * * * /usr/local/bin/molecule-auto-update.sh
|
||||
```
|
||||
|
||||
The update script:
|
||||
1. `docker pull platform-tenant:latest`
|
||||
2. Compare digest with running container's image digest
|
||||
3. If different: `docker stop molecule-tenant && docker rm molecule-tenant && docker run ...`
|
||||
4. Wait for `/health` 200
|
||||
5. Log result to `/var/log/molecule-auto-update.log`
|
||||
|
||||
### Pros
|
||||
- Zero CP involvement — fully autonomous per tenant
|
||||
- Tenants upgrade within 5 min of any publish
|
||||
- No SSH/SSM infrastructure needed
|
||||
- Each tenant upgrades independently (natural canary)
|
||||
- Simple to implement (2 lines in user-data + a small script)
|
||||
|
||||
### Cons
|
||||
- Up to 5 min delay between publish and tenant upgrade
|
||||
- Brief downtime during restart (~10-30s)
|
||||
- No centralized visibility into upgrade status
|
||||
- Can't selectively hold back specific tenants
|
||||
- All tenants track `latest` — no pinned versions
|
||||
|
||||
### When to use
|
||||
- Default for all tenants
|
||||
- Works well for early-stage SaaS with frequent deploys
|
||||
|
||||
---
|
||||
|
||||
## Option C: Blue-green via Worker (zero downtime)
|
||||
|
||||
Each EC2 runs two container slots: `blue` (current) and `green` (new).
|
||||
The Cloudflare Worker routes traffic to whichever is healthy.
|
||||
|
||||
```
|
||||
EC2 instance:
|
||||
molecule-tenant-blue → :8080 (current, serving traffic)
|
||||
molecule-tenant-green → :8081 (new, starting up)
|
||||
|
||||
Upgrade flow:
|
||||
1. Pull new image
|
||||
2. Start green on :8081
|
||||
3. Health check green: GET :8081/health
|
||||
4. If healthy: update Worker routing (KV: slug → port 8081)
|
||||
5. Stop blue
|
||||
6. Next upgrade: blue becomes the new slot
|
||||
|
||||
Worker routing:
|
||||
KV key: "hongming2" → {"ip": "3.144.193.40", "port": 8081}
|
||||
(port defaults to 8080 when not in KV)
|
||||
```
|
||||
|
||||
### Pros
|
||||
- Zero downtime — traffic switches atomically after health check
|
||||
- Instant rollback — just switch back to the old slot
|
||||
- Worker already exists — just add port to the routing lookup
|
||||
- Health-verified before any traffic switches
|
||||
|
||||
### Cons
|
||||
- Double memory usage during transition (~512MB extra per tenant)
|
||||
- More complex user-data script (manage two containers)
|
||||
- Worker needs port-aware routing (KV schema change)
|
||||
- Need to track which slot is active per tenant
|
||||
|
||||
### Implementation effort
|
||||
- Update user-data to manage blue/green containers
|
||||
- Update Worker to read port from KV
|
||||
- Add blue/green state tracking to CP (org_instances.active_slot)
|
||||
- Update auto-updater script for blue-green swap
|
||||
- ~3-5 days
|
||||
|
||||
### When to use
|
||||
- When tenants have SLAs requiring zero downtime
|
||||
- Production deployments with paying customers
|
||||
- After Option B proves the auto-update pattern works
|
||||
|
||||
---
|
||||
|
||||
## Migration path
|
||||
|
||||
```
|
||||
Now: Option B (auto-updater, 5 min delay, brief downtime)
|
||||
↓
|
||||
Growth: Option A (add SSM for urgent patches, keep B as default)
|
||||
↓
|
||||
Scale: Option C (zero-downtime for premium/enterprise tenants)
|
||||
```
|
||||
Loading…
Reference in New Issue
Block a user