diff --git a/content/docs/architecture/tenant-image-upgrades.md b/content/docs/architecture/tenant-image-upgrades.md new file mode 100644 index 0000000..40ae03c --- /dev/null +++ b/content/docs/architecture/tenant-image-upgrades.md @@ -0,0 +1,150 @@ +# Tenant Image Upgrade Strategies + +> **Status:** Option B (sidecar auto-updater) implemented. Options A and C +> documented for future use. + +## Problem + +When we push a new `platform-tenant:latest` to GHCR, existing EC2 tenant +instances keep running the old image. New orgs get the latest image at boot, +but existing tenants fall behind — missing bug fixes, security patches, and +new features. + +## Option A: Rolling restart on publish (coordinated) + +The publish workflow calls a CP admin endpoint after pushing the image. +The CP iterates all running tenants and restarts them one by one. + +``` +publish-platform-image succeeds + → POST https://api.moleculesai.app/cp/admin/rolling-upgrade + → CP queries org_instances WHERE status = 'running' + → For each tenant (staggered, 30s apart): + 1. AWS SSM Run Command: docker pull + docker restart + 2. Wait for /health 200 + 3. Update org_instances.updated_at + 4. If health fails after 60s, rollback (docker run old image) + → Return summary: {upgraded: N, failed: M, skipped: K} +``` + +### Pros +- Immediate, coordinated upgrades across all tenants +- CP has full visibility into upgrade status +- Can implement canary (upgrade 1 tenant first, verify, then rest) +- Rollback capability per tenant + +### Cons +- Requires AWS SSM agent on EC2 instances (not installed yet) +- Alternatively requires SSH access from Railway → EC2 (network/key management) +- Brief downtime per tenant during restart (~10-30s) +- Blast radius: a bad image can take down all tenants before canary catches it + +### Implementation effort +- Add SSM agent to EC2 user-data script +- Add `POST /cp/admin/rolling-upgrade` handler +- Add upgrade step to publish workflow +- Add rollback logic +- ~2-3 days + +### When to use +- Urgent security patches that can't wait 5 min +- Breaking changes that need coordinated rollout +- When you want canary/staged deployment + +--- + +## Option B: Sidecar auto-updater (implemented) + +A cron job on each EC2 checks GHCR for a new image digest every 5 minutes. +If the digest changed, it pulls the new image and restarts the container. + +```bash +# Runs every 5 min on each EC2 (added to user-data) +*/5 * * * * /usr/local/bin/molecule-auto-update.sh +``` + +The update script: +1. `docker pull platform-tenant:latest` +2. Compare digest with running container's image digest +3. If different: `docker stop molecule-tenant && docker rm molecule-tenant && docker run ...` +4. Wait for `/health` 200 +5. Log result to `/var/log/molecule-auto-update.log` + +### Pros +- Zero CP involvement — fully autonomous per tenant +- Tenants upgrade within 5 min of any publish +- No SSH/SSM infrastructure needed +- Each tenant upgrades independently (natural canary) +- Simple to implement (2 lines in user-data + a small script) + +### Cons +- Up to 5 min delay between publish and tenant upgrade +- Brief downtime during restart (~10-30s) +- No centralized visibility into upgrade status +- Can't selectively hold back specific tenants +- All tenants track `latest` — no pinned versions + +### When to use +- Default for all tenants +- Works well for early-stage SaaS with frequent deploys + +--- + +## Option C: Blue-green via Worker (zero downtime) + +Each EC2 runs two container slots: `blue` (current) and `green` (new). +The Cloudflare Worker routes traffic to whichever is healthy. + +``` +EC2 instance: + molecule-tenant-blue → :8080 (current, serving traffic) + molecule-tenant-green → :8081 (new, starting up) + +Upgrade flow: + 1. Pull new image + 2. Start green on :8081 + 3. Health check green: GET :8081/health + 4. If healthy: update Worker routing (KV: slug → port 8081) + 5. Stop blue + 6. Next upgrade: blue becomes the new slot + +Worker routing: + KV key: "example-org" → {"ip": "", "port": 8081} + (port defaults to 8080 when not in KV) +``` + +### Pros +- Zero downtime — traffic switches atomically after health check +- Instant rollback — just switch back to the old slot +- Worker already exists — just add port to the routing lookup +- Health-verified before any traffic switches + +### Cons +- Double memory usage during transition (~512MB extra per tenant) +- More complex user-data script (manage two containers) +- Worker needs port-aware routing (KV schema change) +- Need to track which slot is active per tenant + +### Implementation effort +- Update user-data to manage blue/green containers +- Update Worker to read port from KV +- Add blue/green state tracking to CP (org_instances.active_slot) +- Update auto-updater script for blue-green swap +- ~3-5 days + +### When to use +- When tenants have SLAs requiring zero downtime +- Production deployments with paying customers +- After Option B proves the auto-update pattern works + +--- + +## Migration path + +``` +Now: Option B (auto-updater, 5 min delay, brief downtime) + ↓ +Growth: Option A (add SSM for urgent patches, keep B as default) + ↓ +Scale: Option C (zero-downtime for premium/enterprise tenants) +```