docs: tenant image upgrade strategies

2026-04-17 19:16:30 +00:00 · 2026-04-17 19:16:30 +00:00 · 89d89e1459
commit 89d89e1459
parent a8fcff947d 20750cf128
1 changed files with 150 additions and 0 deletions
--- a/docs/architecture/tenant-image-upgrades.md
+++ b/docs/architecture/tenant-image-upgrades.md
@ -0,0 +1,150 @@
+# Tenant Image Upgrade Strategies
+
+> **Status:** Option B (sidecar auto-updater) implemented. Options A and C
+> documented for future use.
+
+## Problem
+
+When we push a new `platform-tenant:latest` to GHCR, existing EC2 tenant
+instances keep running the old image. New orgs get the latest image at boot,
+but existing tenants fall behind — missing bug fixes, security patches, and
+new features.
+
+## Option A: Rolling restart on publish (coordinated)
+
+The publish workflow calls a CP admin endpoint after pushing the image.
+The CP iterates all running tenants and restarts them one by one.
+
+```
+publish-platform-image succeeds
+  → POST https://api.moleculesai.app/cp/admin/rolling-upgrade
+    → CP queries org_instances WHERE status = 'running'
+    → For each tenant (staggered, 30s apart):
+      1. AWS SSM Run Command: docker pull + docker restart
+      2. Wait for /health 200
+      3. Update org_instances.updated_at
+      4. If health fails after 60s, rollback (docker run old image)
+    → Return summary: {upgraded: N, failed: M, skipped: K}
+```
+
+### Pros
+- Immediate, coordinated upgrades across all tenants
+- CP has full visibility into upgrade status
+- Can implement canary (upgrade 1 tenant first, verify, then rest)
+- Rollback capability per tenant
+
+### Cons
+- Requires AWS SSM agent on EC2 instances (not installed yet)
+- Alternatively requires SSH access from Railway → EC2 (network/key management)
+- Brief downtime per tenant during restart (~10-30s)
+- Blast radius: a bad image can take down all tenants before canary catches it
+
+### Implementation effort
+- Add SSM agent to EC2 user-data script
+- Add `POST /cp/admin/rolling-upgrade` handler
+- Add upgrade step to publish workflow
+- Add rollback logic
+- ~2-3 days
+
+### When to use
+- Urgent security patches that can't wait 5 min
+- Breaking changes that need coordinated rollout
+- When you want canary/staged deployment
+
+---
+
+## Option B: Sidecar auto-updater (implemented)
+
+A cron job on each EC2 checks GHCR for a new image digest every 5 minutes.
+If the digest changed, it pulls the new image and restarts the container.
+
+```bash
+# Runs every 5 min on each EC2 (added to user-data)
+*/5 * * * * /usr/local/bin/molecule-auto-update.sh
+```
+
+The update script:
+1. `docker pull platform-tenant:latest`
+2. Compare digest with running container's image digest
+3. If different: `docker stop molecule-tenant && docker rm molecule-tenant && docker run ...`
+4. Wait for `/health` 200
+5. Log result to `/var/log/molecule-auto-update.log`
+
+### Pros
+- Zero CP involvement — fully autonomous per tenant
+- Tenants upgrade within 5 min of any publish
+- No SSH/SSM infrastructure needed
+- Each tenant upgrades independently (natural canary)
+- Simple to implement (2 lines in user-data + a small script)
+
+### Cons
+- Up to 5 min delay between publish and tenant upgrade
+- Brief downtime during restart (~10-30s)
+- No centralized visibility into upgrade status
+- Can't selectively hold back specific tenants
+- All tenants track `latest` — no pinned versions
+
+### When to use
+- Default for all tenants
+- Works well for early-stage SaaS with frequent deploys
+
+---
+
+## Option C: Blue-green via Worker (zero downtime)
+
+Each EC2 runs two container slots: `blue` (current) and `green` (new).
+The Cloudflare Worker routes traffic to whichever is healthy.
+
+```
+EC2 instance:
+  molecule-tenant-blue  → :8080 (current, serving traffic)
+  molecule-tenant-green → :8081 (new, starting up)
+
+Upgrade flow:
+  1. Pull new image
+  2. Start green on :8081
+  3. Health check green: GET :8081/health
+  4. If healthy: update Worker routing (KV: slug → port 8081)
+  5. Stop blue
+  6. Next upgrade: blue becomes the new slot
+
+Worker routing:
+  KV key: "hongming2" → {"ip": "3.144.193.40", "port": 8081}
+  (port defaults to 8080 when not in KV)
+```
+
+### Pros
+- Zero downtime — traffic switches atomically after health check
+- Instant rollback — just switch back to the old slot
+- Worker already exists — just add port to the routing lookup
+- Health-verified before any traffic switches
+
+### Cons
+- Double memory usage during transition (~512MB extra per tenant)
+- More complex user-data script (manage two containers)
+- Worker needs port-aware routing (KV schema change)
+- Need to track which slot is active per tenant
+
+### Implementation effort
+- Update user-data to manage blue/green containers
+- Update Worker to read port from KV
+- Add blue/green state tracking to CP (org_instances.active_slot)
+- Update auto-updater script for blue-green swap
+- ~3-5 days
+
+### When to use
+- When tenants have SLAs requiring zero downtime
+- Production deployments with paying customers
+- After Option B proves the auto-update pattern works
+
+---
+
+## Migration path
+
+```
+Now:     Option B (auto-updater, 5 min delay, brief downtime)
+         ↓
+Growth:  Option A (add SSM for urgent patches, keep B as default)
+         ↓
+Scale:   Option C (zero-downtime for premium/enterprise tenants)
+```