molecule-core/docs/architecture/tenant-image-upgrades.md
Hongming Wang e906f49ec0 chore: open-source preparation — scrub secrets, add community files
Security:
- Replace hardcoded Cloudflare account/zone/KV IDs in wrangler.toml
  with placeholders; add wrangler.toml to .gitignore, ship .example
- Replace real EC2 IPs in docs with <EC2_IP> placeholders
- Redact partial CF API token prefix in retrospective
- Parameterize Langfuse dev credentials in docker-compose.infra.yml
- Replace Neon project ID in runbook with <neon-project-id>

Community:
- Add CONTRIBUTING.md (build, test, branch conventions, CI info)
- Add CODE_OF_CONDUCT.md (Contributor Covenant 2.1)

Cleanup:
- Replace personal runner username/machine name in CI + PLAN.md
- Replace personal tenant URL in MCP setup guide
- Replace personal author field in bundle-system doc
- Replace personal login in webhook test fixture
- Rewrite cryptominer incident reference as generic security remediation
- Remove private repo commit hashes from PLAN.md

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 00:10:56 -07:00

4.8 KiB

Tenant Image Upgrade Strategies

Status: Option B (sidecar auto-updater) implemented. Options A and C documented for future use.

Problem

When we push a new platform-tenant:latest to GHCR, existing EC2 tenant instances keep running the old image. New orgs get the latest image at boot, but existing tenants fall behind — missing bug fixes, security patches, and new features.

Option A: Rolling restart on publish (coordinated)

The publish workflow calls a CP admin endpoint after pushing the image. The CP iterates all running tenants and restarts them one by one.

publish-platform-image succeeds
  → POST https://api.moleculesai.app/cp/admin/rolling-upgrade
    → CP queries org_instances WHERE status = 'running'
    → For each tenant (staggered, 30s apart):
      1. AWS SSM Run Command: docker pull + docker restart
      2. Wait for /health 200
      3. Update org_instances.updated_at
      4. If health fails after 60s, rollback (docker run old image)
    → Return summary: {upgraded: N, failed: M, skipped: K}

Pros

  • Immediate, coordinated upgrades across all tenants
  • CP has full visibility into upgrade status
  • Can implement canary (upgrade 1 tenant first, verify, then rest)
  • Rollback capability per tenant

Cons

  • Requires AWS SSM agent on EC2 instances (not installed yet)
  • Alternatively requires SSH access from Railway → EC2 (network/key management)
  • Brief downtime per tenant during restart (~10-30s)
  • Blast radius: a bad image can take down all tenants before canary catches it

Implementation effort

  • Add SSM agent to EC2 user-data script
  • Add POST /cp/admin/rolling-upgrade handler
  • Add upgrade step to publish workflow
  • Add rollback logic
  • ~2-3 days

When to use

  • Urgent security patches that can't wait 5 min
  • Breaking changes that need coordinated rollout
  • When you want canary/staged deployment

Option B: Sidecar auto-updater (implemented)

A cron job on each EC2 checks GHCR for a new image digest every 5 minutes. If the digest changed, it pulls the new image and restarts the container.

# Runs every 5 min on each EC2 (added to user-data)
*/5 * * * * /usr/local/bin/molecule-auto-update.sh

The update script:

  1. docker pull platform-tenant:latest
  2. Compare digest with running container's image digest
  3. If different: docker stop molecule-tenant && docker rm molecule-tenant && docker run ...
  4. Wait for /health 200
  5. Log result to /var/log/molecule-auto-update.log

Pros

  • Zero CP involvement — fully autonomous per tenant
  • Tenants upgrade within 5 min of any publish
  • No SSH/SSM infrastructure needed
  • Each tenant upgrades independently (natural canary)
  • Simple to implement (2 lines in user-data + a small script)

Cons

  • Up to 5 min delay between publish and tenant upgrade
  • Brief downtime during restart (~10-30s)
  • No centralized visibility into upgrade status
  • Can't selectively hold back specific tenants
  • All tenants track latest — no pinned versions

When to use

  • Default for all tenants
  • Works well for early-stage SaaS with frequent deploys

Option C: Blue-green via Worker (zero downtime)

Each EC2 runs two container slots: blue (current) and green (new). The Cloudflare Worker routes traffic to whichever is healthy.

EC2 instance:
  molecule-tenant-blue  → :8080 (current, serving traffic)
  molecule-tenant-green → :8081 (new, starting up)

Upgrade flow:
  1. Pull new image
  2. Start green on :8081
  3. Health check green: GET :8081/health
  4. If healthy: update Worker routing (KV: slug → port 8081)
  5. Stop blue
  6. Next upgrade: blue becomes the new slot

Worker routing:
  KV key: "example-org" → {"ip": "<EC2_IP>", "port": 8081}
  (port defaults to 8080 when not in KV)

Pros

  • Zero downtime — traffic switches atomically after health check
  • Instant rollback — just switch back to the old slot
  • Worker already exists — just add port to the routing lookup
  • Health-verified before any traffic switches

Cons

  • Double memory usage during transition (~512MB extra per tenant)
  • More complex user-data script (manage two containers)
  • Worker needs port-aware routing (KV schema change)
  • Need to track which slot is active per tenant

Implementation effort

  • Update user-data to manage blue/green containers
  • Update Worker to read port from KV
  • Add blue/green state tracking to CP (org_instances.active_slot)
  • Update auto-updater script for blue-green swap
  • ~3-5 days

When to use

  • When tenants have SLAs requiring zero downtime
  • Production deployments with paying customers
  • After Option B proves the auto-update pattern works

Migration path

Now:     Option B (auto-updater, 5 min delay, brief downtime)
         ↓
Growth:  Option A (add SSM for urgent patches, keep B as default)
         ↓
Scale:   Option C (zero-downtime for premium/enterprise tenants)