Files
molecule-core/scripts/demo-day-runbook.md
hongming f820780036
ci-arm64-advisory / fast-checks (pull_request) Waiting to run
CI / Canvas Deploy Reminder (pull_request) Blocked by required conditions
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Successful in 9s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s
CI / Detect changes (pull_request) Successful in 6s
CI / Python Lint & Test (pull_request) Successful in 3s
E2E API Smoke Test / detect-changes (pull_request) Successful in 6s
E2E Chat / detect-changes (pull_request) Successful in 10s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Has been skipped
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 12s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Successful in 48s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 4s
Harness Replays / detect-changes (pull_request) Successful in 5s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Successful in 39s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 6s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 4s
Lint no tenant GITEA or GITHUB token write / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 5s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Successful in 1m8s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 1m10s
lint-required-workflows-docker-host-pinned / Lint docker-host pin on docker-touching workflows (pull_request) Successful in 5s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 53s
lint-required-context-exists-in-bp / lint-required-context-exists-in-bp (pull_request) Successful in 1m23s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 1m23s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 5s
gate-check-v3 / gate-check (pull_request) Successful in 5s
sop-checklist / review-refire (pull_request) Has been skipped
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request) Successful in 4s
sop-tier-check / tier-check (pull_request) Successful in 3s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 1m8s
E2E Staging External Runtime / E2E Staging External Runtime (pull_request) Successful in 5m11s
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Successful in 5m40s
qa-review / approved (pull_request) Refired via /qa-recheck by codex-local
security-review / approved (pull_request) Refired via /security-recheck by codex-local
CI / Shellcheck (E2E scripts) (pull_request) Successful in 33s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m19s
E2E Chat / E2E Chat (pull_request) Successful in 6s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 3s
Harness Replays / Harness Replays (pull_request) Successful in 3s
CI / Platform (Go) (pull_request) Successful in 10m5s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2m3s
CI / Canvas (Next.js) (pull_request) Successful in 9m45s
CI / all-required (pull_request) Successful in 31m8s
audit-force-merge / audit (pull_request) Successful in 14s
chore: restrict maintained workspace runtimes
2026-05-24 19:48:00 -07:00

10 KiB

Demo-day runbook

Pre-, during-, and post-demo operational procedures for the molecule production stack. Updated 2026-05-01 ahead of the funding-demo on ~2026-05-06.

The whole stack:

Vercel canvas (app.moleculesai.app)
  → Railway controlplane (api.moleculesai.app)
    → CloudFront/Cloudflare per-tenant edge (<slug>.moleculesai.app)
      → EC2 tenant instance running platform container
        → Docker workspaces pulled from
          ghcr.io/molecule-ai/workspace-template-<runtime>:latest

Every layer has its own deploy/rollback story. This runbook indexes them in the order an operator would touch them during an incident.

Pre-demo (T-48h to T-1h)

1. Freeze the runtime + template image cascade

A merge to molecule-core/staging that touches workspace/** triggers publish-runtime.yml → PyPI bump → repository_dispatch → 8 template repos rebuild and re-tag :latest. A merge to any template repo's main triggers the same final re-tag directly. Either path means a new workspace provision during the demo pulls whatever :latest resolved to seconds earlier.

Capture current good digests + disable both cascade vectors:

# Dry-run first — verifies digests can be fetched and tooling is set up
scripts/demo-freeze.sh

# Apply
scripts/demo-freeze.sh --execute

The script writes two receipts to scripts/demo-freeze-snapshots/:

  • digests-<TS>.txt — current :latest digest per template (rollback target if needed)
  • disabled-workflows-<TS>.txt — workflow paths to re-enable post-demo

Verify the freeze landed:

gh workflow list -R Molecule-AI/molecule-core | grep publish-runtime
# expect: status = disabled_manually

If a critical fix MUST ship during the freeze window:

  1. gh workflow enable publish-runtime.yml -R Molecule-AI/molecule-core
  2. Merge the fix
  3. Watch the cascade through to GHCR:latest manually
  4. Smoke-verify against a staging tenant (scripts/api-smoke.sh or manual canvas walkthrough)
  5. gh workflow disable publish-runtime.yml -R Molecule-AI/molecule-core to re-freeze

Don't auto-promote during the freeze — the value of the freeze is that nothing happens automatically.

2. Confirm production CP is on the expected SHA

gh run list -R Molecule-AI/molecule-controlplane --branch main --limit 5
# Last `ci` run should be SUCCESS with the SHA you intend to demo on

Railway auto-deploys from main. Spot-check api.moleculesai.app:

curl -fsS -H "Authorization: Bearer $CP_ADMIN_API_TOKEN" \
  https://api.moleculesai.app/cp/admin/orgs?limit=1
# Expect: 200 + a JSON {"orgs": [...]}

3. Confirm production canvas (Vercel) is on main

Vercel auto-deploys main. Verify in the Vercel dashboard the most recent prod deploy ran from the expected commit SHA.

4. Pre-warm the demo tenant

Cold-start times on workspace-template images:

Runtime Cold-start (first boot)
claude-code ~30-60s
openclaw ~1-2 min
claude-code ~1 min
hermes ~7 min (large image)

If the demo will use hermes, provision the demo workspace at least 10 min before. The cold-start clock starts when the workspace is created, not when it's used.

During demo — emergency rollback levers

Lever A: Platform-image rollback (canvas/CP layer regression)

If the canvas or platform container shipped a regression, retag :latest to a prior staging SHA without rebuilding:

# Find a known-good SHA from staging history
gh run list -R Molecule-AI/molecule-core --workflow=publish-canvas-image.yml --limit 5

# Roll both platform + tenant images
GITHUB_TOKEN=$(gh auth token) scripts/rollback-latest.sh <good-sha>

rollback-latest.sh retags both ghcr.io/molecule-ai/platform:latest and ghcr.io/molecule-ai/platform-tenant:latest. Existing tenants auto-pull :latest every 5 min — rollback propagates without manual restart.

Lever B: Workspace-template image rollback

If a specific runtime template (claude-code, hermes, etc.) shipped a broken :latest:

# Get the demo's snapshotted-good digest from the freeze receipt
grep claude-code scripts/demo-freeze-snapshots/digests-<TS>.txt

# Retag :latest back to the snapshotted digest using crane
crane auth login ghcr.io -u "$(gh api user --jq .login)" \
  --password-stdin <<< "$(gh auth token)"
crane tag \
  ghcr.io/molecule-ai/workspace-template-claude-code@sha256:<digest> \
  latest

The next workspace provision pulls the rolled-back image. Existing workspaces are unaffected (their image is already loaded into Docker).

Lever C: Wedged demo tenant — redeploy

If the demo tenant's EC2 instance is wedged (boot succeeded but app not responding, or a stuck workspace), the controlplane has an admin redeploy endpoint:

# AWS-side: forces a fresh EC2 launch with current image. ~3 min.
curl -fsS -X POST \
  -H "Authorization: Bearer $CP_ADMIN_API_TOKEN" \
  https://api.moleculesai.app/cp/admin/orgs/<slug>/redeploy

WARNING per memory: this triggers real EC2 + SSM actions on production. Double-check <slug> against the demo tenant's slug before pressing return. The /redeploy endpoint is idempotent on the EC2 side but WILL drop active SSH sessions.

Lever D: Specific bad workspace — delete

If a single workspace inside the demo tenant is misbehaving (e.g. hermes wedged on cold-start, claude-code returning the generic "Agent error (Exception)" message), kill it:

# Get the demo tenant's per-tenant ADMIN_TOKEN
TENANT_ADMIN=$(curl -fsS -H "Authorization: Bearer $CP_ADMIN_API_TOKEN" \
  https://api.moleculesai.app/cp/admin/orgs/<slug>/admin-token \
  | jq -r .admin_token)

ORG_ID=$(curl -fsS -H "Authorization: Bearer $CP_ADMIN_API_TOKEN" \
  https://api.moleculesai.app/cp/admin/orgs?limit=20 \
  | jq -r '.orgs[] | select(.slug=="<slug>") | .id')

# Delete the bad workspace
curl -fsS -X DELETE \
  -H "Origin: https://<slug>.moleculesai.app" \
  -H "Authorization: Bearer $TENANT_ADMIN" \
  -H "X-Molecule-Org-Id: $ORG_ID" \
  https://<slug>.moleculesai.app/workspaces/<workspace-id>

Then re-provision a fresh workspace from the canvas. Faster than debugging the wedged one.

Lever E: Railway production rollback (CP regression)

If the last Railway deploy of CP introduced a regression that lever A can't fix (e.g. a logic bug, not a container issue):

  1. Open Railway dashboard → molecule-platform → controlplane → Deployments
  2. Find the previous-known-good deployment
  3. Click Rollback to this deployment

Manual step — no CLI equivalent built. Takes ~30s to redeploy from the prior image. Note: rollback restores the prior code AND prior env var snapshot; don't expect any env var changes made since to persist.

Lever F: Vercel production rollback (canvas regression)

If the canvas ships a regression:

  1. Open Vercel dashboard → molecule-app → Deployments
  2. Find the previous prod deployment
  3. Promote to Production

Same pattern as Railway — fast revert, no rebuild.

Tenant-level read-only diagnostics (not actions)

Useful during a "is this working?" moment without touching anything:

# Tenant infra state
curl -fsS -H "Authorization: Bearer $CP_ADMIN_API_TOKEN" \
  "https://api.moleculesai.app/cp/admin/orgs?limit=20" \
  | jq '.orgs[] | select(.slug=="<slug>")'

# Tenant boot events (debug a stuck provision)
curl -fsS -H "Authorization: Bearer $CP_ADMIN_API_TOKEN" \
  "https://api.moleculesai.app/cp/admin/tenants/<slug>/boot-events?limit=50" \
  | jq

# Workspace activity (debug an unresponsive agent)
curl -fsS \
  -H "Origin: https://<slug>.moleculesai.app" \
  -H "Authorization: Bearer $TENANT_ADMIN" \
  -H "X-Molecule-Org-Id: $ORG_ID" \
  "https://<slug>.moleculesai.app/workspaces/<workspace-id>/activity?limit=20" \
  | jq

Post-demo (T+30m to T+24h)

1. Thaw the cascades

# Find the freeze receipt
ls scripts/demo-freeze-snapshots/

# Thaw — pass the timestamp suffix
scripts/demo-thaw.sh 20260506-180000

The next merge to molecule-core/staging (workspace/**) or any template repo's main will resume the auto-rebuild cascade.

2. Audit what was held back

If any merges queued during the freeze:

gh pr list -R Molecule-AI/molecule-core --base staging --state merged \
  --search "merged:>=$(date -u -v-7d +%Y-%m-%d)"

Verify each merge's CI is green and dispatch the runtime cascade once to ensure all templates rebuild against the post-freeze HEAD.

3. File a post-mortem if anything fired

If any rollback lever was used during the demo, file a brief doc:

  • Which lever (A through F)
  • Which SHA was rolled back FROM and TO
  • Did the rollback fully resolve the issue or was a follow-up needed
  • Whether the underlying regression should have been caught by CI

Common issues + first-line fix

Symptom First lever to try
Workspace boots but agent always errors Lever D (delete + reprovision)
Whole tenant unreachable Lever C (redeploy)
Canvas crashes on load Lever F (Vercel rollback)
Login broken / API errors Lever E (Railway rollback)
Specific runtime broken across tenants Lever B (template image rollback)
Platform container regression Lever A (rollback-latest.sh)
Mid-demo stray PR auto-published a bad image Lever B + investigate why freeze didn't catch it

Auth fingerprint (rotate post-demo)

The freeze + rollback procedures assume:

  • CP_ADMIN_API_TOKEN available via railway variables --kv --environment production
  • gh auth token returns a working PAT with workflow:write + write:packages
  • crane installed (brew install crane)

After the demo, rotate CP_ADMIN_API_TOKEN (it's the keys-to-the-kingdom token for production) — it likely got copy-pasted into shells during the demo.

# Generate a new admin token
NEW_TOKEN=$(openssl rand -hex 32)

# Update Railway production env var (and optionally staging)
railway variables --set CP_ADMIN_API_TOKEN="$NEW_TOKEN" --environment production

# Restart CP service to pick up the change
# (Railway auto-restarts on env var change)

# Verify
curl -fsS -H "Authorization: Bearer $NEW_TOKEN" \
  https://api.moleculesai.app/cp/admin/orgs?limit=1