Merge pull request #2456 from Molecule-AI/ops/demo-day-freeze-runbook

ops: demo-day freeze + rollback runbook
This commit is contained in:
Hongming Wang 2026-05-01 19:06:51 +00:00 committed by GitHub
commit 0b809cfa62
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
5 changed files with 650 additions and 0 deletions

306
scripts/demo-day-runbook.md Normal file
View File

@ -0,0 +1,306 @@
# Demo-day runbook
Pre-, during-, and post-demo operational procedures for the molecule
production stack. Updated 2026-05-01 ahead of the funding-demo on
~2026-05-06.
The whole stack:
```
Vercel canvas (app.moleculesai.app)
→ Railway controlplane (api.moleculesai.app)
→ CloudFront/Cloudflare per-tenant edge (<slug>.moleculesai.app)
→ EC2 tenant instance running platform container
→ Docker workspaces pulled from
ghcr.io/molecule-ai/workspace-template-<runtime>:latest
```
Every layer has its own deploy/rollback story. This runbook indexes
them in the order an operator would touch them during an incident.
## Pre-demo (T-48h to T-1h)
### 1. Freeze the runtime + template image cascade
A merge to `molecule-core/staging` that touches `workspace/**` triggers
`publish-runtime.yml` → PyPI bump → repository_dispatch → 8 template
repos rebuild and re-tag `:latest`. A merge to any template repo's
`main` triggers the same final re-tag directly. Either path means a
new workspace provision during the demo pulls whatever `:latest`
resolved to seconds earlier.
Capture current good digests + disable both cascade vectors:
```bash
# Dry-run first — verifies digests can be fetched and tooling is set up
scripts/demo-freeze.sh
# Apply
scripts/demo-freeze.sh --execute
```
The script writes two receipts to `scripts/demo-freeze-snapshots/`:
- `digests-<TS>.txt` — current `:latest` digest per template (rollback target if needed)
- `disabled-workflows-<TS>.txt` — workflow paths to re-enable post-demo
Verify the freeze landed:
```bash
gh workflow list -R Molecule-AI/molecule-core | grep publish-runtime
# expect: status = disabled_manually
```
If a critical fix MUST ship during the freeze window:
1. `gh workflow enable publish-runtime.yml -R Molecule-AI/molecule-core`
2. Merge the fix
3. Watch the cascade through to GHCR:latest manually
4. Smoke-verify against a staging tenant (`scripts/api-smoke.sh` or
manual canvas walkthrough)
5. `gh workflow disable publish-runtime.yml -R Molecule-AI/molecule-core` to re-freeze
Don't auto-promote during the freeze — the value of the freeze is that
nothing happens automatically.
### 2. Confirm production CP is on the expected SHA
```bash
gh run list -R Molecule-AI/molecule-controlplane --branch main --limit 5
# Last `ci` run should be SUCCESS with the SHA you intend to demo on
```
Railway auto-deploys from main. Spot-check `api.moleculesai.app`:
```bash
curl -fsS -H "Authorization: Bearer $CP_ADMIN_API_TOKEN" \
https://api.moleculesai.app/cp/admin/orgs?limit=1
# Expect: 200 + a JSON {"orgs": [...]}
```
### 3. Confirm production canvas (Vercel) is on main
Vercel auto-deploys `main`. Verify in the Vercel dashboard the most
recent prod deploy ran from the expected commit SHA.
### 4. Pre-warm the demo tenant
Cold-start times on workspace-template images:
| Runtime | Cold-start (first boot) |
|---|---|
| claude-code | ~30-60s |
| openclaw | ~1-2 min |
| langgraph | ~1 min |
| hermes | **~7 min** (large image) |
If the demo will use `hermes`, provision the demo workspace at least
10 min before. The cold-start clock starts when the workspace is
created, not when it's used.
## During demo — emergency rollback levers
### Lever A: Platform-image rollback (canvas/CP layer regression)
If the canvas or platform container shipped a regression, retag
`:latest` to a prior staging SHA without rebuilding:
```bash
# Find a known-good SHA from staging history
gh run list -R Molecule-AI/molecule-core --workflow=publish-canvas-image.yml --limit 5
# Roll both platform + tenant images
GITHUB_TOKEN=$(gh auth token) scripts/rollback-latest.sh <good-sha>
```
`rollback-latest.sh` retags both `ghcr.io/molecule-ai/platform:latest`
and `ghcr.io/molecule-ai/platform-tenant:latest`. Existing tenants
auto-pull `:latest` every 5 min — rollback propagates without manual
restart.
### Lever B: Workspace-template image rollback
If a specific runtime template (claude-code, hermes, etc.) shipped a
broken `:latest`:
```bash
# Get the demo's snapshotted-good digest from the freeze receipt
grep claude-code scripts/demo-freeze-snapshots/digests-<TS>.txt
# Retag :latest back to the snapshotted digest using crane
crane auth login ghcr.io -u "$(gh api user --jq .login)" \
--password-stdin <<< "$(gh auth token)"
crane tag \
ghcr.io/molecule-ai/workspace-template-claude-code@sha256:<digest> \
latest
```
The next workspace provision pulls the rolled-back image. Existing
workspaces are unaffected (their image is already loaded into Docker).
### Lever C: Wedged demo tenant — redeploy
If the demo tenant's EC2 instance is wedged (boot succeeded but app
not responding, or a stuck workspace), the controlplane has an admin
redeploy endpoint:
```bash
# AWS-side: forces a fresh EC2 launch with current image. ~3 min.
curl -fsS -X POST \
-H "Authorization: Bearer $CP_ADMIN_API_TOKEN" \
https://api.moleculesai.app/cp/admin/orgs/<slug>/redeploy
```
WARNING per memory: this triggers real EC2 + SSM actions on production.
Double-check `<slug>` against the demo tenant's slug before pressing
return. The `/redeploy` endpoint is idempotent on the EC2 side but
WILL drop active SSH sessions.
### Lever D: Specific bad workspace — delete
If a single workspace inside the demo tenant is misbehaving (e.g.
hermes wedged on cold-start, claude-code returning the generic
"Agent error (Exception)" message), kill it:
```bash
# Get the demo tenant's per-tenant ADMIN_TOKEN
TENANT_ADMIN=$(curl -fsS -H "Authorization: Bearer $CP_ADMIN_API_TOKEN" \
https://api.moleculesai.app/cp/admin/orgs/<slug>/admin-token \
| jq -r .admin_token)
ORG_ID=$(curl -fsS -H "Authorization: Bearer $CP_ADMIN_API_TOKEN" \
https://api.moleculesai.app/cp/admin/orgs?limit=20 \
| jq -r '.orgs[] | select(.slug=="<slug>") | .id')
# Delete the bad workspace
curl -fsS -X DELETE \
-H "Origin: https://<slug>.moleculesai.app" \
-H "Authorization: Bearer $TENANT_ADMIN" \
-H "X-Molecule-Org-Id: $ORG_ID" \
https://<slug>.moleculesai.app/workspaces/<workspace-id>
```
Then re-provision a fresh workspace from the canvas. Faster than
debugging the wedged one.
### Lever E: Railway production rollback (CP regression)
If the last Railway deploy of CP introduced a regression that lever A
can't fix (e.g. a logic bug, not a container issue):
1. Open Railway dashboard → molecule-platform → controlplane → Deployments
2. Find the previous-known-good deployment
3. Click **Rollback to this deployment**
Manual step — no CLI equivalent built. Takes ~30s to redeploy from
the prior image. Note: rollback restores the prior code AND prior env
var snapshot; don't expect any env var changes made since to persist.
### Lever F: Vercel production rollback (canvas regression)
If the canvas ships a regression:
1. Open Vercel dashboard → molecule-app → Deployments
2. Find the previous prod deployment
3. **Promote to Production**
Same pattern as Railway — fast revert, no rebuild.
## Tenant-level read-only diagnostics (not actions)
Useful during a "is this working?" moment without touching anything:
```bash
# Tenant infra state
curl -fsS -H "Authorization: Bearer $CP_ADMIN_API_TOKEN" \
"https://api.moleculesai.app/cp/admin/orgs?limit=20" \
| jq '.orgs[] | select(.slug=="<slug>")'
# Tenant boot events (debug a stuck provision)
curl -fsS -H "Authorization: Bearer $CP_ADMIN_API_TOKEN" \
"https://api.moleculesai.app/cp/admin/tenants/<slug>/boot-events?limit=50" \
| jq
# Workspace activity (debug an unresponsive agent)
curl -fsS \
-H "Origin: https://<slug>.moleculesai.app" \
-H "Authorization: Bearer $TENANT_ADMIN" \
-H "X-Molecule-Org-Id: $ORG_ID" \
"https://<slug>.moleculesai.app/workspaces/<workspace-id>/activity?limit=20" \
| jq
```
## Post-demo (T+30m to T+24h)
### 1. Thaw the cascades
```bash
# Find the freeze receipt
ls scripts/demo-freeze-snapshots/
# Thaw — pass the timestamp suffix
scripts/demo-thaw.sh 20260506-180000
```
The next merge to `molecule-core/staging` (workspace/**) or any
template repo's `main` will resume the auto-rebuild cascade.
### 2. Audit what was held back
If any merges queued during the freeze:
```bash
gh pr list -R Molecule-AI/molecule-core --base staging --state merged \
--search "merged:>=$(date -u -v-7d +%Y-%m-%d)"
```
Verify each merge's CI is green and dispatch the runtime cascade once
to ensure all templates rebuild against the post-freeze HEAD.
### 3. File a post-mortem if anything fired
If any rollback lever was used during the demo, file a brief doc:
- Which lever (A through F)
- Which SHA was rolled back FROM and TO
- Did the rollback fully resolve the issue or was a follow-up needed
- Whether the underlying regression should have been caught by CI
## Common issues + first-line fix
| Symptom | First lever to try |
|---|---|
| Workspace boots but agent always errors | Lever D (delete + reprovision) |
| Whole tenant unreachable | Lever C (redeploy) |
| Canvas crashes on load | Lever F (Vercel rollback) |
| Login broken / API errors | Lever E (Railway rollback) |
| Specific runtime broken across tenants | Lever B (template image rollback) |
| Platform container regression | Lever A (rollback-latest.sh) |
| Mid-demo stray PR auto-published a bad image | Lever B + investigate why freeze didn't catch it |
## Auth fingerprint (rotate post-demo)
The freeze + rollback procedures assume:
- `CP_ADMIN_API_TOKEN` available via `railway variables --kv --environment production`
- `gh auth token` returns a working PAT with `workflow:write` + `write:packages`
- `crane` installed (`brew install crane`)
After the demo, **rotate** `CP_ADMIN_API_TOKEN` (it's the keys-to-the-kingdom
token for production) — it likely got copy-pasted into shells during
the demo.
```bash
# Generate a new admin token
NEW_TOKEN=$(openssl rand -hex 32)
# Update Railway production env var (and optionally staging)
railway variables --set CP_ADMIN_API_TOKEN="$NEW_TOKEN" --environment production
# Restart CP service to pick up the change
# (Railway auto-restarts on env var change)
# Verify
curl -fsS -H "Authorization: Bearer $NEW_TOKEN" \
https://api.moleculesai.app/cp/admin/orgs?limit=1
```

View File

@ -0,0 +1,6 @@
# Generated by scripts/demo-freeze.sh — receipts are operational state,
# not source. Tracked .gitignore + .gitkeep keep the directory itself
# in version control so the freeze script's output dir always exists.
*
!.gitignore
!.gitkeep

View File

214
scripts/demo-freeze.sh Executable file
View File

@ -0,0 +1,214 @@
#!/usr/bin/env bash
# demo-freeze.sh — disable the runtime + template image publish cascades
# during a demo-prep window so a stray staging merge can't auto-rebuild
# `:latest` for the 8 workspace-template images mid-demo.
#
# Demo prep typically runs T-48h to T+1h. During that window:
#
# PATH 1: any merge to molecule-core/staging that touches workspace/**
# → publish-runtime.yml fires
# → PyPI auto-bumps molecule-ai-workspace-runtime patch version
# → repository_dispatch fans out to 8 workspace-template-* repos
# → each template repo rebuilds and re-tags
# ghcr.io/molecule-ai/workspace-template-<runtime>:latest
#
# PATH 2: any merge to a workspace-template-* repo's main branch
# → that repo's publish-image.yml fires
# → ghcr.io/molecule-ai/workspace-template-<runtime>:latest
# gets re-tagged
#
# provisioner.go:296 RuntimeImages[runtime] reads `:latest` at every
# workspace boot. A new workspace provision during demo pulls whatever
# `:latest` resolved to seconds earlier — so a bad merge minutes
# before the demo can break a tenant the funder is about to see.
#
# This script captures the current good `:latest` digests for all 8
# templates and disables both cascade vectors. The complementary
# demo-thaw.sh re-enables them.
#
# Usage:
# scripts/demo-freeze.sh # dry run — print what would happen
# scripts/demo-freeze.sh --execute # actually disable workflows + snapshot
#
# Prereqs:
# - gh CLI authenticated with workflow:write scope on Molecule-AI org
# - curl + jq (for digest snapshot via GHCR anonymous registry API)
#
# Output:
# <snapshot dir>/digests-YYYYMMDD-HHMMSS.txt
# One line per template: "<runtime>: <digest>"
# <snapshot dir>/disabled-workflows-YYYYMMDD-HHMMSS.txt
# One line per disabled workflow: "<repo>: <workflow>"
#
# Exit codes:
# 0 — freeze complete (or dry-run successful)
# 1 — pre-flight failure (missing tooling, missing auth, etc.)
# 2 — partial freeze (some workflows did not disable cleanly; see log)
set -euo pipefail
usage() {
cat <<'USAGE'
demo-freeze.sh — disable the runtime + template image publish cascades
during a demo-prep window.
Captures current :latest digests for all 8 workspace-template-* images
and disables the workflows that would otherwise re-tag them.
Usage:
scripts/demo-freeze.sh # dry run — print what would happen
scripts/demo-freeze.sh --execute # actually disable workflows + snapshot
See the comment block at the top of this script for the full procedure.
USAGE
}
EXECUTE=0
case "${1:-}" in
--execute)
EXECUTE=1
;;
--help|-h)
usage
exit 0
;;
"")
;;
*)
echo "unknown arg: $1" >&2
usage >&2
exit 2
;;
esac
# Templates and their GHCR repository slugs. Source of truth for the
# runtime → image map is workspace-server/internal/provisioner/provisioner.go
# RuntimeImages — keep this list in sync if a runtime is added.
TEMPLATES=(
"claude-code"
"hermes"
"openclaw"
"langgraph"
"deepagents"
"crewai"
"autogen"
"gemini-cli"
)
# Pre-flight: required tooling.
need() {
command -v "$1" >/dev/null || { echo "ERROR: missing required tool: $1" >&2; exit 1; }
}
need gh
need curl
need jq
# Pre-flight: gh auth. Snapshot via anonymous GHCR token works without
# org auth, but workflow disable needs an authenticated gh.
if ! gh auth status >/dev/null 2>&1; then
echo "ERROR: gh not authenticated. Run 'gh auth login' first." >&2
exit 1
fi
# Snapshot location relative to this script. Keeping it under scripts/
# rather than a temp dir means freeze receipts are easy to find again
# during the actual demo.
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
SNAPSHOT_DIR="${SCRIPT_DIR}/demo-freeze-snapshots"
mkdir -p "$SNAPSHOT_DIR"
TS="$(date -u +%Y%m%d-%H%M%S)"
DIGESTS_FILE="${SNAPSHOT_DIR}/digests-${TS}.txt"
WORKFLOWS_FILE="${SNAPSHOT_DIR}/disabled-workflows-${TS}.txt"
if [ $EXECUTE -eq 0 ]; then
echo "=== DRY RUN (no changes will be made; pass --execute to apply) ==="
else
echo "=== EXECUTING FREEZE — workflows will be disabled ==="
fi
echo "Snapshot timestamp: $TS"
echo "Digest log: $DIGESTS_FILE"
echo "Workflow log: $WORKFLOWS_FILE"
echo
# Step 1: capture current :latest digest for each template.
echo "→ Capturing current :latest digests"
for tpl in "${TEMPLATES[@]}"; do
token=$(curl -fsS "https://ghcr.io/token?scope=repository:molecule-ai/workspace-template-${tpl}:pull" | jq -r .token 2>/dev/null || true)
if [ -z "$token" ] || [ "$token" = "null" ]; then
echo " WARN: token fetch failed for $tpl — skipping digest capture"
continue
fi
digest=$(curl -fsSI \
-H "Authorization: Bearer $token" \
-H "Accept: application/vnd.oci.image.index.v1+json" \
-H "Accept: application/vnd.docker.distribution.manifest.v2+json" \
"https://ghcr.io/v2/molecule-ai/workspace-template-${tpl}/manifests/latest" 2>/dev/null \
| grep -i 'docker-content-digest' \
| awk '{print $2}' \
| tr -d '\r')
if [ -z "$digest" ]; then
echo " WARN: digest fetch failed for $tpl"
continue
fi
echo " $tpl: $digest"
if [ $EXECUTE -eq 1 ]; then
echo "$tpl: $digest" >> "$DIGESTS_FILE"
fi
done
echo
# Step 2: disable publish-runtime.yml in molecule-core (PATH 1 source).
echo "→ Disabling publish-runtime.yml in molecule-core (kills runtime → 8-template cascade)"
if [ $EXECUTE -eq 1 ]; then
if gh workflow disable publish-runtime.yml -R Molecule-AI/molecule-core 2>/tmp/freeze.err; then
echo " OK molecule-core/publish-runtime.yml disabled"
echo "Molecule-AI/molecule-core: publish-runtime.yml" >> "$WORKFLOWS_FILE"
else
echo " FAIL molecule-core/publish-runtime.yml: $(cat /tmp/freeze.err)" >&2
fi
else
echo " (dry-run) would disable: gh workflow disable publish-runtime.yml -R Molecule-AI/molecule-core"
fi
echo
# Step 3: disable publish-image.yml in each of the 8 template repos (PATH 2 sources).
echo "→ Disabling publish-image.yml in each workspace-template-* repo"
PARTIAL_FAIL=0
for tpl in "${TEMPLATES[@]}"; do
repo="Molecule-AI/molecule-ai-workspace-template-${tpl}"
if [ $EXECUTE -eq 1 ]; then
if gh workflow disable publish-image.yml -R "$repo" 2>/tmp/freeze.err; then
echo " OK $repo/publish-image.yml disabled"
echo "${repo}: publish-image.yml" >> "$WORKFLOWS_FILE"
else
echo " FAIL $repo/publish-image.yml: $(cat /tmp/freeze.err)" >&2
PARTIAL_FAIL=1
fi
else
echo " (dry-run) would disable: gh workflow disable publish-image.yml -R $repo"
fi
done
echo
if [ $EXECUTE -eq 0 ]; then
echo "=== DRY RUN COMPLETE ==="
echo "Re-run with --execute to apply the freeze."
exit 0
fi
echo "=== FREEZE COMPLETE ==="
echo "Receipts: $DIGESTS_FILE"
echo " $WORKFLOWS_FILE"
echo
echo "Next steps:"
echo " - Verify by running: gh workflow list -R Molecule-AI/molecule-core | grep publish-runtime"
echo " Status should be 'disabled_manually'."
echo " - Demo proceeds; new workspaces pull the snapshotted :latest digests."
echo " - Post-demo, run: scripts/demo-thaw.sh ${TS}"
echo " to re-enable every workflow this freeze disabled."
echo
if [ $PARTIAL_FAIL -ne 0 ]; then
echo "WARNING: one or more workflows did not disable cleanly. Re-run after fixing." >&2
exit 2
fi
exit 0

124
scripts/demo-thaw.sh Executable file
View File

@ -0,0 +1,124 @@
#!/usr/bin/env bash
# demo-thaw.sh — re-enable workflows that demo-freeze.sh disabled.
#
# Usage:
# scripts/demo-thaw.sh <freeze-timestamp>
# scripts/demo-thaw.sh 20260503-180000
#
# Reads disabled-workflows-<ts>.txt produced by demo-freeze.sh and
# runs `gh workflow enable` for each entry. Idempotent — re-enabling
# an already-enabled workflow is a no-op.
#
# Defaults to executing (the inverse of freeze, which defaults to
# dry-run). Pass --dry-run to print without executing.
#
# Prereqs:
# - gh CLI authenticated with workflow:write scope on Molecule-AI org
#
# Exit codes:
# 0 — all workflows re-enabled
# 1 — pre-flight failure (missing receipt file, missing tooling)
# 2 — partial thaw (some workflows did not enable; check output)
set -euo pipefail
usage() {
cat <<'USAGE'
demo-thaw.sh — re-enable workflows that demo-freeze.sh disabled.
Usage:
scripts/demo-thaw.sh <freeze-timestamp> # apply
scripts/demo-thaw.sh <freeze-timestamp> --dry-run # print without applying
ts is the YYYYMMDD-HHMMSS suffix on
scripts/demo-freeze-snapshots/disabled-workflows-*.txt produced by
demo-freeze.sh.
USAGE
}
DRY_RUN=0
TS=""
for arg in "$@"; do
case "$arg" in
--dry-run)
DRY_RUN=1
;;
--help|-h)
usage
exit 0
;;
*)
if [ -z "$TS" ]; then
TS="$arg"
else
echo "unknown arg: $arg" >&2
usage >&2
exit 2
fi
;;
esac
done
if [ -z "$TS" ]; then
echo "usage: $0 <freeze-timestamp> [--dry-run]" >&2
echo " e.g. $0 20260503-180000" >&2
echo " ts is the YYYYMMDD-HHMMSS suffix on demo-freeze-snapshots/disabled-workflows-*.txt" >&2
exit 2
fi
command -v gh >/dev/null || { echo "ERROR: gh CLI required" >&2; exit 1; }
if ! gh auth status >/dev/null 2>&1; then
echo "ERROR: gh not authenticated. Run 'gh auth login' first." >&2
exit 1
fi
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
WORKFLOWS_FILE="${SCRIPT_DIR}/demo-freeze-snapshots/disabled-workflows-${TS}.txt"
if [ ! -f "$WORKFLOWS_FILE" ]; then
echo "ERROR: receipt not found: $WORKFLOWS_FILE" >&2
echo "Available receipts:" >&2
ls "${SCRIPT_DIR}/demo-freeze-snapshots/" 2>/dev/null | grep '^disabled-workflows-' >&2 || echo " (none)" >&2
exit 1
fi
if [ $DRY_RUN -eq 1 ]; then
echo "=== DRY RUN (no changes will be made) ==="
else
echo "=== THAWING — re-enabling workflows ==="
fi
echo "Reading: $WORKFLOWS_FILE"
echo
PARTIAL_FAIL=0
while IFS=': ' read -r repo workflow; do
[ -z "$repo" ] && continue
if [ $DRY_RUN -eq 1 ]; then
echo " (dry-run) would enable: gh workflow enable $workflow -R $repo"
else
if gh workflow enable "$workflow" -R "$repo" 2>/tmp/thaw.err; then
echo " OK $repo/$workflow re-enabled"
else
echo " FAIL $repo/$workflow: $(cat /tmp/thaw.err)" >&2
PARTIAL_FAIL=1
fi
fi
done < "$WORKFLOWS_FILE"
echo
if [ $DRY_RUN -eq 1 ]; then
echo "=== DRY RUN COMPLETE ==="
echo "Re-run without --dry-run to apply."
exit 0
fi
echo "=== THAW COMPLETE ==="
echo "Cascades restored. Next workspace/** push to molecule-core/staging will"
echo "auto-publish the runtime wheel and fan out to template rebuilds as normal."
if [ $PARTIAL_FAIL -ne 0 ]; then
echo
echo "WARNING: one or more workflows did not re-enable cleanly. Re-run or enable manually:" >&2
echo " gh workflow list -R <repo>" >&2
exit 2
fi
exit 0