diff --git a/scripts/demo-day-runbook.md b/scripts/demo-day-runbook.md new file mode 100644 index 00000000..ff4847ce --- /dev/null +++ b/scripts/demo-day-runbook.md @@ -0,0 +1,306 @@ +# Demo-day runbook + +Pre-, during-, and post-demo operational procedures for the molecule +production stack. Updated 2026-05-01 ahead of the funding-demo on +~2026-05-06. + +The whole stack: + +``` +Vercel canvas (app.moleculesai.app) + → Railway controlplane (api.moleculesai.app) + → CloudFront/Cloudflare per-tenant edge (.moleculesai.app) + → EC2 tenant instance running platform container + → Docker workspaces pulled from + ghcr.io/molecule-ai/workspace-template-:latest +``` + +Every layer has its own deploy/rollback story. This runbook indexes +them in the order an operator would touch them during an incident. + +## Pre-demo (T-48h to T-1h) + +### 1. Freeze the runtime + template image cascade + +A merge to `molecule-core/staging` that touches `workspace/**` triggers +`publish-runtime.yml` → PyPI bump → repository_dispatch → 8 template +repos rebuild and re-tag `:latest`. A merge to any template repo's +`main` triggers the same final re-tag directly. Either path means a +new workspace provision during the demo pulls whatever `:latest` +resolved to seconds earlier. + +Capture current good digests + disable both cascade vectors: + +```bash +# Dry-run first — verifies digests can be fetched and tooling is set up +scripts/demo-freeze.sh + +# Apply +scripts/demo-freeze.sh --execute +``` + +The script writes two receipts to `scripts/demo-freeze-snapshots/`: + +- `digests-.txt` — current `:latest` digest per template (rollback target if needed) +- `disabled-workflows-.txt` — workflow paths to re-enable post-demo + +Verify the freeze landed: + +```bash +gh workflow list -R Molecule-AI/molecule-core | grep publish-runtime +# expect: status = disabled_manually +``` + +If a critical fix MUST ship during the freeze window: + +1. `gh workflow enable publish-runtime.yml -R Molecule-AI/molecule-core` +2. Merge the fix +3. Watch the cascade through to GHCR:latest manually +4. Smoke-verify against a staging tenant (`scripts/api-smoke.sh` or + manual canvas walkthrough) +5. `gh workflow disable publish-runtime.yml -R Molecule-AI/molecule-core` to re-freeze + +Don't auto-promote during the freeze — the value of the freeze is that +nothing happens automatically. + +### 2. Confirm production CP is on the expected SHA + +```bash +gh run list -R Molecule-AI/molecule-controlplane --branch main --limit 5 +# Last `ci` run should be SUCCESS with the SHA you intend to demo on +``` + +Railway auto-deploys from main. Spot-check `api.moleculesai.app`: + +```bash +curl -fsS -H "Authorization: Bearer $CP_ADMIN_API_TOKEN" \ + https://api.moleculesai.app/cp/admin/orgs?limit=1 +# Expect: 200 + a JSON {"orgs": [...]} +``` + +### 3. Confirm production canvas (Vercel) is on main + +Vercel auto-deploys `main`. Verify in the Vercel dashboard the most +recent prod deploy ran from the expected commit SHA. + +### 4. Pre-warm the demo tenant + +Cold-start times on workspace-template images: + +| Runtime | Cold-start (first boot) | +|---|---| +| claude-code | ~30-60s | +| openclaw | ~1-2 min | +| langgraph | ~1 min | +| hermes | **~7 min** (large image) | + +If the demo will use `hermes`, provision the demo workspace at least +10 min before. The cold-start clock starts when the workspace is +created, not when it's used. + +## During demo — emergency rollback levers + +### Lever A: Platform-image rollback (canvas/CP layer regression) + +If the canvas or platform container shipped a regression, retag +`:latest` to a prior staging SHA without rebuilding: + +```bash +# Find a known-good SHA from staging history +gh run list -R Molecule-AI/molecule-core --workflow=publish-canvas-image.yml --limit 5 + +# Roll both platform + tenant images +GITHUB_TOKEN=$(gh auth token) scripts/rollback-latest.sh +``` + +`rollback-latest.sh` retags both `ghcr.io/molecule-ai/platform:latest` +and `ghcr.io/molecule-ai/platform-tenant:latest`. Existing tenants +auto-pull `:latest` every 5 min — rollback propagates without manual +restart. + +### Lever B: Workspace-template image rollback + +If a specific runtime template (claude-code, hermes, etc.) shipped a +broken `:latest`: + +```bash +# Get the demo's snapshotted-good digest from the freeze receipt +grep claude-code scripts/demo-freeze-snapshots/digests-.txt + +# Retag :latest back to the snapshotted digest using crane +crane auth login ghcr.io -u "$(gh api user --jq .login)" \ + --password-stdin <<< "$(gh auth token)" +crane tag \ + ghcr.io/molecule-ai/workspace-template-claude-code@sha256: \ + latest +``` + +The next workspace provision pulls the rolled-back image. Existing +workspaces are unaffected (their image is already loaded into Docker). + +### Lever C: Wedged demo tenant — redeploy + +If the demo tenant's EC2 instance is wedged (boot succeeded but app +not responding, or a stuck workspace), the controlplane has an admin +redeploy endpoint: + +```bash +# AWS-side: forces a fresh EC2 launch with current image. ~3 min. +curl -fsS -X POST \ + -H "Authorization: Bearer $CP_ADMIN_API_TOKEN" \ + https://api.moleculesai.app/cp/admin/orgs//redeploy +``` + +WARNING per memory: this triggers real EC2 + SSM actions on production. +Double-check `` against the demo tenant's slug before pressing +return. The `/redeploy` endpoint is idempotent on the EC2 side but +WILL drop active SSH sessions. + +### Lever D: Specific bad workspace — delete + +If a single workspace inside the demo tenant is misbehaving (e.g. +hermes wedged on cold-start, claude-code returning the generic +"Agent error (Exception)" message), kill it: + +```bash +# Get the demo tenant's per-tenant ADMIN_TOKEN +TENANT_ADMIN=$(curl -fsS -H "Authorization: Bearer $CP_ADMIN_API_TOKEN" \ + https://api.moleculesai.app/cp/admin/orgs//admin-token \ + | jq -r .admin_token) + +ORG_ID=$(curl -fsS -H "Authorization: Bearer $CP_ADMIN_API_TOKEN" \ + https://api.moleculesai.app/cp/admin/orgs?limit=20 \ + | jq -r '.orgs[] | select(.slug=="") | .id') + +# Delete the bad workspace +curl -fsS -X DELETE \ + -H "Origin: https://.moleculesai.app" \ + -H "Authorization: Bearer $TENANT_ADMIN" \ + -H "X-Molecule-Org-Id: $ORG_ID" \ + https://.moleculesai.app/workspaces/ +``` + +Then re-provision a fresh workspace from the canvas. Faster than +debugging the wedged one. + +### Lever E: Railway production rollback (CP regression) + +If the last Railway deploy of CP introduced a regression that lever A +can't fix (e.g. a logic bug, not a container issue): + +1. Open Railway dashboard → molecule-platform → controlplane → Deployments +2. Find the previous-known-good deployment +3. Click **Rollback to this deployment** + +Manual step — no CLI equivalent built. Takes ~30s to redeploy from +the prior image. Note: rollback restores the prior code AND prior env +var snapshot; don't expect any env var changes made since to persist. + +### Lever F: Vercel production rollback (canvas regression) + +If the canvas ships a regression: + +1. Open Vercel dashboard → molecule-app → Deployments +2. Find the previous prod deployment +3. **Promote to Production** + +Same pattern as Railway — fast revert, no rebuild. + +## Tenant-level read-only diagnostics (not actions) + +Useful during a "is this working?" moment without touching anything: + +```bash +# Tenant infra state +curl -fsS -H "Authorization: Bearer $CP_ADMIN_API_TOKEN" \ + "https://api.moleculesai.app/cp/admin/orgs?limit=20" \ + | jq '.orgs[] | select(.slug=="")' + +# Tenant boot events (debug a stuck provision) +curl -fsS -H "Authorization: Bearer $CP_ADMIN_API_TOKEN" \ + "https://api.moleculesai.app/cp/admin/tenants//boot-events?limit=50" \ + | jq + +# Workspace activity (debug an unresponsive agent) +curl -fsS \ + -H "Origin: https://.moleculesai.app" \ + -H "Authorization: Bearer $TENANT_ADMIN" \ + -H "X-Molecule-Org-Id: $ORG_ID" \ + "https://.moleculesai.app/workspaces//activity?limit=20" \ + | jq +``` + +## Post-demo (T+30m to T+24h) + +### 1. Thaw the cascades + +```bash +# Find the freeze receipt +ls scripts/demo-freeze-snapshots/ + +# Thaw — pass the timestamp suffix +scripts/demo-thaw.sh 20260506-180000 +``` + +The next merge to `molecule-core/staging` (workspace/**) or any +template repo's `main` will resume the auto-rebuild cascade. + +### 2. Audit what was held back + +If any merges queued during the freeze: + +```bash +gh pr list -R Molecule-AI/molecule-core --base staging --state merged \ + --search "merged:>=$(date -u -v-7d +%Y-%m-%d)" +``` + +Verify each merge's CI is green and dispatch the runtime cascade once +to ensure all templates rebuild against the post-freeze HEAD. + +### 3. File a post-mortem if anything fired + +If any rollback lever was used during the demo, file a brief doc: + +- Which lever (A through F) +- Which SHA was rolled back FROM and TO +- Did the rollback fully resolve the issue or was a follow-up needed +- Whether the underlying regression should have been caught by CI + +## Common issues + first-line fix + +| Symptom | First lever to try | +|---|---| +| Workspace boots but agent always errors | Lever D (delete + reprovision) | +| Whole tenant unreachable | Lever C (redeploy) | +| Canvas crashes on load | Lever F (Vercel rollback) | +| Login broken / API errors | Lever E (Railway rollback) | +| Specific runtime broken across tenants | Lever B (template image rollback) | +| Platform container regression | Lever A (rollback-latest.sh) | +| Mid-demo stray PR auto-published a bad image | Lever B + investigate why freeze didn't catch it | + +## Auth fingerprint (rotate post-demo) + +The freeze + rollback procedures assume: + +- `CP_ADMIN_API_TOKEN` available via `railway variables --kv --environment production` +- `gh auth token` returns a working PAT with `workflow:write` + `write:packages` +- `crane` installed (`brew install crane`) + +After the demo, **rotate** `CP_ADMIN_API_TOKEN` (it's the keys-to-the-kingdom +token for production) — it likely got copy-pasted into shells during +the demo. + +```bash +# Generate a new admin token +NEW_TOKEN=$(openssl rand -hex 32) + +# Update Railway production env var (and optionally staging) +railway variables --set CP_ADMIN_API_TOKEN="$NEW_TOKEN" --environment production + +# Restart CP service to pick up the change +# (Railway auto-restarts on env var change) + +# Verify +curl -fsS -H "Authorization: Bearer $NEW_TOKEN" \ + https://api.moleculesai.app/cp/admin/orgs?limit=1 +``` diff --git a/scripts/demo-freeze-snapshots/.gitignore b/scripts/demo-freeze-snapshots/.gitignore new file mode 100644 index 00000000..50692299 --- /dev/null +++ b/scripts/demo-freeze-snapshots/.gitignore @@ -0,0 +1,6 @@ +# Generated by scripts/demo-freeze.sh — receipts are operational state, +# not source. Tracked .gitignore + .gitkeep keep the directory itself +# in version control so the freeze script's output dir always exists. +* +!.gitignore +!.gitkeep diff --git a/scripts/demo-freeze-snapshots/.gitkeep b/scripts/demo-freeze-snapshots/.gitkeep new file mode 100644 index 00000000..e69de29b diff --git a/scripts/demo-freeze.sh b/scripts/demo-freeze.sh new file mode 100755 index 00000000..be7b176b --- /dev/null +++ b/scripts/demo-freeze.sh @@ -0,0 +1,214 @@ +#!/usr/bin/env bash +# demo-freeze.sh — disable the runtime + template image publish cascades +# during a demo-prep window so a stray staging merge can't auto-rebuild +# `:latest` for the 8 workspace-template images mid-demo. +# +# Demo prep typically runs T-48h to T+1h. During that window: +# +# PATH 1: any merge to molecule-core/staging that touches workspace/** +# → publish-runtime.yml fires +# → PyPI auto-bumps molecule-ai-workspace-runtime patch version +# → repository_dispatch fans out to 8 workspace-template-* repos +# → each template repo rebuilds and re-tags +# ghcr.io/molecule-ai/workspace-template-:latest +# +# PATH 2: any merge to a workspace-template-* repo's main branch +# → that repo's publish-image.yml fires +# → ghcr.io/molecule-ai/workspace-template-:latest +# gets re-tagged +# +# provisioner.go:296 RuntimeImages[runtime] reads `:latest` at every +# workspace boot. A new workspace provision during demo pulls whatever +# `:latest` resolved to seconds earlier — so a bad merge minutes +# before the demo can break a tenant the funder is about to see. +# +# This script captures the current good `:latest` digests for all 8 +# templates and disables both cascade vectors. The complementary +# demo-thaw.sh re-enables them. +# +# Usage: +# scripts/demo-freeze.sh # dry run — print what would happen +# scripts/demo-freeze.sh --execute # actually disable workflows + snapshot +# +# Prereqs: +# - gh CLI authenticated with workflow:write scope on Molecule-AI org +# - curl + jq (for digest snapshot via GHCR anonymous registry API) +# +# Output: +# /digests-YYYYMMDD-HHMMSS.txt +# One line per template: ": " +# /disabled-workflows-YYYYMMDD-HHMMSS.txt +# One line per disabled workflow: ": " +# +# Exit codes: +# 0 — freeze complete (or dry-run successful) +# 1 — pre-flight failure (missing tooling, missing auth, etc.) +# 2 — partial freeze (some workflows did not disable cleanly; see log) + +set -euo pipefail + +usage() { + cat <<'USAGE' +demo-freeze.sh — disable the runtime + template image publish cascades +during a demo-prep window. + +Captures current :latest digests for all 8 workspace-template-* images +and disables the workflows that would otherwise re-tag them. + +Usage: + scripts/demo-freeze.sh # dry run — print what would happen + scripts/demo-freeze.sh --execute # actually disable workflows + snapshot + +See the comment block at the top of this script for the full procedure. +USAGE +} + +EXECUTE=0 +case "${1:-}" in + --execute) + EXECUTE=1 + ;; + --help|-h) + usage + exit 0 + ;; + "") + ;; + *) + echo "unknown arg: $1" >&2 + usage >&2 + exit 2 + ;; +esac + +# Templates and their GHCR repository slugs. Source of truth for the +# runtime → image map is workspace-server/internal/provisioner/provisioner.go +# RuntimeImages — keep this list in sync if a runtime is added. +TEMPLATES=( + "claude-code" + "hermes" + "openclaw" + "langgraph" + "deepagents" + "crewai" + "autogen" + "gemini-cli" +) + +# Pre-flight: required tooling. +need() { + command -v "$1" >/dev/null || { echo "ERROR: missing required tool: $1" >&2; exit 1; } +} +need gh +need curl +need jq + +# Pre-flight: gh auth. Snapshot via anonymous GHCR token works without +# org auth, but workflow disable needs an authenticated gh. +if ! gh auth status >/dev/null 2>&1; then + echo "ERROR: gh not authenticated. Run 'gh auth login' first." >&2 + exit 1 +fi + +# Snapshot location relative to this script. Keeping it under scripts/ +# rather than a temp dir means freeze receipts are easy to find again +# during the actual demo. +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +SNAPSHOT_DIR="${SCRIPT_DIR}/demo-freeze-snapshots" +mkdir -p "$SNAPSHOT_DIR" +TS="$(date -u +%Y%m%d-%H%M%S)" +DIGESTS_FILE="${SNAPSHOT_DIR}/digests-${TS}.txt" +WORKFLOWS_FILE="${SNAPSHOT_DIR}/disabled-workflows-${TS}.txt" + +if [ $EXECUTE -eq 0 ]; then + echo "=== DRY RUN (no changes will be made; pass --execute to apply) ===" +else + echo "=== EXECUTING FREEZE — workflows will be disabled ===" +fi +echo "Snapshot timestamp: $TS" +echo "Digest log: $DIGESTS_FILE" +echo "Workflow log: $WORKFLOWS_FILE" +echo + +# Step 1: capture current :latest digest for each template. +echo "→ Capturing current :latest digests" +for tpl in "${TEMPLATES[@]}"; do + token=$(curl -fsS "https://ghcr.io/token?scope=repository:molecule-ai/workspace-template-${tpl}:pull" | jq -r .token 2>/dev/null || true) + if [ -z "$token" ] || [ "$token" = "null" ]; then + echo " WARN: token fetch failed for $tpl — skipping digest capture" + continue + fi + digest=$(curl -fsSI \ + -H "Authorization: Bearer $token" \ + -H "Accept: application/vnd.oci.image.index.v1+json" \ + -H "Accept: application/vnd.docker.distribution.manifest.v2+json" \ + "https://ghcr.io/v2/molecule-ai/workspace-template-${tpl}/manifests/latest" 2>/dev/null \ + | grep -i 'docker-content-digest' \ + | awk '{print $2}' \ + | tr -d '\r') + if [ -z "$digest" ]; then + echo " WARN: digest fetch failed for $tpl" + continue + fi + echo " $tpl: $digest" + if [ $EXECUTE -eq 1 ]; then + echo "$tpl: $digest" >> "$DIGESTS_FILE" + fi +done +echo + +# Step 2: disable publish-runtime.yml in molecule-core (PATH 1 source). +echo "→ Disabling publish-runtime.yml in molecule-core (kills runtime → 8-template cascade)" +if [ $EXECUTE -eq 1 ]; then + if gh workflow disable publish-runtime.yml -R Molecule-AI/molecule-core 2>/tmp/freeze.err; then + echo " OK molecule-core/publish-runtime.yml disabled" + echo "Molecule-AI/molecule-core: publish-runtime.yml" >> "$WORKFLOWS_FILE" + else + echo " FAIL molecule-core/publish-runtime.yml: $(cat /tmp/freeze.err)" >&2 + fi +else + echo " (dry-run) would disable: gh workflow disable publish-runtime.yml -R Molecule-AI/molecule-core" +fi +echo + +# Step 3: disable publish-image.yml in each of the 8 template repos (PATH 2 sources). +echo "→ Disabling publish-image.yml in each workspace-template-* repo" +PARTIAL_FAIL=0 +for tpl in "${TEMPLATES[@]}"; do + repo="Molecule-AI/molecule-ai-workspace-template-${tpl}" + if [ $EXECUTE -eq 1 ]; then + if gh workflow disable publish-image.yml -R "$repo" 2>/tmp/freeze.err; then + echo " OK $repo/publish-image.yml disabled" + echo "${repo}: publish-image.yml" >> "$WORKFLOWS_FILE" + else + echo " FAIL $repo/publish-image.yml: $(cat /tmp/freeze.err)" >&2 + PARTIAL_FAIL=1 + fi + else + echo " (dry-run) would disable: gh workflow disable publish-image.yml -R $repo" + fi +done +echo + +if [ $EXECUTE -eq 0 ]; then + echo "=== DRY RUN COMPLETE ===" + echo "Re-run with --execute to apply the freeze." + exit 0 +fi + +echo "=== FREEZE COMPLETE ===" +echo "Receipts: $DIGESTS_FILE" +echo " $WORKFLOWS_FILE" +echo +echo "Next steps:" +echo " - Verify by running: gh workflow list -R Molecule-AI/molecule-core | grep publish-runtime" +echo " Status should be 'disabled_manually'." +echo " - Demo proceeds; new workspaces pull the snapshotted :latest digests." +echo " - Post-demo, run: scripts/demo-thaw.sh ${TS}" +echo " to re-enable every workflow this freeze disabled." +echo +if [ $PARTIAL_FAIL -ne 0 ]; then + echo "WARNING: one or more workflows did not disable cleanly. Re-run after fixing." >&2 + exit 2 +fi +exit 0 diff --git a/scripts/demo-thaw.sh b/scripts/demo-thaw.sh new file mode 100755 index 00000000..35469c6e --- /dev/null +++ b/scripts/demo-thaw.sh @@ -0,0 +1,124 @@ +#!/usr/bin/env bash +# demo-thaw.sh — re-enable workflows that demo-freeze.sh disabled. +# +# Usage: +# scripts/demo-thaw.sh +# scripts/demo-thaw.sh 20260503-180000 +# +# Reads disabled-workflows-.txt produced by demo-freeze.sh and +# runs `gh workflow enable` for each entry. Idempotent — re-enabling +# an already-enabled workflow is a no-op. +# +# Defaults to executing (the inverse of freeze, which defaults to +# dry-run). Pass --dry-run to print without executing. +# +# Prereqs: +# - gh CLI authenticated with workflow:write scope on Molecule-AI org +# +# Exit codes: +# 0 — all workflows re-enabled +# 1 — pre-flight failure (missing receipt file, missing tooling) +# 2 — partial thaw (some workflows did not enable; check output) + +set -euo pipefail + +usage() { + cat <<'USAGE' +demo-thaw.sh — re-enable workflows that demo-freeze.sh disabled. + +Usage: + scripts/demo-thaw.sh # apply + scripts/demo-thaw.sh --dry-run # print without applying + +ts is the YYYYMMDD-HHMMSS suffix on +scripts/demo-freeze-snapshots/disabled-workflows-*.txt produced by +demo-freeze.sh. +USAGE +} + +DRY_RUN=0 +TS="" +for arg in "$@"; do + case "$arg" in + --dry-run) + DRY_RUN=1 + ;; + --help|-h) + usage + exit 0 + ;; + *) + if [ -z "$TS" ]; then + TS="$arg" + else + echo "unknown arg: $arg" >&2 + usage >&2 + exit 2 + fi + ;; + esac +done + +if [ -z "$TS" ]; then + echo "usage: $0 [--dry-run]" >&2 + echo " e.g. $0 20260503-180000" >&2 + echo " ts is the YYYYMMDD-HHMMSS suffix on demo-freeze-snapshots/disabled-workflows-*.txt" >&2 + exit 2 +fi + +command -v gh >/dev/null || { echo "ERROR: gh CLI required" >&2; exit 1; } +if ! gh auth status >/dev/null 2>&1; then + echo "ERROR: gh not authenticated. Run 'gh auth login' first." >&2 + exit 1 +fi + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +WORKFLOWS_FILE="${SCRIPT_DIR}/demo-freeze-snapshots/disabled-workflows-${TS}.txt" + +if [ ! -f "$WORKFLOWS_FILE" ]; then + echo "ERROR: receipt not found: $WORKFLOWS_FILE" >&2 + echo "Available receipts:" >&2 + ls "${SCRIPT_DIR}/demo-freeze-snapshots/" 2>/dev/null | grep '^disabled-workflows-' >&2 || echo " (none)" >&2 + exit 1 +fi + +if [ $DRY_RUN -eq 1 ]; then + echo "=== DRY RUN (no changes will be made) ===" +else + echo "=== THAWING — re-enabling workflows ===" +fi +echo "Reading: $WORKFLOWS_FILE" +echo + +PARTIAL_FAIL=0 +while IFS=': ' read -r repo workflow; do + [ -z "$repo" ] && continue + if [ $DRY_RUN -eq 1 ]; then + echo " (dry-run) would enable: gh workflow enable $workflow -R $repo" + else + if gh workflow enable "$workflow" -R "$repo" 2>/tmp/thaw.err; then + echo " OK $repo/$workflow re-enabled" + else + echo " FAIL $repo/$workflow: $(cat /tmp/thaw.err)" >&2 + PARTIAL_FAIL=1 + fi + fi +done < "$WORKFLOWS_FILE" + +echo +if [ $DRY_RUN -eq 1 ]; then + echo "=== DRY RUN COMPLETE ===" + echo "Re-run without --dry-run to apply." + exit 0 +fi + +echo "=== THAW COMPLETE ===" +echo "Cascades restored. Next workspace/** push to molecule-core/staging will" +echo "auto-publish the runtime wheel and fan out to template rebuilds as normal." +if [ $PARTIAL_FAIL -ne 0 ]; then + echo + echo "WARNING: one or more workflows did not re-enable cleanly. Re-run or enable manually:" >&2 + echo " gh workflow list -R " >&2 + exit 2 +fi +exit 0