fix(scripts): nuke-and-rebuild self-bootstraps templates; add E2E test
Two paper cuts the fix addresses:
1. nuke-and-rebuild.sh wipes the compose stack but never re-populates
workspace-configs-templates/, org-templates/, or plugins/. Those dirs
are .gitignored — the curated set lives in manifest.json as external
repos cloned via clone-manifest.sh (idempotent). Without that step,
a fresh checkout or a post-deletion run leaves the dirs empty, which
silently hides the entire template palette in Canvas + falls back to
bare default workspace provisioning. Symptom: "Deploy your first
agent" shows zero templates.
2. The existing ws-* container reap was already in the script (good),
but it only fires when this script runs. Folks running `docker compose
down -v` directly leave orphan ws-* containers behind. Documented
that explicitly in the script comment so future readers understand
why those lines are critical.
The fix is just `bash clone-manifest.sh` added to the script. clone-
manifest.sh is idempotent — populated dirs short-circuit, so a re-nuke
on a healthy machine pays only a few stat calls.
scripts/test-nuke-and-rebuild.sh exercises the canonical workflow end-
to-end:
- plants a fake orphan ws-* container, then asserts it gets reaped
- renames the manifest dirs to simulate a fresh checkout, then
asserts they get repopulated
- waits for /health and asserts the platform sees the same template
count on disk as via /configs in the container (catches bind-mount
drift)
- asserts the image-auto-refresh watcher (PR #2114) starts, since
that's load-bearing for the CD chain users now rely on
The test pre-flights port 5432/6379/8080 and exits 0 with a SKIP
message if a non-target compose project is holding them — common when
parallel monorepo checkouts coexist on one Docker daemon.
scripts/ is intentionally outside CI shellcheck per ci.yml comment, but
both files pass `shellcheck --severity=warning` anyway.
Defers but does not solve the runtime root-cause for orphan ws-* after
plain `docker compose down -v`: the orphan-sweeper in the platform only
reaps containers whose workspace row says status='removed', so a wiped
DB → no row → sweeper ignores them. Proper fix needs container labels
keyed to a per-platform-instance UUID so the sweeper can confidently
reap "containers I provisioned that aren't in my DB anymore" without
nuking a sibling platform's containers on a shared daemon. Tracked as
task #109's follow-up; out of scope for this PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
909cbe8b3a
commit
44d0444aae
@ -1,9 +1,25 @@
|
||||
#!/bin/bash
|
||||
# Full nuke + rebuild — one command to reset everything
|
||||
# Usage: bash scripts/nuke-and-rebuild.sh
|
||||
# Full nuke + rebuild — one command to reset everything.
|
||||
#
|
||||
# What "everything" means:
|
||||
# 1. The compose stack (containers + named volumes + network).
|
||||
# 2. Dynamically-spawned ws-* workspace containers + their volumes.
|
||||
# These are NOT in docker-compose.yml — the provisioner creates them
|
||||
# at workspace-create time, so `compose down -v` leaves them behind.
|
||||
# Without this step, a fresh DB plus old ws-* containers = ghost
|
||||
# containers Canvas can't see, eating CPU + memory.
|
||||
# 3. Repopulating the manifest-managed dirs (workspace-configs-templates/,
|
||||
# org-templates/, plugins/). These are .gitignored — fresh checkouts
|
||||
# and post-deletion runs leave them empty, which silently hides the
|
||||
# entire template palette in Canvas. clone-manifest.sh is idempotent,
|
||||
# so re-running with already-populated dirs is a fast no-op.
|
||||
#
|
||||
# Usage:
|
||||
# bash scripts/nuke-and-rebuild.sh
|
||||
set -euo pipefail
|
||||
|
||||
ROOT="$(cd "$(dirname "$0")/.." && pwd)"
|
||||
|
||||
echo "=== NUKE ==="
|
||||
docker compose -f "$ROOT/docker-compose.yml" down -v 2>/dev/null || true
|
||||
docker ps -a --format "{{.Names}}" | grep "^ws-" | xargs -r docker rm -f 2>/dev/null || true
|
||||
@ -11,6 +27,23 @@ docker volume ls --format "{{.Name}}" | grep "^ws-" | xargs -r docker volume rm
|
||||
docker network rm molecule-monorepo-net 2>/dev/null || true
|
||||
echo " cleaned"
|
||||
|
||||
echo "=== POPULATE MANIFEST DIRS ==="
|
||||
# Idempotent: clone-manifest.sh skips dirs that already have content, so a
|
||||
# re-nuke after templates are populated is a fast no-op (a few stat calls).
|
||||
# Skip with a clear warning if jq is missing — installing it is a one-time
|
||||
# step documented in the README quickstart.
|
||||
if command -v jq >/dev/null 2>&1; then
|
||||
bash "$ROOT/scripts/clone-manifest.sh" \
|
||||
"$ROOT/manifest.json" \
|
||||
"$ROOT/workspace-configs-templates" \
|
||||
"$ROOT/org-templates" \
|
||||
"$ROOT/plugins" 2>&1 | tail -3
|
||||
else
|
||||
echo " WARNING: jq not installed — skipping template/plugin clone."
|
||||
echo " Install (brew install jq) and rerun, or Canvas's template"
|
||||
echo " palette will be empty and provisioning falls back to defaults."
|
||||
fi
|
||||
|
||||
echo "=== REBUILD ==="
|
||||
docker compose -f "$ROOT/docker-compose.yml" up -d --build
|
||||
echo " platform + canvas up"
|
||||
|
||||
152
scripts/test-nuke-and-rebuild.sh
Executable file
152
scripts/test-nuke-and-rebuild.sh
Executable file
@ -0,0 +1,152 @@
|
||||
#!/usr/bin/env bash
|
||||
# E2E test: scripts/nuke-and-rebuild.sh self-bootstraps a clean dev stack.
|
||||
#
|
||||
# What this asserts (and why each one matters):
|
||||
# 1. After nuke+rebuild, workspace-configs-templates/ is populated.
|
||||
# Regression target: someone deletes the manifest-clone step and
|
||||
# Canvas silently shows zero templates.
|
||||
# 2. After nuke+rebuild, no orphan ws-* containers survive on the
|
||||
# Docker daemon. Regression target: someone removes the ws-*
|
||||
# reaping lines from the script and old containers haunt every
|
||||
# future stack with a wiped DB.
|
||||
# 3. Platform serves /health 200. Regression target: env wiring drift
|
||||
# or a Dockerfile change that breaks platform startup.
|
||||
# 4. Platform exposes the templates it sees on disk. Regression target:
|
||||
# bind-mount drift between docker-compose.yml and the platform
|
||||
# config (CONFIGS_HOST_DIR / CONFIGS_DIR misalignment).
|
||||
# 5. The image-auto-refresh watcher (PR #2114) starts. Regression
|
||||
# target: someone defaults IMAGE_AUTO_REFRESH back to false in
|
||||
# compose, breaking the runtime CD chain users now rely on.
|
||||
#
|
||||
# Usage:
|
||||
# bash scripts/test-nuke-and-rebuild.sh
|
||||
#
|
||||
# Cost: ~3-6 min on a warm cache (plugin clones are the slow part on
|
||||
# a cold cache, ~30-60s).
|
||||
#
|
||||
# Caveats:
|
||||
# - Requires Docker daemon + jq + curl on PATH.
|
||||
# - Spawns a fake `ws-deadbeef-test` container with a sleep-forever
|
||||
# command so we have a known orphan to assert against. Cleanup
|
||||
# runs in a trap.
|
||||
# - Does NOT test the runtime CD propagation end-to-end (that's
|
||||
# issue #2118). Scope here is the local nuke+rebuild loop only.
|
||||
set -euo pipefail
|
||||
|
||||
ROOT="$(cd "$(dirname "$0")/.." && pwd)"
|
||||
PLATFORM="${PLATFORM:-http://localhost:8080}"
|
||||
PASS=0
|
||||
FAIL=0
|
||||
FAKE_WS="ws-deadbeeftest"
|
||||
|
||||
require() {
|
||||
command -v "$1" >/dev/null 2>&1 || { echo "missing dependency: $1"; exit 2; }
|
||||
}
|
||||
require docker
|
||||
require jq
|
||||
require curl
|
||||
|
||||
cleanup() {
|
||||
docker rm -f "$FAKE_WS" >/dev/null 2>&1 || true
|
||||
}
|
||||
trap cleanup EXIT
|
||||
|
||||
# Pre-flight: if another compose project already holds the ports we need,
|
||||
# bail with a clear message rather than letting the rebuild step fail
|
||||
# halfway through with a confusing "port already allocated" error. This
|
||||
# happens routinely when a parallel monorepo checkout has its stack up.
|
||||
PROJECT="$(basename "$ROOT")"
|
||||
for port in 5432 6379 8080; do
|
||||
HOLDER=$(docker ps --filter "publish=$port" --format '{{.Names}}' | head -1)
|
||||
if [ -n "$HOLDER" ] && [[ "$HOLDER" != "${PROJECT}-"* ]]; then
|
||||
echo "SKIP: port $port held by container '$HOLDER' from a different compose project."
|
||||
echo " This test rebuilds the '$PROJECT' stack, which would conflict."
|
||||
echo " Stop the other stack first (in its own checkout):"
|
||||
echo " docker compose down -v"
|
||||
exit 0
|
||||
fi
|
||||
done
|
||||
|
||||
check() {
|
||||
local label="$1" cond="$2"
|
||||
if eval "$cond"; then
|
||||
echo "PASS: $label"
|
||||
PASS=$((PASS + 1))
|
||||
else
|
||||
echo "FAIL: $label"
|
||||
echo " cond: $cond"
|
||||
FAIL=$((FAIL + 1))
|
||||
fi
|
||||
}
|
||||
|
||||
echo "=== Setup: plant a fake orphan ws-* container ==="
|
||||
# alpine because it's already on most Docker hosts; sleep so Docker treats
|
||||
# it as a long-running container worth listing in `docker ps`.
|
||||
docker run -d --name "$FAKE_WS" --rm=false alpine sleep 3600 >/dev/null
|
||||
docker ps --filter name="^${FAKE_WS}$" --format '{{.Names}}' | grep -q "^${FAKE_WS}$" || {
|
||||
echo "FAIL: setup — fake orphan container did not start"
|
||||
exit 2
|
||||
}
|
||||
echo " planted $FAKE_WS"
|
||||
|
||||
echo ""
|
||||
echo "=== Setup: wipe the manifest-managed dirs to simulate a fresh checkout ==="
|
||||
# Don't actually delete — rename to a sentinel, restore on exit. Avoids
|
||||
# unrecoverable damage if the test crashes after the rename and operator
|
||||
# Ctrl-Cs the trap.
|
||||
for d in workspace-configs-templates org-templates plugins; do
|
||||
if [ -d "$ROOT/$d" ]; then
|
||||
mv "$ROOT/$d" "$ROOT/${d}.testbak"
|
||||
fi
|
||||
done
|
||||
restore_dirs() {
|
||||
for d in workspace-configs-templates org-templates plugins; do
|
||||
if [ -d "$ROOT/${d}.testbak" ] && [ ! -d "$ROOT/$d" ]; then
|
||||
mv "$ROOT/${d}.testbak" "$ROOT/$d"
|
||||
fi
|
||||
done
|
||||
}
|
||||
trap 'cleanup; restore_dirs' EXIT
|
||||
|
||||
echo ""
|
||||
echo "=== Run nuke-and-rebuild.sh (this is what we're testing) ==="
|
||||
bash "$ROOT/scripts/nuke-and-rebuild.sh" >/tmp/nuke.log 2>&1 || {
|
||||
echo "FAIL: nuke-and-rebuild.sh exited non-zero. Tail of log:"
|
||||
tail -30 /tmp/nuke.log
|
||||
exit 2
|
||||
}
|
||||
echo " ran (full log: /tmp/nuke.log)"
|
||||
|
||||
echo ""
|
||||
echo "=== Assertions ==="
|
||||
|
||||
check "templates dir populated (8 entries expected)" \
|
||||
"[ \"\$(ls $ROOT/workspace-configs-templates 2>/dev/null | wc -l | tr -d ' ')\" -ge 8 ]"
|
||||
|
||||
check "fake orphan ws-* container reaped" \
|
||||
"! docker ps -a --filter name=^${FAKE_WS}\$ --format '{{.Names}}' | grep -q ."
|
||||
|
||||
# Wait for platform health (compose startup + migrations can take a beat).
|
||||
echo " waiting for platform /health..."
|
||||
for _ in $(seq 1 30); do
|
||||
if curl -sf "$PLATFORM/health" >/dev/null 2>&1; then break; fi
|
||||
sleep 2
|
||||
done
|
||||
|
||||
check "platform /health returns 200" \
|
||||
"[ \"\$(curl -s -o /dev/null -w '%{http_code}' $PLATFORM/health)\" = '200' ]"
|
||||
|
||||
# Compare templates the platform sees vs. what's on disk. If the bind
|
||||
# mount is broken, on-disk count won't match in-container count.
|
||||
DISK_COUNT=$(find "$ROOT/workspace-configs-templates" -mindepth 1 -maxdepth 1 2>/dev/null | wc -l | tr -d ' ')
|
||||
PLATFORM_COUNT=$(docker exec molecule-monorepo-platform-1 sh -c 'find /configs -mindepth 1 -maxdepth 1 2>/dev/null | wc -l' | tr -d ' ' || echo 0)
|
||||
check "platform sees same template count as disk ($DISK_COUNT)" \
|
||||
"[ \"$PLATFORM_COUNT\" = \"$DISK_COUNT\" ]"
|
||||
|
||||
# IMAGE_AUTO_REFRESH watcher should log its startup line (PR #2114).
|
||||
check "image-auto-refresh watcher started" \
|
||||
"docker logs molecule-monorepo-platform-1 2>&1 | grep -q 'image-auto-refresh: started'"
|
||||
|
||||
echo ""
|
||||
echo "=== Result: $PASS passed, $FAIL failed ==="
|
||||
[ $FAIL -eq 0 ]
|
||||
Loading…
Reference in New Issue
Block a user