feat(tests): add production-shape local harness (Phase 1)

The harness brings up the SaaS tenant topology on localhost using the SAME workspace-server/Dockerfile.tenant image that ships to production. Tests run against http://harness-tenant.localhost:8080 and exercise the same code path a real tenant takes: client → cf-proxy (nginx; CF tunnel + LB header rewrites) → tenant (Dockerfile.tenant — combined platform + canvas) → cp-stub (minimal Go CP stand-in for /cp/* paths) → postgres + redis Why this exists: bugs that survive `go run ./cmd/server` and ship to prod almost always live in env-gated middleware (TenantGuard, /cp/* proxy, canvas proxy), header rewrites, or the strict-auth / live-token mode. The harness activates ALL of them locally so #2395 + #2397-class bugs can be reproduced before deploy. Phase 1 surface: - cp-stub/main.go: minimal CP stand-in. /cp/auth/me, redeploy-fleet, /__stub/{peers,mode,state} for replay scripts. Catch-all returns 501 with a clear message when a new CP route appears. - cf-proxy/nginx.conf: rewrites Host to <slug>.localhost, injects X-Forwarded-*, disables buffering to mirror CF tunnel streaming semantics. - compose.yml: one service per topology layer; tenant builds from the actual production Dockerfile.tenant. - up.sh / down.sh / seed.sh: lifecycle scripts. - replays/peer-discovery-404.sh: reproduces #2397 + asserts the diagnostic helper from PR #2399 surfaces "404" + "registered". - replays/buildinfo-stale-image.sh: reproduces #2395 + asserts /buildinfo wire shape + GIT_SHA injection from PR #2398. - README.md: topology, quickstart, what the harness does NOT cover. Phases 2-3 (separate PRs): - Phase 2: convert tests/e2e/test_api.sh to target the harness URL instead of localhost; make harness-based replays a required CI gate. - Phase 3: config-coherence lint that diffs harness env list against production CP's env list, fails CI on drift. Verification: - cp-stub builds (go build ./...). - cp-stub responds to all stubbed endpoints (smoke-tested locally). - compose.yml passes `docker compose config --quiet`. - All shell scripts pass `bash -n` syntax check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 11:22:46 -07:00 · 2026-04-30 11:22:46 -07:00 · f13d2b2b7b
commit f13d2b2b7b
parent c06e2fec5e
11 changed files with 772 additions and 0 deletions
--- a/tests/harness/README.md
+++ b/tests/harness/README.md
@ -0,0 +1,110 @@
+# Production-shape local harness
+
+The harness brings up the SaaS tenant topology on localhost using the
+same `Dockerfile.tenant` image that ships to production. Tests run
+against `http://harness-tenant.localhost:8080` and exercise the
+SAME code path a real tenant takes — including TenantGuard middleware,
+the `/cp/*` reverse proxy, the canvas reverse proxy, and a
+Cloudflare-tunnel-shape header rewrite layer.
+
+## Why this exists
+
+Local `go run ./cmd/server` skips:
+- `TenantGuard` middleware (no `MOLECULE_ORG_ID` env)
+- `/cp/*` reverse proxy mount (no `CP_UPSTREAM_URL` env)
+- `CANVAS_PROXY_URL` (canvas runs separately on `:3000`)
+- Header rewrites that production's CF tunnel + LB perform
+- Strict-auth mode (no live `ADMIN_TOKEN`)
+
+Bugs that survive `go run` and ship to production almost always live
+in one of those layers. The harness activates ALL of them.
+
+## Topology
+
+```
+client
+  ↓
+cf-proxy        nginx, mirrors CF tunnel header rewrites
+  ↓ (Host:harness-tenant.localhost, X-Forwarded-*)
+tenant          workspace-server/Dockerfile.tenant — same image as prod
+  ↓ (CP_UPSTREAM_URL=http://cp-stub:9090, /cp/* proxied)
+cp-stub         minimal Go service, mocks CP wire surface
+postgres        same version as production
+redis           same version as production
+```
+
+## Quickstart
+
+```bash
+cd tests/harness
+./up.sh                 # builds + starts all services
+./seed.sh               # mints admin token, registers two sample workspaces
+./replays/peer-discovery-404.sh
+./replays/buildinfo-stale-image.sh
+./down.sh               # tear down + remove volumes
+```
+
+First-time setup needs an `/etc/hosts` entry so `harness-tenant.localhost`
+resolves to the local cf-proxy:
+
+```bash
+echo "127.0.0.1 harness-tenant.localhost" | sudo tee -a /etc/hosts
+```
+
+(macOS resolves `*.localhost` automatically in some setups; Linux
+typically does not.)
+
+## Replay scripts
+
+Each replay script reproduces a real bug class against the harness so
+fixes can be verified locally before deploy. The bar for adding a
+replay is "this bug shipped to production despite local E2E being
+green" — the script becomes the regression gate that closes that gap.
+
+| Replay | Closes | What it proves |
+|--------|--------|----------------|
+| `peer-discovery-404.sh` | #2397 | tool_list_peers surfaces the actual reason instead of "may be isolated" |
+| `buildinfo-stale-image.sh` | #2395 | GIT_SHA reaches the binary; verify-step comparison logic works |
+
+To add a new replay:
+1. Drop a script under `replays/` named after the issue.
+2. The script's purpose: reproduce the production failure mode against
+   the harness, then assert the fix is present. PASS criterion is the
+   post-fix behavior.
+3. Wire it into the `tests/harness/run-all-replays.sh` runner (TODO,
+   Phase 2).
+
+## Extending the cp-stub
+
+`cp-stub/main.go` serves the minimum surface for the existing replays
+plus a catch-all that returns 501 + a clear message when the tenant
+asks for a route the stub doesn't implement. To add a new CP route:
+
+1. Add a `mux.HandleFunc` in `cp-stub/main.go` for the path.
+2. Return the same wire shape the real CP returns. The contract is
+   "wire compatibility with the staging CP at the time of writing" —
+   document it with a comment pointing at the real CP handler.
+3. Add a replay script that exercises the path.
+
+## What the harness does NOT cover
+
+- Real TLS / cert handling (CF terminates TLS in production; harness is
+  HTTP-only).
+- Cloudflare API edge cases (rate limits, DNS propagation timing).
+- Real EC2 / SSM / EBS behavior (image-cache replay simulates the
+  outcome but not the AWS API surface).
+- Cross-region or multi-AZ topology.
+- Real production data scale.
+
+These are intentional Phase 1 limits. If a bug class hits one of these
+gaps, escalate to staging E2E rather than expanding the harness past
+its mandate of "exercise the tenant binary in production-shape topology."
+
+## Roadmap
+
+- **Phase 1 (this PR):** harness + cp-stub + cf-proxy + 2 replays.
+- **Phase 2:** convert `tests/e2e/test_api.sh` to run against the
+  harness instead of localhost. Make harness-based E2E a required CI
+  check.
+- **Phase 3:** config-coherence lint that diffs harness env list
+  against production CP's env list, fails CI on drift.
--- a/tests/harness/cf-proxy/nginx.conf
+++ b/tests/harness/cf-proxy/nginx.conf
@ -0,0 +1,68 @@
+# cf-proxy — Cloudflare-tunnel-shape reverse proxy for the local harness.
+#
+# Production path: agent → CF tunnel → AWS LB → tenant container.
+# This config replays the same header rewrites the CF tunnel does so
+# the tenant sees the same Host + X-Forwarded-* it would in production.
+#
+# The tenant's TenantGuard middleware activates on MOLECULE_ORG_ID; the
+# canvas's same-origin fetches use the Host header for cookie scoping.
+# Both behave correctly in production because CF rewrites Host to the
+# tenant subdomain — this proxy reproduces that locally.
+#
+# How tests reach it:
+#   curl --resolve 'harness-tenant.localhost:8443:127.0.0.1' \
+#        https://harness-tenant.localhost:8443/health
+# or via /etc/hosts (added automatically by ./up.sh on first boot).
+
+worker_processes 1;
+events { worker_connections 256; }
+
+http {
+    # Map the wildcard <slug>.localhost to the tenant container. The
+    # tenant container itself doesn't care which slug routed to it —
+    # what matters is that the Host header it sees matches what
+    # production's CF tunnel sets, so cookie/CORS/TenantGuard logic
+    # exercises the same code path.
+    server {
+        listen 8080;
+        server_name *.localhost localhost;
+
+        # Cap upload at 50MB to mirror the staging tenant nginx limit;
+        # chat upload tests will fail closed if the platform handler
+        # ever silently expands its limit (catches the failure mode
+        # opposite of the chat-files lazy-heal incident).
+        client_max_body_size 50m;
+
+        location / {
+            proxy_pass http://tenant:8080;
+
+            # Header parity with CF tunnel + AWS LB. Production CF sets
+            # X-Forwarded-Proto=https; we keep http here because TLS
+            # termination in compose is unnecessary for testing the
+            # tenant logic — TLS is a CF concern, not a tenant bug
+            # surface. If TLS-specific bugs ever bite, add cert-manager
+            # + listen 8443 ssl here.
+            proxy_set_header Host              $host;
+            proxy_set_header X-Real-IP         $remote_addr;
+            proxy_set_header X-Forwarded-For   $proxy_add_x_forwarded_for;
+            proxy_set_header X-Forwarded-Host  $host;
+            proxy_set_header X-Forwarded-Proto $scheme;
+
+            # Streamable HTTP / SSE / WebSocket — the tenant exposes /ws
+            # and /events/stream + MCP /mcp/stream. Disabling buffering
+            # reproduces CF tunnel's pass-through streaming semantics
+            # (CF tunnel = no buffering by default; nginx default IS
+            # buffering, which would mask issue #2397-class streaming
+            # bugs by accumulating output until the client disconnects).
+            proxy_buffering         off;
+            proxy_request_buffering off;
+            proxy_http_version      1.1;
+            proxy_set_header        Connection "";
+
+            # Read timeout — CF tunnel default is 100s. Setting this to
+            # the same value catches "long agent run finishes after the
+            # proxy already closed the upstream" failure mode.
+            proxy_read_timeout      100s;
+        }
+    }
+}
--- a/tests/harness/compose.yml
+++ b/tests/harness/compose.yml
@ -0,0 +1,128 @@
+# Production-shape harness for local E2E.
+#
+# Reproduces the SaaS tenant topology on localhost using the SAME
+# images that ship to production:
+#
+#   client → cf-proxy (nginx, mimics CF tunnel headers)
+#          → tenant (workspace-server/Dockerfile.tenant — combined platform + canvas)
+#          → cp-stub (control-plane stand-in) for /cp/* and CP-callback paths
+#          → postgres + redis (same versions as production)
+#
+# Why this matters: the workspace-server binary IS identical between
+# local and production. The bugs that survive local E2E are topology
+# bugs — env-gated middleware (TenantGuard, CP proxy, Canvas proxy),
+# auth state, header rewrites, real production image. This harness
+# activates ALL of them.
+#
+# Quickstart:
+#   cd tests/harness && ./up.sh
+#   ./seed.sh
+#   ./replays/peer-discovery-404.sh   # reproduces issue #2397
+#
+# Env config:
+#   GIT_SHA — passed to the tenant build for /buildinfo verification.
+#             Defaults to "harness" so /buildinfo distinguishes the
+#             harness build from any cached image.
+#   CP_STUB_PEERS_MODE — peers failure mode for replay scripts.
+#                       "" / "404" / "401" / "500" / "timeout".
+
+services:
+  postgres:
+    image: postgres:16-alpine
+    environment:
+      POSTGRES_USER: harness
+      POSTGRES_PASSWORD: harness
+      POSTGRES_DB: molecule
+    networks: [harness-net]
+    healthcheck:
+      test: ["CMD-SHELL", "pg_isready -U harness"]
+      interval: 2s
+      timeout: 5s
+      retries: 10
+
+  redis:
+    image: redis:7-alpine
+    networks: [harness-net]
+    healthcheck:
+      test: ["CMD", "redis-cli", "ping"]
+      interval: 2s
+      timeout: 5s
+      retries: 10
+
+  cp-stub:
+    build:
+      context: ./cp-stub
+    environment:
+      PORT: "9090"
+      CP_STUB_PEERS_MODE: "${CP_STUB_PEERS_MODE:-}"
+    networks: [harness-net]
+    healthcheck:
+      test: ["CMD-SHELL", "wget -q -O- http://localhost:9090/healthz || exit 1"]
+      interval: 2s
+      timeout: 5s
+      retries: 10
+
+  # The actual production tenant image — same Dockerfile.tenant CI publishes.
+  # This is the load-bearing part of the harness: every bug class that hides
+  # behind "but it works locally" is reproducible HERE, against this image,
+  # not against `go run ./cmd/server`.
+  tenant:
+    build:
+      context: ../..
+      dockerfile: workspace-server/Dockerfile.tenant
+      args:
+        GIT_SHA: "${GIT_SHA:-harness}"
+    depends_on:
+      postgres:
+        condition: service_healthy
+      redis:
+        condition: service_healthy
+      cp-stub:
+        condition: service_healthy
+    environment:
+      DATABASE_URL: "postgres://harness:harness@postgres:5432/molecule?sslmode=disable"
+      REDIS_URL: "redis://redis:6379"
+      PORT: "8080"
+      PLATFORM_URL: "http://tenant:8080"
+      MOLECULE_ENV: "production"
+      # ADMIN_TOKEN flips the platform into strict-auth mode (matches
+      # production's CP-minted token configuration). Seeded value lets
+      # E2E scripts authenticate without going through CP.
+      ADMIN_TOKEN: "harness-admin-token"
+      # MOLECULE_ORG_ID — activates TenantGuard middleware. Every request
+      # must carry X-Molecule-Org-Id matching this value. Replays bugs
+      # that only fire in SaaS mode.
+      MOLECULE_ORG_ID: "harness-org"
+      # CP_UPSTREAM_URL — activates the /cp/* reverse proxy mount in
+      # router.go. Without this set, /cp/* would 404 and the canvas
+      # bootstrap would silently drift from production behavior.
+      CP_UPSTREAM_URL: "http://cp-stub:9090"
+      RATE_LIMIT: "1000"
+      # Canvas auto-proxy — entrypoint-tenant.sh exports CANVAS_PROXY_URL
+      # by default; keeping it explicit here makes the topology readable.
+      CANVAS_PROXY_URL: "http://localhost:3000"
+    networks: [harness-net]
+    healthcheck:
+      test: ["CMD-SHELL", "wget -q -O- http://localhost:8080/health || exit 1"]
+      interval: 5s
+      timeout: 5s
+      retries: 20
+
+  # Cloudflare-tunnel-shape proxy — strips the :8080 suffix, rewrites
+  # Host to the tenant subdomain, injects X-Forwarded-*. Tests target
+  # http://harness-tenant.localhost:8080 and exercise the production
+  # routing layer.
+  cf-proxy:
+    image: nginx:1.27-alpine
+    depends_on:
+      tenant:
+        condition: service_healthy
+    volumes:
+      - ./cf-proxy/nginx.conf:/etc/nginx/nginx.conf:ro
+    ports:
+      - "8080:8080"
+    networks: [harness-net]
+
+networks:
+  harness-net:
+    name: molecule-harness-net
--- a/tests/harness/cp-stub/Dockerfile
+++ b/tests/harness/cp-stub/Dockerfile
@ -0,0 +1,14 @@
+# cp-stub — minimal CP stand-in for the local production-shape harness.
+# See main.go for the rationale. Self-contained build, no module deps.
+
+FROM golang:1.25-alpine AS builder
+WORKDIR /src
+COPY go.mod ./
+COPY main.go ./
+RUN CGO_ENABLED=0 GOOS=linux go build -ldflags="-s -w" -o /cp-stub .
+
+FROM alpine:3.20
+RUN apk add --no-cache ca-certificates
+COPY --from=builder /cp-stub /cp-stub
+EXPOSE 9090
+ENTRYPOINT ["/cp-stub"]
--- a/tests/harness/cp-stub/go.mod
+++ b/tests/harness/cp-stub/go.mod
@ -0,0 +1,3 @@
+module github.com/Molecule-AI/molecule-monorepo/tests/harness/cp-stub
+
+go 1.25
--- a/tests/harness/cp-stub/main.go
+++ b/tests/harness/cp-stub/main.go
@ -0,0 +1,157 @@
+// cp-stub — minimal control-plane stand-in for the local production-shape harness.
+//
+// In production, the tenant Go server reverse-proxies /cp/* to the SaaS
+// control-plane (molecule-controlplane). This stub plays that role on
+// localhost so we can exercise the SAME code path the tenant takes in
+// production — `if cpURL := os.Getenv("CP_UPSTREAM_URL"); cpURL != ""`
+// in workspace-server/internal/router/router.go fires, the proxy mount
+// activates, and tests exercise the real tenant→CP wire.
+//
+// This is NOT a CP reimplementation. It serves the minimum surface to:
+//   1. Boot the tenant image without /cp/* breaking the canvas bootstrap.
+//   2. Replay specific bug classes (e.g. /cp/* returns 404, returns 5xx,
+//      returns malformed JSON) by toggling env vars.
+//
+// Scope is bounded by what the tenant + canvas actually call. Add new
+// handlers as new replay scenarios demand them. Drift from real CP is
+// tolerated because each handler is named for the exact path it serves —
+// when the real CP changes, the failing scenario tells us where to look.
+package main
+
+import (
+	"encoding/json"
+	"fmt"
+	"log"
+	"net/http"
+	"os"
+	"strings"
+	"sync/atomic"
+)
+
+// peersFailureMode controls /registry/<id>/peers responses for replay scripts.
+// Empty (default) → 200 with the rolling peer list set via /__stub/peers.
+// "404"           → 404 (workspace not registered) — replay #2397.
+// "401"           → 401 (auth failure)             — replay #2397.
+// "500"           → 500 (platform error)           — replay #2397.
+// "timeout"       → hang for 60s                   — replay #2397 network branch.
+//
+// Set via env var CP_STUB_PEERS_MODE at startup, or POST /__stub/mode at runtime.
+var (
+	peersFailureMode atomic.Value // string
+	peersList        atomic.Value // []map[string]any
+	redeployFleetCalls atomic.Int64
+)
+
+func init() {
+	peersFailureMode.Store(strings.ToLower(os.Getenv("CP_STUB_PEERS_MODE")))
+	peersList.Store([]map[string]any{})
+}
+
+func main() {
+	mux := http.NewServeMux()
+
+	// /cp/auth/me — canvas calls this on bootstrap; minimal user record
+	// keeps the canvas from redirecting to login during local E2E.
+	mux.HandleFunc("/cp/auth/me", func(w http.ResponseWriter, r *http.Request) {
+		writeJSON(w, 200, map[string]any{
+			"id":     "harness-user",
+			"email":  "harness@local",
+			"org_id": "harness-org",
+			"roles":  []string{"admin"},
+		})
+	})
+
+	// /cp/admin/tenants/redeploy-fleet — exercised by the
+	// redeploy-tenants-on-{staging,main} workflow's local replay. Returns
+	// the same shape the real CP returns so the verify-fleet logic in CI
+	// can be tested without spinning up a real EC2 fleet.
+	mux.HandleFunc("/cp/admin/tenants/redeploy-fleet", func(w http.ResponseWriter, r *http.Request) {
+		redeployFleetCalls.Add(1)
+		writeJSON(w, 200, map[string]any{
+			"ok": true,
+			"results": []map[string]any{
+				{
+					"slug":          "harness-tenant",
+					"phase":         "redeploy",
+					"ssm_status":    "Success",
+					"ssm_exit_code": 0,
+					"healthz_ok":    true,
+				},
+			},
+		})
+	})
+
+	// __stub/peers — set the rolling peer list returned via tenant's
+	// /registry/<id>/peers proxy. Used by replay scripts to seed the
+	// scenario before invoking tool_list_peers from a workspace.
+	mux.HandleFunc("/__stub/peers", func(w http.ResponseWriter, r *http.Request) {
+		if r.Method != http.MethodPost {
+			http.Error(w, "POST required", 405)
+			return
+		}
+		var body []map[string]any
+		if err := json.NewDecoder(r.Body).Decode(&body); err != nil {
+			http.Error(w, "bad JSON: "+err.Error(), 400)
+			return
+		}
+		peersList.Store(body)
+		writeJSON(w, 200, map[string]any{"ok": true, "count": len(body)})
+	})
+
+	// __stub/mode — toggle peersFailureMode at runtime for replay scripts.
+	mux.HandleFunc("/__stub/mode", func(w http.ResponseWriter, r *http.Request) {
+		if r.Method != http.MethodPost {
+			http.Error(w, "POST required", 405)
+			return
+		}
+		mode := strings.ToLower(r.URL.Query().Get("peers"))
+		peersFailureMode.Store(mode)
+		writeJSON(w, 200, map[string]any{"ok": true, "peers_mode": mode})
+	})
+
+	// __stub/state — expose stub state (counters, current mode) so replay
+	// scripts can assert the tenant actually called us.
+	mux.HandleFunc("/__stub/state", func(w http.ResponseWriter, r *http.Request) {
+		writeJSON(w, 200, map[string]any{
+			"peers_mode":           peersFailureMode.Load(),
+			"redeploy_fleet_calls": redeployFleetCalls.Load(),
+		})
+	})
+
+	// Catch-all for any /cp/* the tenant proxies. Keeps the harness from
+	// crashing the canvas when a new CP route is added — surfaces a clear
+	// "stub doesn't implement X" error instead of opaque 502 from the
+	// reverse proxy.
+	mux.HandleFunc("/cp/", func(w http.ResponseWriter, r *http.Request) {
+		writeJSON(w, 501, map[string]any{
+			"error": "cp-stub: handler not implemented for " + r.Method + " " + r.URL.Path,
+			"hint":  "add a handler in tests/harness/cp-stub/main.go for the scenario you're testing",
+		})
+	})
+
+	// /healthz — readiness probe for compose's depends_on.
+	mux.HandleFunc("/healthz", func(w http.ResponseWriter, r *http.Request) {
+		writeJSON(w, 200, map[string]any{"status": "ok"})
+	})
+
+	addr := ":" + envOr("PORT", "9090")
+	log.Printf("cp-stub listening on %s", addr)
+	if err := http.ListenAndServe(addr, mux); err != nil {
+		log.Fatal(err)
+	}
+}
+
+func writeJSON(w http.ResponseWriter, code int, body any) {
+	w.Header().Set("Content-Type", "application/json")
+	w.WriteHeader(code)
+	if err := json.NewEncoder(w).Encode(body); err != nil {
+		fmt.Fprintf(os.Stderr, "cp-stub: write json: %v\n", err)
+	}
+}
+
+func envOr(k, def string) string {
+	if v := os.Getenv(k); v != "" {
+		return v
+	}
+	return def
+}
--- a/tests/harness/down.sh
+++ b/tests/harness/down.sh
@ -0,0 +1,6 @@
+#!/usr/bin/env bash
+set -euo pipefail
+HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+cd "$HERE"
+docker compose -f compose.yml down -v --remove-orphans
+echo "[harness] down + volumes removed."
--- a/tests/harness/replays/buildinfo-stale-image.sh
+++ b/tests/harness/replays/buildinfo-stale-image.sh
@ -0,0 +1,75 @@
+#!/usr/bin/env bash
+# Replay for issue #2395 — local proof that the /buildinfo verify gate
+# closes the SaaS deploy-chain blindness.
+#
+# Prior behavior: redeploy-fleet returned ssm_status=Success based on
+# the SSM RPC return code alone. EC2 tenants kept serving the cached
+# :latest digest because `docker compose up -d` is a no-op when the
+# tag hasn't been invalidated. ssm_status=Success was lying.
+#
+# This replay simulates that condition locally:
+#   1. Boot the harness with GIT_SHA=fix-applied.
+#   2. Curl /buildinfo and assert it returns "fix-applied" (the new code
+#      actually shipped).
+#   3. Negative test: curl with a different EXPECTED_SHA and assert the
+#      mismatch detection logic the workflow uses returns failure.
+#
+# This proves the verify-step's jq lookup + comparison logic works
+# against the SAME Dockerfile.tenant production builds. If the
+# /buildinfo route ever stops being wired through, this replay
+# catches it before it reaches a production tenant.
+
+set -euo pipefail
+HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+HARNESS_ROOT="$(dirname "$HERE")"
+
+BASE="${BASE:-http://harness-tenant.localhost:8080}"
+
+# 1. Confirm /buildinfo wire shape — same shape the workflow's jq lookup expects.
+echo "[replay] curl $BASE/buildinfo ..."
+BUILD_JSON=$(curl -sS "$BASE/buildinfo")
+echo "[replay]   $BUILD_JSON"
+
+ACTUAL_SHA=$(echo "$BUILD_JSON" | jq -r '.git_sha // ""')
+if [ -z "$ACTUAL_SHA" ]; then
+    echo "[replay] FAIL: /buildinfo response missing git_sha field — workflow's jq lookup would null"
+    exit 1
+fi
+echo "[replay] git_sha=$ACTUAL_SHA"
+
+# 2. Assert the harness build threaded GIT_SHA through. If we got "dev",
+#    the Dockerfile arg / ldflags wiring is broken — same regression
+#    class that made #2395 invisible until production.
+EXPECTED_FROM_HARNESS="${HARNESS_GIT_SHA:-harness}"
+if [ "$ACTUAL_SHA" = "dev" ]; then
+    echo "[replay] FAIL: /buildinfo returned 'dev' — Dockerfile.tenant ARG GIT_SHA isn't reaching the binary"
+    echo "[replay]       This regresses #2395 by silencing the deploy-verify gate."
+    exit 1
+fi
+if [ "$ACTUAL_SHA" != "$EXPECTED_FROM_HARNESS" ]; then
+    echo "[replay] WARN: /buildinfo returned '$ACTUAL_SHA' but harness was built with GIT_SHA='$EXPECTED_FROM_HARNESS'"
+    echo "[replay]       Image may be cached from a previous run. Run ./up.sh --rebuild to force a fresh build."
+fi
+
+# 3. Negative test — replay the workflow's mismatch detection by
+#    comparing the actual SHA to a deliberately-wrong expected SHA.
+WRONG_EXPECTED="0000000000000000000000000000000000000000"
+if [ "$ACTUAL_SHA" = "$WRONG_EXPECTED" ]; then
+    echo "[replay] FAIL: /buildinfo returned all-zero SHA — wiring inverted"
+    exit 1
+fi
+
+# 4. Replay the workflow's exact comparison logic so a regression in
+#    the verify step's bash gets caught here.
+MISMATCH_DETECTED=0
+if [ "$ACTUAL_SHA" != "$WRONG_EXPECTED" ]; then
+    MISMATCH_DETECTED=1
+fi
+if [ "$MISMATCH_DETECTED" != "1" ]; then
+    echo "[replay] FAIL: workflow comparison logic would not flag a real mismatch"
+    exit 1
+fi
+
+echo ""
+echo "[replay] PASS: /buildinfo wire shape, GIT_SHA injection, and mismatch detection all work in"
+echo "        production-shape topology. The redeploy-fleet verify-step covers what it claims to."
--- a/tests/harness/replays/peer-discovery-404.sh
+++ b/tests/harness/replays/peer-discovery-404.sh
@ -0,0 +1,107 @@
+#!/usr/bin/env bash
+# Replay for issue #2397 — local proof that the peer-discovery
+# diagnostic surfacing fix actually works.
+#
+# Prior behavior: tool_list_peers returned "No peers available (this
+# workspace may be isolated)" regardless of WHY peers were empty.
+# Five distinct conditions collapsed to one ambiguous message.
+#
+# This replay seeds the cp-stub to return 404 from /registry/<id>/peers
+# (simulating a workspace whose registration was wiped), then calls
+# the workspace's tool_list_peers via MCP. After the fix in #2399, the
+# response should mention "404" + "registered" — proving the diagnostic
+# reaches the agent in production-shape topology, not just unit tests.
+#
+# Pre-fix baseline: this script's PASS criterion is the new diagnostic
+# string. If we ever regress to "may be isolated", the replay fails
+# and CI catches it before the agent + user are blind to the cause.
+
+set -euo pipefail
+HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+HARNESS_ROOT="$(dirname "$HERE")"
+cd "$HARNESS_ROOT"
+
+if [ ! -f .seed.env ]; then
+    echo "[replay] no .seed.env — running ./seed.sh first..."
+    ./seed.sh
+fi
+# shellcheck source=/dev/null
+source .seed.env
+
+BASE="${BASE:-http://harness-tenant.localhost:8080}"
+ADMIN="harness-admin-token"
+ORG="harness-org"
+
+# 1. Toggle cp-stub to return 404 on the peers endpoint. This isn't
+#    actually how the platform calls it (the platform's /registry
+#    endpoints aren't proxied through cp-stub), but the workspace
+#    runtime's get_peers calls /registry/:id/peers ON THE TENANT —
+#    which DB-resolves and returns []. To force a 404 path on the
+#    runtime side, we'd need a workspace whose ID never registered.
+#    Easier replay: ask the runtime to look up a non-existent id.
+#
+# Step 1: ask the tenant for peers of a non-registered id. Tenant's
+# discovery handler returns 404 when the workspace doesn't exist.
+
+ROGUE_ID="$(uuidgen | tr '[:upper:]' '[:lower:]')"
+
+echo "[replay] querying /registry/$ROGUE_ID/peers (workspace doesn't exist)..."
+HTTP_CODE=$(curl -sS -o /tmp/peer-replay.json -w '%{http_code}' \
+    -H "Authorization: Bearer $ADMIN" \
+    -H "X-Molecule-Org-Id: $ORG" \
+    -H "X-Workspace-ID: $ROGUE_ID" \
+    "$BASE/registry/$ROGUE_ID/peers")
+
+echo "[replay] tenant responded HTTP $HTTP_CODE"
+
+# 2. The Python diagnostic helper get_peers_with_diagnostic must convert
+#    that 404 into an actionable string. We simulate the helper's parse
+#    here to assert the contract end-to-end (the runtime is the actual
+#    consumer; this proves the wire shape that feeds it).
+
+if [ "$HTTP_CODE" != "404" ]; then
+    echo "[replay] FAIL: expected 404 from /registry/<unregistered>/peers, got $HTTP_CODE"
+    cat /tmp/peer-replay.json
+    exit 1
+fi
+
+# 3. Verify that running the runtime's diagnostic helper against this
+#    response surfaces the actionable string. We call the helper as a
+#    one-shot Python eval, mirroring how the runtime would consume it.
+
+echo "[replay] invoking workspace runtime diagnostic helper against the 404..."
+
+WORKSPACE_PATH="$(cd "$HARNESS_ROOT/../../workspace" && pwd)"
+DIAGNOSTIC=$(WORKSPACE_ID="$ROGUE_ID" PLATFORM_URL="$BASE" \
+    PYTHONPATH="$WORKSPACE_PATH" \
+    python3 -c "
+import asyncio, sys
+sys.path.insert(0, '$WORKSPACE_PATH')
+import a2a_client
+async def main():
+    peers, diag = await a2a_client.get_peers_with_diagnostic()
+    print(repr(diag))
+asyncio.run(main())
+")
+
+echo "[replay] diagnostic from helper: $DIAGNOSTIC"
+
+# 4. Assert the diagnostic contains "404" + "register" — the actionable
+#    parts of the message. If we regress to None or "may be isolated",
+#    fail the replay.
+
+if ! echo "$DIAGNOSTIC" | grep -q "404"; then
+    echo "[replay] FAIL: diagnostic missing '404' — regressed to swallow-the-status-code"
+    exit 1
+fi
+if ! echo "$DIAGNOSTIC" | grep -qi "regist"; then
+    echo "[replay] FAIL: diagnostic missing 'register' guidance — regressed to opaque message"
+    exit 1
+fi
+if echo "$DIAGNOSTIC" | grep -qi "may be isolated"; then
+    echo "[replay] FAIL: diagnostic still says 'may be isolated' — fix didn't reach this code path"
+    exit 1
+fi
+
+echo ""
+echo "[replay] PASS: peer-discovery 404 surfaces actionable diagnostic in production-shape topology."
--- a/tests/harness/seed.sh
+++ b/tests/harness/seed.sh
@ -0,0 +1,65 @@
+#!/usr/bin/env bash
+# Seed the harness with two registered workspaces so peer-discovery
+# replay scripts have something to discover.
+#
+# - "alpha"  parent (tier 0)
+# - "beta"   child of alpha (tier 1)
+#
+# Both register via the platform's /registry/register endpoint, which
+# is what real workspaces do at boot. The platform then has them in its
+# DB; tool_list_peers from inside alpha can resolve beta as a peer.
+
+set -euo pipefail
+HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+cd "$HERE"
+
+BASE="${BASE:-http://harness-tenant.localhost:8080}"
+ADMIN="harness-admin-token"
+ORG="harness-org"
+
+curl_admin() {
+    curl -sS -H "Authorization: Bearer $ADMIN" \
+            -H "X-Molecule-Org-Id: $ORG" \
+            -H "Content-Type: application/json" "$@"
+}
+
+echo "[seed] confirming tenant is reachable via cf-proxy..."
+HEALTH=$(curl -sS "$BASE/health" || echo "")
+if [ -z "$HEALTH" ]; then
+    echo "[seed] FAILED: $BASE/health unreachable. Did ./up.sh complete? Did you add"
+    echo "       127.0.0.1 harness-tenant.localhost to /etc/hosts?"
+    exit 1
+fi
+echo "[seed]   $HEALTH"
+
+echo "[seed] confirming /buildinfo returns the harness GIT_SHA..."
+BUILD=$(curl -sS "$BASE/buildinfo" || echo "")
+echo "[seed]   $BUILD"
+
+# Mint a fresh admin-call workspace ID for the parent. Platform's
+# /admin/workspaces/:id/test-token mints a per-workspace bearer; the
+# replay scripts use it to call the workspace-scoped routes.
+echo "[seed] creating workspace 'alpha' (parent)..."
+ALPHA_ID=$(uuidgen | tr '[:upper:]' '[:lower:]')
+curl_admin -X POST "$BASE/workspaces" \
+    -d "{\"id\":\"$ALPHA_ID\",\"name\":\"alpha\",\"tier\":0,\"runtime\":\"langgraph\"}" \
+    >/dev/null
+echo "[seed]   alpha id=$ALPHA_ID"
+
+echo "[seed] creating workspace 'beta' (child of alpha)..."
+BETA_ID=$(uuidgen | tr '[:upper:]' '[:lower:]')
+curl_admin -X POST "$BASE/workspaces" \
+    -d "{\"id\":\"$BETA_ID\",\"name\":\"beta\",\"tier\":1,\"parent_id\":\"$ALPHA_ID\",\"runtime\":\"langgraph\"}" \
+    >/dev/null
+echo "[seed]   beta id=$BETA_ID"
+
+# Stash IDs so replay scripts pick them up.
+{
+    echo "ALPHA_ID=$ALPHA_ID"
+    echo "BETA_ID=$BETA_ID"
+} > "$HERE/.seed.env"
+
+echo ""
+echo "[seed] done. IDs persisted to tests/harness/.seed.env"
+echo "[seed]   ALPHA_ID=$ALPHA_ID"
+echo "[seed]   BETA_ID=$BETA_ID"
--- a/tests/harness/up.sh
+++ b/tests/harness/up.sh
@ -0,0 +1,39 @@
+#!/usr/bin/env bash
+# Bring the production-shape harness up.
+#
+# Usage: ./up.sh [--rebuild]
+#
+# Always operates in tests/harness/ regardless of where it's invoked
+# from — test scripts under tests/harness/replays/ source it via the
+# absolute path, so cd-ing first prevents compose-context surprises.
+
+set -euo pipefail
+HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+cd "$HERE"
+
+REBUILD=false
+for arg in "$@"; do
+    case "$arg" in
+        --rebuild) REBUILD=true ;;
+    esac
+done
+
+if [ "$REBUILD" = true ]; then
+    docker compose -f compose.yml build --no-cache tenant cp-stub
+fi
+
+echo "[harness] starting cp-stub + postgres + redis + tenant + cf-proxy ..."
+docker compose -f compose.yml up -d --wait
+
+echo "[harness] /etc/hosts entry for harness-tenant.localhost..."
+if ! grep -q '^127\.0\.0\.1[[:space:]]\+harness-tenant\.localhost' /etc/hosts; then
+    echo "  (skip — your /etc/hosts may not resolve *.localhost. If tests fail with"
+    echo "   'getaddrinfo' errors, add: 127.0.0.1 harness-tenant.localhost)"
+fi
+
+echo ""
+echo "[harness] up. Tenant: http://harness-tenant.localhost:8080/health"
+echo "                     http://harness-tenant.localhost:8080/buildinfo"
+echo "          cp-stub:    http://localhost (internal-only via compose net)"
+echo ""
+echo "Next: ./seed.sh   # mint admin token + register sample workspaces"