Brings the local harness from "single tenant covering the request path" to "two tenants covering both the request path AND the per-tenant isolation boundary" — the same shape production runs (one EC2 + one Postgres + one MOLECULE_ORG_ID per tenant). Why this matters: the four prior replays exercise the SaaS request path against one tenant. They cannot prove that TenantGuard rejects a misrouted request (production CF tunnel + AWS LB are the failure surface), nor that two tenants doing legitimate work in parallel keep their `activity_logs` / `workspaces` / connection-pool state partitioned. Both are real bug classes — TenantGuard allowlist drift shipped #2398, lib/pq prepared-statement cache collision is documented as an org-wide hazard. What changed: 1. compose.yml — split into two tenants. tenant-alpha + postgres-alpha + tenant-beta + postgres-beta + the shared cp-stub, redis, cf-proxy. Each tenant gets a distinct ADMIN_TOKEN + MOLECULE_ORG_ID and its own Postgres database. cf-proxy depends on both tenants becoming healthy. 2. cf-proxy/nginx.conf — Host-header → tenant routing. `map $host $tenant_upstream` resolves the right backend per request. Required `resolver 127.0.0.11 valid=30s ipv6=off;` because nginx needs an explicit DNS resolver to use a variable in `proxy_pass` (literal hostnames resolve once at startup; variables resolve per request — without the resolver nginx fails closed with 502). `server_name` lists both tenants + the legacy alias so unknown Host headers don't silently route to a default and mask routing bugs. 3. _curl.sh — per-tenant + cross-tenant-negative helpers. `curl_alpha_admin` / `curl_beta_admin` set the right Host + Authorization + X-Molecule-Org-Id triple. `curl_alpha_creds_at_beta` / `curl_beta_creds_at_alpha` exist precisely to make WRONG requests (replays use them to assert TenantGuard rejects). `psql_exec_alpha` / `psql_exec_beta` shell out per-tenant Postgres exec. Legacy aliases (`curl_admin`, `psql_exec`) keep the four pre-Phase-2 replays working without edits. 4. seed.sh — registers parent+child workspaces in BOTH tenants. Captures server-generated IDs via `jq -r '.id'` (POST /workspaces ignores body.id, so the older client-side mint silently desynced from the workspaces table and broke FK-dependent replays). Stashes `ALPHA_PARENT_ID` / `ALPHA_CHILD_ID` / `BETA_PARENT_ID` / `BETA_CHILD_ID` to .seed.env, plus legacy `ALPHA_ID` / `BETA_ID` aliases for backwards compat with chat-history / channel-envelope. 5. New replays. tenant-isolation.sh (13 assertions) — TenantGuard 404s any request whose X-Molecule-Org-Id doesn't match the container's MOLECULE_ORG_ID. Asserts the 404 body has zero tenant/org/forbidden/denied keywords (existence of a tenant must not be probable from the outside). Covers cross-tenant routing misconfigure + allowlist drift + missing-org-header. per-tenant-independence.sh (12 assertions) — both tenants seed activity_logs in parallel with distinct row counts (3 vs 5) and confirm each tenant's history endpoint returns exactly its own counts. Then a concurrent INSERT race (10 rows per tenant in parallel via `&` + wait) catches shared-pool corruption + prepared-statement cache poisoning + redis cross-keyspace bleed. 6. Bug fix: down.sh + dump-logs SECRETS_ENCRYPTION_KEY validation. `docker compose down -v` validates the entire compose file even though it doesn't read the env. up.sh generates a per-run key into its own shell — down.sh runs in a fresh shell that wouldn't see it, so without a placeholder `compose down` exited non-zero before removing volumes. Workspaces silently leaked into the next ./up.sh + seed.sh boot. Caught when tenant-isolation.sh F1/F2 saw 3× duplicate alpha-parent rows accumulated across three prior runs. Same fix applied to the workflow's dump-logs step. 7. requirements.txt — pin molecule-ai-workspace-runtime>=0.1.78. channel-envelope-trust-boundary.sh imports from `molecule_runtime.*` (the wheel-rewritten path) so it catches the failure mode where the wheel build silently strips a fix that unit tests on local source still pass. CI was failing this replay because the wheel wasn't installed — caught in the staging push run from #2492. 8. .github/workflows/harness-replays.yml — Phase 2 plumbing. * Removed /etc/hosts step (Host-header path eliminated the need; scripts already source _curl.sh). * Updated dump-logs to reference the new service names (tenant-alpha + tenant-beta + postgres-alpha + postgres-beta). * Added SECRETS_ENCRYPTION_KEY placeholder env on the dump step. Verified: ./run-all-replays.sh from a clean state — 6/6 passed (buildinfo-stale-image, channel-envelope-trust-boundary, chat-history, peer-discovery-404, per-tenant-independence, tenant-isolation). Roadmap section updated: Phase 2 marked shipped. Phase 3 promoted to "replace cp-stub with real molecule-controlplane Docker build + env coherence lint." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
98 lines
4.5 KiB
Nginx Configuration File
98 lines
4.5 KiB
Nginx Configuration File
# cf-proxy — Cloudflare-tunnel-shape reverse proxy for the local harness.
|
|
#
|
|
# Production path: agent → CF tunnel → AWS LB → tenant container.
|
|
# This config replays the same header rewrites the CF tunnel does so
|
|
# the tenant sees the same Host + X-Forwarded-* it would in production.
|
|
#
|
|
# Multi-tenant: nginx routes by Host header to the right tenant
|
|
# container — exactly the same way the production CF tunnel does
|
|
# (URL is the public CF endpoint, Host carries the tenant identity).
|
|
#
|
|
# How tests reach it (no /etc/hosts required):
|
|
# curl -H 'Host: harness-tenant-alpha.localhost' http://localhost:8080/health
|
|
# curl -H 'Host: harness-tenant-beta.localhost' http://localhost:8080/health
|
|
#
|
|
# Backwards-compat: harness-tenant.localhost (no -alpha/-beta suffix) maps
|
|
# to alpha for legacy single-tenant replays.
|
|
|
|
worker_processes 1;
|
|
events { worker_connections 256; }
|
|
|
|
http {
|
|
# Docker's embedded DNS at 127.0.0.11. Required because the
|
|
# `proxy_pass http://$tenant_upstream:8080` below uses a variable —
|
|
# nginx needs an explicit resolver to do per-request DNS lookups
|
|
# (literal hostnames are resolved once at startup, variables are
|
|
# resolved per-request). Without this, nginx fails closed with
|
|
# "no resolver defined" + 502.
|
|
#
|
|
# `valid=30s` caps cache life so a tenant container restart picks
|
|
# up a new IP within 30 seconds. ipv6=off skips AAAA lookups that
|
|
# Docker DNS doesn't always serve cleanly.
|
|
resolver 127.0.0.11 valid=30s ipv6=off;
|
|
|
|
# Reusable proxy block so each tenant server only carries the
|
|
# upstream-pointer + its identity-specific tweaks. Keeping the
|
|
# header rewrites + buffering settings centralised prevents drift
|
|
# between alpha and beta as the harness grows.
|
|
map $host $tenant_upstream {
|
|
default tenant-alpha;
|
|
harness-tenant.localhost tenant-alpha;
|
|
harness-tenant-alpha.localhost tenant-alpha;
|
|
harness-tenant-beta.localhost tenant-beta;
|
|
}
|
|
|
|
server {
|
|
listen 8080 default_server;
|
|
|
|
# Reject Host headers we don't recognise — without this, an
|
|
# unknown Host would silently route to the default tenant and
|
|
# mask cross-tenant routing bugs in test output.
|
|
server_name harness-tenant.localhost
|
|
harness-tenant-alpha.localhost
|
|
harness-tenant-beta.localhost
|
|
localhost;
|
|
|
|
# Cap upload at 50MB to mirror the staging tenant nginx limit;
|
|
# chat upload tests will fail closed if the platform handler
|
|
# ever silently expands its limit (catches the failure mode
|
|
# opposite of the chat-files lazy-heal incident).
|
|
client_max_body_size 50m;
|
|
|
|
location / {
|
|
# The map above resolves $tenant_upstream to the right
|
|
# container based on the Host header — production CF tunnel
|
|
# behavior in one line.
|
|
proxy_pass http://$tenant_upstream:8080;
|
|
|
|
# Header parity with CF tunnel + AWS LB. Production CF sets
|
|
# X-Forwarded-Proto=https; we keep http here because TLS
|
|
# termination in compose is unnecessary for testing the
|
|
# tenant logic — TLS is a CF concern, not a tenant bug
|
|
# surface. If TLS-specific bugs ever bite, add cert-manager
|
|
# + listen 8443 ssl here.
|
|
proxy_set_header Host $host;
|
|
proxy_set_header X-Real-IP $remote_addr;
|
|
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
|
proxy_set_header X-Forwarded-Host $host;
|
|
proxy_set_header X-Forwarded-Proto $scheme;
|
|
|
|
# Streamable HTTP / SSE / WebSocket — the tenant exposes /ws
|
|
# and /events/stream + MCP /mcp/stream. Disabling buffering
|
|
# reproduces CF tunnel's pass-through streaming semantics
|
|
# (CF tunnel = no buffering by default; nginx default IS
|
|
# buffering, which would mask issue #2397-class streaming
|
|
# bugs by accumulating output until the client disconnects).
|
|
proxy_buffering off;
|
|
proxy_request_buffering off;
|
|
proxy_http_version 1.1;
|
|
proxy_set_header Connection "";
|
|
|
|
# Read timeout — CF tunnel default is 100s. Setting this to
|
|
# the same value catches "long agent run finishes after the
|
|
# proxy already closed the upstream" failure mode.
|
|
proxy_read_timeout 100s;
|
|
}
|
|
}
|
|
}
|