5c989fef2f
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Waiting to run
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 9s
CI / Detect changes (pull_request) Successful in 11s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 16s
E2E API Smoke Test / detect-changes (pull_request) Successful in 16s
E2E Chat / detect-changes (pull_request) Successful in 13s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 12s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s
Harness Replays / detect-changes (pull_request) Successful in 6s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 7s
Lint no tenant GITEA/GITHUB token write / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 5s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m8s
publish-runtime-autobump / pr-validate (pull_request) Successful in 34s
publish-runtime-autobump / bump-and-tag (pull_request) Has been skipped
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 8s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 6s
gate-check-v3 / gate-check (pull_request) Successful in 6s
qa-review / approved (pull_request) Successful in 6s
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request) Successful in 3s
sop-checklist / review-refire (pull_request) Has been skipped
sop-tier-check / tier-check (pull_request) Successful in 4s
CI / Platform (Go) (pull_request) Successful in 5m5s
CI / Canvas (Next.js) (pull_request) Successful in 6m11s
CI / Python Lint & Test (pull_request) Successful in 7m17s
CI / all-required (pull_request) Successful in 6m33s
Harness Replays / Harness Replays (pull_request) Successful in 4s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Failing after 2m27s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2m24s
security-review / approved (pull_request) Refired via /security-recheck by unknown
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 2m56s
E2E Chat / E2E Chat (pull_request) Failing after 6m33s
audit-force-merge / audit (pull_request) Successful in 4s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 10m21s
CTO 2026-05-19 directive on forensic a99ab0a1 (reno-stars >50MB
upload that surfaced "signal timed out" when the real cause was
file-size + a fixed 60s client timeout):
"if its file size issue, should have error that instead saying
timeout which is wrong"
Bundles the cap raise + the wrong-reason fix in ONE PR because the
two are coupled — bumping the server alone would still leak the
fixed-60s timeout for legitimate slow uploads; fixing the client
alone would 413 every >50MB attempt.
Server (push-mode, EC2 workspace):
- workspace-server/internal/handlers/chat_files.go:
chatUploadMaxBytes 50→100 MB
httpClient.Timeout 120→1200 s (matches the new slow-uplink budget)
- workspace/internal_chat_uploads.py:
CHAT_UPLOAD_MAX_BYTES 50→100 MB
CHAT_UPLOAD_MAX_FILE_BYTES 25→100 MB (aligned with total so a
single legitimate large file succeeds end-to-end)
Canvas:
- canvas/src/components/tabs/chat/uploads.ts:
MAX_UPLOAD_BYTES 100 MB constant + FileTooLargeError class
pre-flight gate: file-size violation throws BEFORE any fetch,
with the actionable "File too large (got X MB) — limit is 100MB"
computeUploadTimeoutMs: 60s floor + 100 KB/s scaled deadline
(was a fixed 60s — the root cause of the forensic)
- canvas/src/components/tabs/chat/hooks/useChatSend.ts:
mapUploadErrorToReason: routes each cause to ITS OWN message
(FileTooLargeError | TimeoutError | server-Error | fallback)
no conflation between file-size and connection-too-slow
Tests:
- workspace-server chat_files_test.go: pins 100 MB constant,
asserts sub-cap forwards + over-cap non-2xx
- canvas uploads.cap.test.ts (10 cases): pre-flight gate, exact-cap
edge, scaled-timeout curve, server-413 propagation, AbortSignal
shape — explicit negative on "TimeoutError ≠ FileTooLargeError"
- canvas useChatSend.errorReason.test.ts (5 cases): per-cause
message contract, explicit negatives that guard against the
wrong-reason conflation
Test harness mirror:
- tests/harness/cf-proxy/nginx.conf: client_max_body_size 50m→100m
(this is the harness mirror; the production CF / nginx tier is
out-of-repo. If prod still caps at 50m, this mirror passes while
prod 413s — surface to ops.)
Follow-up (SSOT, NOT in this PR):
The 100 MB constant now lives in THREE mirror sites (canvas TS +
workspace Python + platform Go). Per feedback_no_single_source_of_truth,
the proper fix is exposing the cap via GET /uploads/limits so the
client fetches the live value. Filing as a separate issue.
References:
- task #295 (internal tracker; CTO-authorized this work)
- forensic a99ab0a1 (reno-stars 2026-05-19)
- feedback_surface_actionable_failure_reason_to_user (CTO 2026-05-17)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
102 lines
4.7 KiB
Nginx Configuration File
102 lines
4.7 KiB
Nginx Configuration File
# cf-proxy — Cloudflare-tunnel-shape reverse proxy for the local harness.
|
|
#
|
|
# Production path: agent → CF tunnel → AWS LB → tenant container.
|
|
# This config replays the same header rewrites the CF tunnel does so
|
|
# the tenant sees the same Host + X-Forwarded-* it would in production.
|
|
#
|
|
# Multi-tenant: nginx routes by Host header to the right tenant
|
|
# container — exactly the same way the production CF tunnel does
|
|
# (URL is the public CF endpoint, Host carries the tenant identity).
|
|
#
|
|
# How tests reach it (no /etc/hosts required):
|
|
# curl -H 'Host: harness-tenant-alpha.localhost' http://localhost:8080/health
|
|
# curl -H 'Host: harness-tenant-beta.localhost' http://localhost:8080/health
|
|
#
|
|
# Backwards-compat: harness-tenant.localhost (no -alpha/-beta suffix) maps
|
|
# to alpha for legacy single-tenant replays.
|
|
|
|
worker_processes 1;
|
|
events { worker_connections 256; }
|
|
|
|
http {
|
|
# Docker's embedded DNS at 127.0.0.11. Required because the
|
|
# `proxy_pass http://$tenant_upstream:8080` below uses a variable —
|
|
# nginx needs an explicit resolver to do per-request DNS lookups
|
|
# (literal hostnames are resolved once at startup, variables are
|
|
# resolved per-request). Without this, nginx fails closed with
|
|
# "no resolver defined" + 502.
|
|
#
|
|
# `valid=30s` caps cache life so a tenant container restart picks
|
|
# up a new IP within 30 seconds. ipv6=off skips AAAA lookups that
|
|
# Docker DNS doesn't always serve cleanly.
|
|
resolver 127.0.0.11 valid=30s ipv6=off;
|
|
|
|
# Reusable proxy block so each tenant server only carries the
|
|
# upstream-pointer + its identity-specific tweaks. Keeping the
|
|
# header rewrites + buffering settings centralised prevents drift
|
|
# between alpha and beta as the harness grows.
|
|
map $host $tenant_upstream {
|
|
default tenant-alpha;
|
|
harness-tenant.localhost tenant-alpha;
|
|
harness-tenant-alpha.localhost tenant-alpha;
|
|
harness-tenant-beta.localhost tenant-beta;
|
|
}
|
|
|
|
server {
|
|
listen 8080 default_server;
|
|
|
|
# Reject Host headers we don't recognise — without this, an
|
|
# unknown Host would silently route to the default tenant and
|
|
# mask cross-tenant routing bugs in test output.
|
|
server_name harness-tenant.localhost
|
|
harness-tenant-alpha.localhost
|
|
harness-tenant-beta.localhost
|
|
localhost;
|
|
|
|
# Cap upload at 100MB to mirror the staging tenant nginx limit;
|
|
# chat upload tests will fail closed if the platform handler
|
|
# ever silently expands its limit (catches the failure mode
|
|
# opposite of the chat-files lazy-heal incident). Bumped from
|
|
# 50m to 100m in lockstep with chat_files.go chatUploadMaxBytes
|
|
# (CTO 2026-05-19 directive on forensic a99ab0a1). If the
|
|
# production CF / nginx tier still caps at 50m, this mirror
|
|
# will pass while prod 413s — surface to ops if seen.
|
|
client_max_body_size 100m;
|
|
|
|
location / {
|
|
# The map above resolves $tenant_upstream to the right
|
|
# container based on the Host header — production CF tunnel
|
|
# behavior in one line.
|
|
proxy_pass http://$tenant_upstream:8080;
|
|
|
|
# Header parity with CF tunnel + AWS LB. Production CF sets
|
|
# X-Forwarded-Proto=https; we keep http here because TLS
|
|
# termination in compose is unnecessary for testing the
|
|
# tenant logic — TLS is a CF concern, not a tenant bug
|
|
# surface. If TLS-specific bugs ever bite, add cert-manager
|
|
# + listen 8443 ssl here.
|
|
proxy_set_header Host $host;
|
|
proxy_set_header X-Real-IP $remote_addr;
|
|
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
|
proxy_set_header X-Forwarded-Host $host;
|
|
proxy_set_header X-Forwarded-Proto $scheme;
|
|
|
|
# Streamable HTTP / SSE / WebSocket — the tenant exposes /ws
|
|
# and /events/stream + MCP /mcp/stream. Disabling buffering
|
|
# reproduces CF tunnel's pass-through streaming semantics
|
|
# (CF tunnel = no buffering by default; nginx default IS
|
|
# buffering, which would mask issue #2397-class streaming
|
|
# bugs by accumulating output until the client disconnects).
|
|
proxy_buffering off;
|
|
proxy_request_buffering off;
|
|
proxy_http_version 1.1;
|
|
proxy_set_header Connection "";
|
|
|
|
# Read timeout — CF tunnel default is 100s. Setting this to
|
|
# the same value catches "long agent run finishes after the
|
|
# proxy already closed the upstream" failure mode.
|
|
proxy_read_timeout 100s;
|
|
}
|
|
}
|
|
}
|