fix(uploads): bump poll-mode size_bytes CHECK to 100MB to match push-mode (mc#1588)
CI / Canvas Deploy Reminder (pull_request) Blocked by required conditions
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Waiting to run
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 11s
CI / Detect changes (pull_request) Successful in 21s
Check migration collisions / Migration version collision check (pull_request) Successful in 23s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 44s
E2E API Smoke Test / detect-changes (pull_request) Successful in 18s
E2E Chat / detect-changes (pull_request) Successful in 17s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 16s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 8s
Harness Replays / detect-changes (pull_request) Successful in 11s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 13s
Lint no tenant GITEA or GITHUB token write / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 9s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m6s
publish-runtime-autobump / pr-validate (pull_request) Successful in 38s
publish-runtime-autobump / bump-and-tag (pull_request) Has been skipped
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 12s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 5s
gate-check-v3 / gate-check (pull_request) Successful in 4s
qa-review / approved (pull_request) Successful in 4s
security-review / approved (pull_request) Successful in 4s
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request) Successful in 6s
sop-checklist / review-refire (pull_request) Has been skipped
sop-tier-check / tier-check (pull_request) Successful in 8s
CI / Platform (Go) (pull_request) Successful in 5m27s
E2E Staging External Runtime / E2E Staging External Runtime (pull_request) Successful in 5m15s
CI / Canvas (Next.js) (pull_request) Successful in 6m47s
CI / Python Lint & Test (pull_request) Successful in 7m32s
CI / all-required (pull_request) Successful in 7m5s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 20s
Harness Replays / Harness Replays (pull_request) Successful in 7s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Failing after 1m41s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 1m58s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 2m19s
audit-force-merge / audit (pull_request) Successful in 4s
E2E Chat / E2E Chat (pull_request) Failing after 9m40s

CTO directive 2026-05-19 task #295: poll-mode (laptop-runtime workspaces)
had a 25 MB CHECK constraint on `pending_uploads.size_bytes` that
diverged from the push-mode 100 MB cap mc#1588 is landing for the
SaaS EC2 tenants (reno-stars forensic a99ab0a1). This PR closes the
poll-mode gap so an external runtime workspace pulling chat-attached
files via the queue accepts the same per-file ceiling as a push-mode
tenant.

Surfaces touched (mirror sites for the per-file cap):

  1. workspace-server/migrations/20260519200000_pending_uploads_bump_size_cap.{up,down}.sql
     — NEW pair. DROP IF EXISTS the auto-named
     `pending_uploads_size_bytes_check` constraint and re-add at
     104857600. DOWN restores 26214400 (operator must drain any
     25–100 MB rows first; documented in down-migration header).

  2. workspace-server/internal/pendinguploads/storage.go
     — MaxFileBytes 25→100 MB. Pre-DB guard so an oversize Put
     short-circuits before round-tripping Postgres.

  3. workspace-server/internal/handlers/chat_files.go (comment only)
     — header comment on uploadPollMode updated from "Per-file cap:
     25 MB" to "100 MB" with the cross-reference to mc#1588.

  4. workspace/inbox_uploads.py (workspace-side puller for poll-mode)
     — MAX_FILE_BYTES 25→100 MB. DEFAULT_FETCH_TIMEOUT 60→240 s so a
     legitimate 100 MB transfer over a ~5 Mbps consumer link
     (~160 s wire time) doesn't hit the wrong-reason timeout failure
     the canvas side of mc#1588 fixed for forensic a99ab0a1.

  5. workspace-server/internal/handlers/pending_uploads_integration_test.go
     — TestIntegration_PendingUploads_SizeCap_100MB pins both the
     pre-DB guard (Put returns ErrTooLarge at MaxFileBytes+1) AND
     the raw DB CHECK (direct INSERT at 100MB+1 returns SQLSTATE
     23514). Skips in -short mode (allocates ~100 MB).
     TestIntegration_PendingUploads_SizeCap_DBConstraintName pins
     the constraint name + clause via pg_get_constraintdef so a
     future schema-rename can't silently regress to 25 MB through
     a no-op DROP IF EXISTS.

  6. workspace-server/internal/handlers/chat_files_poll_test.go
     — TestPollUpload_PerFileCapPreStorage_413 and
     TestPollUpload_AtomicRollbackOnSecondFileTooLarge gain a
     structural-precondition skip: when MaxFileBytes >=
     chatUploadMaxBytes (which is the post-mc#1588 + this PR state
     for poll-mode: per-file == body == 100 MB), the body
     MaxBytesReader 400s before the per-file 413 branch is
     reachable. Atomicity is still pinned by the integration test
     against real Postgres. Re-enable when chatUploadMaxBytes is
     raised above the per-file cap (RFC for GET /uploads/limits
     follow-up will reshape this layering).

SSOT note (per feedback_no_single_source_of_truth):

  After this PR the 100 MB constant lives in FIVE mirror sites —
  canvas TS (mc#1588) + workspace-server Go (chat_files.go via
  mc#1588 + this PR's storage.go) + workspace Python ingest
  (internal_chat_uploads.py via mc#1588) + workspace Python pull
  (inbox_uploads.py via this PR) + nginx harness mirror (mc#1588)
  + this DB CHECK. The proper fix is the GET /uploads/limits
  endpoint per CTO follow-up (RFC filed by a0d62036, referenced
  in mc#1588 description). Until that lands, every cap change
  needs a coupled bump across all five surfaces.

Out of scope:

  - mc#1588's push-mode bumps (chatUploadMaxBytes, the Python
    ingest CHAT_UPLOAD_MAX_FILE_BYTES, canvas MAX_UPLOAD_BYTES,
    nginx client_max_body_size). Those land via mc#1588; this PR
    touches a disjoint file set.

  - The GET /uploads/limits SSOT endpoint. Separate RFC.

References:
  - mc#1588 (push-mode 50→100 MB; canvas + workspace-server +
    workspace Python + nginx harness)
  - task #295 (internal tracker; CTO-authorized)
  - forensic a99ab0a1 (reno-stars 2026-05-19; the root failure
    motivating mc#1588 and this follow-up)
  - feedback_no_single_source_of_truth (the SSOT discipline this
    bump duplicates against, pending the /uploads/limits fix)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-19 20:42:10 -07:00
committed by hongming-pc2
parent 349d3a5ca7
commit 59180319ce
6 changed files with 250 additions and 20 deletions
@@ -572,8 +572,25 @@ func TestPollUpload_PerFileCapPreStorage_413(t *testing.T) {
// Pin the early-reject branch (fh.Size > MaxFileBytes) BEFORE we
// read the part into memory. Without this, an oversize file
// would hit the storage layer's belt-and-suspenders check, which
// works but burns ~25 MB of memory + DB round-trip first. Send
// 25 MB + 1 byte → 413 with the file size in the response.
// works but burns ~100 MB of memory + DB round-trip first.
//
// Structural precondition: pendinguploads.MaxFileBytes must be
// strictly LESS than chatUploadMaxBytes (the body cap on
// http.MaxBytesReader at chat_files.go:290) — otherwise a single
// oversize part never reaches our handler; the body reader 400s
// first. As of 2026-05-19 with the poll-mode bump to 100 MB and
// the push-mode bump in mc#1588 also to 100 MB, per-file == body
// for single-file uploads and this branch is only reachable in
// the multi-file scenario (see TestPollUpload_AtomicRollbackOn
// SecondFileTooLarge for that path). When chatUploadMaxBytes is
// next bumped above MaxFileBytes (e.g., RFC for the SSOT
// GET /uploads/limits endpoint reshapes the layering) this test
// can run again.
if pendinguploads.MaxFileBytes >= chatUploadMaxBytes {
t.Skipf("per-file cap %d >= body cap %d; the body MaxBytesReader 400s before the per-file 413 branch is reachable. Re-enable when body cap > per-file cap.",
pendinguploads.MaxFileBytes, chatUploadMaxBytes)
}
mock := setupTestDB(t)
setupTestRedis(t)
@@ -584,7 +601,7 @@ func TestPollUpload_PerFileCapPreStorage_413(t *testing.T) {
h := NewChatFilesHandler(NewTemplatesHandler(t.TempDir(), nil, nil)).
WithPendingUploads(store, nil)
// 25 MB + 1 byte. Single file, large enough to trip the early
// MaxFileBytes + 1. Single file, large enough to trip the early
// size check.
oversize := make([]byte, pendinguploads.MaxFileBytes+1)
body, ct := pollUploadFixture(t, map[string][]byte{"big.bin": oversize})
@@ -655,6 +672,25 @@ func TestPollUpload_SanitizesFilenameInResponse(t *testing.T) {
// real atomicity guarantee is the integration test in
// pending_uploads_integration_test.go.
func TestPollUpload_AtomicRollbackOnSecondFileTooLarge(t *testing.T) {
// Same structural precondition as
// TestPollUpload_PerFileCapPreStorage_413: pendinguploads.MaxFileBytes
// < chatUploadMaxBytes. The two-file fixture below sends
// MaxFileBytes+1 worth of bytes; when per-file == body cap, that
// total trips the body reader first (400) before our handler can
// reach the per-file 413 branch.
//
// Per-call atomicity is still pinned at the storage layer in
// TestIntegration_PendingUploads_PutBatch_RollsBackOnAnyError
// (workspace-server/internal/handlers/pending_uploads_integration_test.go)
// — that test exercises the same all-or-nothing guarantee against
// a real Postgres without depending on the body-cap arithmetic
// here. Re-enable this handler-level test when body cap exceeds
// per-file cap again.
if pendinguploads.MaxFileBytes >= chatUploadMaxBytes {
t.Skipf("per-file cap %d >= body cap %d; the body MaxBytesReader 400s before the per-file 413 branch is reachable. Storage-level atomicity covered by integration test. Re-enable when body cap > per-file cap.",
pendinguploads.MaxFileBytes, chatUploadMaxBytes)
}
mock := setupTestDB(t)
setupTestRedis(t)
@@ -669,3 +669,101 @@ func TestIntegration_PendingUploads_GetIgnoresExpiredAndAcked(t *testing.T) {
t.Errorf("Get after expiry: got %v, want ErrNotFound", err)
}
}
// TestIntegration_PendingUploads_SizeCap_100MB pins the post-mc#1588
// poll-mode cap to 100 MB at the DB CHECK level. The MaxFileBytes-1 row
// must INSERT cleanly (no CHECK violation) and the +1 row must be
// rejected pre-DB by Put with ErrTooLarge. We also verify the raw DB
// CHECK by attempting an INSERT that bypasses Put's pre-DB guard — the
// constraint itself must enforce 104857600.
//
// Regression target: migration 20260519200000_pending_uploads_bump_size_cap
// — if a future migration regresses the cap (e.g., re-applies the
// 25 MB ceiling) this test 413s.
func TestIntegration_PendingUploads_SizeCap_100MB(t *testing.T) {
if testing.Short() {
t.Skip("skipping 100MB cap integration test in -short mode (allocates ~100MB)")
}
conn := integrationDB_PendingUploads(t)
store := pendinguploads.NewPostgres(conn)
ctx := context.Background()
wsID := uuid.New()
// 50 MB row — previously would have hit the 25 MB CHECK; must now
// succeed end-to-end (Put → INSERT → row visible via Get).
fiftyMB := make([]byte, 50*1024*1024)
for i := range fiftyMB {
fiftyMB[i] = 0x42 // non-zero so bytea round-trip is observable
}
fid, err := store.Put(ctx, wsID, fiftyMB, "halfcap.bin", "application/octet-stream")
if err != nil {
t.Fatalf("Put 50MB: want nil err (cap is 100MB after mc#1588 follow-up), got %v", err)
}
rec, err := store.Get(ctx, fid)
if err != nil {
t.Fatalf("Get 50MB row: %v", err)
}
if rec.SizeBytes != int64(len(fiftyMB)) {
t.Errorf("size_bytes mismatch: got %d, want %d", rec.SizeBytes, len(fiftyMB))
}
// Exact-cap row — 100 MB exact must succeed (the CHECK is `<=`).
atCap := make([]byte, pendinguploads.MaxFileBytes)
if _, err := store.Put(ctx, wsID, atCap, "atcap.bin", "application/octet-stream"); err != nil {
t.Errorf("Put MaxFileBytes (at-cap): want nil err, got %v", err)
}
// Over-cap row — must be rejected by Put pre-DB.
overCap := make([]byte, pendinguploads.MaxFileBytes+1)
if _, err := store.Put(ctx, wsID, overCap, "overcap.bin", "application/octet-stream"); err != pendinguploads.ErrTooLarge {
t.Errorf("Put MaxFileBytes+1: want ErrTooLarge, got %v", err)
}
// Raw DB CHECK enforcement — bypass Put's pre-DB guard with a direct
// INSERT of a 101-MB row. The CHECK must reject with a Postgres
// integrity-violation (SQLSTATE 23514). This catches the case where a
// future code path inserts straight into the table without going
// through Put (e.g., a background importer).
overByOne := pendinguploads.MaxFileBytes + 1
_, dbErr := conn.ExecContext(ctx, `
INSERT INTO pending_uploads (workspace_id, content, size_bytes, filename, mimetype)
VALUES ($1, $2, $3, $4, $5)
`, wsID, []byte("dummy"), overByOne, "raw-overcap.bin", "application/octet-stream")
if dbErr == nil {
t.Errorf("raw INSERT of %d bytes: want CHECK violation, got nil", overByOne)
} else if !strings.Contains(dbErr.Error(), "pending_uploads_size_bytes_check") &&
!strings.Contains(dbErr.Error(), "23514") {
t.Errorf("raw INSERT: want size_bytes_check violation (SQLSTATE 23514), got: %v", dbErr)
}
}
// TestIntegration_PendingUploads_SizeCap_DBConstraintName pins the
// expected CHECK constraint name. Migration 20260519200000 DROP IF
// EXISTSes by the auto-generated name `pending_uploads_size_bytes_check`
// and re-adds the constraint under the same name. If a future schema
// edit renames the constraint, the next bump migration will silently
// no-op the DROP and leave the old ceiling in place — this test catches
// that drift.
func TestIntegration_PendingUploads_SizeCap_DBConstraintName(t *testing.T) {
conn := integrationDB_PendingUploads(t)
ctx := context.Background()
var checkClause string
err := conn.QueryRowContext(ctx, `
SELECT pg_get_constraintdef(c.oid)
FROM pg_constraint c
JOIN pg_class t ON t.oid = c.conrelid
WHERE t.relname = 'pending_uploads'
AND c.conname = 'pending_uploads_size_bytes_check'
`).Scan(&checkClause)
if err != nil {
t.Fatalf("lookup CHECK clause: %v", err)
}
if !strings.Contains(checkClause, "104857600") {
t.Errorf("size_bytes CHECK clause = %q; want it to contain 104857600 (100 MB)", checkClause)
}
if strings.Contains(checkClause, "26214400") {
t.Errorf("size_bytes CHECK clause = %q; must NOT contain 26214400 (the pre-bump 25 MB ceiling)", checkClause)
}
}
@@ -41,10 +41,17 @@ import (
)
// Per-file size cap. Mirrors workspace-side ingest_handler
// (workspace/internal_chat_uploads.py:198). Pinned at the DB level via
// the size_bytes CHECK constraint; this Go-side constant exists so the
// Put implementation can reject before round-tripping to Postgres.
const MaxFileBytes = 25 * 1024 * 1024
// (workspace/internal_chat_uploads.py:CHAT_UPLOAD_MAX_FILE_BYTES) and the
// push-mode chat upload cap (chat_files.go:chatUploadMaxBytes). Pinned at
// the DB level via the size_bytes CHECK constraint (currently
// 104857600 per migration 20260519200000_pending_uploads_bump_size_cap);
// this Go-side constant exists so the Put implementation can reject
// before round-tripping to Postgres.
//
// Kept consistent with push-mode (mc#1588) per CTO directive 2026-05-19.
// SSOT follow-up: GET /uploads/limits will let every surface read the
// live cap rather than each pinning its own copy.
const MaxFileBytes = 100 * 1024 * 1024
// ErrNotFound is returned by Get / MarkFetched / Ack when the row is
// absent. Callers turn this into HTTP 404. Treat acked + expired rows
@@ -0,0 +1,24 @@
-- 20260519200000_pending_uploads_bump_size_cap.down.sql
--
-- Restores the pre-bump 25 MB (26214400) CHECK on
-- pending_uploads.size_bytes.
--
-- DESTRUCTIVE on existing data: any row with size_bytes > 26214400
-- will cause the ALTER TABLE … ADD CONSTRAINT below to fail (Postgres
-- validates the new CHECK against every existing row before applying
-- it). That is the correct behaviour — silently dropping such rows
-- would lose evidence-of-receipt for files the workspace might still
-- pull. Operator running this rollback MUST first decide what to do
-- with any 25100 MB rows (sweep them out via the normal expires_at
-- path, or accept failed rollback + investigate).
--
-- In practice this DOWN is for migration-tool symmetry only; the
-- pending_uploads table has a 24h hard TTL via expires_at, so waiting
-- one cycle drains any oversized rows naturally.
ALTER TABLE pending_uploads
DROP CONSTRAINT IF EXISTS pending_uploads_size_bytes_check;
ALTER TABLE pending_uploads
ADD CONSTRAINT pending_uploads_size_bytes_check
CHECK (size_bytes > 0 AND size_bytes <= 26214400);
@@ -0,0 +1,56 @@
-- 20260519200000_pending_uploads_bump_size_cap.up.sql
--
-- Bumps the pending_uploads.size_bytes CHECK from 25 MB (26214400) to
-- 100 MB (104857600) so poll-mode (laptop-runtime) workspaces accept the
-- same per-file cap as push-mode (SaaS EC2 tenants).
--
-- Why this is a separate PR from mc#1588:
-- mc#1588 bumped push-mode 50→100 MB across canvas TS +
-- workspace-server Go (chat_files.go cap) + workspace Python ingest
-- (internal_chat_uploads.py) + nginx harness mirror. The poll-mode
-- staging path is a different surface — the DB CHECK on
-- pending_uploads + the Go staging constant
-- (pendinguploads.MaxFileBytes) + the workspace-side puller
-- (workspace/inbox_uploads.py MAX_FILE_BYTES). Bundling both in one
-- PR would have crossed two RFCs' worth of review surface and
-- slowed the urgent push-mode fix that landed for reno-stars.
--
-- Why 100 MB (not higher / lower):
-- Aligns with mc#1588 (chat_files.go chatUploadMaxBytes,
-- internal_chat_uploads.py CHAT_UPLOAD_MAX_FILE_BYTES). The CTO
-- directive 2026-05-19 was a one-line "make it consistent" — no
-- evidence of a load-bearing reason the laptop path needs to stay
-- smaller (the original 25 MB comment in inbox_uploads.py cited
-- "~5 Mbps × 40s" bandwidth math, but that's descriptive, not a
-- hard cap; the fetch timeout is independently tunable).
--
-- bump to 100MB to match push-mode cap per mc#1588; CANVAS_MIRROR:
-- canvas/src/components/tabs/chat/uploads.ts MAX_UPLOAD_BYTES
--
-- SSOT note: 100 MB is now duplicated FIVE times (canvas TS +
-- workspace-server Go + workspace Python ingest + workspace Python
-- pull + nginx harness + this DB CHECK). The proper fix is a
-- GET /uploads/limits endpoint so each surface reads a single source
-- — tracked separately per CTO follow-up (RFC referenced in
-- mc#1588 description).
--
-- Constraint name: the original CREATE TABLE wrote the CHECK inline
-- without an explicit name. Postgres auto-names it
-- `pending_uploads_size_bytes_check`. We DROP IF EXISTS that name
-- then re-add the constraint at the new ceiling. DROP IF EXISTS is
-- idempotent — re-running this migration after a partial failure
-- doesn't error.
--
-- Forward-only safety: ALTER TABLE … ADD CONSTRAINT … CHECK takes
-- an ACCESS EXCLUSIVE lock briefly while it validates every existing
-- row. At current row counts (<1k rows steady-state, 24h TTL) this
-- is sub-second. If pending_uploads ever grows into the millions,
-- switch to ADD CONSTRAINT … NOT VALID + VALIDATE CONSTRAINT in two
-- steps to keep the lock short.
ALTER TABLE pending_uploads
DROP CONSTRAINT IF EXISTS pending_uploads_size_bytes_check;
ALTER TABLE pending_uploads
ADD CONSTRAINT pending_uploads_size_bytes_check
CHECK (size_bytes > 0 AND size_bytes <= 104857600);
+22 -13
View File
@@ -57,26 +57,35 @@ logger = logging.getLogger(__name__)
# doesn't ship the in-container HTTP server).
CHAT_UPLOAD_DIR = "/workspace/.molecule/chat-uploads"
# Per-file safety net. The platform enforces 25 MB on the staging side,
# but a buggy or hostile platform response shouldn't be able to fill the
# workspace's disk — refuse to write more than this even if the response
# claims a larger Content-Length.
MAX_FILE_BYTES = 25 * 1024 * 1024
# Per-file safety net. The platform enforces 100 MB on the staging side
# (workspace-server migration 20260519200000_pending_uploads_bump_size_cap
# + pendinguploads.MaxFileBytes — bumped from 25 MB per CTO directive
# 2026-05-19 to match push-mode mc#1588), but a buggy or hostile
# platform response shouldn't be able to fill the workspace's disk —
# refuse to write more than this even if the response claims a larger
# Content-Length.
MAX_FILE_BYTES = 100 * 1024 * 1024
# Network deadline for the GET. Tuned for a 25 MB transfer over a
# reasonable consumer link (~5 Mbps gives ~40s for the full payload),
# plus headroom for TLS + platform auth. Aligned with inbox poller's
# 10s default for /activity calls — both are user-perceived latency.
DEFAULT_FETCH_TIMEOUT = 60.0
# Network deadline for the GET. Tuned for a 100 MB transfer over a
# reasonable consumer link (~5 Mbps gives ~160s for the full payload),
# plus headroom for TLS + platform auth. Scaled up from the original
# 60s (sized for 25 MB) when the per-file cap moved to 100 MB — a fixed
# 60s would fire BEFORE a legitimate slow uplink finished streaming, the
# same wrong-reason failure mc#1588 fixed on the canvas side (forensic
# a99ab0a1 reno-stars). Aligned with platform httpClient.Timeout (1200s
# in chat_files.go after mc#1588) — laptop pull side gets a smaller
# value because it's downstream of a fully-staged row, not a live
# multipart parse.
DEFAULT_FETCH_TIMEOUT = 240.0
# Concurrency cap for ``BatchFetcher``. Four workers is enough headroom
# for the realistic "user dragged 3-4 files into chat at once" case
# while bounding the platform's per-workspace fan-out. The cap matters
# because the platform's /content endpoint reads bytea from Postgres in
# a single round-trip per request — N workers = N concurrent DB reads
# of up to 25 MB each, so a higher cap could pressure platform memory
# without much UX win (network bandwidth is the bottleneck once the
# bytes are buffered).
# of up to 100 MB each (post-mc#1588 cap), so a higher cap could pressure
# platform memory without much UX win (network bandwidth is the
# bottleneck once the bytes are buffered).
DEFAULT_BATCH_FETCH_WORKERS = 4
# Upper bound on how long ``BatchFetcher.wait_all`` blocks the inbox