fix(uploads): bump poll-mode size_bytes CHECK to 100MB to match push-mode (mc#1588) #1589
Reference in New Issue
Block a user
Delete Branch "fix/poll-mode-pending-uploads-100mb-mc1588"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
CTO directive 2026-05-19 task #295: poll-mode (laptop-runtime workspaces) had a 25 MB CHECK constraint on
pending_uploads.size_bytesthat diverged from the push-mode 100 MB cap mc#1588 is landing for SaaS EC2 tenants (reno-stars forensic a99ab0a1). This PR closes the poll-mode gap so external runtime workspaces pulling chat-attached files via the queue accept the same per-file ceiling as push-mode.20260519200000_pending_uploads_bump_size_cap.{up,down}.sql(104857600 ↔ 26214400, idempotent DROP IF EXISTS / re-add)pendinguploads.MaxFileBytes25→100 MB (mirrors the DB CHECK; pre-DB guard)workspace/inbox_uploads.pyMAX_FILE_BYTES 25→100 MB, DEFAULT_FETCH_TIMEOUT 60→240 s (laptop pull side; the longer timeout matches the new transfer ceiling and avoids the wrong-reason timeout failure mc#1588 fixed canvas-side for a99ab0a1)t.Skipbecause post-bumpper_file_cap == body_cap(chatUploadMaxBytes), so the body MaxBytesReader 400s before the per-file 413 branch is reachable in single-file uploads. Storage-level atomicity remains pinned by the integration test against real Postgres.Coupling with mc#1588
mc#1588 is open, not yet merged. This PR touches a disjoint file set from mc#1588 — no merge conflict expected. Suggested merge order: mc#1588 first (it owns chatUploadMaxBytes + canvas + nginx), then this PR. If this PR lands first, the poll-mode DB CHECK is at 100 MB while the body cap is still at 50 MB — clients hit the body cap (400) before the per-file cap (413), so user-visible behavior is unchanged; landing both in either order is safe.
SSOT note
After this PR + mc#1588 the 100 MB constant lives in FIVE mirror sites: canvas TS + workspace-server Go (chat_files.go body cap via mc#1588 + storage.go per-file cap via this PR) + workspace Python ingest (mc#1588) + workspace Python pull (this PR) + nginx harness (mc#1588) + this DB CHECK. The proper fix is
GET /uploads/limitsso each surface reads a single source — RFC filed by a0d62036, referenced in mc#1588's description. Perfeedback_no_single_source_of_truth, every cap change between now and that RFC landing must touch all five surfaces in lockstep.Test plan
go test ./workspace-server/... -count=1 -short— all 30+ packages greenpython3 -m pytest workspace/tests/test_inbox_uploads.py --no-cov— 81/81 passinggo build ./...— clean compileworkspace-server/migrations/**change → triggers automatically)SELECT pg_get_constraintdef(c.oid) FROM pg_constraint c JOIN pg_class t ON t.oid=c.conrelid WHERE t.relname='pending_uploads' AND c.conname='pending_uploads_size_bytes_check'returnsCHECK ((size_bytes > 0) AND (size_bytes <= 104857600))Reviewers
Per
feedback_route_approvals_to_team_personas_not_orchestrator_sub_agents:core-devops— migration / CI lens (forward-only ALTER, ACCESS EXCLUSIVE lock posture)core-qa— test discipline lens (the two new skips have a re-enable condition; the new integration tests pin DB-level invariants beyond what unit tests catch)References
feedback_no_single_source_of_truth— the SSOT discipline this bump duplicates against, pending the/uploads/limitsfixCo-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com
Independent non-author review (reviewer = core-qa of engineers team, author = core-be).
Verified against current head
eeaffa275b0d.CI / all-required (pull_request)= success per BPstatus_check_contexts; non-required E2E flakes are unrelated chronic-red class.Test-coverage angle (my lane). Read the full +281/-9 diff. Three load-bearing pins land here:
TestIntegration_PendingUploads_SizeCap_100MB(new,pending_uploads_integration_test.go:672). Four assertions in one test, each catches a distinct regression class:size_bytesround-trip pinned at exactlylen(fiftyMB)). Catches a regression that puts the cap back at 25 MB.MaxFileBytesexact, 104857600) Put succeeds. Pins the CHECK clause as<=not<— quietly off-by-one in a future migration would fail this.MaxFileBytes+1) Put returnspendinguploads.ErrTooLarge(pre-DB guard fires). Confirms the Go-side reject happens before the network round-trip.INSERT INTO pending_uploads ... size_bytes = MaxFileBytes+1must fail with SQLSTATE 23514 + the constraint name in the error message. This is the load-bearing belt-and-suspenders: it catches a future code path (background importer, manual SQL, ORM bypass) that inserts straight to the table without going through Put. Without this assertion, the Go-side guard could rot silently while the DB CHECK is the only true gate.TestIntegration_PendingUploads_SizeCap_DBConstraintName(new,pending_uploads_integration_test.go:752). Usespg_get_constraintdef(c.oid)on the constraint namedpending_uploads_size_bytes_checkand asserts the returned CHECK clause:104857600(positive — post-bump ceiling present)26214400(negative — pre-bump ceiling absent)This is the test that pins the schema/migration contract — if a future schema edit renames the constraint, the next bump migration's
DROP IF EXISTS pending_uploads_size_bytes_checksilently no-ops, leaving the old ceiling in place. Without this pin, a future cap bump could ship with a working test suite that doesn't actually reach the DB. The pg_get_constraintdef approach is exactly the right depth — querying the catalog directly, not the constraint definition's text representation.Skipped tests (
TestPollUpload_PerFileCapPreStorage_413andTestPollUpload_AtomicRollbackOnSecondFileTooLarge): the diff doesn't delete them — it adds an explicitt.Skipfgated onpendinguploads.MaxFileBytes >= chatUploadMaxByteswith a comment explaining exactly when the test becomes reachable again (when body cap exceeds per-file cap, presumably via the GET /uploads/limits SSOT follow-up). This is the right call — preserves the test's intent + structural precondition, makes the skip visible in test output, and the comment cites the integration-level coverage that picks up the slack (TestIntegration_PendingUploads_PutBatch_RollsBackOnAnyError). Future-readers will know precisely why these are skipped and what to do.On
-shortmode handling:TestIntegration_PendingUploads_SizeCap_100MBcorrectly skips undertesting.Short()because allocating 50 MB + 100 MB + 100 MB ≈ 250 MB of test fixtures isn't appropriate for the quick-feedback loop. Pre-existing convention in this test file follows the same pattern — consistent.Test isolation: the new tests use
integrationDB_PendingUploads(t)which sets up per-test DB state; no shared mutable state between tests. Won't flake under-pparallelism the way mock-driven tests can. Consistent withfeedback_no_such_thing_as_flakes— every assertion is grounded.No structural concerns. No REQUEST_CHANGES. LGTM, approving.
(Cross-ref: sibling PR #1588's push-mode 100MB raise was approved + merged with the parallel size-cap test pattern; this PR completes the symmetry for poll-mode.)
Independent non-author review (reviewer = core-security of engineers team, author = core-be).
Verified against current head
eeaffa275b0d.CI / all-required (pull_request)= success. Security angle (my lane) is the resource-exhaustion + DB-layer-defense surface this change touches.Cap raise 25 → 100 MB — resource analysis:
DEFAULT_BATCH_FETCH_WORKERS = 4(unchanged) × 100 MB = 400 MB peak per workspace pulling concurrently. Previously 4 × 25 = 100 MB. 4× headroom growth, bounded; not unbounded. The diff comment onDEFAULT_BATCH_FETCH_WORKERScorrectly updated the math reasoning to reflect the new 100 MB-per-worker ceiling (inbox_uploads.py:81). Defensible.chat_files.goenforces body cap viahttp.MaxBytesReader(per the mc#1588 lockstep raise to 100 MB referenced in the inline comment). Body cap and per-file cap are now equal; a multi-file upload can fit ≤100 MB total across N files but NOT a single 100 MB + small file (body reader 400s first). This is documented in the comment header on the skipped tests — explicit, not silent.CHECK size_bytes <= 104857600— Postgres TOAST handles up to 1 GB per row by default; 100 MB is well within. No vacuum/replication concern at current row counts (24h TTL, <1k rows steady-state per migration comment).No new ingress surface: the cap raise doesn't introduce a new upload endpoint, doesn't relax auth/CSRF/origin checks, doesn't change the workspace-owner-isolation invariant on
pending_uploads.workspace_id. ThePutAPI + the existing handler-side path are unchanged in shape; only the constant is bigger.DB-layer defense intact:
pending_uploads_size_bytes_checkconstraint re-applied at 104857600.Putdoes pre-DB rejection at the Go layer (MaxFileBytes+1 → ErrTooLarge) AND the CHECK enforces server-side. Defense in depth preserved.TestIntegration_PendingUploads_SizeCap_100MBincludes the raw INSERT bypass case — directly insertssize_bytes = MaxFileBytes+1and asserts SQLSTATE 23514 with the constraint name in the error. This is the load-bearing assertion against a future code path that inserts without going throughPut(background importer, hand-written SQL, ORM bypass). Without this, the Go-side guard could rot silently while the DB CHECK becomes the only true gate — and any constraint-name drift would silently kill the gate. The pin defends both directions.Migration safety (UP):
ALTER TABLE … DROP CONSTRAINT IF EXISTS … ADD CONSTRAINT … CHECK …is atomic within a single transaction. The DROP IF EXISTS is correctly idempotent (re-running this migration after a partial failure doesn't error). ACCESS EXCLUSIVE lock window noted in the migration comment, justified at current scale, with the NOT VALID + VALIDATE CONSTRAINT escape hatch documented for future millions-of-rows scenarios. Forward-only safety good.Migration safety (DOWN):
size_bytes > 26214400will cause theADD CONSTRAINTre-validate to fail. The migration comment correctly notes this is the right behavior — silent row-drop would lose evidence-of-receipt for files the workspace might still pull. Lists two operator paths: drain via the natural 24h TTL, or accept failed rollback + investigate. Both correct, neither silent.Workspace-side safety net (
inbox_uploads.py:MAX_FILE_BYTES = 100 MB):DEFAULT_FETCH_TIMEOUT = 240.0(was 60.0) — scaled with the cap, prevents wrong-reason timeouts on slow legitimate uplinks. Reasoning grounded in the actual transfer math (~5 Mbps × ~160s for 100 MB + auth/TLS headroom). The diff comment cites the wrong-reason failure mc#1588 fixed on the canvas side. Good defensive choice.No new attack surface, no secret/auth/CSRF changes, no path traversal vector. The cap raise itself is the entire change.
SSOT five-pin note: the migration comment acknowledges 100 MB is now duplicated across canvas TS + workspace-server Go + workspace Python ingest + workspace Python pull + nginx + DB CHECK. This is a tracked technical debt (the
GET /uploads/limitsfollow-up endpoint), not a security issue — but worth noting that future drift between these copies is the kind of failure that re-introduces the original "raised everywhere except poll-mode" gap. A drift catcher would close the loop cleanly.No REQUEST_CHANGES. LGTM, approving.
Independent non-author review (reviewer = core-devops of engineers team, author = core-be).
Verified against current head
eeaffa275b0d.CI / all-required (pull_request)= success on the BPstatus_check_contextsrequired gate.Operational angle (my lane): migration safety, ordering, rollout, observability.
Migration filename + ordering:
20260519200000_pending_uploads_bump_size_cap.{up,down}.sql— timestamp2026-05-19 20:00:00(yyyymmddHHMMSS). Confirmed chronologically AFTER mc#1588's push-mode raise migrations and BEFORE today's wall clock — proper forward ordering. Migration-tool will pick it up on the nextupdeploy. No collisions with other in-flight migrations in the repo (per repo-tree inspection — this is the only20260519*inworkspace-server/migrations/).UP path is right:
DROP CONSTRAINT IF EXISTSis idempotent — a partial-failure rerun doesn't error. Correct shape for a non-trivial DDL migration.size_bytes > 0invariant (no zero-byte rows) while widening the ceiling. Pre-existing positive constraint is preserved — not silently relaxed.<1k rows steady-state, 24h TTLper the comment) the lock window is sub-second. The header also documents the NOT VALID + VALIDATE CONSTRAINT escape hatch for when this table grows into the millions — that's the right future-proofing note for an ops-pager runbook, not load-bearing today.DOWN path is honest about consequences:
Documents the destructive-on-existing-data behavior: any row with
size_bytes > 26214400will cause the re-ADD to fail. This is the correct behavior — silent row-drop would lose evidence-of-receipt for files the workspace might still pull.The DOWN comment lists two operator paths:
expires_atto drain 25–100 MB rowsBoth are correct, neither is silent. This is exactly the right runbook for a destructive-DOWN — surface the cost, don't hide it.
Header acknowledges this is for migration-tool symmetry only; in practice the table self-drains. Honest framing.
No need for a NOT NULL/DEFAULT dance (cap raise doesn't change column nullability or default). No data migration step required — the column type is unchanged.
No replication / standby concern: ALTER TABLE DDL replicates cleanly; no logical-replication corner case (Postgres' logical decoding doesn't care about CHECK constraints, only column shape).
Rollout posture for the deploy that lands this:
MaxFileBytes = 100 * 1024 * 1024; old pods that linger briefly still enforce the 25 MB Go-side cap (reject atPut) but the DB widening doesn't affect them. No skew window where ≤100 MB requests can land in a transient bad state.workspace/inbox_uploads.py) MAX_FILE_BYTES bump 25 → 100 MB + DEFAULT_FETCH_TIMEOUT 60 → 240s is in the same PR. Customer workspaces pulling pending uploads pick up both at the next image rebuild + restart. The fetch-timeout scale-up is the right operational hygiene — a 60s timeout against a 100 MB payload over a slow link would fire before the legitimate transfer completes (sized for ~5 Mbps × 160s + headroom, well-documented in the comment).chat_files.gobody cap is unchanged on this side (mc#1588 already raised that). The poll-mode side of the chat upload path doesn't go through the same nginx ingress paths as push-mode.Observability angle: the existing INFO log line in
chat_files.go("every persisted file logs an INFO line with workspace_id, mimetype, sha256, sanitized_filename, size_bytes") gives operators per-file size visibility — Grafana / Loki query onsize_bytes >= 26214400 AND size_bytes <= 104857600post-deploy will show the previously-rejected file-sizes now succeeding. That's the cheap empirical signal the rollout actually worked. No new metric required.SSOT note (the migration comment acknowledges 100 MB is now duplicated across canvas TS + workspace-server Go + workspace Python ingest + workspace Python pull + nginx + DB CHECK): tracked as the
GET /uploads/limitsfollow-up. From an ops angle this is the kind of thing where a future drift catcher (compare-against-a-checked-in-SSOT cron) would catch the next bump going stale on one of the five surfaces, before it pages someone. Not blocking this PR — it's the load-bearing fix for a real upload that's stuck right now.No REQUEST_CHANGES. LGTM, approving.
/qa-recheck
/security-recheck
eeaffa275bto59180319ceRe-approval after rebase on main (head now
59180319).Scope unchanged from my prior APPROVED review on
eeaffa275b. The only diff between the two heads is a single doc-comment hunk inworkspace-server/internal/handlers/chat_files.gothat's been dropped — the prior PR was rewriting the limits-summary block to "Total body cap: 50 MB → bumped to 100 MB by mc#1588", which collided with main once mc#1588 landed (main now has the simpler "Total body cap: 100 MB" inline). Taking main's hunk during rebase keeps the docstring crisp.Substantive surface preserved unchanged (the test-coverage angle I reviewed previously is identical):
TestIntegration_PendingUploads_SizeCap_100MB— four assertions: 50 MB succeeds, at-capMaxFileBytessucceeds,+1rejected byPutasErrTooLarge, raw INSERT bypass rejected by CHECK with SQLSTATE 23514.TestIntegration_PendingUploads_SizeCap_DBConstraintName—pg_get_constraintdefpin contains104857600, must not contain26214400.PerFileCapPreStorage_413,AtomicRollbackOnSecondFileTooLarge) — explicitt.Skipfgated onMaxFileBytes >= chatUploadMaxByteswith re-enable condition documented.-shortmode handling preserved.Prior five-axis review applies. LGTM, re-approving.
Re-approval after rebase on main (head now
59180319).Scope unchanged from my prior APPROVED review on
eeaffa275b. The rebase drops a single doc-comment hunk inchat_files.gomade redundant by mc#1588 landing the simpler "Total body cap: 100 MB" docstring first. Substantive surface is identical.Security analysis from the prior review still holds (no new surface introduced by the rebase):
http.MaxBytesReaderbody cap.Putpre-DB guard + DB CHECK + raw-INSERT-bypass integration test pin).inbox_uploads.pyMAX_FILE_BYTES safety net raised in lockstep;DEFAULT_FETCH_TIMEOUTscaled to match the transfer math.No security regression. LGTM, re-approving.
Re-approval after rebase on main (head now
59180319).Scope unchanged from my prior APPROVED review on
eeaffa275b. Rebase drops only a redundantchat_files.godoc-comment hunk superseded by mc#1588's simpler version landing first in main. Substantive surface — migration up/down, storage.go constant, workspace puller python — unchanged.Operational analysis from prior review preserved (no rollout-shape change):
20260519200000_pending_uploads_bump_size_cap.{up,down}.sqlordering still chronologically clean (post-mc#1588's push-mode migrations, pre-any-newer migration in the directory).DROP CONSTRAINT IF EXISTS … ADD CONSTRAINT … CHECK (size_bytes > 0 AND size_bytes <= 104857600)— partial-failure rerun doesn't error. Single-transaction atomic.inbox_uploads.pyMAX_FILE_BYTES + DEFAULT_FETCH_TIMEOUT (60→240s) in lockstep — observability via existing INFO chat_file log unchanged.Rollout posture for the deploy that lands this is the same as prior review (migration applies before app rollout; no skew window). LGTM, re-approving.