fix(chat-upload): SSOT caps + Starlette max_part_size fix (#1520) #1524
Reference in New Issue
Block a user
Delete Branch "fix/chat-upload-ssot-100mb-1520"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes #1520
Empirical root cause (verified by Wave 4 sub-agent)
workspace/internal_chat_uploads.py:153calledrequest.form(max_files=64, max_fields=32)withoutmax_part_size. Starlette 1.0's defaultmax_part_sizeis 1 MiB — any single multipart part above that threshold raisesMultiPartExceptionbefore per-file cap enforcement runs. That is why every chat attachment larger than ~1 MB returned a 400 fleet-wide.The Cloudflare chunked-encoding hypothesis in the issue body was disproved at source level: Starlette's multipart parser does not read
Content-Length/Transfer-Encoding— both forwarded shapes are valid input as long asmax_part_sizeis wide enough. The dep-skew (python-multipart) hypothesis was also refuted by the in-container probe.What changed
Three coupled changes per CTO directive (canvas, 2026-05-18):
1. Single source of truth across Go + Python (no more drift)
Pre-fix: two hardcoded constants in two languages.
workspace-server/internal/handlers/chat_files.go—chatUploadMaxBytes = 50 * 1024 * 1024workspace/internal_chat_uploads.py—CHAT_UPLOAD_MAX_BYTES = 50 * 1024 * 1024,CHAT_UPLOAD_MAX_FILE_BYTES = 25 * 1024 * 1024Post-fix: Go const is canonical, exported via env vars on every provision.
applyChatUploadLimitsinworkspace_provision_shared.go(mode-agnostic — fires on both Docker + SaaS provision paths, same anti-drift pattern asmintWorkspaceSecrets/applyAgentGitIdentityalready in this file).CHAT_UPLOAD_MAX_FILE_BYTES,CHAT_UPLOAD_MAX_TOTAL_BYTES._env_int(); unset env keeps the legacy 25 MB / 50 MB defaults so an unprovisioned / locally-run worker does NOT regress. Malformed values log + fall back to default (operability over precision: bad env push rolls back by deleting the var, not by debugging a non-booting worker).2. Raise the user-visible ceiling to 100 MB
chatUploadMaxBytes50 MB -> 100 MB (total)chatUploadMaxFileBytes= 100 MB (per file)Per-file = total intentionally: most chat uploads are a single file; keeping them equal avoids the surprise of "my 60 MB file fits the total but got 413'd on per-file".
3. Surface the parse-error reason (per
feedback_surface_actionable_failure_reason_to_user)400 response body now includes a
detailfield with the MultiPartException string. These strings ("Invalid boundary", "Part exceeded maximum size of ...", "Too many parts", etc.) describe SHAPE not CONTENT — no secrets. Bounded at 200 chars belt-and-braces.Tests
workspace/tests/test_internal_chat_uploads.py(+5 new cases):test_part_above_starlette_1mib_default_is_accepted— regression pin for #1520 (sends a 2 MiB part, expects 200).test_parse_error_surfaces_exception_detail— pins the newdetailfield shape.test_total_cap_413_still_fires_above_per_file_pass— total cap still fires when per-file passes.test_env_driven_ssot_overrides_caps— env override picked up at module init.test_env_driven_ssot_malformed_value_falls_back_to_default— operability.workspace-server/internal/handlers/chat_upload_limits_test.go(new file, 3 cases):TestApplyChatUploadLimits_DefaultsMatchGoConstants— pins the env-injection contract: both vars set tofmt.Sprintf("%d", ...)of the Go consts.TestApplyChatUploadLimits_PreExistingValuesPreserved— pre-set values win.TestChatUploadCaps_MinimumAllowanceFloor— the 100 MB floor cannot be silently reverted (failing the test surfaces a future "tidy up: 100 MB seems large" PR before it reverts user-visible behaviour).All 28 Python tests in
test_internal_chat_uploads.pypass. Fullworkspace-server/internal/handlersGo test package passes (14.2s).Test plan (live verify, per
feedback_verify_actual_endstate_not_ack_follow_sop)workspace:/workspace/....exceeds per-file limit (100 MB).CHAT_UPLOAD_MAX_FILE_BYTES=104857600+CHAT_UPLOAD_MAX_TOTAL_BYTES=104857600set (env read-back).CHAT_UPLOAD_MAX_FILE_BYTESresolves to 100 MB (in-container introspection).reference_molecule_core_platform_no_staging_to_main_autopromoteto promote to prod fleet.Risk / blast radius
upload.read(max + 1)). Worst case = single 100 MB file in flight; the Go proxy already buffers up to the total cap. Workspace containers have >= 2 GB by default; not a regression in practice. If a tenant has resource limits below 256 MB the per-file ceiling can be tightened via the env without redeploying — that's the operational point of the SSOT.detailfield — additive, not breaking).One question for CTO
Per-file ceiling: 100 MB enough, or higher? Set to 100 MB on the assumption that ">= 100 MB" in the directive means 100 MB is acceptable. Easy to bump (one line in
chat_files.go); if you want 250 MB / 500 MB / 1 GB instead say so before merge and I'll update both the const and theTestChatUploadCaps_MinimumAllowanceFloorfloor.Refs
feedback_surface_actionable_failure_reason_to_user— secret-safe reason surfacingreference_molecule_core_platform_no_staging_to_main_autopromote— staging is the right base; manual staging->main PR aftermax_part_sizeroot-cause verificationIdentity: committed as
core-be(workspace runtime + ws-server is the right ownership for this fix).Independent non-author second-eyes review (reviewer = hongming-pc2, not the author core-be).
Verified against current head
098faed185dc. Per-context CI: 18/28 green, 10 pending, zero failures (perfeedback_gitea_gate_check_required_list_not_combined_statusI checked per-context, not the rollup).Read the full +300/-12 diff. This is a thorough root-cause fix for #1520, not a symptom patch — exactly the right shape.
The actual root cause (called out correctly in the diff comments and in the test): Starlette 1.0's
request.form()defaultsmax_part_sizeto 1 MiB. Any single multipart part above that raisesMultiPartExceptionBEFORE per-file enforcement gets a chance to fire. So the previous "50 MB total / 25 MB per file" comment in the Python file was misleading — the real ceiling at the parser layer was 1 MiB, and every reported chat-upload P0 traces back to that. The fix is the newmax_part_size=CHAT_UPLOAD_MAX_FILE_BYTESargument atworkspace/internal_chat_uploads.py:191. Without this single line everything else in the PR would be cosmetic.Three-layer architecture verified:
Go SSOT (
workspace-server/internal/handlers/chat_files.go):chatUploadMaxBytesraised 50 → 100 MB (CTO directive per #1520).chatUploadMaxFileBytes = 100 MB— per-file matches total. The comment correctly explains why: avoids the "60 MB file fits under total but gets 413'd on per-file" UX surprise. Reasonable design call.Go provisioner (
workspace_provision_shared.go::applyChatUploadLimits):setIfEmpty-style: only seeds when env not already set → preserves tenant overrides, plugin mutators, A/B experiments.prepareProvisionContextalongside the existingapplyAgentGitIdentity/applyRuntimeModelEnvcalls — consistent placement.chat_upload_limits_test.go(3 tests: defaults match Go constants, pre-existing values preserved, 100 MB floor pinned against regression). The floor test is the right kind of regression guard — it'll catch a future "tidy up: 100 MB seems large" refactor before it reverts the user-visible bump.Python consumer (
workspace/internal_chat_uploads.py):_env_int(name, default)helper — robust: bad/missing env logs a warning and falls back to default rather than crashing module import. Operations note in the docstring is correct: "operations needs to be able to roll back a bad env-var push by simply removing the var, not by also fixing a worker that won't boot."CHAT_UPLOAD_MAX_BYTES+CHAT_UPLOAD_MAX_FILE_BYTESnow env-driven with the legacy 50 / 25 MB fallbacks — locally-run or unprovisioned workspaces don't regress.max_part_size=CHAT_UPLOAD_MAX_FILE_BYTES— added torequest.form(max_files=64, max_fields=32, ...). Wired to the per-file cap so the user-visible ceiling matches the per-file 413 path's expectation. Clean.detailfield with the exception text (200-char-capped, defense against pathological exception string length). Matchesfeedback_surface_actionable_failure_reason_to_user— multipart errors describe shape, not content, so they're safe to surface.One observation I'd raise in a follow-up (not blocking): the Python fallback defaults
CHAT_UPLOAD_MAX_BYTES = 50 MB/CHAT_UPLOAD_MAX_FILE_BYTES = 25 MB. These are the old limits. That's intentional per the comment ("an older workspace provisioned before the env-injection landed keeps the legacy ceiling") — but it does mean a workspace that boots after the Go-side bump but BEFORE the env-injection-aware provisioner change is deployed will silently keep the old caps. Operationally this is fine because the env-injection is part of the same PR, so as long as Go side + provisioner go together (which they do), no gap. Just worth noting that env-driven defaults differ from the Go constants — a coordinated rollout invariant.No REQUEST_CHANGES. This is the right structural fix for the bug class, the test coverage pins the SSOT contract on both sides, and the comments are honest about the actual root cause (Starlette default, not the previously-cited cap mismatch).
LGTM. Approving.
Non-author Five-Axis review — APPROVE.
Closes user-reported #1520 (CTO-flagged P0 attachment upload 400). SSOT pattern: ws-server Go const → workspace_provision_shared.go env-mutator seeds CHAT_UPLOAD_MAX_{FILE,TOTAL}_BYTES → workspace runtime reads from env at module-init (defaults 50/25 MB preserved for unprovisioned dev). The actual fix is the missing max_part_size= override at internal_chat_uploads.py:191 (Starlette 1.0 default 1 MiB was the root cause; this raises it to per-file cap).
5-axis: no findings per axis. 28 Python + 3 Go tests pass.
Operational rollout note (from hongming-pc2 review): Python env-fallback defaults stay at 50/25 MB so if Go-side ships before ws-server env-mutator deploys, workspaces will still cap at OLD values — ship Go + provisioner together (they are in same binary, same deploy, so OK).