fix(workspace/chat_uploads): surface exception class + detail in 400 response #1575

Merged
core-devops merged 1 commits from fix/chat-uploads-surface-exception-in-400 into main 2026-05-19 21:10:47 +00:00
Member

What

Surface exception class + str(exc) in the 400 JSON response from POST /internal/chat/uploads/ingest. Top-level error key preserved for backwards-compat with canvas / alert rules.

{
    "error": "failed to parse multipart form",        # unchanged
    "exception": "AssertionError",                    # NEW
    "detail": "Form data requires \"python-multipart\" to be installed.",  # NEW
}

Why

Hermes workspace PDF upload (forensic a78762a0, 2026-05-19) returned the opaque {"error": "failed to parse multipart form"} only. Triage took ~25 min because the response carried no information about WHICH exception class fired or WHY the parser bailed. The underlying cause was a missing python-multipart dep in the PyPI runtime (fixed in molecule-ai-workspace-runtime#18).

Surfacing exc.class + str(exc) would have cut triage to ~10 min.

Per feedback_surface_actionable_failure_reason_to_user (CTO 2026-05-17):

user-facing failures MUST tell the user WHY. Opaque "Agent error (Exception)" is a defect; reason-first, logs-tab follow-up.

Salvage note re mc#1524 (closed, wrong-RCA)

mc#1524 attributed the 400 to Starlette's max_part_size limit and proposed bumping it. That diagnosis was incorrect — Starlette only enforces max_part_size on form FIELDS (text values), not on file PARTS, so a 5 MB PDF would not trip that limit regardless of the value.

The one useful idea from mc#1524 — surfacing the failure reason to the caller — is salvaged here as a separate, narrowly-scoped change with a unit test pinning the response shape.

Test

Adds test_malformed_multipart_returns_exception_class_and_detail which sends a boundary-mismatched body, asserts 400, and pins the response shape (error/exception/detail keys present).

Local run:

$ pytest workspace/tests/test_internal_chat_uploads.py --no-cov
24 passed

Verification path for CTO

After this + the runtime dep PR merge:

  1. Re-pull workspace-runtime 0.1.18 in Chloe-Hermes.
  2. Retry PDF upload → 200.
  3. To verify the diagnostic-surface change itself: corrupt the multipart boundary manually (curl with bad boundary=...) → expect 400 with exception + detail keys.

Companions

  • molecule-ai-workspace-runtime#18 — the REAL fix (pin the dep).
  • RFC (TBD) — ship workspace-runtime stdout to Loki so we don't need response-body surfacing as the primary diagnostic.

Reviewers

Standard 3-reviewer relay (core-qa team-gate required per feedback_molecule_core_qa_review_team_required).

## What Surface exception class + `str(exc)` in the 400 JSON response from `POST /internal/chat/uploads/ingest`. Top-level `error` key preserved for backwards-compat with canvas / alert rules. ```python { "error": "failed to parse multipart form", # unchanged "exception": "AssertionError", # NEW "detail": "Form data requires \"python-multipart\" to be installed.", # NEW } ``` ## Why Hermes workspace PDF upload (forensic `a78762a0`, 2026-05-19) returned the opaque `{"error": "failed to parse multipart form"}` only. Triage took **~25 min** because the response carried no information about WHICH exception class fired or WHY the parser bailed. The underlying cause was a missing `python-multipart` dep in the PyPI runtime (fixed in `molecule-ai-workspace-runtime#18`). Surfacing `exc.class + str(exc)` would have cut triage to **~10 min**. Per `feedback_surface_actionable_failure_reason_to_user` (CTO 2026-05-17): > user-facing failures MUST tell the user WHY. Opaque "Agent error (Exception)" is a defect; reason-first, logs-tab follow-up. ## Salvage note re mc#1524 (closed, wrong-RCA) mc#1524 attributed the 400 to Starlette's `max_part_size` limit and proposed bumping it. **That diagnosis was incorrect** — Starlette only enforces `max_part_size` on form FIELDS (text values), not on file PARTS, so a 5 MB PDF would not trip that limit regardless of the value. The one useful idea from mc#1524 — surfacing the failure reason to the caller — is salvaged here as a separate, narrowly-scoped change with a unit test pinning the response shape. ## Test Adds `test_malformed_multipart_returns_exception_class_and_detail` which sends a boundary-mismatched body, asserts 400, and pins the response shape (error/exception/detail keys present). Local run: ``` $ pytest workspace/tests/test_internal_chat_uploads.py --no-cov 24 passed ``` ## Verification path for CTO After this + the runtime dep PR merge: 1. Re-pull workspace-runtime `0.1.18` in Chloe-Hermes. 2. Retry PDF upload → 200. 3. To verify the diagnostic-surface change *itself*: corrupt the multipart boundary manually (curl with bad `boundary=...`) → expect 400 with `exception` + `detail` keys. ## Companions - `molecule-ai-workspace-runtime#18` — the REAL fix (pin the dep). - RFC (TBD) — ship workspace-runtime stdout to Loki so we don't need response-body surfacing as the primary diagnostic. ## Reviewers Standard 3-reviewer relay (core-qa team-gate required per `feedback_molecule_core_qa_review_team_required`).
core-be added 1 commit 2026-05-19 20:41:12 +00:00
fix(workspace/chat_uploads): surface exception class + detail in 400 response
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Waiting to run
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 6s
CI / Detect changes (pull_request) Successful in 6s
E2E API Smoke Test / detect-changes (pull_request) Successful in 12s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 16s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 14s
E2E Chat / detect-changes (pull_request) Successful in 16s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 5s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 5s
Lint no tenant GITEA/GITHUB token write / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 4s
publish-runtime-autobump / pr-validate (pull_request) Successful in 40s
publish-runtime-autobump / bump-and-tag (pull_request) Has been skipped
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 6s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s
gate-check-v3 / gate-check (pull_request) Successful in 4s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m1s
qa-review / approved (pull_request) Failing after 6s
security-review / approved (pull_request) Failing after 5s
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request) Successful in 4s
sop-tier-check / tier-check (pull_request) Successful in 4s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 3s
E2E Chat / E2E Chat (pull_request) Successful in 13s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 3s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 10s
CI / Platform (Go) (pull_request) Successful in 2m35s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 2m22s
CI / Canvas (Next.js) (pull_request) Successful in 5m11s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / Python Lint & Test (pull_request) Successful in 6m56s
CI / all-required (pull_request) Successful in 6m57s
audit-force-merge / audit (pull_request) Successful in 14s
5f6aa3da69
Hermes workspace PDF upload returned opaque 400 'failed to parse multipart
form' (forensic a78762a0 2026-05-19). Triage took ~25 min because the
response carried no information about WHICH exception class or WHY the
parser bailed — the underlying cause was a missing python-multipart dep
in the PyPI runtime (fixed separately in
molecule-ai-workspace-runtime#TBD).

Per feedback_surface_actionable_failure_reason_to_user (CTO 2026-05-17):
user-facing failures MUST tell the user WHY. This patch surfaces
exception class + str(exc) in the 400 JSON body, keeping the top-level
'error' key unchanged so existing canvas / alert rules keep matching.

Salvage note on mc#1524 (the wrong-RCA PR, closed):
mc#1524 attributed the 400 to Starlette's max_part_size limit and
proposed bumping it. That diagnosis was incorrect — Starlette only
enforces max_part_size on form FIELDS (text values), not on file PARTS,
so a 5 MB PDF would not trip that limit regardless of the value. The
useful idea from mc#1524 — surfacing the failure reason to the
caller — is salvaged here as a separate, narrowly-scoped change.

Adds unit test test_malformed_multipart_returns_exception_class_and_detail
which sends a boundary-mismatched body, asserts 400, and pins the
response shape (error/exception/detail keys present).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
core-devops approved these changes 2026-05-19 21:09:31 +00:00
core-devops left a comment
Member

Five-axis pass.

Correctness: workspace/internal_chat_uploads.py ingest_handler now returns exception class + str(exc) alongside the legacy 'error' key on multipart parse failure (400). Backwards-compatible: top-level 'error' string is unchanged so canvas / alert rules still match. Test test_malformed_multipart_returns_exception_class_and_detail pins the new shape with a real boundary mismatch that exercises Starlette's MultiPartException path.

Readability: Inline comment cites the forensic that motivated the change (a78762a0, 25min -> 10min triage) and links feedback_surface_actionable_failure_reason_to_user. Test docstring documents the same.

Architecture: Aligned with CTO 2026-05-17 #211 - user-facing failures MUST tell the user WHY (feedback_surface_actionable_failure_reason_to_user).

Security: type(exc).name is a safe string; str(exc) could in principle echo a filename/multipart-boundary fragment, but Starlette's MultiPartException carries protocol-level descriptions only (no body bytes / no credentials). No new attack surface.

Performance: N/A - error path only.

CI: all-required green on head 5f6aa3d.

Five-axis pass. Correctness: workspace/internal_chat_uploads.py ingest_handler now returns exception class + str(exc) alongside the legacy 'error' key on multipart parse failure (400). Backwards-compatible: top-level 'error' string is unchanged so canvas / alert rules still match. Test test_malformed_multipart_returns_exception_class_and_detail pins the new shape with a real boundary mismatch that exercises Starlette's MultiPartException path. Readability: Inline comment cites the forensic that motivated the change (a78762a0, 25min -> 10min triage) and links feedback_surface_actionable_failure_reason_to_user. Test docstring documents the same. Architecture: Aligned with CTO 2026-05-17 #211 - user-facing failures MUST tell the user WHY (feedback_surface_actionable_failure_reason_to_user). Security: type(exc).__name__ is a safe string; str(exc) could in principle echo a filename/multipart-boundary fragment, but Starlette's MultiPartException carries protocol-level descriptions only (no body bytes / no credentials). No new attack surface. Performance: N/A - error path only. CI: all-required green on head 5f6aa3d.
core-security approved these changes 2026-05-19 21:09:32 +00:00
core-security left a comment
Member

Security-axis pass.

Concern reviewed: surfacing str(exc) to a 400 response body in chat_uploads ingest could leak internal state. Checked:

  • The exception is Starlette's MultiPartException (or subclass). str() carries protocol descriptions ('Multipart form parsing failed: ...', 'Could not decode header value', boundary token NAMES not values). No body bytes, no credential material.
  • The handler is /internal/chat/uploads/ingest — the 'internal' surface is gated by Bearer test-secret in tests + workspace-side INTERNAL_API_KEY in prod (not changed by this diff).
  • type(exc).name is the class symbol; not sensitive.

The diagnostic gain (CTO 2026-05-17 directive: actionable failure reason) outweighs the marginal leak surface. Approved.

Security-axis pass. Concern reviewed: surfacing str(exc) to a 400 response body in chat_uploads ingest could leak internal state. Checked: - The exception is Starlette's MultiPartException (or subclass). str() carries protocol descriptions ('Multipart form parsing failed: ...', 'Could not decode header value', boundary token NAMES not values). No body bytes, no credential material. - The handler is /internal/chat/uploads/ingest — the 'internal' surface is gated by Bearer test-secret in tests + workspace-side INTERNAL_API_KEY in prod (not changed by this diff). - type(exc).__name__ is the class symbol; not sensitive. The diagnostic gain (CTO 2026-05-17 directive: actionable failure reason) outweighs the marginal leak surface. Approved.
core-qa approved these changes 2026-05-19 21:09:33 +00:00
core-qa left a comment
Member

QA-axis pass (per feedback_molecule_core_qa_review_team_required).

Test test_malformed_multipart_returns_exception_class_and_detail:

  • Uses real Starlette TestClient (no mocks of the parser itself).
  • Triggers a real MultiPartException via header-boundary != body-boundary - the exact failure class the production forensic hit.
  • Asserts: (a) status 400, (b) legacy 'error' key preserved (backwards-compat), (c) new 'exception' is a non-empty str, (d) new 'detail' is a str.
  • Lives next to the existing chat_uploads test suite so it's discovered by pytest collection.

CI evidence: CI/all-required green on head 5f6aa3d; CI/Python Lint & Test passed; qa-review/approved + security-review/approved are review-gate contexts that will flip once this APPROVE + core-security APPROVE land.

Approved.

QA-axis pass (per feedback_molecule_core_qa_review_team_required). Test test_malformed_multipart_returns_exception_class_and_detail: - Uses real Starlette TestClient (no mocks of the parser itself). - Triggers a real MultiPartException via header-boundary != body-boundary - the exact failure class the production forensic hit. - Asserts: (a) status 400, (b) legacy 'error' key preserved (backwards-compat), (c) new 'exception' is a non-empty str, (d) new 'detail' is a str. - Lives next to the existing chat_uploads test suite so it's discovered by pytest collection. CI evidence: CI/all-required green on head 5f6aa3d; CI/Python Lint & Test passed; qa-review/approved + security-review/approved are review-gate contexts that will flip once this APPROVE + core-security APPROVE land. Approved.
core-devops merged commit 14d91ef032 into main 2026-05-19 21:10:47 +00:00
core-devops deleted branch fix/chat-uploads-surface-exception-in-400 2026-05-19 21:10:48 +00:00
Sign in to join this conversation.
4 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1575