The deadline contract was incomplete: wait_all logged the timeout but
close() then called executor.shutdown(wait=True), which blocked on
the leaked workers — undoing the user-facing timeout. The inbox poll
loop would stall indefinitely on a hung /content fetch instead of
returning to chat-message processing.
Fix: wait_all now flips self._timed_out and cancels queued (not-yet-
started) futures; close() reads that flag and switches to
shutdown(wait=False, cancel_futures=True) on the timeout path.
Currently-running workers can't be interrupted by Python's threading
model, but they're now detached daemons whose blocking httpx call
no longer gates the next poll.
Healthy path (no timeout) keeps the existing drain-and-wait so a
still-queued ack POST isn't dropped mid-write.
Two new tests pin both legs of the contract end-to-end:
- close-after-timeout-doesn't-block: hung worker, wait_all(0.05s)
fires the timeout, close() returns in <1s instead of waiting ~5s
for the worker to come back.
- close-without-timeout-still-drains: 2 slow workers, wait_all
completes cleanly, close() drains both ack POSTs.
Resolves the BatchFetcher timeout-cancellation finding from the
post-merge five-axis review of Phase 5b.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves the two remaining findings from the Phase 1-4 retrospective
review (the Python-side counterparts to phase 5a):
1. Important — inbox_uploads.fetch_and_stage blocked the inbox poll
loop synchronously per row. A user dragging 4 files into chat at
once would stall the poller for 4× per-fetch latency before the
chat message reached the agent. Add BatchFetcher: a thread-pool
wrapper (default 4 workers) that submits fetches concurrently and
exposes wait_all() as the barrier the inbox loop calls before
processing the chat-message row that references the uploads.
The drain barrier is the correctness invariant: rewrite_request_body
must observe a populated URI cache when it walks the chat-message
row's parts. _poll_once now drains the BatchFetcher inline before
the first non-upload row, AND at end-of-batch (case: batch contains
only upload rows; the corresponding chat message arrives in a later
poll, but the future-poll-races-current-fetch race is closed).
2. Nit — fetch_and_stage created two httpx.Client instances per row
(one for GET /content, one for POST /ack). Refactor so a single
client serves both calls. When called from BatchFetcher, the
batch-shared client serves every row's GET + ack — so the second
fetch reuses the TCP+TLS handshake from the first.
Comprehensive tests:
- 13 new inbox_uploads tests:
- fetch_and_stage with supplied client: zero httpx.Client
constructions, GET+POST through the same client, caller's client
not closed (lifecycle owned by caller).
- fetch_and_stage without supplied client: exactly one
httpx.Client constructed (was 2 pre-fix), closed on the way out.
- BatchFetcher: 3 rows × 120ms = parallel completion < 250ms
(vs. ~360ms serial), URI cache hot when wait_all returns,
per-row failure isolation, single-client reuse across all
submits, idempotent close, submit-after-close raises,
owned-vs-supplied client lifecycle, no-op wait_all on empty
batch, graceful httpx-missing degradation.
- 3 new inbox tests:
- poll_once drains uploads before processing the chat-message row
(in-place mutation of row['request_body'] proves the URI was
rewritten BEFORE message_from_activity returned).
- poll_once with only upload rows still drains at end-of-batch.
- poll_once with no upload rows never constructs a BatchFetcher
(zero overhead on the no-upload happy path).
133 total inbox + inbox_uploads tests pass; 0 regressions.
Closes the chat-upload poll-mode-perf gap end-to-end.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Workspace-side fetcher for the platform-staged chat uploads written by
phase 1. Stack atop feat/poll-mode-chat-upload-phase1.
Wire shape — the platform writes one activity_logs row per uploaded
file with `activity_type=a2a_receive`, `method=chat_upload_receive`,
and a `request_body={file_id, name, mimeType, size, uri}` carrying
the synthetic `platform-pending:<wsid>/<fid>` URI.
Workspace-side flow (new module workspace/inbox_uploads.py):
1. Fetch via GET /workspaces/:id/pending-uploads/:file_id/content
2. Stage to /workspace/.molecule/chat-uploads/<32-hex>-<sanitized>
(same on-disk shape as internal_chat_uploads.py — agent-side
URI resolvers see no contract change)
3. POST /workspaces/:id/pending-uploads/:file_id/ack
4. Cache `platform-pending: → workspace:` so the eventual chat
message that REFERENCES the upload (separate, later activity row)
gets URI-rewritten before the agent sees it.
Inbox poller extension (workspace/inbox.py):
- is_chat_upload_row(row) discriminator on `method`
- upload-receive rows trigger fetch_and_stage and are NOT enqueued
as InboxMessages (they're side-effect rows, not chat messages)
- cursor advances past them regardless of fetch outcome — a
permanent /content failure must not stall the cursor and block
real chat traffic
- message_from_activity calls rewrite_request_body to swap
platform-pending: URIs to local workspace: URIs in subsequent
chat messages' file parts. Cache miss leaves the URI untouched
so the agent surfaces an unresolvable URI rather than the inbox
silently dropping the part.
Filename sanitization mirrors workspace-server/internal/handlers
/chat_files.go::SanitizeFilename and workspace/internal_chat_uploads
.py::sanitize_filename — pinned by the existing parity test suites.
Coverage: 100% on inbox_uploads.py; the inbox.py extension is fully
covered by three new tests in test_inbox.py (skip-from-queue,
cursor-advance-past-broken-fetch, URI-rewrite ordering).