molecule-core/workspace/internal_file_read.py
Hongming Wang e955597a98 feat(chat_files): rewrite Download as HTTP-forward (RFC #2312, PR-D)
Mirrors PR-C's Upload migration: replaces the docker-cp tar-stream
extraction with a streaming HTTP GET to the workspace's own
/internal/file/read endpoint. Closes the SaaS gap for downloads —
without this PR, GET /workspaces/:id/chat/download still returns 503
on Railway-hosted SaaS even after A+B+C+F land.

Stacks: PR-A #2313 → PR-B #2314 → PR-C #2315 → PR-F #2319 → this PR.

Why a single broad /internal/file/read instead of /internal/chat/download:

  Today's chat_files.go::Download already accepts paths under any of the
  four allowed roots {/configs, /workspace, /home, /plugins} — it's not
  strictly chat. Future PRs (template export, etc.) will reuse this
  endpoint via the same forward pattern; reusing avoids three near-
  identical handlers (one per domain) with duplicated path-safety logic.

Path safety is duplicated on platform + workspace sides — defence in
depth via two parallel checks, not "trust the workspace."

Changes:
  * workspace/internal_file_read.py — Starlette handler. Validates path
    (must be absolute, under allowed roots, no traversal, canonicalises
    cleanly). lstat (not stat) so a symlink at the path doesn't redirect
    the read. Streams via FileResponse (no buffering). Mirrors Go's
    contentDispositionAttachment for Content-Disposition header.
  * workspace/main.py — registers GET /internal/file/read alongside the
    POST /internal/chat/uploads/ingest from PR-B.
  * scripts/build_runtime_package.py — adds internal_file_read to
    TOP_LEVEL_MODULES so the publish-runtime cascade rewrites its
    imports correctly. Also includes the PR-B additions
    (internal_chat_uploads, platform_inbound_auth) since this branch
    was rooted before PR-B's drift-gate fix; merge-clean alphabetic
    additions.
  * workspace-server/internal/handlers/chat_files.go — Download
    rewritten as streaming HTTP GET forward. Resolves workspace URL +
    platform_inbound_secret (same shape as Upload), builds GET request
    with path query param, propagates response headers (Content-Type /
    Content-Length / Content-Disposition) + body. Drops archive/tar
    + mime imports (no longer needed). Drops Docker-exec branch entirely
    — Download is now uniform across self-hosted Docker and SaaS EC2.
  * workspace-server/internal/handlers/chat_files_test.go — replaces
    TestChatDownload_DockerUnavailable (stale post-rewrite) with 4
    new tests:
      - TestChatDownload_WorkspaceNotInDB → 404 on missing row
      - TestChatDownload_NoInboundSecret → 503 on NULL column
        (with RFC #2312 detail in body)
      - TestChatDownload_ForwardsToWorkspace_HappyPath → forward shape
        (auth header, GET method, /internal/file/read path) + headers
        propagated + body byte-for-byte
      - TestChatDownload_404FromWorkspacePropagated → 404 from
        workspace propagates (NOT remapped to 500)
    Existing TestChatDownload_InvalidPath path-safety tests preserved.
  * workspace/tests/test_internal_file_read.py — 21 tests covering
    _validate_path matrix (absolute, allowed roots, traversal, double-
    slash, exact-match-on-root), 401 on missing/wrong/no-secret-file
    bearer, 400 on missing path/outside-root/traversal, 404 on missing
    file, happy-path streaming with correct Content-Type +
    Content-Disposition, special-char escaping in Content-Disposition,
    symlink-redirect-rejection (lstat-not-stat protection).

Test results:
  * go test ./internal/handlers/ ./internal/wsauth/ — green
  * pytest workspace/tests/ — 1292 passed (was 1272 before PR-D)

Refs #2312 (parent RFC), #2308 (chat upload+download 503 incident).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 15:19:02 -07:00

135 lines
5.1 KiB
Python

"""GET /internal/file/read?path=<abs path> — workspace-side file read sink.
Companion to /internal/chat/uploads/ingest (RFC #2312 PR-B). Replaces the
docker-cp tar-stream extraction the platform-side workspace-server used
in chat_files.go::Download. Same path-safety contract as the legacy Go
handler:
* absolute path required
* must canonicalise to itself (no `..` segments, no double-slashes)
* must land under one of {/configs, /workspace, /home, /plugins}
* must be a regular file (not a directory, symlink, device, etc.)
Why a single broad "/internal/file/read" instead of a chat-specific path:
Today's chat_files.go::Download already accepts paths under any of the
four allowed roots — it's not strictly chat. Future PR-G/H will migrate
/files/* template-config reads to the same forward pattern; reusing
the same endpoint avoids three near-identical handlers (one per domain)
with duplicated path-safety logic.
Auth: Bearer <platform_inbound_secret>; fail-closed when missing.
Response shape (matches Go contract for byte-for-byte compatibility):
Content-Type: <mime.guess from extension or application/octet-stream>
Content-Length: <stat size>
Content-Disposition: attachment; filename="<basename>"; filename*=UTF-8''<encoded>
body: raw file bytes (binary-safe — no JSON wrapping)
"""
from __future__ import annotations
import logging
import mimetypes
import os
import urllib.parse
from pathlib import Path
from starlette.requests import Request
from starlette.responses import FileResponse, JSONResponse
from platform_inbound_auth import get_inbound_secret, inbound_authorized
logger = logging.getLogger(__name__)
# Mirror chat_files.go's allowedRoots set. A request whose `path` doesn't
# fall under one of these — by exact-match or prefix-with-trailing-slash
# — is rejected at the gate, regardless of how many `..` segments
# canonicalised away.
_ALLOWED_ROOTS = ("/configs", "/workspace", "/home", "/plugins")
def _content_disposition_attachment(name: str) -> str:
"""Mirror chat_files.go::contentDispositionAttachment.
Quotes, CR, and LF stripped/escaped per RFC 6266 / RFC 5987.
Drop control chars, escape backslash and double-quote in the
quoted-string. Emit percent-encoded filename* so non-ASCII names
survive in clients that prefer the modern form.
"""
safe_q: list[str] = []
for ch in name:
if ch in ("\r", "\n"):
continue # would terminate the header
if ch in ('"', "\\"):
safe_q.append("\\")
safe_q.append(ch)
continue
if ord(ch) < 0x20 or ord(ch) == 0x7f:
continue # other control chars
safe_q.append(ch)
ascii_safe = "".join(safe_q)
encoded = urllib.parse.quote(name, safe="") # full RFC 3986 unreserved-only
return f'attachment; filename="{ascii_safe}"; filename*=UTF-8\'\'{encoded}'
def _validate_path(path: str) -> tuple[bool, str]:
"""Return (ok, error_msg). Mirrors Go's chat_files.go::Download
validation in the same order so error shapes stay identical."""
if not path:
return False, "path query required"
if not os.path.isabs(path):
return False, "path must be absolute"
rooted = False
for root in _ALLOWED_ROOTS:
if path == root or path.startswith(root + "/"):
rooted = True
break
if not rooted:
return False, "path must be under /configs, /workspace, /home, or /plugins"
# Reject anything that canonicalises differently or contains a
# traversal segment. Defence-in-depth on top of the prefix check.
if os.path.normpath(path) != path or ".." in path:
return False, "invalid path"
return True, ""
async def file_read_handler(request: Request):
"""GET /internal/file/read — Starlette route handler."""
if not inbound_authorized(get_inbound_secret(), request.headers.get("Authorization", "")):
return JSONResponse({"error": "unauthorized"}, status_code=401)
path = request.query_params.get("path", "")
ok, err = _validate_path(path)
if not ok:
return JSONResponse({"error": err}, status_code=400)
# lstat (not stat) so a symlink at the path doesn't pretend to be the
# file it points at — we want to know "is this LITERALLY a regular
# file at the validated path." A symlink could redirect to /etc/*
# or another mount.
try:
st = os.lstat(path)
except FileNotFoundError:
return JSONResponse({"error": "file not found"}, status_code=404)
except OSError as exc:
logger.warning("internal_file_read: lstat %s failed: %s", path, exc)
return JSONResponse({"error": "stat failed"}, status_code=500)
import stat as _stat
if not _stat.S_ISREG(st.st_mode):
return JSONResponse({"error": "path is not a regular file"}, status_code=400)
name = os.path.basename(path)
mime_type, _ = mimetypes.guess_type(name)
if not mime_type:
mime_type = "application/octet-stream"
return FileResponse(
path,
media_type=mime_type,
headers={
"Content-Disposition": _content_disposition_attachment(name),
},
)