Commit Graph

16 Commits

Author SHA1 Message Date
Hongming Wang
b759548822 fix(chat-uploads): activity rows commit atomically with PutBatch
Closes #149.

uploadPollMode for poll-mode chat uploads previously committed N
pending_uploads rows in one Tx (PutBatch), then wrote N activity_logs
rows individually outside any Tx. A per-row failure on activity row K
left rows 1..K-1 committed and pending_uploads orphaned until the 24h
TTL — not data-loss because the platform's fetcher handled the
half-state cleanly, but the user never saw file K in the canvas and
the inconsistency surfaced as an "uploaded but invisible" complaint
class.

Thread one Tx through PutBatchTx + N × LogActivityTx + Commit so all
or none commit. Broadcasts are deferred until after Commit — emitting
an ACTIVITY_LOGGED event for a row that ends up rolled back would
paint a ghost message into the canvas's optimistic UI. A new
LogActivityTx returns a commitHook the caller invokes post-Commit;
the existing fire-and-forget LogActivity is unchanged for the 4 other
production callers (a2a_proxy_helpers + activity.go report path).

Storage interface gains PutBatchTx; PostgresStorage.PutBatch is
refactored to share the validation + insert path. inMemStorage and
fakeSweepStorage delegate or no-op for PutBatchTx (the in-mem fake
can't model Tx state — DB-level atomicity is verified by the existing
real-Postgres integration test for PutBatch + the new unit test
asserting the Go handler calls Rollback on activity-insert failure).

Tests:
- TestPollUpload_AtomicRollbackOnActivityInsertFailure pins the new
  contract via sqlmock — second activity insert errors → Rollback
  expected, Commit must NOT be called.
- TestLogActivityTx_DefersBroadcastUntilCommitHook +
  _InsertError_NoHook_NoBroadcast + _NilTx_Errors cover the new API.
- TestPutBatchTx_HappyPath / _EmptyItems / _ValidationFails /
  _PerRowErrorPropagates cover Tx-aware storage layer.
- 7 existing TestPollUpload_* tests updated to mock Begin + Commit
  (or Begin + Rollback for failure paths) since the handler now
  opens a Tx around PutBatch + activity inserts.

All workspace-server tests pass; integration tag also clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 21:34:28 -07:00
Hongming Wang
9991057ad1 feat(poll-upload): phase 5a — atomic batch insert + acked-index + mime hardening
Resolves four of six findings from the retrospective code review of Phases
1–4 (poll-mode chat upload). Bundled because every change is in the
platform's pending_uploads layer or the multi-file handler that reads it.

Findings resolved:

1. Important — Sweep query lacked an index for the acked-retention OR-arm.
   The Phase 1 partial indexes are both `WHERE acked_at IS NULL`, so the
   `(acked_at IS NOT NULL AND acked_at < retention)` half of the WHERE
   clause seq-scanned the table on every cycle. Add a complementary
   partial index on `acked_at WHERE acked_at IS NOT NULL` so both arms
   of the disjunction are index-covered. Disjoint from the existing two
   indexes (no row matches both predicates), so write amplification is
   bounded to ~one index entry per terminal-state row.

2. Important — uploadPollMode partial-failure left orphans. The previous
   per-file Put loop committed rows 1..K-1 and then errored on row K with
   no compensation, so a client retry would double-insert the survivors.
   Refactor the handler into three explicit phases (pre-validate +
   read-into-memory, single atomic PutBatch, per-file activity row) and
   add Storage.PutBatch with all-or-nothing transaction semantics.

3. FYI — pendinguploads.StartSweeperWithInterval was exported only for
   tests. Move it to lower-case startSweeperWithInterval and expose the
   test seam through pendinguploads/export_test.go (Go convention; the
   shim file is stripped from the production binary at build time).

4. Nit — multipart Content-Type was passed verbatim into pending_uploads
   rows and re-served on /content. Add safeMimetype which strips
   parameters, rejects CR/LF/control bytes, and coerces malformed shapes
   to application/octet-stream. The eventual GET /content response can no
   longer be header-split via a crafted Content-Type on the multipart.

Comprehensive tests:

- 10 PutBatch unit tests (sqlmock): happy path, empty input, all four
  pre-validation rejection paths, BeginTx error, per-row error +
  Rollback (no Commit), first-row error, Commit error.
- 4 new PutBatch integration tests (real Postgres): all-rows-commit
  happy path with COUNT(*) verification, atomic-rollback no-leak via
  a NUL-byte filename that lib/pq rejects mid-batch, oversize
  short-circuit no-Tx, idx_pending_uploads_acked existence + partial
  predicate via pg_indexes (planner-shape-independent).
- 3 new chat_files_poll tests: atomic rollback on second-file oversize,
  atomic rollback on PutBatch error, mimetype CRLF/NUL/parameter
  sanitization (8 sub-cases).

The two remaining review findings (inbox_uploads.fetch_and_stage blocks
the poll loop synchronously; two httpx Clients per row) are Python-side
and ship in Phase 5b once this lands on staging.

Test-only export pattern via export_test.go, atomic pre-validation
discipline (validate before Tx), and behavior-based (not name-based)
test assertions follow the standing project conventions.
2026-05-05 11:10:13 -07:00
Hongming Wang
43bf94a07c fix(chat-uploads): align poll-mode activity rows with inbox poll filter
The workspace inbox poller filters
`GET /workspaces/:id/activity?type=a2a_receive` — writing rows with
`activity_type=chat_upload_receive` would be silently invisible to it.

Switch the poll-mode upload-staging handler to write
`activity_type=a2a_receive` with `method=chat_upload_receive` as the
discriminator. Same shape as A2A's `tasks/send` vs `message/send` method
split; the workspace-side handler (Phase 2) routes by `method`, not
activity_type.

Pinned with `TestPollUpload_ActivityRowDiscriminator` — sqlmock
WithArgs on positions 2 (activity_type) and 5 (method) so a refactor
that flips activity_type back to a custom value gets a red test
instead of a runtime "poller saw nothing" silent break.
2026-05-05 04:29:07 -07:00
Hongming Wang
86fdaad111 feat(rfc): poll-mode chat upload — phase 1 platform staging layer
External-runtime workspaces (registered via molecule connect, behind
NAT, no public callback URL) currently see HTTP 422 "workspace has no
callback URL" on every chat file upload. The only escape is to wrap the
laptop in ngrok / Cloudflare tunnel + re-register push-mode — a tax
that shouldn't exist for a one-line use case.

This phase introduces the platform-side staging layer that lets
canvas → external workspace uploads ride the same poll loop the inbox
already uses for text messages.

Architecture (mirrors inbox poll, SSOT principle):
  Canvas POST /chat/uploads (multipart)
      ↓ delivery_mode=poll
  Platform: chat_files.uploadPollMode
      ↓ pendinguploads.Storage.Put + LogActivity(chat_upload_receive)
  Workspace's existing inbox poller picks up the activity row (Phase 2)
  Workspace fetches: GET /workspaces/:id/pending-uploads/:fid/content
  Workspace acks:    POST /workspaces/:id/pending-uploads/:fid/ack

Pieces in this PR:
  * Migration 20260505100000 — pending_uploads table; partial indexes
    on unacked + expires_at for the workspace fetch + Phase 3 sweep
    hot paths. No FK to workspaces (audit retention), 24h hard TTL.
  * internal/pendinguploads — Storage interface + Postgres impl. Bytes
    inline (bytea) today; the interface lets a future PR replace with
    S3 (RFC #2789) by swapping one constructor. 100% test coverage on
    the Postgres impl via sqlmock-pinned SQL.
  * handlers.PendingUploadsHandler — GET /content + POST /ack endpoints.
    wsAuth-gated; cross-workspace bleed protection via per-row
    workspace_id check (token leak from A can't read B's pending bytes).
    Handler tests pin happy path + every 4xx/5xx mapping including
    cross-workspace + race-with-sweep.
  * chat_files.go — Upload poll-mode branch behind WithPendingUploads
    builder. Push-mode unchanged (regression-tested). Multipart parse
    + per-file sanitize + storage.Put + activity_logs row per file.
  * SanitizeFilename — Go mirror of workspace/internal_chat_uploads.py
    sanitize_filename. Tests pin parity case-by-case so canvas-emitted
    URIs stay identical regardless of which path handles the upload.
  * Comprehensive logging — every state transition (staged, fetch,
    ack, error) emits a structured log line with workspace_id +
    file_id + size + sanitized name. Phase 3 metrics will hook these.

The pendinguploads.Storage wiring is opt-in (WithPendingUploads on
ChatFilesHandler) so a binary deployed without the migration keeps the
pre-existing 422 behavior — no boot-order coupling between code roll
and schema roll.

Phase 2 (separate PR): workspace inbox extension — inbox_uploads.py
fetches via the GET endpoint, writes to /workspace/.molecule/chat-
uploads/, acks, and rewrites the URI from platform-pending: → workspace:
so the agent's existing send-attachments path needs no changes.
Phase 3: GC sweep + dashboards. Phase 4: poll-mode E2E on staging.

Tests:
  * 100% coverage on pendinguploads (sqlmock-pinned SQL drift gate).
  * Functional 100% on new handler code (uncovered branches are
    documented defensive duplicates: uuid re-parse, multipart Open
    error, Writer.Write fail — none reproducible in unit tests).
  * Push-mode + NULL delivery_mode regression tests pin no behavior
    change for existing workspaces.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 04:22:24 -07:00
Hongming Wang
bcea8ac822 Broaden empty-URL 422 to cover NULL delivery_mode (production reality)
Live-probed user's tenant: three of three external-runtime workspaces
register with delivery_mode = NULL, not "poll". The earlier narrow
poll-only check fell through to the misleading 503 for the actually-
observed shape.

Invariant we want: URL empty + not-exactly-"push" → no dispatch path
will ever exist → 422. Only push-mode with empty URL is genuinely
transient (mid-boot, restart in progress) → 503.

Added TestChatUpload_NullModeEmptyURL using the user's actual workspace
ID. Existing TestChatUpload_NoURL switched to explicit "push" mode
(was relying on default — unsafe given the new branching).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 02:42:46 -07:00
Hongming Wang
87ae691e67 Distinguish poll-mode workspace from transient empty-URL on chat upload
External-runtime workspaces that register in poll mode have no callback
URL by design — the platform never dispatches to them, so chat upload
(HTTP-forward by design) can't proceed. Returning 503 + "workspace url
not registered yet" was misleading: the "yet" implied transient state,
but the URL would never arrive.

Caught externally on 2026-05-04: user uploading an image to an external
"mac laptop" runtime workspace saw the 503 and assumed they should
retry. The workspace's poll mode meant retrying would never help.

Fix: include delivery_mode in the workspace lookup. When URL is empty:
- poll mode → 422 + "re-register in push mode with a public URL"
  (Unprocessable Entity — this request can't succeed against this
  workspace's configuration; no retry will help)
- push mode → 503 + "not registered yet" (genuine transient state —
  retry after next heartbeat is correct)

Test: TestChatUpload_PollModeEmptyURL pins the new 422 path; existing
TestChatUpload_NoURL strengthened to assert the "not registered yet"
substring stays on the push branch (it would have silently passed if
the new 422 path had clobbered both branches).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 02:42:46 -07:00
Hongming Wang
830e4aa548 refactor(chat_files): extract streamWorkspaceResponse helper for Upload+Download
The "do request → check err → defer close → forward headers → set
status → io.Copy → log mid-stream errors" tail was duplicated between
Upload and Download. Each handler had ~12 lines that differed only in:

  - the op label in log messages ("upload" vs "download")
  - the set of response headers to forward verbatim
    (Upload: Content-Type only; Download: Content-Type +
    Content-Length + Content-Disposition)

Hoist into ChatFilesHandler.streamWorkspaceResponse(c, op,
workspaceID, forwardURL, req, forwardHeaders). Each call site
reduces to one line. Future changes — request-id forwarding,
observability metric, response-size cap, bytes-streamed log —
go in ONE place rather than two.

Same drift-prevention rationale as resolveWorkspaceForwardCreds
(#2372) and readOrLazyHealInboundSecret (#2376), applied to the
response-streaming layer of the same handlers.

Behavior preserved: existing TestChatUpload_* and TestChatDownload_*
integration tests (8 across both handlers) all pass unchanged. The
log message format is consistent across both handlers now (single
"chat_files {op}: ..." string template) — operators can grep one
prefix for both features instead of separate prefixes per handler.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 08:27:45 -07:00
Hongming Wang
30a569c742 refactor: extract readOrLazyHealInboundSecret to dedup chat_files + registry
The lazy-heal-on-miss pattern landed in two places this session:
PR #2372 (chat_files.go::resolveWorkspaceForwardCreds — Upload + Download)
and PR #2375 (registry.go::Register). Both implementations did the same
thing:

  read → if ErrNoInboundSecret then mint inline → return outcome

Different response-shape requirements but the same core mechanic. Three
sites' worth of drift potential: any future heal-time condition we add
(audit log, alert, secret rotation, observability) had to be applied to
each site, with partial application silently re-opening the gap.

Fix: extract readOrLazyHealInboundSecret in workspace_provision_shared.go
returning (secret, healed, err). Each caller maps the outcome to its
response shape:

  - chat_files: healed=true → 503 with retry hint; err != nil → 503 with
    RFC-#2312 reprovision hint
  - registry:   healed=true|false + err==nil → include in response;
    err != nil → omit field (workspace can retry on next register)

Net effect:

  - Single source of truth for the read+heal mechanic
  - Response-shape decisions stay in callers (they DO differ per feature)
  - Future heal-time conditions go in one place
  - Behavior preserved: existing TestRegister_NoInboundSecret_LazyHeals,
    TestRegister_NoInboundSecret_LazyHealMintFailureOmitsField,
    TestChatUpload_NoInboundSecret_LazyHeal*,
    TestChatDownload_NoInboundSecret_LazyHeal* all pass unchanged

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 04:11:43 -07:00
Hongming Wang
501a42d753 refactor(chat_files): extract resolveWorkspaceForwardCreds shared by Upload+Download
The 50-line "resolve URL + read inbound secret + lazy-heal on miss"
block was duplicated nearly verbatim between Upload and Download
handlers. Drift-prone — same class of risk as the original SaaS
provision drift fixed in #2366. A future change like:

  - secret rotation (re-mint when the row's older than X)
  - per-feature audit logging
  - additional fail-closed conditions

would have to be applied to both handlers, and a partial application
that healed Upload but skipped Download would surface only at runtime.

Fix: hoist the shared logic into resolveWorkspaceForwardCreds. The
function takes an op label ("upload"/"download") used in log messages
+ the 503 RFC-#2312 detail copy so operators can still distinguish
which feature ran. Both handlers reduce to:

    wsURL, secret, ok := resolveWorkspaceForwardCreds(c, ctx, workspaceID, "upload")
    if !ok { return }

Net -20 lines (helper amortizes the 50-line block across both call
sites). Existing test coverage (TestChatUpload_NoInboundSecret_*,
TestChatDownload_NoInboundSecret_* from PR #2370) covers all four
branches of the shared helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 02:51:53 -07:00
Hongming Wang
3f8286ea47 fix(provision): share Docker+SaaS prepare path so both mint workspace secrets (RFC #2312)
Root cause of 2026-04-30 silent-503 chat-upload bug: provisionWorkspaceCP
(SaaS) skipped issueAndInjectInboundSecret while provisionWorkspaceOpts
(Docker) called it. Every prod SaaS workspace provisioned with NULL
platform_inbound_secret → upload returned 503 with the v2-enrollment
message on every attempt.

Structural fix:
- Extract prepareProvisionContext (secrets load, env mutators, preflight,
  cfg build), mintWorkspaceSecrets (auth_token + platform_inbound_secret),
  markProvisionFailed (broadcast + DB update) into workspace_provision_shared.go
- Refactor both provision modes to call the shared helpers
- Add provisionAbort struct so the missing-env failure class can carry its
  structured "missing" payload through the shared abort path
- Unify last_sample_error: previously the decrypt-fail path skipped it while
  others set it; users now see every failure class in the UI

Drift prevention:
- AST gate TestProvisionFunctions_AllCallMintWorkspaceSecrets asserts every
  function in the provisionFunctions set calls mintWorkspaceSecrets at least
  once (same shape as the audit-coverage gate from #335). New provision paths
  must either call mint or be added to provisionExemptFunctions with a
  one-line justification
- Behavioral test TestMintWorkspaceSecrets_PersistsInboundSecretInSaaSMode
  pins the contract: SaaS mode MUST persist platform_inbound_secret to the DB
  column even though it skips file injection

Existing-workspace recovery (chat_files.go lazy-heal):
- Upload + Download handlers detect NULL platform_inbound_secret and call
  IssuePlatformInboundSecret inline, returning 503 with retry_after_seconds=30
- Self-heals workspaces that were provisioned before this fix without
  requiring destructive reprovision

Tests: full handlers + workspace-server module green; AST gate verified to
fire red on deliberate violation (commented-out mint call surfaces the
exact function name + actionable remediation message).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 02:18:08 -07:00
Hongming Wang
82f73b1fa3 docs(chat_files): update header — Download is HTTP-forward, not docker-cp
The header comment claimed:
  "file upload (HTTP-forward) + download (Docker-exec)"
and:
  "Download still uses the v1 docker-cp path; migrating it lives in
   the next PR in this stack"

Both wrong now. RFC #2312 PR-D landed the Download HTTP-forward path:
chat_files.go:336 builds an http.NewRequestWithContext to
${wsURL}/internal/file/read?path=<abs>, with the response streamed
back to the caller. The workspace-side Starlette handler is at
workspace/internal_file_read.py, mounted at workspace/main.py:440.

Update the header to reflect actual code: both upload + download are
HTTP-forward, share the same per-workspace platform_inbound_secret
auth, and work uniformly on local Docker and SaaS EC2.

Pure docs change — no behavior, no build/test impact.
2026-04-29 20:28:58 -07:00
Hongming Wang
e955597a98 feat(chat_files): rewrite Download as HTTP-forward (RFC #2312, PR-D)
Mirrors PR-C's Upload migration: replaces the docker-cp tar-stream
extraction with a streaming HTTP GET to the workspace's own
/internal/file/read endpoint. Closes the SaaS gap for downloads —
without this PR, GET /workspaces/:id/chat/download still returns 503
on Railway-hosted SaaS even after A+B+C+F land.

Stacks: PR-A #2313 → PR-B #2314 → PR-C #2315 → PR-F #2319 → this PR.

Why a single broad /internal/file/read instead of /internal/chat/download:

  Today's chat_files.go::Download already accepts paths under any of the
  four allowed roots {/configs, /workspace, /home, /plugins} — it's not
  strictly chat. Future PRs (template export, etc.) will reuse this
  endpoint via the same forward pattern; reusing avoids three near-
  identical handlers (one per domain) with duplicated path-safety logic.

Path safety is duplicated on platform + workspace sides — defence in
depth via two parallel checks, not "trust the workspace."

Changes:
  * workspace/internal_file_read.py — Starlette handler. Validates path
    (must be absolute, under allowed roots, no traversal, canonicalises
    cleanly). lstat (not stat) so a symlink at the path doesn't redirect
    the read. Streams via FileResponse (no buffering). Mirrors Go's
    contentDispositionAttachment for Content-Disposition header.
  * workspace/main.py — registers GET /internal/file/read alongside the
    POST /internal/chat/uploads/ingest from PR-B.
  * scripts/build_runtime_package.py — adds internal_file_read to
    TOP_LEVEL_MODULES so the publish-runtime cascade rewrites its
    imports correctly. Also includes the PR-B additions
    (internal_chat_uploads, platform_inbound_auth) since this branch
    was rooted before PR-B's drift-gate fix; merge-clean alphabetic
    additions.
  * workspace-server/internal/handlers/chat_files.go — Download
    rewritten as streaming HTTP GET forward. Resolves workspace URL +
    platform_inbound_secret (same shape as Upload), builds GET request
    with path query param, propagates response headers (Content-Type /
    Content-Length / Content-Disposition) + body. Drops archive/tar
    + mime imports (no longer needed). Drops Docker-exec branch entirely
    — Download is now uniform across self-hosted Docker and SaaS EC2.
  * workspace-server/internal/handlers/chat_files_test.go — replaces
    TestChatDownload_DockerUnavailable (stale post-rewrite) with 4
    new tests:
      - TestChatDownload_WorkspaceNotInDB → 404 on missing row
      - TestChatDownload_NoInboundSecret → 503 on NULL column
        (with RFC #2312 detail in body)
      - TestChatDownload_ForwardsToWorkspace_HappyPath → forward shape
        (auth header, GET method, /internal/file/read path) + headers
        propagated + body byte-for-byte
      - TestChatDownload_404FromWorkspacePropagated → 404 from
        workspace propagates (NOT remapped to 500)
    Existing TestChatDownload_InvalidPath path-safety tests preserved.
  * workspace/tests/test_internal_file_read.py — 21 tests covering
    _validate_path matrix (absolute, allowed roots, traversal, double-
    slash, exact-match-on-root), 401 on missing/wrong/no-secret-file
    bearer, 400 on missing path/outside-root/traversal, 404 on missing
    file, happy-path streaming with correct Content-Type +
    Content-Disposition, special-char escaping in Content-Disposition,
    symlink-redirect-rejection (lstat-not-stat protection).

Test results:
  * go test ./internal/handlers/ ./internal/wsauth/ — green
  * pytest workspace/tests/ — 1292 passed (was 1272 before PR-D)

Refs #2312 (parent RFC), #2308 (chat upload+download 503 incident).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 15:19:02 -07:00
Hongming Wang
c02cb0e1b6 review: defer forward-time URL re-validation to follow-up (#2316)
Self-review found the original draft of this PR added forward-time
validateAgentURL() as defense-in-depth — paranoia layer on top of the
existing register-time gate. The validator unconditionally blocks
loopback (127.0.0.1/8), which makes httptest-based proxy tests
impossible without an env-var hatch I'd rather not add to a security-
critical path on first pass.

Trust note kept inline pointing at the upstream gate + tracking issue
so the gap is explicit, not invisible.

Refs #2312.
2026-04-29 14:33:41 -07:00
Hongming Wang
e632a31347 feat(chat_files): rewrite Upload as HTTP-forward to workspace (RFC #2312, PR-C)
Closes the SaaS upload gap (#2308) with the unified architecture from
RFC #2312: same code path on local Docker and SaaS, no Docker socket
dependency, no `dockerCli == nil` cliff. Stacked on PR-A (#2313) +
PR-B (#2314).

Before:
  Upload → findContainer (nil in SaaS) → 503

After:
  Upload → resolve workspaces.url + platform_inbound_secret
        → stream multipart to <url>/internal/chat/uploads/ingest
        → forward response back unchanged

Same call site whether the workspace runs on local docker-compose
("http://ws-<id>:8000") or SaaS EC2 ("https://<id>.<tenant>...").
The bug behind #2308 cannot exist by construction.

Why streaming, not parse-then-re-encode:
  * No 50 MB intermediate buffer on the platform
  * Per-file size + path-safety enforcement is the workspace's job
    (see workspace/internal_chat_uploads.py, PR-B)
  * Workspace's error responses (413 with offending filename, 400 on
    missing files field, etc.) propagate through unchanged

Changes:
  * workspace-server/internal/handlers/chat_files.go — Upload rewritten
    as a streaming HTTP proxy. Drops sanitizeFilename, copyFlatToContainer,
    and the entire docker-exec path. ChatFilesHandler gains an httpClient
    (broken out for test injection). Download stays docker-exec for now;
    follow-up PR will migrate it to the same shape.
  * workspace-server/internal/handlers/chat_files_external_test.go —
    deleted. Pinned the wrong-headed runtime=external 422 gate from
    #2309 (already reverted in #2311). Superseded by the proxy tests.
  * workspace-server/internal/handlers/chat_files_test.go — replaced
    sanitize-filename tests (now in workspace/tests/test_internal_chat_uploads.py)
    with sqlmock + httptest proxy tests:
      - 400 invalid workspace id
      - 404 workspace row missing
      - 503 platform_inbound_secret NULL (with RFC #2312 detail)
      - 503 workspaces.url empty
      - happy-path forward (asserts auth header, content-type forwarded,
        body streamed, response propagated back)
      - 413 from workspace propagated unchanged (NOT remapped to 500)
      - 502 on workspace unreachable (connect refused)
    Existing Download + ContentDisposition tests preserved.
  * tests/e2e/test_chat_upload_e2e.sh — single-script-everywhere E2E.
    Takes BASE as env (default http://localhost:8080). Creates a
    workspace, waits for online, mints a test token, uploads a fixture,
    reads it back via /chat/download, asserts content matches +
    bearer-required. Same script runs against staging tenants (set
    BASE=https://<id>.<tenant>.staging.moleculesai.app).

Test plan:
  * go build ./... — green
  * go test ./internal/handlers/ ./internal/wsauth/ — green (full suite)
  * tests/e2e/test_chat_upload_e2e.sh against local docker-compose
    after PR-A + PR-B + this PR all merge — TODO before merge

Refs #2312 (parent RFC), #2308 (chat upload 503 incident).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 14:26:37 -07:00
Hongming Wang
4a6095ee1a fix(chat_files): return 422 with structured detail for external workspaces (closes #2308)
Symptom: pasting a screenshot into the canvas chat for a runtime="external"
workspace returned `503 {"error":"workspace container not running"}` —
accurate from the upload handler's POV (no container exists for external
workspaces) but misleading because it implies the container has crashed.

Fix: detect runtime="external" via DB lookup BEFORE the container-find
step and return 422 with:
  - error: "file upload not supported for external workspaces"
  - detail: explains why + points at admin/secrets workaround +
    references issue #2308 for the v0.2 native-support roadmap
  - runtime: "external" (machine-readable for clients)

Why 422 not 200/501:
- 422 = Unprocessable Entity — the request is well-formed but the
  workspace's runtime can't accept it. Standard REST semantics.
- 200 with empty result would lie; 501 implies the API itself is
  unimplemented (it's not — works for non-external workspaces); 503
  was the misleading status this PR fixes.

Verified via live E2E against localhost:
- Created `runtime=external,external=true` workspace
- Posted multipart to /workspaces/:id/chat/uploads
- Got 422 with the expected structured body

Unit test (`chat_files_external_test.go`) pins the contract via sqlmock
+ httptest. Notable: the handler is constructed with `templates: nil`
to prove the runtime check happens BEFORE any docker plumbing — if a
future change moves the check below findContainer, the test crashes
on nil-deref instead of silently regressing.

Out of scope (for v0.2 follow-up):
- Native external-workspace file ingest via artifacts table or the
  channel-plugin's inbox/ pattern. Requires separate design pass.

Closes #2308

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 12:37:49 -07:00
Hongming Wang
94d9331c76 feat(canvas+platform): chat attachments, model selection, deploy/delete UX
Session's accumulated UX work across frontend and platform. Reviewable
in four logical sections — diff is large but internally cohesive
(each section fixes a gap the next one depends on).

## Chat attachments — user ↔ agent file round trip

- New POST /workspaces/:id/chat/uploads (multipart, 50 MB total /
  25 MB per file, UUID-prefixed storage under
  /workspace/.molecule/chat-uploads/).
- New GET /workspaces/:id/chat/download with RFC 6266 filename
  escaping and binary-safe io.CopyN streaming.
- Canvas: drag-and-drop onto chat pane, pending-file pills,
  per-message attachment chips with fetch+blob download (anchor
  navigation can't carry auth headers).
- A2A flow carries FileParts end-to-end; hermes template executor
  now consumes attachments via platform helpers.

## Platform attachment helpers (workspace/executor_helpers.py)

Every runtime's executor routes through the same helpers so future
runtimes inherit attachment awareness for free:
- extract_attached_files — resolve workspace:/file:///bare URIs,
  reject traversal, skip non-existent.
- build_user_content_with_files — manifest for non-image files,
  multi-modal list (text + image_url) for images. Respects
  MOLECULE_DISABLE_IMAGE_INLINING for providers whose vision
  adapter hangs on base64 payloads (MiniMax M2.7).
- collect_outbound_files — scans agent reply for /workspace/...
  paths, stages each into chat-uploads/ (download endpoint
  whitelist), emits as FileParts in the A2A response.
- ensure_workspace_writable — called at molecule-runtime startup
  so non-root agents can write /workspace without each template
  having to chmod in its Dockerfile.

Hermes template executor + langgraph (a2a_executor.py) + claude-code
(claude_sdk_executor.py) all adopt the helpers.

## Model selection & related platform fixes

- PUT /workspaces/:id/model — was 404'ing, so canvas "Save"
  silently lost the model choice. Stores into workspace_secrets
  (MODEL_PROVIDER), auto-restarts via RestartByID.
- applyRuntimeModelEnv falls back to envVars["MODEL_PROVIDER"]
  so Restart propagates the stored model to HERMES_DEFAULT_MODEL
  without needing the caller to rehydrate payload.Model.
- ConfigTab Tier dropdown now reads from workspaces row, not the
  (stale) config.yaml — fixes "badge shows T3, form shows T2".

## ChatTab & WebSocket UX fixes

- Send button no longer locks after a dropped TASK_COMPLETE —
  `sending` no longer initializes from data.currentTask.
- A2A POST timeout 15 s → 120 s. LLM turns routinely exceed 15 s;
  the previous default aborted fetches while the server was still
  replying, producing "agent may be unreachable" on success.
- socket.ts: disposed flag + reconnectTimer cancellation + handler
  detachment fix zombie-WebSocket in React StrictMode.
- Hermes Config tab: RUNTIMES_WITH_OWN_CONFIG drops 'hermes' —
  the adaptor's purpose IS the form, banner was contradictory.
- workspace_provision.go auto-recovery: try <runtime>-default AND
  bare <runtime> for template path (hermes lives at the bare name).

## Org deploy/delete animation (theme-ready CSS)

- styles/theme-tokens.css — design tokens (durations, easings,
  colors). Light theme overrides by setting only the deltas.
- styles/org-deploy.css — animation classes + keyframes, every
  value references a token. prefers-reduced-motion respected.
- Canvas projects node.draggable=false onto locked workspaces
  (deploying children AND actively-deleting ids) — RF's
  authoritative drag lock; useDragHandlers retains a belt-and-
  braces check.
- Organ cancel button (red pulse pill on root during deploy)
  cascades via existing DELETE /workspaces/:id?confirm=true.
- Auto fit-view after each arrival, debounced 500 ms so rapid
  sibling arrivals coalesce into one fit (previous per-event
  fit made the viewport lurch continuously).
- Auto-fit respects user-pan — onMoveEnd stamps a user-pan
  timestamp only when event !== null (ignores programmatic
  fitView) so auto-fits don't self-cancel.
- deletingIds store slice + useOrgDeployState merge gives the
  delete flow the same dim + non-draggable treatment as deploy.
- Platform-level classNames.ts shared by canvas-events +
  useCanvasViewport (DRY'd 3 copies of split/filter/join).

## Server payload change

- org_import.go WORKSPACE_PROVISIONING broadcast now includes
  parent_id + parent-RELATIVE x/y (slotX/slotY) so the canvas
  renders the child at the right parent-nested slot without doing
  any absolute-position walk. createWorkspaceTree signature gains
  relX, relY alongside absX, absY; both call sites updated.

## Tests

- workspace/tests/test_executor_helpers.py — 11 new cases
  covering URI resolution (including traversal rejection),
  attached-file extraction (both Part shapes), manifest-only
  vs multi-modal content, large-image skip, outbound staging,
  dedup, and ensure_workspace_writable (chmod 777 + non-root
  tolerance).
- workspace-server chat_files_test.go — upload validation,
  Content-Disposition escaping, filename sanitisation.
- workspace-server secrets_test.go — SetModel upsert, empty
  clears, invalid UUID rejection.
- tests/e2e/test_chat_attachments_e2e.sh — round-trip against
  a live hermes workspace.
- tests/e2e/test_chat_attachments_multiruntime_e2e.sh — static
  plumbing check + round-trip across hermes/langgraph/claude-code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 13:27:51 -07:00