Symmetric with the existing CPProvisionerAPI interface. Closes the
asymmetry where the SaaS provisioner field was an interface (mockable
in tests) but the Docker provisioner field was a concrete pointer
(not).
## Changes
- New ``provisioner.LocalProvisionerAPI`` interface — the 7 methods
WorkspaceHandler / TeamHandler call on h.provisioner today: Start,
Stop, IsRunning, ExecRead, RemoveVolume, VolumeHasFile,
WriteAuthTokenToVolume. Compile-time assertion confirms *Provisioner
satisfies it. Mirror of cp_provisioner.go's CPProvisionerAPI block.
- ``WorkspaceHandler.provisioner`` and ``TeamHandler.provisioner``
re-typed from ``*provisioner.Provisioner`` to
``provisioner.LocalProvisionerAPI``. Constructor parameter type is
unchanged — the assignment widens to the interface, so the 200+
callers of ``NewWorkspaceHandler`` / ``NewTeamHandler`` are
unaffected.
- Constructors gain a ``if p != nil`` guard before assigning to the
interface field. Without this, ``NewWorkspaceHandler(..., nil, ...)``
(the test fixture pattern across 200+ tests) yields a typed-nil
interface value where ``h.provisioner != nil`` evaluates *true*,
and the SaaS-vs-Docker fork incorrectly routes nil-fixture tests
into the Docker code path. Documented inline with reference to
the Go FAQ.
- Hardened the 5 Provisioner methods that lacked nil-receiver guards
(Start, ExecRead, WriteAuthTokenToVolume, RemoveVolume,
VolumeHasFile) — return ErrNoBackend on nil receiver instead of
panicking on p.cli dereference. Symmetric with Stop/IsRunning
(already hardened in #1813). Defensive cleanup so a future caller
that bypasses the constructor's nil-elision still degrades
cleanly.
- Extended TestZeroValuedBackends_NoPanic with 5 new sub-tests
covering the newly-hardened nil-receiver paths. Defense-in-depth:
a future refactor that drops one of the nil-checks fails red here
before reaching production.
## Why now
- Provisioner orchestration has been touched in #2366 / #2368 — the
interface symmetry is the natural follow-up captured in #2369.
- Future work (CP fleet redeploy endpoint, multi-backend
provisioners) wants this in place. Memory note
``project_provisioner_abstraction.md`` calls out pluggable
backends as a north-star.
- Memory note ``feedback_long_term_robust_automated.md`` —
compile-time gates + ErrNoBackend symmetry > runtime panics.
## Verification
- ``go build ./...`` clean.
- ``go test ./...`` clean — 1300+ tests pass, including the
previously-flaky Create-with-nil-provisioner paths that now
exercise the constructor's nil-elision correctly.
- ``go test ./internal/provisioner/ -run TestZeroValuedBackends_NoPanic
-v`` — all 11 nil-receiver subtests green (was 6, +5 for the
newly-hardened methods).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Migration 043 (2026-04-25) introduced the workspace_status enum but
omitted two values application code had been writing for days, so every
UPDATE that tried to write either value failed silently in production:
'awaiting_agent' (since 2026-04-24, commit 1e8b5e01):
- handlers/workspace.go:333 — external workspace pre-register
- handlers/registry.go (via PR #2382) — liveness offline transition
- registry/healthsweep.go (via PR #2382) — heartbeat-staleness sweep
'hibernating' (since hibernation feature shipped):
- handlers/workspace_restart.go:271 — DB-level claim before stop
All four/five sites swallowed the enum-cast error. User-visible impact:
external workspaces never transition to a stale state when their agent
disconnects (canvas shows them stuck on 'online'/'degraded' indefinitely),
new external workspaces never advance past 'provisioning', and idle
workspaces never auto-hibernate (resources held forever).
PR #2382 didn't *cause* this — it inherited the gap and added two more
silent-fail paths on top. The pre-existing two had been broken for five
days and went unnoticed because:
1. sqlmock matches SQL by regex, not against the live enum constraint.
Every test passed despite the prod-only failure.
2. The handlers either drop the Exec error entirely (workspace.go:333)
or log+continue without an alert (the other three).
Fix in three pieces:
1. migrations/046_*.up.sql — ALTER TYPE workspace_status ADD VALUE
'awaiting_agent', 'hibernating'. IF NOT EXISTS makes it idempotent
across re-runs (RunMigrations re-applies until schema_migrations
records the file). ALTER TYPE ADD VALUE doesn't take a heavy lock
and commits immediately, safe under live traffic.
2. migrations/046_*.down.sql — full rename → recreate → cast → drop
recipe. Postgres has no DROP VALUE so this is the only honest
rollback. Pre-flights existing rows to compatible values
(awaiting_agent → offline, hibernating → hibernated) before the
type swap.
3. internal/db/workspace_status_enum_drift_test.go — static gate that
parses every UPDATE/INSERT against `workspaces` in workspace-server/
internal/, extracts every status literal, and asserts each is in
the enum union (CREATE TYPE + every ALTER TYPE ADD VALUE). The gate
runs in unit tests, no DB required, and would have caught both
omissions on the day they shipped. Pattern matches
feedback_behavior_based_ast_gates and feedback_mock_at_drifting_layer.
Verification:
- go test ./internal/db/ -count=1 -race ✓
- go vet ./... ✓
- Drift gate flips red if I delete either ADD VALUE from the migration
(validated via local mutation).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "do request → check err → defer close → forward headers → set
status → io.Copy → log mid-stream errors" tail was duplicated between
Upload and Download. Each handler had ~12 lines that differed only in:
- the op label in log messages ("upload" vs "download")
- the set of response headers to forward verbatim
(Upload: Content-Type only; Download: Content-Type +
Content-Length + Content-Disposition)
Hoist into ChatFilesHandler.streamWorkspaceResponse(c, op,
workspaceID, forwardURL, req, forwardHeaders). Each call site
reduces to one line. Future changes — request-id forwarding,
observability metric, response-size cap, bytes-streamed log —
go in ONE place rather than two.
Same drift-prevention rationale as resolveWorkspaceForwardCreds
(#2372) and readOrLazyHealInboundSecret (#2376), applied to the
response-streaming layer of the same handlers.
Behavior preserved: existing TestChatUpload_* and TestChatDownload_*
integration tests (8 across both handlers) all pass unchanged. The
log message format is consistent across both handlers now (single
"chat_files {op}: ..." string template) — operators can grep one
prefix for both features instead of separate prefixes per handler.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Paired molecule-core change for the molecule-cli `molecule connect`
RFC (https://github.com/Molecule-AI/molecule-cli/issues/10).
After this PR an `external`-runtime workspace's full lifecycle
matches the operator-driven model: it boots in awaiting_agent, the
CLI connects in poll mode without operator-side flag tuning, the
heartbeat-loss path lands back on awaiting_agent (re-registrable)
instead of the terminal-feeling 'offline'.
Two changes in workspace-server:
1) `resolveDeliveryMode` (registry.go) now reads `runtime` alongside
`delivery_mode`. Resolution order:
a. payload.delivery_mode if non-empty (operator override)
b. row's existing delivery_mode if non-empty (preserves prior
registration)
c. **NEW:** "poll" if row.runtime = "external" — external
operators run on laptops without public HTTPS; push-mode
would hard-fail at validateAgentURL anyway. (`molecule connect`
registers without --mode and expects this default.)
d. "push" otherwise (historical default for platform-managed
runtimes — langgraph, hermes, claude-code, etc.)
2) Heartbeat-loss for external workspaces lands them in
`awaiting_agent` instead of `offline`. Two code paths:
- `liveness.go` — Redis TTL expiration. Uses a CASE expression
so the conditional is one UPDATE (no extra round-trip for
non-external runtimes, no TOCTOU between runtime read and
status write).
- `healthsweep.go::sweepStaleRemoteWorkspaces` — DB-side
last_heartbeat_at age scan. This sweep is already external-
only by query filter, so the UPDATE just hard-codes the new
status.
The Docker-side `sweepOnlineWorkspaces` keeps `offline` —
recovery there is "restart the container", not "re-register from
the operator's box".
Why awaiting_agent over offline for external:
- Matches the status the workspace was created in (workspace.go:333).
- The CLI re-registers on every invocation; awaiting_agent → online
is the natural transition. offline is a terminal-feeling status
that implies operator intervention is needed.
- An operator who closed their laptop overnight should see
awaiting_agent in canvas, not 'offline (something is wrong)'.
Test plan:
- Existing: 9 `resolveDeliveryMode` test sites updated to the new
query shape. Sqlmock now reads `delivery_mode, runtime` columns.
- New: TestRegister_ExternalRuntime_DefaultsToPoll asserts the
external→poll branch. TestRegister_NonExternalRuntime_StillDefaultsToPush
guards against the new branch overshooting (langgraph keeps push).
- Liveness: regex updated to match the CASE expression.
- Healthsweep: `TestSweepStaleRemoteWorkspaces_MarksStaleAwaitingAgent`
(renamed for grep-ability), Docker-side sweepOnlineWorkspaces test
unchanged (verified to still match `'offline'`).
- Full handlers + registry suite green under -race (12.873s + 2.264s).
No migration needed — `status` is a free-form text column; both
'offline' and 'awaiting_agent' are existing values used elsewhere
(workspace.go uses awaiting_agent on initial external creation).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The helper landed in #2376 and is exercised via chat_files + registry
integration tests. Those tests conflate the helper's behavior with the
caller's response shape — a future refactor that broke the (secret,
healed, err) contract subtly (e.g. returning healed=true on a
read-success path, or swallowing a mint error) might still pass them.
Adds 4 direct sub-tests pinning each branch of the contract:
- secret already present → (s, false, nil)
- secret missing, mint succeeds → (minted, true, nil)
- secret missing, mint fails → ("", false, err)
- read fails (non-NoInboundSecret) → ("", false, err)
Each sub-case asserts the return tuple shape AND mock.ExpectationsWereMet
(for the success path) so a future helper change that skips a DB op
trips the gate immediately.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The lazy-heal-on-miss pattern landed in two places this session:
PR #2372 (chat_files.go::resolveWorkspaceForwardCreds — Upload + Download)
and PR #2375 (registry.go::Register). Both implementations did the same
thing:
read → if ErrNoInboundSecret then mint inline → return outcome
Different response-shape requirements but the same core mechanic. Three
sites' worth of drift potential: any future heal-time condition we add
(audit log, alert, secret rotation, observability) had to be applied to
each site, with partial application silently re-opening the gap.
Fix: extract readOrLazyHealInboundSecret in workspace_provision_shared.go
returning (secret, healed, err). Each caller maps the outcome to its
response shape:
- chat_files: healed=true → 503 with retry hint; err != nil → 503 with
RFC-#2312 reprovision hint
- registry: healed=true|false + err==nil → include in response;
err != nil → omit field (workspace can retry on next register)
Net effect:
- Single source of truth for the read+heal mechanic
- Response-shape decisions stay in callers (they DO differ per feature)
- Future heal-time conditions go in one place
- Behavior preserved: existing TestRegister_NoInboundSecret_LazyHeals,
TestRegister_NoInboundSecret_LazyHealMintFailureOmitsField,
TestChatUpload_NoInboundSecret_LazyHeal*,
TestChatDownload_NoInboundSecret_LazyHeal* all pass unchanged
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-fix: a legacy SaaS workspace with NULL platform_inbound_secret
needed two round-trips before chat upload worked:
1. Workspace registers → response missing platform_inbound_secret
2. User attempts chat upload → chat_files lazy-heals platform-side
(RFC #2312 backfill) → 503 + retry-after
3. Workspace heartbeats → register response now includes the
freshly-minted secret → workspace writes /configs/.platform_inbound_secret
4. User retries chat upload → workspace bearer matches → 200
The platform-side lazy-heal in chat_files.go (#2366) closes the
existing-workspace gap, but the user-visible round-trip dance is
still ugly.
Fix: lazy-heal at register time too. When ReadPlatformInboundSecret
returns ErrNoInboundSecret, mint inline and include the freshly-
minted secret in the register response. Collapses the dance to a
single round-trip:
1. Workspace registers → response includes lazy-healed secret
2. User attempts chat upload → workspace bearer matches → 200
Failure model: best-effort. Mint failure logs and falls through to
omitting the field (workspace will retry on next register call).
The 200 response status is preserved — register success doesn't
hinge on the inbound-secret heal.
Tests:
- TestRegister_NoInboundSecret_LazyHeals: pins the success branch.
Mocks the UPDATE explicitly + asserts ExpectationsWereMet, so a
regression that skipped the mint would fail loudly. Replaces
the prior TestRegister_NoInboundSecret_OmitsField which
"passed" on this branch only because sqlmock-unmatched-UPDATE
coincidentally drove the omit-field error path.
- TestRegister_NoInboundSecret_LazyHealMintFailureOmitsField:
pins the failure branch — explicit UPDATE error → 200 + field
absent.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ValidateToken, WorkspaceFromToken, and ValidateAnyToken each duplicated
the same JOIN+WHERE auth predicate:
FROM workspace_auth_tokens t
JOIN workspaces w ON w.id = t.workspace_id
WHERE t.token_hash = $1
AND t.revoked_at IS NULL
AND w.status != 'removed'
Same drift class as the SaaS provision-mint bug fixed in #2366. A
future safety addition (e.g. exclude paused workspaces from auth) had
to be applied to all three queries; a partial application would
silently re-open one auth path while closing the others.
Fix: hoist the predicate into lookupTokenByHash, which projects
(id, workspace_id) — the union of fields any caller needs. Each
public function picks what it uses:
- ValidateToken — needs both (compares workspaceID, updates last_used_at by id)
- WorkspaceFromToken — needs workspace_id
- ValidateAnyToken — needs id
The trivial perf cost of selecting one extra column per call is worth
the single-source-of-truth guarantee for the auth predicate.
Test mock updates: two upstream test files (a2a_proxy_test, middleware
wsauth_middleware_test{,_canvasorbearer_test}) had hand-typed regex
matchers and row shapes pinned to the per-function SELECT projection.
Updated to the unified shape; behavior is unchanged.
All wsauth + middleware + handlers + full-module tests green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The admin test-token endpoint has a critical security check at
admin_test_token.go:64-72 — the IDOR fix from #112 that requires an
explicit ADMIN_TOKEN bearer when the env var is set. Pre-fix, the
route accepted ANY bearer that matched a live org token, allowing
cross-org test-token minting (and therefore cross-org workspace
authentication). The current code uses subtle.ConstantTimeCompare
against ADMIN_TOKEN.
Test coverage was zero. The existing tests exercised the
ADMIN_TOKEN-unset path (local dev / CI) but never set ADMIN_TOKEN.
A regression that:
- removed the os.Getenv("ADMIN_TOKEN") check
- inverted the comparison
- replaced ConstantTimeCompare with bytes.Equal (timing leak)
- re-introduced the AdminAuth fallback that allows org tokens
would not fail any test, and the breakage would re-open the IDOR
that #112 closed.
Adds four tests covering the gate matrix:
- ADMIN_TOKEN set + no Authorization header → 401
- ADMIN_TOKEN set + wrong Authorization → 401
- ADMIN_TOKEN set + correct Authorization → 200
- ADMIN_TOKEN unset + no Authorization → 200 (gate bypassed safely)
The 4-row matrix pins the gate's full truth table: any regression in
either dimension (gate enabled/disabled, header correct/wrong) trips
exactly one test.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 50-line "resolve URL + read inbound secret + lazy-heal on miss"
block was duplicated nearly verbatim between Upload and Download
handlers. Drift-prone — same class of risk as the original SaaS
provision drift fixed in #2366. A future change like:
- secret rotation (re-mint when the row's older than X)
- per-feature audit logging
- additional fail-closed conditions
would have to be applied to both handlers, and a partial application
that healed Upload but skipped Download would surface only at runtime.
Fix: hoist the shared logic into resolveWorkspaceForwardCreds. The
function takes an op label ("upload"/"download") used in log messages
+ the 503 RFC-#2312 detail copy so operators can still distinguish
which feature ran. Both handlers reduce to:
wsURL, secret, ok := resolveWorkspaceForwardCreds(c, ctx, workspaceID, "upload")
if !ok { return }
Net -20 lines (helper amortizes the 50-line block across both call
sites). Existing test coverage (TestChatUpload_NoInboundSecret_*,
TestChatDownload_NoInboundSecret_* from PR #2370) covers all four
branches of the shared helper.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#2367 moved PARENT_ID env injection from inline TeamHandler.Expand
into the shared prepareProvisionContext (sourced from
payload.ParentID). The test was missing — a regression that:
- dropped the injection
- inverted the nil-check
- leaked an empty PARENT_ID="" into env
would not fail any existing test, but workspace/coordinator.py reads
PARENT_ID on startup to track parent-child relationship, so the
breakage would surface only at runtime.
Adds TestPrepareProvisionContext_ParentIDInjection with three
sub-cases:
- nil ParentID → no PARENT_ID env
- empty-string ParentID → no PARENT_ID env (don't pollute)
- set ParentID → PARENT_ID env equals value
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 2026-04-30 lazy-heal fix in chat_files.go (PR #2366) ATTEMPTS to
mint platform_inbound_secret on miss so legacy workspaces self-heal
without requiring destructive reprovision. The pre-existing
TestChatUpload_NoInboundSecret + TestChatDownload_NoInboundSecret
tests asserted the 503 response shape but did NOT pin that the mint
UPDATE actually fires — they happened to exercise the mint-failure
branch (sqlmock unmatched UPDATE = error = "Failed to mint" code path
returns 503 with "RFC #2312" detail, which still passed the original
assertions).
This means a regression that:
- skipped the lazy-heal mint entirely
- inverted the success/failure response branches
- moved the mint to a different code path
would not fail those tests.
Fix:
- TestChatUpload_NoInboundSecret_LazyHeal: mock the UPDATE
successfully; assert sqlmock.ExpectationsWereMet (mint MUST run)
+ body contains "retry" + "30" (success branch).
- TestChatUpload_NoInboundSecret_LazyHealFailure: mock the UPDATE
to fail; assert body contains "Reprovision" (failure branch).
- Same pair for the Download handler — independent code path means
independent test.
Pins both branches of both handlers (4 tests) so future drift trips
the gate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes#2367.
TeamHandler.Expand provisioned child workspaces by directly calling
h.provisioner.Start, skipping mintWorkspaceSecrets and every other
preflight (secrets load, env mutators, identity injection, missing-env,
empty-config-volume auto-recover). Children shipped with NULL
platform_inbound_secret + never-issued auth_token — same drift class as
the SaaS bug just fixed in PR #2366, found while exercising a stronger
gate against this package.
Fix:
- TeamHandler now holds *WorkspaceHandler. Expand delegates each child
provision to wh.provisionWorkspace, picking up the shared
prepare/mint/preflight pipeline automatically. Future provision-time
steps go in ONE place and team-expand inherits them.
- prepareProvisionContext gains PARENT_ID env injection sourced from
payload.ParentID (which Expand now populates). This preserves the
signal workspace/coordinator.py reads on startup, without threading
env through provisioner.WorkspaceConfig manually.
- NewTeamHandler signature gains *WorkspaceHandler; router passes it.
Gate upgrade:
- TestProvisionFunctions_AllCallMintWorkspaceSecrets is now
behavior-based: it walks every FuncDecl in the package and flags any
function that calls h.provisioner.Start or h.cpProv.Start without
also calling mintWorkspaceSecrets. Drift-resistant by construction —
a future provision function with any name still trips the gate.
- Replaces the name-list version from PR #2366. The name list missed
Expand precisely because Expand wasn't named provision*; the
behavior-based detector caught it spontaneously when prototyped.
Tests: full workspace-server module green; gate previously verified to
fire red on Expand pre-fix and on deliberate mintWorkspaceSecrets
removal.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause of 2026-04-30 silent-503 chat-upload bug: provisionWorkspaceCP
(SaaS) skipped issueAndInjectInboundSecret while provisionWorkspaceOpts
(Docker) called it. Every prod SaaS workspace provisioned with NULL
platform_inbound_secret → upload returned 503 with the v2-enrollment
message on every attempt.
Structural fix:
- Extract prepareProvisionContext (secrets load, env mutators, preflight,
cfg build), mintWorkspaceSecrets (auth_token + platform_inbound_secret),
markProvisionFailed (broadcast + DB update) into workspace_provision_shared.go
- Refactor both provision modes to call the shared helpers
- Add provisionAbort struct so the missing-env failure class can carry its
structured "missing" payload through the shared abort path
- Unify last_sample_error: previously the decrypt-fail path skipped it while
others set it; users now see every failure class in the UI
Drift prevention:
- AST gate TestProvisionFunctions_AllCallMintWorkspaceSecrets asserts every
function in the provisionFunctions set calls mintWorkspaceSecrets at least
once (same shape as the audit-coverage gate from #335). New provision paths
must either call mint or be added to provisionExemptFunctions with a
one-line justification
- Behavioral test TestMintWorkspaceSecrets_PersistsInboundSecretInSaaSMode
pins the contract: SaaS mode MUST persist platform_inbound_secret to the DB
column even though it skips file injection
Existing-workspace recovery (chat_files.go lazy-heal):
- Upload + Download handlers detect NULL platform_inbound_secret and call
IssuePlatformInboundSecret inline, returning 503 with retry_after_seconds=30
- Self-heals workspaces that were provisioned before this fix without
requiring destructive reprovision
Tests: full handlers + workspace-server module green; AST gate verified to
fire red on deliberate violation (commented-out mint call surfaces the
exact function name + actionable remediation message).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Independent review on PR #2362 caught: the dead-agent classifier at
a2a_proxy.go included 502/503/504/524 but missed the rest of the CF
origin-failure family (521/522/523), which are MORE indicative of a
dead EC2 than 524:
- 521 "Web server is down" — CF can't open TCP to origin (most direct
dead-EC2 signal; fires when the workspace EC2 has been terminated
and CF still has the CNAME pointing at it).
- 522 "Connection timed out" — TCP didn't complete in ~15s (typical
of SG/NACL flap or agent process hung on accept).
- 523 "Origin is unreachable" — CF can't route to origin (DNS gone,
network path broken).
Pre-fix any of these would propagate as-is to the canvas and the user
would see a 5xx without the reactive auto-restart firing — exactly
the SaaS-blind class of failure PR #2362 was meant to close.
Refactor: extracted isUpstreamDeadStatus(int) helper so the matrix is
in one place, with TestIsUpstreamDeadStatus locking in 18 status
codes (7 dead, 11 not-dead including 520 and 525 which look CF-shaped
but indicate different failures).
Also tightened TestStopForRestart_NoProvisioner_NoOp per the same
review: now uses sqlmock.ExpectationsWereMet to assert the dispatcher
doesn't touch the DB on the both-nil path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Backward-compat replay gate for the A2A JSON-RPC protocol surface.
Every PR that touches normalizeA2APayload OR bumps the a-2-a-sdk
version pin runs every shape in testdata/a2a_corpus/ through the
current code and asserts:
valid/ — every shape MUST parse without error and produce a
canonical v0.3 payload (params.message.parts list).
invalid/ — every shape MUST be rejected with the documented
status code and error substring.
What this prevents
The 2026-04-29 v0.2 → v0.3 silent-drop bug (PR #2349) shipped
because the SDK bump PR didn't replay v0.2-shaped inputs against
the new code; the shape-mismatch surfaced only in production when
the receiver's Pydantic validator silently rejected inbound
messages.
This gate would have caught it pre-merge. Hand-verified: reverting
the v0.2 string→parts shim in normalizeA2APayload fails 3 of the
v0.2 corpus entries with the exact rejection class the production
bug exhibited.
Corpus contents (11 entries)
valid/ (10):
v0_2_string_content — basic v0.2 (the broken case)
v0_2_string_content_no_message_id — v0.2 + auto-fill messageId
v0_2_list_content — v0.2 with content as Part list
v0_3_parts_text_only — canonical v0.3
v0_3_parts_multi_text — multi-Part list
v0_3_parts_with_file — multimodal (text + file)
v0_3_parts_with_context — contextId for multi-turn
v0_3_streaming_method — message/stream variant
v0_3_unicode_text — emoji + multi-script
v0_3_long_text — 10KB text Part
no_jsonrpc_envelope — bare params/method without
outer envelope (legacy senders)
invalid/ (3):
no_content_or_parts — message has neither field
content_is_integer — wrong type for v0.2 content
content_is_bool — wrong type, separate from int
so the failure msg identifies
which type-class regressed
Plus 4 inline malformed-JSON cases (truncated, not-JSON, empty,
whitespace) that can't be expressed as JSON corpus entries.
Coverage tests
The gate has 4 test functions:
1. TestA2ACorpus_ValidShapesParse — replay valid/ corpus,
assert no error + canonical v0.3 output (parts list non-empty,
messageId non-empty, content field deleted).
2. TestA2ACorpus_InvalidShapesRejected — replay invalid/ corpus,
assert rejection matches recorded status + error substring.
3. TestA2ACorpus_MalformedJSONRejected — inline cases for
non-parseable bodies.
4. TestA2ACorpus_HasMinimumCoverage — at least one v0.2 +
one v0.3 entry exists (loses neither side of the bridge).
5. TestA2ACorpus_EveryEntryHasMetadata — _comment/_added/_source
on every entry per the README policy; _expect_error and
_expect_status on invalid entries.
Documentation
testdata/a2a_corpus/README.md describes the corpus contract:
- When to add entries (new SDK shape, new production-observed
shape).
- When NOT to add (test scaffolding, hypothetical futures).
- Removal policy (breaking change, deprecation window required).
Verification
- All 24 corpus subtests pass on current main.
- Hand-test: revert the v0.2 compat shim → 3 v0.2 entries fail
the gate with the exact rejection class the production bug
exhibited. Confirmed.
- Whole-module go test ./... green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses code-review C1 (test goroutine race) and I2 (CF 524) on PR #2362.
C1: TestRunRestartCycle_SaaSPath_DispatchesViaCPProv invoked runRestartCycle
end-to-end, which spawns `go h.sendRestartContext(...)`. That goroutine
outlived the test, then read db.DB while the next test's setupTestDB wrote
to it — DATA RACE under -race, cascading 30+ failures across the handlers
suite. Refactored: extracted `stopForRestart(ctx, id)` from runRestartCycle
as a pure dispatcher, and rewrote the SaaS-path test to call it directly
(no async goroutine spawned). Added a no-provisioner no-op guard test.
I2: Cloudflare 524 ("origin timed out") now triggers maybeMarkContainerDead
alongside 502/503/504. Same upstream signal — origin agent unresponsive.
Verified `go test -race -count=1 ./internal/handlers/...` green locally.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Independent review of #2362 caught a Critical gap: the previous commit
fixed the Stop dispatch in runRestartCycle but left the provisionWorkspace
dispatch unconditionally Docker-only. So on SaaS the auto-restart cycle
would Stop the EC2 successfully (good), then NPE inside provisionWorkspace's
`h.provisioner.VolumeHasFile` call. coalesceRestart's recover()-without-
re-raise (a deliberate platform-stability safeguard) silently swallowed
the panic, leaving the workspace permanently stuck in status='provisioning'
because the UPDATE on workspace_restart.go:450 had already run.
Net pre-amendment effect on SaaS: dead agent → structured 503 (good) →
workspace flipped to 'offline' (good) → cpProv.Stop succeeded (good) →
provisionWorkspace NPE swallowed (bad) → workspace permanently
'provisioning' until manual canvas restart. The headline claim of #2362
("SaaS auto-restart now works") was false on the path it shipped.
Fix: dispatch the reprovision call the same way every other call site
in the package does (workspace.go:431-433, workspace_restart.go:197+596) —
branch on `h.cpProv != nil` and call provisionWorkspaceCP for SaaS,
provisionWorkspace for Docker.
Tests:
- New TestRunRestartCycle_SaaSPath_DispatchesViaCPProv asserts cpProv.Stop
is called when the SaaS path runs (would have caught the NPE if
provisionWorkspace had been called instead).
- fakeCPProv updated: methods record calls and return nil/empty by
default rather than panicking. The previous "panic on unexpected call"
pattern was unsafe — the panic fires on the async restart goroutine
spawned by maybeMarkContainerDead AFTER the test assertions ran, so
the test passed by accident even though the production path was
broken (which is exactly how the Critical bug landed).
- Existing tests still pass (full handlers + provisioner suites green).
Branch-count audit refresh:
runRestartCycle dispatch decisions:
1. h.provisioner != nil → provisioner.Stop + provisionWorkspace ✓ (existing tests)
2. h.cpProv != nil → cpProv.Stop + provisionWorkspaceCP ✓ (NEW test)
3. both nil → coalesceRestart never called (RestartByID gate) ✓
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Class-of-bugs fix surfaced by hongmingwang.moleculesai.app's canvas chat
to a dead workspace returning a generic Cloudflare 502 page on
2026-04-30. Three independent gaps in the reactive-health path that
together leak dead-agent failures to canvas with no auto-recovery.
## Bug 1 — maybeMarkContainerDead is a no-op for SaaS tenants
`maybeMarkContainerDead` only consulted `h.provisioner` (local Docker
provisioner). SaaS tenants set `h.cpProv` (CP-backed EC2 provisioner)
and leave `h.provisioner` nil — so the function early-returned false
on every call and dead EC2 agents never triggered the offline-flip /
broadcast / restart cascade.
Fix: extend `CPProvisionerAPI` interface with `IsRunning(ctx, id)
(bool, error)` (already implemented on `*CPProvisioner`; just needs
to surface on the interface). `maybeMarkContainerDead` now branches:
local-Docker path uses `h.provisioner.IsRunning`; SaaS path uses
`h.cpProv.IsRunning` which calls the CP's `/cp/workspaces/:id/status`
endpoint to read the EC2 state.
## Bug 2 — RestartByID short-circuits on `h.provisioner == nil`
Same shape as Bug 1: the auto-restart cascade triggered by
`maybeMarkContainerDead` calls `RestartByID` which short-circuited
when the local Docker provisioner was missing. So even if Bug 1 were
fixed, the workspace-offline state would never recover.
Fix: change the gate to `h.provisioner == nil && h.cpProv == nil`
and update `runRestartCycle` to branch on which provisioner is
wired for the Stop call. (The HTTP `Restart` handler already does
this branching correctly — we're just bringing the auto-restart path
to parity.)
## Bug 3 — upstream 502/503/504 propagated as-is, masked by Cloudflare
When the agent's tunnel returns 5xx (the "tunnel up but no origin"
shape — agent process dead but cloudflared connection still healthy),
`dispatchA2A` returns successfully at the HTTP layer with a 5xx body.
`handleA2ADispatchError`'s reactive-health path doesn't run because
that path is only triggered on transport-level errors. The pre-fix
code propagated the 502 status to canvas; Cloudflare in front of the
platform then masked the 502 with its own opaque "error code: 502"
page, hiding any structured response and any Retry-After hint.
Fix: in `proxyA2ARequest`, when the upstream returns 502/503/504, run
`maybeMarkContainerDead` BEFORE propagating. If IsRunning confirms
the agent is dead → return a structured 503 with restarting=true +
Retry-After (CF doesn't mask 503s the same way). If running, propagate
the original status (don't recycle a healthy agent on a transient
hiccup — it might have legitimately returned 502).
## Drive-by — a2aClient transport timeouts
a2aClient was `&http.Client{}` with no Transport timeouts. When a
workspace's EC2 black-holes TCP connects (instance terminated mid-flight,
SG flipped, NACL bug), the OS default is 75s on Linux / 21s on macOS —
long enough for Cloudflare's ~100s edge timeout to fire first and
surface a generic 502. Added DialContext (10s connect), TLSHandshake
(10s), and ResponseHeaderTimeout (60s). Client.Timeout DELIBERATELY
unset — that would pre-empt slow-cold-start flows (Claude Code OAuth
first-token, multi-minute agent synthesis). Long-tail body streaming
is still governed by per-request context deadline.
## Tests
- `TestMaybeMarkContainerDead_CPOnly_NotRunning` — IsRunning(false) →
marks workspace offline, returns true.
- `TestMaybeMarkContainerDead_CPOnly_Running` — IsRunning(true) →
no offline-flip, returns false (don't recycle a healthy agent).
- `TestProxyA2A_Upstream502_TriggersContainerDeadCheck` — agent server
returns 502 + cpProv reports dead → caller gets 503 with restarting=
true and Retry-After: 15.
- `TestProxyA2A_Upstream502_AliveAgent_PropagatesAsIs` — same upstream
502 but cpProv reports running → propagates 502 (existing behavior;
safety check that prevents over-eager recycling).
- Existing `TestMaybeMarkContainerDead_NilProvisioner` /
`TestMaybeMarkContainerDead_ExternalRuntime` still pass.
- Full handlers + provisioner test suites pass.
## Impact
Pre-fix: dead EC2 agent on a SaaS tenant → CF-masked 502 to canvas, no
auto-recovery, manual restart from canvas required.
Post-fix: dead EC2 agent on a SaaS tenant → structured 503 with
restarting=true + Retry-After to canvas, workspace flipped to offline,
auto-restart cycle triggered. Canvas can show a user-actionable
"agent is restarting, please wait" message instead of a generic 502.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Telegram getUpdates / Slack RTM shape: poll-mode workspaces pass the id
of the last activity_logs row they consumed, server returns rows
strictly after in chronological (ASC) order. Existing callers that don't
pass since_id keep DESC + most-recent-N — backwards-compatible.
Cursor lookup is scoped by workspace_id so a caller cannot enumerate or
peek at another workspace's events by passing a UUID belonging to a
different workspace. Cross-workspace and pruned cursors both return
410 Gone — no information leak (caller cannot distinguish "row never
existed" from "row exists but you can't see it").
since_id + since_secs both apply (AND). When since_id is set the order
flips to ASC because polling consumers need recorded-order; the recent-
feed shape (no since_id) keeps DESC.
Tests:
- TestActivityHandler_SinceID_ReturnsNewerASC — cursor lookup → main
query with cursorTime + ASC ordering.
- TestActivityHandler_SinceID_CursorNotFound_410 — pruned/unknown cursor.
- TestActivityHandler_SinceID_CrossWorkspaceCursor_410 — UUID belongs to
another workspace, scoped lookup hides it (same 410 path, no leak).
- TestActivityHandler_SinceID_CombinedWithSinceSecs — placeholder index
arithmetic with both filters.
Stacked on #2353 (PR 2: poll-mode short-circuit).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Skip SSRF/dispatch and queue to activity_logs for delivery_mode=poll
workspaces. The polling agent (e.g. molecule-mcp-claude-channel on an
operator's laptop) consumes via GET /activity?since_id= in PR 3 — no
public URL needed.
Order: budget -> normalize -> lookupDeliveryMode short-circuit ->
resolveAgentURL. Normalizing before the short-circuit keeps the
JSON-RPC method name on the activity_logs row so the polling agent
can dispatch correctly.
Fail-closed-to-push: any DB error reading delivery_mode defaults to
push (loud + recoverable) rather than poll (silent drop).
Tests:
- TestProxyA2A_PollMode_ShortCircuits_NoSSRF_NoDispatch — core invariant:
no resolveAgentURL, no Do(), records to activity_logs, returns 200
{status:"queued",delivery_mode:"poll",method:"message/send"}.
- TestProxyA2A_PushMode_NoShortCircuit — push path unaffected; the agent
server actually receives the request.
- TestProxyA2A_PollMode_FailsClosedToPush — DB error on mode lookup
must NOT silently queue; falls through to the push path.
Stacked on #2348 (PR 1: schema + register flow).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hard gate #4: codified module boundaries as Go tests, so a new
contributor (or AI agent) can't silently land an import that crosses
a layer.
Boundaries enforced (one architecture_test.go per package):
- wsauth has no internal/* deps — auth leaf, must be unit-testable in
isolation
- models has no internal/* deps — pure-types leaf, reverse dep would
create cycles since most packages depend on models
- db has no internal/* deps — DB layer below business logic, must be
testable with sqlmock without spinning up handlers/provisioner
- provisioner does not import handlers or router — unidirectional
layering: handlers wires provisioner into HTTP routes; the reverse
is a cycle
Each test parses .go files in its package via go/parser (no x/tools
dep needed) and asserts forbidden import paths don't appear. Failure
messages name the rule, the offending file, and explain WHY the
boundary exists so the diff reviewer learns the rule.
Note: the original issue's first two proposed boundaries
(provisioner-no-DB, handlers-no-docker) don't match the codebase
today — provisioner already imports db (PR #2276 runtime-image
lookup) and handlers hold *docker.Client directly (terminal,
plugins, bundle, templates). I picked the four boundaries that
actually hold; the first two are aspirational and would need a
refactor before they could be codified.
Hand-tested by injecting a deliberate wsauth -> orgtoken violation:
the gate fires red with the rule message before merge.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes#2345.
## Symptom
Design Director silently dropped A2A briefs whose sender used the
v0.2 message format (`params.message.content` string) instead of v0.3
(`params.message.parts` part-list). The downstream a2a-sdk's v0.3
Pydantic validator rejected with "params.message.parts — Field
required" but the rejection only landed in tenant-side logs; the
sender saw HTTP 200/202 and assumed delivery.
UX Researcher therefore never received the kickoff. Multi-agent
pipeline silently idle.
## Fix
Convert at the proxy edge in normalizeA2APayload. Two cases handled,
one explicitly rejected:
v0.2 string content → wrap as [{kind: text, text: <content>}]
(the canonical v0.2 case from the dogfooding
report)
v0.2 list content → preserve list as parts (some older clients
put a list under `content`; treat as "client
meant parts, used wrong field name")
v0.3 parts present → no-op (hot path for normal traffic)
Neither present → return HTTP 400 with structured JSON-RPC
error pointing at the missing field
Why at the proxy edge: every workspace gets the compat for free
without each one bumping a2a-sdk separately. The SDK's own compat
adapter is strict about `parts` and rejects v0.2 senders.
Why reject loud on missing-both: pre-fix the SDK's Pydantic
rejection was post-handler-dispatch and invisible to the original
sender. Now misshapen payloads return a structured 400 to the actual
caller — kills the entire silent-drop class for this payload-shape
category.
## Tests
7 new cases on normalizeA2APayload (#2345) + 1 fixture update on the
existing _MissingMethodReturnsEmpty test:
TestNormalizeA2APayload_ConvertsV02StringContentToParts
TestNormalizeA2APayload_ConvertsV02ListContentToParts
TestNormalizeA2APayload_PreservesV03Parts (hot path)
TestNormalizeA2APayload_RejectsMessageWithNeitherContentNorParts
TestNormalizeA2APayload_RejectsContentWithUnsupportedType
TestNormalizeA2APayload_NoMessageNoCheck (e.g. tasks/list bypasses)
All 11 normalizeA2APayload tests pass + full handler suite (no
regressions).
## Refs
Hard-gates discussion: this is exactly the class of failure
(silent-drop on schema mismatch) that #2342 (continuous synthetic
E2E) would catch automatically. Tier 2 RFC item from #2345 (caller
gets structured JSON-RPC error on parse failure) is delivered above
via the loud-reject path.
Adds workspaces.delivery_mode (push, default | poll) and lets the register
handler accept poll-mode workspaces with no URL. This is the foundation
for the unified poll/push delivery design in #2339 — Telegram-getUpdates
shape for external runtimes that have no public URL.
What this PR does:
- Migration 045: NOT NULL TEXT column, default 'push', CHECK constraint
on the two valid values.
- models.Workspace + RegisterPayload + CreateWorkspacePayload gain a
DeliveryMode field. RegisterPayload.URL drops the `binding:"required"`
tag — the handler now enforces it conditionally on the resolved mode.
- Register handler: validates explicit delivery_mode if set; resolves
effective mode (payload value, else stored row value, else push) AFTER
the C18 token check; validates URL only when effective mode is push;
persists delivery_mode in the upsert; returns it in the response;
skips URL caching when payload.URL is empty.
- CreateWorkspace handler: persists delivery_mode (defaults to push) in
the same INSERT, validates it before any side effects.
What this PR does NOT do (intentional, follow-up PRs):
- PR 2: short-circuit ProxyA2A for poll-mode workspaces (skip SSRF +
dispatch, log a2a_receive activity, return 200).
- PR 3: since_id cursor on GET /activity for lossless polling.
- Plugin v0.2 in molecule-mcp-claude-channel: cursor persistence + a
register helper that creates poll-mode workspaces.
Backwards compatibility: every existing workspace stays push-mode (schema
default) with identical behavior. New tests:
TestRegister_PollMode_AcceptsEmptyURL,
TestRegister_PushMode_RejectsEmptyURL,
TestRegister_InvalidDeliveryMode,
TestRegister_PollMode_PreservesExistingValue. All existing register +
create tests updated to expect the new delivery_mode column in the
INSERT args.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The header comment claimed:
"file upload (HTTP-forward) + download (Docker-exec)"
and:
"Download still uses the v1 docker-cp path; migrating it lives in
the next PR in this stack"
Both wrong now. RFC #2312 PR-D landed the Download HTTP-forward path:
chat_files.go:336 builds an http.NewRequestWithContext to
${wsURL}/internal/file/read?path=<abs>, with the response streamed
back to the caller. The workspace-side Starlette handler is at
workspace/internal_file_read.py, mounted at workspace/main.py:440.
Update the header to reflect actual code: both upload + download are
HTTP-forward, share the same per-workspace platform_inbound_secret
auth, and work uniformly on local Docker and SaaS EC2.
Pure docs change — no behavior, no build/test impact.
Closes the observability gap surfaced in #2329 item 5: callers received
queue_id in the 202 enqueue response but had no public lookup. The only
existing observability path was check_task_status (delegation-flavored
A2A only — joins via request_body->>'delegation_id'). Cross-workspace
peer-direct A2A had no observability after enqueue.
This PR ships RFC #2331's Tier 1: minimum viable observability + caller-
specified TTL. No schema migration — expires_at column already exists
(migration 042); only DequeueNext was honoring it, with no caller path
to populate it.
Two changes:
1. extractExpiresInSeconds(body) — new helper mirroring
extractIdempotencyKey/extractDelegationIDFromBody. Pulls
params.expires_in_seconds from the JSON-RPC body. Zero (the unset
default) preserves today's infinite-TTL semantics. EnqueueA2A grew
an expiresAt *time.Time parameter; the proxy callsite computes
*time.Time from the extracted seconds and threads it through to
the INSERT.
2. GET /workspaces/:id/a2a/queue/:queue_id — new public handler.
Auth: caller's workspace token must match queue.caller_id OR
queue.workspace_id, OR be an org-level token. 404 (not 403) on
auth failure to avoid leaking queue_id existence. Response
includes status/attempts/last_error/timestamps/expires_at; embeds
response_body via LEFT JOIN against activity_logs when status=
completed for delegation-flavored items.
What this does NOT change:
- Drain semantics (heartbeat-driven dispatch).
- Native-session bypass (claude-agent-sdk, hermes still skip queue).
- Schema (column already exists).
- MCP tools (delegate_task_async / check_task_status keep their
contract; this is a parallel queue-id surface).
Tests:
- 7 cases on extractExpiresInSeconds covering absent/positive/
zero/negative/invalid-JSON/wrong-type/empty-params.
- go vet + go build clean.
- Full handlers test suite passes (no regressions from the
EnqueueA2A signature change — only one production caller).
Tier 2 (cross-workspace stitch + webhook callback) and Tier 3
(controllerized lifecycle) deferred per RFC #2331.
External callers (third-party SDKs, the channel plugin) authenticate
purely via bearer and frequently don't set the X-Workspace-ID header.
Without this, activity_logs.source_id ends up NULL — breaking the
peer_id signal on notifications, the "Agent Comms by peer" canvas tab,
and any analytics that breaks down inbound A2A by sender.
The bearer is the authoritative caller identity per the wsauth contract
(it's what proves who you are); the header is a display/routing hint
that must agree with it. So we derive callerID from the bearer's owning
workspace whenever the header is absent. The existing validateCallerToken
guard fires after this and enforces token-to-callerID binding the same
way it always has.
Org-token requests are skipped — those grant org-wide access and don't
bind to a single workspace, so the canvas-class semantics (callerID="")
are preserved. Bearer-resolution failures (revoked, removed workspace)
fall through to canvas-class as well, never 401.
New wsauth.WorkspaceFromToken exposes the bearer→workspace lookup as a
modular interface; mirrors ValidateAnyToken's defense-in-depth JOIN on
workspaces.status != 'removed'.
Tests: 4 unit tests on WorkspaceFromToken + 3 integration tests on
ProxyA2A covering the three observable paths (bearer-derived,
org-token skipped, derive-failure fallthrough).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors PR-C's Upload migration: replaces the docker-cp tar-stream
extraction with a streaming HTTP GET to the workspace's own
/internal/file/read endpoint. Closes the SaaS gap for downloads —
without this PR, GET /workspaces/:id/chat/download still returns 503
on Railway-hosted SaaS even after A+B+C+F land.
Stacks: PR-A #2313 → PR-B #2314 → PR-C #2315 → PR-F #2319 → this PR.
Why a single broad /internal/file/read instead of /internal/chat/download:
Today's chat_files.go::Download already accepts paths under any of the
four allowed roots {/configs, /workspace, /home, /plugins} — it's not
strictly chat. Future PRs (template export, etc.) will reuse this
endpoint via the same forward pattern; reusing avoids three near-
identical handlers (one per domain) with duplicated path-safety logic.
Path safety is duplicated on platform + workspace sides — defence in
depth via two parallel checks, not "trust the workspace."
Changes:
* workspace/internal_file_read.py — Starlette handler. Validates path
(must be absolute, under allowed roots, no traversal, canonicalises
cleanly). lstat (not stat) so a symlink at the path doesn't redirect
the read. Streams via FileResponse (no buffering). Mirrors Go's
contentDispositionAttachment for Content-Disposition header.
* workspace/main.py — registers GET /internal/file/read alongside the
POST /internal/chat/uploads/ingest from PR-B.
* scripts/build_runtime_package.py — adds internal_file_read to
TOP_LEVEL_MODULES so the publish-runtime cascade rewrites its
imports correctly. Also includes the PR-B additions
(internal_chat_uploads, platform_inbound_auth) since this branch
was rooted before PR-B's drift-gate fix; merge-clean alphabetic
additions.
* workspace-server/internal/handlers/chat_files.go — Download
rewritten as streaming HTTP GET forward. Resolves workspace URL +
platform_inbound_secret (same shape as Upload), builds GET request
with path query param, propagates response headers (Content-Type /
Content-Length / Content-Disposition) + body. Drops archive/tar
+ mime imports (no longer needed). Drops Docker-exec branch entirely
— Download is now uniform across self-hosted Docker and SaaS EC2.
* workspace-server/internal/handlers/chat_files_test.go — replaces
TestChatDownload_DockerUnavailable (stale post-rewrite) with 4
new tests:
- TestChatDownload_WorkspaceNotInDB → 404 on missing row
- TestChatDownload_NoInboundSecret → 503 on NULL column
(with RFC #2312 detail in body)
- TestChatDownload_ForwardsToWorkspace_HappyPath → forward shape
(auth header, GET method, /internal/file/read path) + headers
propagated + body byte-for-byte
- TestChatDownload_404FromWorkspacePropagated → 404 from
workspace propagates (NOT remapped to 500)
Existing TestChatDownload_InvalidPath path-safety tests preserved.
* workspace/tests/test_internal_file_read.py — 21 tests covering
_validate_path matrix (absolute, allowed roots, traversal, double-
slash, exact-match-on-root), 401 on missing/wrong/no-secret-file
bearer, 400 on missing path/outside-root/traversal, 404 on missing
file, happy-path streaming with correct Content-Type +
Content-Disposition, special-char escaping in Content-Disposition,
symlink-redirect-rejection (lstat-not-stat protection).
Test results:
* go test ./internal/handlers/ ./internal/wsauth/ — green
* pytest workspace/tests/ — 1292 passed (was 1272 before PR-D)
Refs #2312 (parent RFC), #2308 (chat upload+download 503 incident).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the SaaS-side gap that PR-A acknowledged but didn't fix: SaaS
workspaces have no persistent /configs volume, so the platform_inbound_secret
that PR-A's provisioner wrote at workspace creation never reaches the
runtime. Without this, even after the entire RFC #2312 stack lands,
SaaS chat upload would 401 (workspace fails-closed when /configs/.platform_inbound_secret
is missing).
Solution: return the secret in the /registry/register response body
on every register call. The runtime extracts it and persists to
/configs/.platform_inbound_secret at mode 0600. Idempotent — Docker-
mode workspaces also receive it and overwrite the value the provisioner
already wrote (same value until rotation).
Why on every register, not just first-register:
* SaaS containers can be restarted (deploys, drains, EBS detach/
re-attach) — /configs is rebuilt empty on each fresh start.
* The auth_token is "issue once" because re-issuing rotates and
invalidates the previous one. The inbound secret has no rotation
flow yet (#2318) so re-sending the same value is harmless.
* Eliminates the bootstrap window where a restarted SaaS workspace
has no inbound secret on disk and would 401 every platform call.
Changes:
* workspace-server/internal/handlers/registry.go — Register handler
reads workspaces.platform_inbound_secret via wsauth.ReadPlatformInboundSecret
and includes it in the response body. Legacy workspaces (NULL
column) get a successful registration with the field omitted.
* workspace-server/internal/handlers/registry_test.go — two new tests:
- TestRegister_ReturnsPlatformInboundSecret_RFC2312_PRF: secret
present in DB → secret in response, alongside auth_token.
- TestRegister_NoInboundSecret_OmitsField: NULL column → field
omitted, registration still 200.
* workspace/platform_inbound_auth.py — adds save_inbound_secret(secret).
Atomic write via tmp + os.replace, mode 0600 from os.open(O_CREAT,
0o600) so a concurrent reader never sees 0644-default. Resets the
in-process cache after write so the next get_inbound_secret() returns
the freshly-written value (rotation-safe when it lands).
* workspace/main.py — register-response handler extracts
platform_inbound_secret alongside auth_token and persists via
save_inbound_secret. Mirrors the existing save_token pattern.
* workspace/tests/test_platform_inbound_auth.py — 6 new tests for
save_inbound_secret: writes file, mode 0600, overwrite-existing,
cache invalidation after save, empty-input no-op, parent-dir creation
for fresh installs.
Test results:
* go test ./internal/handlers/ ./internal/wsauth/ — all green
* pytest workspace/tests/ — 1272 passed (was 1266 before this PR)
Refs #2312 (parent RFC), #2308 (chat upload 503 incident).
Stacks: PR-A #2313 → PR-B #2314 → PR-C #2315 → this PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-review found the original draft of this PR added forward-time
validateAgentURL() as defense-in-depth — paranoia layer on top of the
existing register-time gate. The validator unconditionally blocks
loopback (127.0.0.1/8), which makes httptest-based proxy tests
impossible without an env-var hatch I'd rather not add to a security-
critical path on first pass.
Trust note kept inline pointing at the upstream gate + tracking issue
so the gap is explicit, not invisible.
Refs #2312.
Closes the SaaS upload gap (#2308) with the unified architecture from
RFC #2312: same code path on local Docker and SaaS, no Docker socket
dependency, no `dockerCli == nil` cliff. Stacked on PR-A (#2313) +
PR-B (#2314).
Before:
Upload → findContainer (nil in SaaS) → 503
After:
Upload → resolve workspaces.url + platform_inbound_secret
→ stream multipart to <url>/internal/chat/uploads/ingest
→ forward response back unchanged
Same call site whether the workspace runs on local docker-compose
("http://ws-<id>:8000") or SaaS EC2 ("https://<id>.<tenant>...").
The bug behind #2308 cannot exist by construction.
Why streaming, not parse-then-re-encode:
* No 50 MB intermediate buffer on the platform
* Per-file size + path-safety enforcement is the workspace's job
(see workspace/internal_chat_uploads.py, PR-B)
* Workspace's error responses (413 with offending filename, 400 on
missing files field, etc.) propagate through unchanged
Changes:
* workspace-server/internal/handlers/chat_files.go — Upload rewritten
as a streaming HTTP proxy. Drops sanitizeFilename, copyFlatToContainer,
and the entire docker-exec path. ChatFilesHandler gains an httpClient
(broken out for test injection). Download stays docker-exec for now;
follow-up PR will migrate it to the same shape.
* workspace-server/internal/handlers/chat_files_external_test.go —
deleted. Pinned the wrong-headed runtime=external 422 gate from
#2309 (already reverted in #2311). Superseded by the proxy tests.
* workspace-server/internal/handlers/chat_files_test.go — replaced
sanitize-filename tests (now in workspace/tests/test_internal_chat_uploads.py)
with sqlmock + httptest proxy tests:
- 400 invalid workspace id
- 404 workspace row missing
- 503 platform_inbound_secret NULL (with RFC #2312 detail)
- 503 workspaces.url empty
- happy-path forward (asserts auth header, content-type forwarded,
body streamed, response propagated back)
- 413 from workspace propagated unchanged (NOT remapped to 500)
- 502 on workspace unreachable (connect refused)
Existing Download + ContentDisposition tests preserved.
* tests/e2e/test_chat_upload_e2e.sh — single-script-everywhere E2E.
Takes BASE as env (default http://localhost:8080). Creates a
workspace, waits for online, mints a test token, uploads a fixture,
reads it back via /chat/download, asserts content matches +
bearer-required. Same script runs against staging tenants (set
BASE=https://<id>.<tenant>.staging.moleculesai.app).
Test plan:
* go build ./... — green
* go test ./internal/handlers/ ./internal/wsauth/ — green (full suite)
* tests/e2e/test_chat_upload_e2e.sh against local docker-compose
after PR-A + PR-B + this PR all merge — TODO before merge
Refs #2312 (parent RFC), #2308 (chat upload 503 incident).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Foundation for the HTTP-forward architecture that replaces Docker-exec
in chat upload + 5 follow-on handlers. This PR is intentionally scoped
to schema + token mint + provisioner wiring; no caller reads the secret
yet so behavior is unchanged.
Why a second per-workspace bearer (not reuse the existing
workspace_auth_tokens row):
workspace_auth_tokens workspaces.platform_inbound_secret
───────────────────── ─────────────────────────────────
workspace → platform platform → workspace
hash stored, plaintext gone plaintext stored (platform reads back)
workspace presents bearer platform presents bearer
platform validates by hash workspace validates by file compare
Distinct roles, distinct rotation lifecycle, distinct audit signal —
splitting later would require a fleet-wide rolling rotation, so paying
the schema cost up front.
Changes:
* migration 044: ADD COLUMN workspaces.platform_inbound_secret TEXT
* wsauth.IssuePlatformInboundSecret + ReadPlatformInboundSecret
* issueAndInjectInboundSecret hook in workspace_provision: mints
on every workspace create / re-provision; Docker mode writes
plaintext to /configs/.platform_inbound_secret alongside .auth_token,
SaaS mode persists to DB only (workspace will receive via
/registry/register response in a follow-up PR)
* 8 unit tests against sqlmock — covers happy path, rotation, NULL
column, empty string, missing workspace row, empty workspaceID
PR-B (next) wires up workspace-side `/internal/chat/uploads/ingest`
that validates the bearer against /configs/.platform_inbound_secret.
Refs #2312 (parent RFC), #2308 (chat upload 503 incident).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #2309 added an early-return that 422'd uploads to external workspaces
with "file upload not supported." Both halves of that diagnosis were wrong:
1. External workspaces SHOULD support uploads — gating with 422
locks off intended functionality and labels it as design.
2. The 503 the user actually hit was on an INTERNAL workspace, not
an external one. The runtime check never even ran.
Real root cause (separate fix incoming):
- findContainer(...) requires a non-nil h.docker.
- In SaaS (MOLECULE_ORG_ID set), main.go selects the CP provisioner
instead of the local Docker provisioner — dockerCli is nil.
- findContainer short-circuits to "" → 503 "container not running"
on every workspace, internal or external, on Railway-hosted
SaaS where workspaces actually live on EC2.
This PR strips the misleading gate so #2308 can be re-investigated
against the real symptom. The proper fix routes the multipart upload
over HTTP to the workspace's URL when dockerCli is nil — tracked
as a follow-up.
Refs #2308.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symptom: pasting a screenshot into the canvas chat for a runtime="external"
workspace returned `503 {"error":"workspace container not running"}` —
accurate from the upload handler's POV (no container exists for external
workspaces) but misleading because it implies the container has crashed.
Fix: detect runtime="external" via DB lookup BEFORE the container-find
step and return 422 with:
- error: "file upload not supported for external workspaces"
- detail: explains why + points at admin/secrets workaround +
references issue #2308 for the v0.2 native-support roadmap
- runtime: "external" (machine-readable for clients)
Why 422 not 200/501:
- 422 = Unprocessable Entity — the request is well-formed but the
workspace's runtime can't accept it. Standard REST semantics.
- 200 with empty result would lie; 501 implies the API itself is
unimplemented (it's not — works for non-external workspaces); 503
was the misleading status this PR fixes.
Verified via live E2E against localhost:
- Created `runtime=external,external=true` workspace
- Posted multipart to /workspaces/:id/chat/uploads
- Got 422 with the expected structured body
Unit test (`chat_files_external_test.go`) pins the contract via sqlmock
+ httptest. Notable: the handler is constructed with `templates: nil`
to prove the runtime check happens BEFORE any docker plumbing — if a
future change moves the check below findContainer, the test crashes
on nil-deref instead of silently regressing.
Out of scope (for v0.2 follow-up):
- Native external-workspace file ingest via artifacts table or the
channel-plugin's inbox/ pattern. Requires separate design pass.
Closes#2308
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a third snippet alongside externalCurlTemplate / externalPythonTemplate
in workspace-server/internal/handlers/external_connection.go: the new
externalChannelTemplate guides operators through installing the Claude Code
channel plugin (Molecule-AI/molecule-mcp-claude-channel — scaffolded today)
and dropping the .env config for it.
Wires the new snippet into the external-workspace POST /workspaces response
under key `claude_code_channel_snippet`, alongside the existing
`curl_register_template` and `python_snippet`. Canvas's "external workspace
created" modal can render it as a third tab.
CONTRIBUTING.md gains a short "External integrations" section pointing at
the three peer repos (workspace-runtime, sdk-python, mcp-claude-channel)
so contributors know where related runtime artifacts live and to consider
downstream impact when changing the A2A wire shape.
The plugin itself is scaffolded at commit d07363c on the new repo's main
branch; v0.1 is polling-based via the /activity?since_secs= filter shipped
in PR #2300. README + roadmap details there.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The harness runner (scripts/measure-coordinator-task-bounds-runner.sh)
calls `/workspaces/:id/activity?since_secs=$A2A_TIMEOUT` to scope a
trace to a specific test window. The query param was silently
ignored — `ActivityHandler.List` accepted only `type`, `source`, and
`limit`, so the runner got the most-recent-100 events regardless of
how long ago they happened. Works for fresh-tenant tests where
activity_logs is ~empty pre-run, breaks on busy tenants and on tests
that exceed 100 events.
Adds `since_secs` parsing with three behaviors:
- Valid positive int → `AND created_at >= NOW() - make_interval(secs => $N)`
on the SQL. Parameterised; values bound via lib/pq, not interpolated.
`make_interval(secs => $N)` is required — the `INTERVAL '$N seconds'`
literal form rejects placeholder substitution inside the string.
- Above 30 days (2_592_000s) → silently clamped to the cap. Defends
against a paranoid client triggering a multi-month full-table scan
via `since_secs=999999999`.
- Negative, zero, or non-integer → 400 with a structured error, NOT
silently dropped. Silent drop is exactly the bug this is fixing
— a typoed param shouldn't be lost as most-recent-100.
Tests cover all four paths: accepted (with arg-binding assertion via
sqlmock.WithArgs), clamped at 30 days, invalid rejected (5 sub-cases),
and omitted (verifies no extra clause / arg leak via strict WithArgs
count).
RFC #2251 §V1.0 step 6 (platform-side-transition audit) also depends
on this for time-window filtering of activity_logs.
Closes#2268
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per-workspace `restartState` entries (introduced under the name
`restartMu` pre-#2266, renamed to `restartStates` in #2266) are
created via `LoadOrStore` in `workspace_restart.go` but never
deleted. On a long-running platform process serving many short-lived
workspaces (E2E tests, transient sandbox tenants), the sync.Map grows
monotonically — ~16 bytes per workspace ever created.
Fix: call `restartStates.Delete(wsID)` after stopAndRemove +
ClearWorkspaceKeys for each cascaded descendant and the parent. Mirrors
the existing per-ID cleanup loop. `sync.Map.Delete` is safe on absent
keys, so workspaces that were never restarted (no LoadOrStore call)
are no-op.
This is a pre-existing leak — #2266 did not introduce it; just renamed
the holder. Filing as a separate commit to keep the change minimal and
reviewable.
Closes#2269
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>