molecule-core

Author	SHA1	Message	Date
Hongming Wang	8e508a7a2f	fix(a2a): cover CF 521/522/523 in dead-origin status set Independent review on PR #2362 caught: the dead-agent classifier at a2a_proxy.go included 502/503/504/524 but missed the rest of the CF origin-failure family (521/522/523), which are MORE indicative of a dead EC2 than 524: - 521 "Web server is down" — CF can't open TCP to origin (most direct dead-EC2 signal; fires when the workspace EC2 has been terminated and CF still has the CNAME pointing at it). - 522 "Connection timed out" — TCP didn't complete in ~15s (typical of SG/NACL flap or agent process hung on accept). - 523 "Origin is unreachable" — CF can't route to origin (DNS gone, network path broken). Pre-fix any of these would propagate as-is to the canvas and the user would see a 5xx without the reactive auto-restart firing — exactly the SaaS-blind class of failure PR #2362 was meant to close. Refactor: extracted isUpstreamDeadStatus(int) helper so the matrix is in one place, with TestIsUpstreamDeadStatus locking in 18 status codes (7 dead, 11 not-dead including 520 and 525 which look CF-shaped but indicate different failures). Also tightened TestStopForRestart_NoProvisioner_NoOp per the same review: now uses sqlmock.ExpectationsWereMet to assert the dispatcher doesn't touch the DB on the both-nil path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 01:39:04 -07:00
Hongming Wang	747c12e582	test(a2a): protocol-shape replay corpus gate (#2345 follow-up) Backward-compat replay gate for the A2A JSON-RPC protocol surface. Every PR that touches normalizeA2APayload OR bumps the a-2-a-sdk version pin runs every shape in testdata/a2a_corpus/ through the current code and asserts: valid/ — every shape MUST parse without error and produce a canonical v0.3 payload (params.message.parts list). invalid/ — every shape MUST be rejected with the documented status code and error substring. What this prevents The 2026-04-29 v0.2 → v0.3 silent-drop bug (PR #2349) shipped because the SDK bump PR didn't replay v0.2-shaped inputs against the new code; the shape-mismatch surfaced only in production when the receiver's Pydantic validator silently rejected inbound messages. This gate would have caught it pre-merge. Hand-verified: reverting the v0.2 string→parts shim in normalizeA2APayload fails 3 of the v0.2 corpus entries with the exact rejection class the production bug exhibited. Corpus contents (11 entries) valid/ (10): v0_2_string_content — basic v0.2 (the broken case) v0_2_string_content_no_message_id — v0.2 + auto-fill messageId v0_2_list_content — v0.2 with content as Part list v0_3_parts_text_only — canonical v0.3 v0_3_parts_multi_text — multi-Part list v0_3_parts_with_file — multimodal (text + file) v0_3_parts_with_context — contextId for multi-turn v0_3_streaming_method — message/stream variant v0_3_unicode_text — emoji + multi-script v0_3_long_text — 10KB text Part no_jsonrpc_envelope — bare params/method without outer envelope (legacy senders) invalid/ (3): no_content_or_parts — message has neither field content_is_integer — wrong type for v0.2 content content_is_bool — wrong type, separate from int so the failure msg identifies which type-class regressed Plus 4 inline malformed-JSON cases (truncated, not-JSON, empty, whitespace) that can't be expressed as JSON corpus entries. Coverage tests The gate has 4 test functions: 1. TestA2ACorpus_ValidShapesParse — replay valid/ corpus, assert no error + canonical v0.3 output (parts list non-empty, messageId non-empty, content field deleted). 2. TestA2ACorpus_InvalidShapesRejected — replay invalid/ corpus, assert rejection matches recorded status + error substring. 3. TestA2ACorpus_MalformedJSONRejected — inline cases for non-parseable bodies. 4. TestA2ACorpus_HasMinimumCoverage — at least one v0.2 + one v0.3 entry exists (loses neither side of the bridge). 5. TestA2ACorpus_EveryEntryHasMetadata — _comment/_added/_source on every entry per the README policy; _expect_error and _expect_status on invalid entries. Documentation testdata/a2a_corpus/README.md describes the corpus contract: - When to add entries (new SDK shape, new production-observed shape). - When NOT to add (test scaffolding, hypothetical futures). - Removal policy (breaking change, deprecation window required). Verification - All 24 corpus subtests pass on current main. - Hand-test: revert the v0.2 compat shim → 3 v0.2 entries fail the gate with the exact rejection class the production bug exhibited. Confirmed. - Whole-module go test ./... green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 01:26:02 -07:00
Hongming Wang	a27cf8f39f	fix(restart): extract stopForRestart helper + add 524 to dead-agent list Addresses code-review C1 (test goroutine race) and I2 (CF 524) on PR #2362. C1: TestRunRestartCycle_SaaSPath_DispatchesViaCPProv invoked runRestartCycle end-to-end, which spawns `go h.sendRestartContext(...)`. That goroutine outlived the test, then read db.DB while the next test's setupTestDB wrote to it — DATA RACE under -race, cascading 30+ failures across the handlers suite. Refactored: extracted `stopForRestart(ctx, id)` from runRestartCycle as a pure dispatcher, and rewrote the SaaS-path test to call it directly (no async goroutine spawned). Added a no-provisioner no-op guard test. I2: Cloudflare 524 ("origin timed out") now triggers maybeMarkContainerDead alongside 502/503/504. Same upstream signal — origin agent unresponsive. Verified `go test -race -count=1 ./internal/handlers/...` green locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 00:58:22 -07:00
Hongming Wang	28b4e38002	fix(restart): branch provisionWorkspace dispatch on cpProv (PR #2362 amendment) Independent review of #2362 caught a Critical gap: the previous commit fixed the Stop dispatch in runRestartCycle but left the provisionWorkspace dispatch unconditionally Docker-only. So on SaaS the auto-restart cycle would Stop the EC2 successfully (good), then NPE inside provisionWorkspace's `h.provisioner.VolumeHasFile` call. coalesceRestart's recover()-without- re-raise (a deliberate platform-stability safeguard) silently swallowed the panic, leaving the workspace permanently stuck in status='provisioning' because the UPDATE on workspace_restart.go:450 had already run. Net pre-amendment effect on SaaS: dead agent → structured 503 (good) → workspace flipped to 'offline' (good) → cpProv.Stop succeeded (good) → provisionWorkspace NPE swallowed (bad) → workspace permanently 'provisioning' until manual canvas restart. The headline claim of #2362 ("SaaS auto-restart now works") was false on the path it shipped. Fix: dispatch the reprovision call the same way every other call site in the package does (workspace.go:431-433, workspace_restart.go:197+596) — branch on `h.cpProv != nil` and call provisionWorkspaceCP for SaaS, provisionWorkspace for Docker. Tests: - New TestRunRestartCycle_SaaSPath_DispatchesViaCPProv asserts cpProv.Stop is called when the SaaS path runs (would have caught the NPE if provisionWorkspace had been called instead). - fakeCPProv updated: methods record calls and return nil/empty by default rather than panicking. The previous "panic on unexpected call" pattern was unsafe — the panic fires on the async restart goroutine spawned by maybeMarkContainerDead AFTER the test assertions ran, so the test passed by accident even though the production path was broken (which is exactly how the Critical bug landed). - Existing tests still pass (full handlers + provisioner suites green). Branch-count audit refresh: runRestartCycle dispatch decisions: 1. h.provisioner != nil → provisioner.Stop + provisionWorkspace ✓ (existing tests) 2. h.cpProv != nil → cpProv.Stop + provisionWorkspaceCP ✓ (NEW test) 3. both nil → coalesceRestart never called (RestartByID gate) ✓ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 00:35:51 -07:00
Hongming Wang	9f35788aee	fix(a2a): detect dead EC2 agents on upstream 5xx + reactive auto-restart for SaaS Class-of-bugs fix surfaced by hongmingwang.moleculesai.app's canvas chat to a dead workspace returning a generic Cloudflare 502 page on 2026-04-30. Three independent gaps in the reactive-health path that together leak dead-agent failures to canvas with no auto-recovery. ## Bug 1 — maybeMarkContainerDead is a no-op for SaaS tenants `maybeMarkContainerDead` only consulted `h.provisioner` (local Docker provisioner). SaaS tenants set `h.cpProv` (CP-backed EC2 provisioner) and leave `h.provisioner` nil — so the function early-returned false on every call and dead EC2 agents never triggered the offline-flip / broadcast / restart cascade. Fix: extend `CPProvisionerAPI` interface with `IsRunning(ctx, id) (bool, error)` (already implemented on `*CPProvisioner`; just needs to surface on the interface). `maybeMarkContainerDead` now branches: local-Docker path uses `h.provisioner.IsRunning`; SaaS path uses `h.cpProv.IsRunning` which calls the CP's `/cp/workspaces/:id/status` endpoint to read the EC2 state. ## Bug 2 — RestartByID short-circuits on `h.provisioner == nil` Same shape as Bug 1: the auto-restart cascade triggered by `maybeMarkContainerDead` calls `RestartByID` which short-circuited when the local Docker provisioner was missing. So even if Bug 1 were fixed, the workspace-offline state would never recover. Fix: change the gate to `h.provisioner == nil && h.cpProv == nil` and update `runRestartCycle` to branch on which provisioner is wired for the Stop call. (The HTTP `Restart` handler already does this branching correctly — we're just bringing the auto-restart path to parity.) ## Bug 3 — upstream 502/503/504 propagated as-is, masked by Cloudflare When the agent's tunnel returns 5xx (the "tunnel up but no origin" shape — agent process dead but cloudflared connection still healthy), `dispatchA2A` returns successfully at the HTTP layer with a 5xx body. `handleA2ADispatchError`'s reactive-health path doesn't run because that path is only triggered on transport-level errors. The pre-fix code propagated the 502 status to canvas; Cloudflare in front of the platform then masked the 502 with its own opaque "error code: 502" page, hiding any structured response and any Retry-After hint. Fix: in `proxyA2ARequest`, when the upstream returns 502/503/504, run `maybeMarkContainerDead` BEFORE propagating. If IsRunning confirms the agent is dead → return a structured 503 with restarting=true + Retry-After (CF doesn't mask 503s the same way). If running, propagate the original status (don't recycle a healthy agent on a transient hiccup — it might have legitimately returned 502). ## Drive-by — a2aClient transport timeouts a2aClient was `&http.Client{}` with no Transport timeouts. When a workspace's EC2 black-holes TCP connects (instance terminated mid-flight, SG flipped, NACL bug), the OS default is 75s on Linux / 21s on macOS — long enough for Cloudflare's ~100s edge timeout to fire first and surface a generic 502. Added DialContext (10s connect), TLSHandshake (10s), and ResponseHeaderTimeout (60s). Client.Timeout DELIBERATELY unset — that would pre-empt slow-cold-start flows (Claude Code OAuth first-token, multi-minute agent synthesis). Long-tail body streaming is still governed by per-request context deadline. ## Tests - `TestMaybeMarkContainerDead_CPOnly_NotRunning` — IsRunning(false) → marks workspace offline, returns true. - `TestMaybeMarkContainerDead_CPOnly_Running` — IsRunning(true) → no offline-flip, returns false (don't recycle a healthy agent). - `TestProxyA2A_Upstream502_TriggersContainerDeadCheck` — agent server returns 502 + cpProv reports dead → caller gets 503 with restarting= true and Retry-After: 15. - `TestProxyA2A_Upstream502_AliveAgent_PropagatesAsIs` — same upstream 502 but cpProv reports running → propagates 502 (existing behavior; safety check that prevents over-eager recycling). - Existing `TestMaybeMarkContainerDead_NilProvisioner` / `TestMaybeMarkContainerDead_ExternalRuntime` still pass. - Full handlers + provisioner test suites pass. ## Impact Pre-fix: dead EC2 agent on a SaaS tenant → CF-masked 502 to canvas, no auto-recovery, manual restart from canvas required. Post-fix: dead EC2 agent on a SaaS tenant → structured 503 with restarting=true + Retry-After to canvas, workspace flipped to offline, auto-restart cycle triggered. Canvas can show a user-actionable "agent is restarting, please wait" message instead of a generic 502. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 00:28:22 -07:00
Hongming Wang	a81b0e1e3d	feat(activity): since_id cursor on GET /activity (#2339 PR 3) Telegram getUpdates / Slack RTM shape: poll-mode workspaces pass the id of the last activity_logs row they consumed, server returns rows strictly after in chronological (ASC) order. Existing callers that don't pass since_id keep DESC + most-recent-N — backwards-compatible. Cursor lookup is scoped by workspace_id so a caller cannot enumerate or peek at another workspace's events by passing a UUID belonging to a different workspace. Cross-workspace and pruned cursors both return 410 Gone — no information leak (caller cannot distinguish "row never existed" from "row exists but you can't see it"). since_id + since_secs both apply (AND). When since_id is set the order flips to ASC because polling consumers need recorded-order; the recent- feed shape (no since_id) keeps DESC. Tests: - TestActivityHandler_SinceID_ReturnsNewerASC — cursor lookup → main query with cursorTime + ASC ordering. - TestActivityHandler_SinceID_CursorNotFound_410 — pruned/unknown cursor. - TestActivityHandler_SinceID_CrossWorkspaceCursor_410 — UUID belongs to another workspace, scoped lookup hides it (same 410 path, no leak). - TestActivityHandler_SinceID_CombinedWithSinceSecs — placeholder index arithmetic with both filters. Stacked on #2353 (PR 2: poll-mode short-circuit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 22:51:52 -07:00
Hongming Wang	706a388806	Merge pull request #2353 from Molecule-AI/auto/issue-2339-pr2-poll-shortcircuit-v2 feat(a2a): poll-mode short-circuit in ProxyA2A (#2339 PR 2)	2026-04-30 05:29:03 +00:00
Hongming Wang	91a1d5377d	feat(a2a): poll-mode short-circuit in ProxyA2A (#2339 PR 2) Skip SSRF/dispatch and queue to activity_logs for delivery_mode=poll workspaces. The polling agent (e.g. molecule-mcp-claude-channel on an operator's laptop) consumes via GET /activity?since_id= in PR 3 — no public URL needed. Order: budget -> normalize -> lookupDeliveryMode short-circuit -> resolveAgentURL. Normalizing before the short-circuit keeps the JSON-RPC method name on the activity_logs row so the polling agent can dispatch correctly. Fail-closed-to-push: any DB error reading delivery_mode defaults to push (loud + recoverable) rather than poll (silent drop). Tests: - TestProxyA2A_PollMode_ShortCircuits_NoSSRF_NoDispatch — core invariant: no resolveAgentURL, no Do(), records to activity_logs, returns 200 {status:"queued",delivery_mode:"poll",method:"message/send"}. - TestProxyA2A_PushMode_NoShortCircuit — push path unaffected; the agent server actually receives the request. - TestProxyA2A_PollMode_FailsClosedToPush — DB error on mode lookup must NOT silently queue; falls through to the push path. Stacked on #2348 (PR 1: schema + register flow). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 22:22:28 -07:00
Hongming Wang	3da2392f95	Merge pull request #2348 from Molecule-AI/auto/issue-2339-pr1-delivery-mode feat(workspaces): delivery_mode column + poll-mode register flow (#2339 PR 1)	2026-04-30 05:18:03 +00:00
Hongming Wang	68f18424f5	test(arch): codify 4 module boundaries as architecture tests (#2344 ) Hard gate #4: codified module boundaries as Go tests, so a new contributor (or AI agent) can't silently land an import that crosses a layer. Boundaries enforced (one architecture_test.go per package): - wsauth has no internal/* deps — auth leaf, must be unit-testable in isolation - models has no internal/* deps — pure-types leaf, reverse dep would create cycles since most packages depend on models - db has no internal/* deps — DB layer below business logic, must be testable with sqlmock without spinning up handlers/provisioner - provisioner does not import handlers or router — unidirectional layering: handlers wires provisioner into HTTP routes; the reverse is a cycle Each test parses .go files in its package via go/parser (no x/tools dep needed) and asserts forbidden import paths don't appear. Failure messages name the rule, the offending file, and explain WHY the boundary exists so the diff reviewer learns the rule. Note: the original issue's first two proposed boundaries (provisioner-no-DB, handlers-no-docker) don't match the codebase today — provisioner already imports db (PR #2276 runtime-image lookup) and handlers hold *docker.Client directly (terminal, plugins, bundle, templates). I picked the four boundaries that actually hold; the first two are aspirational and would need a refactor before they could be codified. Hand-tested by injecting a deliberate wsauth -> orgtoken violation: the gate fires red with the rule message before merge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 22:12:58 -07:00
Hongming Wang	140fc5fb10	fix(a2a): v0.2 → v0.3 compat shim at proxy edge (#2345 ) Closes #2345. ## Symptom Design Director silently dropped A2A briefs whose sender used the v0.2 message format (`params.message.content` string) instead of v0.3 (`params.message.parts` part-list). The downstream a2a-sdk's v0.3 Pydantic validator rejected with "params.message.parts — Field required" but the rejection only landed in tenant-side logs; the sender saw HTTP 200/202 and assumed delivery. UX Researcher therefore never received the kickoff. Multi-agent pipeline silently idle. ## Fix Convert at the proxy edge in normalizeA2APayload. Two cases handled, one explicitly rejected: v0.2 string content → wrap as [{kind: text, text: <content>}] (the canonical v0.2 case from the dogfooding report) v0.2 list content → preserve list as parts (some older clients put a list under `content`; treat as "client meant parts, used wrong field name") v0.3 parts present → no-op (hot path for normal traffic) Neither present → return HTTP 400 with structured JSON-RPC error pointing at the missing field Why at the proxy edge: every workspace gets the compat for free without each one bumping a2a-sdk separately. The SDK's own compat adapter is strict about `parts` and rejects v0.2 senders. Why reject loud on missing-both: pre-fix the SDK's Pydantic rejection was post-handler-dispatch and invisible to the original sender. Now misshapen payloads return a structured 400 to the actual caller — kills the entire silent-drop class for this payload-shape category. ## Tests 7 new cases on normalizeA2APayload (#2345) + 1 fixture update on the existing _MissingMethodReturnsEmpty test: TestNormalizeA2APayload_ConvertsV02StringContentToParts TestNormalizeA2APayload_ConvertsV02ListContentToParts TestNormalizeA2APayload_PreservesV03Parts (hot path) TestNormalizeA2APayload_RejectsMessageWithNeitherContentNorParts TestNormalizeA2APayload_RejectsContentWithUnsupportedType TestNormalizeA2APayload_NoMessageNoCheck (e.g. tasks/list bypasses) All 11 normalizeA2APayload tests pass + full handler suite (no regressions). ## Refs Hard-gates discussion: this is exactly the class of failure (silent-drop on schema mismatch) that #2342 (continuous synthetic E2E) would catch automatically. Tier 2 RFC item from #2345 (caller gets structured JSON-RPC error on parse failure) is delivered above via the loud-reject path.	2026-04-29 22:01:41 -07:00
Hongming Wang	d5b00d6ac1	feat(workspaces): delivery_mode column + poll-mode register flow (#2339 PR 1) Adds workspaces.delivery_mode (push, default \| poll) and lets the register handler accept poll-mode workspaces with no URL. This is the foundation for the unified poll/push delivery design in #2339 — Telegram-getUpdates shape for external runtimes that have no public URL. What this PR does: - Migration 045: NOT NULL TEXT column, default 'push', CHECK constraint on the two valid values. - models.Workspace + RegisterPayload + CreateWorkspacePayload gain a DeliveryMode field. RegisterPayload.URL drops the `binding:"required"` tag — the handler now enforces it conditionally on the resolved mode. - Register handler: validates explicit delivery_mode if set; resolves effective mode (payload value, else stored row value, else push) AFTER the C18 token check; validates URL only when effective mode is push; persists delivery_mode in the upsert; returns it in the response; skips URL caching when payload.URL is empty. - CreateWorkspace handler: persists delivery_mode (defaults to push) in the same INSERT, validates it before any side effects. What this PR does NOT do (intentional, follow-up PRs): - PR 2: short-circuit ProxyA2A for poll-mode workspaces (skip SSRF + dispatch, log a2a_receive activity, return 200). - PR 3: since_id cursor on GET /activity for lossless polling. - Plugin v0.2 in molecule-mcp-claude-channel: cursor persistence + a register helper that creates poll-mode workspaces. Backwards compatibility: every existing workspace stays push-mode (schema default) with identical behavior. New tests: TestRegister_PollMode_AcceptsEmptyURL, TestRegister_PushMode_RejectsEmptyURL, TestRegister_InvalidDeliveryMode, TestRegister_PollMode_PreservesExistingValue. All existing register + create tests updated to expect the new delivery_mode column in the INSERT args. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 21:47:14 -07:00
Hongming Wang	86d9cb8b55	Merge pull request #2334 from Molecule-AI/auto/chat-files-comment-update docs(chat_files): update header — Download is HTTP-forward, not docker-cp	2026-04-30 03:32:06 +00:00
Hongming Wang	82f73b1fa3	docs(chat_files): update header — Download is HTTP-forward, not docker-cp The header comment claimed: "file upload (HTTP-forward) + download (Docker-exec)" and: "Download still uses the v1 docker-cp path; migrating it lives in the next PR in this stack" Both wrong now. RFC #2312 PR-D landed the Download HTTP-forward path: chat_files.go:336 builds an http.NewRequestWithContext to ${wsURL}/internal/file/read?path=<abs>, with the response streamed back to the caller. The workspace-side Starlette handler is at workspace/internal_file_read.py, mounted at workspace/main.py:440. Update the header to reflect actual code: both upload + download are HTTP-forward, share the same per-workspace platform_inbound_secret auth, and work uniformly on local Docker and SaaS EC2. Pure docs change — no behavior, no build/test impact.	2026-04-29 20:28:58 -07:00
Hongming Wang	b6d223cd0a	feat(a2a): per-queue-id status endpoint + per-message TTL (RFC #2331 Tier 1) Closes the observability gap surfaced in #2329 item 5: callers received queue_id in the 202 enqueue response but had no public lookup. The only existing observability path was check_task_status (delegation-flavored A2A only — joins via request_body->>'delegation_id'). Cross-workspace peer-direct A2A had no observability after enqueue. This PR ships RFC #2331's Tier 1: minimum viable observability + caller- specified TTL. No schema migration — expires_at column already exists (migration 042); only DequeueNext was honoring it, with no caller path to populate it. Two changes: 1. extractExpiresInSeconds(body) — new helper mirroring extractIdempotencyKey/extractDelegationIDFromBody. Pulls params.expires_in_seconds from the JSON-RPC body. Zero (the unset default) preserves today's infinite-TTL semantics. EnqueueA2A grew an expiresAt time.Time parameter; the proxy callsite computes time.Time from the extracted seconds and threads it through to the INSERT. 2. GET /workspaces/:id/a2a/queue/:queue_id — new public handler. Auth: caller's workspace token must match queue.caller_id OR queue.workspace_id, OR be an org-level token. 404 (not 403) on auth failure to avoid leaking queue_id existence. Response includes status/attempts/last_error/timestamps/expires_at; embeds response_body via LEFT JOIN against activity_logs when status= completed for delegation-flavored items. What this does NOT change: - Drain semantics (heartbeat-driven dispatch). - Native-session bypass (claude-agent-sdk, hermes still skip queue). - Schema (column already exists). - MCP tools (delegate_task_async / check_task_status keep their contract; this is a parallel queue-id surface). Tests: - 7 cases on extractExpiresInSeconds covering absent/positive/ zero/negative/invalid-JSON/wrong-type/empty-params. - go vet + go build clean. - Full handlers test suite passes (no regressions from the EnqueueA2A signature change — only one production caller). Tier 2 (cross-workspace stitch + webhook callback) and Tier 3 (controllerized lifecycle) deferred per RFC #2331.	2026-04-29 20:21:17 -07:00
Hongming Wang	0b1d4f294b	Merge pull request #2304 from Molecule-AI/docs/molecule-channel-plugin-pointer docs: surface molecule-mcp-claude-channel plugin in external-workspace flow + CONTRIBUTING	2026-04-30 00:45:51 +00:00
Hongming Wang	5d34abd5b5	Merge remote-tracking branch 'origin/staging' into auto/issue-2312-pr-f-saas-secret-delivery # Conflicts: # scripts/build_runtime_package.py	2026-04-29 16:46:23 -07:00
Hongming Wang	5806feadcc	Merge pull request #2314 from Molecule-AI/auto/issue-2312-pr-b-workspace-ingest feat(workspace): /internal/chat/uploads/ingest endpoint (RFC #2312, PR-B)	2026-04-29 23:40:19 +00:00
Hongming Wang	ca6fc55c8b	fix(a2a_proxy): derive callerID from bearer when X-Workspace-ID absent (#2306 ) External callers (third-party SDKs, the channel plugin) authenticate purely via bearer and frequently don't set the X-Workspace-ID header. Without this, activity_logs.source_id ends up NULL — breaking the peer_id signal on notifications, the "Agent Comms by peer" canvas tab, and any analytics that breaks down inbound A2A by sender. The bearer is the authoritative caller identity per the wsauth contract (it's what proves who you are); the header is a display/routing hint that must agree with it. So we derive callerID from the bearer's owning workspace whenever the header is absent. The existing validateCallerToken guard fires after this and enforces token-to-callerID binding the same way it always has. Org-token requests are skipped — those grant org-wide access and don't bind to a single workspace, so the canvas-class semantics (callerID="") are preserved. Bearer-resolution failures (revoked, removed workspace) fall through to canvas-class as well, never 401. New wsauth.WorkspaceFromToken exposes the bearer→workspace lookup as a modular interface; mirrors ValidateAnyToken's defense-in-depth JOIN on workspaces.status != 'removed'. Tests: 4 unit tests on WorkspaceFromToken + 3 integration tests on ProxyA2A covering the three observable paths (bearer-derived, org-token skipped, derive-failure fallthrough). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 16:05:56 -07:00
Hongming Wang	e8943dffd7	Merge pull request #2313 from Molecule-AI/auto/issue-2312-chat-upload-http-forward feat(wsauth): platform→workspace inbound secret (RFC #2312, PR-A)	2026-04-29 22:29:43 +00:00
Hongming Wang	e955597a98	feat(chat_files): rewrite Download as HTTP-forward (RFC #2312 , PR-D) Mirrors PR-C's Upload migration: replaces the docker-cp tar-stream extraction with a streaming HTTP GET to the workspace's own /internal/file/read endpoint. Closes the SaaS gap for downloads — without this PR, GET /workspaces/:id/chat/download still returns 503 on Railway-hosted SaaS even after A+B+C+F land. Stacks: PR-A #2313 → PR-B #2314 → PR-C #2315 → PR-F #2319 → this PR. Why a single broad /internal/file/read instead of /internal/chat/download: Today's chat_files.go::Download already accepts paths under any of the four allowed roots {/configs, /workspace, /home, /plugins} — it's not strictly chat. Future PRs (template export, etc.) will reuse this endpoint via the same forward pattern; reusing avoids three near- identical handlers (one per domain) with duplicated path-safety logic. Path safety is duplicated on platform + workspace sides — defence in depth via two parallel checks, not "trust the workspace." Changes: * workspace/internal_file_read.py — Starlette handler. Validates path (must be absolute, under allowed roots, no traversal, canonicalises cleanly). lstat (not stat) so a symlink at the path doesn't redirect the read. Streams via FileResponse (no buffering). Mirrors Go's contentDispositionAttachment for Content-Disposition header. * workspace/main.py — registers GET /internal/file/read alongside the POST /internal/chat/uploads/ingest from PR-B. * scripts/build_runtime_package.py — adds internal_file_read to TOP_LEVEL_MODULES so the publish-runtime cascade rewrites its imports correctly. Also includes the PR-B additions (internal_chat_uploads, platform_inbound_auth) since this branch was rooted before PR-B's drift-gate fix; merge-clean alphabetic additions. * workspace-server/internal/handlers/chat_files.go — Download rewritten as streaming HTTP GET forward. Resolves workspace URL + platform_inbound_secret (same shape as Upload), builds GET request with path query param, propagates response headers (Content-Type / Content-Length / Content-Disposition) + body. Drops archive/tar + mime imports (no longer needed). Drops Docker-exec branch entirely — Download is now uniform across self-hosted Docker and SaaS EC2. * workspace-server/internal/handlers/chat_files_test.go — replaces TestChatDownload_DockerUnavailable (stale post-rewrite) with 4 new tests: - TestChatDownload_WorkspaceNotInDB → 404 on missing row - TestChatDownload_NoInboundSecret → 503 on NULL column (with RFC #2312 detail in body) - TestChatDownload_ForwardsToWorkspace_HappyPath → forward shape (auth header, GET method, /internal/file/read path) + headers propagated + body byte-for-byte - TestChatDownload_404FromWorkspacePropagated → 404 from workspace propagates (NOT remapped to 500) Existing TestChatDownload_InvalidPath path-safety tests preserved. * workspace/tests/test_internal_file_read.py — 21 tests covering _validate_path matrix (absolute, allowed roots, traversal, double- slash, exact-match-on-root), 401 on missing/wrong/no-secret-file bearer, 400 on missing path/outside-root/traversal, 404 on missing file, happy-path streaming with correct Content-Type + Content-Disposition, special-char escaping in Content-Disposition, symlink-redirect-rejection (lstat-not-stat protection). Test results: * go test ./internal/handlers/ ./internal/wsauth/ — green * pytest workspace/tests/ — 1292 passed (was 1272 before PR-D) Refs #2312 (parent RFC), #2308 (chat upload+download 503 incident). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 15:19:02 -07:00
Hongming Wang	055e447355	feat(saas): deliver platform_inbound_secret via /registry/register (RFC #2312 , PR-F) Closes the SaaS-side gap that PR-A acknowledged but didn't fix: SaaS workspaces have no persistent /configs volume, so the platform_inbound_secret that PR-A's provisioner wrote at workspace creation never reaches the runtime. Without this, even after the entire RFC #2312 stack lands, SaaS chat upload would 401 (workspace fails-closed when /configs/.platform_inbound_secret is missing). Solution: return the secret in the /registry/register response body on every register call. The runtime extracts it and persists to /configs/.platform_inbound_secret at mode 0600. Idempotent — Docker- mode workspaces also receive it and overwrite the value the provisioner already wrote (same value until rotation). Why on every register, not just first-register: * SaaS containers can be restarted (deploys, drains, EBS detach/ re-attach) — /configs is rebuilt empty on each fresh start. * The auth_token is "issue once" because re-issuing rotates and invalidates the previous one. The inbound secret has no rotation flow yet (#2318) so re-sending the same value is harmless. * Eliminates the bootstrap window where a restarted SaaS workspace has no inbound secret on disk and would 401 every platform call. Changes: * workspace-server/internal/handlers/registry.go — Register handler reads workspaces.platform_inbound_secret via wsauth.ReadPlatformInboundSecret and includes it in the response body. Legacy workspaces (NULL column) get a successful registration with the field omitted. * workspace-server/internal/handlers/registry_test.go — two new tests: - TestRegister_ReturnsPlatformInboundSecret_RFC2312_PRF: secret present in DB → secret in response, alongside auth_token. - TestRegister_NoInboundSecret_OmitsField: NULL column → field omitted, registration still 200. * workspace/platform_inbound_auth.py — adds save_inbound_secret(secret). Atomic write via tmp + os.replace, mode 0600 from os.open(O_CREAT, 0o600) so a concurrent reader never sees 0644-default. Resets the in-process cache after write so the next get_inbound_secret() returns the freshly-written value (rotation-safe when it lands). * workspace/main.py — register-response handler extracts platform_inbound_secret alongside auth_token and persists via save_inbound_secret. Mirrors the existing save_token pattern. * workspace/tests/test_platform_inbound_auth.py — 6 new tests for save_inbound_secret: writes file, mode 0600, overwrite-existing, cache invalidation after save, empty-input no-op, parent-dir creation for fresh installs. Test results: * go test ./internal/handlers/ ./internal/wsauth/ — all green * pytest workspace/tests/ — 1272 passed (was 1266 before this PR) Refs #2312 (parent RFC), #2308 (chat upload 503 incident). Stacks: PR-A #2313 → PR-B #2314 → PR-C #2315 → this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 15:12:34 -07:00
Hongming Wang	c02cb0e1b6	review: defer forward-time URL re-validation to follow-up (#2316 ) Self-review found the original draft of this PR added forward-time validateAgentURL() as defense-in-depth — paranoia layer on top of the existing register-time gate. The validator unconditionally blocks loopback (127.0.0.1/8), which makes httptest-based proxy tests impossible without an env-var hatch I'd rather not add to a security- critical path on first pass. Trust note kept inline pointing at the upstream gate + tracking issue so the gap is explicit, not invisible. Refs #2312.	2026-04-29 14:33:41 -07:00
Hongming Wang	e632a31347	feat(chat_files): rewrite Upload as HTTP-forward to workspace (RFC #2312 , PR-C) Closes the SaaS upload gap (#2308) with the unified architecture from RFC #2312: same code path on local Docker and SaaS, no Docker socket dependency, no `dockerCli == nil` cliff. Stacked on PR-A (#2313) + PR-B (#2314). Before: Upload → findContainer (nil in SaaS) → 503 After: Upload → resolve workspaces.url + platform_inbound_secret → stream multipart to <url>/internal/chat/uploads/ingest → forward response back unchanged Same call site whether the workspace runs on local docker-compose ("http://ws-<id>:8000") or SaaS EC2 ("https://<id>.<tenant>..."). The bug behind #2308 cannot exist by construction. Why streaming, not parse-then-re-encode: * No 50 MB intermediate buffer on the platform * Per-file size + path-safety enforcement is the workspace's job (see workspace/internal_chat_uploads.py, PR-B) * Workspace's error responses (413 with offending filename, 400 on missing files field, etc.) propagate through unchanged Changes: * workspace-server/internal/handlers/chat_files.go — Upload rewritten as a streaming HTTP proxy. Drops sanitizeFilename, copyFlatToContainer, and the entire docker-exec path. ChatFilesHandler gains an httpClient (broken out for test injection). Download stays docker-exec for now; follow-up PR will migrate it to the same shape. * workspace-server/internal/handlers/chat_files_external_test.go — deleted. Pinned the wrong-headed runtime=external 422 gate from #2309 (already reverted in #2311). Superseded by the proxy tests. * workspace-server/internal/handlers/chat_files_test.go — replaced sanitize-filename tests (now in workspace/tests/test_internal_chat_uploads.py) with sqlmock + httptest proxy tests: - 400 invalid workspace id - 404 workspace row missing - 503 platform_inbound_secret NULL (with RFC #2312 detail) - 503 workspaces.url empty - happy-path forward (asserts auth header, content-type forwarded, body streamed, response propagated back) - 413 from workspace propagated unchanged (NOT remapped to 500) - 502 on workspace unreachable (connect refused) Existing Download + ContentDisposition tests preserved. * tests/e2e/test_chat_upload_e2e.sh — single-script-everywhere E2E. Takes BASE as env (default http://localhost:8080). Creates a workspace, waits for online, mints a test token, uploads a fixture, reads it back via /chat/download, asserts content matches + bearer-required. Same script runs against staging tenants (set BASE=https://<id>.<tenant>.staging.moleculesai.app). Test plan: * go build ./... — green * go test ./internal/handlers/ ./internal/wsauth/ — green (full suite) * tests/e2e/test_chat_upload_e2e.sh against local docker-compose after PR-A + PR-B + this PR all merge — TODO before merge Refs #2312 (parent RFC), #2308 (chat upload 503 incident). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 14:26:37 -07:00
Hongming Wang	1c9cea980d	feat(wsauth): platform→workspace inbound secret (RFC #2312 , PR-A) Foundation for the HTTP-forward architecture that replaces Docker-exec in chat upload + 5 follow-on handlers. This PR is intentionally scoped to schema + token mint + provisioner wiring; no caller reads the secret yet so behavior is unchanged. Why a second per-workspace bearer (not reuse the existing workspace_auth_tokens row): workspace_auth_tokens workspaces.platform_inbound_secret ───────────────────── ───────────────────────────────── workspace → platform platform → workspace hash stored, plaintext gone plaintext stored (platform reads back) workspace presents bearer platform presents bearer platform validates by hash workspace validates by file compare Distinct roles, distinct rotation lifecycle, distinct audit signal — splitting later would require a fleet-wide rolling rotation, so paying the schema cost up front. Changes: * migration 044: ADD COLUMN workspaces.platform_inbound_secret TEXT * wsauth.IssuePlatformInboundSecret + ReadPlatformInboundSecret * issueAndInjectInboundSecret hook in workspace_provision: mints on every workspace create / re-provision; Docker mode writes plaintext to /configs/.platform_inbound_secret alongside .auth_token, SaaS mode persists to DB only (workspace will receive via /registry/register response in a follow-up PR) * 8 unit tests against sqlmock — covers happy path, rotation, NULL column, empty string, missing workspace row, empty workspaceID PR-B (next) wires up workspace-side `/internal/chat/uploads/ingest` that validates the bearer against /configs/.platform_inbound_secret. Refs #2312 (parent RFC), #2308 (chat upload 503 incident). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 14:09:33 -07:00
Hongming Wang	51e48a267a	revert(chat_files): drop the wrong external-runtime gate (#2308 ) PR #2309 added an early-return that 422'd uploads to external workspaces with "file upload not supported." Both halves of that diagnosis were wrong: 1. External workspaces SHOULD support uploads — gating with 422 locks off intended functionality and labels it as design. 2. The 503 the user actually hit was on an INTERNAL workspace, not an external one. The runtime check never even ran. Real root cause (separate fix incoming): - findContainer(...) requires a non-nil h.docker. - In SaaS (MOLECULE_ORG_ID set), main.go selects the CP provisioner instead of the local Docker provisioner — dockerCli is nil. - findContainer short-circuits to "" → 503 "container not running" on every workspace, internal or external, on Railway-hosted SaaS where workspaces actually live on EC2. This PR strips the misleading gate so #2308 can be re-investigated against the real symptom. The proper fix routes the multipart upload over HTTP to the workspace's URL when dockerCli is nil — tracked as a follow-up. Refs #2308. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 13:52:23 -07:00
Hongming Wang	4a6095ee1a	fix(chat_files): return 422 with structured detail for external workspaces (closes #2308 ) Symptom: pasting a screenshot into the canvas chat for a runtime="external" workspace returned `503 {"error":"workspace container not running"}` — accurate from the upload handler's POV (no container exists for external workspaces) but misleading because it implies the container has crashed. Fix: detect runtime="external" via DB lookup BEFORE the container-find step and return 422 with: - error: "file upload not supported for external workspaces" - detail: explains why + points at admin/secrets workaround + references issue #2308 for the v0.2 native-support roadmap - runtime: "external" (machine-readable for clients) Why 422 not 200/501: - 422 = Unprocessable Entity — the request is well-formed but the workspace's runtime can't accept it. Standard REST semantics. - 200 with empty result would lie; 501 implies the API itself is unimplemented (it's not — works for non-external workspaces); 503 was the misleading status this PR fixes. Verified via live E2E against localhost: - Created `runtime=external,external=true` workspace - Posted multipart to /workspaces/:id/chat/uploads - Got 422 with the expected structured body Unit test (`chat_files_external_test.go`) pins the contract via sqlmock + httptest. Notable: the handler is constructed with `templates: nil` to prove the runtime check happens BEFORE any docker plumbing — if a future change moves the check below findContainer, the test crashes on nil-deref instead of silently regressing. Out of scope (for v0.2 follow-up): - Native external-workspace file ingest via artifacts table or the channel-plugin's inbox/ pattern. Requires separate design pass. Closes #2308 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 12:37:49 -07:00
Hongming Wang	34d467fe8a	docs: surface molecule-mcp-claude-channel plugin in external-workspace creation + CONTRIBUTING Adds a third snippet alongside externalCurlTemplate / externalPythonTemplate in workspace-server/internal/handlers/external_connection.go: the new externalChannelTemplate guides operators through installing the Claude Code channel plugin (Molecule-AI/molecule-mcp-claude-channel — scaffolded today) and dropping the .env config for it. Wires the new snippet into the external-workspace POST /workspaces response under key `claude_code_channel_snippet`, alongside the existing `curl_register_template` and `python_snippet`. Canvas's "external workspace created" modal can render it as a third tab. CONTRIBUTING.md gains a short "External integrations" section pointing at the three peer repos (workspace-runtime, sdk-python, mcp-claude-channel) so contributors know where related runtime artifacts live and to consider downstream impact when changing the A2A wire shape. The plugin itself is scaffolded at commit d07363c on the new repo's main branch; v0.1 is polling-based via the /activity?since_secs= filter shipped in PR #2300. README + roadmap details there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 11:33:31 -07:00
Hongming Wang	949b1b97a5	Merge pull request #2300 from Molecule-AI/auto/issues-2269-2268-restartstates-leak-and-since-secs fix(workspace_crud) + feat(activity): restartStates leak (#2269) + since_secs param (#2268)	2026-04-29 16:22:34 +00:00
Hongming Wang	9559118678	feat(activity): accept ?since_secs= for time-window filtering (#2268 ) The harness runner (scripts/measure-coordinator-task-bounds-runner.sh) calls `/workspaces/:id/activity?since_secs=$A2A_TIMEOUT` to scope a trace to a specific test window. The query param was silently ignored — `ActivityHandler.List` accepted only `type`, `source`, and `limit`, so the runner got the most-recent-100 events regardless of how long ago they happened. Works for fresh-tenant tests where activity_logs is ~empty pre-run, breaks on busy tenants and on tests that exceed 100 events. Adds `since_secs` parsing with three behaviors: - Valid positive int → `AND created_at >= NOW() - make_interval(secs => $N)` on the SQL. Parameterised; values bound via lib/pq, not interpolated. `make_interval(secs => $N)` is required — the `INTERVAL '$N seconds'` literal form rejects placeholder substitution inside the string. - Above 30 days (2_592_000s) → silently clamped to the cap. Defends against a paranoid client triggering a multi-month full-table scan via `since_secs=999999999`. - Negative, zero, or non-integer → 400 with a structured error, NOT silently dropped. Silent drop is exactly the bug this is fixing — a typoed param shouldn't be lost as most-recent-100. Tests cover all four paths: accepted (with arg-binding assertion via sqlmock.WithArgs), clamped at 30 days, invalid rejected (5 sub-cases), and omitted (verifies no extra clause / arg leak via strict WithArgs count). RFC #2251 §V1.0 step 6 (platform-side-transition audit) also depends on this for time-window filtering of activity_logs. Closes #2268 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 05:53:52 -07:00
Hongming Wang	f75599eba9	fix(workspace_crud): drop restartStates entries on workspace delete (#2269 ) Per-workspace `restartState` entries (introduced under the name `restartMu` pre-#2266, renamed to `restartStates` in #2266) are created via `LoadOrStore` in `workspace_restart.go` but never deleted. On a long-running platform process serving many short-lived workspaces (E2E tests, transient sandbox tenants), the sync.Map grows monotonically — ~16 bytes per workspace ever created. Fix: call `restartStates.Delete(wsID)` after stopAndRemove + ClearWorkspaceKeys for each cascaded descendant and the parent. Mirrors the existing per-ID cleanup loop. `sync.Map.Delete` is safe on absent keys, so workspaces that were never restarted (no LoadOrStore call) are no-op. This is a pre-existing leak — #2266 did not introduce it; just renamed the holder. Filing as a separate commit to keep the change minimal and reviewable. Closes #2269 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 05:53:34 -07:00
Hongming Wang	80c612d987	fix(org-import): remove force=true bypass of required-env preflight The pre-#2290 \`force: true\` flag on POST /org/import skipped the required-env preflight, letting orgs import without their declared required keys (e.g. ANTHROPIC_API_KEY). The ux-ab-lab incident: that import path was used, the org shipped without ANTHROPIC_API_KEY in global_secrets, and every workspace 401'd on the first LLM call. Per #2290 picks (C/remove/both): - Q1=C: template-derived required_env (no schema change — already the existing aggregation via collectOrgEnv). - Q2=remove: drop the bypass entirely. The seed/dev-org flow that legitimately needs to skip becomes a separate dry-run-import path with its own audit trail, not a permission bypass. - Q3=block-at-import-only: provision-time drift logging is a follow-up; for this PR, blocking at import is the gate. Surface change: - Force field removed from POST /org/import request body. - 412 \"suggestion\" text drops the \"or pass force=true\" guidance. - Legacy callers sending {\"force\": true} are silently tolerated (Go's json.Unmarshal drops unknown fields), so no client-side breakage; the bypass effect is just gone. Audited callers in this repo: - canvas/src/components/TemplatePalette.tsx — never sends force. - scripts/post-rebuild-setup.sh — never sends force. - Only external tooling sent force=true. Those callers must now set the global secret via POST /settings/secrets before importing. Adds TestOrgImport_ForceFieldRemoved as a structural pin: if a future change re-adds Force to the body struct, the test fails and forces an explicit reckoning with the #2290 rationale. Closes #2290 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 03:23:23 -07:00
Hongming Wang	bdfa45572e	fix(restart): clear running flag on panic in cycle() Self-review caught a regression I introduced in #2266: if cycle() panics (e.g. a future provisionWorkspace nil-deref or any runtime error from the DB / Docker / encryption stacks it touches), the loop never reaches `state.running = false`. The flag stays true forever, the early-return guard at the top of coalesceRestart fires for every subsequent call, and that workspace is permanently locked out of restarts until the platform process restarts. The pre-fix code had similar exposure (panic killed the goroutine before defer wsMu.Unlock() ran in some Go versions), but my pending- flag version made it worse: the guard is sticky, not ephemeral. Fix: defer the state-clear so it always runs on exit, including panic. Recover (and DON'T re-raise) so the panic doesn't propagate to the goroutine boundary and crash the whole platform process — RestartByID is always called via `go h.RestartByID(...)` from HTTP handlers, and an unrecovered goroutine panic in Go terminates the program. Crashing the platform for every tenant because one workspace's cycle panicked is the wrong availability tradeoff. The panic message + full stack trace via runtime/debug.Stack() are still logged for debuggability. Regression test in TestCoalesceRestart_PanicInCycleClearsState: 1. First call's cycle panics. coalesceRestart's defer must swallow the panic — assert no panic propagates out (would crash the platform process from a goroutine in production). 2. Second call must run a fresh cycle (proves running was cleared). All 7 tests pass with -race -count=10. Surfaced via /code-review-and-quality self-review of #2266; the re-raise-after-recover anti-pattern (originally argued as "don't mask bugs") came up in the comprehensive review and was corrected to log-with-stack-and-suppress for availability. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 00:00:12 -07:00
Hongming Wang	f088090b27	fix(restart): coalesce concurrent restart requests via pending flag The naive mutex-with-TryLock pattern in RestartByID was silently dropping the second of two close-together restart requests. SetSecret and SetModel both fire `go restartFunc(...)` from their HTTP handlers, and both DB writes commit before either restart goroutine reaches loadWorkspaceSecrets. If the second goroutine arrives while the first holds the per-workspace mutex, TryLock returns false and the second is logged-and-dropped: Auto-restart: skipping <id> — restart already in progress The first goroutine's loadWorkspaceSecrets ran before the second write committed, so the new container boots without that env var. Surfaced during the RFC #2251 V1.0 measurement as hermes returning "No LLM provider configured" when MODEL_PROVIDER landed after the API-key write and lost its restart to the mutex (HERMES_DEFAULT_MODEL absent → start.sh fell back to nousresearch/hermes-4-70b → derived provider=openrouter → no OPENROUTER_API_KEY → request-time error). The same race hits any back-to-back secret/model save flow including the canvas's "set MiniMax key + pick model" UX. Fix: pending-flag / coalescing pattern. Any restart request that arrives while one is in flight sets `pending=true` and returns. The in-flight runner, on completion, checks the flag and runs another cycle. This collapses N concurrent requests into at most 2 sequential cycles (the current one + one more that picks up everyone who arrived during it), while guaranteeing the final container always sees the latest secrets. Concrete contract: - 1 request, no concurrency: 1 cycle - N concurrent requests during 1 in-flight cycle: 2 cycles total - N sequential requests (no overlap): N cycles - Per-workspace state — different workspaces never serialize Coalescing is extracted into `coalesceRestart(workspaceID, cycle func())` so the gate logic is testable without the full WorkspaceHandler / DB / provisioner stack. RestartByID now wraps that with the production cycle function. runRestartCycle calls provisionWorkspace SYNCHRONOUSLY (drops the historical `go`) so the loop's pending-flag check happens AFTER the new container is up — without that, the next cycle's Stop call would race the previous cycle's still-spawning provision goroutine. sendRestartContext stays async; it's a one-way notification. Tests in workspace_restart_coalesce_test.go cover all five contract points + race-detector clean over 10 iterations: - Single call → 1 cycle - 5 concurrent during in-flight → exactly 2 cycles total - 3 sequential → 3 cycles - Pending-during-cycle picked up (targeted bug repro) - State cleared after drain (running flag reset) - Per-workspace isolation (no cross-workspace serialization) Refs: molecule-core#2256 (V1.0 gate measurement); root cause for the "No LLM provider configured" symptom seen during hermes/MiniMax repro. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 23:31:56 -07:00
Hongming Wang	0cdbc2c4f6	chore(deps): batch dep bumps — 11 safe upgrades from 2026-04-28 dependabot wave Consolidates 11 of the 17 open Dependabot PRs (#2215, #2217, #2219-#2225, #2227, #2229) into one PR. Every entry is a patch / minor / floor bump where the impact surface is small and CI carries the proof. Same pattern as the 2026-04-15 batch. Go (workspace-server/go.mod + go.sum, regenerated via `go mod tidy`): - golang.org/x/crypto 0.49.0 → 0.50.0 (#2225) - github.com/golang-jwt/jwt/v5 5.2.2 → 5.3.1 (#2222) - github.com/gin-contrib/cors 1.7.2 → 1.7.7 (#2220) - github.com/docker/go-connections 0.6.0 → 0.7.0 (#2223) - github.com/redis/go-redis/v9 9.7.3 → 9.19.0 (#2217) Python floor bumps (workspace/requirements.txt; current pip-resolved versions don't change unless they happen to be below the new floor): - httpx >=0.27 → >=0.28.1 (#2221) - uvicorn >=0.30 → >=0.46 (#2229) - temporalio >=1.7 → >=1.26 (#2227) - websockets >=12 → >=16 (#2224) - opentelemetry-sdk >=1.24 → >=1.41.1 (#2219) GitHub Actions (SHA-pinned per existing convention): - dorny/paths-filter@d1c1ffe (v3) → @fbd0ab8 (v4.0.1) (#2215) REMOVED from this batch (lockfile platform mismatch): - #2231 @types/node ^22 → ^25.6 (npm install on macOS strips Linux-only @emnapi/* entries from package-lock.json that CI's `npm ci` then refuses; needs a Linux-side install to land cleanly) - #2230 jsdom ^25 → ^29.1 (same) NOT included in this batch (deferred to per-PR human review): - #2228 github/codeql-action v3 → v4 (CodeQL CLI alignment risk) - #2218 actions/setup-node v4 → v6 (default Node version drift) - #2216 actions/upload-artifact v4 → v7 (3 major versions) - #2214 actions/setup-python v5 → v6 (action major) NOT merged (CI failing on dependabot's own PR): - #2233 next 15 → 16 - #2232 tailwindcss 3 → 4 - #2226 typescript 5 → 6 Verified: - workspace-server: `go mod tidy && go build ./... && go test ./...` — green - workspace requirements.txt: floor bumps only	2026-04-28 16:25:46 -07:00
Hongming Wang	317196463a	fix(orphan-sweeper): close TOCTOU race with issueAndInjectToken on restart Independent code review caught a real bug in the previous commit's stale-token revoke pass. The platform's restart endpoint (workspace_restart.go:104) Stops the workspace container synchronously then dispatches re-provisioning to a goroutine (line 173). For a workspace that's been idle past the 5-minute grace window — extremely common: user comes back to a long-idle workspace and clicks Restart — this opens a race window: 1. Container stopped → ListWorkspaceContainerIDPrefixes returns no entry → workspace becomes a stale-token candidate. 2. issueAndInjectToken runs in the goroutine: revokes old tokens, issues a fresh one, writes it to /configs/.auth_token. 3. If the sweeper's predicate-only UPDATE `WHERE workspace_id = $1 AND revoked_at IS NULL` runs AFTER IssueToken commits but is racing the SELECT-then-UPDATE window, it revokes the freshly-issued token alongside the old ones. 4. Container starts with a now-revoked token → 401 forever. The fix carries the SAME staleness predicate from the SELECT into the per-workspace UPDATE: a token created within the grace window can't match `< now() - grace` and is automatically excluded. The operation is now idempotent against fresh inserts. Also addresses other findings from the same review: - Add `status NOT IN ('removed', 'provisioning')` to the SELECT (R2 + first-line C1 defence). 'provisioning' is set synchronously in workspace_restart.go before the async re-provision begins, so it's a reliable in-flight signal that narrows the candidate set. - Stop calling wsauth.RevokeAllForWorkspace from the sweeper — that helper revokes EVERY live token unconditionally; the sweeper needs "every STALE live token" which is a different (safer) operation. Inline the UPDATE so we own the predicate end-to-end. Drop the wsauth import (no longer needed in this package). - Tighten expectStaleTokenSweepNoOp regex to anchor at start and require the status filter, so a future query whose first line coincidentally starts with "SELECT DISTINCT t.workspace_id" can't silently absorb the helper's expectation (R3). - Defensive `if reaper == nil { return }` at top of sweepStaleTokensWithoutContainer — even though StartOrphanSweeper already short-circuits on nil, a future refactor that wires this pass directly without checking would otherwise mass-revoke in CP/SaaS mode (F2). - Comment in the function explaining why empty likes is intentionally NOT a short-circuit (asymmetry with the first two passes is the whole point — "no containers running" is the load-bearing case). - Add TestSweepOnce_StaleTokenRevokeUsesStalenessPredicate that asserts the UPDATE shape (predicate present, grace bound). A real-Postgres integration test would prove the race resolution end-to-end; this catches the regression where someone simplifies the UPDATE back to predicate-only. - Add TestSweepStaleTokens_NilReaperEarlyExit pinning the F2 guard. Existing tests updated to match the new query/UPDATE shape with tight regexes that pin all the safety guards (status filter, staleness predicate in both SELECT and UPDATE). Full Go suite green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 17:28:50 -07:00
Hongming Wang	3332e6878b	fix(orphan-sweeper): revoke stale tokens for workspaces with no live container Heals the user-reported "auth token conflict after volume wipe" failure mode. When an operator nukes a workspace's /configs volume outside the platform's restart endpoint (common via `docker compose down -v` or manual cleanup scripts), the DB still holds live workspace_auth_tokens for that workspace while the recreated container has an empty /configs/.auth_token. Subsequent /registry/register calls 401 forever: requireWorkspaceToken sees live tokens, container has no token to present, and the workspace is permanently wedged until an operator manually revokes via SQL. The platform's restart endpoint already handles this correctly via wsauth.RevokeAllForWorkspace inside issueAndInjectToken. This change adds a third orphan-sweeper pass — sweepStaleTokensWithoutContainer — as the safety net for the equivalent action taken outside the API. Detection criterion: workspace has at least one live (non-revoked) token whose most-recent activity (COALESCE(last_used_at, created_at)) is older than staleTokenGrace (5 minutes), AND no live Docker container's name prefix matches the workspace ID. Safety filters that bound the revoke radius: 1. Only runs in single-tenant Docker mode. The orphan sweeper is wired only when prov != nil in cmd/server/main.go — CP/SaaS mode never gets here, so an empty container list cannot be confused with "no Docker at all" (which would otherwise revoke every workspace's tokens in production SaaS). 2. staleTokenGrace = 5min skips tokens issued/used in the last 5 minutes. Bounds the race with mid-provisioning (token issued moments before docker run completes) and brief restart windows — a healthy workspace touches last_used_at every 30s heartbeat, so 5min is 10× the heartbeat interval. 3. The query joins workspaces.status != 'removed' so deleted workspaces are not revoked here (handled at delete time by the explicit RevokeAllForWorkspace call). 4. make_interval(secs => $2) avoids a time.Duration.String() → "5m0s" mismatch with Postgres interval grammar that I caught during implementation. 5. Each revocation logs the workspace ID so operators can correlate "workspace just lost auth" with this sweeper, not blame a network blip. Failure mode: revoke fails (transient DB error). Loop bails to avoid log spam; next 60s cycle retries. Worst case a workspace stays 401-blocked an extra minute. Tests: 5 new tests covering the headline scenario, the safety gate (workspace with container is NOT revoked), revoke-failure-bails-loop, query-error-non-fatal, and Docker-list-failure-skips-cycle. All 11 existing sweepOnce tests updated to register the new third-pass query expectation via a small `expectStaleTokenSweepNoOp` helper that keeps their existing assertions readable. Full Go test suite green: registry, wsauth, handlers, and all other packages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 17:20:08 -07:00
Hongming Wang	c91c09dc55	fix(activity): include request/response bodies in ACTIVITY_LOGGED broadcast Canvas Agent Comms bubbles for outbound delegation showed only "Delegating to <peer>" boilerplate during the live update window — the actual task text only surfaced after a refresh re-fetched the row from /workspaces/:id/activity. Symptom flagged today during a fresh delegation manual test where the bubble said "Delegating to Perf Auditor" instead of the user's "audit moleculesai.app for performance" prompt. Root cause: LogActivity's broadcast payload at activity.go:510-518 deliberately omitted request_body and response_body, so the canvas's live-update path (AgentCommsPanel.tsx:271-289) saw `p.request_body = undefined` and toCommMessage fell back to the `Delegating to ${peerName}` template string. The DB row stored the real task / reply, which is why GET-on-mount worked. Fix: include both bodies in the broadcast as json.RawMessage values (no re-marshal cost — they were already encoded for the DB insert above). Same pattern as tool_trace, which has been included since #1814. Each side is bounded by the workspace-side caller's own caps: the runtime's report_activity helper caps error_detail at 4096 chars and summary at 256; request/response are constrained by the runtime's own limits — typical delegate_task payload is hundreds of chars to a few KB. If a much-larger broadcast becomes a concern later, a soft cap can be added at this site without breaking the contract. Two regression tests pin the broadcast shape: - request_body present → canvas renders the actual task text - response_body present → canvas renders the actual reply text - response_body nil → omitted from payload (no empty-bubble flicker) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 13:38:23 -07:00
Hongming Wang	92d99d96fe	fix(provisioner): treat "removal already in progress" as no-op success Cascade-deleting a 7-workspace org returned 500 with "workspace marked removed, but 2 stop call(s) failed — please retry: stop eeb99b5d-...: force-remove ws-eeb99b5d-607: Error response from daemon: removal of container ws-eeb99b5d-607 is already in progress" even though the DB-side post-condition succeeded (removed_count=7) and the containers WERE removed shortly after. The fanout fired Stop() on every workspace concurrently and the orphan sweeper happened to reap two of them at the same instant, so Docker rejected the second ContainerRemove with "removal already in progress" — a race-condition ack, not a real failure. Retrying just races the same in-flight removal. The post-condition we care about (the container WILL be gone) is identical to a successful removal, so Stop() should treat it the same way it already treats "No such container" — a no-op return nil that lets the caller proceed with volume cleanup. Real daemon failures (timeout, EOF, ctx cancel) still surface as errors. Two pieces: - New isRemovalInProgress() predicate using the same string-match approach as isContainerNotFound (docker/docker has no typed errdef for this; the CLI itself relies on the message). - Stop() now treats the predicate as success, with a log line distinct from the not-found path so debugging can tell which race fired. Both substrings ("removal of container" + "already in progress") must match — "already in progress" alone would false-positive on unrelated operations like image pulls. Truth table pinned in 7 new test cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 13:25:32 -07:00
Hongming Wang	fa5e0f5e4c	deps(redis): bump go-redis/v9 v9.7.0 → v9.7.3 (GHSA-92cp-5422-2mw7) Closes the LOW-severity dependabot alert on workspace-server's go-redis pin. Upstream advisory GHSA-92cp-5422-2mw7: "go-redis allows potential out-of-order responses when CLIENT SETINFO times out" — fixed in 9.7.3. Patch bump within the v9.7 line; semver guarantees no API change. Full workspace-server test suite passes (18/18 packages clean). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 06:54:13 -07:00
Hongming Wang	394dda2a4a	deps(docker): bump docker/docker v28.2.2 → v28.5.2 (GHSA-x4rx-4gw3-53p4) Closes the medium-severity dependabot alert #7 on workspace-server's docker pin: "Moby firewalld reload makes published container ports accessible from remote hosts" — fixed in v28.3.3, pulling v28.5.2 (latest in the v28 line). Patch+minor bump within the v28 train; no client-API breaks (workspace-server only uses docker.Client for container exec / inspect, all stable since v20+). Verification: full workspace-server test suite passes (18/18 packages clean). Build clean. Out of scope: - Alerts #10 and #11 (the AuthZ bypass + plugin-priv off-by-one) require v29.3.1, which is not yet published to the Go module proxy (latest published is v28.5.2). They'll close in a follow-up PR once v29 lands as a Go module. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 06:26:53 -07:00
Hongming Wang	86bdfa3b47	deps(jwt): bump golang-jwt/jwt/v5 v5.2.1 → v5.2.2 (CVE-2025-30204) Closes the HIGH-severity dependabot alert on workspace-server's jwt-go pin. Upstream advisory GHSA-mh63-6h87-95cp / CVE-2025-30204: "jwt-go allows excessive memory allocation during header parsing" — fixed in v5.2.2. Patch bump within the v5.x line; semver guarantees no API change. Full workspace-server test suite passes (\`go test ./...\` clean across all 18 packages). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 05:31:58 -07:00
Hongming Wang	7cf77f274a	Merge pull request #2166 from Molecule-AI/test/unblock-resolveandstage-test test(plugins): unblock TestResolveAndStage_NoInternalErrorsInHTTPErr (#1814)	2026-04-27 11:36:15 +00:00
Hongming Wang	a0154ea0b4	test(plugins): unblock TestResolveAndStage_NoInternalErrorsInHTTPErr (#1814 ) Closes the second of two skipped tests in workspace_provision_test.go that were blocked on interface refactors. The Broadcaster + CP provisioner halves landed in earlier #1814 cycles; this is the plugin-source-registry half. Refactor: - Add handlers.pluginSources interface with the 3 methods handler code actually calls (Register, Resolve, Schemes) - Compile-time assertion `var _ pluginSources = (plugins.Registry)(nil)` catches future method-signature drift at build time - PluginsHandler.sources narrowed from plugins.Registry to the interface; production wiring (NewPluginsHandler, WithSourceResolver) still passes *plugins.Registry — satisfies the interface Production fix (#1206 leak): - resolveAndStage's Fetch-failure path was interpolating err.Error() into the HTTP response body via `failed to fetch plugin from %s: %v`. Resolver errors routinely contain rate-limit text, github request IDs, raw HTTP body fragments, and (for local resolvers) file system paths — none has any business landing in a user's browser. - Body now carries just `failed to fetch plugin from <scheme>`; the status code already differentiates the failure shape (404 not found, 504 timeout, 502 generic). Full err detail stays in the server-side log line one statement above. Test: - 6 sub-tests covering every error path inside resolveAndStage: empty source, invalid format, unknown scheme, local path-traversal, unpinned github (PLUGIN_ALLOW_UNPINNED unset), Fetch failure with a leaky synthetic error - The Fetch-failure case plants 5 realistic leak markers in the resolver's error string (rate limit text, x-github-request-id, auth_token, ghp_-prefixed token, /etc/passwd path); the assertion fails if ANY appears in the response body - Table-driven so a future error path added to resolveAndStage gets one new row, not a copy-paste of the assertion logic Verification: - 6/6 sub-tests pass - Full workspace-server test suite passes (interface refactor is non-breaking; production caller paths unchanged) - go build ./... clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 04:00:39 -07:00
Hongming Wang	e15d1182cd	test(provisioner): unblock TestProvisionWorkspaceCP_NoInternalErrorsInBroadcast (#1814 ) The skipped test exists to assert that provisionWorkspaceCP never leaks err.Error() in WORKSPACE_PROVISION_FAILED broadcasts (regression guard for #1206). Writing the test body required substituting a failing CPProvisioner — but the handler's `cpProv` field was the concrete CPProvisioner type, so a mock had nowhere to plug in. Refactor: - Add provisioner.CPProvisionerAPI interface with the 3 methods handlers actually call (Start, Stop, GetConsoleOutput) - Compile-time assertion `var _ CPProvisionerAPI = (CPProvisioner)(nil)` catches future method-signature drift at build time - WorkspaceHandler.cpProv narrowed to the interface; SetCPProvisioner accepts the interface (production caller passes *CPProvisioner from NewCPProvisioner unchanged) Test: - stubFailingCPProv whose Start returns a deliberately leaky error (machine_type=t3.large, ami=…, vpc=…, raw HTTP body fragment) - Drive provisionWorkspaceCP via the cpProv.Start failure path - Assert broadcast["error"] == "provisioning failed" (canned) - Assert no leak markers (machine type, AMI, VPC, subnet, HTTP body, raw error head) in any broadcast string value - Stop/GetConsoleOutput on the stub panic — flags a future regression that reaches into them on this path Verification: - Full workspace-server test suite passes (interface refactor is non-breaking; production caller path unchanged) - go build ./... clean - The other skipped test in this file (TestResolveAndStage_…) is a separate plugins.Registry refactor and remains skipped Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 03:28:25 -07:00
hongmingwang-moleculeai	34b92c33b7	Merge pull request #2144 from Molecule-AI/feat/native-session-skip-queue feat(runtime): native_session skips a2a_queue — primitive #5 of 6	2026-04-27 06:40:09 +00:00
Hongming Wang	ae64fe340a	feat(runtime): native_session skips a2a_queue enqueue — primitive #5 of 6 When a target workspace's adapter has declared provides_native_session=True (claude-code SDK's streaming session, hermes-agent's in-container event log), the SDK owns its own queue/ session state. Adding the platform's a2a_queue layer on top would double-buffer the same in-flight state — and worse, the platform queue's drain timing has no relationship to the SDK's actual readiness, so the queued request might dispatch while the SDK is STILL busy. Behavior change: in handleA2ADispatchError, when isUpstreamBusyError(err) fires and the target declared native_session, return 503 + Retry-After directly without enqueueing. The caller's adapter handles retry on its own schedule, and the SDK's own queue absorbs the request when ready. Response body carries native_session=true so callers can distinguish this from queue-failure 503s. Observability is preserved: logA2AFailure still runs above; the broadcaster still fires; the activity_logs row records the busy event just like the platform-fallback path. This is the consumer that validates the template-side declarations already shipped in: - molecule-ai-workspace-template-claude-code PR #12 - molecule-ai-workspace-template-hermes PR #25 Once those merge + image tags bump, claude-code + hermes workspaces' busy 503s skip the platform queue end-to-end. End-to-end validation of capability primitive #5. Tests (2 new): - NativeSession_SkipsEnqueue: cache pre-populated, deliberate sqlmock with NO INSERT INTO a2a_queue expected — implicit regression cover (sqlmock fails on unexpected queries). Asserts 503 + Retry-After + native_session=true marker in body. - NoNativeSession_StillEnqueues: negative pin — empty cache, same busy error → falls through to EnqueueA2A (which fails in this test, falls through to legacy 503 without native_session marker). Verification: - All Go handlers tests pass (2 new + existing) - go build + go vet clean See project memory `project_runtime_native_pluggable.md`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 23:34:04 -07:00
Hongming Wang	186f25c261	Merge pull request #2141 from Molecule-AI/feat/native-status-mgmt-skip feat(runtime): native_status_mgmt skip — primitive #4 of 6	2026-04-27 06:30:59 +00:00
Hongming Wang	b4b406c074	feat(runtime): native_status_mgmt skip — primitive #4 of 6 When an adapter declares provides_native_status_mgmt=True (because its SDK reports its own ready/degraded/failed state explicitly), the platform's error-rate-based status inference fights the adapter's own state machine. This PR gates the inference branches on the capability flag — adapter-driven transitions become authoritative. Components: - registry.go evaluateStatus: gate the two inferred-status branches (online → degraded when error_rate ≥ 0.5; degraded → online when error_rate < 0.1 and runtime_state is empty) behind a check of runtimeOverrides.HasCapability("status_mgmt"). - The wedged-branch (RuntimeState == "wedged" → degraded) is NOT gated. That path is the adapter's OWN self-report, not platform inference, and stays active under native_status_mgmt — adapters can still drive transitions via runtime_state. Python side: no change. The capability map is already serialized via RuntimeCapabilities.to_dict() in PR #2137 and sent in the heartbeat's runtime_metadata block via PR #2139. An adapter setting RuntimeCapabilities(provides_native_status_mgmt=True) automatically flows through. Tests (3 new): - SkipsDegradeInference: error_rate=0.8 + currentStatus=online + native flag set → degrade UPDATE does NOT fire (sqlmock fails on unexpected query, which is the regression cover) - SkipsRecovery: error_rate=0.05 + currentStatus=degraded + native → recovery UPDATE does NOT fire - WedgedStillRespected: runtime_state="wedged" + native → wedged branch DOES fire (adapter self-report stays active) Verification: - All Go handlers tests pass (3 new + existing) - 1308/1308 Python pytest pass (unchanged — Python side unmodified) - go build + go vet clean Stacked on #2140 (already merged via cascade); branch is current with staging since #2139 and #2140 merged. See project memory `project_runtime_native_pluggable.md`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 23:13:13 -07:00
Hongming Wang	0473522cc5	Merge branch 'staging' into feat/idle-timeout-adapter-override	2026-04-26 22:52:42 -07:00

1 2 3 4 5 ...

449 Commits