molecule-core

Author	SHA1	Message	Date
Hongming Wang	c02cb0e1b6	review: defer forward-time URL re-validation to follow-up (#2316 ) Self-review found the original draft of this PR added forward-time validateAgentURL() as defense-in-depth — paranoia layer on top of the existing register-time gate. The validator unconditionally blocks loopback (127.0.0.1/8), which makes httptest-based proxy tests impossible without an env-var hatch I'd rather not add to a security- critical path on first pass. Trust note kept inline pointing at the upstream gate + tracking issue so the gap is explicit, not invisible. Refs #2312.	2026-04-29 14:33:41 -07:00
Hongming Wang	e632a31347	feat(chat_files): rewrite Upload as HTTP-forward to workspace (RFC #2312 , PR-C) Closes the SaaS upload gap (#2308) with the unified architecture from RFC #2312: same code path on local Docker and SaaS, no Docker socket dependency, no `dockerCli == nil` cliff. Stacked on PR-A (#2313) + PR-B (#2314). Before: Upload → findContainer (nil in SaaS) → 503 After: Upload → resolve workspaces.url + platform_inbound_secret → stream multipart to <url>/internal/chat/uploads/ingest → forward response back unchanged Same call site whether the workspace runs on local docker-compose ("http://ws-<id>:8000") or SaaS EC2 ("https://<id>.<tenant>..."). The bug behind #2308 cannot exist by construction. Why streaming, not parse-then-re-encode: * No 50 MB intermediate buffer on the platform * Per-file size + path-safety enforcement is the workspace's job (see workspace/internal_chat_uploads.py, PR-B) * Workspace's error responses (413 with offending filename, 400 on missing files field, etc.) propagate through unchanged Changes: * workspace-server/internal/handlers/chat_files.go — Upload rewritten as a streaming HTTP proxy. Drops sanitizeFilename, copyFlatToContainer, and the entire docker-exec path. ChatFilesHandler gains an httpClient (broken out for test injection). Download stays docker-exec for now; follow-up PR will migrate it to the same shape. * workspace-server/internal/handlers/chat_files_external_test.go — deleted. Pinned the wrong-headed runtime=external 422 gate from #2309 (already reverted in #2311). Superseded by the proxy tests. * workspace-server/internal/handlers/chat_files_test.go — replaced sanitize-filename tests (now in workspace/tests/test_internal_chat_uploads.py) with sqlmock + httptest proxy tests: - 400 invalid workspace id - 404 workspace row missing - 503 platform_inbound_secret NULL (with RFC #2312 detail) - 503 workspaces.url empty - happy-path forward (asserts auth header, content-type forwarded, body streamed, response propagated back) - 413 from workspace propagated unchanged (NOT remapped to 500) - 502 on workspace unreachable (connect refused) Existing Download + ContentDisposition tests preserved. * tests/e2e/test_chat_upload_e2e.sh — single-script-everywhere E2E. Takes BASE as env (default http://localhost:8080). Creates a workspace, waits for online, mints a test token, uploads a fixture, reads it back via /chat/download, asserts content matches + bearer-required. Same script runs against staging tenants (set BASE=https://<id>.<tenant>.staging.moleculesai.app). Test plan: * go build ./... — green * go test ./internal/handlers/ ./internal/wsauth/ — green (full suite) * tests/e2e/test_chat_upload_e2e.sh against local docker-compose after PR-A + PR-B + this PR all merge — TODO before merge Refs #2312 (parent RFC), #2308 (chat upload 503 incident). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 14:26:37 -07:00
Hongming Wang	1c9cea980d	feat(wsauth): platform→workspace inbound secret (RFC #2312 , PR-A) Foundation for the HTTP-forward architecture that replaces Docker-exec in chat upload + 5 follow-on handlers. This PR is intentionally scoped to schema + token mint + provisioner wiring; no caller reads the secret yet so behavior is unchanged. Why a second per-workspace bearer (not reuse the existing workspace_auth_tokens row): workspace_auth_tokens workspaces.platform_inbound_secret ───────────────────── ───────────────────────────────── workspace → platform platform → workspace hash stored, plaintext gone plaintext stored (platform reads back) workspace presents bearer platform presents bearer platform validates by hash workspace validates by file compare Distinct roles, distinct rotation lifecycle, distinct audit signal — splitting later would require a fleet-wide rolling rotation, so paying the schema cost up front. Changes: * migration 044: ADD COLUMN workspaces.platform_inbound_secret TEXT * wsauth.IssuePlatformInboundSecret + ReadPlatformInboundSecret * issueAndInjectInboundSecret hook in workspace_provision: mints on every workspace create / re-provision; Docker mode writes plaintext to /configs/.platform_inbound_secret alongside .auth_token, SaaS mode persists to DB only (workspace will receive via /registry/register response in a follow-up PR) * 8 unit tests against sqlmock — covers happy path, rotation, NULL column, empty string, missing workspace row, empty workspaceID PR-B (next) wires up workspace-side `/internal/chat/uploads/ingest` that validates the bearer against /configs/.platform_inbound_secret. Refs #2312 (parent RFC), #2308 (chat upload 503 incident). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 14:09:33 -07:00
Hongming Wang	4a6095ee1a	fix(chat_files): return 422 with structured detail for external workspaces (closes #2308 ) Symptom: pasting a screenshot into the canvas chat for a runtime="external" workspace returned `503 {"error":"workspace container not running"}` — accurate from the upload handler's POV (no container exists for external workspaces) but misleading because it implies the container has crashed. Fix: detect runtime="external" via DB lookup BEFORE the container-find step and return 422 with: - error: "file upload not supported for external workspaces" - detail: explains why + points at admin/secrets workaround + references issue #2308 for the v0.2 native-support roadmap - runtime: "external" (machine-readable for clients) Why 422 not 200/501: - 422 = Unprocessable Entity — the request is well-formed but the workspace's runtime can't accept it. Standard REST semantics. - 200 with empty result would lie; 501 implies the API itself is unimplemented (it's not — works for non-external workspaces); 503 was the misleading status this PR fixes. Verified via live E2E against localhost: - Created `runtime=external,external=true` workspace - Posted multipart to /workspaces/:id/chat/uploads - Got 422 with the expected structured body Unit test (`chat_files_external_test.go`) pins the contract via sqlmock + httptest. Notable: the handler is constructed with `templates: nil` to prove the runtime check happens BEFORE any docker plumbing — if a future change moves the check below findContainer, the test crashes on nil-deref instead of silently regressing. Out of scope (for v0.2 follow-up): - Native external-workspace file ingest via artifacts table or the channel-plugin's inbox/ pattern. Requires separate design pass. Closes #2308 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 12:37:49 -07:00
Hongming Wang	949b1b97a5	Merge pull request #2300 from Molecule-AI/auto/issues-2269-2268-restartstates-leak-and-since-secs fix(workspace_crud) + feat(activity): restartStates leak (#2269) + since_secs param (#2268)	2026-04-29 16:22:34 +00:00
Hongming Wang	9559118678	feat(activity): accept ?since_secs= for time-window filtering (#2268 ) The harness runner (scripts/measure-coordinator-task-bounds-runner.sh) calls `/workspaces/:id/activity?since_secs=$A2A_TIMEOUT` to scope a trace to a specific test window. The query param was silently ignored — `ActivityHandler.List` accepted only `type`, `source`, and `limit`, so the runner got the most-recent-100 events regardless of how long ago they happened. Works for fresh-tenant tests where activity_logs is ~empty pre-run, breaks on busy tenants and on tests that exceed 100 events. Adds `since_secs` parsing with three behaviors: - Valid positive int → `AND created_at >= NOW() - make_interval(secs => $N)` on the SQL. Parameterised; values bound via lib/pq, not interpolated. `make_interval(secs => $N)` is required — the `INTERVAL '$N seconds'` literal form rejects placeholder substitution inside the string. - Above 30 days (2_592_000s) → silently clamped to the cap. Defends against a paranoid client triggering a multi-month full-table scan via `since_secs=999999999`. - Negative, zero, or non-integer → 400 with a structured error, NOT silently dropped. Silent drop is exactly the bug this is fixing — a typoed param shouldn't be lost as most-recent-100. Tests cover all four paths: accepted (with arg-binding assertion via sqlmock.WithArgs), clamped at 30 days, invalid rejected (5 sub-cases), and omitted (verifies no extra clause / arg leak via strict WithArgs count). RFC #2251 §V1.0 step 6 (platform-side-transition audit) also depends on this for time-window filtering of activity_logs. Closes #2268 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 05:53:52 -07:00
Hongming Wang	f75599eba9	fix(workspace_crud): drop restartStates entries on workspace delete (#2269 ) Per-workspace `restartState` entries (introduced under the name `restartMu` pre-#2266, renamed to `restartStates` in #2266) are created via `LoadOrStore` in `workspace_restart.go` but never deleted. On a long-running platform process serving many short-lived workspaces (E2E tests, transient sandbox tenants), the sync.Map grows monotonically — ~16 bytes per workspace ever created. Fix: call `restartStates.Delete(wsID)` after stopAndRemove + ClearWorkspaceKeys for each cascaded descendant and the parent. Mirrors the existing per-ID cleanup loop. `sync.Map.Delete` is safe on absent keys, so workspaces that were never restarted (no LoadOrStore call) are no-op. This is a pre-existing leak — #2266 did not introduce it; just renamed the holder. Filing as a separate commit to keep the change minimal and reviewable. Closes #2269 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 05:53:34 -07:00
Hongming Wang	80c612d987	fix(org-import): remove force=true bypass of required-env preflight The pre-#2290 \`force: true\` flag on POST /org/import skipped the required-env preflight, letting orgs import without their declared required keys (e.g. ANTHROPIC_API_KEY). The ux-ab-lab incident: that import path was used, the org shipped without ANTHROPIC_API_KEY in global_secrets, and every workspace 401'd on the first LLM call. Per #2290 picks (C/remove/both): - Q1=C: template-derived required_env (no schema change — already the existing aggregation via collectOrgEnv). - Q2=remove: drop the bypass entirely. The seed/dev-org flow that legitimately needs to skip becomes a separate dry-run-import path with its own audit trail, not a permission bypass. - Q3=block-at-import-only: provision-time drift logging is a follow-up; for this PR, blocking at import is the gate. Surface change: - Force field removed from POST /org/import request body. - 412 \"suggestion\" text drops the \"or pass force=true\" guidance. - Legacy callers sending {\"force\": true} are silently tolerated (Go's json.Unmarshal drops unknown fields), so no client-side breakage; the bypass effect is just gone. Audited callers in this repo: - canvas/src/components/TemplatePalette.tsx — never sends force. - scripts/post-rebuild-setup.sh — never sends force. - Only external tooling sent force=true. Those callers must now set the global secret via POST /settings/secrets before importing. Adds TestOrgImport_ForceFieldRemoved as a structural pin: if a future change re-adds Force to the body struct, the test fails and forces an explicit reckoning with the #2290 rationale. Closes #2290 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 03:23:23 -07:00
Hongming Wang	bdfa45572e	fix(restart): clear running flag on panic in cycle() Self-review caught a regression I introduced in #2266: if cycle() panics (e.g. a future provisionWorkspace nil-deref or any runtime error from the DB / Docker / encryption stacks it touches), the loop never reaches `state.running = false`. The flag stays true forever, the early-return guard at the top of coalesceRestart fires for every subsequent call, and that workspace is permanently locked out of restarts until the platform process restarts. The pre-fix code had similar exposure (panic killed the goroutine before defer wsMu.Unlock() ran in some Go versions), but my pending- flag version made it worse: the guard is sticky, not ephemeral. Fix: defer the state-clear so it always runs on exit, including panic. Recover (and DON'T re-raise) so the panic doesn't propagate to the goroutine boundary and crash the whole platform process — RestartByID is always called via `go h.RestartByID(...)` from HTTP handlers, and an unrecovered goroutine panic in Go terminates the program. Crashing the platform for every tenant because one workspace's cycle panicked is the wrong availability tradeoff. The panic message + full stack trace via runtime/debug.Stack() are still logged for debuggability. Regression test in TestCoalesceRestart_PanicInCycleClearsState: 1. First call's cycle panics. coalesceRestart's defer must swallow the panic — assert no panic propagates out (would crash the platform process from a goroutine in production). 2. Second call must run a fresh cycle (proves running was cleared). All 7 tests pass with -race -count=10. Surfaced via /code-review-and-quality self-review of #2266; the re-raise-after-recover anti-pattern (originally argued as "don't mask bugs") came up in the comprehensive review and was corrected to log-with-stack-and-suppress for availability. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 00:00:12 -07:00
Hongming Wang	f088090b27	fix(restart): coalesce concurrent restart requests via pending flag The naive mutex-with-TryLock pattern in RestartByID was silently dropping the second of two close-together restart requests. SetSecret and SetModel both fire `go restartFunc(...)` from their HTTP handlers, and both DB writes commit before either restart goroutine reaches loadWorkspaceSecrets. If the second goroutine arrives while the first holds the per-workspace mutex, TryLock returns false and the second is logged-and-dropped: Auto-restart: skipping <id> — restart already in progress The first goroutine's loadWorkspaceSecrets ran before the second write committed, so the new container boots without that env var. Surfaced during the RFC #2251 V1.0 measurement as hermes returning "No LLM provider configured" when MODEL_PROVIDER landed after the API-key write and lost its restart to the mutex (HERMES_DEFAULT_MODEL absent → start.sh fell back to nousresearch/hermes-4-70b → derived provider=openrouter → no OPENROUTER_API_KEY → request-time error). The same race hits any back-to-back secret/model save flow including the canvas's "set MiniMax key + pick model" UX. Fix: pending-flag / coalescing pattern. Any restart request that arrives while one is in flight sets `pending=true` and returns. The in-flight runner, on completion, checks the flag and runs another cycle. This collapses N concurrent requests into at most 2 sequential cycles (the current one + one more that picks up everyone who arrived during it), while guaranteeing the final container always sees the latest secrets. Concrete contract: - 1 request, no concurrency: 1 cycle - N concurrent requests during 1 in-flight cycle: 2 cycles total - N sequential requests (no overlap): N cycles - Per-workspace state — different workspaces never serialize Coalescing is extracted into `coalesceRestart(workspaceID, cycle func())` so the gate logic is testable without the full WorkspaceHandler / DB / provisioner stack. RestartByID now wraps that with the production cycle function. runRestartCycle calls provisionWorkspace SYNCHRONOUSLY (drops the historical `go`) so the loop's pending-flag check happens AFTER the new container is up — without that, the next cycle's Stop call would race the previous cycle's still-spawning provision goroutine. sendRestartContext stays async; it's a one-way notification. Tests in workspace_restart_coalesce_test.go cover all five contract points + race-detector clean over 10 iterations: - Single call → 1 cycle - 5 concurrent during in-flight → exactly 2 cycles total - 3 sequential → 3 cycles - Pending-during-cycle picked up (targeted bug repro) - State cleared after drain (running flag reset) - Per-workspace isolation (no cross-workspace serialization) Refs: molecule-core#2256 (V1.0 gate measurement); root cause for the "No LLM provider configured" symptom seen during hermes/MiniMax repro. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 23:31:56 -07:00
Hongming Wang	317196463a	fix(orphan-sweeper): close TOCTOU race with issueAndInjectToken on restart Independent code review caught a real bug in the previous commit's stale-token revoke pass. The platform's restart endpoint (workspace_restart.go:104) Stops the workspace container synchronously then dispatches re-provisioning to a goroutine (line 173). For a workspace that's been idle past the 5-minute grace window — extremely common: user comes back to a long-idle workspace and clicks Restart — this opens a race window: 1. Container stopped → ListWorkspaceContainerIDPrefixes returns no entry → workspace becomes a stale-token candidate. 2. issueAndInjectToken runs in the goroutine: revokes old tokens, issues a fresh one, writes it to /configs/.auth_token. 3. If the sweeper's predicate-only UPDATE `WHERE workspace_id = $1 AND revoked_at IS NULL` runs AFTER IssueToken commits but is racing the SELECT-then-UPDATE window, it revokes the freshly-issued token alongside the old ones. 4. Container starts with a now-revoked token → 401 forever. The fix carries the SAME staleness predicate from the SELECT into the per-workspace UPDATE: a token created within the grace window can't match `< now() - grace` and is automatically excluded. The operation is now idempotent against fresh inserts. Also addresses other findings from the same review: - Add `status NOT IN ('removed', 'provisioning')` to the SELECT (R2 + first-line C1 defence). 'provisioning' is set synchronously in workspace_restart.go before the async re-provision begins, so it's a reliable in-flight signal that narrows the candidate set. - Stop calling wsauth.RevokeAllForWorkspace from the sweeper — that helper revokes EVERY live token unconditionally; the sweeper needs "every STALE live token" which is a different (safer) operation. Inline the UPDATE so we own the predicate end-to-end. Drop the wsauth import (no longer needed in this package). - Tighten expectStaleTokenSweepNoOp regex to anchor at start and require the status filter, so a future query whose first line coincidentally starts with "SELECT DISTINCT t.workspace_id" can't silently absorb the helper's expectation (R3). - Defensive `if reaper == nil { return }` at top of sweepStaleTokensWithoutContainer — even though StartOrphanSweeper already short-circuits on nil, a future refactor that wires this pass directly without checking would otherwise mass-revoke in CP/SaaS mode (F2). - Comment in the function explaining why empty likes is intentionally NOT a short-circuit (asymmetry with the first two passes is the whole point — "no containers running" is the load-bearing case). - Add TestSweepOnce_StaleTokenRevokeUsesStalenessPredicate that asserts the UPDATE shape (predicate present, grace bound). A real-Postgres integration test would prove the race resolution end-to-end; this catches the regression where someone simplifies the UPDATE back to predicate-only. - Add TestSweepStaleTokens_NilReaperEarlyExit pinning the F2 guard. Existing tests updated to match the new query/UPDATE shape with tight regexes that pin all the safety guards (status filter, staleness predicate in both SELECT and UPDATE). Full Go suite green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 17:28:50 -07:00
Hongming Wang	3332e6878b	fix(orphan-sweeper): revoke stale tokens for workspaces with no live container Heals the user-reported "auth token conflict after volume wipe" failure mode. When an operator nukes a workspace's /configs volume outside the platform's restart endpoint (common via `docker compose down -v` or manual cleanup scripts), the DB still holds live workspace_auth_tokens for that workspace while the recreated container has an empty /configs/.auth_token. Subsequent /registry/register calls 401 forever: requireWorkspaceToken sees live tokens, container has no token to present, and the workspace is permanently wedged until an operator manually revokes via SQL. The platform's restart endpoint already handles this correctly via wsauth.RevokeAllForWorkspace inside issueAndInjectToken. This change adds a third orphan-sweeper pass — sweepStaleTokensWithoutContainer — as the safety net for the equivalent action taken outside the API. Detection criterion: workspace has at least one live (non-revoked) token whose most-recent activity (COALESCE(last_used_at, created_at)) is older than staleTokenGrace (5 minutes), AND no live Docker container's name prefix matches the workspace ID. Safety filters that bound the revoke radius: 1. Only runs in single-tenant Docker mode. The orphan sweeper is wired only when prov != nil in cmd/server/main.go — CP/SaaS mode never gets here, so an empty container list cannot be confused with "no Docker at all" (which would otherwise revoke every workspace's tokens in production SaaS). 2. staleTokenGrace = 5min skips tokens issued/used in the last 5 minutes. Bounds the race with mid-provisioning (token issued moments before docker run completes) and brief restart windows — a healthy workspace touches last_used_at every 30s heartbeat, so 5min is 10× the heartbeat interval. 3. The query joins workspaces.status != 'removed' so deleted workspaces are not revoked here (handled at delete time by the explicit RevokeAllForWorkspace call). 4. make_interval(secs => $2) avoids a time.Duration.String() → "5m0s" mismatch with Postgres interval grammar that I caught during implementation. 5. Each revocation logs the workspace ID so operators can correlate "workspace just lost auth" with this sweeper, not blame a network blip. Failure mode: revoke fails (transient DB error). Loop bails to avoid log spam; next 60s cycle retries. Worst case a workspace stays 401-blocked an extra minute. Tests: 5 new tests covering the headline scenario, the safety gate (workspace with container is NOT revoked), revoke-failure-bails-loop, query-error-non-fatal, and Docker-list-failure-skips-cycle. All 11 existing sweepOnce tests updated to register the new third-pass query expectation via a small `expectStaleTokenSweepNoOp` helper that keeps their existing assertions readable. Full Go test suite green: registry, wsauth, handlers, and all other packages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 17:20:08 -07:00
Hongming Wang	c91c09dc55	fix(activity): include request/response bodies in ACTIVITY_LOGGED broadcast Canvas Agent Comms bubbles for outbound delegation showed only "Delegating to <peer>" boilerplate during the live update window — the actual task text only surfaced after a refresh re-fetched the row from /workspaces/:id/activity. Symptom flagged today during a fresh delegation manual test where the bubble said "Delegating to Perf Auditor" instead of the user's "audit moleculesai.app for performance" prompt. Root cause: LogActivity's broadcast payload at activity.go:510-518 deliberately omitted request_body and response_body, so the canvas's live-update path (AgentCommsPanel.tsx:271-289) saw `p.request_body = undefined` and toCommMessage fell back to the `Delegating to ${peerName}` template string. The DB row stored the real task / reply, which is why GET-on-mount worked. Fix: include both bodies in the broadcast as json.RawMessage values (no re-marshal cost — they were already encoded for the DB insert above). Same pattern as tool_trace, which has been included since #1814. Each side is bounded by the workspace-side caller's own caps: the runtime's report_activity helper caps error_detail at 4096 chars and summary at 256; request/response are constrained by the runtime's own limits — typical delegate_task payload is hundreds of chars to a few KB. If a much-larger broadcast becomes a concern later, a soft cap can be added at this site without breaking the contract. Two regression tests pin the broadcast shape: - request_body present → canvas renders the actual task text - response_body present → canvas renders the actual reply text - response_body nil → omitted from payload (no empty-bubble flicker) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 13:38:23 -07:00
Hongming Wang	92d99d96fe	fix(provisioner): treat "removal already in progress" as no-op success Cascade-deleting a 7-workspace org returned 500 with "workspace marked removed, but 2 stop call(s) failed — please retry: stop eeb99b5d-...: force-remove ws-eeb99b5d-607: Error response from daemon: removal of container ws-eeb99b5d-607 is already in progress" even though the DB-side post-condition succeeded (removed_count=7) and the containers WERE removed shortly after. The fanout fired Stop() on every workspace concurrently and the orphan sweeper happened to reap two of them at the same instant, so Docker rejected the second ContainerRemove with "removal already in progress" — a race-condition ack, not a real failure. Retrying just races the same in-flight removal. The post-condition we care about (the container WILL be gone) is identical to a successful removal, so Stop() should treat it the same way it already treats "No such container" — a no-op return nil that lets the caller proceed with volume cleanup. Real daemon failures (timeout, EOF, ctx cancel) still surface as errors. Two pieces: - New isRemovalInProgress() predicate using the same string-match approach as isContainerNotFound (docker/docker has no typed errdef for this; the CLI itself relies on the message). - Stop() now treats the predicate as success, with a log line distinct from the not-found path so debugging can tell which race fired. Both substrings ("removal of container" + "already in progress") must match — "already in progress" alone would false-positive on unrelated operations like image pulls. Truth table pinned in 7 new test cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 13:25:32 -07:00
Hongming Wang	7cf77f274a	Merge pull request #2166 from Molecule-AI/test/unblock-resolveandstage-test test(plugins): unblock TestResolveAndStage_NoInternalErrorsInHTTPErr (#1814)	2026-04-27 11:36:15 +00:00
Hongming Wang	a0154ea0b4	test(plugins): unblock TestResolveAndStage_NoInternalErrorsInHTTPErr (#1814 ) Closes the second of two skipped tests in workspace_provision_test.go that were blocked on interface refactors. The Broadcaster + CP provisioner halves landed in earlier #1814 cycles; this is the plugin-source-registry half. Refactor: - Add handlers.pluginSources interface with the 3 methods handler code actually calls (Register, Resolve, Schemes) - Compile-time assertion `var _ pluginSources = (plugins.Registry)(nil)` catches future method-signature drift at build time - PluginsHandler.sources narrowed from plugins.Registry to the interface; production wiring (NewPluginsHandler, WithSourceResolver) still passes *plugins.Registry — satisfies the interface Production fix (#1206 leak): - resolveAndStage's Fetch-failure path was interpolating err.Error() into the HTTP response body via `failed to fetch plugin from %s: %v`. Resolver errors routinely contain rate-limit text, github request IDs, raw HTTP body fragments, and (for local resolvers) file system paths — none has any business landing in a user's browser. - Body now carries just `failed to fetch plugin from <scheme>`; the status code already differentiates the failure shape (404 not found, 504 timeout, 502 generic). Full err detail stays in the server-side log line one statement above. Test: - 6 sub-tests covering every error path inside resolveAndStage: empty source, invalid format, unknown scheme, local path-traversal, unpinned github (PLUGIN_ALLOW_UNPINNED unset), Fetch failure with a leaky synthetic error - The Fetch-failure case plants 5 realistic leak markers in the resolver's error string (rate limit text, x-github-request-id, auth_token, ghp_-prefixed token, /etc/passwd path); the assertion fails if ANY appears in the response body - Table-driven so a future error path added to resolveAndStage gets one new row, not a copy-paste of the assertion logic Verification: - 6/6 sub-tests pass - Full workspace-server test suite passes (interface refactor is non-breaking; production caller paths unchanged) - go build ./... clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 04:00:39 -07:00
Hongming Wang	e15d1182cd	test(provisioner): unblock TestProvisionWorkspaceCP_NoInternalErrorsInBroadcast (#1814 ) The skipped test exists to assert that provisionWorkspaceCP never leaks err.Error() in WORKSPACE_PROVISION_FAILED broadcasts (regression guard for #1206). Writing the test body required substituting a failing CPProvisioner — but the handler's `cpProv` field was the concrete CPProvisioner type, so a mock had nowhere to plug in. Refactor: - Add provisioner.CPProvisionerAPI interface with the 3 methods handlers actually call (Start, Stop, GetConsoleOutput) - Compile-time assertion `var _ CPProvisionerAPI = (CPProvisioner)(nil)` catches future method-signature drift at build time - WorkspaceHandler.cpProv narrowed to the interface; SetCPProvisioner accepts the interface (production caller passes *CPProvisioner from NewCPProvisioner unchanged) Test: - stubFailingCPProv whose Start returns a deliberately leaky error (machine_type=t3.large, ami=…, vpc=…, raw HTTP body fragment) - Drive provisionWorkspaceCP via the cpProv.Start failure path - Assert broadcast["error"] == "provisioning failed" (canned) - Assert no leak markers (machine type, AMI, VPC, subnet, HTTP body, raw error head) in any broadcast string value - Stop/GetConsoleOutput on the stub panic — flags a future regression that reaches into them on this path Verification: - Full workspace-server test suite passes (interface refactor is non-breaking; production caller path unchanged) - go build ./... clean - The other skipped test in this file (TestResolveAndStage_…) is a separate plugins.Registry refactor and remains skipped Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 03:28:25 -07:00
hongmingwang-moleculeai	34b92c33b7	Merge pull request #2144 from Molecule-AI/feat/native-session-skip-queue feat(runtime): native_session skips a2a_queue — primitive #5 of 6	2026-04-27 06:40:09 +00:00
Hongming Wang	ae64fe340a	feat(runtime): native_session skips a2a_queue enqueue — primitive #5 of 6 When a target workspace's adapter has declared provides_native_session=True (claude-code SDK's streaming session, hermes-agent's in-container event log), the SDK owns its own queue/ session state. Adding the platform's a2a_queue layer on top would double-buffer the same in-flight state — and worse, the platform queue's drain timing has no relationship to the SDK's actual readiness, so the queued request might dispatch while the SDK is STILL busy. Behavior change: in handleA2ADispatchError, when isUpstreamBusyError(err) fires and the target declared native_session, return 503 + Retry-After directly without enqueueing. The caller's adapter handles retry on its own schedule, and the SDK's own queue absorbs the request when ready. Response body carries native_session=true so callers can distinguish this from queue-failure 503s. Observability is preserved: logA2AFailure still runs above; the broadcaster still fires; the activity_logs row records the busy event just like the platform-fallback path. This is the consumer that validates the template-side declarations already shipped in: - molecule-ai-workspace-template-claude-code PR #12 - molecule-ai-workspace-template-hermes PR #25 Once those merge + image tags bump, claude-code + hermes workspaces' busy 503s skip the platform queue end-to-end. End-to-end validation of capability primitive #5. Tests (2 new): - NativeSession_SkipsEnqueue: cache pre-populated, deliberate sqlmock with NO INSERT INTO a2a_queue expected — implicit regression cover (sqlmock fails on unexpected queries). Asserts 503 + Retry-After + native_session=true marker in body. - NoNativeSession_StillEnqueues: negative pin — empty cache, same busy error → falls through to EnqueueA2A (which fails in this test, falls through to legacy 503 without native_session marker). Verification: - All Go handlers tests pass (2 new + existing) - go build + go vet clean See project memory `project_runtime_native_pluggable.md`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 23:34:04 -07:00
Hongming Wang	186f25c261	Merge pull request #2141 from Molecule-AI/feat/native-status-mgmt-skip feat(runtime): native_status_mgmt skip — primitive #4 of 6	2026-04-27 06:30:59 +00:00
Hongming Wang	b4b406c074	feat(runtime): native_status_mgmt skip — primitive #4 of 6 When an adapter declares provides_native_status_mgmt=True (because its SDK reports its own ready/degraded/failed state explicitly), the platform's error-rate-based status inference fights the adapter's own state machine. This PR gates the inference branches on the capability flag — adapter-driven transitions become authoritative. Components: - registry.go evaluateStatus: gate the two inferred-status branches (online → degraded when error_rate ≥ 0.5; degraded → online when error_rate < 0.1 and runtime_state is empty) behind a check of runtimeOverrides.HasCapability("status_mgmt"). - The wedged-branch (RuntimeState == "wedged" → degraded) is NOT gated. That path is the adapter's OWN self-report, not platform inference, and stays active under native_status_mgmt — adapters can still drive transitions via runtime_state. Python side: no change. The capability map is already serialized via RuntimeCapabilities.to_dict() in PR #2137 and sent in the heartbeat's runtime_metadata block via PR #2139. An adapter setting RuntimeCapabilities(provides_native_status_mgmt=True) automatically flows through. Tests (3 new): - SkipsDegradeInference: error_rate=0.8 + currentStatus=online + native flag set → degrade UPDATE does NOT fire (sqlmock fails on unexpected query, which is the regression cover) - SkipsRecovery: error_rate=0.05 + currentStatus=degraded + native → recovery UPDATE does NOT fire - WedgedStillRespected: runtime_state="wedged" + native → wedged branch DOES fire (adapter self-report stays active) Verification: - All Go handlers tests pass (3 new + existing) - 1308/1308 Python pytest pass (unchanged — Python side unmodified) - go build + go vet clean Stacked on #2140 (already merged via cascade); branch is current with staging since #2139 and #2140 merged. See project memory `project_runtime_native_pluggable.md`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 23:13:13 -07:00
Hongming Wang	0473522cc5	Merge branch 'staging' into feat/idle-timeout-adapter-override	2026-04-26 22:52:42 -07:00
Hongming Wang	c0a5d842b4	feat(runtime): native_scheduler skip — primitive #3 of 6 When an adapter declares provides_native_scheduler=True (because its SDK has built-in cron / Temporal-style workflows), the platform's polling loop must skip firing schedules for that workspace — otherwise the schedule fires twice (once natively, once via platform). The native skip preserves observability (next_run_at still advances, the schedule row stays in the DB, last_run_at would still update) while moving the FIRE responsibility to the SDK. Stacked on PR #2139 (idle_timeout_override end-to-end). The RuntimeMetadata heartbeat block already carries the capability map; this PR teaches the platform how to read and act on the scheduler bit. Components: - handlers/runtime_overrides.go: extended the cache to store capability flags alongside idle timeout. Two heartbeat fields are independent — SetIdleTimeout / SetCapabilities each update one without stomping the other. Defensive copy on SetCapabilities so a caller mutating its map after the call doesn't retroactively change cached declarations. Empty entries dropped to avoid stale husks. - handlers/runtime_overrides.go: new HasCapability(workspaceID, name) + ProvidesNativeScheduler(workspaceID) — the latter is the package-level adapter the scheduler imports (avoids a handlers/scheduler import cycle). - handlers/registry.go: heartbeat handler now calls SetCapabilities in addition to SetIdleTimeout. - scheduler/scheduler.go: NativeSchedulerCheck function-pointer DI (mirrors the existing QueueDrainFunc pattern). New() leaves the field nil so existing callers preserve today's "always fire" behavior. SetNativeSchedulerCheck wires production. tick() drops workspaces declaring native ownership before goroutine fan-out; advances next_run_at so we don't tight-loop on the same row. - cmd/server/main.go: wires handlers.ProvidesNativeScheduler into the cron scheduler at server boot. Tests: Go (7 new): - SetCapabilitiesAndHas (round-trip) - per-workspace isolation (ws-a's declaration doesn't leak to ws-b) - nil/empty map clears (adapter dropping the flag restores fallback) - SetCapabilities is a defensive copy (caller mutation can't retroactively flip cached value) - SetIdleTimeout preserves capabilities and vice-versa (two-field independence) - empty entry deleted (no stale husks) - ProvidesNativeScheduler reads the same singleton heartbeat writes - SetNativeSchedulerCheck wires the function (scheduler-side) - nil-check safety contract for tick Python: no change needed — the heartbeat already serializes the full capability map via _runtime_metadata_payload (PR #2139). An adapter setting RuntimeCapabilities(provides_native_scheduler=True) automatically flows through. Verification: - 1308 / 1308 Python pytest pass (unchanged) - All Go handlers + scheduler tests pass - go build + go vet clean See project memory `project_runtime_native_pluggable.md`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 22:47:00 -07:00
Hongming Wang	0d3058585b	feat(runtime): adapter-declared idle_timeout_override end-to-end Capability primitive #2 (task #117). The first cross-cutting capability where the adapter actually displaces platform behavior — claude-code's streaming session can legitimately go silent for 8+ minutes during synthesis + slow tool calls; the platform's hardcoded 5min idle timer in a2a_proxy.go cancels it mid-flight (the bug PR #2128 patched at the env-var layer). This PR fixes it at the right layer: the adapter declares "I need 600s" and the platform's dispatch path honors it. Wire shape (Python → Go): POST /registry/heartbeat { "workspace_id": "...", ... "runtime_metadata": { "capabilities": {"heartbeat": false, "scheduler": false, ...}, "idle_timeout_seconds": 600 // optional, omitted = use default } } Default behavior preserved: any adapter that doesn't override BaseAdapter.idle_timeout_override() (returns None by default) sends no idle_timeout_seconds field; the Go side falls through to idleTimeoutDuration (env A2A_IDLE_TIMEOUT_SECONDS, default 5min). Existing langgraph / crewai / deepagents workspaces are unaffected. Components: Python: - adapter_base.py: idle_timeout_override() method on BaseAdapter returning None (the platform-default sentinel). - heartbeat.py: _runtime_metadata_payload() lazy-imports the active adapter and assembles the capability + override block. Try/except swallows ANY error so heartbeat never breaks because of capability discovery — observability outranks capability accuracy. Go: - models.HeartbeatPayload.RuntimeMetadata (pointer so absent = "old runtime, didn't say"; explicit zero-cap = "new runtime, declared no native ownership"). - handlers.runtimeOverrides: in-memory sync.Map cache keyed by workspaceID. Populated by the heartbeat handler, consulted on every dispatchA2A. Reset on platform restart (worst-case 30s of platform-default behavior — acceptable; nothing about overrides is correctness-critical). - a2a_proxy.dispatchA2A: looks up the override before applyIdle Timeout; falls through to global default when absent. Tests: Python (17, all new): - RuntimeCapabilities dataclass shape (frozen, defaults, wire keys) - BaseAdapter.capabilities() default + override + sibling isolation - idle_timeout_override default, positive override, dropped-override - Heartbeat metadata producer: default adapter emits all-False, native adapter emits flag + override, missing ADAPTER_MODULE returns {} (graceful), zero/negative override is omitted from wire, exception inside adapter swallowed Go (6, all new): - SetIdleTimeout + IdleTimeout round-trip - Zero/negative duration clears the override - Empty workspace_id ignored - Replacement (heartbeat overwrites prior value) - Reset clears entire cache - Concurrent reads + writes (sync.Map invariant) Verification: - 1308 / 1308 workspace pytest pass (was 1300, +8) - All Go handlers tests pass (6 new + existing) - go vet clean See project memory `project_runtime_native_pluggable.md` for the architecture principle this implements. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 22:38:01 -07:00
Hongming Wang	e25b8a508e	test(provisioning): pin no-internal-errors-in-broadcast for global-secret decrypt path (#1814 ) [Molecule-Platform-Evolvement-Manager] ## What this fixes Closes one of the three skipped tests in workspace_provision_test.go that #1814's interface refactor enabled but never had a body written: `TestProvisionWorkspace_NoInternalErrorsInBroadcast`. The interface blocker (`captureBroadcaster` couldn't substitute for `events.Broadcaster`) was already fixed when `events.EventEmitter` was extracted; this PR ships the test body that the prior refactor made possible. The test was effectively unverified regression cover for issue #1206 (internal error leak in WORKSPACE_PROVISION_FAILED broadcasts) until now. ## What the test pins Drives the earliest* failure path in `provisionWorkspace` — the global-secrets decrypt failure — so the setup needs only: - one `global_secrets` mock row (with `encryption_version=99` to force `crypto.DecryptVersioned` to error with a string that includes the literal version number) - one `UPDATE workspaces SET status = 'failed'` expectation - a `captureBroadcaster` (already in the test file) injected via `NewWorkspaceHandler` Asserts the captured `WORKSPACE_PROVISION_FAILED` payload: 1. carries the safe canned `"failed to decrypt global secret"` only 2. does NOT contain `"version=99"`, `"platform upgrade required"`, or the global_secret row's `key` value (`FAKE_KEY`) — the three leak markers a regression that interpolates `err.Error()` into the broadcast would surface ## Why not use containsUnsafeString The test file already has a `containsUnsafeString` helper with `"secret"` and `"token"` in its prohibition list. Those substrings match the legitimate redacted message (`"failed to decrypt global secret"`) — appropriate in user-facing copy, NOT a leak. Using the broad helper would either fail the test against the source's own correct message OR require loosening the helper for everyone else. Per-test explicit leak markers keep the assertion precise without weakening shared infrastructure. ## What's still skipped (out of scope for this PR) - `TestProvisionWorkspaceCP_NoInternalErrorsInBroadcast` — same shape but blocked on a different refactor: `provisionWorkspaceCP` routes through `provisioner.CPProvisioner` (concrete pointer, no interface), so the test would need either an interface extraction or a real CPProvisioner with a mocked HTTP server. Larger scope; deferred. - `TestResolveAndStage_NoInternalErrorsInHTTPErr` — different blocker (`mockPluginsSources` vs `plugins.Registry` type mismatch). Needs a SourceResolver-side interface refactor. Both still carry their `t.Skip` notes documenting the remaining work. ## Test plan - [x] New test passes - [x] Full handlers package suite still green (`go test ./internal/handlers/`) - [x] No changes to production code — pure test addition 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 22:31:30 -07:00
Hongming Wang	6eaacf175b	fix(notify): review-flagged Critical + Required findings on PR #2130 Two Critical bugs caught in code review of the agent→user attachments PR: 1. Empty-URI attachments slipped past validation. Gin's go-playground/validator does NOT iterate slice elements without `dive` — verified zero `dive` usage anywhere in workspace-server — so the inner `binding:"required"` tags on NotifyAttachment.URI/Name were never enforced. `attachments: [{"uri":"","name":""}]` would pass validation, broadcast empty-URI chips that render blank in canvas, AND persist them in activity_logs for every page reload to re-render. Added explicit per-element validation in Notify (returns 400 with `attachment[i]: uri and name are required`) plus defence-in-depth in the canvas filter (rejects empty strings, not just non-strings). 3-case regression test pins the rejection. 2. Hardcoded application/octet-stream stripped real mime types. `_upload_chat_files` always passed octet-stream as the multipart Content-Type. chat_files.go:Upload reads `fh.Header.Get("Content-Type")` FIRST and only falls back to extension-sniffing when the header is empty, so every agent-attached file lost its real type forever — broke the canvas's MIME-based icon/preview logic. Now sniff via `mimetypes.guess_type(path)` and only fall back to octet-stream when sniffing returns None. Plus three Required nits: - `sqlmockArgMatcher` was misleading — the closure always returned true after capture, identical to `sqlmock.AnyArg()` semantics, but named like a custom matcher. Renamed to `sqlmockCaptureArg(*string)` so the intent (capture for post-call inspection, not validate via driver-callback) is unambiguous. - Test asserted notify call by `await_args_list[1]` index — fragile to any future _upload_chat_files refactor that adds a pre-flight POST. Now filter call list by URL suffix `/notify` and assert exactly one match. - Added `TestNotify_RejectsAttachmentWithEmptyURIOrName` (3 cases) covering empty-uri, empty-name, both-empty so the Critical fix stays defended. Deferred to follow-up: - ORDER BY tiebreaker for same-millisecond notifies — pre-existing risk, not regression. - Streaming multipart upload — bounded by the platform's 50MB total cap so RAM ceiling is fixed; switch to streaming if cap rises. - Symlink rejection — agent UID can already read whatever its filesystem perms allow via the shell tool; rejecting symlinks doesn't materially shrink the attack surface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 19:47:31 -07:00
Hongming Wang	d028fe19ff	feat(notify): agent → user file attachments via send_message_to_user Closes the gap where the Director would say "ZIP is ready at /tmp/foo.zip" in plain text instead of attaching a download chip — the runtime literally had no API for outbound file attachments. The canvas + platform's chat-uploads infrastructure already supported the inbound (user → agent) direction (commit `94d9331c`); this PR wires the outbound side. End-to-end shape: agent: send_message_to_user("Done!", attachments=["/tmp/build.zip"]) ↓ runtime POST /workspaces/<self>/chat/uploads (multipart) ↓ platform /workspace/.molecule/chat-uploads/<uuid>-build.zip → returns {uri: workspace:/...build.zip, name, mimeType, size} ↓ runtime POST /workspaces/<self>/notify {message: "Done!", attachments: [{uri, name, mimeType, size}]} ↓ platform Broadcasts AGENT_MESSAGE with attachments + persists to activity_logs with response_body = {result: "Done!", parts: [{kind:file, file:{...}}]} ↓ canvas WS push: canvas-events.ts adds attachments to agentMessages queue Reload: ChatTab.loadMessagesFromDB → extractFilesFromTask sees parts[] Either path → ChatTab renders download chip via existing path Files changed: workspace-server/internal/handlers/activity.go - NotifyAttachment struct {URI, Name, MimeType, Size} - Notify body accepts attachments[], broadcasts in payload, persists as response_body.parts[].kind="file" canvas/src/store/canvas-events.ts - AGENT_MESSAGE handler reads payload.attachments, type-validates each entry, attaches to agentMessages queue - Skips empty events (was: skipped only when content empty) workspace/a2a_tools.py - tool_send_message_to_user(message, attachments=[paths]) - New _upload_chat_files helper: opens each path, multipart POSTs to /chat/uploads, returns the platform's metadata - Fail-fast on missing file / upload error — never sends a notify with a half-rendered attachment chip workspace/a2a_mcp_server.py - inputSchema declares attachments param so claude-code SDK surfaces it to the model - Defensive filter on the dispatch path (drops non-string entries if the model sends a malformed payload) Tests: - 4 new Python: success path, missing file, upload 5xx, no-attach backwards compat - 1 new Go: Notify-with-attachments persists parts[] in response_body so chat reload reconstructs the chip Why /tmp paths work even though they're outside the canvas's allowed roots: the runtime tool reads the bytes locally and re-uploads through /chat/uploads, which lands the file under /workspace (an allowed root). The agent can specify any readable path. Does NOT include: agent → agent file transfer. Different design problem (cross-workspace download auth: peer would need a credential to call sender's /chat/download). Tracked as a follow-up under task #114. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 19:35:58 -07:00
hongmingwang-moleculeai	a5e099d644	Merge branch 'staging' into feat/external-runtime-first-class	2026-04-26 16:34:17 -07:00
Hongming Wang	00f78c6252	fix(a2a-proxy): log when A2A_IDLE_TIMEOUT_SECONDS is invalid Review-feedback follow-up. Pre-fix, A2A_IDLE_TIMEOUT_SECONDS=foo or =-30 fell back to the default with zero log signal — operator sets the wrong value, sees "no effect," wastes hours debugging "why is my override not working." Now bad-input cases log a clear message naming the variable, the bad value, and the default applied. Refactor: extract parseIdleTimeoutEnv(string) → time.Duration so the parse logic is unit-testable. defaultIdleTimeoutDuration is a const so tests reference it without re-deriving the value. 8 new unit tests cover empty / valid / negative / zero / non-numeric / float / trailing-units inputs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 15:57:00 -07:00
Hongming Wang	d552c43b94	fix(a2a-proxy): close 60s context-canceled gap on long silent runs Two compounding bugs caused the "context canceled" wave on 2026-04-26 (15+ failed user/agent A2A calls in 1hr across 6 workspaces, including the user's "send it in the chat" message that the director never received): 1. a2a_proxy.go:applyIdleTimeout cancels the dispatch after 60s of broadcaster silence for the workspace. Resets on any SSE event for the workspace, fires cancel() if no event arrives in time. 2. registry.go:Heartbeat broadcast was conditional — `if payload.CurrentTask != prevTask`. The runtime POSTs /registry/heartbeat every 30s, but if current_task hasn't changed the handler emits ZERO broadcasts. evaluateStatus only broadcasts on online/degraded transitions — also no-op when steady. Net: a claude-code agent on a long packaging step or slow tool call keeps the same current_task for >60s → no broadcasts → idle timer fires → in-flight request cancelled mid-flight with the "context canceled" error the user sees in the activity log. Fix: (a) Heartbeat handler always emits a `WORKSPACE_HEARTBEAT` BroadcastOnly event (no DB write — same path as TASK_UPDATED). At the existing 30s runtime cadence this resets the idle timer twice per minute. Cost is one in-memory channel send per active SSE subscriber + one WS hub fan-out per heartbeat — far below any noise floor. (b) idleTimeoutDuration default bumped 60s → 5min as a safety net for any future regression where the heartbeat path goes silent (e.g. runtime crashed mid-request before its next heartbeat). Made env-overridable via A2A_IDLE_TIMEOUT_SECONDS for ops who want to tune (canary tests fail-fast, prod tenants with slow plugins want longer). Either fix alone closes today's gap; both together is defence in depth. The runtime side already POSTs /registry/heartbeat every 30s via workspace/heartbeat.py — no runtime change needed. Test: TestHeartbeatHandler_AlwaysBroadcastsHeartbeat pins the property that an SSE subscriber observes a WORKSPACE_HEARTBEAT broadcast on a same-task heartbeat (the regression scenario). All 16 existing handler tests still pass. Doesn't fix: task #102 (single SDK session bottleneck) — peers will still queue when busy. But this PR ensures the queue/wait flow actually completes instead of being killed by the idle timer mid-wait. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 15:45:44 -07:00
Hongming Wang	4915d1d59e	fix(orphan-sweeper): reap labeled containers with no DB row (wiped-DB) The existing sweeper only reaps ws-* containers whose workspace row has status='removed'. That misses the entire wiped-DB case: an operator does `docker compose down -v` (kills the postgres volume), the previous platform's ws-* containers keep running, the new platform boots into an empty workspaces table — first pass finds zero candidates and those containers leak forever. Symptom users hit today: 7 ws-* containers from 11h ago, no rows in DB, no visibility in Canvas, eating CPU + memory. Fix shape: 1. Provisioner stamps every ws-* container + volume with `molecule.platform.managed=true`. Without a label, the sweeper would have to assume any unlabeled ws-* container might belong to a sibling platform stack on a shared Docker daemon. 2. Provisioner exposes ListManagedContainerIDPrefixes — a label-filter counterpart to the existing name-filter. 3. Sweeper splits sweepOnce into two independent passes: - sweepRemovedRows (unchanged behavior; status='removed' only) - sweepLabeledOrphansWithoutRows (new; labeled containers whose workspace_id has no row in the table at all) Each pass has its own short-circuit so an empty result or transient error in one doesn't block the other — load-bearing because the wiped-DB pass exists precisely for cases where the removed-row pass finds nothing. Safe under multi-platform-on-shared-daemon: only containers carrying our label get reaped, sibling stacks' containers are invisible to this pass. (For now the label is a constant string; a future per-instance UUID layer can refine "ours" further if a real shared-daemon scenario emerges.) Migration: existing platforms running pre-PR builds have UNLABELED ws-* containers. After this lands they continue to NOT be reaped by the new path (no label = invisible). They'll only be cleaned via manual intervention or once the operator recreates them — same as today. No regression. Tests cover all five branches of the new pass: happy-path reap, no-reap when row exists, mixed reap-some-keep-some, Docker error short-circuits cleanly, non-UUID prefixes get filtered before the SQL query. Pairs with PR #2122 (script-level fix). Together they close the orphan-leak path for both `bash scripts/nuke-and-rebuild.sh` users (handled by the script) AND `docker compose down -v` users (handled by the runtime). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 14:33:41 -07:00
Hongming Wang	9375e3d4ee	feat(workspace-server): GHCR digest watcher closes runtime CD chain (#2114 ) Adds an opt-in goroutine that polls GHCR every 5 minutes for digest changes on each workspace-template-*:latest tag and invokes the same refresh logic /admin/workspace-images/refresh exposes. With this, the chain from "merge runtime PR" to "containers running new code" is fully hands-off — no operator step between auto-tag → publish-runtime → cascade → template image rebuild → host pull + recreate. Opt-in via IMAGE_AUTO_REFRESH=true. SaaS deploys whose pipeline already pulls every release should leave it off (would be redundant work); self-hosters get true zero-touch. Why a refactor of admin_workspace_images.go is in this PR: The HTTP handler held all the refresh logic inline. To share it with the new watcher without HTTP loopback, extracted WorkspaceImageService with a Refresh(ctx, runtimes, recreate) (RefreshResult, error) shape. HTTP handler is now a thin wrapper; behavior is preserved (same JSON response, same 500-on-list-failure, same per-runtime soft-fail). Watcher design notes: - Last-observed digest tracked in memory (not persisted). On boot the first observation per runtime is seed-only — no spurious refresh fires on every restart. - On Refresh error, the seen digest rolls back so the next tick retries. Without this rollback a transient Docker glitch would convince the watcher the work was done. - Per-runtime fetch errors don't block other runtimes (one template's brief 500 doesn't pause the others). - digestFetcher injection seam in tick() lets unit tests cover all bookkeeping branches without standing up an httptest GHCR server. Verified live: probed GHCR's /token + manifest HEAD against workspace-template-claude-code; got HTTP 200 + a real Docker-Content-Digest. Same calls the watcher makes. Co-authored-by: Hongming Wang <hongmingwangalt@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 13:36:26 -07:00
rabbitblood	ca9a034bbe	test(handlers): add 11th INSERT arg (max_concurrent_tasks) to remaining Create-handler mocks CI on PR #2105 caught 7 Create-handler tests still mocking the pre-#1408 10-arg INSERT signature. With the column now wired unconditionally into the INSERT, every WithArgs that pinned budget_limit as the 10th arg needed a 11th slot for the resolved max_concurrent_tasks value. Files: - workspace_test.go: 6 tests (DBInsertError, DefaultsApplied, WithSecrets_Persists, TemplateDefaultsMissingRuntimeAndModel, TemplateDefaultsLegacyTopLevelModel, CallerModelOverridesTemplateDefault) - workspace_budget_test.go: 1 test (Budget_Create_WithLimit) All resolved values are the schema-default mirror, so the test expectation reads as the same models.DefaultMaxConcurrentTasks const that the handler writes. New imports added to both files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 11:14:02 -07:00
rabbitblood	4e6f6bf0f3	merge: sync staging into feat/wire-max-concurrent-from-template-1408	2026-04-26 11:11:30 -07:00
rabbitblood	4bcfc64e25	chore(simplify): drop verbose comments + introduce DefaultMaxConcurrentTasks const Simplify pass on top of the wire-up commit: - New const models.DefaultMaxConcurrentTasks = 1; handlers and tests reference the symbol so the schema-default mirror lives in one place. - Strip 5 multi-line comments that narrated what the code does. - Drop the duplicate field-rationale on OrgWorkspace; the one on CreateWorkspacePayload is canonical. - Drop test-side positional comments that would silently lie if columns get reordered. Pure cleanup; no behaviour change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 11:07:00 -07:00
rabbitblood	ad5295cd8a	feat(workspaces): wire max_concurrent_tasks from template config.yaml (#1408 ) Phase 4 of #1408 (active_tasks counter). Runtime increment/decrement, schema column (037), and scheduler enforcement (scheduler.go:312) already shipped — but the write path from template config.yaml + direct API was missing, so every workspace silently fell through to the schema default of 1. Leaders that set max_concurrent_tasks: 3 in their org template were getting 1 anyway, defeating the entire feature for the use case it was built for (cron-vs-A2A contention on PM/lead workspaces). - OrgWorkspace gains MaxConcurrentTasks (yaml + json tags) - CreateWorkspacePayload gains MaxConcurrentTasks (json tag) - Both INSERTs now write the column unconditionally; 0/omitted payload value falls back to 1 (schema default mirror) so the wire stays single-shape — no forked column list / goto. - Existing Create-handler test mocks updated to expect the 11th arg. - New TestWorkspaceCreate_MaxConcurrentTasksOverride locks the payload→DB propagation for the leader case (value=3). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 11:03:01 -07:00
Hongming Wang	3b09bcc589	Merge branch 'staging' into fix/canvas-multilevel-layout-ux	2026-04-26 10:44:02 -07:00
Hongming Wang	d0f198b24f	merge: resolve staging conflicts (a2a_proxy + workspace_crud) Three files conflicted with staging changes that landed while this PR sat open. Resolved each by combining both intents (not picking one side): - a2a_proxy.go: keep the branch's idle-timeout signature (workspaceID parameter + comment) AND apply staging's #1483 SSRF defense-in-depth check at the top of dispatchA2A. Type-assert h.broadcaster (now an EventEmitter interface per staging) back to Broadcaster for applyIdleTimeout's SubscribeSSE call; falls through to no-op when the assertion fails (test-mock case). - a2a_proxy_test.go: keep both new test suites — branch's TestApplyIdleTimeout_ (3 cases for the idle-timeout helper) AND staging's TestDispatchA2A_RejectsUnsafeURL (#1483 regression). Updated the staging test's dispatchA2A call to pass the workspaceID arg introduced by the branch's signature change. - workspace_crud.go: combine both Delete-cleanup intents: * Branch's cleanupCtx detachment (WithoutCancel + 30s) so canvas hang-up doesn't cancel mid-Docker-call (the container-leak fix) * Branch's stopAndRemove helper that skips RemoveVolume when Stop fails (orphan sweeper handles) * Staging's #1843 stopErrs aggregation so Stop failures bubble up as 500 to the client (the EC2 orphan-instance prevention) Both concerns satisfied: cleanup runs to completion past canvas hangup AND failed Stop calls surface to caller. Build clean, all platform tests pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-04-26 10:43:22 -07:00
Hongming Wang	78afa0f544	Merge branch 'staging' into feat/external-runtime-first-class	2026-04-26 10:40:15 -07:00
Hongming Wang	762d3b8b2c	test(ssrf): pin dev-mode RFC-1918 allow contract (follow-up to #2103 ) PR #2103 widened the SSRF saasMode branch to also relax RFC-1918 + ULA under MOLECULE_ENV=development (so the docker-compose dev pattern stops rejecting workspace registrations on 172.18.x.x bridge IPs). The existing TestIsSafeURL_DevMode_StillBlocksOtherRanges covered the security floor (metadata / TEST-NET / CGNAT stay blocked), but no test asserted the positive side — that 10.x / 172.x / 192.168.x / fd00:: ARE now allowed under dev mode. Without this test, a future refactor that quietly drops the `\|\| devModeAllowsLoopback()` from isPrivateOrMetadataIP wouldn't trip any assertion, and the docker-compose dev loop would silently re-break. Adds TestIsSafeURL_DevMode_AllowsRFC1918 — table of 4 URLs covering the three RFC-1918 IPv4 ranges + IPv6 ULA fd00::/8. Sets MOLECULE_DEPLOY_MODE=self-hosted explicitly so the test exercises the devMode branch, not a SaaS-mode pass. Closes the Optional finding I left on PR #2103. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 10:32:33 -07:00
Hongming Wang	0de67cd379	feat(platform/admin): /admin/workspace-images/refresh + Docker SDK + GHCR auth The production-side end of the runtime CD chain. Operators (or the post- publish CI workflow) hit this after a runtime release to pull the latest workspace-template-* images from GHCR and recreate any running ws-* containers so they adopt the new image. Without this, freshly-published runtime sat in the registry but containers kept the old image until naturally cycled. Implementation notes: - Uses Docker SDK ImagePull rather than shelling out to docker CLI — the alpine platform container has no docker CLI installed. - ghcrAuthHeader() reads GHCR_USER + GHCR_TOKEN env, builds the base64- encoded JSON payload Docker engine expects in PullOptions.RegistryAuth. Both empty → public/cached images only; both set → private GHCR pulls. - Container matching uses ContainerInspect (NOT ContainerList) because ContainerList returns the resolved digest in .Image, not the human tag. Inspect surfaces .Config.Image which is what we need. - Provisioner.DefaultImagePlatform() exported so admin handler picks the same Apple-Silicon-needs-amd64 platform as the provisioner — single source of truth for the multi-arch override. Local-dev companion: scripts/refresh-workspace-images.sh runs on the host and inherits the host's docker keychain auth — alternate path for when GHCR_USER/TOKEN aren't set in the platform env. 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-04-26 10:17:21 -07:00
Hongming Wang	09972486e8	fix(platform/notify): persist agent send_message_to_user pushes Pre-fix, POST /workspaces/:id/notify (the side-channel agents use to push interim updates and follow-up results) only broadcast via WebSocket — no DB write. When the user refreshed the page, the chat-history loader (which queries activity_logs) couldn't restore those messages and they vanished from the chat. Hits the most common path: when the platform's POST /a2a times out (idle), the runtime keeps working and eventually pushes its reply via send_message_to_user. The reply rendered live but disappeared on reload. Fix: also INSERT an activity_logs row with shape the existing loader already understands (type=a2a_receive, source_id=NULL, response_body= {result: text}). Persistence is best-effort — a DB hiccup doesn't block the WebSocket push (which the user is already seeing). 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-04-26 10:14:47 -07:00
Hongming Wang	7ed50824b6	fix(platform/ssrf): allow RFC-1918 in MOLECULE_ENV=development The docker-compose dev pattern puts platform and workspace containers on the same docker bridge network (172.18.0.0/16, RFC-1918). The runtime registers via its docker-internal hostname which DNS-resolves to a 172.18.x.x IP. The SSRF defence's isPrivateOrMetadataIP rejected those, so every workspace POST through the platform proxy returned 'workspace URL is not publicly routable' — breaking the entire docker- compose dev loop. Fix: in isPrivateOrMetadataIP, treat MOLECULE_ENV=development the same as SaaS mode for RFC-1918 relaxation. Both share the 'trusted intra- network routing' property — SaaS is sibling EC2s in the same VPC, dev is sibling containers on the same docker bridge. Always-blocked categories (metadata link-local, TEST-NET, CGNAT) stay blocked. 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-04-26 10:14:47 -07:00
Hongming Wang	d97d7d4768	fix(platform/delegation): classify queued response + stitch drain result back When proxyA2A returns 202+{queued:true} (target busy → enqueued for drain on next heartbeat), executeDelegation previously treated it as a successful completion and ran extractResponseText on the queued JSON. The result was 'Delegation completed (workspace agent busy — request queued, will dispatch...)' landing in activity_logs.summary, which the LLM then echoed to the user chat as garbage. Two fixes: 1. delegation.go: detect queued shape via new isQueuedProxyResponse helper, write status='queued' with clean summary 'Delegation queued — target at capacity', store delegation_id in response_body so the drain can stitch back later. Also embed delegation_id in params.message.metadata + use it as messageId so the proxy's idempotency-key path keys off the same id. 2. a2a_queue.go: when DrainQueueForWorkspace successfully drains a queued item, extract delegation_id from the body's metadata and UPDATE the originating delegate_result row (queued → completed with real response_body). Broadcast DELEGATION_COMPLETE so the canvas chat feed flips the queued line to completed in real time. Closes the loop so check_task_status reflects ground truth instead of perpetual 'queued' even after the queued request eventually drained. 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-04-26 10:14:19 -07:00
Hongming Wang	7d48f24fef	test(handlers): introduce events.EventEmitter interface (#1814 partial) The 3 skipped tests in workspace_provision_test.go (#1206 regression tests) were blocked because captureBroadcaster's struct-embed wouldn't type-check against WorkspaceHandler.broadcaster's concrete events.Broadcaster field. This PR fixes the interface blocker for the 2 broadcaster-related tests; the 3rd (plugins.Registry resolver) is a separate blocker tracked elsewhere. Changes: - internal/events/broadcaster.go: define `EventEmitter` interface with RecordAndBroadcast + BroadcastOnly. Broadcaster satisfies it via its existing methods (compile-time assertion guards future drift). SubscribeSSE / Subscribe stay off the interface because only sse.go + cmd/server/main.go call them, and both still hold the concrete Broadcaster. - internal/handlers/workspace.go: WorkspaceHandler.broadcaster type changes from events.Broadcaster to events.EventEmitter. NewWorkspaceHandler signature updated to match. Production callers unchanged — they pass *events.Broadcaster, which the interface accepts. - internal/handlers/activity.go: LogActivity takes events.EventEmitter for the same reason — tests passing a stub no longer need to construct the full broadcaster. - internal/handlers/workspace_provision_test.go: captureBroadcaster drops the struct embed (no more zero-value Broadcaster underlying the SSE+hub fields), implements RecordAndBroadcast directly, and adds a no-op BroadcastOnly to satisfy the interface. Skip messages on the 2 empty broadcaster-blocked tests updated to reflect the new "interface unblocked, test body still needed" state. Verified `go build ./...`, `go test ./internal/handlers/`, and `go vet ./...` all clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 09:05:52 -07:00
Hongming Wang	fd891a147e	fix(a2a): isSafeURL guard inside dispatchA2A (closes #1483 ) #1483 flagged that dispatchA2A() doesn't call isSafeURL internally — the guard exists only at the caller level (resolveAgentURL at a2a_proxy.go:424). The primary call path through proxyA2ARequest is safe today, but if any future code path ever calls dispatchA2A directly without going through resolveAgentURL, the SSRF check would be silently bypassed. This adds the one-line defense-in-depth guard the issue prescribed: if err := isSafeURL(agentURL); err != nil { return nil, nil, &proxyDispatchBuildError{err: err} } Wrapping as *proxyDispatchBuildError preserves the existing caller error-classification path — the same shape that maps to 500 elsewhere. Adds TestDispatchA2A_RejectsUnsafeURL pinning the contract: re-enables SSRF for the test (setupTestDB disables it for normal unit tests), passes a metadata IP, asserts the build error returns and cancel is nil so no resource is leaked. The 4 existing dispatchA2A unit tests use setupTestDB → SSRF disabled, so they continue passing unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 07:18:58 -07:00
Hongming Wang	a8c9644618	Merge pull request #2094 from Molecule-AI/feat/server-side-provision-timeout-2054-phase2 feat(workspace-server): surface provision_timeout_ms in workspace API (#2054 phase 2)	2026-04-26 13:53:18 +00:00
Hongming Wang	2b76f7dfcb	fix(discovery): isSafeURL guard on registered URLs (closes #1484 ) #1484 flagged that discoverHostPeer() and writeExternalWorkspaceURL() return URLs sourced from the workspaces table without an isSafeURL check. Workspace runtimes register their own URLs via /registry/register — a misbehaving / compromised runtime could register a metadata-IP URL. Today both functions are gated by Phase 30.6 bearer-required Discover, so exposure is theoretical. The fix makes them safe regardless of upstream auth shape. Changes: - discoverHostPeer: isSafeURL on resolved URL before responding; 503 + log on rejection. - writeExternalWorkspaceURL: same guard applied to the post-rewrite outURL (so a host.docker.internal rewrite is checked AND a metadata-IP that survived the rewrite untouched is rejected). - 3 new regression tests: * RejectsMetadataIPURL on host-peer path (169.254.169.254 → 503) * AcceptsPublicURL on host-peer path (8.8.8.8 → 200; positive counterpart so the rejection test can't pass via universal-fail) * RejectsMetadataIPURL on external-workspace path setupTestDB already disables SSRF checks via setSSRFCheckForTest, so the 16+ existing discovery tests remain untouched. Only the new tests opt in to enabled SSRF. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 06:50:36 -07:00
rabbitblood	f1ad012024	refactor(handlers): apply simplify findings on PR #2094 - Extract walkTemplateConfigs(configsDir, fn) shared helper. Both templates.List and loadRuntimeProvisionTimeouts walked configsDir + parsed config.yaml — same boilerplate twice. Now centralised so a future template-discovery rule (subdir naming, README sentinel, etc.) lands in one place. - templates.List uses the walker — net -10 lines. - loadRuntimeProvisionTimeouts uses the walker — net -10 lines. - Document runtimeProvisionTimeoutsCache as 'NOT SAFE for package-level reuse' so a future change doesn't accidentally promote it to a singleton (sync.Once can't be reset → tests would lock out other fixtures). Skipped (review finding): atomic.Pointer[map[string]int] for future hot-reload. The doc comment already documents the limitation; YAGNI-promoting the primitive now would buy a not-yet-built feature at the cost of more code today. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 06:40:15 -07:00
rabbitblood	27396d992c	feat(workspace-server): surface provision_timeout_ms in workspace API (#2054 phase 2) Phase 2 of #2054 — workspace-server reads runtime-level provision_timeout_seconds from template config.yaml manifests and includes provision_timeout_ms in the workspace List/Get response. Phase 1 (canvas, #2092) already plumbs the field through socket → node-data → ProvisioningTimeout's resolver, so the moment a template declares the field the per-runtime banner threshold adjusts without a canvas release. Implementation: - templates.go: parse runtime_config.provision_timeout_seconds in the templateSummary marshaller. The /templates API now surfaces the field too — useful for ops dashboards and future tooling. - runtime_provision_timeouts.go (new): loadRuntimeProvisionTimeouts scans configsDir, parses every immediate subdir's config.yaml, returns runtime → seconds. Multiple templates with the same runtime: max wins (so a slow template's threshold doesn't get cut by a fast template's). Bad/empty inputs are silently skipped — workspace-server starts cleanly with no templates. - runtimeProvisionTimeoutsCache: sync.Once-backed lazy cache. First workspace API request after process start pays the read cost (~few KB across ~50 templates); every subsequent request is a map lookup. Cache lifetime = process lifetime; invalidates on workspace-server restart, which is the normal template-change cadence. - WorkspaceHandler gets a provisionTimeouts field (zero-value struct is valid — the cache lazy-inits on first get()). - addProvisionTimeoutMs decorates the response map with provision_timeout_ms (seconds × 1000) when the runtime has a declared timeout. Absent = no key in the response, canvas falls through to its runtime-profile default. Wired into both List (per-row decoration in the loop) and Get. Tests (5 new in runtime_provision_timeouts_test.go): - happy path: hermes declares 720, claude-code doesn't, only hermes appears in the map - max-on-duplicate: same runtime in two templates → max wins - skip-bad-inputs: missing runtime, zero timeout, malformed yaml, loose top-level files all silently ignored - missing-dir: returns empty map, no crash - cache: lazy-init on first get; subsequent gets hit cache even after underlying file changes (sync.Once contract); unknown runtime returns zero Phase 3 (separate template-repo PR): template-hermes config.yaml declares provision_timeout_seconds: 720 under runtime_config. canvas RUNTIME_PROFILES.hermes becomes redundant + removable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 06:37:45 -07:00

1 2 3 4 5 ...

403 Commits