molecule-core

Author	SHA1	Message	Date
Hongming Wang	da3cb4c098	fix(workspace-server): close TOCTOU race on workspaces(parent_id, name) (#2872 Critical 1) ## Bug `/org/import` had no per-tenant mutex, advisory lock, or DB-level uniqueness on (parent_id, name). The pattern was lookup-then-insert: existingID, existing, err := h.lookupExistingChild(...) // SELECT if existing { return /* skip / } db.DB.ExecContext(ctx, `INSERT INTO workspaces ...`) // INSERT Two concurrent admin POSTs (rapid double-click in canvas, retry-after- timeout, two operators on the same template) both saw "not found" in the SELECT and both INSERT'd the same (parent_id, name). Captured impact: tenant-hongming accumulated 72 stale child workspaces in 4 days from repeated org-template spawns of the same template (see #2857 phase 4 sweeper for the cleanup; #2872 for the prevention RFC). ## Fix Two-layer fix — DB-level backstop AND application-level happy path: 1. Migration* `20260506000000_workspaces_unique_parent_name.up.sql` ```sql CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS workspaces_parent_name_uniq ON workspaces ( COALESCE(parent_id, '00000000-0000-0000-0000-000000000000'::uuid), name ) WHERE status != 'removed'; ``` * COALESCE(parent_id, sentinel) collapses NULLs so root workspaces also collide pairwise. * `WHERE status != 'removed'` lets a tombstoned row be replaced by a same-named re-import (preserves existing org-import semantics). * CONCURRENTLY avoids ACCESS EXCLUSIVE on production tenants under live traffic; IF NOT EXISTS makes the migration resumable. * Down migration drops CONCURRENTLY symmetrically. 2. `org_import.go` swap Replace lookup-then-insert with `INSERT ... ON CONFLICT DO NOTHING RETURNING id`. On the skip path (RETURNING returns 0 rows → sql.ErrNoRows), re-select the existing id to recurse children: INSERT INTO workspaces (...) VALUES (...) ON CONFLICT (COALESCE(parent_id, ...), name) WHERE status != 'removed' DO NOTHING RETURNING id; The ON CONFLICT target predicate matches the partial-index predicate exactly — required for Postgres to consider the index applicable. Existing `lookupExistingChild` helper kept (still used on the skip path); semantics unchanged. ## Test coverage * AST gate refreshed to assert the workspaces INSERT contains the ON CONFLICT pattern (`onConflictDoNothingRE`) instead of the now-obsolete "lookup-before-insert" ordering. Per behavior-based gating (memory: feedback_behavior_based_ast_gates.md), the new gate pins the actual TOCTOU-resolution behavior. * Companion `TestGate_FailsWhenInsertOmitsOnConflict` proves the gate catches the bug shape on synthetic source. * All existing `lookupExistingChild` unit tests (no-rows, found, nil-parent, DB error, wrapped no-rows) still pass — helper is unchanged and still load-bearing on the skip path. * Live Postgres E2E coverage runs via the existing "Handlers Postgres Integration" CI job, which applies migrations to a real PG and exercises the INSERT path. ## Why ship the migration + swap together (not stacked) The migration alone provides a DB-level backstop, but without the handler swap a UNIQUE-violation surfaces as a 500 to the user. The handler swap alone has no enforceable target until the migration applies. Shipped together they give graceful skip + atomic backstop. Migration is CONCURRENTLY + IF NOT EXISTS, safe to apply even on tenants where the sweeper (#2860) hasn't run yet — the index just declines to build until conflicting rows are reconciled. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 21:43:49 -07:00
Hongming Wang	b759548822	fix(chat-uploads): activity rows commit atomically with PutBatch Closes #149. uploadPollMode for poll-mode chat uploads previously committed N pending_uploads rows in one Tx (PutBatch), then wrote N activity_logs rows individually outside any Tx. A per-row failure on activity row K left rows 1..K-1 committed and pending_uploads orphaned until the 24h TTL — not data-loss because the platform's fetcher handled the half-state cleanly, but the user never saw file K in the canvas and the inconsistency surfaced as an "uploaded but invisible" complaint class. Thread one Tx through PutBatchTx + N × LogActivityTx + Commit so all or none commit. Broadcasts are deferred until after Commit — emitting an ACTIVITY_LOGGED event for a row that ends up rolled back would paint a ghost message into the canvas's optimistic UI. A new LogActivityTx returns a commitHook the caller invokes post-Commit; the existing fire-and-forget LogActivity is unchanged for the 4 other production callers (a2a_proxy_helpers + activity.go report path). Storage interface gains PutBatchTx; PostgresStorage.PutBatch is refactored to share the validation + insert path. inMemStorage and fakeSweepStorage delegate or no-op for PutBatchTx (the in-mem fake can't model Tx state — DB-level atomicity is verified by the existing real-Postgres integration test for PutBatch + the new unit test asserting the Go handler calls Rollback on activity-insert failure). Tests: - TestPollUpload_AtomicRollbackOnActivityInsertFailure pins the new contract via sqlmock — second activity insert errors → Rollback expected, Commit must NOT be called. - TestLogActivityTx_DefersBroadcastUntilCommitHook + _InsertError_NoHook_NoBroadcast + _NilTx_Errors cover the new API. - TestPutBatchTx_HappyPath / _EmptyItems / _ValidationFails / _PerRowErrorPropagates cover Tx-aware storage layer. - 7 existing TestPollUpload_* tests updated to mock Begin + Commit (or Begin + Rollback for failure paths) since the handler now opens a Tx around PutBatch + activity inserts. All workspace-server tests pass; integration tag also clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 21:34:28 -07:00
Hongming Wang	19df43e3da	Merge pull request #2993 from Molecule-AI/rfc-2945-pr-b-1-migrate-bare-event-strings refactor(events): migrate 18 producers to typed EventType constants (RFC #2945 PR-B-1)	2026-05-06 03:45:47 +00:00
Hongming Wang	f39b595a9c	fix(workspace files API): EIC parity for ListFiles + DeleteFile (closes #2999 PR-A) ## User-visible bug Canvas Files tab returns "0 files / No config files yet" for every SaaS workspace, every root (/configs, /home, /workspace, /plugins). Reported by user (canvas screenshot, hongming.moleculesai.app, Hongming Personal Brand Agent — claude-code, T4, online). ## Root cause `ListFiles` (templates.go) was missing the SSH-via-EIC branch that ReadFile (PR #2785) and WriteFile (PR #1702) already have. On SaaS, dockerCli is nil → findContainer returns "" → falls through to host-side resolveTemplateDir which only matches baked-in template names. For a user-named workspace it matches nothing, so the handler silently returns []fileEntry{}. DeleteFile had the same gap — right-click delete (introduced in PR-C of this issue) would silently no-op once #1 was fixed. ## Fix 1. Extracted shared EIC plumbing into `withEICTunnel` (closure-based, single SSOT for keypair → key push → tunnel → port-wait → cleanup). Refactored writeFileViaEIC + readFileViaEIC to use it. Added listFilesViaEIC + deleteFileViaEIC on the same scaffold. The `LogLevel=ERROR` shim from PR #2822 now lives in one `eicSSHSession.sshArgs()` helper instead of being duplicated per helper — the next time we need to tweak ssh options, one place. 2. Factored remote shell strings into pure functions (buildInstallShell / buildCatShell / buildRmShell / buildFindShell + parseFindOutput) so the wire shape can be pinned without booting a real EIC tunnel. 3. Refactored `resolveWorkspaceFilePath(runtime, root, relPath)` to honor `?root=`. New rule: `/configs` (or empty / unrecognized) → runtime managed-config dir via workspaceFilePathPrefix (preserves the v1 ReadFile/WriteFile behaviour where canvas's Config tab GETs/PUTs config.yaml without specifying a root and lands in the right per-runtime dir); `/home`, `/workspace`, `/plugins` → literal absolute path on the EC2 host. List/Read/Write/Delete now agree on what file a tree row points to — pre-fix List would say "/home contents" but Read/Write would route to /configs. 4. ListFiles + DeleteFile dispatch on instance_id != "" → EIC helper. Errors from the EIC path produce 500 (not silent fall-through to local-Docker, which would mask the failure as "0 files" — the exact user-visible symptom). 5. Added ?root= validation gate to WriteFile + DeleteFile so an out-of-allowlist root is rejected before the resolver runs. ## Test coverage - TestResolveWorkspaceFilePath_RuntimeIndirection — pins the /configs → runtime prefix translation per-runtime (hermes, claude-code, langgraph, external, unknown). Catches the regression where a future edit accidentally drops the runtime indirection. - TestResolveWorkspaceFilePath_LiteralRoots — pins /home, /workspace, /plugins as literal pass-through regardless of runtime. Catches the symmetric regression where the literal roots start getting rewritten to the runtime prefix (which would mean the FilesTab "/home" selector silently routes to /configs on hermes). - TestResolveWorkspaceRootPath — directory-only translation used by listFilesViaEIC, same indirection rules. - TestSSHArgs_HardenedFlags — pins the centralised ssh option set (LogLevel=ERROR + hardening). Catches drift in the one-place-where-ssh-flags-live. - TestEicSSHSessionSingleSourceForSSHFlags — behaviour-based AST gate (per memory). Counts s.sshArgs() callers (must be ≥4 — list/read/write/delete) and asserts LogLevel=ERROR appears exactly once in the source. Fires if anyone copy-pastes a raw ssh args slice instead of going through the helper. - TestBuildInstallShell / TestBuildCatShell / TestBuildRmShell / TestBuildFindShell — pure-function tests pinning the remote command shape. Catches regression like "rm -f silently becomes rm -rf" or "find loses node_modules pruning" without needing a real EC2. - TestBuildFindShell_DepthForwarding — catches a regression where the helper hard-codes a depth instead of using the caller's value. - TestParseFindOutput / TestParseFindOutput_EmptyInput — pin the TYPE\|SIZE\|REL parser. Empty-input case explicitly returns [] not nil so the JSON wire shape stays a list. - TestListFiles_EICDispatch_Success / Error — sqlmock-driven handler test. Verifies instance_id != "" routes to listFilesViaEIC and surfaces errors as 500 (does NOT silently fall through to local-Docker, which is the exact regression-mode of the original bug). - TestListFiles_EICBranch_NotTakenForSelfHosted — back-compat guard: instance_id == "" must NOT enter the EIC branch (would break self-hosted operators). - TestDeleteFile_EICDispatch_Success / Error — same shape for DeleteFile. - TestListFiles_RootValidation / TestDeleteFile_RootValidation — ?root=/etc must 400 before any DB query or EIC call. ## Verification - `go build ./...` clean - `go test ./...` clean (full workspace-server suite) - Will be live-verified against staging on hongming.moleculesai.app after merge: open Files tab → expect populated /home + /configs + /workspace listings (not "0 files"); right-click delete on /configs/old.yaml → expect file removed on the EC2 host. ## Three weakest spots (hostile self-review) 1. The LogLevel=ERROR drift gate counts source occurrences. A future refactor that intentionally moves the literal somewhere else (e.g. into a constant) would trigger a false positive. The gate's failure message points to the load-bearing constraint (must appear in sshArgs); operator can adjust. 2. `eicFileWriteTimeout` constant kept as an alias for back-compat with prior tests. Documented as intentional + safe to remove on the next pass. 3. The resolver tests pin the runtime → prefix map values (`/home/ubuntu/.hermes`, `/configs`, etc.). A future runtime addition that ships a new prefix needs the test updated. This is intentional — silent prefix changes orphan saved files, so a test failure on map edit IS the right signal. ## Follow-up (RFC #2312 subtask 2) Long-term the right fix is to drop EIC entirely and HTTP-forward to the workspace's own URL (RFC #2312). That's a substantially larger refactor across 5 surfaces (chat upload, files, templates, plugins, terminal) and out of scope for this bug-fix PR. Tracked separately under that RFC. Refs #2999. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 20:18:05 -07:00
Hongming Wang	64e58fb390	test(memory-v2-e2e): update expectChainQueryRoot for new name column PR #2990 root cause: the resolver SQL added `name` to the SELECT for DisplayName plumbing, but the e2e test's sqlmock fixture (expectChainQueryRoot at swap_test.go:216) still scripts the 3-column shape. Three e2e tests fail with: sql: expected 3 destination arguments in Scan, not 4 Fix: bump the fixture to 4 columns (id, name, parent_id, depth) and pass an empty name. The e2e tests don't assert on label rendering — they pin the namespace string flow ("workspace:root-1" etc), which is unchanged. Empty name is fine: ReadableNamespaces still emits the correct namespace strings; only DisplayName is empty. Caught by CI's Platform (Go) check on PR #2990 — would have been a silent missed-coverage case in the resolver_test.go run because that package doesn't import the e2e package. 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-05-05 19:10:18 -07:00
Hongming Wang	9ceda9d81f	refactor(events): migrate 18 files to typed EventType constants (RFC #2945 PR-B-1) Mechanical migration of bare event-name strings in BroadcastOnly / RecordAndBroadcast call sites to the typed constants from internal/events/types.go (RFC #2945 PR-B). Wire format unchanged (both shapes serialize to identical WSMessage.Event literals); pinned by TestAllEventTypes_IsSnapshot in #2965. Migrated (18 files, scope: handlers/, scheduler/, registry/, bundle/, channels/): - handlers/{approvals,a2a_proxy_helpers,a2a_queue,activity,agent, delegation,external_rotate,org_import,registry,workspace, workspace_bootstrap,workspace_crud,workspace_provision_shared, workspace_restart}.go - channels/manager.go (caught by hostile-reviewer pass — initial scope missed channels/, found via grep on the post-migration tree) - scheduler/scheduler.go - registry/provisiontimeout.go - bundle/importer.go Hostile self-review (3 weakest spots, addressed) ------------------------------------------------ 1. Missed call sites — initial scope omitted channels/. Post-migration `grep -rEn 'BroadcastOnly\([^,]+,[^,]"[A-Z_]+"\|RecordAndBroadcast\([^,]+,[^,]"[A-Z_]+"' internal/` found 2 stragglers in channels/manager.go. Migrated. Final grep on the same pattern returns only the docstring example in types.go (intentional). 2. gofmt drift — auto-import injection produced non-canonical import ordering. `gofmt -w` applied ONLY to the 18 modified files (NOT the whole tree, to avoid sweeping unrelated pre-existing drift into this PR's diff). Three pre-existing un-gofmt'd files in handlers/ (a2a_proxy.go, a2a_proxy_test.go, a2a_queue_test.go) left as-is — they're unchanged by this PR and their drift predates it. 3. Wire format — paranoia check: do the constants serialize to the exact strings consumers (canvas TS, hermes plugin, anything parsing WSMessage.Event) expect? Yes. Pinned by the snapshot test. The migration is name-only; not a single character of wire output changes. Verified - go build ./... clean - go vet ./internal/... clean - gofmt -l on the 5 migrated package dirs: only pre-existing files - Full tests: handlers/, channels/, scheduler/, registry/, events/, bundle/ all green (5 ok, 0 fail) PR-B-2 (canvas TS mirror + cross-language parity gate) remains as the final piece of RFC #2945 PR-B. Tracked separately so this PR stays mechanical + reviewable. Refs RFC #2945, PR #2965 (PR-B types). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 19:05:03 -07:00
Hongming Wang	b6310d7ebf	fix(memory-v2): namespace dropdown labels use display names not UUID prefixes (#2988 ) User feedback on the v2 Memory tab redesign: on a root workspace, the namespace dropdown showed three indistinguishable entries: Workspace (30ba7f0b) Team (30ba7f0b) (team) Org (30ba7f0b-b303-4a20-aefe-3a4a675b8aa4) (org) For a root workspace, the resolver collapses workspace==team==org IDs (resolver.go:113-122 derive() degenerate case). The previous shortID(8)-truncated UUID label scheme made all three look identical even though the three concepts (private / team-shared / org-wide) remain semantically distinct. ## Backend — Resolver returns DisplayName - SQL chain query now SELECTs workspaces.name (COALESCE → "" on NULL) - chainNode carries .name through walk - deriveNames() computes the display name for each namespace, mirroring derive(): workspace: self.name team: parent.name (or self.name if root — degenerate) org: chain[end].name (root of tree) - Namespace struct gets a new DisplayName field, omitempty wire-shape ## Backend — Handler renders label from DisplayName when present - memories_v2.go:namespaceLabelWithName(name, kind, displayName) is the new SSOT label generator. Falls back to the UUID-prefix shape when displayName is empty so callers without name plumbing keep working unchanged. - namespacesToViews now plumbs Namespace.DisplayName into the label. - Old namespaceLabel(name, kind) is preserved as a thin wrapper around namespaceLabelWithName(_, _, "") for back-compat. - Custom namespaces ignore displayName by design — operator-defined suffixes ARE the chosen label; a name override would surprise. ## Frontend — drop redundant `(kind)` suffix Pre-fix: "Team (mac laptop) (team)" — kind shown twice. Post-fix: "Team (mac laptop)" — the prefix already conveys the kind. ## Test coverage Resolver (3 new tests): - DisplayName_Root: workspace name propagates to all 3 namespaces - DisplayName_Child: workspace=self.name, team=parent.name, org=root.name - DisplayName_EmptyOnNULL: COALESCE → "" → empty fallback Handler (3 new tests): - NamespaceLabelWithName_PrefersDisplayName: workspace/team/org/custom paths - NamespaceLabelWithName_FallsBackToUUIDPrefix: empty displayName → legacy shape - NamespacesToViews_PassesDisplayNameThrough: full integration on root case Canvas: existing 30 tests still pass; suffix drop is rendering-only. memories_v2.go function coverage: 14/14 = 100% - namespaceLabelWithName: 100% - namespacesToViews: 100% - (all 11 pre-existing functions stay at 100%) ## SSOT The "what is this namespace called" question now has one source of truth: namespace.Resolver.ReadableNamespaces sets DisplayName from the canonical workspace.name column. The handler is a renderer; the canvas is a consumer. No name-lookup logic duplicated across the three layers. 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-05-05 18:46:50 -07:00
Hongming Wang	f1dc721eeb	Merge pull request #2964 from Molecule-AI/fix/delegation-ledger-utf8-truncate-2962 fix(delegation_ledger): rune-safe preview truncation (#2962)	2026-05-05 23:34:57 +00:00
Hongming Wang	a5903af459	fix(delegation_ledger): rune-safe preview truncation (#2962 ) The previous byte-slice form `s[:previewCap]` could split a multi-byte codepoint at byte 4096, producing invalid UTF-8. Postgres JSONB rejects the row → ledger insert silently fails → audit gap on dashboards while activity_logs continues to record the event. Walk the string by rune index and stop at the last boundary that fits inside the cap. ASCII-only strings still hit the cap exactly; CJK/emoji strings stop slightly under, never over. Mirrors the truncatePreviewRunes fix shipped for agent_message_writer in #2959. Followup: deduplicate into a shared helper once both have landed. Tests: 2 regression tests using utf8.ValidString — one with an all-3-byte rune string just over the cap, one with a single multi-byte rune sitting exactly on the boundary. Verified on the previous byte-slice impl: both new tests would fail (invalid UTF-8 + truncation past cap by 1 byte).	2026-05-05 16:19:51 -07:00
Hongming Wang	5b78bea10d	feat(events): typed EventType registry — single source of truth for WS event names (RFC #2945 PR-B) Pre-RFC-#2945, every BroadcastOnly / RecordAndBroadcast call site passed a bare string literal: h.broadcaster.BroadcastOnly(workspaceID, "AGENT_MESSAGE", payload) 29 producers (Go, ~30 call sites in handlers/, scheduler/, registry/, bundle/) and ~30 canvas consumers (TS store + listeners) duplicated the same string with no shared definition. A producer renaming an event silently broke every consumer — same drift class that produced the reno-stars data-loss regression on the persistence side. PR-A fixed the persistence-side SSOT (AgentMessageWriter); PR-B fixes the event-name SSOT. What this PR ships internal/events/types.go - EventType typed string + 29 named constants covering the full taxonomy (chat / lifecycle / agent assignment / delegation / task / approval / auth). - Grouped semantically; new constants must be added here AND mirrored in canvas/src/lib/ws-events.ts (parity gate landing in PR-B-2 follow-up). - AllEventTypes slice — authoritative list for the snapshot test + the cross-language parity gate. internal/events/types_test.go (3 tests) - TestAllEventTypes_IsSnapshot: pins the canonical list. Adding a new constant without updating AllEventTypes (or vice versa) fails with a one-line diff. - TestEventType_NoEmptyConstants: catches accidentally-empty values (typo in types.go: const X EventType = ...). - TestEventType_AllUppercaseSnakeCase: pins the wire format that canvas TS switch statements assume (no kebab-case, no mixed case, no leading/trailing/double underscores). agent_message_writer.go (single migration) - Demonstrates the constant-usage shape: events.EventAgentMessage → "AGENT_MESSAGE" - Other ~30 call sites stay on bare strings for now (this PR narrow); the migration happens in PR-B-1 follow-up. Both shapes (constant + bare string) co-exist on the wire — the typed version is just the recommended path for new code. Why ship this in stages 1. PR-B (this): types + tests + first migration → MERGEABLE NOW, low risk. 2. PR-B-1 (follow-up): migrate the remaining ~30 call sites to constants. Mechanical, low-risk. 3. PR-B-2 (follow-up): canvas/src/lib/ws-events.ts mirror + cross- language parity gate. Touches both repos. Per memory feedback_oss_design_philosophy.md (every refactor toward OSS plugin shape) — this surface is now plugin-safe: external implementations can import the events package and get the same named taxonomy without copying strings. Verified - go vet ./internal/events/ clean - go build ./... clean - TestAllEventTypes_IsSnapshot + TestEventType_* all pass - TestAgentMessageWriter_* (the only call site touched) still green Refs RFC #2945, PR #2949 (PR-A SSOT), PR #2944 (reno-stars). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 16:25:38 -07:00
Hongming Wang	07d09f3696	Merge pull request #2959 from Molecule-AI/rfc-2945-pr-a-followup-utf8-and-db-errors fix(handlers): UTF-8-safe preview truncation + distinguish DB errors from not-found (PR-A followup)	2026-05-05 16:19:29 -07:00
Hongming Wang	feef80423b	Merge pull request #2958 from Molecule-AI/fix/external-connect-templates-mcp-command fix(external-connect): use molecule-mcp wrapper in Codex/OpenClaw templates (#2957)	2026-05-05 16:18:23 -07:00
Hongming Wang	1e01083e55	fix(handlers): UTF-8-safe preview truncation + distinguish DB errors from not-found (RFC #2945 PR-A followup) Self-review of PR #2949 surfaced two pre-existing defects that the SSOT consolidation inherited from the original /notify handler. Both are addressable in a small follow-up; shipping them as a separate PR keeps the consolidation and the bug-fix individually reviewable. Critical: byte-slice preview truncation produces invalid UTF-8 ------------------------------------------------------------- Pre-fix: if len(preview) > 80 { preview = preview[:80] + "…" } `len()` returns BYTES; `preview[:80]` slices on a byte boundary. For agent-authored chat in CJK / emoji / accented characters, byte 80 lands mid-codepoint → invalid UTF-8 → Postgres JSONB rejects → INSERT fails → activity_log row never written → message vanishes from chat history on the next reload. The persistence-failure log fires but operators have to grep to find it, and the user-visible regression mode is identical to reno-stars. Fix: extract `truncatePreviewRunes(s, maxRunes)` that walks the rune boundary using `for i := range s` (Go's range over string yields rune start indices). Cap at 80 RUNES not bytes — UI-friendly count, not storage count. Important: workspace-lookup error path swallows real DB errors -------------------------------------------------------------- Pre-fix: if err := w.db.QueryRowContext(...).Scan(&wsName); err != nil { return ErrWorkspaceNotFound } Conflates `sql.ErrNoRows` (legit not-found → caller 404) with real DB errors (connection drop, query timeout, pool exhaustion → caller should 503). During a Postgres outage every notify call surfaced as "workspace not found" — masking the actual incident in alerting and making the symptom indistinguishable from "you typed a bad workspace ID". Fix: distinguish via `errors.Is(err, sql.ErrNoRows)` and wrap non-not-found errors with `fmt.Errorf("agent_message: workspace lookup: %w", err)`. Callers' existing fallback path (return 500 / return error wrapped) handles the new shape correctly without any changes — verified by running existing TestNotify_* and TestMCPHandler_SendMessage_* tests. Tests added (3 new, 11 total writer tests) ------------------------------------------ - TestTruncatePreviewRunes_RuneBoundary: 8-case table — ASCII, CJK, exactly-at-max, emoji prefix. Asserts both correct visible output AND `utf8.ValidString` on every result so the bug shape (invalid UTF-8) can't recur. - TestAgentMessageWriter_Send_NonASCIIMessagePersists: end-to-end with a 200-rune CJK message (exceeds the 80-rune cap, would have hit the byte-slice bug). Pins the INSERT summary contains valid UTF-8 with exactly 80-rune body + ellipsis. - TestAgentMessageWriter_Send_DBErrorOnLookupReturnsWrapped: pins the DB-outage path returns a wrapped non-ErrWorkspaceNotFound error so alerting can distinguish 404 from 503. Verified via mock ExpectQuery returning a transient error. Verified -------- - `go vet ./internal/handlers/` clean - `go build ./...` clean - All 14 writer + caller tests pass (8 original + 3 new + AST gate + TestNotify_* + TestMCPHandler_SendMessage_* sibling tests) Per memory feedback_assert_exact_not_substring.md: every new test asserts boundary behavior directly (UTF-8 validity, exact rune count, errors.Is comparison) rather than substring-match in stringified output. Refs RFC #2945, PR #2949, PR #2944. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 16:10:58 -07:00
Hongming Wang	eab36e217e	fix(external-connect): use molecule-mcp wrapper in Codex/OpenClaw templates (#2957 ) The External Connect modal's Codex and OpenClaw tabs were rendering this MCP server config: command = "python3" args = ["-m", "molecule_runtime.a2a_mcp_server"] That spawns the bare MCP dispatcher with no presence wiring. The ``molecule-mcp`` console-script wrapper (mcp_cli.main) is what calls ``POST /registry/register`` at startup and runs the 20s heartbeat thread alongside the MCP stdio loop. Without the wrapper, the canvas flips the workspace back to ``awaiting_agent`` (OFFLINE) within 60-90s — even while tools work — because nothing is heartbeating. Operator-side this looks like: the workspace is registered and tools work fine when invoked, but the canvas shows "offline" / "Restart" CTA, peer agents see the workspace as awaiting_agent in list_peers output, and inbound A2A delivery silently fails the readiness check. A new external-Codex operator (#2957) hit this and spent debugging time on what should have been a copy-paste install. Fix: switch both Codex and OpenClaw templates to ``command = "molecule-mcp"`` / ``args = []``, matching the universal MCP template that already handles this correctly. Inline comment in each template explains the wrapper-vs-bare-module tradeoff so a future template author doesn't regress to the shorter form. Hermes-channel intentionally still spawns the bare module — the hermes plugin owns the platform plugin path and runs its own register_platform/heartbeat code in-process; double-heartbeating would race. Universal/Codex/OpenClaw all need the wrapper. Regression gate: TestExternalMcpTemplates_UseMoleculeMcpWrapper asserts the three templates that must use the wrapper actually do, and explicitly fails on the old ``-m molecule_runtime.a2a_mcp_server`` shape. Verified the test FAILS on pre-fix source by stashing only external_connection.go and re-running. Source: molecule-core#2957 issue 1 (item 4 of the report — the ``(codex returned empty output)`` / opaque-canvas-error / stale- session items live in codex-channel-molecule and are tracked separately). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 16:06:02 -07:00
Hongming Wang	decec9b9a1	Merge pull request #2956 from Molecule-AI/feat/memory-tab-v2-redesign feat(memory): redesign Memory tab for v2 plugin	2026-05-05 22:56:55 +00:00
Hongming Wang	f0f4d0e761	feat(memory): redesign Memory tab for v2 plugin Replaces the v1 LOCAL/TEAM/GLOBAL tab trio (mapped to the deprecated shared_context model) with a v2 plugin-driven UI. Without this, canvas Memory tab was reading the frozen agent_memories table while all post-cutover agent writes went to the plugin's memory_records — the tab silently displayed stale data. ## Backend (workspace-server) New routes under wsAuth, all behind the existing per-tenant token: GET /workspaces/:id/v2/namespaces → readable + writable lists GET /workspaces/:id/v2/memories → plugin search proxy DELETE /workspaces/:id/v2/memories/:mid → plugin forget proxy memories_v2.go — slim handler: - Server-side ACL: every search request is intersected with the resolver's readable-namespaces set (canvas-supplied namespace that the workspace can't read returns [] not 403, matches v1 existence-non-inferring shape). - Returns 503 with "set MEMORY_PLUGIN_URL" hint when plugin isn't wired (canvas surfaces a banner). - Maps plugin not_found → 404, other plugin errors → 502. - View shaping: NamespaceView.label rendered server-side ("Workspace (abc-1234)", "Team (t-99)", "Org (acme)", custom) so canvas doesn't parse namespace names. MemoryView surfaces pin/expires_at/score/source_workspace_id from Propagation. memories_v2_test.go — 100% line + 100% function coverage: - 503 path on every endpoint when unwired - Namespaces success + readable/writable error paths - Search: empty intersection, full-path query/kind/limit propagation, namespace=/no-namespace branches, propagation map missing/wrong-type, intersect error, plugin error - Forget: success, plugin not_found→404, other plugin errors→502, missing memoryId→400 - Helpers: namespaceLabel for all 4 kinds + truncation, parseLimit edge cases (default/0/negative/over-cap/non-num), memoryToView field round-trip, indexOfColon, shortID ## Frontend (canvas) MemoryInspectorPanel rewritten for v2: - Drop LOCAL/TEAM/GLOBAL trio. Namespace dropdown driven by GET /v2/namespaces.readable, "All namespaces" default. - New per-row badges: kind (F/S/C), source (agent/runtime/user), pin (📌), TTL countdown (⌛12h / "expired"), score% on semantic search, source-workspace ⇡ws-pee for propagated. - Drop Edit button — v2 plugin contract has no PATCH; the model is forget + recommit. Forget stays. - Plugin-unavailable banner with operator hint when /v2/* returns 503. - Bug fix surfaced by test: rollback-on-failed-delete order of operations (loadEntries() called setError(null) AFTER we set the failure message, wiping it). Reload first, then set the error. MemoryEditorDialog deleted — Add was POST /memories which v2 doesn't support from canvas (writes go via MCP). The legacy Edit-flow tests go with it. ## Test results Backend: `go test ./internal/handlers/` — all pass Backend coverage on memories_v2.go: 100% lines, 100% functions Canvas: `vitest run` — 91 files, 1273 tests pass (26 new) Canvas coverage on MemoryInspectorPanel.tsx: 100% lines, 100% functions, 96.7% statements, 84.7% branches (uncovered branches are defensive `?? fallback` for contract-impossible kind/source values) ## Migration note The legacy v1 GET/POST/PATCH/DELETE on /workspaces/:id/memories remains in place for the back-compat MCP shim (mcp_tools_memory_v2's legacy routing) and admin export/import. PR-9 (#283) drops agent_memories along with the v1 endpoints once the cutover verification window closes.	2026-05-05 15:53:28 -07:00
Hongming Wang	d99b3f2aec	refactor(handlers): consolidate Notify + MCP send_message_to_user through AgentMessageWriter (RFC #2945 PR-A) Pre-RFC-#2945 the broadcast + activity_log INSERT for "agent → user chat" was duplicated across two handlers — activity.go's Notify (HTTP /notify) and mcp_tools.go's toolSendMessageToUser (MCP tools/call). The duplication is exactly what produced the reno-stars production data-loss regression (PR #2944): the persistence-half fix landed for one handler and silently lagged for the other for months, dropping every long-form external-agent message on reload. PR #2944 added the missing INSERT to mcp_tools.go and a forward- looking AST gate. This PR removes the duplication at the source. What changes ------------ NEW: workspace-server/internal/handlers/agent_message_writer.go - AgentMessageWriter struct + NewAgentMessageWriter ctor. - Send(ctx, workspaceID, message, attachments) error: workspace lookup → broadcast WS AGENT_MESSAGE → INSERT activity_logs. - ErrWorkspaceNotFound for the lookup-miss path so callers can return 404 / JSON-RPC error cleanly. - Best-effort persistence: INSERT failure logs only, returns nil so the broadcast success isn't undone (matches previous behavior in both call sites — pinned by test). - Takes events.EventEmitter (interface) so tests can substitute a capturing fake without nil-panicking inside hub.Broadcast. UPDATED: activity.go:Notify - Replaced ~75 lines of inline broadcast+INSERT with a 12-line call to AgentMessageWriter.Send. - Attachment shape conversion (NotifyAttachment → AgentMessageAttachment) is local to the HTTP handler; the writer's API doesn't import the HTTP-binding-tagged type. UPDATED: mcp_tools.go:toolSendMessageToUser - Replaced ~40 lines (the post-#2944 broadcast+INSERT pair) with a 6-line call to the writer. - Attachments is nil today because the MCP tool args don't expose attachments yet. When the schema adds it, build the slice and pass through; the writer half is ready. Tests ----- agent_message_writer_test.go (8 tests, comprehensive): - TestAgentMessageWriter_Send_Success_NoAttachments — happy path, pins JSON `{"result":"hi"}`. - TestAgentMessageWriter_Send_Success_WithAttachments — pins file parts shape (kind=file, file.{uri,name,mimeType,size}). Uses a jsonMatcher that decodes + asserts via predicate (tolerant of map key ordering, exact on shape). - TestAgentMessageWriter_Send_WorkspaceNotFound — pins ErrWorkspaceNotFound + asserts NO broadcast NO INSERT. - TestAgentMessageWriter_Send_DBInsertFailureStillReturnsNil — pins best-effort persistence contract. - TestAgentMessageWriter_Send_PreviewTruncation — pins ≤80-char preview + ellipsis (Ryan's onboarding-friction report would have bloated activity_logs.summary by 2KB without this). - TestAgentMessageWriter_Send_BroadcastsAgentMessageEvent — pins WS event name + payload shape via capturingEmitter. - TestAgentMessageWriter_Send_OmitsAttachmentsKeyWhenEmpty — pins the "no key when nil" wire contract. The existing AST gate from #2944 (TestAgentMessageBroadcastsArePersisted) still holds: any future function emitting AGENT_MESSAGE without an INSERT fails the test. With the writer in place that's now redundant — both producers go through it — but the gate is cheap to keep as defense-in-depth. Verified: go vet clean; all writer + caller tests pass; existing TestNotify_* + TestMCPHandler_SendMessage_* + the AST gate all green. Refs RFC #2945, PR #2944. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 15:29:42 -07:00
Hongming Wang	899c53550d	test(mcp): comprehensive coverage for send_message_to_user persistence + AST gate (reno-stars followup) Per user request: audit all similar tools + write comprehensive tests including E2E for the persistence-of-AGENT_MESSAGE-broadcasts contract. Audit (all BroadcastOnly call sites in workspace-server/internal/): \| Site \| Event \| Persisted? \| Notes \| \|---\|---\|:---:\|---\| \| a2a_proxy_helpers.go:275 \| A2A_RESPONSE \| ✓ \| LogActivity above \| \| activity.go:486 (Notify) \| AGENT_MESSAGE \| ✓ \| INSERT line 535 \| \| activity.go:701 (LogActivity) \| ACTIVITY_LOGGED \| ✓ \| self-emits inside DB write \| \| mcp_tools.go:341 (toolSendMessageToUser) \| AGENT_MESSAGE \| ✓ NEW (this PR) \| \| registry.go:575 \| TASK_UPDATED \| N/A \| transient progress, not chat \| \| registry.go:596 \| WORKSPACE_HEARTBEAT \| N/A \| infra ping, not chat \| Only one chat-bearing broadcast was missing persistence (the just- fixed mcp bridge path). No other regressions found. Tests added (4 new, total 5 send_message_to_user tests): 1. TestAgentMessageBroadcastsArePersisted — AST gate that walks every non-test .go in the package, finds funcs that BroadcastOnly with "AGENT_MESSAGE", asserts each ALSO contains an "INSERT INTO activity_logs". Forward-looking regression block: any future chat tool that broadcasts without persisting fails the test with a clear file:func diagnostic. Mutation-tested locally: removing the INSERT block from toolSendMessageToUser reliably produces the expected failure. 2. TestMCPHandler_SendMessageToUser_DBErrorLogsAndStill200s — pins the "best-effort persistence" contract. DB INSERT failures must NOT abort the tool response (the WS broadcast already succeeded; retrying would double-render in the live chat). Matches /notify. 3. TestMCPHandler_SendMessageToUser_ResponseBodyShape — pins the exact `{"result": "<message>"}` JSON shape stored in response_body. The canvas hydrater (extractResponseText in historyHydration.ts) reads body.result; any drift here silently breaks chat history without failing the INSERT. Per memory feedback_assert_exact_not_substring.md, asserts the literal JSON shape, not a substring. 4. TestMCPHandler_SendMessageToUser_PersistsToActivityLog (existing, from previous commit) — pins INSERT shape with regex on 'a2a_receive' + 'notify' literals. 5. TestMCPHandler_SendMessageToUser_Blocked_WhenEnvNotSet (existing) — env-gate aborts before DB. Test fixture cleanup: newMCPHandler now uses newTestBroadcaster (real ws.Hub) instead of events.NewBroadcaster(nil) — the latter nil-panics inside hub.Broadcast on the AGENT_MESSAGE path. Same broadcaster shape every other handler test uses. E2E note: the AST gate is the strongest forward-looking guarantee. A real-DB integration test would add value for CI but is largely duplicative of the sqlmock contract tests above (sqlmock pins SQL shape with much faster feedback). Left as a future enhancement when the handlers Postgres-integration suite extends MCP coverage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 14:52:32 -07:00
Hongming Wang	cdfc9f743f	fix(mcp): persist send_message_to_user pushes to activity_log (reno-stars data loss) Reported on production tenant reno-stars: an external claude-code agent (CEO Ryan PC workspace) sent a long-form message via send_message_to_user; the user saw it live in the chat panel but it vanished after a refresh. Confirmed via direct production query — the message is NOT in activity_logs at all (only short test pings around it are persisted). Root cause: there are TWO server-side handlers for send_message_to_user: 1. HTTP `/workspaces/:id/notify` (activity.go:Notify) — broadcasts WS AND inserts a row into activity_logs. This is the path the in-container runtime's tool_send_message_to_user calls. 2. MCP-bridge `tools/call name=send_message_to_user` (mcp_tools.go:toolSendMessageToUser) — broadcasts WS only, never persisted. This is the path EXTERNAL agents using molecule-mcp's send_message_to_user tool route through. The persistence fix landed for path 1 months ago but was never mirrored on path 2. External agents — exactly the case in reno-stars/CEO Ryan PC — have been silently losing every long-form notification on reload. Fix: mirror the activity.go INSERT shape inside toolSendMessageToUser: INSERT INTO activity_logs (workspace_id, activity_type, method, summary, response_body, status) VALUES ($1, 'a2a_receive', 'notify', $2, $3::jsonb, 'ok') Same wire shape as /notify so the canvas's chat-history hydration (`type=a2a_receive&source=canvas`) treats both writers identically. Errors are log-only — broadcast already succeeded, persistence failure shouldn't block the tool response (matches /notify behavior; downside is the same data-loss-on-DB-error risk, surfaced via log.Printf). Tests ----- - `TestMCPHandler_SendMessageToUser_PersistsToActivityLog` — pins both the workspace-name lookup AND the INSERT shape. Regex-matches `'a2a_receive'` + `'notify'` literals so a future refactor that changes activity_type or method breaks the test loud, not silently re-introducing the data-loss bug. - Updated newMCPHandler to use newTestBroadcaster() (real ws.Hub) — events.NewBroadcaster(nil) crashes inside hub.Broadcast in the send_message_to_user path. Same shape every other handler test uses. Verified `go test ./internal/handlers/ -run TestMCPHandler_SendMessage` green; full vet clean. Refs reno-stars production incident 2026-05-05. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 14:47:48 -07:00
Hongming Wang	f3782662bd	refactor(external-connect): embed help in agent paste, fix wrong docs hostname Two related fixes to the Connect-External-Agent flow that the user flagged: the "Need help?" disclosure block in the modal is for the operator's eyes only — but the agent reading the pasted snippet has no access to that context. And the docs URL was pointing at a hostname that doesn't resolve. User-visible problems: 1. The agent doesn't see the install link, docs link, or the common- error/check pairs that the human pasted. When the agent fails to register or hits ConnectionRefused, it can't self-diagnose because the troubleshooting context lives in a separate UI block. 2. https://docs.molecule.ai → DNS NXDOMAIN. Every "Documentation" link in the modal was a dead link. ## Fixes ### Move help INTO the snippet (not a separate human-only UI block) Each of the 7 server-rendered templates in `workspace-server/internal/handlers/external_connection.go` now appends a `# Need help?` section with: install link, correct docs link, and the top common errors as `# • symptom — check` pairs. Templates updated: curl / channel (Claude Code) / mcp (Universal MCP) / python / hermes / codex / openclaw. Agents reading the paste now have the same diagnostic context the human did. ### Drop the duplicated UI block in the canvas modal `canvas/src/components/ExternalConnectModal.tsx`: - Removed the `TAB_HELP` per-tab metadata constant (152 lines). - Removed the `HelpBlock` component (62 lines). - Removed the `<HelpBlock help={TAB_HELP[tab]} />` render call. The snippet is now the single source of truth for tab-level help. ### Fix the wrong docs hostname The actual docs site is `doc.moleculesai.app` (singular `doc`, `.app` not `.ai`), confirmed by: - `package.json` description in `Molecule-AI/docs` repo → "Molecule AI documentation site — doc.moleculesai.app" - HTTP HEAD on the new URL → 200 for both `/docs/guides/mcp-server-setup` and `/docs/guides/external-agent-registration` - HTTP HEAD on old `docs.molecule.ai` → 000 (NXDOMAIN) All template docs URLs now point at `doc.moleculesai.app`. ## Verification - `go build ./...` clean - `go test ./internal/handlers/... -count=1` green - `pnpm test` → 1291/1291 pass (unchanged) - `tsc --noEmit` clean - 219 LOC removed (canvas duplicate UI), 69 LOC added (snippet help) - Net `-150 LOC` while gaining the agent-readable help ## Out of scope (deferred, captured in followups) - One blog post still has `canonical: "https://docs.molecule.ai/blog/..."` in `src/app/blog/2026-04-20-chrome-devtools-mcp/page.mdx` — separate blog-content fix. - Comment in `theme-provider.tsx` references `docs.moleculesai.app` (with `s`) — comment-only, not a runtime URL. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 13:51:35 -07:00
Hongming Wang	cb70d3d437	docs: callout Python>=3.11 requirement on Universal MCP install snippet User-reported friction: pip install molecule-ai-workspace-runtime on a 3.10 interpreter fails with "Could not find a version that satisfies the requirement (from versions: none)" — pip's requires_python filter silently drops the only available artifact before attempting install, so the error doesn't mention Python at all. Operators see "package missing", file a bug, and chase a phantom CDN/visibility issue. Two changes mirror the requirement at the two operator-touch surfaces: 1. workspace-server/internal/handlers/external_connection.go: the externalUniversalMcpTemplate snippet (rendered into the canvas Connect-External-Agent modal) now leads with a brief "Requires Python >= 3.11" block + diagnostic + upgrade paths. 2. docs/workspace-runtime-package.md: same callout at the top of the doc, before the Overview, so anyone landing here from search gets the answer immediately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 13:44:25 -07:00
Hongming Wang	423d58d42c	fix(org-import): polish — wrap-safe ErrNoRows, bounded lookup, godoc Three small hardening passes from #2872's optional/important findings, batched into one polish PR: 1. errors.Is(err, sql.ErrNoRows) instead of err == sql.ErrNoRows. The bare equality breaks if any future caller wraps the error via fmt.Errorf("…: %w", err) — the no-rows happy path would fall through to the "real DB error" branch and abort the import. errors.Is unwraps. New test TestLookupExistingChild_WrappedNoRows_TreatedAsNotFound pins the fix; verified the test fails on the old `==` shape (build break on unused-import + assertion failure once import dropped). 2. Bounded 5s timeout on lookupExistingChild instead of context.Background(). The createWorkspaceTree call site runs in goroutines spawned from the /org/import handler, so plumbing the request context here would cascade-cancel into provisionWorkspaceAuto and abort in-flight EC2 provisioning if the client disconnected mid-import — that's the wrong tradeoff. A short bounded timeout protects the per-row SELECT against a wedged DB without taking the drop-everything-on-disconnect behaviour. The lookup is a single ~10ms query; 5s leaves 500x headroom for transient slow paths. 3. Godoc clarifications on the skip-path block. - /org/import is ADDITIVE-ONLY, never destructive. Children present in the existing tree but absent from the new template are preserved (no DELETE on diff). - Skip-path does NOT propagate updates to existing nodes — a re-import that adds an initial_memory or schedule to an existing workspace is silently dropped. Document the limitation so future operators know to delete-and-re-import or reach for a future /org/sync route. Verification: - go build ./... → clean - go test ./internal/handlers/... → all passing (TestLookup* + TestCreateWorkspaceTree* + TestClass1* + TestGate*) - 4 lookup tests + 1 new wrap-safety test → 5/5 PASS - Full handlers suite → green Refs molecule-core#2872 (Optional findings — wrap-safety + ctx, godoc clarifications for additive-only + skip-path-update-limitation) Out of scope (deferred): - PR-D partial unique index migration + ON CONFLICT — sequenced after Phase 4 cleanup verified clean per #2872 plan - PR-E full createWorkspaceTree integration test for partial-match — needs heavier sqlmock scaffolding for downstream workspaces_audit/canvas_layouts/secrets/channels INSERTs; follow-up Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 13:20:54 -07:00
Hongming Wang	60afcd43c9	test(handlers): generic Class 1 leak AST gate (#2867 PR-A) Adds class1_ast_gate_test.go — a per-package AST walk that fails the build if any handler function INSERTs INTO workspaces inside a range loop body without one of three escape hatches: 1. A call to a registered preflight helper (lookupExistingChild today; extend preflightCallNames as new helpers are introduced). 2. An ON CONFLICT clause in the same SQL literal (idempotent UPSERT, like registry.go). 3. An explicit `// class1-gate: idempotent-by-design` comment in the function body (deliberately awkward — forces a code-review beat). Why this is broader than the existing TestCreateWorkspaceTree_CallsLookupBeforeInsert gate in org_import_idempotency_test.go: that one is hard-coded to one function in one file. This one walks every non-test .go file in the handlers package and applies a structural rule independent of file/function names. A future handler written from scratch in a new file would not have been covered before — now it is. Detection mechanism (per AST): - Collect spans (Lbrace..Rbrace) of every RangeStmt body in each function. Position-based instead of stack-based — ast.Inspect's nil-callback ordering doesn't give per-node pop semantics, so a naive push/pop stack silently miscounts. Position spans are deterministic. - Walk every BasicLit, regex-match `^\s*INSERT INTO workspaces\(` (tightened from bytes.Index "INSERT INTO workspaces" so workspaces_audit literals don't false-positive — same regex used by the existing createWorkspaceTree gate). - For each match: record insertLine, hasONCONFLICT, and the innermost enclosing RangeStmt line (or 0 if not inside any range). - Fail the function if INSERT is inside a range AND no preflight AND no ON CONFLICT AND no allowlist annotation. Self-tests (per `feedback_assert_exact_not_substring.md` — verify gate fails on the bug shape before merging): - TestClass1_GateFiresOnSyntheticBuggySource: synthetic source where INSERT is inside `for _, child := range children` body must trigger the gate's three guards (enclosingRangeLine!=0, hasONCONFLICT=false, no preflight call). - TestClass1_GateAllowsONCONFLICT: synthetic INSERT...ON CONFLICT must NOT trigger the gate (idempotent UPSERT case). - TestClass1_GateAllowsAllowlistAnnotation: function with `// class1-gate: idempotent-by-design` must be skipped. - TestClass1_NoUnpreflightedInsertInsideRange: production sweep over every handler .go file. Currently passes because org_import.go preflights, registry.go ON-CONFLICTs, and workspace.go's Create has no INSERT inside a range body. Verification: - go test ./internal/handlers/... -run TestClass1_ -count=1 → 4/4 PASS - go test ./internal/handlers/... -count=1 → suite green (no pre-existing test broken by the new file) Refs molecule-core#2867 (PR-A Class 1 generic AST gate) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 13:01:34 -07:00
Hongming Wang	83454e5efd	feat(workspace-server): structured logging at provisioning boundaries Adds internal/provlog with a single Event(name, fields) helper that emits JSON-tagged single-line records to the standard logger. Five boundary sites instrumented for #2867: provision.start — workspace_dispatchers.go (sync + async) provision.skip_existing — org_import.go idempotency hit provision.ec2_started — cp_provisioner.go after RunInstances provision.ec2_stopped — cp_provisioner.go after TerminateInstances ack restart.pre_stop — workspace_restart.go before Stop dispatch These pair with the existing human-prose log.Printf lines (kept). The new records are grep+jq friendly so a future log-aggregation pipeline can reconstruct per-workspace provision timelines without parsing the operator messages — this is the "and debug loggers so it dont happen again" half of the leak-prevention work. Tests: - provlog: emits evt-prefixed JSON, nil-tolerant, marshal-error fallback preserves event boundary, single-line output pinned. - handlers: provlog_emit_test.go pins three call-site contracts: provisionWorkspaceAutoSync emits provision.start with sync=true, stopForRestart emits restart.pre_stop with backend=cp on SaaS, and backend=none when both backends are nil. Field taxonomy is convenience for ops, not contract — payload can grow additively without breaking callers. Behavior gate is the event name + boundary location, per feedback_behavior_based_ast_gates.md. Refs #2867 (PR-D structured logging at provisioning boundaries) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 12:30:11 -07:00
Hongming Wang	eec4ea2e7d	chore: delete TeamHandler.Collapse + docs cleanup (closes #2864 ) Multi-model retrospective review of #2856 (Phase 1 Expand removal) flagged that TeamHandler.Collapse is unreachable from the canvas UI: the "Collapse Team" button calls PATCH /workspaces/:id { collapsed } (visual flag toggle on canvas_layouts), NOT POST /workspaces/:id/collapse. The destructive POST route — which stops EC2s, marks children removed, and deletes layouts — has zero UI callers (verified via grep across canvas/, scripts/, and the MCP tool registry; only docs referenced it). Two semantically different operations had been sharing the word "Collapse": - Visual collapse (canvas) → PATCH { collapsed: true }. Hides children visually. Reversible. UI-only. - Destructive collapse (POST /collapse) → Stops + marks removed. Irreversible. No caller. Deleting the destructive one + its supporting machinery: - workspace-server/internal/handlers/team.go (entirely) - workspace-server/internal/handlers/team_test.go (entirely) - POST /collapse route + teamh init in router.go - findTemplateDirByName helper (zero non-test callers after Expand was deleted in #2856; package-private so no out-of-package consumers) - NewTeamHandler constructor (no callers after route removed) Plus stale doc references (the most dangerous was the MCP wrapper mapping in mcp-server-setup.md — anyone generating MCP tool wrappers from that table was wiring a 404): - docs/agent-runtime/team-expansion.md (deleted entirely — whole guide taught the deleted flow) - docs/api-reference.md (dropped two team.go rows) - docs/api-protocol/platform-api.md (dropped /expand + /collapse rows) - docs/architecture/molecule-technical-doc.md (dropped /expand + /collapse rows) - docs/guides/mcp-server-setup.md (dropped expand_team + collapse_team MCP wrapper mappings) - docs/glossary.md (dropped "(org template expand_team)" parenthetical) - docs/frontend/canvas.md (dropped broken link to deleted team-expansion.md) Kept: docs/architecture/backends.md mention of "TeamHandler.Expand (#2367) bypassed routing on Start" — correct historical context for the AST gate's existence, no live route reference. Visual-collapse path unaffected: canvas/src/components/ContextMenu.tsx:227 → api.patch — unchanged canvas/src/components/WorkspaceNode.tsx:128 → api.patch — unchanged go vet ./... clean. go test ./internal/handlers/ -count 1 — all green (4.3s, no regression). Net: -388/+10 = ~378 lines removed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 11:59:43 -07:00
Hongming Wang	9f551319d2	feat(saas): close 4th default-tier site + lift org_import asymmetry + tests (#2910 ) Multi-model retrospective review of #2901 found three Critical gaps: 1. (#2910 PR-B) template_import.go:79 wrote `tier: 3` hardcoded into generated config.yaml. On SaaS this defeated the T4 default at the create-handler layer — a config-less template import landed at T3 regardless of POST /workspaces' computed default. The 4th default-tier site #2901 missed. 2. (#2910 PR-A) #2901 claimed `go test ... all green` but added zero new tests. Existing structural-pin tests caught dispatch-layer drift but said nothing about tier-default drift. A future refactor that flips DefaultTier() to always return 3 would ship green. 3. (#2910 PR-E) org_import.go fallback returned T2 on self-hosted while workspace.go returned T3. Internally consistent ("bulk vs interactive defaults") but undocumented same-name-different-value drift. Fix: - TemplatesHandler.NewTemplatesHandler now takes `wh *WorkspaceHandler` (nil-tolerant for read-only callers). Import + ReplaceFiles compute tier via h.wh.DefaultTier() and pass it to generateDefaultConfig. generateDefaultConfig gets a `tier int` parameter (bounds-checked, invalid input falls back to T3). - org_import.go fallback lifts to h.workspace.DefaultTier() — single source of truth shared with Create + Templates so a future tier-default change sweeps every entry point at once. - New saas_default_tier_test.go pinning: TestIsSaaS_TrueWhenCPProvWired TestIsSaaS_FalseWhenOnlyDocker TestDefaultTier_SaaS_IsT4 TestDefaultTier_SelfHosted_IsT3 TestGenerateDefaultConfig_RespectsTierParam TestGenerateDefaultConfig_SelfHostedTierT3 TestGenerateDefaultConfig_OutOfRangeFallsBackToT3 - Existing template_import_test.go tests + chat_files_test.go + security_regression_test.go updated to thread the new tier param / wh constructor arg through their NewTemplatesHandler calls. Their pre-#2910 assertion of `tier: 3` is preserved (now passes because the test caller passes `3` explicitly), so no regression. go vet ./... clean. go test ./internal/handlers/ -count 1 — all green (4.2s). Deferred to separate follow-ups (per #2910 plan): - PR-C: MOLECULE_DEPLOYMENT_MODE explicit deployment-mode signal (closes the IsSaaS()=cpProv!=nil structural fragility) - PR-D: Host iptables IMDS block + IMDSv2 hop-limit (paired with molecule-controlplane EC2-IAM-scope audit) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 11:38:22 -07:00
Hongming Wang	5334d60de4	Merge pull request #2898 from Molecule-AI/2867-workspaces-insert-allowlist test(handlers): allowlist INSERT INTO workspaces sites (#2867 class 1)	2026-05-05 18:18:19 +00:00
Hongming Wang	27db090d3d	Merge pull request #2907 from Molecule-AI/feat/poll-mode-chat-upload-phase5a feat(poll-upload): phase 5a — atomic batch insert + acked-index + mime hardening	2026-05-05 11:16:56 -07:00
Hongming Wang	0f25f6de97	test(handlers): allowlist INSERT INTO workspaces sites — close bulk-create regression class (#2867 class 1) Adds TestINSERTworkspacesAllowlist: walks every non-test .go in this package, finds funcs containing an `INSERT INTO workspaces (` SQL literal, and pins the result against an explicit allowlist with the safety mechanism named per entry. New entries fail the build until a reviewer adds them — forcing the question "what makes this INSERT idempotent?" at PR-review time, not after the next bulk-create leak (the shape that produced 72 stale child workspaces in tenant-hongming over 4 days). Pairs with TestCreateWorkspaceTree_CallsLookupBeforeInsert (the behavior pin for the one bulk path today). Together: - this test catches "did a new function start inserting?" - that test catches "did the existing bulk path drop its idempotency check?" Both fire immediately when drift happens. Current allowlist (3 entries): - org_import.go:createWorkspaceTree → lookup-then-insert via lookupExistingChild (#2868 phase 3, also pinned by the sibling AST gate from #2895) - registry.go:Register → ON CONFLICT (id) DO UPDATE (idempotent by primary key — external workspace upsert) - workspace.go:Create → single-workspace POST /workspaces, server- generated UUID, no iteration Verified via mutation: dropping a synthetic tempBulkLeakTest with an unsafe loop+INSERT into the package fails the gate with a clear diagnostic pointing at the file + function. Restoring the tree returns the gate to green. Memory: feedback_assert_exact_not_substring.md (verify tightened test FAILS on bug shape) — mutation proof done locally. RFC #2867 class 1. Class 2 (Prometheus gauge for ec2_instance duplicates) + class 3 (structured logging on workspace create) are follow-up PRs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 11:15:16 -07:00
Hongming Wang	9991057ad1	feat(poll-upload): phase 5a — atomic batch insert + acked-index + mime hardening Resolves four of six findings from the retrospective code review of Phases 1–4 (poll-mode chat upload). Bundled because every change is in the platform's pending_uploads layer or the multi-file handler that reads it. Findings resolved: 1. Important — Sweep query lacked an index for the acked-retention OR-arm. The Phase 1 partial indexes are both `WHERE acked_at IS NULL`, so the `(acked_at IS NOT NULL AND acked_at < retention)` half of the WHERE clause seq-scanned the table on every cycle. Add a complementary partial index on `acked_at WHERE acked_at IS NOT NULL` so both arms of the disjunction are index-covered. Disjoint from the existing two indexes (no row matches both predicates), so write amplification is bounded to ~one index entry per terminal-state row. 2. Important — uploadPollMode partial-failure left orphans. The previous per-file Put loop committed rows 1..K-1 and then errored on row K with no compensation, so a client retry would double-insert the survivors. Refactor the handler into three explicit phases (pre-validate + read-into-memory, single atomic PutBatch, per-file activity row) and add Storage.PutBatch with all-or-nothing transaction semantics. 3. FYI — pendinguploads.StartSweeperWithInterval was exported only for tests. Move it to lower-case startSweeperWithInterval and expose the test seam through pendinguploads/export_test.go (Go convention; the shim file is stripped from the production binary at build time). 4. Nit — multipart Content-Type was passed verbatim into pending_uploads rows and re-served on /content. Add safeMimetype which strips parameters, rejects CR/LF/control bytes, and coerces malformed shapes to application/octet-stream. The eventual GET /content response can no longer be header-split via a crafted Content-Type on the multipart. Comprehensive tests: - 10 PutBatch unit tests (sqlmock): happy path, empty input, all four pre-validation rejection paths, BeginTx error, per-row error + Rollback (no Commit), first-row error, Commit error. - 4 new PutBatch integration tests (real Postgres): all-rows-commit happy path with COUNT(*) verification, atomic-rollback no-leak via a NUL-byte filename that lib/pq rejects mid-batch, oversize short-circuit no-Tx, idx_pending_uploads_acked existence + partial predicate via pg_indexes (planner-shape-independent). - 3 new chat_files_poll tests: atomic rollback on second-file oversize, atomic rollback on PutBatch error, mimetype CRLF/NUL/parameter sanitization (8 sub-cases). The two remaining review findings (inbox_uploads.fetch_and_stage blocks the poll loop synchronously; two httpx Clients per row) are Python-side and ship in Phase 5b once this lands on staging. Test-only export pattern via export_test.go, atomic pre-validation discipline (validate before Tx), and behavior-based (not name-based) test assertions follow the standing project conventions.	2026-05-05 11:10:13 -07:00
Hongming Wang	c79ba05ed5	test(pendinguploads): close cycleDone-vs-metric-record race in sweeper tests TestStartSweeper_RecordsMetricsOnError flaked on every CI rerun under race detection: `error counter delta = 0, want 1`. Root cause is a race between two goroutines, not a bug in the production sweeper. The fake `fakeSweepStorage.Sweep` signals `cycleDone` from inside its deferred return — that happens BEFORE Sweep's return value is received by `sweepOnce`, which is what triggers the metric increment. On slow CI hosts the test goroutine wins the read after `waitForCycle` unblocks and BEFORE StartSweeper's goroutine has called `metrics.PendingUploadsSweepError`, so the asserted delta is 0 even though the metric WILL be 1 a few ms later. Adds a polling assert helper, `waitForMetricDelta`, that closes the race deterministically without timing-based sleeps: - TestStartSweeper_RecordsMetricsOnError uses waitForMetricDelta to wait for the error counter to settle at 1. - TestStartSweeper_RecordsMetricsOnSuccess uses it on the success counters (acked, expired) so the error-stayed-zero assertion reads after StartSweeper has fully processed the cycle. - waitForCycle keeps its current shape but documents the caveat in its comment so future tests don't repeat the assumption. Verified: `go test ./internal/pendinguploads/ -race -count 5` passes all 9 tests across 5 iterations cleanly. Per memory feedback_question_test_when_unexpected.md: the "delta=0, want=1" failure looked like a real production bug at first glance, but instrumented inspection showed the metric DOES increment, just AFTER the test's read. The fix is the test's wait shape, not the sweeper. Unblocks every PR currently broken by this flake (#2898 hit it on two consecutive CI runs; staging-merged PRs from earlier today (#2877/#2881/#2885/#2886) introduced the test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 10:46:17 -07:00
Hongming Wang	7644e82f2f	feat(saas): default new workspaces to T4 on SaaS, T3 self-hosted User reported every SaaS workspace defaults to T2 (Standard). Three sites quietly disagreed on the default: - canvas CreateWorkspaceDialog (line 126): isSaaS ? 4 : 3 ← only correct one - canvas EmptyState "Create blank": tier: 2 ← hardcoded - workspace.go POST /workspaces: tier = 3 ← not SaaS-aware - org_import.go createWorkspaceTree: tier = 2 (fallback)← not SaaS-aware So a user clicking "+ New Workspace" via the dialog got T4 on SaaS, but a user clicking "Create blank" on the empty canvas got T2, and an agent POSTing /workspaces directly got T3. Same tenant, three different tiers depending on entry point. Fix: 1. WorkspaceHandler.IsSaaS() and DefaultTier() helpers (workspace_dispatchers.go). IsSaaS() := h.cpProv != nil — single source of truth for "are we SaaS" across the file. DefaultTier() returns 4 on SaaS, 3 on self-hosted. SaaS rationale: each workspace runs on its own sibling EC2 so the per-workspace tier boundary is a Docker resource limit on the only container present — no neighbour to protect from. T4 matches the boundary. 2. workspace.go now defaults tier via h.DefaultTier() instead of hardcoded T3. 3. org_import.go fallback (when neither ws.tier nor defaults.tier set) becomes SaaS-aware: T4 on SaaS, T2 on self-hosted (preserve the existing safe-shared-Docker-daemon default for self-hosted org imports). 4. canvas EmptyState "Create blank" stops sending tier:2 in the body and lets the backend pick — single source of truth in the backend. Eliminates the third disagreement. Test plan: - go vet ./... clean - go test ./internal/handlers/ -count 1 — all green (4.3s) - npx tsc --noEmit on canvas — clean - Staging E2E (after deploy): create a fresh workspace via canvas empty-state on hongming.moleculesai.app, confirm tier=4 on the workspace details panel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 10:30:22 -07:00
Hongming Wang	48d1945269	test(org-import): tighten AST gate to discriminate workspaces vs lookalikes (#2872 Imp-1) The previous TestCreateWorkspaceTree_CallsLookupBeforeInsert used bytes.Index("INSERT INTO workspaces"), which prefix-matches INSERT INTO workspaces_audit, INSERT INTO workspace_secrets, and INSERT INTO workspace_channels. RFC #2872 cited this as a silent false-pass mode: a future refactor that adds an audit-table INSERT literal earlier in source than the real workspaces INSERT would make the gate point at the wrong target. Replaces the byte-search with a go/ast walk + a regex that requires `\s\(` after `workspaces` — distinguishes the real target from prefix lookalikes. Adds three discriminating tests: - TestWorkspacesInsertRE_RejectsLookalikes — pins the regex against 9 sql shapes (real, raw-string-literal, audit-shadow, workspace_ prefixes, canvas_layouts, UPDATE/SELECT, comments). - TestGate_FailsWhenLookupAfterInsert — synthesizes Go source where the lookup is positioned AFTER the workspaces INSERT, asserts the helper returns lookupPos > insertPos (which the production gate flags via t.Errorf). Proves the gate isn't vestigial. - TestGate_IgnoresAuditTableShadow — synthesizes source with an audit-table INSERT BEFORE the lookup + real INSERT, asserts the tightened regex correctly walks past the shadow and finds the real INSERT. Also extracts findLookupAndWorkspacesInsertPos as a helper so the gate logic can be exercised against synthetic source, not only against the real org_import.go. Memory: feedback_assert_exact_not_substring.md (verify tightened test FAILS on old code) — TestGate_FailsWhenLookupAfterInsert is the failing-on-bug-shape proof. Closes the silent-false-pass mode of #2872 Important-1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 08:32:56 -07:00
Hongming Wang	e50799bc29	test(rfc): poll-mode chat upload — phase 4 real-Postgres integration Phase 4 closes out the rollout — strict-sqlmock unit tests pin which SQL fires, but they cannot detect bugs that depend on the actual row state after the SQL runs. Real-Postgres integration tests catch: - the Sweep CTE depends on Postgres' make_interval function and the table's CHECK constraints; sqlmock would happily accept a hand-written SQL literal that Postgres rejects at runtime. - the partial idx_pending_uploads_unacked index only catches a wrong WHERE predicate at real-query-plan time. - subtle predicate drift (e.g. a WHERE clause that filters by acked_at IS NOT NULL but uses BETWEEN incorrectly). Test cases: - PutGetAckRoundTrip: the full happy path — Put, Get, MarkFetched, Ack, idempotent re-Ack, Get-after-Ack returns ErrNotFound. - Sweep_DeletesAckedAfterRetention: row not eligible at retention=1h immediately after Ack; deleted at retention=0. - Sweep_DeletesExpiredUnacked: backdated expires_at exercises the unacked-and-expired branch of the WHERE clause. - Sweep_DeletesBothCategoriesInOneCycle: three rows (acked, expired, fresh); a single Sweep deletes the first two and leaves the third. - PutEnforcesSizeCap: ErrTooLarge above MaxFileBytes. - GetIgnoresExpiredAndAcked: Get filters predicate matches expected row state in the table. Run path: - locally via the file-header docker incantation. - CI runs on every PR/push that touches handlers/ OR migrations/ (.github/workflows/handlers-postgres-integration.yml).	2026-05-05 05:04:41 -07:00
Hongming Wang	a327d207da	feat(rfc): poll-mode chat upload — phase 3 GC sweep + observability Phase 3 of the poll-mode chat upload rollout. Stack atop Phase 2. The platform's pending_uploads table grows once-per-uploaded-file with no built-in cleanup. Phase 1's hard TTL (expires_at default 24h) makes expired rows un-fetchable but doesn't actually delete them; Phase 1's ack stamps acked_at but leaves the row indefinitely. Without a sweep the table grows unbounded across normal traffic. This PR adds: - `Storage.Sweep(ctx, ackRetention)` — a single round-trip CTE that deletes acked rows past their retention window plus unacked rows past expires_at. Returns `(acked, expired)` deletion counts so Phase 3 dashboards can spot the stuck-fetch pattern (high expired, low acked) vs healthy churn. - `pendinguploads.StartSweeper(ctx, storage, ackRetention)` — background goroutine that calls Sweep every 5 minutes (default). Runs once immediately on startup so a platform restart cleans up any rows that became eligible while we were down. - Prometheus counters `molecule_pending_uploads_swept_total` with `outcome={acked,expired,error}` labels. Wired into the existing `/metrics` endpoint. - Wired from cmd/server/main.go via supervised.RunWithRecover — one transient panic doesn't take the platform down with it. Defaults: - SweepInterval = 5m (matches the dashboard refresh cadence) - DefaultAckRetention = 1h (gives the workspace at-least-once retry headroom in case it processed but failed to write the file before crashing) Test coverage: 100% on storage_test.go (extended with sweepSQL pin + six Sweep test cases including negative-retention clamp + zero-retention immediate-delete + DB error wrapping) and sweeper_test.go (ticker-driven + ctx-cancel + nil-storage + transient-error-doesn't-crash + metric counter assertions). Closes the third of four phases tracked on the parent RFC; phase 4 is the staging E2E test.	2026-05-05 05:00:13 -07:00
Hongming Wang	c778b62202	feat(metrics): add molecule_phantom_busy_resets_total counter (#2865 ) Closes #2865 (split-B of the #2669 root-cause stack). The phantom-busy sweep in workspace-server/internal/scheduler/scheduler.go already logs each row reset, but no aggregate metric surfaces "how often is this firing." A regression that causes high reset rates (e.g. controlplane#481's missing env vars, or future drift in the workspace runtime's task-lifecycle accounting) only surfaces when users complain. Fix: counter exposed at /metrics as molecule_phantom_busy_resets_total, incremented from sweepPhantomBusy after each row whose active_tasks was reset. Same shape as existing molecule_websocket_connections_active. Operator-side dashboard: alert when daily phantom-busy reset count > 0.5% of active workspaces. Today's steady-state is near-zero; any increase is a regression signal. Tests: - TestTrackPhantomBusyReset_IncrementsCounter - TestTrackPhantomBusyReset_RaceFreeUnderConcurrentWrites (50×200 concurrent writes; tests atomic invariant) - TestHandler_ExposesPhantomBusyResetsCounter (asserts HELP + TYPE + value lines in Prometheus text format) - TestHandler_PhantomBusyResetsZeroByDefault (fresh-process 0 contract — prevents a future refactor from accidentally dropping the metric from /metrics) Race-detector clean. Vet clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 04:45:24 -07:00
Hongming Wang	243f9bc2b1	Merge pull request #2877 from Molecule-AI/feat/poll-mode-chat-upload-phase1 feat(rfc): poll-mode chat upload — phase 1 platform staging layer	2026-05-05 11:32:10 +00:00
Hongming Wang	43bf94a07c	fix(chat-uploads): align poll-mode activity rows with inbox poll filter The workspace inbox poller filters `GET /workspaces/:id/activity?type=a2a_receive` — writing rows with `activity_type=chat_upload_receive` would be silently invisible to it. Switch the poll-mode upload-staging handler to write `activity_type=a2a_receive` with `method=chat_upload_receive` as the discriminator. Same shape as A2A's `tasks/send` vs `message/send` method split; the workspace-side handler (Phase 2) routes by `method`, not activity_type. Pinned with `TestPollUpload_ActivityRowDiscriminator` — sqlmock WithArgs on positions 2 (activity_type) and 5 (method) so a refactor that flips activity_type back to a custom value gets a red test instead of a runtime "poller saw nothing" silent break.	2026-05-05 04:29:07 -07:00
Hongming Wang	86fdaad111	feat(rfc): poll-mode chat upload — phase 1 platform staging layer External-runtime workspaces (registered via molecule connect, behind NAT, no public callback URL) currently see HTTP 422 "workspace has no callback URL" on every chat file upload. The only escape is to wrap the laptop in ngrok / Cloudflare tunnel + re-register push-mode — a tax that shouldn't exist for a one-line use case. This phase introduces the platform-side staging layer that lets canvas → external workspace uploads ride the same poll loop the inbox already uses for text messages. Architecture (mirrors inbox poll, SSOT principle): Canvas POST /chat/uploads (multipart) ↓ delivery_mode=poll Platform: chat_files.uploadPollMode ↓ pendinguploads.Storage.Put + LogActivity(chat_upload_receive) Workspace's existing inbox poller picks up the activity row (Phase 2) Workspace fetches: GET /workspaces/:id/pending-uploads/:fid/content Workspace acks: POST /workspaces/:id/pending-uploads/:fid/ack Pieces in this PR: * Migration 20260505100000 — pending_uploads table; partial indexes on unacked + expires_at for the workspace fetch + Phase 3 sweep hot paths. No FK to workspaces (audit retention), 24h hard TTL. * internal/pendinguploads — Storage interface + Postgres impl. Bytes inline (bytea) today; the interface lets a future PR replace with S3 (RFC #2789) by swapping one constructor. 100% test coverage on the Postgres impl via sqlmock-pinned SQL. * handlers.PendingUploadsHandler — GET /content + POST /ack endpoints. wsAuth-gated; cross-workspace bleed protection via per-row workspace_id check (token leak from A can't read B's pending bytes). Handler tests pin happy path + every 4xx/5xx mapping including cross-workspace + race-with-sweep. * chat_files.go — Upload poll-mode branch behind WithPendingUploads builder. Push-mode unchanged (regression-tested). Multipart parse + per-file sanitize + storage.Put + activity_logs row per file. * SanitizeFilename — Go mirror of workspace/internal_chat_uploads.py sanitize_filename. Tests pin parity case-by-case so canvas-emitted URIs stay identical regardless of which path handles the upload. * Comprehensive logging — every state transition (staged, fetch, ack, error) emits a structured log line with workspace_id + file_id + size + sanitized name. Phase 3 metrics will hook these. The pendinguploads.Storage wiring is opt-in (WithPendingUploads on ChatFilesHandler) so a binary deployed without the migration keeps the pre-existing 422 behavior — no boot-order coupling between code roll and schema roll. Phase 2 (separate PR): workspace inbox extension — inbox_uploads.py fetches via the GET endpoint, writes to /workspace/.molecule/chat- uploads/, acks, and rewrites the URI from platform-pending: → workspace: so the agent's existing send-attachments path needs no changes. Phase 3: GC sweep + dashboards. Phase 4: poll-mode E2E on staging. Tests: * 100% coverage on pendinguploads (sqlmock-pinned SQL drift gate). * Functional 100% on new handler code (uncovered branches are documented defensive duplicates: uuid re-parse, multipart Open error, Writer.Write fail — none reproducible in unit tests). * Push-mode + NULL delivery_mode regression tests pin no behavior change for existing workspaces. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 04:22:24 -07:00
Hongming Wang	26e2e97006	refactor(models): consolidate per-runtime model defaults to SSOT (RFC #2873 iter 1) Two call sites — workspace_provision.go:537 and org_import.go:54 — duplicated the same `if runtime == "claude-code"` branch deciding the default model when the operator/agent didn't supply one. They were copy-pasted; nothing prevented them from drifting silently. Extract to `models.DefaultModel(runtime string) string`. Both call sites now route through the helper. New runtimes need one entry in DefaultModel + one assertion in TestDefaultModel — pre-fix it required two source edits + an audit. Foundation for the future `RuntimeConfig` interface (RFC #2873 + task #231): once we add `ProvisioningTimeout()`, `CapabilitiesSupported()` etc., the helper expands to per-runtime structs and `DefaultModel` becomes one method on the interface. ## Coverage 15 unit tests pinning the exact contract: - claude-code → "sonnet" - 9 other known runtimes → universal default - empty + unknown → universal default (matches pre-refactor fallthrough) - case-sensitivity preserved (CLAUDE-CODE → universal default) Plus invariant test: `DefaultModel` never returns "" — protects against a future "return early on unknown" regression that would silently break workspace creation. ## Verification - go build ./... clean - 15 model unit tests pass - existing handler tests untouched (no behavior change at call sites) - identical output to pre-refactor for every input First iteration of the OSS-shape refactor program. Each PR meets all 7 bars (plugin/abstract/modular/SSOT/coverage/cleanup/file-split). Refs RFC #2873.	2026-05-05 04:12:37 -07:00
Hongming Wang	1e8d7ae17c	Merge pull request #2869 from Molecule-AI/test/rfc2829-tighten-sweeper-assertions test(delegations): tighten integration-test assertions + integrationDB doc	2026-05-05 10:42:31 +00:00
Hongming Wang	fcdf79774d	test(delegations): tighten integration-test assertions + integrationDB doc (#321 ) Three small follow-ups from #2866 self-review: 1. TestIntegration_Sweeper_StaleHeartbeatIsMarkedStuck — assert strings.Contains(errDet, "no heartbeat for") instead of != "". The original "non-empty" check passes for any error_detail value; if a future regression swaps the message format, the test wouldn't catch it. Pin the production format string explicitly. 2. TestIntegration_Sweeper_DeadlineExceededIsMarkedFailed — drop the redundant `last_heartbeat = now()` write. The sweeper checks deadline FIRST (the stronger statement) and short-circuits before evaluating heartbeat staleness, so the heartbeat field is irrelevant for that test path. 3. integrationDB doc comment now warns explicitly that the helper is NOT t.Parallel()-safe — it hot-swaps the package-level mdb.DB and restores via t.Cleanup. If a future contributor adds t.Parallel() to one of these tests they race on the global. Comment makes the constraint discoverable instead of a debugging surprise. All 7 integration tests still pass against real Postgres locally.	2026-05-05 03:39:22 -07:00
Hongming Wang	d6337a1ae9	feat(org-import): make createWorkspaceTree idempotent (Phase 3 of #2857 ) OrgHandler.Import was non-idempotent — every call INSERTed a fresh row for every workspace in the tree, regardless of whether matching workspaces already existed. Calling /org/import twice with the same template duplicated the entire tree. This was the bigger leak source than TeamHandler.Expand (deleted in PR #2856). tenant-hongming accumulated 72 distinct child workspaces in 4 days entirely from repeated org-template spawns of the same template — the (tier × runtime) matrix in the audit data was the template's static shape, multiplied by spawn count. Fix: route through a new lookupExistingChild helper before INSERT. Skip-if-exists semantics by default: - Match on (parent_id, name) using `IS NOT DISTINCT FROM` so NULL parents (root workspaces) are included. - Ignore status='removed' rows so collapsed teams or deleted workspaces don't block re-import. - Recursion still runs on the existing id so partial-match templates (parent exists, some children missing) backfill correctly instead of either no-op'ing the whole subtree or duplicating the existing children. - Result entries for skipped nodes carry skipped:true so callers (canvas Import preflight modal) can surface "5 of 7 already existed, 2 created." The recursion that walked ws.Children is extracted into recurseChildrenForImport so both the create-path and the skip-path share one implementation — no duplicated grid math, no two paths to keep in sync. Note: replace_if_exists semantics (re-roll: stop+delete old, create new) are deferred. Skip-if-exists alone closes the leak; re-roll is a later UX decision for the canvas Import preflight modal. Tests: - 4 sqlmock cases on lookupExistingChild: not-found, found, nil-parent (the IS NOT DISTINCT FROM NULL trick), DB-error propagates (must fail fast — silent fallback to INSERT is the failure mode the helper exists to prevent). - 1 source-level AST gate (per memory feedback_behavior_based_ast_gates.md): pins that h.lookupExistingChild( appears BEFORE INSERT INTO workspaces in org_import.go. If a future refactor reintroduces the un-checked INSERT, the gate fails. Verified load-bearing by removing the call — build fails (helper symbol gone). go vet ./... clean. go test ./internal/handlers/ -count 1 — all green (4.2s, no regression on existing OrgImport / Provision / Team tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 03:37:49 -07:00
Hongming Wang	3d2a50e2a2	test(delegations): extend integration suite with sweeper coverage (3 tests) Real-Postgres tests for the RFC #2829 PR-3 sweeper. Validates: - Deadline-exceeded rows are marked failed with the expected error_detail - Stale-heartbeat in-flight rows are marked stuck (uses DELEGATION_STUCK_THRESHOLD_S env override for deterministic timing) - Healthy rows (fresh heartbeat + future deadline) are not touched — no false-positive against well-behaved delegations These extend the gate added in the previous commit so the workflow catches sweeper regressions, not just ledger-write ones. All 7 integration tests now pass; CI workflow runs them all.	2026-05-05 03:20:19 -07:00
Hongming Wang	c661ea4cd3	Merge pull request #2861 from Molecule-AI/fix/rfc2829-result-preview-ordering-and-integration-gate fix(delegations): preserve result_preview + add real-Postgres integration gate	2026-05-05 09:51:30 +00:00
Hongming Wang	4c9f12258d	fix(delegations): preserve result_preview through completion + add real-Postgres integration gate Two-part PR: ## Fix: result_preview was lost on completion Self-review of #2854 caught a real bug. SetStatus has a same-status replay no-op; the order of calls in `executeDelegation` completion + `UpdateStatus` completed branch clobbered the preview field: 1. updateDelegationStatus(completed, "") fires 2. inner recordLedgerStatus(completed, "", "") → SetStatus transitions dispatched → completed with preview="" 3. outer recordLedgerStatus(completed, "", responseText) → SetStatus reads current=completed, status=completed → SAME-STATUS NO-OP, never writes responseText → preview lost Confirmed against real Postgres (see integration test). Strict-sqlmock unit tests passed because they pin SQL shape, not row state. Fix: call the WITH-PREVIEW recordLedgerStatus FIRST, then updateDelegationStatus. The inner call becomes the no-op (correctly preserves the row written by the outer call). Same gap fixed in UpdateStatus handler — body.ResponsePreview was never landing in the ledger because updateDelegationStatus's nested SetStatus(completed, "", "") fired first. ## Gate: real-Postgres integration tests + CI workflow The unit-test-only workflow that shipped #2854 was the root cause. Adding two layers of defense: 1. workspace-server/internal/handlers/delegation_ledger_integration_test.go — `//go:build integration` tag, requires INTEGRATION_DB_URL env var. 4 tests: * ResultPreviewPreservedThroughCompletion (regression gate for the bug above — fires the production call sequence in fixed order and asserts row.result_preview matches) * ResultPreviewBuggyOrderIsLost (DIAGNOSTIC: confirms the same-status no-op contract works as designed; if SetStatus's semantics ever change, this test fires) * FailedTransitionCapturesErrorDetail (failure-path symmetry) * FullLifecycle_QueuedToDispatchedToCompleted (forward-only + happy path) 2. .github/workflows/handlers-postgres-integration.yml — required check on staging branch protection. Spins postgres:15 service container, applies the delegations migration, runs `go test -tags=integration` against the live DB. Always-runs + per-step gating on path filter (handlers/wsauth/migrations) so the required-check name is satisfied on PRs that don't touch relevant code. Local dev workflow (file header documents this): docker run --rm -d --name pg -e POSTGRES_PASSWORD=test -p 55432:5432 postgres:15-alpine psql ... < workspace-server/migrations/049_delegations.up.sql INTEGRATION_DB_URL="postgres://postgres:test@localhost:55432/molecule?sslmode=disable" \ go test -tags=integration ./internal/handlers/ -run "^TestIntegration_" ## Why this matters Per memory `feedback_mandatory_local_e2e_before_ship`: backend PRs MUST verify against real Postgres before claiming done. sqlmock pins SQL shape; only a real DB can verify row state. The workflow makes this gate mandatory rather than optional.	2026-05-05 02:47:52 -07:00
Hongming Wang	d890fd9a3f	Merge pull request #2856 from Molecule-AI/chore/remove-team-expand-handler chore(workspace-server): remove TeamHandler.Expand bulk-create handler	2026-05-05 09:42:51 +00:00
Hongming Wang	ec1f21922c	chore(workspace-server): remove TeamHandler.Expand bulk-create handler Every workspace can have children via the regular CreateWorkspace flow with parent_id set, so a separate handler that bulk-creates from config.yaml's sub_workspaces (and was non-idempotent — calling it twice duplicated the team) earned its way out. "Team" is just the state of having children; expanding/collapsing is purely a canvas-side visual action that toggles the `collapsed` column via PATCH. The non-idempotency directly caused tenant-hongming's vCPU starvation: 72 distinct child workspaces accumulated in 4 days, ~14 leaked EC2s (50 of 64 vCPU consumed by stale teams), every Canvas tabs E2E retry flaking on RunInstances VcpuLimitExceeded. What stays: - TeamHandler.Collapse — still useful; stops + removes children via StopWorkspaceAuto. Reachable from the canvas Collapse Team button. (Note: that button currently calls PATCH /workspaces/:id, not the Collapse endpoint — that's a separate reachability question for later.) - findTemplateDirByName helper — kept in team.go pending a relocate decision; no in-package consumers after Expand. - The four other paths that create child workspaces continue to work unchanged: regular POST /workspaces with parent_id, OrgHandler.Import (recursive tree), Bundle import, scripts. What goes: - POST /workspaces/:id/expand route (router.go) - TeamHandler.Expand method (team.go: ~130 lines) - 4 TestTeamExpand_* sqlmock tests (team_test.go) - TestTeamExpand_UsesAutoNotDirectDockerPath AST gate (workspace_provision_auto_test.go) — pinned a code path that no longer exists; the generic TestNoCallSiteCallsDirectProvisionerExceptAuto gate still covers the architectural intent for any future caller. Follow-up PRs: - canvas/ContextMenu.tsx: drop the "Expand to Team" right-click button + handleExpand callback; users create children via the regular + New Workspace dialog with the parent picker (already supported) - OrgHandler.Import idempotency (skip-if-exists OR replace_if_exists) — same bug class as the deleted Expand, but on the bulk-tree path - One-off cleanup script for tenant-hongming's 72 stale workspaces Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 02:39:13 -07:00
Hongming Wang	ca61213578	Merge pull request #2853 from Molecule-AI/refactor/split-workspace-dispatchers-1777970000 refactor(handlers): extract dispatchers from workspace.go (#2800 partial)	2026-05-05 09:30:55 +00:00
Hongming Wang	44bb35a926	feat(delegations): wire ledger Insert+SetStatus from production code paths (RFC #2829 #318 ) PR-1 shipped the `delegations` table + `DelegationLedger` helper. PR-3 wired the sweeper. PR-4 wired the dashboard. But no PR ever wired `ledger.Insert` from a production code path — the table stayed empty, the sweeper had nothing to sweep, the dashboard had nothing to show. This PR closes that gap. Behind feature flag `DELEGATION_LEDGER_WRITE=1` (default off), the legacy activity_logs writes are mirrored to the durable ledger: - insertDelegationRow → ledger.Insert (queued) - updateDelegationStatus → ledger.SetStatus on every status transition - executeDelegation completion path → ledger.SetStatus(completed, result_preview) for the result preview that activity_logs already stores in response_body - Record handler → ledger.Insert + ledger.SetStatus(dispatched) so agent-initiated delegations land in the same table ## Why a flag The legacy flow has ~30 strict-sqlmock tests pinning exactly which SQL statements fire per handler. Adding ledger writes always-on would force adding ExpectExec stanzas to each. Flag-off keeps all 30 green without churn; flag-on lets operators populate the table in staging to feed the sweeper + dashboard once the agent-side cutover (RFC #2829 PR-5) has proven the round-trip end-to-end. Default off → byte-identical to pre-#318 behavior. ## Status vocabulary mapping activity_logs uses a freer status vocabulary than the ledger's CHECK constraint allows. updateDelegationStatus is called with values like "received" that the ledger doesn't accept; the wiring filters via a switch to only forward known-good values, skipping anything else. Record's first activity_logs row is `dispatched` but the ledger's Insert path requires `queued` as initial state. Insert as queued first; the very next SetStatus(..., dispatched) promotes it on the same row. ## Coverage 8 wiring tests (delegation_ledger_writes_test.go): - flag off → no SQL fired (rollout safety contract) - flag on → INSERT + UPDATE fire as expected - flag rejects loose truthy values (true/yes/0/on/TRUE) — only "1" is the on signal, matching PR-2 + PR-5 conventions - terminal-state replay swallows ErrInvalidTransition (legacy is authoritative; ledger replay error is not a delegation failure) All 30 existing delegation_test.go tests still pass — flag default off keeps the strict-sqlmock surface unchanged. Refs RFC #2829.	2026-05-05 02:26:06 -07:00
Hongming Wang	024ef260db	refactor(handlers): extract dispatchers from workspace.go (#2800 partial) workspace.go was 950 lines after the dispatcher work in PRs #2811 + #2824 + #2843 + #2846 + #2847 + #2848 + #2850. This extracts the 6 SoT dispatcher helpers into a new workspace_dispatchers.go so the file is the architectural unit it deserves to be (one place for "how do we route a workspace lifecycle verb to a backend?"). Moved (no body changes — pure cut + paste with imports): - HasProvisioner (gate accessor) - provisionWorkspaceAuto (async provision) - provisionWorkspaceAutoSync (sync provision, runRestartCycle's path) - StopWorkspaceAuto (stop dispatcher) - RestartWorkspaceAuto (restart wrapper) - RestartWorkspaceAutoOpts (restart with resetClaudeSession) workspace.go shrinks from 950 → 735 lines and now holds: - WorkspaceHandler struct + constructor - SetCPProvisioner / SetEnvMutators - Create / List / Get / scanWorkspaceRow - HTTP handler glue workspace_dispatchers.go is 255 lines and holds the dispatcher trio + sync variant + gate accessor + a header docblock summarizing the history (PRs that added each helper) and the source-level pin tests that gate against drift. Source-level pin tests updated: - TestNoCallSiteCallsDirectProvisionerExceptAuto: workspace_dispatchers.go added to allowlist (the dispatcher IS the place that calls per-backend bodies directly). - TestNoCallSiteCallsBareStop: same. - TestNoBareBothNilCheck / TestOrgImportGate_UsesHasProvisionerNotBareField: no change — they were source-pinning specific files, not all callers. Build clean, vet clean, full test suite passes (1742 / 0 in workspace, all Go test packages green). Out of scope (#2800 has more): - workspace_provision.go (869 lines) split into Docker + CP halves — files would still be 400+ each, marginal value. Defer until a third backend lands and the symmetry breaks. - Splitting Create / List / Get into per-handler files — they're short and tightly coupled to the struct; keep co-located. Closes #2800 partial. Filing a follow-up issue if/when workspace.go or workspace_provision.go grows past 800 lines again. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 02:24:49 -07:00
Hongming Wang	c85783fbee	docs(workspace): point recovery hint at /external/rotate (not the never-shipped /tokens) Self-review of #2852: the inline comment on the IssueToken-failed branch still referenced POST /workspaces/:id/tokens, which never shipped. The recovery path that did ship in #2852 is POST /workspaces/:id/external/rotate. Update the hint so the next operator who hits this failure mode finds the right endpoint.	2026-05-05 01:58:43 -07:00
Hongming Wang	b375252dc8	feat(external): credential rotation + re-show instruction modal (#319 ) External workspaces (runtime=external) lose their workspace_auth_token the moment the create modal closes — the token is unrecoverable from any later DB read. Operators who lost their copy or want to respond to a suspected leak had no recovery path short of recreating the workspace (which also breaks cross-workspace delegation links + memory namespace). This PR adds two endpoints + a Config-tab section that surfaces them: POST /workspaces/:id/external/rotate Revokes any prior live tokens, mints a fresh one, returns the same ExternalConnectionInfo payload Create returns. Old credentials stop working immediately — the previously-paired agent will fail auth on its next heartbeat (~20s). GET /workspaces/:id/external/connection Returns the connect block with auth_token="". For the operator who just needs to re-find PLATFORM_URL / WORKSPACE_ID / one of the snippets without invalidating the live agent. Both reject runtime ≠ external with 400 + a hint pointing at /restart for non-external runtimes (which mints AND injects into the container). ## Why a flag isn't needed The endpoints are purely additive — Create's behavior is unchanged. Existing external workspaces don't see anything different until an operator clicks the new buttons. ## DRY refactor Extracted BuildExternalConnectionPayload() in external_connection.go as the single source of truth for the connect payload shape. Create, Rotate, and GetExternalConnection all call it. Adds a snippet once → all three endpoints emit it. Trims trailing slash on platform_url so no double-slash sneaks into registry_endpoint. ## Canvas ExternalConnectionSection mounts in ConfigTab when runtime=external. Two buttons: - "Show connection info" (cosmetic) — fetches GET /external/connection - "Rotate credentials" (destructive) — confirm dialog explains the impact, then POST /external/rotate Both reuse the existing ExternalConnectModal so operators don't learn a second snippet UX. ## Coverage 10 Go tests: - Rotate happy path (revoke + mint order, payload shape, broadcast event) - Rotate refuses non-external runtimes (400 with restart hint) - Rotate 404 on unknown workspace + 400 on empty id - GetExternalConnection happy path (auth_token="", same payload shape) - GetExternalConnection refuses non-external + 404 on unknown - BuildExternalConnectionPayload — placeholder substitution + trailing slash trimming + blank-token contract 6 canvas tests: - both action buttons render - "Show" calls GET /external/connection and opens modal - "Rotate" opens confirm dialog before firing POST - Cancel dismisses without rotating - Confirm POSTs and opens modal with returned token - API failures surface as visible error chips Migration: existing external workspaces gain new abilities; no data migration. The DRY refactor preserves byte-identical Create response shape (8 ConfigTab tests + all existing handler tests still pass). Closes #319.	2026-05-05 01:55:27 -07:00
Hongming Wang	61b7755c3c	feat(handlers): migrate Pause loop to StopWorkspaceAuto — #2799 Phase 3 Last open #2799 site. Pause's per-workspace stop call now routes through StopWorkspaceAuto, removing the final inline if-cpProv-else (actually if-h.provisioner) dispatch from workspace_restart.go's restart/pause/resume code paths. Pre-2026-05-05 the Pause loop was: if h.provisioner != nil { h.provisioner.Stop(ctx, ws.id) } Same drift class as #2813 (team-collapse leak) + #2814 (workspace delete leak) — Docker-only stop silently no-ops on SaaS, leaving the EC2 running while the workspace row gets marked paused. Orphan sweeper would catch it eventually but the leak window is real. Pause-specific bookkeeping (mark paused, clear workspace keys, broadcast WORKSPACE_PAUSED) stays inline in the handler; only the "stop the running workload" step delegates. StopWorkspaceAuto's no-backend → no-op semantics match the pre-fix behavior on misconfigured deployments (the bookkeeping still runs). One new source-level pin: TestPauseHandler_UsesStopWorkspaceAuto — gates regression to the inline dispatch shape. This closes #2799 Phase 3. After this PR + #2847 (Phase 2 PR-B) land, workspace_restart.go has no remaining inline if-cpProv-else dispatch in any user-facing code path. The remaining direct backend calls inside the file are in stopForRestart and cpStopWithRetry — both internal helpers that ARE the dispatcher's underlying primitives, not new bypasses. Note: scope was originally tagged "Phase 3 needs PauseWorkspaceAuto verb" in the audit on PR #2843. On closer reading Pause's stop step is identical to Stop — only the bookkeeping is Pause-specific. Reusing StopWorkspaceAuto avoids unnecessary surface and keeps the dispatcher trio (provision/stop/restart) tight. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 00:00:16 -07:00
Hongming Wang	9a772bf946	feat(handlers): provisionWorkspaceAutoSync + Site 4 migration — #2799 Phase 2 PR-B runRestartCycle's auto-restart cycle (Site 4 from PR #2843's audit) needs synchronous provision dispatch — the outer pending-flag loop in RestartByID relies on returning when the new container is up so the next restart cycle doesn't race the in-flight provision goroutine on its Stop call. Phase 1's provisionWorkspaceAuto wraps each per-backend body in `go func() {...}()` — wrong shape for runRestartCycle's needs. This PR introduces provisionWorkspaceAutoSync as a behavioral mirror that runs in the current goroutine instead. Two helpers, kept identical except for the wrapper: provisionWorkspaceAuto: spawns goroutine, returns immediately provisionWorkspaceAutoSync: blocks until per-backend body returns Same backend-selection (CP first, Docker second) + no-backend mark-failed fallback. When one grows a new arm (third backend, retry semantics), the other should too — pinned in the docstring. Site 4 (runRestartCycle) was the only call site that needs sync today. Migrating it removes the last bare if-cpProv-else dispatch in the restart code path's provision half. Three new tests: - TestProvisionWorkspaceAutoSync_RoutesToCPWhenSet - TestProvisionWorkspaceAutoSync_NoBackendMarksFailed - TestRunRestartCycle_UsesProvisionWorkspaceAutoSync (source-level pin) Out of scope (last open #2799 site): Phase 3 — Site 5 (Pause loop). PAUSE doesn't reprovision; needs a new PauseWorkspaceAuto verb. After this PR lands, Pause is the only inline if-cpProv-else dispatch left in workspace_restart.go. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 23:44:54 -07:00
Hongming Wang	0a90d7ae1a	Merge pull request #2846 from Molecule-AI/feat/2799-phase2-restart-resume-1777958000 feat(handlers): migrate Restart + Resume handlers to dispatchers — #2799 Phase 2 PR-A	2026-05-05 05:15:29 +00:00
Hongming Wang	5b7f4d260b	feat(handlers): migrate Restart + Resume handlers to dispatchers — #2799 Phase 2 PR-A Sites 1+2 (Restart HTTP handler goroutine) and Site 3 (Resume HTTP handler goroutine) now route through RestartWorkspaceAutoOpts / provisionWorkspaceAuto instead of inlining the if-cpProv-else dispatch. Three changes: 1. RestartWorkspaceAutoOpts — new variant of RestartWorkspaceAuto that carries the resetClaudeSession Docker-only flag (issue #12). The bare RestartWorkspaceAuto still exists as a wrapper that calls Opts with false. CP path silently ignores the flag (each EC2 boots fresh — no session state to clear). Mirrors the Provision pair (provisionWorkspace / provisionWorkspaceOpts). 2. Restart handler (Site 1+2) — the inline goroutine `if h.provisioner != nil { Stop } else if h.cpProv != nil { ... }` collapses to `RestartWorkspaceAutoOpts(...)`. Pre-fix the dispatch was Docker-FIRST ordering (a different drift class from the silent-drop bugs PRs #2811/#2824 closed); the dispatcher enforces CP-FIRST. 3. Resume handler (Site 3) — Resume is provision-only (workspace is paused, no live container), so it routes through provisionWorkspaceAuto, not RestartWorkspaceAuto. Inline if-cpProv-else dispatch removed. Two new source-level pins: - TestRestartHandler_UsesRestartWorkspaceAuto - TestResumeHandler_UsesProvisionWorkspaceAuto These prevent regression to the inline dispatch pattern. Out of scope (tracked under #2799): - Site 4 (runRestartCycle) — synchronous coordination model needs a different shape than the fire-and-return dispatchers. PR-B. - Site 5 (Pause loop) — PAUSE doesn't reprovision, needs a new PauseWorkspaceAuto verb. Phase 3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 22:09:12 -07:00
Hongming Wang	7993693cf1	feat(delegations): wire RFC #2829 sweeper + admin routes into platform server Activates the server-side foundation that PRs #2832, #2836, #2837 shipped without wiring (each PR landed dead code on purpose so the review surface stayed tight). ## What this PR wires up 1. router.go — registers the RFC #2829 PR-4 admin endpoints behind AdminAuth: GET /admin/delegations[?status=...&limit=N] GET /admin/delegations/stats 2. cmd/server/main.go — starts the RFC #2829 PR-3 stuck-task sweeper as a supervised goroutine alongside the existing scheduler + hibernation-monitor + image-auto-refresh: go supervised.RunWithRecover(ctx, "delegation-sweeper", delegSweeper.Start) ## What this PR does NOT do - PR-2's DELEGATION_RESULT_INBOX_PUSH flag stays default off — flip happens via env config in a follow-up after staging burn-in. - PR-5's DELEGATION_SYNC_VIA_INBOX flag stays default off — same reason. The two flags are independent; either can be flipped in isolation. - Canvas operator panel UI: this PR exposes the JSON contract; the canvas panel consumes it in a separate canvas PR. ## Coverage 2 new router gate tests in admin_delegations_route_test.go: - List endpoint requires AdminAuth (unauthenticated → 401) - Stats endpoint requires AdminAuth (unauthenticated → 401) Pattern mirrors admin_test_token_route_test.go (the IDOR-fix gate for PR #112). Catches a future router refactor that silently drops AdminAuth — operator dashboard data exposes caller_id, callee_id, and task_preview, none of which should reach unauthenticated callers. Sweeper boots as a no-op until at least one delegation row exists, so this PR is safe to land before PR-5's agent-side cutover sees production traffic. Refs RFC #2829.	2026-05-04 22:00:59 -07:00
Hongming Wang	789d705866	Merge pull request #2843 from Molecule-AI/fix/restart-dispatcher-rework-1777956000 feat(handlers): RestartWorkspaceAuto dispatcher — #2799 Phase 1 (re-do of #2835)	2026-05-05 04:48:52 +00:00
Hongming Wang	cb820acbd6	fix(test): pre-register sqlmock for panic-recovered Docker test goroutine	2026-05-04 21:44:31 -07:00
Hongming Wang	82e7059e0e	Merge pull request #2842 from Molecule-AI/fix/codex-template-bump-cli-pin fix(external-templates): unpin codex CLI from stale ^0.57	2026-05-05 04:34:14 +00:00
Hongming Wang	4f67fe59fb	feat(handlers): RestartWorkspaceAuto dispatcher — #2799 Phase 1 Closes the third silent-drop-on-SaaS class for the restart verb. Two of the three dispatchers were already in place (provisionWorkspaceAuto PR #2811, StopWorkspaceAuto PR #2824); this completes the trio. PR #2835 was an earlier attempt at this work (delivered by a peer agent) that I had to send back for four critical bugs — stop-leg dispatch order inverted, no-backend nil-deref, empty payload (dispatcher unusable by callers), forcing-function tests red-from-day-1. This re-do takes the audit + classification from that work but rebuilds the implementation against the existing dispatcher convention. Phase 1 scope: - RestartWorkspaceAuto in workspace.go — symmetric mirror of provisionWorkspaceAuto + StopWorkspaceAuto. CP-first dispatch order. cpStopWithRetry on the SaaS leg (Restart's "make it alive again" contract justifies the retry that StopWorkspaceAuto's delete-time contract does not). Three-arm shape including a no-backend mark-failed defense-in-depth. - Three new pin tests covering the routing surface: TestRestartWorkspaceAuto_RoutesToCPWhenSet, TestRestartWorkspaceAuto_RoutesToDockerWhenOnlyDocker, TestRestartWorkspaceAuto_NoBackendMarksFailed. Phase 2/3 (deferred, file as follow-up issue): - workspace_restart.go's manual dispatch sites (Restart handler goroutine, Resume handler goroutine, runRestartCycle's inline Stop, Pause loop). Each site has async-context reasoning beyond a fire-and-return dispatcher and needs per-site review. - Pause specifically needs a different verb (PauseWorkspaceAuto) since Pause doesn't reprovision. Why no callers migrated in this PR: the existing call sites in workspace_restart.go all build their `payload` from a synchronous DB read first; rewiring them needs care to preserve that ordering plus the resetClaudeSession + template path resolution that lives in the HTTP handler context. Splitting the dispatcher introduction from the migration keeps each PR small and reviewable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 21:30:36 -07:00
Hongming Wang	410275e5af	fix(external-templates): unpin codex CLI from stale ^0.57 `^0.57` only allows 0.57.x — codex CLI is now at 0.128 with breaking API changes between (notably `exec --resume <sid>` → `exec resume <sid>` subcommand). Operators following the snippet today either get a 6-month-old codex with the legacy resume flag, OR install latest manually and discover the daemon previously couldn't drive it. codex-channel-molecule 0.1.2 (just published) handles the new subcommand shape, so operators are best served by always getting the latest codex that the bridge daemon was last validated against. Bump to `@latest`. If a future codex CLI breaks the daemon's invocation again, we ship a new bridge-daemon release rather than asking operators to manage a pin themselves. Test: go test ./internal/handlers/ -run TestExternalTemplates -count=1 → green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 21:27:45 -07:00
Hongming Wang	1557743ef9	Merge branch 'staging' into feat/rfc2829-pr2-result-push-and-sync-cutover	2026-05-04 21:25:33 -07:00
Hongming Wang	b0bcd97781	Merge pull request #2839 from Molecule-AI/fix/status-failed-must-set-error-1777954000 fix(bundle): markFailed sets last_sample_error + AST drift gate (resolves #2632 root cause)	2026-05-05 04:12:38 +00:00
Hongming Wang	56149f8a24	fix(bundle): markFailed sets last_sample_error + AST gate Closes the bug class surfaced by Canvas E2E #2632: a workspace ends up status='failed' with last_sample_error=NULL, and operators (or the E2E poll loop) see the useless "Workspace failed: (no last_sample_error)" with no triage signal. Two pieces: 1. bundle/importer.go markFailed — the UPDATE was setting only status, leaving last_sample_error NULL. Same incident class as the silent-drop bugs in PRs #2811 + #2824, different code path. markProvisionFailed in workspace_provision_shared.go has set the message column for a long time; this writer drifted the convention. Fix: include last_sample_error in the SET clause + the broadcast. 2. AST drift gate (db/workspace_status_failed_message_drift_test.go) — Go AST walk that finds every db.DB.{Exec,Query,QueryRow}Context call whose argument list binds models.StatusFailed and asserts the SQL literal contains last_sample_error. Catches the next caller that drifts the same convention. Verified to FAIL against the bug shape (reverted importer.go temporarily — gate flagged the exact line) and PASS against the fix. Why an AST gate vs a regex: pre-fix attempt with a regex over UPDATE statements flagged status='online' / status='hibernating' / status= 'removed' UPDATEs as false positives. Walking the AST and only flagging calls that pass the StatusFailed constant eliminates that. Out of scope (filed separately if needed): - The Canvas E2E that surfaced the missing message (#2632) is now a required check on staging via PR #2827. Once this fix lands the next staging push should re-run #2632's failing case and produce a meaningful last_sample_error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 21:08:08 -07:00
Hongming Wang	0134353a48	Merge pull request #2838 from Molecule-AI/feat/memories-update-endpoint feat(memories): PATCH /workspaces/:id/memories/:id endpoint for edits	2026-05-05 04:06:01 +00:00
Hongming Wang	aca7d99152	Merge pull request #2837 from Molecule-AI/feat/rfc2829-pr4-operator-dashboard feat(delegations): operator dashboard endpoint over the durable ledger (RFC #2829 PR-4)	2026-05-05 04:01:46 +00:00
Hongming Wang	aec0fb35d2	feat(memories): PATCH /workspaces/:id/memories/:id endpoint for edits Pre-fix the only writes to agent_memories were Commit (POST) and Delete (DELETE). Editing an entry meant delete + recreate, losing the original id and created_at, and (the user-visible reason for filing this) leaving the canvas Memory tab without an Edit button at all. Adds PATCH that accepts either content, namespace, or both — at least one required (empty body 400s; silently no-op'ing would let a buggy client think it succeeded). The full Commit security pipeline is re-run on content edits: - redactSecrets on every scope (#1201 SAFE-T) - GLOBAL [MEMORY → [_MEMORY delimiter escape (#807 SAFE-T) - GLOBAL audit log row mirroring Commit's #767 forensic pattern - re-embed via the configured EmbeddingFunc (skipping would leave the row's vector pointing at the OLD content, silently breaking semantic search) Cross-scope edits (LOCAL→GLOBAL) intentionally NOT supported — that's delete + recreate so the GLOBAL access-control gate (only root workspaces can write GLOBAL) gets re-evaluated cleanly. 7 new sqlmock tests pin: namespace-only, content-only LOCAL, content-only GLOBAL with audit + escape, empty-body 400, empty- content 400, 404 on missing/wrong-workspace memory, no-op 200 with changed=false (and crucially: no UPDATE fires on no-op). Build clean, full handlers test suite (./internal/handlers) passes in 4s. PR-2 (frontend): Add modal + Edit button in MemoryInspectorPanel.tsx will land separately.	2026-05-04 21:00:47 -07:00
Hongming Wang	2ed4f4fb41	feat(delegations): operator dashboard endpoint over the durable ledger (RFC #2829 PR-4) Two read endpoints over the `delegations` table (PR-1 schema): GET /admin/delegations[?status=in_flight\|stuck\|failed\|completed&limit=N] GET /admin/delegations/stats ## What this gives operators Without this, post-incident investigation requires direct DB access — only the on-call SRE can answer "is workspace X delegating to a wedged callee?". This moves that visibility into the same surface as /admin/queue, /admin/schedules-health, /admin/memories. ## List endpoint Status filter via tight allowlist: - in_flight (default) → status IN (queued, dispatched, in_progress) - stuck → status='stuck' (rows the PR-3 sweeper marked) - failed → status='failed' - completed → status='completed' Unknown status → 400 with the allowlist in the error body. Limit 1..1000, default 100. The status allowlist drives a parameterized IN clause (no string- concatenation of user-controlled values into SQL). Result rows expose all the audit-grade fields the dashboard needs: delegation_id, caller_id, callee_id, task_preview, status, last_heartbeat, deadline, result_preview, error_detail, retry_count, created_at, updated_at. Nullable fields use pointer types so JSON omits them when NULL (no false-zero "" for missing values). ## Stats endpoint Zero-fills every known status key (queued, dispatched, in_progress, completed, failed, stuck) so the dashboard summary card doesn't have to handle "missing key vs zero" branching. ## Out of scope (deferred) - "retry this stuck task" mutation: needs the agent-side cutover (RFC #2829 PR-5 plan) before re-fire is safe - p95 / p99 duration aggregates: separate metric exposure, not a row-level read endpoint - Canvas UI: this is the JSON contract; the canvas operator panel consumes it in a follow-up canvas PR ## Wiring NOT wired into the router in this PR — ships separately to keep PR-by-PR review surface tight. Wiring will land in the `enable-rfc2829-server-side` follow-up PR alongside the sweeper Start call and the result-push flag flip. ## Coverage 11 unit tests: List (8): - default status=in_flight, IN(queued,dispatched,in_progress) - status=stuck → IN(stuck) - status=failed → IN(failed) - unknown status → 400 with allowlist - negative limit → 400 - over-cap limit → 400 - custom limit accepted + echoed in response - nullable fields populated correctly (pointer-omitempty) Stats (2): - zero-fills missing status keys - empty table → all counts zero Contract pin (1): - statusFilters table shape — every documented key + value pair pinned. Drift catches accidental edits (forward defense). Refs RFC #2829.	2026-05-04 20:58:17 -07:00
Hongming Wang	02b325063b	feat(delegations): stuck-task sweeper with deadline + heartbeat-staleness rules (RFC #2829 PR-3) Periodically scans the `delegations` table (PR-1 schema) for in-flight rows that need terminal action: 1. Deadline-exceeded → marked `failed` with "deadline exceeded by sweeper" 2. Heartbeat-stale (no beat for >10× heartbeat interval) → marked `stuck` ## Why both rules Deadline catches forever-heartbeating wedged agents (the alive-but-not- advancing class — agent loops on heartbeat call inside its main loop). Heartbeat-staleness catches OOM-killed and crashed agents that stop cold without graceful shutdown. Either rule alone misses one of these classes. ## Order matters Deadline is checked first. A deadline-exceeded AND stale row is marked `failed` (operator action: investigate + give up), not `stuck` (operator action: investigate + retry). The semantic difference matters. ## NULL heartbeat is a free pass A delegation that's just been inserted but hasn't emitted its first heartbeat yet is NOT stuck-marked — gives the agent its first beat window. Lets the deadline catch true never-started rows naturally. ## Concurrent-completion safety Sweep races with UpdateStatus on a delegation that just completed: the ledger's terminal forward-only protection (PR-1) returns ErrInvalidTransition, sweeper logs + counts in Errors, the row stays correctly in completed. ## Configuration - DELEGATION_SWEEPER_INTERVAL_S — tick cadence (default 5min) - DELEGATION_STUCK_THRESHOLD_S — heartbeat-staleness threshold (default 10min) Both fall back gracefully on invalid input (typo'd env shouldn't crash startup). Both read at construction time so a long-running process picks up overrides via restart. ## Wiring NOT wired into main.go in this PR — that ships separately so the sweeper can be enabled/disabled independently of the binary upgrade. The sweeper is a standalone Sweep(ctx) callable + Start(ctx) ticker loop, both with panic recovery, both indexed-scan-cheap on the partial idx_delegations_inflight_heartbeat from PR-1. ## Coverage 13 unit tests against sqlmock-backed *sql.DB: Sweep semantics (8 tests): - empty in-flight set → clean no-op - deadline → failed - heartbeat-stale → stuck - NULL heartbeat is left alone (first-beat free pass) - healthy row → no-op - both-rule row → marked failed (deadline wins) - mixed set → both rules fire on the right rows - concurrent-completion race → forward-only protection holds Env override parsing (5 tests): - default on missing env - parses positive seconds - falls back on garbage - falls back on negative - constructor picks up overrides; defaults when env unset Refs RFC #2829.	2026-05-04 20:55:13 -07:00
Hongming Wang	ae79b9e9fe	feat(delegations): result-push to caller inbox behind feature flag (RFC #2829 PR-2) When a delegation completes (or fails), also write an `activity_type='a2a_receive'` row to the caller's activity_logs so the caller's inbox poller (workspace/inbox.py — `?type=a2a_receive`) surfaces the result to the agent. Why: today the only way the caller agent learns about a delegation result is by holding open an HTTP `message/send` connection through the platform proxy. That connection has a hard timeout (~600s) — a 90-iteration external-runtime task on stream output routinely blows past it, and the result emitted after the timeout lands in /dev/null. (Hongming's home hermes hit this on 2026-05-05 — task was actively heartbeating "iteration 14/90" when the proxy timer fired.) This PR adds the SERVER-SIDE result-push so the result is durably delivered to the caller's inbox queue. The agent-side cutover (replace sync httpx delegation with delegate_task_async + wait_for_message poll) ships in the next PR — once both land, the proxy timeout class is gone. ## Feature flag `DELEGATION_RESULT_INBOX_PUSH=1` enables the push. Default off — staging canary first, flip after RFC #2829 PR-3 (agent-side) lands and proves the round-trip end-to-end. With the flag off, behavior is byte-identical to before this PR (verified by TestUpdateStatus_FlagOff_NoNewSQL). ## Two write sites 1. UpdateStatus handler (POST /workspaces/:id/delegations/:id/update) — agent-initiated delegations report status here 2. executeDelegation goroutine — canvas-initiated delegations (POST /workspaces/:id/delegate) report status from this background coroutine Both paths call `pushDelegationResultToInbox` which is best-effort: an INSERT failure logs but does NOT propagate up. The existing `delegate_result` row in activity_logs (the dashboard view) remains authoritative; the new `a2a_receive` row is purely additive for the inbox-poller to surface. ## Coverage 6 new tests in delegation_inbox_push_test.go: - flag off → no SQL fired (the rollout-safety contract) - flag on, completed → a2a_receive row with status=ok - flag on, failed → a2a_receive row with status=error + error_detail - UpdateStatus end-to-end (flag on, completed) - UpdateStatus end-to-end (flag on, failed) - UpdateStatus end-to-end (flag off, byte-identical to pre-PR behavior) All 30 existing delegation_test.go tests still pass — flag default off keeps the strict-sqlmock surface unchanged. Refs RFC #2829.	2026-05-04 20:50:46 -07:00
Hongming Wang	ed6dfe01e5	feat(delegations): durable per-task ledger + audit-write helper (RFC #2829 PR-1) Adds the `delegations` table and the DelegationLedger writer that PRs #2-#4 of RFC #2829 build on. Schema-only foundation — no behavior change in this PR. PR-2 wires the ledger into the existing handlers and ships the result- push-to-inbox cutover behind a feature flag. Why a dedicated table when activity_logs already records every delegation event: Today, "what is currently in flight for this workspace" is reconstructed by GROUPing activity_logs by delegation_id and ORDER BY created_at DESC. PR-3's stuck-task sweeper needs the join SELECT delegation_id FROM delegations WHERE status = 'in_progress' AND last_heartbeat < now() - interval '10 minutes' which is impossible to express against the event stream without a window over every (delegation_id, latest event) pair — a planner-killing query at scale. The dedicated table makes the sweeper an indexed scan. Same posture as tenant_resources (PR #2343, memory `reference_tenant_resources_audit`): activity_logs remains the audit- grade source of truth, delegations is the queryable view for dashboards + sweeper joins. Symmetric writes — both tables are written, neither blocks orchestration on the other's failure. Schema highlights: - delegation_id PRIMARY KEY (caller-chosen, idempotent retry on restart is a no-op via ON CONFLICT DO NOTHING) - caller_id / callee_id NOT FK — workspace delete must NOT cascade- delete delegation history (audit retention) - status CHECK constraint enforces the lifecycle (queued\|dispatched\|in_progress\|completed\|failed\|stuck) - last_heartbeat NULL-able; PR-3 sweeper compares to NOW() - deadline default now()+6h matches longest-observed legit delegation (memory-namespace migrations) — protects against forever-heartbeating wedged agents - Partial index `idx_delegations_inflight_heartbeat` keeps the sweeper hot path tiny (only non-terminal rows) - UNIQUE(caller_id, idempotency_key) WHERE NOT NULL — natural collision becomes ON CONFLICT no-op without colliding across callers DelegationLedger.SetStatus enforces forward-only on terminal states (completed/failed/stuck cannot be revised) as defense-in-depth on the schema CHECK. Same-status replay is a no-op. Missing-row SetStatus is a no-op (transient inconsistency the next agent retry will heal). Heartbeat updates only in-flight rows — terminal-state delegations are silently skipped. Coverage: - 17 unit tests against sqlmock-backed *sql.DB (Insert happy path, missing-required guards, truncation, lifecycle transitions, terminal forward-only protection, replay no-op, missing-row no-op, empty-input rejection, heartbeat semantics, transition table shape) - Migration roundtrip verified on a real Postgres 15 instance: up creates the expected schema with all 4 indexes + CHECK, down drops everything cleanly. Refs RFC #2829.	2026-05-04 20:43:06 -07:00
Hongming Wang	46d79a3e3b	Merge pull request #2824 from Molecule-AI/fix/stop-workspace-auto-saas-1777945000 fix(provision): StopWorkspaceAuto mirror — close SaaS EC2-leak class	2026-05-05 03:05:09 +00:00
Hongming Wang	2198f92dcb	Merge pull request #2823 from Molecule-AI/feat/codex-tab-pypi-install feat(external-templates): codex tab uses plain pip install	2026-05-05 03:03:08 +00:00
Hongming Wang	11c9ed2a46	fix(provision): StopWorkspaceAuto mirror — close SaaS EC2-leak class Closes #2813 (team-collapse) and #2814 (workspace delete). Two leaks, one class. Both call sites had the same shape pre-fix: if h.provisioner != nil { h.provisioner.Stop(ctx, wsID) } On SaaS where h.provisioner (Docker) is nil and h.cpProv is set, that gate evaluates false and the EC2 keeps running. Workspace gets marked removed in DB; EC2 lives on until the orphan sweeper catches it. Same drift class as PR #2811's org-import provision bug — a Docker- only check on what should be a both-backend operation. Confirmed in production: PR #2811's verification step deleted a test workspace and the EC2 stayed running until I terminated it manually. Fix: WorkspaceHandler.StopWorkspaceAuto(ctx, wsID) — symmetric mirror of provisionWorkspaceAuto. CP first, Docker second, no-op when neither is wired (a workspace nobody is running can't be stopped — that's a no-op, not a failure, distinct from provision's mark-failed contract). Three call-site changes: - team.go:208 (Collapse) → h.wh.StopWorkspaceAuto(ctx, childID) - workspace_crud.go:432 (stopAndRemove) → h.StopWorkspaceAuto(...); RemoveVolume stays Docker-only behind an explicit gate since CP-managed workspaces have no host-bind volumes - TeamHandler.provisioner field + NewTeamHandler's *Provisioner param removed as dead code (Stop was the only call site) Volume cleanup separation is intentional: the abstraction is "stop the running workload," not "tear down all state." Callers that need volume cleanup keep their `if h.provisioner != nil { RemoveVolume }` gate AFTER the Stop call. Tests: - TestStopWorkspaceAuto_RoutesToCPWhenSet — SaaS path - TestStopWorkspaceAuto_RoutesToDockerWhenOnlyDocker — self-hosted - TestStopWorkspaceAuto_NoBackendIsNoOp — pins the contract distinction from provisionWorkspaceAuto's mark-failed - TestNoCallSiteCallsBareStop — source-level pin against `.provisioner.Stop(` / `.cpProv.Stop(` outside the dispatcher, per-backend bodies, restart helper, and the Docker-daemon-direct short-lived-container path. Strips Go comments before substring match so archaeology in code comments doesn't trip the gate. - Verified: pin FAILS against the buggy shape (workspace_crud.go reversion); team.go reversion compile-fails because the field is gone — even stronger than the test. Out of scope (tracked under #2799): - workspace_restart.go's manual if-cpProv-else dispatch with retry semantics tuned for the restart hot path. Functionally equivalent + wraps cpStopWithRetry, so it's not the bug class this PR closes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 20:00:23 -07:00
Hongming Wang	c0bfd19b9e	feat(external-templates): codex tab uses plain pip install for bridge daemon `codex-channel-molecule` 0.1.0 is now on PyPI, so operators no longer need the `git+https://...` URL workaround. Verified: `pip install codex-channel-molecule` from a clean venv installs the wheel and the `codex-channel-molecule --help` console script runs. PyPI: https://pypi.org/project/codex-channel-molecule/0.1.0/ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 19:58:56 -07:00
Hongming Wang	e0f9434eaf	fix(files-eic): silence ssh known-hosts warning that 500'd Hermes config load GET /workspaces/:id/files/config.yaml on hongming.moleculesai.app's Hermes workspace returned 500 with body: ssh cat: exit status 1 (Warning: Permanently added '[127.0.0.1]:37951' (ED25519) to the list of known hosts.) Root cause: ssh emits the "Permanently added" notice on every fresh tunnel connection, even with UserKnownHostsFile=/dev/null (that prevents persistence, not the warning). It lands on stderr, fooling readFileViaEIC's classifier: if len(out) == 0 && stderr.Len() == 0 { return nil, os.ErrNotExist } return nil, fmt.Errorf("ssh cat: %w (%s)", runErr, ...) stderr was non-empty (the warning), so we returned the wrapped error → 500 from the HTTP layer instead of 404. Fix: add `-o LogLevel=ERROR` to BOTH writeFileViaEIC and readFileViaEIC ssh invocations. Silences info+warning while keeping real auth/tunnel errors visible (those emit at ERROR level). Test: TestSSHArgs_LogLevelErrorBothSites pins the flag in both blocks. Mutation-tested: stripping the flag from one site fails the gate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 19:58:49 -07:00
Hongming Wang	7e1fdf5847	refactor(provision): use HasProvisioner() at all gate-y both-nil checks SSOT pass — replace 4 bare `h.provisioner == nil && h.cpProv == nil` checks with `!h.HasProvisioner()`. When a third backend lands (k8s, containerd, whatever), HasProvisioner gets one new field; bare both-nil checks would each need to be hunted and updated. Sites: - a2a_proxy_helpers.go:166 — maybeMarkContainerDead skip-no-backend - workspace_restart.go:118 — Restart endpoint guard - workspace_restart.go:363 — RestartByID coalescer guard - workspace_restart.go:660 — Resume endpoint guard Adds TestNoBareBothNilCheck (source-level) so the antipattern can't slip back in. Out of scope but discovered during the audit (filed separately): - team.go:207 — team-collapse Stop is Docker-only, leaks EC2 on SaaS - workspace_crud.go:423 — workspace delete cleanup is Docker-only, leaks EC2 on SaaS Both need a StopWorkspaceAuto mirror of provisionWorkspaceAuto. Same class of bug as today's org-import incident, different verb (stop vs provision). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:51:53 -07:00
Hongming Wang	d084d7e61a	fix(provision): consolidate org-import gate + Auto self-marks-failed Two changes that close the silent-drop bug class: 1. Add WorkspaceHandler.HasProvisioner() and use it as the org-import gate. Pre-fix, org_import.go:178 read `h.provisioner != nil` (Docker- only) — on SaaS tenants where cpProv is wired but Docker is nil, the entire 220-line provisioning prep block was skipped. The Auto call PR #2798 added at line 395 was unreachable on SaaS. Repro: 2026-05-05 01:14 — hongming prod tenant, 7-workspace org import, every workspace sat in 'provisioning' for 10 min until the sweeper marked it failed with the misleading "container started but never called /registry/register". 2. provisionWorkspaceAuto self-marks-failed on the no-backend path. Defense in depth: even if a future caller bypasses HasProvisioner gating or ignores the bool return (TeamHandler pre-#2367 did exactly this), the workspace ends in a clean failed state with an actionable error message instead of lingering until the 10-min sweep. Auto becomes the single source of truth for "start a workspace" — routing AND the no-backend failure path. Create's redundant if-not-Auto-then-mark-failed block collapses (kept only the workspace_config UPSERT, which is a Create-specific UI concern for rendering runtime/model on the Config tab). Tests: - TestProvisionWorkspaceAuto_NoBackendMarksFailed pins the new contract - TestHasProvisioner_TrueOnCPOnly catches the SaaS-only blind spot - TestHasProvisioner_TrueOnDockerOnly preserves self-hosted shape - TestHasProvisioner_FalseWhenNeitherWired pins the gate-out path - TestOrgImportGate_UsesHasProvisionerNotBareField source-pins the gate (verified: FAILS against the buggy `h.provisioner != nil` shape, PASSES with `h.workspace.HasProvisioner()`) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:47:02 -07:00
Hongming Wang	dfd0bc528c	fix(external-templates): codex-channel-molecule via git+ URL (not on PyPI yet) Mirrors the pattern hermes-channel-molecule uses (line 256). Drops the broken `pip install codex-channel-molecule` which would 404. PyPI publish workflow is a separate piece of work — until then, git+https install is the path operators get. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:29:23 -07:00
Hongming Wang	4ea6f437e9	feat(external-templates): codex tab now includes the bridge-daemon inbound path The codex tab in the External Connect modal had a "outbound-tools-only first cut" caveat — operators got the MCP wiring for codex calling platform tools, but there was no documented inbound path. Canvas messages couldn't wake an idle codex session. That gap is now filled by codex-channel-molecule (github.com/Molecule-AI/codex-channel-molecule), shipped today as the codex counterpart to hermes-channel-molecule. The daemon long-polls the platform inbox, runs `codex exec --resume <session>` per inbound message, captures the assistant reply, routes it back via send_message_to_user / delegate_task, and acks the inbox row. Per-thread session continuity persisted to disk so daemon restarts don't lose conversation context. This commit: - Updates externalCodexTemplate to include `pip install codex-channel-molecule` (step 1) and a foreground `nohup codex-channel-molecule` invocation (step 3) using the same env-var contract as the MCP server (WORKSPACE_ID + PLATFORM_URL + MOLECULE_WORKSPACE_TOKEN). - Adds a "Canvas messages don't wake codex" common-issues entry to the TAB_HELP codex section pointing at the bridge daemon log. - Updates the doc comment to record the upstream deprecation path: when openai/codex#17543 lands, the bridge becomes redundant and the wired MCP server delivers push natively. Verified: TestExternalTemplates_NoMoleculeOrgIDPlaceholder still passes (no MOLECULE_ORG_ID re-introduction); full handlers suite green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:28:35 -07:00
Hongming Wang	0f389ba325	Merge pull request #2804 from Molecule-AI/fix/external-templates-drop-molecule-org-id fix(external-templates): drop MOLECULE_ORG_ID from codex/openclaw/hermes snippets	2026-05-05 00:38:45 +00:00
Hongming Wang	472862bc50	fix(external-templates): drop MOLECULE_ORG_ID from operator-facing snippets Codex / openclaw / hermes-channel snippets each instructed operators to set `MOLECULE_ORG_ID = "<your org id>"`. The molecule_runtime MCP subprocess these snippets spawn never reads MOLECULE_ORG_ID — that env var is consumed only by workspace-server's TenantGuard middleware, server-side, on the tenant box itself (set by the control plane via user-data on provision). External operator → tenant calls pass TenantGuard via the isSameOriginCanvas path (Origin matches Host), with auth via Bearer token + X-Workspace-ID. The universal_mcp snippet — which calls into the same molecule_runtime — has always (correctly) omitted MOLECULE_ORG_ID; this brings codex / openclaw / hermes-channel into line. Symptom that caught it: an external codex CLI session, after pasting the codex-tab snippet, surfaced "MOLECULE_ORG_ID is still set to '<your org id>'" as an unresolved blocker — agent reasonably treated the placeholder as required setup. Operator has no value to fill. Pinned with a structural test (TestExternalTemplates_NoMoleculeOrgIDPlaceholder) so the placeholder can't drift back across all six external-tab templates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 17:30:07 -07:00
Hongming Wang	b5435b4732	fix(memory v2): warn at boot when cutover env half-configured MEMORY_V2_CUTOVER=true gates the admin export/import path on the v2 plugin, but the cutoverActive() check in admin_memories.go silently returns false when the plugin isn't wired: func (h AdminMemoriesHandler) cutoverActive() bool { if os.Getenv(envMemoryV2Cutover) != "true" { return false } return h.plugin != nil && h.resolver != nil } Two operator misconfigs hit the silent-fallback path: 1. MEMORY_V2_CUTOVER=true set, MEMORY_PLUGIN_URL unset → wiring.Build returns nil → handler stays on legacy SQL path → operator sees no error, assumes cutover is live, but every request still writes the legacy table. 2. MEMORY_V2_CUTOVER=true set, MEMORY_PLUGIN_URL set, but plugin unreachable at boot → wiring.Build still returns the bundle (intentional — circuit breaker handles ongoing unavailability), but every cutover write quietly falls back via the breaker. → only signal: legacy table keeps growing. Both are exactly the "structurally invisible until prod" failure mode; the only real-world detection today is "notice the legacy table is still being written to," which no operator will check. Add loud, distinctive WARN log lines at Build() time for both shapes. Boot logs are operator-visible, so a half-config is immediately obvious without needing dashboards. Tests: 4 new (cutover+no-URL → warn, neither set → silent, cutover+probe- fail → loud warn, probe-fail-without-cutover → quiet generic) * 6 existing (still pass; pin no-warning-on-happy-path) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 17:24:11 -07:00
Hongming Wang	f1b72af97e	Merge pull request #2798 from Molecule-AI/fix/org-import-saas-routing-1777938328 fix(org-import): route through provisionWorkspaceAuto so SaaS gets EC2 — closes #2486	2026-05-04 23:54:37 +00:00
Hongming Wang	19e7acdc22	fix(org-import): route through provisionWorkspaceAuto so SaaS gets EC2 Org-import called h.workspace.provisionWorkspace directly — same silent- drop bug that bit TeamHandler.Expand on 2026-05-04 (see workspace.go :121-125 comment + #2486). Symptom on SaaS: every claude-code workspace sat in "provisioning" until the 600s sweeper marked it failed with "container started but never called /registry/register" — because no container ever existed; the goroutine returned silently when the Docker provisioner field was nil. User reproduced 2026-05-04 ~22:30Z importing a 7-workspace template on the hongming prod tenant. Tenant CP logs (queried live via SSM) showed ZERO "Provisioner: goroutine entered" or "CPProvisioner: goroutine entered" lines for any of the 7 failed workspace UUIDs in the 60min window — confirming the goroutine never ran past line 384 of org_import.go because provisionWorkspace returned early in SaaS mode. The fix is one line: replace h.workspace.provisionWorkspace with h.workspace.provisionWorkspaceAuto. Auto is the single source of truth for backend selection (workspace.go:130) — picks CP-mode when h.cpProv is wired, Docker-mode when h.provisioner is wired, returns false when neither. ALSO adds a generic source-level gate (TestNoCallSiteCallsDirectProvisionerExceptAuto) so the next future caller can't repeat the pattern. Walks every non-test .go file in handlers/ and fails if any direct call to provisionWorkspace( or provisionWorkspaceCP( appears outside the dispatcher's own definition file. The gate currently allows workspace_restart.go which has its own manual if-h.cpProv-else dispatch (functionally equivalent to Auto, not the bug class — but is architectural duplication; follow-up filed for proper de-dup). Test plan: - TestOrgImport_UsesAutoNotDirectDockerPath: pin the org_import.go call site - TestNoCallSiteCallsDirectProvisionerExceptAuto: generic gate against future drift - TestTeamExpand_UsesAutoNotDirectDockerPath (existing): symmetric for team.go All 3 + the rest of the handler suite pass. Closes #2486 Pairs with: PR #2794 (configurable provision concurrency) which made it possible to bisect concurrency-vs-routing as the cause Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 16:49:07 -07:00
Hongming Wang	872b781f64	Merge pull request #2792 from Molecule-AI/feat/drop-shared-context feat: drop shared_context — use memory v2 team namespace	2026-05-04 23:37:49 +00:00
Hongming Wang	3bc7749e84	feat(org-import): make provision concurrency configurable via env Org-import was hard-capped at 3 concurrent workspace provisions (#1084), calibrated for Docker-mode workspaces where each provision was a docker-run. Now that workspaces are EC2 instances, AWS RunInstances parallelises happily and the artificial cap of 3 makes a 7-workspace org-import take 3-4× longer than necessary (3 batches × ~70s/provision ≈ 4 min wall time when AWS could absorb all 7 in parallel for ~70s). This PR makes the cap configurable via MOLECULE_PROVISION_CONCURRENCY: unset → 3 (Docker-mode default, unchanged) "0" → effectively unlimited (SaaS / EC2 backend; AWS rate-limit + vCPU quota are the real backpressure) N>0 → exactly N N<0 → fall back to default 3 + warning log garbage → fall back to default 3 + warning log The "0 = unlimited" mapping is the user-facing convention requested for SaaS deployments — operators don't have to pick an arbitrary large number. Implementation hands off 1<<20 internally so the channel-based semaphore stays a no-op without infinite-buffer risk. Test coverage (org_provision_concurrency_test.go, 6 cases / 15 subtests): - unset → default - "0" → large unlimited cap - positive integer exact (1, 5, 10, 50) - negative → default + warning - non-numeric → default + warning - whitespace-trimmed (" 7 " → 7) Boot-time log line confirms the resolved cap so an operator can verify their env is being honored without re-deploying. Does NOT address the separate 600s "never registered" timeout the user also reported during org-import — that's filed as molecule-core#2793 for proper investigation (parallel-provision contention, network routing, register-retry budget, or container-start failure are all candidates and need live SSM capture to bisect). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 16:33:49 -07:00
Hongming Wang	2f7beb9bce	feat: drop shared_context — use memory v2 team namespace instead Parent → child knowledge sharing previously lived behind a `shared_context` list in config.yaml: at boot, every child workspace HTTP-fetched its parent's listed files via GET /workspaces/:id/shared-context and prepended them as a "## Parent Context" block. That paid the full transfer cost on every boot regardless of whether the agent needed it, single-parent SPOF, no team or org scope, and broken if the parent was unreachable. Replace with memory v2's team:<id> namespace: agents call recall_memory on demand. For large blob-shaped artefacts see RFC #2789 (platform-owned shared file storage). Removed: - workspace/coordinator.py: get_parent_context() - workspace/prompt.py: parent_context arg + injection block - workspace/adapter_base.py: import + call + arg pass - workspace/config.py: shared_context field + parser entry - workspace-server/internal/handlers/templates.go: SharedContext handler - workspace-server/internal/router/router.go: GET /shared-context route - canvas/src/components/tabs/ConfigTab.tsx: Shared Context tag input - canvas/src/components/tabs/config/form-inputs.tsx: schema field + default - canvas/src/components/tabs/config/yaml-utils.ts: serializer entry - 6 tests pinning the removed behavior; 5 doc references Added regression gates so any reintroduction is loud: - workspace/tests/test_prompt.py: build_system_prompt must NOT emit "## Parent Context" - workspace/tests/test_config.py: legacy YAML key loads cleanly but shared_context attr must NOT exist on WorkspaceConfig - tests/e2e/test_staging_full_saas.sh §9d: GET /shared-context must NOT return 200 against a live tenant Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 16:30:26 -07:00
Hongming Wang	9c7b34cb7f	fix(workspace files API): GET ReadFile via SSH-EIC for SaaS workspaces Pre-fix WriteFile (templates.go:436) had an `instance_id != ""` branch that dispatched to writeFileViaEIC (SSH through EC2 Instance Connect), but ReadFile (templates.go:362) skipped that branch entirely. ReadFile always tried `findContainer` (which only works for local-Docker workspaces, not SaaS EC2-per-workspace ones) and fell through to `resolveTemplateDir` (which returns the seed template, not the persisted workspace state). Net effect on production: every Canvas Config tab open against a SaaS workspace returned 404 "No config.yaml found" because GET couldn't see what PUT had written. Visible to users after PR #2781 ("show-misconfigured-state") surfaced the 404 as an error UX. Caught by the synth-E2E 7c gate's GET-back assertion, but misdiagnosed as a "test bug" and the GET assertion was dropped in PR #2783 (rather than fixed at the source). This PR closes the loop: 1. New `readFileViaEIC` helper in template_files_eic.go that mirrors writeFileViaEIC's SSH-via-EIC dance and runs `sudo -n cat <path>`. Returns os.ErrNotExist on missing file (cat exits 1 with empty stdout under `2>/dev/null`) so the handler maps it cleanly to 404. 2. ReadFile dispatch now mirrors WriteFile's: when `instance_id` is non-empty, use readFileViaEIC; otherwise fall through to the local-Docker / template-dir path. 3. ReadFile's DB query expanded to also select instance_id + runtime (was just name). Three sqlmock-based tests updated to match the new column shape; the existing local-Docker fallback path stays green by passing instance_id="" in the mock rows. Follow-up (separate PR): the synth-E2E 7c gate should restore the GET-back marker assertion now that the read/write paths are unified. That'll also catch any future Files API regression in the round-trip. This PR doesn't touch the gate to keep the scope tight. Verification: - go build ./... clean - full handlers test suite green (0.4s for ReadFile subset; 5.8s full) - The 3 ReadFile sqlmock tests still cover the local-Docker fallback (instance_id=""); SaaS EIC dispatch is covered by the upcoming re-enabled synth-E2E 7c GET assertion (deferred to follow-up) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 16:02:26 -07:00
Hongming Wang	61d5908817	fix(workspace files API): write claude-code config to /configs, sudo for root-owned base Root cause of the user-visible 500 ("install: cannot create directory '/opt/configs': Permission denied") on PUT /workspaces/<id>/files/config.yaml: 1. Path map fall-through. claude-code wasn't in workspaceFilePathPrefix, so resolveWorkspaceFilePath returned the default `/opt/configs/...`. That directory doesn't exist on the workspace EC2 — cloud-init in provisioner/userdata_containerized.go runs `mkdir -p /configs` only. Even if the SSH write had succeeded at /opt/configs, the docker container's bind-mount is host:/configs → container:/configs, so the file would have been invisible to the runtime. 2. /configs ownership. cloud-init runs as root, so /configs is root-owned. The SSH-as-ubuntu install command can't write into it without sudo. Hermes wasn't affected because its base path (/home/ubuntu/.hermes) is ubuntu-owned. Two-line fix: - Add `claude-code: /configs` to the runtime → base-path map and flip the default fall-through from `/opt/configs` to `/configs`. Leave the pre-existing langgraph/external entries pointing at /opt/configs pending a migration audit (no user report on those today, and flipping them would silently relocate any files those runtimes already wrote). - Prefix the remote install command with `sudo -n` so the write succeeds under the standard EC2 ubuntu/passwordless-sudo posture. `-n` (non-interactive) ensures clean failure if that ever changes, rather than a hang waiting for a password prompt. Tests: - TestResolveWorkspaceFilePath_KnownRuntimes adds claude-code + CLAUDE-CODE coverage and updates the empty/unknown default cases to expect /configs. The langgraph/external rows stay green (unchanged values), confirming the scope of the rename. Verification: - go build ./... clean - go test ./internal/handlers/ green - The user-reported bug (PUT /workspaces/57fb7043-79a0-4a53-ae4a-efb39deb457f/files/config.yaml → 500 EACCES on /opt/configs) is the failure mode this fix addresses on both axes (path + sudo). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 14:29:08 -07:00
Hongming Wang	707e4d7342	Memory v2 wiring: replace decorative tests with real integration Self-review of #2755 found two tests that didn't actually exercise the production code path: - TestNamespaceCleanupFn_NamespaceFormat asserted "workspace:" + "abc-123" == "workspace:abc-123" — a compile-time invariant, not runtime behavior. Provided no protection if the closure in Bundle.NamespaceCleanupFn ever stopped using that prefix. - TestNamespaceCleanupFn_FailureLogsButReturns built a parallel cleanup closure inline with errors.New, then invoked the parallel closure. The production closure was never exercised. A regression in NamespaceCleanupFn (e.g. forgetting the deferred recover, calling the plugin without nil-check) would still pass this test. Replaced both with real integration: - TestNamespaceCleanupFn_HitsPluginAtCorrectNamespace spins up httptest.Server, points MEMORY_PLUGIN_URL at it, calls Build(), invokes the production closure, and asserts the server actually saw DELETE /v1/namespaces/workspace:abc-123. - TestNamespaceCleanupFn_PluginErrorDoesNotPanic exercises the failure path for real: server returns 500 on DELETE, closure must log and return without propagating. defer-recover is belt-and- suspenders since production calls this from a for-loop in workspace_crud.go that has no recover. Couldn't ship with #2755 because the merge queue locks the branch once enqueued. Following up now that #2755 is merged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 10:38:59 -07:00
Hongming Wang	46731729d4	Memory v2 fixup Critical: wire plugin from main.go (was fully dormant) Caught during continued review: the entire v2 plugin system shipped in PRs #2729-#2742 + #2744-#2751 was never actually invoked because main.go and router.go don't construct the plugin client/resolver or attach the WithMemoryV2 / WithNamespaceCleanup hooks. Operators setting MEMORY_PLUGIN_URL=... saw zero behavior change because nothing read it. Every fixup we shipped (idempotency, verify mode, expires_at validation, audit JSON, namespace cleanup, O(N) export, boot E2E) was also dormant for the same reason. Root cause: when a multi-handler feature lands across many PRs, none of them are individually responsible for wiring main.go — and the master-task-tracking issue didn't gate-check that the wiring landed. Add main.go integration to every multi-handler RFC checklist. What ships: * internal/memory/wiring/wiring.go: new package that constructs the plugin client + resolver from MEMORY_PLUGIN_URL once. Returns nil when unset (preserves zero-config legacy behavior). Probes /v1/health at boot but doesn't fail-closed — the MCP layer's circuit breaker handles ongoing unavailability. * internal/memory/wiring/wiring_test.go: 6 tests covering the nil/non-nil bundle paths + the namespace-cleanup closure contract (nil-safe, format-stable, failure-tolerant). * cmd/server/main.go: imports memwiring, calls Build(db.DB) once after WorkspaceHandler creation, attaches WithNamespaceCleanup, threads the bundle through router.Setup. * internal/router/router.go: Setup signature gains *memwiring.Bundle param. Inside, attaches WithMemoryV2 to AdminMemoriesHandler and MCPHandler when the bundle is non-nil. After this, the v2 plugin is reachable end-to-end: Operator sets MEMORY_PLUGIN_URL → main.Build instantiates client + resolver → WorkspaceHandler gets cleanup hook → router wires AdminMemoriesHandler + MCPHandler with WithMemoryV2 → MCP tool calls (commit_memory_v2, search_memory, etc.) actually do something → admin export/import respects MEMORY_V2_CUTOVER. Prerequisite for #292 (staging verification) — without this, the operator runbook's step 2 (set MEMORY_PLUGIN_URL, observe behavior) silently no-ops. Verified: all 9 affected test packages still green (memory/{client,contract,e2e,namespace,pgplugin,wiring}, handlers, router, plus the build).	2026-05-04 10:22:30 -07:00
Hongming Wang	9f47ecf86e	Merge branch 'staging' into fix/memory-v2-i3-export-on	2026-05-04 09:44:37 -07:00
Hongming Wang	ebc20794f3	fix(admin-memories): include each member's private namespace in export ReadableNamespaces(rootID) returns {workspace:rootID, team:rootID, org:rootID} — the workspace: namespace it surfaces is the root's only. The I3 batching change resolved namespaces once per root which silently dropped every child workspace's private memories from admin export (workspace:childID never reached the plugin search). Keep the per-root batching win for team:/org:/custom: namespaces; inject each member's workspace:<id> + owner mapping explicitly so coverage matches the legacy per-workspace iteration. Cost stays at 1 SQL + N_roots resolver + 1 plugin search. Test changes: - New TestExport_IncludesEveryMembersPrivateNamespace uses a per-workspace resolver stub (mirrors real behaviour) and asserts every member's workspace:<id> reaches the plugin search AND that children's private memories appear in the response with correct owner attribution. Verified to FAIL on the pre-fix code. - TestExport_BatchesPluginCallsByRoot updated to expect 5 namespaces (3 workspace + team + org) instead of 3 — it had pinned the buggy 3-namespace behaviour. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 09:44:06 -07:00
Hongming Wang	6b445aae2d	Memory v2 fixup I5: workspace purge cleans up plugin namespace Self-review #291. When a workspace is hard-purged, its `workspace:<id>` namespace stays in the plugin storage. Over time deleted workspaces accumulate as orphan namespaces. Fix: optional namespaceCleanupFn hook on WorkspaceHandler. The purge path (workspace_crud.go ~line 520) iterates each purged id and calls the hook best-effort. main.go wires the hook to plugin.DeleteNamespace when MEMORY_PLUGIN_URL is set; operators who haven't enabled the plugin keep the no-op default. Why a hook (not direct plugin import): * Keeps WorkspaceHandler decoupled from the memory contract package (easier to test, smaller blast radius if the contract bumps) * Tests inject a captureCleanupHook stub without standing up a real plugin client * Production wiring stays a one-liner in main.go What gets cleaned up: * `workspace:<id>` for each purged workspace * NOT `team:<root>` / `org:<root>` — those may still be referenced by other workspaces under the same root, so dropping them on a single workspace's purge would orphan team/org data for the survivors. Operator can purge those manually after confirming the entire root is gone. What stays untouched: * Soft-removed workspaces (status='removed', no ?purge=true). The grace window is by design — the data should still be there if the operator unremoves. Tests: * TestWithNamespaceCleanup_DefaultIsNil pins the safe default * TestWithNamespaceCleanup_NilStaysNil pins the explicit-nil case * TestWithNamespaceCleanup_AttachesFn pins the wiring * TestPurge_CallsCleanupHookPerID exercises the per-id loop body * TestPurge_NilHookIsSkipped pins the nil guard A full end-to-end Delete-handler test requires mocking broadcaster + provisioner + descendant SQL chain, which is out-of-scope for a single fixup. Integration coverage for the wired path lives in PR-11's E2E swap test (#293 follow-up).	2026-05-04 09:20:37 -07:00
Hongming Wang	9a64aeaa2c	Memory v2 fixup I3: admin export O(workspaces) → O(N_roots+1) Self-review #289. The previous exportViaPlugin ran one resolver CTE walk + one plugin search PER WORKSPACE. For a 1000-workspace tenant that's 1000× of each, mostly redundant — workspaces sharing a team/org root see identical readable namespaces. New strategy: 1. Single SQL pass returns each workspace + its computed root_id via a recursive CTE (loadWorkspacesWithRoots). 2. Group by root → unique tree count is typically << workspace count. 3. Resolver runs ONCE per root (any member sees the same readable list). 4. Build the union of all root namespaces; single plugin.Search call. 5. Map each memory back to a workspace_name via pickOwnerForNamespace (workspace:<id> → matching member; team:* / org:* / custom:* → canonical first member of root group). Net call cost: 1 SQL + N_roots resolver + 1 plugin call (vs N_workspaces × resolver + N_workspaces × plugin in the old code). Tests: * TestExport_BatchesPluginCallsByRoot pins the new behavior explicitly: 3 workspaces under 1 root → exactly 1 plugin search (was 3 with the old code). * TestPickOwnerForNamespace covers all five attribution cases: workspace:<id> match, workspace:<id> no-match-fallback, team:, org:, custom:* → first-member-of-root-group; plus empty-members fallback. * All 9 existing TestExport_* / TestImport_* / TestPickOwner / TestNamespaceKindFromLegacyScope / TestSkipImport / etc. tests remain green (verified with -run "Export"). The legacy DB path (when MEMORY_V2_CUTOVER unset) is unchanged.	2026-05-04 09:17:30 -07:00
Hongming Wang	d297e75fc9	Merge pull request #2746 from Molecule-AI/fix/memory-v2-i1-i4-small Memory v2 fixup I1+I4: expires_at validation + audit JSON marshal	2026-05-04 16:05:02 +00:00
Hongming Wang	d48693144b	Memory v2 fixup I1+I4: expires_at validation + audit JSON marshal Two small Important findings from self-review, bundled because both are <20 line changes touching the same file. I1: expires_at silent drop - mcp_tools_memory_v2.go:130 had `if t, err := ...; err == nil { ... }` which dropped malformed timestamps without telling the agent. Agent passes `expires_at: "tomorrow"`, gets a 200, and the memory has no TTL. - Now returns a clear error: "invalid expires_at: must be RFC3339" - Test renamed: TestCommitMemoryV2_BadExpiresIsIgnored (which codified the bug) → TestCommitMemoryV2_BadExpiresReturnsError (which pins the fix). I4: audit log JSON via Sprintf-%q - auditOrgWrite was building activity_logs.metadata via fmt.Sprintf with %q. Go-quoted strings happen to coincide with JSON-quoted for ASCII (and today's values are pure ASCII: UUID + hex digest) so the bug was latent. - Replaced with json.Marshal of map[string]string. Same wire shape today, but won't silently produce invalid JSON if metadata grows to include arbitrary content snippets. - New test TestAuditOrgWrite_MetadataIsValidJSON uses a custom sqlmock.Argument matcher (jsonValidMatcher) that fails the test if the metadata column isn't parseable JSON. The test runs auditOrgWrite with a content string containing quotes, backslashes, and a control byte — values where %q would diverge from JSON-quote. Both pre-existing tests (TestCommitMemoryV2_AuditsOrgWrites etc.) remain green.	2026-05-04 08:57:58 -07:00

1 2 3 4 5 ...

660 Commits