molecule-core

Author	SHA1	Message	Date
Hongming Wang	56149f8a24	fix(bundle): markFailed sets last_sample_error + AST gate Closes the bug class surfaced by Canvas E2E #2632: a workspace ends up status='failed' with last_sample_error=NULL, and operators (or the E2E poll loop) see the useless "Workspace failed: (no last_sample_error)" with no triage signal. Two pieces: 1. bundle/importer.go markFailed — the UPDATE was setting only status, leaving last_sample_error NULL. Same incident class as the silent-drop bugs in PRs #2811 + #2824, different code path. markProvisionFailed in workspace_provision_shared.go has set the message column for a long time; this writer drifted the convention. Fix: include last_sample_error in the SET clause + the broadcast. 2. AST drift gate (db/workspace_status_failed_message_drift_test.go) — Go AST walk that finds every db.DB.{Exec,Query,QueryRow}Context call whose argument list binds models.StatusFailed and asserts the SQL literal contains last_sample_error. Catches the next caller that drifts the same convention. Verified to FAIL against the bug shape (reverted importer.go temporarily — gate flagged the exact line) and PASS against the fix. Why an AST gate vs a regex: pre-fix attempt with a regex over UPDATE statements flagged status='online' / status='hibernating' / status= 'removed' UPDATEs as false positives. Walking the AST and only flagging calls that pass the StatusFailed constant eliminates that. Out of scope (filed separately if needed): - The Canvas E2E that surfaced the missing message (#2632) is now a required check on staging via PR #2827. Once this fix lands the next staging push should re-run #2632's failing case and produce a meaningful last_sample_error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 21:08:08 -07:00
Hongming Wang	ed6dfe01e5	feat(delegations): durable per-task ledger + audit-write helper (RFC #2829 PR-1) Adds the `delegations` table and the DelegationLedger writer that PRs #2-#4 of RFC #2829 build on. Schema-only foundation — no behavior change in this PR. PR-2 wires the ledger into the existing handlers and ships the result- push-to-inbox cutover behind a feature flag. Why a dedicated table when activity_logs already records every delegation event: Today, "what is currently in flight for this workspace" is reconstructed by GROUPing activity_logs by delegation_id and ORDER BY created_at DESC. PR-3's stuck-task sweeper needs the join SELECT delegation_id FROM delegations WHERE status = 'in_progress' AND last_heartbeat < now() - interval '10 minutes' which is impossible to express against the event stream without a window over every (delegation_id, latest event) pair — a planner-killing query at scale. The dedicated table makes the sweeper an indexed scan. Same posture as tenant_resources (PR #2343, memory `reference_tenant_resources_audit`): activity_logs remains the audit- grade source of truth, delegations is the queryable view for dashboards + sweeper joins. Symmetric writes — both tables are written, neither blocks orchestration on the other's failure. Schema highlights: - delegation_id PRIMARY KEY (caller-chosen, idempotent retry on restart is a no-op via ON CONFLICT DO NOTHING) - caller_id / callee_id NOT FK — workspace delete must NOT cascade- delete delegation history (audit retention) - status CHECK constraint enforces the lifecycle (queued\|dispatched\|in_progress\|completed\|failed\|stuck) - last_heartbeat NULL-able; PR-3 sweeper compares to NOW() - deadline default now()+6h matches longest-observed legit delegation (memory-namespace migrations) — protects against forever-heartbeating wedged agents - Partial index `idx_delegations_inflight_heartbeat` keeps the sweeper hot path tiny (only non-terminal rows) - UNIQUE(caller_id, idempotency_key) WHERE NOT NULL — natural collision becomes ON CONFLICT no-op without colliding across callers DelegationLedger.SetStatus enforces forward-only on terminal states (completed/failed/stuck cannot be revised) as defense-in-depth on the schema CHECK. Same-status replay is a no-op. Missing-row SetStatus is a no-op (transient inconsistency the next agent retry will heal). Heartbeat updates only in-flight rows — terminal-state delegations are silently skipped. Coverage: - 17 unit tests against sqlmock-backed *sql.DB (Insert happy path, missing-required guards, truncation, lifecycle transitions, terminal forward-only protection, replay no-op, missing-row no-op, empty-input rejection, heartbeat semantics, transition table shape) - Migration roundtrip verified on a real Postgres 15 instance: up creates the expected schema with all 4 indexes + CHECK, down drops everything cleanly. Refs RFC #2829.	2026-05-04 20:43:06 -07:00
Hongming Wang	46d79a3e3b	Merge pull request #2824 from Molecule-AI/fix/stop-workspace-auto-saas-1777945000 fix(provision): StopWorkspaceAuto mirror — close SaaS EC2-leak class	2026-05-05 03:05:09 +00:00
Hongming Wang	2198f92dcb	Merge pull request #2823 from Molecule-AI/feat/codex-tab-pypi-install feat(external-templates): codex tab uses plain pip install	2026-05-05 03:03:08 +00:00
Hongming Wang	11c9ed2a46	fix(provision): StopWorkspaceAuto mirror — close SaaS EC2-leak class Closes #2813 (team-collapse) and #2814 (workspace delete). Two leaks, one class. Both call sites had the same shape pre-fix: if h.provisioner != nil { h.provisioner.Stop(ctx, wsID) } On SaaS where h.provisioner (Docker) is nil and h.cpProv is set, that gate evaluates false and the EC2 keeps running. Workspace gets marked removed in DB; EC2 lives on until the orphan sweeper catches it. Same drift class as PR #2811's org-import provision bug — a Docker- only check on what should be a both-backend operation. Confirmed in production: PR #2811's verification step deleted a test workspace and the EC2 stayed running until I terminated it manually. Fix: WorkspaceHandler.StopWorkspaceAuto(ctx, wsID) — symmetric mirror of provisionWorkspaceAuto. CP first, Docker second, no-op when neither is wired (a workspace nobody is running can't be stopped — that's a no-op, not a failure, distinct from provision's mark-failed contract). Three call-site changes: - team.go:208 (Collapse) → h.wh.StopWorkspaceAuto(ctx, childID) - workspace_crud.go:432 (stopAndRemove) → h.StopWorkspaceAuto(...); RemoveVolume stays Docker-only behind an explicit gate since CP-managed workspaces have no host-bind volumes - TeamHandler.provisioner field + NewTeamHandler's *Provisioner param removed as dead code (Stop was the only call site) Volume cleanup separation is intentional: the abstraction is "stop the running workload," not "tear down all state." Callers that need volume cleanup keep their `if h.provisioner != nil { RemoveVolume }` gate AFTER the Stop call. Tests: - TestStopWorkspaceAuto_RoutesToCPWhenSet — SaaS path - TestStopWorkspaceAuto_RoutesToDockerWhenOnlyDocker — self-hosted - TestStopWorkspaceAuto_NoBackendIsNoOp — pins the contract distinction from provisionWorkspaceAuto's mark-failed - TestNoCallSiteCallsBareStop — source-level pin against `.provisioner.Stop(` / `.cpProv.Stop(` outside the dispatcher, per-backend bodies, restart helper, and the Docker-daemon-direct short-lived-container path. Strips Go comments before substring match so archaeology in code comments doesn't trip the gate. - Verified: pin FAILS against the buggy shape (workspace_crud.go reversion); team.go reversion compile-fails because the field is gone — even stronger than the test. Out of scope (tracked under #2799): - workspace_restart.go's manual if-cpProv-else dispatch with retry semantics tuned for the restart hot path. Functionally equivalent + wraps cpStopWithRetry, so it's not the bug class this PR closes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 20:00:23 -07:00
Hongming Wang	c0bfd19b9e	feat(external-templates): codex tab uses plain pip install for bridge daemon `codex-channel-molecule` 0.1.0 is now on PyPI, so operators no longer need the `git+https://...` URL workaround. Verified: `pip install codex-channel-molecule` from a clean venv installs the wheel and the `codex-channel-molecule --help` console script runs. PyPI: https://pypi.org/project/codex-channel-molecule/0.1.0/ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 19:58:56 -07:00
Hongming Wang	e0f9434eaf	fix(files-eic): silence ssh known-hosts warning that 500'd Hermes config load GET /workspaces/:id/files/config.yaml on hongming.moleculesai.app's Hermes workspace returned 500 with body: ssh cat: exit status 1 (Warning: Permanently added '[127.0.0.1]:37951' (ED25519) to the list of known hosts.) Root cause: ssh emits the "Permanently added" notice on every fresh tunnel connection, even with UserKnownHostsFile=/dev/null (that prevents persistence, not the warning). It lands on stderr, fooling readFileViaEIC's classifier: if len(out) == 0 && stderr.Len() == 0 { return nil, os.ErrNotExist } return nil, fmt.Errorf("ssh cat: %w (%s)", runErr, ...) stderr was non-empty (the warning), so we returned the wrapped error → 500 from the HTTP layer instead of 404. Fix: add `-o LogLevel=ERROR` to BOTH writeFileViaEIC and readFileViaEIC ssh invocations. Silences info+warning while keeping real auth/tunnel errors visible (those emit at ERROR level). Test: TestSSHArgs_LogLevelErrorBothSites pins the flag in both blocks. Mutation-tested: stripping the flag from one site fails the gate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 19:58:49 -07:00
Hongming Wang	7e1fdf5847	refactor(provision): use HasProvisioner() at all gate-y both-nil checks SSOT pass — replace 4 bare `h.provisioner == nil && h.cpProv == nil` checks with `!h.HasProvisioner()`. When a third backend lands (k8s, containerd, whatever), HasProvisioner gets one new field; bare both-nil checks would each need to be hunted and updated. Sites: - a2a_proxy_helpers.go:166 — maybeMarkContainerDead skip-no-backend - workspace_restart.go:118 — Restart endpoint guard - workspace_restart.go:363 — RestartByID coalescer guard - workspace_restart.go:660 — Resume endpoint guard Adds TestNoBareBothNilCheck (source-level) so the antipattern can't slip back in. Out of scope but discovered during the audit (filed separately): - team.go:207 — team-collapse Stop is Docker-only, leaks EC2 on SaaS - workspace_crud.go:423 — workspace delete cleanup is Docker-only, leaks EC2 on SaaS Both need a StopWorkspaceAuto mirror of provisionWorkspaceAuto. Same class of bug as today's org-import incident, different verb (stop vs provision). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:51:53 -07:00
Hongming Wang	d084d7e61a	fix(provision): consolidate org-import gate + Auto self-marks-failed Two changes that close the silent-drop bug class: 1. Add WorkspaceHandler.HasProvisioner() and use it as the org-import gate. Pre-fix, org_import.go:178 read `h.provisioner != nil` (Docker- only) — on SaaS tenants where cpProv is wired but Docker is nil, the entire 220-line provisioning prep block was skipped. The Auto call PR #2798 added at line 395 was unreachable on SaaS. Repro: 2026-05-05 01:14 — hongming prod tenant, 7-workspace org import, every workspace sat in 'provisioning' for 10 min until the sweeper marked it failed with the misleading "container started but never called /registry/register". 2. provisionWorkspaceAuto self-marks-failed on the no-backend path. Defense in depth: even if a future caller bypasses HasProvisioner gating or ignores the bool return (TeamHandler pre-#2367 did exactly this), the workspace ends in a clean failed state with an actionable error message instead of lingering until the 10-min sweep. Auto becomes the single source of truth for "start a workspace" — routing AND the no-backend failure path. Create's redundant if-not-Auto-then-mark-failed block collapses (kept only the workspace_config UPSERT, which is a Create-specific UI concern for rendering runtime/model on the Config tab). Tests: - TestProvisionWorkspaceAuto_NoBackendMarksFailed pins the new contract - TestHasProvisioner_TrueOnCPOnly catches the SaaS-only blind spot - TestHasProvisioner_TrueOnDockerOnly preserves self-hosted shape - TestHasProvisioner_FalseWhenNeitherWired pins the gate-out path - TestOrgImportGate_UsesHasProvisionerNotBareField source-pins the gate (verified: FAILS against the buggy `h.provisioner != nil` shape, PASSES with `h.workspace.HasProvisioner()`) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:47:02 -07:00
Hongming Wang	dfd0bc528c	fix(external-templates): codex-channel-molecule via git+ URL (not on PyPI yet) Mirrors the pattern hermes-channel-molecule uses (line 256). Drops the broken `pip install codex-channel-molecule` which would 404. PyPI publish workflow is a separate piece of work — until then, git+https install is the path operators get. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:29:23 -07:00
Hongming Wang	4ea6f437e9	feat(external-templates): codex tab now includes the bridge-daemon inbound path The codex tab in the External Connect modal had a "outbound-tools-only first cut" caveat — operators got the MCP wiring for codex calling platform tools, but there was no documented inbound path. Canvas messages couldn't wake an idle codex session. That gap is now filled by codex-channel-molecule (github.com/Molecule-AI/codex-channel-molecule), shipped today as the codex counterpart to hermes-channel-molecule. The daemon long-polls the platform inbox, runs `codex exec --resume <session>` per inbound message, captures the assistant reply, routes it back via send_message_to_user / delegate_task, and acks the inbox row. Per-thread session continuity persisted to disk so daemon restarts don't lose conversation context. This commit: - Updates externalCodexTemplate to include `pip install codex-channel-molecule` (step 1) and a foreground `nohup codex-channel-molecule` invocation (step 3) using the same env-var contract as the MCP server (WORKSPACE_ID + PLATFORM_URL + MOLECULE_WORKSPACE_TOKEN). - Adds a "Canvas messages don't wake codex" common-issues entry to the TAB_HELP codex section pointing at the bridge daemon log. - Updates the doc comment to record the upstream deprecation path: when openai/codex#17543 lands, the bridge becomes redundant and the wired MCP server delivers push natively. Verified: TestExternalTemplates_NoMoleculeOrgIDPlaceholder still passes (no MOLECULE_ORG_ID re-introduction); full handlers suite green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:28:35 -07:00
Hongming Wang	0f389ba325	Merge pull request #2804 from Molecule-AI/fix/external-templates-drop-molecule-org-id fix(external-templates): drop MOLECULE_ORG_ID from codex/openclaw/hermes snippets	2026-05-05 00:38:45 +00:00
Hongming Wang	472862bc50	fix(external-templates): drop MOLECULE_ORG_ID from operator-facing snippets Codex / openclaw / hermes-channel snippets each instructed operators to set `MOLECULE_ORG_ID = "<your org id>"`. The molecule_runtime MCP subprocess these snippets spawn never reads MOLECULE_ORG_ID — that env var is consumed only by workspace-server's TenantGuard middleware, server-side, on the tenant box itself (set by the control plane via user-data on provision). External operator → tenant calls pass TenantGuard via the isSameOriginCanvas path (Origin matches Host), with auth via Bearer token + X-Workspace-ID. The universal_mcp snippet — which calls into the same molecule_runtime — has always (correctly) omitted MOLECULE_ORG_ID; this brings codex / openclaw / hermes-channel into line. Symptom that caught it: an external codex CLI session, after pasting the codex-tab snippet, surfaced "MOLECULE_ORG_ID is still set to '<your org id>'" as an unresolved blocker — agent reasonably treated the placeholder as required setup. Operator has no value to fill. Pinned with a structural test (TestExternalTemplates_NoMoleculeOrgIDPlaceholder) so the placeholder can't drift back across all six external-tab templates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 17:30:07 -07:00
Hongming Wang	b5435b4732	fix(memory v2): warn at boot when cutover env half-configured MEMORY_V2_CUTOVER=true gates the admin export/import path on the v2 plugin, but the cutoverActive() check in admin_memories.go silently returns false when the plugin isn't wired: func (h AdminMemoriesHandler) cutoverActive() bool { if os.Getenv(envMemoryV2Cutover) != "true" { return false } return h.plugin != nil && h.resolver != nil } Two operator misconfigs hit the silent-fallback path: 1. MEMORY_V2_CUTOVER=true set, MEMORY_PLUGIN_URL unset → wiring.Build returns nil → handler stays on legacy SQL path → operator sees no error, assumes cutover is live, but every request still writes the legacy table. 2. MEMORY_V2_CUTOVER=true set, MEMORY_PLUGIN_URL set, but plugin unreachable at boot → wiring.Build still returns the bundle (intentional — circuit breaker handles ongoing unavailability), but every cutover write quietly falls back via the breaker. → only signal: legacy table keeps growing. Both are exactly the "structurally invisible until prod" failure mode; the only real-world detection today is "notice the legacy table is still being written to," which no operator will check. Add loud, distinctive WARN log lines at Build() time for both shapes. Boot logs are operator-visible, so a half-config is immediately obvious without needing dashboards. Tests: 4 new (cutover+no-URL → warn, neither set → silent, cutover+probe- fail → loud warn, probe-fail-without-cutover → quiet generic) * 6 existing (still pass; pin no-warning-on-happy-path) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 17:24:11 -07:00
Hongming Wang	f1b72af97e	Merge pull request #2798 from Molecule-AI/fix/org-import-saas-routing-1777938328 fix(org-import): route through provisionWorkspaceAuto so SaaS gets EC2 — closes #2486	2026-05-04 23:54:37 +00:00
Hongming Wang	19e7acdc22	fix(org-import): route through provisionWorkspaceAuto so SaaS gets EC2 Org-import called h.workspace.provisionWorkspace directly — same silent- drop bug that bit TeamHandler.Expand on 2026-05-04 (see workspace.go :121-125 comment + #2486). Symptom on SaaS: every claude-code workspace sat in "provisioning" until the 600s sweeper marked it failed with "container started but never called /registry/register" — because no container ever existed; the goroutine returned silently when the Docker provisioner field was nil. User reproduced 2026-05-04 ~22:30Z importing a 7-workspace template on the hongming prod tenant. Tenant CP logs (queried live via SSM) showed ZERO "Provisioner: goroutine entered" or "CPProvisioner: goroutine entered" lines for any of the 7 failed workspace UUIDs in the 60min window — confirming the goroutine never ran past line 384 of org_import.go because provisionWorkspace returned early in SaaS mode. The fix is one line: replace h.workspace.provisionWorkspace with h.workspace.provisionWorkspaceAuto. Auto is the single source of truth for backend selection (workspace.go:130) — picks CP-mode when h.cpProv is wired, Docker-mode when h.provisioner is wired, returns false when neither. ALSO adds a generic source-level gate (TestNoCallSiteCallsDirectProvisionerExceptAuto) so the next future caller can't repeat the pattern. Walks every non-test .go file in handlers/ and fails if any direct call to provisionWorkspace( or provisionWorkspaceCP( appears outside the dispatcher's own definition file. The gate currently allows workspace_restart.go which has its own manual if-h.cpProv-else dispatch (functionally equivalent to Auto, not the bug class — but is architectural duplication; follow-up filed for proper de-dup). Test plan: - TestOrgImport_UsesAutoNotDirectDockerPath: pin the org_import.go call site - TestNoCallSiteCallsDirectProvisionerExceptAuto: generic gate against future drift - TestTeamExpand_UsesAutoNotDirectDockerPath (existing): symmetric for team.go All 3 + the rest of the handler suite pass. Closes #2486 Pairs with: PR #2794 (configurable provision concurrency) which made it possible to bisect concurrency-vs-routing as the cause Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 16:49:07 -07:00
Hongming Wang	872b781f64	Merge pull request #2792 from Molecule-AI/feat/drop-shared-context feat: drop shared_context — use memory v2 team namespace	2026-05-04 23:37:49 +00:00
Hongming Wang	3bc7749e84	feat(org-import): make provision concurrency configurable via env Org-import was hard-capped at 3 concurrent workspace provisions (#1084), calibrated for Docker-mode workspaces where each provision was a docker-run. Now that workspaces are EC2 instances, AWS RunInstances parallelises happily and the artificial cap of 3 makes a 7-workspace org-import take 3-4× longer than necessary (3 batches × ~70s/provision ≈ 4 min wall time when AWS could absorb all 7 in parallel for ~70s). This PR makes the cap configurable via MOLECULE_PROVISION_CONCURRENCY: unset → 3 (Docker-mode default, unchanged) "0" → effectively unlimited (SaaS / EC2 backend; AWS rate-limit + vCPU quota are the real backpressure) N>0 → exactly N N<0 → fall back to default 3 + warning log garbage → fall back to default 3 + warning log The "0 = unlimited" mapping is the user-facing convention requested for SaaS deployments — operators don't have to pick an arbitrary large number. Implementation hands off 1<<20 internally so the channel-based semaphore stays a no-op without infinite-buffer risk. Test coverage (org_provision_concurrency_test.go, 6 cases / 15 subtests): - unset → default - "0" → large unlimited cap - positive integer exact (1, 5, 10, 50) - negative → default + warning - non-numeric → default + warning - whitespace-trimmed (" 7 " → 7) Boot-time log line confirms the resolved cap so an operator can verify their env is being honored without re-deploying. Does NOT address the separate 600s "never registered" timeout the user also reported during org-import — that's filed as molecule-core#2793 for proper investigation (parallel-provision contention, network routing, register-retry budget, or container-start failure are all candidates and need live SSM capture to bisect). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 16:33:49 -07:00
Hongming Wang	2f7beb9bce	feat: drop shared_context — use memory v2 team namespace instead Parent → child knowledge sharing previously lived behind a `shared_context` list in config.yaml: at boot, every child workspace HTTP-fetched its parent's listed files via GET /workspaces/:id/shared-context and prepended them as a "## Parent Context" block. That paid the full transfer cost on every boot regardless of whether the agent needed it, single-parent SPOF, no team or org scope, and broken if the parent was unreachable. Replace with memory v2's team:<id> namespace: agents call recall_memory on demand. For large blob-shaped artefacts see RFC #2789 (platform-owned shared file storage). Removed: - workspace/coordinator.py: get_parent_context() - workspace/prompt.py: parent_context arg + injection block - workspace/adapter_base.py: import + call + arg pass - workspace/config.py: shared_context field + parser entry - workspace-server/internal/handlers/templates.go: SharedContext handler - workspace-server/internal/router/router.go: GET /shared-context route - canvas/src/components/tabs/ConfigTab.tsx: Shared Context tag input - canvas/src/components/tabs/config/form-inputs.tsx: schema field + default - canvas/src/components/tabs/config/yaml-utils.ts: serializer entry - 6 tests pinning the removed behavior; 5 doc references Added regression gates so any reintroduction is loud: - workspace/tests/test_prompt.py: build_system_prompt must NOT emit "## Parent Context" - workspace/tests/test_config.py: legacy YAML key loads cleanly but shared_context attr must NOT exist on WorkspaceConfig - tests/e2e/test_staging_full_saas.sh §9d: GET /shared-context must NOT return 200 against a live tenant Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 16:30:26 -07:00
Hongming Wang	9c7b34cb7f	fix(workspace files API): GET ReadFile via SSH-EIC for SaaS workspaces Pre-fix WriteFile (templates.go:436) had an `instance_id != ""` branch that dispatched to writeFileViaEIC (SSH through EC2 Instance Connect), but ReadFile (templates.go:362) skipped that branch entirely. ReadFile always tried `findContainer` (which only works for local-Docker workspaces, not SaaS EC2-per-workspace ones) and fell through to `resolveTemplateDir` (which returns the seed template, not the persisted workspace state). Net effect on production: every Canvas Config tab open against a SaaS workspace returned 404 "No config.yaml found" because GET couldn't see what PUT had written. Visible to users after PR #2781 ("show-misconfigured-state") surfaced the 404 as an error UX. Caught by the synth-E2E 7c gate's GET-back assertion, but misdiagnosed as a "test bug" and the GET assertion was dropped in PR #2783 (rather than fixed at the source). This PR closes the loop: 1. New `readFileViaEIC` helper in template_files_eic.go that mirrors writeFileViaEIC's SSH-via-EIC dance and runs `sudo -n cat <path>`. Returns os.ErrNotExist on missing file (cat exits 1 with empty stdout under `2>/dev/null`) so the handler maps it cleanly to 404. 2. ReadFile dispatch now mirrors WriteFile's: when `instance_id` is non-empty, use readFileViaEIC; otherwise fall through to the local-Docker / template-dir path. 3. ReadFile's DB query expanded to also select instance_id + runtime (was just name). Three sqlmock-based tests updated to match the new column shape; the existing local-Docker fallback path stays green by passing instance_id="" in the mock rows. Follow-up (separate PR): the synth-E2E 7c gate should restore the GET-back marker assertion now that the read/write paths are unified. That'll also catch any future Files API regression in the round-trip. This PR doesn't touch the gate to keep the scope tight. Verification: - go build ./... clean - full handlers test suite green (0.4s for ReadFile subset; 5.8s full) - The 3 ReadFile sqlmock tests still cover the local-Docker fallback (instance_id=""); SaaS EIC dispatch is covered by the upcoming re-enabled synth-E2E 7c GET assertion (deferred to follow-up) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 16:02:26 -07:00
Hongming Wang	61d5908817	fix(workspace files API): write claude-code config to /configs, sudo for root-owned base Root cause of the user-visible 500 ("install: cannot create directory '/opt/configs': Permission denied") on PUT /workspaces/<id>/files/config.yaml: 1. Path map fall-through. claude-code wasn't in workspaceFilePathPrefix, so resolveWorkspaceFilePath returned the default `/opt/configs/...`. That directory doesn't exist on the workspace EC2 — cloud-init in provisioner/userdata_containerized.go runs `mkdir -p /configs` only. Even if the SSH write had succeeded at /opt/configs, the docker container's bind-mount is host:/configs → container:/configs, so the file would have been invisible to the runtime. 2. /configs ownership. cloud-init runs as root, so /configs is root-owned. The SSH-as-ubuntu install command can't write into it without sudo. Hermes wasn't affected because its base path (/home/ubuntu/.hermes) is ubuntu-owned. Two-line fix: - Add `claude-code: /configs` to the runtime → base-path map and flip the default fall-through from `/opt/configs` to `/configs`. Leave the pre-existing langgraph/external entries pointing at /opt/configs pending a migration audit (no user report on those today, and flipping them would silently relocate any files those runtimes already wrote). - Prefix the remote install command with `sudo -n` so the write succeeds under the standard EC2 ubuntu/passwordless-sudo posture. `-n` (non-interactive) ensures clean failure if that ever changes, rather than a hang waiting for a password prompt. Tests: - TestResolveWorkspaceFilePath_KnownRuntimes adds claude-code + CLAUDE-CODE coverage and updates the empty/unknown default cases to expect /configs. The langgraph/external rows stay green (unchanged values), confirming the scope of the rename. Verification: - go build ./... clean - go test ./internal/handlers/ green - The user-reported bug (PUT /workspaces/57fb7043-79a0-4a53-ae4a-efb39deb457f/files/config.yaml → 500 EACCES on /opt/configs) is the failure mode this fix addresses on both axes (path + sudo). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 14:29:08 -07:00
Hongming Wang	707e4d7342	Memory v2 wiring: replace decorative tests with real integration Self-review of #2755 found two tests that didn't actually exercise the production code path: - TestNamespaceCleanupFn_NamespaceFormat asserted "workspace:" + "abc-123" == "workspace:abc-123" — a compile-time invariant, not runtime behavior. Provided no protection if the closure in Bundle.NamespaceCleanupFn ever stopped using that prefix. - TestNamespaceCleanupFn_FailureLogsButReturns built a parallel cleanup closure inline with errors.New, then invoked the parallel closure. The production closure was never exercised. A regression in NamespaceCleanupFn (e.g. forgetting the deferred recover, calling the plugin without nil-check) would still pass this test. Replaced both with real integration: - TestNamespaceCleanupFn_HitsPluginAtCorrectNamespace spins up httptest.Server, points MEMORY_PLUGIN_URL at it, calls Build(), invokes the production closure, and asserts the server actually saw DELETE /v1/namespaces/workspace:abc-123. - TestNamespaceCleanupFn_PluginErrorDoesNotPanic exercises the failure path for real: server returns 500 on DELETE, closure must log and return without propagating. defer-recover is belt-and- suspenders since production calls this from a for-loop in workspace_crud.go that has no recover. Couldn't ship with #2755 because the merge queue locks the branch once enqueued. Following up now that #2755 is merged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 10:38:59 -07:00
Hongming Wang	46731729d4	Memory v2 fixup Critical: wire plugin from main.go (was fully dormant) Caught during continued review: the entire v2 plugin system shipped in PRs #2729-#2742 + #2744-#2751 was never actually invoked because main.go and router.go don't construct the plugin client/resolver or attach the WithMemoryV2 / WithNamespaceCleanup hooks. Operators setting MEMORY_PLUGIN_URL=... saw zero behavior change because nothing read it. Every fixup we shipped (idempotency, verify mode, expires_at validation, audit JSON, namespace cleanup, O(N) export, boot E2E) was also dormant for the same reason. Root cause: when a multi-handler feature lands across many PRs, none of them are individually responsible for wiring main.go — and the master-task-tracking issue didn't gate-check that the wiring landed. Add main.go integration to every multi-handler RFC checklist. What ships: * internal/memory/wiring/wiring.go: new package that constructs the plugin client + resolver from MEMORY_PLUGIN_URL once. Returns nil when unset (preserves zero-config legacy behavior). Probes /v1/health at boot but doesn't fail-closed — the MCP layer's circuit breaker handles ongoing unavailability. * internal/memory/wiring/wiring_test.go: 6 tests covering the nil/non-nil bundle paths + the namespace-cleanup closure contract (nil-safe, format-stable, failure-tolerant). * cmd/server/main.go: imports memwiring, calls Build(db.DB) once after WorkspaceHandler creation, attaches WithNamespaceCleanup, threads the bundle through router.Setup. * internal/router/router.go: Setup signature gains *memwiring.Bundle param. Inside, attaches WithMemoryV2 to AdminMemoriesHandler and MCPHandler when the bundle is non-nil. After this, the v2 plugin is reachable end-to-end: Operator sets MEMORY_PLUGIN_URL → main.Build instantiates client + resolver → WorkspaceHandler gets cleanup hook → router wires AdminMemoriesHandler + MCPHandler with WithMemoryV2 → MCP tool calls (commit_memory_v2, search_memory, etc.) actually do something → admin export/import respects MEMORY_V2_CUTOVER. Prerequisite for #292 (staging verification) — without this, the operator runbook's step 2 (set MEMORY_PLUGIN_URL, observe behavior) silently no-ops. Verified: all 9 affected test packages still green (memory/{client,contract,e2e,namespace,pgplugin,wiring}, handlers, router, plus the build).	2026-05-04 10:22:30 -07:00
Hongming Wang	9f47ecf86e	Merge branch 'staging' into fix/memory-v2-i3-export-on	2026-05-04 09:44:37 -07:00
Hongming Wang	ebc20794f3	fix(admin-memories): include each member's private namespace in export ReadableNamespaces(rootID) returns {workspace:rootID, team:rootID, org:rootID} — the workspace: namespace it surfaces is the root's only. The I3 batching change resolved namespaces once per root which silently dropped every child workspace's private memories from admin export (workspace:childID never reached the plugin search). Keep the per-root batching win for team:/org:/custom: namespaces; inject each member's workspace:<id> + owner mapping explicitly so coverage matches the legacy per-workspace iteration. Cost stays at 1 SQL + N_roots resolver + 1 plugin search. Test changes: - New TestExport_IncludesEveryMembersPrivateNamespace uses a per-workspace resolver stub (mirrors real behaviour) and asserts every member's workspace:<id> reaches the plugin search AND that children's private memories appear in the response with correct owner attribution. Verified to FAIL on the pre-fix code. - TestExport_BatchesPluginCallsByRoot updated to expect 5 namespaces (3 workspace + team + org) instead of 3 — it had pinned the buggy 3-namespace behaviour. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 09:44:06 -07:00
Hongming Wang	6b445aae2d	Memory v2 fixup I5: workspace purge cleans up plugin namespace Self-review #291. When a workspace is hard-purged, its `workspace:<id>` namespace stays in the plugin storage. Over time deleted workspaces accumulate as orphan namespaces. Fix: optional namespaceCleanupFn hook on WorkspaceHandler. The purge path (workspace_crud.go ~line 520) iterates each purged id and calls the hook best-effort. main.go wires the hook to plugin.DeleteNamespace when MEMORY_PLUGIN_URL is set; operators who haven't enabled the plugin keep the no-op default. Why a hook (not direct plugin import): * Keeps WorkspaceHandler decoupled from the memory contract package (easier to test, smaller blast radius if the contract bumps) * Tests inject a captureCleanupHook stub without standing up a real plugin client * Production wiring stays a one-liner in main.go What gets cleaned up: * `workspace:<id>` for each purged workspace * NOT `team:<root>` / `org:<root>` — those may still be referenced by other workspaces under the same root, so dropping them on a single workspace's purge would orphan team/org data for the survivors. Operator can purge those manually after confirming the entire root is gone. What stays untouched: * Soft-removed workspaces (status='removed', no ?purge=true). The grace window is by design — the data should still be there if the operator unremoves. Tests: * TestWithNamespaceCleanup_DefaultIsNil pins the safe default * TestWithNamespaceCleanup_NilStaysNil pins the explicit-nil case * TestWithNamespaceCleanup_AttachesFn pins the wiring * TestPurge_CallsCleanupHookPerID exercises the per-id loop body * TestPurge_NilHookIsSkipped pins the nil guard A full end-to-end Delete-handler test requires mocking broadcaster + provisioner + descendant SQL chain, which is out-of-scope for a single fixup. Integration coverage for the wired path lives in PR-11's E2E swap test (#293 follow-up).	2026-05-04 09:20:37 -07:00
Hongming Wang	9a64aeaa2c	Memory v2 fixup I3: admin export O(workspaces) → O(N_roots+1) Self-review #289. The previous exportViaPlugin ran one resolver CTE walk + one plugin search PER WORKSPACE. For a 1000-workspace tenant that's 1000× of each, mostly redundant — workspaces sharing a team/org root see identical readable namespaces. New strategy: 1. Single SQL pass returns each workspace + its computed root_id via a recursive CTE (loadWorkspacesWithRoots). 2. Group by root → unique tree count is typically << workspace count. 3. Resolver runs ONCE per root (any member sees the same readable list). 4. Build the union of all root namespaces; single plugin.Search call. 5. Map each memory back to a workspace_name via pickOwnerForNamespace (workspace:<id> → matching member; team:* / org:* / custom:* → canonical first member of root group). Net call cost: 1 SQL + N_roots resolver + 1 plugin call (vs N_workspaces × resolver + N_workspaces × plugin in the old code). Tests: * TestExport_BatchesPluginCallsByRoot pins the new behavior explicitly: 3 workspaces under 1 root → exactly 1 plugin search (was 3 with the old code). * TestPickOwnerForNamespace covers all five attribution cases: workspace:<id> match, workspace:<id> no-match-fallback, team:, org:, custom:* → first-member-of-root-group; plus empty-members fallback. * All 9 existing TestExport_* / TestImport_* / TestPickOwner / TestNamespaceKindFromLegacyScope / TestSkipImport / etc. tests remain green (verified with -run "Export"). The legacy DB path (when MEMORY_V2_CUTOVER unset) is unchanged.	2026-05-04 09:17:30 -07:00
Hongming Wang	d297e75fc9	Merge pull request #2746 from Molecule-AI/fix/memory-v2-i1-i4-small Memory v2 fixup I1+I4: expires_at validation + audit JSON marshal	2026-05-04 16:05:02 +00:00
Hongming Wang	d48693144b	Memory v2 fixup I1+I4: expires_at validation + audit JSON marshal Two small Important findings from self-review, bundled because both are <20 line changes touching the same file. I1: expires_at silent drop - mcp_tools_memory_v2.go:130 had `if t, err := ...; err == nil { ... }` which dropped malformed timestamps without telling the agent. Agent passes `expires_at: "tomorrow"`, gets a 200, and the memory has no TTL. - Now returns a clear error: "invalid expires_at: must be RFC3339" - Test renamed: TestCommitMemoryV2_BadExpiresIsIgnored (which codified the bug) → TestCommitMemoryV2_BadExpiresReturnsError (which pins the fix). I4: audit log JSON via Sprintf-%q - auditOrgWrite was building activity_logs.metadata via fmt.Sprintf with %q. Go-quoted strings happen to coincide with JSON-quoted for ASCII (and today's values are pure ASCII: UUID + hex digest) so the bug was latent. - Replaced with json.Marshal of map[string]string. Same wire shape today, but won't silently produce invalid JSON if metadata grows to include arbitrary content snippets. - New test TestAuditOrgWrite_MetadataIsValidJSON uses a custom sqlmock.Argument matcher (jsonValidMatcher) that fails the test if the metadata column isn't parseable JSON. The test runs auditOrgWrite with a content string containing quotes, backslashes, and a control byte — values where %q would diverge from JSON-quote. Both pre-existing tests (TestCommitMemoryV2_AuditsOrgWrites etc.) remain green.	2026-05-04 08:57:58 -07:00
Hongming Wang	1e97fb9a16	Memory v2 fixup C1: backfill idempotency via MemoryWrite.id Self-review (post-merge) flagged that the backfill claimed to be idempotent on re-run but actually duplicates every row because the plugin's INSERT uses gen_random_uuid() and ignores any id passed in. Fix is contract-level: extend MemoryWrite with an optional `id` idempotency key. When supplied, the plugin MUST treat the write as upsert keyed on this id; when omitted, the plugin generates a fresh UUID (production agent commits keep working unchanged). Changes: * docs/api-protocol/memory-plugin-v1.yaml: add id field with description that flags it as idempotency key * internal/memory/contract/contract.go: add ID to MemoryWrite struct, update memory_write_minimal golden vector * internal/memory/pgplugin/store.go: split CommitMemory into two paths — upsert when body.ID set (INSERT ... ON CONFLICT (id) DO UPDATE), plain INSERT otherwise * cmd/memory-backfill/main.go: pass agent_memories.id to MemoryWrite, fix the false comment about 409 deduplication New tests: * pgplugin: TestCommitMemory_WithIDUpserts pins the upsert SQL is used when id is set; TestCommitMemory_UpsertScanError covers the error branch * backfill: TestBackfill_PassesSourceUUIDAsIdempotencyKey pins the forwarding behavior; TestBackfill_RerunIsIdempotent simulates a retry and asserts both runs pass the same uuid (plugin upsert is what makes this safe) Why this matters: operators retrying a failed backfill (which they will — networks fail, transactions abort) would otherwise create N duplicates per memory. The duplicates aren't visible until search results show obvious dupes — debugging that under prod load is bad. Production agent commits are unaffected: they leave id empty, the plugin generates a fresh UUID via gen_random_uuid(), zero behavior change for the hot path.	2026-05-04 08:54:13 -07:00
Hongming Wang	b07575c710	Merge branch 'staging' into feat/memory-v2-pr11-e2e-swap	2026-05-04 08:24:26 -07:00
Hongming Wang	b937415e1e	Memory v2 PR-11: E2E test — flat-plugin swap proves contract works Final implementation PR. Builds on PR-1..10 (all merged or queued). Proves the central design property of the plugin contract: ANY plugin satisfying the v1 OpenAPI spec works as a drop-in replacement for the built-in postgres plugin. If this test fails after a refactor, the contract has drifted in a way that breaks ecosystem plugins. What ships: * internal/memory/e2e/swap_test.go — five E2E tests against a deliberately minimal "flat-memory" stub plugin (~50 LOC, single map, zero capabilities) * MCPHandler.Dispatch — small exported wrapper around dispatch so out-of-package E2E tests can drive tools by name without duplicating the whole MCP RPC stack E2E coverage: * TestE2E_FlatPluginRoundTrip: full lifecycle - list_writable_namespaces returns 3 entries - commit_memory_v2 writes through plugin - search_memory finds it back - commit_summary writes a summary - forget_memory deletes - search after forget excludes the deleted memory * TestE2E_LegacyShimRoutesThroughFlatPlugin: PR-6 shim wired up - Legacy commit_memory(scope=LOCAL) ends up in plugin storage - Legacy recall_memory finds it back through plugin search - Response shapes preserved (scope:LOCAL stays scope:LOCAL) * TestE2E_OrgMemoriesDelimiterWrap: prompt-injection mitigation - Org-namespace memory committed - Audit INSERT into activity_logs verified - Search returns content with [MEMORY id=... scope=ORG ns=...] prefix applied * TestE2E_StubPluginCapabilitiesAreEmpty: capability negotiation - Stub plugin reports zero capabilities - Client.SupportsCapability returns false for FTS, embedding - Confirms graceful degradation when plugin doesn't support a feature * TestE2E_PluginUnreachable_AgentSeesClearError: failure surface - Plugin URL pointing at bogus port - commit_memory_v2 returns informative error - No nil-pointer dereference; error message is actionable The flat plugin is intentionally minimal — it has no namespaces table distinct from memory records, no FTS, no semantic search, no TTL. The test proves operators can drop in a 50-line plugin and the agent behavior is identical (modulo capability-gated features).	2026-05-04 08:20:35 -07:00
Hongming Wang	7b0bd32957	Memory v2 PR-8: cutover — admin export/import via plugin Builds on merged PR-1..7. Adds the operator-controlled cutover flag that flips admin export/import from the legacy direct-DB path to the v2 plugin path. Activation: MEMORY_V2_CUTOVER=true AND the v2 plugin is wired via WithMemoryV2. Both must be true to take the new path; either being false falls through to the existing legacy SQL code unchanged. What ships: * AdminMemoriesHandler gains plugin + resolver fields, wired via WithMemoryV2 (production) / withMemoryV2APIs (tests) * Export: enumerates workspaces, asks resolver for each one's readable namespaces, searches each via plugin, deduplicates by memory id, applies SAFE-T1201 redaction on emitted content (F1084 parity). Returns the legacy memoryExportEntry shape so existing tooling keeps working. * Import: scope→namespace translation mirrors PR-6 shim. Uses UpsertNamespace + CommitMemory; runs SAFE-T1201 redaction BEFORE the plugin sees the content (F1085 parity). * Helpers: legacyScopeFromNamespace + namespaceKindFromLegacyScope (lifted out so admin_memories doesn't depend on MCP handler helpers). skipImport typed error. Operational rollout (cutover sequencing): 1. Today: MEMORY_V2_CUTOVER unset → legacy DB path. 2. After PR-7 backfill applied + smoke verified: operator sets MEMORY_V2_CUTOVER=true. 3. From that point, admin export/import operate on plugin storage; legacy agent_memories table is read-only for the ~60-day grace window before PR-9 drops it. Coverage on new paths: * cutoverActive: 100% * WithMemoryV2 / withMemoryV2APIs: 100% * importViaPlugin: 100% * exportViaPlugin: 97.2% (one defensive scan-error branch in the workspace-list loop) * scopeToWritableNamespaceForImport: 76.9% (resolver-error and no-matching-kind branches exercised end-to-end via Import) * legacyScopeFromNamespace + namespaceKindFromLegacyScope: 100% Edge cases pinned: * Cutover flag matrix (env unset/true/false × wired/unwired) * Export deduplicates memories shared across team (one row per id) * Export tolerates per-workspace failures (resolver / plugin) and keeps going on the rest * Export returns 500 only when the top-level workspace query fails * Empty readable namespaces → empty export (no panic) * Export redacts secrets in plugin path * Import: unknown workspace skipped, unknown scope skipped, plugin upsert/commit errors counted as errors * Import redacts secrets BEFORE plugin sees content * Legacy export/import path unchanged when cutover flag unset	2026-05-04 08:15:10 -07:00
Hongming Wang	290e6dfdc3	Memory v2 PR-6: backward-compat shim — legacy tools route to v2 Builds on merged PR-1..5. Adds the bridge that lets legacy commit_memory / recall_memory tools route through the v2 plugin path when MEMORY_PLUGIN_URL is wired, otherwise fall through to the existing DB-backed code unchanged. What ships: * handlers/mcp_tools_memory_legacy_shim.go — translation helpers: scopeToWritableNamespace, scopeToReadableNamespaces, commitMemoryLegacyShim, recallMemoryLegacyShim, namespaceKindToLegacyScope * handlers/mcp_tools.go — toolCommitMemory + toolRecallMemory now delegate to the shim when memv2 is wired Translation: commit: LOCAL → workspace:<self> TEAM → team:<root> (resolver picks at runtime) empty → defaults to LOCAL (preserves legacy default) GLOBAL → still rejected at MCP bridge (C3 preserved) recall: LOCAL → search restricted to workspace:<self> TEAM → workspace:<self> + team:<root> empty → all readable (matches v2 default behavior) GLOBAL → blocked at MCP bridge (C3 preserved) Response shapes are preserved exactly: commit: {"id":"...","scope":"LOCAL"\|"TEAM"} — agents see no diff recall: [{"id":"...","content":"...","scope":"LOCAL"\|...,"created_at":"..."}, ...] org-namespace memories get the same [MEMORY id=... scope=ORG ns=...] prefix as v2 search; legacy scope label comes back as "GLOBAL" Operational rollout: * Today: MEMORY_PLUGIN_URL unset on most operators → legacy DB path * After PR-7 backfill: operators set MEMORY_PLUGIN_URL → all writes flow through plugin transparently * After PR-8 cutover: dual-write removed, plugin is the only path * After PR-9 (~60 days later): legacy tool entries dropped entirely Coverage: 100% on every helper, 100% on recallMemoryLegacyShim, 94.7% on commitMemoryLegacyShim. The 1 uncovered line is a defensive guard against a v2-response-parse error that's unreachable when the v2 tool is operating correctly (it always returns valid JSON). Edge cases pinned: * scope translation for every legacy value + invalid scope * resolver error propagation * plugin error propagation * GLOBAL still blocked * default-scope fallback (LOCAL) * empty content rejected * No-op when v2 unwired (legacy SQL path exercised via sqlmock) * org-namespace memory wrap on recall + GLOBAL scope label round-trip * No-results returns "No memories found." (legacy message preserved)	2026-05-04 08:01:41 -07:00
Hongming Wang	5bfa4b1d80	Memory v2 PR-5: 6 new MCP tools wired through the plugin Builds on PR-1, PR-2, PR-3, PR-4 (all merged). Adds the agent-facing v2 surface for the memory plugin contract. What ships (all in handlers/mcp_tools_memory_v2.go, no edits to the legacy commit_memory / recall_memory paths): commit_memory_v2 — write to a namespace; default workspace:self search_memory — search across namespaces; default = all readable commit_summary — kind=summary, 30-day default TTL, runtime-overridable list_writable_namespaces — discover what you can write to list_readable_namespaces — discover what you can read from forget_memory — delete by id, only in namespaces you can write to Workspace-server is the security perimeter — every layer the plugin mustn't be trusted with runs here: * SAFE-T1201 redactSecrets BEFORE every plugin write * Server-side ACL re-validation: CanWrite + IntersectReadable run on EVERY request, never trusting client-supplied namespaces (a canvas re-parent between list_writable and commit would otherwise let a stale namespace slip through) * org:* writes audited to activity_logs (SHA256, not plaintext) — matches memories.go:201-221 so the schema stays uniform * Audit failure does NOT block the write (logged + continue) — failing closed would deny org-scope writes whenever activity_logs is unhappy * org:* memories get the [MEMORY id=... scope=ORG ns=...]: prefix on read — preserves the prompt-injection mitigation from memories.go:455-461 Coexistence design: legacy commit_memory + recall_memory still wired to their old code paths in mcp_tools.go. PR-6 will alias them to delegate to these v2 implementations. PR-9 (60 days post-cutover) removes the legacy entries. Wiring: * MCPHandler gains an memv2 field (nil-safe; tools return a clear error when MEMORY_PLUGIN_URL is unset rather than crashing) * WithMemoryV2(plugin, resolver) is the production wiring API main.go calls at boot * withMemoryV2APIs(plugin, resolver) is the test-injectable variant against the memoryPluginAPI / namespaceResolverAPI interfaces Coverage: 100.0% on every new function in mcp_tools_memory_v2.go. Edge cases pinned: * empty/whitespace content → reject before plugin * plugin unconfigured → clear error, no crash * ACL violation → clear error * resolver error → wrapped error * plugin error → wrapped error * malformed expires_at → silently ignored (no exception) * org write audit failure → logged, write proceeds * search namespace intersection drops foreign entries * search with all-foreign namespaces → empty result, plugin not called * search org memories get delimiter wrap, workspace memories do not * forget with explicit + default namespace * forget cross-scope rejected * pickStr / pickStringSlice handle missing keys, wrong types, mixed slices * wrapOrgDelimiter format is exact-match * dispatch wires all 6 tools (no "unknown tool" error)	2026-05-04 07:50:26 -07:00
Hongming Wang	f2397bf138	Merge pull request #2733 from Molecule-AI/feat/memory-v2-pr3-postgres-plugin Memory v2 PR-3: built-in postgres plugin server + schema migrations	2026-05-04 14:37:24 +00:00
Hongming Wang	ff5f4cbf7c	Memory v2 PR-3: built-in postgres plugin server + schema migrations Builds on merged PR-1 (#2729), independent of PR-2/PR-4. Implements every endpoint of the v1 plugin contract behind an HTTP server (cmd/memory-plugin-postgres/) backed by postgres. Operators run this binary next to workspace-server; it's the default implementation MEMORY_PLUGIN_URL points at. What ships: - cmd/memory-plugin-postgres/main.go: boot, signal-driven shutdown, boot-time migrations, configurable LISTEN/DATABASE/MIGRATION_DIR - cmd/memory-plugin-postgres/migrations/001_memory_v2.up.sql: memory_namespaces (PK on name, kind CHECK, expires_at, metadata) memory_records (FK to namespaces with CASCADE, kind+source CHECK, pgvector embedding, FTS tsvector, ivfflat partial index on embedding, partial index on expires_at) - internal/memory/pgplugin/store.go: storage layer using lib/pq - internal/memory/pgplugin/handlers.go: HTTP layer (no router dep — a switch on URL.Path keeps the binary's dep surface tiny) - 100% statement coverage on store.go + handlers.go Schema notes: - These tables live next to the plugin binary, NOT in workspace- server/migrations/. When operators swap the plugin, these tables become orphaned (operator drops manually). Documented in PR-10. - Search supports semantic (pgvector cosine) → FTS (>=2 char query) → ILIKE (1-char query) → recent-listing (no query), with a TTL filter applied uniformly across all paths. - DELETE on namespace cascades to memory_records (FK ON DELETE CASCADE) — a deleted namespace immediately frees its memories. Coverage corner cases pinned: - Health: ok, degraded (db ping fails), no-ping fn - Every CRUD endpoint: happy path, bad name, bad JSON, bad body, not-found, store errors, exec/scan/marshal errors - Search: FTS, semantic, short-query (ILIKE), no-query (recent), kinds filter, store errors, scan errors, mid-iteration row error - Routing edge cases: unknown path, empty namespace, unknown sub, method-not-allowed, GET on /v1/health (allowed), POST on /v1/health (404), GET on /v1/search (404) - Helper internals: marshalMetadata (nil/happy/unmarshalable), nullTime (nil/non-nil), vectorString (empty/format), nullVectorString (empty/non-empty), scanNamespace + scanMemory metadata-decode errors No callers in workspace-server yet; integration starts in PR-5 (MCP handlers wire the plugin client through to MCP tools).	2026-05-04 07:31:56 -07:00
Hongming Wang	01b653d6b0	Memory v2 PR-4: namespace resolver + tests Stacked on PR-1 (#2729). Computes the readable/writable namespace lists for a workspace from the live workspaces tree at request time. No precomputed columns, no migrations — re-parenting on canvas takes effect immediately on the next memory call. What ships: - workspace-server/internal/memory/namespace/resolver.go - walkChain: recursive CTE, walks parent_id chain to root, capped at depth 50 to defend against malformed/cyclic data - derive: maps a chain to (workspace, team, org) namespace strings - ReadableNamespaces / WritableNamespaces: the public API - CanWrite + IntersectReadable: server-side ACL helpers MCP handlers (PR-5) will call before talking to the plugin - resolver_test.go: 100% statement coverage Design choices worth flagging: - Today's tree is depth-1 (root + children). The recursive CTE handles arbitrary depth so we don't have to revisit the resolver when the tree deepens. - GLOBAL→org write restriction (memories.go:167-174) is preserved by gating the org namespace's Writable flag on parent_id IS NULL. - Removed-status workspaces are NOT filtered from the chain walk — matches today's TEAM behavior (memories.go:367-372 filters on read, not on tree walk). - IntersectReadable with empty `requested` returns ALL readable namespaces (default-search-everything semantic from the discovery tools spec). This package has zero callers in this PR; integration starts in PR-5.	2026-05-04 07:25:33 -07:00
Hongming Wang	c1cff3169f	Memory v2 PR-2: HTTP plugin client + breaker + capability negotiation Builds on PR-1 (#2729). Implements every endpoint in the OpenAPI spec plus two operational concerns the agent never sees: 1. Capability negotiation. Boot/Refresh probes /v1/health and captures the plugin's capability list. MCP handlers (PR-5) ask SupportsCapability before exposing capability-gated features — e.g., agents can only request semantic search when "embedding" is reported. 2. Circuit breaker. Three consecutive failures open the breaker for 60 seconds; while open, calls fail fast with ErrBreakerOpen. Picked these constants because: - 3 failures: long enough to skip transient blips, short enough to react before all in-flight handlers stack on the timeout - 60s cooldown: long enough to back off a flapping plugin, short enough that recovery is felt within a single session 4xx responses do NOT count toward the breaker (those are client bugs, not plugin health issues); 5xx + transport errors do. What ships: - workspace-server/internal/memory/client/client.go - client_test.go: 100% statement coverage Coverage corner cases pinned: - env-var success branches in New (parseDurationEnv applied) - json.Marshal error (via channel in Propagation) - http.NewRequestWithContext error (via unbalanced bracket in BaseURL) - 204 NoContent on endpoint that normally has a body - 4xx vs 5xx breaker behavior (4xx must NOT trip) - breaker cooldown elapsed → reset on next success - all 6 public endpoints fail-fast when breaker is open This package has no callers in this PR; integration starts in PR-5.	2026-05-04 06:57:24 -07:00
Hongming Wang	53d823e719	Memory v2 PR-1: OpenAPI plugin contract + Go bindings First of 11 PRs implementing the memory-system plugin refactor (RFC #2728). This PR is pure additive scaffolding — no behavior change, no integration yet. It defines the wire shape between workspace-server and a memory plugin so PR-2 (HTTP client) and PR-3 (built-in postgres plugin) can be built against a single source of truth. What ships: - docs/api-protocol/memory-plugin-v1.yaml: OpenAPI 3.0.3 spec covering /v1/health, namespace upsert/patch/delete, memory commit, search, forget. Auth-free (private network only); workspace-server is the only sanctioned client and the security perimeter. - workspace-server/internal/memory/contract: typed Go bindings with Validate() methods on every wire object so both client (PR-2) and server (PR-3) self-check at the boundary. - Round-trip JSON tests for every type (catch asymmetric tag bugs). - 5 golden vector files under testdata/ pinning the exact wire shape; update via UPDATE_GOLDENS=1. Coverage: 100% of statements in contract.go. The validation rules encode design decisions worth flagging in review: - SearchRequest with empty Namespaces is REJECTED at plugin level — workspace-server is required to intersect the readable set server-side; an empty list reaching the plugin is a bug. - NamespacePatch with no fields is REJECTED — empty patches are pointless round-trips. - MemoryWrite with whitespace-only Content is REJECTED — zero-info memories pollute search results. No code yet calls into this package; integration starts in PR-2.	2026-05-04 06:45:52 -07:00
Hongming Wang	be997883c9	Centralize backend selection in provisionWorkspaceAuto User-reported 2026-05-04: deploying a team org-template ("Design Director" + 6 sub-agents) on a SaaS tenant produced 7-of-7 WORKSPACE_PROVISION_FAILED with the misleading message "container started but never called /registry/register". Diagnose returned "docker client not configured on this workspace-server" and the workspace rows had no instance_id. Root cause: TeamHandler.Expand hardcoded h.wh.provisionWorkspace — the Docker leg of WorkspaceHandler. WorkspaceHandler.Create branched on h.cpProv to pick CP-managed EC2 (SaaS) vs local Docker (self-hosted), but Expand never used that branch. On SaaS the docker goroutine ran but had no socket, so children silently sat in "provisioning" until the 600s sweeper marked them failed. Architectural principle (user): templates own runtime/config/prompts/files/plugins; the platform owns where it runs. Backend selection belongs in one helper. Fix: - Extract WorkspaceHandler.provisionWorkspaceAuto: picks CP when cpProv is set, Docker when only provisioner is set, returns false when neither (caller marks failed). - WorkspaceHandler.Create routes through Auto. - TeamHandler.Expand routes through Auto. Tests pin three invariants: - TestProvisionWorkspaceAuto_NoBackendReturnsFalse — Auto signals fall-through correctly so the caller can persist + mark-failed. - TestProvisionWorkspaceAuto_RoutesToCPWhenSet — when cpProv is wired, Start lands on CP (the user-visible regression target). Discipline-verified: removing the cpProv branch fails this. - TestTeamExpand_UsesAutoNotDirectDockerPath — source-level guard against future refactors reintroducing the hardcoded Docker call. Discipline-verified: reverting team.go fails this with a clear message naming the bug class. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 03:43:41 -07:00
Hongming Wang	bcea8ac822	Broaden empty-URL 422 to cover NULL delivery_mode (production reality) Live-probed user's tenant: three of three external-runtime workspaces register with delivery_mode = NULL, not "poll". The earlier narrow poll-only check fell through to the misleading 503 for the actually- observed shape. Invariant we want: URL empty + not-exactly-"push" → no dispatch path will ever exist → 422. Only push-mode with empty URL is genuinely transient (mid-boot, restart in progress) → 503. Added TestChatUpload_NullModeEmptyURL using the user's actual workspace ID. Existing TestChatUpload_NoURL switched to explicit "push" mode (was relying on default — unsafe given the new branching). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 02:42:46 -07:00
Hongming Wang	87ae691e67	Distinguish poll-mode workspace from transient empty-URL on chat upload External-runtime workspaces that register in poll mode have no callback URL by design — the platform never dispatches to them, so chat upload (HTTP-forward by design) can't proceed. Returning 503 + "workspace url not registered yet" was misleading: the "yet" implied transient state, but the URL would never arrive. Caught externally on 2026-05-04: user uploading an image to an external "mac laptop" runtime workspace saw the 503 and assumed they should retry. The workspace's poll mode meant retrying would never help. Fix: include delivery_mode in the workspace lookup. When URL is empty: - poll mode → 422 + "re-register in push mode with a public URL" (Unprocessable Entity — this request can't succeed against this workspace's configuration; no retry will help) - push mode → 503 + "not registered yet" (genuine transient state — retry after next heartbeat is correct) Test: TestChatUpload_PollModeEmptyURL pins the new 422 path; existing TestChatUpload_NoURL strengthened to assert the "not registered yet" substring stays on the push branch (it would have silently passed if the new 422 path had clobbered both branches). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 02:42:46 -07:00
Hongming Wang	d5eb58af56	feat(external-connect): comprehensive setup — fix Claude Code channel snippet + add per-tab Help section User report: handing the modal's Claude Code channel snippet to an agent fails immediately with two errors that the snippet doesn't tell the operator how to resolve: plugin:molecule@Molecule-AI/molecule-mcp-claude-channel · plugin not installed plugin:molecule@Molecule-AI/molecule-mcp-claude-channel · not on the approved channels allowlist Root cause: the snippet's `claude --channels plugin:...` line assumes the plugin is pre-installed AND that the channel is on Anthropic's default allowlist. Both assumptions are wrong for a custom Molecule plugin in a public repo. Two changes: 1. Rewrite externalChannelTemplate (Go) with full setup chain: - Bun prereq check (channel plugins are Bun scripts) - `/plugin marketplace add Molecule-AI/molecule-mcp-claude-channel` + `/plugin install molecule@molecule-mcp-claude-channel` BEFORE the launch — otherwise "plugin not installed" - `--dangerously-load-development-channels` flag on launch — required for non-Anthropic-allowlisted channels, otherwise "not on approved channels allowlist" - Common-errors block at the bottom mapping each error string to which numbered step recovers it - Team/Enterprise managed-settings caveat (the dev-channels flag is blocked there; admin must use channelsEnabled + allowedChannelPlugins) Plugin install info verified by reading `Molecule-AI/molecule-mcp-claude-channel` plugin.json (`name: "molecule"`) and the Claude Code channels + plugin-discovery docs at code.claude.com/docs/en/{channels,discover-plugins}. 2. Add per-tab HelpBlock to the modal (canvas): - Collapsible <details> below each snippet, closed by default so the snippet stays the visual focus - "Where to install" link (PyPI for runtime, claude.com for Claude Code, github.com/openai/codex for Codex, NousResearch/hermes-agent for Hermes) - "Documentation" link (docs.molecule.ai/docs/guides/; hostname confirmed by existing blog post canonical metadata; paths map 1:1 to docs/guides/.md files in this repo) - "Common errors" list with concrete recovery steps for each tab (e.g. Codex tab calls out the codex≥0.57 requirement and TOML duplicate-table parse error; OpenClaw calls out the :18789 port conflict check) URL discipline: every URL is either (a) verified against a file path in this repo's docs/, (b) the canonical repo of an existing snippet reference, or (c) a well-known third-party canonical URL. No guessed URLs — broken links would defeat the purpose of "more comprehensive instructions." Verification: - `go build ./...` clean in workspace-server - `go test ./internal/handlers/...` passes (4.3s) - Bash syntax check on test_staging_full_saas.sh (no edits there) clean - TS brace/paren/bracket counts balanced; no full tsc run because the worktree's node_modules isn't installed — counterpart Canvas tabs E2E on the PR will exercise the full type-check + render path Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 17:46:55 -07:00
Hongming Wang	ff0d4dae77	fix(external-connect): address self-review criticals — config corruption + durability Self-review of the modal-tab additions caught footguns in the new hermes/codex/openclaw snippets. Ship the fixes before merge. Critical 1 — Hermes `cat >> ~/.hermes/config.yaml` corrupts existing configs. Most existing hermes installs have a top-level gateway: block; appending creates a duplicate, which YAML rejects. Replaced the auto-append with explicit instructions: 'under your existing gateway: block, add a plugin_platforms entry'. Critical 2 — Codex `cat >> ~/.codex/config.toml` corrupts on re-run. TOML rejects duplicate [mcp_servers.molecule] tables; a second run breaks codex parse. Replaced auto-append with commented config block + explicit 'open ~/.codex/config.toml in your editor and paste'. Canvas-side token stamping still hits the literal in the comment so the operator's clipboard has the real token already substituted. Required 3 — OpenClaw `onboard --non-interactive` missing provider/model defaults. Added explicit --provider + --model placeholders in a commented form so operators see what's needed without a stub default applying silently. Required 4 — OpenClaw gateway started with bare '&' dies on terminal close. Switched to nohup + log file + disown, with a note that systemd is the right answer for production. Optional 5 + 6 (env_vars cleanup, tests) deferred — env_vars stripped to keep the in-tree-vs-external surface narrow; tests for the new response fields can land separately when external_connection.go is next touched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 13:12:54 -07:00
Hongming Wang	eba0c5e3f1	feat(canvas): add Hermes/Codex/OpenClaw tabs to ExternalConnectModal + default to Universal MCP The External Connect modal had tabs for Python SDK / curl / Claude Code channel / Universal MCP. Operators using hermes / codex / openclaw as their external runtime had no copy-paste; they pieced together WORKSPACE_ID + PLATFORM_URL + auth_token into config files by reading docs. Adds three runtime-specific snippets stamped server-side: - Hermes — installs molecule-ai-workspace-runtime + the hermes-channel-molecule plugin, exports the 4 env vars, and writes the gateway.plugin_platforms.molecule block into ~/.hermes/config.yaml. Same long-poll-based push semantics the Claude Code channel tab delivers (push parity with the in-tree template-hermes adapter). - Codex — wires the molecule_runtime A2A MCP server into ~/.codex/config.toml ([mcp_servers.molecule] block with env_vars passthrough + literal env values). Outbound tools only — codex's MCP client doesn't route arbitrary notifications/* (verified by reading codex-rs/codex-mcp/src/connection_manager.rs); push parity on external codex would need a separate bridge daemon, tracked as future work. Snippet calls this out so operators know to pair with Python SDK if they need inbound delivery. - OpenClaw — installs openclaw + onboards, wires the molecule MCP server via openclaw mcp set, starts the gateway on loopback. Same outbound-tools-only caveat as codex; the in-tree template- openclaw adapter implements the full sessions.steer push path, but an external setup would need the same bridge daemon to translate platform inbox events into sessions.steer calls. Future work. Default open tab changed from "Claude Code" to "Universal MCP". Universal MCP is runtime-agnostic and works as a starting point for any operator regardless of their downstream agent runtime; runtime- specific tabs are still one click away. Pre-2026-05-03 the modal defaulted to Claude Code, so operators using non-Claude runtimes opened to a tab they had to skip past. Tab order also reorganized: Universal MCP → Python SDK → Claude Code → Hermes → Codex → OpenClaw → curl → Fields Each runtime-specific tab is gated on the platform supplying the snippet (older platform builds without the field don't show empty tabs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 13:07:19 -07:00
Hongming Wang	1bff419833	feat(provisioner): digest-pin workspace images via runtime_image_pins (#2272 layer 1) Layer 1 of the runtime-rollout plan. Decouples publish from promotion by giving operators a `runtime_image_pins` table the provisioner consults at container-create time. No row = legacy `:latest` behavior; row present = provisioner pulls `<base>@sha256:<digest>`. One bad publish no longer breaks every workspace simultaneously. Mechanics: - Migration 047: `runtime_image_pins` (template_name PK + sha256 digest + audit columns) and `workspaces.runtime_image_digest` (nullable, with partial index) for "show me workspaces still on the old digest" queries. - `resolveRuntimeImage` (handlers/runtime_image_pin.go): looks up the pin, returns `<base>@sha256:<digest>` on hit, "" on miss/error so the provisioner falls through to the legacy tag map. Availability over pinning — any DB error logs and returns "" rather than blocking the provision. `WORKSPACE_IMAGE_LOCAL_OVERRIDE=1` short-circuits the lookup so devs rebuilding template images locally see their fresh build. - `WorkspaceConfig.Image` carries the resolved value into the provisioner. `selectImage` honors it ahead of the runtime→tag map and falls back to DefaultImage on unknown runtime. - The existing `imageTagIsMoving` predicate (#215) already returns false on `@sha256:` form, so digest pins skip the force-pull path naturally. Tests: - Handler-side (sqlmock): no-pin/db-error/with-pin/empty/unknown/local- override paths cover every branch of `resolveRuntimeImage`. - Provisioner-side: `selectImage` table covers explicit-image preference, runtime-map fallback, unknown-runtime → default, empty-config → default. Plus a struct-literal compile-time pin on `Image` so a future refactor can't silently drop the field. Layer 2 (per-ring routing via `workspaces.runtime_image_digest`) and the admin promote/rollback endpoint ride on top of this and ship separately.	2026-05-03 02:30:00 -07:00
Hongming Wang	be271aef8b	fix(orphan-sweeper): exclude runtime='external' from stale-token revoke The Docker-mode orphan sweeper was incorrectly targeting external runtime workspaces, revoking their auth tokens ~6 minutes after creation (one sweep cycle past the 5-min grace). External workspaces have NO local container by design — their agent runs off-host. The "no live container" predicate the sweep uses to detect wiped-volume orphans matches every external workspace unconditionally, which was killing the only auth credential the off-host agent has. Reproducer: create runtime=external workspace, paste the auth token into molecule-mcp / curl, wait 5 minutes. Next request returns `HTTP 401 — token may be revoked`. Platform log shows `Orphan sweeper: revoking stale tokens for workspace <id> (no live container; volume likely wiped)`. Fix: add `AND w.runtime != 'external'` to the sweep's SELECT. The existing test regexes (third-pass query expectations + the shared expectStaleTokenSweepNoOp helper) are tightened to require the new predicate, so a regression that drops it fails CI immediately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 00:49:37 -07:00
Hongming Wang	384edb4af0	Merge branch 'staging' into perf/cache-platform-inbound-secret	2026-05-03 00:08:43 -07:00
Hongming Wang	b040171fa1	perf(wsauth): in-process cache for platform_inbound_secret reads Heartbeats fire every 60s per workspace and were the dominant caller of ReadPlatformInboundSecret — one DB SELECT each, purely to redeliver the same value. For an N-workspace fleet that's N SELECTs/minute of pure overhead, growing linearly with the fleet (#189). This adds a sync.Map cache keyed by workspaceID with a 5-minute TTL: - Read-through: cache miss → DB SELECT → populate → return. - Write-through: every IssuePlatformInboundSecret call refreshes the cache with the new value before returning, so the lazy-heal mint path (readOrLazyHealInboundSecret) doesn't see a stale read of the value it just wrote. - TTL eviction: 5 minutes — generous enough that the heartbeat hot path hits cache for ~5 reads in a row before re-validating, short enough that an out-of-band rotation (operator running `UPDATE workspaces SET platform_inbound_secret=...` directly) propagates within minutes without requiring a redeploy. - Absence not cached: ErrNoInboundSecret skips the cache write so the lazy-heal recovery contract for the column-NULL case (readOrLazyHealInboundSecret in workspace_provision_shared.go) keeps working. Memory footprint is bounded by the active workspace fleet (~200 bytes per entry); deleted workspaces leave dead entries until process restart, acceptable given workspace-deletion is operator-rare. Why in-process instead of Redis: workspace-server runs as a single Railway service today (per memory project_controlplane_ownership); adding Redis for this single column read would be over-engineering. The cache is a self-contained, Redis-free upgrade that keeps the same semantic surface (read returns the latest secret) while collapsing the heartbeat read storm. If the deployment ever fans out across replicas, an operator-side rotation propagates per-replica TTL-bounded without needing a shared write log. Tests: 5 new cases covering cache hit within TTL, refresh after TTL (simulating an operator rotation via SQL), write-through on Issue, absence-not-cached, and Reset clearing all entries. The setupMock helper in wsauth and setupTestDB helper in handlers both call ResetInboundSecretCacheForTesting() at start + cleanup so write-through state from one test doesn't shadow SELECT expectations in the next. SetInboundSecretCacheNowForTesting() exposes a deterministic clock override so the TTL test doesn't sleep. Task: #189.	2026-05-03 00:04:38 -07:00

1 2 3 4 5 ...

539 Commits