molecule-core

Author	SHA1	Message	Date
Hongming Wang	01b653d6b0	Memory v2 PR-4: namespace resolver + tests Stacked on PR-1 (#2729). Computes the readable/writable namespace lists for a workspace from the live workspaces tree at request time. No precomputed columns, no migrations — re-parenting on canvas takes effect immediately on the next memory call. What ships: - workspace-server/internal/memory/namespace/resolver.go - walkChain: recursive CTE, walks parent_id chain to root, capped at depth 50 to defend against malformed/cyclic data - derive: maps a chain to (workspace, team, org) namespace strings - ReadableNamespaces / WritableNamespaces: the public API - CanWrite + IntersectReadable: server-side ACL helpers MCP handlers (PR-5) will call before talking to the plugin - resolver_test.go: 100% statement coverage Design choices worth flagging: - Today's tree is depth-1 (root + children). The recursive CTE handles arbitrary depth so we don't have to revisit the resolver when the tree deepens. - GLOBAL→org write restriction (memories.go:167-174) is preserved by gating the org namespace's Writable flag on parent_id IS NULL. - Removed-status workspaces are NOT filtered from the chain walk — matches today's TEAM behavior (memories.go:367-372 filters on read, not on tree walk). - IntersectReadable with empty `requested` returns ALL readable namespaces (default-search-everything semantic from the discovery tools spec). This package has zero callers in this PR; integration starts in PR-5.	2026-05-04 07:25:33 -07:00
Hongming Wang	c1cff3169f	Memory v2 PR-2: HTTP plugin client + breaker + capability negotiation Builds on PR-1 (#2729). Implements every endpoint in the OpenAPI spec plus two operational concerns the agent never sees: 1. Capability negotiation. Boot/Refresh probes /v1/health and captures the plugin's capability list. MCP handlers (PR-5) ask SupportsCapability before exposing capability-gated features — e.g., agents can only request semantic search when "embedding" is reported. 2. Circuit breaker. Three consecutive failures open the breaker for 60 seconds; while open, calls fail fast with ErrBreakerOpen. Picked these constants because: - 3 failures: long enough to skip transient blips, short enough to react before all in-flight handlers stack on the timeout - 60s cooldown: long enough to back off a flapping plugin, short enough that recovery is felt within a single session 4xx responses do NOT count toward the breaker (those are client bugs, not plugin health issues); 5xx + transport errors do. What ships: - workspace-server/internal/memory/client/client.go - client_test.go: 100% statement coverage Coverage corner cases pinned: - env-var success branches in New (parseDurationEnv applied) - json.Marshal error (via channel in Propagation) - http.NewRequestWithContext error (via unbalanced bracket in BaseURL) - 204 NoContent on endpoint that normally has a body - 4xx vs 5xx breaker behavior (4xx must NOT trip) - breaker cooldown elapsed → reset on next success - all 6 public endpoints fail-fast when breaker is open This package has no callers in this PR; integration starts in PR-5.	2026-05-04 06:57:24 -07:00
Hongming Wang	53d823e719	Memory v2 PR-1: OpenAPI plugin contract + Go bindings First of 11 PRs implementing the memory-system plugin refactor (RFC #2728). This PR is pure additive scaffolding — no behavior change, no integration yet. It defines the wire shape between workspace-server and a memory plugin so PR-2 (HTTP client) and PR-3 (built-in postgres plugin) can be built against a single source of truth. What ships: - docs/api-protocol/memory-plugin-v1.yaml: OpenAPI 3.0.3 spec covering /v1/health, namespace upsert/patch/delete, memory commit, search, forget. Auth-free (private network only); workspace-server is the only sanctioned client and the security perimeter. - workspace-server/internal/memory/contract: typed Go bindings with Validate() methods on every wire object so both client (PR-2) and server (PR-3) self-check at the boundary. - Round-trip JSON tests for every type (catch asymmetric tag bugs). - 5 golden vector files under testdata/ pinning the exact wire shape; update via UPDATE_GOLDENS=1. Coverage: 100% of statements in contract.go. The validation rules encode design decisions worth flagging in review: - SearchRequest with empty Namespaces is REJECTED at plugin level — workspace-server is required to intersect the readable set server-side; an empty list reaching the plugin is a bug. - NamespacePatch with no fields is REJECTED — empty patches are pointless round-trips. - MemoryWrite with whitespace-only Content is REJECTED — zero-info memories pollute search results. No code yet calls into this package; integration starts in PR-2.	2026-05-04 06:45:52 -07:00
Hongming Wang	be997883c9	Centralize backend selection in provisionWorkspaceAuto User-reported 2026-05-04: deploying a team org-template ("Design Director" + 6 sub-agents) on a SaaS tenant produced 7-of-7 WORKSPACE_PROVISION_FAILED with the misleading message "container started but never called /registry/register". Diagnose returned "docker client not configured on this workspace-server" and the workspace rows had no instance_id. Root cause: TeamHandler.Expand hardcoded h.wh.provisionWorkspace — the Docker leg of WorkspaceHandler. WorkspaceHandler.Create branched on h.cpProv to pick CP-managed EC2 (SaaS) vs local Docker (self-hosted), but Expand never used that branch. On SaaS the docker goroutine ran but had no socket, so children silently sat in "provisioning" until the 600s sweeper marked them failed. Architectural principle (user): templates own runtime/config/prompts/files/plugins; the platform owns where it runs. Backend selection belongs in one helper. Fix: - Extract WorkspaceHandler.provisionWorkspaceAuto: picks CP when cpProv is set, Docker when only provisioner is set, returns false when neither (caller marks failed). - WorkspaceHandler.Create routes through Auto. - TeamHandler.Expand routes through Auto. Tests pin three invariants: - TestProvisionWorkspaceAuto_NoBackendReturnsFalse — Auto signals fall-through correctly so the caller can persist + mark-failed. - TestProvisionWorkspaceAuto_RoutesToCPWhenSet — when cpProv is wired, Start lands on CP (the user-visible regression target). Discipline-verified: removing the cpProv branch fails this. - TestTeamExpand_UsesAutoNotDirectDockerPath — source-level guard against future refactors reintroducing the hardcoded Docker call. Discipline-verified: reverting team.go fails this with a clear message naming the bug class. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 03:43:41 -07:00
Hongming Wang	bcea8ac822	Broaden empty-URL 422 to cover NULL delivery_mode (production reality) Live-probed user's tenant: three of three external-runtime workspaces register with delivery_mode = NULL, not "poll". The earlier narrow poll-only check fell through to the misleading 503 for the actually- observed shape. Invariant we want: URL empty + not-exactly-"push" → no dispatch path will ever exist → 422. Only push-mode with empty URL is genuinely transient (mid-boot, restart in progress) → 503. Added TestChatUpload_NullModeEmptyURL using the user's actual workspace ID. Existing TestChatUpload_NoURL switched to explicit "push" mode (was relying on default — unsafe given the new branching). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 02:42:46 -07:00
Hongming Wang	87ae691e67	Distinguish poll-mode workspace from transient empty-URL on chat upload External-runtime workspaces that register in poll mode have no callback URL by design — the platform never dispatches to them, so chat upload (HTTP-forward by design) can't proceed. Returning 503 + "workspace url not registered yet" was misleading: the "yet" implied transient state, but the URL would never arrive. Caught externally on 2026-05-04: user uploading an image to an external "mac laptop" runtime workspace saw the 503 and assumed they should retry. The workspace's poll mode meant retrying would never help. Fix: include delivery_mode in the workspace lookup. When URL is empty: - poll mode → 422 + "re-register in push mode with a public URL" (Unprocessable Entity — this request can't succeed against this workspace's configuration; no retry will help) - push mode → 503 + "not registered yet" (genuine transient state — retry after next heartbeat is correct) Test: TestChatUpload_PollModeEmptyURL pins the new 422 path; existing TestChatUpload_NoURL strengthened to assert the "not registered yet" substring stays on the push branch (it would have silently passed if the new 422 path had clobbered both branches). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 02:42:46 -07:00
Hongming Wang	d5eb58af56	feat(external-connect): comprehensive setup — fix Claude Code channel snippet + add per-tab Help section User report: handing the modal's Claude Code channel snippet to an agent fails immediately with two errors that the snippet doesn't tell the operator how to resolve: plugin:molecule@Molecule-AI/molecule-mcp-claude-channel · plugin not installed plugin:molecule@Molecule-AI/molecule-mcp-claude-channel · not on the approved channels allowlist Root cause: the snippet's `claude --channels plugin:...` line assumes the plugin is pre-installed AND that the channel is on Anthropic's default allowlist. Both assumptions are wrong for a custom Molecule plugin in a public repo. Two changes: 1. Rewrite externalChannelTemplate (Go) with full setup chain: - Bun prereq check (channel plugins are Bun scripts) - `/plugin marketplace add Molecule-AI/molecule-mcp-claude-channel` + `/plugin install molecule@molecule-mcp-claude-channel` BEFORE the launch — otherwise "plugin not installed" - `--dangerously-load-development-channels` flag on launch — required for non-Anthropic-allowlisted channels, otherwise "not on approved channels allowlist" - Common-errors block at the bottom mapping each error string to which numbered step recovers it - Team/Enterprise managed-settings caveat (the dev-channels flag is blocked there; admin must use channelsEnabled + allowedChannelPlugins) Plugin install info verified by reading `Molecule-AI/molecule-mcp-claude-channel` plugin.json (`name: "molecule"`) and the Claude Code channels + plugin-discovery docs at code.claude.com/docs/en/{channels,discover-plugins}. 2. Add per-tab HelpBlock to the modal (canvas): - Collapsible <details> below each snippet, closed by default so the snippet stays the visual focus - "Where to install" link (PyPI for runtime, claude.com for Claude Code, github.com/openai/codex for Codex, NousResearch/hermes-agent for Hermes) - "Documentation" link (docs.molecule.ai/docs/guides/; hostname confirmed by existing blog post canonical metadata; paths map 1:1 to docs/guides/.md files in this repo) - "Common errors" list with concrete recovery steps for each tab (e.g. Codex tab calls out the codex≥0.57 requirement and TOML duplicate-table parse error; OpenClaw calls out the :18789 port conflict check) URL discipline: every URL is either (a) verified against a file path in this repo's docs/, (b) the canonical repo of an existing snippet reference, or (c) a well-known third-party canonical URL. No guessed URLs — broken links would defeat the purpose of "more comprehensive instructions." Verification: - `go build ./...` clean in workspace-server - `go test ./internal/handlers/...` passes (4.3s) - Bash syntax check on test_staging_full_saas.sh (no edits there) clean - TS brace/paren/bracket counts balanced; no full tsc run because the worktree's node_modules isn't installed — counterpart Canvas tabs E2E on the PR will exercise the full type-check + render path Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 17:46:55 -07:00
Hongming Wang	ff0d4dae77	fix(external-connect): address self-review criticals — config corruption + durability Self-review of the modal-tab additions caught footguns in the new hermes/codex/openclaw snippets. Ship the fixes before merge. Critical 1 — Hermes `cat >> ~/.hermes/config.yaml` corrupts existing configs. Most existing hermes installs have a top-level gateway: block; appending creates a duplicate, which YAML rejects. Replaced the auto-append with explicit instructions: 'under your existing gateway: block, add a plugin_platforms entry'. Critical 2 — Codex `cat >> ~/.codex/config.toml` corrupts on re-run. TOML rejects duplicate [mcp_servers.molecule] tables; a second run breaks codex parse. Replaced auto-append with commented config block + explicit 'open ~/.codex/config.toml in your editor and paste'. Canvas-side token stamping still hits the literal in the comment so the operator's clipboard has the real token already substituted. Required 3 — OpenClaw `onboard --non-interactive` missing provider/model defaults. Added explicit --provider + --model placeholders in a commented form so operators see what's needed without a stub default applying silently. Required 4 — OpenClaw gateway started with bare '&' dies on terminal close. Switched to nohup + log file + disown, with a note that systemd is the right answer for production. Optional 5 + 6 (env_vars cleanup, tests) deferred — env_vars stripped to keep the in-tree-vs-external surface narrow; tests for the new response fields can land separately when external_connection.go is next touched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 13:12:54 -07:00
Hongming Wang	eba0c5e3f1	feat(canvas): add Hermes/Codex/OpenClaw tabs to ExternalConnectModal + default to Universal MCP The External Connect modal had tabs for Python SDK / curl / Claude Code channel / Universal MCP. Operators using hermes / codex / openclaw as their external runtime had no copy-paste; they pieced together WORKSPACE_ID + PLATFORM_URL + auth_token into config files by reading docs. Adds three runtime-specific snippets stamped server-side: - Hermes — installs molecule-ai-workspace-runtime + the hermes-channel-molecule plugin, exports the 4 env vars, and writes the gateway.plugin_platforms.molecule block into ~/.hermes/config.yaml. Same long-poll-based push semantics the Claude Code channel tab delivers (push parity with the in-tree template-hermes adapter). - Codex — wires the molecule_runtime A2A MCP server into ~/.codex/config.toml ([mcp_servers.molecule] block with env_vars passthrough + literal env values). Outbound tools only — codex's MCP client doesn't route arbitrary notifications/* (verified by reading codex-rs/codex-mcp/src/connection_manager.rs); push parity on external codex would need a separate bridge daemon, tracked as future work. Snippet calls this out so operators know to pair with Python SDK if they need inbound delivery. - OpenClaw — installs openclaw + onboards, wires the molecule MCP server via openclaw mcp set, starts the gateway on loopback. Same outbound-tools-only caveat as codex; the in-tree template- openclaw adapter implements the full sessions.steer push path, but an external setup would need the same bridge daemon to translate platform inbox events into sessions.steer calls. Future work. Default open tab changed from "Claude Code" to "Universal MCP". Universal MCP is runtime-agnostic and works as a starting point for any operator regardless of their downstream agent runtime; runtime- specific tabs are still one click away. Pre-2026-05-03 the modal defaulted to Claude Code, so operators using non-Claude runtimes opened to a tab they had to skip past. Tab order also reorganized: Universal MCP → Python SDK → Claude Code → Hermes → Codex → OpenClaw → curl → Fields Each runtime-specific tab is gated on the platform supplying the snippet (older platform builds without the field don't show empty tabs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 13:07:19 -07:00
Hongming Wang	db132351a3	feat(db): add per-peer btree indexes on activity_logs for chat_history scale (#2478 ) The chat_history query WHERE workspace_id = $1 AND activity_type = 'a2a_receive' AND (source_id = $2 OR target_id = $2) ORDER BY created_at DESC forces a workspace-scoped seq-scan-and-filter at every call — idx_activity_ws_type_time covers workspace_id+type prefix but the (source OR target) clause then walks every workspace row. Demo workspaces (≤50 rows) don't notice; production workspaces accumulate thousands over months and chat_history latency grows linearly. Adds two partial btree indexes (workspace_id, source_id) WHERE NOT NULL and (workspace_id, target_id) WHERE NOT NULL. Postgres BitmapOrs them into a workspace-scoped BitmapAnd against the existing index, dropping chat_history from O(workspace_rows) to O(peer_a2a_rows). Partial WHERE NOT NULL because most activity rows (heartbeats, agent_log, memory_write, etc.) carry NULL source_id/target_id and shouldn't bloat the index. Anti-pattern caveat (per the issue): a single compound (a, b) index can't serve 'a OR b' — Postgres only uses compound for prefix match. Two separate indexes + BitmapOr is the right shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 11:34:35 -07:00
Hongming Wang	1bff419833	feat(provisioner): digest-pin workspace images via runtime_image_pins (#2272 layer 1) Layer 1 of the runtime-rollout plan. Decouples publish from promotion by giving operators a `runtime_image_pins` table the provisioner consults at container-create time. No row = legacy `:latest` behavior; row present = provisioner pulls `<base>@sha256:<digest>`. One bad publish no longer breaks every workspace simultaneously. Mechanics: - Migration 047: `runtime_image_pins` (template_name PK + sha256 digest + audit columns) and `workspaces.runtime_image_digest` (nullable, with partial index) for "show me workspaces still on the old digest" queries. - `resolveRuntimeImage` (handlers/runtime_image_pin.go): looks up the pin, returns `<base>@sha256:<digest>` on hit, "" on miss/error so the provisioner falls through to the legacy tag map. Availability over pinning — any DB error logs and returns "" rather than blocking the provision. `WORKSPACE_IMAGE_LOCAL_OVERRIDE=1` short-circuits the lookup so devs rebuilding template images locally see their fresh build. - `WorkspaceConfig.Image` carries the resolved value into the provisioner. `selectImage` honors it ahead of the runtime→tag map and falls back to DefaultImage on unknown runtime. - The existing `imageTagIsMoving` predicate (#215) already returns false on `@sha256:` form, so digest pins skip the force-pull path naturally. Tests: - Handler-side (sqlmock): no-pin/db-error/with-pin/empty/unknown/local- override paths cover every branch of `resolveRuntimeImage`. - Provisioner-side: `selectImage` table covers explicit-image preference, runtime-map fallback, unknown-runtime → default, empty-config → default. Plus a struct-literal compile-time pin on `Image` so a future refactor can't silently drop the field. Layer 2 (per-ring routing via `workspaces.runtime_image_digest`) and the admin promote/rollback endpoint ride on top of this and ship separately.	2026-05-03 02:30:00 -07:00
Hongming Wang	be271aef8b	fix(orphan-sweeper): exclude runtime='external' from stale-token revoke The Docker-mode orphan sweeper was incorrectly targeting external runtime workspaces, revoking their auth tokens ~6 minutes after creation (one sweep cycle past the 5-min grace). External workspaces have NO local container by design — their agent runs off-host. The "no live container" predicate the sweep uses to detect wiped-volume orphans matches every external workspace unconditionally, which was killing the only auth credential the off-host agent has. Reproducer: create runtime=external workspace, paste the auth token into molecule-mcp / curl, wait 5 minutes. Next request returns `HTTP 401 — token may be revoked`. Platform log shows `Orphan sweeper: revoking stale tokens for workspace <id> (no live container; volume likely wiped)`. Fix: add `AND w.runtime != 'external'` to the sweep's SELECT. The existing test regexes (third-pass query expectations + the shared expectStaleTokenSweepNoOp helper) are tightened to require the new predicate, so a regression that drops it fails CI immediately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 00:49:37 -07:00
Hongming Wang	384edb4af0	Merge branch 'staging' into perf/cache-platform-inbound-secret	2026-05-03 00:08:43 -07:00
Hongming Wang	b040171fa1	perf(wsauth): in-process cache for platform_inbound_secret reads Heartbeats fire every 60s per workspace and were the dominant caller of ReadPlatformInboundSecret — one DB SELECT each, purely to redeliver the same value. For an N-workspace fleet that's N SELECTs/minute of pure overhead, growing linearly with the fleet (#189). This adds a sync.Map cache keyed by workspaceID with a 5-minute TTL: - Read-through: cache miss → DB SELECT → populate → return. - Write-through: every IssuePlatformInboundSecret call refreshes the cache with the new value before returning, so the lazy-heal mint path (readOrLazyHealInboundSecret) doesn't see a stale read of the value it just wrote. - TTL eviction: 5 minutes — generous enough that the heartbeat hot path hits cache for ~5 reads in a row before re-validating, short enough that an out-of-band rotation (operator running `UPDATE workspaces SET platform_inbound_secret=...` directly) propagates within minutes without requiring a redeploy. - Absence not cached: ErrNoInboundSecret skips the cache write so the lazy-heal recovery contract for the column-NULL case (readOrLazyHealInboundSecret in workspace_provision_shared.go) keeps working. Memory footprint is bounded by the active workspace fleet (~200 bytes per entry); deleted workspaces leave dead entries until process restart, acceptable given workspace-deletion is operator-rare. Why in-process instead of Redis: workspace-server runs as a single Railway service today (per memory project_controlplane_ownership); adding Redis for this single column read would be over-engineering. The cache is a self-contained, Redis-free upgrade that keeps the same semantic surface (read returns the latest secret) while collapsing the heartbeat read storm. If the deployment ever fans out across replicas, an operator-side rotation propagates per-replica TTL-bounded without needing a shared write log. Tests: 5 new cases covering cache hit within TTL, refresh after TTL (simulating an operator rotation via SQL), write-through on Issue, absence-not-cached, and Reset clearing all entries. The setupMock helper in wsauth and setupTestDB helper in handlers both call ResetInboundSecretCacheForTesting() at start + cleanup so write-through state from one test doesn't shadow SELECT expectations in the next. SetInboundSecretCacheNowForTesting() exposes a deterministic clock override so the TTL test doesn't sleep. Task: #189.	2026-05-03 00:04:38 -07:00
Hongming Wang	c4f64a11a8	Merge pull request #2546 from Molecule-AI/fix/provisioner-repull-moving-tags fix(provisioner): force re-pull of moving image tags on workspace start	2026-05-03 06:59:36 +00:00
Hongming Wang	552602e462	fix(provisioner): force re-pull of moving image tags on workspace start Previously Start() only pulled when the image was missing locally (imgErr != nil). Once a tenant's Docker daemon had `:latest` cached, it stuck on that snapshot forever even after publish-runtime pushed a newer image with the same tag — the same image-cache class that sibling task #232 closed on the controlplane redeploy path. Now Start() additionally re-pulls when the tag is "moving" (`:latest`, no tag, `:staging`, `:main`, `:dev`, `:edge`, `:nightly`, `:rolling`). Pinned tags (semver, sha-prefixed, date-stamped, build-id) and digest-pinned references (`@sha256:...`) skip the pull because their contents are by definition immutable. The classifier (imageTagIsMoving) is deliberately conservative on the "moving" side — only the well-known moving tags trip it. Misclassifying a pinned tag as moving wastes bandwidth on every provision; misclassifying moving as pinned silently bricks the fleet on stale snapshots, which is exactly the bug class this fix closes. Edge cases handled: - Registry hostname with port (`localhost:5000/foo`) — the `:5000` is not mistaken for a tag. - Digest pinning (`image@sha256:...`) — never re-pulled even if a moving-looking tag is also present. - Legacy local-build tags (`workspace-template:hermes`) — treated as pinned (no registry to move from). Test coverage: 22 cases across all classifier shapes. No changes to the pull-failure path (still best-effort, ContainerCreate still surfaces the actionable "image not found" error if the pull failed and the cache is also empty). Task: #215. Companion to #232.	2026-05-02 23:56:32 -07:00
Hongming Wang	dfeefb0acc	fix(workspace-server): vendor upstream derive-provider.sh + close 12-prefix drift The drift gate's monorepoRoot walk-up looked for workspace-configs-templates/ which is gitignored locally and doesn't exist in this repo at all (the canonical script lives in molecule-ai-workspace-template-hermes). Test failed on CI from day one with "could not find monorepo root". Two layered fixes in one PR: 1. Vendor upstream derive-provider.sh as testdata/ + drop monorepoRoot. The vendored copy has a header pointing operators at the upstream source and a one-line cp command for refresh. Test now reads two files (vendored shell + workspace_provision.go) via package-relative paths — Go test sets cwd to the package dir, so this is hermetic without any walk-up gymnastics. 2. Update the case-statement regex to match upstream's renamed variable (${_HERMES_MODEL} since v0.12.0, the resolved value of HERMES_INFERENCE_MODEL with a HERMES_DEFAULT_MODEL legacy fallback). Regex now accepts either spelling so a future rename fails loudly on the parser-sanity check rather than silently returning empty. Vendoring upstream surfaced real drift the gate was designed to catch: upstream v0.12.0 added 12 provider prefixes that deriveProviderFromModelSlug didn't handle (xai/grok, bedrock/aws, tencent/tencent-tokenhub, gmi, qwen-oauth, lmstudio/lm-studio, minimax-oauth, alibaba-coding-plan, google-gemini-cli, openai-codex, copilot-acp, copilot). Without these, Save+Restart on a workspace using one of those prefixes would persist LLM_PROVIDER="" and the next boot would fall back to derive-provider.sh's runtime *=auto branch — losing the user's explicit choice on every restart. Added all 12 case clauses + 16 new table-driven test cases (covering both canonical and aliased forms). Drift gate now passes; future upstream additions will fail loudly with a "DRIFT: ..." message pointing the engineer at the missing case. Task: #242	2026-05-02 23:51:23 -07:00
Hongming Wang	284012a768	test(workspace-server): AST drift gate for derive-provider.sh ↔ Go port PR #2535 added a Go port of derive-provider.sh (deriveProviderFromModelSlug) so workspace-server can persist LLM_PROVIDER into workspace_secrets at provision time. This created two sources of truth — if a future PR adds a provider prefix to one without the other, the platform's persisted LLM_PROVIDER silently disagrees with what the container's derive-provider.sh produces at boot, with no test going red. This adds a hermetic drift gate that: 1. Parses workspace-configs-templates/hermes/scripts/derive-provider.sh with regex (handling both single-line `pat/) PROVIDER="x" ;;` clauses and multi-line conditional clauses) to build a map[prefix]provider. 2. Walks workspace_provision.go's AST with go/ast, finds deriveProviderFromModelSlug, and extracts every case-clause prefix → return-string-literal pair. 3. Cross-checks both directions and accepts only the two documented divergences (nousresearch/ and openai/* both → "openrouter" at provision time because derive-provider.sh's runtime-env checks aren't loaded yet) via a hardcoded acceptedDivergences map. 4. Fails with an actionable message that names both files and suggests the exact fix (add the case OR add to divergence list with a comment). Pattern: behavior-based AST gate from PR #2367 / memory feedback — pin the invariant by what the function maps, not by what it's named. Stdlib-only (go/ast, go/parser, go/token, regexp); no network, no DB, no docker — reads two monorepo files in-process. A second sanity-check test pins anchor prefixes the regex must find, so a future shell-syntax change can't silently produce an empty map and trivially pass the main gate. Closes task #242.	2026-05-02 23:51:23 -07:00
Hongming Wang	586d567a48	fix(workspace-server): log silent yaml.Unmarshal + coexistence test (#256 , #257 ) Two follow-ups from PR #2543's multi-model code review (audit #253). 1. Log silent yaml.Unmarshal errors (#256). When a malformed config.yaml made `yaml.Unmarshal(data, &raw)` fail, the affected template silently disappeared from /templates with no trace — operator could not distinguish "excluded due to parse error" from "never existed." That widened a real foot-gun once PR #2543 added structured top-level `providers:` (a string-shaped top-level `providers:` decoded into `[]providerRegistryEntry` would fail and drop the whole entry). Now logs `templates list: skip <id>: yaml.Unmarshal: <err>` and continues with the rest. 2. Coexistence test (#257 part 1). PR #2543 covered the structured registry and slug list in isolation. claude-code-default in production ships BOTH: top-level `providers:` (structured registry, 2 entries) AND `runtime_config.providers:` (slug list, 3 entries). New `TestTemplatesList_BothProviderShapesCoexist` mirrors that layout, asserts both shapes surface independently with no cross-talk (e.g. a slug-only entry like `anthropic-api` does NOT synthesize a stub in the structured registry), and pins the JSON wire-shape for both fields side-by-side. 3. `base_url: null` decoding assertion (#257 part 3). Adds an explicit `got[0].BaseURL == ""` check in the existing `TestTemplatesList_SurfacesProviderRegistry` test, locking in the `string` (not `string`) type. A future change to `string` would surface as JSON `null` and break canvas's "no base_url = use provider defaults" branch — caught loudly by this assertion. Tests: 11 TestTemplatesList_* now green, including the new MalformedYAMLLogsAndSkips and BothProviderShapesCoexist. The remaining piece of #257 — renaming `Providers []string` JSON tag to `provider_slugs` — requires coordinated canvas updates across 4 files and is intentionally deferred to a separate PR (no canvas churn while user is mid-test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 23:01:59 -07:00
Hongming Wang	992a0c6860	fix(workspace-server): surface structured provider registry on /templates (#235 ) Closes the contract drift caught by audit #253. Task #235 ("Server: enrich /templates payload with structured providers") was marked completed, but `templates.go` only ever emitted the `runtime_config.providers []string` slug list — the structured ProviderEntry shape (auth_env, model_prefixes, model_aliases, base_url) the description promised was never plumbed. Templates ship the structured registry under a TOP-LEVEL `providers:` block (claude-code carries 6+ entries today; hermes still uses the slug list). Both shapes coexist and are independent — surface them as two separate fields: - `providers` → existing []string slug list (unchanged) - `provider_registry` → new []providerRegistryEntry (structured) The canvas's ProviderModelSelector comment block already anticipates this ("Templates that ship explicit vendor metadata (future) should override the heuristic."). With this field in place, the canvas can optionally drop its prefix-inference fallback for templates that ship an explicit registry — separate PR. Today's change is purely additive on the server side; no canvas change required. Tests: - TestTemplatesList_SurfacesProviderRegistry: order preservation + field plumbing on a claude-code-shaped fixture (oauth + minimax) + JSON wire-shape gate to catch struct-tag renames. - TestTemplatesList_OmitsProviderRegistryWhenAbsent: omitempty so legacy templates (hermes, langgraph) don't emit `null` and break Array.isArray on the canvas side. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 22:42:42 -07:00
Hongming Wang	8a86b66159	fix(workspace-server): set universal MODEL env on every templated provision Bug B fix, server-side complement to molecule-runtime PR #2538. The runtime PR taught `workspace/config.py` to honour `MODEL_PROVIDER` over `runtime_config.model` from the template's verbatim YAML. This PR is the upstream half: workspace-server's `applyRuntimeModelEnv` now sets `MODEL=<picked>` for every runtime, not just hermes (which got `HERMES_DEFAULT_MODEL` already). Pre-fix: applyRuntimeModelEnv's per-runtime switch only emitted HERMES_DEFAULT_MODEL for hermes; every other runtime got nothing, so the adapter read its template's default model from /configs/config.yaml. Surfaced 2026-05-02 — picking MiniMax-M2 in canvas → workspace booted with model=sonnet (claude-code template default) and demanded CLAUDE_CODE_OAUTH_TOKEN. Post-fix: MODEL is set unconditionally before the per-runtime switch. HERMES_DEFAULT_MODEL stays for backwards compat. Adapters opt in by reading os.environ["MODEL"] in their executor (claude-code adapter already does this since the same Bug B fix; see workspace-configs-templates/claude-code-default/adapter.py). Tests ===== - `TestApplyRuntimeModelEnv_SetsUniversalMODELForAllRuntimes`: table-driven across claude-code/hermes/langgraph/crewai + empty-model fallback + MODEL_PROVIDER-secret-fallback path. Adding a new runtime = adding a row, not writing a new test. - All 6 sub-cases pass + existing `TestWorkspaceCreate_FirstDeploy_UnknownModel_OnlyMintModelProvider` pin still green. Why now ======= This was authored alongside the runtime PR but stashed (not committed) during a session-handoff cleanup. The molecule-runtime side shipped at SHA `16ac895a` and is live on PyPI as molecule-ai-workspace-runtime 0.1.84, but until the workspace-server side ships, the canvas-picked MODEL env never reaches non-hermes adapters. Caught by the systematic stash audit triggered by the user's discovery that ProviderModelSelector had been similarly stashed. Closes the workspace-server side of #246. Builds on merged #2538. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 22:10:51 -07:00
Hongming Wang	d95877c88d	Merge pull request #2535 from Molecule-AI/fix/hermes-first-deploy-model-provider-persistence Persist canvas-selected model+provider on first deploy	2026-05-03 02:25:03 +00:00
Hongming Wang	1b75fddb8e	Merge pull request #2536 from Molecule-AI/chore/prune-manifest-to-4-runtimes chore(manifest): prune to 4 actively-supported runtimes	2026-05-03 02:24:50 +00:00
Hongming Wang	f33e59ba8c	chore(manifest): prune to 4 actively-supported runtimes Deletes the 5 unsupported workspace_templates from manifest.json (langgraph, crewai, autogen, deepagents, gemini-cli). The runtime matrix is now claude-code / hermes / openclaw / codex — the four templates with shipping images, working A2A integration, and active CI publish-image cascades. Mirrors the prune in: - workspace-server/internal/handlers/runtime_registry.go (fallbackRuntimes for dev/test contexts that boot without the manifest mounted) - workspace-server/internal/handlers/workspace_provision.go (sanitizeRuntime: empty/unknown → "claude-code", was "langgraph"; removes the langgraph/deepagents-specific runtime_config skip branch — they're no longer supported, so the block is dead) - tests for both: rename TestEnsureDefaultConfig_LangGraph → _Hermes, TestEnsureDefaultConfig_EmptyRuntimeDefaultsToLangGraph → _ClaudeCode, drop TestEnsureDefaultConfig_DeepAgents, update TestSanitizeRuntime_Allowlist + the two TestResolveRestartTemplate_* cases that pinned langgraph-default as the safe-default name Why this is safe: production reads manifest.json at boot and uses it as the authoritative allowlist; the 5 removed runtimes have not shipped working images for ≥1 release cycle. Any provision request naming one will now coerce to claude-code (with a log line) instead of returning a runtime that has no functioning template repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 19:21:47 -07:00
Hongming Wang	a1de71dd53	fix(workspace-server): persist canvas-selected model + provider on first deploy When the canvas POSTs /workspaces with {model: "minimax/MiniMax-M2.7"}, the model slug was never written to workspace_secrets. The workspace booted hermes once with HERMES_DEFAULT_MODEL set from payload.Model, but on every subsequent restart applyRuntimeModelEnv's fallback chain found nothing in envVars["MODEL_PROVIDER"] (because nothing wrote it) and hermes silently fell through to the template default (nousresearch/hermes-4-70b) — wrong provider keys → hermes gateway 401'd → /health poll failed → molecule-runtime never registered → "container started but never called /registry/register". Worse, LLM_PROVIDER was never written either (the canvas doesn't send provider), so CP user-data wrote no provider: field to /configs/config.yaml and derive-provider.sh fell through to PROVIDER=auto on every custom-prefix slug. Fix: after the workspace row commits, persist MODEL_PROVIDER (verbatim slug) and LLM_PROVIDER (derived from slug prefix) to workspace_secrets. LLM_PROVIDER is gating-only — derive-provider.sh remains the runtime source of truth and can override at boot. Reuses extracted setModelSecret / setProviderSecret helpers (refactored out of SetModel / SetProvider gin handlers) so SQL stays in one place. Symptom: failed-workspace 95ed3ff2 (2026-05-02). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 19:21:01 -07:00
dependabot[bot]	82d0655fe9	chore(deps)(deps): bump github.com/creack/pty in /workspace-server Bumps [github.com/creack/pty](https://github.com/creack/pty) from 1.1.18 to 1.1.24. - [Release notes](https://github.com/creack/pty/releases) - [Commits](https://github.com/creack/pty/compare/v1.1.18...v1.1.24) --- updated-dependencies: - dependency-name: github.com/creack/pty dependency-version: 1.1.24 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>	2026-05-02 19:22:48 +00:00
Hongming Wang	f18ee8598a	fix(restart): retry cpProv.Stop with backoff + flag exhaustion as LEAK-SUSPECT Both restart paths (interactive Restart handler + auto-restart's stopForRestart) used to log-and-continue on cpProv.Stop failure. After PR #2500 made CPProvisioner.Stop surface CP non-2xx as an error, those paths became the actual leak generator: every transient CP/AWS hiccup = one orphan EC2 alongside the freshly provisioned one. The 13 zombie workspace EC2s on demo-prep staging traced to this exact path. Adds cpStopWithRetry helper with bounded exponential backoff (3 attempts, 1s/2s/4s). Different policy from workspace_crud.go's Delete handler: Delete returns 500 to the client on Stop failure (loud-fail-and-block — user asked to destroy, silent leak unacceptable), whereas Restart's contract is "make the workspace alive again" — refusing to reprovision strands the user with a dead workspace. So this helper retries to absorb transient failures, then on exhaustion emits a structured `LEAK-SUSPECT` log line for the (forthcoming) CP-side workspace orphan reconciler to correlate. Caller proceeds to reprovision regardless. ctx-cancel exits the retry early without sleeping the backoff (matters during shutdown drain); the cancel path emits a distinct log line and deliberately does NOT emit LEAK-SUSPECT — operator-cancel and retry-exhaustion are different signals and conflating them would noise up the orphan-reconciler queue with workspaces we never had a chance to retry. Tests: 5 behavior tests covering every branch (no-op, first-try success, eventual success, exhaustion, ctx-cancel) + 1 AST gate that pins the helper-only invariant (any future inline `h.cpProv.Stop(...)` in workspace_restart.go fires the gate, mutation-tested). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 23:36:38 -07:00
Hongming Wang	5167e482d0	fix(cp-provisioner): surface CP non-2xx on Stop to plug EC2 leak http.Client.Do only errors on transport failure — a CP 5xx (AWS hiccup, missing IAM, transient outage) was silently treated as success. Workspace row then flipped to status='removed' and the EC2 stayed alive forever with no DB pointer (the "orphan EC2 on a 0-customer account" scenario flagged in workspace_crud.go #1843). Found while triaging 13 zombie workspace EC2s on demo-prep staging. Adds a status-code check that returns an error tagged with the workspace ID + status + bounded body excerpt, so the existing loud-fail path in workspace_crud.go's Delete handler can populate stop_failures and surface a 500. Body read is io.LimitReader-capped at 512 bytes to keep error logs sane during a CP outage. Tests: 4 new (5xx surfaces, 4xx surfaces, 2xx variants 200/202/204 all succeed, long body is truncated). Test-first verified — the first three fail on the buggy code and all four pass on the fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 22:59:01 -07:00
Hongming Wang	0064f02c00	test(sweeper): integration coverage for manifest-override + accessor consolidation Two follow-ups from PR #2494's review: 1. Two new sweep tests exercise the lookup path through sweepStuckProvisioning end-to-end: - ManifestOverrideSparesRow: claude-code 11min old, manifest=20min → no UPDATE, no broadcast (sparing works through the sweeper) - ManifestOverrideStillFlipsPastDeadline: claude-code 21min old, manifest=20min → flipped + payload.timeout_secs=1200 Closes the gap that the unit-test on provisioningTimeoutFor alone left open: a future refactor could drop the lookup arg from the sweeper's call and only the unit test caught it. Verified by regression-injecting `lookup→nil` in sweepStuckProvisioning — both new tests fail, the old ones still pass. 2. addProvisionTimeoutMs now goes through ProvisionTimeoutSecondsForRuntime instead of calling provisionTimeouts.get directly. Single accessor path for the same data — the canvas response and the sweeper now resolve identically by construction. No production behavior change; tests + accessor cleanup only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 22:00:36 -07:00
Hongming Wang	18edf88d59	fix(sweeper): honour template-manifest provision_timeout_seconds Real wiring gap discovered while investigating issue #2486 cluster of prod claude-code workspaces failed at exactly 10m. The runtimeProvisionTimeoutsCache (#2054 phase 2) reads runtime_config.provision_timeout_seconds from each template's config.yaml so the canvas spinner respects per-template timeouts — but the sweeper in registry/provisiontimeout.go hardcoded 10 min (claude-code) / 30 min (hermes) and never consulted the manifest. So a template that declared a longer window had a UI that waited correctly but a sweeper that killed the row at the hardcoded floor anyway. Resolution order pinned by new TestProvisioningTimeout_ManifestOverride: 1. PROVISION_TIMEOUT_SECONDS env (ops-debug global override) 2. Template manifest lookup (per-runtime, beats hermes default too) 3. Hermes default (30 min — CP bootstrap-watcher 25 min + 5 min slack) 4. DefaultProvisioningTimeout (10 min) Wiring: - registry: new RuntimeTimeoutLookup function type, threaded through StartProvisioningTimeoutSweep + sweepStuckProvisioning + the pre-existing provisioningTimeoutFor. - handlers: ProvisionTimeoutSecondsForRuntime exposes the cache's lookup as a method so main.go can pass it without breaking the handlers→registry import direction. - cmd/server/main.go: wire wh.ProvisionTimeoutSecondsForRuntime into the sweep boot. Verified: - go test -race ./... passes (every workspace-server package). - Regression-injected the lookup arm: 3 manifest-override subcases fail with the actual-vs-expected gap, confirming the new test is load-bearing. - The original two timeout tests (env-override, hermes default) keep passing — `lookup=nil` argument preserves their semantics. Operator action enabled: a template wanting a 15-min window can now just set `runtime_config.provision_timeout_seconds: 900` in its config.yaml and the sweeper honours it on the next workspace-server restart. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 21:44:42 -07:00
Hongming Wang	955755ce1e	test(provision): tighten Assertion 4 message to name both failure modes Per review nit on PR #2491: the previous message ("a goroutine reached cpProv.Start but never broadcast its failure") could mislead an operator if Assertion 2 and 4 both fire — Assertion 4 also catches "goroutine exited via an earlier path before reaching Start." Spell both modes out and cross-reference Assertion 2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 20:14:39 -07:00
Hongming Wang	82cc331517	test(provision): harden panic tests with re-raise guard + assert broadcast count Post-merge follow-up to PR #2487 review feedback: 1. guardAgainstReraise(fn) helper around every panic-test exercise. The original RecoversAndMarksFailed had its own outer recover() to detect re-raise; NoOpWhenNoPanic and PersistFailureLogged didn't. If a future regression makes logProvisionPanic re-raise, those two would have crashed the test process (taking sibling tests down) instead of reporting a clean failure. Now all three use the shared guard. 2. Concurrent repro now asserts bcast.count == 7 — the new concurrentSafeBroadcaster's count field was added in the race fix but not actually consumed. Cross-checks the existing recorder-set assertion from a different angle: a goroutine could in principle reach cpProv.Start (recorder hits) but then lose its WORKSPACE_PROVISION_FAILED broadcast on the failure path. Pinning both rules out that silent-drop variant for the canvas-broadcast contract specifically. 3. Comment on captureLog noting log.SetOutput is process-global and incompatible with t.Parallel() — preempts a future footgun if someone parallelizes the panic suite. Verified: all four tests pass under -race; full handlers + db packages green under -race. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 20:11:11 -07:00
Hongming Wang	4f64c4366f	test(provision): swap to concurrent-safe broadcaster in 7-burst harness CI Platform (Go) ran with -race and the concurrent test tripped the detector: captureBroadcaster (sequential-test stub) writes lastData unguarded; 7 fan-out goroutines call markProvisionFailed → that stub concurrently. Local non-race run had hidden it. Introduce concurrentSafeBroadcaster (mutex-counted) for this single fan-out test. Sequential tests keep using captureBroadcaster — the fix is local to the test that creates the goroutines. Verified ./internal/handlers passes with -race. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 20:03:11 -07:00
Hongming Wang	7a19724194	fix(provision): route panic recovery through markProvisionFailed + fix log capture Three fixes addressing review of the issue #2486 observability PR: 1. CI failure: original inline UPDATE in logProvisionPanic used a hard-coded `status='failed'` literal, which trips workspace_status_enum_drift_test (the post-PR-#2396 gate that requires every status write to flow through models.Status* via parameterized $N). Refactor to call h.markProvisionFailed which uses StatusFailed parameterized. 2. Canvas-broadcast gap (review finding): inline UPDATE skipped RecordAndBroadcast, so panic recovery marked the row failed in DB but the canvas spinner stayed on "provisioning" until the next poll. markProvisionFailed fires WORKSPACE_PROVISION_FAILED, so canvas now flips to a failure card immediately. 3. Critical test bug (review finding): `defer log.SetOutput(log.Writer())` in three test sites evaluated log.Writer() at defer-fire time AFTER the SetOutput swap — restoring the buffer to itself, never restoring os.Stderr. Subsequent tests in the package were running with the panic tests' captured buffer as their writer. Extracted captureLog(t) helper that captures `prev` BEFORE the swap and uses t.Cleanup. Plus: softened the "goroutine never started" comment in the concurrent repro harness — the harness atomic-counts BEFORE the entry log fires, so "never started" was misleading; the real failure mode is "entry log renamed/removed or writer hijacked." Verified: full handlers suite passes; drift gate passes (Platform Go CI failure root-caused). Regression-injected the recover body again — both panic tests still fail as expected, confirming the contract is gated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 19:56:34 -07:00
Hongming Wang	fe92194584	test(provision): concurrent 7-burst repro harness for #2486 silent-drop Goal: a deterministic, in-process reproduction of the prod incident where 7 simultaneous claude-code provisions on the hongming tenant produced ZERO log lines from any of the four documented exit paths. Approach: stub CPProvisioner that records every Start() call, sqlmock for the prepare flow, fire 7 goroutines concurrently against provisionWorkspaceCP, then assert: 1. Entry log fired exactly 7 times (one per goroutine). 2. Stub Start() recorded all 7 distinct workspace IDs. 3. Each goroutine's entry log names its own workspace ID. Result on staging head as of 2026-05-02: PASSES — meaning the silent-drop class isn't reproducible against current head with stub CP. Tenant hongming runs sha `76c604fb` (725 commits behind staging), so the bug is most likely already fixed upstream — hongming needs a redeploy. The test stays as a regression gate: any future refactor that re-introduces silent goroutine swallow in the CP provision path (rate-limit drop, channel-send-without-receiver, panic without recover, etc.) trips it. A safeWriter wraps the captured log buffer because raw bytes.Buffer.Write isn't safe for concurrent goroutines — without serialization the 7 entry-log lines interleave at byte boundaries and the strings.Count assertion gets unreliable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 19:19:05 -07:00
Hongming Wang	46daae1ffb	fix(provision): entry log + panic recovery on workspace provision goroutines Issue #2486: 7 claude-code workspaces stuck in provisioning produced NONE of the four documented exit-path log lines in provisionWorkspaceCP — neither prepare-failed, nor start-failed, nor persist-instance-id-failed, nor success. Operators couldn't tell whether the goroutine ran at all. Add an entry log at the top of provisionWorkspaceOpts + provisionWorkspaceCP so a missing entry distinguishes "goroutine never started" from "started but exited via an unlogged path." Add logProvisionPanic at the same defer site so a panic inside either provisioner doesn't (a) crash the whole workspace-server process, taking every other tenant workspace with it, and (b) silently leave the row in `provisioning` until the 10-min sweeper fires. The recover persists status='failed' with a sanitized panic-class message via a fresh 10s context (the goroutine's own ctx may have been the one panicking). Tests pin three contracts: - no-op when no panic (otherwise every successful provision emits a spurious log line) - recovers + persists failed status on panic, with stack trace - defense-in-depth: if the persist itself fails, log it instead of leaving the operator with a recovered-panic log but no row Regression-injected by neutering the recover() body — all three tests fail until the recover + UPDATE path is restored. This is observability + resilience only, not a root-cause fix for #2486. The actual silent-drop class still needs reproduction once the tenant is on a build that includes this entry log. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 19:14:20 -07:00
Hongming Wang	15e1ea36de	feat(activity): add before_ts paging knob to /activity route The wheel-side chat_history MCP tool advertises a `before_ts` parameter for backward paging through long histories, and the docs describe it as the canonical pagination knob — but the server silently ignored it until now. Without this fix, an agent passing before_ts to chat_history would always get the most-recent N rows and pagination would be broken end-to-end. Add `before_ts` query param parsed as RFC3339 at the trust boundary and translated into a `created_at < $X` clause on the existing builder. Mirrors the strict-inequality shape since_id uses for forward paging (`created_at > cursorTime`) so paging across both directions has consistent semantics. Tests: 3 new branches (positive filter, composition with peer_id into the canonical chat_history paging shape, RFC3339 rejection across 4 malformed inputs including URL-encoded SQL injection). Mutation-verified pre-commit; existing 9 activity tests still pass. Reported by self-review on PR #2474. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 18:04:31 -07:00
Hongming Wang	c85fac4663	feat(activity): add peer_id filter to /workspaces/:id/activity Surfaces the conversation history with one specific peer for the wheel-side chat_history MCP tool. The filter joins (source_id = $X OR target_id = $X) so both inbound (peer was sender) and outbound (peer was recipient) turns appear in the same view, ordered by created_at, and composes with existing type/source/ since_secs/since_id/limit filters. Validates peer_id as a UUID at the trust boundary so a malformed caller can't smuggle SQL fragments via the parameter — the args are bound but the explicit rejection gives the wheel a cleaner 400 signal than an empty list, and defends against any future code path that might interpolate the value into a URL or another query. Tests: 3 new branches (positive filter, composition with type+source, UUID-shape rejection across 5 malformed inputs). Mutation-verified: reverting activity.go fails all peer_id tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 17:46:15 -07:00
Hongming Wang	517bd0efc5	feat(canvas+workspace-server): data-driven Provider dropdown (#199 ) Option B PR-5. Canvas Config tab now exposes a Provider override input that's adapter-driven from each runtime's template — no hardcoded provider list in the canvas. PUT /workspaces/:id/provider on Save when dirty; auto-restart suppression to avoid double-restart with the model handler's own restart. The dropdown's suggestion list comes from /templates → runtime_config.providers (the field added in molecule-ai-workspace-template-hermes PR #31). For templates that haven't migrated to the explicit providers list yet, suggestions derive from model[].id slug prefixes — still adapter-driven, just inferred. This keeps existing templates working while platform team migrates them one at a time. workspace-server changes: - Add Providers []string field to templateSummary JSON - Parse runtime_config.providers in /templates handler - 2 new tests pin the surfacing + omitempty behavior canvas changes: - Remove hardcoded PROVIDER_SUGGESTIONS constant - Add provider/originalProvider state + PUT-on-save logic - Add deriveProvidersFromModels() fallback helper - Wire RuntimeOption.providers from /templates response - 8 new tests pin the behavior end-to-end Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 11:19:17 -07:00
Hongming Wang	1a1285171c	Merge pull request #2453 from Molecule-AI/feat/workspace-server-provider-endpoint feat(workspace-server): PUT /provider endpoint (#196 — Option B PR-2)	2026-05-01 05:37:15 +00:00
Hongming Wang	258c6bea44	feat(workspace-server): PUT /provider endpoint for explicit LLM provider (#196 ) Mirror of PUT /model. Stores the provider slug as the LLM_PROVIDER workspace secret so the canvas can update model + provider independently — a user might keep the same model alias and switch providers (route through a different gateway), or vice versa. Forcing both into one endpoint imposes a single Save+Restart per change; two endpoints let canvas update each as the user picks. Plumbs through the existing chain: secret-load → envVars → CP req.Env → user-data env exports → /configs/config.yaml (after controlplane PR #364 lands the heredoc append). Tests: 5 new cases mirroring SetModel/GetModel exactly — default empty response, DB error, upsert with restart trigger, empty-clears, invalid-UUID rejection. Part of: Option B PR-2 (#196) — workspace-server plumbs LLM_PROVIDER Stack: PR-1 schema (#2441 merged) PR-2 (this) ws-server endpoint PR-3 (#364 open) CP user-data persistence PR-4 (pending) hermes adapter consume PR-5 (pending) canvas Provider dropdown	2026-04-30 22:25:48 -07:00
Hongming Wang	364c70fc71	fix(workspace-server): emit null removed_at when timestamp fetch fails #2429 review finding. The 410-Gone path issues a follow-up `SELECT updated_at` after detecting status='removed'. If that query fails (workspace row deleted between the two queries, transient DB error, etc.), `removedAt` stays as Go's zero time and the JSON body emits `"removed_at": "0001-01-01T00:00:00Z"` — a misleading timestamp the client has to know to ignore. Now we branch on `removedAt.IsZero()` and emit `null` for the failed path. The actionable signal (the 410 + hint) is unchanged; only the timestamp shape gets cleaner. Pinned by `TestWorkspaceGet_RemovedReturns410WithNullRemovedAtOnTimestampFetchFailure`, which simulates the row vanishing via `sqlmock`'s `WillReturnError(sql.ErrNoRows)`. The original `_RemovedReturns410` test now also asserts that the happy-path timestamp is a non-null value (was just checking the key existed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 22:24:59 -07:00
Hongming Wang	72f0079c10	feat(workspace-server): GET /workspaces/:id returns 410 Gone when status='removed' (#2429 ) Defense-in-depth at the endpoint level. Previously, GET /workspaces/:id returned 200 OK with `status:"removed"` in the body for deleted workspaces — silent-fail UX hit on the hongmingwang tenant 2026-04-30: the channel bridge / molecule-mcp wheel had a dead workspace_id + token in .env, get_workspace_info returned 200 → caller assumed everything was fine, then every subsequent /registry/* call 401d because tokens were revoked, and operators had no idea their workspace was gone. #2425 fixed the steady-state heartbeat path (escalate to ERROR after 3 consecutive 401s). This change is the startup-time defense — fail loud when the operator first probes the workspace instead of waiting for the heartbeat to sour. The 410 body includes: {error: "workspace removed", id, removed_at, hint: "Regenerate ..."} Audit-trail consumers that need the body shape of a removed workspace (admin views, "show me deleted workspaces" tooling) opt into the legacy 200 + body via ?include_removed=true. Without this opt-in path the audit trail becomes invisible at the API layer. Two new tests pinned: - TestWorkspaceGet_RemovedReturns410 - TestWorkspaceGet_RemovedWithIncludeQueryReturns200 Follow-ups in separate PRs: - Update workspace/a2a_client.py get_workspace_info to surface "removed" specifically rather than collapsing into "not found" - Update channel bridge getWorkspaceInfo (server.ts) to detect 410 → log clear "workspace was deleted, re-onboard" error - Audit canvas/* + admin tooling consumers that may rely on the legacy 200 + status:"removed" shape; switch them to the ?include_removed=true opt-in if needed - Update docs (runtime-mcp.mdx Troubleshooting + external-agents.mdx lifecycle table) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 21:55:24 -07:00
Hongming Wang	b9311134cf	fix(terminal-diagnose): KI-005 hierarchy check + race-free stderr capture Two fixes from /code-review-and-quality on PR #2445: 1. KI-005 hierarchy check parity with /terminal HandleConnect runs the KI-005 cross-workspace guard before dispatch (terminal.go:85-106): when X-Workspace-ID is set and != :id, validate the bearer's workspace binding then call canCommunicateCheck. Without this, an org-level token holder in tenant Foo can probe any workspace's diagnostic state by guessing the UUID — same enumeration vector KI-005 closed for /terminal in #1609. Per-workspace bearer tokens are URL-bound by WorkspaceAuth, so the gap is org tokens within the same tenant. Fix: copy the same gate into HandleDiagnose, before the instance_id SELECT. Test: TestHandleDiagnose_KI005_RejectsCrossWorkspace stubs canCommunicateCheck=false and confirms 403 fires before the DB lookup (sqlmock's ExpectationsWereMet pins that we never reached the SELECT COALESCE). Mirrors the existing TestTerminalConnect_KI005_RejectsUnauthorizedCrossWorkspace. 2. Race-free tunnel stderr capture (syncBuf) strings.Builder isn't goroutine-safe. os/exec spawns a background goroutine that copies the subprocess's stderr fd to cmd.Stderr's Write, so reading the buffer's String() from the request goroutine on wait-for-port timeout while the tunnel may still be writing is a data race that `go test -race` flags. Worst-case impact in production is a garbled Detail string (not a crash), but the fix is small. Fix: wrap bytes.Buffer in a sync.Mutex (syncBuf type). Same io.Writer interface, no API changes elsewhere. 3. Nit cleanup - read-pubkey failure now reports as its own step name instead of a duplicated "ssh-keygen" entry — disambiguates two different failure modes that previously shared a name. - Replaced numToString hand-rolled int-to-string with strconv.Itoa in the test (no import savings reason existed). Suite: 4 diagnose tests pass with -race; full handlers suite passes in 3.95s. go vet clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 21:19:18 -07:00
Hongming Wang	d012a803e4	feat(terminal): add diagnose endpoint for SSH probe stages GET /workspaces/:id/terminal/diagnose runs the same per-stage pipeline as /terminal (ssh-keygen → EIC send-key → tunnel → ssh) but non-interactively and returns JSON. Each stage reports {name, ok, duration_ms, error, detail}, plus a top-level first_failure naming the broken stage. Why: when the canvas terminal silently disconnects ("Session ended" with no error frame — the user-reported failure mode on hongmingwang's hermes workspace), there is no remote-readable signal of WHICH stage failed. The ssh client's stderr lives only in the workspace-server's stdout on the tenant CP EC2 — invisible without shell access. /terminal can't expose stderr cleanly because it has already upgraded to WebSocket binary frames by the time ssh runs. /terminal/diagnose stays pure HTTP/JSON, so the same auth (WorkspaceAuth + ADMIN_TOKEN fallback) gives operators a one-call probe that splits "IAM broke" (send-ssh-public-key fails) from "tunnel/SG broke" (wait-for-port fails) from "sshd auth broke" (ssh-probe gets Permission denied) from "shell broke" (probe exits non-zero with stderr). Stages mirrored from handleRemoteConnect in terminal.go: 1. ssh-keygen ephemeral session keypair 2. send-ssh-public-key AWS EIC API push, IAM-gated 3. pick-free-port local port for the tunnel 4. open-tunnel aws ec2-instance-connect open-tunnel start 5. wait-for-port the tunnel actually listens (folds tunnel stderr into Detail when it doesn't) 6. ssh-probe non-interactive `ssh ... 'echo MARKER'` that confirms auth + bash + the marker round-trip (CombinedOutput captures stderr verbatim — this is the whole reason the endpoint exists) Local Docker workspaces (no instance_id) get a smaller probe: container-found + container-running. Same response shape so callers don't need to branch. Tests stub sendSSHPublicKey / openTunnelCmd / sshProbeCmd via the existing package-level vars (same pattern as TestSSHCommandCmd_*) so the test suite stays hermetic — no AWS, no network. The three new tests pin: (a) routing to remote on instance_id present, (b) routing to local on empty instance_id, (c) the operationally critical case — full success through wait-for-port then a probe failure surfaces ssh stderr in the ssh-probe step's Error/Detail with first_failure="ssh-probe". Auth: rides on existing WorkspaceAuth middleware. Operators with the tenant ADMIN_TOKEN (fetched via /cp/admin/orgs/:slug/admin-token) can probe any workspace without per-workspace token; same admin path as the canvas dashboard reads workspace activity. Response always returns HTTP 200 (success or step failure are both in the JSON body) so callers don't need to branch on status code — the endpoint either reports a first_failure or doesn't. Resolves task #200, supports task #193 (workspace EC2 sshd unresponsive — without this endpoint we couldn't pin the failure stage from outside the tenant CP EC2). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 21:10:20 -07:00
Hongming Wang	cda93e3c52	test(terminal): update exact-argv snapshot to include ConnectTimeout The pre-existing TestSSHCommandCmd_BuildsArgv asserts the literal argv slice. Adding `-o ConnectTimeout=10` shifted the slice — this commit tracks the snapshot to match. The new behavior-based TestSSHCommandCmd_ConnectTimeoutPresent (added in the prior commit) keeps the invariant pinned without depending on argv ordering, so future tweaks land in only one place even if more options are added. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:23:48 -07:00
Hongming Wang	f30b3d4476	fix(terminal): cap ssh handshake at 10s so hung sshd surfaces fast When the workspace EC2's sshd is unresponsive (mid-restart, SG drop, AMI without ec2-instance-connect), the canvas's xterm shows the user's typed bytes echoed back by the workspace-server's local PTY (cooked + echo mode before ssh sets it raw post-handshake) and then closes silently when Cloudflare's idle WebSocket timer fires (~100s) — with no "Connection refused" or "Permission denied" output ever reaching the user. This is what hongmingwang's hermes terminal looked like 2026-04-30 right after the heartbeat-fix redeploy: status="online" but the shell appeared dead. Caught reproducibly by holding a fresh /workspaces/<id>/terminal WebSocket open for 60s — server sent zero frames except the local-PTY echo of one keystroke typed at t=8s. ssh was hung at handshake; bash never saw the byte. Fix: add `-o ConnectTimeout=10` to ssh args. Now the failure surfaces as a real ssh error message in the terminal within 10s, instead of masquerading as a silently dead shell over the next ~100s. Doesn't diagnose why sshd isn't responding (separate investigation), but it does mean the user gets actionable feedback within seconds. Behavior-based regression test asserts `-o ConnectTimeout=N` is in the ssh argv — pins presence, not the literal value, so operators can tune without breaking the gate. Verified to FAIL on pre-fix code (matched the literal arg pair) and PASS on fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:16:41 -07:00
Hongming Wang	f6ddcf66ab	Move /restart Stop into the async goroutine Pre-fix Restart called provisioner.Stop / cpProv.Stop synchronously before returning the HTTP response. CPProvisioner.Stop is DELETE /cp/workspaces/:id → CP → AWS EC2 terminate, which can exceed the canvas's 15s HTTP timeout, especially right after a platform-wide redeploy when every tenant queues a CP request at once. The user sees a misleading "signal timed out" red banner on Save & Restart even though the async re-provision goroutine continues and the workspace ends up online. Caught 2026-04-30 on hongmingwang hermes workspace 32993ee7-…cb9d75d112a5 right after the heartbeat-fix platform redeploy at 02:11Z. The workspace came back online correctly; only the canvas response timed out. Fix moves Stop into the same goroutine as provisionWorkspaceCP / provisionWorkspaceOpts. The handler now responds in <500ms (DB lookup + status UPDATE only). Stop and provision keep their existing ordering inside the goroutine. Uses context.Background() to detach from the request lifecycle so an aborted client connection doesn't cancel the in-flight Stop/provision pair. Pinned by a behavior-based AST gate (workspace_restart_async_test.go): the test parses workspace_restart.go and walks the Restart function body, flagging any <recv>.{provisioner,cpProv}.Stop call that isn't nested in a *ast.FuncLit. Same family as callsProvisionStart in workspace_provision_shared_test.go. Verified the gate fails on the pre-fix shape (flags lines 151 and 153 — the original sync Stop calls). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 19:35:29 -07:00
Hongming Wang	a5c5139e3a	fix(workspace): deliver platform_inbound_secret on every heartbeat Heartbeat now echoes the workspace's platform_inbound_secret on every beat (mirroring /registry/register), and the molecule-mcp client persists it to /configs/.platform_inbound_secret on receipt. Symptom (2026-04-30, hongmingwang tenant): chat upload returned 503 "workspace will pick it up on its next heartbeat" and then 401 on retry — permanent until workspace restart. The 503 message was a lie: heartbeat used to discard the platform_inbound_secret entirely; only register delivered it, and register fires once at startup. Server (Go): - Heartbeat handler reuses readOrLazyHealInboundSecret (the same helper chat_files + register use), so heartbeat-time recovery covers the rotate / mid-life NULL-column case the existing register-time heal can't reach. - Failure is non-fatal: liveness contract trumps secret delivery, chat_files retries lazy-heal on its own next request. Client (Python): - _persist_inbound_secret_from_heartbeat parses the heartbeat 200 response and persists via platform_inbound_auth.save_inbound_secret. - All exceptions swallowed — heartbeat liveness > secret persistence; next tick (≤20s) retries. Tests: - Server: pin secret-present, lazy-heal-mint-on-NULL, and heal- failure-omits-field branches. - Client: pin persist-on-200, skip-on-empty, skip-on-non-dict-body, skip-on-401, swallow-save-OSError.	2026-04-30 17:36:33 -07:00
Hongming Wang	876c0bfcd4	docs(canvas): update Universal MCP snippet — molecule-mcp now standalone The canvas tab snippet for the Universal MCP path was written before this PR added the built-in register + heartbeat thread. Earlier wording described it as "outbound-only — pair with the Claude Code or Python SDK tab for heartbeat + inbound messages" — that's stale. molecule-mcp now handles register + heartbeat itself; the only thing it doesn't yet do is inbound A2A delivery. Updated: - externalUniversalMcpTemplate header comment + body — describes standalone behavior, points operators at SDK/channel only when they need INBOUND (not heartbeat). - Drops the now-redundant curl-register step from the snippet — the binary registers itself on startup. - Canvas modal label likewise updated. No runtime / behavior change; pure docs polish so a copy-pasting operator's mental model matches what the binary actually does. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:52:15 -07:00

1 2 3 4 5 ...

528 Commits