molecule-core

Author	SHA1	Message	Date
Hongming Wang	0acdf3bb56	fix(wheel): import inbox without alias to dodge rewriter collision PR #2433 (notifications/claude/channel) shipped 'import inbox as _inbox_module' inside a2a_mcp_server.py:main(). The build script's import rewriter expands plain 'import inbox' to 'import molecule_runtime.inbox as inbox', so the original source became 'import molecule_runtime.inbox as inbox as _inbox_module', which is invalid Python. Caught at the publish-runtime + PR-built-wheel-smoke gate (the SyntaxError trace is in run 25200422679). The wheel didn't ship to PyPI because publish-runtime's smoke-import step refused to install it, but staging is currently sitting on a broken-build commit until this fix-forward lands. Changes: - a2a_mcp_server.py: lift `import inbox` to top of file (rewriter produces clean `import molecule_runtime.inbox as inbox`), call inbox.set_notification_callback directly in main() - build_runtime_package.py: rewrite_imports() now raises ValueError when it sees 'import X as Y' for any X in the workspace allowlist, instead of silently producing a syntax-error wheel. Operator gets a clear actionable error at build time pointing at the offending line + suggested rewrites ('from X import …' or plain 'import X'). The build-time gate (this PR's rewriter check) catches the regression class earlier than the smoke-time gate (PR #2433's failure). Adding 'PR-built wheel + import smoke' to staging branch protection's required checks is filed separately so this class doesn't merge again. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:21:54 -07:00
Hongming Wang	c901d52ee3	Merge pull request #2433 from Molecule-AI/feat/mcp-channel-notifications feat(mcp): notifications/claude/channel for push-feel inbox UX	2026-05-01 03:12:31 +00:00
Hongming Wang	0a3ec53f34	feat(mcp): notifications/claude/channel for push-feel inbox UX Adds a notification seam to the universal molecule-mcp wheel so push- notification-capable MCP hosts (Claude Code today; any compliant client tomorrow) get inbound A2A messages as conversation interrupts instead of having to poll wait_for_message / inbox_peek. Wire-up: - inbox.py: module-level _NOTIFICATION_CALLBACK + set_notification_callback() Fires from InboxState.record() AFTER lock release, with same dict shape inbox_peek returns. Best-effort — a raising callback never prevents the message from landing in the queue. - a2a_mcp_server.py: _build_channel_notification() pure helper + bridge wiring in main() that schedules notifications via asyncio.run_coroutine_threadsafe (poller is a daemon thread, MCP loop is asyncio). - Method name 'notifications/claude/channel' matches the contract documented in molecule-mcp-claude-channel/server.ts:509. - wheel_smoke.py: pin set_notification_callback as a published name, same regression class as the 0.1.16 main_sync incident. Pollers (wait_for_message / inbox_peek) keep working unchanged for runtimes without notification support. Tests: 6 new in test_inbox.py (callback fires once on record, dedupe short-circuits before fire, raising cb doesn't break inbox, set/clear semantics), 5 new in test_a2a_mcp_server.py (method name pin, content mapping, meta routing, no-id JSON-RPC notification spec, missing- field tolerance). All 59 combined tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:10:01 -07:00
Hongming Wang	0c51df989b	Merge pull request #2431 from Molecule-AI/fix/restart-async-stop Move /restart Stop into the async goroutine	2026-05-01 02:38:29 +00:00
Hongming Wang	f6ddcf66ab	Move /restart Stop into the async goroutine Pre-fix Restart called provisioner.Stop / cpProv.Stop synchronously before returning the HTTP response. CPProvisioner.Stop is DELETE /cp/workspaces/:id → CP → AWS EC2 terminate, which can exceed the canvas's 15s HTTP timeout, especially right after a platform-wide redeploy when every tenant queues a CP request at once. The user sees a misleading "signal timed out" red banner on Save & Restart even though the async re-provision goroutine continues and the workspace ends up online. Caught 2026-04-30 on hongmingwang hermes workspace 32993ee7-…cb9d75d112a5 right after the heartbeat-fix platform redeploy at 02:11Z. The workspace came back online correctly; only the canvas response timed out. Fix moves Stop into the same goroutine as provisionWorkspaceCP / provisionWorkspaceOpts. The handler now responds in <500ms (DB lookup + status UPDATE only). Stop and provision keep their existing ordering inside the goroutine. Uses context.Background() to detach from the request lifecycle so an aborted client connection doesn't cancel the in-flight Stop/provision pair. Pinned by a behavior-based AST gate (workspace_restart_async_test.go): the test parses workspace_restart.go and walks the Restart function body, flagging any <recv>.{provisioner,cpProv}.Stop call that isn't nested in a *ast.FuncLit. Same family as callsProvisionStart in workspace_provision_shared_test.go. Verified the gate fails on the pre-fix shape (flags lines 151 and 153 — the original sync Stop calls). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 19:35:29 -07:00
hongmingwang-moleculeai	03d5f80cb6	Merge pull request #2428 from Molecule-AI/feat/agent-card-env-vars feat(mcp_cli): agent_card from env vars (capability discovery)	2026-05-01 02:23:37 +00:00
Hongming Wang	c4bb803329	feat(mcp_cli): agent_card from env vars (capability discovery) External molecule-mcp runtimes register with hardcoded agent_card.name = molecule-mcp-{id[:8]} and skills=[]. That made every external workspace look identical on the canvas and gave peer agents calling list_peers no signal beyond name — they had to guess capabilities. Three new env vars let the operator declare identity + capabilities without code changes: * MOLECULE_AGENT_NAME — display name on canvas (default unchanged) * MOLECULE_AGENT_DESCRIPTION — one-line description (default empty) * MOLECULE_AGENT_SKILLS — comma-separated skill names Comma-separated skills get expanded to {"name": "..."} objects — the minimum shape that satisfies both shared_runtime.summarize_peers (reads s["name"]) AND canvas SkillsTab.tsx (id falls back to name). Strict-superset behaviour: when no env vars are set, agent_card matches the previous hardcoded value exactly. No regression for operators who haven't migrated. Why this matters end-to-end: * Canvas Skills tab now shows each declared skill as a chip * Peer agents calling list_peers see {name, skills} per peer and can route delegations to the right specialist * Same applies to the canvas Details tab + workspace card hover Tests cover: defaults match prior behaviour; name override; CSV → skill objects; whitespace stripping + empty entries dropped; description omitted when unset (keeps wire payload minimal); whitespace-only name falls back to default; end-to-end through _platform_register's payload. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:57:39 -07:00
Hongming Wang	6cca4c5708	Merge pull request #2426 from Molecule-AI/fix/canvas-model-save-runtime-config fix(canvas): persist model on Save+Restart for runtime-bearing workspaces	2026-05-01 01:42:18 +00:00
Hongming Wang	210f6e066a	Merge pull request #2424 from Molecule-AI/fix/in-container-heartbeat-persists-inbound-secret fix(workspace): in-container heartbeat persists platform_inbound_secret	2026-05-01 01:36:52 +00:00
Hongming Wang	7706db5a93	fix(canvas): persist model on Save+Restart for runtime-bearing workspaces The Model dropdown's onChange writes to config.runtime_config.model whenever a runtime is set (hermes, claude-code, etc.), and only falls back to top-level config.model when no runtime is selected. But handleSave used to diff the new value against top-level nextSource.model only — so for any runtime-bearing workspace, the PUT /workspaces/:id/model never fired and MODEL_PROVIDER never landed in workspace_secrets. Symptom (2026-04-30, hongmingwang Hermes Agent 32993ee7-840e-4c02-8ca8-cb9d75d112a5): - User picks minimax/MiniMax-M2.7-highspeed from the dropdown - Hits Save & Restart - Save reports success; restart fires - The new EC2 boots with HERMES_DEFAULT_MODEL empty - install.sh defaults to nousresearch/hermes-4-70b - hermes-agent errors "No LLM provider configured" on every chat turn because no NOUS_API_KEY / OPENROUTER_API_KEY is set - Reload Config tab → model field reverts to whatever GET /workspaces/:id/model returns (i.e. empty / template default) handleSave now reads the effective model from runtime_config.model first and falls back to top-level model for legacy no-runtime workspaces. Same change for the old-value diff so a no-op Save still skips the PUT. Tests pin both branches: PUTs /model when the dropdown changed runtime_config.model on a hermes workspace; does NOT PUT when the value is unchanged from what GET /model returned.	2026-04-30 18:31:43 -07:00
Hongming Wang	2a5669788c	Merge pull request #2425 from Molecule-AI/fix/heartbeat-detect-401 fix(mcp_cli): escalate heartbeat 401s with re-onboard guidance	2026-05-01 01:28:57 +00:00
Hongming Wang	d887ce8e96	fix(mcp_cli): escalate consecutive heartbeat 401s with re-onboard guidance The universal molecule-mcp wheel runs in a daemon thread, posting /registry/heartbeat every 20s. When the workspace gets deleted server-side (DELETE /workspaces/:id), the platform revokes all tokens for that workspace. Previous behaviour: heartbeat would 401 forever, log at WARNING per tick, no actionable signal anywhere. Failure mode hit on hongmingwang tenant 2026-04-30: workspace a1771dba was deleted at some prior time, the channel-bridge .env still pointed at it, MCP tools 401-ed silently with the operator having no idea why. The register-time path at mcp_cli.py:104-111 already does loud + actionable for 401 (sys.exit(3) with regenerate- from-canvas-Tokens text) — extend the same pattern to the heartbeat. Behaviour: * count < 3: WARNING per tick (could be transient blip) * count == 3: ERROR with re-onboard instructions, names the dead workspace_id, points at the canvas Tokens tab * count > 3 and every 20 ticks (~7 min): re-log ERROR so a session that started after the first ERROR still catches it 5xx and other non-auth HTTP errors do NOT increment the auth-failure counter — that would mislead the operator (e.g. a server blip would trigger "token revoked" when the token is fine). Tests cover: single 401 stays at WARNING; 3 consecutive 401s escalate to ERROR with the right keywords; 403 treated identically; recovery via 200 resets the counter; 5xx never triggers the auth path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:26:35 -07:00
Hongming Wang	98845c8f42	fix(workspace): in-container heartbeat persists platform_inbound_secret Follow-up to PR #2421. The standalone wrapper (mcp_cli.py) got heartbeat-time secret persistence in #2421, but the in-container heartbeat (workspace/heartbeat.py) was missed — and that's the path every workspace EC2 actually runs. Result: hongmingwang Claude Code agent stayed 401-forever on chat upload after this morning's deploy because the workspace's runtime never picked up the lazy-healed secret. The in-container _loop now captures the heartbeat response and calls the same _persist_inbound_secret_from_heartbeat helper used by the standalone path, on both the first POST and the 401-retry POST. Defensive on every error (non-JSON, non-dict, empty, save failure) — liveness contract trumps secret persistence. Tests pin: happy path, absent secret, empty string, non-JSON body, non-dict body, save_inbound_secret OSError, end-to-end loop.	2026-04-30 18:18:10 -07:00
Hongming Wang	c733454a56	Merge pull request #2421 from Molecule-AI/fix/heartbeat-delivers-inbound-secret fix(workspace): deliver platform_inbound_secret on every heartbeat	2026-05-01 00:54:00 +00:00
Hongming Wang	f035482e0a	Merge pull request #2420 from Molecule-AI/refactor/send-a2a-message-by-peer-id refactor(workspace-runtime): send_a2a_message takes peer_id + UUID validation [stacks on #2418]	2026-05-01 00:44:56 +00:00
Hongming Wang	993f8c494e	refactor(workspace-runtime): send_a2a_message takes peer_id, validates UUID Two cleanups stacked on PR #2418: 1. Refactor `send_a2a_message(target_url, msg)` → `send_a2a_message(peer_id, msg)`. After #2418 every caller passes `${PLATFORM_URL}/workspaces/{peer_id}/a2a` — the function's parameter pretended to accept arbitrary URLs but in practice only one shape is meaningful. Owning URL construction inside the function makes the contract honest and centralises the peer-id validation introduced below. 2. Add `_validate_peer_id` UUID-shape check at the trust boundary. `discover_peer` and `send_a2a_message` are the entry points where agent-controlled strings flow into URL paths; rejecting non-UUID input at this layer eliminates the URL-interpolation class of bug (`workspace_id="../admin"` etc.) regardless of how the rest of the codebase interpolates ids elsewhere. Auth was already gating malicious access — this is consistency + clear failure over silent platform 4xx. In-container tests cover positive UUIDs, malformed input (``"ws-abc"``, ``"../admin"``, empty), and the contract that ``tool_delegate_task`` hands the peer_id to ``send_a2a_message`` without building URLs itself. Live-verified: external delegation 8dad3e29 → 97ac32e9 returned "refactor verified" from Claude Code Agent through the refactored code; ``_validate_peer_id`` rejects ``"ws-abc"`` and ``"../admin"`` and accepts canonical UUIDs. Stacked on PR #2418 (proxy-routing fix). Will rebase onto staging once #2418 merges. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 17:43:01 -07:00
Hongming Wang	a5c5139e3a	fix(workspace): deliver platform_inbound_secret on every heartbeat Heartbeat now echoes the workspace's platform_inbound_secret on every beat (mirroring /registry/register), and the molecule-mcp client persists it to /configs/.platform_inbound_secret on receipt. Symptom (2026-04-30, hongmingwang tenant): chat upload returned 503 "workspace will pick it up on its next heartbeat" and then 401 on retry — permanent until workspace restart. The 503 message was a lie: heartbeat used to discard the platform_inbound_secret entirely; only register delivered it, and register fires once at startup. Server (Go): - Heartbeat handler reuses readOrLazyHealInboundSecret (the same helper chat_files + register use), so heartbeat-time recovery covers the rotate / mid-life NULL-column case the existing register-time heal can't reach. - Failure is non-fatal: liveness contract trumps secret delivery, chat_files retries lazy-heal on its own next request. Client (Python): - _persist_inbound_secret_from_heartbeat parses the heartbeat 200 response and persists via platform_inbound_auth.save_inbound_secret. - All exceptions swallowed — heartbeat liveness > secret persistence; next tick (≤20s) retries. Tests: - Server: pin secret-present, lazy-heal-mint-on-NULL, and heal- failure-omits-field branches. - Client: pin persist-on-200, skip-on-empty, skip-on-non-dict-body, skip-on-401, swallow-save-OSError.	2026-04-30 17:36:33 -07:00
Hongming Wang	665582b612	Merge pull request #2418 from Molecule-AI/fix/external-delegate-via-platform-proxy fix(workspace-runtime): route delegate_task through platform A2A proxy	2026-05-01 00:16:15 +00:00
Hongming Wang	aefb44aff2	fix(workspace-runtime): route delegate_task through platform A2A proxy tool_delegate_task was POSTing directly to peer["url"], which is the Docker-internal hostname (e.g. http://ws-X-Y:8000) for in- container peers. External callers — the standalone molecule-mcp wrapper running on an operator's laptop — get [Errno 8] nodename nor servname every single delegation, breaking the universal-MCP path's last "ride the same code as in-container" claim. The platform's /workspaces/:peer-id/a2a proxy endpoint already handles internal forwarding for in-container peers AND is the only path external runtimes can use. Unify on it: in-container callers pay one extra HTTP hop on the same Docker bridge (microseconds); external callers get a working delegation path for the first time. discover_peer is still called for access-control + online-status detection — only the routing target changes. Verified live on 2026-04-30 against workspace 8dad3e29 (external mac runtime) → 97ac32e9 (Claude Code Agent in-container): direct POST returned ConnectError, proxy POST returned "acknowledged from claude code agent" as requested. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 17:13:50 -07:00
Hongming Wang	cc58e87393	Merge pull request #2415 from Molecule-AI/feat/molecule-mcp-inbox-polling feat(workspace-runtime): inbox polling for standalone molecule-mcp	2026-04-30 23:41:47 +00:00
Hongming Wang	d061642cfc	test(inbox): bind side-effecting pop() before assert CodeQL flagged the bare `assert state.pop(...) is None` — under `python -O` asserts are stripped, which would skip the call entirely and the test would silently pass without exercising the code. Bind the result first so the call always runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:39:45 -07:00
Hongming Wang	b47d4ceb00	feat(workspace-runtime): add inbox polling for standalone molecule-mcp path The universal MCP server (a2a_mcp_server.py) was outbound-only — agents in standalone runtimes (Claude Code, hermes, codex, etc.) could delegate, list peers, and write memories, but never observed the canvas-user or peer-agent messages addressed to them. This blocked "constantly responding" loops without forcing operators back onto a runtime-specific channel plugin. This PR closes the inbound gap with a poller-fed in-memory queue and three new MCP tools: - wait_for_message(timeout_secs?) — block until next message arrives - inbox_peek(limit?) — list pending messages (non-destructive) - inbox_pop(activity_id) — drop a handled message A daemon thread polls /workspaces/:id/activity?type=a2a_receive every 5s, fills the queue from the cursor (since_id), and persists the cursor to ${CONFIGS_DIR}/.mcp_inbox_cursor so a restart doesn't replay backlog. On 410 (cursor pruned) we fall back to since_secs=600 for a bounded recovery window. Activity-row → InboxMessage extraction mirrors the molecule-mcp-claude-channel plugin's extractText (envelope shapes #1-3 + summary fallback). mcp_cli.main starts the poller alongside the existing register + heartbeat threads. In-container runtimes (which have push delivery via canvas WebSocket) skip activation, so inbox tools return an informational "(inbox not enabled)" message instead of double-delivery. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:32:48 -07:00
Hongming Wang	d00c8be8c9	Merge pull request #2413 from Molecule-AI/fix/external-runtime-universal-mcp feat(workspace-runtime): expose universal MCP server to runtime=external operators	2026-04-30 23:21:25 +00:00
Hongming Wang	b54ceb799f	fix: address 5-axis review findings on PR #2413 Critical: - ExternalConnectModal.tsx: filledUniversalMcp substitution searched for WORKSPACE_AUTH_TOKEN but the snippet's placeholder is now MOLECULE_WORKSPACE_TOKEN (changed in the previous polish commit `876c0bfc`). Operators copy-pasting the MCP tab would have gotten a literal "<paste from create response>" instead of the token. Fix the substitution to match the new placeholder name. Important: - mcp_cli._platform_register: 401/403 from initial register now hard- exits with code 3 + an actionable stderr message pointing the operator at the canvas Tokens tab. Pre-fix: warning log + continue, which made a bad-token startup silently fail (heartbeat 401's forever, every tool call also 401's, no clear surfacing in the operator's MCP client). 500/503 still log + continue (transient platform blips shouldn't abort the MCP loop). - a2a_mcp_server.cli_main docstring: removed stale claim that this is the wheel's console-script entry-point target. The actual target is mcp_cli.main since 2026-04-30. Wheel-smoke pins both names so the functionality was correct, but the doc was lying. Test coverage: 3 new mcp_cli tests: - register 401 exits code=3 + stderr mentions canvas Tokens tab - register 403 (C18 hijack rejection) takes same path - register 500/503 does NOT exit — only auth errors hard-fail Findings deferred to follow-up (acceptable per review rubric): - Code dedup across mcp_cli / heartbeat.py / molecule_agent SDK - Pooled httpx.Client for connection reuse - Heartbeat exponential backoff - Token-resolution ordering parity (env-first vs file-first) between mcp_cli.main and platform_auth.get_token Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:06:59 -07:00
Hongming Wang	876c0bfcd4	docs(canvas): update Universal MCP snippet — molecule-mcp now standalone The canvas tab snippet for the Universal MCP path was written before this PR added the built-in register + heartbeat thread. Earlier wording described it as "outbound-only — pair with the Claude Code or Python SDK tab for heartbeat + inbound messages" — that's stale. molecule-mcp now handles register + heartbeat itself; the only thing it doesn't yet do is inbound A2A delivery. Updated: - externalUniversalMcpTemplate header comment + body — describes standalone behavior, points operators at SDK/channel only when they need INBOUND (not heartbeat). - Drops the now-redundant curl-register step from the snippet — the binary registers itself on startup. - Canvas modal label likewise updated. No runtime / behavior change; pure docs polish so a copy-pasting operator's mental model matches what the binary actually does. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:52:15 -07:00
Hongming Wang	427300f3a4	feat: make molecule-mcp standalone (built-in register + heartbeat) + recover awaiting_agent on heartbeat Two paired fixes that together let an external operator run a single process (molecule-mcp) and see their workspace come up online in the canvas — the bug surfaced live when status stuck at "awaiting_agent / OFFLINE" despite an active MCP server. Platform side (workspace-server/internal/handlers/registry.go): Heartbeat handler already auto-recovers offline → online and provisioning → online, but NOT awaiting_agent → online. Healthsweep flips stale-heartbeat external workspaces TO awaiting_agent, and with no recovery path the workspace stays "OFFLINE — Restart" in the canvas forever. Add the symmetric branch: if currentStatus == "awaiting_agent" and a heartbeat arrives, flip to online + broadcast WORKSPACE_ONLINE. Mirrors the existing offline/provisioning patterns exactly. Test: TestHeartbeatHandler_AwaitingAgentToOnline asserts the SQL UPDATE fires with the awaiting_agent guard clause. Wheel side (workspace/mcp_cli.py): molecule-mcp was outbound-only — operators had to run a separate SDK process to register + heartbeat. Now mcp_cli.main(): 1. Calls /registry/register at startup (idempotent upsert flips status awaiting_agent → online via the existing register path). 2. Spawns a daemon thread that POSTs /registry/heartbeat every 20s. 20s is comfortably under the healthsweep stale window so a single missed beat doesn't cause status churn. 3. Runs the MCP stdio loop in the foreground. Both calls set Origin: ${PLATFORM_URL} so the SaaS edge WAF accepts them. Threaded heartbeat (not asyncio) chosen because it doesn't need to share an event loop with the MCP stdio server — daemon=True cleanly dies when the operator's runtime exits. MOLECULE_MCP_DISABLE_HEARTBEAT=1 escape hatch lets in-container callers (which have heartbeat.py running already) reuse the entry point without double-heartbeating. Default is enabled. End-to-end verification (live, against hongmingwang.moleculesai.app, workspace 8dad3e29-...): pre-fix: status=awaiting_agent → canvas shows OFFLINE forever post-fix: ran `molecule-mcp` for 5s standalone → canvas state: status=online runtime=external agent=molecule-mcp-8dad3e29 Test coverage: 7 new mcp_cli tests (register-at-startup, heartbeat- thread-spawned, disable-env-skips-both, env-and-file token resolution, register payload shape, heartbeat endpoint + headers); 1 new platform test (awaiting_agent → online recovery). Full workspace + handlers suites green: 1355 Python, full Go handlers passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:42:44 -07:00
Hongming Wang	716589742c	feat(canvas): add Universal MCP tab to external-agent connect modal The "Connect your external agent" dialog already covered Claude Code, Python SDK, curl, and raw fields. This adds a Universal MCP tab that documents the new \`molecule-mcp\` console script — the runtime- agnostic baseline shipped by PR #2413's workspace-runtime changes. Surface area: - New \`externalUniversalMcpTemplate\` constant in workspace-server. Three-step snippet: pip install runtime → one-shot register via curl → wire molecule-mcp into agent's MCP config (Claude Code example, notes that hermes/codex/etc. take the same env-var contract). - Workspace create response now includes \`universal_mcp_snippet\` alongside the existing curl/python/channel snippets. - Canvas modal renders the tab when \`universal_mcp_snippet\` is present; backward-compatible with older platform builds (tab hides when empty). Origin/WAF coverage (the user explicitly asked for this): - The runtime wheel handles Origin automatically (this PR's earlier commit on platform_auth.auth_headers). - The curl tab now sets \`Origin: {{PLATFORM_URL}}\` preemptively with an explanatory comment; \`/registry/register\` is currently WAF-allowed without it but adding now keeps the snippet working if WAF rules expand. The comment also explains why \`/workspaces/*\` paths return empty 404 without Origin — the exact failure mode I hit while smoke-testing this PR live. - The MCP snippet's footer notes that the wheel auto-handles Origin so operators don't think about it. End-to-end verification (against live tenant hongmingwang.moleculesai.app, freshly registered workspace): - get_workspace_info → full JSON - list_peers → "Claude Code Agent (ID: 97ac32e9..., status: online)" - recall_memory → "No memories found." all returned by the molecule-mcp binary speaking MCP stdio to this Claude Code session. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:34:27 -07:00
Hongming Wang	74c5e0d7a8	fix(workspace-runtime): add Origin header so SaaS edge WAF accepts MCP tool calls Discovered while smoke-testing the molecule-mcp external-runtime path against a live tenant (hongmingwang.moleculesai.app). Every tool call that hit /workspaces/* or /registry/*/peers returned 404 — but /registry/register and /registry/heartbeat returned 200. Diagnosis: the tenant's edge WAF requires a same-origin header. Without it, unhandled paths get silently rewritten to the canvas Next.js app, which has no /workspaces or /registry/:id/peers route and returns an empty 404. The molecule-mcp-claude-channel plugin already sets this header (server.ts:271-276); the workspace runtime never did because in-container PLATFORM_URLs (Docker network) aren't behind the WAF. Fix: extend platform_auth.auth_headers() to include Origin: ${PLATFORM_URL} whenever PLATFORM_URL is set. Inside-container behavior is unchanged (the WAF is path-irrelevant for the internal hostnames). External-runtime calls now thread the WAF correctly. Verification (live, against a freshly-registered external workspace): pre-fix: get_workspace_info → "not found", list_peers → 404 post-fix: get_workspace_info → full workspace JSON, list_peers → "Claude Code Agent (ID: 97ac32e9..., status: online)" This is the kind of bug unit tests can never catch — caught only by running the wheel against the real tenant. Memory: feedback_always_run_e2e.md. Test coverage: 4 new tests in test_platform_auth.py — Origin alone when no token + Origin + Authorization both, no-PLATFORM_URL falls through to original empty-dict behavior, env-token path with Origin. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:30:15 -07:00
Hongming Wang	169e284d57	feat(workspace-runtime): expose universal MCP server to runtime=external operators Ship the baseline universal MCP path that any external runtime (Claude Code, hermes, codex, anything that speaks MCP stdio) can use, before optimizing per-runtime channels. Today the workspace MCP server only spins up inside the container; external operators have no way to call the 8 platform tools (delegate_task, list_peers, send_message_to_user, commit_memory, etc.) from outside. Three additive changes: 1. `platform_auth.get_token()` env-var fallback — adds `MOLECULE_WORKSPACE_TOKEN` as a fallback when no `${CONFIGS_DIR}/.auth_token` file exists. File-first preserves in-container behavior unchanged. External operators (no /configs volume) now have a way to supply the token without faking the filesystem layout. 2. `molecule-mcp` console script — adds a new entry point in the published `molecule-ai-workspace-runtime` PyPI wheel. Operators run `pip install molecule-ai-workspace-runtime`, set 3 env vars (WORKSPACE_ID, PLATFORM_URL, MOLECULE_WORKSPACE_TOKEN), and register the binary in their agent's MCP config. `mcp_cli.main` is a thin validator wrapper — it checks env BEFORE importing the heavy `a2a_mcp_server` module so a misconfigured first-run gets a friendly 3-line error instead of a 20-line module-level RuntimeError traceback. 3. Wheel smoke gate — extends `scripts/wheel_smoke.py` to assert `cli_main` and `mcp_cli.main` are importable. Same regression class as the 0.1.16 main_sync incident: a silent rename or unrewritten import here would break every external operator on the next wheel publish (memory: feedback_runtime_publish_pipeline_gates.md). Test coverage: - `tests/test_platform_auth.py` — 8 new tests for the env-var fallback: file-priority, env-fallback, whitespace handling, cache, header construction, empty-env-as-unset. - `tests/test_mcp_cli.py` — 8 new tests for the validator: each required var separately, file-or-env satisfies token requirement, whitespace-only env treated as missing, help mentions canvas Tokens tab. - Full `workspace/tests/` suite green: 1346 passed, 1 skipped. - Local end-to-end: built wheel, installed in venv, ran `molecule-mcp` with no env → friendly error; with env → MCP server starts. Why now / why this shape: user redirect was "support the baseline first so all runtimes can use, then optimize". A claude-only MCP channel leaves hermes/codex/third-party operators broken on runtime=external. This PR ships the runtime-agnostic baseline; per- runtime polish (claude-channel push delivery, hermes-native bindings) is a follow-up PR. PR #2412 fixed the partner bug where canvas Restart silently revoked the operator's token — the two together unblock the external-runtime story end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:20:19 -07:00
Hongming Wang	d2046c374d	Merge pull request #2412 from Molecule-AI/fix/restart-external-no-revoke fix(workspace-server): skip provision pipeline on Restart for runtime=external	2026-04-30 22:11:25 +00:00
Hongming Wang	36e263a07d	fix(workspace-server): skip provision pipeline on Restart for runtime=external POST /workspaces/:id/restart on a runtime=external workspace ran the full re-provision pipeline (Stop → provisionWorkspace), which calls issueAndInjectToken → RevokeAllForWorkspace. For external workspaces (operator-driven, no container/EC2) that silently destroyed the operator's local bearer token on every "Restart" click in the canvas — the local poller would then 401-spam against /activity until the operator manually regenerated from the Tokens tab. The auto-restart path (runRestartCycle, line 436) already short-circuits runtime=external. This patch mirrors that for the manual handler so the two paths agree, and surfaces a 200 OK with a clear message so the canvas can tell the operator the fix is on their side rather than silently no-op'ing. Test coverage: TestRestartHandler_ExternalRuntimeNoOps asserts the short-circuit fires before* any DB write or provision call. sqlmock's "unexpected query" failure mode would catch a regression that re-introduced the token revoke or the status=provisioning UPDATE. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 15:08:48 -07:00
Hongming Wang	c68ec23d3c	Merge pull request #2410 from Molecule-AI/auto/harness-replays-ci-gate ci: gate PRs on tests/harness/run-all-replays.sh	2026-04-30 20:35:30 +00:00
Hongming Wang	0f0df576f5	Merge pull request #2392 from Molecule-AI/auto/e2e-staging-external-runtime test(e2e): live staging regression for external-runtime awaiting_agent transitions	2026-04-30 20:32:23 +00:00
Hongming Wang	c8b17ea1ad	fix(harness): install httpx for replay Python evals peer-discovery-404 imports workspace/a2a_client.py which depends on httpx; the runner's stock Python doesn't have it, so the replay's PARSE assertion (b) fails with ModuleNotFoundError on every run. The WIRE assertion (a) — pure curl — passes, so the failure was masking just enough to make the replay LOOK partially-broken when the tenant side is fine. Adding tests/harness/requirements.txt with only httpx instead of sourcing workspace/requirements.txt: that file pulls a2a-sdk, langchain-core, opentelemetry, sqlalchemy, temporalio, etc. — ~30s of install for one replay's PARSE step. The harness's deps surface should grow when a new replay introduces a new import, not by default. Workflow gains one step (`pip install -r tests/harness/requirements.txt`) between the /etc/hosts setup and run-all-replays. No other changes.	2026-04-30 13:32:00 -07:00
Hongming Wang	9dae0503ee	fix(harness): generate SECRETS_ENCRYPTION_KEY per-run instead of hardcoding Replaces the hardcoded base64 sentinel (`630dd0da`) with a per-run generation in up.sh, exported into compose's interpolation environment. Why: - Hardcoding a 32-byte base64 string in the repo, even one labelled "test-only", sets a bad muscle-memory pattern. The next agent or contributor copies the shape into another harness — or worse, into a staging .env — and the test-only sentinel turns into something someone treats as a real key. - Secret scanners flag key-shaped values regardless of the surrounding comment claiming intent. Avoiding the literal entirely sidesteps the false-positive. - A fresh key per harness lifetime more closely mimics prod's per-tenant isolation, exercising the same code paths without any pretense of stable encrypted-data fixtures (which the harness wipes on every ./down.sh anyway). Implementation: - up.sh: `openssl rand -base64 32` if SECRETS_ENCRYPTION_KEY isn't already set in the caller's env. Honoring a pre-set value lets a debug session pin a key for reproducibility (e.g. when investigating encrypted-row corruption). - compose.yml: `${SECRETS_ENCRYPTION_KEY:?…}` makes a misuse loud — running `docker compose up` directly bypassing up.sh fails fast with a clear error pointing at the right entry point, rather than a 100s unhealthy-tenant timeout. Both paths verified via `docker compose config`: - with key exported: value interpolates cleanly - without it: "required variable SECRETS_ENCRYPTION_KEY is missing a value: must be set — run via tests/harness/up.sh, which generates one per run"	2026-04-30 13:30:14 -07:00
Hongming Wang	630dd0dae7	fix(harness): seed SECRETS_ENCRYPTION_KEY so MOLECULE_ENV=production tenant boots Found via the first run of the harness-replays-required-check workflow (#2410): the tenant container failed its healthcheck after 100s with "refusing to boot without encryption in production". This is the deferred CRITICAL flagged on PR #2401 — `crypto.InitStrict()` requires SECRETS_ENCRYPTION_KEY when MOLECULE_ENV=production, and the harness sets prod-mode but never seeded a key. Fix: add a clearly-test 32-byte base64 value (encoding the literal string "harness-test-only-not-for-prod!!") inline. Keeping MOLECULE_ENV=production preserves the harness's value as a production- shape replay surface — it now exercises the full encryption boot path including the strict check, rather than skirting it via dev-mode. Why inline rather than .env: - The harness compose file is meant to be self-contained and reproducible from a clean clone. An external .env would split the config across two files for one synthetic value. - The value is intentionally a sentinel; there's no operator decision here to gate behind a per-deployment file. After this lands the harness boots clean and `run-all-replays.sh` can exercise the buildinfo + peer-discovery replays as designed. The required-check workflow itself (#2410) needs no change.	2026-04-30 13:25:52 -07:00
Hongming Wang	be44e54b77	Merge pull request #2411 from Molecule-AI/auto/prod-version-check-script ops: check-prod-versions.sh — one-line tenant version status	2026-04-30 20:16:43 +00:00
Hongming Wang	24cb2a286f	ci(harness-replays): KEEP_UP=1 so dump-logs step has containers to read First run on PR #2410 failed with 'container harness-tenant-1 is unhealthy' but the dump-compose-logs step printed empty tenant logs because run-all-replays.sh's trap-on-EXIT had already torn down the harness. Setting KEEP_UP=1 leaves containers in place; the always-run Force teardown step at the end owns cleanup explicitly. Now we'll actually see why the tenant didn't become healthy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 13:15:46 -07:00
Hongming Wang	41d5f9558f	ops: scripts/ops/check-prod-versions.sh — one-line "is each tenant on latest?" Iterates a list of tenant slugs (default canary set on production, operator-supplied on staging), curls each tenant's /buildinfo plus canvas's /api/buildinfo, compares to origin/main's HEAD SHA, prints a table with one of {current, stale, unreachable} per surface. Returns non-zero if any surface is stale, so it can be wired into a periodic alert later. Why this exists: every "is the fix live?" question used to be answered with a one-off curl + git rev-parse + manual diff. This script does that uniformly across every public surface (workspace tenants + canvas) and is parseable. The redeploy verifier (#2398) covers the deploy moment; this covers any-time-after. Reads EXPECTED_SHA from `gh api repos/Molecule-AI/molecule-core/ commits/main` so it always reflects the actual upstream tip, not local working-copy state. Falls back to local origin/main with a WARN if `gh` isn't logged in — debugging is still useful even if the comparison may lag. Depends on: - #2409 (TenantGuard /buildinfo allowlist) — without it every tenant looks "unreachable" because the route 404s before the handler. Already merged on staging; will hit production after the next staging→main fast-forward + redeploy. - #2407 (canvas /api/buildinfo) — already on main + Vercel. Usage: ./scripts/ops/check-prod-versions.sh # production canary set TENANT_SLUGS="a b c" ./scripts/ops/check-prod-versions.sh # custom set ENV=staging TENANT_SLUGS="..." ./scripts/ops/check-prod-versions.sh Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 13:13:47 -07:00
Hongming Wang	b5c7b349d8	Merge pull request #2408 from Molecule-AI/auto/fix-healthsweep-gate-saas fix(boot): always start health-sweep goroutine — SaaS tenants need it for external-runtime liveness	2026-04-30 20:12:28 +00:00
Hongming Wang	3105e87cf7	ci: gate PRs on tests/harness/run-all-replays.sh Closes the gap between "the harness exists" and "the harness blocks bugs." Phase 2 of the harness roadmap (per tests/harness/README.md): make harness-based E2E a required CI check on every PR touching the tenant binary or the harness itself. Trigger: push + pull_request to staging+main, paths-filtered to workspace-server/, canvas/, tests/harness/**, and this workflow. merge_group support included so this becomes branch-protectable. Single-job-with-conditional-steps pattern (matches e2e-api.yml). One check run regardless of paths-filter outcome; satisfies branch protection cleanly per the PR #2264 SKIPPED-in-set finding. Why this exists: 2026-04-30 we shipped a TenantGuard allowlist gap (/buildinfo added to router.go in #2398, never added to the allowlist) that the existing buildinfo-stale-image.sh replay would have caught. The harness was wired correctly; nobody ran it. Replays as a discipline beat replays as a memory item. The CI pipeline: detect-changes (paths filter) └ harness-replays (always) ├ no-op pass when paths-filter says no relevant change └ otherwise: checkout + sibling plugin checkout + /etc/hosts entry + run-all-replays.sh + compose-logs-on-failure + force-teardown Compose logs from tenant/cp-stub/cf-proxy/postgres are dumped on failure so a CI red is debuggable without re-reproducing locally. The trap in run-all-replays.sh handles teardown; the always-run down.sh step is a belt-and-suspenders against trap-bypass kills. Follow-ups (not in this PR): - Add this check to staging branch protection once it's been green for a few PRs (the new-workflow-instability hedge that other gates followed). - Eventually wire the buildx GHA cache to speed up tenant image builds — currently every PR rebuilds the full Dockerfile.tenant (Go + Next.js + template clones) from scratch. Acceptable for now; optimize when the timeout-minutes:30 ceiling becomes painful. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 13:04:53 -07:00
Hongming Wang	9c7b14923d	Merge pull request #2409 from Molecule-AI/auto/tenant-guard-allowlist-buildinfo fix(tenant-guard): allowlist /buildinfo so redeploy verifier can reach it	2026-04-30 19:58:00 +00:00
Hongming Wang	8516a8f9c6	fix(tenant-guard): allowlist /buildinfo so redeploy verifier can reach it The /buildinfo route added in #2398 to verify each tenant runs the published SHA was 404'd by TenantGuard on every production tenant — the allowlist had /health, /metrics, /registry/register, /registry/heartbeat, but not /buildinfo. The redeploy workflows curl /buildinfo from a CI runner with no X-Molecule-Org-Id header, TenantGuard 404'd them, gin's NoRoute proxied to canvas, canvas returned its HTML 404 page, jq read empty git_sha, and the verifier silently soft-warned every tenant as "unreachable" — which the workflow doesn't fail on. Confirmed externally: curl https://hongmingwang.moleculesai.app/buildinfo → HTTP 404 + Content-Type: text/html (Next.js "404: This page could not be found.") even though /health on the same host returns {"status":"ok"} from gin. The buildinfo package's own doc already declares /buildinfo public by design ("Public is intentional: it's a build identifier, not operational state. The same string is already published as org.opencontainers.image.revision on the container image, so no new info is exposed.") — the allowlist just missed it. Pin the alignment in tenant_guard_test.go: TestTenantGuard_AllowlistBypassesCheck now asserts /buildinfo returns 200 without an org header alongside /health and /metrics, so a future allowlist edit can't silently regress the verifier again. Closes the silent-success failure mode: stale tenants will now show up as STALE (hard-fail) rather than UNREACHABLE (soft-warn). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 12:54:51 -07:00
Hongming Wang	235aca9908	fix(boot): always start health-sweep goroutine — SaaS tenants need it for external-runtime liveness Pre-fix, cmd/server/main.go gated the entire health-sweep goroutine on `prov != nil`. On SaaS tenants (`MOLECULE_ORG_ID` set) the local Docker provisioner is never initialized — only `cpProv`. So the goroutine never started, and `sweepStaleRemoteWorkspaces` (which transitions runtime='external' workspaces from 'online' to 'awaiting_agent' when their last_heartbeat_at goes stale) never ran. Net effect on production: every external-runtime workspace on SaaS that lost its agent stayed 'online' indefinitely instead of falling back to 'awaiting_agent' (re-registrable). The drift gate (#2388) caught the migration side and #2382 fixed the SQL writes, but this orchestration-side gate slipped through both because there was no SaaS-mode E2E coverage on the heartbeat-loss → awaiting_agent transition. Caught by #2392 (live staging external-runtime regression E2E) failing at step 6 — 180s with no heartbeat, expected status=awaiting_agent, got online. Fix: drop the `if prov != nil` gate. `StartHealthSweep` already handles nil checker correctly (healthsweep.go:50-71): the Docker sweep is gated inside the loop, the remote sweep always runs. Test coverage already exists at TestStartHealthSweep_NilCheckerRunsRemoteSweep. After this lands and tenants redeploy, #2392 step 6 passes and the regression coverage closes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 12:05:40 -07:00
Hongming Wang	c03074424e	Merge pull request #2407 from Molecule-AI/auto/canvas-buildinfo-endpoint feat(canvas): add /api/buildinfo for version parity with tenant	2026-04-30 19:04:22 +00:00
Hongming Wang	fc3b5fd385	feat(canvas): add /api/buildinfo for version-display parity with tenant Workspace-server has GET /buildinfo (PR #2398) — `curl https://<slug>. moleculesai.app/buildinfo` returns the live git SHA. Canvas had no parallel: debugging "is this the deployed code?" required reading Vercel's UI or response headers (deployment ID, not git SHA). Add canvas /api/buildinfo returning {git_sha, git_ref, vercel_env} sourced from VERCEL_GIT_COMMIT_SHA / _REF / VERCEL_ENV — Vercel injects these at build time from the deploying commit. Outside Vercel (local `next dev`, harness) all three are unset and the endpoint returns `git_sha: "dev"`, the same sentinel workspace-server uses pre-ldflags- injection. Now both surfaces speak the same vocabulary: curl https://<slug>.moleculesai.app/buildinfo curl https://canvas.moleculesai.app/api/buildinfo 3 tests cover dev-fallback, Vercel-injected SHA pass-through, and JSON content type. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 12:00:35 -07:00
Hongming Wang	6e0a0dba55	Merge pull request #2406 from Molecule-AI/auto/harness-run-all-replays feat(tests): add run-all-replays.sh harness runner	2026-04-30 19:00:07 +00:00
Hongming Wang	0af4012f79	feat(tests): add run-all-replays.sh harness runner Boots the harness, runs every script under replays/, tracks pass/fail, and tears down on exit. Closes the README's TODO for the harness runner that the per-replay-registration comment referenced. Usage: ./run-all-replays.sh # boot, run, teardown KEEP_UP=1 ./run-all-replays.sh # leave harness running on exit REBUILD=1 ./run-all-replays.sh # rebuild images before booting Trap-on-EXIT teardown ensures partial-failure runs don't leak Docker resources. Returns non-zero if any replay failed; CI can adopt this as a single command without per-replay registration. Phase 2 picks this up to wire harness-based E2E as a required check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 11:57:27 -07:00
Hongming Wang	46ca5aa6d3	Merge pull request #2405 from Molecule-AI/auto/wheel-smoke-shared-script refactor(ci): extract wheel smoke into shared script (close PR-time vs publish-time gap)	2026-04-30 18:55:35 +00:00
Hongming Wang	ef206b5be6	refactor(ci): extract wheel smoke into shared script publish-runtime.yml had a broad smoke (AgentCard call-shape, well-known mount alignment, new_text_message) inline as a heredoc. runtime-prbuild- compat.yml had a narrow inline smoke (just `from main import main_sync`). Result: a PR could introduce SDK shape regressions that pass at PR time and only fail at publish time, post-merge. Extract the broad smoke into scripts/wheel_smoke.py and invoke it from both workflows. PR-time gate now matches publish-time gate — same script, same assertions. Eliminates the drift hazard of two heredocs that have to be kept in lockstep manually. Verified locally: * Built wheel from workspace/ source, installed in venv, ran smoke → pass * Simulated AgentCard kwarg-rename regression → smoke catches it as `ValueError: Protocol message AgentCard has no "supported_interfaces" field` (the exact failure mode of #2179 / supported_protocols incident) Path filter for runtime-prbuild-compat extended to include scripts/wheel_smoke.py so smoke-only edits get PR-validated. publish- runtime path filter intentionally NOT extended — smoke-only edits should not auto-trigger a PyPI version bump. Subset of #131 (the broader "invoke main() against stub config" goal remains pending — main() needs a config dir + stub platform server). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 11:52:07 -07:00

1 2 3 4 5 ...

3581 Commits