molecule-core

Author	SHA1	Message	Date
Hongming Wang	08d082d466	Merge pull request #2447 from Molecule-AI/chore/wheel-smoke-fixups chore(smoke-mode): harden module-load + drop dead except clause	2026-05-01 04:33:21 +00:00
Hongming Wang	661eec2659	chore(smoke-mode): harden module-load + drop dead except clause Two follow-ups from the #2275 Phase 1 self-review: 1. `_SMOKE_TIMEOUT_SECS = float(os.environ.get(...))` was evaluated at module load. main.py imports smoke_mode unconditionally — before the is_smoke_mode() check — so a malformed MOLECULE_SMOKE_TIMEOUT_SECS env value would SystemExit every workspace boot, not just smoke runs. Wrapped in try/except with a 5.0 fallback. Probability of a typo'd env var hitting production is low (it's a CI-only knob), but the footgun is removed entirely. Regression test reloads the module under a malformed env value. 2. `_real_a2a_sdk_available()` caught (ImportError, AttributeError). `from X import Y` raises ImportError when Y is missing on X — never AttributeError. Dropped the unreachable branch. No behavior change for the happy path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 21:31:08 -07:00
Hongming Wang	1a18e9398a	Merge pull request #2445 from Molecule-AI/feat/terminal-diagnose-endpoint feat(terminal): add /workspaces/:id/terminal/diagnose endpoint	2026-05-01 04:29:02 +00:00
Hongming Wang	e6161b15a1	Merge pull request #2446 from Molecule-AI/feat/wheel-smoke-mode-execute-stub feat(wheel-smoke): exercise execute() to catch lazy imports (#2275)	2026-05-01 04:23:43 +00:00
Hongming Wang	aacaba024c	feat(wheel-smoke): exercise executor.execute() to catch lazy imports (#2275 ) The existing wheel-publish smoke (`wheel_smoke.py`) only IMPORTS `molecule_runtime.main` at module scope. Lazy imports buried inside `async def execute(...)` bodies (e.g. `from a2a.types import FilePart`) NEVER evaluate at static-import time — they crash at first message delivery in production. The 2026-04-2x v0→v1 a2a-sdk migration shipped 5 such regressions in templates that all looked fine at module-load smoke. This change adds `smoke_mode.py` plus a `MOLECULE_SMOKE_MODE=1` short-circuit in `main.py`: after `adapter.create_executor(...)`, the boot path invokes `executor.execute(stub_ctx, stub_queue)` once with a 5s timeout (`MOLECULE_SMOKE_TIMEOUT_SECS`). Healthy import tree → execution proceeds far enough to hit a network boundary and times out (exit 0). Broken lazy import → `ImportError` / `ModuleNotFoundError` from inside the executor body (exit 1). Other downstream errors (auth, validation) pass — those are caught by adapter-level tests, not this gate. Stub `(RequestContext, EventQueue)` is built from the real a2a-sdk so SendMessageRequest/RequestContext constructor changes also surface as import-tree failures (the regression class also includes "SDK refactored mid-publish"). The stub-build itself is wrapped — if it raises, that's a smoke fail too. Phase 2 (separate PR, molecule-ci) wires this into publish-template-image.yml so the publish gate runs the boot smoke against every template image before pushing the tag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 21:21:18 -07:00
Hongming Wang	b9311134cf	fix(terminal-diagnose): KI-005 hierarchy check + race-free stderr capture Two fixes from /code-review-and-quality on PR #2445: 1. KI-005 hierarchy check parity with /terminal HandleConnect runs the KI-005 cross-workspace guard before dispatch (terminal.go:85-106): when X-Workspace-ID is set and != :id, validate the bearer's workspace binding then call canCommunicateCheck. Without this, an org-level token holder in tenant Foo can probe any workspace's diagnostic state by guessing the UUID — same enumeration vector KI-005 closed for /terminal in #1609. Per-workspace bearer tokens are URL-bound by WorkspaceAuth, so the gap is org tokens within the same tenant. Fix: copy the same gate into HandleDiagnose, before the instance_id SELECT. Test: TestHandleDiagnose_KI005_RejectsCrossWorkspace stubs canCommunicateCheck=false and confirms 403 fires before the DB lookup (sqlmock's ExpectationsWereMet pins that we never reached the SELECT COALESCE). Mirrors the existing TestTerminalConnect_KI005_RejectsUnauthorizedCrossWorkspace. 2. Race-free tunnel stderr capture (syncBuf) strings.Builder isn't goroutine-safe. os/exec spawns a background goroutine that copies the subprocess's stderr fd to cmd.Stderr's Write, so reading the buffer's String() from the request goroutine on wait-for-port timeout while the tunnel may still be writing is a data race that `go test -race` flags. Worst-case impact in production is a garbled Detail string (not a crash), but the fix is small. Fix: wrap bytes.Buffer in a sync.Mutex (syncBuf type). Same io.Writer interface, no API changes elsewhere. 3. Nit cleanup - read-pubkey failure now reports as its own step name instead of a duplicated "ssh-keygen" entry — disambiguates two different failure modes that previously shared a name. - Replaced numToString hand-rolled int-to-string with strconv.Itoa in the test (no import savings reason existed). Suite: 4 diagnose tests pass with -race; full handlers suite passes in 3.95s. go vet clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 21:19:18 -07:00
Hongming Wang	d012a803e4	feat(terminal): add diagnose endpoint for SSH probe stages GET /workspaces/:id/terminal/diagnose runs the same per-stage pipeline as /terminal (ssh-keygen → EIC send-key → tunnel → ssh) but non-interactively and returns JSON. Each stage reports {name, ok, duration_ms, error, detail}, plus a top-level first_failure naming the broken stage. Why: when the canvas terminal silently disconnects ("Session ended" with no error frame — the user-reported failure mode on hongmingwang's hermes workspace), there is no remote-readable signal of WHICH stage failed. The ssh client's stderr lives only in the workspace-server's stdout on the tenant CP EC2 — invisible without shell access. /terminal can't expose stderr cleanly because it has already upgraded to WebSocket binary frames by the time ssh runs. /terminal/diagnose stays pure HTTP/JSON, so the same auth (WorkspaceAuth + ADMIN_TOKEN fallback) gives operators a one-call probe that splits "IAM broke" (send-ssh-public-key fails) from "tunnel/SG broke" (wait-for-port fails) from "sshd auth broke" (ssh-probe gets Permission denied) from "shell broke" (probe exits non-zero with stderr). Stages mirrored from handleRemoteConnect in terminal.go: 1. ssh-keygen ephemeral session keypair 2. send-ssh-public-key AWS EIC API push, IAM-gated 3. pick-free-port local port for the tunnel 4. open-tunnel aws ec2-instance-connect open-tunnel start 5. wait-for-port the tunnel actually listens (folds tunnel stderr into Detail when it doesn't) 6. ssh-probe non-interactive `ssh ... 'echo MARKER'` that confirms auth + bash + the marker round-trip (CombinedOutput captures stderr verbatim — this is the whole reason the endpoint exists) Local Docker workspaces (no instance_id) get a smaller probe: container-found + container-running. Same response shape so callers don't need to branch. Tests stub sendSSHPublicKey / openTunnelCmd / sshProbeCmd via the existing package-level vars (same pattern as TestSSHCommandCmd_*) so the test suite stays hermetic — no AWS, no network. The three new tests pin: (a) routing to remote on instance_id present, (b) routing to local on empty instance_id, (c) the operationally critical case — full success through wait-for-port then a probe failure surfaces ssh stderr in the ssh-probe step's Error/Detail with first_failure="ssh-probe". Auth: rides on existing WorkspaceAuth middleware. Operators with the tenant ADMIN_TOKEN (fetched via /cp/admin/orgs/:slug/admin-token) can probe any workspace without per-workspace token; same admin path as the canvas dashboard reads workspace activity. Response always returns HTTP 200 (success or step failure are both in the JSON body) so callers don't need to branch on status code — the endpoint either reports a first_failure or doesn't. Resolves task #200, supports task #193 (workspace EC2 sshd unresponsive — without this endpoint we couldn't pin the failure stage from outside the tenant CP EC2). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 21:10:20 -07:00
Hongming Wang	f46c471f9b	Merge pull request #2443 from Molecule-AI/docs/correct-test-ops-scripts-header docs(ci): correct test-ops-scripts.yml header — discover does NOT recurse	2026-05-01 03:55:28 +00:00
Hongming Wang	0b2ea0a50f	Merge pull request #2441 from Molecule-AI/feat/explicit-provider-field feat(config): add explicit `provider:` field alongside `model:` (PR-1 of stack)	2026-05-01 03:54:27 +00:00
Hongming Wang	e58e446444	docs(ci): correct test-ops-scripts.yml header — discover does NOT recurse The previous header said `unittest discover from the scripts/ root walks recursively`, contradicting the workflow body which runs two passes precisely because discover does NOT recurse without __init__.py. Fixed self-review feedback on PR #2440. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:52:58 -07:00
Hongming Wang	f2545fcb57	Merge pull request #2440 from Molecule-AI/chore/wheel-rewriter-tests-and-noqa-cleanup chore: rewriter unit tests + drop misleading noqa on import inbox	2026-05-01 03:48:33 +00:00
Hongming Wang	067ad83ce5	feat(config): add explicit `provider:` field alongside `model:` Adds a top-level `provider` slug to WorkspaceConfig and RuntimeConfig so adapters can route to a specific gateway without re-implementing slug-prefix parsing across hermes / claude-code / codex. Resolution chain in load_config (mirrors how `model` resolves): 1. ``LLM_PROVIDER`` env var — what canvas Save+Restart sets so the operator's Provider dropdown choice survives a CP-driven restart (the regenerated /configs/config.yaml drops most user fields). 2. Explicit YAML ``provider:`` — operator pinned it in the file. 3. Derive from the model slug prefix for backward compat: ``anthropic:claude-opus-4-7`` → ``anthropic`` ``minimax/abab7-chat-preview`` → ``minimax`` bare model names → ``""`` (let the adapter decide). `runtime_config.provider` falls back to the top-level resolved provider, the same shape PR #2438 added for `runtime_config.model`. Why a separate field at all (we already parse the slug): - Custom model aliases without a recognizable prefix need an explicit signal — the canvas Provider dropdown writes it. - Adapters were each rolling their own slug-parse (hermes's derive-provider.sh, claude-code's adapter-default branch, etc.); one resolution point in load_config kills that drift class. - Canvas needs a stable storage field that doesn't get clobbered every time the user picks a new model. Backward-compatible: when `provider:` is absent, slug derivation keeps every existing config.yaml working without a migration. PR-1 of a multi-PR stack (Option B from RFC discussion). Subsequent PRs plumb the field through workspace-server env, CP user-data, adapters (hermes prefers explicit over derive-provider.sh), and canvas Provider dropdown UI. Tests cover all four resolution paths + runtime_config inheritance: - test_provider_default_empty_when_bare_model - test_provider_derived_from_colon_slug - test_provider_derived_from_slash_slug - test_provider_yaml_explicit_wins_over_derived - test_provider_env_override_beats_yaml_and_derived - test_runtime_config_provider_yaml_wins_over_top_level - test_provider_default_from_default_model Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:47:09 -07:00
Hongming Wang	6e92fe0a08	chore: rewriter unit tests + drop misleading noqa on `import inbox` Two small follow-ups to the PR #2433 → #2436 → #2439 incident chain. 1) `import inbox # noqa: F401` in workspace/a2a_mcp_server.py was misleading — `inbox` IS used (at the bridge wiring inside main()). F401 means "imported but unused", which would mask a real future F401 if the usage is removed. Drop the noqa, keep the explanatory block comment about the rewriter's `import X` → `import mr.X as X` expansion (and the `import X as Y` → `import mr.X as X as Y` trap the comment exists to prevent re-introducing). 2) scripts/test_build_runtime_package.py — 17 unit tests covering `rewrite_imports()` and `build_import_rewriter()` in scripts/build_runtime_package.py. Until now the function had zero coverage despite the entire wheel build depending on it. Tests pin: bare-import aliasing, dotted-import preservation, indented imports, from-imports (simple + dotted + multi-symbol + block), the `import X as Y` rejection added in PR #2436 (with comment- stripping + indented + comma-not-alias edge cases), allowlist anchoring (`a2a` ≠ `a2a_tools`), and end-to-end reproduction of the PR #2433 failing pattern + the #2436 fix pattern. 3) Wire scripts/test_.py into CI by adding a second discover pass to test-ops-scripts.yml. Top-level scripts/ tests live alongside their target file (parallels the scripts/ops/ test layout); the existing scripts/ops/ pass keeps running because scripts/ops/ has no __init__.py so a single discover from scripts/ root doesn't recurse. Two passes is simpler than retrofitting namespace packages. Path filter widened from `scripts/ops/` to `scripts/*` so PRs touching the build script trigger the new tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:45:32 -07:00
Hongming Wang	5b70204b01	Merge pull request #2439 from Molecule-AI/ci/wheel-smoke-always-run-for-required-gate ci(wheel-smoke): always-run with per-step if-gates for required-check eligibility	2026-05-01 03:42:51 +00:00
Hongming Wang	3c16c27415	ci(wheel-smoke): always-run with per-step if-gates for required-check eligibility The `PR-built wheel + import smoke` gate caught the broken wheel from PR #2433 (`import inbox as _inbox_module` collision) but couldn't block the merge because it isn't a required check on staging. Promoting it to required is the right move per the runtime publish pipeline gates note (2026-04-27 RuntimeCapabilities ImportError outage), but the existing `paths: [workspace/**, scripts/...]` filter blocks PRs that don't touch those paths from ever generating the check run — branch protection would deadlock waiting on a check that never fires. Refactor (same shape as e2e-api.yml's e2e-api job): - Drop top-level `paths:` filter — workflow runs on every push/PR/ merge_group event. - Add `detect-changes` job using dorny/paths-filter to compute the `wheel=true\|false` output. - Collapse to ONE always-running `local-build-install` job named `PR-built wheel + import smoke`. Per-step `if:` gates on the detect output. PRs untouched by wheel-relevant paths emit a no-op SUCCESS step ("paths filter excluded this commit") so the check passes without rebuilding the wheel. - merge_group + workflow_dispatch unconditionally `wheel=true` so the queue always validates the to-be-merged state, regardless of which PR composed it. Why one-job-with-step-gates instead of two-jobs-sharing-name: SKIPPED check runs block branch protection even when SUCCESS siblings exist (verified PR #2264 incident, 2026-04-29). Single always-run job emits exactly one SUCCESS check run regardless of paths filter. Follow-up: open a separate PR adding `PR-built wheel + import smoke` to the staging branch protection's required_status_checks.contexts once this lands. Doing both in one PR risks the protection update firing before the workflow refactor merges, deadlocking unrelated PRs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:40:05 -07:00
Hongming Wang	5ad41f63ce	Merge pull request #2438 from Molecule-AI/fix/runtime-config-model-fallback-clean fix(config): runtime_config.model falls back to top-level model	2026-05-01 03:31:20 +00:00
Hongming Wang	0070d0bd59	fix(config): runtime_config.model falls back to top-level model External feedback (2026-04-30): "Provisioner doesn't read model from config.yaml and doesn't set MODEL env var. Without MODEL, the adapter defaults to sonnet and bypasses the mimo routing." Confirmed accurate for SaaS workspaces. Trace: claude-code-default/adapter.py reads `runtime_config.model or "sonnet"` (and hermes reads HERMES_DEFAULT_MODEL via install.sh, which IS plumbed). For claude-code there's nothing — workspace/config.py loaded `runtime_config.model` only from YAML, ignoring MODEL_PROVIDER env. The CP user-data script regenerates /configs/config.yaml at every boot with only `name`, `runtime`, `a2a` keys (intentionally minimal so it doesn't carry stale state) — so any user-set runtime_config.model is wiped on every restart, and the adapter falls back to "sonnet" even when the user picked Opus in the canvas Config tab. Fix: when YAML omits runtime_config.model, fall back to the top-level resolved `model`, which already honors MODEL_PROVIDER env override. One-line in workspace/config.py. Now MODEL_PROVIDER → top-level model → runtime_config.model → adapter sees the user's selection. Sticky across CP-driven restarts; the canvas Save+Restart loop works as intended for every runtime, not just hermes. Tests: test_runtime_config_model_falls_back_to_top_level — top-level set, runtime_config empty → fallback wins test_runtime_config_model_yaml_wins_over_top_level — YAML explicit → fallback skipped (precedence) test_runtime_config_model_picks_up_env_via_top_level — full canvas Save+Restart simulation: env → top-level → runtime_config.model Negative-control verified: removing the `or model` flips both fallback tests red with the expected "" vs expected-model mismatch; restoring flips them green. The yaml-wins test passes either way (correctly, because precedence is preserved). Replaces closed PR #2435 — that PR's commit was on a contaminated branch and accidentally captured unrelated WIP changes (build script + a2a_mcp_server refactor) instead of this fix. Self-review caught it and closed the PR. This branch is clean off main + diff verified before push. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:28:50 -07:00
Hongming Wang	4e39609ae0	Merge pull request #2434 from Molecule-AI/fix/terminal-ssh-connect-timeout fix(terminal): cap ssh handshake at 10s so hung sshd surfaces fast	2026-05-01 03:26:41 +00:00
Hongming Wang	bbc994f6c8	Merge pull request #2436 from Molecule-AI/fix/wheel-import-as-collision-fix-forward fix(wheel): import inbox without alias to dodge rewriter collision	2026-05-01 03:24:18 +00:00
Hongming Wang	cda93e3c52	test(terminal): update exact-argv snapshot to include ConnectTimeout The pre-existing TestSSHCommandCmd_BuildsArgv asserts the literal argv slice. Adding `-o ConnectTimeout=10` shifted the slice — this commit tracks the snapshot to match. The new behavior-based TestSSHCommandCmd_ConnectTimeoutPresent (added in the prior commit) keeps the invariant pinned without depending on argv ordering, so future tweaks land in only one place even if more options are added. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:23:48 -07:00
Hongming Wang	0acdf3bb56	fix(wheel): import inbox without alias to dodge rewriter collision PR #2433 (notifications/claude/channel) shipped 'import inbox as _inbox_module' inside a2a_mcp_server.py:main(). The build script's import rewriter expands plain 'import inbox' to 'import molecule_runtime.inbox as inbox', so the original source became 'import molecule_runtime.inbox as inbox as _inbox_module', which is invalid Python. Caught at the publish-runtime + PR-built-wheel-smoke gate (the SyntaxError trace is in run 25200422679). The wheel didn't ship to PyPI because publish-runtime's smoke-import step refused to install it, but staging is currently sitting on a broken-build commit until this fix-forward lands. Changes: - a2a_mcp_server.py: lift `import inbox` to top of file (rewriter produces clean `import molecule_runtime.inbox as inbox`), call inbox.set_notification_callback directly in main() - build_runtime_package.py: rewrite_imports() now raises ValueError when it sees 'import X as Y' for any X in the workspace allowlist, instead of silently producing a syntax-error wheel. Operator gets a clear actionable error at build time pointing at the offending line + suggested rewrites ('from X import …' or plain 'import X'). The build-time gate (this PR's rewriter check) catches the regression class earlier than the smoke-time gate (PR #2433's failure). Adding 'PR-built wheel + import smoke' to staging branch protection's required checks is filed separately so this class doesn't merge again. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:21:54 -07:00
Hongming Wang	f30b3d4476	fix(terminal): cap ssh handshake at 10s so hung sshd surfaces fast When the workspace EC2's sshd is unresponsive (mid-restart, SG drop, AMI without ec2-instance-connect), the canvas's xterm shows the user's typed bytes echoed back by the workspace-server's local PTY (cooked + echo mode before ssh sets it raw post-handshake) and then closes silently when Cloudflare's idle WebSocket timer fires (~100s) — with no "Connection refused" or "Permission denied" output ever reaching the user. This is what hongmingwang's hermes terminal looked like 2026-04-30 right after the heartbeat-fix redeploy: status="online" but the shell appeared dead. Caught reproducibly by holding a fresh /workspaces/<id>/terminal WebSocket open for 60s — server sent zero frames except the local-PTY echo of one keystroke typed at t=8s. ssh was hung at handshake; bash never saw the byte. Fix: add `-o ConnectTimeout=10` to ssh args. Now the failure surfaces as a real ssh error message in the terminal within 10s, instead of masquerading as a silently dead shell over the next ~100s. Doesn't diagnose why sshd isn't responding (separate investigation), but it does mean the user gets actionable feedback within seconds. Behavior-based regression test asserts `-o ConnectTimeout=N` is in the ssh argv — pins presence, not the literal value, so operators can tune without breaking the gate. Verified to FAIL on pre-fix code (matched the literal arg pair) and PASS on fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:16:41 -07:00
Hongming Wang	c901d52ee3	Merge pull request #2433 from Molecule-AI/feat/mcp-channel-notifications feat(mcp): notifications/claude/channel for push-feel inbox UX	2026-05-01 03:12:31 +00:00
Hongming Wang	0a3ec53f34	feat(mcp): notifications/claude/channel for push-feel inbox UX Adds a notification seam to the universal molecule-mcp wheel so push- notification-capable MCP hosts (Claude Code today; any compliant client tomorrow) get inbound A2A messages as conversation interrupts instead of having to poll wait_for_message / inbox_peek. Wire-up: - inbox.py: module-level _NOTIFICATION_CALLBACK + set_notification_callback() Fires from InboxState.record() AFTER lock release, with same dict shape inbox_peek returns. Best-effort — a raising callback never prevents the message from landing in the queue. - a2a_mcp_server.py: _build_channel_notification() pure helper + bridge wiring in main() that schedules notifications via asyncio.run_coroutine_threadsafe (poller is a daemon thread, MCP loop is asyncio). - Method name 'notifications/claude/channel' matches the contract documented in molecule-mcp-claude-channel/server.ts:509. - wheel_smoke.py: pin set_notification_callback as a published name, same regression class as the 0.1.16 main_sync incident. Pollers (wait_for_message / inbox_peek) keep working unchanged for runtimes without notification support. Tests: 6 new in test_inbox.py (callback fires once on record, dedupe short-circuits before fire, raising cb doesn't break inbox, set/clear semantics), 5 new in test_a2a_mcp_server.py (method name pin, content mapping, meta routing, no-id JSON-RPC notification spec, missing- field tolerance). All 59 combined tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:10:01 -07:00
github-actions[bot]	ed29ad0d2a	Merge pull request #2432 from Molecule-AI/staging staging → main: auto-promote `0c51df9`	2026-04-30 19:52:46 -07:00
Hongming Wang	0c51df989b	Merge pull request #2431 from Molecule-AI/fix/restart-async-stop Move /restart Stop into the async goroutine	2026-05-01 02:38:29 +00:00
github-actions[bot]	05de9f5777	Merge pull request #2430 from Molecule-AI/staging staging → main: auto-promote `03d5f80`	2026-04-30 19:37:38 -07:00
Hongming Wang	f6ddcf66ab	Move /restart Stop into the async goroutine Pre-fix Restart called provisioner.Stop / cpProv.Stop synchronously before returning the HTTP response. CPProvisioner.Stop is DELETE /cp/workspaces/:id → CP → AWS EC2 terminate, which can exceed the canvas's 15s HTTP timeout, especially right after a platform-wide redeploy when every tenant queues a CP request at once. The user sees a misleading "signal timed out" red banner on Save & Restart even though the async re-provision goroutine continues and the workspace ends up online. Caught 2026-04-30 on hongmingwang hermes workspace 32993ee7-…cb9d75d112a5 right after the heartbeat-fix platform redeploy at 02:11Z. The workspace came back online correctly; only the canvas response timed out. Fix moves Stop into the same goroutine as provisionWorkspaceCP / provisionWorkspaceOpts. The handler now responds in <500ms (DB lookup + status UPDATE only). Stop and provision keep their existing ordering inside the goroutine. Uses context.Background() to detach from the request lifecycle so an aborted client connection doesn't cancel the in-flight Stop/provision pair. Pinned by a behavior-based AST gate (workspace_restart_async_test.go): the test parses workspace_restart.go and walks the Restart function body, flagging any <recv>.{provisioner,cpProv}.Stop call that isn't nested in a *ast.FuncLit. Same family as callsProvisionStart in workspace_provision_shared_test.go. Verified the gate fails on the pre-fix shape (flags lines 151 and 153 — the original sync Stop calls). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 19:35:29 -07:00
hongmingwang-moleculeai	03d5f80cb6	Merge pull request #2428 from Molecule-AI/feat/agent-card-env-vars feat(mcp_cli): agent_card from env vars (capability discovery)	2026-05-01 02:23:37 +00:00
github-actions[bot]	ace3c85708	Merge pull request #2427 from Molecule-AI/staging staging → main: auto-promote `2a56697`	2026-04-30 19:08:15 -07:00
Hongming Wang	c4bb803329	feat(mcp_cli): agent_card from env vars (capability discovery) External molecule-mcp runtimes register with hardcoded agent_card.name = molecule-mcp-{id[:8]} and skills=[]. That made every external workspace look identical on the canvas and gave peer agents calling list_peers no signal beyond name — they had to guess capabilities. Three new env vars let the operator declare identity + capabilities without code changes: * MOLECULE_AGENT_NAME — display name on canvas (default unchanged) * MOLECULE_AGENT_DESCRIPTION — one-line description (default empty) * MOLECULE_AGENT_SKILLS — comma-separated skill names Comma-separated skills get expanded to {"name": "..."} objects — the minimum shape that satisfies both shared_runtime.summarize_peers (reads s["name"]) AND canvas SkillsTab.tsx (id falls back to name). Strict-superset behaviour: when no env vars are set, agent_card matches the previous hardcoded value exactly. No regression for operators who haven't migrated. Why this matters end-to-end: * Canvas Skills tab now shows each declared skill as a chip * Peer agents calling list_peers see {name, skills} per peer and can route delegations to the right specialist * Same applies to the canvas Details tab + workspace card hover Tests cover: defaults match prior behaviour; name override; CSV → skill objects; whitespace stripping + empty entries dropped; description omitted when unset (keeps wire payload minimal); whitespace-only name falls back to default; end-to-end through _platform_register's payload. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:57:39 -07:00
Hongming Wang	6cca4c5708	Merge pull request #2426 from Molecule-AI/fix/canvas-model-save-runtime-config fix(canvas): persist model on Save+Restart for runtime-bearing workspaces	2026-05-01 01:42:18 +00:00
Hongming Wang	210f6e066a	Merge pull request #2424 from Molecule-AI/fix/in-container-heartbeat-persists-inbound-secret fix(workspace): in-container heartbeat persists platform_inbound_secret	2026-05-01 01:36:52 +00:00
Hongming Wang	7706db5a93	fix(canvas): persist model on Save+Restart for runtime-bearing workspaces The Model dropdown's onChange writes to config.runtime_config.model whenever a runtime is set (hermes, claude-code, etc.), and only falls back to top-level config.model when no runtime is selected. But handleSave used to diff the new value against top-level nextSource.model only — so for any runtime-bearing workspace, the PUT /workspaces/:id/model never fired and MODEL_PROVIDER never landed in workspace_secrets. Symptom (2026-04-30, hongmingwang Hermes Agent 32993ee7-840e-4c02-8ca8-cb9d75d112a5): - User picks minimax/MiniMax-M2.7-highspeed from the dropdown - Hits Save & Restart - Save reports success; restart fires - The new EC2 boots with HERMES_DEFAULT_MODEL empty - install.sh defaults to nousresearch/hermes-4-70b - hermes-agent errors "No LLM provider configured" on every chat turn because no NOUS_API_KEY / OPENROUTER_API_KEY is set - Reload Config tab → model field reverts to whatever GET /workspaces/:id/model returns (i.e. empty / template default) handleSave now reads the effective model from runtime_config.model first and falls back to top-level model for legacy no-runtime workspaces. Same change for the old-value diff so a no-op Save still skips the PUT. Tests pin both branches: PUTs /model when the dropdown changed runtime_config.model on a hermes workspace; does NOT PUT when the value is unchanged from what GET /model returned.	2026-04-30 18:31:43 -07:00
Hongming Wang	2a5669788c	Merge pull request #2425 from Molecule-AI/fix/heartbeat-detect-401 fix(mcp_cli): escalate heartbeat 401s with re-onboard guidance	2026-05-01 01:28:57 +00:00
Hongming Wang	d887ce8e96	fix(mcp_cli): escalate consecutive heartbeat 401s with re-onboard guidance The universal molecule-mcp wheel runs in a daemon thread, posting /registry/heartbeat every 20s. When the workspace gets deleted server-side (DELETE /workspaces/:id), the platform revokes all tokens for that workspace. Previous behaviour: heartbeat would 401 forever, log at WARNING per tick, no actionable signal anywhere. Failure mode hit on hongmingwang tenant 2026-04-30: workspace a1771dba was deleted at some prior time, the channel-bridge .env still pointed at it, MCP tools 401-ed silently with the operator having no idea why. The register-time path at mcp_cli.py:104-111 already does loud + actionable for 401 (sys.exit(3) with regenerate- from-canvas-Tokens text) — extend the same pattern to the heartbeat. Behaviour: * count < 3: WARNING per tick (could be transient blip) * count == 3: ERROR with re-onboard instructions, names the dead workspace_id, points at the canvas Tokens tab * count > 3 and every 20 ticks (~7 min): re-log ERROR so a session that started after the first ERROR still catches it 5xx and other non-auth HTTP errors do NOT increment the auth-failure counter — that would mislead the operator (e.g. a server blip would trigger "token revoked" when the token is fine). Tests cover: single 401 stays at WARNING; 3 consecutive 401s escalate to ERROR with the right keywords; 403 treated identically; recovery via 200 resets the counter; 5xx never triggers the auth path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 18:26:35 -07:00
Hongming Wang	98845c8f42	fix(workspace): in-container heartbeat persists platform_inbound_secret Follow-up to PR #2421. The standalone wrapper (mcp_cli.py) got heartbeat-time secret persistence in #2421, but the in-container heartbeat (workspace/heartbeat.py) was missed — and that's the path every workspace EC2 actually runs. Result: hongmingwang Claude Code agent stayed 401-forever on chat upload after this morning's deploy because the workspace's runtime never picked up the lazy-healed secret. The in-container _loop now captures the heartbeat response and calls the same _persist_inbound_secret_from_heartbeat helper used by the standalone path, on both the first POST and the 401-retry POST. Defensive on every error (non-JSON, non-dict, empty, save failure) — liveness contract trumps secret persistence. Tests pin: happy path, absent secret, empty string, non-JSON body, non-dict body, save_inbound_secret OSError, end-to-end loop.	2026-04-30 18:18:10 -07:00
github-actions[bot]	ce7698efc3	Merge pull request #2423 from Molecule-AI/staging staging → main: auto-promote `c733454`	2026-04-30 18:07:52 -07:00
github-actions[bot]	695d286dba	Merge pull request #2422 from Molecule-AI/staging staging → main: auto-promote `f035482`	2026-04-30 17:55:53 -07:00
Hongming Wang	c733454a56	Merge pull request #2421 from Molecule-AI/fix/heartbeat-delivers-inbound-secret fix(workspace): deliver platform_inbound_secret on every heartbeat	2026-05-01 00:54:00 +00:00
Hongming Wang	f035482e0a	Merge pull request #2420 from Molecule-AI/refactor/send-a2a-message-by-peer-id refactor(workspace-runtime): send_a2a_message takes peer_id + UUID validation [stacks on #2418]	2026-05-01 00:44:56 +00:00
Hongming Wang	993f8c494e	refactor(workspace-runtime): send_a2a_message takes peer_id, validates UUID Two cleanups stacked on PR #2418: 1. Refactor `send_a2a_message(target_url, msg)` → `send_a2a_message(peer_id, msg)`. After #2418 every caller passes `${PLATFORM_URL}/workspaces/{peer_id}/a2a` — the function's parameter pretended to accept arbitrary URLs but in practice only one shape is meaningful. Owning URL construction inside the function makes the contract honest and centralises the peer-id validation introduced below. 2. Add `_validate_peer_id` UUID-shape check at the trust boundary. `discover_peer` and `send_a2a_message` are the entry points where agent-controlled strings flow into URL paths; rejecting non-UUID input at this layer eliminates the URL-interpolation class of bug (`workspace_id="../admin"` etc.) regardless of how the rest of the codebase interpolates ids elsewhere. Auth was already gating malicious access — this is consistency + clear failure over silent platform 4xx. In-container tests cover positive UUIDs, malformed input (``"ws-abc"``, ``"../admin"``, empty), and the contract that ``tool_delegate_task`` hands the peer_id to ``send_a2a_message`` without building URLs itself. Live-verified: external delegation 8dad3e29 → 97ac32e9 returned "refactor verified" from Claude Code Agent through the refactored code; ``_validate_peer_id`` rejects ``"ws-abc"`` and ``"../admin"`` and accepts canonical UUIDs. Stacked on PR #2418 (proxy-routing fix). Will rebase onto staging once #2418 merges. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 17:43:01 -07:00
github-actions[bot]	6445c0ad17	Merge pull request #2419 from Molecule-AI/staging staging → main: auto-promote `665582b`	2026-04-30 17:38:19 -07:00
Hongming Wang	a5c5139e3a	fix(workspace): deliver platform_inbound_secret on every heartbeat Heartbeat now echoes the workspace's platform_inbound_secret on every beat (mirroring /registry/register), and the molecule-mcp client persists it to /configs/.platform_inbound_secret on receipt. Symptom (2026-04-30, hongmingwang tenant): chat upload returned 503 "workspace will pick it up on its next heartbeat" and then 401 on retry — permanent until workspace restart. The 503 message was a lie: heartbeat used to discard the platform_inbound_secret entirely; only register delivered it, and register fires once at startup. Server (Go): - Heartbeat handler reuses readOrLazyHealInboundSecret (the same helper chat_files + register use), so heartbeat-time recovery covers the rotate / mid-life NULL-column case the existing register-time heal can't reach. - Failure is non-fatal: liveness contract trumps secret delivery, chat_files retries lazy-heal on its own next request. Client (Python): - _persist_inbound_secret_from_heartbeat parses the heartbeat 200 response and persists via platform_inbound_auth.save_inbound_secret. - All exceptions swallowed — heartbeat liveness > secret persistence; next tick (≤20s) retries. Tests: - Server: pin secret-present, lazy-heal-mint-on-NULL, and heal- failure-omits-field branches. - Client: pin persist-on-200, skip-on-empty, skip-on-non-dict-body, skip-on-401, swallow-save-OSError.	2026-04-30 17:36:33 -07:00
Hongming Wang	665582b612	Merge pull request #2418 from Molecule-AI/fix/external-delegate-via-platform-proxy fix(workspace-runtime): route delegate_task through platform A2A proxy	2026-05-01 00:16:15 +00:00
Hongming Wang	aefb44aff2	fix(workspace-runtime): route delegate_task through platform A2A proxy tool_delegate_task was POSTing directly to peer["url"], which is the Docker-internal hostname (e.g. http://ws-X-Y:8000) for in- container peers. External callers — the standalone molecule-mcp wrapper running on an operator's laptop — get [Errno 8] nodename nor servname every single delegation, breaking the universal-MCP path's last "ride the same code as in-container" claim. The platform's /workspaces/:peer-id/a2a proxy endpoint already handles internal forwarding for in-container peers AND is the only path external runtimes can use. Unify on it: in-container callers pay one extra HTTP hop on the same Docker bridge (microseconds); external callers get a working delegation path for the first time. discover_peer is still called for access-control + online-status detection — only the routing target changes. Verified live on 2026-04-30 against workspace 8dad3e29 (external mac runtime) → 97ac32e9 (Claude Code Agent in-container): direct POST returned ConnectError, proxy POST returned "acknowledged from claude code agent" as requested. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 17:13:50 -07:00
github-actions[bot]	cf6061a6c3	Merge pull request #2417 from Molecule-AI/staging staging → main: auto-promote `cc58e87`	2026-04-30 16:57:37 -07:00
Hongming Wang	cc58e87393	Merge pull request #2415 from Molecule-AI/feat/molecule-mcp-inbox-polling feat(workspace-runtime): inbox polling for standalone molecule-mcp	2026-04-30 23:41:47 +00:00
github-actions[bot]	563b31f2af	Merge pull request #2416 from Molecule-AI/staging staging → main: auto-promote `d00c8be`	2026-04-30 16:39:53 -07:00
Hongming Wang	d061642cfc	test(inbox): bind side-effecting pop() before assert CodeQL flagged the bare `assert state.pop(...) is None` — under `python -O` asserts are stripped, which would skip the call entirely and the test would silently pass without exercising the code. Bind the result first so the call always runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:39:45 -07:00

1 2 3 4 5 ...

3623 Commits