molecule-core

Author	SHA1	Message	Date
Hongming Wang	06bebc1b35	Merge pull request #2461 from Molecule-AI/feat/mcp-experimental-channel-capability feat(mcp): declare experimental.claude/channel capability for push UX	2026-05-01 20:47:52 +00:00
Hongming Wang	d294f15c88	Merge pull request #2460 from Molecule-AI/feat/template-always-ask-provider feat(canvas): always ask for provider+model when deploying multi-provider templates	2026-05-01 20:45:37 +00:00
Hongming Wang	0a87dec50e	feat(mcp): declare experimental.claude/channel capability for push UX Without this capability declaration in the initialize handshake, Claude Code's MCP client receives our notifications/claude/channel emissions but silently drops them — they never become inline <channel> tags in the conversation. The push-UX bridge added in PR #2433 ships, fires, and is invisible. This was anticipated as a failure mode in #2444 §2 ("Notification arrives but Claude Code doesn't surface it — host doesn't recognize the method"), and confirmed live in this session: a canvas chat "hi" landed in the inbox queue (inbox_peek returned it) but never woke the agent until inbox_peek was called by hand. The contract matches molecule-mcp-claude-channel/server.ts:374 where the bun bridge declares the same experimental flag. Refactor: extracted _build_initialize_result() so the handshake shape is unit-testable. Pure function, no behavioral change beyond adding the experimental capability to the result. Tests: 3 new pins on the initialize result (capability presence, tools-still-there, protocolVersion stable). Closes the live- verification gap §2 of #2444. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 13:45:06 -07:00
Hongming Wang	3ba924d174	review: drop destructive Override + single-fetch configuredKeys Self-review of #2460 found two issues: 1. Critical: Override button in ProviderPickerModal called /settings/secrets when no workspaceId, overwriting the GLOBAL secret used by every workspace. The only consumers of this modal today (TemplatePalette, EmptyState via useTemplateDeploy) never pass workspaceId, so Override was always destructive. Removed entirely — the picker still solves the user-reported bug (always-ask + reuse saved keys); per-workspace key override can be a separate PR that plumbs secrets through POST /workspaces. 2. Optional: /settings/secrets was being fetched twice — once inside checkDeploySecrets (silently) and again in the hook to populate configuredKeys. Surfaced configuredKeys on PreflightResult so the hook re-uses the existing fetch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 13:40:58 -07:00
Hongming Wang	0608e15ab3	feat(canvas): always prompt for provider+model on multi-provider template deploy Clicking a hermes template tile silently deployed when global env covered the API key, producing "No LLM provider configured" 500 because the workspace booted with no explicit model slug — the adapter fell back to its compiled-in default which 401s on the user's actual provider key. Fix: in useTemplateDeploy, open the picker whenever the template declares ≥2 provider options, even when preflight.ok=true. The modal renders pre-saved keys as Saved (with an Override link) and adds a model input pre-filled from the template's default. Single- provider templates (claude-code, langgraph) still skip the picker since there's nothing to choose. POST /workspaces now includes the picker's model slug so hermes- style routing reads the prefix at install time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 13:34:17 -07:00
Hongming Wang	141ecc1c16	Merge pull request #2459 from Molecule-AI/fix/configs-dir-fallback-non-container fix(runtime): auto-fallback CONFIGS_DIR for non-container hosts	2026-05-01 20:16:55 +00:00
Hongming Wang	b8fdbd9fab	fix(runtime): register configs_dir in TOP_LEVEL_MODULES + drop alias Wheel-build smoke gate detected `configs_dir` missing from scripts/build_runtime_package.py:TOP_LEVEL_MODULES. Without it the build would ship `import configs_dir` un-rewritten and every external-runtime install would die on `ModuleNotFoundError` at first import. Two callers used `import configs_dir as _configs_dir` to belt-and- suspenders against an imagined name collision, but the rewriter rejects `import X as Y` because the rewrite would produce `import molecule_runtime.X as X as Y` (invalid syntax). No actual collision exists (only docstring/comment references). Switched to plain `import configs_dir`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 13:13:57 -07:00
Hongming Wang	de353a5933	Merge pull request #2457 from Molecule-AI/feat/createworkspacedialog-dynamic-providers feat(canvas): dynamic provider dropdown in CreateWorkspaceDialog	2026-05-01 20:08:35 +00:00
Hongming Wang	c636022d2f	fix(runtime): auto-fallback CONFIGS_DIR for non-container hosts (closes #2458 ) The runtime persists per-workspace state (`.auth_token`, `.platform_inbound_secret`, `.mcp_inbox_cursor`) under `/configs` — the workspace-EC2 mount path. Inside a container that's writable, agent-owned. Outside a container, `/configs` either doesn't exist or isn't writable by an unprivileged user. The default broke the external-runtime path (`pip install molecule-ai-workspace-runtime` + `molecule-mcp` on a Mac/Linux laptop). First heartbeat tries to persist `.platform_inbound_secret` and crashes: [Errno 30] Read-only file system: '/configs' The heartbeat thread logs and dies. Workspace flips offline within a minute. Operator sees no actionable error. Adds workspace/configs_dir.py — single resolution point with a tiered fallback: 1. CONFIGS_DIR env var, if set — explicit operator override (preserves existing tests + custom deployments verbatim). 2. /configs — if it exists AND is writable. In-container default; unchanged behavior for every prod workspace. 3. ~/.molecule-workspace — created with mode 0700 so per-file 0600 perms aren't undermined by a world-readable parent. Migrates the four readers (platform_auth, platform_inbound_auth, mcp_cli, inbox) to call configs_dir.resolve() instead of inlining `Path(os.environ.get("CONFIGS_DIR", "/configs"))`. Existing tests that assert the old `/configs`-as-default contract updated to assert the new contract: when CONFIGS_DIR is unset, path resolves to a writable location — `/configs` if present, fallback otherwise. Tests skip the fallback branch on hosts that DO have a writable `/configs` (CI containers). Verified the original repro is fixed: with no CONFIGS_DIR set on macOS, configs_dir.resolve() returns ~/.molecule-workspace, the dir exists, and writes succeed. Test suite: 1454 passed, 3 skipped, 2 xfailed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 13:07:55 -07:00
Hongming Wang	e1496936e9	feat(canvas): dynamic provider dropdown in CreateWorkspaceDialog Mirrors the data-driven pattern PR #2454 set in ConfigTab: read runtime_config.providers from /templates and filter the modal's provider <select> to that subset. Same source of truth, three fewer hardcoded copies of the provider list. Behavior: - Template declares providers → dropdown shows only those. - Template ships no providers field → fall back to full HERMES_PROVIDERS catalog (back-compat for older templates / self-hosted setups). - Declared list has no overlap with our static metadata → fall back to full catalog so the form can't lock the operator out. - hermesProvider snaps back to the first available pick when its current value falls out of the filtered list. Tests: 3 new pinning the filter, no-providers-field fallback, and the unknown-providers fallback. All 27 CreateWorkspaceDialog tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 12:45:20 -07:00
Hongming Wang	0b809cfa62	Merge pull request #2456 from Molecule-AI/ops/demo-day-freeze-runbook ops: demo-day freeze + rollback runbook	2026-05-01 19:06:51 +00:00
Hongming Wang	6d23611620	ops: demo-day freeze + rollback runbook Demo-day preparation bundle for the funding demo (~2026-05-06). Adds: - scripts/demo-freeze.sh — captures current ghcr.io workspace-template-* :latest digests for all 8 runtimes, then disables both cascade vectors that could re-tag :latest mid-demo: publish-runtime.yml in molecule-core (PATH 1 — staging push to workspace/** auto-bumps the wheel and fans out to 8 templates) and publish-image.yml in each of the 8 template repos (PATH 2 — direct template repo merge re-tags :latest). Defaults to dry-run; requires --execute to apply. Writes both digest + workflow receipts to scripts/demo-freeze-snapshots/. - scripts/demo-thaw.sh — re-enables every workflow demo-freeze.sh disabled, keyed off the receipt timestamp. Defaults to executing (the inverse safety polarity from freeze, where the destructive default is dry-run). --dry-run prints without applying. - scripts/demo-day-runbook.md — operator runbook indexing the six rollback levers (platform image rollback, template image rollback, tenant redeploy, workspace delete, Railway rollback, Vercel rollback) plus pre-warm timing and post-demo cleanup. Also covers read-only diagnostics for "is this working?" moments and the CP_ADMIN_API_TOKEN rotation step that must follow demo (the token gets copy-pasted into shells during incident response). - scripts/demo-freeze-snapshots/.gitignore — generated freeze receipts are operational state, not source. Tracked .gitkeep so the directory exists when the script writes to it. Both scripts dry-run-tested locally. Did not exercise --execute since that would actually disable production workflows mid-development. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 12:04:30 -07:00
Hongming Wang	092724b6d7	Merge pull request #2455 from Molecule-AI/fix/internal-chat-uploads-errno-clarity fix(workspace): surface errno + path on chat-upload mkdir failure	2026-05-01 18:52:46 +00:00
Hongming Wang	2e8892ebc4	fix(workspace): surface errno + path on chat-upload mkdir failure Production incident on hongming.moleculesai.app 2026-05-01T18:30Z — fresh-tenant signup chat upload returned 500 with the body {"error":"failed to prepare uploads dir"}. Diagnosis required SSM access to the workspace stderr to recover errno + actual path. The root-cause fix lives in claude-code template entrypoint (molecule-ai-workspace-template-claude-code#23 — pre-create the .molecule subtree as root before gosu drops to agent). This change is the diagnostic improvement: when mkdir fails for any reason in the future (EACCES, ENOSPC, EROFS, etc.), the response carries the errno + offending path so the operator inspecting browser devtools sees the real cause without needing SSM. Backwards compatible — top-level "error" key is unchanged so existing canvas / external alert rules continue to match. New fields are additive: path, errno, detail. Test pins the diagnostic shape so a future struct refactor can't silently drop these fields. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 11:47:53 -07:00
Hongming Wang	d55360c5d8	Merge pull request #2454 from Molecule-AI/feat/canvas-config-provider-dropdown feat(canvas+workspace-server): data-driven Provider dropdown (#199)	2026-05-01 18:32:55 +00:00
Hongming Wang	517bd0efc5	feat(canvas+workspace-server): data-driven Provider dropdown (#199 ) Option B PR-5. Canvas Config tab now exposes a Provider override input that's adapter-driven from each runtime's template — no hardcoded provider list in the canvas. PUT /workspaces/:id/provider on Save when dirty; auto-restart suppression to avoid double-restart with the model handler's own restart. The dropdown's suggestion list comes from /templates → runtime_config.providers (the field added in molecule-ai-workspace-template-hermes PR #31). For templates that haven't migrated to the explicit providers list yet, suggestions derive from model[].id slug prefixes — still adapter-driven, just inferred. This keeps existing templates working while platform team migrates them one at a time. workspace-server changes: - Add Providers []string field to templateSummary JSON - Parse runtime_config.providers in /templates handler - 2 new tests pin the surfacing + omitempty behavior canvas changes: - Remove hardcoded PROVIDER_SUGGESTIONS constant - Add provider/originalProvider state + PUT-on-save logic - Add deriveProvidersFromModels() fallback helper - Wire RuntimeOption.providers from /templates response - 8 new tests pin the behavior end-to-end Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 11:19:17 -07:00
Hongming Wang	1a1285171c	Merge pull request #2453 from Molecule-AI/feat/workspace-server-provider-endpoint feat(workspace-server): PUT /provider endpoint (#196 — Option B PR-2)	2026-05-01 05:37:15 +00:00
Hongming Wang	89a6f27478	Merge pull request #2452 from Molecule-AI/fix/workspace-410-removed-at-zero-value fix(workspace-server): null removed_at when timestamp fetch fails (#2429 review polish)	2026-05-01 05:28:24 +00:00
Hongming Wang	258c6bea44	feat(workspace-server): PUT /provider endpoint for explicit LLM provider (#196 ) Mirror of PUT /model. Stores the provider slug as the LLM_PROVIDER workspace secret so the canvas can update model + provider independently — a user might keep the same model alias and switch providers (route through a different gateway), or vice versa. Forcing both into one endpoint imposes a single Save+Restart per change; two endpoints let canvas update each as the user picks. Plumbs through the existing chain: secret-load → envVars → CP req.Env → user-data env exports → /configs/config.yaml (after controlplane PR #364 lands the heredoc append). Tests: 5 new cases mirroring SetModel/GetModel exactly — default empty response, DB error, upsert with restart trigger, empty-clears, invalid-UUID rejection. Part of: Option B PR-2 (#196) — workspace-server plumbs LLM_PROVIDER Stack: PR-1 schema (#2441 merged) PR-2 (this) ws-server endpoint PR-3 (#364 open) CP user-data persistence PR-4 (pending) hermes adapter consume PR-5 (pending) canvas Provider dropdown	2026-04-30 22:25:48 -07:00
Hongming Wang	364c70fc71	fix(workspace-server): emit null removed_at when timestamp fetch fails #2429 review finding. The 410-Gone path issues a follow-up `SELECT updated_at` after detecting status='removed'. If that query fails (workspace row deleted between the two queries, transient DB error, etc.), `removedAt` stays as Go's zero time and the JSON body emits `"removed_at": "0001-01-01T00:00:00Z"` — a misleading timestamp the client has to know to ignore. Now we branch on `removedAt.IsZero()` and emit `null` for the failed path. The actionable signal (the 410 + hint) is unchanged; only the timestamp shape gets cleaner. Pinned by `TestWorkspaceGet_RemovedReturns410WithNullRemovedAtOnTimestampFetchFailure`, which simulates the row vanishing via `sqlmock`'s `WillReturnError(sql.ErrNoRows)`. The original `_RemovedReturns410` test now also asserts that the happy-path timestamp is a non-null value (was just checking the key existed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 22:24:59 -07:00
Hongming Wang	c06c4c0f56	Merge pull request #2450 from Molecule-AI/feat/observability-config-schema feat(config): observability block schema (#119 PR-1 of 4)	2026-05-01 05:20:11 +00:00
Hongming Wang	b97a346fbf	Merge pull request #2451 from Molecule-AI/feat/a2a-client-410-removed feat(a2a-client): surface 410 Gone as 'removed' so callers can re-onboard (#2429)	2026-05-01 05:13:36 +00:00
Hongming Wang	645c1862c4	feat(a2a-client): surface 410 Gone as 'removed' error so callers can re-onboard (#2429 ) Follow-up A to PR #2449 — that PR taught the platform to return 410 Gone for status='removed' workspaces; this PR teaches get_workspace_info to consume that signal. Before: every non-200 collapsed into {"error": "not found"}, which made the 2026-04-30 incident impossible to diagnose — the operator KNEW the workspace_id existed (they'd just registered it), but the runtime kept reporting "not found" for a deleted-but-not-purged row. After: 410 produces a distinct {"error": "removed", "id", "removed_at", "hint"} dict so callers (heartbeat-loop, channel bridge, dashboard tools) can surface "your workspace was deleted, re-onboard" instead of "not found". Falls back to a default hint if the platform body isn't parseable so the actionable signal doesn't depend on body shape parity. Two new tests: - TestGetWorkspaceInfo.test_410_returns_removed_with_hint - TestGetWorkspaceInfo.test_410_with_unparseable_body_falls_back_to_default_hint Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 22:08:08 -07:00
Hongming Wang	59902bce83	feat(config): add observability block schema (#119 PR-1 of 4) Hermes-style declarative block grouping cadence + verbosity knobs into one place. Schema-only in this PR — wiring into heartbeat.py and main.py lands in PR-3 of the #119 stack. Two fields with live consumers waiting: - heartbeat_interval_seconds (default 30, clamped to [5, 300]) → heartbeat.py:134 currently has hard-coded HEARTBEAT_INTERVAL = 30 - log_level (default "INFO", uppercased at parse) → main.py:465 currently has hard-coded log_level="info" Clamp band [5, 300] is intentional: sub-5s flooded the platform during IR-2026-03-11; >5min lets crashed workspaces look healthy long enough to mask failure. Coerce at parse so adapters and heartbeat.py can read the value without re-validating. Tests pin defaults, explicit YAML override, partial override, and parametrized clamp behavior (10 cases including garbage strings + None). Part of: task #119 (adopt hermes-style architecture) Stack: PR-1 schema → PR-2 event_log → PR-3 wire consumers → PR-4 skill compat	2026-04-30 21:58:45 -07:00
Hongming Wang	6dbc36d820	Merge pull request #2449 from Molecule-AI/feat/workspace-410-gone-on-removed feat(workspace-server): 410 Gone for removed workspaces (#2429)	2026-05-01 04:58:28 +00:00
Hongming Wang	72f0079c10	feat(workspace-server): GET /workspaces/:id returns 410 Gone when status='removed' (#2429 ) Defense-in-depth at the endpoint level. Previously, GET /workspaces/:id returned 200 OK with `status:"removed"` in the body for deleted workspaces — silent-fail UX hit on the hongmingwang tenant 2026-04-30: the channel bridge / molecule-mcp wheel had a dead workspace_id + token in .env, get_workspace_info returned 200 → caller assumed everything was fine, then every subsequent /registry/* call 401d because tokens were revoked, and operators had no idea their workspace was gone. #2425 fixed the steady-state heartbeat path (escalate to ERROR after 3 consecutive 401s). This change is the startup-time defense — fail loud when the operator first probes the workspace instead of waiting for the heartbeat to sour. The 410 body includes: {error: "workspace removed", id, removed_at, hint: "Regenerate ..."} Audit-trail consumers that need the body shape of a removed workspace (admin views, "show me deleted workspaces" tooling) opt into the legacy 200 + body via ?include_removed=true. Without this opt-in path the audit trail becomes invisible at the API layer. Two new tests pinned: - TestWorkspaceGet_RemovedReturns410 - TestWorkspaceGet_RemovedWithIncludeQueryReturns200 Follow-ups in separate PRs: - Update workspace/a2a_client.py get_workspace_info to surface "removed" specifically rather than collapsing into "not found" - Update channel bridge getWorkspaceInfo (server.ts) to detect 410 → log clear "workspace was deleted, re-onboard" error - Audit canvas/* + admin tooling consumers that may rely on the legacy 200 + status:"removed" shape; switch them to the ?include_removed=true opt-in if needed - Update docs (runtime-mcp.mdx Troubleshooting + external-agents.mdx lifecycle table) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 21:55:24 -07:00
Hongming Wang	08d082d466	Merge pull request #2447 from Molecule-AI/chore/wheel-smoke-fixups chore(smoke-mode): harden module-load + drop dead except clause	2026-05-01 04:33:21 +00:00
Hongming Wang	661eec2659	chore(smoke-mode): harden module-load + drop dead except clause Two follow-ups from the #2275 Phase 1 self-review: 1. `_SMOKE_TIMEOUT_SECS = float(os.environ.get(...))` was evaluated at module load. main.py imports smoke_mode unconditionally — before the is_smoke_mode() check — so a malformed MOLECULE_SMOKE_TIMEOUT_SECS env value would SystemExit every workspace boot, not just smoke runs. Wrapped in try/except with a 5.0 fallback. Probability of a typo'd env var hitting production is low (it's a CI-only knob), but the footgun is removed entirely. Regression test reloads the module under a malformed env value. 2. `_real_a2a_sdk_available()` caught (ImportError, AttributeError). `from X import Y` raises ImportError when Y is missing on X — never AttributeError. Dropped the unreachable branch. No behavior change for the happy path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 21:31:08 -07:00
Hongming Wang	1a18e9398a	Merge pull request #2445 from Molecule-AI/feat/terminal-diagnose-endpoint feat(terminal): add /workspaces/:id/terminal/diagnose endpoint	2026-05-01 04:29:02 +00:00
Hongming Wang	e6161b15a1	Merge pull request #2446 from Molecule-AI/feat/wheel-smoke-mode-execute-stub feat(wheel-smoke): exercise execute() to catch lazy imports (#2275)	2026-05-01 04:23:43 +00:00
Hongming Wang	aacaba024c	feat(wheel-smoke): exercise executor.execute() to catch lazy imports (#2275 ) The existing wheel-publish smoke (`wheel_smoke.py`) only IMPORTS `molecule_runtime.main` at module scope. Lazy imports buried inside `async def execute(...)` bodies (e.g. `from a2a.types import FilePart`) NEVER evaluate at static-import time — they crash at first message delivery in production. The 2026-04-2x v0→v1 a2a-sdk migration shipped 5 such regressions in templates that all looked fine at module-load smoke. This change adds `smoke_mode.py` plus a `MOLECULE_SMOKE_MODE=1` short-circuit in `main.py`: after `adapter.create_executor(...)`, the boot path invokes `executor.execute(stub_ctx, stub_queue)` once with a 5s timeout (`MOLECULE_SMOKE_TIMEOUT_SECS`). Healthy import tree → execution proceeds far enough to hit a network boundary and times out (exit 0). Broken lazy import → `ImportError` / `ModuleNotFoundError` from inside the executor body (exit 1). Other downstream errors (auth, validation) pass — those are caught by adapter-level tests, not this gate. Stub `(RequestContext, EventQueue)` is built from the real a2a-sdk so SendMessageRequest/RequestContext constructor changes also surface as import-tree failures (the regression class also includes "SDK refactored mid-publish"). The stub-build itself is wrapped — if it raises, that's a smoke fail too. Phase 2 (separate PR, molecule-ci) wires this into publish-template-image.yml so the publish gate runs the boot smoke against every template image before pushing the tag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 21:21:18 -07:00
Hongming Wang	b9311134cf	fix(terminal-diagnose): KI-005 hierarchy check + race-free stderr capture Two fixes from /code-review-and-quality on PR #2445: 1. KI-005 hierarchy check parity with /terminal HandleConnect runs the KI-005 cross-workspace guard before dispatch (terminal.go:85-106): when X-Workspace-ID is set and != :id, validate the bearer's workspace binding then call canCommunicateCheck. Without this, an org-level token holder in tenant Foo can probe any workspace's diagnostic state by guessing the UUID — same enumeration vector KI-005 closed for /terminal in #1609. Per-workspace bearer tokens are URL-bound by WorkspaceAuth, so the gap is org tokens within the same tenant. Fix: copy the same gate into HandleDiagnose, before the instance_id SELECT. Test: TestHandleDiagnose_KI005_RejectsCrossWorkspace stubs canCommunicateCheck=false and confirms 403 fires before the DB lookup (sqlmock's ExpectationsWereMet pins that we never reached the SELECT COALESCE). Mirrors the existing TestTerminalConnect_KI005_RejectsUnauthorizedCrossWorkspace. 2. Race-free tunnel stderr capture (syncBuf) strings.Builder isn't goroutine-safe. os/exec spawns a background goroutine that copies the subprocess's stderr fd to cmd.Stderr's Write, so reading the buffer's String() from the request goroutine on wait-for-port timeout while the tunnel may still be writing is a data race that `go test -race` flags. Worst-case impact in production is a garbled Detail string (not a crash), but the fix is small. Fix: wrap bytes.Buffer in a sync.Mutex (syncBuf type). Same io.Writer interface, no API changes elsewhere. 3. Nit cleanup - read-pubkey failure now reports as its own step name instead of a duplicated "ssh-keygen" entry — disambiguates two different failure modes that previously shared a name. - Replaced numToString hand-rolled int-to-string with strconv.Itoa in the test (no import savings reason existed). Suite: 4 diagnose tests pass with -race; full handlers suite passes in 3.95s. go vet clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 21:19:18 -07:00
Hongming Wang	d012a803e4	feat(terminal): add diagnose endpoint for SSH probe stages GET /workspaces/:id/terminal/diagnose runs the same per-stage pipeline as /terminal (ssh-keygen → EIC send-key → tunnel → ssh) but non-interactively and returns JSON. Each stage reports {name, ok, duration_ms, error, detail}, plus a top-level first_failure naming the broken stage. Why: when the canvas terminal silently disconnects ("Session ended" with no error frame — the user-reported failure mode on hongmingwang's hermes workspace), there is no remote-readable signal of WHICH stage failed. The ssh client's stderr lives only in the workspace-server's stdout on the tenant CP EC2 — invisible without shell access. /terminal can't expose stderr cleanly because it has already upgraded to WebSocket binary frames by the time ssh runs. /terminal/diagnose stays pure HTTP/JSON, so the same auth (WorkspaceAuth + ADMIN_TOKEN fallback) gives operators a one-call probe that splits "IAM broke" (send-ssh-public-key fails) from "tunnel/SG broke" (wait-for-port fails) from "sshd auth broke" (ssh-probe gets Permission denied) from "shell broke" (probe exits non-zero with stderr). Stages mirrored from handleRemoteConnect in terminal.go: 1. ssh-keygen ephemeral session keypair 2. send-ssh-public-key AWS EIC API push, IAM-gated 3. pick-free-port local port for the tunnel 4. open-tunnel aws ec2-instance-connect open-tunnel start 5. wait-for-port the tunnel actually listens (folds tunnel stderr into Detail when it doesn't) 6. ssh-probe non-interactive `ssh ... 'echo MARKER'` that confirms auth + bash + the marker round-trip (CombinedOutput captures stderr verbatim — this is the whole reason the endpoint exists) Local Docker workspaces (no instance_id) get a smaller probe: container-found + container-running. Same response shape so callers don't need to branch. Tests stub sendSSHPublicKey / openTunnelCmd / sshProbeCmd via the existing package-level vars (same pattern as TestSSHCommandCmd_*) so the test suite stays hermetic — no AWS, no network. The three new tests pin: (a) routing to remote on instance_id present, (b) routing to local on empty instance_id, (c) the operationally critical case — full success through wait-for-port then a probe failure surfaces ssh stderr in the ssh-probe step's Error/Detail with first_failure="ssh-probe". Auth: rides on existing WorkspaceAuth middleware. Operators with the tenant ADMIN_TOKEN (fetched via /cp/admin/orgs/:slug/admin-token) can probe any workspace without per-workspace token; same admin path as the canvas dashboard reads workspace activity. Response always returns HTTP 200 (success or step failure are both in the JSON body) so callers don't need to branch on status code — the endpoint either reports a first_failure or doesn't. Resolves task #200, supports task #193 (workspace EC2 sshd unresponsive — without this endpoint we couldn't pin the failure stage from outside the tenant CP EC2). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 21:10:20 -07:00
Hongming Wang	f46c471f9b	Merge pull request #2443 from Molecule-AI/docs/correct-test-ops-scripts-header docs(ci): correct test-ops-scripts.yml header — discover does NOT recurse	2026-05-01 03:55:28 +00:00
Hongming Wang	0b2ea0a50f	Merge pull request #2441 from Molecule-AI/feat/explicit-provider-field feat(config): add explicit `provider:` field alongside `model:` (PR-1 of stack)	2026-05-01 03:54:27 +00:00
Hongming Wang	e58e446444	docs(ci): correct test-ops-scripts.yml header — discover does NOT recurse The previous header said `unittest discover from the scripts/ root walks recursively`, contradicting the workflow body which runs two passes precisely because discover does NOT recurse without __init__.py. Fixed self-review feedback on PR #2440. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:52:58 -07:00
Hongming Wang	f2545fcb57	Merge pull request #2440 from Molecule-AI/chore/wheel-rewriter-tests-and-noqa-cleanup chore: rewriter unit tests + drop misleading noqa on import inbox	2026-05-01 03:48:33 +00:00
Hongming Wang	067ad83ce5	feat(config): add explicit `provider:` field alongside `model:` Adds a top-level `provider` slug to WorkspaceConfig and RuntimeConfig so adapters can route to a specific gateway without re-implementing slug-prefix parsing across hermes / claude-code / codex. Resolution chain in load_config (mirrors how `model` resolves): 1. ``LLM_PROVIDER`` env var — what canvas Save+Restart sets so the operator's Provider dropdown choice survives a CP-driven restart (the regenerated /configs/config.yaml drops most user fields). 2. Explicit YAML ``provider:`` — operator pinned it in the file. 3. Derive from the model slug prefix for backward compat: ``anthropic:claude-opus-4-7`` → ``anthropic`` ``minimax/abab7-chat-preview`` → ``minimax`` bare model names → ``""`` (let the adapter decide). `runtime_config.provider` falls back to the top-level resolved provider, the same shape PR #2438 added for `runtime_config.model`. Why a separate field at all (we already parse the slug): - Custom model aliases without a recognizable prefix need an explicit signal — the canvas Provider dropdown writes it. - Adapters were each rolling their own slug-parse (hermes's derive-provider.sh, claude-code's adapter-default branch, etc.); one resolution point in load_config kills that drift class. - Canvas needs a stable storage field that doesn't get clobbered every time the user picks a new model. Backward-compatible: when `provider:` is absent, slug derivation keeps every existing config.yaml working without a migration. PR-1 of a multi-PR stack (Option B from RFC discussion). Subsequent PRs plumb the field through workspace-server env, CP user-data, adapters (hermes prefers explicit over derive-provider.sh), and canvas Provider dropdown UI. Tests cover all four resolution paths + runtime_config inheritance: - test_provider_default_empty_when_bare_model - test_provider_derived_from_colon_slug - test_provider_derived_from_slash_slug - test_provider_yaml_explicit_wins_over_derived - test_provider_env_override_beats_yaml_and_derived - test_runtime_config_provider_yaml_wins_over_top_level - test_provider_default_from_default_model Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:47:09 -07:00
Hongming Wang	6e92fe0a08	chore: rewriter unit tests + drop misleading noqa on `import inbox` Two small follow-ups to the PR #2433 → #2436 → #2439 incident chain. 1) `import inbox # noqa: F401` in workspace/a2a_mcp_server.py was misleading — `inbox` IS used (at the bridge wiring inside main()). F401 means "imported but unused", which would mask a real future F401 if the usage is removed. Drop the noqa, keep the explanatory block comment about the rewriter's `import X` → `import mr.X as X` expansion (and the `import X as Y` → `import mr.X as X as Y` trap the comment exists to prevent re-introducing). 2) scripts/test_build_runtime_package.py — 17 unit tests covering `rewrite_imports()` and `build_import_rewriter()` in scripts/build_runtime_package.py. Until now the function had zero coverage despite the entire wheel build depending on it. Tests pin: bare-import aliasing, dotted-import preservation, indented imports, from-imports (simple + dotted + multi-symbol + block), the `import X as Y` rejection added in PR #2436 (with comment- stripping + indented + comma-not-alias edge cases), allowlist anchoring (`a2a` ≠ `a2a_tools`), and end-to-end reproduction of the PR #2433 failing pattern + the #2436 fix pattern. 3) Wire scripts/test_.py into CI by adding a second discover pass to test-ops-scripts.yml. Top-level scripts/ tests live alongside their target file (parallels the scripts/ops/ test layout); the existing scripts/ops/ pass keeps running because scripts/ops/ has no __init__.py so a single discover from scripts/ root doesn't recurse. Two passes is simpler than retrofitting namespace packages. Path filter widened from `scripts/ops/` to `scripts/*` so PRs touching the build script trigger the new tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:45:32 -07:00
Hongming Wang	5b70204b01	Merge pull request #2439 from Molecule-AI/ci/wheel-smoke-always-run-for-required-gate ci(wheel-smoke): always-run with per-step if-gates for required-check eligibility	2026-05-01 03:42:51 +00:00
Hongming Wang	3c16c27415	ci(wheel-smoke): always-run with per-step if-gates for required-check eligibility The `PR-built wheel + import smoke` gate caught the broken wheel from PR #2433 (`import inbox as _inbox_module` collision) but couldn't block the merge because it isn't a required check on staging. Promoting it to required is the right move per the runtime publish pipeline gates note (2026-04-27 RuntimeCapabilities ImportError outage), but the existing `paths: [workspace/**, scripts/...]` filter blocks PRs that don't touch those paths from ever generating the check run — branch protection would deadlock waiting on a check that never fires. Refactor (same shape as e2e-api.yml's e2e-api job): - Drop top-level `paths:` filter — workflow runs on every push/PR/ merge_group event. - Add `detect-changes` job using dorny/paths-filter to compute the `wheel=true\|false` output. - Collapse to ONE always-running `local-build-install` job named `PR-built wheel + import smoke`. Per-step `if:` gates on the detect output. PRs untouched by wheel-relevant paths emit a no-op SUCCESS step ("paths filter excluded this commit") so the check passes without rebuilding the wheel. - merge_group + workflow_dispatch unconditionally `wheel=true` so the queue always validates the to-be-merged state, regardless of which PR composed it. Why one-job-with-step-gates instead of two-jobs-sharing-name: SKIPPED check runs block branch protection even when SUCCESS siblings exist (verified PR #2264 incident, 2026-04-29). Single always-run job emits exactly one SUCCESS check run regardless of paths filter. Follow-up: open a separate PR adding `PR-built wheel + import smoke` to the staging branch protection's required_status_checks.contexts once this lands. Doing both in one PR risks the protection update firing before the workflow refactor merges, deadlocking unrelated PRs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:40:05 -07:00
Hongming Wang	5ad41f63ce	Merge pull request #2438 from Molecule-AI/fix/runtime-config-model-fallback-clean fix(config): runtime_config.model falls back to top-level model	2026-05-01 03:31:20 +00:00
Hongming Wang	0070d0bd59	fix(config): runtime_config.model falls back to top-level model External feedback (2026-04-30): "Provisioner doesn't read model from config.yaml and doesn't set MODEL env var. Without MODEL, the adapter defaults to sonnet and bypasses the mimo routing." Confirmed accurate for SaaS workspaces. Trace: claude-code-default/adapter.py reads `runtime_config.model or "sonnet"` (and hermes reads HERMES_DEFAULT_MODEL via install.sh, which IS plumbed). For claude-code there's nothing — workspace/config.py loaded `runtime_config.model` only from YAML, ignoring MODEL_PROVIDER env. The CP user-data script regenerates /configs/config.yaml at every boot with only `name`, `runtime`, `a2a` keys (intentionally minimal so it doesn't carry stale state) — so any user-set runtime_config.model is wiped on every restart, and the adapter falls back to "sonnet" even when the user picked Opus in the canvas Config tab. Fix: when YAML omits runtime_config.model, fall back to the top-level resolved `model`, which already honors MODEL_PROVIDER env override. One-line in workspace/config.py. Now MODEL_PROVIDER → top-level model → runtime_config.model → adapter sees the user's selection. Sticky across CP-driven restarts; the canvas Save+Restart loop works as intended for every runtime, not just hermes. Tests: test_runtime_config_model_falls_back_to_top_level — top-level set, runtime_config empty → fallback wins test_runtime_config_model_yaml_wins_over_top_level — YAML explicit → fallback skipped (precedence) test_runtime_config_model_picks_up_env_via_top_level — full canvas Save+Restart simulation: env → top-level → runtime_config.model Negative-control verified: removing the `or model` flips both fallback tests red with the expected "" vs expected-model mismatch; restoring flips them green. The yaml-wins test passes either way (correctly, because precedence is preserved). Replaces closed PR #2435 — that PR's commit was on a contaminated branch and accidentally captured unrelated WIP changes (build script + a2a_mcp_server refactor) instead of this fix. Self-review caught it and closed the PR. This branch is clean off main + diff verified before push. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:28:50 -07:00
Hongming Wang	4e39609ae0	Merge pull request #2434 from Molecule-AI/fix/terminal-ssh-connect-timeout fix(terminal): cap ssh handshake at 10s so hung sshd surfaces fast	2026-05-01 03:26:41 +00:00
Hongming Wang	bbc994f6c8	Merge pull request #2436 from Molecule-AI/fix/wheel-import-as-collision-fix-forward fix(wheel): import inbox without alias to dodge rewriter collision	2026-05-01 03:24:18 +00:00
Hongming Wang	cda93e3c52	test(terminal): update exact-argv snapshot to include ConnectTimeout The pre-existing TestSSHCommandCmd_BuildsArgv asserts the literal argv slice. Adding `-o ConnectTimeout=10` shifted the slice — this commit tracks the snapshot to match. The new behavior-based TestSSHCommandCmd_ConnectTimeoutPresent (added in the prior commit) keeps the invariant pinned without depending on argv ordering, so future tweaks land in only one place even if more options are added. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:23:48 -07:00
Hongming Wang	0acdf3bb56	fix(wheel): import inbox without alias to dodge rewriter collision PR #2433 (notifications/claude/channel) shipped 'import inbox as _inbox_module' inside a2a_mcp_server.py:main(). The build script's import rewriter expands plain 'import inbox' to 'import molecule_runtime.inbox as inbox', so the original source became 'import molecule_runtime.inbox as inbox as _inbox_module', which is invalid Python. Caught at the publish-runtime + PR-built-wheel-smoke gate (the SyntaxError trace is in run 25200422679). The wheel didn't ship to PyPI because publish-runtime's smoke-import step refused to install it, but staging is currently sitting on a broken-build commit until this fix-forward lands. Changes: - a2a_mcp_server.py: lift `import inbox` to top of file (rewriter produces clean `import molecule_runtime.inbox as inbox`), call inbox.set_notification_callback directly in main() - build_runtime_package.py: rewrite_imports() now raises ValueError when it sees 'import X as Y' for any X in the workspace allowlist, instead of silently producing a syntax-error wheel. Operator gets a clear actionable error at build time pointing at the offending line + suggested rewrites ('from X import …' or plain 'import X'). The build-time gate (this PR's rewriter check) catches the regression class earlier than the smoke-time gate (PR #2433's failure). Adding 'PR-built wheel + import smoke' to staging branch protection's required checks is filed separately so this class doesn't merge again. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:21:54 -07:00
Hongming Wang	f30b3d4476	fix(terminal): cap ssh handshake at 10s so hung sshd surfaces fast When the workspace EC2's sshd is unresponsive (mid-restart, SG drop, AMI without ec2-instance-connect), the canvas's xterm shows the user's typed bytes echoed back by the workspace-server's local PTY (cooked + echo mode before ssh sets it raw post-handshake) and then closes silently when Cloudflare's idle WebSocket timer fires (~100s) — with no "Connection refused" or "Permission denied" output ever reaching the user. This is what hongmingwang's hermes terminal looked like 2026-04-30 right after the heartbeat-fix redeploy: status="online" but the shell appeared dead. Caught reproducibly by holding a fresh /workspaces/<id>/terminal WebSocket open for 60s — server sent zero frames except the local-PTY echo of one keystroke typed at t=8s. ssh was hung at handshake; bash never saw the byte. Fix: add `-o ConnectTimeout=10` to ssh args. Now the failure surfaces as a real ssh error message in the terminal within 10s, instead of masquerading as a silently dead shell over the next ~100s. Doesn't diagnose why sshd isn't responding (separate investigation), but it does mean the user gets actionable feedback within seconds. Behavior-based regression test asserts `-o ConnectTimeout=N` is in the ssh argv — pins presence, not the literal value, so operators can tune without breaking the gate. Verified to FAIL on pre-fix code (matched the literal arg pair) and PASS on fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:16:41 -07:00
Hongming Wang	c901d52ee3	Merge pull request #2433 from Molecule-AI/feat/mcp-channel-notifications feat(mcp): notifications/claude/channel for push-feel inbox UX	2026-05-01 03:12:31 +00:00
Hongming Wang	0a3ec53f34	feat(mcp): notifications/claude/channel for push-feel inbox UX Adds a notification seam to the universal molecule-mcp wheel so push- notification-capable MCP hosts (Claude Code today; any compliant client tomorrow) get inbound A2A messages as conversation interrupts instead of having to poll wait_for_message / inbox_peek. Wire-up: - inbox.py: module-level _NOTIFICATION_CALLBACK + set_notification_callback() Fires from InboxState.record() AFTER lock release, with same dict shape inbox_peek returns. Best-effort — a raising callback never prevents the message from landing in the queue. - a2a_mcp_server.py: _build_channel_notification() pure helper + bridge wiring in main() that schedules notifications via asyncio.run_coroutine_threadsafe (poller is a daemon thread, MCP loop is asyncio). - Method name 'notifications/claude/channel' matches the contract documented in molecule-mcp-claude-channel/server.ts:509. - wheel_smoke.py: pin set_notification_callback as a published name, same regression class as the 0.1.16 main_sync incident. Pollers (wait_for_message / inbox_peek) keep working unchanged for runtimes without notification support. Tests: 6 new in test_inbox.py (callback fires once on record, dedupe short-circuits before fire, raising cb doesn't break inbox, set/clear semantics), 5 new in test_a2a_mcp_server.py (method name pin, content mapping, meta routing, no-id JSON-RPC notification spec, missing- field tolerance). All 59 combined tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 20:10:01 -07:00

1 2 3 4 5 ...

3649 Commits