molecule-ai-workspace-runtime

molecule-ai/molecule-ai-workspace-runtime

Author	SHA1	Message	Date
molecule-ai[bot]	d5cf872311	feat: migrate a2a-sdk 1.x (KI-009) (#39 ) - Replace a2a.utils.new_agent_text_message → a2a.helpers.new_text_message - Replace Part(root=TextPart(...)) → Part(text=...) (flat Part API) - Replace A2AStarletteApplication → Starlette route factories (create_agent_card_routes, create_jsonrpc_routes) - Update conftest stubs: remove a2a.server.apps/a2a.utils, add a2a.server.routes/a2a.helpers/AgentInterface - Add AgentInterface to AgentCard supported_interfaces - Rename snake_case AgentCard fields per 1.x schema Co-authored-by: Molecule AI Infra-Runtime-BE <infra-runtime-be@agents.moleculesai.app> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-24 01:54:33 +00:00
Hongming Wang	e562b7a03e	Merge branch 'staging' into fix/auto-detect-llm-token-type	2026-04-23 13:52:25 -07:00
rabbitblood	a78b9f229e	test(1877): convert async tests to sync httpx.Client to unblock CI CI doesn't have pytest-asyncio installed, and the async wrapping was incidental — the production retry pattern (refresh-on-401) is identical in sync and async forms. Switching to httpx.Client + MockTransport keeps the same coverage without the async dep. 6/6 still pass locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 13:35:45 -07:00
rabbitblood	050c2412b3	fix(heartbeat): refresh on-disk auth token on 401 + retry once (#1877 ) ## Problem Auto-restart rotates the workspace's auth token in two non-atomic steps: 1. Platform issues new token via wsauth.IssueToken 2. Provisioner writes the new token to /configs/.auth_token AFTER ContainerStart returns Between steps 1 and 2, the new container has booted and the runtime has already loaded the OLD cached value of .auth_token (or no value if the file was empty during boot). The runtime's first /registry/heartbeat call sends the stale token, gets 401, but the loop never re-reads the on-disk token — so subsequent heartbeats also send the stale value. Each 401 means the platform never sees the workspace as alive → status stays 'provisioning' → scheduler won't dispatch → workspace looks dead from every angle even though the container is actually running. The existing code comment in workspace_provision.go acknowledges this: "the workspace will get 401 on its first heartbeat and can recover on the next restart." That recovery only worked because workspaces used to crash for unrelated reasons and get restarted. After PR #1861 (provisioner empty-volume auto-recover) removed those crashes, workspaces get stuck in the 401 loop with no exit. ## Fix Two-part runtime-side fix in molecule-ai-workspace-runtime: 1. platform_auth.refresh_from_disk() — new helper that clears the in-memory cache and re-reads /configs/.auth_token. Returns the fresh value (or None if missing). Updates the cache as a side effect. 2. HeartbeatLoop._loop() — on 401 from /registry/heartbeat, calls refresh_from_disk() and retries the request ONCE with the new token. Same pattern in _check_delegations(). Bounded retry budget — if the on-disk token is also stale (bug elsewhere), no infinite loop. ## Tests 6/6 new tests in tests/test_token_refresh_1877.py: - refresh_picks_up_rotated_token — happy path - refresh_returns_none_when_file_missing — defensive - refresh_clears_stale_cache_when_file_disappears - refresh_is_idempotent - 401_retry_pattern_uses_refreshed_token — the production fix path - 401_retry_no_loop_when_disk_token_also_stale — bounded retry budget All pass locally on Python 3.13 + pytest 9. ## Why this fix and not the alternatives - Alternative B (platform writes token before ContainerStart): Right architecturally but invasive — needs provisioner refactor to prep volumes before docker run. - Alternative C (skip rotation on auto-restart): Breaks the multi-instance-safety invariant the existing code calls out (revoke prevents stale tokens from sister deployments). - This fix (A): 3-line core change + helper. Self-healing for any timing edge case, not just the post-restart one. Costs nothing in the happy path (only triggers on 401). ## Version Bumped to 0.1.9. Once published to PyPI + workspace template image rebuilt, deployed workspaces auto-recover from token-rotation races without operator intervention. Closes #1877. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 13:26:36 -07:00
rabbitblood	4bafea58ae	fix(llm_auth): tighten base-URL hostname match + strip whitespace + no token in logs Self-review findings on #38: 1. Token substring leak: the "unknown prefix" warning included the first 12 chars of the token in the log message. Logs get shipped to Langfuse / CloudWatch / slack-firehose — 12 bytes of a secret in a log is still 12 bytes too many. Warning no longer references the token value at all. 2. Base-URL substring match was too loose: `"anthropic.com" not in base` would accept `https://proxy.anthropic.com.evil.example/` as "looks like Anthropic, keep the URL." Replaced with an allowlist of exact hostnames parsed via urllib.parse.urlparse. 3. Whitespace in pasted tokens: operators frequently paste tokens from terminals with a trailing newline. The token would flow through startswith() detection but then fail downstream auth with a confusing "malformed token" error. Strip and persist the cleaned value. 4. Malformed base URL crash guard: if someone sets ANTHROPIC_BASE_URL to something urlparse can't handle, don't crash — fall through to clearing it, which is the safe choice in OAuth mode. Added 5 new tests covering each of the above. 16/16 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 10:46:07 -07:00
rabbitblood	0a0f11b41f	feat(runtime): auto-detect LLM token type, normalise env on boot Platform stores per-workspace LLM credentials under a single key (ANTHROPIC_AUTH_TOKEN in workspace_secrets). But downstream tools expect different env var names depending on the token type: sk-ant-oat01-* → CLAUDE_CODE_OAUTH_TOKEN (Claude Code OAuth session) sk-ant-api03-* → ANTHROPIC_API_KEY (direct Anthropic API) sk-cp-* → ANTHROPIC_AUTH_TOKEN (proxy: MiniMax, gateways) Without normalisation, an OAuth token under ANTHROPIC_AUTH_TOKEN gets sent as a bearer to api.anthropic.com, which responds: 401 authentication_error: OAuth authentication is currently not supported. This was a platform-wide footgun: anyone rotating LLM keys had to know the exact env var for each token type, AND make sure stale overrides were cleared, AND set ANTHROPIC_BASE_URL correctly for proxies (or NOT set for native Claude). Nothing downstream could help — the SDK just saw the wrong var. Fix: - New molecule_runtime/llm_auth.py — normalise_llm_env() mutates os.environ (or any dict) to the correct shape based on token prefix. Returns a NormalisationResult for logging. - main.py calls it as step 0, before any adapter/executor import. Every adapter (claude-code, langgraph, crewai, autogen, hermes, …) benefits automatically — no per-adapter branching needed. - 11 unit tests covering all prefix paths, edge cases, and the "operator deliberately set CLAUDE_CODE_OAUTH_TOKEN" precedence rule. Operationally: this means operators can keep using one ANTHROPIC_AUTH_TOKEN slot in platform settings and just paste whatever token the agent needs. No env-var-name awareness required. Tested locally: 11/11 new tests pass. 83 other tests unchanged (pre-existing failures on staging are all unrelated: test_workspace_id_validation, test_a2a_mcp_server RBAC, the test_imports.main module-walker — same signature as on staging HEAD before this PR). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 10:41:47 -07:00
molecule-ai[bot]	dcb6edd1a1	fix(shared_runtime): push heartbeat on CLEAR in set_current_task() (#37 ) Fixes #1372 — phantom busy: canvas showed workspace as active for up to 30s after task completion because set_current_task("") returned early without posting the updated heartbeat. Before: clearing only updated the heartbeat object; the next 30s scheduled heartbeat cycle propagated the clear. Quick tasks would leave a phantom-busy indicator. After: both SET and CLEAR push immediately to /registry/heartbeat. active_tasks=0 on clear, active_tasks=1 on set. Heartbeat object update and HTTP post are now unconditional. Tests: 5 new cases covering SET/CLEAR HTTP body, error resilience, None heartbeat, and missing env vars. Co-authored-by: Molecule AI Infra-Runtime-BE <infra-runtime-be@agents.moleculesai.app>	2026-04-22 17:33:42 +00:00
Molecule AI Infra-Runtime-BE	d4b9bff5d0	fix(a2a_cli): validate WORKSPACE_ID in discover() before X-Workspace-ID header PR #32 wrapped all platform URL construction sites with get_validated_workspace_id() but missed a2a_cli.discover(), which passed the raw unvalidated WORKSPACE_ID in the X-Workspace-ID header. All other functions (peers, info) had try/except guards added. discover() now calls get_validated_workspace_id() upfront and returns None (printing the error) if validation fails — consistent with the best-effort error handling pattern used elsewhere in the module. Tests: 2 new cases in TestA2aCliDiscoverValidation covering empty and slash-injected WORKSPACE_ID values. Follow-up to: PR #32 (fix/908-add-namespace-param-commit-memory) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-21 01:35:37 +00:00
Molecule AI Infra-SRE	32a7880f4f	test+fix(builtin_tools/validation): add test coverage + fix ".." bypass in regex Tests: 37 new test cases in tests/test_validation.py covering: - Valid ID patterns (6): normal IDs, underscores, dots, max-length (256) - Empty/missing (1): raises with "empty" in message - Invalid chars (10): / \ .. # ? & whitespace - Caching (2): result is cached; raises on repeated bad calls - Error type (1): WorkspaceIdValidationError is a ValueError subclass Fix: regex now uses negative lookahead `(?!.*\.\.)` to reject ".." anywhere in the string (not just at the start). The old pattern `^[A-Za-z0-9_\-.]{1,256}$` matched ".." literally because two dots ARE in the allowed character class. Also adds test cases for embedded ".." (ws..example, ws../etc). Fixes: the ".." bypass was a gap in the original CWE-20 fix.	2026-04-21 00:55:08 +00:00
molecule-ai[bot]	30d96b4e4e	fix(platform_auth): validate WORKSPACE_ID at import time (issue #14 , CWE-20) (#29 ) WORKSPACE_ID was read via os.environ.get("WORKSPACE_ID", "") in multiple builtin_tools modules and used directly in platform API URLs and X-Workspace-ID headers without validation. A crafted ID containing /, .., or # could cause URL path injection. Fix: validate_workspace_id() in platform_auth.py now validates the ID format at module import time using a regex that permits only lowercase alphanumerics and hyphens (matching UUIDs and org-generated IDs). The validated value is exposed as a module-level WORKSPACE_ID constant. builtin_tools/approval.py and builtin_tools/delegation.py now import from platform_auth instead of reading os.environ directly. Failing input raises ValueError with a clear message — workspace fails fast at startup rather than silently accepting malformed IDs in requests. Add 15 regression tests (45/45 passing total). Co-authored-by: Molecule AI Infra-Runtime-BE <infra-runtime-be@agents.moleculesai.app> Co-authored-by: Infra-Runtime-BE <infra-runtime-be@molecule.ai>	2026-04-21 00:04:54 +00:00
Hongming Wang	4aa0d9f110	fix(adapter-loader): fall back to any BaseAdapter subclass ADAPTER_MODULE resolution required the imported module to export a class literally named `Adapter`. The claude-code, langgraph, and openclaw adapter-template repos (3 of 4 currently in production) don't ship that alias — they export ClaudeCodeAdapter / LangGraphAdapter / OpenClawAdapter directly. Only hermes has the `Adapter = HermesAdapter` shim at the bottom of adapter.py. Consequence in prod: every fresh claude-code / langgraph / openclaw workspace crashed at runtime startup with "module 'adapter' has no attribute 'Adapter'", even with a2a-sdk correctly pinned <1.0. Provisioning looked successful from CP's side (EC2 ran) but the agent never registered because the process never reached A2A bootstrap. Fix: if `Adapter` is absent from the imported module, scan the module for any attribute that is a proper BaseAdapter subclass (excluding BaseAdapter itself — regression guard in tests). The explicit alias remains the preferred contract; this is purely additive tolerance. Bump to 0.1.4 and publish to PyPI via the existing v* tag trigger. 6 new tests cover: explicit alias, subclass-fallback, non-adapter-noise ignored, empty module → error, missing module → error, re-exported BaseAdapter → not selected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 16:59:12 -07:00
molecule-ai[bot]	2bb0f97085	Merge pull request #26 from Molecule-AI/fix/plugin-setup-env-scrub fix(plugins_registry/builtins): strip API keys from plugin setup.sh env	2026-04-20 23:04:33 +00:00
Molecule AI Infra-Runtime-BE	d6944086fe	fix(plugins_registry/builtins): strip API keys from plugin setup.sh env Issue #19 (CWE-C-312): AgentskillsAdaptor.install() passed the full os.environ to the subprocess running setup.sh, including ANTHROPIC_API_KEY, OPENAI_API_KEY, GITHUB_TOKEN, WORKSPACE_AUTH_TOKEN, etc. A malicious or compromised plugin's setup.sh could exfiltrate them. Fix: _scrubbed_env() builds a copy of os.environ with sensitive keys removed, matching the same _SCRUB_KEYS list used in skill_loader/loader.py so the scrubbing policy is consistent. CONFIGS_DIR is still passed via the extra dict. Non-secret vars (PATH, HOME, etc.) are preserved. Add 6 regression tests (30/30 passing). Co-Authored-By: Infra-Runtime-BE <infra-runtime-be@molecule.ai>	2026-04-20 22:52:13 +00:00
Molecule AI Infra-Runtime-BE	c72fbfc9a4	fix(builtin_tools/audit): fail-secure RBAC — read-only default when config unavailable Fixes #11 (CWE-285): get_workspace_roles() returned ["operator"] (full delegate/approve/memory.write) when workspace config could not be loaded. Changed to ["read-only"] — deny-by-default per Principle of Least Privilege. Add regression tests in tests/test_audit.py. Also includes: - main.py: remove token prefix log (CWE-532) — issue #10/#17 - a2a_mcp_server.py: RBAC gate on sensitive MCP tools (CWE-862) — issue #12 - cli_executor.py: sanitize stderr in error logs (CWE-209) — issue #13 - tests/test_a2a_mcp_server.py: 5 new regression tests for MCP RBAC Co-Authored-By: Infra-Runtime-BE <infra-runtime-be@molecule.ai>	2026-04-20 22:47:38 +00:00
rabbitblood	6cd4d74c5a	test: move sdk stubs to conftest.py (consistent across all test modules)	2026-04-16 11:15:45 -07:00
rabbitblood	a35d128870	test: stub claude_agent_sdk + a2a in session-resume tests CI failed on collect because claude_agent_sdk + a2a aren't test-env deps (they're installed inside the claude-code workspace image). The test file now stubs both via sys.modules so the collector can import claude_sdk_executor without pulling the real SDKs. Tests don't exercise the SDK anyway — only _resolve_resume() glob logic.	2026-04-16 11:13:33 -07:00
rabbitblood	3b56410ad5	fix: gate session resume on file existence (closes #488 ) ## Symptom (cycle 6+ of #488) Workspaces appear `online` (heartbeats fine) but every cron tick fails silently with `No conversation found with session ID: <uuid>` → `ProcessError: exit code 1` → idle loop logs HTTP 200, no actual work happens. Backend Engineer received 5 idle pulses without claiming a single one of the 6 open Hermes issues (#496-500) because the bug prevents `gh issue list` from ever firing. ## Root cause (verified live in ws-20cb8ff8-3e4 today) claude-code stores sessions at `/root/.claude/projects/<cwd-with-/-as-->/<id>.jsonl`. When a workspace container is recreated, `self._session_id` from a prior instance references a file that no longer exists. Passing it as `resume=<id>` to ClaudeAgentOptions crashes the CLI on the very first call. The existing #75 fix only fires AFTER the first ProcessError lands, and per-cycle executor re-instantiation can reload the stale id from elsewhere — restart-with-reset_claude_session was the only working mitigation, hand-fired every cycle. ## Fix New `_resolve_resume()` in ClaudeSDKExecutor: probes a handful of well-known session-file locations (`/root/.claude/projects/*/<id>.jsonl`, `/root/.claude/sessions/<id>.jsonl`, plus the agent-uid variants) via `glob.glob`. If no file matches the in-memory `_session_id`, drops the id (sets to None) AND returns None so `ClaudeAgentOptions.resume` is unset — CLI starts a fresh session. Logged at INFO with `#488` in the message so operators correlate. `_build_options()` now calls `_resolve_resume()` instead of reading `self._session_id` directly. Cheap path when no session set: zero glob calls. Hot path (session set + file exists): one glob call, short-circuits on first match. ## Drive-by fix: stale `from X import` in 4 modules Same regression class as #1 (the runtime release that closed it): - `claude_sdk_executor.py:43`: `from executor_helpers import …` - `cli_executor.py:39-40`: `from config import …`, `from executor_helpers import …` - `main.py:28-30`: `from config import …`, `from heartbeat import …`, `from preflight import …` - `preflight.py:7`: `from config import …` All rewritten to absolute `from molecule_runtime.<module> import …` so they resolve outside of workspace containers (e.g. test environments where `/app` isn't on sys.path). The grep guard in `tests/test_imports.py` already covered `adapters` — extending to all top-level imports would catch this class going forward; not in this PR to keep scope tight. ## Tests 6 new in `tests/test_session_resume_gate.py`: - baseline (no session) → no glob, returns None - file exists → keep id, returns id, single glob (early-exit) - file missing → drop id (clears `_session_id`), returns None - late-pattern match → walks all patterns until hit - log includes session id (operator triage) - log references #488 (debugger discoverability) All 16 tests (10 existing + 6 new) pass. ## Release plan - Bump version 0.1.1 → 0.1.2 (in this commit) - After merge, push v0.1.2 tag → publish.yml auto-publishes to PyPI - Then rebuild workspace template images locally so workspaces pick up the fix (templates pin `>=0.1.0`, will resolve to 0.1.2 on next build) - Then mass-restart workspaces with reset_claude_session=true once to clear any DB-side stale state, and the permanent fix kicks in Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 11:12:03 -07:00
rabbitblood	9cdae9afec	fix: switch top-level `from adapters import` to absolute imports (#1 ) Every modular workspace template repo (claude-code, hermes, langgraph, …) was crashing on boot with: KeyError: "Unknown runtime '<runtime>'. Available: " Root cause: `molecule_runtime/main.py` and four other modules used top-level imports like `from adapters import get_adapter` — a monorepo legacy that resolved when something on sys.path had an `adapters/` package. Standalone template repos COPY only `adapter.py` (singular) to /app and don't ship an `adapters/` package, so this import path went through some side-resolution that left `get_adapter` unable to see the user's adapter. The ADAPTER_MODULE → import → getattr → issubclass chain then silently fell through to the discovery branch and reported "Unknown runtime". Fix is one-line per file: `from adapters` → `from molecule_runtime.adapters` in: - molecule_runtime/main.py:27 - molecule_runtime/a2a_executor.py:44 - molecule_runtime/coordinator.py:20 - molecule_runtime/prompt.py:6 - molecule_runtime/builtin_tools/temporal_workflow.py:417 Tests + CI added so this regression class is caught at PR time, not at runtime in self-hosters' clusters: - tests/test_imports.py: parametrised import smoke for every previously affected module + a grep guard that fails if any future change reintroduces a top-level `from adapters` / `import adapters` line - .github/workflows/ci.yml: runs the smoke on every PR (no CI existed before — the publish workflow only fires on tag push) Closes #1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 07:53:03 -07:00

18 Commits