Commit Graph

57 Commits

Author SHA1 Message Date
Hongming Wang
1b04da2061
Merge pull request #38 from Molecule-AI/fix/auto-detect-llm-token-type
feat(runtime): auto-detect LLM token type, normalise env on boot
2026-04-23 13:53:06 -07:00
Hongming Wang
e562b7a03e
Merge branch 'staging' into fix/auto-detect-llm-token-type 2026-04-23 13:52:25 -07:00
Hongming Wang
3556244725
Merge pull request #40 from Molecule-AI/fix/heartbeat-401-token-refresh-1877
fix(heartbeat): refresh on-disk auth token on 401 + retry once (#1877)
2026-04-23 13:51:42 -07:00
rabbitblood
a78b9f229e test(1877): convert async tests to sync httpx.Client to unblock CI
CI doesn't have pytest-asyncio installed, and the async wrapping was
incidental — the production retry pattern (refresh-on-401) is identical
in sync and async forms. Switching to httpx.Client + MockTransport keeps
the same coverage without the async dep.

6/6 still pass locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 13:35:45 -07:00
rabbitblood
050c2412b3 fix(heartbeat): refresh on-disk auth token on 401 + retry once (#1877)
## Problem

Auto-restart rotates the workspace's auth token in two non-atomic steps:
  1. Platform issues new token via wsauth.IssueToken
  2. Provisioner writes the new token to /configs/.auth_token AFTER
     ContainerStart returns

Between steps 1 and 2, the new container has booted and the runtime has
already loaded the OLD cached value of .auth_token (or no value if the
file was empty during boot). The runtime's first /registry/heartbeat
call sends the stale token, gets 401, but the loop never re-reads the
on-disk token — so subsequent heartbeats also send the stale value.

Each 401 means the platform never sees the workspace as alive →
status stays 'provisioning' → scheduler won't dispatch → workspace
looks dead from every angle even though the container is actually
running.

The existing code comment in workspace_provision.go acknowledges this:
"the workspace will get 401 on its first heartbeat and can recover on
the next restart." That recovery only worked because workspaces used
to crash for unrelated reasons and get restarted. After PR #1861
(provisioner empty-volume auto-recover) removed those crashes,
workspaces get stuck in the 401 loop with no exit.

## Fix

Two-part runtime-side fix in molecule-ai-workspace-runtime:

1. **platform_auth.refresh_from_disk()** — new helper that clears the
   in-memory cache and re-reads /configs/.auth_token. Returns the
   fresh value (or None if missing). Updates the cache as a side effect.

2. **HeartbeatLoop._loop()** — on 401 from /registry/heartbeat, calls
   refresh_from_disk() and retries the request ONCE with the new token.
   Same pattern in _check_delegations(). Bounded retry budget — if the
   on-disk token is also stale (bug elsewhere), no infinite loop.

## Tests

6/6 new tests in tests/test_token_refresh_1877.py:

  - refresh_picks_up_rotated_token              — happy path
  - refresh_returns_none_when_file_missing      — defensive
  - refresh_clears_stale_cache_when_file_disappears
  - refresh_is_idempotent
  - 401_retry_pattern_uses_refreshed_token      — the production fix path
  - 401_retry_no_loop_when_disk_token_also_stale — bounded retry budget

All pass locally on Python 3.13 + pytest 9.

## Why this fix and not the alternatives

- **Alternative B (platform writes token before ContainerStart):**
  Right architecturally but invasive — needs provisioner refactor to
  prep volumes before docker run.
- **Alternative C (skip rotation on auto-restart):** Breaks the
  multi-instance-safety invariant the existing code calls out
  (revoke prevents stale tokens from sister deployments).
- **This fix (A):** 3-line core change + helper. Self-healing for any
  timing edge case, not just the post-restart one. Costs nothing in
  the happy path (only triggers on 401).

## Version

Bumped to 0.1.9. Once published to PyPI + workspace template image
rebuilt, deployed workspaces auto-recover from token-rotation races
without operator intervention.

Closes #1877.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 13:26:36 -07:00
rabbitblood
4bafea58ae fix(llm_auth): tighten base-URL hostname match + strip whitespace + no token in logs
Self-review findings on #38:

1. **Token substring leak**: the "unknown prefix" warning included the
   first 12 chars of the token in the log message. Logs get shipped to
   Langfuse / CloudWatch / slack-firehose — 12 bytes of a secret in a
   log is still 12 bytes too many. Warning no longer references the
   token value at all.

2. **Base-URL substring match was too loose**: `"anthropic.com" not in
   base` would accept `https://proxy.anthropic.com.evil.example/` as
   "looks like Anthropic, keep the URL." Replaced with an allowlist of
   exact hostnames parsed via urllib.parse.urlparse.

3. **Whitespace in pasted tokens**: operators frequently paste tokens
   from terminals with a trailing newline. The token would flow through
   startswith() detection but then fail downstream auth with a
   confusing "malformed token" error. Strip and persist the cleaned
   value.

4. **Malformed base URL crash guard**: if someone sets ANTHROPIC_BASE_URL
   to something urlparse can't handle, don't crash — fall through to
   clearing it, which is the safe choice in OAuth mode.

Added 5 new tests covering each of the above. 16/16 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 10:46:07 -07:00
rabbitblood
0a0f11b41f feat(runtime): auto-detect LLM token type, normalise env on boot
Platform stores per-workspace LLM credentials under a single key
(ANTHROPIC_AUTH_TOKEN in workspace_secrets). But downstream tools
expect different env var names depending on the token type:

  sk-ant-oat01-*  → CLAUDE_CODE_OAUTH_TOKEN  (Claude Code OAuth session)
  sk-ant-api03-*  → ANTHROPIC_API_KEY        (direct Anthropic API)
  sk-cp-*         → ANTHROPIC_AUTH_TOKEN     (proxy: MiniMax, gateways)

Without normalisation, an OAuth token under ANTHROPIC_AUTH_TOKEN gets
sent as a bearer to api.anthropic.com, which responds:

    401 authentication_error: OAuth authentication is currently not
    supported.

This was a platform-wide footgun: anyone rotating LLM keys had to
know the exact env var for each token type, AND make sure stale
overrides were cleared, AND set ANTHROPIC_BASE_URL correctly for
proxies (or NOT set for native Claude). Nothing downstream could
help — the SDK just saw the wrong var.

Fix:

- New molecule_runtime/llm_auth.py — normalise_llm_env() mutates
  os.environ (or any dict) to the correct shape based on token
  prefix. Returns a NormalisationResult for logging.
- main.py calls it as step 0, before any adapter/executor import.
  Every adapter (claude-code, langgraph, crewai, autogen, hermes,
  …) benefits automatically — no per-adapter branching needed.
- 11 unit tests covering all prefix paths, edge cases, and the
  "operator deliberately set CLAUDE_CODE_OAUTH_TOKEN" precedence
  rule.

Operationally: this means operators can keep using one
ANTHROPIC_AUTH_TOKEN slot in platform settings and just paste
whatever token the agent needs. No env-var-name awareness required.

Tested locally: 11/11 new tests pass. 83 other tests unchanged
(pre-existing failures on staging are all unrelated:
test_workspace_id_validation, test_a2a_mcp_server RBAC, the
test_imports.main module-walker — same signature as on staging
HEAD before this PR).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 10:41:47 -07:00
molecule-ai[bot]
dcb6edd1a1
fix(shared_runtime): push heartbeat on CLEAR in set_current_task() (#37)
Fixes #1372 — phantom busy: canvas showed workspace as active for up
to 30s after task completion because set_current_task("") returned
early without posting the updated heartbeat.

Before: clearing only updated the heartbeat object; the next 30s
scheduled heartbeat cycle propagated the clear. Quick tasks would leave
a phantom-busy indicator.

After: both SET and CLEAR push immediately to /registry/heartbeat.
active_tasks=0 on clear, active_tasks=1 on set. Heartbeat object
update and HTTP post are now unconditional.

Tests: 5 new cases covering SET/CLEAR HTTP body, error resilience,
None heartbeat, and missing env vars.

Co-authored-by: Molecule AI Infra-Runtime-BE <infra-runtime-be@agents.moleculesai.app>
2026-04-22 17:33:42 +00:00
rabbitblood
1e545ed6ba chore: bump 0.1.8 — executor_helpers phantom-busy fix confirmed in tree
Some checks failed
Publish to PyPI / build-and-publish (push) Failing after 8s
2026-04-21 07:16:47 -07:00
rabbitblood
5a1990552d chore: bump 0.1.7 — ensure executor_helpers phantom-busy fix in PyPI build
Some checks failed
Publish to PyPI / build-and-publish (push) Failing after 7s
2026-04-21 07:07:17 -07:00
rabbitblood
59f54560a0 Merge branch 'main' of https://github.com/Molecule-AI/molecule-ai-workspace-runtime into fix/507-mcp-server-path-absolute-imports
Some checks failed
Publish to PyPI / build-and-publish (push) Failing after 6s
# Conflicts:
#	pyproject.toml
2026-04-21 06:37:38 -07:00
rabbitblood
d3235cc564 fix(heartbeat): increment/decrement active_tasks + push on clear (#1372, #1408)
Both set_current_task() implementations (shared_runtime.py + executor_helpers.py):
- Increment active_tasks on task start, decrement on completion (was binary 0/1)
- Push heartbeat immediately on BOTH increment AND decrement
- Only clear current_task when active_tasks reaches 0 (preserves description
  for still-running tasks)

Fixes phantom-busy: the old code returned early on clear, leaving
active_tasks=1 in the platform DB until the next 30s heartbeat cycle.
If a new cron fired before the heartbeat, the workspace appeared
permanently busy — required manual DB reset every 30 min.

Bump: 0.1.2 → 0.1.3

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 06:37:12 -07:00
Hongming Wang
7febb51382
Merge pull request #36 from Molecule-AI/chore/bump-0.1.5
Some checks failed
Publish to PyPI / build-and-publish (push) Failing after 6s
chore: bump to 0.1.5 for X-Molecule-Org-Id header fix
2026-04-20 20:30:54 -07:00
Hongming Wang
742b7d1dfb chore: bump version to 0.1.5 for org-id-header fix 2026-04-20 20:30:31 -07:00
Hongming Wang
4b0185a57b
Merge pull request #35 from Molecule-AI/feat/send-org-id-header
feat(auth): send X-Molecule-Org-Id on every outbound platform call
2026-04-20 20:28:40 -07:00
Hongming Wang
ba5466243b feat(auth): send X-Molecule-Org-Id on every outbound platform call
The SaaS tenant platform's TenantGuard middleware rejects cross-org
routing with synthetic 404s unless the request carries
X-Molecule-Org-Id matching the tenant's MOLECULE_ORG_ID env var. The
runtime never sent it, so every non-allowlisted workspace→platform
path (memories, delegations, notify, a2a, update-card, peers...)
404'd. Paired with CP change feat/workspace-export-org-id which
injects MOLECULE_ORG_ID into workspace user-data env.

auth_headers() now returns both headers — the existing Authorization
bearer AND the new X-Molecule-Org-Id — so every caller that already
threads auth_headers() through httpx picks it up for free. Self-
hosted deployments with MOLECULE_ORG_ID unset keep the old behavior
(no header, TenantGuard is a no-op).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 20:28:07 -07:00
molecule-ai[bot]
0e2e1fc2c4
Merge pull request #33 from Molecule-AI/fix/a2a-cli-discover-workspace-id-validation
fix(a2a_cli): validate WORKSPACE_ID in discover() before X-Workspace-ID header
2026-04-21 01:53:19 +00:00
d4b9bff5d0 fix(a2a_cli): validate WORKSPACE_ID in discover() before X-Workspace-ID header
PR #32 wrapped all platform URL construction sites with
get_validated_workspace_id() but missed a2a_cli.discover(), which
passed the raw unvalidated WORKSPACE_ID in the X-Workspace-ID header.
All other functions (peers, info) had try/except guards added.

discover() now calls get_validated_workspace_id() upfront and returns
None (printing the error) if validation fails — consistent with the
best-effort error handling pattern used elsewhere in the module.

Tests: 2 new cases in TestA2aCliDiscoverValidation covering empty
and slash-injected WORKSPACE_ID values.

Follow-up to: PR #32 (fix/908-add-namespace-param-commit-memory)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-21 01:35:37 +00:00
molecule-ai[bot]
40c30c068a
Merge pull request #32 from Molecule-AI/fix/908-add-namespace-param-commit-memory
fix(CI): set WORKSPACE_ID env var + validation coverage
2026-04-21 01:29:32 +00:00
4bfe6222a6 fix(CI): remove conflicting bandit flags from security linter step
PR #31 added `-ll --severity-level=high` but these flags conflict:
  - `-ll` is a shorthand for `--level low` (only show low+ issues)
  - `--severity-level=high` suppresses everything but high-severity issues
The combination causes bandit to exit 2 because `--severity-level` is
not allowed alongside `-l/--level`. Use `--severity-level=high` alone.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-21 00:58:43 +00:00
875a8ef952 fix(CI): set WORKSPACE_ID env var for test job
PR #29 introduced WORKSPACE_ID validation at module import time
(platform_auth.py). The CI environment did not set WORKSPACE_ID,
causing 8 failures + 13 errors on every main push. Add a dummy
CI-only value so imports succeed without affecting real workspaces.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-21 00:55:08 +00:00
249e5c07eb fix(builtin_tools/validation): complete WORKSPACE_ID validation in a2a_tools.py
Add get_validated_workspace_id() to all 6 remaining unguarded URL positions
in molecule_runtime/a2a_tools.py (the MCP tool body implementations):

- report_activity(): /workspaces/{id}/activity + heartbeat
- tool_delegate_task_async(): /workspaces/{id}/delegate
- tool_check_task_status(): /workspaces/{id}/delegations
- tool_send_message_to_user(): /workspaces/{id}/notify
- tool_commit_memory(): /workspaces/{id}/memories (POST)
- tool_recall_memory(): /workspaces/{id}/memories (GET)

All 6 functions now use validated ws_id. The last remaining unguarded
WORKSPACE_ID use in the entire molecule_runtime package is in
builtin_tools/telemetry.py:142 (metric service name — not a URL path,
low security risk). 67/67 tests pass.
2026-04-21 00:55:08 +00:00
32a7880f4f test+fix(builtin_tools/validation): add test coverage + fix ".." bypass in regex
Tests: 37 new test cases in tests/test_validation.py covering:
- Valid ID patterns (6): normal IDs, underscores, dots, max-length (256)
- Empty/missing (1): raises with "empty" in message
- Invalid chars (10): / \ .. # ? & whitespace
- Caching (2): result is cached; raises on repeated bad calls
- Error type (1): WorkspaceIdValidationError is a ValueError subclass

Fix: regex now uses negative lookahead `(?!.*\.\.)` to reject ".." anywhere
in the string (not just at the start). The old pattern `^[A-Za-z0-9_\-.]{1,256}$`
matched ".." literally because two dots ARE in the allowed character class.
Also adds test cases for embedded ".." (ws..example, ws../etc).

Fixes: the ".." bypass was a gap in the original CWE-20 fix.
2026-04-21 00:55:08 +00:00
be9c9997c0 fix(builtin_tools/validation): cover remaining WORKSPACE_ID URL usages
Extend get_validated_workspace_id() to all remaining unguarded URL positions:

- consolidation.py: _consolidate() — validates before GET/POST/DELETE to
  /workspaces/{id}/memories endpoints. Graceful skip on failure (log + return).
- coordinator.py: get_children() — validates before /registry/{id}/peers.
  Graceful skip (empty list) on failure.
- molecule_ai_status.py: set_status() — validates before /registry/heartbeat
  and /workspaces/{id}/activity. Exits with descriptive error on failure.

With these three, every runtime use of WORKSPACE_ID in a URL path is now
validated. Remaining WORKSPACE_ID uses are:
- JSON body fields (not injection-risky): heartbeat, memory POST bodies
- Header values (X-Workspace-ID): lower risk, non-URL-injection
2026-04-21 00:55:08 +00:00
42bdf530b5 fix(builtin_tools/validation): extend WORKSPACE_ID validation to top-level modules
Fixes remaining unguarded WORKSPACE_ID URL usages identified after the initial
builtin_tools/ fix:

- a2a_client.py: get_peers() and get_workspace_info() now use
  get_validated_workspace_id() before URL construction. The raw module-level
  constant is still used in the discover_peer() header (low risk, not URL path).
- a2a_cli.py: peers() and info() CLI commands now validate WORKSPACE_ID before
  calling the platform API. Commands exit with error code 1 + descriptive
  message if WORKSPACE_ID is empty or malformed.

Follow-up candidates (lower priority, not URL injection risk):
- coordinator.py: WORKSPACE_ID in registry peer URL
- consolidation.py: WORKSPACE_ID in memory URLs (long-running consolidation job)
- molecule_ai_status.py: WORKSPACE_ID in activity log URL
2026-04-21 00:55:08 +00:00
d52082839f fix(builtin_tools): validate WORKSPACE_ID before URL construction
Add WORKSPACE_ID format validation before every URL/header use to prevent
URL injection (CWE-20 / CWE-88). The validator:
- Rejects empty values (fail-fast with clear error)
- Rejects path-traversal chars (/ \ ..) and fragment/query chars (# ? &)
- Accepts alphanumeric, hyphen, underscore, dot (typical ID formats)
- Caches the result after first successful call (zero overhead per call)

Validated in:
- memory.py: commit_memory, search_memory (both awareness-client + httpx paths)
- approval.py: _create_approval_request, _wait_polling
- delegation.py: _notify_completion, _record_delegation_on_platform,
  _update_delegation_on_platform
- a2a_tools.py: list_peers, delegate_task

Fixes #14.
2026-04-21 00:55:08 +00:00
molecule-ai[bot]
548549d5e9
feat(CI): add bandit security linter (audit rec #2) (#31)
Bandit runs on every PR against molecule_runtime/ at high severity.
Addresses audit recommendation from issue #9.

Co-authored-by: Molecule AI Infra-Runtime-BE <infra-runtime-be@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-21 00:23:17 +00:00
molecule-ai[bot]
30d96b4e4e
fix(platform_auth): validate WORKSPACE_ID at import time (issue #14, CWE-20) (#29)
WORKSPACE_ID was read via os.environ.get("WORKSPACE_ID", "") in multiple
builtin_tools modules and used directly in platform API URLs and X-Workspace-ID
headers without validation. A crafted ID containing /, .., or # could cause
URL path injection.

Fix: validate_workspace_id() in platform_auth.py now validates the ID format
at module import time using a regex that permits only lowercase alphanumerics
and hyphens (matching UUIDs and org-generated IDs). The validated value is
exposed as a module-level WORKSPACE_ID constant. builtin_tools/approval.py
and builtin_tools/delegation.py now import from platform_auth instead of
reading os.environ directly.

Failing input raises ValueError with a clear message — workspace fails fast
at startup rather than silently accepting malformed IDs in requests.

Add 15 regression tests (45/45 passing total).

Co-authored-by: Molecule AI Infra-Runtime-BE <infra-runtime-be@agents.moleculesai.app>
Co-authored-by: Infra-Runtime-BE <infra-runtime-be@molecule.ai>
2026-04-21 00:04:54 +00:00
Hongming Wang
953aa2847c
Merge pull request #30 from Molecule-AI/fix/adapter-loader-find-subclass
Some checks failed
Publish to PyPI / build-and-publish (push) Failing after 7s
fix(adapter-loader): fall back to any BaseAdapter subclass
2026-04-20 16:59:38 -07:00
Hongming Wang
4aa0d9f110 fix(adapter-loader): fall back to any BaseAdapter subclass
ADAPTER_MODULE resolution required the imported module to export a
class literally named `Adapter`. The claude-code, langgraph, and
openclaw adapter-template repos (3 of 4 currently in production) don't
ship that alias — they export ClaudeCodeAdapter / LangGraphAdapter /
OpenClawAdapter directly. Only hermes has the `Adapter = HermesAdapter`
shim at the bottom of adapter.py.

Consequence in prod: every fresh claude-code / langgraph / openclaw
workspace crashed at runtime startup with
"module 'adapter' has no attribute 'Adapter'", even with a2a-sdk
correctly pinned <1.0. Provisioning looked successful from CP's side
(EC2 ran) but the agent never registered because the process never
reached A2A bootstrap.

Fix: if `Adapter` is absent from the imported module, scan the module
for any attribute that is a proper BaseAdapter subclass (excluding
BaseAdapter itself — regression guard in tests). The explicit alias
remains the preferred contract; this is purely additive tolerance.

Bump to 0.1.4 and publish to PyPI via the existing v* tag trigger.

6 new tests cover: explicit alias, subclass-fallback, non-adapter-noise
ignored, empty module → error, missing module → error, re-exported
BaseAdapter → not selected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 16:59:12 -07:00
molecule-ai[bot]
457adcbd64
Merge pull request #28 from Molecule-AI/fix/908-add-namespace-param-commit-memory
feat(builtin_tools/memory): add namespace param to commit_memory and search_memory
2026-04-20 23:18:45 +00:00
ecc0a231bf feat(builtin_tools/memory): add optional namespace param to commit_memory and search_memory
Adds optional namespace parameter so agents can organize memories into named
buckets (e.g. "facts", "procedures", "blockers"). Defaults to "general".

- commit_memory(content, scope, *, namespace=None): namespace normalised to
  "general" when None or whitespace-only, forwarded to awareness client and
  included in httpx POST body.
- search_memory(query, scope, *, namespace=None): namespace forwarded as
  ?namespace= query param (omitted when None), matching the existing behaviour
  for the scope param.
- AwarenessClient.commit() and .search() updated to accept namespace kwarg.

Fixes #908.
2026-04-20 23:12:32 +00:00
molecule-ai[bot]
830381d40b
Merge pull request #27 from Molecule-AI/fix/cli-auth-helper-and-sandbox-warn
fix(cli_executor + sandbox): CWE-78 auth helper + subprocess isolation warning
2026-04-20 23:07:07 +00:00
83f87702ea fix(cli_executor + sandbox): CWE-78 auth helper + subprocess warning
Issue #21 (CWE-78): _create_auth_helper() wrote a shell script using
shlex.quote() which does NOT protect against $(...) command substitution
inside the token value. Replaced with a mode-0600 token file passed via
AGENT_AUTH_TOKEN_FILE env var — token is never interpreted by a shell.

Issue #22 (CWE-266): sandbox subprocess backend warns once at module
load time when active, alerting operators that SANDBOX_BACKEND=docker or
e2b should be used for production isolation.

Co-Authored-By: Infra-Runtime-BE <infra-runtime-be@molecule.ai>
2026-04-20 23:05:57 +00:00
molecule-ai[bot]
2bb0f97085
Merge pull request #26 from Molecule-AI/fix/plugin-setup-env-scrub
fix(plugins_registry/builtins): strip API keys from plugin setup.sh env
2026-04-20 23:04:33 +00:00
molecule-ai[bot]
097908e707
Merge pull request #25 from Molecule-AI/fix/security-failopen-rbac-and-token-log-v2
fix(builtin_tools/audit): fail-secure RBAC + 3 additional security fixes
2026-04-20 23:04:31 +00:00
d6944086fe fix(plugins_registry/builtins): strip API keys from plugin setup.sh env
Issue #19 (CWE-C-312): AgentskillsAdaptor.install() passed the full
os.environ to the subprocess running setup.sh, including
ANTHROPIC_API_KEY, OPENAI_API_KEY, GITHUB_TOKEN, WORKSPACE_AUTH_TOKEN,
etc. A malicious or compromised plugin's setup.sh could exfiltrate them.

Fix: _scrubbed_env() builds a copy of os.environ with sensitive keys
removed, matching the same _SCRUB_KEYS list used in skill_loader/loader.py
so the scrubbing policy is consistent. CONFIGS_DIR is still passed via
the extra dict. Non-secret vars (PATH, HOME, etc.) are preserved.

Add 6 regression tests (30/30 passing).

Co-Authored-By: Infra-Runtime-BE <infra-runtime-be@molecule.ai>
2026-04-20 22:52:13 +00:00
c72fbfc9a4 fix(builtin_tools/audit): fail-secure RBAC — read-only default when config unavailable
Fixes #11 (CWE-285): get_workspace_roles() returned ["operator"] (full
delegate/approve/memory.write) when workspace config could not be loaded.
Changed to ["read-only"] — deny-by-default per Principle of Least
Privilege. Add regression tests in tests/test_audit.py.

Also includes:
- main.py: remove token prefix log (CWE-532) — issue #10/#17
- a2a_mcp_server.py: RBAC gate on sensitive MCP tools (CWE-862) — issue #12
- cli_executor.py: sanitize stderr in error logs (CWE-209) — issue #13
- tests/test_a2a_mcp_server.py: 5 new regression tests for MCP RBAC

Co-Authored-By: Infra-Runtime-BE <infra-runtime-be@molecule.ai>
2026-04-20 22:47:38 +00:00
Hongming Wang
0d1c8e711f
Merge pull request #24 from Molecule-AI/fix/pin-a2a-sdk-pre-1-0
Some checks failed
Publish to PyPI / build-and-publish (push) Failing after 32s
fix: pin a2a-sdk<1.0 — keep a2a.server.apps import working
2026-04-20 15:36:15 -07:00
Hongming Wang
90a1bdbbf4 fix: pin a2a-sdk<1.0 to keep a2a.server.apps import working
a2a-sdk 1.0.0 restructured the package and removed a2a.server.apps,
which main.py imports directly for A2AStarletteApplication. The
current >=0.3.25 constraint resolves to 1.0.0 on fresh installs and
the runtime crashes at startup with ModuleNotFoundError — which is
exactly what bit production workspace EC2 instances provisioned on
2026-04-20.

Bump to 0.1.3 and pin <1.0 until we're ready to migrate to the 1.x
import paths. Companion fix in molecule-controlplane PR #174 pins at
pip-install time; this PR fixes the upstream package so other callers
don't re-hit the same trap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 15:34:27 -07:00
molecule-ai[bot]
2391952eae
Merge pull request #7 from Molecule-AI/fix/auth-headers-and-pip-audit
fix: add auth headers to skill promotion logs and improve pip-audit severity parsing
2026-04-20 08:50:26 -07:00
Molecule AI Backend Engineer 3
fa64a04cba fix: add auth headers to skill promotion logs and improve pip-audit severity parsing
- Extract _auth_headers_for_platform() helper so _maybe_log_skill_promotion()
  includes auth headers when calling /workspaces/:id/activity (was missing)
- Improve pip-audit severity parsing: if fix_versions is present, severity
  is 'high' (patch available); otherwise 'medium' (no known fix yet)
2026-04-20 05:03:22 +00:00
rabbitblood
2da6f2d1cd Merge branch 'main' of https://github.com/Molecule-AI/molecule-ai-workspace-runtime into fix/507-mcp-server-path-absolute-imports 2026-04-17 21:36:51 -07:00
rabbitblood
d1719dd2a6 fix: strip CRLF from .sh/.py files in plugin hook installer — permanent #507 fix
The TRUE root cause of recurring CRLF: shutil.copy2() in
_copy_dir_files() copies hook files byte-for-byte from /plugins/
(mounted from Windows host) into /configs/.claude/hooks/. Windows
git checkout introduces \r\n regardless of .gitattributes.

Previous fixes were band-aids:
- .gitattributes eol=lf (only works for files IN git, not host disk)
- entrypoint.sh sed strip (runs at boot but after plugin install)
- provisioner CopyTemplateToContainer strip (wrong code path — hooks
  come through the Python plugin installer, not the Go template copier)

This fix strips CRLF at the single point where ALL plugin hooks enter
a container: _copy_dir_files() in builtins.py. read_bytes() + replace
+ write_bytes for .sh/.py files. Other file types pass through unchanged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 21:36:43 -07:00
Hongming Wang
ceeec69c8c
Merge pull request #6 from Molecule-AI/fix/507-mcp-server-path-absolute-imports
fix: resolve MCP server path from package + absolute imports (2nd half of #507)
2026-04-16 13:47:54 -07:00
rabbitblood
18d904cfc1 fix: MCP server path resolution + absolute imports (2nd half of #507)
The a2a MCP subprocess was launched with a hard-coded /app/a2a_mcp_server.py
path that only existed in the legacy workspace-template layout. Current
templates copy adapter.py into /app but not the MCP server script, so
claude-code's mcp_servers={"a2a": ...} config spawned a non-existent file,
the server never registered any tools, and every agent reported that
search_memory / commit_memory / list_peers / delegate_task / send_message_to_user
were unavailable in the tool registry.

Surfaced this cycle after the CRLF hook fix (PR molecule-core#508 +
plugin repo's .gitattributes) unblocked the primary (no response generated)
symptom. Before that, agents crashed before the missing-MCP issue was
observable — the two bugs stacked.

Changes
-------
* executor_helpers._default_mcp_server_path: resolves the installed
  molecule_runtime.a2a_mcp_server module's __file__ so the path is
  always correct regardless of template layout. Legacy /app path kept
  as last-resort fallback for any old images still in rotation.
* a2a_mcp_server.py, a2a_tools.py, a2a_client.py: convert bare module
  imports (from a2a_tools import ...) to absolute
  (from molecule_runtime.a2a_tools import ...). Previously this worked
  only when main.py injected the package dir onto sys.path; the MCP
  subprocess doesn't go through main.py, so the bare imports would fail.
  Added a sys.path shim at the top of a2a_mcp_server.py so running as a
  standalone script (python path/to/a2a_mcp_server.py) still works —
  the subprocess can now locate the package root automatically.
* consolidation.py, heartbeat.py, main.py: same bare-to-absolute
  conversion for platform_auth imports (unblocks the same class of
  failure if any of these modules are imported from a non-main.py
  entrypoint in the future).

Verification
------------
Deployed the updated files into ws-8010dbd0 (PM) and ran an isolated
sdk.query() as agent user. SystemMessage.init.mcp_servers now reports
[{'name': 'a2a', 'status': 'connected'}] and the tools list includes
all 8 mcp__a2a__* entries:
  mcp__a2a__check_task_status, mcp__a2a__commit_memory,
  mcp__a2a__delegate_task, mcp__a2a__delegate_task_async,
  mcp__a2a__get_workspace_info, mcp__a2a__list_peers,
  mcp__a2a__recall_memory, mcp__a2a__send_message_to_user

Rolled the in-container hotfix across all 22 workspaces pending release
(docker cp the 4 changed files into each site-packages/molecule_runtime/).

Fixes Molecule-AI/molecule-core#507 (secondary)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 13:28:57 -07:00
Hongming Wang
d140999a09
Merge pull request #5 from Molecule-AI/fix/488-session-file-existence-gate
Some checks failed
Publish to PyPI / build-and-publish (push) Failing after 34s
fix: gate session resume on file existence (closes #488 in molecule-core)
2026-04-16 11:16:26 -07:00
rabbitblood
6cd4d74c5a test: move sdk stubs to conftest.py (consistent across all test modules) 2026-04-16 11:15:45 -07:00
rabbitblood
a35d128870 test: stub claude_agent_sdk + a2a in session-resume tests
CI failed on collect because claude_agent_sdk + a2a aren't test-env deps
(they're installed inside the claude-code workspace image). The test file
now stubs both via sys.modules so the collector can import
claude_sdk_executor without pulling the real SDKs. Tests don't exercise
the SDK anyway — only _resolve_resume() glob logic.
2026-04-16 11:13:33 -07:00
rabbitblood
3b56410ad5 fix: gate session resume on file existence (closes #488)
## Symptom (cycle 6+ of #488)
Workspaces appear `online` (heartbeats fine) but every cron tick fails
silently with `No conversation found with session ID: <uuid>` →
`ProcessError: exit code 1` → idle loop logs HTTP 200, no actual work
happens. Backend Engineer received 5 idle pulses without claiming a
single one of the 6 open Hermes issues (#496-500) because the bug
prevents `gh issue list` from ever firing.

## Root cause (verified live in ws-20cb8ff8-3e4 today)
claude-code stores sessions at `/root/.claude/projects/<cwd-with-/-as-->/<id>.jsonl`.
When a workspace container is recreated, `self._session_id` from a
prior instance references a file that no longer exists. Passing it as
`resume=<id>` to ClaudeAgentOptions crashes the CLI on the very first
call. The existing #75 fix only fires AFTER the first ProcessError
lands, and per-cycle executor re-instantiation can reload the stale id
from elsewhere — restart-with-reset_claude_session was the only working
mitigation, hand-fired every cycle.

## Fix
New `_resolve_resume()` in ClaudeSDKExecutor: probes a handful of
well-known session-file locations (`/root/.claude/projects/*/<id>.jsonl`,
`/root/.claude/sessions/<id>.jsonl`, plus the agent-uid variants) via
`glob.glob`. If no file matches the in-memory `_session_id`, drops the
id (sets to None) AND returns None so `ClaudeAgentOptions.resume` is
unset — CLI starts a fresh session. Logged at INFO with `#488` in the
message so operators correlate.

`_build_options()` now calls `_resolve_resume()` instead of reading
`self._session_id` directly. Cheap path when no session set: zero
glob calls. Hot path (session set + file exists): one glob call,
short-circuits on first match.

## Drive-by fix: stale `from X import` in 4 modules
Same regression class as #1 (the runtime release that closed it):
- `claude_sdk_executor.py:43`: `from executor_helpers import …`
- `cli_executor.py:39-40`: `from config import …`, `from executor_helpers import …`
- `main.py:28-30`: `from config import …`, `from heartbeat import …`, `from preflight import …`
- `preflight.py:7`: `from config import …`

All rewritten to absolute `from molecule_runtime.<module> import …`
so they resolve outside of workspace containers (e.g. test environments
where `/app` isn't on sys.path). The grep guard in `tests/test_imports.py`
already covered `adapters` — extending to all top-level imports would
catch this class going forward; not in this PR to keep scope tight.

## Tests
6 new in `tests/test_session_resume_gate.py`:
- baseline (no session) → no glob, returns None
- file exists → keep id, returns id, single glob (early-exit)
- file missing → drop id (clears `_session_id`), returns None
- late-pattern match → walks all patterns until hit
- log includes session id (operator triage)
- log references #488 (debugger discoverability)

All 16 tests (10 existing + 6 new) pass.

## Release plan
- Bump version 0.1.1 → 0.1.2 (in this commit)
- After merge, push v0.1.2 tag → publish.yml auto-publishes to PyPI
- Then rebuild workspace template images locally so workspaces pick up the
  fix (templates pin `>=0.1.0`, will resolve to 0.1.2 on next build)
- Then mass-restart workspaces with reset_claude_session=true once to clear
  any DB-side stale state, and the permanent fix kicks in

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 11:12:03 -07:00