Commit Graph

38 Commits

Author SHA1 Message Date
Hongming Wang
2d514612a2 fix(pre-commit): align SECRET_PATTERNS with molecule-core canonical (add sk-cp-)
The bundled pre-commit hook is the runtime-side mirror of
molecule-core's canonical .github/workflows/secret-scan.yml SECRET_PATTERNS
array. They drifted: canonical added the MiniMax sk-cp- pattern
(F1088 vector — caught only after the fact) but this side wasn't
updated. Result: a workspace developer's local pre-commit would let
through a sk-cp- token that the org-wide CI scan would then refuse —
useless friction.

This brings the two sides back into byte-aligned-on-the-pattern-list
state. The drift is exactly the maintenance gap that task #139's
upcoming molecule-core CI lint is designed to surface automatically;
this PR clears the gap so the lint passes from day 1.

Refs: task #139.
2026-04-28 15:21:13 -07:00
rabbitblood
f1bede31a8 feat(precommit): add secret scan to bundled pre-commit hook (defense-in-depth for #2090-style leaks)
Adds a secret-scan gate alongside the existing internal-paths block in
the runtime's bundled pre-commit hook. Runs on every commit in every
repo (not scoped to Molecule-AI public repos like the internal-paths
block) — refuses any staged addition matching a high-value credential
shape and prints a recovery message that does NOT echo the secret value.

Pattern set covers GitHub family (ghp_, ghs_, gho_, ghu_, ghr_,
github_pat_), Anthropic / OpenAI / Slack / AWS — same shape as the
tenant-proxy CI scanner; keep aligned when either side adds a pattern.

Single hook file dispatches both checks (renamed
pre-commit-block-internal-paths.sh → pre-commit-checks.sh) so each
agent commit pays one git-config + one hook-install surface, not two.
Both checks share the existing fast-paths (skip if GIT_AUTHOR_NAME
unset; skip during rebase / cherry-pick / merge / revert).

End-to-end test exercises a real bash subprocess against a real temp
git repo with real staged content. Three cases:
 - ghs_-prefixed token in package.json (the actual #2090 vector) → refuse
 - clean README → pass through
 - sk-ant- key in a non-Molecule-AI repo → refuse (secret scan is universal,
   internal-paths block is not)

Skipped when bash is not on PATH so Windows test environments without
WSL stay green.

Bumps version 0.1.15 → 0.1.16.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 11:57:39 -07:00
Hongming Wang
7fc3537b2f
Merge pull request #42 from Molecule-AI/fix/stderr-capture-a2a-response
fix(runtime): capture stderr in A2A error response (closes #66)
2026-04-24 13:25:15 -07:00
rabbitblood
6ead3b433e fix(a2a): include exception class + error code in [A2A_ERROR] (#51)
When an exception's str() is empty (bare TimeoutError(), BrokenPipeError(),
some httpx transport errors) `f"{_A2A_ERROR_PREFIX}{e}"` produced
`"[A2A_ERROR] "` with a trailing space and zero diagnostic context,
masking the real cause of peer-delegation failures in activity_logs.

Observed on main monorepo: 22+ occurrences in 75 min across 7 leads
during the MiniMax M2.7 trial rate-limit episode — zero breadcrumbs
to route the debug from.

Fix:
- Exception branch: fall back to `type(e).__name__` when str(e) is empty
- Error branch: include JSON-RPC `error.code` alongside message when present

Tests: test_a2a_error_observability.py covers both the bare-exception
path (must surface class name) and the message-passthrough path (must
preserve existing useful messages).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 11:22:57 -07:00
rabbitblood
c43df7f947 fix(precommit): skip during rebase/cherry-pick/merge/revert — unblocks DIRTY PR rebase
Trace from molecule-core cycle 107 (2026-04-24): 15 staging PRs stuck
DIRTY (real merge conflicts) with 0 merges in 1+ hours. Authors couldn't
rebase to fix the conflicts because the pre-commit hook (shipped in
0.1.11) refuses ANY commit that includes forbidden paths in the diff —
including rebase replays of historical commits that pre-date the gate.

Specifically, agents trying to `git rebase staging` on a PR like
"docs(marketing): Phase 30 social copy" fail at the first commit replay
because that commit added marketing/* files. The fix would require
interactive rebase + manual file deletion + commit amend — agents don't
do that, so the PR stays DIRTY indefinitely.

Detection: check .git for rebase-merge/, rebase-apply/, CHERRY_PICK_HEAD,
MERGE_HEAD, or REVERT_HEAD. These state markers exist only during the
corresponding git operation. Skip the hook silently when present.

The hook still blocks fresh `git commit` (the failure mode it was
designed for). It just doesn't try to police what was already in git
history.

Bumped to 0.1.14.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 04:34:55 -07:00
rabbitblood
19f0033222 fix: enable v0_3 compat in JSON-RPC dispatcher — platform sends old method names
Sister fix to 0.1.12 (root mounting). After fixing the route mount,
every inbound A2A still returned `-32601 Method not found` because the
1.x dispatcher's method table doesn't recognize v0.3-shaped names
(`message/send`, `tasks/get`) that the platform's ProxyA2A still sends.

Reproduces in the SDK on a minimal handler:
    create_jsonrpc_routes(h, "/")                          → "Method not found"
    create_jsonrpc_routes(h, "/", enable_v0_3_compat=True) → dispatches OK

Bumped to 0.1.13. Both 0.1.12 and 0.1.13 are needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 02:37:07 -07:00
rabbitblood
30ebe9baf3 fix: mount JSON-RPC at root — platform POSTs to /, not /api/v1/jsonrpc/
Baseline restart 2026-04-24: every workspace came up healthy (uvicorn
listening, agent-card serving) but produced zero delegations for two
maintenance cycles. Tracing revealed platform's ProxyA2A POSTs to
`http://ws-<id>:8000/` (no path suffix, see
workspace-server/internal/provisioner.InternalURL) while the runtime's
JSON-RPC routes were mounted at `/api/v1/jsonrpc/` under the a2a-sdk
1.x API migration.

Result was silent — every inbound A2A returned 404 Not Found, the
platform logged "Not Found" at INFO level, but no error bubbled up
because the SDK's jsonrpc route factory doesn't respond to root when
mounted at a subpath. Agents stayed warm, crons fired, but no work
flowed.

Fix: `create_jsonrpc_routes(handler, "/")` — matches platform
expectation and the agent-card self-advertisement (which also shows
root as the JSON-RPC URL). Agent-card route keeps its hard-coded
`/.well-known/agent-card.json` path so there's no collision.

Bumped to 0.1.12.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 02:06:04 -07:00
rabbitblood
89739bf848 feat: pre-commit hook to block internal paths in public monorepo (A)
Anti-leak proposal item A. Companion to D (decision tree in role
prompts, separate PR on org-templates).

Why a local pre-commit hook
===========================

Agents try to `git add /research/foo.md` despite SHARED_RULES, the
.gitignore patterns, and the CI gate. Each leak attempt costs ~5 cycles
(PR opens, CI fails, agent retries with workaround) and pollutes git
history with reverts.

A pre-commit hook converts the failure from "PR opens then fails" →
"commit refused immediately, with the recovery command printed in the
same error message the agent reads." Agents act on what's in the
current response context — putting the redirect command literally in
the failure output is the highest-density feedback we can provide.

What changes
============

- molecule_runtime/scripts/pre-commit-block-internal-paths.sh —
  bash hook. Checks `git remote get-url origin`, only enforces in
  Molecule-AI/molecule-monorepo + molecule-core. In every other repo
  (internal, plugins, templates, third-party) it's a no-op.

  When forbidden paths are staged, refuses the commit with the redirect
  recipe + the alternative public-facing paths + the workflow-edit path
  for legitimate exceptions.

- molecule_runtime/precommit_hook.py — install_pre_commit_hook():
  1. Extracts bundled hook to ~/.molecule-runtime/git-hooks/pre-commit
  2. chmod +x
  3. Sets core.hooksPath globally — UNLESS already set by an operator
     (then logs a warning + skips, doesn't clobber)

- molecule_runtime/main.py — calls install_pre_commit_hook() at
  step 0.2, right after install_credential_helper()

- pyproject.toml bumped to 0.1.11

Both A and D together close the loop: D ensures the agent knows the
right path before writing; A enforces it at the local git boundary if
the agent forgets. CI gate remains the third backstop for anything
that gets pushed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 00:48:47 -07:00
rabbitblood
f1329fe230 feat: ship GitHub credential-helper inline in runtime (fixes #1933 class)
Lifts the per-template wiring (Dockerfile COPY + entrypoint.sh git config
+ nohup daemon launch) into the Python runtime. Templates that depend
on molecule-ai-workspace-runtime get the behavior automatically — they
no longer need to maintain their own copy of the helper scripts or
remember to write the right git config in their entrypoint.

Background:
- GitHub App installation tokens (ghs_…) expire ~60min after issue
- claude-code-default template shipped without wiring → 39 workspaces
  lost their tokens, three PMs' A2A queues filled with retry-status
  messages, manual fleet restart required (cycle 62-66 incident)

This commit:
- Adds molecule_runtime/scripts/{molecule-git-token-helper.sh,
  molecule-gh-token-refresh.sh} as package data (copies from canonical
  workspace/scripts/ in molecule-monorepo)
- Adds molecule_runtime/credential_helper.py with
  install_credential_helper() that:
    1. Extracts bundled scripts to ~/.molecule-runtime/scripts/
    2. Configures git credential.helper for github.com
    3. Creates ~/.molecule-token-cache/ mode 0700
    4. Spawns refresh daemon under respawn loop (PID file dedup)
    5. Runs initial gh auth login --with-token
- Hooks call site early in main.py (step 0.1, before config load)
- Fails-soft: each step independently fault-tolerant; missing git/gh
  binary doesn't block runtime startup

Bumped to 0.1.10. Templates can drop their entrypoint.sh credential
helper setup once they update the runtime pin (separate PRs per template).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 00:41:32 -07:00
19fde6f466 fix(runtime): capture stderr in A2A error response (closes #66)
- Lower _PROCESS_ERROR_STDERR_MAX_CHARS to 1024 (was 4096) so A2A
  responses stay bounded — the full context is already in workspace logs
  via logger.error/exception.
- Add stderr= kwarg to sanitize_agent_error() so callers can surface
  subprocess stderr verbatim in A2A responses.
- In _execute_locked() non-retryable error path, extract the first 1 KB
  of exc.stderr and pass it to sanitize_agent_error() so the A2A
  response carries actionable context (rate limit message, auth error,
  etc.) instead of just a class name.
- Add test_executor_helpers.py unit tests for the new stderr= kwarg.
2026-04-24 05:00:51 +00:00
molecule-ai[bot]
d5cf872311
feat: migrate a2a-sdk 1.x (KI-009) (#39)
- Replace a2a.utils.new_agent_text_message → a2a.helpers.new_text_message
- Replace Part(root=TextPart(...)) → Part(text=...) (flat Part API)
- Replace A2AStarletteApplication → Starlette route factories
  (create_agent_card_routes, create_jsonrpc_routes)
- Update conftest stubs: remove a2a.server.apps/a2a.utils,
  add a2a.server.routes/a2a.helpers/AgentInterface
- Add AgentInterface to AgentCard supported_interfaces
- Rename snake_case AgentCard fields per 1.x schema

Co-authored-by: Molecule AI Infra-Runtime-BE <infra-runtime-be@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-24 01:54:33 +00:00
Hongming Wang
e562b7a03e
Merge branch 'staging' into fix/auto-detect-llm-token-type 2026-04-23 13:52:25 -07:00
rabbitblood
050c2412b3 fix(heartbeat): refresh on-disk auth token on 401 + retry once (#1877)
## Problem

Auto-restart rotates the workspace's auth token in two non-atomic steps:
  1. Platform issues new token via wsauth.IssueToken
  2. Provisioner writes the new token to /configs/.auth_token AFTER
     ContainerStart returns

Between steps 1 and 2, the new container has booted and the runtime has
already loaded the OLD cached value of .auth_token (or no value if the
file was empty during boot). The runtime's first /registry/heartbeat
call sends the stale token, gets 401, but the loop never re-reads the
on-disk token — so subsequent heartbeats also send the stale value.

Each 401 means the platform never sees the workspace as alive →
status stays 'provisioning' → scheduler won't dispatch → workspace
looks dead from every angle even though the container is actually
running.

The existing code comment in workspace_provision.go acknowledges this:
"the workspace will get 401 on its first heartbeat and can recover on
the next restart." That recovery only worked because workspaces used
to crash for unrelated reasons and get restarted. After PR #1861
(provisioner empty-volume auto-recover) removed those crashes,
workspaces get stuck in the 401 loop with no exit.

## Fix

Two-part runtime-side fix in molecule-ai-workspace-runtime:

1. **platform_auth.refresh_from_disk()** — new helper that clears the
   in-memory cache and re-reads /configs/.auth_token. Returns the
   fresh value (or None if missing). Updates the cache as a side effect.

2. **HeartbeatLoop._loop()** — on 401 from /registry/heartbeat, calls
   refresh_from_disk() and retries the request ONCE with the new token.
   Same pattern in _check_delegations(). Bounded retry budget — if the
   on-disk token is also stale (bug elsewhere), no infinite loop.

## Tests

6/6 new tests in tests/test_token_refresh_1877.py:

  - refresh_picks_up_rotated_token              — happy path
  - refresh_returns_none_when_file_missing      — defensive
  - refresh_clears_stale_cache_when_file_disappears
  - refresh_is_idempotent
  - 401_retry_pattern_uses_refreshed_token      — the production fix path
  - 401_retry_no_loop_when_disk_token_also_stale — bounded retry budget

All pass locally on Python 3.13 + pytest 9.

## Why this fix and not the alternatives

- **Alternative B (platform writes token before ContainerStart):**
  Right architecturally but invasive — needs provisioner refactor to
  prep volumes before docker run.
- **Alternative C (skip rotation on auto-restart):** Breaks the
  multi-instance-safety invariant the existing code calls out
  (revoke prevents stale tokens from sister deployments).
- **This fix (A):** 3-line core change + helper. Self-healing for any
  timing edge case, not just the post-restart one. Costs nothing in
  the happy path (only triggers on 401).

## Version

Bumped to 0.1.9. Once published to PyPI + workspace template image
rebuilt, deployed workspaces auto-recover from token-rotation races
without operator intervention.

Closes #1877.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 13:26:36 -07:00
rabbitblood
4bafea58ae fix(llm_auth): tighten base-URL hostname match + strip whitespace + no token in logs
Self-review findings on #38:

1. **Token substring leak**: the "unknown prefix" warning included the
   first 12 chars of the token in the log message. Logs get shipped to
   Langfuse / CloudWatch / slack-firehose — 12 bytes of a secret in a
   log is still 12 bytes too many. Warning no longer references the
   token value at all.

2. **Base-URL substring match was too loose**: `"anthropic.com" not in
   base` would accept `https://proxy.anthropic.com.evil.example/` as
   "looks like Anthropic, keep the URL." Replaced with an allowlist of
   exact hostnames parsed via urllib.parse.urlparse.

3. **Whitespace in pasted tokens**: operators frequently paste tokens
   from terminals with a trailing newline. The token would flow through
   startswith() detection but then fail downstream auth with a
   confusing "malformed token" error. Strip and persist the cleaned
   value.

4. **Malformed base URL crash guard**: if someone sets ANTHROPIC_BASE_URL
   to something urlparse can't handle, don't crash — fall through to
   clearing it, which is the safe choice in OAuth mode.

Added 5 new tests covering each of the above. 16/16 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 10:46:07 -07:00
rabbitblood
0a0f11b41f feat(runtime): auto-detect LLM token type, normalise env on boot
Platform stores per-workspace LLM credentials under a single key
(ANTHROPIC_AUTH_TOKEN in workspace_secrets). But downstream tools
expect different env var names depending on the token type:

  sk-ant-oat01-*  → CLAUDE_CODE_OAUTH_TOKEN  (Claude Code OAuth session)
  sk-ant-api03-*  → ANTHROPIC_API_KEY        (direct Anthropic API)
  sk-cp-*         → ANTHROPIC_AUTH_TOKEN     (proxy: MiniMax, gateways)

Without normalisation, an OAuth token under ANTHROPIC_AUTH_TOKEN gets
sent as a bearer to api.anthropic.com, which responds:

    401 authentication_error: OAuth authentication is currently not
    supported.

This was a platform-wide footgun: anyone rotating LLM keys had to
know the exact env var for each token type, AND make sure stale
overrides were cleared, AND set ANTHROPIC_BASE_URL correctly for
proxies (or NOT set for native Claude). Nothing downstream could
help — the SDK just saw the wrong var.

Fix:

- New molecule_runtime/llm_auth.py — normalise_llm_env() mutates
  os.environ (or any dict) to the correct shape based on token
  prefix. Returns a NormalisationResult for logging.
- main.py calls it as step 0, before any adapter/executor import.
  Every adapter (claude-code, langgraph, crewai, autogen, hermes,
  …) benefits automatically — no per-adapter branching needed.
- 11 unit tests covering all prefix paths, edge cases, and the
  "operator deliberately set CLAUDE_CODE_OAUTH_TOKEN" precedence
  rule.

Operationally: this means operators can keep using one
ANTHROPIC_AUTH_TOKEN slot in platform settings and just paste
whatever token the agent needs. No env-var-name awareness required.

Tested locally: 11/11 new tests pass. 83 other tests unchanged
(pre-existing failures on staging are all unrelated:
test_workspace_id_validation, test_a2a_mcp_server RBAC, the
test_imports.main module-walker — same signature as on staging
HEAD before this PR).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 10:41:47 -07:00
rabbitblood
59f54560a0 Merge branch 'main' of https://github.com/Molecule-AI/molecule-ai-workspace-runtime into fix/507-mcp-server-path-absolute-imports
Some checks failed
Publish to PyPI / build-and-publish (push) Failing after 6s
# Conflicts:
#	pyproject.toml
2026-04-21 06:37:38 -07:00
rabbitblood
d3235cc564 fix(heartbeat): increment/decrement active_tasks + push on clear (#1372, #1408)
Both set_current_task() implementations (shared_runtime.py + executor_helpers.py):
- Increment active_tasks on task start, decrement on completion (was binary 0/1)
- Push heartbeat immediately on BOTH increment AND decrement
- Only clear current_task when active_tasks reaches 0 (preserves description
  for still-running tasks)

Fixes phantom-busy: the old code returned early on clear, leaving
active_tasks=1 in the platform DB until the next 30s heartbeat cycle.
If a new cron fired before the heartbeat, the workspace appeared
permanently busy — required manual DB reset every 30 min.

Bump: 0.1.2 → 0.1.3

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 06:37:12 -07:00
Hongming Wang
ba5466243b feat(auth): send X-Molecule-Org-Id on every outbound platform call
The SaaS tenant platform's TenantGuard middleware rejects cross-org
routing with synthetic 404s unless the request carries
X-Molecule-Org-Id matching the tenant's MOLECULE_ORG_ID env var. The
runtime never sent it, so every non-allowlisted workspace→platform
path (memories, delegations, notify, a2a, update-card, peers...)
404'd. Paired with CP change feat/workspace-export-org-id which
injects MOLECULE_ORG_ID into workspace user-data env.

auth_headers() now returns both headers — the existing Authorization
bearer AND the new X-Molecule-Org-Id — so every caller that already
threads auth_headers() through httpx picks it up for free. Self-
hosted deployments with MOLECULE_ORG_ID unset keep the old behavior
(no header, TenantGuard is a no-op).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 20:28:07 -07:00
d4b9bff5d0 fix(a2a_cli): validate WORKSPACE_ID in discover() before X-Workspace-ID header
PR #32 wrapped all platform URL construction sites with
get_validated_workspace_id() but missed a2a_cli.discover(), which
passed the raw unvalidated WORKSPACE_ID in the X-Workspace-ID header.
All other functions (peers, info) had try/except guards added.

discover() now calls get_validated_workspace_id() upfront and returns
None (printing the error) if validation fails — consistent with the
best-effort error handling pattern used elsewhere in the module.

Tests: 2 new cases in TestA2aCliDiscoverValidation covering empty
and slash-injected WORKSPACE_ID values.

Follow-up to: PR #32 (fix/908-add-namespace-param-commit-memory)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-21 01:35:37 +00:00
249e5c07eb fix(builtin_tools/validation): complete WORKSPACE_ID validation in a2a_tools.py
Add get_validated_workspace_id() to all 6 remaining unguarded URL positions
in molecule_runtime/a2a_tools.py (the MCP tool body implementations):

- report_activity(): /workspaces/{id}/activity + heartbeat
- tool_delegate_task_async(): /workspaces/{id}/delegate
- tool_check_task_status(): /workspaces/{id}/delegations
- tool_send_message_to_user(): /workspaces/{id}/notify
- tool_commit_memory(): /workspaces/{id}/memories (POST)
- tool_recall_memory(): /workspaces/{id}/memories (GET)

All 6 functions now use validated ws_id. The last remaining unguarded
WORKSPACE_ID use in the entire molecule_runtime package is in
builtin_tools/telemetry.py:142 (metric service name — not a URL path,
low security risk). 67/67 tests pass.
2026-04-21 00:55:08 +00:00
32a7880f4f test+fix(builtin_tools/validation): add test coverage + fix ".." bypass in regex
Tests: 37 new test cases in tests/test_validation.py covering:
- Valid ID patterns (6): normal IDs, underscores, dots, max-length (256)
- Empty/missing (1): raises with "empty" in message
- Invalid chars (10): / \ .. # ? & whitespace
- Caching (2): result is cached; raises on repeated bad calls
- Error type (1): WorkspaceIdValidationError is a ValueError subclass

Fix: regex now uses negative lookahead `(?!.*\.\.)` to reject ".." anywhere
in the string (not just at the start). The old pattern `^[A-Za-z0-9_\-.]{1,256}$`
matched ".." literally because two dots ARE in the allowed character class.
Also adds test cases for embedded ".." (ws..example, ws../etc).

Fixes: the ".." bypass was a gap in the original CWE-20 fix.
2026-04-21 00:55:08 +00:00
be9c9997c0 fix(builtin_tools/validation): cover remaining WORKSPACE_ID URL usages
Extend get_validated_workspace_id() to all remaining unguarded URL positions:

- consolidation.py: _consolidate() — validates before GET/POST/DELETE to
  /workspaces/{id}/memories endpoints. Graceful skip on failure (log + return).
- coordinator.py: get_children() — validates before /registry/{id}/peers.
  Graceful skip (empty list) on failure.
- molecule_ai_status.py: set_status() — validates before /registry/heartbeat
  and /workspaces/{id}/activity. Exits with descriptive error on failure.

With these three, every runtime use of WORKSPACE_ID in a URL path is now
validated. Remaining WORKSPACE_ID uses are:
- JSON body fields (not injection-risky): heartbeat, memory POST bodies
- Header values (X-Workspace-ID): lower risk, non-URL-injection
2026-04-21 00:55:08 +00:00
42bdf530b5 fix(builtin_tools/validation): extend WORKSPACE_ID validation to top-level modules
Fixes remaining unguarded WORKSPACE_ID URL usages identified after the initial
builtin_tools/ fix:

- a2a_client.py: get_peers() and get_workspace_info() now use
  get_validated_workspace_id() before URL construction. The raw module-level
  constant is still used in the discover_peer() header (low risk, not URL path).
- a2a_cli.py: peers() and info() CLI commands now validate WORKSPACE_ID before
  calling the platform API. Commands exit with error code 1 + descriptive
  message if WORKSPACE_ID is empty or malformed.

Follow-up candidates (lower priority, not URL injection risk):
- coordinator.py: WORKSPACE_ID in registry peer URL
- consolidation.py: WORKSPACE_ID in memory URLs (long-running consolidation job)
- molecule_ai_status.py: WORKSPACE_ID in activity log URL
2026-04-21 00:55:08 +00:00
d52082839f fix(builtin_tools): validate WORKSPACE_ID before URL construction
Add WORKSPACE_ID format validation before every URL/header use to prevent
URL injection (CWE-20 / CWE-88). The validator:
- Rejects empty values (fail-fast with clear error)
- Rejects path-traversal chars (/ \ ..) and fragment/query chars (# ? &)
- Accepts alphanumeric, hyphen, underscore, dot (typical ID formats)
- Caches the result after first successful call (zero overhead per call)

Validated in:
- memory.py: commit_memory, search_memory (both awareness-client + httpx paths)
- approval.py: _create_approval_request, _wait_polling
- delegation.py: _notify_completion, _record_delegation_on_platform,
  _update_delegation_on_platform
- a2a_tools.py: list_peers, delegate_task

Fixes #14.
2026-04-21 00:55:08 +00:00
molecule-ai[bot]
30d96b4e4e
fix(platform_auth): validate WORKSPACE_ID at import time (issue #14, CWE-20) (#29)
WORKSPACE_ID was read via os.environ.get("WORKSPACE_ID", "") in multiple
builtin_tools modules and used directly in platform API URLs and X-Workspace-ID
headers without validation. A crafted ID containing /, .., or # could cause
URL path injection.

Fix: validate_workspace_id() in platform_auth.py now validates the ID format
at module import time using a regex that permits only lowercase alphanumerics
and hyphens (matching UUIDs and org-generated IDs). The validated value is
exposed as a module-level WORKSPACE_ID constant. builtin_tools/approval.py
and builtin_tools/delegation.py now import from platform_auth instead of
reading os.environ directly.

Failing input raises ValueError with a clear message — workspace fails fast
at startup rather than silently accepting malformed IDs in requests.

Add 15 regression tests (45/45 passing total).

Co-authored-by: Molecule AI Infra-Runtime-BE <infra-runtime-be@agents.moleculesai.app>
Co-authored-by: Infra-Runtime-BE <infra-runtime-be@molecule.ai>
2026-04-21 00:04:54 +00:00
Hongming Wang
4aa0d9f110 fix(adapter-loader): fall back to any BaseAdapter subclass
ADAPTER_MODULE resolution required the imported module to export a
class literally named `Adapter`. The claude-code, langgraph, and
openclaw adapter-template repos (3 of 4 currently in production) don't
ship that alias — they export ClaudeCodeAdapter / LangGraphAdapter /
OpenClawAdapter directly. Only hermes has the `Adapter = HermesAdapter`
shim at the bottom of adapter.py.

Consequence in prod: every fresh claude-code / langgraph / openclaw
workspace crashed at runtime startup with
"module 'adapter' has no attribute 'Adapter'", even with a2a-sdk
correctly pinned <1.0. Provisioning looked successful from CP's side
(EC2 ran) but the agent never registered because the process never
reached A2A bootstrap.

Fix: if `Adapter` is absent from the imported module, scan the module
for any attribute that is a proper BaseAdapter subclass (excluding
BaseAdapter itself — regression guard in tests). The explicit alias
remains the preferred contract; this is purely additive tolerance.

Bump to 0.1.4 and publish to PyPI via the existing v* tag trigger.

6 new tests cover: explicit alias, subclass-fallback, non-adapter-noise
ignored, empty module → error, missing module → error, re-exported
BaseAdapter → not selected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 16:59:12 -07:00
molecule-ai[bot]
457adcbd64
Merge pull request #28 from Molecule-AI/fix/908-add-namespace-param-commit-memory
feat(builtin_tools/memory): add namespace param to commit_memory and search_memory
2026-04-20 23:18:45 +00:00
ecc0a231bf feat(builtin_tools/memory): add optional namespace param to commit_memory and search_memory
Adds optional namespace parameter so agents can organize memories into named
buckets (e.g. "facts", "procedures", "blockers"). Defaults to "general".

- commit_memory(content, scope, *, namespace=None): namespace normalised to
  "general" when None or whitespace-only, forwarded to awareness client and
  included in httpx POST body.
- search_memory(query, scope, *, namespace=None): namespace forwarded as
  ?namespace= query param (omitted when None), matching the existing behaviour
  for the scope param.
- AwarenessClient.commit() and .search() updated to accept namespace kwarg.

Fixes #908.
2026-04-20 23:12:32 +00:00
83f87702ea fix(cli_executor + sandbox): CWE-78 auth helper + subprocess warning
Issue #21 (CWE-78): _create_auth_helper() wrote a shell script using
shlex.quote() which does NOT protect against $(...) command substitution
inside the token value. Replaced with a mode-0600 token file passed via
AGENT_AUTH_TOKEN_FILE env var — token is never interpreted by a shell.

Issue #22 (CWE-266): sandbox subprocess backend warns once at module
load time when active, alerting operators that SANDBOX_BACKEND=docker or
e2b should be used for production isolation.

Co-Authored-By: Infra-Runtime-BE <infra-runtime-be@molecule.ai>
2026-04-20 23:05:57 +00:00
molecule-ai[bot]
2bb0f97085
Merge pull request #26 from Molecule-AI/fix/plugin-setup-env-scrub
fix(plugins_registry/builtins): strip API keys from plugin setup.sh env
2026-04-20 23:04:33 +00:00
d6944086fe fix(plugins_registry/builtins): strip API keys from plugin setup.sh env
Issue #19 (CWE-C-312): AgentskillsAdaptor.install() passed the full
os.environ to the subprocess running setup.sh, including
ANTHROPIC_API_KEY, OPENAI_API_KEY, GITHUB_TOKEN, WORKSPACE_AUTH_TOKEN,
etc. A malicious or compromised plugin's setup.sh could exfiltrate them.

Fix: _scrubbed_env() builds a copy of os.environ with sensitive keys
removed, matching the same _SCRUB_KEYS list used in skill_loader/loader.py
so the scrubbing policy is consistent. CONFIGS_DIR is still passed via
the extra dict. Non-secret vars (PATH, HOME, etc.) are preserved.

Add 6 regression tests (30/30 passing).

Co-Authored-By: Infra-Runtime-BE <infra-runtime-be@molecule.ai>
2026-04-20 22:52:13 +00:00
c72fbfc9a4 fix(builtin_tools/audit): fail-secure RBAC — read-only default when config unavailable
Fixes #11 (CWE-285): get_workspace_roles() returned ["operator"] (full
delegate/approve/memory.write) when workspace config could not be loaded.
Changed to ["read-only"] — deny-by-default per Principle of Least
Privilege. Add regression tests in tests/test_audit.py.

Also includes:
- main.py: remove token prefix log (CWE-532) — issue #10/#17
- a2a_mcp_server.py: RBAC gate on sensitive MCP tools (CWE-862) — issue #12
- cli_executor.py: sanitize stderr in error logs (CWE-209) — issue #13
- tests/test_a2a_mcp_server.py: 5 new regression tests for MCP RBAC

Co-Authored-By: Infra-Runtime-BE <infra-runtime-be@molecule.ai>
2026-04-20 22:47:38 +00:00
Molecule AI Backend Engineer 3
fa64a04cba fix: add auth headers to skill promotion logs and improve pip-audit severity parsing
- Extract _auth_headers_for_platform() helper so _maybe_log_skill_promotion()
  includes auth headers when calling /workspaces/:id/activity (was missing)
- Improve pip-audit severity parsing: if fix_versions is present, severity
  is 'high' (patch available); otherwise 'medium' (no known fix yet)
2026-04-20 05:03:22 +00:00
rabbitblood
d1719dd2a6 fix: strip CRLF from .sh/.py files in plugin hook installer — permanent #507 fix
The TRUE root cause of recurring CRLF: shutil.copy2() in
_copy_dir_files() copies hook files byte-for-byte from /plugins/
(mounted from Windows host) into /configs/.claude/hooks/. Windows
git checkout introduces \r\n regardless of .gitattributes.

Previous fixes were band-aids:
- .gitattributes eol=lf (only works for files IN git, not host disk)
- entrypoint.sh sed strip (runs at boot but after plugin install)
- provisioner CopyTemplateToContainer strip (wrong code path — hooks
  come through the Python plugin installer, not the Go template copier)

This fix strips CRLF at the single point where ALL plugin hooks enter
a container: _copy_dir_files() in builtins.py. read_bytes() + replace
+ write_bytes for .sh/.py files. Other file types pass through unchanged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 21:36:43 -07:00
rabbitblood
18d904cfc1 fix: MCP server path resolution + absolute imports (2nd half of #507)
The a2a MCP subprocess was launched with a hard-coded /app/a2a_mcp_server.py
path that only existed in the legacy workspace-template layout. Current
templates copy adapter.py into /app but not the MCP server script, so
claude-code's mcp_servers={"a2a": ...} config spawned a non-existent file,
the server never registered any tools, and every agent reported that
search_memory / commit_memory / list_peers / delegate_task / send_message_to_user
were unavailable in the tool registry.

Surfaced this cycle after the CRLF hook fix (PR molecule-core#508 +
plugin repo's .gitattributes) unblocked the primary (no response generated)
symptom. Before that, agents crashed before the missing-MCP issue was
observable — the two bugs stacked.

Changes
-------
* executor_helpers._default_mcp_server_path: resolves the installed
  molecule_runtime.a2a_mcp_server module's __file__ so the path is
  always correct regardless of template layout. Legacy /app path kept
  as last-resort fallback for any old images still in rotation.
* a2a_mcp_server.py, a2a_tools.py, a2a_client.py: convert bare module
  imports (from a2a_tools import ...) to absolute
  (from molecule_runtime.a2a_tools import ...). Previously this worked
  only when main.py injected the package dir onto sys.path; the MCP
  subprocess doesn't go through main.py, so the bare imports would fail.
  Added a sys.path shim at the top of a2a_mcp_server.py so running as a
  standalone script (python path/to/a2a_mcp_server.py) still works —
  the subprocess can now locate the package root automatically.
* consolidation.py, heartbeat.py, main.py: same bare-to-absolute
  conversion for platform_auth imports (unblocks the same class of
  failure if any of these modules are imported from a non-main.py
  entrypoint in the future).

Verification
------------
Deployed the updated files into ws-8010dbd0 (PM) and ran an isolated
sdk.query() as agent user. SystemMessage.init.mcp_servers now reports
[{'name': 'a2a', 'status': 'connected'}] and the tools list includes
all 8 mcp__a2a__* entries:
  mcp__a2a__check_task_status, mcp__a2a__commit_memory,
  mcp__a2a__delegate_task, mcp__a2a__delegate_task_async,
  mcp__a2a__get_workspace_info, mcp__a2a__list_peers,
  mcp__a2a__recall_memory, mcp__a2a__send_message_to_user

Rolled the in-container hotfix across all 22 workspaces pending release
(docker cp the 4 changed files into each site-packages/molecule_runtime/).

Fixes Molecule-AI/molecule-core#507 (secondary)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 13:28:57 -07:00
rabbitblood
3b56410ad5 fix: gate session resume on file existence (closes #488)
## Symptom (cycle 6+ of #488)
Workspaces appear `online` (heartbeats fine) but every cron tick fails
silently with `No conversation found with session ID: <uuid>` →
`ProcessError: exit code 1` → idle loop logs HTTP 200, no actual work
happens. Backend Engineer received 5 idle pulses without claiming a
single one of the 6 open Hermes issues (#496-500) because the bug
prevents `gh issue list` from ever firing.

## Root cause (verified live in ws-20cb8ff8-3e4 today)
claude-code stores sessions at `/root/.claude/projects/<cwd-with-/-as-->/<id>.jsonl`.
When a workspace container is recreated, `self._session_id` from a
prior instance references a file that no longer exists. Passing it as
`resume=<id>` to ClaudeAgentOptions crashes the CLI on the very first
call. The existing #75 fix only fires AFTER the first ProcessError
lands, and per-cycle executor re-instantiation can reload the stale id
from elsewhere — restart-with-reset_claude_session was the only working
mitigation, hand-fired every cycle.

## Fix
New `_resolve_resume()` in ClaudeSDKExecutor: probes a handful of
well-known session-file locations (`/root/.claude/projects/*/<id>.jsonl`,
`/root/.claude/sessions/<id>.jsonl`, plus the agent-uid variants) via
`glob.glob`. If no file matches the in-memory `_session_id`, drops the
id (sets to None) AND returns None so `ClaudeAgentOptions.resume` is
unset — CLI starts a fresh session. Logged at INFO with `#488` in the
message so operators correlate.

`_build_options()` now calls `_resolve_resume()` instead of reading
`self._session_id` directly. Cheap path when no session set: zero
glob calls. Hot path (session set + file exists): one glob call,
short-circuits on first match.

## Drive-by fix: stale `from X import` in 4 modules
Same regression class as #1 (the runtime release that closed it):
- `claude_sdk_executor.py:43`: `from executor_helpers import …`
- `cli_executor.py:39-40`: `from config import …`, `from executor_helpers import …`
- `main.py:28-30`: `from config import …`, `from heartbeat import …`, `from preflight import …`
- `preflight.py:7`: `from config import …`

All rewritten to absolute `from molecule_runtime.<module> import …`
so they resolve outside of workspace containers (e.g. test environments
where `/app` isn't on sys.path). The grep guard in `tests/test_imports.py`
already covered `adapters` — extending to all top-level imports would
catch this class going forward; not in this PR to keep scope tight.

## Tests
6 new in `tests/test_session_resume_gate.py`:
- baseline (no session) → no glob, returns None
- file exists → keep id, returns id, single glob (early-exit)
- file missing → drop id (clears `_session_id`), returns None
- late-pattern match → walks all patterns until hit
- log includes session id (operator triage)
- log references #488 (debugger discoverability)

All 16 tests (10 existing + 6 new) pass.

## Release plan
- Bump version 0.1.1 → 0.1.2 (in this commit)
- After merge, push v0.1.2 tag → publish.yml auto-publishes to PyPI
- Then rebuild workspace template images locally so workspaces pick up the
  fix (templates pin `>=0.1.0`, will resolve to 0.1.2 on next build)
- Then mass-restart workspaces with reset_claude_session=true once to clear
  any DB-side stale state, and the permanent fix kicks in

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 11:12:03 -07:00
rabbitblood
9cdae9afec fix: switch top-level from adapters import to absolute imports (#1)
Every modular workspace template repo (claude-code, hermes, langgraph,
…) was crashing on boot with:

  KeyError: "Unknown runtime '<runtime>'. Available: "

Root cause: `molecule_runtime/main.py` and four other modules used
top-level imports like `from adapters import get_adapter` — a monorepo
legacy that resolved when something on sys.path had an `adapters/`
package. Standalone template repos COPY only `adapter.py` (singular) to
/app and don't ship an `adapters/` package, so this import path went
through some side-resolution that left `get_adapter` unable to see the
user's adapter. The ADAPTER_MODULE → import → getattr → issubclass
chain then silently fell through to the discovery branch and reported
"Unknown runtime".

Fix is one-line per file: `from adapters` → `from molecule_runtime.adapters`
in:
  - molecule_runtime/main.py:27
  - molecule_runtime/a2a_executor.py:44
  - molecule_runtime/coordinator.py:20
  - molecule_runtime/prompt.py:6
  - molecule_runtime/builtin_tools/temporal_workflow.py:417

Tests + CI added so this regression class is caught at PR time, not at
runtime in self-hosters' clusters:
  - tests/test_imports.py: parametrised import smoke for every previously
    affected module + a grep guard that fails if any future change
    reintroduces a top-level `from adapters` / `import adapters` line
  - .github/workflows/ci.yml: runs the smoke on every PR (no CI existed
    before — the publish workflow only fires on tag push)

Closes #1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 07:53:03 -07:00
Hongming Wang
851a6d7bfd feat: initial release of molecule-ai-workspace-runtime 0.1.0
Extracts shared workspace runtime from molecule-monorepo/workspace-template
into a publishable PyPI package.

- molecule_runtime/ package with all shared infrastructure modules
- Adapter discovery via ADAPTER_MODULE env var (standalone repos) + built-in scan
- molecule-runtime console script entry point (main_sync)
- CI workflow to publish on version tags
- Published to PyPI as molecule-ai-workspace-runtime==0.1.0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 04:26:06 -07:00