- Lower _PROCESS_ERROR_STDERR_MAX_CHARS to 1024 (was 4096) so A2A
responses stay bounded — the full context is already in workspace logs
via logger.error/exception.
- Add stderr= kwarg to sanitize_agent_error() so callers can surface
subprocess stderr verbatim in A2A responses.
- In _execute_locked() non-retryable error path, extract the first 1 KB
of exc.stderr and pass it to sanitize_agent_error() so the A2A
response carries actionable context (rate limit message, auth error,
etc.) instead of just a class name.
- Add test_executor_helpers.py unit tests for the new stderr= kwarg.
Both set_current_task() implementations (shared_runtime.py + executor_helpers.py):
- Increment active_tasks on task start, decrement on completion (was binary 0/1)
- Push heartbeat immediately on BOTH increment AND decrement
- Only clear current_task when active_tasks reaches 0 (preserves description
for still-running tasks)
Fixes phantom-busy: the old code returned early on clear, leaving
active_tasks=1 in the platform DB until the next 30s heartbeat cycle.
If a new cron fired before the heartbeat, the workspace appeared
permanently busy — required manual DB reset every 30 min.
Bump: 0.1.2 → 0.1.3
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The SaaS tenant platform's TenantGuard middleware rejects cross-org
routing with synthetic 404s unless the request carries
X-Molecule-Org-Id matching the tenant's MOLECULE_ORG_ID env var. The
runtime never sent it, so every non-allowlisted workspace→platform
path (memories, delegations, notify, a2a, update-card, peers...)
404'd. Paired with CP change feat/workspace-export-org-id which
injects MOLECULE_ORG_ID into workspace user-data env.
auth_headers() now returns both headers — the existing Authorization
bearer AND the new X-Molecule-Org-Id — so every caller that already
threads auth_headers() through httpx picks it up for free. Self-
hosted deployments with MOLECULE_ORG_ID unset keep the old behavior
(no header, TenantGuard is a no-op).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #32 wrapped all platform URL construction sites with
get_validated_workspace_id() but missed a2a_cli.discover(), which
passed the raw unvalidated WORKSPACE_ID in the X-Workspace-ID header.
All other functions (peers, info) had try/except guards added.
discover() now calls get_validated_workspace_id() upfront and returns
None (printing the error) if validation fails — consistent with the
best-effort error handling pattern used elsewhere in the module.
Tests: 2 new cases in TestA2aCliDiscoverValidation covering empty
and slash-injected WORKSPACE_ID values.
Follow-up to: PR #32 (fix/908-add-namespace-param-commit-memory)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add get_validated_workspace_id() to all 6 remaining unguarded URL positions
in molecule_runtime/a2a_tools.py (the MCP tool body implementations):
- report_activity(): /workspaces/{id}/activity + heartbeat
- tool_delegate_task_async(): /workspaces/{id}/delegate
- tool_check_task_status(): /workspaces/{id}/delegations
- tool_send_message_to_user(): /workspaces/{id}/notify
- tool_commit_memory(): /workspaces/{id}/memories (POST)
- tool_recall_memory(): /workspaces/{id}/memories (GET)
All 6 functions now use validated ws_id. The last remaining unguarded
WORKSPACE_ID use in the entire molecule_runtime package is in
builtin_tools/telemetry.py:142 (metric service name — not a URL path,
low security risk). 67/67 tests pass.
Tests: 37 new test cases in tests/test_validation.py covering:
- Valid ID patterns (6): normal IDs, underscores, dots, max-length (256)
- Empty/missing (1): raises with "empty" in message
- Invalid chars (10): / \ .. # ? & whitespace
- Caching (2): result is cached; raises on repeated bad calls
- Error type (1): WorkspaceIdValidationError is a ValueError subclass
Fix: regex now uses negative lookahead `(?!.*\.\.)` to reject ".." anywhere
in the string (not just at the start). The old pattern `^[A-Za-z0-9_\-.]{1,256}$`
matched ".." literally because two dots ARE in the allowed character class.
Also adds test cases for embedded ".." (ws..example, ws../etc).
Fixes: the ".." bypass was a gap in the original CWE-20 fix.
Extend get_validated_workspace_id() to all remaining unguarded URL positions:
- consolidation.py: _consolidate() — validates before GET/POST/DELETE to
/workspaces/{id}/memories endpoints. Graceful skip on failure (log + return).
- coordinator.py: get_children() — validates before /registry/{id}/peers.
Graceful skip (empty list) on failure.
- molecule_ai_status.py: set_status() — validates before /registry/heartbeat
and /workspaces/{id}/activity. Exits with descriptive error on failure.
With these three, every runtime use of WORKSPACE_ID in a URL path is now
validated. Remaining WORKSPACE_ID uses are:
- JSON body fields (not injection-risky): heartbeat, memory POST bodies
- Header values (X-Workspace-ID): lower risk, non-URL-injection
Fixes remaining unguarded WORKSPACE_ID URL usages identified after the initial
builtin_tools/ fix:
- a2a_client.py: get_peers() and get_workspace_info() now use
get_validated_workspace_id() before URL construction. The raw module-level
constant is still used in the discover_peer() header (low risk, not URL path).
- a2a_cli.py: peers() and info() CLI commands now validate WORKSPACE_ID before
calling the platform API. Commands exit with error code 1 + descriptive
message if WORKSPACE_ID is empty or malformed.
Follow-up candidates (lower priority, not URL injection risk):
- coordinator.py: WORKSPACE_ID in registry peer URL
- consolidation.py: WORKSPACE_ID in memory URLs (long-running consolidation job)
- molecule_ai_status.py: WORKSPACE_ID in activity log URL
WORKSPACE_ID was read via os.environ.get("WORKSPACE_ID", "") in multiple
builtin_tools modules and used directly in platform API URLs and X-Workspace-ID
headers without validation. A crafted ID containing /, .., or # could cause
URL path injection.
Fix: validate_workspace_id() in platform_auth.py now validates the ID format
at module import time using a regex that permits only lowercase alphanumerics
and hyphens (matching UUIDs and org-generated IDs). The validated value is
exposed as a module-level WORKSPACE_ID constant. builtin_tools/approval.py
and builtin_tools/delegation.py now import from platform_auth instead of
reading os.environ directly.
Failing input raises ValueError with a clear message — workspace fails fast
at startup rather than silently accepting malformed IDs in requests.
Add 15 regression tests (45/45 passing total).
Co-authored-by: Molecule AI Infra-Runtime-BE <infra-runtime-be@agents.moleculesai.app>
Co-authored-by: Infra-Runtime-BE <infra-runtime-be@molecule.ai>
ADAPTER_MODULE resolution required the imported module to export a
class literally named `Adapter`. The claude-code, langgraph, and
openclaw adapter-template repos (3 of 4 currently in production) don't
ship that alias — they export ClaudeCodeAdapter / LangGraphAdapter /
OpenClawAdapter directly. Only hermes has the `Adapter = HermesAdapter`
shim at the bottom of adapter.py.
Consequence in prod: every fresh claude-code / langgraph / openclaw
workspace crashed at runtime startup with
"module 'adapter' has no attribute 'Adapter'", even with a2a-sdk
correctly pinned <1.0. Provisioning looked successful from CP's side
(EC2 ran) but the agent never registered because the process never
reached A2A bootstrap.
Fix: if `Adapter` is absent from the imported module, scan the module
for any attribute that is a proper BaseAdapter subclass (excluding
BaseAdapter itself — regression guard in tests). The explicit alias
remains the preferred contract; this is purely additive tolerance.
Bump to 0.1.4 and publish to PyPI via the existing v* tag trigger.
6 new tests cover: explicit alias, subclass-fallback, non-adapter-noise
ignored, empty module → error, missing module → error, re-exported
BaseAdapter → not selected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds optional namespace parameter so agents can organize memories into named
buckets (e.g. "facts", "procedures", "blockers"). Defaults to "general".
- commit_memory(content, scope, *, namespace=None): namespace normalised to
"general" when None or whitespace-only, forwarded to awareness client and
included in httpx POST body.
- search_memory(query, scope, *, namespace=None): namespace forwarded as
?namespace= query param (omitted when None), matching the existing behaviour
for the scope param.
- AwarenessClient.commit() and .search() updated to accept namespace kwarg.
Fixes#908.
Issue #21 (CWE-78): _create_auth_helper() wrote a shell script using
shlex.quote() which does NOT protect against $(...) command substitution
inside the token value. Replaced with a mode-0600 token file passed via
AGENT_AUTH_TOKEN_FILE env var — token is never interpreted by a shell.
Issue #22 (CWE-266): sandbox subprocess backend warns once at module
load time when active, alerting operators that SANDBOX_BACKEND=docker or
e2b should be used for production isolation.
Co-Authored-By: Infra-Runtime-BE <infra-runtime-be@molecule.ai>
Issue #19 (CWE-C-312): AgentskillsAdaptor.install() passed the full
os.environ to the subprocess running setup.sh, including
ANTHROPIC_API_KEY, OPENAI_API_KEY, GITHUB_TOKEN, WORKSPACE_AUTH_TOKEN,
etc. A malicious or compromised plugin's setup.sh could exfiltrate them.
Fix: _scrubbed_env() builds a copy of os.environ with sensitive keys
removed, matching the same _SCRUB_KEYS list used in skill_loader/loader.py
so the scrubbing policy is consistent. CONFIGS_DIR is still passed via
the extra dict. Non-secret vars (PATH, HOME, etc.) are preserved.
Add 6 regression tests (30/30 passing).
Co-Authored-By: Infra-Runtime-BE <infra-runtime-be@molecule.ai>
- Extract _auth_headers_for_platform() helper so _maybe_log_skill_promotion()
includes auth headers when calling /workspaces/:id/activity (was missing)
- Improve pip-audit severity parsing: if fix_versions is present, severity
is 'high' (patch available); otherwise 'medium' (no known fix yet)
The TRUE root cause of recurring CRLF: shutil.copy2() in
_copy_dir_files() copies hook files byte-for-byte from /plugins/
(mounted from Windows host) into /configs/.claude/hooks/. Windows
git checkout introduces \r\n regardless of .gitattributes.
Previous fixes were band-aids:
- .gitattributes eol=lf (only works for files IN git, not host disk)
- entrypoint.sh sed strip (runs at boot but after plugin install)
- provisioner CopyTemplateToContainer strip (wrong code path — hooks
come through the Python plugin installer, not the Go template copier)
This fix strips CRLF at the single point where ALL plugin hooks enter
a container: _copy_dir_files() in builtins.py. read_bytes() + replace
+ write_bytes for .sh/.py files. Other file types pass through unchanged.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The a2a MCP subprocess was launched with a hard-coded /app/a2a_mcp_server.py
path that only existed in the legacy workspace-template layout. Current
templates copy adapter.py into /app but not the MCP server script, so
claude-code's mcp_servers={"a2a": ...} config spawned a non-existent file,
the server never registered any tools, and every agent reported that
search_memory / commit_memory / list_peers / delegate_task / send_message_to_user
were unavailable in the tool registry.
Surfaced this cycle after the CRLF hook fix (PR molecule-core#508 +
plugin repo's .gitattributes) unblocked the primary (no response generated)
symptom. Before that, agents crashed before the missing-MCP issue was
observable — the two bugs stacked.
Changes
-------
* executor_helpers._default_mcp_server_path: resolves the installed
molecule_runtime.a2a_mcp_server module's __file__ so the path is
always correct regardless of template layout. Legacy /app path kept
as last-resort fallback for any old images still in rotation.
* a2a_mcp_server.py, a2a_tools.py, a2a_client.py: convert bare module
imports (from a2a_tools import ...) to absolute
(from molecule_runtime.a2a_tools import ...). Previously this worked
only when main.py injected the package dir onto sys.path; the MCP
subprocess doesn't go through main.py, so the bare imports would fail.
Added a sys.path shim at the top of a2a_mcp_server.py so running as a
standalone script (python path/to/a2a_mcp_server.py) still works —
the subprocess can now locate the package root automatically.
* consolidation.py, heartbeat.py, main.py: same bare-to-absolute
conversion for platform_auth imports (unblocks the same class of
failure if any of these modules are imported from a non-main.py
entrypoint in the future).
Verification
------------
Deployed the updated files into ws-8010dbd0 (PM) and ran an isolated
sdk.query() as agent user. SystemMessage.init.mcp_servers now reports
[{'name': 'a2a', 'status': 'connected'}] and the tools list includes
all 8 mcp__a2a__* entries:
mcp__a2a__check_task_status, mcp__a2a__commit_memory,
mcp__a2a__delegate_task, mcp__a2a__delegate_task_async,
mcp__a2a__get_workspace_info, mcp__a2a__list_peers,
mcp__a2a__recall_memory, mcp__a2a__send_message_to_user
Rolled the in-container hotfix across all 22 workspaces pending release
(docker cp the 4 changed files into each site-packages/molecule_runtime/).
FixesMolecule-AI/molecule-core#507 (secondary)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Symptom (cycle 6+ of #488)
Workspaces appear `online` (heartbeats fine) but every cron tick fails
silently with `No conversation found with session ID: <uuid>` →
`ProcessError: exit code 1` → idle loop logs HTTP 200, no actual work
happens. Backend Engineer received 5 idle pulses without claiming a
single one of the 6 open Hermes issues (#496-500) because the bug
prevents `gh issue list` from ever firing.
## Root cause (verified live in ws-20cb8ff8-3e4 today)
claude-code stores sessions at `/root/.claude/projects/<cwd-with-/-as-->/<id>.jsonl`.
When a workspace container is recreated, `self._session_id` from a
prior instance references a file that no longer exists. Passing it as
`resume=<id>` to ClaudeAgentOptions crashes the CLI on the very first
call. The existing #75 fix only fires AFTER the first ProcessError
lands, and per-cycle executor re-instantiation can reload the stale id
from elsewhere — restart-with-reset_claude_session was the only working
mitigation, hand-fired every cycle.
## Fix
New `_resolve_resume()` in ClaudeSDKExecutor: probes a handful of
well-known session-file locations (`/root/.claude/projects/*/<id>.jsonl`,
`/root/.claude/sessions/<id>.jsonl`, plus the agent-uid variants) via
`glob.glob`. If no file matches the in-memory `_session_id`, drops the
id (sets to None) AND returns None so `ClaudeAgentOptions.resume` is
unset — CLI starts a fresh session. Logged at INFO with `#488` in the
message so operators correlate.
`_build_options()` now calls `_resolve_resume()` instead of reading
`self._session_id` directly. Cheap path when no session set: zero
glob calls. Hot path (session set + file exists): one glob call,
short-circuits on first match.
## Drive-by fix: stale `from X import` in 4 modules
Same regression class as #1 (the runtime release that closed it):
- `claude_sdk_executor.py:43`: `from executor_helpers import …`
- `cli_executor.py:39-40`: `from config import …`, `from executor_helpers import …`
- `main.py:28-30`: `from config import …`, `from heartbeat import …`, `from preflight import …`
- `preflight.py:7`: `from config import …`
All rewritten to absolute `from molecule_runtime.<module> import …`
so they resolve outside of workspace containers (e.g. test environments
where `/app` isn't on sys.path). The grep guard in `tests/test_imports.py`
already covered `adapters` — extending to all top-level imports would
catch this class going forward; not in this PR to keep scope tight.
## Tests
6 new in `tests/test_session_resume_gate.py`:
- baseline (no session) → no glob, returns None
- file exists → keep id, returns id, single glob (early-exit)
- file missing → drop id (clears `_session_id`), returns None
- late-pattern match → walks all patterns until hit
- log includes session id (operator triage)
- log references #488 (debugger discoverability)
All 16 tests (10 existing + 6 new) pass.
## Release plan
- Bump version 0.1.1 → 0.1.2 (in this commit)
- After merge, push v0.1.2 tag → publish.yml auto-publishes to PyPI
- Then rebuild workspace template images locally so workspaces pick up the
fix (templates pin `>=0.1.0`, will resolve to 0.1.2 on next build)
- Then mass-restart workspaces with reset_claude_session=true once to clear
any DB-side stale state, and the permanent fix kicks in
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Every modular workspace template repo (claude-code, hermes, langgraph,
…) was crashing on boot with:
KeyError: "Unknown runtime '<runtime>'. Available: "
Root cause: `molecule_runtime/main.py` and four other modules used
top-level imports like `from adapters import get_adapter` — a monorepo
legacy that resolved when something on sys.path had an `adapters/`
package. Standalone template repos COPY only `adapter.py` (singular) to
/app and don't ship an `adapters/` package, so this import path went
through some side-resolution that left `get_adapter` unable to see the
user's adapter. The ADAPTER_MODULE → import → getattr → issubclass
chain then silently fell through to the discovery branch and reported
"Unknown runtime".
Fix is one-line per file: `from adapters` → `from molecule_runtime.adapters`
in:
- molecule_runtime/main.py:27
- molecule_runtime/a2a_executor.py:44
- molecule_runtime/coordinator.py:20
- molecule_runtime/prompt.py:6
- molecule_runtime/builtin_tools/temporal_workflow.py:417
Tests + CI added so this regression class is caught at PR time, not at
runtime in self-hosters' clusters:
- tests/test_imports.py: parametrised import smoke for every previously
affected module + a grep guard that fails if any future change
reintroduces a top-level `from adapters` / `import adapters` line
- .github/workflows/ci.yml: runs the smoke on every PR (no CI existed
before — the publish workflow only fires on tag push)
Closes#1.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extracts shared workspace runtime from molecule-monorepo/workspace-template
into a publishable PyPI package.
- molecule_runtime/ package with all shared infrastructure modules
- Adapter discovery via ADAPTER_MODULE env var (standalone repos) + built-in scan
- molecule-runtime console script entry point (main_sync)
- CI workflow to publish on version tags
- Published to PyPI as molecule-ai-workspace-runtime==0.1.0
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>