PR #52 fixed the empty '[A2A_ERROR] ' suffix but didn't bump the
version — the fix landed on main without a corresponding PyPI
release, so workspace-template rebuilds keep pulling 0.1.14 and the
fix never reaches running agents.
Bump to 0.1.15 to trigger the publish-on-tag workflow (maintainer
pushes v0.1.15 tag after staging→main promotion).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI does not install pytest-asyncio — follow test_shared_runtime.py's
_run(coro) helper pattern. Tests still cover the same two paths (bare
exception class-name fallback + message passthrough) but no longer
require the async pytest plugin.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When an exception's str() is empty (bare TimeoutError(), BrokenPipeError(),
some httpx transport errors) `f"{_A2A_ERROR_PREFIX}{e}"` produced
`"[A2A_ERROR] "` with a trailing space and zero diagnostic context,
masking the real cause of peer-delegation failures in activity_logs.
Observed on main monorepo: 22+ occurrences in 75 min across 7 leads
during the MiniMax M2.7 trial rate-limit episode — zero breadcrumbs
to route the debug from.
Fix:
- Exception branch: fall back to `type(e).__name__` when str(e) is empty
- Error branch: include JSON-RPC `error.code` alongside message when present
Tests: test_a2a_error_observability.py covers both the bare-exception
path (must surface class name) and the message-passthrough path (must
preserve existing useful messages).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Trace from molecule-core cycle 107 (2026-04-24): 15 staging PRs stuck
DIRTY (real merge conflicts) with 0 merges in 1+ hours. Authors couldn't
rebase to fix the conflicts because the pre-commit hook (shipped in
0.1.11) refuses ANY commit that includes forbidden paths in the diff —
including rebase replays of historical commits that pre-date the gate.
Specifically, agents trying to `git rebase staging` on a PR like
"docs(marketing): Phase 30 social copy" fail at the first commit replay
because that commit added marketing/* files. The fix would require
interactive rebase + manual file deletion + commit amend — agents don't
do that, so the PR stays DIRTY indefinitely.
Detection: check .git for rebase-merge/, rebase-apply/, CHERRY_PICK_HEAD,
MERGE_HEAD, or REVERT_HEAD. These state markers exist only during the
corresponding git operation. Skip the hook silently when present.
The hook still blocks fresh `git commit` (the failure mode it was
designed for). It just doesn't try to police what was already in git
history.
Bumped to 0.1.14.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sister fix to 0.1.12 (root mounting). After fixing the route mount,
every inbound A2A still returned `-32601 Method not found` because the
1.x dispatcher's method table doesn't recognize v0.3-shaped names
(`message/send`, `tasks/get`) that the platform's ProxyA2A still sends.
Reproduces in the SDK on a minimal handler:
create_jsonrpc_routes(h, "/") → "Method not found"
create_jsonrpc_routes(h, "/", enable_v0_3_compat=True) → dispatches OK
Bumped to 0.1.13. Both 0.1.12 and 0.1.13 are needed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Baseline restart 2026-04-24: every workspace came up healthy (uvicorn
listening, agent-card serving) but produced zero delegations for two
maintenance cycles. Tracing revealed platform's ProxyA2A POSTs to
`http://ws-<id>:8000/` (no path suffix, see
workspace-server/internal/provisioner.InternalURL) while the runtime's
JSON-RPC routes were mounted at `/api/v1/jsonrpc/` under the a2a-sdk
1.x API migration.
Result was silent — every inbound A2A returned 404 Not Found, the
platform logged "Not Found" at INFO level, but no error bubbled up
because the SDK's jsonrpc route factory doesn't respond to root when
mounted at a subpath. Agents stayed warm, crons fired, but no work
flowed.
Fix: `create_jsonrpc_routes(handler, "/")` — matches platform
expectation and the agent-card self-advertisement (which also shows
root as the JSON-RPC URL). Agent-card route keeps its hard-coded
`/.well-known/agent-card.json` path so there's no collision.
Bumped to 0.1.12.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Anti-leak proposal item A. Companion to D (decision tree in role
prompts, separate PR on org-templates).
Why a local pre-commit hook
===========================
Agents try to `git add /research/foo.md` despite SHARED_RULES, the
.gitignore patterns, and the CI gate. Each leak attempt costs ~5 cycles
(PR opens, CI fails, agent retries with workaround) and pollutes git
history with reverts.
A pre-commit hook converts the failure from "PR opens then fails" →
"commit refused immediately, with the recovery command printed in the
same error message the agent reads." Agents act on what's in the
current response context — putting the redirect command literally in
the failure output is the highest-density feedback we can provide.
What changes
============
- molecule_runtime/scripts/pre-commit-block-internal-paths.sh —
bash hook. Checks `git remote get-url origin`, only enforces in
Molecule-AI/molecule-monorepo + molecule-core. In every other repo
(internal, plugins, templates, third-party) it's a no-op.
When forbidden paths are staged, refuses the commit with the redirect
recipe + the alternative public-facing paths + the workflow-edit path
for legitimate exceptions.
- molecule_runtime/precommit_hook.py — install_pre_commit_hook():
1. Extracts bundled hook to ~/.molecule-runtime/git-hooks/pre-commit
2. chmod +x
3. Sets core.hooksPath globally — UNLESS already set by an operator
(then logs a warning + skips, doesn't clobber)
- molecule_runtime/main.py — calls install_pre_commit_hook() at
step 0.2, right after install_credential_helper()
- pyproject.toml bumped to 0.1.11
Both A and D together close the loop: D ensures the agent knows the
right path before writing; A enforces it at the local git boundary if
the agent forgets. CI gate remains the third backstop for anything
that gets pushed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lifts the per-template wiring (Dockerfile COPY + entrypoint.sh git config
+ nohup daemon launch) into the Python runtime. Templates that depend
on molecule-ai-workspace-runtime get the behavior automatically — they
no longer need to maintain their own copy of the helper scripts or
remember to write the right git config in their entrypoint.
Background:
- GitHub App installation tokens (ghs_…) expire ~60min after issue
- claude-code-default template shipped without wiring → 39 workspaces
lost their tokens, three PMs' A2A queues filled with retry-status
messages, manual fleet restart required (cycle 62-66 incident)
This commit:
- Adds molecule_runtime/scripts/{molecule-git-token-helper.sh,
molecule-gh-token-refresh.sh} as package data (copies from canonical
workspace/scripts/ in molecule-monorepo)
- Adds molecule_runtime/credential_helper.py with
install_credential_helper() that:
1. Extracts bundled scripts to ~/.molecule-runtime/scripts/
2. Configures git credential.helper for github.com
3. Creates ~/.molecule-token-cache/ mode 0700
4. Spawns refresh daemon under respawn loop (PID file dedup)
5. Runs initial gh auth login --with-token
- Hooks call site early in main.py (step 0.1, before config load)
- Fails-soft: each step independently fault-tolerant; missing git/gh
binary doesn't block runtime startup
Bumped to 0.1.10. Templates can drop their entrypoint.sh credential
helper setup once they update the runtime pin (separate PRs per template).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Lower _PROCESS_ERROR_STDERR_MAX_CHARS to 1024 (was 4096) so A2A
responses stay bounded — the full context is already in workspace logs
via logger.error/exception.
- Add stderr= kwarg to sanitize_agent_error() so callers can surface
subprocess stderr verbatim in A2A responses.
- In _execute_locked() non-retryable error path, extract the first 1 KB
of exc.stderr and pass it to sanitize_agent_error() so the A2A
response carries actionable context (rate limit message, auth error,
etc.) instead of just a class name.
- Add test_executor_helpers.py unit tests for the new stderr= kwarg.
CI doesn't have pytest-asyncio installed, and the async wrapping was
incidental — the production retry pattern (refresh-on-401) is identical
in sync and async forms. Switching to httpx.Client + MockTransport keeps
the same coverage without the async dep.
6/6 still pass locally.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Problem
Auto-restart rotates the workspace's auth token in two non-atomic steps:
1. Platform issues new token via wsauth.IssueToken
2. Provisioner writes the new token to /configs/.auth_token AFTER
ContainerStart returns
Between steps 1 and 2, the new container has booted and the runtime has
already loaded the OLD cached value of .auth_token (or no value if the
file was empty during boot). The runtime's first /registry/heartbeat
call sends the stale token, gets 401, but the loop never re-reads the
on-disk token — so subsequent heartbeats also send the stale value.
Each 401 means the platform never sees the workspace as alive →
status stays 'provisioning' → scheduler won't dispatch → workspace
looks dead from every angle even though the container is actually
running.
The existing code comment in workspace_provision.go acknowledges this:
"the workspace will get 401 on its first heartbeat and can recover on
the next restart." That recovery only worked because workspaces used
to crash for unrelated reasons and get restarted. After PR #1861
(provisioner empty-volume auto-recover) removed those crashes,
workspaces get stuck in the 401 loop with no exit.
## Fix
Two-part runtime-side fix in molecule-ai-workspace-runtime:
1. **platform_auth.refresh_from_disk()** — new helper that clears the
in-memory cache and re-reads /configs/.auth_token. Returns the
fresh value (or None if missing). Updates the cache as a side effect.
2. **HeartbeatLoop._loop()** — on 401 from /registry/heartbeat, calls
refresh_from_disk() and retries the request ONCE with the new token.
Same pattern in _check_delegations(). Bounded retry budget — if the
on-disk token is also stale (bug elsewhere), no infinite loop.
## Tests
6/6 new tests in tests/test_token_refresh_1877.py:
- refresh_picks_up_rotated_token — happy path
- refresh_returns_none_when_file_missing — defensive
- refresh_clears_stale_cache_when_file_disappears
- refresh_is_idempotent
- 401_retry_pattern_uses_refreshed_token — the production fix path
- 401_retry_no_loop_when_disk_token_also_stale — bounded retry budget
All pass locally on Python 3.13 + pytest 9.
## Why this fix and not the alternatives
- **Alternative B (platform writes token before ContainerStart):**
Right architecturally but invasive — needs provisioner refactor to
prep volumes before docker run.
- **Alternative C (skip rotation on auto-restart):** Breaks the
multi-instance-safety invariant the existing code calls out
(revoke prevents stale tokens from sister deployments).
- **This fix (A):** 3-line core change + helper. Self-healing for any
timing edge case, not just the post-restart one. Costs nothing in
the happy path (only triggers on 401).
## Version
Bumped to 0.1.9. Once published to PyPI + workspace template image
rebuilt, deployed workspaces auto-recover from token-rotation races
without operator intervention.
Closes#1877.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-review findings on #38:
1. **Token substring leak**: the "unknown prefix" warning included the
first 12 chars of the token in the log message. Logs get shipped to
Langfuse / CloudWatch / slack-firehose — 12 bytes of a secret in a
log is still 12 bytes too many. Warning no longer references the
token value at all.
2. **Base-URL substring match was too loose**: `"anthropic.com" not in
base` would accept `https://proxy.anthropic.com.evil.example/` as
"looks like Anthropic, keep the URL." Replaced with an allowlist of
exact hostnames parsed via urllib.parse.urlparse.
3. **Whitespace in pasted tokens**: operators frequently paste tokens
from terminals with a trailing newline. The token would flow through
startswith() detection but then fail downstream auth with a
confusing "malformed token" error. Strip and persist the cleaned
value.
4. **Malformed base URL crash guard**: if someone sets ANTHROPIC_BASE_URL
to something urlparse can't handle, don't crash — fall through to
clearing it, which is the safe choice in OAuth mode.
Added 5 new tests covering each of the above. 16/16 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Platform stores per-workspace LLM credentials under a single key
(ANTHROPIC_AUTH_TOKEN in workspace_secrets). But downstream tools
expect different env var names depending on the token type:
sk-ant-oat01-* → CLAUDE_CODE_OAUTH_TOKEN (Claude Code OAuth session)
sk-ant-api03-* → ANTHROPIC_API_KEY (direct Anthropic API)
sk-cp-* → ANTHROPIC_AUTH_TOKEN (proxy: MiniMax, gateways)
Without normalisation, an OAuth token under ANTHROPIC_AUTH_TOKEN gets
sent as a bearer to api.anthropic.com, which responds:
401 authentication_error: OAuth authentication is currently not
supported.
This was a platform-wide footgun: anyone rotating LLM keys had to
know the exact env var for each token type, AND make sure stale
overrides were cleared, AND set ANTHROPIC_BASE_URL correctly for
proxies (or NOT set for native Claude). Nothing downstream could
help — the SDK just saw the wrong var.
Fix:
- New molecule_runtime/llm_auth.py — normalise_llm_env() mutates
os.environ (or any dict) to the correct shape based on token
prefix. Returns a NormalisationResult for logging.
- main.py calls it as step 0, before any adapter/executor import.
Every adapter (claude-code, langgraph, crewai, autogen, hermes,
…) benefits automatically — no per-adapter branching needed.
- 11 unit tests covering all prefix paths, edge cases, and the
"operator deliberately set CLAUDE_CODE_OAUTH_TOKEN" precedence
rule.
Operationally: this means operators can keep using one
ANTHROPIC_AUTH_TOKEN slot in platform settings and just paste
whatever token the agent needs. No env-var-name awareness required.
Tested locally: 11/11 new tests pass. 83 other tests unchanged
(pre-existing failures on staging are all unrelated:
test_workspace_id_validation, test_a2a_mcp_server RBAC, the
test_imports.main module-walker — same signature as on staging
HEAD before this PR).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes#1372 — phantom busy: canvas showed workspace as active for up
to 30s after task completion because set_current_task("") returned
early without posting the updated heartbeat.
Before: clearing only updated the heartbeat object; the next 30s
scheduled heartbeat cycle propagated the clear. Quick tasks would leave
a phantom-busy indicator.
After: both SET and CLEAR push immediately to /registry/heartbeat.
active_tasks=0 on clear, active_tasks=1 on set. Heartbeat object
update and HTTP post are now unconditional.
Tests: 5 new cases covering SET/CLEAR HTTP body, error resilience,
None heartbeat, and missing env vars.
Co-authored-by: Molecule AI Infra-Runtime-BE <infra-runtime-be@agents.moleculesai.app>
Both set_current_task() implementations (shared_runtime.py + executor_helpers.py):
- Increment active_tasks on task start, decrement on completion (was binary 0/1)
- Push heartbeat immediately on BOTH increment AND decrement
- Only clear current_task when active_tasks reaches 0 (preserves description
for still-running tasks)
Fixes phantom-busy: the old code returned early on clear, leaving
active_tasks=1 in the platform DB until the next 30s heartbeat cycle.
If a new cron fired before the heartbeat, the workspace appeared
permanently busy — required manual DB reset every 30 min.
Bump: 0.1.2 → 0.1.3
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The SaaS tenant platform's TenantGuard middleware rejects cross-org
routing with synthetic 404s unless the request carries
X-Molecule-Org-Id matching the tenant's MOLECULE_ORG_ID env var. The
runtime never sent it, so every non-allowlisted workspace→platform
path (memories, delegations, notify, a2a, update-card, peers...)
404'd. Paired with CP change feat/workspace-export-org-id which
injects MOLECULE_ORG_ID into workspace user-data env.
auth_headers() now returns both headers — the existing Authorization
bearer AND the new X-Molecule-Org-Id — so every caller that already
threads auth_headers() through httpx picks it up for free. Self-
hosted deployments with MOLECULE_ORG_ID unset keep the old behavior
(no header, TenantGuard is a no-op).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #32 wrapped all platform URL construction sites with
get_validated_workspace_id() but missed a2a_cli.discover(), which
passed the raw unvalidated WORKSPACE_ID in the X-Workspace-ID header.
All other functions (peers, info) had try/except guards added.
discover() now calls get_validated_workspace_id() upfront and returns
None (printing the error) if validation fails — consistent with the
best-effort error handling pattern used elsewhere in the module.
Tests: 2 new cases in TestA2aCliDiscoverValidation covering empty
and slash-injected WORKSPACE_ID values.
Follow-up to: PR #32 (fix/908-add-namespace-param-commit-memory)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PR #31 added `-ll --severity-level=high` but these flags conflict:
- `-ll` is a shorthand for `--level low` (only show low+ issues)
- `--severity-level=high` suppresses everything but high-severity issues
The combination causes bandit to exit 2 because `--severity-level` is
not allowed alongside `-l/--level`. Use `--severity-level=high` alone.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PR #29 introduced WORKSPACE_ID validation at module import time
(platform_auth.py). The CI environment did not set WORKSPACE_ID,
causing 8 failures + 13 errors on every main push. Add a dummy
CI-only value so imports succeed without affecting real workspaces.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add get_validated_workspace_id() to all 6 remaining unguarded URL positions
in molecule_runtime/a2a_tools.py (the MCP tool body implementations):
- report_activity(): /workspaces/{id}/activity + heartbeat
- tool_delegate_task_async(): /workspaces/{id}/delegate
- tool_check_task_status(): /workspaces/{id}/delegations
- tool_send_message_to_user(): /workspaces/{id}/notify
- tool_commit_memory(): /workspaces/{id}/memories (POST)
- tool_recall_memory(): /workspaces/{id}/memories (GET)
All 6 functions now use validated ws_id. The last remaining unguarded
WORKSPACE_ID use in the entire molecule_runtime package is in
builtin_tools/telemetry.py:142 (metric service name — not a URL path,
low security risk). 67/67 tests pass.
Tests: 37 new test cases in tests/test_validation.py covering:
- Valid ID patterns (6): normal IDs, underscores, dots, max-length (256)
- Empty/missing (1): raises with "empty" in message
- Invalid chars (10): / \ .. # ? & whitespace
- Caching (2): result is cached; raises on repeated bad calls
- Error type (1): WorkspaceIdValidationError is a ValueError subclass
Fix: regex now uses negative lookahead `(?!.*\.\.)` to reject ".." anywhere
in the string (not just at the start). The old pattern `^[A-Za-z0-9_\-.]{1,256}$`
matched ".." literally because two dots ARE in the allowed character class.
Also adds test cases for embedded ".." (ws..example, ws../etc).
Fixes: the ".." bypass was a gap in the original CWE-20 fix.
Extend get_validated_workspace_id() to all remaining unguarded URL positions:
- consolidation.py: _consolidate() — validates before GET/POST/DELETE to
/workspaces/{id}/memories endpoints. Graceful skip on failure (log + return).
- coordinator.py: get_children() — validates before /registry/{id}/peers.
Graceful skip (empty list) on failure.
- molecule_ai_status.py: set_status() — validates before /registry/heartbeat
and /workspaces/{id}/activity. Exits with descriptive error on failure.
With these three, every runtime use of WORKSPACE_ID in a URL path is now
validated. Remaining WORKSPACE_ID uses are:
- JSON body fields (not injection-risky): heartbeat, memory POST bodies
- Header values (X-Workspace-ID): lower risk, non-URL-injection
Fixes remaining unguarded WORKSPACE_ID URL usages identified after the initial
builtin_tools/ fix:
- a2a_client.py: get_peers() and get_workspace_info() now use
get_validated_workspace_id() before URL construction. The raw module-level
constant is still used in the discover_peer() header (low risk, not URL path).
- a2a_cli.py: peers() and info() CLI commands now validate WORKSPACE_ID before
calling the platform API. Commands exit with error code 1 + descriptive
message if WORKSPACE_ID is empty or malformed.
Follow-up candidates (lower priority, not URL injection risk):
- coordinator.py: WORKSPACE_ID in registry peer URL
- consolidation.py: WORKSPACE_ID in memory URLs (long-running consolidation job)
- molecule_ai_status.py: WORKSPACE_ID in activity log URL
Bandit runs on every PR against molecule_runtime/ at high severity.
Addresses audit recommendation from issue #9.
Co-authored-by: Molecule AI Infra-Runtime-BE <infra-runtime-be@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
WORKSPACE_ID was read via os.environ.get("WORKSPACE_ID", "") in multiple
builtin_tools modules and used directly in platform API URLs and X-Workspace-ID
headers without validation. A crafted ID containing /, .., or # could cause
URL path injection.
Fix: validate_workspace_id() in platform_auth.py now validates the ID format
at module import time using a regex that permits only lowercase alphanumerics
and hyphens (matching UUIDs and org-generated IDs). The validated value is
exposed as a module-level WORKSPACE_ID constant. builtin_tools/approval.py
and builtin_tools/delegation.py now import from platform_auth instead of
reading os.environ directly.
Failing input raises ValueError with a clear message — workspace fails fast
at startup rather than silently accepting malformed IDs in requests.
Add 15 regression tests (45/45 passing total).
Co-authored-by: Molecule AI Infra-Runtime-BE <infra-runtime-be@agents.moleculesai.app>
Co-authored-by: Infra-Runtime-BE <infra-runtime-be@molecule.ai>