CI doesn't have pytest-asyncio installed, and the async wrapping was
incidental — the production retry pattern (refresh-on-401) is identical
in sync and async forms. Switching to httpx.Client + MockTransport keeps
the same coverage without the async dep.
6/6 still pass locally.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Problem
Auto-restart rotates the workspace's auth token in two non-atomic steps:
1. Platform issues new token via wsauth.IssueToken
2. Provisioner writes the new token to /configs/.auth_token AFTER
ContainerStart returns
Between steps 1 and 2, the new container has booted and the runtime has
already loaded the OLD cached value of .auth_token (or no value if the
file was empty during boot). The runtime's first /registry/heartbeat
call sends the stale token, gets 401, but the loop never re-reads the
on-disk token — so subsequent heartbeats also send the stale value.
Each 401 means the platform never sees the workspace as alive →
status stays 'provisioning' → scheduler won't dispatch → workspace
looks dead from every angle even though the container is actually
running.
The existing code comment in workspace_provision.go acknowledges this:
"the workspace will get 401 on its first heartbeat and can recover on
the next restart." That recovery only worked because workspaces used
to crash for unrelated reasons and get restarted. After PR #1861
(provisioner empty-volume auto-recover) removed those crashes,
workspaces get stuck in the 401 loop with no exit.
## Fix
Two-part runtime-side fix in molecule-ai-workspace-runtime:
1. **platform_auth.refresh_from_disk()** — new helper that clears the
in-memory cache and re-reads /configs/.auth_token. Returns the
fresh value (or None if missing). Updates the cache as a side effect.
2. **HeartbeatLoop._loop()** — on 401 from /registry/heartbeat, calls
refresh_from_disk() and retries the request ONCE with the new token.
Same pattern in _check_delegations(). Bounded retry budget — if the
on-disk token is also stale (bug elsewhere), no infinite loop.
## Tests
6/6 new tests in tests/test_token_refresh_1877.py:
- refresh_picks_up_rotated_token — happy path
- refresh_returns_none_when_file_missing — defensive
- refresh_clears_stale_cache_when_file_disappears
- refresh_is_idempotent
- 401_retry_pattern_uses_refreshed_token — the production fix path
- 401_retry_no_loop_when_disk_token_also_stale — bounded retry budget
All pass locally on Python 3.13 + pytest 9.
## Why this fix and not the alternatives
- **Alternative B (platform writes token before ContainerStart):**
Right architecturally but invasive — needs provisioner refactor to
prep volumes before docker run.
- **Alternative C (skip rotation on auto-restart):** Breaks the
multi-instance-safety invariant the existing code calls out
(revoke prevents stale tokens from sister deployments).
- **This fix (A):** 3-line core change + helper. Self-healing for any
timing edge case, not just the post-restart one. Costs nothing in
the happy path (only triggers on 401).
## Version
Bumped to 0.1.9. Once published to PyPI + workspace template image
rebuilt, deployed workspaces auto-recover from token-rotation races
without operator intervention.
Closes#1877.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-review findings on #38:
1. **Token substring leak**: the "unknown prefix" warning included the
first 12 chars of the token in the log message. Logs get shipped to
Langfuse / CloudWatch / slack-firehose — 12 bytes of a secret in a
log is still 12 bytes too many. Warning no longer references the
token value at all.
2. **Base-URL substring match was too loose**: `"anthropic.com" not in
base` would accept `https://proxy.anthropic.com.evil.example/` as
"looks like Anthropic, keep the URL." Replaced with an allowlist of
exact hostnames parsed via urllib.parse.urlparse.
3. **Whitespace in pasted tokens**: operators frequently paste tokens
from terminals with a trailing newline. The token would flow through
startswith() detection but then fail downstream auth with a
confusing "malformed token" error. Strip and persist the cleaned
value.
4. **Malformed base URL crash guard**: if someone sets ANTHROPIC_BASE_URL
to something urlparse can't handle, don't crash — fall through to
clearing it, which is the safe choice in OAuth mode.
Added 5 new tests covering each of the above. 16/16 tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Platform stores per-workspace LLM credentials under a single key
(ANTHROPIC_AUTH_TOKEN in workspace_secrets). But downstream tools
expect different env var names depending on the token type:
sk-ant-oat01-* → CLAUDE_CODE_OAUTH_TOKEN (Claude Code OAuth session)
sk-ant-api03-* → ANTHROPIC_API_KEY (direct Anthropic API)
sk-cp-* → ANTHROPIC_AUTH_TOKEN (proxy: MiniMax, gateways)
Without normalisation, an OAuth token under ANTHROPIC_AUTH_TOKEN gets
sent as a bearer to api.anthropic.com, which responds:
401 authentication_error: OAuth authentication is currently not
supported.
This was a platform-wide footgun: anyone rotating LLM keys had to
know the exact env var for each token type, AND make sure stale
overrides were cleared, AND set ANTHROPIC_BASE_URL correctly for
proxies (or NOT set for native Claude). Nothing downstream could
help — the SDK just saw the wrong var.
Fix:
- New molecule_runtime/llm_auth.py — normalise_llm_env() mutates
os.environ (or any dict) to the correct shape based on token
prefix. Returns a NormalisationResult for logging.
- main.py calls it as step 0, before any adapter/executor import.
Every adapter (claude-code, langgraph, crewai, autogen, hermes,
…) benefits automatically — no per-adapter branching needed.
- 11 unit tests covering all prefix paths, edge cases, and the
"operator deliberately set CLAUDE_CODE_OAUTH_TOKEN" precedence
rule.
Operationally: this means operators can keep using one
ANTHROPIC_AUTH_TOKEN slot in platform settings and just paste
whatever token the agent needs. No env-var-name awareness required.
Tested locally: 11/11 new tests pass. 83 other tests unchanged
(pre-existing failures on staging are all unrelated:
test_workspace_id_validation, test_a2a_mcp_server RBAC, the
test_imports.main module-walker — same signature as on staging
HEAD before this PR).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes#1372 — phantom busy: canvas showed workspace as active for up
to 30s after task completion because set_current_task("") returned
early without posting the updated heartbeat.
Before: clearing only updated the heartbeat object; the next 30s
scheduled heartbeat cycle propagated the clear. Quick tasks would leave
a phantom-busy indicator.
After: both SET and CLEAR push immediately to /registry/heartbeat.
active_tasks=0 on clear, active_tasks=1 on set. Heartbeat object
update and HTTP post are now unconditional.
Tests: 5 new cases covering SET/CLEAR HTTP body, error resilience,
None heartbeat, and missing env vars.
Co-authored-by: Molecule AI Infra-Runtime-BE <infra-runtime-be@agents.moleculesai.app>
PR #32 wrapped all platform URL construction sites with
get_validated_workspace_id() but missed a2a_cli.discover(), which
passed the raw unvalidated WORKSPACE_ID in the X-Workspace-ID header.
All other functions (peers, info) had try/except guards added.
discover() now calls get_validated_workspace_id() upfront and returns
None (printing the error) if validation fails — consistent with the
best-effort error handling pattern used elsewhere in the module.
Tests: 2 new cases in TestA2aCliDiscoverValidation covering empty
and slash-injected WORKSPACE_ID values.
Follow-up to: PR #32 (fix/908-add-namespace-param-commit-memory)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tests: 37 new test cases in tests/test_validation.py covering:
- Valid ID patterns (6): normal IDs, underscores, dots, max-length (256)
- Empty/missing (1): raises with "empty" in message
- Invalid chars (10): / \ .. # ? & whitespace
- Caching (2): result is cached; raises on repeated bad calls
- Error type (1): WorkspaceIdValidationError is a ValueError subclass
Fix: regex now uses negative lookahead `(?!.*\.\.)` to reject ".." anywhere
in the string (not just at the start). The old pattern `^[A-Za-z0-9_\-.]{1,256}$`
matched ".." literally because two dots ARE in the allowed character class.
Also adds test cases for embedded ".." (ws..example, ws../etc).
Fixes: the ".." bypass was a gap in the original CWE-20 fix.
WORKSPACE_ID was read via os.environ.get("WORKSPACE_ID", "") in multiple
builtin_tools modules and used directly in platform API URLs and X-Workspace-ID
headers without validation. A crafted ID containing /, .., or # could cause
URL path injection.
Fix: validate_workspace_id() in platform_auth.py now validates the ID format
at module import time using a regex that permits only lowercase alphanumerics
and hyphens (matching UUIDs and org-generated IDs). The validated value is
exposed as a module-level WORKSPACE_ID constant. builtin_tools/approval.py
and builtin_tools/delegation.py now import from platform_auth instead of
reading os.environ directly.
Failing input raises ValueError with a clear message — workspace fails fast
at startup rather than silently accepting malformed IDs in requests.
Add 15 regression tests (45/45 passing total).
Co-authored-by: Molecule AI Infra-Runtime-BE <infra-runtime-be@agents.moleculesai.app>
Co-authored-by: Infra-Runtime-BE <infra-runtime-be@molecule.ai>
ADAPTER_MODULE resolution required the imported module to export a
class literally named `Adapter`. The claude-code, langgraph, and
openclaw adapter-template repos (3 of 4 currently in production) don't
ship that alias — they export ClaudeCodeAdapter / LangGraphAdapter /
OpenClawAdapter directly. Only hermes has the `Adapter = HermesAdapter`
shim at the bottom of adapter.py.
Consequence in prod: every fresh claude-code / langgraph / openclaw
workspace crashed at runtime startup with
"module 'adapter' has no attribute 'Adapter'", even with a2a-sdk
correctly pinned <1.0. Provisioning looked successful from CP's side
(EC2 ran) but the agent never registered because the process never
reached A2A bootstrap.
Fix: if `Adapter` is absent from the imported module, scan the module
for any attribute that is a proper BaseAdapter subclass (excluding
BaseAdapter itself — regression guard in tests). The explicit alias
remains the preferred contract; this is purely additive tolerance.
Bump to 0.1.4 and publish to PyPI via the existing v* tag trigger.
6 new tests cover: explicit alias, subclass-fallback, non-adapter-noise
ignored, empty module → error, missing module → error, re-exported
BaseAdapter → not selected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Issue #19 (CWE-C-312): AgentskillsAdaptor.install() passed the full
os.environ to the subprocess running setup.sh, including
ANTHROPIC_API_KEY, OPENAI_API_KEY, GITHUB_TOKEN, WORKSPACE_AUTH_TOKEN,
etc. A malicious or compromised plugin's setup.sh could exfiltrate them.
Fix: _scrubbed_env() builds a copy of os.environ with sensitive keys
removed, matching the same _SCRUB_KEYS list used in skill_loader/loader.py
so the scrubbing policy is consistent. CONFIGS_DIR is still passed via
the extra dict. Non-secret vars (PATH, HOME, etc.) are preserved.
Add 6 regression tests (30/30 passing).
Co-Authored-By: Infra-Runtime-BE <infra-runtime-be@molecule.ai>
CI failed on collect because claude_agent_sdk + a2a aren't test-env deps
(they're installed inside the claude-code workspace image). The test file
now stubs both via sys.modules so the collector can import
claude_sdk_executor without pulling the real SDKs. Tests don't exercise
the SDK anyway — only _resolve_resume() glob logic.
## Symptom (cycle 6+ of #488)
Workspaces appear `online` (heartbeats fine) but every cron tick fails
silently with `No conversation found with session ID: <uuid>` →
`ProcessError: exit code 1` → idle loop logs HTTP 200, no actual work
happens. Backend Engineer received 5 idle pulses without claiming a
single one of the 6 open Hermes issues (#496-500) because the bug
prevents `gh issue list` from ever firing.
## Root cause (verified live in ws-20cb8ff8-3e4 today)
claude-code stores sessions at `/root/.claude/projects/<cwd-with-/-as-->/<id>.jsonl`.
When a workspace container is recreated, `self._session_id` from a
prior instance references a file that no longer exists. Passing it as
`resume=<id>` to ClaudeAgentOptions crashes the CLI on the very first
call. The existing #75 fix only fires AFTER the first ProcessError
lands, and per-cycle executor re-instantiation can reload the stale id
from elsewhere — restart-with-reset_claude_session was the only working
mitigation, hand-fired every cycle.
## Fix
New `_resolve_resume()` in ClaudeSDKExecutor: probes a handful of
well-known session-file locations (`/root/.claude/projects/*/<id>.jsonl`,
`/root/.claude/sessions/<id>.jsonl`, plus the agent-uid variants) via
`glob.glob`. If no file matches the in-memory `_session_id`, drops the
id (sets to None) AND returns None so `ClaudeAgentOptions.resume` is
unset — CLI starts a fresh session. Logged at INFO with `#488` in the
message so operators correlate.
`_build_options()` now calls `_resolve_resume()` instead of reading
`self._session_id` directly. Cheap path when no session set: zero
glob calls. Hot path (session set + file exists): one glob call,
short-circuits on first match.
## Drive-by fix: stale `from X import` in 4 modules
Same regression class as #1 (the runtime release that closed it):
- `claude_sdk_executor.py:43`: `from executor_helpers import …`
- `cli_executor.py:39-40`: `from config import …`, `from executor_helpers import …`
- `main.py:28-30`: `from config import …`, `from heartbeat import …`, `from preflight import …`
- `preflight.py:7`: `from config import …`
All rewritten to absolute `from molecule_runtime.<module> import …`
so they resolve outside of workspace containers (e.g. test environments
where `/app` isn't on sys.path). The grep guard in `tests/test_imports.py`
already covered `adapters` — extending to all top-level imports would
catch this class going forward; not in this PR to keep scope tight.
## Tests
6 new in `tests/test_session_resume_gate.py`:
- baseline (no session) → no glob, returns None
- file exists → keep id, returns id, single glob (early-exit)
- file missing → drop id (clears `_session_id`), returns None
- late-pattern match → walks all patterns until hit
- log includes session id (operator triage)
- log references #488 (debugger discoverability)
All 16 tests (10 existing + 6 new) pass.
## Release plan
- Bump version 0.1.1 → 0.1.2 (in this commit)
- After merge, push v0.1.2 tag → publish.yml auto-publishes to PyPI
- Then rebuild workspace template images locally so workspaces pick up the
fix (templates pin `>=0.1.0`, will resolve to 0.1.2 on next build)
- Then mass-restart workspaces with reset_claude_session=true once to clear
any DB-side stale state, and the permanent fix kicks in
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Every modular workspace template repo (claude-code, hermes, langgraph,
…) was crashing on boot with:
KeyError: "Unknown runtime '<runtime>'. Available: "
Root cause: `molecule_runtime/main.py` and four other modules used
top-level imports like `from adapters import get_adapter` — a monorepo
legacy that resolved when something on sys.path had an `adapters/`
package. Standalone template repos COPY only `adapter.py` (singular) to
/app and don't ship an `adapters/` package, so this import path went
through some side-resolution that left `get_adapter` unable to see the
user's adapter. The ADAPTER_MODULE → import → getattr → issubclass
chain then silently fell through to the discovery branch and reported
"Unknown runtime".
Fix is one-line per file: `from adapters` → `from molecule_runtime.adapters`
in:
- molecule_runtime/main.py:27
- molecule_runtime/a2a_executor.py:44
- molecule_runtime/coordinator.py:20
- molecule_runtime/prompt.py:6
- molecule_runtime/builtin_tools/temporal_workflow.py:417
Tests + CI added so this regression class is caught at PR time, not at
runtime in self-hosters' clusters:
- tests/test_imports.py: parametrised import smoke for every previously
affected module + a grep guard that fails if any future change
reintroduces a top-level `from adapters` / `import adapters` line
- .github/workflows/ci.yml: runs the smoke on every PR (no CI existed
before — the publish workflow only fires on tag push)
Closes#1.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>