Phase 4 verification surfaced a follow-up edge case the initial fix missed:
the persona env files use friendlier slugs than the registry's canonical names:
* MODEL_PROVIDER=claude-code -> anthropic-oauth (Claude Code subscription)
* MODEL_PROVIDER=anthropic -> anthropic-api (direct Anthropic API key)
Without an alias map, a lead workspace's MODEL_PROVIDER=claude-code env
fell through the slug-detection path; when the YAML didn't pin a
provider, the model-prefix matcher saw MODEL=MiniMax-M2.7 and routed the
lead to MiniMax — even though CLAUDE_CODE_OAUTH_TOKEN was clearly the
intended auth path.
Add _PROVIDER_SLUG_ALIASES with the two operator-facing slugs that don't
match registry names verbatim. The alias map is consulted before the
slug-vs-legacy detection, so claude-code now resolves to anthropic-oauth
and the lead boots through OAuth as intended.
Tests
-----
+ test_persona_env_lead_with_minimax_model_routes_via_oauth — lock in
the alias-map behavior so a future contributor can't silently re-introduce
the lead-mis-routed-to-MiniMax bug.
+ test_anthropic_alias_resolves_to_anthropic_api — covers the second
alias path.
Updated test_persona_env_lead_claude_code_resolves_correctly to assert
the new (correct) behavior: provider == 'anthropic-oauth', not None.
Full adapter suite: 78/78 pass.
Fix 2026-05-08 dev-tree wedge: 22/27 non-lead workspaces stuck at SDK initialize timeout because MODEL_PROVIDER=minimax was read as model id instead of provider slug.
Fixes the 2026-05-08 dev-tree wedge: 22/27 non-lead workspaces (minimax tier)
stuck in degraded after /org/import, every chat hanging on
`Control request timeout: initialize`.
Root cause
----------
The persona env files (`~/.molecule-ai/personas/<name>/env`) declare a TWO-
variable convention:
- MODEL = model id ("MiniMax-M2.7-highspeed")
- MODEL_PROVIDER = provider slug ("minimax")
The runtime wheel's legacy `workspace/config.py` interprets MODEL_PROVIDER
as the *model id* — a name chosen long before there was a separate MODEL
env. With both set, the legacy code reads MODEL_PROVIDER="minimax" into
runtime_config.model. The literal string "minimax" doesn't match any
registry prefix (`minimax-` requires a hyphen suffix), falls through to
providers[0] (anthropic-oauth), the auth check fails on the absent
CLAUDE_CODE_OAUTH_TOKEN, the claude CLI launches anyway, and the SDK's
`query.initialize()` 60s control timeout fires.
The brief hypothesised `claude_sdk_executor.py` lacked dispatch logic.
Phase 1 evidence: dispatch ALREADY exists in adapter.py — model -> provider
-> base_url + auth_env routing was correctly built for #180. The bug was
upstream: MODEL_PROVIDER's name collision with the persona-env convention
silently corrupted the picked model BEFORE adapter.py saw it.
Fix
---
New helper `_resolve_model_and_provider_from_env` reconciles env vars
against YAML inside adapter.setup() and create_executor():
1. MODEL env -> picked_model (authoritative when set).
2. MODEL_PROVIDER env -> explicit_provider IFF the value matches a
registered provider name. Backward-compat: if MODEL is unset and
MODEL_PROVIDER doesn't match a registered slug, treat it as a
legacy model id (canvas Save+Restart pre-this-fix).
3. YAML runtime_config.{model,provider} fills any field env didn't
supply.
Contained in the template repo per the brief's scope guidance — does NOT
touch the runtime wheel's workspace/config.py (which would need a separate
molecule-core PR), and does NOT change the persona-env dispatch policy
(Phase 2 mapping 2026-05-08).
Tests
-----
Eleven new cases in tests/test_env_model_provider_dispatch.py covering:
- persona-env shape (minimax, GLM, lead claude-code) -> correct model + slug
- legacy MODEL_PROVIDER-as-model-id shape still works
- env wins over YAML
- YAML fallback when env unset
- whitespace/empty defensive handling
- case-insensitive provider slug matching
Full adapter test suite: 76/76 pass.
Verification path
-----------------
After image rebuild + workspace re-provision, ws-* containers will boot
with provider=minimax (not anthropic-oauth), ANTHROPIC_BASE_URL set to
https://api.minimax.io/anthropic, MINIMAX_API_KEY projected onto
ANTHROPIC_AUTH_TOKEN, and the SDK init handshake succeeding.
Refs: task #181, brief 2026-05-08, related #180 (#7 in this repo)
The 5 _load_providers tests were single-path-only: they wrote a
config.yaml to tmp_path and called _load_providers(str(tmp_path)),
expecting the lookup to read tmp_path/config.yaml.
After the multi-path fix in #7, _load_providers also checks:
1. _CANONICAL_ADAPTER_DIR/config.yaml (= /opt/adapter/config.yaml)
2. _TEMPLATE_DIR/config.yaml (= dirname(__file__)/config.yaml)
3. ${config_path}/config.yaml (the test's tmp_path)
Path 2 finds the repo's bundled config.yaml on the test runner's
disk before path 3 — the tests then see the bundled providers list
instead of the test's expected behavior.
Two surface changes:
1. adapter.py — extract `os.path.dirname(os.path.abspath(__file__))`
into a module-level `_TEMPLATE_DIR` constant, mirroring
`_CANONICAL_ADAPTER_DIR`. Production behavior identical
(resolved once at import). Tests can monkeypatch the module
attribute to redirect the path-2 lookup.
2. tests/test_adapter_prevalidate.py — 5 _load_providers tests
monkeypatch `_CANONICAL_ADAPTER_DIR` and `_TEMPLATE_DIR` to a
non-existent tmp subdir, isolating the test to the workspace
config_path branch they always meant to test.
The 6th _load_providers test (`test_load_providers_parses_yaml_and_normalizes`)
already passed because path 2 returns 7 providers and that's what
that test expects — left unchanged.
Verification:
pytest tests/ 65/65 PASS
pytest tests/test_adapter_prevalidate.py -k load_providers
6/6 PASS
Closes molecule-core#129 follow-up — the unit tests were the last
red on the template repo's CI.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The template's _load_providers had only ONE lookup path
(${config_path}/config.yaml = /configs/config.yaml) — which is the
per-workspace override, NOT the template's bundled provider registry.
Every MiniMax/GLM/Kimi/DeepSeek model resolved to anthropic-oauth
and crashed at first LLM call:
None of CLAUDE_CODE_OAUTH_TOKEN set for model=MiniMax-M2.7-highspeed
(provider=anthropic-oauth) — the adapter will fail on the first
LLM call with AuthenticationError
...
probed_cli_error='Not logged in · Please run /login'
Canary chronic red 38h+ on 2026-05-07/08 traced to this. The fix
that the May-4 image already had bundled — a 4-path lookup with
canonical /opt/adapter/config.yaml + __file__-adjacent + workspace
override + builtins fallback — was never on Gitea main, so post-
suspension rebuilds dropped it. Restoring here.
Resolution order:
1. /opt/adapter/config.yaml (canonical, provisioner-contracted)
2. dirname(__file__)/config.yaml (covers /app/config.yaml from
Dockerfile #6 as well as dev/test imports)
3. ${config_path}/config.yaml (per-workspace override)
4. _BUILTIN_PROVIDERS (oauth + anthropic-api fallback)
Verified locally: ps=_load_providers('/nonexistent') returns the
7 providers from /tmp/cctmpl/config.yaml via path 2 (the
__file__-adjacent lookup). Without the fix, returns 2 (builtins).
Closes molecule-core#129 failure mode #1 (the original "Agent error
(Exception)" 38h chronic red).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The adapter's _load_providers tries 4 paths in order:
1. /opt/adapter/config.yaml — provisioner-managed (currently missing)
2. os.path.dirname(__file__)/config.yaml — alongside adapter.py
3. ${WORKSPACE_CONFIG_PATH}/config.yaml — workspace overrides
4. _BUILTIN_PROVIDERS — oauth + anthropic-api only
On this template's docker image /opt/adapter/ is never populated by
the platform provisioner (verified 2026-05-08 by SSM-exec on a live
canary's workspace EC2: ls /opt/adapter/ → no such file or directory).
That makes path 2 — the dir adjacent to /app/adapter.py — the
load-bearing one for production workloads.
The Dockerfile copies adapter.py + claude_sdk_executor.py + scripts/
+ entrypoint.sh + __init__.py into /app, but it does NOT copy
config.yaml. So /app/config.yaml doesn't exist, path 2 fails, and
the adapter falls all the way through to _BUILTIN_PROVIDERS.
_BUILTIN_PROVIDERS contains only anthropic-oauth + anthropic-api.
Every MiniMax / GLM / Kimi / DeepSeek model id has no matching
prefix in those two, so _resolve_provider returns providers[0] =
anthropic-oauth (per "unknown ids fall back to providers[0]" rule).
That provider needs CLAUDE_CODE_OAUTH_TOKEN, which is unset for
non-OAuth tenants. The claude CLI fails with:
Not logged in · Please run /login
…which surfaces in the A2A response as "Agent error (Exception)".
This is the root cause of:
• Canary chronic red since 2026-05-07 02:30 UTC (38h+ at time of
investigation)
• molecule-core#129 failure mode #1
• Memory feedback_template_vs_workspace_config_separation
(template-claude-code PR #37 added the multi-path lookup but
didn't bundle config.yaml into the image — the lookup paths
point at files that don't exist)
Fix: one-line `COPY config.yaml .` in the Dockerfile.
Verification path (post-merge): publish-runtime workflow rebuilds
the image, deploys to staging tenant fleet, next canary cron run
sees /app/config.yaml → loads minimax provider → MINIMAX_API_KEY
matches → claude CLI auths → A2A returns PONG → green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI runner installs only `pytest pytest-asyncio pyyaml`; without the
molecule_runtime/a2a/claude_sdk_executor stubs, the new
test_provider_resolution.py fails to collect with
ModuleNotFoundError. test_adapter_prevalidate.py owned the same
shims via a per-file _install_stubs(), but two files maintaining
parallel stub copies eventually disagree on shape (BaseAdapter
needing install_plugins_via_registry, etc.).
Move the shim install + sys.path bump into tests/conftest.py so
every test module shares a single canonical stub set, collected
before any test imports adapter.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Workspace operators set 'provider: minimax' in /configs/config.yaml
expecting the adapter to route to MiniMax. Pre-fix behavior: adapter
ignored 'provider:' entirely, _resolve_provider model-matched against
_BUILTIN_PROVIDERS (anthropic-oauth + anthropic-api only), no model_prefix
matched 'MiniMax-M2.7-highspeed', silent fallback to providers[0]
(anthropic-oauth) — SDK kept using CLAUDE_CODE_OAUTH_TOKEN, hit OAuth
quota under a name the operator never asked for.
Fix: _resolve_provider now takes an explicit_provider arg. setup() reads
it from runtime_config.provider OR top-level config.yaml provider:.
Explicit name in registry → returned. Not in registry → ValueError with
the two paths to fix (add provider entry, or switch runtime template).
10 new tests cover: explicit-in-registry returns match, case-insensitive,
not-in-registry raises with actionable message, defense-in-depth against
silent fallback regression, custom-registry lookup, empty/None treated as
no-explicit (back-compat).
Closes#180.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Background: post-2026-05-06 SCM is Gitea, not GitHub. Gitea 1.22.6 has
no repository_dispatch / workflow_dispatch trigger API (empirically
verified across 6 candidate paths in molecule-core#20 issuecomment-913).
The molecule-core/publish-runtime.yml cascade therefore cannot fire
templates via curl-dispatch — pivots to push-mode instead.
This PR is the consumer side of that pivot:
- `.runtime-version` file at repo root — single line, plain version
string. Currently 0.1.129 (latest published as of 2026-05-07).
publish-runtime overwrites this on each cascade.
- publish-image.yml gains a `resolve-version` job that reads the file
and forwards the value to the reusable build workflow as the
third-priority source in the resolution chain:
1. client_payload.runtime_version (forward-compat with future
GitHub-style dispatch if Gitea ever adds it)
2. inputs.runtime_version (manual workflow_dispatch override)
3. .runtime-version file (push-mode cascade — the new path)
4. '' (Dockerfile requirements.txt default)
No behavioural change for PRs / manual dispatches; only fills in the
on-push case where previously the version was empty.
Sequencing context: this PR (and 8 sibling PRs to the other template
repos) MUST land before molecule-core#20 v2 is merged — otherwise the
first cascade push would trigger an on-push rebuild that pins the OLD
requirements.txt floor instead of the freshly-published version.
Refs molecule-core#14, molecule-core#20, molecule-core/issues/20.
Per saved memory feedback_runner_config_partial_deploy: orchestrator
identified that runners 1-8 last restarted before AGENT_TOOLSDIRECTORY
+ RUNNER_TOOL_CACHE were added; cycle 7 retrigger landed ~50% on stale
runners. Orchestrator restarted 1-8 at ~09:37; this empty commit
re-triggers CI on the now-consistent runner pool.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 3-line wrapper at .github/workflows/secret-scan.yml referenced
`uses: molecule-ai/molecule-core/.github/workflows/secret-scan.yml@staging`.
molecule-core is private; act_runner clones cross-repo reusable
workflows anonymously, so the resolve fails at 0s with no logs.
Same root cause + same fix that molecule-controlplane already shipped
(see its secret-scan.yml comment block lines 10-22). Inlining keeps
the gate functional until Gitea is upgraded or the canonical scanner
moves to a public repo. When either lands, this file reverts to the
3-line wrapper.
Refs: internal#46 Phase 3 Class 2.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Empty commit to re-run CI against the act_runner config that landed
in /opt/molecule/runners/config.yaml (cycle ~58 internal#46 Phase 3).
No source change. CI now runs setup-python with /tmp/hostedtoolcache,
which works (verified in cycle 6 task 1022 log, careful-bash#2).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Gitea is case-sensitive on owner slugs; canonical is lowercase
`molecule-ai/...`. Mixed-case `Molecule-AI/...` refs fail-at-0s
when the runner tries to resolve the cross-repo workflow / checkout.
Same fix as molecule-controlplane#12. Mechanical case-correction;
no behavior change beyond making CI resolve again.
Refs: internal#46
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Third-party Anthropic-compat providers (MiniMax, GLM, Kimi, DeepSeek)
all reuse the Anthropic SDK's wire format, which means the claude CLI
and claude-code-sdk read the bearer token from ANTHROPIC_AUTH_TOKEN no
matter which vendor is being talked to. Pre-#244:
* Canvas surfaced the vendor-specific name (MINIMAX_API_KEY, etc.)
to the user — so a user who saved only MINIMAX_API_KEY hit a
silent 401 on first call.
* The boot audit said `MINIMAX_API_KEY=set`, making it look like an
SDK bug rather than a routing gap.
* A user with multiple vendor keys could only run one workspace at a
time because they all fought over the shared ANTHROPIC_AUTH_TOKEN
slot.
Diagnostic-only audit logging shipped earlier (#32) but the actual
routing was never written — task #244 was mismarked complete.
Changes:
* config.yaml: third-party model `required_env` now references the
per-vendor name (MINIMAX_API_KEY, GLM_API_KEY, KIMI_API_KEY,
DEEPSEEK_API_KEY) so canvas asks the user for the right key.
First-party Anthropic models still use ANTHROPIC_AUTH_TOKEN /
CLAUDE_CODE_OAUTH_TOKEN.
* config.yaml: each third-party provider's `auth_env` lists the
vendor name FIRST (priority order) so projection picks the
vendor key over a stale ANTHROPIC_AUTH_TOKEN.
* adapter.py: new `_project_vendor_auth(provider)` helper, called
from `setup()` right after `_resolve_provider`. Idempotent — only
projects when ANTHROPIC_AUTH_TOKEN is unset (operator override
always wins). Logs the projection by NAME, never by VALUE
(mirrors `_audit_auth_env_presence`).
* tests/test_provider_routing.py: 6 new tests pin the contract —
vendor-key-set projects, AUTH_TOKEN-already-set is never
clobbered, first-party providers skip projection, secret value
never leaks into a log record, empty-string vendor env doesn't
trigger projection, and the same routing fires for GLM / Kimi /
DeepSeek.
Mirrors the parallel hermes-side fix from task #249 / hermes PR #38;
keeps the two runtimes' multi-vendor UX in lockstep.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds two operator-visible boot diagnostics that close the diagnosis gap
exposed by the 2026-05-02 MiniMax E2E crash-loop. The universal
canvas-picked-model fix (Bug B) and per-model required_env (Bug D) live
in molecule-core PR #2538 — this PR adds the per-template visibility
that complements them so operators can answer "is the key missing or is
routing wrong?" from `docker logs` alone.
Changes
-------
adapter.py:
- _AUTH_ENV_AUDIT tuple of 8 vendor env names (CLAUDE_CODE_OAUTH_TOKEN,
ANTHROPIC_API_KEY/AUTH_TOKEN/BASE_URL, MINIMAX/GLM/KIMI/DEEPSEEK_API_KEY).
- _audit_auth_env_presence() helper — single INFO line of NAME=set/unset
pairs. NEVER logs values; the test pins this with a "fake-secret-MUST-
NOT-LEAK" sentinel that must never appear in the log message.
- One call site at the end of setup()'s boot banner so every workspace
start emits both "which provider got picked" and "which envs are present"
in adjacent log lines.
entrypoint.sh:
- log_boot_context() function fired once before the gosu drop (as root)
and once after (as agent) so an operator can spot env values lost
across the privilege drop. Emits uid/gid/user/hostname/workspace_id/
platform_url/configs_dir/workspace_dir + the same 8 env names as
NAME=set/unset. Mirror of _AUTH_ENV_AUDIT — list pinned in sync by a
new AST-style test (test_audit_env_list_matches_entrypoint_sh) that
parses entrypoint.sh and asserts set-equality with adapter.py's tuple.
tests/test_adapter_logging.py (new):
- 4 tests covering the audit contract: every name appears, all-unset
scenario, empty-string treated as unset (matches routing semantics),
and the cross-file sync gate against entrypoint.sh's for-loop.
- Stubs molecule_runtime + a2a so the helpers can be imported without
the real wheel installed in CI (mirrors test_adapter_prevalidate.py's
scaffolding pattern).
Why this complements molecule-core PR #2538
-------------------------------------------
- PR #2538 makes Bug B (canvas-picked model silently dropped) impossible
by resolving model centrally in workspace/config.py:load_config —
every adapter (claude-code, hermes, codex, future ones) gets the
passthrough for free.
- PR #2538 makes Bug D (preflight rejects valid auth for non-default
models) impossible by REPLACE-not-union per-entry required_env.
- This template PR is the per-template observability layer: when one
of those universal fixes regresses (or when an operator misconfigs a
vendor key), the boot logs say exactly which env was present at each
tier. Validated end-to-end on workspace
be27badd-00a7-4cef-91e8-af428175c76f (clean boot, MINIMAX_API_KEY=set
audited, no crash-loop).
Closes part of molecule-monorepo task #248. Sibling of #2538 for
molecule-core.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two review nits:
1. Narrow the import-arm catch in _mark_sdk_wedged and
_clear_sdk_wedge_on_success to (ImportError, ModuleNotFoundError).
The bare `except Exception:` would have masked an AttributeError /
TypeError from a runtime_wedge API rename — silently degrading the
mirror to "no-op" and making heartbeat + the smoke gate (#131)
blind to claude-code wedges. The structural snapshot test in
molecule-core (task #169) catches the rename at PR-time. Older
runtimes that don't ship runtime_wedge at all still hit ImportError
and silently no-op — the local sticky flag still gates is_wedged()
inside this module so internal callers keep working.
2. Add mirror-CALL-failure injection tests. The recorder used by the
original tests never raised, so the inner try around
_mark_runtime_wedged(reason) (and the symmetric clear) wasn't
pinned. New tests inject a recorder whose mark/clear raise on call,
then assert: (a) the call attempt was recorded, (b) the local
sticky flag stayed correct, (c) the failure was logged at ERROR.
Pins both the contract ("mirror is best-effort, local is source of
truth") AND the operator-visible signal (an ERROR log line is the
only way to see a silent mirror regression).
Regression-injection-checked: removing the call-side try arm makes
both new tests fail with clear messages. Tests: 7 in
test_runtime_wedge_mirror.py, 45 across the whole tests/ tree.
The local _sdk_wedged_reason flag was only observed inside this module
— heartbeat reads runtime_wedge.is_wedged() (universal cross-cutting
holder) and so does the new boot-smoke gate from molecule-core PR
#2473 / task #131. Without the mirror, a wedged claude-code workspace
stayed green-dot on the canvas while every chat hung, AND the
publish-image gate could not catch PR-25-class init wedges before
the broken image shipped to GHCR.
_mark_sdk_wedged now mirrors into runtime_wedge.mark_wedged, and
_clear_sdk_wedge_on_success mirrors into runtime_wedge.clear_wedge.
Both are best-effort — older runtimes that don't ship runtime_wedge
silently no-op the mirror, so a template pinned to an older runtime
still boots. Mirror exceptions are logged but don't suppress the
local sticky flag, so internal callers (retry loop, cancel handler)
see consistent state regardless of the universal-side outcome.
Tests cover: mark mirrors with reason, first-call-wins propagates,
clear mirrors, no-op when not wedged, ImportError-resilience.
Regression-injection-checked: silencing the mirror branch fails the
mark+first-wins tests at unit-test time with a clear message naming
the missing runtime_wedge call.
Claude Code 2.1.x changed the flag's signature to take an *allowlist* of
tagged entries — `server:<name>` for manually-configured MCP servers,
`plugin:<name>@<marketplace>` for plugin channels. PR #25's
`{flag: None}` rendered as a bare `--<flag>` with no value, the CLI
rejected with `argument missing`, and the SDK timed out at `initialize`,
surfacing upstream as `Control request timeout: initialize` (caught
live on workspace dd40faf8 on 2026-05-01 — 100% of A2A turns wedged).
Pass `server:molecule` so the SDK forwards
`--dangerously-load-development-channels server:molecule`. Live-verified
end-to-end: A2A returns coherent replies AND the host claude session
renders inbound canvas messages as `<channel source="molecule" ...>`
tags inline (push UX without inbox poll).
Tests: replace the unconditional `None` pin with a tagged-form pin
that asserts the exact `server:molecule` value, plus a defense-in-depth
test that pins the invariants (non-None, non-empty, contains tag
colon) so any regression to the bare-switch shape fails at unit-test
time instead of surfacing as a live SDK initialize wedge. 38/38 pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Live-test revealed a regression in PR #24's setup() strip: the wheel-
default `anthropic:claude-opus-4-7` paired with an OAuth workspace
(CLAUDE_CODE_OAUTH_TOKEN set, no ANTHROPIC_API_KEY) is the realistic
production shape. Stripping in setup() routes those users into the
`anthropic-api` provider entry, after which the CLI hangs at
`initialize` because no API key env is set. Caught on workspace
dd40faf8 on 2026-05-01 — banner went `provider=anthropic-api` and
A2A wedged on Control request timeout.
Pre-fix routing (let prefixed strings fall through to providers[0] =
anthropic-oauth) is actually correct for this combo. The strip is only
needed at the CLI invocation site (create_executor) where claude's
`--model` arg must be a bare id.
Tests: replace `test_setup_strip_routes_prefixed_anthropic_to_anthropic_api`
with `test_setup_keeps_prefix_routing_oauth_for_anthropic_prefix`,
which pins the inverse — prefixed model + OAuth env stays on oauth and
emits no API-key warning. The 5 unit cases on `_strip_provider_prefix`
plus the `create_executor` strip pins remain unchanged. 36/36 pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The wheel-side push UX gates (capability + instructions, molecule-core
PR #2463) only matter if the host claude CLI is willing to register a
non-allowlisted experimental channel. During the channels research
preview the CLI requires --dangerously-load-development-channels to
bypass its allowlist; without it, every notifications/claude/channel
fired by the inbox bridge arrives at the host and is silently dropped.
claude-agent-sdk forwards arbitrary CLI flags to the spawned subprocess
via ClaudeAgentOptions.extra_args (claude_agent_sdk/_internal/transport/
subprocess_cli.py:340). Wire the flag in unconditionally — the flag is
harmless on builds that already allowlist the capability and required
on builds during the research preview, so there is no version skew to
guard. Remove the line once channels graduate to the default allowlist.
Test pins the wiring with a stubbed ClaudeAgentOptions recorder; runs
in CI without claude_agent_sdk / a2a / molecule_runtime installed via
the same _ensure_module/_ensure_attr pattern as the existing adapter
prevalidate test, but tolerates real packages being present locally.
Verified by injection: removing the extra_args line makes the test
fail with a message naming the missing flag and citing the SDK file
that consumes it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The molecule-runtime wheel's config.py defaults model to
`anthropic:claude-opus-4-7` so langchain/crewai consumers get a uniform
provider:model string out of the box. The claude CLI's --model arg
expects the bare model id and silently exits 1 (no stderr) on prefixed
strings — root cause of the 2026-05-01 "Agent error (Exception)" mid-A2A
bug. Diagnosed via strace on a live workspace: the CLI received
`--model anthropic:claude-opus-4-7` and exit_group(1)'d before any
non-fatal output.
Add `_strip_provider_prefix` and call it in both setup() (so
_resolve_provider routes anthropic:claude-X correctly to anthropic-api
instead of falling back to oauth) and create_executor() (so the bare
id reaches the CLI). Only known-Claude prefixes are stripped; unknown
ones (openai:, bedrock:) pass through so the CLI fails loudly instead
of being silently mangled.
Coverage: 8 new tests — unit tests for the helper across all branches,
end-to-end `create_executor` strip on dict + dataclass shapes, and a
caplog-based setup() test that pins provider=anthropic-api routing
after the strip (the silent-fallback failure mode this fix eliminates).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fresh-tenant signup hits "Upload failed: failed to prepare uploads
dir" on the first chat attachment (reported on hongming.moleculesai.app
2026-05-01T18:30Z). Root cause is that workspace/internal_chat_uploads.py
runs `mkdir -p /workspace/.molecule/chat-uploads` as the agent user,
but the volume's `.molecule` subdir surfaces root-owned in some race
windows (volume cache + new mount + RW remount during reboot/redeploy).
Pre-creating the directory tree as root in the entrypoint, BEFORE
gosu drops to agent, eliminates the class entirely — the upload
handler's `mkdir(parents=True, exist_ok=True)` is a no-op on the
common path and the failure mode it currently surfaces no longer
exists.
Idempotent: works on fresh volumes (creates) and reused volumes
(no-op + chown re-asserts ownership in case a prior process changed
it).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The empty-providers fallback in `_resolve_provider` was load-bearing
when `_load_providers` could return an empty tuple, but after PR #22's
per-entry hardening every return path yields a non-empty registry
(builtins on parse failure, the parsed list otherwise). The leftover
`_normalize_provider({})` branch became dead and outright broken: with
the stricter `_normalize_provider` rejecting nameless entries, the
fallback now returns None and would crash setup() on `provider["auth_mode"]`
the moment anything called `_resolve_provider` with an empty tuple.
Replace the dead branch with an explicit ValueError + pre-condition
docstring. Defensive — no production caller can hit this — but turns
a future silent NoneType crash into an actionable error.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two correctness issues spotted in self-review of c6f4912:
1. String-as-prefix typo split into character tuple. ``model_prefixes:
mimo-`` (operator forgot brackets) used to iterate over characters
→ ``('m','i','m','o','-')``, silently routing every model id starting
with 'm', 'i', or '-' through the entry. Now: non-list values coerce
to empty tuple (entry survives but matches nothing — operator notices
in boot banner, not via misrouted requests).
2. Single bad provider entry nuked the whole registry. _load_providers
built the registry via a generator inside tuple(...). One AttributeError
mid-comprehension (e.g. ``[mimo-, 123]`` — int's missing .lower())
propagated out, broad except caught it, registry silently fell back
to _BUILTIN_PROVIDERS (oauth + anthropic-api only). Every third-party
model would then route to anthropic-oauth — exactly the silent-fallback
failure mode this PR was meant to eliminate. Now: per-entry try/except
drops the bad entry with a warning, rest survives.
Also: entries without a string ``name`` field are now dropped with a
warning instead of silently using the placeholder ``<unnamed>`` —
operator typos surface in boot logs.
Tests: 28 passing (3 new regression tests covering both issues plus
the no-name path).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without pyyaml in CI, adapter._load_providers' broad except-Exception
swallows the ImportError and silently falls back to _BUILTIN_PROVIDERS.
Tests then assert 7 providers but get 2; setup() can't route any
third-party model. Locally pyyaml is system-installed so the issue
went unnoticed.
Same failure mode as the 2026-04-30 incident (CI green, prod broken)
— pinning the dep here closes that gap.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Routes via the existing `minimax` provider entry (model prefix matches
`minimax-` case-insensitively) — no registry change needed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the model→endpoint→auth-env mapping out of hardcoded constants
in adapter.py + entrypoint.sh into a single `providers:` list at the
top of config.yaml. The adapter loads it at boot via _load_providers;
canvas Config tab will read the same YAML for its Provider dropdown so
UI and adapter never disagree on what's available. Adding a new
provider becomes a one-line YAML edit — no Python or shell changes.
Includes 5 third-party providers ready out of the box (Anthropic-compat
endpoints, Bearer-style ANTHROPIC_AUTH_TOKEN OR ANTHROPIC_API_KEY auth):
xiaomi-mimo https://api.xiaomimimo.com/anthropic
minimax https://api.minimax.io/anthropic
zai https://api.z.ai/api/anthropic (NEW)
moonshot https://api.moonshot.ai/anthropic (NEW)
deepseek https://api.deepseek.com/anthropic (NEW)
Plus 7 new model entries in runtime_config.models (mimo-v2.5, MiniMax-M2,
MiniMax-M2.7, GLM-4.6, GLM-4.5, kimi-k2.5, kimi-k2, deepseek-v4-pro,
deepseek-v4-flash) so they show up in the Canvas Config dropdown.
Operator override unchanged: ANTHROPIC_BASE_URL set as a workspace
secret still wins over the registry default — the escape hatch for
regional endpoints (Xiaomi token-plan-sgp, MiniMax api.minimaxi.com).
entrypoint.sh: drops the `mimo-*` case mapping (adapter handles routing
now). _BUILTIN_PROVIDERS retained as malformed-YAML fallback so a
bare-bones workspace still boots with oauth + anthropic-api defaults.
Tests: 25 passing. New coverage:
- YAML parses + normalizes to expected shape
- Malformed YAML falls back to builtins (warning, not raise)
- Each new provider routes its model id to the right base_url
- ANTHROPIC_AUTH_TOKEN alone satisfies third-party auth check
- Operator-set ANTHROPIC_BASE_URL overrides registry default
- Case-insensitive prefix match (MiniMax-M2 / minimax-m2.7 / GLM-4.6)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>