Commit Graph

17 Commits

Author SHA1 Message Date
Hongming Wang
b9a1fa1b1f feat: per-vendor env routing for third-party providers (task #244)
Some checks failed
CI / validate (push) Failing after 0s
CI / Adapter unit tests (push) Failing after 6s
Third-party Anthropic-compat providers (MiniMax, GLM, Kimi, DeepSeek)
all reuse the Anthropic SDK's wire format, which means the claude CLI
and claude-code-sdk read the bearer token from ANTHROPIC_AUTH_TOKEN no
matter which vendor is being talked to. Pre-#244:

  * Canvas surfaced the vendor-specific name (MINIMAX_API_KEY, etc.)
    to the user — so a user who saved only MINIMAX_API_KEY hit a
    silent 401 on first call.
  * The boot audit said `MINIMAX_API_KEY=set`, making it look like an
    SDK bug rather than a routing gap.
  * A user with multiple vendor keys could only run one workspace at a
    time because they all fought over the shared ANTHROPIC_AUTH_TOKEN
    slot.

Diagnostic-only audit logging shipped earlier (#32) but the actual
routing was never written — task #244 was mismarked complete.

Changes:
  * config.yaml: third-party model `required_env` now references the
    per-vendor name (MINIMAX_API_KEY, GLM_API_KEY, KIMI_API_KEY,
    DEEPSEEK_API_KEY) so canvas asks the user for the right key.
    First-party Anthropic models still use ANTHROPIC_AUTH_TOKEN /
    CLAUDE_CODE_OAUTH_TOKEN.
  * config.yaml: each third-party provider's `auth_env` lists the
    vendor name FIRST (priority order) so projection picks the
    vendor key over a stale ANTHROPIC_AUTH_TOKEN.
  * adapter.py: new `_project_vendor_auth(provider)` helper, called
    from `setup()` right after `_resolve_provider`. Idempotent — only
    projects when ANTHROPIC_AUTH_TOKEN is unset (operator override
    always wins). Logs the projection by NAME, never by VALUE
    (mirrors `_audit_auth_env_presence`).
  * tests/test_provider_routing.py: 6 new tests pin the contract —
    vendor-key-set projects, AUTH_TOKEN-already-set is never
    clobbered, first-party providers skip projection, secret value
    never leaks into a log record, empty-string vendor env doesn't
    trigger projection, and the same routing fires for GLM / Kimi /
    DeepSeek.

Mirrors the parallel hermes-side fix from task #249 / hermes PR #38;
keeps the two runtimes' multi-vendor UX in lockstep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 22:20:03 -07:00
Hongming Wang
78ae139609 feat(adapter,entrypoint): boot env audit + crash-loop diagnosis logging
Adds two operator-visible boot diagnostics that close the diagnosis gap
exposed by the 2026-05-02 MiniMax E2E crash-loop. The universal
canvas-picked-model fix (Bug B) and per-model required_env (Bug D) live
in molecule-core PR #2538 — this PR adds the per-template visibility
that complements them so operators can answer "is the key missing or is
routing wrong?" from `docker logs` alone.

Changes
-------
adapter.py:
- _AUTH_ENV_AUDIT tuple of 8 vendor env names (CLAUDE_CODE_OAUTH_TOKEN,
  ANTHROPIC_API_KEY/AUTH_TOKEN/BASE_URL, MINIMAX/GLM/KIMI/DEEPSEEK_API_KEY).
- _audit_auth_env_presence() helper — single INFO line of NAME=set/unset
  pairs. NEVER logs values; the test pins this with a "fake-secret-MUST-
  NOT-LEAK" sentinel that must never appear in the log message.
- One call site at the end of setup()'s boot banner so every workspace
  start emits both "which provider got picked" and "which envs are present"
  in adjacent log lines.

entrypoint.sh:
- log_boot_context() function fired once before the gosu drop (as root)
  and once after (as agent) so an operator can spot env values lost
  across the privilege drop. Emits uid/gid/user/hostname/workspace_id/
  platform_url/configs_dir/workspace_dir + the same 8 env names as
  NAME=set/unset. Mirror of _AUTH_ENV_AUDIT — list pinned in sync by a
  new AST-style test (test_audit_env_list_matches_entrypoint_sh) that
  parses entrypoint.sh and asserts set-equality with adapter.py's tuple.

tests/test_adapter_logging.py (new):
- 4 tests covering the audit contract: every name appears, all-unset
  scenario, empty-string treated as unset (matches routing semantics),
  and the cross-file sync gate against entrypoint.sh's for-loop.
- Stubs molecule_runtime + a2a so the helpers can be imported without
  the real wheel installed in CI (mirrors test_adapter_prevalidate.py's
  scaffolding pattern).

Why this complements molecule-core PR #2538
-------------------------------------------
- PR #2538 makes Bug B (canvas-picked model silently dropped) impossible
  by resolving model centrally in workspace/config.py:load_config —
  every adapter (claude-code, hermes, codex, future ones) gets the
  passthrough for free.
- PR #2538 makes Bug D (preflight rejects valid auth for non-default
  models) impossible by REPLACE-not-union per-entry required_env.
- This template PR is the per-template observability layer: when one
  of those universal fixes regresses (or when an operator misconfigs a
  vendor key), the boot logs say exactly which env was present at each
  tier. Validated end-to-end on workspace
  be27badd-00a7-4cef-91e8-af428175c76f (clean boot, MINIMAX_API_KEY=set
  audited, no crash-loop).

Closes part of molecule-monorepo task #248. Sibling of #2538 for
molecule-core.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 21:41:05 -07:00
Hongming Wang
a78626ced4 fix(adapter): keep setup() routing — strip prefix only at CLI invocation
Live-test revealed a regression in PR #24's setup() strip: the wheel-
default `anthropic:claude-opus-4-7` paired with an OAuth workspace
(CLAUDE_CODE_OAUTH_TOKEN set, no ANTHROPIC_API_KEY) is the realistic
production shape. Stripping in setup() routes those users into the
`anthropic-api` provider entry, after which the CLI hangs at
`initialize` because no API key env is set. Caught on workspace
dd40faf8 on 2026-05-01 — banner went `provider=anthropic-api` and
A2A wedged on Control request timeout.

Pre-fix routing (let prefixed strings fall through to providers[0] =
anthropic-oauth) is actually correct for this combo. The strip is only
needed at the CLI invocation site (create_executor) where claude's
`--model` arg must be a bare id.

Tests: replace `test_setup_strip_routes_prefixed_anthropic_to_anthropic_api`
with `test_setup_keeps_prefix_routing_oauth_for_anthropic_prefix`,
which pins the inverse — prefixed model + OAuth env stays on oauth and
emits no API-key warning. The 5 unit cases on `_strip_provider_prefix`
plus the `create_executor` strip pins remain unchanged. 36/36 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:46:54 -07:00
Hongming Wang
6ba4cc6a01 fix(adapter): strip LangChain-style provider prefix before CLI invocation
The molecule-runtime wheel's config.py defaults model to
`anthropic:claude-opus-4-7` so langchain/crewai consumers get a uniform
provider:model string out of the box. The claude CLI's --model arg
expects the bare model id and silently exits 1 (no stderr) on prefixed
strings — root cause of the 2026-05-01 "Agent error (Exception)" mid-A2A
bug. Diagnosed via strace on a live workspace: the CLI received
`--model anthropic:claude-opus-4-7` and exit_group(1)'d before any
non-fatal output.

Add `_strip_provider_prefix` and call it in both setup() (so
_resolve_provider routes anthropic:claude-X correctly to anthropic-api
instead of falling back to oauth) and create_executor() (so the bare
id reaches the CLI). Only known-Claude prefixes are stripped; unknown
ones (openai:, bedrock:) pass through so the CLI fails loudly instead
of being silently mangled.

Coverage: 8 new tests — unit tests for the helper across all branches,
end-to-end `create_executor` strip on dict + dataclass shapes, and a
caplog-based setup() test that pins provider=anthropic-api routing
after the strip (the silent-fallback failure mode this fix eliminates).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 16:06:53 -07:00
Hongming Wang
25e86963f3 fix(adapter): drop dead _normalize_provider({}) fallback in _resolve_provider
The empty-providers fallback in `_resolve_provider` was load-bearing
when `_load_providers` could return an empty tuple, but after PR #22's
per-entry hardening every return path yields a non-empty registry
(builtins on parse failure, the parsed list otherwise). The leftover
`_normalize_provider({})` branch became dead and outright broken: with
the stricter `_normalize_provider` rejecting nameless entries, the
fallback now returns None and would crash setup() on `provider["auth_mode"]`
the moment anything called `_resolve_provider` with an empty tuple.

Replace the dead branch with an explicit ValueError + pre-condition
docstring. Defensive — no production caller can hit this — but turns
a future silent NoneType crash into an actionable error.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 00:05:16 -07:00
Hongming Wang
2b9b4306eb fix(adapter): per-entry isolation in _load_providers + tighten _normalize_provider
Two correctness issues spotted in self-review of c6f4912:

1. String-as-prefix typo split into character tuple. ``model_prefixes:
   mimo-`` (operator forgot brackets) used to iterate over characters
   → ``('m','i','m','o','-')``, silently routing every model id starting
   with 'm', 'i', or '-' through the entry. Now: non-list values coerce
   to empty tuple (entry survives but matches nothing — operator notices
   in boot banner, not via misrouted requests).

2. Single bad provider entry nuked the whole registry. _load_providers
   built the registry via a generator inside tuple(...). One AttributeError
   mid-comprehension (e.g. ``[mimo-, 123]`` — int's missing .lower())
   propagated out, broad except caught it, registry silently fell back
   to _BUILTIN_PROVIDERS (oauth + anthropic-api only). Every third-party
   model would then route to anthropic-oauth — exactly the silent-fallback
   failure mode this PR was meant to eliminate. Now: per-entry try/except
   drops the bad entry with a warning, rest survives.

Also: entries without a string ``name`` field are now dropped with a
warning instead of silently using the placeholder ``<unnamed>`` —
operator typos surface in boot logs.

Tests: 28 passing (3 new regression tests covering both issues plus
the no-name path).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 23:58:24 -07:00
Hongming Wang
c6f4912d09 feat(adapter): data-driven provider registry in config.yaml
Move the model→endpoint→auth-env mapping out of hardcoded constants
in adapter.py + entrypoint.sh into a single `providers:` list at the
top of config.yaml. The adapter loads it at boot via _load_providers;
canvas Config tab will read the same YAML for its Provider dropdown so
UI and adapter never disagree on what's available. Adding a new
provider becomes a one-line YAML edit — no Python or shell changes.

Includes 5 third-party providers ready out of the box (Anthropic-compat
endpoints, Bearer-style ANTHROPIC_AUTH_TOKEN OR ANTHROPIC_API_KEY auth):

  xiaomi-mimo  https://api.xiaomimimo.com/anthropic
  minimax      https://api.minimax.io/anthropic
  zai          https://api.z.ai/api/anthropic           (NEW)
  moonshot     https://api.moonshot.ai/anthropic        (NEW)
  deepseek     https://api.deepseek.com/anthropic       (NEW)

Plus 7 new model entries in runtime_config.models (mimo-v2.5, MiniMax-M2,
MiniMax-M2.7, GLM-4.6, GLM-4.5, kimi-k2.5, kimi-k2, deepseek-v4-pro,
deepseek-v4-flash) so they show up in the Canvas Config dropdown.

Operator override unchanged: ANTHROPIC_BASE_URL set as a workspace
secret still wins over the registry default — the escape hatch for
regional endpoints (Xiaomi token-plan-sgp, MiniMax api.minimaxi.com).

entrypoint.sh: drops the `mimo-*` case mapping (adapter handles routing
now). _BUILTIN_PROVIDERS retained as malformed-YAML fallback so a
bare-bones workspace still boots with oauth + anthropic-api defaults.

Tests: 25 passing. New coverage:
  - YAML parses + normalizes to expected shape
  - Malformed YAML falls back to builtins (warning, not raise)
  - Each new provider routes its model id to the right base_url
  - ANTHROPIC_AUTH_TOKEN alone satisfies third-party auth check
  - Operator-set ANTHROPIC_BASE_URL overrides registry default
  - Case-insensitive prefix match (MiniMax-M2 / minimax-m2.7 / GLM-4.6)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 23:29:40 -07:00
Hongming Wang
c646b8cebe feat(adapter): raise on third-party model without ANTHROPIC_BASE_URL
Aligns setup()'s third-party-model-without-URL handling with
create_executor()'s pre-validate (#19) — both unrecoverable
misconfigurations now raise ValueError at boot instead of one warning
and one raising.

Why: a third-party (mimo-*) model selected without ANTHROPIC_BASE_URL
sends every LLM request to api.anthropic.com with a non-Anthropic key,
401-ing every prompt. Workspace boots, looks "online" via heartbeat,
but is structurally broken on the user-facing path. The previous
warning-only path produced the same end-user symptom as the
2026-04-30 incident (workspace looks alive, every interaction fails)
just via a different misconfig shape.

Symmetry: create_executor raises when ANTHROPIC_BASE_URL is set to a
non-Anthropic host but no model is picked. setup() now raises when a
third-party model is picked but no URL is set. Together they catch
both halves of the misconfig surface at boot, before the workspace
enters "online" status.

Adds 4 setup() tests:
- raises on third-party + no URL
- passes on third-party + URL
- passes on OAuth alias (sonnet) + no URL
- passes on Anthropic API id (claude-*) + no URL

Stubs molecule_runtime.plugins.load_plugins as a no-op so the pass-path
tests run cleanly without the runtime installed. Test count: 11 (7
create_executor + 4 setup).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 22:50:25 -07:00
Hongming Wang
a4d83cb356 chore(adapter): drop redundant urlparse imports + dead ternary
Self-review follow-up to #19. Two cosmetic cleanups:

- urlparse is now imported at module-top (added in #17 alongside the
  auth-mode classification) so the two inline `from urllib.parse import
  urlparse` statements inside conditional branches are redundant.
- The log-format ternary " (custom upstream)" if base_url else "" lives
  inside `if base_url:` — base_url is unconditionally truthy there, so
  the else branch was dead code.

No behavior change. Tests still 7/7 green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 22:45:04 -07:00
Hongming Wang
61f935674f
Merge branch 'main' into feat/adapter-prevalidate 2026-04-30 22:38:53 -07:00
Hongming Wang
0d95b5098a feat(adapter): pre-validate ANTHROPIC_BASE_URL + missing model combo
The 2026-04-30 staging incident traced back to workspaces booting with
ANTHROPIC_BASE_URL pointing at a non-Anthropic shim (MiniMax / OpenAI
gateway) but no explicit model configured. The adapter silently fell
back to "sonnet" — an Anthropic-native alias the upstream didn't
recognize — and the SDK --print probe hung 30s before timing out.
Platform's phantom-busy sweep then nuked the workspace at 10min,
producing "every workspace dead" with the root cause buried in a
30s subprocess hang.

Pre-validate the combo at adapter boot: when ANTHROPIC_BASE_URL host
is non-Anthropic AND no explicit model is set, raise ValueError with
an actionable message pointing to MODEL_PROVIDER / runtime_config.model.
Also log the resolved model + base_url_host every boot so future
failures explain themselves in the workspace logs without digging
into the SDK subprocess.

Tests live under tests/ with their own pytest.ini that anchors rootdir
there — keeps pytest from importing the package __init__.py (which
does the runtime-discovery relative import that requires
molecule_runtime installed). 7 tests cover: misconfig raises with the
right message, Anthropic-native passes, no-base-url passes, custom-url
+ explicit model passes, dataclass + dict shapes, unparseable URL
no-crash. CI runs them on every push/PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 22:35:49 -07:00
Hongming Wang
824bc4a176 adapter: warn for the right env var per auth mode + log boot banner
The pre-multi-provider warning hardcoded CLAUDE_CODE_OAUTH_TOKEN — it
fired even when an operator legitimately picked claude-sonnet-4-6 (API
key) or mimo-v2-flash (third-party) and set ANTHROPIC_API_KEY instead.
Misleading.

Now classifies the picked model into oauth / anthropic_api /
third_party_anthropic_compat and warns about the env var that auth path
actually needs. Adds a single-line boot banner so workspace logs surface
which provider was selected and (for third-party) which base-URL host
took effect — host-only, never full URL.

Adds an additional warning when a third-party model is selected but
ANTHROPIC_BASE_URL is unset, since the symptom otherwise is silent
fall-through to api.anthropic.com with a third-party key (401).

Functional tests against 14 model-id cases (oauth aliases, claude-*
versioned, all 4 mimo-* variants, case-insensitivity, empty/None,
unknown id fallback) all pass — see commit's pre-push validation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 03:15:21 -07:00
c930626f82 fix(adapter): warn at startup if CLAUDE_CODE_OAUTH_TOKEN is absent (KI-001)
adapter.py:setup() now emits a logger.warning() if CLAUDE_CODE_OAUTH_TOKEN
is absent, so operators see the problem immediately instead of getting a silent
AuthenticationError on the first LLM call. known-issues.md updated to mark
KI-001 as resolved.
2026-04-29 01:57:16 -07:00
Hongming Wang
e7dea39df2 fix: qualify all bare imports of runtime modules
Five `from <runtime_module> import` statements in adapter.py +
claude_sdk_executor.py were never qualified when the template was
extracted to its own repo (#87). They worked when the runtime was
bundled into workspace/ where bare imports resolved against
sibling files; in the template repo they explode at startup with
ModuleNotFoundError as soon as Python reaches the import.

Caught by manual provision after pipeline-3 wire-real E2E. The
plugins import was the first one tripped because it sits in
adapter.setup() — earlier bare imports inside claude_sdk_executor.py
are deferred until the executor is constructed.

Pattern: any `from <X> import Y` where X is a workspace/ module ->
`from molecule_runtime.X import Y`. Fixes:
- adapter.py:97          plugins
- claude_sdk_executor.py executor_helpers, heartbeat, a2a_client, platform_auth

Same class of bug as the runtime's TOP_LEVEL_MODULES drift but
inverted — instead of forgetting to rewrite imports IN the wheel,
the template authors forgot to qualify imports IN the template
code (the build script's rewriter only runs on workspace/ -> wheel).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 05:20:24 -07:00
Hongming Wang
280e89c50b fix: export Adapter alias so runtime adapter discovery works
`workspace/adapters/__init__.py:get_adapter()` does
`getattr(mod, "Adapter")` after importing ADAPTER_MODULE. Without the
alias the runtime's preflight check fails with:

  [FAIL] Runtime: ADAPTER_MODULE='adapter' imported, but no `Adapter`
  class is exported. Add `Adapter = YourAdapterClass` at module scope

Symptom: workspace container restarts forever, never reaches `online`.

This contract was added (or hardened) in #123's adapter-discovery
refactor. Hermes's adapter.py already has `Adapter = HermesAgentAdapter`
at module scope; claude-code missed the migration. gemini-cli template
has the same bug — file separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 04:55:30 -07:00
Hongming Wang
c9ca671cd5 feat(adapter): declare provides_native_session + idle_timeout_override
Wires this template into the platform's capability-primitive layer
(molecule-core task #117). Two declarations:

1. RuntimeCapabilities(provides_native_session=True) — the claude-agent-sdk
   maintains a long-lived streaming session with its own client state.
   The platform's a2a_queue would double-buffer that in-flight state
   if it didn't know the SDK owned it. Once primitive #5 lands in
   molecule-core, the platform's enqueue path will skip workspaces
   declaring this and dispatch directly.

2. idle_timeout_override() returning 900 (15 min) — Opus + multi-step
   tool use legitimately runs 8-10 min between broadcaster events.
   The pre-capability bug (molecule-core PR #2128) hit this: the
   platform's 5min idle timer cancelled mid-flight during long
   packaging steps. The override moves the per-workspace ceiling up
   without leaving genuinely-wedged runs hanging too long. Consumed
   by molecule-core PR #2139 in a2a_proxy.dispatchA2A.

Other capability flags stay False — see inline docstring for the
per-flag rationale (notably native_status_mgmt is partially adapter-
driven via runtime_state="wedged" but the recovery path stays platform-
owned, so we don't claim it yet).

Requires molecule-ai-workspace-runtime with RuntimeCapabilities (PR
#2137 in molecule-core, merged 2026-04-27). The current
requirements.txt pin (>=0.1.0) will pick up the latest released
version on next image rebuild.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 23:07:28 -07:00
Hongming Wang
7f9b2b4189 feat: add adapter code + Dockerfile for standalone deployment
Adapters extracted from molecule-monorepo/workspace-template.
Uses molecule-ai-workspace-runtime PyPI package for shared infrastructure.

- adapter.py — runtime-specific adapter class
- requirements.txt — runtime-specific deps + molecule-ai-workspace-runtime
- Dockerfile — FROM python:3.11-slim, pip install, COPY adapter, molecule-runtime entrypoint
- ADAPTER_MODULE=adapter tells the runtime to load this repo's Adapter class

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 04:27:22 -07:00