fix(adapter): surface real CLI error message to user when SDK swallows it (#46) #83

Open
core-be wants to merge 18 commits from fix/template46-adapter-only into main
Member

Summary

Addresses PR #79 CR2 Finding 1 + Finding 2 (scope-creep split).

  • fix(adapter): surface real CLI error message — when the Claude SDK swallows stderr, the adapter now probes the CLI directly to extract the real error message
  • test(adapter): regression tests — sad path (swallowed stderr propagates) + happy path (normal output unaffected)

Test plan

  • python -m pytest tests/test_cli_error_probe.py -v passes

🤖 Generated with Claude Code

## Summary Addresses PR #79 CR2 Finding 1 + Finding 2 (scope-creep split). - `fix(adapter): surface real CLI error message` — when the Claude SDK swallows stderr, the adapter now probes the CLI directly to extract the real error message - `test(adapter): regression tests` — sad path (swallowed stderr propagates) + happy path (normal output unaffected) ## Test plan - [ ] `python -m pytest tests/test_cli_error_probe.py -v` passes 🤖 Generated with [Claude Code](https://claude.com/claude-code)
core-be added 18 commits 2026-06-03 19:32:17 +00:00
Replace direct `new_text_message` + ad-hoc Message construction with the canonical `new_response_message(context, text, files=...)` helper added in molecule-core PR #2271. Threads task_id/context_id back to the platform's a2a proxy uniformly with every other adapter.
test: bash coverage for entrypoint.sh log_boot_context()
CI / validate (push) Failing after 0s
CI / Adapter unit tests (push) Failing after 6s
227787bbbd
The Python adapter audit (test_adapter_logging.py) pins the
adapter.py side, but the entrypoint shell function fires earlier and
twice (pre-gosu + post-gosu). When the SDK import wedge keeps the
adapter from running at all, the shell emission is the operator's
only visibility into the boot env.

Eight new tests cover:
- env NAME=set / env NAME=unset shape for every audited var
- value-leak guard: secret strings never appear in output
- WORKSPACE_ID + PLATFORM_URL passthrough by value (not secret)
- <unset> fallback for missing platform identifiers
- uid/gid line shape (used to verify the privilege drop)
- dated boot banner shape (used to count restarts in a crash loop)
- cross-file gate: shell for-loop names == fixture tuple, mirroring
  test_audit_env_list_matches_entrypoint_sh's adapter.py↔shell gate

Strategy: regex-extract the function body from entrypoint.sh and run
it in a fresh /bin/sh with controlled env. We never source the whole
entrypoint because it would chown /workspace and exec molecule-runtime.

Closes the gap from task #251 (follow-up to PR #32 boot-debug logging).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
test: bash coverage for entrypoint.sh log_boot_context()
migrate: use new_response_message helper (threads task_id/context_id)
Same fix as template-langgraph (verified green there) — bumps the
python:3.11-slim base's old transitives (jaraco.context, wheel,
setuptools) that Trivy correctly flags as fixable HIGH severity.
molecule-ci#38 Phase-1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
claude_sdk_executor.py imports new_response_message from
molecule_runtime.executor_helpers (added in PR #16,
migrate/use-new-response-message). The two test files that import
the executor stub a slimmed-down molecule_runtime.executor_helpers
module to avoid pulling the real wheel in CI — but the stub list
was never updated to include new_response_message, so every test
that loads the executor crashes at import:

  ImportError: cannot import name 'new_response_message' from
  'molecule_runtime.executor_helpers' (unknown location)

Add the same `_ensure_attr(helpers, "new_response_message", ...)`
line to both stub bundles. Now matches the public symbol surface
the executor needs at module load.

Was failing every push on main since PR #16 merged.
fix(tests): stub new_response_message in molecule_runtime helpers
The platform provisioner writes /configs/config.yaml with only
runtime/model/name — providers are a TEMPLATE concern and ship in
the template repo's own config.yaml (cloned to /opt/adapter/ at
provision time). _load_providers was only reading from /configs
and silently falling back to _BUILTIN_PROVIDERS (oauth +
anthropic-api), so every MiniMax/GLM/Kimi/DeepSeek model id
resolved to anthropic-oauth and the workspace died at first LLM
call with "Not logged in. Please run /login".

Fix: prefer the template-bundled config.yaml (next to adapter.py
via os.path.dirname(__file__)). Per-workspace config.yaml stays
as a fallback so operators on private deployments can override
the providers list. Builtins stay as the final fallback.

Caught by SSH'ing into a debug workspace on 2026-05-04 — the cron
canary masked this bug because it used direct-Anthropic until the
2026-05-03 migration to MiniMax (PR #2710 in molecule-core).

Includes 4 regression tests pinning the load order:
  - template wins when workspace config has no providers section
  - MiniMax-M2.7-highspeed routes to minimax provider end-to-end
  - per-workspace override honored when template config absent
  - builtins fallback when neither config exists

The new tests fail on the pre-fix code (verified by stashing the
adapter.py change and re-running) — they catch the actual silent
fallback rather than just structural shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 5 _load_providers parsing tests in test_adapter_prevalidate.py
write a custom per-workspace config.yaml and assert against its
contents. Post-fix, _load_providers prefers the template-bundled
config.yaml (next to adapter.py) over the workspace one, so those
tests broke against the real shipped template registry.

Add a _isolate_template_config helper that monkeypatches
os.path.dirname in the adapter module's namespace so the template
lookup falls through to the per-workspace candidate. Each affected
test now creates a fake-template / workspace tmp_path pair and
calls the helper.

Test intent is unchanged — they still pin the per-workspace YAML
parsing contract (per-entry isolation, malformed-YAML fallback,
string-vs-list coercion, missing-name handling). They just now
exercise that contract via the per-workspace fallback path
instead of the template path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix(adapter): load providers from template config.yaml, not /configs
PR #37 added a template-bundled lookup using
``os.path.dirname(os.path.abspath(__file__))`` and verified the fix
on a workspace whose adapter.py was patched in-place. But the
platform's EC2-host install path
(molecule-controlplane ec2.go:2091) copies adapter.py into the
venv's site-packages dir:

    cp /opt/adapter/*.py /opt/molecule-venv/lib/python3.12/site-packages/

The host install does NOT set ``PYTHONPATH=/opt/adapter``, so site-
packages wins on import — ``__file__`` resolves to the venv path,
which has NO adjacent config.yaml. Same silent fallback to
``anthropic-oauth`` and same "Not logged in. Please run /login"
crash as the original PR #37 incident.

Verified on 2026-05-04 11:08Z: a fresh-provisioned MiniMax workspace
(cloned from main of this repo POST-PR-#37) showed the exact same
boot banner — ``provider=anthropic-oauth`` — because the venv-
installed adapter.py couldn't find its sibling config.yaml.

Fix: add the canonical install path
``/opt/adapter/config.yaml`` as the first lookup candidate.
This is invariant across both Docker (mounted /app→/opt/adapter)
and EC2-host (cloned by ec2.go) install paths. The ``__file__``
candidate stays as a dev/test fallback; the per-workspace yaml
remains as an operator-override path; builtins as final fallback.

Includes a regression test
(``test_load_providers_uses_canonical_path_when_file_dir_is_site_packages``)
that simulates the EC2-host install layout — ``__file__`` in a fake
site-packages dir without config.yaml, canonical path with the
registry — and pins ``MiniMax-M2.7-highspeed`` → ``minimax``
provider routing. Verified the test fails on the pre-this-PR code
(AttributeError on the missing ``_CANONICAL_ADAPTER_DIR`` constant).

Existing parsing tests in test_adapter_prevalidate.py +
test_provider_routing.py monkeypatch ``_CANONICAL_ADAPTER_DIR``
alongside ``os.path.dirname`` so they remain robust on CI runners
that happen to have a populated /opt/adapter dir.

Follow-up: also worth setting ``PYTHONPATH=/opt/adapter`` in the
provisioner env so Python imports adapter from /opt/adapter
directly (instead of relying on the site-packages copy). That's a
molecule-controlplane PR; this template-side fix makes the adapter
robust against any provisioner-side import-path layout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix(adapter): check /opt/adapter/config.yaml as canonical providers source
Pure cleanup — yaml is already a transitive dep via
molecule-ai-workspace-runtime, and _load_providers ran the
import inside the candidate-iteration loop. Moving it to module
scope makes the loop body show only what's actually different
between iterations (yaml_path).

No behavior change. Closes #39.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
adapter: hoist `import yaml` out of _load_providers loop
When the claude-agent-sdk raises a bare Exception with
\"Claude Code returned an error result: <subtype>\" or the
\"Check stderr output for details\" placeholder, the canvas previously
showed only the generic \"Agent error (Exception)\" message.

Now we probe the CLI directly via _probe_claude_cli_error() and pass
the captured stdout to sanitize_agent_error(..., stderr=probed) so the
user sees the actual actionable reason (e.g. rate-limit reset time)
instead of having to SSH into the container to find out why.

Fixes template#46
Related: molecule-core#211

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
test(adapter): add happy-path guard for CLI error probe (PR #79 Finding 1)
CI / validate (push) Failing after 1s
CI / validate (pull_request) Failing after 0s
Secret scan / secret-scan (pull_request) Failing after 31s
CI / Adapter unit tests (push) Failing after 1m7s
CI / Adapter unit tests (pull_request) Failing after 1m0s
fc8ef37b82
When _run_query succeeds, the probe must not fire and stderr= must
stay None. This closes the CR2 gap: the existing tests only covered
exception paths, leaving the happy path unguarded.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Some checks are pending
CI / validate (push) Failing after 1s
CI / validate (pull_request) Failing after 0s
Secret scan / secret-scan (pull_request) Failing after 31s
CI / Adapter unit tests (push) Failing after 1m7s
CI / Adapter unit tests (pull_request) Failing after 1m0s
Required
Details
CI / Template validation (runtime) (pull_request)
Required
CI / Template validation (static) (pull_request)
Required
Secret scan / Scan diff for credential-shaped strings (pull_request)
Required
Checking for merge conflicts…
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin fix/template46-adapter-only:fix/template46-adapter-only
git checkout fix/template46-adapter-only
Sign in to join this conversation.
No Reviewers
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-ai-workspace-template-claude-code#83