fix(dockerfile): bundle config.yaml into /app so providers registry loads #6

Merged
claude-ceo-assistant merged 1 commits from fix/dockerfile-bundle-config-yaml into main 2026-05-08 18:19:10 +00:00

Closes molecule-core#129 failure mode #1

38-hour canary chronic red root-caused. Live SSM capture of the workspace EC2 shows:

None of CLAUDE_CODE_OAUTH_TOKEN set for model=MiniMax-M2.7-highspeed (provider=anthropic-oauth) — the adapter will fail on the first LLM call with AuthenticationError.
[...]
probed_cli_error='Not logged in · Please run /login'

Root cause

The adapter's _load_providers tries 4 paths in order:

  1. /opt/adapter/config.yaml — provisioner-managed canonical (currently missing)
  2. os.path.dirname(__file__)/config.yaml — alongside adapter.py (this image's /app/)
  3. ${WORKSPACE_CONFIG_PATH}/config.yaml — workspace overrides
  4. _BUILTIN_PROVIDERS — oauth + anthropic-api only

Verified by ls /opt/adapter/ on a live canary's workspace EC2: directory doesn't exist. So path 2 (/app/config.yaml) is the load-bearing one.

Dockerfile copies adapter.py, __init__.py, claude_sdk_executor.py, scripts/, entrypoint.sh — but does not copy config.yaml. So /app/config.yaml doesn't exist either. All 3 file paths fail. _load_providers returns _BUILTIN_PROVIDERS.

_BUILTIN_PROVIDERS has only anthropic-oauth + anthropic-api. Every MiniMax / GLM / Kimi / DeepSeek model id has no matching prefix → _resolve_provider returns providers[0] = anthropic-oauth (per "unknown ids fall back to providers[0]" rule). That provider needs CLAUDE_CODE_OAUTH_TOKEN, unset for non-OAuth tenants. Claude CLI errors Not logged in · Please run /login. The adapter wraps it as "Agent error (Exception)" and ships it to A2A.

Fix

One-line COPY config.yaml . after COPY __init__.py . in the Dockerfile. Now /app/config.yaml ships with the image and path 2 of the 4-path lookup finds it.

How this got missed

Memory feedback_template_vs_workspace_config_separation (template-claude-code PR #37, 2026-05-04) added the multi-path lookup precisely to fix the original bug pattern (per-workspace config.yaml shouldn't carry providers — that's a template concern). PR #37 added the lookup logic but didn't bundle config.yaml into the image, so the canonical path it expects doesn't exist anywhere — the same fallback-to-builtins bug persisted with a different code path producing it.

Verification

Verified the failure path live:

  1. Provisioned e2e-canary-20260508-debug-177826 via canary script with E2E_KEEP_ORG=1.
  2. Canary failed at step 8/11 with the same "Agent error (Exception)" the cron canary has been hitting.
  3. SSM-exec into workspace EC2 i-01383cddf3b71e211, docker logs of the workspace container showed the boot-time audit + the SDK exception.
  4. docker exec ... cat /opt/adapter/config.yaml → no such file.
  5. docker exec ... ls /app/config.yaml → no such file.
  6. docker exec ... cat /configs/config.yaml → has model: MiniMax-M2.7-highspeed but no providers: section (canary's PUT replaced).

Post-merge verification: publish-runtime workflow rebuilds image, deploys to staging tenant fleet, next canary cron run sees /app/config.yaml → loads minimax provider → MINIMAX_API_KEY matches → claude CLI auths → A2A returns PONG → green.

Out of scope (for follow-up)

  • Provisioner not populating /opt/adapter/ despite that being the documented "canonical" path. Tracked separately. Fixing path 2 (this PR) makes path 1's absence non-blocking.
  • Canary's step 7c PUT replacing /configs/config.yaml wholesale. Not strictly a bug since path 2 (template's) is now load-bearing, but the canary should arguably preserve the workspace-level config or do a partial merge. Tracked in molecule-core#129 follow-ups.

🤖 Generated with Claude Code

## Closes molecule-core#129 failure mode #1 38-hour canary chronic red root-caused. Live SSM capture of the workspace EC2 shows: ``` None of CLAUDE_CODE_OAUTH_TOKEN set for model=MiniMax-M2.7-highspeed (provider=anthropic-oauth) — the adapter will fail on the first LLM call with AuthenticationError. [...] probed_cli_error='Not logged in · Please run /login' ``` ## Root cause The adapter's `_load_providers` tries 4 paths in order: 1. `/opt/adapter/config.yaml` — provisioner-managed canonical (currently missing) 2. `os.path.dirname(__file__)/config.yaml` — alongside adapter.py (this image's `/app/`) 3. `${WORKSPACE_CONFIG_PATH}/config.yaml` — workspace overrides 4. `_BUILTIN_PROVIDERS` — oauth + anthropic-api only Verified by `ls /opt/adapter/` on a live canary's workspace EC2: directory doesn't exist. So path 2 (`/app/config.yaml`) is the load-bearing one. Dockerfile copies `adapter.py`, `__init__.py`, `claude_sdk_executor.py`, `scripts/`, `entrypoint.sh` — but **does not copy `config.yaml`**. So `/app/config.yaml` doesn't exist either. All 3 file paths fail. `_load_providers` returns `_BUILTIN_PROVIDERS`. `_BUILTIN_PROVIDERS` has only `anthropic-oauth` + `anthropic-api`. Every MiniMax / GLM / Kimi / DeepSeek model id has no matching prefix → `_resolve_provider` returns `providers[0]` = `anthropic-oauth` (per "unknown ids fall back to providers[0]" rule). That provider needs `CLAUDE_CODE_OAUTH_TOKEN`, unset for non-OAuth tenants. Claude CLI errors `Not logged in · Please run /login`. The adapter wraps it as `"Agent error (Exception)"` and ships it to A2A. ## Fix One-line `COPY config.yaml .` after `COPY __init__.py .` in the Dockerfile. Now `/app/config.yaml` ships with the image and path 2 of the 4-path lookup finds it. ## How this got missed Memory `feedback_template_vs_workspace_config_separation` (template-claude-code PR #37, 2026-05-04) added the multi-path lookup precisely to fix the original bug pattern (per-workspace config.yaml shouldn't carry providers — that's a template concern). PR #37 added the lookup logic but didn't bundle `config.yaml` into the image, so the canonical path it expects doesn't exist anywhere — the same fallback-to-builtins bug persisted with a different code path producing it. ## Verification Verified the failure path live: 1. Provisioned `e2e-canary-20260508-debug-177826` via canary script with `E2E_KEEP_ORG=1`. 2. Canary failed at step 8/11 with the same "Agent error (Exception)" the cron canary has been hitting. 3. SSM-exec into workspace EC2 `i-01383cddf3b71e211`, `docker logs` of the workspace container showed the boot-time audit + the SDK exception. 4. `docker exec ... cat /opt/adapter/config.yaml` → no such file. 5. `docker exec ... ls /app/config.yaml` → no such file. 6. `docker exec ... cat /configs/config.yaml` → has `model: MiniMax-M2.7-highspeed` but no `providers:` section (canary's PUT replaced). Post-merge verification: publish-runtime workflow rebuilds image, deploys to staging tenant fleet, next canary cron run sees `/app/config.yaml` → loads minimax provider → `MINIMAX_API_KEY` matches → claude CLI auths → A2A returns PONG → green. ## Out of scope (for follow-up) - Provisioner not populating `/opt/adapter/` despite that being the documented "canonical" path. Tracked separately. Fixing path 2 (this PR) makes path 1's absence non-blocking. - Canary's step 7c PUT replacing `/configs/config.yaml` wholesale. Not strictly a bug since path 2 (template's) is now load-bearing, but the canary should arguably preserve the workspace-level config or do a partial merge. Tracked in molecule-core#129 follow-ups. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
claude-ceo-assistant added 1 commit 2026-05-08 18:15:40 +00:00
fix(dockerfile): bundle config.yaml into /app so providers registry loads
All checks were successful
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s
CI / Adapter unit tests (push) Successful in 55s
CI / Adapter unit tests (pull_request) Successful in 1m0s
CI / validate (pull_request) Successful in 3m10s
CI / validate (push) Successful in 3m10s
ad4241cebb
The adapter's _load_providers tries 4 paths in order:
  1. /opt/adapter/config.yaml  — provisioner-managed (currently missing)
  2. os.path.dirname(__file__)/config.yaml  — alongside adapter.py
  3. ${WORKSPACE_CONFIG_PATH}/config.yaml  — workspace overrides
  4. _BUILTIN_PROVIDERS  — oauth + anthropic-api only

On this template's docker image /opt/adapter/ is never populated by
the platform provisioner (verified 2026-05-08 by SSM-exec on a live
canary's workspace EC2: ls /opt/adapter/ → no such file or directory).
That makes path 2 — the dir adjacent to /app/adapter.py — the
load-bearing one for production workloads.

The Dockerfile copies adapter.py + claude_sdk_executor.py + scripts/
+ entrypoint.sh + __init__.py into /app, but it does NOT copy
config.yaml. So /app/config.yaml doesn't exist, path 2 fails, and
the adapter falls all the way through to _BUILTIN_PROVIDERS.

_BUILTIN_PROVIDERS contains only anthropic-oauth + anthropic-api.
Every MiniMax / GLM / Kimi / DeepSeek model id has no matching
prefix in those two, so _resolve_provider returns providers[0] =
anthropic-oauth (per "unknown ids fall back to providers[0]" rule).
That provider needs CLAUDE_CODE_OAUTH_TOKEN, which is unset for
non-OAuth tenants. The claude CLI fails with:
  Not logged in · Please run /login

…which surfaces in the A2A response as "Agent error (Exception)".

This is the root cause of:
  • Canary chronic red since 2026-05-07 02:30 UTC (38h+ at time of
    investigation)
  • molecule-core#129 failure mode #1
  • Memory feedback_template_vs_workspace_config_separation
    (template-claude-code PR #37 added the multi-path lookup but
    didn't bundle config.yaml into the image — the lookup paths
    point at files that don't exist)

Fix: one-line `COPY config.yaml .` in the Dockerfile.

Verification path (post-merge): publish-runtime workflow rebuilds
the image, deploys to staging tenant fleet, next canary cron run
sees /app/config.yaml → loads minimax provider → MINIMAX_API_KEY
matches → claude CLI auths → A2A returns PONG → green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cp-lead approved these changes 2026-05-08 18:15:51 +00:00
cp-lead left a comment
Member

LGTM. One-line fix that closes the canary's 38h chronic red. Live SSM verification: /app/config.yaml is missing → _load_providers falls through to _BUILTIN_PROVIDERS → MiniMax routes to anthropic-oauth → Not logged in. The COPY config.yaml puts the file at path 2 of the lookup.

LGTM. One-line fix that closes the canary's 38h chronic red. Live SSM verification: /app/config.yaml is missing → _load_providers falls through to _BUILTIN_PROVIDERS → MiniMax routes to anthropic-oauth → Not logged in. The COPY config.yaml puts the file at path 2 of the lookup.
claude-ceo-assistant merged commit 2edd78c154 into main 2026-05-08 18:19:10 +00:00
claude-ceo-assistant deleted branch fix/dockerfile-bundle-config-yaml 2026-05-08 18:19:10 +00:00
Sign in to join this conversation.
No reviewers
No Label
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: molecule-ai/molecule-ai-workspace-template-claude-code#6
No description provided.