Commit Graph

7 Commits

Author SHA1 Message Date
dev-lead
ad4241cebb fix(dockerfile): bundle config.yaml into /app so providers registry loads
All checks were successful
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s
CI / Adapter unit tests (push) Successful in 55s
CI / Adapter unit tests (pull_request) Successful in 1m0s
CI / validate (pull_request) Successful in 3m10s
CI / validate (push) Successful in 3m10s
The adapter's _load_providers tries 4 paths in order:
  1. /opt/adapter/config.yaml  — provisioner-managed (currently missing)
  2. os.path.dirname(__file__)/config.yaml  — alongside adapter.py
  3. ${WORKSPACE_CONFIG_PATH}/config.yaml  — workspace overrides
  4. _BUILTIN_PROVIDERS  — oauth + anthropic-api only

On this template's docker image /opt/adapter/ is never populated by
the platform provisioner (verified 2026-05-08 by SSM-exec on a live
canary's workspace EC2: ls /opt/adapter/ → no such file or directory).
That makes path 2 — the dir adjacent to /app/adapter.py — the
load-bearing one for production workloads.

The Dockerfile copies adapter.py + claude_sdk_executor.py + scripts/
+ entrypoint.sh + __init__.py into /app, but it does NOT copy
config.yaml. So /app/config.yaml doesn't exist, path 2 fails, and
the adapter falls all the way through to _BUILTIN_PROVIDERS.

_BUILTIN_PROVIDERS contains only anthropic-oauth + anthropic-api.
Every MiniMax / GLM / Kimi / DeepSeek model id has no matching
prefix in those two, so _resolve_provider returns providers[0] =
anthropic-oauth (per "unknown ids fall back to providers[0]" rule).
That provider needs CLAUDE_CODE_OAUTH_TOKEN, which is unset for
non-OAuth tenants. The claude CLI fails with:
  Not logged in · Please run /login

…which surfaces in the A2A response as "Agent error (Exception)".

This is the root cause of:
  • Canary chronic red since 2026-05-07 02:30 UTC (38h+ at time of
    investigation)
  • molecule-core#129 failure mode #1
  • Memory feedback_template_vs_workspace_config_separation
    (template-claude-code PR #37 added the multi-path lookup but
    didn't bundle config.yaml into the image — the lookup paths
    point at files that don't exist)

Fix: one-line `COPY config.yaml .` in the Dockerfile.

Verification path (post-merge): publish-runtime workflow rebuilds
the image, deploys to staging tenant fleet, next canary cron run
sees /app/config.yaml → loads minimax provider → MINIMAX_API_KEY
matches → claude CLI auths → A2A returns PONG → green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 11:15:39 -07:00
Hongming Wang
fd92de2591
Merge branch 'main' into fix/wire-up-gh-token-refresh 2026-04-29 00:56:02 -07:00
Hongming Wang
de2ab5ab33 feat: forward client_payload.runtime_version + ARG RUNTIME_VERSION
Closes the cache trap structurally (instead of pin-bumping every
runtime release):
1. publish-image.yml caller now forwards
   github.event.client_payload.runtime_version (set by cascade) to
   the molecule-ci reusable workflow as runtime_version input.
2. Reusable workflow forwards it to docker build as a --build-arg.
3. Dockerfile declares ARG RUNTIME_VERSION near the pip install
   layer so its value becomes part of the cache key.
4. The pip install RUN command does an extra targeted upgrade to
   the exact version when ARG is set — guarantees the version is
   what we expect even if requirements.txt resolves to something
   else.

Pairs with molecule-ci PR #12 + molecule-core PR #2181. Together
the pipeline is now race- and cache-proof end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 07:46:14 -07:00
Hongming Wang
fab7c6a929 feat(template): own claude_sdk_executor locally (universal-runtime refactor)
First half of molecule-core task #87 — move adapter-specific code out
of the universal molecule-runtime package into the template that
actually consumes it.

Adds:
  - claude_sdk_executor.py (757 LOC) — copied verbatim from
    molecule-core/workspace/claude_sdk_executor.py @ commit 186f25c2.
    The adapter at adapter.py:59 already does
    `from claude_sdk_executor import ClaudeSDKExecutor` — once this
    file lands at /app/, Python's import order picks the local copy
    over the same-named module that older molecule-runtime versions
    ship under site-packages.
  - Dockerfile: COPY claude_sdk_executor.py . alongside adapter.py.

Pure additive at this stage — molecule-runtime still ships the
file too, so any image built from this template just has two copies
on disk (local /app shadows the site-packages one). No behavior
change.

Sequencing (the molecule-core PR follows AFTER this image rebuilds):
  1. THIS PR — template gets local copy, image rebuilds with it
     (current PR; safe because no removal yet)
  2. molecule-core PR — drop workspace/claude_sdk_executor.py, bump
     molecule-ai-workspace-runtime PyPI version. Templates that
     haven't pulled the new runtime version still work because their
     local copy is unchanged.
  3. (later) Bump requirements.txt pin in this template once the
     new runtime version is on PyPI, so future builds explicitly
     install the slimmed runtime.

Why local-copy-first:
  - Reverse order (drop from runtime first, then add to template)
    creates a window where any template image build pulling the
    latest runtime would fail to import claude_sdk_executor.
  - This order has zero downtime: every intermediate state is valid.

Validates the capability primitives shipped in molecule-core PRs
#2137-#2144 — once this template image rebuilds and the molecule-
core deletion lands, the claude-code workspace is the FIRST adapter
to live entirely outside molecule-runtime, with native_session +
idle_timeout_override declared via capabilities() (PR #12 here).

Source: molecule-core/workspace/claude_sdk_executor.py @ 186f25c2
(commit hash pinned for traceability of any future divergence).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 23:58:05 -07:00
rabbitblood
d4ab584deb fix: wire up GitHub App token refresh — fixes #1933
Symptoms before this PR:
- After ~60 min of workspace uptime, every git push/clone returns 401
- PMM, DevRel, Social Media Brand and other content agents infinite-loop
  status reports back to PMs ("I tried, GH_TOKEN dead")
- PM A2A queues overflow with retry-status messages (depth 27 on Marketing
  Lead, 18 on Dev Lead, 11 on Core Platform Lead at peak)

Root cause:
- GH_TOKEN/GITHUB_TOKEN injected at provision time has a ~60 min TTL
  (GitHub App installation tokens cap at one hour)
- Workspace env is frozen at container start — no in-process mechanism
  to refresh after expiry
- The credential-helper architecture exists in the codebase but was
  never wired up at template boot. Specifically the claude-code template:
  - did not COPY the helper scripts into the image
  - did not configure git credential.helper at boot
  - did not start the background refresh daemon
  - did not run initial gh auth login

Fix:
1. Dockerfile COPYs scripts/molecule-git-token-helper.sh and
   scripts/molecule-gh-token-refresh.sh into /app/scripts/
2. entrypoint.sh (root half) configures git credential helper for
   github.com and creates the per-user token cache directory
3. entrypoint.sh (agent half) starts the refresh daemon under a
   respawn loop and runs initial `gh auth login --with-token`

The helper hits the platform's /admin/github-installation-token endpoint
(fallback to env-var GH_TOKEN when platform unreachable). The refresh
daemon calls _refresh_gh every ~45 min ± 2 min jitter so cli auth and
helper cache stay warm even when no git operation triggers a refresh.

Acceptance:
- After this image deploys, `gh api /user` from inside a workspace
  should keep returning 200 even after >60 min uptime
- Marketing Lead / Dev Lead a2a queues should drain to <5 within one
  cycle of the new image rolling

Follow-up issues to file (not in this PR):
- Replicate this wiring in the other 7 template repos (autogen, crewai,
  deepagents, gemini-cli, hermes, langgraph, openclaw)
- Lift the wiring into the molecule-runtime PyPI package so future
  templates inherit it instead of re-implementing

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 17:57:30 -07:00
Hongming Wang
fef8fd5c57
fix: install git + gh CLI for agent autonomy loop (#2)
Install git + gh CLI in workspace image for agent autonomy loop.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-21 00:50:33 +00:00
Hongming Wang
7f9b2b4189 feat: add adapter code + Dockerfile for standalone deployment
Adapters extracted from molecule-monorepo/workspace-template.
Uses molecule-ai-workspace-runtime PyPI package for shared infrastructure.

- adapter.py — runtime-specific adapter class
- requirements.txt — runtime-specific deps + molecule-ai-workspace-runtime
- Dockerfile — FROM python:3.11-slim, pip install, COPY adapter, molecule-runtime entrypoint
- ADAPTER_MODULE=adapter tells the runtime to load this repo's Adapter class

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 04:27:22 -07:00