molecule-ai-workspace-templ.../entrypoint.sh
Hongming Wang 78ae139609 feat(adapter,entrypoint): boot env audit + crash-loop diagnosis logging
Adds two operator-visible boot diagnostics that close the diagnosis gap
exposed by the 2026-05-02 MiniMax E2E crash-loop. The universal
canvas-picked-model fix (Bug B) and per-model required_env (Bug D) live
in molecule-core PR #2538 — this PR adds the per-template visibility
that complements them so operators can answer "is the key missing or is
routing wrong?" from `docker logs` alone.

Changes
-------
adapter.py:
- _AUTH_ENV_AUDIT tuple of 8 vendor env names (CLAUDE_CODE_OAUTH_TOKEN,
  ANTHROPIC_API_KEY/AUTH_TOKEN/BASE_URL, MINIMAX/GLM/KIMI/DEEPSEEK_API_KEY).
- _audit_auth_env_presence() helper — single INFO line of NAME=set/unset
  pairs. NEVER logs values; the test pins this with a "fake-secret-MUST-
  NOT-LEAK" sentinel that must never appear in the log message.
- One call site at the end of setup()'s boot banner so every workspace
  start emits both "which provider got picked" and "which envs are present"
  in adjacent log lines.

entrypoint.sh:
- log_boot_context() function fired once before the gosu drop (as root)
  and once after (as agent) so an operator can spot env values lost
  across the privilege drop. Emits uid/gid/user/hostname/workspace_id/
  platform_url/configs_dir/workspace_dir + the same 8 env names as
  NAME=set/unset. Mirror of _AUTH_ENV_AUDIT — list pinned in sync by a
  new AST-style test (test_audit_env_list_matches_entrypoint_sh) that
  parses entrypoint.sh and asserts set-equality with adapter.py's tuple.

tests/test_adapter_logging.py (new):
- 4 tests covering the audit contract: every name appears, all-unset
  scenario, empty-string treated as unset (matches routing semantics),
  and the cross-file sync gate against entrypoint.sh's for-loop.
- Stubs molecule_runtime + a2a so the helpers can be imported without
  the real wheel installed in CI (mirrors test_adapter_prevalidate.py's
  scaffolding pattern).

Why this complements molecule-core PR #2538
-------------------------------------------
- PR #2538 makes Bug B (canvas-picked model silently dropped) impossible
  by resolving model centrally in workspace/config.py:load_config —
  every adapter (claude-code, hermes, codex, future ones) gets the
  passthrough for free.
- PR #2538 makes Bug D (preflight rejects valid auth for non-default
  models) impossible by REPLACE-not-union per-entry required_env.
- This template PR is the per-template observability layer: when one
  of those universal fixes regresses (or when an operator misconfigs a
  vendor key), the boot logs say exactly which env was present at each
  tier. Validated end-to-end on workspace
  be27badd-00a7-4cef-91e8-af428175c76f (clean boot, MINIMAX_API_KEY=set
  audited, no crash-loop).

Closes part of molecule-monorepo task #248. Sibling of #2538 for
molecule-core.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 21:41:05 -07:00

134 lines
6.7 KiB
Bash

#!/bin/sh
# Drop privileges to the agent user before exec'ing molecule-runtime.
# claude-code refuses --dangerously-skip-permissions when running as
# root/sudo for safety. Without this entrypoint, every cron tick fails
# with `ProcessError: Command failed with exit code 1` and the agent
# logs `--dangerously-skip-permissions cannot be used with root/sudo
# privileges for security reasons`.
#
# Pattern matches the legacy monorepo workspace-template/entrypoint.sh:
# fix volume ownership as root, then re-exec via gosu as agent (uid 1000).
# Boot-context snapshot — emitted on EVERY container start, including
# every restart of a crash-loop. Lets `docker logs` answer "what env
# was actually present?" without having to docker exec into a dying
# container. Logs NAMES of auth-relevant env vars, never VALUES. Fires
# twice (once as root pre-gosu, once as agent post-gosu) so an operator
# can see whether a value was lost across the privilege drop.
# Keep the env-name list in sync with adapter.py's _AUTH_ENV_AUDIT —
# the same set of vendors should be audited from both sides.
log_boot_context() {
echo "----- entrypoint boot $(date -u +%Y-%m-%dT%H:%M:%SZ) -----"
echo "uid=$(id -u) gid=$(id -g) user=$(id -un 2>/dev/null || echo unknown)"
echo "hostname=$(hostname) workspace_id=${WORKSPACE_ID:-<unset>}"
echo "platform_url=${PLATFORM_URL:-<unset>}"
echo "configs_dir: $(ls -ld /configs 2>/dev/null || echo MISSING)"
echo "configs_contents: $(ls /configs 2>/dev/null | tr '\n' ' ' || echo MISSING)"
echo "workspace_dir: $(ls -ld /workspace 2>/dev/null || echo MISSING)"
# Auth env presence (NAMES + set/unset only — never the values).
# Mirror of _AUTH_ENV_AUDIT in adapter.py — keep in sync if you add a vendor.
for var in CLAUDE_CODE_OAUTH_TOKEN ANTHROPIC_API_KEY ANTHROPIC_AUTH_TOKEN ANTHROPIC_BASE_URL MINIMAX_API_KEY GLM_API_KEY KIMI_API_KEY DEEPSEEK_API_KEY; do
eval "val=\$$var"
if [ -n "$val" ]; then
echo "env $var=set"
else
echo "env $var=unset"
fi
done
echo "------------------------------------------------"
}
log_boot_context
if [ "$(id -u)" = "0" ]; then
# Configs volume is created by Docker as root; agent needs write access
# for plugin installs, memory writes, .auth_token rotation, etc.
chown -R agent:agent /configs 2>/dev/null
# /workspace handling — only chown when the contents are root-owned
# (typical on Docker Desktop on Windows where host uid maps to 0).
# On Linux Docker with matching uids the recursive chown is skipped
# to keep startup fast.
chown agent:agent /workspace 2>/dev/null || true
if [ -d /workspace ]; then
first_entry=$(find /workspace -mindepth 1 -maxdepth 1 -print -quit 2>/dev/null)
if [ -n "$first_entry" ] && [ "$(stat -c '%u' "$first_entry" 2>/dev/null)" = "0" ]; then
chown -R agent:agent /workspace 2>/dev/null
fi
# Pre-create /workspace/.molecule/chat-uploads so the upload
# handler in workspace/internal_chat_uploads.py never has to
# mkdir as agent inside a root-owned tree. Without this the
# first upload after a fresh provision fails with "failed to
# prepare uploads dir" because the volume mount comes up with
# root-owned `.molecule` whenever a sibling subsystem (e.g. an
# adapter writing telemetry, or a workspace runtime that ran
# before the chown landed) raced ahead. Idempotent: a re-run
# finds the dir already there, mode 0755 / agent:agent.
mkdir -p /workspace/.molecule/chat-uploads 2>/dev/null || true
chown -R agent:agent /workspace/.molecule 2>/dev/null || true
fi
# Claude Code session directory — mounted at /root/.claude/sessions by
# the platform provisioner. Symlink it into agent's home so the SDK
# finds it when running as agent. The provisioner's mount point is
# hardcoded to /root/.claude/sessions; we don't want to change the
# platform contract just for this template.
mkdir -p /home/agent/.claude
if [ -d /root/.claude/sessions ]; then
chown -R agent:agent /root/.claude /home/agent/.claude 2>/dev/null
ln -sfn /root/.claude/sessions /home/agent/.claude/sessions
fi
# GitHub credential helper setup (fix #1933 / #1866 / #547).
# Runs as root so the global gitconfig is written before we drop to agent.
# The helper fetches fresh GitHub App installation tokens from the
# platform API on every git push/clone, with caching + env-var fallback.
if [ -x /app/scripts/molecule-git-token-helper.sh ]; then
git config --global "credential.https://github.com.helper" \
"!/app/scripts/molecule-git-token-helper.sh"
git config --global "credential.https://github.com.useHttpPath" true
if [ -f /root/.gitconfig ]; then
cp /root/.gitconfig /home/agent/.gitconfig
chown agent:agent /home/agent/.gitconfig
fi
fi
mkdir -p /home/agent/.molecule-token-cache
chown agent:agent /home/agent/.molecule-token-cache
chmod 700 /home/agent/.molecule-token-cache
exec gosu agent "$0" "$@"
fi
# Now running as agent (uid 1000)
# Background token refresh daemon — keeps `gh` CLI auth + credential helper
# cache warm across the ~60 min GitHub App installation token TTL. Wrapped
# in a respawn loop so a daemon crash doesn't silently leave the workspace
# stuck on an expired token (which is exactly how #1933 was discovered).
if [ -x /app/scripts/molecule-gh-token-refresh.sh ]; then
nohup bash -c '
while true; do
/app/scripts/molecule-gh-token-refresh.sh
rc=$?
echo "[molecule-gh-token-refresh] daemon exited rc=$rc — respawning in 30s" >&2
sleep 30
done
' > /home/agent/.gh-token-refresh.log 2>&1 &
fi
# Initial gh auth — primes the CLI with whatever GH_TOKEN/GITHUB_TOKEN was
# injected at provision time, so commands work in the ~60s window before the
# background daemon's first refresh fires.
if [ -n "${GITHUB_TOKEN:-}" ]; then
echo "${GITHUB_TOKEN}" | gh auth login --hostname github.com --with-token 2>/dev/null || true
elif [ -n "${GH_TOKEN:-}" ]; then
echo "${GH_TOKEN}" | gh auth login --hostname github.com --with-token 2>/dev/null || true
fi
# Third-party provider routing is now handled by adapter.py at boot —
# it reads the `providers:` registry from /configs/config.yaml and sets
# ANTHROPIC_BASE_URL based on the picked MODEL. Adding a new provider
# is a one-line YAML edit (see config.yaml's `providers:` section).
# Operator-set ANTHROPIC_BASE_URL still wins as the escape hatch for
# regional endpoints (e.g. Xiaomi's token-plan-sgp.*, MiniMax's
# api.minimaxi.com China endpoint).
exec molecule-runtime "$@"