fix(claude-sdk): #160 — probe CLI directly when SDK swallowed the real stderr

Context: when the claude-agent-sdk wraps a stream error from the CLI
subprocess that it can't categorize (rate limit, auth, network), it
raises a bare `Exception("Command failed with exit code 1\nError output:
Check stderr output for details")`. The exception has no `.stderr` or
`.exit_code` attributes, so #66's `_format_process_error` — which reads
those attributes — has nothing to surface. The log line becomes:

    SDK agent error [claude-code]: Exception: Command failed with exit
    code 1 (exit code: 1)\nError output: Check stderr output for details

That's the placeholder text from the SDK's error path, not the actual
error. Operators chasing a stuck workspace are forced to `docker exec
ws-xxx claude --print` manually to discover the real cause. Observed
today during the rate-limit incident: every PM error line was identical
"Check stderr output for details" while the real cause ("You've hit
your limit · resets Apr 17, 11pm (UTC)") was only visible via manual
reproduction — that cost ~20 minutes of diagnosis time.

## Fix

Add `_probe_claude_cli_error()`: a best-effort subprocess call that runs
`claude --print` with a small probe input, captures stderr+stdout, and
returns the real error string. Bounded by 30s timeout so a hung CLI
can't stall the error path.

Extend `_format_process_error` with ONE narrow fallback: if the
exception has no stderr/exit_code AND its message contains the specific
"Check stderr output for details" marker, call the probe and append
`probed_cli_error=<real error>` to the formatted line.

Critically: the probe only runs in the narrow case where we have
nothing else to log. If `.stderr` or `.exit_code` are present (the
normal ProcessError path from #66), the probe is skipped — no wasted
subprocess, no 30s latency on every error.

## Test coverage

`workspace-template/tests/test_claude_sdk_executor.py` adds 3 new tests:
- `test_format_process_error_probes_cli_when_stderr_swallowed` — the
  happy path: exception matches the marker, probe runs, result appears
  in the formatted line. Probe is monkeypatched so no subprocess spawns
  in the test.
- `test_format_process_error_does_not_probe_when_stderr_already_present` —
  negative: regular ProcessError with `.stderr` set does NOT trigger
  the probe (skip the wasted call).
- `test_format_process_error_does_not_probe_without_swallowed_marker` —
  negative: unrelated plain exceptions (e.g. RuntimeError) do NOT
  trigger the probe (so the common-case error path stays fast).

All 7 `_format_process_error` tests pass locally (4 existing + 3 new):
\`\`\`
pytest tests/test_claude_sdk_executor.py -k format_process_error
======================= 7 passed in 0.06s ========================
\`\`\`

## Impact

Next time the SDK swallows a real error (rate limit, auth failure,
network outage), the workspace log will contain the actual error string
alongside the generic placeholder:

    SDK agent error [claude-code]: Exception: Command failed with exit
    code 1 ... | probed_cli_error="You've hit your limit · resets Apr
    17, 11pm (UTC)"

Diagnosis time drops from "docker exec each ws, run claude --print,
read stderr" (~20 min) to "grep probed_cli_error in platform logs"
(~10 seconds).

Closes #160.
This commit is contained in:
rabbitblood 2026-04-15 11:50:55 -07:00
parent 8aad65287a
commit 0f2ed6bf0a
2 changed files with 141 additions and 0 deletions

View File

@ -82,6 +82,50 @@ _RETRYABLE_PATTERNS = (
)
_SWALLOWED_STDERR_MARKER = "Check stderr output for details"
def _probe_claude_cli_error() -> str | None:
"""Run ``claude --print`` directly and capture its stderr + stdout.
Used as a fallback when the claude-agent-sdk raises a bare ``Exception``
with the swallowed "Check stderr output for details" placeholder that
happens when the SDK wraps a stream error from the CLI subprocess and
loses both the ``.stderr`` attribute and the exit code. At that point
the only way to see the real failure reason (rate limit, auth error,
network outage, missing token) is to run the CLI ourselves.
Bounded by a 30s timeout so a hung CLI can't stall the error path.
Returns None if the probe itself failed (wrong invariant don't
corrupt the main error message with probe noise).
"""
try:
import subprocess
# --print reads stdin, prints response, exits. Empty stdin gives the
# CLI something to work with without triggering an actual model call
# when it's going to fail anyway.
proc = subprocess.run(
["claude", "--print"],
input="probe",
capture_output=True,
text=True,
timeout=30,
)
if proc.returncode == 0:
# CLI succeeded — the original error was a transient state that
# resolved between the SDK failure and our probe. Signal that.
return "<cli probe succeeded — error was transient>"
raw = (proc.stderr or "") + (proc.stdout or "")
raw = raw.strip()
if not raw:
return f"<cli exited {proc.returncode} with empty output>"
if len(raw) > _PROCESS_ERROR_STDERR_MAX_CHARS:
raw = raw[:_PROCESS_ERROR_STDERR_MAX_CHARS] + "... [truncated]"
return raw
except Exception as probe_exc: # pragma: no cover — best-effort diagnostic
return f"<probe failed: {type(probe_exc).__name__}: {probe_exc}>"
def _format_process_error(exc: BaseException) -> str:
"""Render a Claude-SDK ProcessError (or any ClaudeSDKError) with its full
captured context exit code, stderr, exception type. Plain strings for
@ -92,6 +136,15 @@ def _format_process_error(exc: BaseException) -> str:
ProcessError carries `.stderr`/`.exit_code` attributes that the previous
code silently discarded, leaving every CLI crash with an identical
"Check stderr output for details" message in the workspace log).
Fixes #160: when the SDK raises a bare ``Exception`` containing the
"Check stderr output for details" placeholder (which happens when the
CLI subprocess emits a stream error the SDK can't categorize — rate
limit, auth, network), there's no ``.stderr``/``.exit_code`` to read.
In that case we fall back to running the CLI ourselves via
``_probe_claude_cli_error`` so the operator sees the real failure
reason (e.g. ``You've hit your limit · resets Apr 17``) instead of
chasing ghosts in the workspace logs.
"""
parts = [f"{type(exc).__name__}: {exc}"]
exit_code = getattr(exc, "exit_code", None)
@ -103,6 +156,13 @@ def _format_process_error(exc: BaseException) -> str:
if len(stderr) > _PROCESS_ERROR_STDERR_MAX_CHARS:
trimmed += f"... [{len(stderr) - _PROCESS_ERROR_STDERR_MAX_CHARS} more chars truncated]"
parts.append(f"stderr={trimmed!r}")
elif exit_code is None and _SWALLOWED_STDERR_MARKER in str(exc):
# #160: generic exception with the swallowed-stderr placeholder.
# Probe the CLI directly — this is the only way to surface the real
# error when the SDK lost it in translation.
probed = _probe_claude_cli_error()
if probed:
parts.append(f"probed_cli_error={probed!r}")
return " | ".join(parts)

View File

@ -837,6 +837,87 @@ def test_format_process_error_no_stderr_but_has_exit_code():
assert "stderr" not in out
# ---------------------------------------------------------------------------
# _format_process_error — #160: probe CLI when SDK swallowed the real stderr
# ---------------------------------------------------------------------------
def test_format_process_error_probes_cli_when_stderr_swallowed(monkeypatch):
"""Generic Exception with "Check stderr output for details" → probe the
claude CLI directly so the real error (rate limit, auth, etc.) surfaces
in the log instead of the useless placeholder text. Fixes #160."""
from claude_sdk_executor import _format_process_error
def fake_probe():
return "You've hit your limit · resets Apr 17, 11pm (UTC)"
# Patch the probe helper so we don't actually spawn a subprocess in tests
monkeypatch.setattr(
"claude_sdk_executor._probe_claude_cli_error",
fake_probe,
)
swallowed = Exception(
"Command failed with exit code 1 (exit code: 1)\n"
"Error output: Check stderr output for details"
)
out = _format_process_error(swallowed)
assert "probed_cli_error" in out
assert "hit your limit" in out
def test_format_process_error_does_not_probe_when_stderr_already_present():
"""If the exception already carries .stderr, we DO NOT run the probe —
that'd be wasted subprocess work and could mask the real stderr with
probe-time state."""
from claude_sdk_executor import _format_process_error
# Track whether the probe was called
called = {"value": False}
def fake_probe():
called["value"] = True
return "probe ran"
import claude_sdk_executor
original = claude_sdk_executor._probe_claude_cli_error
claude_sdk_executor._probe_claude_cli_error = fake_probe
try:
class FakeProcessError(Exception):
def __init__(self):
super().__init__("real error")
self.stderr = "real stderr content"
self.exit_code = 1
out = _format_process_error(FakeProcessError())
assert "real stderr content" in out
assert called["value"] is False, "probe should not have been called"
finally:
claude_sdk_executor._probe_claude_cli_error = original
def test_format_process_error_does_not_probe_without_swallowed_marker(monkeypatch):
"""Only probe when the exception message matches the specific swallowed-
stderr marker. Unrelated plain exceptions (e.g. 'generic failure') should
not trigger a CLI probe that'd make every error path 30s slower."""
from claude_sdk_executor import _format_process_error
called = {"value": False}
def fake_probe():
called["value"] = True
return "probe ran"
monkeypatch.setattr(
"claude_sdk_executor._probe_claude_cli_error",
fake_probe,
)
out = _format_process_error(RuntimeError("generic failure"))
assert called["value"] is False
assert "probed_cli_error" not in out
def test_process_error_reaches_logs_via_execute(caplog):
"""End-to-end: a ProcessError in query() → executor logs both the
formatted summary and the full traceback. Fixes #66 — previously no