fix(claude-sdk): #160 — probe CLI directly when SDK swallowed the real stderr
Context: when the claude-agent-sdk wraps a stream error from the CLI
subprocess that it can't categorize (rate limit, auth, network), it
raises a bare `Exception("Command failed with exit code 1\nError output:
Check stderr output for details")`. The exception has no `.stderr` or
`.exit_code` attributes, so #66's `_format_process_error` — which reads
those attributes — has nothing to surface. The log line becomes:
SDK agent error [claude-code]: Exception: Command failed with exit
code 1 (exit code: 1)\nError output: Check stderr output for details
That's the placeholder text from the SDK's error path, not the actual
error. Operators chasing a stuck workspace are forced to `docker exec
ws-xxx claude --print` manually to discover the real cause. Observed
today during the rate-limit incident: every PM error line was identical
"Check stderr output for details" while the real cause ("You've hit
your limit · resets Apr 17, 11pm (UTC)") was only visible via manual
reproduction — that cost ~20 minutes of diagnosis time.
## Fix
Add `_probe_claude_cli_error()`: a best-effort subprocess call that runs
`claude --print` with a small probe input, captures stderr+stdout, and
returns the real error string. Bounded by 30s timeout so a hung CLI
can't stall the error path.
Extend `_format_process_error` with ONE narrow fallback: if the
exception has no stderr/exit_code AND its message contains the specific
"Check stderr output for details" marker, call the probe and append
`probed_cli_error=<real error>` to the formatted line.
Critically: the probe only runs in the narrow case where we have
nothing else to log. If `.stderr` or `.exit_code` are present (the
normal ProcessError path from #66), the probe is skipped — no wasted
subprocess, no 30s latency on every error.
## Test coverage
`workspace-template/tests/test_claude_sdk_executor.py` adds 3 new tests:
- `test_format_process_error_probes_cli_when_stderr_swallowed` — the
happy path: exception matches the marker, probe runs, result appears
in the formatted line. Probe is monkeypatched so no subprocess spawns
in the test.
- `test_format_process_error_does_not_probe_when_stderr_already_present` —
negative: regular ProcessError with `.stderr` set does NOT trigger
the probe (skip the wasted call).
- `test_format_process_error_does_not_probe_without_swallowed_marker` —
negative: unrelated plain exceptions (e.g. RuntimeError) do NOT
trigger the probe (so the common-case error path stays fast).
All 7 `_format_process_error` tests pass locally (4 existing + 3 new):
\`\`\`
pytest tests/test_claude_sdk_executor.py -k format_process_error
======================= 7 passed in 0.06s ========================
\`\`\`
## Impact
Next time the SDK swallows a real error (rate limit, auth failure,
network outage), the workspace log will contain the actual error string
alongside the generic placeholder:
SDK agent error [claude-code]: Exception: Command failed with exit
code 1 ... | probed_cli_error="You've hit your limit · resets Apr
17, 11pm (UTC)"
Diagnosis time drops from "docker exec each ws, run claude --print,
read stderr" (~20 min) to "grep probed_cli_error in platform logs"
(~10 seconds).
Closes #160.
This commit is contained in:
parent
8aad65287a
commit
0f2ed6bf0a
@ -82,6 +82,50 @@ _RETRYABLE_PATTERNS = (
|
||||
)
|
||||
|
||||
|
||||
_SWALLOWED_STDERR_MARKER = "Check stderr output for details"
|
||||
|
||||
|
||||
def _probe_claude_cli_error() -> str | None:
|
||||
"""Run ``claude --print`` directly and capture its stderr + stdout.
|
||||
|
||||
Used as a fallback when the claude-agent-sdk raises a bare ``Exception``
|
||||
with the swallowed "Check stderr output for details" placeholder — that
|
||||
happens when the SDK wraps a stream error from the CLI subprocess and
|
||||
loses both the ``.stderr`` attribute and the exit code. At that point
|
||||
the only way to see the real failure reason (rate limit, auth error,
|
||||
network outage, missing token) is to run the CLI ourselves.
|
||||
|
||||
Bounded by a 30s timeout so a hung CLI can't stall the error path.
|
||||
Returns None if the probe itself failed (wrong invariant — don't
|
||||
corrupt the main error message with probe noise).
|
||||
"""
|
||||
try:
|
||||
import subprocess
|
||||
# --print reads stdin, prints response, exits. Empty stdin gives the
|
||||
# CLI something to work with without triggering an actual model call
|
||||
# when it's going to fail anyway.
|
||||
proc = subprocess.run(
|
||||
["claude", "--print"],
|
||||
input="probe",
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=30,
|
||||
)
|
||||
if proc.returncode == 0:
|
||||
# CLI succeeded — the original error was a transient state that
|
||||
# resolved between the SDK failure and our probe. Signal that.
|
||||
return "<cli probe succeeded — error was transient>"
|
||||
raw = (proc.stderr or "") + (proc.stdout or "")
|
||||
raw = raw.strip()
|
||||
if not raw:
|
||||
return f"<cli exited {proc.returncode} with empty output>"
|
||||
if len(raw) > _PROCESS_ERROR_STDERR_MAX_CHARS:
|
||||
raw = raw[:_PROCESS_ERROR_STDERR_MAX_CHARS] + "... [truncated]"
|
||||
return raw
|
||||
except Exception as probe_exc: # pragma: no cover — best-effort diagnostic
|
||||
return f"<probe failed: {type(probe_exc).__name__}: {probe_exc}>"
|
||||
|
||||
|
||||
def _format_process_error(exc: BaseException) -> str:
|
||||
"""Render a Claude-SDK ProcessError (or any ClaudeSDKError) with its full
|
||||
captured context — exit code, stderr, exception type. Plain strings for
|
||||
@ -92,6 +136,15 @@ def _format_process_error(exc: BaseException) -> str:
|
||||
ProcessError carries `.stderr`/`.exit_code` attributes that the previous
|
||||
code silently discarded, leaving every CLI crash with an identical
|
||||
"Check stderr output for details" message in the workspace log).
|
||||
|
||||
Fixes #160: when the SDK raises a bare ``Exception`` containing the
|
||||
"Check stderr output for details" placeholder (which happens when the
|
||||
CLI subprocess emits a stream error the SDK can't categorize — rate
|
||||
limit, auth, network), there's no ``.stderr``/``.exit_code`` to read.
|
||||
In that case we fall back to running the CLI ourselves via
|
||||
``_probe_claude_cli_error`` so the operator sees the real failure
|
||||
reason (e.g. ``You've hit your limit · resets Apr 17``) instead of
|
||||
chasing ghosts in the workspace logs.
|
||||
"""
|
||||
parts = [f"{type(exc).__name__}: {exc}"]
|
||||
exit_code = getattr(exc, "exit_code", None)
|
||||
@ -103,6 +156,13 @@ def _format_process_error(exc: BaseException) -> str:
|
||||
if len(stderr) > _PROCESS_ERROR_STDERR_MAX_CHARS:
|
||||
trimmed += f"... [{len(stderr) - _PROCESS_ERROR_STDERR_MAX_CHARS} more chars truncated]"
|
||||
parts.append(f"stderr={trimmed!r}")
|
||||
elif exit_code is None and _SWALLOWED_STDERR_MARKER in str(exc):
|
||||
# #160: generic exception with the swallowed-stderr placeholder.
|
||||
# Probe the CLI directly — this is the only way to surface the real
|
||||
# error when the SDK lost it in translation.
|
||||
probed = _probe_claude_cli_error()
|
||||
if probed:
|
||||
parts.append(f"probed_cli_error={probed!r}")
|
||||
return " | ".join(parts)
|
||||
|
||||
|
||||
|
||||
@ -837,6 +837,87 @@ def test_format_process_error_no_stderr_but_has_exit_code():
|
||||
assert "stderr" not in out
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# _format_process_error — #160: probe CLI when SDK swallowed the real stderr
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def test_format_process_error_probes_cli_when_stderr_swallowed(monkeypatch):
|
||||
"""Generic Exception with "Check stderr output for details" → probe the
|
||||
claude CLI directly so the real error (rate limit, auth, etc.) surfaces
|
||||
in the log instead of the useless placeholder text. Fixes #160."""
|
||||
from claude_sdk_executor import _format_process_error
|
||||
|
||||
def fake_probe():
|
||||
return "You've hit your limit · resets Apr 17, 11pm (UTC)"
|
||||
|
||||
# Patch the probe helper so we don't actually spawn a subprocess in tests
|
||||
monkeypatch.setattr(
|
||||
"claude_sdk_executor._probe_claude_cli_error",
|
||||
fake_probe,
|
||||
)
|
||||
|
||||
swallowed = Exception(
|
||||
"Command failed with exit code 1 (exit code: 1)\n"
|
||||
"Error output: Check stderr output for details"
|
||||
)
|
||||
out = _format_process_error(swallowed)
|
||||
assert "probed_cli_error" in out
|
||||
assert "hit your limit" in out
|
||||
|
||||
|
||||
def test_format_process_error_does_not_probe_when_stderr_already_present():
|
||||
"""If the exception already carries .stderr, we DO NOT run the probe —
|
||||
that'd be wasted subprocess work and could mask the real stderr with
|
||||
probe-time state."""
|
||||
from claude_sdk_executor import _format_process_error
|
||||
|
||||
# Track whether the probe was called
|
||||
called = {"value": False}
|
||||
|
||||
def fake_probe():
|
||||
called["value"] = True
|
||||
return "probe ran"
|
||||
|
||||
import claude_sdk_executor
|
||||
original = claude_sdk_executor._probe_claude_cli_error
|
||||
claude_sdk_executor._probe_claude_cli_error = fake_probe
|
||||
try:
|
||||
class FakeProcessError(Exception):
|
||||
def __init__(self):
|
||||
super().__init__("real error")
|
||||
self.stderr = "real stderr content"
|
||||
self.exit_code = 1
|
||||
|
||||
out = _format_process_error(FakeProcessError())
|
||||
assert "real stderr content" in out
|
||||
assert called["value"] is False, "probe should not have been called"
|
||||
finally:
|
||||
claude_sdk_executor._probe_claude_cli_error = original
|
||||
|
||||
|
||||
def test_format_process_error_does_not_probe_without_swallowed_marker(monkeypatch):
|
||||
"""Only probe when the exception message matches the specific swallowed-
|
||||
stderr marker. Unrelated plain exceptions (e.g. 'generic failure') should
|
||||
not trigger a CLI probe — that'd make every error path 30s slower."""
|
||||
from claude_sdk_executor import _format_process_error
|
||||
|
||||
called = {"value": False}
|
||||
|
||||
def fake_probe():
|
||||
called["value"] = True
|
||||
return "probe ran"
|
||||
|
||||
monkeypatch.setattr(
|
||||
"claude_sdk_executor._probe_claude_cli_error",
|
||||
fake_probe,
|
||||
)
|
||||
|
||||
out = _format_process_error(RuntimeError("generic failure"))
|
||||
assert called["value"] is False
|
||||
assert "probed_cli_error" not in out
|
||||
|
||||
|
||||
def test_process_error_reaches_logs_via_execute(caplog):
|
||||
"""End-to-end: a ProcessError in query() → executor logs both the
|
||||
formatted summary and the full traceback. Fixes #66 — previously no
|
||||
|
||||
Loading…
Reference in New Issue
Block a user