forked from molecule-ai/molecule-core
Three different intermittent failures observed during a single
manual-test session — RemoteProtocolError, ReadTimeout, ConnectError —
each surfaced as a "Failed to deliver to <peer>" error chip in the
canvas Agent Comms panel even though the next attempt would have
succeeded (verified by direct probes from the same source workspace
to the same peer). The error message even told the user "Usually a
transient network blip — retry once," but it left the retry to a
human reading the error message.
Auto-retry inside send_a2a_message itself: up to 5 attempts (1
initial + 4 retries) with exponential backoff (1s, 2s, 4s, 8s,
16s-capped), each backoff jittered ±25% to break sync across
siblings. Cumulative wall-clock capped at 600s by
_DELEGATE_TOTAL_BUDGET_S so a string of 5×300s ReadTimeouts can't
make the caller wait 25 minutes — once the deadline elapses, retries
stop even if attempts remain.
Retry only on transport-layer transients:
- ConnectError / ConnectTimeout (peer's listening socket not ready)
- RemoteProtocolError (peer closed TCP without writing — observed
when a peer's prior in-flight Claude SDK session aborted)
- ReadError / WriteError (network blip on Docker bridge)
- ReadTimeout (peer wrote no response in 300s)
Application-level errors are NOT retried — they're deterministic and
retrying just wastes wall-clock:
- HTTP 4xx (peer rejected the request format)
- JSON parse failures (peer returned garbage)
- JSON-RPC error in response body (peer's runtime errored cleanly)
- Programmer-bug exceptions (ValueError, etc.)
8 new tests pin the contract:
- retry succeeds after 2 RemoteProtocolErrors
- retry succeeds after 1 ConnectError
- all 5 attempts fail → returns formatted last-error
- capped at exactly _DELEGATE_MAX_ATTEMPTS (regression cover for
"did someone bump the constant accidentally?")
- JSON-RPC error response NOT retried (1 attempt only)
- non-httpx exception NOT retried (programmer bugs stay loud)
- total budget caps the loop even if attempts remain
- backoff schedule grows exponentially with ±25% jitter
Refactor: extracted _format_a2a_error() so the success and exhausted
paths share one error-formatting routine. _delegate_backoff_seconds()
is a pure function so the schedule is unit-testable without monkey-
patching asyncio.sleep.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|---|---|---|
| .. | ||
| adapters | ||
| builtin_tools | ||
| lib | ||
| molecule_audit | ||
| plugins_registry | ||
| policies | ||
| scripts | ||
| skill_loader | ||
| tests | ||
| .coveragerc | ||
| a2a_cli.py | ||
| a2a_client.py | ||
| a2a_executor.py | ||
| a2a_mcp_server.py | ||
| a2a_tools.py | ||
| adapter_base.py | ||
| agent.py | ||
| agents_md.py | ||
| build-all.sh | ||
| config.py | ||
| consolidation.py | ||
| coordinator.py | ||
| Dockerfile | ||
| entrypoint.sh | ||
| events.py | ||
| executor_helpers.py | ||
| heartbeat.py | ||
| initial_prompt.py | ||
| main.py | ||
| molecule_ai_status.py | ||
| platform_auth.py | ||
| plugins.py | ||
| preflight.py | ||
| prompt.py | ||
| pytest.ini | ||
| rebuild-runtime-images.sh | ||
| requirements.txt | ||
| runtime_wedge.py | ||
| shared_runtime.py | ||
| transcript_auth.py | ||
| watcher.py | ||