molecule-core/workspace
Hongming Wang e87a9c3858 fix(a2a): auto-retry transient transport errors in send_a2a_message
Three different intermittent failures observed during a single
manual-test session — RemoteProtocolError, ReadTimeout, ConnectError —
each surfaced as a "Failed to deliver to <peer>" error chip in the
canvas Agent Comms panel even though the next attempt would have
succeeded (verified by direct probes from the same source workspace
to the same peer). The error message even told the user "Usually a
transient network blip — retry once," but it left the retry to a
human reading the error message.

Auto-retry inside send_a2a_message itself: up to 5 attempts (1
initial + 4 retries) with exponential backoff (1s, 2s, 4s, 8s,
16s-capped), each backoff jittered ±25% to break sync across
siblings. Cumulative wall-clock capped at 600s by
_DELEGATE_TOTAL_BUDGET_S so a string of 5×300s ReadTimeouts can't
make the caller wait 25 minutes — once the deadline elapses, retries
stop even if attempts remain.

Retry only on transport-layer transients:
  - ConnectError / ConnectTimeout (peer's listening socket not ready)
  - RemoteProtocolError (peer closed TCP without writing — observed
    when a peer's prior in-flight Claude SDK session aborted)
  - ReadError / WriteError (network blip on Docker bridge)
  - ReadTimeout (peer wrote no response in 300s)

Application-level errors are NOT retried — they're deterministic and
retrying just wastes wall-clock:
  - HTTP 4xx (peer rejected the request format)
  - JSON parse failures (peer returned garbage)
  - JSON-RPC error in response body (peer's runtime errored cleanly)
  - Programmer-bug exceptions (ValueError, etc.)

8 new tests pin the contract:
  - retry succeeds after 2 RemoteProtocolErrors
  - retry succeeds after 1 ConnectError
  - all 5 attempts fail → returns formatted last-error
  - capped at exactly _DELEGATE_MAX_ATTEMPTS (regression cover for
    "did someone bump the constant accidentally?")
  - JSON-RPC error response NOT retried (1 attempt only)
  - non-httpx exception NOT retried (programmer bugs stay loud)
  - total budget caps the loop even if attempts remain
  - backoff schedule grows exponentially with ±25% jitter

Refactor: extracted _format_a2a_error() so the success and exhausted
paths share one error-formatting routine. _delegate_backoff_seconds()
is a pure function so the schedule is unit-testable without monkey-
patching asyncio.sleep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 13:52:01 -07:00
..
adapters fix: comprehensive a2a-sdk 1.x migration sweep across workspace/ 2026-04-27 09:42:57 -07:00
builtin_tools fix(runtime): use lowercase wire role for v0.3 JSON-RPC compat layer 2026-04-27 12:40:11 -07:00
lib feat(workspace): pre-stop serialization for pause/resume (closes #1386) 2026-04-21 12:40:44 +00:00
molecule_audit chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
plugins_registry feat(plugin): implement MCPServerAdaptor (issue #847) 2026-04-24 01:42:13 +00:00
policies chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
scripts fix(git-token-helper): close TOCTOU window + stop swallowing chmod errors (closes #1552) 2026-04-26 08:22:29 -07:00
skill_loader feat(skills): per-skill runtime compatibility (#119, hermes pattern) 2026-04-27 01:57:43 -07:00
tests fix(a2a): auto-retry transient transport errors in send_a2a_message 2026-04-27 13:52:01 -07:00
.coveragerc test(workspace): centralize pytest-cov config + 92% floor (closes #1817) 2026-04-26 06:21:22 -07:00
a2a_cli.py fix(runtime): use lowercase wire role for v0.3 JSON-RPC compat layer 2026-04-27 12:40:11 -07:00
a2a_client.py fix(a2a): auto-retry transient transport errors in send_a2a_message 2026-04-27 13:52:01 -07:00
a2a_executor.py fix: comprehensive a2a-sdk 1.x migration sweep across workspace/ 2026-04-27 09:42:57 -07:00
a2a_mcp_server.py feat(tools): tighten send_message_to_user description to forbid pasting URLs in body 2026-04-27 01:13:11 -07:00
a2a_tools.py feat(tools): borrow hermes-style discipline — error/summary caps + sharper MCP descriptions 2026-04-26 23:25:54 -07:00
adapter_base.py feat(skills): per-skill runtime compatibility (#119, hermes pattern) 2026-04-27 01:57:43 -07:00
agent.py chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
agents_md.py chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
build-all.sh fix: update workspace script comments for workspace-template → workspace rename 2026-04-18 01:48:05 -07:00
config.py fix(compliance): flip default mode to owasp_agentic (detect-only) 2026-04-24 11:52:09 -07:00
consolidation.py fix: apply #1124 env-var defaults + scrub F1088 credentials from INCIDENT_LOG.md (#1347) 2026-04-21 08:11:44 +00:00
coordinator.py fix: apply #1124 env-var defaults + scrub F1088 credentials from INCIDENT_LOG.md (#1347) 2026-04-21 08:11:44 +00:00
Dockerfile feat(workspace): 45-min gh-token refresh daemon + credential helper cache 2026-04-22 19:52:46 -07:00
entrypoint.sh fix(workspace): credential helper security hardening (#1797) 2026-04-23 18:14:55 +00:00
events.py chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
executor_helpers.py fix(runtime): replace remaining /app/ legacy paths in agent prompts + docstrings 2026-04-27 11:22:00 -07:00
heartbeat.py fix(runtime): use lowercase wire role for v0.3 JSON-RPC compat layer 2026-04-27 12:40:11 -07:00
initial_prompt.py chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
main.py fix(runtime): use lowercase wire role for v0.3 JSON-RPC compat layer 2026-04-27 12:40:11 -07:00
molecule_ai_status.py fix(runtime): replace remaining /app/ legacy paths in agent prompts + docstrings 2026-04-27 11:22:00 -07:00
platform_auth.py fix(workspace): tag self-originated A2A POSTs with X-Workspace-ID 2026-04-24 19:54:43 -07:00
plugins.py chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
preflight.py feat(preflight): replace SUPPORTED_RUNTIMES static list with adapter discovery 2026-04-27 00:44:51 -07:00
prompt.py fix(review): address code review blockers on tool-trace + instructions 2026-04-22 16:18:06 -07:00
pytest.ini feat(preflight): replace SUPPORTED_RUNTIMES static list with adapter discovery 2026-04-27 00:44:51 -07:00
rebuild-runtime-images.sh fix: update workspace script comments for workspace-template → workspace rename 2026-04-18 01:48:05 -07:00
requirements.txt feat(workspace): migrate a2a-sdk from 0.3.x to 1.0.0 (KI-009) (#1974) 2026-04-24 04:43:17 +00:00
runtime_wedge.py chore(workspace): drop claude_sdk_executor — Phase 2 of #87 2026-04-27 00:52:55 -07:00
shared_runtime.py fix: CWE-78 rm scope, go vet failures, delegation idempotency 2026-04-21 18:22:30 +00:00
transcript_auth.py chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
watcher.py chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00