hermes-agent/tests/agent
Andre Kurait a9ccb03ccc fix(bedrock): evict cached boto3 client on stale-connection errors
## Problem

When a pooled HTTPS connection to the Bedrock runtime goes stale (NAT
timeout, VPN flap, server-side TCP RST, proxy idle cull), the next
Converse call surfaces as one of:

  * botocore.exceptions.ConnectionClosedError / ReadTimeoutError /
    EndpointConnectionError / ConnectTimeoutError
  * urllib3.exceptions.ProtocolError
  * A bare AssertionError raised from inside urllib3 or botocore
    (internal connection-pool invariant check)

The agent loop retries the request 3x, but the cached boto3 client in
_bedrock_runtime_client_cache is reused across retries — so every
attempt hits the same dead connection pool and fails identically.
Only a process restart clears the cache and lets the user keep working.

The bare-AssertionError variant is particularly user-hostile because
str(AssertionError()) is an empty string, so the retry banner shows:

    ⚠️  API call failed: AssertionError
       📝 Error:

with no hint of what went wrong.

## Fix

Add two helpers to agent/bedrock_adapter.py:

  * is_stale_connection_error(exc) — classifies exceptions that
    indicate dead-client/dead-socket state. Matches botocore
    ConnectionError + HTTPClientError subtrees, urllib3
    ProtocolError / NewConnectionError, and AssertionError
    raised from a frame whose module name starts with urllib3.,
    botocore., or boto3.. Application-level AssertionErrors are
    intentionally excluded.

  * invalidate_runtime_client(region) — per-region counterpart to
    the existing reset_client_cache(). Evicts a single cached
    client so the next call rebuilds it (and its connection pool).

Wire both into the Converse call sites:

  * call_converse() / call_converse_stream() in
    bedrock_adapter.py (defense-in-depth for any future caller)
  * The two direct client.converse(**kwargs) /
    client.converse_stream(**kwargs) call sites in run_agent.py
    (the paths the agent loop actually uses)

On a stale-connection exception, the client is evicted and the
exception re-raised unchanged. The agent's existing retry loop then
builds a fresh client on the next attempt and recovers without
requiring a process restart.

## Tests

tests/agent/test_bedrock_adapter.py gets three new classes (14 tests):

  * TestInvalidateRuntimeClient — per-region eviction correctness;
    non-cached region returns False.
  * TestIsStaleConnectionError — classifies botocore
    ConnectionClosedError / EndpointConnectionError /
    ReadTimeoutError, urllib3 ProtocolError, library-internal
    AssertionError (both urllib3.* and botocore.* frames), and
    correctly ignores application-level AssertionError and
    unrelated exceptions (ValueError, KeyError).
  * TestCallConverseInvalidatesOnStaleError — end-to-end: stale
    error evicts the cached client, non-stale error (validation)
    leaves it alone, successful call leaves it cached.

All 116 tests in test_bedrock_adapter.py pass.

Signed-off-by: Andre Kurait <andrekurait@gmail.com>
2026-04-24 07:26:07 -07:00
..
transports fix(kimi,mcp): Moonshot schema sanitizer + MCP schema robustness (#14805) 2026-04-23 16:11:57 -07:00
__init__.py
test_anthropic_adapter.py refactor: remove _nr_to_assistant_message shim + fix flush_memories guard 2026-04-23 02:30:05 -07:00
test_anthropic_keychain.py fix: re-auth on stale OAuth token; read Claude Code credentials from macOS Keychain 2026-04-24 07:14:00 -07:00
test_auxiliary_client_anthropic_custom.py
test_auxiliary_client.py fix(aux): force anthropic oauth refresh after 401 2026-04-24 07:14:00 -07:00
test_auxiliary_config_bridge.py
test_auxiliary_main_first.py feat: add Xiaomi MiMo v2.5-pro and v2.5 model support (#14635) 2026-04-23 10:06:25 -07:00
test_auxiliary_named_custom_providers.py fix(aux): normalize GitHub Copilot provider slugs 2026-04-24 03:33:29 -07:00
test_bedrock_adapter.py fix(bedrock): evict cached boto3 client on stale-connection errors 2026-04-24 07:26:07 -07:00
test_bedrock_integration.py fix(agent): handle aws_sdk auth type in resolve_provider_client 2026-04-24 07:26:07 -07:00
test_codex_cloudflare_headers.py
test_compress_focus.py
test_context_compressor.py fix(agent): guard context compressor against structured message content 2026-04-22 14:46:51 -07:00
test_context_engine.py
test_context_references.py
test_copilot_acp_client.py fix: set HOME for Copilot ACP subprocesses 2026-04-24 05:09:08 -07:00
test_credential_pool_routing.py
test_credential_pool.py fix(credential_pool): add Nous OAuth cross-process auth-store sync 2026-04-24 05:20:05 -07:00
test_crossloop_client_cache.py
test_direct_provider_url_detection.py
test_display_emoji.py
test_display.py
test_error_classifier.py fix(agent): only set rate-limit cooldown when leaving primary; add tests 2026-04-24 05:35:43 -07:00
test_external_skills.py
test_gemini_cloudcode.py
test_gemini_free_tier_gate.py feat(gemini): block free-tier keys at setup + surface guidance on 429 (#15100) 2026-04-24 04:46:17 -07:00
test_gemini_native_adapter.py fix(gemini): fail fast on missing API key + surface it in hermes dump (#15133) 2026-04-24 05:35:17 -07:00
test_gemini_schema.py fix(gemini): drop integer/number/boolean enums from tool schemas (#15082) 2026-04-24 03:40:00 -07:00
test_image_gen_registry.py
test_insights.py
test_kimi_coding_anthropic_thinking.py
test_local_stream_timeout.py
test_memory_provider.py
test_memory_user_id.py
test_minimax_auxiliary_url.py
test_minimax_provider.py fix(tests): resolve 17 persistent CI test failures (#15084) 2026-04-24 03:46:46 -07:00
test_model_metadata_local_ctx.py
test_model_metadata_ssl.py fix(auth): honor SSL CA env vars across httpx + requests callsites 2026-04-24 03:00:33 -07:00
test_model_metadata.py fix(bedrock): resolve context length via static table before custom-endpoint probe 2026-04-24 07:26:07 -07:00
test_models_dev.py
test_moonshot_schema.py fix(kimi,mcp): Moonshot schema sanitizer + MCP schema robustness (#14805) 2026-04-23 16:11:57 -07:00
test_nous_rate_guard.py
test_prompt_builder.py feat(agent): add PLATFORM_HINTS for matrix, mattermost, and feishu (#14428) 2026-04-23 12:50:22 +05:30
test_prompt_caching.py
test_proxy_and_url_validation.py
test_rate_limit_tracker.py
test_redact.py
test_shell_hooks_consent.py
test_shell_hooks.py
test_skill_commands.py refactor(commands): drop /provider, /plan handler, and clean up slash registry (#15047) 2026-04-24 03:10:52 -07:00
test_subagent_progress.py
test_subagent_stop_hook.py
test_subdirectory_hints.py
test_title_generator.py
test_usage_pricing.py fix(usage): read top-level Anthropic cache fields from OAI-compatible proxies 2026-04-22 17:40:49 -07:00
test_vision_resolved_args.py