From ca0ae56ccbb5a1308beece973b2c93c50075318e Mon Sep 17 00:00:00 2001 From: Teknium <127238744+teknium1@users.noreply.github.com> Date: Tue, 14 Apr 2026 21:03:05 -0700 Subject: [PATCH] fix: add 402 billing error hint to gateway error handler (#5220) (#10057) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * fix: hermes gateway restart waits for service to come back up (#8260) Previously, systemd_restart() sent SIGUSR1 to the gateway, printed 'restart requested', and returned immediately. The gateway still needed to drain active agents, exit with code 75, wait for systemd's RestartSec=30, and start the new process. The user saw 'success' but the gateway was actually down for 30-60 seconds. Now the SIGUSR1 path blocks with progress feedback: Phase 1 — wait for old process to die: ⏳ User service draining active work... Polls os.kill(pid, 0) until ProcessLookupError (up to 90s) Phase 2 — wait for new process to become active: ⏳ Waiting for hermes-gateway to restart... Polls systemctl is-active + verifies new PID (up to 60s) Success: ✓ User service restarted (PID 12345) Timeout: ⚠ User service did not become active within 60s. Check status: hermes gateway status Check logs: journalctl --user -u hermes-gateway --since '2 min ago' The reload-or-restart fallback path (line 1189) already blocks because systemctl reload-or-restart is synchronous. Test plan: - Updated test to verify wait-for-restart behavior - All 118 gateway CLI tests pass * fix: add 402 billing error hint to gateway error handler (#5220) The gateway's exception handler for agent errors had specific hints for HTTP 401, 429, 529, 400, 500 — but not 402 (Payment Required / quota exhausted). Users hitting billing limits from custom proxy providers got a generic error with no guidance. Added: 'Your API balance or quota is exhausted. Check your provider dashboard.' The underlying billing classification (error_classifier.py) already correctly handles 402 as FailoverReason.billing with credential rotation and fallback. The original issue (#5220) where 402 killed the entire gateway was from an older version — on current main, 402 is excluded from the is_client_error abort path (line 9460) and goes through the proper retry/fallback/fail flow. Combined with PR #9875 (auto-recover from unexpected SIGTERM), even edge cases where the gateway dies are now survivable. --- gateway/run.py | 2 ++ 1 file changed, 2 insertions(+) diff --git a/gateway/run.py b/gateway/run.py index d137d73c..c59e9dff 100644 --- a/gateway/run.py +++ b/gateway/run.py @@ -4020,6 +4020,8 @@ class GatewayRunner: _hist_len = len(history) if 'history' in locals() else 0 if status_code == 401: status_hint = " Check your API key or run `claude /login` to refresh OAuth credentials." + elif status_code == 402: + status_hint = " Your API balance or quota is exhausted. Check your provider dashboard." elif status_code == 429: # Check if this is a plan usage limit (resets on a schedule) vs a transient rate limit _err_body = getattr(e, "response", None)