* fix: hermes gateway restart waits for service to come back up (#8260) Previously, systemd_restart() sent SIGUSR1 to the gateway, printed 'restart requested', and returned immediately. The gateway still needed to drain active agents, exit with code 75, wait for systemd's RestartSec=30, and start the new process. The user saw 'success' but the gateway was actually down for 30-60 seconds. Now the SIGUSR1 path blocks with progress feedback: Phase 1 — wait for old process to die: ⏳ User service draining active work... Polls os.kill(pid, 0) until ProcessLookupError (up to 90s) Phase 2 — wait for new process to become active: ⏳ Waiting for hermes-gateway to restart... Polls systemctl is-active + verifies new PID (up to 60s) Success: ✓ User service restarted (PID 12345) Timeout: ⚠ User service did not become active within 60s. Check status: hermes gateway status Check logs: journalctl --user -u hermes-gateway --since '2 min ago' The reload-or-restart fallback path (line 1189) already blocks because systemctl reload-or-restart is synchronous. Test plan: - Updated test to verify wait-for-restart behavior - All 118 gateway CLI tests pass * fix: add 402 billing error hint to gateway error handler (#5220) The gateway's exception handler for agent errors had specific hints for HTTP 401, 429, 529, 400, 500 — but not 402 (Payment Required / quota exhausted). Users hitting billing limits from custom proxy providers got a generic error with no guidance. Added: 'Your API balance or quota is exhausted. Check your provider dashboard.' The underlying billing classification (error_classifier.py) already correctly handles 402 as FailoverReason.billing with credential rotation and fallback. The original issue (#5220) where 402 killed the entire gateway was from an older version — on current main, 402 is excluded from the is_client_error abort path (line 9460) and goes through the proper retry/fallback/fail flow. Combined with PR #9875 (auto-recover from unexpected SIGTERM), even edge cases where the gateway dies are now survivable.
This commit is contained in:
parent
23b87c8ca8
commit
ca0ae56ccb
@ -4020,6 +4020,8 @@ class GatewayRunner:
|
||||
_hist_len = len(history) if 'history' in locals() else 0
|
||||
if status_code == 401:
|
||||
status_hint = " Check your API key or run `claude /login` to refresh OAuth credentials."
|
||||
elif status_code == 402:
|
||||
status_hint = " Your API balance or quota is exhausted. Check your provider dashboard."
|
||||
elif status_code == 429:
|
||||
# Check if this is a plan usage limit (resets on a schedule) vs a transient rate limit
|
||||
_err_body = getattr(e, "response", None)
|
||||
|
||||
Loading…
Reference in New Issue
Block a user