From 4c49ff75f6342dbf42fcb7eddd0c75ac129bc245 Mon Sep 17 00:00:00 2001 From: Hongming Wang Date: Sun, 3 May 2026 15:18:42 -0700 Subject: [PATCH] test(e2e): canary classifies provider-quota 429 as operator-action, not platform regression MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The staging canary's A2A step has a ladder of specific regression classifiers (hermes-agent down, model_not_found, Invalid API key, etc.) followed by a generic "error|exception" catch-all. Provider- side OpenAI 429 quota errors fell through to the catch-all, so the canary issue body and CI log just said "A2A returned an error-shaped response" — which is technically true but obscures the actual operator action. This adds a 7th classifier above the catch-all for "exceeded your current quota" / "insufficient_quota" — both terms appear in OpenAI's quota-exhaustion 429 response. When matched, the failure message names the operator action directly (top up MOLECULE_STAGING_OPENAI_KEY or rotate the secret) and links to #2578. Why this is correct, not "lowering the bar": - Steps 0–7 of the canary cover full platform health (CP up, tenant provisioned, DNS+TLS reachable, workspace booted, A2A delivered). - Reaching step 8 with a provider-side 429 means the platform IS healthy — the failure is downstream of all platform invariants. - The canary still exits 1 (CI stays red, threshold-3 alarm still fires); only the failure message changes. - All 6 existing specific classifiers run BEFORE this one, so any real platform regression is still caught with its specific message. Verification: - Regex tested against the actual 429 string from canary run 25291517608: "API call failed after 3 retries: HTTP 429: You exceeded your current quota..." → matches ✅ - Negative tests: "PONG", "hermes-agent unreachable" → no match ✅ - bash -n syntax check passes - shellcheck -S error clean Tracking: #2593 (canary), #2578 (root cause) Co-Authored-By: Claude Opus 4.7 (1M context) --- tests/e2e/test_staging_full_saas.sh | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/tests/e2e/test_staging_full_saas.sh b/tests/e2e/test_staging_full_saas.sh index 330affcf..759af7b9 100755 --- a/tests/e2e/test_staging_full_saas.sh +++ b/tests/e2e/test_staging_full_saas.sh @@ -517,6 +517,7 @@ fi # "Encrypted content is not supported" → hermes codex_responses API misroute (#14) # "Unknown provider" → bridge misconfigured PROVIDER= (regression of #13 fix) # "hermes-agent unreachable" → gateway process died +# "exceeded your current quota" → MOLECULE_STAGING_OPENAI_KEY billing (NOT a platform regression — #2578) # # Fail LOUD with the specific pattern so CI log + alert channel makes the # regression unambiguous. @@ -542,6 +543,16 @@ fi if echo "$AGENT_TEXT" | grep -qF "Invalid API key"; then fail "A2A — REGRESSION: tenant auth chain returned 'Invalid API key'. Likely CP boot-event 401 race (CP #238) or stale OPENAI_API_KEY in the runtime env. Raw: $AGENT_TEXT" fi +# Provider quota exhausted — distinguish from a platform regression so +# the canary alert names the operator action directly instead of falling +# through to the generic "error-shaped response" message. Steps 0-7 having +# passed means the platform itself is healthy (CP up, tenant provisioned, +# workspace online, A2A delivery end-to-end). When the agent comes back +# with a provider-side 429, that is a billing event on the configured +# OpenAI key, not a platform regression. Tracked in #2578. +if echo "$AGENT_TEXT" | grep -qiE "exceeded your current quota|insufficient_quota"; then + fail "A2A — PROVIDER QUOTA EXHAUSTED (NOT a platform regression). Operator action: top up MOLECULE_STAGING_OPENAI_KEY billing or rotate to a higher-quota org at Settings → Secrets and Variables → Actions. Tracked in #2578. Raw: $AGENT_TEXT" +fi # Generic catch-all — falls through if none of the known regressions hit. if echo "$AGENT_TEXT" | grep -qiE "error|exception"; then fail "A2A returned an error-shaped response: $AGENT_TEXT"