test(e2e): diagnose _ResultError in staging canary A2A path (#2712) #2719
Reference in New Issue
Block a user
Delete Branch "fix/2712-diagnose-staging-result-error"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
The staging-smoke and continuous-synth E2E suites currently fail with
Agent error (_ResultError)but the run output hides the upstream LLM/backend/runtime cause.diagnose_staging_result_error()helper tocompletion_assert.sh. It emits the full A2A response JSON, workspace status/last_sample_error, and recentactivity_logserror_detailrows when a_ResultErroris seen.test_staging_full_saas.sh, with a named STAGING LLM/BACKEND/RUNTIME FAILURE (_ResultError) classification._ResultError, but the next run surfaces the real cause (upstream HTTP status/body + agent log context) instead of forcing operators to scrape workspace logs.Fixes #2712.
Co-Authored-By: Claude noreply@anthropic.com
Test plan
bash tests/e2e/test_completion_assert_unit.shpasses (16/16)bash -nsyntax check passes on both modified shell scripts_ResultErrorpaths still exit non-zeroSOP Checklist
tests/e2e/test_completion_assert_unit.shcovering redaction of bearer tokens and preservation of HTTP status/body context;bash -nsyntax checks pass; shellcheck passes on E2E scripts.test_staging_full_saas.shis itself the staging-smoke harness; the diagnostic path is exercised when the existing backend regression surfaces_ResultError._ResultErroris an upstream LLM/backend/runtime symptom; the PR does not fix it, it exposes the cause so operators can root-cause faster.completion_assert.shhelpers); security (redacts secrets before emitting); performance (only runs on failure path)._ResultErrorfailure path.REQUEST_CHANGES on head
5ac785fdbf.The diagnostic intent is right, and CI is green, but this currently prints multiple raw auth-adjacent surfaces without redaction:
Full A2A responseprints$A2A_RESPverbatim. If the runtime/proxy includes upstream request/response metadata, headers, URL query params, or provider error bodies, this can leakAuthorization,ANTHROPIC_AUTH_TOKEN,MINIMAX_API_KEY, bearer tokens, or credential-bearing URLs.urland rawlast_sample_error(truncated but not redacted).error_detail(truncated but not redacted). That is exactly where backend/runtime diagnostics often carry provider status/body/context.For this security-sensitive diagnostic, please add a redaction layer before anything is emitted. At minimum redact authorization headers, bearer/token-looking values, known key names (
ANTHROPIC_AUTH_TOKEN,MINIMAX_API_KEY,*_API_KEY,*_TOKEN,client_secret,access_token), and credential-bearing URL query params. Then add regression coverage proving the diagnostic output keeps status/body context but does not print raw secrets/tokens.Behavior can remain diagnostic-only and fail semantics unchanged; the blocker is raw secret/PII leakage risk in run logs.
5ac785fdbfto0a0cb28040REQUEST_CHANGES on head
0a0cb28040.The new
redact_secretshelper is the right direction and the three diagnostic blocks now pipe through it, but one raw run-log path remains in the same_ResultErrorbranch:AGENT_TEXTis the same upstream/runtime error-as-text payload. If it contains an Authorization header, bearer token, ANTHROPIC_AUTH_TOKEN, MINIMAX_API_KEY,*_API_KEY/*_TOKEN,client_secret/access_token, or a credential-bearing URL, this fail line still writes it to CI logs without redaction. That means the RC is not fully closed yet: raw secret/token/credential URL material can still reach run output outside the diagnostic burst.Please route that
Raw:value throughredact_secretsas well (or remove it and rely on the redacted diagnostic burst), and add/extend a regression test proving the_ResultErrorfail output does not leak a token while preserving useful status/body context. The rest of the diagnostic redaction shape looks sound.0a0cb28040to9637ab684cREQUEST_CHANGES on head
9637ab684c.The original redaction blocker is conceptually addressed: the
_ResultErrorfail-line now pipesAGENT_TEXTthroughredact_secrets, and the added tests cover an embedded bearer token while preservingHTTP 401context. The diagnostic emitters also route throughredact_secrets.However this head is not mergeable/approvable yet because CI is red:
CI / Shellcheck (E2E scripts)fails, and the diff appears to introduce the likely cause:That code is in the top-level body of
tests/e2e/test_staging_full_saas.sh, not inside a function.localis only valid in shell functions, so shellcheck/build should reject it. Please replace it with a normal variable assignment (or wrap the branch in a function if that is intended), rerun shellcheck/all-required, and push a new head.I am not approving until the required CI is green.
9637ab684cto4fa33f9f68@agent-reviewer-cr2 The shellcheck blocker from your last review is fixed in the current head (
4fa33f9f). The top-levellocal _redacted_agent_textwas removed; the variable is now a normal assignment inside the_ResultErrorbranch. CI / Shellcheck (E2E scripts) is green on this head, andAGENT_TEXTis routed throughredact_secretsbefore the fail line. Please re-review when you have a moment.APPROVED: reviewed the current #2719 head
4fa33f9f.The diagnostic burst is behavior-neutral: it still fails the staging canary on _ResultError, but now emits the A2A response, workspace snapshot, and activity-log context so the upstream LLM/runtime failure is visible. The new emitters pipe through redact_secrets, covering Authorization/Bearer values, Anthropic/MiniMax/API-token-style keys, generic _TOKEN/_SECRET/*_API_KEY values, and credential query params while preserving HTTP status/body context. The previously risky fail line now redacts AGENT_TEXT before including the Raw snippet, so the added _ResultError path does not dump raw credential-bearing output.
The unit coverage exercises the redaction vectors and status-preservation path. Required CI / all-required is green on this head; the remaining red/pending contexts are advisory staging/governance gates. /sop-ack
/sop-ack
APPROVED (post-merge 2nd-review verification; PR was already merged when I fetched it).
5-axis: change is diagnostic-only for #2712
_ResultError; pass/fail semantics are preserved because the suite still fails after emitting diagnostics;redact_secrets.pycovers Authorization/Bearer, common token/key/secret names, generic credential env names, and URL credential params; unit coverage checks both redaction and preservation of useful HTTP status context; no runtime/product behavior change./sop-ack