3269e93216
ci-arm64-advisory / fast-checks (pull_request) Waiting to run
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Successful in 11s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 22s
CI / Python Lint & Test (pull_request) Successful in 16s
CI / Detect changes (pull_request) Successful in 38s
E2E API Smoke Test / detect-changes (pull_request) Successful in 22s
E2E Chat / detect-changes (pull_request) Successful in 13s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 10s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 7s
Harness Replays / detect-changes (pull_request) Successful in 5s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 12s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 3s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Successful in 34s
Lint no tenant GITEA or GITHUB token write / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 4s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Failing after 1m7s
lint-mask-pr-atomicity / lint-mask-pr-atomicity (pull_request) Successful in 1m14s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 1m11s
lint-required-workflows-docker-host-pinned / Lint docker-host pin on docker-touching workflows (pull_request) Successful in 8s
lint-required-context-exists-in-bp / lint-required-context-exists-in-bp (pull_request) Successful in 1m24s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m3s
gate-check-v3 / gate-check (pull_request) Successful in 5s
qa-review / approved (pull_request) Failing after 8s
security-review / approved (pull_request) Failing after 12s
verify-providers-gen / Regenerate providers artifact and fail on drift (pull_request) Successful in 27s
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request) Successful in 4s
sop-checklist / review-refire (pull_request) Has been skipped
sop-tier-check / tier-check (pull_request) Successful in 6s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 1m25s
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Failing after 4m31s
CI / Canvas (Next.js) (pull_request) Successful in 6s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 17s
E2E Chat / E2E Chat (pull_request) Successful in 4s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 8s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m12s
Harness Replays / Harness Replays (pull_request) Successful in 4s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2m16s
CI / Platform (Go) (pull_request) Successful in 6m10s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / all-required (pull_request) Successful in 13m28s
The A2A e2e historically asserted only response SHAPE (test_a2a_e2e.sh
checked '"kind":"text"' only). A broken agent returns its error AS a
text part -- {"kind":"text","text":"Agent error (Exception) ..."} --
which STILL matches the shape check, so it PASSED on a fully broken
agent. That is why the 2026-05-2x drained-key / byok-misroute failures
(agents-team PM + reno marketing erroring on every LLM call) sailed
through CI. "Channel returns text shape" is not "agent completed an LLM
round-trip."
Adds, ADDITIVELY (no existing assertion weakened or removed):
- tests/e2e/lib/completion_assert.sh -- reusable gates:
* a2a_assert_real_completion: deterministic known-answer round-trip;
asserts CONTAINS the expected token AND NOT an error-as-text marker
(Agent error / Exception / error result / MISSING_BYOK_CREDENTIAL).
* provider_liveness_matrix + offered_platform_models_for_runtime:
per-offered-provider cheap (max_tokens:4) probe; the offered set is
read from the providers.yaml SSOT (runtimes.<rt>.providers[platform]
.models) -- not a hardcoded list -- so the matrix tracks the SSOT.
* assert_byok_not_platform_proxy: #1994 regression guard -- a
byok-resolving workspace must NOT resolve platform_managed (reads the
same derived resolver GET /admin/workspaces/:id/llm-billing-mode the
provision strip gate uses).
- tests/e2e/test_staging_full_saas.sh (the live-agent lane, MiniMax
primary): new stanzas 8b (PINEAPPLE known-answer, the core gate),
8c (byok-routing guard), 8d (SSOT-driven per-provider liveness matrix).
- tests/e2e/test_a2a_e2e.sh: added check_no_error_as_text on Echo + SEO
replies so the brief's literal shape-only example now FAILS on an
error-as-text payload.
- tests/e2e/test_completion_assert_unit.sh: offline fail-direction proof
(16 cases) that the negative gates are load-bearing -- error-as-text
MUST fail, platform_managed MUST trip the #1994 guard. Wired into
ci.yml "Run E2E bash unit tests (no live infra)" (required, per-PR +
main). e2e-staging-saas.yml paths filter extended to re-trigger the
live lane on lib changes.
No #1994 fix code touched -- tests/e2e + workflow wiring only.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
112 lines
4.9 KiB
Bash
Executable File
112 lines
4.9 KiB
Bash
Executable File
#!/usr/bin/env bash
|
|
# Fail-direction / load-bearing proof for lib/completion_assert.sh.
|
|
#
|
|
# This is the watch-it-FAIL counterpart the dev-SOP Phase 3 requires: it
|
|
# proves the new real-completion + byok gates actually CATCH a broken agent,
|
|
# not just pass on a good one. It runs entirely offline (no LLM, no network,
|
|
# no provisioning) — pure assertion logic — so it can run on every PR in the
|
|
# fast lane (e2e-api.yml unit-shell step) and locally via `bash`.
|
|
#
|
|
# The decisive case is `error-as-text payload MUST FAIL`: that is the exact
|
|
# trap (#1994) the historical shape-only check missed. If a refactor weakens
|
|
# a2a_assert_real_completion to a substring/shape check, THIS test goes red.
|
|
set -uo pipefail
|
|
|
|
HERE="$(cd "$(dirname "$0")" && pwd)"
|
|
PASS=0
|
|
FAIL=0
|
|
|
|
# Minimal stand-ins for the host script's helpers. fail() must NOT exit the
|
|
# whole harness here — we want to assert that it WAS called. We trap it by
|
|
# running the assertion in a subshell and checking the subshell's exit code:
|
|
# the real fail() exits 1, ok() exits 0 implicitly.
|
|
log() { echo "[unit] $*"; }
|
|
ok() { echo "[unit] OK: $*"; }
|
|
fail() { echo "[unit] FAIL-CALLED: $*" >&2; exit 1; }
|
|
|
|
# shellcheck source=lib/completion_assert.sh
|
|
source "$HERE/lib/completion_assert.sh"
|
|
|
|
expect_pass() {
|
|
local desc="$1"; shift
|
|
if ( "$@" ) >/dev/null 2>&1; then
|
|
echo "PASS: $desc (assertion accepted, as expected)"
|
|
PASS=$((PASS + 1))
|
|
else
|
|
echo "FAIL: $desc — expected the assertion to ACCEPT, but it rejected"
|
|
FAIL=$((FAIL + 1))
|
|
fi
|
|
}
|
|
|
|
expect_fail() {
|
|
local desc="$1"; shift
|
|
if ( "$@" ) >/dev/null 2>&1; then
|
|
echo "FAIL: $desc — expected the assertion to REJECT, but it accepted (gate NOT load-bearing!)"
|
|
FAIL=$((FAIL + 1))
|
|
else
|
|
echo "PASS: $desc (assertion rejected, as expected)"
|
|
PASS=$((PASS + 1))
|
|
fi
|
|
}
|
|
|
|
echo "=== completion_assert.sh fail-direction proof ==="
|
|
|
|
# ---- a2a_assert_real_completion ----
|
|
# Good: real known-answer reply passes.
|
|
expect_pass "real PINEAPPLE reply passes" \
|
|
a2a_assert_real_completion "PINEAPPLE" "PINEAPPLE" "unit"
|
|
expect_pass "case-insensitive known answer passes" \
|
|
a2a_assert_real_completion "pineapple" "PINEAPPLE" "unit"
|
|
expect_pass "known answer with minor wrapping passes" \
|
|
a2a_assert_real_completion "Sure: PINEAPPLE" "PINEAPPLE" "unit"
|
|
|
|
# DECISIVE: the error-as-text trap. Each MUST fail — these are the payloads a
|
|
# broken agent returns that the old shape-only `"kind":"text"` check passed.
|
|
expect_fail "Agent error as text payload MUST fail" \
|
|
a2a_assert_real_completion "Agent error (Exception) — see workspace logs for details." "PINEAPPLE" "unit"
|
|
expect_fail "bare Exception as text MUST fail" \
|
|
a2a_assert_real_completion "Traceback ... Exception: boom" "PINEAPPLE" "unit"
|
|
expect_fail "error result as text MUST fail" \
|
|
a2a_assert_real_completion "tool returned error result" "PINEAPPLE" "unit"
|
|
expect_fail "MISSING_BYOK_CREDENTIAL as text MUST fail" \
|
|
a2a_assert_real_completion "MISSING_BYOK_CREDENTIAL: set your own key" "PINEAPPLE" "unit"
|
|
# Error-as-text that ALSO happens to contain the token still fails (error
|
|
# marker takes precedence — a real completion never carries these markers).
|
|
expect_fail "error-as-text containing the token still fails" \
|
|
a2a_assert_real_completion "Agent error: could not produce PINEAPPLE" "PINEAPPLE" "unit"
|
|
# Empty text fails.
|
|
expect_fail "empty text fails" \
|
|
a2a_assert_real_completion "" "PINEAPPLE" "unit"
|
|
# Wrong/echoed content (no token, no error) fails — shape-OK but not a real
|
|
# completion.
|
|
expect_fail "wrong content without token fails" \
|
|
a2a_assert_real_completion "Reply with exactly the word PINEAPPLE and nothing else." "BANANA" "unit"
|
|
|
|
# ---- assert_byok_not_platform_proxy (#1994 guard) ----
|
|
expect_pass "byok resolution passes the guard" \
|
|
assert_byok_not_platform_proxy '{"resolved_mode":"byok","provider_selection":"minimax","source":"derived_provider"}' "unit"
|
|
# DECISIVE: a platform_managed resolution on a byok workspace = the #1994
|
|
# regression. MUST fail.
|
|
expect_fail "platform_managed resolution trips the #1994 guard" \
|
|
assert_byok_not_platform_proxy '{"resolved_mode":"platform_managed","provider_selection":"platform","source":"derived_provider"}' "unit"
|
|
expect_fail "missing resolved_mode trips the guard" \
|
|
assert_byok_not_platform_proxy '{"provider_selection":"x"}' "unit"
|
|
expect_fail "disabled mode trips the guard (not byok)" \
|
|
assert_byok_not_platform_proxy '{"resolved_mode":"disabled"}' "unit"
|
|
|
|
# ---- a2a_completion_error_marker (the scanner under the gate) ----
|
|
if hit=$(a2a_completion_error_marker "all good PINEAPPLE"); then
|
|
echo "FAIL: clean text wrongly flagged as error marker ($hit)"; FAIL=$((FAIL + 1))
|
|
else
|
|
echo "PASS: clean text has no error marker"; PASS=$((PASS + 1))
|
|
fi
|
|
if hit=$(a2a_completion_error_marker "An Exception occurred"); then
|
|
echo "PASS: error marker detected ($hit)"; PASS=$((PASS + 1))
|
|
else
|
|
echo "FAIL: error marker NOT detected in 'An Exception occurred'"; FAIL=$((FAIL + 1))
|
|
fi
|
|
|
|
echo ""
|
|
echo "=== Results: $PASS passed, $FAIL failed ==="
|
|
[ "$FAIL" -eq 0 ]
|