Files
molecule-core/tests/e2e/lib/completion_assert.sh
hongming-ceo-delegated 3269e93216
ci-arm64-advisory / fast-checks (pull_request) Waiting to run
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Successful in 11s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 22s
CI / Python Lint & Test (pull_request) Successful in 16s
CI / Detect changes (pull_request) Successful in 38s
E2E API Smoke Test / detect-changes (pull_request) Successful in 22s
E2E Chat / detect-changes (pull_request) Successful in 13s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 10s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 7s
Harness Replays / detect-changes (pull_request) Successful in 5s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 12s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 3s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Successful in 34s
Lint no tenant GITEA or GITHUB token write / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 4s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Failing after 1m7s
lint-mask-pr-atomicity / lint-mask-pr-atomicity (pull_request) Successful in 1m14s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 1m11s
lint-required-workflows-docker-host-pinned / Lint docker-host pin on docker-touching workflows (pull_request) Successful in 8s
lint-required-context-exists-in-bp / lint-required-context-exists-in-bp (pull_request) Successful in 1m24s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m3s
gate-check-v3 / gate-check (pull_request) Successful in 5s
qa-review / approved (pull_request) Failing after 8s
security-review / approved (pull_request) Failing after 12s
verify-providers-gen / Regenerate providers artifact and fail on drift (pull_request) Successful in 27s
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request) Successful in 4s
sop-checklist / review-refire (pull_request) Has been skipped
sop-tier-check / tier-check (pull_request) Successful in 6s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 1m25s
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Failing after 4m31s
CI / Canvas (Next.js) (pull_request) Successful in 6s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 17s
E2E Chat / E2E Chat (pull_request) Successful in 4s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 8s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m12s
Harness Replays / Harness Replays (pull_request) Successful in 4s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2m16s
CI / Platform (Go) (pull_request) Successful in 6m10s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / all-required (pull_request) Successful in 13m28s
test(e2e): add real-completion + per-provider liveness + byok-routing A2A gate (#1994 follow-on)
The A2A e2e historically asserted only response SHAPE (test_a2a_e2e.sh
checked '"kind":"text"' only). A broken agent returns its error AS a
text part -- {"kind":"text","text":"Agent error (Exception) ..."} --
which STILL matches the shape check, so it PASSED on a fully broken
agent. That is why the 2026-05-2x drained-key / byok-misroute failures
(agents-team PM + reno marketing erroring on every LLM call) sailed
through CI. "Channel returns text shape" is not "agent completed an LLM
round-trip."

Adds, ADDITIVELY (no existing assertion weakened or removed):

- tests/e2e/lib/completion_assert.sh -- reusable gates:
  * a2a_assert_real_completion: deterministic known-answer round-trip;
    asserts CONTAINS the expected token AND NOT an error-as-text marker
    (Agent error / Exception / error result / MISSING_BYOK_CREDENTIAL).
  * provider_liveness_matrix + offered_platform_models_for_runtime:
    per-offered-provider cheap (max_tokens:4) probe; the offered set is
    read from the providers.yaml SSOT (runtimes.<rt>.providers[platform]
    .models) -- not a hardcoded list -- so the matrix tracks the SSOT.
  * assert_byok_not_platform_proxy: #1994 regression guard -- a
    byok-resolving workspace must NOT resolve platform_managed (reads the
    same derived resolver GET /admin/workspaces/:id/llm-billing-mode the
    provision strip gate uses).

- tests/e2e/test_staging_full_saas.sh (the live-agent lane, MiniMax
  primary): new stanzas 8b (PINEAPPLE known-answer, the core gate),
  8c (byok-routing guard), 8d (SSOT-driven per-provider liveness matrix).

- tests/e2e/test_a2a_e2e.sh: added check_no_error_as_text on Echo + SEO
  replies so the brief's literal shape-only example now FAILS on an
  error-as-text payload.

- tests/e2e/test_completion_assert_unit.sh: offline fail-direction proof
  (16 cases) that the negative gates are load-bearing -- error-as-text
  MUST fail, platform_managed MUST trip the #1994 guard. Wired into
  ci.yml "Run E2E bash unit tests (no live infra)" (required, per-PR +
  main). e2e-staging-saas.yml paths filter extended to re-trigger the
  live lane on lib changes.

No #1994 fix code touched -- tests/e2e + workflow wiring only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-28 18:58:13 +00:00

230 lines
10 KiB
Bash
Executable File

#!/usr/bin/env bash
# Real-completion + per-provider liveness + byok-routing assertion helpers
# for the staging full-SaaS E2E (tests/e2e/test_staging_full_saas.sh).
#
# WHY THIS LIB EXISTS (molecule-core#1995 / #1994 follow-on):
# The A2A e2e historically asserted only response SHAPE — e.g.
# test_a2a_e2e.sh:`check "SEO response has text" '"kind":"text"'`. A fully
# BROKEN agent returns its error AS a text part:
# {"kind":"text","text":"Agent error (Exception) — see workspace logs..."}
# which STILL matches `"kind":"text"` → the shape check PASSES on a broken
# agent. That is exactly why the 2026-05-2x drained-key / byok-misroute
# failures (agents-team PM + reno marketing erroring on every LLM call)
# sailed through CI. "Channel returns text shape" != "agent actually
# completed an LLM round-trip".
#
# These helpers add three load-bearing gates ON TOP of (never replacing) the
# existing shape + PONG checks:
# 1. a2a_assert_real_completion — deterministic known-answer round-trip
# (CONTAINS the expected token AND NOT an error-as-text payload).
# 2. provider_liveness_matrix — per-offered-provider cheap completion
# probe, providers sourced from the providers.yaml SSOT runtimes block.
# 3. assert_byok_not_platform_proxy — #1994 regression guard: a
# byok-resolving workspace must NOT resolve to platform_managed.
#
# Conventions: reuses the host script's fail()/ok()/log() + tenant_call().
# Source this AFTER those are defined. BASH 4+.
# Error-as-text trap markers. If the agent's text part contains ANY of
# these, the "round-trip" did not really complete — the agent surfaced an
# error AS text. This is the negative assertion that makes a broken agent
# FAIL instead of slipping through the shape check.
#
# Kept as an array (not a single regex) so a new failure signature is a
# one-line append + the failure message can name which marker matched.
A2A_ERROR_AS_TEXT_MARKERS=(
"Agent error"
"Exception"
"error result"
"MISSING_BYOK_CREDENTIAL"
)
# a2a_completion_error_marker <agent_text>
# Echoes the first error-as-text marker found in <agent_text> (case-
# insensitive), or nothing if clean. Exit 0 if a marker matched, 1 if not.
# Pure string scan — no LLM, no network — so it is deterministic and is the
# unit under the fail-direction proof in test_completion_assert_unit.sh.
a2a_completion_error_marker() {
local text="$1"
local upper marker
upper=$(printf '%s' "$text" | tr '[:lower:]' '[:upper:]')
for marker in "${A2A_ERROR_AS_TEXT_MARKERS[@]}"; do
if printf '%s' "$upper" | grep -qF -- "$(printf '%s' "$marker" | tr '[:lower:]' '[:upper:]')"; then
printf '%s' "$marker"
return 0
fi
done
return 1
}
# a2a_assert_real_completion <agent_text> <expected_token> <context_label>
# The CORE gate. Asserts the agent text:
# (a) does NOT contain any error-as-text marker (broken-agent trap), AND
# (b) CONTAINS <expected_token> (case-insensitive) — proving a real LLM
# round-trip produced the deterministic known answer.
# Calls fail() (which exits) on either violation. This MUST fail on an
# error-as-text payload — that is the property test_completion_assert_unit.sh
# pins.
a2a_assert_real_completion() {
local text="$1"
local expected="$2"
local ctx="${3:-A2A}"
if [ -z "$text" ]; then
fail "$ctx — real-completion gate: agent returned EMPTY text (no round-trip)."
fi
local hit
if hit=$(a2a_completion_error_marker "$text"); then
fail "$ctx — real-completion gate: agent returned an ERROR-AS-TEXT payload (matched '$hit'). A broken agent that surfaces its error as a text part is NOT a completed round-trip. This is the trap the shape-only check missed (#1994). Raw: ${text:0:200}"
fi
# Known-answer: real LLM round-trip yields the deterministic token. A
# prompt-echo / truncated-context / wrong-auth pipeline won't.
if ! printf '%s' "$text" | tr '[:lower:]' '[:upper:]' | grep -qF -- "$(printf '%s' "$expected" | tr '[:lower:]' '[:upper:]')"; then
fail "$ctx — real-completion gate: reply did NOT contain expected known-answer token '$expected'. The channel returned a text shape but no real completion. Raw: ${text:0:200}"
fi
ok "$ctx — real completion verified (contains '$expected', no error-as-text). Reply: \"${text:0:80}\""
}
# offered_platform_models_for_runtime <runtime>
# Emits, one per line, the platform-servable model ids the providers.yaml
# SSOT (runtimes.<runtime>.providers[name=platform].models) declares for
# <runtime>. This is the SSOT-driven offered/platform-servable matrix — NOT
# a hardcoded provider list — so a provider added/removed in providers.yaml
# automatically changes the matrix this probe exercises.
#
# Reads the embedded copy at workspace-server/internal/providers/providers.yaml
# (the same file go:embed compiles into the binary). Requires python3 +
# PyYAML (already a test-harness dep). On parse failure, emits nothing and
# returns 1 so the caller can fail loud rather than silently skip.
offered_platform_models_for_runtime() {
local runtime="$1"
local yaml_path="${PROVIDERS_YAML_PATH:-}"
if [ -z "$yaml_path" ]; then
# This lib lives at tests/e2e/lib/ -> repo root is three dirs up
# (lib -> e2e -> tests -> repo-root).
yaml_path="$(cd "$(dirname "${BASH_SOURCE[0]}")/../../.." && pwd)/workspace-server/internal/providers/providers.yaml"
fi
if [ ! -f "$yaml_path" ]; then
log " [provider-matrix] providers.yaml SSOT not found at $yaml_path"
return 1
fi
RUNTIME_REF="$runtime" python3 - "$yaml_path" <<'PY'
import os, sys
try:
import yaml
except Exception as e: # PyYAML missing — fail loud, do not silently skip.
sys.stderr.write(f"PyYAML required for provider-matrix SSOT read: {e}\n")
sys.exit(2)
rt = os.environ["RUNTIME_REF"]
with open(sys.argv[1]) as f:
doc = yaml.safe_load(f)
native = (doc.get("runtimes") or {}).get(rt) or {}
for pref in native.get("providers", []) or []:
if pref.get("name") == "platform":
for m in pref.get("models", []) or []:
print(m)
PY
}
# provider_liveness_matrix <runtime> <probe_fn>
# For each platform-servable model the SSOT lists for <runtime>, calls
# <probe_fn> <model_id> which must echo the agent text (or empty) and return
# 0 on a non-error completion, non-zero otherwise. Logs a per-model pass/fail
# matrix. Returns 0 only if EVERY probed model produced a non-error
# completion; non-zero (and a recorded matrix) otherwise.
#
# Purpose: exercise each offered provider's AUTH + ROUTING path so a drained
# key / wrong base-URL / byok-misroute fails the gate (the #1994 class). The
# probe_fn is expected to use minimal max_tokens.
#
# This helper does the SSOT read + matrix bookkeeping; the host script
# supplies probe_fn (it owns workspace ids + tenant_call wiring).
provider_liveness_matrix() {
local runtime="$1"
local probe_fn="$2"
local models model rc total=0 passed=0
local -a results=()
models=$(offered_platform_models_for_runtime "$runtime") || {
fail "provider-liveness: could not read offered-provider matrix from providers.yaml SSOT for runtime=$runtime"
}
if [ -z "$models" ]; then
log " [provider-matrix] runtime=$runtime offers no platform-servable models in the SSOT — nothing to probe (not a failure)."
return 0
fi
log " [provider-matrix] SSOT offered platform models for $runtime:"
while IFS= read -r model; do
[ -z "$model" ] && continue
log " - $model"
done <<<"$models"
while IFS= read -r model; do
[ -z "$model" ] && continue
total=$((total + 1))
set +e
"$probe_fn" "$model"
rc=$?
set -e
if [ "$rc" = "0" ]; then
passed=$((passed + 1))
results+=("PASS $model")
elif [ "$rc" = "75" ]; then
# 75 (EX_TEMPFAIL convention) = probe skipped (key/runtime not
# available in this lane). Not counted toward pass/fail — logged.
total=$((total - 1))
results+=("SKIP $model (probe unavailable in this lane)")
else
results+=("FAIL $model")
fi
done <<<"$models"
log " [provider-matrix] result matrix (runtime=$runtime):"
local line
for line in "${results[@]}"; do
log " $line"
done
log " [provider-matrix] $passed/$total probed providers completed without error"
if [ "$passed" != "$total" ]; then
return 1
fi
return 0
}
# assert_byok_not_platform_proxy <billing_mode_json> <context_label>
# #1994 regression guard. Given the JSON body from
# GET /admin/workspaces/:id/llm-billing-mode (same derived resolver the
# provision-time strip gate uses), asserts the workspace resolves to BYOK
# and NOT platform_managed. A regression of #1994 (byok workspace baked to
# platform_managed → routed through the platform proxy → platform LLM key
# drained) flips resolved_mode to "platform_managed" and trips this gate.
# Calls fail() (exits) on violation.
assert_byok_not_platform_proxy() {
local body="$1"
local ctx="${2:-byok-guard}"
local mode prov
mode=$(printf '%s' "$body" | python3 -c "import json,sys
try: print(json.load(sys.stdin).get('resolved_mode',''))
except Exception: print('')" 2>/dev/null || echo "")
prov=$(printf '%s' "$body" | python3 -c "import json,sys
try:
d=json.load(sys.stdin); v=d.get('provider_selection')
print(v if v is not None else '')
except Exception: print('')" 2>/dev/null || echo "")
if [ -z "$mode" ]; then
fail "$ctx — byok-routing guard: could not read resolved_mode from billing-mode response. Raw: ${body:0:200}"
fi
if [ "$mode" = "platform_managed" ]; then
fail "$ctx — byok-routing guard TRIPPED (#1994 regression): a byok-configured workspace resolved to 'platform_managed' (provider_selection=$prov) → it would route through the platform proxy and drain the platform LLM key. Expected resolved_mode=byok. Raw: ${body:0:200}"
fi
if [ "$mode" != "byok" ]; then
fail "$ctx — byok-routing guard: unexpected resolved_mode='$mode' (expected 'byok'). provider_selection=$prov. Raw: ${body:0:200}"
fi
ok "$ctx — byok-routing guard: workspace resolves byok (provider_selection=$prov), NOT platform-proxy. #1994 stays fixed."
}