fix(core#2675): LLM-proxy preflight with DEP-DOWN:staging-llm status description convention #2763
Reference in New Issue
Block a user
Delete Branch "fix/core2675-llm-preflight"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What
Adds a reusable shell preflight (
tests/e2e/lib/llm_proxy_preflight.sh) that completion-gated e2e lanes source before booting workspaces. The preflight makes ONE cheap completion through the staging LLM proxy with a 30s timeout. On any non-200, 200-with-malformed-body, or unreachable condition, it emitsDEP-DOWN:staging-llm ...as a machine-readable Gitea Actions status description and exits 70 (config-missing is exit 71).Wired into the pr-validate job of
e2e-staging-saas.ymlas the proof-of-concept lane.Why (2026-06-12 staging LLM outage)
4 completion-gated lanes went red identically with no signal distinguishing "dependency down" from "real code bug." Triage required forensic log-diffing and initially mis-attributed an unrelated deploy-path bug to the outage (the /statuses pagination fix mentioned in the issue body). The
DEP-DOWN:staging-llmconvention lets the redgate-reporter dedup N identical reds into ONE incident issue.Test plan
bash tests/e2e/test_llm_proxy_preflight_unit.sh— 5 unit tests PASSpython3 .gitea/scripts/lint-workflow-yaml.py— 61 workflows cleanScope kept tight (deliberately)
tests/e2e/lib/, matching the pattern of the existingcompletion_assert.sh/model_slug.sh/aws_leak_check.shsiblings.pr-validateine2e-staging-saas.yml) as the proof-of-concept. The other 3 completion-gated lanes (local-provision-e2e.ymland the 2 remaininge2e-staging-saas.ymljob blocks) are mechanically derivable — same 3 lines per lane, same source block, same path filter additions. Tracked as a follow-up to keep this PR focused.E2E_LLM_PROXY_URL, withMOLECULE_CP_URL-based derivation as the default.local-provisionoverridesE2E_LLM_PROXY_URLto point at its own built-in proxy.Refs core#2675.
Adds a reusable shell preflight that completion-gated e2e lanes can source before booting workspaces. The preflight makes ONE cheap completion through the staging LLM proxy with a 30s timeout. On any non-200, 200-with-malformed-body, or unreachable condition, it emits 'DEP-DOWN:staging-llm ...' as a machine-readable Gitea Actions status description and exits 70 (config-missing is exit 71). Why this matters (2026-06-12 staging LLM outage): 4 completion-gated lanes went red identically with no signal distinguishing 'dependency down' from 'real code bug.' Triage required forensic log-diffing and initially mis-attributed an unrelated deploy-path bug to the outage (the /statuses pagination fix mentioned in the issue body). The DEP-DOWN:staging-llm convention lets the redgate-reporter dedup N identical reds into ONE incident issue. Wired into the pr-validate job of e2e-staging-saas.yml as the proof-of-concept lane; the other 3 completion-gated lanes (local-provision-e2e.yml and the 2 remaining e2e-staging-saas.yml job blocks) are mechanically derivable and tracked in a follow-up issue to keep this PR's diff focused. Files: + tests/e2e/lib/llm_proxy_preflight.sh — the helper + tests/e2e/test_llm_proxy_preflight_unit.sh — 5 unit tests covering config-missing, unreachable, 200-empty-body, ok, 503 ~ .gitea/workflows/e2e-staging-saas.yml — wires the helper into pr-validate + path filter additions for the new lib + test files Tests: bash tests/e2e/test_llm_proxy_preflight_unit.sh → all 5 PASS. Workflow lint: lint-workflow-yaml.py clean. Scope kept tight: - Workspace-server code NOT touched (this is CI/Python, not Go — consistent with the other 3 lanes that this PR is the proof-of-concept for). - The redgate-reporter's dedup logic is external and out of scope for this PR. The convention (status description prefix + distinct exit codes) is the SSOT — the team can wire the redgate-reporter's parser in a separate change. - LLM proxy URL is configurable via E2E_LLM_PROXY_URL, with MOLECULE_CP_URL-based derivation as the default. Local-provision overrides E2E_LLM_PROXY_URL to its own proxy. Refs core#2675. Co-Authored-By: Claude <noreply@anthropic.com>REQUEST_CHANGES on head
28da216e0f9209c6a4f6022c27194f5393e8ea18.Findings:
Required CI is red, so this cannot be approved.
CI / Shellcheck (E2E scripts)fails on the new unit test script withSC2034attests/e2e/test_llm_proxy_preflight_unit.sh:182(E2E_LLM_PROXY_URL appears unused). BecauseCI / all-requiredis skipped as a consequence, the requested approval precondition is not met. Please fix the shellcheck warning rather than relying on ceremony gates.The config-missing path returns the right exit code (
71) but emits the sameDEP-DOWN:staging-llmprefix attests/e2e/lib/llm_proxy_preflight.sh:65-67. The PR’s core contract is that redgate classifies dependency-down incidents by theDEP-DOWN:staging-llmprefix, whileboth URLs unset -> 71 config-missingis operator/config error and should not dedup as a staging LLM outage. If the reporter keys on the prefix as stated, this branch is ambiguous/misclassified. Please make the config-missing status text distinct from theDEP-DOWN:staging-llmdependency prefix, or explicitly prove the reporter keys on exit code before prefix.What looks good: local execution of
tests/e2e/test_llm_proxy_preflight_unit.shpasses the modeled cases (200+choices -> 0; unreachable/503/200-malformed -> 70; no URL inputs -> 71), and wiring one of four completion-gated lanes as the PoC scope is acceptable. But the CI failure and prefix/exit-code contract gap need correction before approval.REQUEST_CHANGES on head
28da216e.Independent 5-axis review found this is not approvable yet:
Required CI is red.
CI / Shellcheck (E2E scripts)is failing on the new unit script, soCI/all-requiredis skipped. The review request explicitly requires approval only afterCI/all-requiredis green on head.The exit-code/status-prefix contract is ambiguous in the config-missing branch. The PR defines
DEP-DOWN:staging-llmas the SSOT prefix redgate-reporter uses to classify staging LLM dependency outages, and separately defines both URLs unset as exit 71 config-missing/operator error. Butllm_proxy_preflightstill emits:for the exit 71 path. If the reporter keys on the prefix as stated, config errors will dedup as staging LLM outages. Please make config-missing use a distinct prefix/message, or prove/update the reporter contract so exit code 71 takes precedence over the DEP-DOWN prefix.
The 200+choices, non-200/unreachable, and 200-malformed behavior looks directionally correct, and wiring one completion-gated lane as the PoC is acceptable scope. But the red required CI and the dependency-down/config-missing classification gap need fixing before approval.
The previous fix emitted 'DEP-DOWN:staging-llm (config-missing)' on the E2E_LLM_PROXY_URL+MOLECULE_CP_URL-both-unset path. The redgate- reporter dedups on the DEP-DOWN:staging-llm prefix against live dependency outages — folding the config-missing case into that bucket would conflate operator error (a mis-wired lane) with infrastructure outage, suppressing the operator-fix signal. Fix: emit 'CONFIG-MISSING:staging-llm-proxy-url' on the exit-71 path instead. The two prefixes dedup separately in the redgate-reporter: DEP-DOWN:staging-llm — live LLM proxy outage (many runs/lanes dedup into one incident issue) CONFIG-MISSING:staging-llm-proxy-url — operator-misconfigured lane (dedup across runs/lanes that share the same missing env) lib doc comment updated to call out the prefix contract. Test updated: test_config_missing now asserts the CONFIG-MISSING prefix AND that DEP-DOWN:staging-llm is NOT present (the two prefixes must never co-occur in the same output line). All 5 unit tests still PASS. Shellcheck clean. Refs #2763. Co-Authored-By: Claude <noreply@anthropic.com>APPROVED on head
905b8d93.Verified my #11458 blockers are resolved:
E2E_LLM_PROXY_URLin the unit cases that intentionally pass the value across the sourcedtests/e2e/lib/llm_proxy_preflight.shboundary. This is genuine cross-source use, not a suppression hiding dead state.CONFIG-MISSING:staging-llm-proxy-url, notDEP-DOWN:staging-llm; the unit test asserts CONFIG-MISSING is present and DEP-DOWN is absent for that branch.Local verification:
bash tests/e2e/test_llm_proxy_preflight_unit.shpasses all five cases. Remote required CI is green on this head:CI / all-requiredsuccessful andCI / Shellcheck (E2E scripts)successful. Remaining red contexts are approval/review gates that this approval is intended to clear.APPROVED on head
905b8d93.Re-review after CR2 #11459:
CI / Shellcheck (E2E scripts)is green, andCI/all-requiredis green.CONFIG-MISSING:staging-llm-proxy-url, and the unit test asserts thatDEP-DOWN:staging-llmis absent for that branch.5-axis review:
/sop-ack