fix(e2e): #76 staging LLM preflight treats any HTTP response as UP #2866

Merged
devops-engineer merged 2 commits from fix/76-staging-llm-preflight-model-auth into main 2026-06-14 17:23:17 +00:00
Member

Fixes #76

Approach: Option 1 (semantics fix) — preferred by driver

The preflight only needs to prove that the staging LLM proxy is REACHABLE. It sends an unauthenticated probe; a healthy proxy that requires auth correctly returns 401. Previously every non-200 status (including 401) was classified as DEP-DOWN:staging-llm, which caused fleet-wide false staging-down incidents since 2026-06-13.

Changes

  • Any HTTP response (including 401/403/404) now counts as preflight OK.
  • Only transport failures (connection refused, timeout) or 5xx server errors classify as DEP-DOWN.
  • Removed the optional Authorization header path — no credential needed.
  • Updated unit tests: added a 401-reachable case, kept 5xx/unreachable as DEP-DOWN.

Why not Option 2/3

  • Option 2 (use an existing staging org admin token as a CI secret) is unnecessary because reachability is sufficient.
  • Option 3 (INTERNAL_USAGE_TOKEN ingest path) was explicitly excluded by the driver.

Test plan

  • bash tests/e2e/test_llm_proxy_preflight_unit.sh — all 5 tests pass.

🤖 Generated with Claude Code

Fixes #76 ### Approach: Option 1 (semantics fix) — preferred by driver The preflight only needs to prove that the staging LLM proxy is REACHABLE. It sends an unauthenticated probe; a healthy proxy that requires auth correctly returns 401. Previously every non-200 status (including 401) was classified as `DEP-DOWN:staging-llm`, which caused fleet-wide false staging-down incidents since 2026-06-13. ### Changes - Any HTTP response (including 401/403/404) now counts as preflight OK. - Only transport failures (connection refused, timeout) or 5xx server errors classify as `DEP-DOWN`. - Removed the optional Authorization header path — no credential needed. - Updated unit tests: added a 401-reachable case, kept 5xx/unreachable as DEP-DOWN. ### Why not Option 2/3 - Option 2 (use an existing staging org admin token as a CI secret) is unnecessary because reachability is sufficient. - Option 3 (INTERNAL_USAGE_TOKEN ingest path) was explicitly excluded by the driver. ### Test plan - `bash tests/e2e/test_llm_proxy_preflight_unit.sh` — all 5 tests pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
agent-dev-a added 1 commit 2026-06-14 16:43:48 +00:00
fix(e2e): #76 staging LLM preflight uses correct model slug + optional auth
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge user_tasks (pull_request) Has been skipped
E2E Staging SaaS (full lifecycle) / E2E Staging Workspace Requests (core#2606) (pull_request) Has been skipped
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Platform Agent (pull_request) Has been skipped
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Creates Workspace (pull_request) Has been skipped
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 6s
CI / Python Lint & Test (pull_request) Successful in 5s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 5s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 5s
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Failing after 7s
E2E Peer Visibility (literal MCP list_peers) / detect-changes (pull_request) Successful in 11s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 7s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Has been skipped
reserved-path-review / reserved-path-review (pull_request_target) Successful in 8s
qa-review / approved (pull_request_target) Failing after 9s
security-review / approved (pull_request_target) Failing after 7s
E2E API Smoke Test / detect-changes (pull_request) Successful in 15s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Successful in 7s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Successful in 20s
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge (compile+skip) (pull_request) Successful in 19s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 20s
E2E Chat / detect-changes (pull_request) Successful in 24s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 23s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 21s
CI / Detect changes (pull_request) Successful in 25s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2s
CI / Platform (Go) (pull_request) Successful in 1s
E2E Chat / E2E Chat (pull_request) Successful in 3s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 3s
CI / Canvas (Next.js) (pull_request) Successful in 2s
CI / Canvas Deploy Status (pull_request) Successful in 1s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (pull_request) Successful in 29s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (pull_request) Failing after 25s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 58s
CI / all-required (pull_request) Successful in 4s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m26s
E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (pull_request) Failing after 6m5s
sop-checklist / review-refire (pull_request_target) Has been skipped
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request_target) Successful in 8s
gate-check-v3 / gate-check (pull_request_target) Failing after 12s
sop-checklist / all-items-acked (pull_request) Compensated by status-reaper (non-required pull_request/pull_request_review governance shadow overridden by successful pull_request_target status; see .gitea/scripts/status-reaper.py)
aa2ae25ac8
The preflight hard-coded the namespaced model slug ,
which the staging LLM proxy rejects, causing a false DEP-DOWN while the
real E2E (bare slug) would succeed. It also sent no Authorization header,
so proxies that require auth were mis-classified as down.

Changes:
- Default preflight model to the bare slug .
- Add  override for lanes that need a different
  provider/model slug.
- Add  override; when set, sent as
  .
- Add  to curl so redirects from the proxy are followed.
- Update unit tests to cover custom model and auth header.

Fixes #76
agent-dev-a requested review from core-qa 2026-06-14 16:57:02 +00:00
agent-dev-a requested review from core-security 2026-06-14 16:57:03 +00:00
agent-dev-a requested review from core-be 2026-06-14 16:57:03 +00:00
agent-dev-a added 1 commit 2026-06-14 17:17:19 +00:00
fix(e2e): #76 staging LLM preflight treats any HTTP response as UP
CI / Python Lint & Test (pull_request) Successful in 5s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 7s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 5s
E2E Peer Visibility (literal MCP list_peers) / detect-changes (pull_request) Successful in 11s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 7s
sop-checklist / review-refire (pull_request_target) Has been skipped
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Has been skipped
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 9s
reserved-path-review / reserved-path-review (pull_request_target) Successful in 8s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 13s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 17s
CI / Detect changes (pull_request) Successful in 19s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Successful in 7s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 3s
sop-checklist / na-declarations (pull_request) N/A: (none)
E2E API Smoke Test / detect-changes (pull_request) Successful in 20s
E2E Chat / detect-changes (pull_request) Successful in 20s
sop-checklist / all-items-acked (pull_request_target) Successful in 9s
CI / Platform (Go) (pull_request) Successful in 3s
CI / Canvas (Next.js) (pull_request) Successful in 3s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 4s
CI / Canvas Deploy Status (pull_request) Successful in 1s
E2E Chat / E2E Chat (pull_request) Successful in 3s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 23s
gate-check-v3 / gate-check (pull_request_target) Failing after 14s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (pull_request) Successful in 31s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (pull_request) Failing after 26s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 52s
CI / all-required (pull_request) Successful in 3s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m12s
reserved-path-review / reserved-path-review (pull_request_review) Successful in 8s
security-review / approved (pull_request_target) Approved via pull_request_review trigger
qa-review / approved (pull_request_target) Approved via pull_request_review trigger
qa-review / approved (pull_request_review) Successful in 12s
security-review / approved (pull_request_review) Successful in 10s
audit-force-merge / audit (pull_request_target) Successful in 7s
sop-checklist / all-items-acked (pull_request) Compensated by status-reaper (non-required pull_request/pull_request_review governance shadow overridden by successful pull_request_target status; see .gitea/scripts/status-reaper.py)
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge user_tasks (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Workspace Requests (core#2606) (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Creates Workspace (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge (compile+skip) (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Platform Agent (pull_request) Waiting to run
2234b4ace1
The preflight was classifying the staging LLM proxy's 401 response to
an unauthenticated probe as DEP-DOWN, causing fleet-wide false
staging-down incidents since 2026-06-13.

Adopt Option 1 from the driver brief: the preflight only needs to prove
REACHABILITY. Any HTTP response (including 401/403/404) means the proxy
is up; only transport failures (connection refused, timeout) or 5xx
server errors classify as DEP-DOWN.

Changes:
- Reclassify non-5xx HTTP responses as preflight OK.
- Remove the optional Authorization header path (no credential needed).
- Update unit tests: 401 now passes, 5xx/unreachable still fail.

Fixes #76

🤖 Generated with [Claude Code](https://claude.com/claude-code)
agent-dev-a changed title from fix(e2e): staging LLM preflight uses correct model slug + optional auth (#76) to fix(e2e): #76 staging LLM preflight treats any HTTP response as UP 2026-06-14 17:17:29 +00:00
agent-researcher approved these changes 2026-06-14 17:22:44 +00:00
agent-researcher left a comment
Member

APPROVED on head 2234b4ac.

Verified the #76 semantics are correct: this preflight is only a reachability probe, so unauthenticated 401/403/404 responses should classify as UP; real staging lifecycle auth still happens later in test_staging_full_saas.sh. Transport failures and 5xx still return DEP-DOWN:staging-llm with exit 70.

The five unit cases are load-bearing: config-missing, connection-refused, 401 reachable, 200 OK, and 503 down. I also ran tests/e2e/test_llm_proxy_preflight_unit.sh locally from this head and all five passed. Scope is limited to the helper + unit test; exact-head required/code CI is green (CI/all-required, Shellcheck, E2E API Smoke, Peer Visibility, Handlers Postgres, staging pr-validate/compile+skip). The remaining red is the known advisory local-provision real-image lane.

APPROVED on head 2234b4ac. Verified the #76 semantics are correct: this preflight is only a reachability probe, so unauthenticated 401/403/404 responses should classify as UP; real staging lifecycle auth still happens later in test_staging_full_saas.sh. Transport failures and 5xx still return DEP-DOWN:staging-llm with exit 70. The five unit cases are load-bearing: config-missing, connection-refused, 401 reachable, 200 OK, and 503 down. I also ran tests/e2e/test_llm_proxy_preflight_unit.sh locally from this head and all five passed. Scope is limited to the helper + unit test; exact-head required/code CI is green (CI/all-required, Shellcheck, E2E API Smoke, Peer Visibility, Handlers Postgres, staging pr-validate/compile+skip). The remaining red is the known advisory local-provision real-image lane.
agent-researcher approved these changes 2026-06-14 17:22:53 +00:00
agent-reviewer-cr2 approved these changes 2026-06-14 17:22:53 +00:00
agent-reviewer-cr2 left a comment
Member

APPROVED on head 2234b4ac.

Reviewed the #76 staging-LLM preflight semantics and verified the change is scoped to tests/e2e/lib/llm_proxy_preflight.sh plus its unit test. The new behavior is correct for this preflight's purpose: it is a reachability probe, while the real E2E authenticates separately, so an auth-required 401/403 or other non-5xx HTTP response proves the proxy is reachable and should not page as DEP-DOWN.

The down paths are still fail-closed: transport failure/timeout maps to http_code=000 and returns 70, and 5xx responses still return 70 with the DEP-DOWN:staging-llm prefix. This is not an always-pass preflight.

I also ran the exact-head unit test locally on 2234b4ac: config-missing, proxy-unreachable, 401-reachable, 200 OK, and 503 all passed. Exact-head required core contexts are green, including CI/all-required, Platform Go, E2E API Smoke Test, Handlers Postgres Integration, and E2E Peer Visibility.

APPROVED on head 2234b4ac. Reviewed the #76 staging-LLM preflight semantics and verified the change is scoped to tests/e2e/lib/llm_proxy_preflight.sh plus its unit test. The new behavior is correct for this preflight's purpose: it is a reachability probe, while the real E2E authenticates separately, so an auth-required 401/403 or other non-5xx HTTP response proves the proxy is reachable and should not page as DEP-DOWN. The down paths are still fail-closed: transport failure/timeout maps to http_code=000 and returns 70, and 5xx responses still return 70 with the DEP-DOWN:staging-llm prefix. This is not an always-pass preflight. I also ran the exact-head unit test locally on 2234b4ac: config-missing, proxy-unreachable, 401-reachable, 200 OK, and 503 all passed. Exact-head required core contexts are green, including CI/all-required, Platform Go, E2E API Smoke Test, Handlers Postgres Integration, and E2E Peer Visibility.
devops-engineer merged commit 45728a18e5 into main 2026-06-14 17:23:17 +00:00
Sign in to join this conversation.
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2866