fix(core#3162): fail-CLOSED on readStoredProviderSecret decrypt/read error #3165
Reference in New Issue
Block a user
Delete Branch "fix/3162-byok-fail-closed"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What
Closes core#3162.
readStoredProviderSecret(inworkspace-server/internal/handlers/platform_agent.go) used to collapse any read/decrypt error into""and treat it as "unset", so a transient decrypt failure on an existingLLM_PROVIDERrow combined with the empty-MODEL platform pin (core#3160) could silently mis-pin a BYOK/self-host concierge toLLM_PROVIDER=platformand mis-route it through the platform LLM proxy.The fix
readStoredProviderSecretnow returns(string, error)and distinguishes three observed states so the caller can fail closed on a real error:(value, nil)("", nil)sql.ErrNoRows(genuine unset)("", err)DecryptVersionederrorensureConciergeProviderwas updated to handle the new signature: onreadErr != nilit logs and returns without pinning, so the next provision re-tries rather than silently mis-routing.Why this scope (one item, no batching)
readStoredModelSecrethas the same fail-open shape (the function's old docstring even says "Mirrors readStoredModelSecret"), but it is on a different code path (MODEL is the customer's model pick, not a provider pin). The issue body scopes the fix toreadStoredProviderSecret, and PM's standing rule says "one item, don't batch more core". A follow-up issue will track the parallel MODEL fix separately.Tests (4 new + 2 existing)
TestEnsureConciergeProvider_FailsClosedOnReadError/decrypt_error_on_existing_row_fails_closed_(does_NOT_seed_platform)— the primary regression: an unreadable existing row now fails closed (no platform pin). Uses corrupt ciphertext to forceDecryptVersionedto return an error.…/db_scan_error_(non-ErrNoRows)_fails_closed— connection-level errors also fail closed.…/sql.ErrNoRows_(genuine_unset)_proceeds_to_seed_(no_regression)— the fresh-boot path still re-seeds.…/existing_stored_provider_is_respected_on_successful_read_(no_regression)— customer-pick path still wins.TestEnsureConciergeProvider_EmptyModelPins(empty-MODEL pins, BYOK non-platform model skips) continues to pass.Full handler test suite passes (
go test ./internal/handlers/→ ok 38.546s).go build ./...andgo vet ./...clean.Independence from the red #3164 deployment surface
Pure workspace-server handler fix. No concierge / MCP / heartbeat / identity-gate / platform-side touched. Safe to merge on the core-lane.
Gate (per PM)
unit-tests+e2efor workspace-serverCloses core#3162: a transient decrypt/read error on an existing LLM_PROVIDER workspace_secret used to be collapsed into an empty string and treated as "unset" by readStoredProviderSecret. Combined with the empty-MODEL platform pin (core#3160), a BYOK/self-host concierge that hit a transient decrypt failure while its MODEL was momentarily empty (rebuilt-from-DB payload) could be silently mis-pinned to LLM_PROVIDER=platform and mis-routed through the platform LLM proxy. Fix --- * readStoredProviderSecret now returns (string, error) and distinguishes: - (value, nil) → secret is stored and decrypted; caller respects it - ("", nil) → sql.ErrNoRows (genuine unset; re-seed is safe) - ("", err) → any other Scan error or DecryptVersioned error; caller MUST fail closed (not seed platform) * ensureConciergeProvider fails closed on the ('', err) case: logs the error and returns without seeding LLM_PROVIDER, so the next provision re-tries rather than silently mis-routing. Scope discipline (per PM's "one item, don't batch more core") --- * readStoredModelSecret has the same fail-open shape (its docstring even says "Mirrors readStoredModelSecret") but is intentionally OUT OF SCOPE for this PR. The MODEL read is not on the BYOK-leak path the issue body describes (MODEL is the customer's model pick, not a provider pin), and per PM's standing rule, a follow-up issue will track it separately. Tests (4 new + 2 existing) --- * TestEnsureConciergeProvider_FailsClosedOnReadError/ decrypt_error_on_existing_row_fails_closed_(does_NOT_seed_platform) — the primary regression: the fail-OPEN case the issue describes is now closed. * …/db_scan_error_(non-ErrNoRows)_fails_closed — connection-level errors also fail closed. * …/sql.ErrNoRows_(genuine_unset)_proceeds_to_seed_(no_regression) — the fresh-boot path still re-seeds. * …/existing_stored_provider_is_respected_on_successful_read_(no_regression) — the customer-pick path still wins. * The pre-existing TestEnsureConciergeProvider_EmptyModelPins tests (empty-MODEL pins, BYOK non-platform model skips) continue to pass. Independence from the red #3164 deployment surface --- * Pure server-side fix to the secret-read path in workspace-server. No concierge / MCP / heartbeat / identity-gate / platform-side surface touched. Safe to merge on the core-lane. Closes: core#3162 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>APPROVE on
7c3840fee9(target=main).Security/RCA review: the provider-secret fail-open path is closed for this scope. readStoredProviderSecret now distinguishes (value,nil), ("",nil) for sql.ErrNoRows, and ("",err) for scan/decrypt failures; ensureConciergeProvider fails closed on the error case and does not seed LLM_PROVIDER=platform. A grep of readStoredProviderSecret shows this is the only caller, so there is no alternate provider-secret read path still failing open.
Availability tradeoff: acceptable. A transient DB/decrypt failure can temporarily avoid seeding a legitimate platform provider, but that is a fail-closed provision/retry posture. It is preferable to silently converting an unreadable existing BYOK/self-host provider row into a platform provider pin. Genuine unset (sql.ErrNoRows) still seeds, and existing successfully-read provider values are still respected.
Scope note: readStoredModelSecret still had the same fail-open shape, but it is explicitly a sibling fix (#3166) and not bundled here. Technical CI is green on this head: CI / all-required succeeded; Platform Go and handler integration contexts are green. Overall status remains red only because qa/security/governance gates are expected to stage after 2-genuine.
APPROVE on
7c3840fee9(target=main).Security review:
CI: required own-head contexts green, including CI/all-required, E2E API Smoke, qa/security review gates, and reserved-path review. Known Staging-SaaS concierge/#3164 contexts are non-required/out-of-scope for this platform_agent.go fail-closed fix.