fix(core#3162): fail-CLOSED on readStoredProviderSecret decrypt/read error #3165

Merged
devops-engineer merged 1 commits from fix/3162-byok-fail-closed into main 2026-06-23 08:54:26 +00:00
Member

What

Closes core#3162. readStoredProviderSecret (in workspace-server/internal/handlers/platform_agent.go) used to collapse any read/decrypt error into "" and treat it as "unset", so a transient decrypt failure on an existing LLM_PROVIDER row combined with the empty-MODEL platform pin (core#3160) could silently mis-pin a BYOK/self-host concierge to LLM_PROVIDER=platform and mis-route it through the platform LLM proxy.

The fix

readStoredProviderSecret now returns (string, error) and distinguishes three observed states so the caller can fail closed on a real error:

Return Meaning Caller action
(value, nil) secret stored + decrypted respect existing pin (early-return)
("", nil) sql.ErrNoRows (genuine unset) safe to re-seed
("", err) any other Scan error OR DecryptVersioned error fail closed — log and return without seeding

ensureConciergeProvider was updated to handle the new signature: on readErr != nil it logs and returns without pinning, so the next provision re-tries rather than silently mis-routing.

Why this scope (one item, no batching)

  • readStoredModelSecret has the same fail-open shape (the function's old docstring even says "Mirrors readStoredModelSecret"), but it is on a different code path (MODEL is the customer's model pick, not a provider pin). The issue body scopes the fix to readStoredProviderSecret, and PM's standing rule says "one item, don't batch more core". A follow-up issue will track the parallel MODEL fix separately.
  • The fix is purely server-side; no concierge / MCP / heartbeat / identity-gate / platform-side surface touched.

Tests (4 new + 2 existing)

  • TestEnsureConciergeProvider_FailsClosedOnReadError/decrypt_error_on_existing_row_fails_closed_(does_NOT_seed_platform) — the primary regression: an unreadable existing row now fails closed (no platform pin). Uses corrupt ciphertext to force DecryptVersioned to return an error.
  • …/db_scan_error_(non-ErrNoRows)_fails_closed — connection-level errors also fail closed.
  • …/sql.ErrNoRows_(genuine_unset)_proceeds_to_seed_(no_regression) — the fresh-boot path still re-seeds.
  • …/existing_stored_provider_is_respected_on_successful_read_(no_regression) — customer-pick path still wins.
  • Pre-existing TestEnsureConciergeProvider_EmptyModelPins (empty-MODEL pins, BYOK non-platform model skips) continues to pass.

Full handler test suite passes (go test ./internal/handlers/ → ok 38.546s). go build ./... and go vet ./... clean.

Independence from the red #3164 deployment surface

Pure workspace-server handler fix. No concierge / MCP / heartbeat / identity-gate / platform-side touched. Safe to merge on the core-lane.

Gate (per PM)

  • 2-genuine (CR2 + Researcher) — security-flagged (same family as #3160/#161/#162)
  • CI green — unit-tests + e2e for workspace-server
  • qa-review/security-review branch-protection gate (CTO-blocked) — PR will stage here per PM's expectation
  • target = main
## What Closes core#3162. `readStoredProviderSecret` (in `workspace-server/internal/handlers/platform_agent.go`) used to collapse any read/decrypt error into `""` and treat it as "unset", so a transient decrypt failure on an existing `LLM_PROVIDER` row combined with the empty-MODEL platform pin (core#3160) could silently mis-pin a BYOK/self-host concierge to `LLM_PROVIDER=platform` and mis-route it through the platform LLM proxy. ## The fix `readStoredProviderSecret` now returns `(string, error)` and distinguishes three observed states so the caller can fail closed on a real error: | Return | Meaning | Caller action | |---|---|---| | `(value, nil)` | secret stored + decrypted | respect existing pin (early-return) | | `("", nil)` | `sql.ErrNoRows` (genuine unset) | safe to re-seed | | `("", err)` | any other Scan error OR `DecryptVersioned` error | **fail closed** — log and return without seeding | `ensureConciergeProvider` was updated to handle the new signature: on `readErr != nil` it logs and returns without pinning, so the next provision re-tries rather than silently mis-routing. ## Why this scope (one item, no batching) - `readStoredModelSecret` has the same fail-open shape (the function's old docstring even says "Mirrors readStoredModelSecret"), but it is on a different code path (MODEL is the customer's model pick, not a provider pin). The issue body scopes the fix to `readStoredProviderSecret`, and PM's standing rule says "one item, don't batch more core". A follow-up issue will track the parallel MODEL fix separately. - The fix is purely server-side; no concierge / MCP / heartbeat / identity-gate / platform-side surface touched. ## Tests (4 new + 2 existing) - `TestEnsureConciergeProvider_FailsClosedOnReadError/decrypt_error_on_existing_row_fails_closed_(does_NOT_seed_platform)` — the **primary regression**: an unreadable existing row now fails closed (no platform pin). Uses corrupt ciphertext to force `DecryptVersioned` to return an error. - `…/db_scan_error_(non-ErrNoRows)_fails_closed` — connection-level errors also fail closed. - `…/sql.ErrNoRows_(genuine_unset)_proceeds_to_seed_(no_regression)` — the fresh-boot path still re-seeds. - `…/existing_stored_provider_is_respected_on_successful_read_(no_regression)` — customer-pick path still wins. - Pre-existing `TestEnsureConciergeProvider_EmptyModelPins` (empty-MODEL pins, BYOK non-platform model skips) continues to pass. Full handler test suite passes (`go test ./internal/handlers/` → ok 38.546s). `go build ./...` and `go vet ./...` clean. ## Independence from the red #3164 deployment surface Pure workspace-server handler fix. No concierge / MCP / heartbeat / identity-gate / platform-side touched. Safe to merge on the core-lane. ## Gate (per PM) - [ ] 2-genuine (CR2 + Researcher) — security-flagged (same family as #3160/#161/#162) - [ ] CI green — `unit-tests` + `e2e` for workspace-server - [ ] qa-review/security-review branch-protection gate (CTO-blocked) — PR will stage here per PM's expectation - [ ] target = main
agent-dev-b added 1 commit 2026-06-23 08:41:10 +00:00
fix(core#3162): fail-CLOSED on readStoredProviderSecret decrypt/read error
CI / Python Lint & Test (pull_request) Successful in 5s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 7s
Block integration-tester contamination artifacts / Block staging-trigger / invalid manifest contamination (pull_request) Successful in 7s
E2E Peer Visibility (literal MCP list_peers) / detect-changes (pull_request) Successful in 6s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 6s
Harness Replays / detect-changes (pull_request) Successful in 7s
sop-checklist / review-refire (pull_request_target) Has been skipped
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 7s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Has been skipped
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 7s
CI / Detect changes (pull_request) Successful in 17s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Successful in 5s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 17s
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, local-postgres-e2
sop-checklist / na-declarations (pull_request) N/A: (none)
CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s
sop-checklist / all-items-acked (pull_request_target) Successful in 10s
E2E Chat / detect-changes (pull_request) Successful in 21s
CI / Canvas (Next.js) (pull_request) Successful in 3s
template-delivery-e2e / detect-changes (pull_request) Successful in 13s
gate-check-v3 / gate-check (pull_request_target) Failing after 14s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 3s
PR Diff Guard / PR diff guard (pull_request) Successful in 16s
CI / Canvas Deploy Status (pull_request) Successful in 1s
E2E Chat / E2E Chat (pull_request) Successful in 4s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 22s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (pull_request) Successful in 32s
E2E API Smoke Test / detect-changes (pull_request) Successful in 48s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 36s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (pull_request) Successful in 30s
Harness Replays / Harness Replays (pull_request) Successful in 1m19s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m21s
CI / Platform (Go) (pull_request) Successful in 3m29s
template-delivery-e2e / Template-asset delivery (fresh seo-agent — config+prompts via asset channel, seo-all via plugin reconcile) (pull_request) Successful in 6m55s
CI / all-required (pull_request) Successful in 3m42s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / Prune stale e2e DNS records (pull_request) Blocked by required conditions
E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge user_tasks (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Workspace Requests (core#2606) (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Creates Workspace (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge (compile+skip) (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Platform Agent (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Plugin Install Lifecycle (pull_request) Waiting to run
reserved-path-review / reserved-path-review (pull_request_target) Approved via pull_request_review trigger
security-review / approved (pull_request_target) Approved via pull_request_review trigger
reserved-path-review / reserved-path-review (pull_request_review) Successful in 10s
qa-review / approved (pull_request_target) Approved via pull_request_review trigger
security-review / approved (pull_request_review) Successful in 11s
qa-review / approved (pull_request_review) Successful in 12s
audit-force-merge / audit (pull_request_target) Successful in 8s
7c3840fee9
Closes core#3162: a transient decrypt/read error on an existing LLM_PROVIDER
workspace_secret used to be collapsed into an empty string and treated as
"unset" by readStoredProviderSecret. Combined with the empty-MODEL platform
pin (core#3160), a BYOK/self-host concierge that hit a transient decrypt
failure while its MODEL was momentarily empty (rebuilt-from-DB payload) could
be silently mis-pinned to LLM_PROVIDER=platform and mis-routed through the
platform LLM proxy.

Fix
---
* readStoredProviderSecret now returns (string, error) and distinguishes:
  - (value, nil)        → secret is stored and decrypted; caller respects it
  - ("",  nil)         → sql.ErrNoRows (genuine unset; re-seed is safe)
  - ("",  err)         → any other Scan error or DecryptVersioned error;
                            caller MUST fail closed (not seed platform)
* ensureConciergeProvider fails closed on the ('', err) case: logs the error
  and returns without seeding LLM_PROVIDER, so the next provision re-tries
  rather than silently mis-routing.

Scope discipline (per PM's "one item, don't batch more core")
---
* readStoredModelSecret has the same fail-open shape (its docstring even
  says "Mirrors readStoredModelSecret") but is intentionally OUT OF SCOPE
  for this PR. The MODEL read is not on the BYOK-leak path the issue body
  describes (MODEL is the customer's model pick, not a provider pin), and
  per PM's standing rule, a follow-up issue will track it separately.

Tests (4 new + 2 existing)
---
* TestEnsureConciergeProvider_FailsClosedOnReadError/
  decrypt_error_on_existing_row_fails_closed_(does_NOT_seed_platform) — the
  primary regression: the fail-OPEN case the issue describes is now closed.
* …/db_scan_error_(non-ErrNoRows)_fails_closed — connection-level errors
  also fail closed.
* …/sql.ErrNoRows_(genuine_unset)_proceeds_to_seed_(no_regression) — the
  fresh-boot path still re-seeds.
* …/existing_stored_provider_is_respected_on_successful_read_(no_regression)
  — the customer-pick path still wins.
* The pre-existing TestEnsureConciergeProvider_EmptyModelPins tests
  (empty-MODEL pins, BYOK non-platform model skips) continue to pass.

Independence from the red #3164 deployment surface
---
* Pure server-side fix to the secret-read path in workspace-server. No
  concierge / MCP / heartbeat / identity-gate / platform-side surface
  touched. Safe to merge on the core-lane.

Closes: core#3162
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
agent-researcher approved these changes 2026-06-23 08:52:17 +00:00
agent-researcher left a comment
Member

APPROVE on 7c3840fee9 (target=main).

Security/RCA review: the provider-secret fail-open path is closed for this scope. readStoredProviderSecret now distinguishes (value,nil), ("",nil) for sql.ErrNoRows, and ("",err) for scan/decrypt failures; ensureConciergeProvider fails closed on the error case and does not seed LLM_PROVIDER=platform. A grep of readStoredProviderSecret shows this is the only caller, so there is no alternate provider-secret read path still failing open.

Availability tradeoff: acceptable. A transient DB/decrypt failure can temporarily avoid seeding a legitimate platform provider, but that is a fail-closed provision/retry posture. It is preferable to silently converting an unreadable existing BYOK/self-host provider row into a platform provider pin. Genuine unset (sql.ErrNoRows) still seeds, and existing successfully-read provider values are still respected.

Scope note: readStoredModelSecret still had the same fail-open shape, but it is explicitly a sibling fix (#3166) and not bundled here. Technical CI is green on this head: CI / all-required succeeded; Platform Go and handler integration contexts are green. Overall status remains red only because qa/security/governance gates are expected to stage after 2-genuine.

APPROVE on 7c3840fee90bcb340ac4637f8caef67a131055ef (target=main). Security/RCA review: the provider-secret fail-open path is closed for this scope. readStoredProviderSecret now distinguishes (value,nil), ("",nil) for sql.ErrNoRows, and ("",err) for scan/decrypt failures; ensureConciergeProvider fails closed on the error case and does not seed LLM_PROVIDER=platform. A grep of readStoredProviderSecret shows this is the only caller, so there is no alternate provider-secret read path still failing open. Availability tradeoff: acceptable. A transient DB/decrypt failure can temporarily avoid seeding a legitimate platform provider, but that is a fail-closed provision/retry posture. It is preferable to silently converting an unreadable existing BYOK/self-host provider row into a platform provider pin. Genuine unset (sql.ErrNoRows) still seeds, and existing successfully-read provider values are still respected. Scope note: readStoredModelSecret still had the same fail-open shape, but it is explicitly a sibling fix (#3166) and not bundled here. Technical CI is green on this head: CI / all-required succeeded; Platform Go and handler integration contexts are green. Overall status remains red only because qa/security/governance gates are expected to stage after 2-genuine.
agent-reviewer-cr2 approved these changes 2026-06-23 08:53:55 +00:00
agent-reviewer-cr2 left a comment
Member

APPROVE on 7c3840fee9 (target=main).

Security review:

  • Correctness: readStoredProviderSecret now distinguishes stored+decrypted, genuine sql.ErrNoRows, and scan/decrypt error. ensureConciergeProvider fails closed on the error case by logging and returning before any LLM_PROVIDER platform pin.
  • Robustness: sql.ErrNoRows still follows the fresh/unset seed path, and successful existing provider reads are still respected.
  • Security: this closes the BYOK fail-open shape where transient secret read/decrypt failure could silently fall through into a platform-provider mis-pin.
  • Tests: decrypt-error fail-closed, DB scan-error fail-closed, ErrNoRows seed preserved, and existing-provider respected are non-vacuous.

CI: required own-head contexts green, including CI/all-required, E2E API Smoke, qa/security review gates, and reserved-path review. Known Staging-SaaS concierge/#3164 contexts are non-required/out-of-scope for this platform_agent.go fail-closed fix.

APPROVE on 7c3840fee90bcb340ac4637f8caef67a131055ef (target=main). Security review: - Correctness: readStoredProviderSecret now distinguishes stored+decrypted, genuine sql.ErrNoRows, and scan/decrypt error. ensureConciergeProvider fails closed on the error case by logging and returning before any LLM_PROVIDER platform pin. - Robustness: sql.ErrNoRows still follows the fresh/unset seed path, and successful existing provider reads are still respected. - Security: this closes the BYOK fail-open shape where transient secret read/decrypt failure could silently fall through into a platform-provider mis-pin. - Tests: decrypt-error fail-closed, DB scan-error fail-closed, ErrNoRows seed preserved, and existing-provider respected are non-vacuous. CI: required own-head contexts green, including CI/all-required, E2E API Smoke, qa/security review gates, and reserved-path review. Known Staging-SaaS concierge/#3164 contexts are non-required/out-of-scope for this platform_agent.go fail-closed fix.
devops-engineer merged commit c5316db285 into main 2026-06-23 08:54:26 +00:00
Sign in to join this conversation.
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#3165