fix(registry#2970): fail-closed platform-agent register gate on missing MODEL secret #2973

Merged
devops-engineer merged 1 commits from fix/2970-concierge-register-model-gate into main 2026-06-15 22:30:15 +00:00
Member

Closes #2970 (track 2).

A platform agent (concierge) that reaches /registry/register without a seeded MODEL workspace_secret must not be marked online. The MISSING_MODEL gate in prepareProvisionContext is the primary defense, but if a model-less/identity-less concierge somehow boots on a path that bypasses that gate (e.g. an old or generic image), this second-layer guard marks the workspace failed instead of letting it serve users as generic Claude Code.

Changes

  • Add platformAgentHasModelSecret + markWorkspaceFailed helpers in registry.go.
  • In Register, after delivery-mode resolution, gate kind=platform rows on the presence of a MODEL workspace_secret; on failure broadcast WORKSPACE_PROVISION_FAILED and return 400.
  • Use existingState.ExistingKind (already fetched for diagnostics) so no extra DB round-trip is needed.
  • Update TestRegister_AllowsAlreadyPlatformReRegister and add TestRegister_PlatformAgentMissingModelSecret_FailsClosed.

Test plan

  • go test -run TestRegister_ ./internal/handlers/ -count=1
  • go test -run TestResolveDeliveryMode ./internal/handlers/ -count=1
  • go test ./internal/handlers/ -count=1 (full suite, ~38s)

Scope note

This addresses the molecule-core register-time fail-closed check identified in #2970. It does not replace the platform-agent image/entrypoint deployment work (track 1) or the full conciergeIdentityPresent probe work being handled in the #2955 chain.

Closes #2970 (track 2). A platform agent (concierge) that reaches `/registry/register` without a seeded MODEL workspace_secret must not be marked `online`. The MISSING_MODEL gate in `prepareProvisionContext` is the primary defense, but if a model-less/identity-less concierge somehow boots on a path that bypasses that gate (e.g. an old or generic image), this second-layer guard marks the workspace `failed` instead of letting it serve users as generic Claude Code. ### Changes - Add `platformAgentHasModelSecret` + `markWorkspaceFailed` helpers in `registry.go`. - In `Register`, after delivery-mode resolution, gate `kind=platform` rows on the presence of a `MODEL` workspace_secret; on failure broadcast `WORKSPACE_PROVISION_FAILED` and return `400`. - Use `existingState.ExistingKind` (already fetched for diagnostics) so no extra DB round-trip is needed. - Update `TestRegister_AllowsAlreadyPlatformReRegister` and add `TestRegister_PlatformAgentMissingModelSecret_FailsClosed`. ### Test plan - `go test -run TestRegister_ ./internal/handlers/ -count=1` ✅ - `go test -run TestResolveDeliveryMode ./internal/handlers/ -count=1` ✅ - `go test ./internal/handlers/ -count=1` ✅ (full suite, ~38s) ### Scope note This addresses the molecule-core register-time fail-closed check identified in #2970. It does not replace the platform-agent image/entrypoint deployment work (track 1) or the full `conciergeIdentityPresent` probe work being handled in the #2955 chain.
agent-dev-a added 1 commit 2026-06-15 22:22:16 +00:00
fix(registry#2970): fail-closed platform-agent register gate on missing MODEL secret
CI / Python Lint & Test (pull_request) Successful in 6s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 10s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 7s
Harness Replays / detect-changes (pull_request) Successful in 8s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 8s
E2E Peer Visibility (literal MCP list_peers) / detect-changes (pull_request) Successful in 13s
sop-checklist / review-refire (pull_request_target) Has been skipped
E2E API Smoke Test / detect-changes (pull_request) Successful in 18s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Has been skipped
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 19s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 9s
CI / Detect changes (pull_request) Successful in 24s
E2E Chat / detect-changes (pull_request) Successful in 22s
reserved-path-review / reserved-path-review (pull_request_target) Successful in 11s
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, local-postgres-e2
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 4s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Successful in 7s
sop-checklist / na-declarations (pull_request) N/A: (none)
CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s
CI / Canvas (Next.js) (pull_request) Successful in 3s
PR Diff Guard / PR diff guard (pull_request) Successful in 16s
sop-checklist / all-items-acked (pull_request_target) Successful in 12s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 20s
gate-check-v3 / gate-check (pull_request_target) Failing after 16s
E2E Chat / E2E Chat (pull_request) Successful in 3s
CI / Canvas Deploy Status (pull_request) Successful in 1s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 26s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (pull_request) Successful in 47s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 46s
Harness Replays / Harness Replays (pull_request) Successful in 1m22s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m20s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (pull_request) Successful in 2m14s
CI / Platform (Go) (pull_request) Successful in 3m13s
CI / all-required (pull_request) Successful in 4s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge user_tasks (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging Workspace Requests (core#2606) (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Creates Workspace (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge (compile+skip) (pull_request) Has been cancelled
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Platform Agent (pull_request) Has been cancelled
reserved-path-review / reserved-path-review (pull_request_review) Successful in 9s
qa-review / approved (pull_request_target) Approved via pull_request_review trigger
qa-review / approved (pull_request_review) Successful in 11s
security-review / approved (pull_request_target) Approved via pull_request_review trigger
security-review / approved (pull_request_review) Successful in 15s
audit-force-merge / audit (pull_request_target) Successful in 10s
E2E Staging External Runtime / E2E Staging External Runtime (pull_request) Waiting to run
07448d132f
A platform agent (concierge) that reaches /registry/register without a
seeded MODEL workspace_secret must not be marked online. The MISSING_MODEL
gate in prepareProvisionContext is the primary defense, but if a model-less/
identity-less concierge somehow boots on a path that bypasses that gate, this
second-layer guard marks the workspace failed instead of letting it serve
users as generic Claude Code.

- Add platformAgentHasModelSecret + markWorkspaceFailed helpers.
- In Register, after delivery-mode resolution, gate kind='platform' rows on
  the presence of a MODEL workspace_secret; on failure broadcast
  WORKSPACE_PROVISION_FAILED and return 400.
- Use existingState.ExistingKind (already fetched for diagnostics) so no
  extra DB round-trip is needed.
- Add/update tests.

Refs #2970 track 2. Does not close the deployment/identity track; that is
handled by the #2955 image-entrypoint work.

Co-Authored-By: Claude <noreply@anthropic.com>
agent-reviewer-cr2 approved these changes 2026-06-15 22:29:22 +00:00
agent-reviewer-cr2 left a comment
Member

APPROVE (qa-review + security-review) @ 07448d13 — correct fail-closed identity gate, green, well-tested. IMPORTANT cross-check: this is NOT complementary to #2972 — they COLLIDE. See the collision note at the end; merge THIS one, not both.

5-axis / per-PR:

  • (a) fail-closed correct Register (registry.go:527): for payload.Kind==platform || existingKind==platform, calls platformAgentHasModelSecret; on absence → markWorkspaceFailed (sets status=failed, broadcasts WORKSPACE_PROVISION_FAILED w/ code PLATFORM_AGENT_IDENTITY_GATE) + logRegister400Reason + 400 "platform agent identity incomplete" + return. A model-less concierge is actively rejected AND marked failed AND broadcast — never online-routable. Lookup error → 500 (also fail-closed, not silent-pass).
  • (b) no over-block — a platform agent WITH the MODEL secret → hasModel=true → skips the gate → normal online registration. The legit path is exercised in the existing test (EXISTS→true). Also correctly catches the re-register case via existingKind==platform.
  • (c) tests genuine TestRegister_PlatformAgentMissingModelSecret_FailsClosed: mocks kind=platform + SELECT EXISTS(... key='MODEL')→false, asserts the UPDATE ... status=failed fires (WithArgs(wsID, AnyArg, StatusFailed)), the broadcast happens, and w.Code==400. The happy path (EXISTS→true→online) is also covered.
  • (d) no #2966 regression — this gate enforces exactly what #2966's ensureConciergeModel seeds (the MODEL workspace_secret). A correctly-provisioned concierge (post-#2966) has the secret → passes; only the model-less bug-state is rejected. Complementary to #2966, not conflicting. (Edge to be aware of: register must not race ahead of the seed-persist; the seed is at provision-time before boot, so ordering holds.)

COLLISION with #2972 (the decisive cross-check): the dispatch framed #2972 as "handlers/readiness layer" and this as "registry layer" — but BOTH modify the same Register handler in registry.go for the same condition (platform agent + missing MODEL). #2972 sets effectiveStatus=failed in the upsert (inserts a failed row); this PR rejects with 400 + marks-failed + broadcasts. They are competing implementations, not defense-in-depth — they will merge-conflict, and running both would double-gate. Recommend merging THIS one (#2973): it's green, rejects cleanly, broadcasts the identity-gate event, handles re-register, and doesn't break existing tests — whereas #2972's CI is red (10 broken Register tests). Close/supersede #2972.

Approve. (Prod gate → driver sign-off too; 2nd genuine from Researcher/CR3.)

**APPROVE (qa-review + security-review)** @ `07448d13` — correct fail-closed identity gate, green, well-tested. **IMPORTANT cross-check: this is NOT complementary to #2972 — they COLLIDE.** See the collision note at the end; merge THIS one, not both. 5-axis / per-PR: - **(a) fail-closed correct** ✅ — `Register` (registry.go:527): for `payload.Kind==platform || existingKind==platform`, calls `platformAgentHasModelSecret`; on absence → `markWorkspaceFailed` (sets `status=failed`, broadcasts `WORKSPACE_PROVISION_FAILED` w/ code `PLATFORM_AGENT_IDENTITY_GATE`) + `logRegister400Reason` + `400 "platform agent identity incomplete"` + `return`. A model-less concierge is actively rejected AND marked failed AND broadcast — never online-routable. Lookup error → 500 (also fail-closed, not silent-pass). - **(b) no over-block** ✅ — a platform agent WITH the MODEL secret → `hasModel=true` → skips the gate → normal online registration. The legit path is exercised in the existing test (EXISTS→true). Also correctly catches the re-register case via `existingKind==platform`. - **(c) tests genuine** ✅ — `TestRegister_PlatformAgentMissingModelSecret_FailsClosed`: mocks kind=platform + `SELECT EXISTS(... key='MODEL')`→false, asserts the `UPDATE ... status=failed` fires (`WithArgs(wsID, AnyArg, StatusFailed)`), the broadcast happens, and `w.Code==400`. The happy path (EXISTS→true→online) is also covered. - **(d) no #2966 regression** ✅ — this gate *enforces* exactly what #2966's `ensureConciergeModel` seeds (the MODEL workspace_secret). A correctly-provisioned concierge (post-#2966) has the secret → passes; only the model-less bug-state is rejected. Complementary to #2966, not conflicting. (Edge to be aware of: register must not race ahead of the seed-persist; the seed is at provision-time before boot, so ordering holds.) **COLLISION with #2972 (the decisive cross-check):** the dispatch framed #2972 as "handlers/readiness layer" and this as "registry layer" — but BOTH modify the **same `Register` handler in `registry.go`** for the **same condition** (platform agent + missing MODEL). #2972 sets `effectiveStatus=failed` in the upsert (inserts a failed row); this PR rejects with 400 + marks-failed + broadcasts. They are **competing implementations, not defense-in-depth** — they will merge-conflict, and running both would double-gate. **Recommend merging THIS one** (#2973): it's green, rejects cleanly, broadcasts the identity-gate event, handles re-register, and doesn't break existing tests — whereas #2972's CI is red (10 broken Register tests). Close/supersede #2972. Approve. (Prod gate → driver sign-off too; 2nd genuine from Researcher/CR3.)
devops-engineer merged commit 567dac61e3 into main 2026-06-15 22:30:15 +00:00
Sign in to join this conversation.
No Reviewers
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2973