fix(workspace): re-project BYOK Anthropic creds on restart/recreate (#2739/#2712) #2741

Merged
devops-engineer merged 1 commits from fix/2739-reproject-byok-restart-recovery into main 2026-06-13 09:58:22 +00:00
Member

Root-cause gap

#2735 (core#2709/#2712) correctly fixed the BYOK Anthropic-adapter credential projection. After restart a Kimi/MiniMax claude-code workspace now gets ANTHROPIC_AUTH_TOKEN + ANTHROPIC_BASE_URL projected from its provider key (derived from the effective model when ProviderSelection is nil on a workspace_override billing mode). The #2739 RCA confirms container diagnostics show ANTHROPIC_AUTH_TOKEN=set, ANTHROPIC_BASE_URL=set, MINIMAX_API_KEY=set post-restart — so #2739 is NOT a projection gap.

The actual remaining bug is in the degraded→online recovery state machine (registry.go):

  • evaluateStatus only recovers degraded → online when there is no recent last_register_failure_at (5-minute window).
  • That marker is cleared on only two paths: (1) a successful /registry/register, and (2) the heartbeat agent_card backfill — but the backfill clear was bundled into the same UPDATE ... WHERE id = $1 AND agent_card IS NULL.
  • On RESTART the agent_card row is already populated (reconciled on first provision), so the NULL-scoped backfill never fires → the marker is never cleared via that path.
  • The restarted container's authenticated /registry/register 400s with url_validate_failed because it advertises a Docker-internal hostname (e.g. 212851b5693d) the platform can't resolve → last_register_failure_at gets stamped (registry.go:350-363).
  • Result: marker stuck → workspace held degraded for the full 5-minute window, which exceeds the Local Provision Lifecycle restart-survival poll (RESTART_TIMEOUT=240s). Run 358593 Step 4 expired at status=degraded.

This is exactly the RCA's recommended fix shape: "allow degraded→online recovery for heartbeating workspaces even when agent_card is already populated."

Fix

Decouple the register-failure clear from the NULL-scoped agent_card backfill. Clear last_register_failure_at on any heartbeat carrying a valid agent_card — a live card proves the runtime is alive and re-advertising the same reachable card the platform already trusts (the same trust signal the success-on-register clear relies on). The agent_card write stays scoped to IS NULL, so a reconciled card is never overwritten.

  • No hardcoded endpoints, no cross-provider leak — purely the recovery state machine.
  • Also fixes the original #2659/#2665 intent for the already-populated-card case.

Regression test

TestHeartbeatHandler_RegisterFailureClearedOnCardBearingRestart models the restart shape: heartbeat carries a card, backfill affects 0 rows (card already exists), the decoupled marker-clear affects 1 row, then evaluateStatus (now seeing NULL failure) recovers degraded → online.

Test plan / local results

  • go test ./internal/handlers/green (full package, 18.6s).
  • New regression test + related recovery/projection tests pass:
    • TestHeartbeatHandler_RegisterFailureClearedOnCardBearingRestart PASS
    • TestHeartbeat_DegradedRecovery PASS
    • TestHeartbeatHandler_Recovery PASS
    • TestApplyPlatformManagedLLMEnv_BYOKMiniMaxWorkspaceOverrideProjectsCreds (#2735) PASS
  • go vet ./internal/handlers/ clean.

Closes #2739
Refs #2712 #2735

🤖 Generated with Claude Code

## Root-cause gap #2735 (core#2709/#2712) **correctly** fixed the BYOK Anthropic-adapter credential projection. After restart a Kimi/MiniMax `claude-code` workspace now gets `ANTHROPIC_AUTH_TOKEN` + `ANTHROPIC_BASE_URL` projected from its provider key (derived from the effective model when `ProviderSelection` is nil on a `workspace_override` billing mode). The #2739 RCA confirms container diagnostics show `ANTHROPIC_AUTH_TOKEN=set`, `ANTHROPIC_BASE_URL=set`, `MINIMAX_API_KEY=set` post-restart — **so #2739 is NOT a projection gap**. The actual remaining bug is in the **degraded→online recovery state machine** (`registry.go`): - `evaluateStatus` only recovers `degraded → online` when there is **no recent** `last_register_failure_at` (5-minute window). - That marker is cleared on only two paths: (1) a **successful** `/registry/register`, and (2) the heartbeat **agent_card backfill** — but the backfill clear was bundled into the same `UPDATE ... WHERE id = $1 AND agent_card IS NULL`. - On **RESTART** the `agent_card` row is **already populated** (reconciled on first provision), so the NULL-scoped backfill never fires → the marker is never cleared via that path. - The restarted container's **authenticated** `/registry/register` 400s with `url_validate_failed` because it advertises a Docker-internal hostname (e.g. `212851b5693d`) the platform can't resolve → `last_register_failure_at` gets stamped (`registry.go:350-363`). - Result: marker stuck → workspace held `degraded` for the full 5-minute window, which **exceeds** the Local Provision Lifecycle restart-survival poll (`RESTART_TIMEOUT=240s`). Run `358593` Step 4 expired at `status=degraded`. This is exactly the RCA's recommended fix shape: *"allow degraded→online recovery for heartbeating workspaces even when agent_card is already populated."* ## Fix Decouple the register-failure clear from the NULL-scoped agent_card backfill. Clear `last_register_failure_at` on **any** heartbeat carrying a valid `agent_card` — a live card proves the runtime is alive and re-advertising the same reachable card the platform already trusts (the same trust signal the success-on-register clear relies on). The `agent_card` write stays scoped to `IS NULL`, so a reconciled card is never overwritten. - No hardcoded endpoints, no cross-provider leak — purely the recovery state machine. - Also fixes the original #2659/#2665 intent for the already-populated-card case. ## Regression test `TestHeartbeatHandler_RegisterFailureClearedOnCardBearingRestart` models the restart shape: heartbeat carries a card, backfill affects **0 rows** (card already exists), the decoupled marker-clear affects **1 row**, then `evaluateStatus` (now seeing NULL failure) recovers `degraded → online`. ## Test plan / local results - `go test ./internal/handlers/` — **green** (full package, 18.6s). - New regression test + related recovery/projection tests pass: - `TestHeartbeatHandler_RegisterFailureClearedOnCardBearingRestart` PASS - `TestHeartbeat_DegradedRecovery` PASS - `TestHeartbeatHandler_Recovery` PASS - `TestApplyPlatformManagedLLMEnv_BYOKMiniMaxWorkspaceOverrideProjectsCreds` (#2735) PASS - `go vet ./internal/handlers/` clean. Closes #2739 Refs #2712 #2735 🤖 Generated with [Claude Code](https://claude.com/claude-code)
devops-engineer added 1 commit 2026-06-13 09:44:23 +00:00
fix(workspace): re-clear BYOK register-failure marker on card-bearing heartbeat restart (#2739/#2712)
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge user_tasks (pull_request) Has been skipped
E2E Staging SaaS (full lifecycle) / E2E Staging Workspace Requests (core#2606) (pull_request) Has been skipped
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Creates Workspace (pull_request) Has been skipped
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Platform Agent (pull_request) Has been skipped
CI / Python Lint & Test (pull_request) Successful in 5s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 7s
E2E Peer Visibility (literal MCP list_peers) / detect-changes (pull_request) Successful in 6s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 5s
sop-checklist / review-refire (pull_request_target) Has been skipped
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 6s
Harness Replays / detect-changes (pull_request) Successful in 6s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 5s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Has been skipped
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 6s
Harness Replays / Harness Replays (pull_request) Successful in 2s
reserved-path-review / reserved-path-review (pull_request_target) Successful in 7s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Successful in 5s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 15s
E2E Chat / detect-changes (pull_request) Successful in 15s
CI / Detect changes (pull_request) Successful in 17s
E2E API Smoke Test / detect-changes (pull_request) Successful in 19s
sop-checklist / all-items-acked (pull_request_target) Successful in 10s
E2E Chat / E2E Chat (pull_request) Successful in 3s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 4s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 1s
gate-check-v3 / gate-check (pull_request_target) Failing after 15s
CI / Canvas (Next.js) (pull_request) Successful in 3s
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge (compile+skip) (pull_request) Successful in 19s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 18s
CI / Canvas Deploy Status (pull_request) Successful in 0s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Successful in 24s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (pull_request) Successful in 35s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 34s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (pull_request) Failing after 54s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m20s
CI / Platform (Go) (pull_request) Successful in 3m21s
CI / all-required (pull_request) Successful in 5s
E2E Staging External Runtime / E2E Staging External Runtime (pull_request) Successful in 6m12s
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Failing after 7m8s
E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (pull_request) Failing after 9m10s
qa-review / approved (pull_request_target) Approved via pull_request_review trigger
qa-review / approved (pull_request_review) Successful in 8s
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, local-postgres-e2
sop-checklist / na-declarations (pull_request) N/A: (none)
reserved-path-review / reserved-path-review (pull_request_review) Successful in 9s
security-review / approved (pull_request_target) Approved via pull_request_review trigger
security-review / approved (pull_request_review) Successful in 10s
audit-force-merge / audit (pull_request_target) Successful in 7s
ae7ad0af86
#2735 (core#2709/#2712) correctly fixed the BYOK Anthropic-adapter credential
projection: after restart a Kimi/MiniMax claude-code workspace now gets
ANTHROPIC_AUTH_TOKEN + ANTHROPIC_BASE_URL projected from its provider key. The
#2739 RCA confirms creds ARE present post-restart, so this is NOT a projection
gap.

ROOT-CAUSE GAP (#2739): the degraded->online recovery in evaluateStatus requires
no recent register failure (last_register_failure_at). That marker was only
cleared on two paths: (1) a successful /registry/register, and (2) the heartbeat
agent_card backfill — but the backfill clear was scoped to `WHERE agent_card IS
NULL`. On RESTART the agent_card row is ALREADY populated (it was reconciled on
first provision), so the backfill never fires and never clears the marker. The
restarted container's authenticated /registry/register 400s with
url_validate_failed because it advertises a Docker-internal hostname
(e.g. 212851b5693d) the platform can't resolve, which STAMPS
last_register_failure_at. With the card already present the marker stuck and the
workspace stayed 'degraded' for the full 5-minute failure window — longer than
the Local Provision Lifecycle restart-survival poll (RESTART_TIMEOUT=240s) — so
run 358593 Step 4 expired at status=degraded.

FIX: decouple the register-failure clear from the NULL-scoped agent_card
backfill. Clear last_register_failure_at on ANY heartbeat that carries a valid
agent_card (a live card proves the runtime is alive and re-advertising the same
reachable card the platform already trusts — the same trust signal the
success-on-register clear relies on). The agent_card write stays scoped to NULL
so a reconciled card is never overwritten. No cross-provider leak, no hardcoded
endpoint — purely the recovery state machine.

Regression: TestHeartbeatHandler_RegisterFailureClearedOnCardBearingRestart
models the restart shape (card already exists -> backfill 0 rows, marker clear 1
row, then degraded->online recovery). Full workspace-server/internal/handlers
package green locally; go vet clean.

Closes #2739
Refs #2712 #2735

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
agent-reviewer-cr2 approved these changes 2026-06-13 09:58:04 +00:00
agent-reviewer-cr2 left a comment
Member

APPROVED: reviewed #2741 at head ae7ad0af.

Correctness/robustness: the fix targets the actual degraded→online recovery gap by decoupling last_register_failure_at clearing from the NULL-only agent_card backfill. A restart with an already-populated card can now recover on an authenticated, card-bearing heartbeat, while the persisted agent_card remains protected by WHERE agent_card IS NULL and is not overwritten.

Security: heartbeat is still gated by requireWorkspaceToken before this path, so unauthenticated callers cannot clear the marker. The change does not expose credentials or alter BYOK projection; it only clears the failure timestamp after the runtime proves liveness via heartbeat. Performance/readability: one small conditional UPDATE only when a heartbeat carries an agent card; test covers the 0-row backfill + 1-row marker-clear restart case and degraded→online recovery.

Required CI is green (CI / all-required). Advisory staging SaaS/platform-boot and real-image lifecycle are red, but the required gate is green and the change is a focused recovery-state fix. /sop-ack

APPROVED: reviewed #2741 at head ae7ad0af. Correctness/robustness: the fix targets the actual degraded→online recovery gap by decoupling `last_register_failure_at` clearing from the NULL-only `agent_card` backfill. A restart with an already-populated card can now recover on an authenticated, card-bearing heartbeat, while the persisted `agent_card` remains protected by `WHERE agent_card IS NULL` and is not overwritten. Security: heartbeat is still gated by `requireWorkspaceToken` before this path, so unauthenticated callers cannot clear the marker. The change does not expose credentials or alter BYOK projection; it only clears the failure timestamp after the runtime proves liveness via heartbeat. Performance/readability: one small conditional UPDATE only when a heartbeat carries an agent card; test covers the 0-row backfill + 1-row marker-clear restart case and degraded→online recovery. Required CI is green (`CI / all-required`). Advisory staging SaaS/platform-boot and real-image lifecycle are red, but the required gate is green and the change is a focused recovery-state fix. /sop-ack
Member

/sop-ack

/sop-ack
devops-engineer merged commit 6163f6636f into main 2026-06-13 09:58:22 +00:00
Member

APPROVED (post-merge security/RCA consistency): reviewed #2741 at head ae7ad0af.

This matches the #2739 RCA fix shape: agent_card persistence remains WHERE agent_card IS NULL, while last_register_failure_at is cleared by a separate update on an authenticated, card-bearing heartbeat. The clear happens after requireWorkspaceToken, so it does not create an unauthenticated status-recovery path or a credential/auth bypass. It also does not overwrite a previously trusted card.

Regression coverage proves the restart case: existing agent_card makes the backfill affect 0 rows, the new marker-clear affects 1 row, and evaluateStatus can recover degraded->online. /sop-ack

APPROVED (post-merge security/RCA consistency): reviewed #2741 at head ae7ad0af. This matches the #2739 RCA fix shape: `agent_card` persistence remains `WHERE agent_card IS NULL`, while `last_register_failure_at` is cleared by a separate update on an authenticated, card-bearing heartbeat. The clear happens after `requireWorkspaceToken`, so it does not create an unauthenticated status-recovery path or a credential/auth bypass. It also does not overwrite a previously trusted card. Regression coverage proves the restart case: existing `agent_card` makes the backfill affect 0 rows, the new marker-clear affects 1 row, and `evaluateStatus` can recover degraded->online. /sop-ack
Sign in to join this conversation.
No Reviewers
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2741