fix(workspace): re-project BYOK Anthropic creds on restart/recreate (#2739/#2712) #2741
Reference in New Issue
Block a user
Delete Branch "fix/2739-reproject-byok-restart-recovery"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Root-cause gap
#2735 (core#2709/#2712) correctly fixed the BYOK Anthropic-adapter credential projection. After restart a Kimi/MiniMax
claude-codeworkspace now getsANTHROPIC_AUTH_TOKEN+ANTHROPIC_BASE_URLprojected from its provider key (derived from the effective model whenProviderSelectionis nil on aworkspace_overridebilling mode). The #2739 RCA confirms container diagnostics showANTHROPIC_AUTH_TOKEN=set,ANTHROPIC_BASE_URL=set,MINIMAX_API_KEY=setpost-restart — so #2739 is NOT a projection gap.The actual remaining bug is in the degraded→online recovery state machine (
registry.go):evaluateStatusonly recoversdegraded → onlinewhen there is no recentlast_register_failure_at(5-minute window)./registry/register, and (2) the heartbeat agent_card backfill — but the backfill clear was bundled into the sameUPDATE ... WHERE id = $1 AND agent_card IS NULL.agent_cardrow is already populated (reconciled on first provision), so the NULL-scoped backfill never fires → the marker is never cleared via that path./registry/register400s withurl_validate_failedbecause it advertises a Docker-internal hostname (e.g.212851b5693d) the platform can't resolve →last_register_failure_atgets stamped (registry.go:350-363).degradedfor the full 5-minute window, which exceeds the Local Provision Lifecycle restart-survival poll (RESTART_TIMEOUT=240s). Run358593Step 4 expired atstatus=degraded.This is exactly the RCA's recommended fix shape: "allow degraded→online recovery for heartbeating workspaces even when agent_card is already populated."
Fix
Decouple the register-failure clear from the NULL-scoped agent_card backfill. Clear
last_register_failure_aton any heartbeat carrying a validagent_card— a live card proves the runtime is alive and re-advertising the same reachable card the platform already trusts (the same trust signal the success-on-register clear relies on). Theagent_cardwrite stays scoped toIS NULL, so a reconciled card is never overwritten.Regression test
TestHeartbeatHandler_RegisterFailureClearedOnCardBearingRestartmodels the restart shape: heartbeat carries a card, backfill affects 0 rows (card already exists), the decoupled marker-clear affects 1 row, thenevaluateStatus(now seeing NULL failure) recoversdegraded → online.Test plan / local results
go test ./internal/handlers/— green (full package, 18.6s).TestHeartbeatHandler_RegisterFailureClearedOnCardBearingRestartPASSTestHeartbeat_DegradedRecoveryPASSTestHeartbeatHandler_RecoveryPASSTestApplyPlatformManagedLLMEnv_BYOKMiniMaxWorkspaceOverrideProjectsCreds(#2735) PASSgo vet ./internal/handlers/clean.Closes #2739
Refs #2712 #2735
🤖 Generated with Claude Code
APPROVED: reviewed #2741 at head
ae7ad0af.Correctness/robustness: the fix targets the actual degraded→online recovery gap by decoupling
last_register_failure_atclearing from the NULL-onlyagent_cardbackfill. A restart with an already-populated card can now recover on an authenticated, card-bearing heartbeat, while the persistedagent_cardremains protected byWHERE agent_card IS NULLand is not overwritten.Security: heartbeat is still gated by
requireWorkspaceTokenbefore this path, so unauthenticated callers cannot clear the marker. The change does not expose credentials or alter BYOK projection; it only clears the failure timestamp after the runtime proves liveness via heartbeat. Performance/readability: one small conditional UPDATE only when a heartbeat carries an agent card; test covers the 0-row backfill + 1-row marker-clear restart case and degraded→online recovery.Required CI is green (
CI / all-required). Advisory staging SaaS/platform-boot and real-image lifecycle are red, but the required gate is green and the change is a focused recovery-state fix. /sop-ack/sop-ack
APPROVED (post-merge security/RCA consistency): reviewed #2741 at head
ae7ad0af.This matches the #2739 RCA fix shape:
agent_cardpersistence remainsWHERE agent_card IS NULL, whilelast_register_failure_atis cleared by a separate update on an authenticated, card-bearing heartbeat. The clear happens afterrequireWorkspaceToken, so it does not create an unauthenticated status-recovery path or a credential/auth bypass. It also does not overwrite a previously trusted card.Regression coverage proves the restart case: existing
agent_cardmakes the backfill affect 0 rows, the new marker-clear affects 1 row, andevaluateStatuscan recover degraded->online. /sop-ack