RCA: local-provision restart stays degraded after authenticated register-400 despite healthy heartbeat #2739

Closed
opened 2026-06-13 09:20:43 +00:00 by agent-researcher · 6 comments
Member

MECHANISM: Local Provision Lifecycle real-image main run 358593 fails restart-survival because the restarted container's authenticated /registry/register posts a non-resolvable Docker hostname URL (212851b5693d), so validateAgentURL rejects it at workspace-server/internal/handlers/registry.go:270-294 and Register returns url_validate_failed at registry.go:497-504. The deferred register failure marker is stamped at registry.go:350-363; subsequent heartbeats prove the container is alive and the BYOK MiniMax projection is fixed, but heartbeat only clears last_register_failure_at on the agent_card IS NULL backfill branch at registry.go:827-839. On restart the card already exists, so the recent register-failure marker blocks degraded->online recovery until it ages out; the test's Step 4 window (tests/e2e/test_local_provision_lifecycle_e2e.sh:144-149) expires with status degraded.

EVIDENCE: main 0ebfb5d27ee3e63e4407bcc24aa43c6f9961bf58, run 358593, job 487659. Log excerpt: reason=url_validate_failed and later got: degraded. The container diagnostics show credentials are present (ANTHROPIC_AUTH_TOKEN=set, ANTHROPIC_BASE_URL=set, MINIMAX_API_KEY=set) and platform reachability succeeds ({"status":"ok"}), so this is not the previous #2709/#2735 token-projection gap. Restart-context also drops because the workspace never returns online: restart-context.go:265-267.

RECOMMENDED FIX SHAPE: in workspace-server/internal/handlers/registry.go, decouple heartbeat recovery from the agent_card IS NULL backfill path. When a heartbeat arrives from an authenticated/live workspace after a recent url_validate_failed register-400, either clear last_register_failure_at if the heartbeat URL/card proves the same reachable runtime, or allow degraded->online recovery for heartbeating workspaces even when agent_card is already populated. Add a regression covering restart with existing agent_card + register-400 + subsequent heartbeat so the local-provision lifecycle test no longer waits for the five-minute failure window.

MECHANISM: Local Provision Lifecycle real-image main run `358593` fails restart-survival because the restarted container's authenticated `/registry/register` posts a non-resolvable Docker hostname URL (`212851b5693d`), so `validateAgentURL` rejects it at `workspace-server/internal/handlers/registry.go:270-294` and `Register` returns `url_validate_failed` at `registry.go:497-504`. The deferred register failure marker is stamped at `registry.go:350-363`; subsequent heartbeats prove the container is alive and the BYOK MiniMax projection is fixed, but heartbeat only clears `last_register_failure_at` on the `agent_card IS NULL` backfill branch at `registry.go:827-839`. On restart the card already exists, so the recent register-failure marker blocks degraded->online recovery until it ages out; the test's Step 4 window (`tests/e2e/test_local_provision_lifecycle_e2e.sh:144-149`) expires with status `degraded`. EVIDENCE: main `0ebfb5d27ee3e63e4407bcc24aa43c6f9961bf58`, run `358593`, job `487659`. Log excerpt: `reason=url_validate_failed` and later `got: degraded`. The container diagnostics show credentials are present (`ANTHROPIC_AUTH_TOKEN=set`, `ANTHROPIC_BASE_URL=set`, `MINIMAX_API_KEY=set`) and platform reachability succeeds (`{"status":"ok"}`), so this is not the previous #2709/#2735 token-projection gap. Restart-context also drops because the workspace never returns online: `restart-context.go:265-267`. RECOMMENDED FIX SHAPE: in `workspace-server/internal/handlers/registry.go`, decouple heartbeat recovery from the `agent_card IS NULL` backfill path. When a heartbeat arrives from an authenticated/live workspace after a recent `url_validate_failed` register-400, either clear `last_register_failure_at` if the heartbeat URL/card proves the same reachable runtime, or allow degraded->online recovery for heartbeating workspaces even when `agent_card` is already populated. Add a regression covering restart with existing agent_card + register-400 + subsequent heartbeat so the local-provision lifecycle test no longer waits for the five-minute failure window.
Author
Member

RCA tick confirmation (2026-06-13): current main status still points to the same active red documented here: molecule-core main 0ebfb5d27ee3e63e4407bcc24aa43c6f9961bf58, run 358593, job 487659, Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory).

MECHANISM: unchanged from the issue body. Restarted workspace has correct MiniMax/BYOK Anthropic adapter env, but /registry/register returns authenticated 400 url_validate_failed for the container-hostname URL; the failure marker keeps the row degraded while heartbeats continue.

EVIDENCE: current status API still reports only this failing core context; molecule-controlplane and landingpage main are green.

RECOMMENDED FIX SHAPE: keep this routed as #2739; no duplicate issue needed.

RCA tick confirmation (2026-06-13): current main status still points to the same active red documented here: `molecule-core` main `0ebfb5d27ee3e63e4407bcc24aa43c6f9961bf58`, run `358593`, job `487659`, `Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory)`. MECHANISM: unchanged from the issue body. Restarted workspace has correct MiniMax/BYOK Anthropic adapter env, but `/registry/register` returns authenticated 400 `url_validate_failed` for the container-hostname URL; the failure marker keeps the row degraded while heartbeats continue. EVIDENCE: current status API still reports only this failing core context; `molecule-controlplane` and `landingpage` main are green. RECOMMENDED FIX SHAPE: keep this routed as #2739; no duplicate issue needed.
Author
Member

RCA tick confirmation (2026-06-13): no new active incident beyond #2739. Current status API still shows only molecule-core main 0ebfb5d27ee3e63e4407bcc24aa43c6f9961bf58 failing Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) on run 358593 / job 487659.

MECHANISM: unchanged — restart-degraded after authenticated url_validate_failed register-400, with healthy heartbeat and BYOK adapter env present.

EVIDENCE: molecule-controlplane main and landingpage main are green; no additional failing context is present in current main status.

RECOMMENDED FIX SHAPE: continue routing #2739; no duplicate RCA issue needed.

RCA tick confirmation (2026-06-13): no new active incident beyond #2739. Current status API still shows only `molecule-core` main `0ebfb5d27ee3e63e4407bcc24aa43c6f9961bf58` failing `Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory)` on run `358593` / job `487659`. MECHANISM: unchanged — restart-degraded after authenticated `url_validate_failed` register-400, with healthy heartbeat and BYOK adapter env present. EVIDENCE: `molecule-controlplane` main and `landingpage` main are green; no additional failing context is present in current main status. RECOMMENDED FIX SHAPE: continue routing #2739; no duplicate RCA issue needed.
Author
Member

Turnkey fix-spec after re-reading run 358593 / job 487659 and current main 0ebfb5d27ee3e63e4407bcc24aa43c6f9961bf58.

Correction to premise: I do not see an ANTHROPIC_AUTH_TOKEN / provider-credential projection failure in this run. The restarted real-image container prints ANTHROPIC_AUTH_TOKEN=set, ANTHROPIC_BASE_URL=set, and MINIMAX_API_KEY=set; the full env also contains ANTHROPIC_BASE_URL=https://api.minimax.io/anthropic/v1. The projection path is workspace-server/internal/handlers/workspace_provision.go:1167-1199: after #2735, BYOK + runtimeUsesAnthropicNativeProxy(runtime) derives the provider from effectiveModel when res.ProviderSelection is empty (workspace_override case), then maps MiniMax auth into ANTHROPIC_AUTH_TOKEN and ANTHROPIC_BASE_URL. The run proves this worked on restart.

Why #2735 does not cover the red: #2735 covers the credential-projection layer only. The actual failing layer is registry/status recovery after the restarted container is already alive and credentialed. The restarted container posts /registry/register, but its advertised URL is a raw Docker hostname (212851b5693d), and validateAgentURL rejects non-resolvable non-platform hostnames at workspace-server/internal/handlers/registry.go:270-294; the register handler returns url_validate_failed at registry.go:497-504. Because the caller is authenticated, the defer stamps last_register_failure_at at registry.go:350-363. Heartbeats then prove the container is live, but heartbeat only clears last_register_failure_at inside the agent_card IS NULL backfill update at registry.go:827-839. On restart the row already has an agent_card, so that UPDATE affects 0 rows and the failure marker remains; degraded→online recovery is blocked until the recent-failure window ages out, longer than RESTART_TIMEOUT=240 in tests/e2e/test_local_provision_lifecycle_e2e.sh:144-149.

Patch spec: keep URL validation strict; do not whitelist arbitrary Docker hostnames. Instead make heartbeat recovery clear stale authenticated register failures when the runtime is actively heartbeating and the card already exists. Minimal shape in workspace-server/internal/handlers/registry.go, immediately after the existing agent-card backfill block:

if len(payload.AgentCard) > 0 {
    res, err := db.DB.ExecContext(ctx, `
        UPDATE workspaces
        SET agent_card = $2,
            last_register_failure_at = NULL
        WHERE id = $1 AND agent_card IS NULL
    `, payload.WorkspaceID, payload.AgentCard)
    if err != nil { ... } else if rows, _ := res.RowsAffected(); rows > 0 { ... }

    // NEW: restart/recreate recovery. If the card already exists, heartbeat is
    // still authoritative proof that the current container is alive. Clear the
    // recent register-failure marker so evaluateStatus/degraded recovery can
    // return the row to online without waiting for the 5m failure window.
    if err == nil {
        if _, err := db.DB.ExecContext(ctx, `
            UPDATE workspaces
            SET last_register_failure_at = NULL,
                updated_at = now()
            WHERE id = $1
              AND agent_card IS NOT NULL
              AND last_register_failure_at IS NOT NULL
        `, payload.WorkspaceID); err != nil {
            log.Printf("Registry heartbeat: failed to clear register-failure marker for %s after live heartbeat: %v", payload.WorkspaceID, err)
        }
    }
}

Then ensure the existing degraded recovery branch can fire on the same heartbeat or next heartbeat. If current code requires a second heartbeat because currentStatus,last_register_failure_at was read before clearing, either (a) after clearing set the in-memory lastRegisterFailureAt.Valid = false before evaluateStatus, or (b) move this clearing before the status/evaluate query. Prefer (a) if smaller and local.

Regression test: add/adjust in workspace-server/internal/handlers/registry_test.go near the existing backfill/recovery tests. Scenario: workspace row is degraded, agent_card already non-NULL, last_register_failure_at=now()-2m, heartbeat payload includes agent_card, and auth/token gate passes. Expect an UPDATE clearing last_register_failure_at even though agent_card IS NULL backfill affects 0 rows, then expect degraded→online recovery/broadcast. This test must fail on current main because the marker is only cleared when agent_card IS NULL. Also add a narrower unit asserting heartbeat with existing card + recent marker does not require a successful /registry/register first.

Engineer workspace correlation: denied on current evidence. #2739’s failing run is not missing ANTHROPIC_AUTH_TOKEN; it is a local Docker restart status-recovery wedge after url_validate_failed. Kimi/MiniMax _ResultError crashes may share the broad “workspace cannot recover after restart” symptom, but this log does not prove a credential-projection crash. If their logs show ANTHROPIC_AUTH_TOKEN=unset, that is a different regression; if they show live heartbeats + degraded after register-400, then route them to this #2739 registry heartbeat-marker fix.

Turnkey fix-spec after re-reading run `358593` / job `487659` and current main `0ebfb5d27ee3e63e4407bcc24aa43c6f9961bf58`. **Correction to premise:** I do **not** see an `ANTHROPIC_AUTH_TOKEN` / provider-credential projection failure in this run. The restarted real-image container prints `ANTHROPIC_AUTH_TOKEN=set`, `ANTHROPIC_BASE_URL=set`, and `MINIMAX_API_KEY=set`; the full env also contains `ANTHROPIC_BASE_URL=https://api.minimax.io/anthropic/v1`. The projection path is `workspace-server/internal/handlers/workspace_provision.go:1167-1199`: after #2735, BYOK + `runtimeUsesAnthropicNativeProxy(runtime)` derives the provider from `effectiveModel` when `res.ProviderSelection` is empty (`workspace_override` case), then maps MiniMax auth into `ANTHROPIC_AUTH_TOKEN` and `ANTHROPIC_BASE_URL`. The run proves this worked on restart. **Why #2735 does not cover the red:** #2735 covers the credential-projection layer only. The actual failing layer is registry/status recovery after the restarted container is already alive and credentialed. The restarted container posts `/registry/register`, but its advertised URL is a raw Docker hostname (`212851b5693d`), and `validateAgentURL` rejects non-resolvable non-platform hostnames at `workspace-server/internal/handlers/registry.go:270-294`; the register handler returns `url_validate_failed` at `registry.go:497-504`. Because the caller is authenticated, the defer stamps `last_register_failure_at` at `registry.go:350-363`. Heartbeats then prove the container is live, but heartbeat only clears `last_register_failure_at` inside the `agent_card IS NULL` backfill update at `registry.go:827-839`. On restart the row already has an `agent_card`, so that UPDATE affects 0 rows and the failure marker remains; degraded→online recovery is blocked until the recent-failure window ages out, longer than `RESTART_TIMEOUT=240` in `tests/e2e/test_local_provision_lifecycle_e2e.sh:144-149`. **Patch spec:** keep URL validation strict; do not whitelist arbitrary Docker hostnames. Instead make heartbeat recovery clear stale authenticated register failures when the runtime is actively heartbeating and the card already exists. Minimal shape in `workspace-server/internal/handlers/registry.go`, immediately after the existing agent-card backfill block: ```go if len(payload.AgentCard) > 0 { res, err := db.DB.ExecContext(ctx, ` UPDATE workspaces SET agent_card = $2, last_register_failure_at = NULL WHERE id = $1 AND agent_card IS NULL `, payload.WorkspaceID, payload.AgentCard) if err != nil { ... } else if rows, _ := res.RowsAffected(); rows > 0 { ... } // NEW: restart/recreate recovery. If the card already exists, heartbeat is // still authoritative proof that the current container is alive. Clear the // recent register-failure marker so evaluateStatus/degraded recovery can // return the row to online without waiting for the 5m failure window. if err == nil { if _, err := db.DB.ExecContext(ctx, ` UPDATE workspaces SET last_register_failure_at = NULL, updated_at = now() WHERE id = $1 AND agent_card IS NOT NULL AND last_register_failure_at IS NOT NULL `, payload.WorkspaceID); err != nil { log.Printf("Registry heartbeat: failed to clear register-failure marker for %s after live heartbeat: %v", payload.WorkspaceID, err) } } } ``` Then ensure the existing degraded recovery branch can fire on the same heartbeat or next heartbeat. If current code requires a second heartbeat because `currentStatus,last_register_failure_at` was read before clearing, either (a) after clearing set the in-memory `lastRegisterFailureAt.Valid = false` before `evaluateStatus`, or (b) move this clearing before the status/evaluate query. Prefer (a) if smaller and local. **Regression test:** add/adjust in `workspace-server/internal/handlers/registry_test.go` near the existing backfill/recovery tests. Scenario: workspace row is `degraded`, `agent_card` already non-NULL, `last_register_failure_at=now()-2m`, heartbeat payload includes `agent_card`, and auth/token gate passes. Expect an UPDATE clearing `last_register_failure_at` even though `agent_card IS NULL` backfill affects 0 rows, then expect degraded→online recovery/broadcast. This test must fail on current main because the marker is only cleared when `agent_card IS NULL`. Also add a narrower unit asserting heartbeat with existing card + recent marker does not require a successful `/registry/register` first. **Engineer workspace correlation:** denied on current evidence. #2739’s failing run is not missing `ANTHROPIC_AUTH_TOKEN`; it is a local Docker restart status-recovery wedge after `url_validate_failed`. Kimi/MiniMax `_ResultError` crashes may share the broad “workspace cannot recover after restart” symptom, but this log does not prove a credential-projection crash. If their logs show `ANTHROPIC_AUTH_TOKEN=unset`, that is a different regression; if they show live heartbeats + degraded after register-400, then route them to this #2739 registry heartbeat-marker fix.
Author
Member

RCA tick after turnkey fix-spec (2026-06-13): no new active incident beyond #2739. Current status API still shows only molecule-core main 0ebfb5d27ee3e63e4407bcc24aa43c6f9961bf58 failing Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) on run 358593 / job 487659; molecule-controlplane and landingpage main are green.

The actionable fix-spec is in the previous comment: #2739 is not missing ANTHROPIC_AUTH_TOKEN in this run; it is registry heartbeat/status recovery after authenticated url_validate_failed with an existing agent_card.

RCA tick after turnkey fix-spec (2026-06-13): no new active incident beyond #2739. Current status API still shows only `molecule-core` main `0ebfb5d27ee3e63e4407bcc24aa43c6f9961bf58` failing `Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory)` on run `358593` / job `487659`; `molecule-controlplane` and `landingpage` main are green. The actionable fix-spec is in the previous comment: #2739 is not missing `ANTHROPIC_AUTH_TOKEN` in this run; it is registry heartbeat/status recovery after authenticated `url_validate_failed` with an existing agent_card.
Author
Member

Close-out confirmation (2026-06-13): #2741 merged at head ae7ad0af and matches the intended #2739 RCA fix. It keeps the agent_card write NULL-scoped but clears last_register_failure_at on an authenticated, card-bearing heartbeat, allowing a restarted workspace with an existing card to recover degraded->online after an authenticated register-400.

Verification signal: current main 6163f6636fc80a50de0c609d1a414264efe3c509 has Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) green in run 358772/job 488004. No remaining #2739-specific action is visible; any future red should be treated as a new recurrence and compared against the marker-clear path in workspace-server/internal/handlers/registry.go.

Close-out confirmation (2026-06-13): #2741 merged at head `ae7ad0af` and matches the intended #2739 RCA fix. It keeps the `agent_card` write NULL-scoped but clears `last_register_failure_at` on an authenticated, card-bearing heartbeat, allowing a restarted workspace with an existing card to recover degraded->online after an authenticated register-400. Verification signal: current main `6163f6636fc80a50de0c609d1a414264efe3c509` has `Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory)` green in run 358772/job 488004. No remaining #2739-specific action is visible; any future red should be treated as a new recurrence and compared against the marker-clear path in `workspace-server/internal/handlers/registry.go`.
Author
Member

Post-#2754/#2755 re-evaluation (2026-06-13): remains CLOSED / confirmed resolved.

Mechanism: #2739's own root cause was the authenticated register-400 recovery marker path, fixed earlier by #2741: authenticated, card-bearing heartbeat clears stale last_register_failure_at while keeping the agent_card write NULL-scoped. #2754/#2755 addressed the later MiniMax real-image residual, not the #2739 marker mechanism.

Evidence: #2739 is already closed. Current main 1f7f513afbcc62de74fedf7747188e7efe097685 contains #2754/#2755 on top of the #2741 fix lineage, and Local Provision Lifecycle E2E / real image + MiniMax LLM, advisory is successful in 33s. The earlier #2739 failure mode (restart stays degraded after healthy heartbeat/register-400) is not present in current main status.

Recommended state: no reopen. Owner action complete; future degraded-after-register failures should be filed as a new issue with fresh logs.

Post-#2754/#2755 re-evaluation (2026-06-13): remains CLOSED / confirmed resolved. Mechanism: #2739's own root cause was the authenticated register-400 recovery marker path, fixed earlier by #2741: authenticated, card-bearing heartbeat clears stale `last_register_failure_at` while keeping the `agent_card` write NULL-scoped. #2754/#2755 addressed the later MiniMax real-image residual, not the #2739 marker mechanism. Evidence: #2739 is already closed. Current main `1f7f513afbcc62de74fedf7747188e7efe097685` contains #2754/#2755 on top of the #2741 fix lineage, and `Local Provision Lifecycle E2E / real image + MiniMax LLM, advisory` is successful in 33s. The earlier #2739 failure mode (restart stays degraded after healthy heartbeat/register-400) is not present in current main status. Recommended state: no reopen. Owner action complete; future degraded-after-register failures should be filed as a new issue with fresh logs.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2739