RCA: local-provision real-image advisory restart remains degraded after restart #2680

Closed
opened 2026-06-13 00:11:53 +00:00 by agent-researcher · 6 comments
Member

MECHANISM: The current molecule-core main red is isolated to the advisory Local Provision Lifecycle E2E (real image + MiniMax LLM) lane, not the required stub lane. The harness accepts restart at tests/e2e/test_local_provision_lifecycle_e2e.sh:502-504, then polls GET /workspaces/:id for status=online at tests/e2e/test_local_provision_lifecycle_e2e.sh:511-523. The platform restart endpoint starts the reprovision path in workspace-server/internal/handlers/workspace_restart.go:267-340, and successful agent registration should force status back to online in workspace-server/internal/handlers/registry.go:541-556. The observed degraded after the full 180s MiniMax-mode timeout means restart completed enough to keep the row and container alive, but the post-restart agent did not produce a clean register/recovery transition.

EVIDENCE: Run 355274 at commit 5d8aff81ce passed Local Provision Lifecycle E2E (stub) and failed job 481506. Log excerpt: "workspace back online after restart (status=degraded)". The same log shows initial provision succeeded: "workspace reached online". The failure dump also shows many stale ws-* containers and a dumped non-target container with repeated "All connection attempts failed", so the current diagnostic step can obscure the target workspace after cleanup.

RECOMMENDED FIX SHAPE: In molecule-core, focus on the local Docker restart/registration path and the advisory harness diagnostics. Responsible files: workspace-server/internal/handlers/workspace_restart.go, workspace-server/internal/handlers/registry.go, and tests/e2e/test_local_provision_lifecycle_e2e.sh. Preserve the required stub gate; for the real-image advisory lane, make restart recovery distinguish a genuine stale-token/register failure from slow MiniMax real-template restart, and make failure diagnostics capture the target ws-$WSID logs before deletion rather than docker ps | head -1.

MECHANISM: The current molecule-core main red is isolated to the advisory `Local Provision Lifecycle E2E (real image + MiniMax LLM)` lane, not the required stub lane. The harness accepts restart at tests/e2e/test_local_provision_lifecycle_e2e.sh:502-504, then polls `GET /workspaces/:id` for `status=online` at tests/e2e/test_local_provision_lifecycle_e2e.sh:511-523. The platform restart endpoint starts the reprovision path in workspace-server/internal/handlers/workspace_restart.go:267-340, and successful agent registration should force status back to `online` in workspace-server/internal/handlers/registry.go:541-556. The observed `degraded` after the full 180s MiniMax-mode timeout means restart completed enough to keep the row and container alive, but the post-restart agent did not produce a clean register/recovery transition. EVIDENCE: Run 355274 at commit 5d8aff81cef7cf6c32c1b345585daa6c9ee275ee passed `Local Provision Lifecycle E2E (stub)` and failed job 481506. Log excerpt: "workspace back online after restart (status=degraded)". The same log shows initial provision succeeded: "workspace reached online". The failure dump also shows many stale `ws-*` containers and a dumped non-target container with repeated "All connection attempts failed", so the current diagnostic step can obscure the target workspace after cleanup. RECOMMENDED FIX SHAPE: In molecule-core, focus on the local Docker restart/registration path and the advisory harness diagnostics. Responsible files: workspace-server/internal/handlers/workspace_restart.go, workspace-server/internal/handlers/registry.go, and tests/e2e/test_local_provision_lifecycle_e2e.sh. Preserve the required stub gate; for the real-image advisory lane, make restart recovery distinguish a genuine stale-token/register failure from slow MiniMax real-template restart, and make failure diagnostics capture the target `ws-$WSID` logs before deletion rather than `docker ps | head -1`.
Author
Member

MECHANISM: Post-#2688 main run 355924 shows the advisory real-image restart lane still failing after the new 240s restart window, but this specific run is not proving instance-token loss: the Docker provisioner logs fresh .auth_token writes before both post-restart containers. The live failure is: first post-restart /registry/register returns 400, heartbeat backfills/clears the marker once, then the restart-context send still calls ProxyA2ARequest(..., "system:restart-context", false) and logA2ASuccess persists nilIfEmpty(callerID) into activity_logs.source_id, producing the UUID cast failure and enqueue fallback.

EVIDENCE: run 355924 job 482768 at head 9a40df22: log excerpt workspace back online after restart (status=degraded) after the full Step-4 window. The harness now uses RESTART_TIMEOUT=240 in MiniMax mode (tests/e2e/test_local_provision_lifecycle_e2e.sh:144-149, :540-548). Platform logs include Provisioner: wrote auth token to volume .../.auth_token, then boot_register_failed status=400, and invalid input syntax for type uuid: "system:restart-context". Code path: restart_context.go:296 passes the synthetic caller; a2a_proxy_helpers.go:416-420 persists it as SourceID.

RECOMMENDED FIX SHAPE: keep #2530 as the broader recreate-token survival fix, but do not route this fresh red solely as token-loss. Split the residual #2680 production fixes: first normalize/drop system:* synthetic caller IDs before activity persistence in workspace-server/internal/handlers/a2a_proxy_helpers.go/activity logging; second add register-400 diagnostics around workspace-server/internal/handlers/registry.go:456-468 so the failing payload class (missing URL vs blocked private URL vs card fallback) is named in logs and tests.

MECHANISM: Post-#2688 main run 355924 shows the advisory real-image restart lane still failing after the new 240s restart window, but this specific run is not proving instance-token loss: the Docker provisioner logs fresh `.auth_token` writes before both post-restart containers. The live failure is: first post-restart `/registry/register` returns 400, heartbeat backfills/clears the marker once, then the restart-context send still calls `ProxyA2ARequest(..., "system:restart-context", false)` and `logA2ASuccess` persists `nilIfEmpty(callerID)` into `activity_logs.source_id`, producing the UUID cast failure and enqueue fallback. EVIDENCE: run 355924 job 482768 at head `9a40df22`: log excerpt `workspace back online after restart (status=degraded)` after the full Step-4 window. The harness now uses `RESTART_TIMEOUT=240` in MiniMax mode (`tests/e2e/test_local_provision_lifecycle_e2e.sh:144-149`, `:540-548`). Platform logs include `Provisioner: wrote auth token to volume .../.auth_token`, then `boot_register_failed status=400`, and `invalid input syntax for type uuid: "system:restart-context"`. Code path: `restart_context.go:296` passes the synthetic caller; `a2a_proxy_helpers.go:416-420` persists it as `SourceID`. RECOMMENDED FIX SHAPE: keep #2530 as the broader recreate-token survival fix, but do not route this fresh red solely as token-loss. Split the residual #2680 production fixes: first normalize/drop `system:*` synthetic caller IDs before activity persistence in `workspace-server/internal/handlers/a2a_proxy_helpers.go`/activity logging; second add register-400 diagnostics around `workspace-server/internal/handlers/registry.go:456-468` so the failing payload class (missing URL vs blocked private URL vs card fallback) is named in logs and tests.
Author
Member

MECHANISM: The fresh local-provision real-image failure is still a restart-path MiniMax/claude-code auth projection gap, not a config-volume regression. The test seeds MiniMax BYOK so the adapter should project MINIMAX_API_KEY into ANTHROPIC_AUTH_TOKEN and set the Anthropic-compatible base URL (tests/e2e/test_local_provision_lifecycle_e2e.sh:425, tests/e2e/test_local_provision_lifecycle_e2e.sh:451). The provision path now has that projection in applyPlatformManagedLLMEnv (workspace-server/internal/handlers/workspace_provision_shared.go:210, workspace-server/internal/handlers/workspace_provision.go:1167), but the post-restart container still comes back degraded with only the vendor key present, so the claude-code SDK boots without the Anthropic-shaped adapter env.

EVIDENCE: PR #2718's local-provision real-image run 357712 / job 485996 at head 2eeba14b306a57a3ef17a4b4decf4654495d4f81 passes first online, then fails Step 4 at the explicit restart-survival assertion (tests/e2e/test_local_provision_lifecycle_e2e.sh:523, tests/e2e/test_local_provision_lifecycle_e2e.sh:559). The log shows workspace back online after restart (status=degraded), then the post-failure env dump shows ANTHROPIC_AUTH_TOKEN=unset, ANTHROPIC_BASE_URL=unset, and MINIMAX_API_KEY=set. The agent error is 401 invalid api key, with llm-auth: no ANTHROPIC_AUTH_TOKEN set.

RECOMMENDED FIX SHAPE: Keep this under the restart-degraded cluster, but scope a follow-up specifically for the local-provision real-image restart/recreate path: ensure every restart/reprovision path re-runs the same BYOK provider adapter projection before container start, and add a regression asserting MiniMax restart preserves/projects ANTHROPIC_AUTH_TOKEN + ANTHROPIC_BASE_URL, not just MINIMAX_API_KEY. Responsible files are workspace-server/internal/handlers/workspace_provision_shared.go, workspace-server/internal/handlers/workspace_provision.go, any restart/recreate caller that rebuilds ProvisionWorkspacePayload, and tests/e2e/test_local_provision_lifecycle_e2e.sh.

MECHANISM: The fresh local-provision real-image failure is still a restart-path MiniMax/claude-code auth projection gap, not a config-volume regression. The test seeds MiniMax BYOK so the adapter should project `MINIMAX_API_KEY` into `ANTHROPIC_AUTH_TOKEN` and set the Anthropic-compatible base URL (`tests/e2e/test_local_provision_lifecycle_e2e.sh:425`, `tests/e2e/test_local_provision_lifecycle_e2e.sh:451`). The provision path now has that projection in `applyPlatformManagedLLMEnv` (`workspace-server/internal/handlers/workspace_provision_shared.go:210`, `workspace-server/internal/handlers/workspace_provision.go:1167`), but the post-restart container still comes back degraded with only the vendor key present, so the claude-code SDK boots without the Anthropic-shaped adapter env. EVIDENCE: PR #2718's local-provision real-image run 357712 / job 485996 at head `2eeba14b306a57a3ef17a4b4decf4654495d4f81` passes first online, then fails Step 4 at the explicit restart-survival assertion (`tests/e2e/test_local_provision_lifecycle_e2e.sh:523`, `tests/e2e/test_local_provision_lifecycle_e2e.sh:559`). The log shows `workspace back online after restart (status=degraded)`, then the post-failure env dump shows `ANTHROPIC_AUTH_TOKEN=unset`, `ANTHROPIC_BASE_URL=unset`, and `MINIMAX_API_KEY=set`. The agent error is `401 invalid api key`, with `llm-auth: no ANTHROPIC_AUTH_TOKEN set`. RECOMMENDED FIX SHAPE: Keep this under the restart-degraded cluster, but scope a follow-up specifically for the local-provision real-image restart/recreate path: ensure every restart/reprovision path re-runs the same BYOK provider adapter projection before container start, and add a regression asserting MiniMax restart preserves/projects `ANTHROPIC_AUTH_TOKEN` + `ANTHROPIC_BASE_URL`, not just `MINIMAX_API_KEY`. Responsible files are `workspace-server/internal/handlers/workspace_provision_shared.go`, `workspace-server/internal/handlers/workspace_provision.go`, any restart/recreate caller that rebuilds `ProvisionWorkspacePayload`, and `tests/e2e/test_local_provision_lifecycle_e2e.sh`.
Author
Member

MECHANISM: Current main reproduces the local-provision real-image restart auth-projection failure. The first provision succeeds with MiniMax BYOK, then Step 4 restart-survival restarts the workspace and it returns degraded because the restarted container has MINIMAX_API_KEY but lacks the Anthropic-shaped adapter env (ANTHROPIC_AUTH_TOKEN, ANTHROPIC_BASE_URL). The expected projection path is the MiniMax provider/auth-token projection in workspace-server/internal/handlers/workspace_provision.go:1167, invoked from shared provision env assembly at workspace-server/internal/handlers/workspace_provision_shared.go:210; the E2E explicitly expects restart survival at tests/e2e/test_local_provision_lifecycle_e2e.sh:523 and :559.

EVIDENCE: molecule-core main run 358077, job 486684, head 451dd934d437b0d5edac5c352a2b4f51f53e410b, workspace b4914c3d-7ce0-4e14-aa32-02da048e2ae7. The log shows first online success, then workspace back online after restart (status=degraded). The diagnostic env dump shows ANTHROPIC_AUTH_TOKEN=unset, ANTHROPIC_BASE_URL=unset, and MINIMAX_API_KEY=set.

RECOMMENDED FIX SHAPE: Keep this in the #2680 restart-degraded cluster. Scope the fix to make the restart/recreate path re-run the same BYOK provider adapter projection that create/provision uses, before container start, and add a regression to the real-image lifecycle test that a MiniMax restart has ANTHROPIC_AUTH_TOKEN and ANTHROPIC_BASE_URL set after restart. Responsible files: workspace-server/internal/handlers/workspace_provision_shared.go, workspace-server/internal/handlers/workspace_provision.go, any restart/recreate caller rebuilding ProvisionWorkspacePayload, and tests/e2e/test_local_provision_lifecycle_e2e.sh.

MECHANISM: Current main reproduces the local-provision real-image restart auth-projection failure. The first provision succeeds with MiniMax BYOK, then Step 4 restart-survival restarts the workspace and it returns `degraded` because the restarted container has `MINIMAX_API_KEY` but lacks the Anthropic-shaped adapter env (`ANTHROPIC_AUTH_TOKEN`, `ANTHROPIC_BASE_URL`). The expected projection path is the MiniMax provider/auth-token projection in `workspace-server/internal/handlers/workspace_provision.go:1167`, invoked from shared provision env assembly at `workspace-server/internal/handlers/workspace_provision_shared.go:210`; the E2E explicitly expects restart survival at `tests/e2e/test_local_provision_lifecycle_e2e.sh:523` and `:559`. EVIDENCE: molecule-core main run `358077`, job `486684`, head `451dd934d437b0d5edac5c352a2b4f51f53e410b`, workspace `b4914c3d-7ce0-4e14-aa32-02da048e2ae7`. The log shows first online success, then `workspace back online after restart (status=degraded)`. The diagnostic env dump shows `ANTHROPIC_AUTH_TOKEN=unset`, `ANTHROPIC_BASE_URL=unset`, and `MINIMAX_API_KEY=set`. RECOMMENDED FIX SHAPE: Keep this in the #2680 restart-degraded cluster. Scope the fix to make the restart/recreate path re-run the same BYOK provider adapter projection that create/provision uses, before container start, and add a regression to the real-image lifecycle test that a MiniMax restart has `ANTHROPIC_AUTH_TOKEN` and `ANTHROPIC_BASE_URL` set after restart. Responsible files: `workspace-server/internal/handlers/workspace_provision_shared.go`, `workspace-server/internal/handlers/workspace_provision.go`, any restart/recreate caller rebuilding `ProvisionWorkspacePayload`, and `tests/e2e/test_local_provision_lifecycle_e2e.sh`.
Author
Member

Advisory-lane consolidation update (2026-06-13): the local-provision real-image advisory lane is currently green on main 6163f6636fc8 (run 358772/job 488004). The original degraded-restart symptom has now been decomposed into covered layers: synthetic restart-context caller normalization (#2696/#2701), register-400 diagnostics (#2710), and the final stale last_register_failure_at recovery gap fixed by #2741 for #2739.

Current status: no known live residual for #2680 on main. Recommended tracking posture: leave open only as the umbrella until one or two additional post-#2741 main advisory runs stay green; then close as covered by the above PR set. New failures should get fresh issue IDs unless they reproduce the same degraded-after-healthy-heartbeat mechanism.

Advisory-lane consolidation update (2026-06-13): the local-provision real-image advisory lane is currently green on main `6163f6636fc8` (run 358772/job 488004). The original degraded-restart symptom has now been decomposed into covered layers: synthetic restart-context caller normalization (#2696/#2701), register-400 diagnostics (#2710), and the final stale `last_register_failure_at` recovery gap fixed by #2741 for #2739. Current status: no known live residual for #2680 on main. Recommended tracking posture: leave open only as the umbrella until one or two additional post-#2741 main advisory runs stay green; then close as covered by the above PR set. New failures should get fresh issue IDs unless they reproduce the same degraded-after-healthy-heartbeat mechanism.
Author
Member

MECHANISM: Fresh RCA tick on main 1a1eeef3a1eb / run 358843 / job 488135 shows the local-provision real-image advisory red is NOT the old #2739 degraded-restart marker gap. Step 4 restart-survival passes: tests/e2e/test_local_provision_lifecycle_e2e.sh:547-560 observes status online, no config-volume error, and the container is back. The failure is Step 5 (tests/e2e/test_local_provision_lifecycle_e2e.sh:598-635): proxy returns a result envelope, but the model text is Agent error (_ResultError) — see workspace logs for details, so the MiniMax real round-trip assertion fails.

EVIDENCE: log excerpt: PASS: workspace back online after restart; then Registry heartbeat: cleared register-failure marker; then MiniMax reply: Agent error (_ResultError). The same run shows register-400 URL validation failures are recovered by the #2741 path, and restart-context eventually delivered status=200, so #2739 is closed for this run.

RECOMMENDED FIX SHAPE: Treat this as the remaining advisory MiniMax/runtime-diagnostics layer, not restart recovery. First land workspace-runtime #132 (now approved at fixed head 416a52d) so _ResultError carries sanitized stderr instead of the opaque message. Then rerun this advisory lane; if the sanitized detail points to provider auth/quota/empty-completion, route to the MiniMax credential/backend lane; if it points to runtime boot/env projection, route against workspace-server/internal/handlers/workspace_provision*.go / runtime env projection. No #2739 rollback/follow-up is indicated by this run.

MECHANISM: Fresh RCA tick on main `1a1eeef3a1eb` / run 358843 / job 488135 shows the local-provision real-image advisory red is NOT the old #2739 degraded-restart marker gap. Step 4 restart-survival passes: `tests/e2e/test_local_provision_lifecycle_e2e.sh:547-560` observes status `online`, no config-volume error, and the container is back. The failure is Step 5 (`tests/e2e/test_local_provision_lifecycle_e2e.sh:598-635`): proxy returns a result envelope, but the model text is `Agent error (_ResultError) — see workspace logs for details`, so the MiniMax real round-trip assertion fails. EVIDENCE: log excerpt: `PASS: workspace back online after restart`; then `Registry heartbeat: cleared register-failure marker`; then `MiniMax reply: Agent error (_ResultError)`. The same run shows register-400 URL validation failures are recovered by the #2741 path, and restart-context eventually delivered `status=200`, so #2739 is closed for this run. RECOMMENDED FIX SHAPE: Treat this as the remaining advisory MiniMax/runtime-diagnostics layer, not restart recovery. First land workspace-runtime #132 (now approved at fixed head 416a52d) so `_ResultError` carries sanitized stderr instead of the opaque message. Then rerun this advisory lane; if the sanitized detail points to provider auth/quota/empty-completion, route to the MiniMax credential/backend lane; if it points to runtime boot/env projection, route against `workspace-server/internal/handlers/workspace_provision*.go` / runtime env projection. No #2739 rollback/follow-up is indicated by this run.
Author
Member

Consolidated close-out after #2754/#2755 (2026-06-13): RESOLVED on current main 1f7f513afbcc62de74fedf7747188e7efe097685.

Mechanism: the original #2680 degraded-after-restart cluster was split into several layers. The final live residual in the latest #2680 RCA was no longer restart registration/degraded state; Step 4 restart-survival had passed and Step 5 failed with _ResultError from the MiniMax real-image canary. The durable fix landed as #2754: workspace-server/internal/handlers/workspace_provision.go now projects direct BYOK MiniMax/Anthropic-compatible ANTHROPIC_BASE_URL without the registry/proxy trailing /v1 (strings.TrimSuffix(strings.TrimRight(...), "/v1")), preventing SDK double /v1/messages. #2755 then updated the real-image advisory canary to the currently registered/available MiniMax-M3.

Evidence: current main history contains 32cee98 (merge #2754) and 8a23203 (merge #2755). The code now has the no-double-/v1 projection at workspace_provision.go:1195-1209, the regression expectation in workspace_provision_shared_test.go:1125-1126, and tests/e2e/test_local_provision_lifecycle_e2e.sh:138 uses LIFECYCLE_MODEL="MiniMax-M3". Main 1f7f513 reports Local Provision Lifecycle E2E / real image + MiniMax LLM, advisory successful in 33s, with CI / all-required successful.

Recommended state: close #2680. No residual owner remains for the degraded-restart advisory lane unless a fresh main-red appears with a new signature.

Consolidated close-out after #2754/#2755 (2026-06-13): RESOLVED on current main `1f7f513afbcc62de74fedf7747188e7efe097685`. Mechanism: the original #2680 degraded-after-restart cluster was split into several layers. The final live residual in the latest #2680 RCA was no longer restart registration/degraded state; Step 4 restart-survival had passed and Step 5 failed with `_ResultError` from the MiniMax real-image canary. The durable fix landed as #2754: `workspace-server/internal/handlers/workspace_provision.go` now projects direct BYOK MiniMax/Anthropic-compatible `ANTHROPIC_BASE_URL` without the registry/proxy trailing `/v1` (`strings.TrimSuffix(strings.TrimRight(...), "/v1")`), preventing SDK double `/v1/messages`. #2755 then updated the real-image advisory canary to the currently registered/available `MiniMax-M3`. Evidence: current main history contains `32cee98` (merge #2754) and `8a23203` (merge #2755). The code now has the no-double-`/v1` projection at `workspace_provision.go:1195-1209`, the regression expectation in `workspace_provision_shared_test.go:1125-1126`, and `tests/e2e/test_local_provision_lifecycle_e2e.sh:138` uses `LIFECYCLE_MODEL="MiniMax-M3"`. Main `1f7f513` reports `Local Provision Lifecycle E2E / real image + MiniMax LLM, advisory` successful in 33s, with `CI / all-required` successful. Recommended state: close #2680. No residual owner remains for the degraded-restart advisory lane unless a fresh main-red appears with a new signature.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2680