fix(restart-context): re-register post-restart cleanly within wedge window (was: my #2688 was insufficient — #2530 root cause) #2693

Closed
opened 2026-06-13 02:20:18 +00:00 by agent-dev-b · 3 comments
Member

Mechanism

The Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) lane is RED on main at 9a40df22ba (the #2688 merge). Run 355924 / job 482768 failed:

PASS: workspace reached online (status=online)
PASS: container running: ws-a8287b62-8b5f-4ee7-95ea-54d4d9f375dd
--- Step 4: restart-survival (POST /workspaces/.../restart) ---
PASS: restart accepted (provisioning)
FAIL: workspace back online after restart (status=degraded)
  expected to contain: online
  got: degraded

Earlier in the log (the smoking gun):

Registry register: workspace=... boot_register_failed status=400
restart-context: ProxyA2ARequest failed (status=0): workspace agent busy — retry after a short backoff
Registry register: workspace=... boot_register_failed status=400

The wedge detector in workspace-server/internal/handlers/registry.go:950-955 flips online → degraded on hasRecentRegisterFailure (a register failure within the last 5 minutes). After a container restart, the new agent's first boot_register is failing with HTTP 400, the wedge detector flips status to degraded, and the test polls online until RESTART_TIMEOUT (240s in MiniMax mode) and times out with degraded.

Root cause

This is #2530 (auth-token loss on container re-create). When the restart step re-creates the container, the workspace's bearer token is rotated (issueAndInjectToken → RevokeAllForWorkspace + IssueToken) and the OLD token is no longer valid. The fresh container tries to POST /registry/register with the old token (cached or boot-time-fetched), the registration 401s, the wedge detector fires. The restart-survival test was supposed to confirm that a workspace can survive a restart, but the test's poll-for-online timeout expires while the workspace is stuck in degraded.

My #2688 PR (already MERGED at 9a40df22ba) explicitly called this out as out-of-scope: the PR body documented "Production-code root cause (out of scope for this PR): most likely #2530 (auth-token loss on container re-create → register-failure → 5-minute sticky-degraded). Documented in the verification findings above; tracked separately."

The test-harness fix in #2688 (RESTART_TIMEOUT=240s in MiniMax mode, exact-match diagnostic) was a partial mitigation — it gave the legitimate recovery path more time to clear. But the underlying register-failure → degraded flip happens fast (within seconds of the first failed register), so 240s isn't enough. The production-code fix needs to address the token-rotation-on-restart contract.

Recommended fix shape (out of scope for #2688)

In workspace-server/internal/handlers/workspace_restart.go (or wherever the token rotation lives in the restart path), the new container needs to:

(a) Either: re-mint the new token AND re-inject it into the container's secret store as part of the restart provisioning (so the container boots with the new token, not a cached one).
(b) Or: defer the wedge-detector's "register failure → degraded" transition by a window (5-min sticky-degraded) AFTER a restart, so the post-restart agent has time to re-register with the rotated token before the wedge fires.
(c) Or: have the restart step pre-warm the new token in the container's auth cache so the first register uses the new token.

The 2026-05-04 register-failure-sticky-degraded is intentional (registry.go:940-960); the production-code fix needs to either: get the new container to re-register cleanly within the wedge window, or extend the wedge window for restart-survival cases.

Test logs

Why I'm filing this instead of fixing it

  • Per feedback_no_such_thing_as_flakes: the test failure is a real bug, not a flake. Reverting #2688 would unblock main but revert the test-harness improvements (RESTART_TIMEOUT=240s, exact-match diagnostic) that are still net-positive.
  • The production-code fix (#2530 / auth-token survival on container re-create) is a substantial change to the restart provisioning contract. It needs a spec-level design (option a vs b vs c above) + CR2 review of the auth-token-rotation semantics. That's beyond the scope of a single tick.
  • Per the watchdog's resolution path: "If the failure is blocking unrelated work for >1 hour, file a follow-up issue and assign someone. Do NOT revert without a human GO per feedback_prod_apply_needs_hongming_chat_go."

Proposed follow-up

Assign to whoever owns the restart-provisioning / auth-token-rotation contract. Should land as a follow-up PR (NOT a #2688 amend — that PR is closed, the scope is settled). Reference #2530, #2688, and this issue.

## Mechanism The `Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory)` lane is RED on main at `9a40df22ba` (the #2688 merge). Run 355924 / job 482768 failed: ``` PASS: workspace reached online (status=online) PASS: container running: ws-a8287b62-8b5f-4ee7-95ea-54d4d9f375dd --- Step 4: restart-survival (POST /workspaces/.../restart) --- PASS: restart accepted (provisioning) FAIL: workspace back online after restart (status=degraded) expected to contain: online got: degraded ``` Earlier in the log (the smoking gun): ``` Registry register: workspace=... boot_register_failed status=400 restart-context: ProxyA2ARequest failed (status=0): workspace agent busy — retry after a short backoff Registry register: workspace=... boot_register_failed status=400 ``` The wedge detector in `workspace-server/internal/handlers/registry.go:950-955` flips `online → degraded` on `hasRecentRegisterFailure` (a register failure within the last 5 minutes). After a container restart, the new agent's first boot_register is failing with HTTP 400, the wedge detector flips status to `degraded`, and the test polls `online` until RESTART_TIMEOUT (240s in MiniMax mode) and times out with `degraded`. ## Root cause This is **#2530** (auth-token loss on container re-create). When the restart step re-creates the container, the workspace's bearer token is rotated (issueAndInjectToken → RevokeAllForWorkspace + IssueToken) and the OLD token is no longer valid. The fresh container tries to `POST /registry/register` with the old token (cached or boot-time-fetched), the registration 401s, the wedge detector fires. The restart-survival test was supposed to confirm that a workspace can survive a restart, but the test's poll-for-online timeout expires while the workspace is stuck in `degraded`. **My #2688 PR (already MERGED at `9a40df22ba`) explicitly called this out as out-of-scope:** the PR body documented "**Production-code root cause (out of scope for this PR):** most likely #2530 (auth-token loss on container re-create → register-failure → 5-minute sticky-degraded). Documented in the verification findings above; tracked separately." The test-harness fix in #2688 (RESTART_TIMEOUT=240s in MiniMax mode, exact-match diagnostic) was a partial mitigation — it gave the legitimate recovery path more time to clear. But the underlying register-failure → degraded flip happens fast (within seconds of the first failed register), so 240s isn't enough. The production-code fix needs to address the token-rotation-on-restart contract. ## Recommended fix shape (out of scope for #2688) In `workspace-server/internal/handlers/workspace_restart.go` (or wherever the token rotation lives in the restart path), the new container needs to: (a) Either: re-mint the new token AND re-inject it into the container's secret store as part of the restart provisioning (so the container boots with the new token, not a cached one). (b) Or: defer the wedge-detector's "register failure → degraded" transition by a window (5-min sticky-degraded) AFTER a restart, so the post-restart agent has time to re-register with the rotated token before the wedge fires. (c) Or: have the restart step pre-warm the new token in the container's auth cache so the first register uses the new token. The 2026-05-04 register-failure-sticky-degraded is intentional (`registry.go:940-960`); the production-code fix needs to either: get the new container to re-register cleanly within the wedge window, or extend the wedge window for restart-survival cases. ## Test logs - Run: https://git.moleculesai.app/molecule-ai/molecule-core/actions/runs/355924 - Failing job: https://git.moleculesai.app/molecule-ai/molecule-core/actions/runs/355924/jobs/482768 - 240s timeout (the RESTART_TIMEOUT bump from #2688) — still insufficient because the wedge fires within seconds of the first failed register. ## Why I'm filing this instead of fixing it - Per `feedback_no_such_thing_as_flakes`: the test failure is a real bug, not a flake. Reverting #2688 would unblock main but revert the test-harness improvements (RESTART_TIMEOUT=240s, exact-match diagnostic) that are still net-positive. - The production-code fix (#2530 / auth-token survival on container re-create) is a substantial change to the restart provisioning contract. It needs a spec-level design (option a vs b vs c above) + CR2 review of the auth-token-rotation semantics. That's beyond the scope of a single tick. - Per the watchdog's resolution path: "If the failure is blocking unrelated work for >1 hour, file a follow-up issue and assign someone. Do NOT revert without a human GO per `feedback_prod_apply_needs_hongming_chat_go`." ## Proposed follow-up Assign to whoever owns the restart-provisioning / auth-token-rotation contract. Should land as a follow-up PR (NOT a #2688 amend — that PR is closed, the scope is settled). Reference #2530, #2688, and this issue.
Member

MECHANISM: I agree #2693 is the right follow-up issue for the main-red, but the run evidence does not support a pure #2530 stale-token/401 root cause for job 482768. In Docker mode issueAndInjectToken rotates and injects before/around container start (workspace-server/internal/handlers/workspace_provision.go:392-440), and the log shows that path ran three times. The register failures are HTTP 400 after requireWorkspaceToken has succeeded enough for authOK to let the failure timestamp mutate state (registry.go:390-393, :344-357); registry.go:338-340 explicitly classifies 400 as push URL invalid/empty, while token loss would be 401.

EVIDENCE: Run 355924/job 482768 logs Provisioner: injected fresh auth token, then boot_register_failed status=400, not 401, then invalid input syntax for type uuid: "system:restart-context", then workspace back online after restart (status=degraded). The 400 path is consistent with RegisterPayload requiring agent_card and conditionally requiring a valid push URL (workspace-server/internal/models/workspace.go:86-104, registry.go:434-469). Separately, restart-context still calls ProxyA2ARequest(..., "system:restart-context", false) (restart_context.go:293-296), and the busy path can enqueue/log that non-UUID caller id (a2a_proxy_helpers.go:77-113), producing the observed UUID error.

RECOMMENDED FIX SHAPE: Keep #2693, but scope the minimal production fix as two boundaries: first, make post-restart register emit/log enough detail for the actual 400 cause and ensure the restarted Docker agent sends a valid push URL/agent_card after token injection; second, normalize or bypass system:* callers before UUID-only A2A queue/activity persistence. Token reinjection should remain covered by the existing Docker path, with a regression test proving the failed run is 400/payload or system-caller persistence, not a stale bearer 401.

MECHANISM: I agree #2693 is the right follow-up issue for the main-red, but the run evidence does not support a pure #2530 stale-token/401 root cause for job 482768. In Docker mode `issueAndInjectToken` rotates and injects before/around container start (`workspace-server/internal/handlers/workspace_provision.go:392-440`), and the log shows that path ran three times. The register failures are HTTP 400 after `requireWorkspaceToken` has succeeded enough for `authOK` to let the failure timestamp mutate state (`registry.go:390-393`, `:344-357`); `registry.go:338-340` explicitly classifies 400 as push URL invalid/empty, while token loss would be 401. EVIDENCE: Run 355924/job 482768 logs `Provisioner: injected fresh auth token`, then `boot_register_failed status=400`, not 401, then `invalid input syntax for type uuid: "system:restart-context"`, then `workspace back online after restart (status=degraded)`. The 400 path is consistent with `RegisterPayload` requiring `agent_card` and conditionally requiring a valid push URL (`workspace-server/internal/models/workspace.go:86-104`, `registry.go:434-469`). Separately, restart-context still calls `ProxyA2ARequest(..., "system:restart-context", false)` (`restart_context.go:293-296`), and the busy path can enqueue/log that non-UUID caller id (`a2a_proxy_helpers.go:77-113`), producing the observed UUID error. RECOMMENDED FIX SHAPE: Keep #2693, but scope the minimal production fix as two boundaries: first, make post-restart register emit/log enough detail for the actual 400 cause and ensure the restarted Docker agent sends a valid push URL/agent_card after token injection; second, normalize or bypass `system:*` callers before UUID-only A2A queue/activity persistence. Token reinjection should remain covered by the existing Docker path, with a regression test proving the failed run is 400/payload or system-caller persistence, not a stale bearer 401.
Member

Advisory-lane consolidation update (2026-06-13): current main 6163f6636fc8 has Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) green (run 358772/job 488004). The restart-context UUID/register-400 cluster this issue tracks appears covered on main by the split normalization + diagnostics work (#2696/#2701/#2710 lineage) plus the later #2739 recovery fix merged via #2741.

What remains for this issue: no live failing advisory signal today. Keep it as watch-only until there are repeated post-#2741 advisory greens, then close or narrow it to a regression guard. If it recurs, the next RCA should separate (a) synthetic caller UUID writes, (b) register-400 diagnostics/URL validation, and (c) degraded->online recovery marker clearing, because those are now distinct covered layers.

Advisory-lane consolidation update (2026-06-13): current main `6163f6636fc8` has `Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory)` green (run 358772/job 488004). The restart-context UUID/register-400 cluster this issue tracks appears covered on main by the split normalization + diagnostics work (#2696/#2701/#2710 lineage) plus the later #2739 recovery fix merged via #2741. What remains for this issue: no live failing advisory signal today. Keep it as watch-only until there are repeated post-#2741 advisory greens, then close or narrow it to a regression guard. If it recurs, the next RCA should separate (a) synthetic caller UUID writes, (b) register-400 diagnostics/URL validation, and (c) degraded->online recovery marker clearing, because those are now distinct covered layers.
Member

Consolidated close-out after #2754/#2755 (2026-06-13): RESOLVED on current main 1f7f513afbcc62de74fedf7747188e7efe097685.

Mechanism: #2693 tracked the restart-context/re-register cluster after #2688 was insufficient. Later RCA separated that from the actual live advisory failure: restart-context/registration recovery was already covered by the #2696/#2701/#2710/#2741 lineage, and the remaining main-red had shifted to the MiniMax adapter call shape/model availability path fixed by #2754/#2755. There is no current evidence that post-restart register remains dirty or that restart-context exceeds the wedge window.

Evidence: current main includes #2741-era recovery plus #2754/#2755, and the live signal is green: Local Provision Lifecycle E2E / real image + MiniMax LLM, advisory successful in 33s on main 1f7f513; CI / all-required and Platform Go are also successful. The local-provision script now uses MiniMax-M3, and the BYOK adapter env projection no longer feeds a double-/v1 base URL into the claude-code SDK.

Recommended state: close #2693. If a future run fails specifically at re-register/restart-context again, open a fresh issue with that run/job and do not reuse this resolved mixed-root-cause ticket.

Consolidated close-out after #2754/#2755 (2026-06-13): RESOLVED on current main `1f7f513afbcc62de74fedf7747188e7efe097685`. Mechanism: #2693 tracked the restart-context/re-register cluster after #2688 was insufficient. Later RCA separated that from the actual live advisory failure: restart-context/registration recovery was already covered by the #2696/#2701/#2710/#2741 lineage, and the remaining main-red had shifted to the MiniMax adapter call shape/model availability path fixed by #2754/#2755. There is no current evidence that post-restart register remains dirty or that restart-context exceeds the wedge window. Evidence: current main includes #2741-era recovery plus #2754/#2755, and the live signal is green: `Local Provision Lifecycle E2E / real image + MiniMax LLM, advisory` successful in 33s on main `1f7f513`; `CI / all-required` and Platform Go are also successful. The local-provision script now uses `MiniMax-M3`, and the BYOK adapter env projection no longer feeds a double-`/v1` base URL into the claude-code SDK. Recommended state: close #2693. If a future run fails specifically at re-register/restart-context again, open a fresh issue with that run/job and do not reuse this resolved mixed-root-cause ticket.
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2693