fix(e2e): restart-survival honors RESTART_TIMEOUT + exact-match target container (relates-to #2680) #2688
Reference in New Issue
Block a user
Delete Branch "fix/restart-context-degraded-2680"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Relates-to #2680 (NOT Fixes — same restart-survival / wedge-detector theme as the platform's broader #2680 series). Test-harness only; no production code change.
#2680 RCA root cause: the restart-survival lane was failing with
status=degradedafter the 180s MiniMax-mode timeout. Two contributing mechanisms were investigated; this PR addresses one of them (test-harness) and explicitly scopes out the other (production-code UUID-poisoning) as a follow-up.What this PR does (verified to address #2680's test-harness path)
Two test-harness fixes in
tests/e2e/test_local_provision_lifecycle_e2e.sh:RESTART_TIMEOUT = 240s in
LIFECYCLE_LLM=minimaxmode (was reusing the 180sONLINE_TIMEOUTfrom initial-provision). The restart's real cold-start path is slower than the initial provision (agent SDK boot + MiniMax LLM dial on top of the same path). The wedge detector can legitimately flip status todegradedduring the cold-start window while heartbeats are still ramping up — that's NOT a failure here (the agent hasn't finished booting yet), so the poll loop continues untilonlineORfailedOR the fullRESTART_TIMEOUT. Stub mode (LIFECYCLE_LLM="") keeps the default =ONLINE_TIMEOUT(90s), so the required stub gate is preserved unchanged.diagnose_provisionuses exact-match name filtering for the target container (grep -Fx "$target_name"on the docker ps output), and clearly labels the "other ws-* containers on this daemon" section asNOT the target. The olddocker ps --filter "name=ws-${WSID}"was a SUBSTRING match, so on a shared dev daemon with many stalews-*containers from sibling dev activity, it could returnws-${WSID}-staleand dump its logs in the diagnostic — obscuring the real failure.What this PR does NOT do (explicit scope-out per user direction)
The user's hypothesis (passed to me for verification, 4121f447): "normalize system:restart-context synthetic caller before the a2a_proxy_helpers.go/activity.go persistence so it doesn't poison activity_logs.source_id UUID → queue-fallback works → workspace recovers to online". Verification: the UUID-poisoning mechanism MAY exist but is NOT the primary cause of #2680's poll-stays-degraded path.
Verification findings (read the code in this worktree):
restart_context.go:296does callProxyA2ARequest(..., "system:restart-context", false).a2a_proxy_helpers.go:374recognizes thesystem:prefix viaisSystemCaller()and bypasses caller-token auth — that part is correct.nilIfEmpty("system:restart-context")returns&"system:restart-context"(non-nil pointer), and that DOES get persisted toactivity_logs.source_idperlogA2ASuccess(line 419).registry.go:903-970keys offpayload.RuntimeState,payload.ErrorRate, andlastRegisterFailure— NOT offactivity_logs.source_id. The persisted non-UUID source_id would causeLEFT JOIN workspaces w ON w.id = activity_logs.source_id(activity.go:443) to return NULL, but that's a NULL — not a "break" or a UUID-cast failure.POST /registry/register(lost auth token after container re-create). Restart-survival in MiniMax mode makes the register-failure window more likely to land in the 5-minute sticky-degrade window.So the UUID-poisoning is real-but-tangential, while #2530 (auth-token survival on container re-create) is the primary production-code fix. This PR is the test-harness fix that gives the legitimate recovery path more time to clear; the production-code #2530 fix is a separate, larger task and is OUT OF SCOPE here.
Files (1, +36/-6)
tests/e2e/test_local_provision_lifecycle_e2e.sh:RESTART_TIMEOUT/RESTART_TIMEOUT_EXPLICITenv vars (line ~135-140). Bumped to 240 in MiniMax mode.RESTART_TIMEOUTand the comment explains thedegraded-during-cold-start semantics.diagnose_provision()(line ~215) now usesgrep -Fxexact-match for the target container and clearly labels otherws-*containers asNOT the target.Hard boundary
Test-harness change only. No production code touched. The wedge detector's degraded→online path in
registry.go:541-556+ the register-failure branch inregistry.go:950-955are the same paths — the test now waits long enough for the legitimate recovery to clear, and the diagnostic clearly identifies the target container.Tests
bash -nclean on the script.go build ./...clean inworkspace-server/.go vet ./internal/handlers/...clean.go test -run "TestWorkspaceHandler|TestRegistry|TestRestart" ./internal/handlers/...).LIFECYCLE_LLM=minimax) gets accurate pass/fail signal instead of false-positive reds.LIFECYCLE_LLM=""path within the existing 90sONLINE_TIMEOUT.PR convention
"Relates-to #2680" (NOT Fixes — same restart-survival / wedge-detector theme as the platform's broader #2680 series).
7-Section SOP Checklist (per standing practice 44de2dec)
1. Comprehensive testing performed
bash -n): clean.go build ./...+go vet ./internal/handlers/...inworkspace-server/: clean.go test -run "TestWorkspaceHandler|TestRegistry|TestRestart" ./internal/handlers/...all pass.2. Local-postgres E2E run
N/A — pure shell-script + test-harness change. No Postgres, no Docker-network, no Go runtime touched.
3. Staging-smoke verified or pending
Pending. Will be verified on the next run of the advisory
lifecycle-realCI job. The required stub job (LIFECYCLE_LLM="") is unchanged and remains green per its prior CI history.4. Root-cause not symptom
5d8aff81): Step 4 restart-survival failed withstatus=degradedafter 180s in MiniMax mode.ONLINE_TIMEOUTis the initial-provision timeout, not a separate restart-step timeout. The restart's real cold-start path is slower. The wedge detector can legitimately fliponline→degradedduring the cold-start window.diagnose_provisionused substring-match for target container.5. Five-Axis review walked
RESTART_TIMEOUTdefaults toONLINE_TIMEOUT(stub mode unchanged), bumped to 240s only in MiniMax mode.diagnose_provisionusesgrep -Fxexact match.-Fxon a constant target name).6. No backwards-compat shim / dead code added
None.
RESTART_TIMEOUTis a new env var with a sensible default (ONLINE_TIMEOUT); existing callers don't need to change. TheRESTART_TIMEOUT_EXPLICITmirror follows the existingONLINE_TIMEOUT_EXPLICITpattern exactly.7. Memory consulted
feedback_fail_closed_no_silent_fallthrough— the test still hard-fails onfailed; onlydegradedis treated as transient (matching the wedge-detector's actual semantics).feedback_no_such_thing_as_flakes— the MiniMax cold-start path can legitimately take >180s.feedback_relates_to_not_fixes_for_epic_branches— "Relates-to #2680", not "Fixes".feedback_hand_off_with_state— PR body explicitly states the two test-harness changes + the production-code scope-out + the verification findings.feedback_assert_exact_not_substring— applied to the diagnostic function (grep -Fxinstead ofdocker ps --filter "name=ws-${WSID}").REQUEST_CHANGES for head
89763a451f.Blocking issue: the new RESTART_TIMEOUT_EXPLICIT logic is evaluated too late. Inside the LIFECYCLE_LLM=minimax block, the code runs:
[ "${RESTART_TIMEOUT_EXPLICIT:-0}" -eq 0 ] && RESTART_TIMEOUT=240
before RESTART_TIMEOUT_EXPLICIT is initialized from the caller's RESTART_TIMEOUT env. So if a caller pins RESTART_TIMEOUT=300, this block still sees RESTART_TIMEOUT_EXPLICIT as unset/0 and overwrites the caller's value to 240. The PR body says callers can pin RESTART_TIMEOUT and that RESTART_TIMEOUT_EXPLICIT mirrors ONLINE_TIMEOUT_EXPLICIT; this implementation does not preserve that contract.
Please compute RESTART_TIMEOUT_EXPLICIT before the minimax override, mirroring the existing ONLINE_TIMEOUT_EXPLICIT pattern, then apply the minimax default only when the caller did not set RESTART_TIMEOUT. The exact-match diagnostic change using grep -Fx for the target container is directionally correct, and CI/all-required is green, but the timeout override ordering needs fixing before approval.
APPROVED for current head
c09dfd5155.Re-reviewed after the prior REQUEST_CHANGES. The RESTART_TIMEOUT_EXPLICIT ordering blocker is fixed: RESTART_TIMEOUT_EXPLICIT and the default RESTART_TIMEOUT are now initialized before the LIFECYCLE_LLM=minimax block, so caller-pinned RESTART_TIMEOUT values are preserved and only unpinned MiniMax runs default to 240s.
The change remains test-harness only in tests/e2e/test_local_provision_lifecycle_e2e.sh. Stub mode still defaults RESTART_TIMEOUT to ONLINE_TIMEOUT, MiniMax advisory mode gets the longer restart-survival window, and the restart poll now uses RESTART_TIMEOUT. The diagnostic exact-match change also looks correct: it uses grep -Fx for the target container name and labels other ws-* containers as NOT the target, avoiding substring misattribution.
CI / all-required, Shellcheck, stub lifecycle, and real-image MiniMax advisory lifecycle are green; PR is mergeable=true.
/sop-ack