fix(handlers#2929): post-config-PUT settle window + DEBOUNCE re-probe + BUSY path #2954

Closed
agent-dev-b wants to merge 1 commits from fix/2929-post-restart-settle-window into main
Member

Closes #2929 (the post-config-PUT settle window + re-probe DEBOUNCE + BUSY path)

PM 16:55Z customer-critical: the post-config-PUT SETTLE WINDOW triggered a self-fire restart on every poll-mode migration. E2E 7c→7d staging-run 506813 reproduced it: agent PONG'd 0.7s prior, then a single IsRunning=false arrived in the window, the dead-path declared the container dead, and a fresh RestartByID ec2_stop'd the just-launched instance.

The pre-#2929 maybeMarkContainerDead had three guard holes that this commit closes:

(a) DEBOUNCE — re-probe IsRunning after a 2s delay

The previous count-based debounce (2 observations within 30s) only fired after 2 separate probe-triggering forward errors. A single IsRunning=false within a 0.7s window of a healthy PONG fell straight through. The new re-probe fires immediately on the first dead observation.

(b) RECENT-PONG → BUSY (not DEAD)

If transport-liveness was green within the last 15s, the dead observation is a transport flake or a settle-window artifact. The caller turns BUSY into 202-queued (NOT 503 + RestartByID). The pre-#2929 "2nd observation declares dead" path is gone: a recent heartbeat + IsRunning=false is always BUSY, regardless of observation count.

(c) WIDEN the self-fire guard to cover the post-restart SETTLE WINDOW

state.running flips to false at the END of the cycle, but the new container is still booting for ~60s after. The OLD heartbeat is still in the DB. isRestarting(workspaceID) only fires during state.running=true; the new inPostRestartSettleWindow check covers the just-completed case (restartStartedAt < 60s ago). The settle window uses the same 60s ceiling as waitForFreshHeartbeat (restart_context.go:205).

API change

maybeMarkContainerDead now returns containerDeadActionT (3-state enum) instead of bool. The 3-state shape is load-bearing for the caller:

  • deadActionAlive → continue with the standard isUpstreamBusyError check
  • deadActionBusy → 202-queued (caller's existing busy-enqueue path at a2a_proxy_helpers.go:78+ handles this)
  • deadActionDead → 503 + RestartByID

Diff

4 files changed, +340 -66 lines:

  • a2a_proxy_helpers.go: new enum + consts + 2 helpers + reworked dispatch
  • a2a_proxy.go: 1 line (the if h.maybeMarkContainerDead(...) check)
  • a2a_proxy_test.go: 5 test updates (enum assertions) + 3 new regression tests
  • workspace_restart_self_fire_test.go: 1 assertion (busy)

Regression tests (the 3 PM-asked shapes)

  1. TestMaybeMarkContainerDead_PostRestartSettleWindow_StaysBusy — the exact #2929 settle-window shape: fresh restartStartedAt, state.running=false, recentHB, IsRunning=false → BUSY (NOT dead). The recentHB query is NOT issued when the settle window catches the call (early-return before IsRunning).
  2. TestMaybeMarkContainerDead_RecentHeartbeat_StaysBusy — multiple observations with recent heartbeat + IsRunning=false ALL return BUSY (the pre-#2929 "2nd observation declares dead" path is gone).
  3. TestMaybeMarkContainerDead_ReProbe_TransportFlake — 1st probe returns false, re-probe returns true → alive. Pins the DEBOUNCE shape (cp.calls=2, no UPDATE fires).

Run

All handler tests pass: go test ./internal/handlers/ -count=1 -timeout 300s (46s). Build clean: go build ./.... No new dependencies, no DB migrations.

SOP Checklist

  • Comprehensive testing performed: 3 new regression tests (settle-window, recent-heartbeat, re-probe) + 5 test updates; full handler suite green (46s).
  • Local-postgres E2E run: N/A — pure Go handler change, sqlmock covers DB; new tests exercise restartStartedAt, last_heartbeat_at, state.running via mock.
  • Staging-smoke verified or pending: pending — will be exercised by the staging E2E Platform-Boot lane on this PR (run 506813 reproduction harness).
  • Root-cause not symptom: addresses the destructive auto-restart trigger identified in #2929 (settle window + 2-observation bypass), not the downstream E2E exit code.
  • Five-Axis review walked: correctness (3-state enum + 60s settle window + recent-HB 15s window), readability (extracted helpers + named constants), architecture (in-memory per-handler debounce, no cross-replica state), security (no new secrets or attack surface), performance (one extra IsRunning probe on the dead-observation path).
  • No backwards-compat shim / dead code added: existing immediate-dead path preserved for stale-heartbeat workspaces (the hongmingwang incident path).
  • Memory/saved-feedback consulted: reused existing last_heartbeat_at + restartStartedAt columns and IsRunning contract from #2931.

🤖 Generated with Claude Code

## Closes #2929 (the post-config-PUT settle window + re-probe DEBOUNCE + BUSY path) PM 16:55Z customer-critical: the post-config-PUT SETTLE WINDOW triggered a self-fire restart on every poll-mode migration. E2E 7c→7d staging-run 506813 reproduced it: agent PONG'd 0.7s prior, then a single IsRunning=false arrived in the window, the dead-path declared the container dead, and a fresh RestartByID ec2_stop'd the just-launched instance. The pre-#2929 `maybeMarkContainerDead` had three guard holes that this commit closes: ### (a) DEBOUNCE — re-probe IsRunning after a 2s delay The previous count-based debounce (2 observations within 30s) only fired after 2 separate probe-triggering forward errors. A single IsRunning=false within a 0.7s window of a healthy PONG fell straight through. The new re-probe fires immediately on the first dead observation. ### (b) RECENT-PONG → BUSY (not DEAD) If transport-liveness was green within the last 15s, the dead observation is a transport flake or a settle-window artifact. The caller turns BUSY into 202-queued (NOT 503 + RestartByID). The pre-#2929 "2nd observation declares dead" path is gone: a recent heartbeat + IsRunning=false is **always BUSY**, regardless of observation count. ### (c) WIDEN the self-fire guard to cover the post-restart SETTLE WINDOW `state.running` flips to false at the END of the cycle, but the new container is still booting for ~60s after. The OLD heartbeat is still in the DB. `isRestarting(workspaceID)` only fires during `state.running=true`; the new `inPostRestartSettleWindow` check covers the just-completed case (`restartStartedAt < 60s ago`). The settle window uses the same 60s ceiling as `waitForFreshHeartbeat` (restart_context.go:205). ## API change `maybeMarkContainerDead` now returns `containerDeadActionT` (3-state enum) instead of bool. The 3-state shape is load-bearing for the caller: - `deadActionAlive` → continue with the standard isUpstreamBusyError check - `deadActionBusy` → 202-queued (caller's existing busy-enqueue path at a2a_proxy_helpers.go:78+ handles this) - `deadActionDead` → 503 + RestartByID ## Diff 4 files changed, +340 -66 lines: - `a2a_proxy_helpers.go`: new enum + consts + 2 helpers + reworked dispatch - `a2a_proxy.go`: 1 line (the `if h.maybeMarkContainerDead(...)` check) - `a2a_proxy_test.go`: 5 test updates (enum assertions) + 3 new regression tests - `workspace_restart_self_fire_test.go`: 1 assertion (busy) ## Regression tests (the 3 PM-asked shapes) 1. `TestMaybeMarkContainerDead_PostRestartSettleWindow_StaysBusy` — the exact #2929 settle-window shape: fresh `restartStartedAt`, `state.running=false`, recentHB, `IsRunning=false` → BUSY (NOT dead). The recentHB query is NOT issued when the settle window catches the call (early-return before IsRunning). 2. `TestMaybeMarkContainerDead_RecentHeartbeat_StaysBusy` — multiple observations with recent heartbeat + IsRunning=false ALL return BUSY (the pre-#2929 "2nd observation declares dead" path is gone). 3. `TestMaybeMarkContainerDead_ReProbe_TransportFlake` — 1st probe returns false, re-probe returns true → alive. Pins the DEBOUNCE shape (`cp.calls=2`, no UPDATE fires). ## Run All handler tests pass: `go test ./internal/handlers/ -count=1 -timeout 300s` (46s). Build clean: `go build ./...`. No new dependencies, no DB migrations. ## SOP Checklist - [x] Comprehensive testing performed: 3 new regression tests (settle-window, recent-heartbeat, re-probe) + 5 test updates; full handler suite green (46s). - [x] Local-postgres E2E run: N/A — pure Go handler change, sqlmock covers DB; new tests exercise restartStartedAt, last_heartbeat_at, state.running via mock. - [x] Staging-smoke verified or pending: pending — will be exercised by the staging E2E Platform-Boot lane on this PR (run 506813 reproduction harness). - [x] Root-cause not symptom: addresses the destructive auto-restart trigger identified in #2929 (settle window + 2-observation bypass), not the downstream E2E exit code. - [x] Five-Axis review walked: correctness (3-state enum + 60s settle window + recent-HB 15s window), readability (extracted helpers + named constants), architecture (in-memory per-handler debounce, no cross-replica state), security (no new secrets or attack surface), performance (one extra IsRunning probe on the dead-observation path). - [x] No backwards-compat shim / dead code added: existing immediate-dead path preserved for stale-heartbeat workspaces (the hongmingwang incident path). - [x] Memory/saved-feedback consulted: reused existing `last_heartbeat_at` + `restartStartedAt` columns and `IsRunning` contract from #2931. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
agent-dev-b added 1 commit 2026-06-15 17:15:38 +00:00
fix(handlers#2929): post-config-PUT settle window + DEBOUNCE re-probe + BUSY path
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge user_tasks (pull_request) Has been skipped
E2E Staging SaaS (full lifecycle) / E2E Staging Workspace Requests (core#2606) (pull_request) Has been skipped
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Creates Workspace (pull_request) Has been skipped
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Platform Agent (pull_request) Has been skipped
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 8s
CI / Python Lint & Test (pull_request) Successful in 9s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 6s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 8s
Harness Replays / detect-changes (pull_request) Successful in 8s
E2E Peer Visibility (literal MCP list_peers) / detect-changes (pull_request) Successful in 14s
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge (compile+skip) (pull_request) Successful in 13s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 13s
qa-review / approved (pull_request_target) Failing after 9s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Has been skipped
security-review / approved (pull_request_target) Failing after 8s
reserved-path-review / reserved-path-review (pull_request_target) Failing after 9s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Successful in 19s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Successful in 5s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 18s
PR Diff Guard / PR diff guard (pull_request) Successful in 19s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 19s
E2E API Smoke Test / detect-changes (pull_request) Successful in 28s
CI / Detect changes (pull_request) Successful in 30s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 1s
CI / Canvas (Next.js) (pull_request) Successful in 3s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 33s
E2E Chat / detect-changes (pull_request) Successful in 36s
CI / Canvas Deploy Status (pull_request) Successful in 1s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 3s
E2E Chat / E2E Chat (pull_request) Successful in 4s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (pull_request) Successful in 42s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 35s
Harness Replays / Harness Replays (pull_request) Successful in 1m16s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m18s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (pull_request) Successful in 2m1s
CI / Platform (Go) (pull_request) Successful in 3m32s
CI / all-required (pull_request) Successful in 3s
E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (pull_request) Failing after 8m0s
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Failing after 10m33s
sop-checklist / review-refire (pull_request_target) Has been skipped
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: memory-consulted
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request_target) Successful in 10s
gate-check-v3 / gate-check (pull_request_target) Failing after 15s
audit-force-merge / audit (pull_request_target) Has been skipped
ad3c7987d4
PM 16:55Z customer-critical #2929: the post-config-PUT SETTLE WINDOW
triggered a self-fire restart on every poll-mode migration. E2E 7c→7d
staging-run 506813 reproduced it: agent PONG'd 0.7s prior, then a
single IsRunning=false arrived in the window, the dead-path
declared the container dead, and a fresh RestartByID ec2_stop'd
the just-launched instance ('no URL, offline' → boot exit 1).

The pre-#2929 maybeMarkContainerDead had three guard holes that
this commit closes:

1. (a) DEBOUNCE — re-probe IsRunning after a 2s delay; never act on
   a single false. The previous count-based debounce (2 observations
   within 30s) only fires after 2 separate probe-triggering forward
   errors; a single IsRunning=false within a 0.7s window of a healthy
   PONG fell straight through. The new re-probe fires
   immediately on the first dead observation.
2. (b) RECENT-PONG → BUSY (not DEAD) — if transport-liveness was
   green within the last 15s, the dead observation is a
   transport flake or a settle-window artifact. The caller turns
   BUSY into 202-queued (NOT 503 + RestartByID). The pre-#2929
   '2nd observation declares dead' path is gone: a recent
   heartbeat + IsRunning=false is always BUSY, regardless of
   observation count.
3. (c) WIDEN the self-fire guard to cover the post-restart SETTLE
   WINDOW — state.running flips to false at the END of the cycle
   (workspace_restart.go:682-685), but the new container is still
   booting for ~60s after. The OLD heartbeat is still in the DB (the
   next heartbeat is 30-60s away). isRestarting(workspaceID) only
   fires during state.running=true; the new inPostRestartSettleWindow
   check covers the just-completed case (state.running=false but
   restartStartedAt < 60s ago). The settle window uses the same
   60s ceiling as waitForFreshHeartbeat (restart_context.go:205).

API change: maybeMarkContainerDead now returns containerDeadActionT
(3-state enum: deadActionAlive / deadActionBusy / deadActionDead)
instead of bool. The 3-state shape is load-bearing for the caller:
containerDeadAction → 503 + RestartByID (the prior shape);
containerBusyAction → 202-queued (caller's existing busy-enqueue
path at a2a_proxy_helpers.go:78+ handles this); containerAliveAction
→ continue with the standard isUpstreamBusyError check.

New types/consts in a2a_proxy_helpers.go:
  type containerDeadActionT int
  const (
    deadActionAlive containerDeadActionT = iota  // 200 — let the caller continue
    deadActionBusy                             // 202 queued — caller should enqueue
    deadActionDead                             // 503 + RestartByID — caller should recycle
  )
  const containerPostRestartSettleWindow = 60 * time.Second
  const containerReProbeDelay          = 2 * time.Second

New helpers:
  (h *WorkspaceHandler) inPostRestartSettleWindow(workspaceID) bool
    Reads restartStates; returns true if state.restartStartedAt
    is within containerPostRestartSettleWindow of now.
  (h *WorkspaceHandler) shouldReProbe(workspaceID, delay) bool
    Per-workspace throttled re-probe gate (sync.Map + atomic.Int64
    stamp). Throttles re-probes so a flapping probe doesn't trigger
    an unbounded re-probe storm.

Test updates (a2a_proxy_test.go, workspace_restart_self_fire_test.go):
  - All bool assertions for maybeMarkContainerDead updated to the
    new 3-state enum.
  - TestMaybeMarkContainerDead_CPOnly_NotRunning: expect 2
    IsRunning calls (initial + re-probe), not 1. Add the
    SELECT last_heartbeat_at mock.
  - TestProxyA2A_Upstream502_TriggersContainerDeadCheck: same
    (2 IsRunning calls + recentHB query).

New tests (regression coverage for the 2026-06-15 staging-boot class):
  - TestMaybeMarkContainerDead_PostRestartSettleWindow_StaysBusy:
    the exact #2929 settle-window shape — fresh restartStartedAt,
    state.running=false, recentHB, IsRunning=false → BUSY (NOT
    dead). Pins the early-return before IsRunning is consulted
    (the recentHB query is NOT issued when the settle window
    catches the call).
  - TestMaybeMarkContainerDead_RecentHeartbeat_StaysBusy:
    multiple observations with recent heartbeat + IsRunning=false
    all return BUSY, never dead (the pre-#2929 '2nd observation
    declares dead' path is gone).
  - TestMaybeMarkContainerDead_ReProbe_TransportFlake:
    1st probe returns false, re-probe returns true → alive. Pins
    the DEBOUNCE shape (cp.calls=2, no UPDATE fires).

All tests pass: go test ./internal/handlers/ -count=1 -timeout 300s
(46s). Build clean: go build ./...

Closes the #2929 staging-boot class. PM 16:55Z ask: 'Add a
regression test for the settle-window case. Open a workspace-server
PR; reply PR#/head ASAP.' PR is ready.
agent-dev-b closed this pull request 2026-06-15 17:40:25 +00:00
Some required checks failed
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge user_tasks (pull_request) Has been skipped
E2E Staging SaaS (full lifecycle) / E2E Staging Workspace Requests (core#2606) (pull_request) Has been skipped
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Creates Workspace (pull_request) Has been skipped
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Platform Agent (pull_request) Has been skipped
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 8s
CI / Python Lint & Test (pull_request) Successful in 9s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 6s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 8s
Harness Replays / detect-changes (pull_request) Successful in 8s
E2E Peer Visibility (literal MCP list_peers) / detect-changes (pull_request) Successful in 14s
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge (compile+skip) (pull_request) Successful in 13s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 13s
qa-review / approved (pull_request_target) Failing after 9s
Required
Details
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Has been skipped
security-review / approved (pull_request_target) Failing after 8s
Required
Details
reserved-path-review / reserved-path-review (pull_request_target) Failing after 9s
Required
Details
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Successful in 19s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Successful in 5s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 18s
PR Diff Guard / PR diff guard (pull_request) Successful in 19s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 19s
Required
Details
E2E API Smoke Test / detect-changes (pull_request) Successful in 28s
CI / Detect changes (pull_request) Successful in 30s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 1s
CI / Canvas (Next.js) (pull_request) Successful in 3s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 33s
E2E Chat / detect-changes (pull_request) Successful in 36s
CI / Canvas Deploy Status (pull_request) Successful in 1s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 3s
E2E Chat / E2E Chat (pull_request) Successful in 4s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (pull_request) Successful in 42s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 35s
Required
Details
Harness Replays / Harness Replays (pull_request) Successful in 1m16s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m18s
Required
Details
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (pull_request) Successful in 2m1s
CI / Platform (Go) (pull_request) Successful in 3m32s
CI / all-required (pull_request) Successful in 3s
Required
Details
E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (pull_request) Failing after 8m0s
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Failing after 10m33s
sop-checklist / review-refire (pull_request_target) Has been skipped
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: memory-consulted
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request_target) Successful in 10s
gate-check-v3 / gate-check (pull_request_target) Failing after 15s
audit-force-merge / audit (pull_request_target) Has been skipped

Pull request closed

Sign in to join this conversation.
No Reviewers
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2954