From 235aca9908c47d9a76687d7c66586e43add3acfc Mon Sep 17 00:00:00 2001 From: Hongming Wang Date: Thu, 30 Apr 2026 12:05:40 -0700 Subject: [PATCH] =?UTF-8?q?fix(boot):=20always=20start=20health-sweep=20go?= =?UTF-8?q?routine=20=E2=80=94=20SaaS=20tenants=20need=20it=20for=20extern?= =?UTF-8?q?al-runtime=20liveness?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Pre-fix, cmd/server/main.go gated the entire health-sweep goroutine on `prov != nil`. On SaaS tenants (`MOLECULE_ORG_ID` set) the local Docker provisioner is never initialized — only `cpProv`. So the goroutine never started, and `sweepStaleRemoteWorkspaces` (which transitions runtime='external' workspaces from 'online' to 'awaiting_agent' when their last_heartbeat_at goes stale) never ran. Net effect on production: every external-runtime workspace on SaaS that lost its agent stayed 'online' indefinitely instead of falling back to 'awaiting_agent' (re-registrable). The drift gate (#2388) caught the migration side and #2382 fixed the SQL writes, but this orchestration-side gate slipped through both because there was no SaaS-mode E2E coverage on the heartbeat-loss → awaiting_agent transition. Caught by #2392 (live staging external-runtime regression E2E) failing at step 6 — 180s with no heartbeat, expected status=awaiting_agent, got online. Fix: drop the `if prov != nil` gate. `StartHealthSweep` already handles nil checker correctly (healthsweep.go:50-71): the Docker sweep is gated inside the loop, the remote sweep always runs. Test coverage already exists at TestStartHealthSweep_NilCheckerRunsRemoteSweep. After this lands and tenants redeploy, #2392 step 6 passes and the regression coverage closes. Co-Authored-By: Claude Opus 4.7 (1M context) --- workspace-server/cmd/server/main.go | 25 ++++++++++++++++++------- 1 file changed, 18 insertions(+), 7 deletions(-) diff --git a/workspace-server/cmd/server/main.go b/workspace-server/cmd/server/main.go index d0d5ae57..f620537b 100644 --- a/workspace-server/cmd/server/main.go +++ b/workspace-server/cmd/server/main.go @@ -223,13 +223,24 @@ func main() { registry.StartLivenessMonitor(c, onWorkspaceOffline) }) - // Proactive container health sweep — detects dead containers faster than Redis TTL. - // Checks all "online" workspaces against Docker every 15 seconds. - if prov != nil { - go supervised.RunWithRecover(ctx, "health-sweep", func(c context.Context) { - registry.StartHealthSweep(c, prov, 15*time.Second, onWorkspaceOffline) - }) - } + // Proactive health sweep — two passes per tick: + // 1. Docker-side: checks "online" workspaces against the local Docker + // daemon (only runs when prov is non-nil, i.e. self-hosted mode). + // 2. Remote-side: scans runtime='external' rows whose last_heartbeat_at + // is past REMOTE_LIVENESS_STALE_AFTER and flips them to + // awaiting_agent. Runs regardless of provisioner mode — SaaS + // tenants need this even though they don't run Docker locally, + // because external-runtime workspaces are operator-managed and + // the platform-side liveness sweep is the only thing that + // transitions them off 'online' when the operator's CLI dies. + // + // Pre-2026-04-30 this goroutine was gated on prov != nil, which silently + // disabled the remote-side sweep on every SaaS tenant. The function in + // healthsweep.go has always handled nil checker correctly; only the + // orchestration was wrong. See #2392's CI failure for the trace. + go supervised.RunWithRecover(ctx, "health-sweep", func(c context.Context) { + registry.StartHealthSweep(c, prov, 15*time.Second, onWorkspaceOffline) + }) // Orphan-container reconcile sweep — finds running containers // whose workspace row is already status='removed' and stops