From 235aca9908c47d9a76687d7c66586e43add3acfc Mon Sep 17 00:00:00 2001
From: Hongming Wang <hongmingwangalt@gmail.com>
Date: Thu, 30 Apr 2026 12:05:40 -0700
Subject: [PATCH] =?UTF-8?q?fix(boot):=20always=20start=20health-sweep=20go?=
 =?UTF-8?q?routine=20=E2=80=94=20SaaS=20tenants=20need=20it=20for=20extern?=
 =?UTF-8?q?al-runtime=20liveness?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Pre-fix, cmd/server/main.go gated the entire health-sweep goroutine on
`prov != nil`. On SaaS tenants (`MOLECULE_ORG_ID` set) the local Docker
provisioner is never initialized — only `cpProv`. So the goroutine
never started, and `sweepStaleRemoteWorkspaces` (which transitions
runtime='external' workspaces from 'online' to 'awaiting_agent' when
their last_heartbeat_at goes stale) never ran.

Net effect on production: every external-runtime workspace on SaaS
that lost its agent stayed 'online' indefinitely instead of falling
back to 'awaiting_agent' (re-registrable). The drift gate (#2388)
caught the migration side and #2382 fixed the SQL writes, but this
orchestration-side gate slipped through both because there was no
SaaS-mode E2E coverage on the heartbeat-loss → awaiting_agent
transition.

Caught by #2392 (live staging external-runtime regression E2E)
failing at step 6 — 180s with no heartbeat, expected
status=awaiting_agent, got online.

Fix: drop the `if prov != nil` gate. `StartHealthSweep` already
handles nil checker correctly (healthsweep.go:50-71): the Docker
sweep is gated inside the loop, the remote sweep always runs. Test
coverage already exists at TestStartHealthSweep_NilCheckerRunsRemoteSweep.

After this lands and tenants redeploy, #2392 step 6 passes and the
regression coverage closes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 workspace-server/cmd/server/main.go | 25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/workspace-server/cmd/server/main.go b/workspace-server/cmd/server/main.go
index d0d5ae57..f620537b 100644
--- a/workspace-server/cmd/server/main.go
+++ b/workspace-server/cmd/server/main.go
@@ -223,13 +223,24 @@ func main() {
 		registry.StartLivenessMonitor(c, onWorkspaceOffline)
 	})
 
-	// Proactive container health sweep — detects dead containers faster than Redis TTL.
-	// Checks all "online" workspaces against Docker every 15 seconds.
-	if prov != nil {
-		go supervised.RunWithRecover(ctx, "health-sweep", func(c context.Context) {
-			registry.StartHealthSweep(c, prov, 15*time.Second, onWorkspaceOffline)
-		})
-	}
+	// Proactive health sweep — two passes per tick:
+	//   1. Docker-side: checks "online" workspaces against the local Docker
+	//      daemon (only runs when prov is non-nil, i.e. self-hosted mode).
+	//   2. Remote-side: scans runtime='external' rows whose last_heartbeat_at
+	//      is past REMOTE_LIVENESS_STALE_AFTER and flips them to
+	//      awaiting_agent. Runs regardless of provisioner mode — SaaS
+	//      tenants need this even though they don't run Docker locally,
+	//      because external-runtime workspaces are operator-managed and
+	//      the platform-side liveness sweep is the only thing that
+	//      transitions them off 'online' when the operator's CLI dies.
+	//
+	// Pre-2026-04-30 this goroutine was gated on prov != nil, which silently
+	// disabled the remote-side sweep on every SaaS tenant. The function in
+	// healthsweep.go has always handled nil checker correctly; only the
+	// orchestration was wrong. See #2392's CI failure for the trace.
+	go supervised.RunWithRecover(ctx, "health-sweep", func(c context.Context) {
+		registry.StartHealthSweep(c, prov, 15*time.Second, onWorkspaceOffline)
+	})
 
 	// Orphan-container reconcile sweep — finds running containers
 	// whose workspace row is already status='removed' and stops