Pre-fix: workspace-server's provision-timeout sweep was hardcoded
at 10 min for all runtimes. The CP-side bootstrap-watcher (cp#245)
correctly gives hermes 25 min for cold-boot (hermes installs
include apt + uv + Python venv + Node + hermes-agent — 13–25 min on
slow apt mirrors is normal). The two timeout systems disagreed:
the watcher would happily wait 25 min, but the workspace-server's
10-min sweep killed healthy hermes boots mid-install at 10 min and
marked them failed.
Today's example: #2061's E2E run on 2026-04-26 at 08:06:34Z
created a hermes workspace, EC2 cloud-init was visibly making
progress on apt-installs (libcjson1, libmbedcrypto7t64) when the
sweep flipped status to 'failed' at 08:17:00Z (10:26 elapsed). The
test threw "Workspace failed: " (empty error from sql.NullString
serialization) and CI failed on a healthy boot.
Fix: provisioningTimeoutFor(runtime) — same shape as the CP's
bootstrapTimeoutFn:
- hermes: 30 min (watcher's 25 min + 5 min slack)
- others: 10 min (unchanged — claude-code/langgraph/etc. boot
in <5 min, 10 min is plenty)
PROVISION_TIMEOUT_SECONDS env override still works (applies to all
runtimes — operators who care about the runtime distinction
shouldn't use the override anyway).
Sweep query change: pulls (id, runtime, age_sec) per row instead
of pre-filtering by age in SQL. Per-row Go evaluation picks the
correct timeout. Slightly more rows scanned but bounded by the
status='provisioning' partial index — workspaces in flight, not
historical.
Tests:
- TestProvisioningTimeout_RuntimeAware — locks in the per-runtime
mapping
- TestSweepStuckProvisioning_HermesGets30MinSlack — hermes at
11 min must NOT be flipped
- TestSweepStuckProvisioning_HermesPastDeadline — hermes at
31 min IS flipped, payload includes runtime
- Existing tests updated for the new query shape
Verified:
- go build ./... clean
- go vet ./... clean
- go test ./... all green
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>