molecule-core/workspace-server/internal/registry
Hongming Wang be1beff4a0 fix(registry): runtime-aware provision-timeout sweep — give hermes 30 min
Pre-fix: workspace-server's provision-timeout sweep was hardcoded
at 10 min for all runtimes. The CP-side bootstrap-watcher (cp#245)
correctly gives hermes 25 min for cold-boot (hermes installs
include apt + uv + Python venv + Node + hermes-agent — 13–25 min on
slow apt mirrors is normal). The two timeout systems disagreed:
the watcher would happily wait 25 min, but the workspace-server's
10-min sweep killed healthy hermes boots mid-install at 10 min and
marked them failed.

Today's example: #2061's E2E run on 2026-04-26 at 08:06:34Z
created a hermes workspace, EC2 cloud-init was visibly making
progress on apt-installs (libcjson1, libmbedcrypto7t64) when the
sweep flipped status to 'failed' at 08:17:00Z (10:26 elapsed). The
test threw "Workspace failed: " (empty error from sql.NullString
serialization) and CI failed on a healthy boot.

Fix: provisioningTimeoutFor(runtime) — same shape as the CP's
bootstrapTimeoutFn:
  - hermes:  30 min (watcher's 25 min + 5 min slack)
  - others:  10 min (unchanged — claude-code/langgraph/etc. boot
                     in <5 min, 10 min is plenty)

PROVISION_TIMEOUT_SECONDS env override still works (applies to all
runtimes — operators who care about the runtime distinction
shouldn't use the override anyway).

Sweep query change: pulls (id, runtime, age_sec) per row instead
of pre-filtering by age in SQL. Per-row Go evaluation picks the
correct timeout. Slightly more rows scanned but bounded by the
status='provisioning' partial index — workspaces in flight, not
historical.

Tests:
  - TestProvisioningTimeout_RuntimeAware — locks in the per-runtime
    mapping
  - TestSweepStuckProvisioning_HermesGets30MinSlack — hermes at
    11 min must NOT be flipped
  - TestSweepStuckProvisioning_HermesPastDeadline — hermes at
    31 min IS flipped, payload includes runtime
  - Existing tests updated for the new query shape

Verified:
  - go build ./... clean
  - go vet ./... clean
  - go test ./... all green

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 01:44:09 -07:00
..
access_test.go chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
access.go chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
healthsweep_test.go chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
healthsweep.go chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
hibernation_test.go chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
hibernation.go chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
liveness_test.go chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
liveness.go chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
provisiontimeout_test.go fix(registry): runtime-aware provision-timeout sweep — give hermes 30 min 2026-04-26 01:44:09 -07:00
provisiontimeout.go fix(registry): runtime-aware provision-timeout sweep — give hermes 30 min 2026-04-26 01:44:09 -07:00