molecule-core

History

Hongming Wang be1beff4a0 fix(registry): runtime-aware provision-timeout sweep — give hermes 30 min Pre-fix: workspace-server's provision-timeout sweep was hardcoded at 10 min for all runtimes. The CP-side bootstrap-watcher (cp#245) correctly gives hermes 25 min for cold-boot (hermes installs include apt + uv + Python venv + Node + hermes-agent — 13–25 min on slow apt mirrors is normal). The two timeout systems disagreed: the watcher would happily wait 25 min, but the workspace-server's 10-min sweep killed healthy hermes boots mid-install at 10 min and marked them failed. Today's example: #2061's E2E run on 2026-04-26 at 08:06:34Z created a hermes workspace, EC2 cloud-init was visibly making progress on apt-installs (libcjson1, libmbedcrypto7t64) when the sweep flipped status to 'failed' at 08:17:00Z (10:26 elapsed). The test threw "Workspace failed: " (empty error from sql.NullString serialization) and CI failed on a healthy boot. Fix: provisioningTimeoutFor(runtime) — same shape as the CP's bootstrapTimeoutFn: - hermes: 30 min (watcher's 25 min + 5 min slack) - others: 10 min (unchanged — claude-code/langgraph/etc. boot in <5 min, 10 min is plenty) PROVISION_TIMEOUT_SECONDS env override still works (applies to all runtimes — operators who care about the runtime distinction shouldn't use the override anyway). Sweep query change: pulls (id, runtime, age_sec) per row instead of pre-filtering by age in SQL. Per-row Go evaluation picks the correct timeout. Slightly more rows scanned but bounded by the status='provisioning' partial index — workspaces in flight, not historical. Tests: - TestProvisioningTimeout_RuntimeAware — locks in the per-runtime mapping - TestSweepStuckProvisioning_HermesGets30MinSlack — hermes at 11 min must NOT be flipped - TestSweepStuckProvisioning_HermesPastDeadline — hermes at 31 min IS flipped, payload includes runtime - Existing tests updated for the new query shape Verified: - go build ./... clean - go vet ./... clean - go test ./... all green Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-04-26 01:44:09 -07:00
..
access_test.go	chore: open-source restructure — rename dirs, remove internal files, scrub secrets	2026-04-18 00:24:44 -07:00
access.go	chore: open-source restructure — rename dirs, remove internal files, scrub secrets	2026-04-18 00:24:44 -07:00
healthsweep_test.go	chore: open-source restructure — rename dirs, remove internal files, scrub secrets	2026-04-18 00:24:44 -07:00
healthsweep.go	chore: open-source restructure — rename dirs, remove internal files, scrub secrets	2026-04-18 00:24:44 -07:00
hibernation_test.go	chore: open-source restructure — rename dirs, remove internal files, scrub secrets	2026-04-18 00:24:44 -07:00
hibernation.go	chore: open-source restructure — rename dirs, remove internal files, scrub secrets	2026-04-18 00:24:44 -07:00
liveness_test.go	chore: open-source restructure — rename dirs, remove internal files, scrub secrets	2026-04-18 00:24:44 -07:00
liveness.go	chore: open-source restructure — rename dirs, remove internal files, scrub secrets	2026-04-18 00:24:44 -07:00
provisiontimeout_test.go	fix(registry): runtime-aware provision-timeout sweep — give hermes 30 min	2026-04-26 01:44:09 -07:00
provisiontimeout.go	fix(registry): runtime-aware provision-timeout sweep — give hermes 30 min	2026-04-26 01:44:09 -07:00