molecule-core/workspace-server/cmd
Hongming Wang 18edf88d59 fix(sweeper): honour template-manifest provision_timeout_seconds
Real wiring gap discovered while investigating issue #2486 cluster of
prod claude-code workspaces failed at exactly 10m. The
runtimeProvisionTimeoutsCache (#2054 phase 2) reads
runtime_config.provision_timeout_seconds from each template's
config.yaml so the **canvas** spinner respects per-template timeouts —
but the **sweeper** in registry/provisiontimeout.go hardcoded 10 min
(claude-code) / 30 min (hermes) and never consulted the manifest. So a
template that declared a longer window had a UI that waited correctly
but a sweeper that killed the row at the hardcoded floor anyway.

Resolution order pinned by new TestProvisioningTimeout_ManifestOverride:

  1. PROVISION_TIMEOUT_SECONDS env (ops-debug global override)
  2. Template manifest lookup (per-runtime, beats hermes default too)
  3. Hermes default (30 min — CP bootstrap-watcher 25 min + 5 min slack)
  4. DefaultProvisioningTimeout (10 min)

Wiring:
  - registry: new RuntimeTimeoutLookup function type, threaded through
    StartProvisioningTimeoutSweep + sweepStuckProvisioning + the
    pre-existing provisioningTimeoutFor.
  - handlers: ProvisionTimeoutSecondsForRuntime exposes the cache's
    lookup as a method so main.go can pass it without breaking the
    handlers→registry import direction.
  - cmd/server/main.go: wire wh.ProvisionTimeoutSecondsForRuntime into
    the sweep boot.

Verified:
  - go test -race ./... passes (every workspace-server package).
  - Regression-injected the lookup arm: 3 manifest-override subcases
    fail with the actual-vs-expected gap, confirming the new test is
    load-bearing.
  - The original two timeout tests (env-override, hermes default) keep
    passing — `lookup=nil` argument preserves their semantics.

Operator action enabled: a template wanting a 15-min window can now
just set `runtime_config.provision_timeout_seconds: 900` in its
config.yaml and the sweeper honours it on the next workspace-server
restart.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 21:44:42 -07:00
..
server fix(sweeper): honour template-manifest provision_timeout_seconds 2026-05-01 21:44:42 -07:00