molecule-core

Author	SHA1	Message	Date
Hongming Wang	fdf1b5d76a	refactor(workspace-status): typed constants + AST-based drift gate Eliminate raw 'awaiting_agent'/'hibernating'/'failed'/etc string literals from production status writes. Adds models.WorkspaceStatus typed alias and models.AllWorkspaceStatuses canonical slice; every UPDATE workspaces SET status = ... now passes a parameterized $N typed value rather than a hard-coded SQL literal. Defense-in-depth follow-up to migration 046 (#2388): the Postgres enum type was missing 'awaiting_agent' + 'hibernating' for ~5 days because sqlmock regex matching cannot enforce live enum constraints. The drift gate is now a proper Go AST + SQL parser (no regex), asserting the codebase ⊆ migration enum and every const appears in the canonical slice. With status as a parameterized typed value, future enum mismatches fail at the SQL layer in tests, not silently in prod. Test coverage: full suite passes with -race; drift gate green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 10:41:41 -07:00
Hongming Wang	be1beff4a0	fix(registry): runtime-aware provision-timeout sweep — give hermes 30 min Pre-fix: workspace-server's provision-timeout sweep was hardcoded at 10 min for all runtimes. The CP-side bootstrap-watcher (cp#245) correctly gives hermes 25 min for cold-boot (hermes installs include apt + uv + Python venv + Node + hermes-agent — 13–25 min on slow apt mirrors is normal). The two timeout systems disagreed: the watcher would happily wait 25 min, but the workspace-server's 10-min sweep killed healthy hermes boots mid-install at 10 min and marked them failed. Today's example: #2061's E2E run on 2026-04-26 at 08:06:34Z created a hermes workspace, EC2 cloud-init was visibly making progress on apt-installs (libcjson1, libmbedcrypto7t64) when the sweep flipped status to 'failed' at 08:17:00Z (10:26 elapsed). The test threw "Workspace failed: " (empty error from sql.NullString serialization) and CI failed on a healthy boot. Fix: provisioningTimeoutFor(runtime) — same shape as the CP's bootstrapTimeoutFn: - hermes: 30 min (watcher's 25 min + 5 min slack) - others: 10 min (unchanged — claude-code/langgraph/etc. boot in <5 min, 10 min is plenty) PROVISION_TIMEOUT_SECONDS env override still works (applies to all runtimes — operators who care about the runtime distinction shouldn't use the override anyway). Sweep query change: pulls (id, runtime, age_sec) per row instead of pre-filtering by age in SQL. Per-row Go evaluation picks the correct timeout. Slightly more rows scanned but bounded by the status='provisioning' partial index — workspaces in flight, not historical. Tests: - TestProvisioningTimeout_RuntimeAware — locks in the per-runtime mapping - TestSweepStuckProvisioning_HermesGets30MinSlack — hermes at 11 min must NOT be flipped - TestSweepStuckProvisioning_HermesPastDeadline — hermes at 31 min IS flipped, payload includes runtime - Existing tests updated for the new query shape Verified: - go build ./... clean - go vet ./... clean - go test ./... all green Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 01:44:09 -07:00
Hongming Wang	ec52d155f4	fix(sweeper): emit WORKSPACE_PROVISION_FAILED so canvas updates UI The provision-timeout sweeper was emitting a new WORKSPACE_PROVISION_TIMEOUT event type, but the canvas event handler (canvas-events.ts:234) only has a case for WORKSPACE_PROVISION_FAILED — the sweep's event fell through silently. DB was being marked 'failed' but the UI stayed on 'starting' indefinitely until the user hard-refreshed. Reusing the existing event name keeps the UI reaction uniform across both fail paths (runtime-crash via bootstrap-watcher and boot-timeout via sweeper). Operators who need to distinguish can read the `source` payload field — "bootstrap_watcher" vs "provision_timeout_sweep". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:38:41 -07:00
Hongming Wang	c3f7447e86	fix: harden stuck-provisioning UX — details crash, preflight, sweeper Workspaces stuck in status='provisioning' previously surfaced in three bad ways: 1. Details tab crashed with `Cannot read properties of undefined (reading 'toLocaleString')`. `BudgetSection` + `WorkspaceUsage` assumed full response shapes but a provisioning-stuck workspace returns partial `{}`. Guard each deep field with `?? 0` and cover the partial-response case with regression tests. 2. Missing required env vars failed silently 15+ minutes later as a cosmetic "Provisioning Timeout" banner. The in-container preflight catches them but by then the container has already crashed without calling /registry/register, so the workspace sat in 'provisioning' forever. Mirror the preflight server-side: parse config.yaml's `runtime_config.required_env` before launch, fail fast with a WORKSPACE_PROVISION_FAILED event naming the missing vars. 3. No backend timeout ever flipped a stuck workspace to 'failed'. Add a registry sweeper (10m default, env-overridable) that detects workspaces stuck past the window, flips them to 'failed', and emits WORKSPACE_PROVISION_TIMEOUT. Race-safe: the UPDATE re-checks the status + age predicate so a concurrent register/restart wins. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 14:51:39 -07:00

4 Commits