702ea5dc09
When a workspace boot FAILS — the provision-timeout sweep flips it to
`failed`, or the control plane's bootstrap-watcher POSTs bootstrap-failed
— capture a fixed forensic "rescue bundle" off the still-running (but
boot-failed) EC2 BEFORE the control plane reaps it, and ship it to
obs/Loki. This makes a wedged workspace (e.g. the codex
provider-derivation failure that motivated the RFC) post-mortem-
inspectable instead of an uninspectable wall.
What it collects (fixed set, redacted before anything leaves the box):
/configs/config.yaml, /configs/system-prompt.md, tail -200 of
cloud-init-output.log, `docker ps -a`, the agent container's
`docker logs --tail 200`, and the resolved MODEL|PROVIDER|RUNTIME env.
Every section is run through the existing SAFE-T1201 secret-scan
(handlers.redactSecrets) before shipping — and fails CLOSED (ships
nothing) if the redactor is unwired.
Shipping reuses the existing obs shipper (internal/audit → Loki via the
tenant Vector stdout source) with event_type="rescue.bundle" and
kind="rescue" / org / workspace_id in the record body, queryable as
`{kind="rescue"} | json`.
Hook points (the two boot-failure VERDICT paths only — never normal
teardown/deprovision/recreate/billing-suspend/hibernate):
- registry.sweepStuckProvisioning: fires the injected
registry.BootFailureRescueHook only on a real flip (affected==1),
never on a race (affected==0) or a non-overdue row.
- handlers.WorkspaceHandler.BootstrapFailed: fires captureRescueBundle
only after the row is actually flipped to `failed`.
Capture is best-effort + non-blocking: it runs in its own goroutine with
its own 45s timeout, detached from the request/sweep context, so it can
never change boot-failure semantics or add latency to the failure path.
The leaf internal/rescue package injects the EIC/SSH runner + redactor as
package vars (wired from handlers at init) so registry can call it
without importing handlers (no import cycle) — mirroring the existing
RuntimeTimeoutLookup injection pattern.
Volume retention: in molecule-core the boot-failure verdict only flips
status to `failed`; it never terminates. Both platform reapers
(registry.StartCPOrphanSweeper + handlers deprovision) act ONLY on
status='removed', so a `failed` workspace's instance + /configs data
volume are RETAINED by construction through the rescue grace
(rescue.RescueVolumeGrace = 24h, the SSOT the CP reaper must honour),
distinct from the user-prune erase path. Added a regression test pinning
the orphan-sweeper's status='removed' predicate so a future widening to
`failed` (which would terminate boxes mid-rescue) fails the build.
Tests: capture fires on boot-failure (not on healthy teardown/race),
bundle redacts secrets + fails closed without a redactor, Loki push
called with the right labels, volume retained on boot-failure. EIC/SSH +
Loki + ec2 faked via package-var swaps (mirrors existing provisioner
test fakes).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
249 lines
10 KiB
Go
249 lines
10 KiB
Go
package registry
|
||
|
||
import (
|
||
"context"
|
||
"log"
|
||
"os"
|
||
"strconv"
|
||
"time"
|
||
|
||
"git.moleculesai.app/molecule-ai/molecule-core/workspace-server/internal/db"
|
||
"git.moleculesai.app/molecule-ai/molecule-core/workspace-server/internal/events"
|
||
"git.moleculesai.app/molecule-ai/molecule-core/workspace-server/internal/models"
|
||
)
|
||
|
||
// ProvisionTimeoutEmitter is the narrow broadcaster dependency the sweeper
|
||
// needs. Defined locally so the registry package stays event-bus agnostic
|
||
// (same pattern as OfflineHandler in healthsweep.go).
|
||
type ProvisionTimeoutEmitter interface {
|
||
RecordAndBroadcast(ctx context.Context, eventType string, workspaceID string, payload interface{}) error
|
||
}
|
||
|
||
// DefaultProvisioningTimeout is how long a workspace may sit in
|
||
// status='provisioning' before the sweeper flips it to 'failed'.
|
||
// Default for non-hermes runtimes (claude-code, codex, openclaw, etc.)
|
||
// which cold-boot in <5 min. The container-launch path
|
||
// has its own 3-minute context timeout (provisioner.ProvisionTimeout)
|
||
// but that only bounds the docker API call — a container that started
|
||
// but crashes before /registry/register never triggers that path and
|
||
// would sit in provisioning forever. 10 minutes covers pathological
|
||
// image-pull + user-data execution on a cold EC2 worker while still
|
||
// getting well ahead of the "15+ minute" stuck state users see in
|
||
// production.
|
||
const DefaultProvisioningTimeout = 10 * time.Minute
|
||
|
||
// HermesProvisioningTimeout matches the CP bootstrap-watcher's
|
||
// runtime-aware deadline (cp#245) for hermes workspaces: 25 min watcher
|
||
// + 5 min sweep slack. Hermes cold-boot does apt + uv + Python venv +
|
||
// Node + hermes-agent install — 13–25 min on slow apt mirrors is
|
||
// normal. Without this, the sweep would flip the workspace to 'failed'
|
||
// at 10 min while the watcher (and the workspace itself) is still
|
||
// happily progressing through install. Issue #1843 follow-up: a
|
||
// healthy 10.5-min hermes boot was killed by the 10-min sweep on
|
||
// 2026-04-26, breaking #2061's E2E.
|
||
const HermesProvisioningTimeout = 30 * time.Minute
|
||
|
||
// DefaultProvisionSweepInterval is how often the sweeper polls. Same cadence
|
||
// as the hibernation monitor — cheap and bounded by the provisioning-state
|
||
// query which hits the primary key / status partial index.
|
||
const DefaultProvisionSweepInterval = 30 * time.Second
|
||
|
||
// RuntimeTimeoutLookup returns the per-runtime provision timeout in
|
||
// seconds when a template's config.yaml declared
|
||
// `runtime_config.provision_timeout_seconds`, else zero (= "no override,
|
||
// fall through to runtime defaults below"). Same shape as
|
||
// runtimeProvisionTimeoutsCache.get in handlers — wired through main.go
|
||
// so this package stays template-discovery agnostic.
|
||
//
|
||
// Why an interface instead of importing the cache directly: registry
|
||
// already sits below handlers in the import graph (handlers → registry,
|
||
// not the reverse). A function-typed argument keeps that flow.
|
||
type RuntimeTimeoutLookup func(runtime string) int
|
||
|
||
// provisioningTimeoutFor picks the per-runtime sweep deadline. Resolution
|
||
// order:
|
||
//
|
||
// 1. PROVISION_TIMEOUT_SECONDS env — global override, ops-debug only.
|
||
// 2. Template manifest override (lookup) — what the canvas spinner
|
||
// also reads via #2054 phase 2. Without this, a template that
|
||
// declared `runtime_config.provision_timeout_seconds: 900` would
|
||
// still get killed by the sweeper at the 10-min hardcoded floor —
|
||
// a real wiring gap that drove every claude-code burst on a cold
|
||
// EC2 to false-positive timeout.
|
||
// 3. Hermes special-case (CP bootstrap-watcher 25 min + 5 min slack).
|
||
// 4. DefaultProvisioningTimeout (10 min) for everything else.
|
||
//
|
||
// lookup may be nil (during package tests, or before main.go has wired
|
||
// it) — falls through to the legacy hermes/default split.
|
||
func provisioningTimeoutFor(runtime string, lookup RuntimeTimeoutLookup) time.Duration {
|
||
if v := os.Getenv("PROVISION_TIMEOUT_SECONDS"); v != "" {
|
||
if n, err := strconv.Atoi(v); err == nil && n > 0 {
|
||
return time.Duration(n) * time.Second
|
||
}
|
||
}
|
||
if lookup != nil {
|
||
if secs := lookup(runtime); secs > 0 {
|
||
return time.Duration(secs) * time.Second
|
||
}
|
||
}
|
||
if runtime == "hermes" {
|
||
return HermesProvisioningTimeout
|
||
}
|
||
return DefaultProvisioningTimeout
|
||
}
|
||
|
||
// BootFailureRescueHook, when wired, is invoked once per workspace the
|
||
// sweep flips from `provisioning` to `failed` — i.e. on the boot-failure
|
||
// verdict, BEFORE the control plane reaps the instance. It captures a
|
||
// forensic rescue bundle off the still-running (but boot-failed) EC2 and
|
||
// ships it to obs/Loki (RFC internal#742 Part 2). Wired in main.go to
|
||
// handlers.captureRescueBundle via a thin adapter; nil in tests + on
|
||
// self-hosted deploys (no rescue shipping there).
|
||
//
|
||
// Function-typed injection (not an import of handlers) keeps the
|
||
// existing handlers→registry import direction intact — registry must not
|
||
// import handlers.
|
||
//
|
||
// MUST be best-effort + non-blocking: the hook itself dispatches the
|
||
// capture on its own goroutine with its own timeout, so the sweep loop
|
||
// is never slowed or blocked by a hung EIC tunnel on the dead box.
|
||
var BootFailureRescueHook func(workspaceID, instanceID, reason string)
|
||
|
||
// StartProvisioningTimeoutSweep periodically scans for workspaces stuck in
|
||
// `status='provisioning'` past the timeout window, flips them to `failed`,
|
||
// and broadcasts a WORKSPACE_PROVISION_TIMEOUT event so the canvas can
|
||
// render a fail-state instead of the indefinite cosmetic "Provisioning
|
||
// Timeout" banner.
|
||
//
|
||
// The sweep is idempotent: the UPDATE's WHERE clause re-checks both status
|
||
// and age under the same row lock, so a workspace that raced to `online` or
|
||
// was restarted while the sweep was scanning will not get flipped.
|
||
func StartProvisioningTimeoutSweep(ctx context.Context, emitter ProvisionTimeoutEmitter, interval time.Duration, lookup RuntimeTimeoutLookup) {
|
||
if emitter == nil {
|
||
log.Println("Provision-timeout sweep: emitter is nil — skipping (no one to broadcast to)")
|
||
return
|
||
}
|
||
if interval <= 0 {
|
||
interval = DefaultProvisionSweepInterval
|
||
}
|
||
ticker := time.NewTicker(interval)
|
||
defer ticker.Stop()
|
||
|
||
log.Printf("Provision-timeout sweep: started (interval=%s, timeout=%s default / %s hermes / per-runtime manifest override=%v)",
|
||
interval, DefaultProvisioningTimeout, HermesProvisioningTimeout, lookup != nil)
|
||
|
||
for {
|
||
select {
|
||
case <-ctx.Done():
|
||
return
|
||
case <-ticker.C:
|
||
sweepStuckProvisioning(ctx, emitter, lookup)
|
||
}
|
||
}
|
||
}
|
||
|
||
// sweepStuckProvisioning is one tick of the sweeper. Exported-for-test via
|
||
// the package boundary: keep all time.Now reads inside so tests can drive it
|
||
// deterministically by seeding updated_at rather than manipulating time.
|
||
//
|
||
// Runtime-aware: the per-workspace timeout depends on `runtime`. Hermes
|
||
// gets 30 min (matching the CP bootstrap-watcher's 25-min deadline + 5
|
||
// min slack); everything else gets 10 min. Without this distinction a
|
||
// healthy hermes cold-boot at 10–25 min got killed mid-install by this
|
||
// sweep, leaving an incoherent "marked failed but actually working"
|
||
// state. See bootstrap_watcher.go's bootstrapTimeoutFn for the
|
||
// canonical CP-side gating.
|
||
func sweepStuckProvisioning(ctx context.Context, emitter ProvisionTimeoutEmitter, lookup RuntimeTimeoutLookup) {
|
||
// We can't pre-filter by age in SQL because the threshold depends
|
||
// on the row's runtime. Pull every provisioning row + its runtime
|
||
// + its age, evaluate per-row in Go. Still cheap — the
|
||
// status='provisioning' row count is bounded (workspaces in
|
||
// flight, not historical) and the partial index on status keeps
|
||
// it fast.
|
||
rows, err := db.DB.QueryContext(ctx, `
|
||
SELECT id, COALESCE(runtime, ''), COALESCE(instance_id, ''), EXTRACT(EPOCH FROM (now() - updated_at))::int
|
||
FROM workspaces
|
||
WHERE status = 'provisioning'
|
||
`)
|
||
if err != nil {
|
||
log.Printf("Provision-timeout sweep: query error: %v", err)
|
||
return
|
||
}
|
||
defer rows.Close()
|
||
|
||
type candidate struct {
|
||
id string
|
||
runtime string
|
||
instanceID string
|
||
ageSec int
|
||
}
|
||
var ids []candidate
|
||
for rows.Next() {
|
||
var c candidate
|
||
if err := rows.Scan(&c.id, &c.runtime, &c.instanceID, &c.ageSec); err == nil {
|
||
ids = append(ids, c)
|
||
}
|
||
}
|
||
if err := rows.Err(); err != nil {
|
||
log.Printf("Provision-timeout sweep: rows error: %v", err)
|
||
}
|
||
|
||
for _, c := range ids {
|
||
timeout := provisioningTimeoutFor(c.runtime, lookup)
|
||
timeoutSec := int(timeout / time.Second)
|
||
if c.ageSec < timeoutSec {
|
||
continue
|
||
}
|
||
msg := "provisioning timed out — container started but never called /registry/register. Check container logs and network connectivity to the platform."
|
||
res, err := db.DB.ExecContext(ctx, `
|
||
UPDATE workspaces
|
||
SET status = $4,
|
||
last_sample_error = $2,
|
||
updated_at = now()
|
||
WHERE id = $1
|
||
AND status = 'provisioning'
|
||
AND updated_at < now() - ($3 || ' seconds')::interval
|
||
`, c.id, msg, timeoutSec, models.StatusFailed)
|
||
if err != nil {
|
||
log.Printf("Provision-timeout sweep: failed to flip %s to failed: %v", c.id, err)
|
||
continue
|
||
}
|
||
affected, err := res.RowsAffected()
|
||
if err != nil {
|
||
log.Printf("Provision-timeout sweep: RowsAffected error for %s: %v", c.id, err)
|
||
continue
|
||
}
|
||
if affected == 0 {
|
||
// Raced with restart / register — no harm, just skip.
|
||
continue
|
||
}
|
||
log.Printf("Provision-timeout sweep: %s (runtime=%q) stuck in provisioning > %s — marked failed", c.id, c.runtime, timeout)
|
||
|
||
// RFC internal#742 Part 2: this flip is a boot-failure verdict.
|
||
// The instance is still running (the CP reaps it shortly after);
|
||
// capture a forensic rescue bundle off it NOW, before teardown.
|
||
// Best-effort + non-blocking — the hook dispatches on its own
|
||
// goroutine + timeout, so a hung EIC tunnel on the dead box can't
|
||
// slow the sweep. Only fires on a real flip (affected==1), never
|
||
// on a race (affected==0) or a non-overdue row — guaranteeing it
|
||
// runs once per boot-failure verdict and never on a healthy row.
|
||
if BootFailureRescueHook != nil {
|
||
BootFailureRescueHook(c.id, c.instanceID, "provision_timeout_sweep")
|
||
}
|
||
|
||
// Emit as WORKSPACE_PROVISION_FAILED, not _TIMEOUT, because the
|
||
// canvas event handler only flips node state on the _FAILED case.
|
||
// A separate event type was considered but the UI reaction is
|
||
// identical either way — operators who need to distinguish can
|
||
// tell from the `source` payload field.
|
||
if emitErr := emitter.RecordAndBroadcast(ctx, string(events.EventWorkspaceProvisionFailed), c.id, map[string]interface{}{
|
||
"error": msg,
|
||
"timeout_secs": timeoutSec,
|
||
"runtime": c.runtime,
|
||
"source": "provision_timeout_sweep",
|
||
}); emitErr != nil {
|
||
log.Printf("Provision-timeout sweep: broadcast failed for %s: %v", c.id, emitErr)
|
||
}
|
||
}
|
||
}
|