molecule-core/platform
rabbitblood 7dc9d83792 fix(scheduler): recover from panics + add liveness watchdog (#85)
The scheduler died silently on 2026-04-14 14:21 UTC and stayed dead for
12+ hours. Platform restart didn't recover it. Root cause: tick() and
fireSchedule() goroutines have no panic recovery. A single bad row, bad
cron expression, DB blip, or transient panic anywhere in the chain
permanently kills the scheduler goroutine — and the only signal to an
operator is "no crons firing", which is invisible if you're not watching.

Specifically:

  func (s *Scheduler) Start(ctx context.Context) {
      for {
          select {
          case <-ticker.C:
              s.tick(ctx)   // <- if this panics, the for-loop exits forever
          }
      }
  }

And inside tick:

  go func(s2 scheduleRow) {
      defer wg.Done()
      defer func() { <-sem }()
      s.fireSchedule(ctx, s2)   // <- panic here propagates up wg.Wait()
  }(sched)

Two `defer recover()` additions:

1. In Start's tick wrapper — a panic in tick() (DB scan, cron parse,
   row processing) is logged and the next tick fires normally.
2. In each fireSchedule goroutine — a single bad workspace can't take
   the rest of the batch down.

Plus a liveness watchdog:

- Scheduler now records `lastTickAt` after each successful tick.
- New methods `LastTickAt()` and `Healthy()` (true if last tick within
  2× pollInterval = 60s).
- Initialised at Start so Healthy() returns true on a fresh process.

Endpoint plumbing for /admin/scheduler/health is a follow-up — needs
threading the scheduler instance through router.Setup(). Documented
on #85.

Closes the silent-outage failure mode of #85. The other proposed
fixes (force-kill on /restart hang, active_tasks watchdog) are
separate concerns tracked in #85's comments.
2026-04-14 19:32:01 -07:00
..
cmd initial commit — Molecule AI platform 2026-04-13 11:55:37 -07:00
internal fix(scheduler): recover from panics + add liveness watchdog (#85) 2026-04-14 19:32:01 -07:00
migrations fix(schedules): backfill legacy rows to 'template' + extract import SQL const 2026-04-14 14:30:22 -07:00
Dockerfile initial commit — Molecule AI platform 2026-04-13 11:55:37 -07:00
go.mod initial commit — Molecule AI platform 2026-04-13 11:55:37 -07:00
go.sum initial commit — Molecule AI platform 2026-04-13 11:55:37 -07:00