The first scheduler heartbeat (#95) only fired AFTER each tick completed.
A tick that runs fireSchedule for 110+ seconds (long agent prompts) would
make /admin/liveness report scheduler as stale even though it was actively
working. Observed today: scheduler firing UIUX audit, last_tick_at lagged
by 95s+ and incrementing.
Three places now call Heartbeat:
1. Top of tick() — proves we're past the ticker.C wait
2. Inside each fire goroutine, before fireSchedule — ANY active fire
keeps the heartbeat fresh
3. Inside each fire goroutine, after fireSchedule — captures the moment
the per-fire work completes
(The post-tick Heartbeat in Start() is still there as the "all idle" case.)
Net result: /admin/liveness reports stale only if the scheduler genuinely
isn't doing anything for >2× pollInterval, which is the actual signal we
want.
Yesterday's scheduler-died incident (#85) was one instance of a systemic
bug: every long-running goroutine in the platform lacks panic recovery
and exposes no liveness signal. In a multi-tenant SaaS deployment, a
single tenant's bad data panicking any subsystem takes down the
subsystem for every tenant, silently, with all standard health probes
still green. That is a scale-of-one sev-1.
This PR:
1. Introduces `platform/internal/supervised/` with two primitives:
a. RunWithRecover(ctx, name, fn) — runs fn in a recover wrapper.
On panic logs the stack + exponential-backoff restart (1s → 2s →
4s → … → 30s cap). On clean return (fn decided to stop) returns.
On ctx.Done cancels cleanly.
b. Heartbeat(name) + LastTick(name) + Snapshot() + IsHealthy(names,
staleThreshold) — shared in-memory liveness registry. Every
subsystem calls Heartbeat(name) at the end of each tick so
operators can distinguish "goroutine alive and healthy" from
"alive but stuck inside a single tick".
2. Wraps every `go X.Start(ctx)` in main.go:
- broadcaster.Subscribe (Redis pub/sub relay → WebSocket)
- registry.StartLivenessMonitor
- registry.StartHealthSweep
- scheduler.Start (the one that died yesterday)
- channelMgr.Start (Telegram / Slack)
3. Adds `supervised.Heartbeat("scheduler")` inside the scheduler tick
loop as the first end-to-end demonstration. Follow-up PRs will add
heartbeats to the other four subsystems.
4. Adds `GET /admin/liveness` endpoint returning per-subsystem
last_tick_at + seconds_ago. Operators can poll this and alert on
any subsystem whose seconds_ago exceeds 2x its cron/tick interval.
5. Unit tests for RunWithRecover (clean return no restart; panic
restarts with backoff; ctx cancel stops restart loop) and for the
liveness registry.
Net new code: ~160 lines + ~100 lines of tests. Refactor of main.go:
~10 line changes. No behavior change on happy path; only lifts what
happens on a panic.
Closes#92. Supersedes the local recover added to scheduler.go in
#90 (kept conceptually, but now via the shared helper).
The scheduler died silently on 2026-04-14 14:21 UTC and stayed dead for
12+ hours. Platform restart didn't recover it. Root cause: tick() and
fireSchedule() goroutines have no panic recovery. A single bad row, bad
cron expression, DB blip, or transient panic anywhere in the chain
permanently kills the scheduler goroutine — and the only signal to an
operator is "no crons firing", which is invisible if you're not watching.
Specifically:
func (s *Scheduler) Start(ctx context.Context) {
for {
select {
case <-ticker.C:
s.tick(ctx) // <- if this panics, the for-loop exits forever
}
}
}
And inside tick:
go func(s2 scheduleRow) {
defer wg.Done()
defer func() { <-sem }()
s.fireSchedule(ctx, s2) // <- panic here propagates up wg.Wait()
}(sched)
Two `defer recover()` additions:
1. In Start's tick wrapper — a panic in tick() (DB scan, cron parse,
row processing) is logged and the next tick fires normally.
2. In each fireSchedule goroutine — a single bad workspace can't take
the rest of the batch down.
Plus a liveness watchdog:
- Scheduler now records `lastTickAt` after each successful tick.
- New methods `LastTickAt()` and `Healthy()` (true if last tick within
2× pollInterval = 60s).
- Initialised at Start so Healthy() returns true on a fresh process.
Endpoint plumbing for /admin/scheduler/health is a follow-up — needs
threading the scheduler instance through router.Setup(). Documented
on #85.
Closes the silent-outage failure mode of #85. The other proposed
fixes (force-kill on /restart hang, active_tasks watchdog) are
separate concerns tracked in #85's comments.