Commit Graph

5 Commits

Author SHA1 Message Date
Hongming Wang
593c7e2984 merge: resolve scheduler conflicts with main (#85 panic-recover + supervised heartbeat) 2026-04-15 00:12:29 -07:00
rabbitblood
101f284e5d fix(scheduler): heartbeat at tick start + per-fire so liveness reflects work-in-progress
The first scheduler heartbeat (#95) only fired AFTER each tick completed.
A tick that runs fireSchedule for 110+ seconds (long agent prompts) would
make /admin/liveness report scheduler as stale even though it was actively
working. Observed today: scheduler firing UIUX audit, last_tick_at lagged
by 95s+ and incrementing.

Three places now call Heartbeat:
1. Top of tick() — proves we're past the ticker.C wait
2. Inside each fire goroutine, before fireSchedule — ANY active fire
   keeps the heartbeat fresh
3. Inside each fire goroutine, after fireSchedule — captures the moment
   the per-fire work completes

(The post-tick Heartbeat in Start() is still there as the "all idle" case.)

Net result: /admin/liveness reports stale only if the scheduler genuinely
isn't doing anything for >2× pollInterval, which is the actual signal we
want.
2026-04-14 21:20:06 -07:00
rabbitblood
e4535560cf fix(platform): panic-recovering supervisor for every background goroutine (#92)
Yesterday's scheduler-died incident (#85) was one instance of a systemic
bug: every long-running goroutine in the platform lacks panic recovery
and exposes no liveness signal. In a multi-tenant SaaS deployment, a
single tenant's bad data panicking any subsystem takes down the
subsystem for every tenant, silently, with all standard health probes
still green. That is a scale-of-one sev-1.

This PR:

1. Introduces `platform/internal/supervised/` with two primitives:

   a. RunWithRecover(ctx, name, fn) — runs fn in a recover wrapper.
      On panic logs the stack + exponential-backoff restart (1s → 2s →
      4s → … → 30s cap). On clean return (fn decided to stop) returns.
      On ctx.Done cancels cleanly.

   b. Heartbeat(name) + LastTick(name) + Snapshot() + IsHealthy(names,
      staleThreshold) — shared in-memory liveness registry. Every
      subsystem calls Heartbeat(name) at the end of each tick so
      operators can distinguish "goroutine alive and healthy" from
      "alive but stuck inside a single tick".

2. Wraps every `go X.Start(ctx)` in main.go:
   - broadcaster.Subscribe   (Redis pub/sub relay → WebSocket)
   - registry.StartLivenessMonitor
   - registry.StartHealthSweep
   - scheduler.Start         (the one that died yesterday)
   - channelMgr.Start        (Telegram / Slack)

3. Adds `supervised.Heartbeat("scheduler")` inside the scheduler tick
   loop as the first end-to-end demonstration. Follow-up PRs will add
   heartbeats to the other four subsystems.

4. Adds `GET /admin/liveness` endpoint returning per-subsystem
   last_tick_at + seconds_ago. Operators can poll this and alert on
   any subsystem whose seconds_ago exceeds 2x its cron/tick interval.

5. Unit tests for RunWithRecover (clean return no restart; panic
   restarts with backoff; ctx cancel stops restart loop) and for the
   liveness registry.

Net new code: ~160 lines + ~100 lines of tests. Refactor of main.go:
~10 line changes. No behavior change on happy path; only lifts what
happens on a panic.

Closes #92. Supersedes the local recover added to scheduler.go in
#90 (kept conceptually, but now via the shared helper).
2026-04-14 20:34:18 -07:00
rabbitblood
7dc9d83792 fix(scheduler): recover from panics + add liveness watchdog (#85)
The scheduler died silently on 2026-04-14 14:21 UTC and stayed dead for
12+ hours. Platform restart didn't recover it. Root cause: tick() and
fireSchedule() goroutines have no panic recovery. A single bad row, bad
cron expression, DB blip, or transient panic anywhere in the chain
permanently kills the scheduler goroutine — and the only signal to an
operator is "no crons firing", which is invisible if you're not watching.

Specifically:

  func (s *Scheduler) Start(ctx context.Context) {
      for {
          select {
          case <-ticker.C:
              s.tick(ctx)   // <- if this panics, the for-loop exits forever
          }
      }
  }

And inside tick:

  go func(s2 scheduleRow) {
      defer wg.Done()
      defer func() { <-sem }()
      s.fireSchedule(ctx, s2)   // <- panic here propagates up wg.Wait()
  }(sched)

Two `defer recover()` additions:

1. In Start's tick wrapper — a panic in tick() (DB scan, cron parse,
   row processing) is logged and the next tick fires normally.
2. In each fireSchedule goroutine — a single bad workspace can't take
   the rest of the batch down.

Plus a liveness watchdog:

- Scheduler now records `lastTickAt` after each successful tick.
- New methods `LastTickAt()` and `Healthy()` (true if last tick within
  2× pollInterval = 60s).
- Initialised at Start so Healthy() returns true on a fresh process.

Endpoint plumbing for /admin/scheduler/health is a follow-up — needs
threading the scheduler instance through router.Setup(). Documented
on #85.

Closes the silent-outage failure mode of #85. The other proposed
fixes (force-kill on /restart hang, active_tasks watchdog) are
separate concerns tracked in #85's comments.
2026-04-14 19:32:01 -07:00
Hongming Wang
24fec62d7f initial commit — Molecule AI platform
Forked clean from public hackathon repo (Starfire-AgentTeam, BSL 1.1)
with full rebrand to Molecule AI under github.com/Molecule-AI/molecule-monorepo.

Brand: Starfire → Molecule AI.
Slug: starfire / agent-molecule → molecule.
Env vars: STARFIRE_* → MOLECULE_*.
Go module: github.com/agent-molecule/platform → github.com/Molecule-AI/molecule-monorepo/platform.
Python packages: starfire_plugin → molecule_plugin, starfire_agent → molecule_agent.
DB: agentmolecule → molecule.

History truncated; see public repo for prior commits and contributor
attribution. Verified green: go test -race ./... (platform), pytest
(workspace-template 1129 + sdk 132), vitest (canvas 352), build (mcp).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 11:55:37 -07:00