Addresses self-review of the 10-PR batch merged earlier this session.
Splits the follow-ups into this Go-side PR and a later Python/docs PR.
## Fixes
1. wsauth_middleware.go CanvasOrBearer — invalid bearer now hard-rejects
with 401 instead of falling through to the Origin check. Previous code
let an attacker with an expired token + matching Origin bypass auth.
Empty bearer still falls through to the Origin path (the intended
canvas path).
2. scheduler.go short() helper — extracts safe UUID prefix truncation.
Pre-existing unsafe [:12] and [:8] slices would panic on workspace IDs
shorter than the bound. #115's new skip path had the bounds check;
the happy-path log lines did not. One helper, three call sites.
3. activity.go security-event log on source_id spoof — #209 added the
403 but the attempt was invisible to any auditor cron. Stable
greppable log line with authed_workspace, body_source_id, client IP.
## New tests
- TestShort_helper — bounds-safety regression guard for the helper
- TestRecordSkipped_writesSkippedStatus — #115 coverage gap, exercises
UPDATE + INSERT via sqlmock
- TestRecordSkipped_shortWorkspaceIDNoPanic — short-ID crash regression
- TestActivityHandler_Report_SourceIDSpoofRejected — #209 403 path
- TestActivityHandler_Report_MatchingSourceIDAccepted — non-spoof path
- TestHistory_IncludesErrorDetail — #152 problem B coverage
go test -race ./... green locally.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Closes#115. The Security Auditor hourly cron (and likely others) hit a
~36% miss rate because the platform's A2A proxy rejected fires with
"workspace agent busy — retry after a short backoff" while the agent was
still executing the prior audit. That error was recorded as a hard
failure and polluted last_error.
New behaviour:
Before fireSchedule calls into the A2A proxy, it reads
workspaces.active_tasks for the target. If >0, it:
- Advances next_run_at to the next cron slot (cron keeps ticking)
- Bumps run_count
- Sets last_status='skipped' + last_error=<reason>
- Inserts a cron_run activity_logs row with status='skipped' + error_detail
- Broadcasts CRON_SKIPPED for canvas + operators
Effect: busy-collision ceases to be an error. The history surface now
distinguishes "ran and failed" from "skipped because busy". Operators
can tell the difference at a glance, and the liveness view doesn't
stall waiting for the next ticker cycle.
Pairs with #149 (dedicated heartbeat pulse) and #152 problem B
(error_detail surfaced in history) for a coherent scheduler story.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Closes#152 problem B (schedule history API drops error detail).
Two tiny changes:
1. scheduler.fireSchedule now writes lastError into activity_logs.error_detail
when inserting the cron_run row. Previously the column was left NULL even
on failure because the INSERT didn't include it.
2. schedules.History SELECT now reads error_detail and includes it in the
JSON response under error_detail. Frontend + audit cron can now display
"why did this run fail" instead of just "status=error".
No schema change — activity_logs.error_detail already exists from
migration 009. This just starts using the column.
Problem A of #152 (Research Lead ecosystem-watch 50% error rate on its
own) is a separate ops investigation and stays open.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The #95 scheduler heartbeat scheme relied on:
1. Top of tick() (once per poll interval)
2. Per-fire goroutine entry + exit
That leaves a gap: tick() ends with wg.Wait(), so if a single fire takes
longer than pollInterval (UIUX audits routinely take 60-120s; max fireTimeout
is 5min), the next tick doesn't run and no top-of-tick heartbeat fires.
Per-fire heartbeats only bracket the fire — between entry and the HTTP
response returning, nothing heartbeats either.
Observed today: /admin/liveness reports seconds_ago=251 while docker logs
show the scheduler actively firing 'Hourly ecosystem watch'. Scheduler is
fine; liveness is lying.
Adds an independent 10s heartbeat pulse goroutine inside Start(), decoupled
from tick completion. The existing heartbeats at tick top + per-fire are
kept as redundant signals but this pulse is the one that guarantees liveness
freshness regardless of what tick is doing.
Ships the exact fix proposed in #140 body.
Closes#140.
Added scheduler_test.go with 8 test cases covering all previously untested
security-critical code paths from PR #90:
TestLastTickAt_zero — zero time before first tick
TestHealthy_beforeStart — false on fresh scheduler (zero lastTickAt)
TestHealthy_freshTick — true when lastTickAt == now
TestHealthy_stale — false when lastTickAt is 3×pollInterval ago
TestComputeNextRun_valid — "0 * * * *" / UTC returns top-of-hour future time
TestComputeNextRun_invalid — unparseable expression returns non-nil error
TestComputeNextRun_invalidTimezone — unrecognised IANA zone returns non-nil error
TestPanicRecovery — panicProxy crashes ProxyA2ARequest; scheduler
goroutine recovers and remains Healthy
To support these tests, scheduler.go gained four changes (minimal surface):
1. Added mu sync.RWMutex, lastTickAt time.Time, and tickInterval time.Duration
fields to Scheduler. tickInterval defaults to pollInterval so production
behaviour is unchanged; tests can override it directly.
2. Added LastTickAt() and Healthy() methods with read-lock protection.
3. tick() now records lastTickAt after wg.Wait() — a single atomic write under
the mutex, no hot-path cost.
4. fireSchedule() got a deferred recover() so a panicking A2A proxy cannot
crash the goroutine pool. Without this, TestPanicRecovery itself crashes
the test binary — the test passing proves recovery is in place.
Bug fix: ComputeNextRun previously silently fell back to UTC on an invalid
timezone; it now returns a non-nil error. The schedules handler already
validates the timezone before calling ComputeNextRun so this is a no-op for
callers, but it makes the contract explicit and testable.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The first scheduler heartbeat (#95) only fired AFTER each tick completed.
A tick that runs fireSchedule for 110+ seconds (long agent prompts) would
make /admin/liveness report scheduler as stale even though it was actively
working. Observed today: scheduler firing UIUX audit, last_tick_at lagged
by 95s+ and incrementing.
Three places now call Heartbeat:
1. Top of tick() — proves we're past the ticker.C wait
2. Inside each fire goroutine, before fireSchedule — ANY active fire
keeps the heartbeat fresh
3. Inside each fire goroutine, after fireSchedule — captures the moment
the per-fire work completes
(The post-tick Heartbeat in Start() is still there as the "all idle" case.)
Net result: /admin/liveness reports stale only if the scheduler genuinely
isn't doing anything for >2× pollInterval, which is the actual signal we
want.
Yesterday's scheduler-died incident (#85) was one instance of a systemic
bug: every long-running goroutine in the platform lacks panic recovery
and exposes no liveness signal. In a multi-tenant SaaS deployment, a
single tenant's bad data panicking any subsystem takes down the
subsystem for every tenant, silently, with all standard health probes
still green. That is a scale-of-one sev-1.
This PR:
1. Introduces `platform/internal/supervised/` with two primitives:
a. RunWithRecover(ctx, name, fn) — runs fn in a recover wrapper.
On panic logs the stack + exponential-backoff restart (1s → 2s →
4s → … → 30s cap). On clean return (fn decided to stop) returns.
On ctx.Done cancels cleanly.
b. Heartbeat(name) + LastTick(name) + Snapshot() + IsHealthy(names,
staleThreshold) — shared in-memory liveness registry. Every
subsystem calls Heartbeat(name) at the end of each tick so
operators can distinguish "goroutine alive and healthy" from
"alive but stuck inside a single tick".
2. Wraps every `go X.Start(ctx)` in main.go:
- broadcaster.Subscribe (Redis pub/sub relay → WebSocket)
- registry.StartLivenessMonitor
- registry.StartHealthSweep
- scheduler.Start (the one that died yesterday)
- channelMgr.Start (Telegram / Slack)
3. Adds `supervised.Heartbeat("scheduler")` inside the scheduler tick
loop as the first end-to-end demonstration. Follow-up PRs will add
heartbeats to the other four subsystems.
4. Adds `GET /admin/liveness` endpoint returning per-subsystem
last_tick_at + seconds_ago. Operators can poll this and alert on
any subsystem whose seconds_ago exceeds 2x its cron/tick interval.
5. Unit tests for RunWithRecover (clean return no restart; panic
restarts with backoff; ctx cancel stops restart loop) and for the
liveness registry.
Net new code: ~160 lines + ~100 lines of tests. Refactor of main.go:
~10 line changes. No behavior change on happy path; only lifts what
happens on a panic.
Closes#92. Supersedes the local recover added to scheduler.go in
#90 (kept conceptually, but now via the shared helper).
The scheduler died silently on 2026-04-14 14:21 UTC and stayed dead for
12+ hours. Platform restart didn't recover it. Root cause: tick() and
fireSchedule() goroutines have no panic recovery. A single bad row, bad
cron expression, DB blip, or transient panic anywhere in the chain
permanently kills the scheduler goroutine — and the only signal to an
operator is "no crons firing", which is invisible if you're not watching.
Specifically:
func (s *Scheduler) Start(ctx context.Context) {
for {
select {
case <-ticker.C:
s.tick(ctx) // <- if this panics, the for-loop exits forever
}
}
}
And inside tick:
go func(s2 scheduleRow) {
defer wg.Done()
defer func() { <-sem }()
s.fireSchedule(ctx, s2) // <- panic here propagates up wg.Wait()
}(sched)
Two `defer recover()` additions:
1. In Start's tick wrapper — a panic in tick() (DB scan, cron parse,
row processing) is logged and the next tick fires normally.
2. In each fireSchedule goroutine — a single bad workspace can't take
the rest of the batch down.
Plus a liveness watchdog:
- Scheduler now records `lastTickAt` after each successful tick.
- New methods `LastTickAt()` and `Healthy()` (true if last tick within
2× pollInterval = 60s).
- Initialised at Start so Healthy() returns true on a fresh process.
Endpoint plumbing for /admin/scheduler/health is a follow-up — needs
threading the scheduler instance through router.Setup(). Documented
on #85.
Closes the silent-outage failure mode of #85. The other proposed
fixes (force-kill on /restart hang, active_tasks watchdog) are
separate concerns tracked in #85's comments.