Problem observed 2026-04-16: Research Lead, Dev Lead, Security Auditor,
and UIUX Designer were being auto-restarted by the liveness monitor every
~30 minutes, even though their containers were healthy and processing
real work. A2A callers (PM, children agents) saw regular EOFs:
A2A request to <leader-id> failed: Post http://ws-*:8000: EOF
Followed in platform logs by:
Liveness: workspace <id> TTL expired
Auto-restart: restarting <name> (was: offline)
Provisioner: stopped and removed container ws-*
Root cause: the liveness key `ws:{id}` in Redis has a 60s TTL
(platform/internal/db/redis.go). The workspace heartbeat loop
(workspace-template/heartbeat.py) refreshes it every 30s. That leaves
room for exactly ONE missed heartbeat before expiry.
A busy Claude Code Opus synthesis can starve the container's asyncio
scheduler for 60-120s (the SDK spawns the claude CLI subprocess and
blocks until the message-reader yields; the heartbeat coroutine doesn't
run during that window). Leaders running 5-minute orchestrator pulses
or processing deep delegations routinely hit this. The platform then
mistakes a busy-but-healthy container for a dead one, marks it offline,
tears it down, and re-provisions — interrupting whatever work was mid-
synthesis and generating a cascade of EOF errors on pending A2A calls.
Fix: hoist the TTL into a named `LivenessTTL` constant and raise it to
180s. With a 30s heartbeat interval this now tolerates up to ~5 missed
beats before expiry — comfortably longer than any realistic Opus stall,
while still detecting genuinely-dead containers within 3 minutes.
Safety: real crashes are still caught immediately by a2a_proxy's reactive
IsRunning() check (maybeMarkContainerDead in a2a_proxy.go:439). That path
doesn't depend on TTL; it fires on the first failed forward. So this PR
only relaxes the "slow but alive" false-positive — dead-container
detection is unchanged.
Observed impact before fix (2026-04-16 ~06:40–06:49 UTC, 10-minute
window, 4 containers affected):
| Container | EOF errors | Forced restart |
|-------------------|-----------:|:--------------:|
| Dev Lead | 5 | yes (06:48) |
| Research Lead | 5 | yes (06:47) |
| Security Auditor | 5 | yes (06:49) |
| UIUX Designer | 4 | no (not yet) |
Expected impact after merge + redeploy: drop to ~0 forced restarts on
healthy-busy leaders. If genuinely-stuck agents stop responding, the
IsRunning check still catches them on the next A2A forward.
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Closes#211 HIGH ops/security. RunMigrations globbed \`*.sql\` which
matches both \`.up.sql\` AND \`.down.sql\`. Alphabetical sort puts \"d\"
before \"u\", so every platform boot ran the rollback BEFORE the forward
migration for any pair starting with migration 018.
Net effect: every restart wiped workspace_auth_tokens (the 020 pair),
which in turn regressed AdminAuth to its fail-open bootstrap bypass for
every route protected by it — the live server was effectively
unauthenticated from restart until the next workspace re-registered.
Also wiped 018_secrets_encryption_version and 019_workspace_access
pairs silently.
Fix is a 3-line filter: skip files whose base name ends in \`.down.sql\`.
Down migrations remain on disk for operator-driven rollback via psql,
but are never picked up by the auto-run loop.
Added unit test against a tmp dir to lock the filter behaviour so this
can never regress: stages a mix of legacy plain .sql, matched up/down
pairs, asserts only forward files survive.
Follow-up (not in this PR): the runner still re-applies every migration
on every boot. Migrations must be idempotent. A proper schema_migrations
tracking table is tracked as a future cleanup.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>