molecule-core

History

Hongming Wang 9604584384 fix(liveness): raise workspace TTL 60s → 180s to survive Opus synthesis (#386 ) Problem observed 2026-04-16: Research Lead, Dev Lead, Security Auditor, and UIUX Designer were being auto-restarted by the liveness monitor every ~30 minutes, even though their containers were healthy and processing real work. A2A callers (PM, children agents) saw regular EOFs: A2A request to <leader-id> failed: Post http://ws-:8000: EOF Followed in platform logs by: Liveness: workspace <id> TTL expired Auto-restart: restarting <name> (was: offline) Provisioner: stopped and removed container ws- Root cause: the liveness key `ws:{id}` in Redis has a 60s TTL (platform/internal/db/redis.go). The workspace heartbeat loop (workspace-template/heartbeat.py) refreshes it every 30s. That leaves room for exactly ONE missed heartbeat before expiry. A busy Claude Code Opus synthesis can starve the container's asyncio scheduler for 60-120s (the SDK spawns the claude CLI subprocess and blocks until the message-reader yields; the heartbeat coroutine doesn't run during that window). Leaders running 5-minute orchestrator pulses or processing deep delegations routinely hit this. The platform then mistakes a busy-but-healthy container for a dead one, marks it offline, tears it down, and re-provisions — interrupting whatever work was mid- synthesis and generating a cascade of EOF errors on pending A2A calls. Fix: hoist the TTL into a named `LivenessTTL` constant and raise it to 180s. With a 30s heartbeat interval this now tolerates up to ~5 missed beats before expiry — comfortably longer than any realistic Opus stall, while still detecting genuinely-dead containers within 3 minutes. Safety: real crashes are still caught immediately by a2a_proxy's reactive IsRunning() check (maybeMarkContainerDead in a2a_proxy.go:439). That path doesn't depend on TTL; it fires on the first failed forward. So this PR only relaxes the "slow but alive" false-positive — dead-container detection is unchanged. Observed impact before fix (2026-04-16 ~06:40–06:49 UTC, 10-minute window, 4 containers affected): \| Container \| EOF errors \| Forced restart \| \|-------------------\|-----------:\|:--------------:\| \| Dev Lead \| 5 \| yes (06:48) \| \| Research Lead \| 5 \| yes (06:47) \| \| Security Auditor \| 5 \| yes (06:49) \| \| UIUX Designer \| 4 \| no (not yet) \| Expected impact after merge + redeploy: drop to ~0 forced restarts on healthy-busy leaders. If genuinely-stuck agents stop responding, the IsRunning check still catches them on the next A2A forward. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>		2026-04-16 00:05:45 -07:00
..
bundle	initial commit — Molecule AI platform	2026-04-13 11:55:37 -07:00
channels	fix(security): scope PausePollersForToken to requesting workspace (closes #329 )	2026-04-15 21:22:50 -07:00
crypto	initial commit — Molecule AI platform	2026-04-13 11:55:37 -07:00
db	fix(liveness): raise workspace TTL 60s → 180s to survive Opus synthesis (#386 )	2026-04-16 00:05:45 -07:00
envx	initial commit — Molecule AI platform	2026-04-13 11:55:37 -07:00
events	initial commit — Molecule AI platform	2026-04-13 11:55:37 -07:00
handlers	fix(security): forward Authorization header in transcript proxy (#405 ) (#380 )	2026-04-15 23:38:07 -07:00
metrics	initial commit — Molecule AI platform	2026-04-13 11:55:37 -07:00
middleware	chore(test): remove dead constants from wsauth_middleware_test.go (#358 )	2026-04-16 05:02:11 +00:00
models	initial commit — Molecule AI platform	2026-04-13 11:55:37 -07:00
plugins	initial commit — Molecule AI platform	2026-04-13 11:55:37 -07:00
provisioner	config(org): add Telegram to Dev Lead and Research Lead (#385 )	2026-04-16 00:00:10 -07:00
registry	fix(registry): allow ancestor↔descendant A2A so audit_summary can reach PM	2026-04-14 22:18:38 -07:00
router	feat: GET /workspaces/:id/transcript — live agent session log	2026-04-15 14:29:43 -07:00
scheduler	fix(code-review): CanvasOrBearer fall-through, scheduler short(), activity spoof log + 6 new tests	2026-04-15 11:48:25 -07:00
supervised	fix(platform): panic-recovering supervisor for every background goroutine (#92 )	2026-04-14 20:34:18 -07:00
ws	initial commit — Molecule AI platform	2026-04-13 11:55:37 -07:00
wsauth	fix(security): close WorkspaceAuth fail-open on non-existent workspace IDs (#318 )	2026-04-15 21:02:29 -07:00