Renames: - platform/ → workspace-server/ (Go module path stays as "platform" for external dep compat — will update after plugin module republish) - workspace-template/ → workspace/ Removed (moved to separate repos or deleted): - PLAN.md — internal roadmap (move to private project board) - HANDOFF.md, AGENTS.md — one-time internal session docs - .claude/ — gitignored entirely (local agent config) - infra/cloudflare-worker/ → Molecule-AI/molecule-tenant-proxy - org-templates/molecule-dev/ → standalone template repo - .mcp-eval/ → molecule-mcp-server repo - test-results/ — ephemeral, gitignored Security scrubbing: - Cloudflare account/zone/KV IDs → placeholders - Real EC2 IPs → <EC2_IP> in all docs - CF token prefix, Neon project ID, Fly app names → redacted - Langfuse dev credentials → parameterized - Personal runner username/machine name → generic Community files: - CONTRIBUTING.md — build, test, branch conventions - CODE_OF_CONDUCT.md — Contributor Covenant 2.1 All Dockerfiles, CI workflows, docker-compose, railway.toml, render.yaml, README, CLAUDE.md updated for new directory names. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8.0 KiB
Registry & Heartbeat
Every workspace registers with the platform on startup and sends a heartbeat every 30 seconds. This is how the platform knows which workspaces are alive and where to find them.
Registration Flow
When a workspace container starts:
POST /registry/register
Body: { id, url, agent_card }
The platform:
- Writes the Agent Card to Postgres (
workspacestable) - Sets Redis key
ws:{id}="online"with TTL 60 seconds - Appends a
WORKSPACE_ONLINEevent tostructure_events - Broadcasts event via WebSocket — canvas node turns green
Heartbeat Flow
Every 30 seconds:
# workspace/heartbeat.py
await platform.post("/registry/heartbeat", json={
"workspace_id": WORKSPACE_ID,
# used by platform to make status decisions
"error_rate": error_tracker.error_rate, # triggers degraded
"sample_error": error_tracker.sample_error, # shown on canvas
# informational — shown on canvas node tooltip
"active_tasks": task_counter.current, # how many tasks running now
"uptime_seconds": time.time() - START_TIME, # how long since container start
"current_task": current_task_description, # what the agent is working on
})
Five fields. Memory usage, CPU, queue depth — those are infrastructure metrics for Prometheus/Grafana or CloudWatch. The platform registry is a service discovery layer, not a metrics store.
active_tasks is included because the canvas uses it for a busy indicator on the node, and it sets up backpressure for Phase 2 without a schema change.
current_task is a human-readable description of what the agent is currently working on. The platform stores it in workspaces.current_task and broadcasts a TASK_UPDATED WebSocket event only when the value changes (not on every heartbeat). The canvas shows it as an amber banner on the workspace node and side panel header.
The platform:
- Overwrites heartbeat columns in Postgres (latest snapshot only — no history)
- Refreshes Redis TTL to 60 seconds
- Checks error rate for status transitions (see Health Monitoring below)
// workspace-server/internal/registry/heartbeat.go
func HandleHeartbeat(workspaceID string, stats HeartbeatStats) {
db.Exec(`
UPDATE workspaces SET
last_heartbeat_at = now(),
last_error_rate = $2,
last_sample_error = $3,
active_tasks = $4,
uptime_seconds = $5,
current_task = $6
WHERE id = $1
`, workspaceID,
stats.ErrorRate, stats.SampleError,
stats.ActiveTasks, stats.UptimeSeconds,
stats.CurrentTask,
)
redis.Refresh(workspaceID, 60*time.Second)
evaluateStatusTransition(workspaceID, stats)
}
No heartbeat history table. Heartbeats arrive every 30 seconds — storing history would be 2880 rows per workspace per day with no practical use. If you need health trends, Langfuse traces capture that at the task level with far more detail.
Health Monitoring
The workspace self-reports its health via the heartbeat payload. The platform decides status transitions based on error rate thresholds:
| Condition | Transition | Event |
|---|---|---|
error_rate >= 0.5 and status is online |
online -> degraded |
WORKSPACE_DEGRADED |
error_rate < 0.1 and status is degraded |
degraded -> online |
WORKSPACE_ONLINE |
What counts as an error: Only things that indicate the workspace itself is broken — 5xx responses, timeouts, connection errors. Client errors (400, 403) are the caller's fault and are not counted.
The workspace tracks errors locally using a rolling 60-second window and includes the stats in every heartbeat. The platform doesn't sit in the A2A message path, so it can't monitor response codes directly — self-reporting via heartbeat is the mechanism.
Liveness Detection (No Polling)
Redis keyspace notifications are enabled (notify-keyspace-events = KEA). When ws:{id} TTL expires (workspace missed 2 heartbeats), Redis fires an expiry event automatically.
Workspace starts
|
v
POST /registry/register -> Platform writes Agent Card to Postgres
-> Platform sets Redis key: ws:{id} = "online" TTL 60s
Every 30s:
POST /registry/heartbeat -> Platform refreshes Redis TTL
Workspace crashes / goes dark:
|
v
Redis TTL expires (60s)
|
v
Redis keyspace event fires
|
v
Platform marks workspace offline in Postgres
|
v
WebSocket broadcast -> canvas node turns gray
On expiry, the platform:
- Receives Redis keyspace expiry event
- Writes
WORKSPACE_OFFLINEevent to Postgres - Updates
workspaces.status = 'offline' - Broadcasts via WebSocket — canvas node turns gray
Heartbeat Delegation Checking
In addition to alive signals, the heartbeat loop checks for completed delegations every 30 seconds:
Every 30s:
1. POST /registry/heartbeat → alive signal
2. GET /workspaces/:id/delegations → check results
3. If new completions found:
a. Write to /tmp/delegation_results.jsonl (for context injection)
b. Send A2A self-message to wake the agent (5-min cooldown)
c. Push notification to user via POST /notify
Self-message trigger: When the heartbeat detects a completed delegation, it sends an A2A message to its own workspace: "Delegation results are ready. Review them and take appropriate action." This wakes the agent so it can report results back up the chain.
Cooldown: At most one self-message per 5 minutes to prevent infinite loops (agent delegates → completes → self-message → agent delegates again → ...).
Context injection: When the agent receives any message (including the self-message), the CLI executor reads /tmp/delegation_results.jsonl and injects results into the prompt as [Delegation results received while you were idle].
Delegation status lifecycle:
pending → dispatched → received → in_progress → completed | failed
Each transition broadcasts a DELEGATION_STATUS WebSocket event. DELEGATION_COMPLETE and DELEGATION_FAILED are broadcast on terminal states.
Workspace Moves to a New Machine
When a workspace starts on a new machine (e.g. new EC2 instance):
- Workspace sends
POST /registry/registerwith new URL - Platform updates
workspaces.urlin Postgres - Platform busts Redis URL cache key
ws:{id}:url - All subsequent inter-workspace calls use the new URL automatically
Workspace Forwarding
The forwarded_to column is set when a workspace is retired but a replacement exists. Three scenarios:
1. Workspace replaced by a better version: User deploys a new SEO agent with improved skills and retires the old one. Old workspace gets forwarded_to = new_workspace_id. Any workspace that still has the old URL cached gets a 410 Gone + redirect pointer and updates its reference automatically.
2. Workspace expanded into a team: A single Developer Agent expands into a Developer Team. The original single-agent workspace is retired. forwarded_to = developer_pm_id (the team lead). Anything that was talking to the old Developer Agent now gets redirected to Developer PM.
3. Workspace moved to a different parent: A workspace is reorganized in the hierarchy. Old workspace entry kept for redirect, new one created under the new parent.
In all cases, the forwarding is transparent — callers follow the redirect and update their cached URL.
Querying a Deleted Workspace
When a workspace is queried after being deleted or retired:
- Platform checks
workspacestable — status isremoved - Platform queries
structure_eventsfor the last event on that ID - If
workspaces.forwarded_tois set, returns301with the new workspace URL - If no forwarding, returns
410 Gonewith the last event payload - Canvas removes the node; parent workspace is notified
Related Docs
- Platform API — Full endpoint reference
- Event Log — How events are recorded
- Database Schema — Redis key patterns