docs/content/docs/observability.mdx
Hongming Wang a620e5a7a3 docs: comprehensive content for all 15 documentation pages
Previously 7 pages were stubs ("Coming soon"). Now all 15 have full content:

- index.mdx: SaaS subdomain table, runtime adapters, MCP/SDK links
- quickstart.mdx: 3 setup options (dev-start.sh, docker-compose, manual), SaaS alternative
- concepts.mdx: added external agents, Lark channel, tokens, MCP integration
- architecture.mdx: system diagram, 4 components, infra services, health detection, deployment modes
- api-reference.mdx: all 80+ routes across 19 categories with auth requirements
- channels.mdx: Telegram, Slack, Lark/Feishu adapters with config examples
- plugins.mdx: two-axis model, 12 built-in plugins, install safeguards
- schedules.mdx: cron syntax, concurrency handling, supervision, org template examples
- org-template.mdx: YAML structure, defaults layer, plugin UNION, template registry
- self-hosting.mdx: dev-start.sh, docker-compose, env vars, production deployment
- observability.mdx: activity logs, Langfuse, Prometheus, liveness, WebSocket events
- troubleshooting.mdx: 10 common issues with fixes

Build verified: 19/19 static pages generated.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 10:05:12 -07:00

142 lines
4.7 KiB
Plaintext

---
title: Observability
description: Monitor agent activity, LLM traces, and platform health.
---
## Overview
Molecule AI provides multiple layers of observability -- from real-time WebSocket events on the canvas to structured activity logs, LLM traces, Prometheus metrics, and admin health endpoints.
## Activity Logs
Every significant action in the platform is recorded in the `activity_logs` table. Query logs for a specific workspace:
```
GET /workspaces/:id/activity
```
Activity types include:
- **A2A communications** -- request/response capture with duration and method
- **Task updates** -- agent-reported task status changes
- **Agent logs** -- structured log entries from workspace runtimes
- **Errors** -- failures with `error_detail` for debugging
Filter by source to separate user-agent chat (`source=canvas`) from agent-to-agent traffic (`source=agent`).
Activity logs are automatically cleaned up based on `ACTIVITY_RETENTION_DAYS` (default 7). The cleanup job runs every `ACTIVITY_CLEANUP_INTERVAL_HOURS` (default 6).
## LLM Traces
Molecule AI integrates with [Langfuse](https://langfuse.com) for LLM observability. Langfuse runs as part of the infrastructure stack on port 3001, backed by ClickHouse for efficient trace storage.
View traces for a specific workspace:
```
GET /workspaces/:id/traces
```
The Langfuse UI at `http://localhost:3001` provides:
- Token usage and cost tracking per workspace
- Latency breakdowns for LLM calls
- Prompt/completion pairs for debugging
- Trace timelines showing multi-step agent reasoning
## Prometheus Metrics
The platform exposes Prometheus-format metrics at:
```
GET /metrics
```
This endpoint requires no authentication and is safe to scrape. Metrics are in Prometheus text format (v0.0.4) and include:
- Request counts by method, path, and status code
- Request latency histograms
- Active WebSocket connections
- Workspace status counts
Configure your Prometheus instance to scrape `http://localhost:8080/metrics` at your preferred interval.
## Admin Liveness
The liveness endpoint reports the health of every supervised subsystem:
```
GET /admin/liveness
```
This endpoint requires `AdminAuth` (bearer token). It returns a `supervised.Snapshot()` for each subsystem with ages -- how long since each subsystem last reported healthy. Use this to debug stuck schedulers, stalled heartbeat goroutines, or unresponsive health sweeps before diving into logs.
## WebSocket Events
The canvas receives real-time updates via WebSocket at `/ws`. Every state change in the platform is broadcast to connected clients:
| Event | Trigger |
|-------|---------|
| `WORKSPACE_ONLINE` | Workspace registers successfully |
| `WORKSPACE_OFFLINE` | Heartbeat TTL expires or health sweep detects dead container |
| `WORKSPACE_DEGRADED` | Error rate exceeds threshold |
| `WORKSPACE_RECOVERED` | Error rate drops back to normal |
| `WORKSPACE_REMOVED` | Workspace deleted |
| `HEARTBEAT` | Periodic heartbeat from workspace |
| `A2A_RESPONSE` | Agent-to-agent message received |
| `AGENT_MESSAGE` | Agent pushes a message to the user |
Events flow through Redis pub/sub to ensure all platform instances broadcast consistently.
## Structure Events
The `structure_events` table is an append-only audit log of every structural change in the platform. Each event is:
1. Inserted into the database via `broadcaster.RecordAndBroadcast()`
2. Published to Redis pub/sub
3. Relayed to WebSocket clients
Query events for a specific workspace or globally:
```
GET /events/:workspaceId # Workspace-specific
GET /events # All events
```
Both endpoints require `AdminAuth`.
## Session Search
Search through chat history for a workspace:
```
GET /workspaces/:id/session-search?q=deployment+error
```
This searches across both user-agent conversations and agent-to-agent A2A traffic stored in the activity logs.
## Current Task Visibility
Each workspace reports its current task via heartbeat. This is visible in two places:
- **Canvas node** -- the workspace card on the canvas shows the current task text
- **Heartbeat data** -- `GET /registry/discover/:id` includes `current_task` in the workspace info
When `active_tasks` drops to zero, the current task field clears and the idle loop (if configured) begins its countdown.
## Schedule Run History
For workspaces with cron schedules, inspect past runs:
```
GET /workspaces/:id/schedules/:scheduleId/history
```
Each history entry includes:
- Execution timestamp
- Status (`success`, `failed`, `skipped`)
- Duration
- `error_detail` when the run failed (populated by `scheduler.fireSchedule`)
A status of `skipped` means the workspace was busy (active tasks > 0) when the schedule fired and the concurrency-aware scheduler chose not to queue the prompt.