Previously 7 pages were stubs ("Coming soon"). Now all 15 have full content:
- index.mdx: SaaS subdomain table, runtime adapters, MCP/SDK links
- quickstart.mdx: 3 setup options (dev-start.sh, docker-compose, manual), SaaS alternative
- concepts.mdx: added external agents, Lark channel, tokens, MCP integration
- architecture.mdx: system diagram, 4 components, infra services, health detection, deployment modes
- api-reference.mdx: all 80+ routes across 19 categories with auth requirements
- channels.mdx: Telegram, Slack, Lark/Feishu adapters with config examples
- plugins.mdx: two-axis model, 12 built-in plugins, install safeguards
- schedules.mdx: cron syntax, concurrency handling, supervision, org template examples
- org-template.mdx: YAML structure, defaults layer, plugin UNION, template registry
- self-hosting.mdx: dev-start.sh, docker-compose, env vars, production deployment
- observability.mdx: activity logs, Langfuse, Prometheus, liveness, WebSocket events
- troubleshooting.mdx: 10 common issues with fixes
Build verified: 19/19 static pages generated.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
142 lines
4.7 KiB
Plaintext
142 lines
4.7 KiB
Plaintext
---
|
|
title: Observability
|
|
description: Monitor agent activity, LLM traces, and platform health.
|
|
---
|
|
|
|
## Overview
|
|
|
|
Molecule AI provides multiple layers of observability -- from real-time WebSocket events on the canvas to structured activity logs, LLM traces, Prometheus metrics, and admin health endpoints.
|
|
|
|
## Activity Logs
|
|
|
|
Every significant action in the platform is recorded in the `activity_logs` table. Query logs for a specific workspace:
|
|
|
|
```
|
|
GET /workspaces/:id/activity
|
|
```
|
|
|
|
Activity types include:
|
|
|
|
- **A2A communications** -- request/response capture with duration and method
|
|
- **Task updates** -- agent-reported task status changes
|
|
- **Agent logs** -- structured log entries from workspace runtimes
|
|
- **Errors** -- failures with `error_detail` for debugging
|
|
|
|
Filter by source to separate user-agent chat (`source=canvas`) from agent-to-agent traffic (`source=agent`).
|
|
|
|
Activity logs are automatically cleaned up based on `ACTIVITY_RETENTION_DAYS` (default 7). The cleanup job runs every `ACTIVITY_CLEANUP_INTERVAL_HOURS` (default 6).
|
|
|
|
## LLM Traces
|
|
|
|
Molecule AI integrates with [Langfuse](https://langfuse.com) for LLM observability. Langfuse runs as part of the infrastructure stack on port 3001, backed by ClickHouse for efficient trace storage.
|
|
|
|
View traces for a specific workspace:
|
|
|
|
```
|
|
GET /workspaces/:id/traces
|
|
```
|
|
|
|
The Langfuse UI at `http://localhost:3001` provides:
|
|
|
|
- Token usage and cost tracking per workspace
|
|
- Latency breakdowns for LLM calls
|
|
- Prompt/completion pairs for debugging
|
|
- Trace timelines showing multi-step agent reasoning
|
|
|
|
## Prometheus Metrics
|
|
|
|
The platform exposes Prometheus-format metrics at:
|
|
|
|
```
|
|
GET /metrics
|
|
```
|
|
|
|
This endpoint requires no authentication and is safe to scrape. Metrics are in Prometheus text format (v0.0.4) and include:
|
|
|
|
- Request counts by method, path, and status code
|
|
- Request latency histograms
|
|
- Active WebSocket connections
|
|
- Workspace status counts
|
|
|
|
Configure your Prometheus instance to scrape `http://localhost:8080/metrics` at your preferred interval.
|
|
|
|
## Admin Liveness
|
|
|
|
The liveness endpoint reports the health of every supervised subsystem:
|
|
|
|
```
|
|
GET /admin/liveness
|
|
```
|
|
|
|
This endpoint requires `AdminAuth` (bearer token). It returns a `supervised.Snapshot()` for each subsystem with ages -- how long since each subsystem last reported healthy. Use this to debug stuck schedulers, stalled heartbeat goroutines, or unresponsive health sweeps before diving into logs.
|
|
|
|
## WebSocket Events
|
|
|
|
The canvas receives real-time updates via WebSocket at `/ws`. Every state change in the platform is broadcast to connected clients:
|
|
|
|
| Event | Trigger |
|
|
|-------|---------|
|
|
| `WORKSPACE_ONLINE` | Workspace registers successfully |
|
|
| `WORKSPACE_OFFLINE` | Heartbeat TTL expires or health sweep detects dead container |
|
|
| `WORKSPACE_DEGRADED` | Error rate exceeds threshold |
|
|
| `WORKSPACE_RECOVERED` | Error rate drops back to normal |
|
|
| `WORKSPACE_REMOVED` | Workspace deleted |
|
|
| `HEARTBEAT` | Periodic heartbeat from workspace |
|
|
| `A2A_RESPONSE` | Agent-to-agent message received |
|
|
| `AGENT_MESSAGE` | Agent pushes a message to the user |
|
|
|
|
Events flow through Redis pub/sub to ensure all platform instances broadcast consistently.
|
|
|
|
## Structure Events
|
|
|
|
The `structure_events` table is an append-only audit log of every structural change in the platform. Each event is:
|
|
|
|
1. Inserted into the database via `broadcaster.RecordAndBroadcast()`
|
|
2. Published to Redis pub/sub
|
|
3. Relayed to WebSocket clients
|
|
|
|
Query events for a specific workspace or globally:
|
|
|
|
```
|
|
GET /events/:workspaceId # Workspace-specific
|
|
GET /events # All events
|
|
```
|
|
|
|
Both endpoints require `AdminAuth`.
|
|
|
|
## Session Search
|
|
|
|
Search through chat history for a workspace:
|
|
|
|
```
|
|
GET /workspaces/:id/session-search?q=deployment+error
|
|
```
|
|
|
|
This searches across both user-agent conversations and agent-to-agent A2A traffic stored in the activity logs.
|
|
|
|
## Current Task Visibility
|
|
|
|
Each workspace reports its current task via heartbeat. This is visible in two places:
|
|
|
|
- **Canvas node** -- the workspace card on the canvas shows the current task text
|
|
- **Heartbeat data** -- `GET /registry/discover/:id` includes `current_task` in the workspace info
|
|
|
|
When `active_tasks` drops to zero, the current task field clears and the idle loop (if configured) begins its countdown.
|
|
|
|
## Schedule Run History
|
|
|
|
For workspaces with cron schedules, inspect past runs:
|
|
|
|
```
|
|
GET /workspaces/:id/schedules/:scheduleId/history
|
|
```
|
|
|
|
Each history entry includes:
|
|
|
|
- Execution timestamp
|
|
- Status (`success`, `failed`, `skipped`)
|
|
- Duration
|
|
- `error_detail` when the run failed (populated by `scheduler.fireSchedule`)
|
|
|
|
A status of `skipped` means the workspace was busy (active tasks > 0) when the schedule fired and the concurrency-aware scheduler chose not to queue the prompt.
|