[Molecule-Platform-Evolvement-Manager] PR #59 (commitdae42e2) was merged ~2 weeks ago with a bad diff that deleted all Next.js/Fumadocs build files (package.json, app/, lib/, source.config.ts, tsconfig.json, etc.) and most MDX content pages. This broke the Vercel build, taking doc.moleculesai.app offline. Root cause: the PR branch was likely rebased or reset to a state that only contained the marketing/ subtree, so the merge diff showed deletions for every other file. This commit: 1. Restores all build infrastructure from the last good commit (86fa0e9) 2. Restores 25 deleted MDX content pages (concepts, quickstart, etc.) 3. Adds frontmatter (title) to 55 .md files added post-bad-merge that were missing the required YAML frontmatter for Fumadocs 4. Removes duplicate quickstart.mdx (superseded by quickstart.md) 5. Adds CI workflow (.github/workflows/ci.yml) to catch build failures on PRs before merge — this would have prevented the outage Build verified: 99 static pages generated successfully. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
181 lines
5.8 KiB
Plaintext
181 lines
5.8 KiB
Plaintext
---
|
|
title: Observability
|
|
description: Monitor agent activity, LLM traces, and platform health.
|
|
---
|
|
|
|
## Overview
|
|
|
|
Molecule AI provides multiple layers of observability -- from real-time WebSocket events on the canvas to structured activity logs, LLM traces, Prometheus metrics, and admin health endpoints.
|
|
|
|
## Activity Logs
|
|
|
|
Every significant action in the platform is recorded in the `activity_logs` table. Query logs for a specific workspace:
|
|
|
|
```
|
|
GET /workspaces/:id/activity
|
|
```
|
|
|
|
Activity types include:
|
|
|
|
- **A2A communications** -- request/response capture with duration and method
|
|
- **Task updates** -- agent-reported task status changes
|
|
- **Agent logs** -- structured log entries from workspace runtimes
|
|
- **Errors** -- failures with `error_detail` for debugging
|
|
|
|
Filter by source to separate user-agent chat (`source=canvas`) from agent-to-agent traffic (`source=agent`).
|
|
|
|
Activity logs are automatically cleaned up based on `ACTIVITY_RETENTION_DAYS` (default 7). The cleanup job runs every `ACTIVITY_CLEANUP_INTERVAL_HOURS` (default 6).
|
|
|
|
## LLM Traces
|
|
|
|
Molecule AI integrates with [Langfuse](https://langfuse.com) for LLM observability. Langfuse runs as part of the infrastructure stack on port 3001, backed by ClickHouse for efficient trace storage.
|
|
|
|
View traces for a specific workspace:
|
|
|
|
```
|
|
GET /workspaces/:id/traces
|
|
```
|
|
|
|
The Langfuse UI at `http://localhost:3001` provides:
|
|
|
|
- Token usage and cost tracking per workspace
|
|
- Latency breakdowns for LLM calls
|
|
- Prompt/completion pairs for debugging
|
|
- Trace timelines showing multi-step agent reasoning
|
|
|
|
## Prometheus Metrics
|
|
|
|
The platform exposes Prometheus-format metrics at:
|
|
|
|
```
|
|
GET /metrics
|
|
```
|
|
|
|
This endpoint requires no authentication and is safe to scrape. Metrics are in Prometheus text format (v0.0.4) and include:
|
|
|
|
- Request counts by method, path, and status code
|
|
- Request latency histograms
|
|
- Active WebSocket connections
|
|
- Workspace status counts
|
|
|
|
Configure your Prometheus instance to scrape `http://localhost:8080/metrics` at your preferred interval.
|
|
|
|
## Per-Workspace Token Metrics
|
|
|
|
Track LLM token consumption per workspace — input tokens, output tokens, and Anthropic prompt-cache reads/writes — aggregated over two rolling windows:
|
|
|
|
```
|
|
GET /workspaces/:id/metrics
|
|
```
|
|
|
|
Requires a **workspace bearer token** (`Authorization: Bearer <token>`). Returns:
|
|
|
|
```json
|
|
{
|
|
"workspace_id": "uuid",
|
|
"token_metrics": {
|
|
"1h": {
|
|
"input_tokens": 1250,
|
|
"output_tokens": 430,
|
|
"cache_read_tokens": 800,
|
|
"cache_write_tokens": 200
|
|
},
|
|
"30d": {
|
|
"input_tokens": 84200,
|
|
"output_tokens": 28100,
|
|
"cache_read_tokens": 52000,
|
|
"cache_write_tokens": 9400
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
| Field | Description |
|
|
|-------|-------------|
|
|
| `input_tokens` | Tokens in the prompt sent to the LLM (sum over window) |
|
|
| `output_tokens` | Tokens in the completion returned by the LLM |
|
|
| `cache_read_tokens` | Prompt tokens served from Anthropic's prompt cache |
|
|
| `cache_write_tokens` | Prompt tokens written into Anthropic's prompt cache |
|
|
|
|
The **canvas WorkspaceUsage panel** (⊞ icon → Usage tab) displays these same metrics live, updating each time the workspace reports a heartbeat.
|
|
|
|
## Admin Liveness
|
|
|
|
The liveness endpoint reports the health of every supervised subsystem:
|
|
|
|
```
|
|
GET /admin/liveness
|
|
```
|
|
|
|
This endpoint requires `AdminAuth` (bearer token). It returns a `supervised.Snapshot()` for each subsystem with ages -- how long since each subsystem last reported healthy. Use this to debug stuck schedulers, stalled heartbeat goroutines, or unresponsive health sweeps before diving into logs.
|
|
|
|
## WebSocket Events
|
|
|
|
The canvas receives real-time updates via WebSocket at `/ws`. Every state change in the platform is broadcast to connected clients:
|
|
|
|
| Event | Trigger |
|
|
|-------|---------|
|
|
| `WORKSPACE_ONLINE` | Workspace registers successfully |
|
|
| `WORKSPACE_OFFLINE` | Heartbeat TTL expires or health sweep detects dead container |
|
|
| `WORKSPACE_DEGRADED` | Error rate exceeds threshold |
|
|
| `WORKSPACE_RECOVERED` | Error rate drops back to normal |
|
|
| `WORKSPACE_REMOVED` | Workspace deleted |
|
|
| `HEARTBEAT` | Periodic heartbeat from workspace |
|
|
| `A2A_RESPONSE` | Agent-to-agent message received |
|
|
| `AGENT_MESSAGE` | Agent pushes a message to the user |
|
|
|
|
Events flow through Redis pub/sub to ensure all platform instances broadcast consistently.
|
|
|
|
## Structure Events
|
|
|
|
The `structure_events` table is an append-only audit log of every structural change in the platform. Each event is:
|
|
|
|
1. Inserted into the database via `broadcaster.RecordAndBroadcast()`
|
|
2. Published to Redis pub/sub
|
|
3. Relayed to WebSocket clients
|
|
|
|
Query events for a specific workspace or globally:
|
|
|
|
```
|
|
GET /events/:workspaceId # Workspace-specific
|
|
GET /events # All events
|
|
```
|
|
|
|
Both endpoints require `AdminAuth`.
|
|
|
|
## Session Search
|
|
|
|
Search through chat history for a workspace:
|
|
|
|
```
|
|
GET /workspaces/:id/session-search?q=deployment+error
|
|
```
|
|
|
|
This searches across both user-agent conversations and agent-to-agent A2A traffic stored in the activity logs.
|
|
|
|
## Current Task Visibility
|
|
|
|
Each workspace reports its current task via heartbeat. This is visible in two places:
|
|
|
|
- **Canvas node** -- the workspace card on the canvas shows the current task text
|
|
- **Heartbeat data** -- `GET /registry/discover/:id` includes `current_task` in the workspace info
|
|
|
|
When `active_tasks` drops to zero, the current task field clears and the idle loop (if configured) begins its countdown.
|
|
|
|
## Schedule Run History
|
|
|
|
For workspaces with cron schedules, inspect past runs:
|
|
|
|
```
|
|
GET /workspaces/:id/schedules/:scheduleId/history
|
|
```
|
|
|
|
Each history entry includes:
|
|
|
|
- Execution timestamp
|
|
- Status (`success`, `failed`, `skipped`)
|
|
- Duration
|
|
- `error_detail` when the run failed (populated by `scheduler.fireSchedule`)
|
|
|
|
A status of `skipped` means the workspace was busy (active tasks > 0) when the schedule fired and the concurrency-aware scheduler chose not to queue the prompt.
|