From f529f672782ce45128e45795acc735ce40e79bc7 Mon Sep 17 00:00:00 2001 From: "molecule-ai[bot]" <276602405+molecule-ai[bot]@users.noreply.github.com> Date: Tue, 21 Apr 2026 07:49:49 +0000 Subject: [PATCH] docs: add docs/architecture/overview.md --- content/docs/architecture/overview.md | 148 ++++++++++++++++++++++++++ 1 file changed, 148 insertions(+) create mode 100644 content/docs/architecture/overview.md diff --git a/content/docs/architecture/overview.md b/content/docs/architecture/overview.md new file mode 100644 index 0000000..312a0da --- /dev/null +++ b/content/docs/architecture/overview.md @@ -0,0 +1,148 @@ +# Architecture Overview + +Molecule AI is a platform for orchestrating AI agent workspaces that form an organizational hierarchy. Workspaces register with a central platform, communicate via A2A protocol, and are visualized on a drag-and-drop canvas. + +## System Diagram + +``` +Canvas (Next.js :3000) ←WebSocket→ Platform (Go :8080) ←HTTP→ Postgres + Redis + ↑ + Workspace A ←──A2A──→ Workspace B + (Python agents) + ↑ register/heartbeat ↑ + └───── Platform ─────┘ +``` + +## Main Components + +- **Workspace Server** (`workspace-server/`): Go/Gin control plane — workspace CRUD, registry, discovery, WebSocket hub, liveness monitoring. +- **Canvas** (`canvas/`): Next.js 15 + React Flow (@xyflow/react v12) + Zustand + Tailwind — visual workspace graph. +- **Workspace Runtime** (`workspace/`): Shared runtime published as [`molecule-ai-workspace-runtime`](https://pypi.org/project/molecule-ai-workspace-runtime/) on PyPI. Supports LangGraph, Claude Code, OpenClaw, DeepAgents, CrewAI, AutoGen. Each adapter lives in its own standalone template repo (e.g. `molecule-ai-workspace-template-claude-code`). See `docs/workspace-runtime-package.md` for the full picture. +- **molecli** (`workspace-server/cmd/cli/`): Go TUI dashboard (Bubbletea + Lipgloss) — real-time workspace monitoring, event log, health overview, delete/filter operations. + +## Key Architectural Patterns + +### Import Cycle Prevention + +The platform uses function injection to avoid Go import cycles between `ws`, `registry`, and `events` packages: + +- `ws.NewHub(canCommunicate AccessChecker)` — Hub accepts `registry.CanCommunicate` as a function parameter. +- `registry.StartLivenessMonitor(ctx, onOffline OfflineHandler)` — Liveness accepts a broadcaster callback. +- `registry.StartHealthSweep(ctx, checker ContainerChecker, interval, onOffline)` — Health sweep accepts a Docker checker interface. + +Wiring happens in `workspace-server/cmd/server/main.go` — init order: `wh → onWorkspaceOffline → liveness/healthSweep → router`. + +### Container Health Detection + +Three layers detect dead containers (e.g. Docker Desktop crash): + +1. **Passive (Redis TTL):** 60s heartbeat key expires → liveness monitor → auto-restart. +2. **Proactive (Health Sweep):** `registry.StartHealthSweep` polls Docker API every 15s — catches dead containers faster than TTL expiry. +3. **Reactive (A2A Proxy):** On connection error, checks `provisioner.IsRunning()` → immediate offline + restart. + +All three call `onWorkspaceOffline`, which broadcasts `WORKSPACE_OFFLINE` and calls `go wh.RestartByID()`. Redis cleanup uses the shared `db.ClearWorkspaceKeys()` helper. + +### Template Resolution (Workspace Create) + +Runtime detection happens **before** the DB insert: if `payload.Runtime` is empty and a template is specified, the handler reads `runtime:` from `configsDir/template/config.yaml` first. If still empty, it defaults to `"langgraph"`. This ensures the correct runtime (e.g. `claude-code`) is persisted in the DB and used for container image selection. + +When the requested template does not exist, the Create handler falls back in order: + +1. Check `os.Stat(configsDir/template)` — use if exists. +2. Try `{runtime}-default` template (e.g. `claude-code-default/`). +3. Generate a default config via `ensureDefaultConfig()` (includes `.auth-token` copy for CLI runtimes). + +### Communication Rules (`registry/access.go`) + +`CanCommunicate(callerID, targetID)` determines whether two workspaces may communicate: + +- Same workspace → allowed +- Siblings (same `parent_id`) → allowed +- Root-level siblings (both `parent_id IS NULL`) → allowed +- Parent ↔ child → allowed +- Everything else → denied + +The A2A proxy (`POST /workspaces/:id/a2a`) enforces this for agent-to-agent calls. Canvas requests (no `X-Workspace-ID` header), self-calls, and system callers (`webhook:*`, `system:*`, `test:*` prefixes via `isSystemCaller()` in `a2a_proxy.go`) bypass the check. + +### Handler Decomposition + +Large handler functions are split into focused private helpers to keep individual functions under ~60 lines. The decomposition pattern used across the codebase: + +- `a2a_proxy.go::proxyA2ARequest` — helpers: `resolveAgentURL`, `normalizeA2APayload`, `dispatchA2A`, `handleA2ADispatchError`, `maybeMarkContainerDead`, `logA2AFailure`, `logA2ASuccess`; sentinel `proxyDispatchBuildError`. +- `delegation.go::Delegate` — helpers: `bindDelegateRequest`, `lookupIdempotentDelegation`, `insertDelegationRow`; typed `insertDelegationOutcome` enum replaces a `(bool, bool)` positional return. +- `discovery.go::Discover` — helpers: `discoverWorkspacePeer`, `writeExternalWorkspaceURL`, `discoverHostPeer`. +- `activity.go::SessionSearch` — helpers: `parseSessionSearchParams`, `buildSessionSearchQuery`, `scanSessionSearchRows`. + +When modifying any of these handlers, prefer extending the helper rather than inlining logic back into the top-level function. + +### JSONB Gotcha + +When inserting Go `[]byte` (from `json.Marshal`) into Postgres JSONB columns, you must: + +1. Convert to `string()` first. +2. Use a `::jsonb` cast in the SQL statement. + +`lib/pq` treats `[]byte` as `bytea`, not JSONB, so skipping either step silently stores binary data instead of a JSON value. + +### WebSocket Events Flow + +1. An action occurs (register, heartbeat, config change, etc.). +2. `broadcaster.RecordAndBroadcast()` inserts a row into the `structure_events` table and publishes to Redis pub/sub. +3. The Redis subscriber relays the message to the WebSocket hub. +4. The hub broadcasts to canvas clients (all events) and workspace clients (filtered by `CanCommunicate`). + +### Canvas State Management + +- **Initial load:** HTTP fetch from `GET /workspaces` → Zustand hydrate. +- **Real-time updates:** WebSocket events → `applyEvent()` in the Zustand store. +- **Position persistence:** `onNodeDragStop` → `PATCH /workspaces/:id` with `{x, y}`. +- **Embedded sub-workspaces:** `nestNode` sets `hidden: !!targetId` on child nodes; children render as recursive `TeamMemberChip` components inside the parent (up to 3 levels), not as separate canvas nodes. Use `n.data.parentId` (not React Flow's `n.parentId`) for hierarchy lookups. +- **Chat:** two sub-tabs — "My Chat" (user↔agent, `source=canvas`) and "Agent Comms" (agent↔agent A2A traffic, `source=agent`). History loaded from `GET /activity` with source filter. Real-time via `A2A_RESPONSE` + `AGENT_MESSAGE` WebSocket events. Conversation history (last 20 messages) sent via `params.metadata.history` in A2A `message/send` requests. +- **Config save:** "Save & Restart" writes `config.yaml` and auto-restarts the workspace. "Save" writes only (shows a restart banner). Secrets POST/DELETE auto-restart on the platform side. + +### Initial Prompt + +Agents can auto-execute a prompt on startup before any user interaction. Configure via `initial_prompt` (inline string) or `initial_prompt_file` (path relative to config dir) in `config.yaml`. After the A2A server is ready, `main.py` sends the prompt as a `message/send` to self. A `.initial_prompt_done` marker file prevents re-execution on restart. Org templates support `initial_prompt` on both `defaults` (applies to all agents) and per-workspace (overrides the default). + +**Important:** Initial prompts must not send A2A messages (`delegate_task`, `send_message_to_user`) because other agents may not yet be ready. Keep them local: clone repos, read docs, save to memory, wait for tasks. + +### Idle Loop + +Opt-in pattern: when `idle_prompt` is non-empty in `config.yaml`, the workspace self-sends it every `idle_interval_seconds` (default 600) **while `heartbeat.active_tasks == 0`**. The idle check is local (no LLM call) and the prompt only fires when there is genuinely nothing to do. Set per-workspace or as a per-org default in `org.yaml`. The fire timeout clamps to `max(60, min(300, idle_interval_seconds))`. Both the idle loop and `initial_prompt` self-posts include `auth_headers()` so they work in multi-tenant mode. + +### Admin Auth Middleware Variants + +Three Gin middleware classes gate server-side routes. Full contract in `docs/runbooks/admin-auth.md`. + +- **`middleware.AdminAuth(db.DB)`** — strict bearer-only. Used for any route where a forged request could leak prompts/memory, create/mutate workspaces, or leak ops intel. Lazy-bootstrap fail-open when `HasAnyLiveTokenGlobal` returns 0. +- **`middleware.CanvasOrBearer(db.DB)`** — accepts a bearer token OR an Origin matching `CORS_ORIGINS`. Used **only** for cosmetic routes where a forged request has zero data/security impact. Currently only on `PUT /canvas/viewport`. Do not extend this to any route that leaks data or creates resources — see the runbook. +- **`middleware.WorkspaceAuth(db.DB)`** — binds a bearer token to `:id`. Workspace A's token cannot hit workspace B's sub-routes. Used for the entire `/workspaces/:id/*` group except the A2A proxy (which has its own `CanCommunicate` layer). + +### Migration Runner (`workspace-server/internal/db/postgres.go`) + +`RunMigrations` globs `*.sql` in `migrationsDir`, filters out `.down.sql` files, sorts alphabetically, then `DB.Exec()`s each file on boot. The filter is load-bearing: without it, alphabetical sort places `.down.sql` before `.up.sql` (since "d" sorts before "u"), which would wipe tables like `workspace_auth_tokens` on every boot. All `.up.sql` files must be **idempotent** (`CREATE TABLE IF NOT EXISTS`, `ALTER TABLE ... ADD COLUMN IF NOT EXISTS`) because the runner re-applies every migration on every startup. + +### Workspace Lifecycle + +``` +provisioning → online → degraded → online → offline → (auto-restart) → provisioning → ... → removed + ↑ ↑ + └──────────────────────────── paused ◄──────── any state ──────────────────────────────┘ + │ + └── (user resumes) → provisioning +``` + +State transitions: + +- `provisioning` → `online`: workspace registers via `/registry/register`. +- `online` → `degraded`: error rate exceeds 0.5. +- `degraded` → `online`: error rate recovers. +- `online`/`degraded` → `offline`: Redis TTL expires OR the health sweep detects a dead container. +- `offline` → `provisioning`: auto-restart fires. +- Any state → `paused`: user pauses the workspace (container is stopped). +- `paused` → `provisioning`: user resumes. +- Any state → `removed`: workspace is deleted. + +Paused workspaces are excluded from the health sweep, liveness monitor, and auto-restart. + +**Restart context message:** After any restart and successful re-registration, the platform sends a synthetic A2A `message/send` to the workspace with `metadata.kind=restart_context`. The body contains the restart timestamp, previous session end time + duration, and the env-var keys (keys only, never values) now available in the container. The sender uses the `system:restart-context` caller prefix, which bypasses `CanCommunicate` via `isSystemCaller()`. If the workspace does not re-register within 30 seconds, the message is dropped (logged). Handler: `workspace-server/internal/handlers/restart_context.go`.