Cross-checked every runtime list against the canonical AllRuntimes
allowlist in workspace-server/internal/handlers/admin_workspace_images.go
(claude-code, langgraph, crewai, autogen, deepagents, hermes, gemini-cli,
openclaw). Lists were drifting toward an old 6-runtime snapshot, missing
hermes + gemini-cli; one file referenced a non-existent NemoClaw branch
plus NVIDIA T4 hardware that has no source presence anywhere in the
monorepo (verified: no `feat/nemoclaw-t4-docker` branch, no `nemoclaw`
references outside docs).
Files changed:
content/docs/agent-runtime/workspace-runtime.md
- Before: "Current `main` ships six adapters" listing langgraph,
deepagents, claude-code, crewai, autogen, openclaw + a
"Branch-level experiments such as NemoClaw" sentence.
- After: "Current `main` ships eight adapters" — added hermes +
gemini-cli; removed NemoClaw line; pointed readers at AllRuntimes
as the source of truth and external-workspace path for BYO.
content/docs/architecture/molecule-technical-doc.md (two locations)
- Before (runtime table ~line 575): 6 rows + "Branch-level WIP:
NemoClaw (NVIDIA T4 + Docker socket) on feat/nemoclaw-t4-docker".
- After: 8 rows including Hermes + Gemini CLI; NemoClaw line
deleted; AllRuntimes pointer added.
- Before (Branch-Level Work table ~line 981): row
"feat/nemoclaw-t4-docker | NemoClaw adapter (NVIDIA T4 support) | WIP".
- After: row removed (the branch does not exist).
content/docs/architecture/overview.md
- Before: "Supports LangGraph, Claude Code, OpenClaw, DeepAgents,
CrewAI, AutoGen."
- After: "Supports Claude Code, LangGraph, CrewAI, AutoGen, DeepAgents,
Hermes, Gemini CLI, and OpenClaw."
content/docs/glossary.md
- Before: runtime defined as "one of langgraph, claude-code, openclaw,
crewai, autogen, deepagents, hermes" (7, missing gemini-cli).
- After: full 8-entry list ordered to match AllRuntimes.
Out-of-scope sweep results (no changes needed):
- "NeMo Guardrails" — zero references in molecule-docs (the
"guardrails" word elsewhere refers to generic plugin/role guardrails,
not the NVIDIA NeMo Guardrails product). Nothing to remove.
- "Nemotron" — only present in the monorepo as a NIM model alias
inside a model-dispatcher test; no first-class model claim in docs.
- Model lists in docs match workspace/agent.py (Anthropic, OpenAI,
Gemini, MiniMax, DeepSeek, Moonshot, local Ollama/vLLM); no
Nemotron claims to remove.
- Observability sinks (OpenTelemetry, Langfuse, Sentry, Prometheus
references) match implementation; no NeMo Guardrails sink claimed.
- index.mdx, concepts.mdx, architecture.mdx already list the full 8
runtimes; no edit needed there.
Verified via `npm run build` — all 111 static pages regenerate clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
11 KiB
| title |
|---|
| Architecture Overview |
Architecture Overview
Molecule AI is a platform for orchestrating AI agent workspaces that form an organizational hierarchy. Workspaces register with a central platform, communicate via A2A protocol, and are visualized on a drag-and-drop canvas.
System Diagram
Canvas (Next.js :3000) ←WebSocket→ Platform (Go :8080) ←HTTP→ Postgres + Redis
↑
Workspace A ←──A2A──→ Workspace B
(Python agents)
↑ register/heartbeat ↑
└───── Platform ─────┘
Main Components
- Workspace Server (
workspace-server/): Go/Gin control plane — workspace CRUD, registry, discovery, WebSocket hub, liveness monitoring. - Canvas (
canvas/): Next.js 15 + React Flow (@xyflow/react v12) + Zustand + Tailwind — visual workspace graph. - Workspace Runtime (
workspace/): Shared runtime published asmolecule-ai-workspace-runtimeon PyPI. Supports Claude Code, LangGraph, CrewAI, AutoGen, DeepAgents, Hermes, Gemini CLI, and OpenClaw. Each adapter lives in its own standalone template repo (e.g.molecule-ai-workspace-template-claude-code). Seedocs/workspace-runtime-package.mdfor the full picture. - molecli (
workspace-server/cmd/cli/): Go TUI dashboard (Bubbletea + Lipgloss) — real-time workspace monitoring, event log, health overview, delete/filter operations.
Key Architectural Patterns
Import Cycle Prevention
The platform uses function injection to avoid Go import cycles between ws, registry, and events packages:
ws.NewHub(canCommunicate AccessChecker)— Hub acceptsregistry.CanCommunicateas a function parameter.registry.StartLivenessMonitor(ctx, onOffline OfflineHandler)— Liveness accepts a broadcaster callback.registry.StartHealthSweep(ctx, checker ContainerChecker, interval, onOffline)— Health sweep accepts a Docker checker interface.
Wiring happens in workspace-server/cmd/server/main.go — init order: wh → onWorkspaceOffline → liveness/healthSweep → router.
Container Health Detection
Three layers detect dead containers (e.g. Docker Desktop crash):
- Passive (Redis TTL): 60s heartbeat key expires → liveness monitor → auto-restart.
- Proactive (Health Sweep):
registry.StartHealthSweeppolls Docker API every 15s — catches dead containers faster than TTL expiry. - Reactive (A2A Proxy): On connection error, checks
provisioner.IsRunning()→ immediate offline + restart.
All three call onWorkspaceOffline, which broadcasts WORKSPACE_OFFLINE and calls go wh.RestartByID(). Redis cleanup uses the shared db.ClearWorkspaceKeys() helper.
Template Resolution (Workspace Create)
Runtime detection happens before the DB insert: if payload.Runtime is empty and a template is specified, the handler reads runtime: from configsDir/template/config.yaml first. If still empty, it defaults to "langgraph". This ensures the correct runtime (e.g. claude-code) is persisted in the DB and used for container image selection.
When the requested template does not exist, the Create handler falls back in order:
- Check
os.Stat(configsDir/template)— use if exists. - Try
{runtime}-defaulttemplate (e.g.claude-code-default/). - Generate a default config via
ensureDefaultConfig()(includes.auth-tokencopy for CLI runtimes).
Communication Rules (registry/access.go)
CanCommunicate(callerID, targetID) determines whether two workspaces may communicate:
- Same workspace → allowed
- Siblings (same
parent_id) → allowed - Root-level siblings (both
parent_id IS NULL) → allowed - Parent ↔ child → allowed
- Everything else → denied
The A2A proxy (POST /workspaces/:id/a2a) enforces this for agent-to-agent calls. Canvas requests (no X-Workspace-ID header), self-calls, and system callers (webhook:*, system:*, test:* prefixes via isSystemCaller() in a2a_proxy.go) bypass the check.
Handler Decomposition
Large handler functions are split into focused private helpers to keep individual functions under ~60 lines. The decomposition pattern used across the codebase:
a2a_proxy.go::proxyA2ARequest— helpers:resolveAgentURL,normalizeA2APayload,dispatchA2A,handleA2ADispatchError,maybeMarkContainerDead,logA2AFailure,logA2ASuccess; sentinelproxyDispatchBuildError.delegation.go::Delegate— helpers:bindDelegateRequest,lookupIdempotentDelegation,insertDelegationRow; typedinsertDelegationOutcomeenum replaces a(bool, bool)positional return.discovery.go::Discover— helpers:discoverWorkspacePeer,writeExternalWorkspaceURL,discoverHostPeer.activity.go::SessionSearch— helpers:parseSessionSearchParams,buildSessionSearchQuery,scanSessionSearchRows.
When modifying any of these handlers, prefer extending the helper rather than inlining logic back into the top-level function.
JSONB Gotcha
When inserting Go []byte (from json.Marshal) into Postgres JSONB columns, you must:
- Convert to
string()first. - Use a
::jsonbcast in the SQL statement.
lib/pq treats []byte as bytea, not JSONB, so skipping either step silently stores binary data instead of a JSON value.
WebSocket Events Flow
- An action occurs (register, heartbeat, config change, etc.).
broadcaster.RecordAndBroadcast()inserts a row into thestructure_eventstable and publishes to Redis pub/sub.- The Redis subscriber relays the message to the WebSocket hub.
- The hub broadcasts to canvas clients (all events) and workspace clients (filtered by
CanCommunicate).
Canvas State Management
- Initial load: HTTP fetch from
GET /workspaces→ Zustand hydrate. - Real-time updates: WebSocket events →
applyEvent()in the Zustand store. - Position persistence:
onNodeDragStop→PATCH /workspaces/:idwith{x, y}. - Embedded sub-workspaces:
nestNodesetshidden: !!targetIdon child nodes; children render as recursiveTeamMemberChipcomponents inside the parent (up to 3 levels), not as separate canvas nodes. Usen.data.parentId(not React Flow'sn.parentId) for hierarchy lookups. - Chat: two sub-tabs — "My Chat" (user↔agent,
source=canvas) and "Agent Comms" (agent↔agent A2A traffic,source=agent). History loaded fromGET /activitywith source filter. Real-time viaA2A_RESPONSE+AGENT_MESSAGEWebSocket events. Conversation history (last 20 messages) sent viaparams.metadata.historyin A2Amessage/sendrequests. - Config save: "Save & Restart" writes
config.yamland auto-restarts the workspace. "Save" writes only (shows a restart banner). Secrets POST/DELETE auto-restart on the platform side.
Initial Prompt
Agents can auto-execute a prompt on startup before any user interaction. Configure via initial_prompt (inline string) or initial_prompt_file (path relative to config dir) in config.yaml. After the A2A server is ready, main.py sends the prompt as a message/send to self. A .initial_prompt_done marker file prevents re-execution on restart. Org templates support initial_prompt on both defaults (applies to all agents) and per-workspace (overrides the default).
Important: Initial prompts must not send A2A messages (delegate_task, send_message_to_user) because other agents may not yet be ready. Keep them local: clone repos, read docs, save to memory, wait for tasks.
Idle Loop
Opt-in pattern: when idle_prompt is non-empty in config.yaml, the workspace self-sends it every idle_interval_seconds (default 600) while heartbeat.active_tasks == 0. The idle check is local (no LLM call) and the prompt only fires when there is genuinely nothing to do. Set per-workspace or as a per-org default in org.yaml. The fire timeout clamps to max(60, min(300, idle_interval_seconds)). Both the idle loop and initial_prompt self-posts include auth_headers() so they work in multi-tenant mode.
Admin Auth Middleware Variants
Three Gin middleware classes gate server-side routes. Full contract in docs/runbooks/admin-auth.md.
middleware.AdminAuth(db.DB)— strict bearer-only. Used for any route where a forged request could leak prompts/memory, create/mutate workspaces, or leak ops intel. Lazy-bootstrap fail-open whenHasAnyLiveTokenGlobalreturns 0.middleware.CanvasOrBearer(db.DB)— accepts a bearer token OR an Origin matchingCORS_ORIGINS. Used only for cosmetic routes where a forged request has zero data/security impact. Currently only onPUT /canvas/viewport. Do not extend this to any route that leaks data or creates resources — see the runbook.middleware.WorkspaceAuth(db.DB)— binds a bearer token to:id. Workspace A's token cannot hit workspace B's sub-routes. Used for the entire/workspaces/:id/*group except the A2A proxy (which has its ownCanCommunicatelayer).
Migration Runner (workspace-server/internal/db/postgres.go)
RunMigrations globs *.sql in migrationsDir, filters out .down.sql files, sorts alphabetically, then DB.Exec()s each file on boot. The filter is load-bearing: without it, alphabetical sort places .down.sql before .up.sql (since "d" sorts before "u"), which would wipe tables like workspace_auth_tokens on every boot. All .up.sql files must be idempotent (CREATE TABLE IF NOT EXISTS, ALTER TABLE ... ADD COLUMN IF NOT EXISTS) because the runner re-applies every migration on every startup.
Workspace Lifecycle
provisioning → online → degraded → online → offline → (auto-restart) → provisioning → ... → removed
↑ ↑
└──────────────────────────── paused ◄──────── any state ──────────────────────────────┘
│
└── (user resumes) → provisioning
State transitions:
provisioning→online: workspace registers via/registry/register.online→degraded: error rate exceeds 0.5.degraded→online: error rate recovers.online/degraded→offline: Redis TTL expires OR the health sweep detects a dead container.offline→provisioning: auto-restart fires.- Any state →
paused: user pauses the workspace (container is stopped). paused→provisioning: user resumes.- Any state →
removed: workspace is deleted.
Paused workspaces are excluded from the health sweep, liveness monitor, and auto-restart.
Restart context message: After any restart and successful re-registration, the platform sends a synthetic A2A message/send to the workspace with metadata.kind=restart_context. The body contains the restart timestamp, previous session end time + duration, and the env-var keys (keys only, never values) now available in the container. The sender uses the system:restart-context caller prefix, which bypasses CanCommunicate via isSystemCaller(). If the workspace does not re-register within 30 seconds, the message is dropped (logged). Handler: workspace-server/internal/handlers/restart_context.go.