Phase B.2 companion to the private molecule-controlplane provisioner PR. On every push to main that touches platform/**, builds platform/Dockerfile and pushes to GHCR with two tags: - :latest (floating, always main's tip) - :sha-<short-commit> (immutable, pin-friendly) Cache via GitHub Actions cache (cache-from: type=gha). Workflow_dispatch trigger so we can re-publish after a docs-only merge if needed. The private molecule-controlplane sets TENANT_IMAGE=ghcr.io/molecule-ai/platform:<tag> and the provisioner creates each tenant Fly Machine from this image. Staying on the same base image across tenants keeps upgrades atomic. CLAUDE.md updated to document the new workflow in the CI pipeline section. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
37 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
Molecule AI is a platform for orchestrating AI agent workspaces that form an organizational hierarchy. Workspaces register with a central platform, communicate via A2A protocol, and are visualized on a drag-and-drop canvas.
Ecosystem Context
Before research, strategy, or design work, skim docs/ecosystem-watch.md —
it catalogs adjacent agent projects (Holaboss, Hermes, gstack, …) with
overlap / differentiation / terminology-collision notes. Cross-referenced
from PLAN.md and README.md; it's the canonical starting point for
"what else is out there."
Agent operating rules (auto-loaded — read first)
The following are project-level rules that override default behavior. They apply to every conversation in this repo, automated cron tick, and every subagent the orchestrator spawns.
Cron / triage discipline
-
Always read the most recent cron-learnings before reviewing PRs. Open
~/.claude/projects/-Users-hongming-Documents-GitHub-molecule-monorepo/memory/cron-learnings.jsonl, read the last 20 lines. Patterns recur — a finding that was a false-positive last tick is likely a false-positive again. A fix that worked last tick is likely the fix this tick. The SessionStart hook auto-injects this; read anyway when starting a triage from the middle of a conversation. -
Treat
docs/sync-*PRs that touch CLAUDE.md or PLAN.md as ALWAYS noteworthy. Those two files are the agent-facing source of truth — a bad merge there silently corrupts every future triage tick. Run code-review skill at minimum, ideally cross-vendor-review too. -
After any cron tick, write a 1-line reflection to
.claude/per-tick-reflections.md(gitignored). Format:2026-MM-DDTHH:MMZ — what surprised me / what I'd do differently next tick. This is for YOUR future self; the cron-learnings JSONL is for the operational pattern memory. They are distinct.
Hooks active in this repo
The following ambient guardrails fire automatically (configured in
.claude/settings.json). When a hook blocks a tool call, the response will
include a permissionDecisionReason — read it carefully before retrying.
| Hook | Event | Effect |
|---|---|---|
pre-bash-careful.sh |
PreToolUse:Bash | REFUSES git push --force to main, rm -rf at root/HOME, DROP TABLE against prod schema. WARNs on --force-with-lease, gh pr close/issue close. |
pre-edit-freeze.sh |
PreToolUse:Edit/Write | Blocks edits outside the path in .claude/freeze if that file exists. Use to lock scope while debugging. |
session-start-context.sh |
SessionStart | Auto-loads recent cron-learnings, freeze status, open PR/issue counts. |
post-edit-audit.sh |
PostToolUse:Edit/Write | Appends every edit to .claude/audit.jsonl (gitignored). |
user-prompt-tag.sh |
UserPromptSubmit | Injects warning into context when prompt mentions force-push / drop-table / "delete all" / etc. |
subagent-stop-judge.sh |
SubagentStop | Off by default (touch .claude/judge-subagents to enable). When on, prompts the orchestrator to verify the subagent's output addresses the original task. |
Skills active in this repo
These are documented in .claude/skills/*/SKILL.md. Invoke explicitly via
the Skill tool — they are NOT auto-applied. The cron prompt invokes them
at fixed steps; for ad-hoc work, decide if the skill matches your situation:
code-review— full 16-criteria rubric on a diffcross-vendor-review— adversarial second-model review (use for noteworthy PRs)careful-mode— the doc backing the bash hook abovecron-learnings— defines the JSONL formatcron-retro— weekly retrospective generatorllm-judge— score whether a deliverable addresses the requestupdate-docs— sync repo docs after merges
Standing rules (inviolable)
- Never push directly to main — use feat/fix/chore/docs branches
- Merge-commits only (
gh pr merge --merge) — never--squash/--rebase - Never commit without explicit user approval EXCEPT on:
- Open PR branches you're fixing for a gate
- Issue-pickup branches you opened a draft PR for
- Docs-sync branches
- Main is untouchable without a merge
- Dark theme only (no white/light CSS classes; pre-commit hook enforces)
- No native browser dialogs (
confirm/alert/prompt) — useConfirmDialog - Delegate through PM, never bypass hierarchy
- Only PM mounts the repo (
workspace_dirbind-mount); other agents get isolated Docker volumes
Architecture
Canvas (Next.js :3000) ←WebSocket→ Platform (Go :8080) ←HTTP→ Postgres + Redis
↑
Workspace A ←──A2A──→ Workspace B
(Python agents)
↑ register/heartbeat ↑
└───── Platform ─────┘
Four main components:
- Platform (
platform/): Go/Gin control plane — workspace CRUD, registry, discovery, WebSocket hub, liveness monitoring - Canvas (
canvas/): Next.js 15 + React Flow (@xyflow/react v12) + Zustand + Tailwind — visual workspace graph - Workspace Runtime (
workspace-template/): Unified Docker image with pluggable adapter system — supports LangGraph, Claude Code, OpenClaw, DeepAgents, CrewAI, AutoGen. Adapters inworkspace-template/adapters/. Deps installed at startup viaentrypoint.sh. - molecli (
platform/cmd/cli/): Go TUI dashboard (Bubbletea + Lipgloss) — real-time workspace monitoring, event log, health overview, delete/filter operations
Build & Run Commands
Infrastructure
./infra/scripts/setup.sh # Start Postgres, Redis, Langfuse, Temporal; run migrations
./infra/scripts/nuke.sh # Tear down everything, remove volumes
Infra services (via docker-compose.infra.yml, all attached to the shared molecule-monorepo-net network — setup.sh creates it idempotently):
- Postgres
:5432— primary datastore (also backs Langfuse + Temporal via separate DBs) - Redis
:6379— pub/sub, heartbeat TTLs - Langfuse
:3001— LLM trace viewer (backed by Clickhouse) - Temporal
:7233(gRPC) +:8233(Web UI) — durable workflow engine forworkspace-template/builtin_tools/temporal_workflow.py. Dev-only posture: the auto-setup image runs with no auth on0.0.0.0:7233; production deployments must gate access via mTLS or an API key / reverse proxy.
Platform (Go)
cd platform
go build ./cmd/server # Build server
go run ./cmd/server # Run server (requires Postgres + Redis running)
go build -o molecli ./cmd/cli # Build TUI dashboard
./molecli # Run TUI dashboard (requires platform running)
Must run from platform/ directory (not repo root). Env vars: DATABASE_URL, REDIS_URL, PORT, PLATFORM_URL (default http://host.docker.internal:PORT — passed to agent containers so they can reach the platform), SECRETS_ENCRYPTION_KEY (optional AES-256, 32 bytes), CONFIGS_DIR (auto-discovered), PLUGINS_DIR (deprecated — plugins are now installed per-workspace via API; the plugins/ registry at repo root is auto-discovered), ACTIVITY_RETENTION_DAYS (default 7), ACTIVITY_CLEANUP_INTERVAL_HOURS (default 6), CORS_ORIGINS (comma-separated, default http://localhost:3000,http://localhost:3001), RATE_LIMIT (requests/min, default 600), WORKSPACE_DIR (optional — global fallback host path for /workspace bind-mount; overridden by per-workspace workspace_dir column in DB; if neither is set, each workspace gets an isolated Docker named volume), AWARENESS_URL (optional — if set, injected into workspace containers along with a deterministic AWARENESS_NAMESPACE derived from workspace ID), MOLECULE_IN_DOCKER (optional — set to 1 when the platform itself runs inside Docker so the A2A proxy rewrites 127.0.0.1:<port> URLs to container hostnames; auto-detected via /.dockerenv), MOLECULE_ENV (optional — set to production to hide the /admin/workspaces/:id/test-token E2E helper endpoint; unset or any other value leaves it enabled), MOLECULE_ENABLE_TEST_TOKENS (optional — set to 1 to force-enable the test-token endpoint even when MOLECULE_ENV=production; intended for staging runs only), MOLECULE_ORG_ID (optional — the public repo's only SaaS hook. When set to a UUID, every non-allowlisted request must carry a matching X-Molecule-Org-Id header or gets a 404; when unset, the guard is a passthrough so self-hosted / dev / CI are unaffected. Set only by the private molecule-controlplane provisioner on Fly Machines tenant instances — never by self-hosters).
Workspace tier resource limits (issue #14 — override the per-tier memory/CPU caps in provisioner.ApplyTierConfig; CPU_SHARES follows Docker's 1024 = 1 CPU convention, translated to NanoCPUs for a hard cap):
TIER2_MEMORY_MB/TIER2_CPU_SHARES— Standard tier (defaults512/1024)TIER3_MEMORY_MB/TIER3_CPU_SHARES— Privileged tier (defaults2048/2048; previously uncapped)TIER4_MEMORY_MB/TIER4_CPU_SHARES— Full-host tier (defaults4096/4096; previously uncapped)
Plugin install safeguards (bound the cost of a single POST /workspaces/:id/plugins install so a slow/malicious source can't tie up a handler):
PLUGIN_INSTALL_BODY_MAX_BYTES— max request body size (default65536= 64 KiB)PLUGIN_INSTALL_FETCH_TIMEOUT— duration string; whole fetch+copy deadline (default5m)PLUGIN_INSTALL_MAX_DIR_BYTES— max staged-tree size (default104857600= 100 MiB)
See docs/plugins/sources.md for the two-axis source/shape plugin model.
Additional env vars documented in .env.example (2026-04-13 sync — all 21 distinct os.Getenv/envx.* keys now documented): MOLECULE_ENV, GITHUB_WEBHOOK_SECRET, MOLECULE_URL (MCP server target; same semantic as PLATFORM_URL).
molecli reads MOLECLI_URL (default http://localhost:8080) to locate the platform. Logs are written to molecli.log in the working directory (already covered by *.log in .gitignore).
Canvas (Next.js)
cd canvas
npm install
npm run dev # Dev server on :3000
npm run build && npm start # Production
Env vars: NEXT_PUBLIC_PLATFORM_URL (default http://localhost:8080), NEXT_PUBLIC_WS_URL (default ws://localhost:8080/ws).
Workspace Images
bash workspace-template/build-all.sh # Build base + ALL runtime images
bash workspace-template/build-all.sh claude-code # Build base + specific runtime only
Each runtime has its own Docker image extending workspace-template:base, with deps pre-installed for fast startup. The base Dockerfile (workspace-template/Dockerfile) builds :base, then each adapters/*/Dockerfile extends it (e.g. claude_code/Dockerfile installs the claude CLI). Always use build-all.sh — it builds base first, then all runtimes in order. No :latest tag — each runtime uses its own tag to avoid confusion.
| Runtime | Image Tag | Key Deps |
|---|---|---|
| langgraph | workspace-template:langgraph |
langchain-anthropic, langgraph |
| claude-code | workspace-template:claude-code |
claude-agent-sdk (pip), @anthropic-ai/claude-code (npm) |
| openclaw | workspace-template:openclaw |
openclaw deps |
| crewai | workspace-template:crewai |
crewai |
| autogen | workspace-template:autogen |
autogen |
| deepagents | workspace-template:deepagents |
deepagents |
| hermes | workspace-template:hermes |
openai (OpenAI-compatible client; Nous Portal via HERMES_API_KEY or OpenRouter via OPENROUTER_API_KEY fallback) |
Templates are framework presets in workspace-configs-templates/: claude-code-default, langgraph, openclaw, deepagents. Agent roles are configured after deployment via Config tab or API.
For Claude Code runtime, write your OAuth token to workspace-configs-templates/claude-code-default/.auth-token.
Pre-commit Hook
git config core.hooksPath .githooks # Install hooks (agents do this via initial_prompt)
Enforces: 'use client' on hook-using .tsx files, dark theme (no white/light), no SQL injection (fmt.Sprintf with SQL), no leaked secrets (sk-ant-, ghp_, AKIA). Commit is rejected until violations are fixed — agents cannot bypass this.
Plugins
Shared plugins in plugins/ are auto-loaded by every workspace:
molecule-dev: Codebase conventions (rules injected into CLAUDE.md) +review-loopskill for multi-round QA cyclessuperpowers:verification-before-completion,test-driven-development,systematic-debugging,writing-plansecc: General Claude Code guardrailsbrowser-automation: Puppeteer/CDP-based web scraping and live canvas screenshots (opt-in per workspace — wired into Research + UIUX roles inorg-templates/molecule-dev/org.yaml)
Modular guardrails (Claude Code only — pick what you need, or install several):
Hook plugins (ambient enforcement at the harness layer)
molecule-careful-bash— REFUSESgit push --forceto main,rm -rfat root,DROP TABLEagainst prod schema. Ships thecareful-modeskill as documentation.molecule-freeze-scope— locks edits to a single path glob via.claude/freeze. Useful while debugging.molecule-audit-trail— appends every Edit/Write to.claude/audit.jsonlfor accountability.molecule-session-context— auto-loads recent cron-learnings + open PR/issue counts at session start. Pairs withmolecule-skill-cron-learnings.molecule-prompt-watchdog— injects warning context when the user prompt mentions destructive keywords ("force push", "drop table", "delete all", etc).
Skill plugins (on-demand, via the Skill tool)
molecule-skill-code-review— 16-criteria multi-axis review.molecule-skill-cross-vendor-review— adversarial second-model review (use for noteworthy PRs).molecule-skill-llm-judge— score whether a deliverable addresses the request.molecule-skill-update-docs— sync repo docs after merges.molecule-skill-cron-learnings— defines the operational-memory JSONL format consumed bymolecule-session-context.
Workflow plugins (slash commands that compose skills)
molecule-workflow-triage—/triageruns a full PR-triage cycle (gates 1–7 + code-review + merge if green). Recommends installingmolecule-skill-code-review+molecule-skill-cron-learningsfirst.molecule-workflow-retro—/retroposts a weekly retrospective issue. Recommendsmolecule-skill-cron-learningsfirst.
These are distilled from the harness-level guardrails the orchestrator uses on itself. A workspace can install one (e.g., just molecule-careful-bash for safety) or stack the full set for the same posture as the Molecule AI orchestrator.
Org-template plugin resolution (PR #71, issue #68): per-workspace plugins: lists in org-templates/*/org.yaml role overrides UNION with defaults.plugins (deduplicated, defaults first) — they do not REPLACE them. To opt a specific default out for a given role/workspace, prefix the plugin name with ! or - (e.g. !browser-automation). Implemented by mergePlugins in platform/internal/handlers/org.go.
Scripts
bash scripts/setup-default-org.sh # Create PM + 3 teams (Marketing/Research/Dev) via API
OPENAI_API_KEY=... bash scripts/test-a2a-cross-runtime.sh # E2E: Claude Code ↔ OpenClaw A2A test
OPENAI_API_KEY=... bash scripts/test-team-e2e.sh # E2E: Multi-template team + A2A
Unit Tests
cd platform && go test -race ./... # 740 Go tests (handlers, registry, provisioner, CLI, delegation, org, channels, wsauth — sqlmock + miniredis; +2 on 2026-04-14 tick-4 for TestSetGlobal_* / TestDeleteGlobal_* auto-restart branches (#64); +4 on 2026-04-14 tick-4 for TestRestartContext_* covering the synthetic restart-context A2A message (#65); +5 on 2026-04-14 tick-6 for TestPlugins_* covering the new UNION + `!`/`-` opt-out semantics in org.go mergePlugins (#71, resolves issue #68); +9 on 2026-04-14 tick-7 for TestCategoryRouting_* / TestAppendYAMLBlock_* (#75) + TestRuntimeSchedule_HasSourceRuntime / TestImport_OrgScheduleSQLShape / TestList_IncludesSourceColumn (#76); raw PASS-line count is higher due to table-driven subtests)
cd canvas && npm test # 357 Vitest tests (store, components, hydration, buildTree, secrets API, org template import, ConfirmDialog singleButton + 7 native-dialog replacements)
cd workspace-template && python -m pytest -v # 1140 pytest tests (adds platform_auth token store for Phase 30.1, memory_write activity logging)
cd sdk/python && python -m pytest -v # 132 SDK tests (agentskills.io spec validator, CLI, AgentskillsAdaptor round-trip, workspace/org/channel validators, RemoteAgentClient Phase 30 flows)
cd mcp-server && npm test # 97 Jest tests (per-domain tool modules + smoke test on tool count)
Integration Tests
bash tests/e2e/test_api.sh # 62 API tests against localhost:8080 (Phase 30.1 bearer-token auth aware; shellcheck-clean; also runs in CI `e2e-api` job)
bash tests/e2e/test_a2a_e2e.sh # 22 A2A end-to-end tests (requires 2 online agents)
bash tests/e2e/test_activity_e2e.sh # 25 activity/task E2E tests (requires 1 online agent; re-registers detected agent to capture bearer token)
bash tests/e2e/test_comprehensive_e2e.sh # 67 checks — ALL endpoints, memory, runtime, bundles, approvals (registers workspaces immediately after create to beat the provisioner token race)
All five E2E scripts share tests/e2e/_lib.sh + tests/e2e/_extract_token.py helpers and are shellcheck-clean. test_api.sh is the quick local-verify command — use it after any platform change. Tests full CRUD, registry, heartbeat, discovery, peers, access control, events, degraded/recovery lifecycle, activity logging, current task tracking, bundle round-trip (export → delete → import → verify).
Phase 30.1 / 30.6 auth callout (future-proofing): /registry/heartbeat and /registry/update-card require Authorization: Bearer <token> once a workspace has any live token on file (Phase 30.1 — legacy workspaces grandfathered). /registry/discover/:id and /registry/:id/peers additionally require X-Workspace-ID + bearer token on the caller side (Phase 30.6 — fail-open on DB hiccup since hierarchy check is primary). If you change these routes, update tests/e2e/test_api.sh and docs/api-protocol/platform-api.md in the same PR.
test_a2a_e2e.sh requires platform + two provisioned agents (Echo Agent, SEO Agent) running with a valid OPENROUTER_API_KEY. Tests message/send, JSON-RPC wrapping, error handling, peer discovery, agent cards, heartbeat. Timeout configurable via A2A_TIMEOUT env var (default 120s).
test_activity_e2e.sh requires platform + one online agent. Tests A2A communication logging (request/response capture, duration, method), agent self-reported activity, type filtering, current task visibility via heartbeat, cross-workspace activity isolation, edge cases.
MCP Server
cd mcp-server
npm install && npm run build # Build MCP server
node dist/index.js # Run (stdio transport)
Exposes 87 tools for managing Molecule AI from Claude Code, Cursor, Codex, or any MCP client. Includes workspace CRUD, async delegation, plugins (install/uninstall/list), global secrets, pause/resume, org import, A2A chat, approvals, memory, files, config, discovery, bundles, templates, traces, activity logs, remote agents (Phase 30), and social channels (add/update/remove/send/test). Configured in .mcp.json. Env: MOLECULE_URL (default http://localhost:8080).
Structure (refactored 2026-04-13, PRs #2/#4/#7): src/index.ts shrank from 1697 → 89 lines and now only wires createServer(). Per-domain tool modules live in src/tools/: workspaces.ts, agents.ts, secrets.ts, files.ts, memory.ts, plugins.ts, channels.ts, delegation.ts, schedules.ts, approvals.ts, discovery.ts, remote_agents.ts. Each exports its handlers and a registerXxxTools(srv) function. Shared HTTP layer in src/api.ts (PLATFORM_URL, apiCall<T>, ApiError, isApiError(), toMcpResult(), toMcpText()). When adding a tool, pick the matching domain file or create a new one and wire it in createServer().
CI Pipeline
GitHub Actions (.github/workflows/ci.yml) runs on push to main and PRs:
- platform-build: Go build, vet,
go test -racewith coverage profiling (25% baseline threshold;setup-gouses module cache) - canvas-build: npm build,
vitest run(no--passWithNoTests-- tests must exist and pass) - mcp-server-build: npm build
- python-lint:
pytest --cov=. --cov-report=term-missing(pytest-cov enabled) - e2e-api (added 2026-04-13): spins up Postgres + Redis service containers, runs platform migrations via
docker exec, then executestests/e2e/test_api.shagainst a locally-built binary (62/62 must pass) - shellcheck (added 2026-04-13): lints every
tests/e2e/*.shvia the shellcheck marketplace action - publish-platform-image (
.github/workflows/publish-platform-image.yml, added 2026-04-14 tick-9): on push to main touchingplatform/**, buildsplatform/Dockerfileand pushes toghcr.io/molecule-ai/platform:latest+:sha-<short>. Used by the privatemolecule-controlplaneprovisioner as tenant VM image. Manual re-trigger viaworkflow_dispatch.
Docker Compose
docker compose -f docker-compose.infra.yml up -d # Infra only
docker compose up # Full stack
Key Architectural Patterns
Import Cycle Prevention
The platform uses function injection to avoid Go import cycles between ws, registry, and events packages:
ws.NewHub(canCommunicate AccessChecker)— Hub acceptsregistry.CanCommunicateas a functionregistry.StartLivenessMonitor(ctx, onOffline OfflineHandler)— Liveness accepts broadcaster callbackregistry.StartHealthSweep(ctx, checker ContainerChecker, interval, onOffline)— Health sweep accepts Docker checker interface- Wiring happens in
platform/cmd/server/main.go— init order:wh → onWorkspaceOffline → liveness/healthSweep → router
Container Health Detection
Three layers detect dead containers (e.g. Docker Desktop crash):
- Passive (Redis TTL): 60s heartbeat key expires → liveness monitor → auto-restart
- Proactive (Health Sweep):
registry.StartHealthSweeppolls Docker API every 15s → catches dead containers faster - Reactive (A2A Proxy): On connection error, checks
provisioner.IsRunning()→ immediate offline + restart
All three call onWorkspaceOffline which broadcasts WORKSPACE_OFFLINE + go wh.RestartByID(). Redis cleanup uses shared db.ClearWorkspaceKeys().
Template Resolution (Create)
Runtime detection happens before DB insert: if payload.Runtime is empty and a template is specified, the handler reads runtime: from configsDir/template/config.yaml first. If still empty, defaults to "langgraph". This ensures the correct runtime (e.g. claude-code) is persisted in the DB and used for container image selection.
When a workspace specifies a template that doesn't exist, the Create handler falls back:
- Check
os.Stat(configsDir/template)— use if exists - Try
{runtime}-defaulttemplate (e.g.claude-code-default/) - Generate default config via
ensureDefaultConfig()(includes.auth-tokencopy for CLI runtimes)
Communication Rules (registry/access.go)
CanCommunicate(callerID, targetID) determines if two workspaces can talk:
- Same workspace → allowed
- Siblings (same parent_id) → allowed
- Root-level siblings (both parent_id IS NULL) → allowed
- Parent ↔ child → allowed
- Everything else → denied
The A2A proxy (POST /workspaces/:id/a2a) enforces this for agent-to-agent calls. Canvas requests (no X-Workspace-ID), self-calls, and system callers (webhook:*, system:*, test:* prefixes via isSystemCaller() in a2a_proxy.go) bypass the check.
Handler Decomposition (2026-04-13)
Four oversize handler functions were split into private helpers (pure refactor, behavior unchanged — 47 new unit tests cover the helpers directly; handlers package coverage 56.1% → 57.6%):
a2a_proxy.go::proxyA2ARequest(257→56 lines) — helpers:resolveAgentURL,normalizeA2APayload,dispatchA2A,handleA2ADispatchError,maybeMarkContainerDead,logA2AFailure,logA2ASuccess; sentinelproxyDispatchBuildErrordelegation.go::Delegate(127→60 lines) — helpers:bindDelegateRequest,lookupIdempotentDelegation,insertDelegationRow; typedinsertDelegationOutcomeenum replaces(bool, bool)positional returndiscovery.go::Discover(125→40 lines) — helpers:discoverWorkspacePeer,writeExternalWorkspaceURL,discoverHostPeeractivity.go::SessionSearch(109→24 lines) — helpers:parseSessionSearchParams,buildSessionSearchQuery,scanSessionSearchRows
When modifying any of these, prefer extending the helper rather than inlining back.
JSONB Gotcha
When inserting Go []byte (from json.Marshal) into Postgres JSONB columns, you must:
- Convert to
string()first - Use
::jsonbcast in SQL
lib/pq treats []byte as bytea, not JSONB.
WebSocket Events Flow
- Action occurs (register, heartbeat, etc.)
broadcaster.RecordAndBroadcast()inserts intostructure_eventstable + publishes to Redis pub/sub- Redis subscriber relays to WebSocket hub
- Hub broadcasts to canvas clients (all events) and workspace clients (filtered by CanCommunicate)
Canvas State Management
- Initial load: HTTP fetch from
GET /workspaces→ Zustand hydrate - Real-time updates: WebSocket events →
applyEvent()in Zustand store - Position persistence:
onNodeDragStop→PATCH /workspaces/:idwith{x, y} - Embedded sub-workspaces:
nestNodesetshidden: !!targetIdon child nodes; children render as recursiveTeamMemberChipcomponents inside parent (up to 3 levels), not as separate canvas nodes. Usen.data.parentId(not React Flow'sn.parentId) for hierarchy lookups. - Chat: two sub-tabs — "My Chat" (user↔agent,
source=canvas) and "Agent Comms" (agent↔agent A2A traffic,source=agent). History loaded fromGET /activitywith source filter. Real-time viaA2A_RESPONSE+AGENT_MESSAGEWebSocket events. Conversation history (last 20 messages) sent viaparams.metadata.historyin A2Amessage/sendrequests. - Config save: "Save & Restart" writes config.yaml and auto-restarts the workspace. "Save" writes only (shows restart banner). Secrets POST/DELETE auto-restart on the platform side.
Initial Prompt
Agents can auto-execute a prompt on startup before any user interaction. Configure via initial_prompt (inline string) or initial_prompt_file (path relative to config dir) in config.yaml. After the A2A server is ready, main.py sends the prompt as a message/send to self. A .initial_prompt_done marker file prevents re-execution on restart. Org templates support initial_prompt on both defaults (all agents) and per-workspace (overrides default).
Important: Initial prompts must NOT send A2A messages (delegate_task, send_message_to_user) — other agents may not be ready. Keep them local: clone repo, read docs, save to memory, wait for tasks.
Workspace Lifecycle
provisioning → online (on register) → degraded (error_rate > 0.5) → online (recovered) → offline (Redis TTL expired OR health sweep detects dead container) → auto-restart → provisioning → ... → removed (deleted). Any state → paused (user pauses) → provisioning (user resumes). Paused workspaces skip health sweep, liveness monitor, and auto-restart.
Restart context message (issue #19 Layer 1): After any restart (HTTP /restart or programmatic RestartByID) and successful re-registration, the platform sends a synthetic A2A message/send to the workspace with metadata.kind=restart_context — body contains restart timestamp, previous session end + duration, and env-var keys (keys only, never values) now available. Sender uses the system:restart-context caller prefix so it bypasses CanCommunicate via isSystemCaller(). If the workspace does not re-register within 30s the message is dropped (logged). Handler: platform/internal/handlers/restart_context.go. Layer 2 (user-defined restart_prompt from config.yaml / org.yaml) is tracked as GitHub issue #66.
Platform API Routes
| Method | Path | Handler |
|---|---|---|
| GET | /health | inline |
| GET | /metrics | metrics.Handler() — Prometheus text format (v0.0.4); no auth, scrape-safe |
| POST/GET/PATCH/DELETE | /workspaces[/:id] | workspace.go |
| GET/PATCH | /workspaces/:id/config | workspace.go |
| GET/POST | /workspaces/:id/memory | workspace.go |
| DELETE | /workspaces/:id/memory/:key | workspace.go |
| POST/PATCH/DELETE | /workspaces/:id/agent | agent.go |
| POST | /workspaces/:id/agent/move | agent.go |
| GET/POST/PUT | /workspaces/:id/secrets | secrets.go (POST/PUT auto-restarts workspace) |
| DELETE | /workspaces/:id/secrets/:key | secrets.go (DELETE auto-restarts workspace) |
| GET | /workspaces/:id/model | secrets.go |
| GET | /settings/secrets | secrets.go — list global secrets (keys only, values masked) |
| PUT/POST | /settings/secrets | secrets.go — set a global secret {key, value}; auto-restarts every non-paused/non-removed/non-external workspace that does not shadow the key with a workspace-level override (issue #15 / PR #64) |
| DELETE | /settings/secrets/:key | secrets.go — delete a global secret; same auto-restart fan-out as SetGlobal |
| GET | /admin/workspaces/:id/test-token | admin_test_token.go — mint a fresh bearer token for E2E scripts; 404 unless MOLECULE_ENV != production or MOLECULE_ENABLE_TEST_TOKENS=1 |
| GET/POST/DELETE | /admin/secrets[/:key] | secrets.go — legacy aliases for /settings/secrets |
| WS | /workspaces/:id/terminal | terminal.go |
| POST | /workspaces/:id/expand | team.go |
| POST | /workspaces/:id/collapse | team.go |
| POST/GET | /workspaces/:id/approvals | approvals.go |
| POST | /workspaces/:id/approvals/:id/decide | approvals.go |
| GET | /approvals/pending | approvals.go |
| POST/GET | /workspaces/:id/memories | memories.go |
| DELETE | /workspaces/:id/memories/:id | memories.go |
| GET | /workspaces/:id/traces | traces.go |
| GET/POST | /workspaces/:id/activity | activity.go |
| POST | /workspaces/:id/notify | activity.go (agent→user push message via WS) |
| POST | /workspaces/:id/restart | workspace.go |
| POST | /workspaces/:id/pause | workspace.go (stops container, status→paused) |
| POST | /workspaces/:id/resume | workspace.go (re-provisions paused workspace) |
| POST | /workspaces/:id/a2a | workspace.go |
| POST | /workspaces/:id/delegate | delegation.go (async fire-and-forget) |
| GET | /workspaces/:id/delegations | delegation.go (list delegation status) |
| GET/POST | /workspaces/:id/schedules | schedules.go (cron CRUD) |
| PATCH/DELETE | /workspaces/:id/schedules/:scheduleId | schedules.go |
| POST | /workspaces/:id/schedules/:scheduleId/run | schedules.go (manual trigger) |
| GET | /workspaces/:id/schedules/:scheduleId/history | schedules.go (past runs) |
| GET/POST | /workspaces/:id/channels | channels.go (social channel CRUD) |
| PATCH/DELETE | /workspaces/:id/channels/:channelId | channels.go |
| POST | /workspaces/:id/channels/:channelId/send | channels.go (outbound message) |
| POST | /workspaces/:id/channels/:channelId/test | channels.go (test connection) |
| GET | /channels/adapters | channels.go (list available platforms) |
| POST | /channels/discover | channels.go (auto-detect chats for a bot token) |
| POST | /webhooks/:type | channels.go (incoming social webhook) |
| GET | /workspaces/:id/shared-context | templates.go |
| GET/PUT/DELETE | /workspaces/:id/files[/*path] | templates.go |
| GET/PUT | /canvas/viewport | viewport.go |
| GET | /templates | templates.go |
| POST | /templates/import | templates.go |
| POST | /registry/register | registry.go |
| POST | /registry/heartbeat | registry.go |
| POST | /registry/update-card | registry.go |
| GET | /registry/discover/:id | discovery.go |
| GET | /registry/:id/peers | discovery.go |
| POST | /registry/check-access | discovery.go |
| GET | /plugins | plugins.go (list registry; supports ?runtime= filter) |
| GET | /plugins/sources | plugins.go (list registered install-source schemes) |
| GET/POST/DELETE | /workspaces/:id/plugins[/:name] | plugins.go — list, install ({"source":"scheme://spec"}), uninstall per-workspace |
| GET | /workspaces/:id/plugins/available | plugins.go (filtered by workspace runtime) |
| GET | /workspaces/:id/plugins/compatibility?runtime=X | plugins.go (preflight runtime-change check) |
| GET | /bundles/export/:id | bundle.go |
| POST | /bundles/import | bundle.go |
| GET | /org/templates | org.go (list available org templates) |
| POST | /org/import | org.go (import entire org hierarchy from YAML) |
| GET | /ws | socket.go |
Database
23 migration files in platform/migrations/ (up to 022_workspace_schedules_source — 2026-04-14 tick-7, PR #76). Key tables: workspaces (core entity with status, runtime, agent_card JSONB, heartbeat columns, current_task, awareness_namespace, workspace_dir), canvas_layouts (x/y position), structure_events (append-only event log), activity_logs (A2A communications, task updates, agent logs, errors), workspace_schedules (cron tasks with expression, timezone, prompt, run history, and source — 'template' for org/import-seeded, 'runtime' for Canvas/API-created; org/import is additive and only refreshes template-source rows on re-import), workspace_channels (social channel integrations — Telegram, Slack, etc., with JSONB config and allowlist), agents, workspace_secrets, global_secrets, agent_memories (HMA scoped memory), approvals.
The platform auto-discovers and runs migrations on startup from several candidate paths.
Project Memory (Awareness MCP)
IMPORTANT: These instructions override default behavior. You must follow them exactly.
Awareness Memory Integration (MANDATORY)
awareness_* = cross-session persistent memory (past decisions, knowledge, tasks). Other tools = current codebase navigation (file search, code index). Use BOTH - they serve different purposes.
STEP 1 - SESSION START: Call awareness_init(source="claude-code") -> get session_id, review context. If active_skills[] is returned: skill = reusable procedure done 2+ times; summary = injectable instruction, methods = steps. Apply matching skills to tasks.
STEP 2 - RECALL BEFORE WORK (progressive disclosure):
- awareness_recall(semantic_query=..., keyword_query=..., detail='summary') → lightweight index.
- Review summaries/scores, pick relevant IDs.
- awareness_recall(detail='full', ids=[...]) → expand only what you need.
STEP 3 - RECORD EVERY CHANGE: After EVERY code edit, decision, or bug fix: awareness_record(content=, insights={knowledge_cards:[...], action_items:[...], risks:[...]}) Content should be RICH and DETAILED — include reasoning, key code snippets, user quotes, alternatives considered, and files changed. Do NOT compress into a single-line summary. The content IS the memory — more detail = better recall. Include insights to create searchable knowledge in ONE step (recommended). Skipping = permanent data loss.
STEP 4 - CATEGORY GUIDE (for insights.knowledge_cards):
- decision = choice made between alternatives.
- problem_solution = bug/problem plus the fix that resolved it.
- workflow = process, setup, or configuration steps only.
- pitfall = blocker, warning, or limitation without a fix yet.
- insight = reusable pattern or general learning.
- skill = reusable procedure done 2+ times; summary = injectable instruction, methods = steps.
- key_point = important technical fact when nothing else fits. Never default everything to workflow.
STEP 5 - SESSION END: awareness_record(content=[step1, step2, ...], insights={...}) with final summary.
BACKFILL (if applicable): If MCP connected late: awareness_record(content=)
RULES VERSION: Pass rules_version="2" to awareness_init so the server knows you have these rules. If the server returns _setup_action, the rules have been updated — follow the instruction to re-sync.
NOTE: memory_id from X-Awareness-Memory-Id header. source/actor/event_type auto-inferred.
Compliance Check
Before responding to ANY user request:
-
Have you called awareness_init yet this session? If not, call it NOW.
-
Did you just edit a file? Call awareness_record(content=, insights={...}) IMMEDIATELY.
-
Is the user asking about past work? Call awareness_recall FIRST.