molecule-core/CLAUDE.md
Hongming Wang 035287df38 feat(ci): publish-platform-image workflow → ghcr.io/molecule-ai/platform
Phase B.2 companion to the private molecule-controlplane provisioner PR.
On every push to main that touches platform/**, builds platform/Dockerfile
and pushes to GHCR with two tags:

- :latest              (floating, always main's tip)
- :sha-<short-commit>  (immutable, pin-friendly)

Cache via GitHub Actions cache (cache-from: type=gha). Workflow_dispatch
trigger so we can re-publish after a docs-only merge if needed.

The private molecule-controlplane sets TENANT_IMAGE=ghcr.io/molecule-ai/platform:<tag>
and the provisioner creates each tenant Fly Machine from this image. Staying
on the same base image across tenants keeps upgrades atomic.

CLAUDE.md updated to document the new workflow in the CI pipeline section.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 16:37:49 -07:00

37 KiB
Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Molecule AI is a platform for orchestrating AI agent workspaces that form an organizational hierarchy. Workspaces register with a central platform, communicate via A2A protocol, and are visualized on a drag-and-drop canvas.

Ecosystem Context

Before research, strategy, or design work, skim docs/ecosystem-watch.md — it catalogs adjacent agent projects (Holaboss, Hermes, gstack, …) with overlap / differentiation / terminology-collision notes. Cross-referenced from PLAN.md and README.md; it's the canonical starting point for "what else is out there."

Agent operating rules (auto-loaded — read first)

The following are project-level rules that override default behavior. They apply to every conversation in this repo, automated cron tick, and every subagent the orchestrator spawns.

Cron / triage discipline

  1. Always read the most recent cron-learnings before reviewing PRs. Open ~/.claude/projects/-Users-hongming-Documents-GitHub-molecule-monorepo/memory/cron-learnings.jsonl, read the last 20 lines. Patterns recur — a finding that was a false-positive last tick is likely a false-positive again. A fix that worked last tick is likely the fix this tick. The SessionStart hook auto-injects this; read anyway when starting a triage from the middle of a conversation.

  2. Treat docs/sync-* PRs that touch CLAUDE.md or PLAN.md as ALWAYS noteworthy. Those two files are the agent-facing source of truth — a bad merge there silently corrupts every future triage tick. Run code-review skill at minimum, ideally cross-vendor-review too.

  3. After any cron tick, write a 1-line reflection to .claude/per-tick-reflections.md (gitignored). Format: 2026-MM-DDTHH:MMZ — what surprised me / what I'd do differently next tick. This is for YOUR future self; the cron-learnings JSONL is for the operational pattern memory. They are distinct.

Hooks active in this repo

The following ambient guardrails fire automatically (configured in .claude/settings.json). When a hook blocks a tool call, the response will include a permissionDecisionReason — read it carefully before retrying.

Hook Event Effect
pre-bash-careful.sh PreToolUse:Bash REFUSES git push --force to main, rm -rf at root/HOME, DROP TABLE against prod schema. WARNs on --force-with-lease, gh pr close/issue close.
pre-edit-freeze.sh PreToolUse:Edit/Write Blocks edits outside the path in .claude/freeze if that file exists. Use to lock scope while debugging.
session-start-context.sh SessionStart Auto-loads recent cron-learnings, freeze status, open PR/issue counts.
post-edit-audit.sh PostToolUse:Edit/Write Appends every edit to .claude/audit.jsonl (gitignored).
user-prompt-tag.sh UserPromptSubmit Injects warning into context when prompt mentions force-push / drop-table / "delete all" / etc.
subagent-stop-judge.sh SubagentStop Off by default (touch .claude/judge-subagents to enable). When on, prompts the orchestrator to verify the subagent's output addresses the original task.

Skills active in this repo

These are documented in .claude/skills/*/SKILL.md. Invoke explicitly via the Skill tool — they are NOT auto-applied. The cron prompt invokes them at fixed steps; for ad-hoc work, decide if the skill matches your situation:

  • code-review — full 16-criteria rubric on a diff
  • cross-vendor-review — adversarial second-model review (use for noteworthy PRs)
  • careful-mode — the doc backing the bash hook above
  • cron-learnings — defines the JSONL format
  • cron-retro — weekly retrospective generator
  • llm-judge — score whether a deliverable addresses the request
  • update-docs — sync repo docs after merges

Standing rules (inviolable)

  • Never push directly to main — use feat/fix/chore/docs branches
  • Merge-commits only (gh pr merge --merge) — never --squash / --rebase
  • Never commit without explicit user approval EXCEPT on:
    • Open PR branches you're fixing for a gate
    • Issue-pickup branches you opened a draft PR for
    • Docs-sync branches
    • Main is untouchable without a merge
  • Dark theme only (no white/light CSS classes; pre-commit hook enforces)
  • No native browser dialogs (confirm/alert/prompt) — use ConfirmDialog
  • Delegate through PM, never bypass hierarchy
  • Only PM mounts the repo (workspace_dir bind-mount); other agents get isolated Docker volumes

Architecture

Canvas (Next.js :3000) ←WebSocket→ Platform (Go :8080) ←HTTP→ Postgres + Redis
                                                                  ↑
                                   Workspace A ←──A2A──→ Workspace B
                                   (Python agents)
                                        ↑ register/heartbeat ↑
                                        └───── Platform ─────┘

Four main components:

  • Platform (platform/): Go/Gin control plane — workspace CRUD, registry, discovery, WebSocket hub, liveness monitoring
  • Canvas (canvas/): Next.js 15 + React Flow (@xyflow/react v12) + Zustand + Tailwind — visual workspace graph
  • Workspace Runtime (workspace-template/): Unified Docker image with pluggable adapter system — supports LangGraph, Claude Code, OpenClaw, DeepAgents, CrewAI, AutoGen. Adapters in workspace-template/adapters/. Deps installed at startup via entrypoint.sh.
  • molecli (platform/cmd/cli/): Go TUI dashboard (Bubbletea + Lipgloss) — real-time workspace monitoring, event log, health overview, delete/filter operations

Build & Run Commands

Infrastructure

./infra/scripts/setup.sh    # Start Postgres, Redis, Langfuse, Temporal; run migrations
./infra/scripts/nuke.sh     # Tear down everything, remove volumes

Infra services (via docker-compose.infra.yml, all attached to the shared molecule-monorepo-net network — setup.sh creates it idempotently):

  • Postgres :5432 — primary datastore (also backs Langfuse + Temporal via separate DBs)
  • Redis :6379 — pub/sub, heartbeat TTLs
  • Langfuse :3001 — LLM trace viewer (backed by Clickhouse)
  • Temporal :7233 (gRPC) + :8233 (Web UI) — durable workflow engine for workspace-template/builtin_tools/temporal_workflow.py. Dev-only posture: the auto-setup image runs with no auth on 0.0.0.0:7233; production deployments must gate access via mTLS or an API key / reverse proxy.

Platform (Go)

cd platform
go build ./cmd/server       # Build server
go run ./cmd/server          # Run server (requires Postgres + Redis running)
go build -o molecli ./cmd/cli  # Build TUI dashboard
./molecli                    # Run TUI dashboard (requires platform running)

Must run from platform/ directory (not repo root). Env vars: DATABASE_URL, REDIS_URL, PORT, PLATFORM_URL (default http://host.docker.internal:PORT — passed to agent containers so they can reach the platform), SECRETS_ENCRYPTION_KEY (optional AES-256, 32 bytes), CONFIGS_DIR (auto-discovered), PLUGINS_DIR (deprecated — plugins are now installed per-workspace via API; the plugins/ registry at repo root is auto-discovered), ACTIVITY_RETENTION_DAYS (default 7), ACTIVITY_CLEANUP_INTERVAL_HOURS (default 6), CORS_ORIGINS (comma-separated, default http://localhost:3000,http://localhost:3001), RATE_LIMIT (requests/min, default 600), WORKSPACE_DIR (optional — global fallback host path for /workspace bind-mount; overridden by per-workspace workspace_dir column in DB; if neither is set, each workspace gets an isolated Docker named volume), AWARENESS_URL (optional — if set, injected into workspace containers along with a deterministic AWARENESS_NAMESPACE derived from workspace ID), MOLECULE_IN_DOCKER (optional — set to 1 when the platform itself runs inside Docker so the A2A proxy rewrites 127.0.0.1:<port> URLs to container hostnames; auto-detected via /.dockerenv), MOLECULE_ENV (optional — set to production to hide the /admin/workspaces/:id/test-token E2E helper endpoint; unset or any other value leaves it enabled), MOLECULE_ENABLE_TEST_TOKENS (optional — set to 1 to force-enable the test-token endpoint even when MOLECULE_ENV=production; intended for staging runs only), MOLECULE_ORG_ID (optional — the public repo's only SaaS hook. When set to a UUID, every non-allowlisted request must carry a matching X-Molecule-Org-Id header or gets a 404; when unset, the guard is a passthrough so self-hosted / dev / CI are unaffected. Set only by the private molecule-controlplane provisioner on Fly Machines tenant instances — never by self-hosters).

Workspace tier resource limits (issue #14 — override the per-tier memory/CPU caps in provisioner.ApplyTierConfig; CPU_SHARES follows Docker's 1024 = 1 CPU convention, translated to NanoCPUs for a hard cap):

  • TIER2_MEMORY_MB / TIER2_CPU_SHARES — Standard tier (defaults 512 / 1024)
  • TIER3_MEMORY_MB / TIER3_CPU_SHARES — Privileged tier (defaults 2048 / 2048; previously uncapped)
  • TIER4_MEMORY_MB / TIER4_CPU_SHARES — Full-host tier (defaults 4096 / 4096; previously uncapped)

Plugin install safeguards (bound the cost of a single POST /workspaces/:id/plugins install so a slow/malicious source can't tie up a handler):

  • PLUGIN_INSTALL_BODY_MAX_BYTES — max request body size (default 65536 = 64 KiB)
  • PLUGIN_INSTALL_FETCH_TIMEOUT — duration string; whole fetch+copy deadline (default 5m)
  • PLUGIN_INSTALL_MAX_DIR_BYTES — max staged-tree size (default 104857600 = 100 MiB)

See docs/plugins/sources.md for the two-axis source/shape plugin model.

Additional env vars documented in .env.example (2026-04-13 sync — all 21 distinct os.Getenv/envx.* keys now documented): MOLECULE_ENV, GITHUB_WEBHOOK_SECRET, MOLECULE_URL (MCP server target; same semantic as PLATFORM_URL).

molecli reads MOLECLI_URL (default http://localhost:8080) to locate the platform. Logs are written to molecli.log in the working directory (already covered by *.log in .gitignore).

Canvas (Next.js)

cd canvas
npm install
npm run dev                  # Dev server on :3000
npm run build && npm start   # Production

Env vars: NEXT_PUBLIC_PLATFORM_URL (default http://localhost:8080), NEXT_PUBLIC_WS_URL (default ws://localhost:8080/ws).

Workspace Images

bash workspace-template/build-all.sh                   # Build base + ALL runtime images
bash workspace-template/build-all.sh claude-code       # Build base + specific runtime only

Each runtime has its own Docker image extending workspace-template:base, with deps pre-installed for fast startup. The base Dockerfile (workspace-template/Dockerfile) builds :base, then each adapters/*/Dockerfile extends it (e.g. claude_code/Dockerfile installs the claude CLI). Always use build-all.sh — it builds base first, then all runtimes in order. No :latest tag — each runtime uses its own tag to avoid confusion.

Runtime Image Tag Key Deps
langgraph workspace-template:langgraph langchain-anthropic, langgraph
claude-code workspace-template:claude-code claude-agent-sdk (pip), @anthropic-ai/claude-code (npm)
openclaw workspace-template:openclaw openclaw deps
crewai workspace-template:crewai crewai
autogen workspace-template:autogen autogen
deepagents workspace-template:deepagents deepagents
hermes workspace-template:hermes openai (OpenAI-compatible client; Nous Portal via HERMES_API_KEY or OpenRouter via OPENROUTER_API_KEY fallback)

Templates are framework presets in workspace-configs-templates/: claude-code-default, langgraph, openclaw, deepagents. Agent roles are configured after deployment via Config tab or API.

For Claude Code runtime, write your OAuth token to workspace-configs-templates/claude-code-default/.auth-token.

Pre-commit Hook

git config core.hooksPath .githooks            # Install hooks (agents do this via initial_prompt)

Enforces: 'use client' on hook-using .tsx files, dark theme (no white/light), no SQL injection (fmt.Sprintf with SQL), no leaked secrets (sk-ant-, ghp_, AKIA). Commit is rejected until violations are fixed — agents cannot bypass this.

Plugins

Shared plugins in plugins/ are auto-loaded by every workspace:

  • molecule-dev: Codebase conventions (rules injected into CLAUDE.md) + review-loop skill for multi-round QA cycles
  • superpowers: verification-before-completion, test-driven-development, systematic-debugging, writing-plans
  • ecc: General Claude Code guardrails
  • browser-automation: Puppeteer/CDP-based web scraping and live canvas screenshots (opt-in per workspace — wired into Research + UIUX roles in org-templates/molecule-dev/org.yaml)

Modular guardrails (Claude Code only — pick what you need, or install several):

Hook plugins (ambient enforcement at the harness layer)

  • molecule-careful-bash — REFUSES git push --force to main, rm -rf at root, DROP TABLE against prod schema. Ships the careful-mode skill as documentation.
  • molecule-freeze-scope — locks edits to a single path glob via .claude/freeze. Useful while debugging.
  • molecule-audit-trail — appends every Edit/Write to .claude/audit.jsonl for accountability.
  • molecule-session-context — auto-loads recent cron-learnings + open PR/issue counts at session start. Pairs with molecule-skill-cron-learnings.
  • molecule-prompt-watchdog — injects warning context when the user prompt mentions destructive keywords ("force push", "drop table", "delete all", etc).

Skill plugins (on-demand, via the Skill tool)

  • molecule-skill-code-review — 16-criteria multi-axis review.
  • molecule-skill-cross-vendor-review — adversarial second-model review (use for noteworthy PRs).
  • molecule-skill-llm-judge — score whether a deliverable addresses the request.
  • molecule-skill-update-docs — sync repo docs after merges.
  • molecule-skill-cron-learnings — defines the operational-memory JSONL format consumed by molecule-session-context.

Workflow plugins (slash commands that compose skills)

  • molecule-workflow-triage/triage runs a full PR-triage cycle (gates 17 + code-review + merge if green). Recommends installing molecule-skill-code-review + molecule-skill-cron-learnings first.
  • molecule-workflow-retro/retro posts a weekly retrospective issue. Recommends molecule-skill-cron-learnings first.

These are distilled from the harness-level guardrails the orchestrator uses on itself. A workspace can install one (e.g., just molecule-careful-bash for safety) or stack the full set for the same posture as the Molecule AI orchestrator.

Org-template plugin resolution (PR #71, issue #68): per-workspace plugins: lists in org-templates/*/org.yaml role overrides UNION with defaults.plugins (deduplicated, defaults first) — they do not REPLACE them. To opt a specific default out for a given role/workspace, prefix the plugin name with ! or - (e.g. !browser-automation). Implemented by mergePlugins in platform/internal/handlers/org.go.

Scripts

bash scripts/setup-default-org.sh              # Create PM + 3 teams (Marketing/Research/Dev) via API
OPENAI_API_KEY=... bash scripts/test-a2a-cross-runtime.sh  # E2E: Claude Code ↔ OpenClaw A2A test
OPENAI_API_KEY=... bash scripts/test-team-e2e.sh           # E2E: Multi-template team + A2A

Unit Tests

cd platform && go test -race ./...               # 740 Go tests (handlers, registry, provisioner, CLI, delegation, org, channels, wsauth — sqlmock + miniredis; +2 on 2026-04-14 tick-4 for TestSetGlobal_* / TestDeleteGlobal_* auto-restart branches (#64); +4 on 2026-04-14 tick-4 for TestRestartContext_* covering the synthetic restart-context A2A message (#65); +5 on 2026-04-14 tick-6 for TestPlugins_* covering the new UNION + `!`/`-` opt-out semantics in org.go mergePlugins (#71, resolves issue #68); +9 on 2026-04-14 tick-7 for TestCategoryRouting_* / TestAppendYAMLBlock_* (#75) + TestRuntimeSchedule_HasSourceRuntime / TestImport_OrgScheduleSQLShape / TestList_IncludesSourceColumn (#76); raw PASS-line count is higher due to table-driven subtests)
cd canvas && npm test                            # 357 Vitest tests (store, components, hydration, buildTree, secrets API, org template import, ConfirmDialog singleButton + 7 native-dialog replacements)
cd workspace-template && python -m pytest -v     # 1140 pytest tests (adds platform_auth token store for Phase 30.1, memory_write activity logging)
cd sdk/python && python -m pytest -v              # 132 SDK tests (agentskills.io spec validator, CLI, AgentskillsAdaptor round-trip, workspace/org/channel validators, RemoteAgentClient Phase 30 flows)
cd mcp-server && npm test                        # 97 Jest tests (per-domain tool modules + smoke test on tool count)

Integration Tests

bash tests/e2e/test_api.sh             # 62 API tests against localhost:8080 (Phase 30.1 bearer-token auth aware; shellcheck-clean; also runs in CI `e2e-api` job)
bash tests/e2e/test_a2a_e2e.sh         # 22 A2A end-to-end tests (requires 2 online agents)
bash tests/e2e/test_activity_e2e.sh    # 25 activity/task E2E tests (requires 1 online agent; re-registers detected agent to capture bearer token)
bash tests/e2e/test_comprehensive_e2e.sh # 67 checks — ALL endpoints, memory, runtime, bundles, approvals (registers workspaces immediately after create to beat the provisioner token race)

All five E2E scripts share tests/e2e/_lib.sh + tests/e2e/_extract_token.py helpers and are shellcheck-clean. test_api.sh is the quick local-verify command — use it after any platform change. Tests full CRUD, registry, heartbeat, discovery, peers, access control, events, degraded/recovery lifecycle, activity logging, current task tracking, bundle round-trip (export → delete → import → verify).

Phase 30.1 / 30.6 auth callout (future-proofing): /registry/heartbeat and /registry/update-card require Authorization: Bearer <token> once a workspace has any live token on file (Phase 30.1 — legacy workspaces grandfathered). /registry/discover/:id and /registry/:id/peers additionally require X-Workspace-ID + bearer token on the caller side (Phase 30.6 — fail-open on DB hiccup since hierarchy check is primary). If you change these routes, update tests/e2e/test_api.sh and docs/api-protocol/platform-api.md in the same PR.

test_a2a_e2e.sh requires platform + two provisioned agents (Echo Agent, SEO Agent) running with a valid OPENROUTER_API_KEY. Tests message/send, JSON-RPC wrapping, error handling, peer discovery, agent cards, heartbeat. Timeout configurable via A2A_TIMEOUT env var (default 120s).

test_activity_e2e.sh requires platform + one online agent. Tests A2A communication logging (request/response capture, duration, method), agent self-reported activity, type filtering, current task visibility via heartbeat, cross-workspace activity isolation, edge cases.

MCP Server

cd mcp-server
npm install && npm run build   # Build MCP server
node dist/index.js             # Run (stdio transport)

Exposes 87 tools for managing Molecule AI from Claude Code, Cursor, Codex, or any MCP client. Includes workspace CRUD, async delegation, plugins (install/uninstall/list), global secrets, pause/resume, org import, A2A chat, approvals, memory, files, config, discovery, bundles, templates, traces, activity logs, remote agents (Phase 30), and social channels (add/update/remove/send/test). Configured in .mcp.json. Env: MOLECULE_URL (default http://localhost:8080).

Structure (refactored 2026-04-13, PRs #2/#4/#7): src/index.ts shrank from 1697 → 89 lines and now only wires createServer(). Per-domain tool modules live in src/tools/: workspaces.ts, agents.ts, secrets.ts, files.ts, memory.ts, plugins.ts, channels.ts, delegation.ts, schedules.ts, approvals.ts, discovery.ts, remote_agents.ts. Each exports its handlers and a registerXxxTools(srv) function. Shared HTTP layer in src/api.ts (PLATFORM_URL, apiCall<T>, ApiError, isApiError(), toMcpResult(), toMcpText()). When adding a tool, pick the matching domain file or create a new one and wire it in createServer().

CI Pipeline

GitHub Actions (.github/workflows/ci.yml) runs on push to main and PRs:

  • platform-build: Go build, vet, go test -race with coverage profiling (25% baseline threshold; setup-go uses module cache)
  • canvas-build: npm build, vitest run (no --passWithNoTests -- tests must exist and pass)
  • mcp-server-build: npm build
  • python-lint: pytest --cov=. --cov-report=term-missing (pytest-cov enabled)
  • e2e-api (added 2026-04-13): spins up Postgres + Redis service containers, runs platform migrations via docker exec, then executes tests/e2e/test_api.sh against a locally-built binary (62/62 must pass)
  • shellcheck (added 2026-04-13): lints every tests/e2e/*.sh via the shellcheck marketplace action
  • publish-platform-image (.github/workflows/publish-platform-image.yml, added 2026-04-14 tick-9): on push to main touching platform/**, builds platform/Dockerfile and pushes to ghcr.io/molecule-ai/platform:latest + :sha-<short>. Used by the private molecule-controlplane provisioner as tenant VM image. Manual re-trigger via workflow_dispatch.

Docker Compose

docker compose -f docker-compose.infra.yml up -d    # Infra only
docker compose up                                     # Full stack

Key Architectural Patterns

Import Cycle Prevention

The platform uses function injection to avoid Go import cycles between ws, registry, and events packages:

  • ws.NewHub(canCommunicate AccessChecker) — Hub accepts registry.CanCommunicate as a function
  • registry.StartLivenessMonitor(ctx, onOffline OfflineHandler) — Liveness accepts broadcaster callback
  • registry.StartHealthSweep(ctx, checker ContainerChecker, interval, onOffline) — Health sweep accepts Docker checker interface
  • Wiring happens in platform/cmd/server/main.go — init order: wh → onWorkspaceOffline → liveness/healthSweep → router

Container Health Detection

Three layers detect dead containers (e.g. Docker Desktop crash):

  1. Passive (Redis TTL): 60s heartbeat key expires → liveness monitor → auto-restart
  2. Proactive (Health Sweep): registry.StartHealthSweep polls Docker API every 15s → catches dead containers faster
  3. Reactive (A2A Proxy): On connection error, checks provisioner.IsRunning() → immediate offline + restart

All three call onWorkspaceOffline which broadcasts WORKSPACE_OFFLINE + go wh.RestartByID(). Redis cleanup uses shared db.ClearWorkspaceKeys().

Template Resolution (Create)

Runtime detection happens before DB insert: if payload.Runtime is empty and a template is specified, the handler reads runtime: from configsDir/template/config.yaml first. If still empty, defaults to "langgraph". This ensures the correct runtime (e.g. claude-code) is persisted in the DB and used for container image selection.

When a workspace specifies a template that doesn't exist, the Create handler falls back:

  1. Check os.Stat(configsDir/template) — use if exists
  2. Try {runtime}-default template (e.g. claude-code-default/)
  3. Generate default config via ensureDefaultConfig() (includes .auth-token copy for CLI runtimes)

Communication Rules (registry/access.go)

CanCommunicate(callerID, targetID) determines if two workspaces can talk:

  • Same workspace → allowed
  • Siblings (same parent_id) → allowed
  • Root-level siblings (both parent_id IS NULL) → allowed
  • Parent ↔ child → allowed
  • Everything else → denied

The A2A proxy (POST /workspaces/:id/a2a) enforces this for agent-to-agent calls. Canvas requests (no X-Workspace-ID), self-calls, and system callers (webhook:*, system:*, test:* prefixes via isSystemCaller() in a2a_proxy.go) bypass the check.

Handler Decomposition (2026-04-13)

Four oversize handler functions were split into private helpers (pure refactor, behavior unchanged — 47 new unit tests cover the helpers directly; handlers package coverage 56.1% → 57.6%):

  • a2a_proxy.go::proxyA2ARequest (257→56 lines) — helpers: resolveAgentURL, normalizeA2APayload, dispatchA2A, handleA2ADispatchError, maybeMarkContainerDead, logA2AFailure, logA2ASuccess; sentinel proxyDispatchBuildError
  • delegation.go::Delegate (127→60 lines) — helpers: bindDelegateRequest, lookupIdempotentDelegation, insertDelegationRow; typed insertDelegationOutcome enum replaces (bool, bool) positional return
  • discovery.go::Discover (125→40 lines) — helpers: discoverWorkspacePeer, writeExternalWorkspaceURL, discoverHostPeer
  • activity.go::SessionSearch (109→24 lines) — helpers: parseSessionSearchParams, buildSessionSearchQuery, scanSessionSearchRows

When modifying any of these, prefer extending the helper rather than inlining back.

JSONB Gotcha

When inserting Go []byte (from json.Marshal) into Postgres JSONB columns, you must:

  1. Convert to string() first
  2. Use ::jsonb cast in SQL

lib/pq treats []byte as bytea, not JSONB.

WebSocket Events Flow

  1. Action occurs (register, heartbeat, etc.)
  2. broadcaster.RecordAndBroadcast() inserts into structure_events table + publishes to Redis pub/sub
  3. Redis subscriber relays to WebSocket hub
  4. Hub broadcasts to canvas clients (all events) and workspace clients (filtered by CanCommunicate)

Canvas State Management

  • Initial load: HTTP fetch from GET /workspaces → Zustand hydrate
  • Real-time updates: WebSocket events → applyEvent() in Zustand store
  • Position persistence: onNodeDragStopPATCH /workspaces/:id with {x, y}
  • Embedded sub-workspaces: nestNode sets hidden: !!targetId on child nodes; children render as recursive TeamMemberChip components inside parent (up to 3 levels), not as separate canvas nodes. Use n.data.parentId (not React Flow's n.parentId) for hierarchy lookups.
  • Chat: two sub-tabs — "My Chat" (user↔agent, source=canvas) and "Agent Comms" (agent↔agent A2A traffic, source=agent). History loaded from GET /activity with source filter. Real-time via A2A_RESPONSE + AGENT_MESSAGE WebSocket events. Conversation history (last 20 messages) sent via params.metadata.history in A2A message/send requests.
  • Config save: "Save & Restart" writes config.yaml and auto-restarts the workspace. "Save" writes only (shows restart banner). Secrets POST/DELETE auto-restart on the platform side.

Initial Prompt

Agents can auto-execute a prompt on startup before any user interaction. Configure via initial_prompt (inline string) or initial_prompt_file (path relative to config dir) in config.yaml. After the A2A server is ready, main.py sends the prompt as a message/send to self. A .initial_prompt_done marker file prevents re-execution on restart. Org templates support initial_prompt on both defaults (all agents) and per-workspace (overrides default).

Important: Initial prompts must NOT send A2A messages (delegate_task, send_message_to_user) — other agents may not be ready. Keep them local: clone repo, read docs, save to memory, wait for tasks.

Workspace Lifecycle

provisioningonline (on register) → degraded (error_rate > 0.5) → online (recovered) → offline (Redis TTL expired OR health sweep detects dead container) → auto-restart → provisioning → ... → removed (deleted). Any state → paused (user pauses) → provisioning (user resumes). Paused workspaces skip health sweep, liveness monitor, and auto-restart.

Restart context message (issue #19 Layer 1): After any restart (HTTP /restart or programmatic RestartByID) and successful re-registration, the platform sends a synthetic A2A message/send to the workspace with metadata.kind=restart_context — body contains restart timestamp, previous session end + duration, and env-var keys (keys only, never values) now available. Sender uses the system:restart-context caller prefix so it bypasses CanCommunicate via isSystemCaller(). If the workspace does not re-register within 30s the message is dropped (logged). Handler: platform/internal/handlers/restart_context.go. Layer 2 (user-defined restart_prompt from config.yaml / org.yaml) is tracked as GitHub issue #66.

Platform API Routes

Method Path Handler
GET /health inline
GET /metrics metrics.Handler() — Prometheus text format (v0.0.4); no auth, scrape-safe
POST/GET/PATCH/DELETE /workspaces[/:id] workspace.go
GET/PATCH /workspaces/:id/config workspace.go
GET/POST /workspaces/:id/memory workspace.go
DELETE /workspaces/:id/memory/:key workspace.go
POST/PATCH/DELETE /workspaces/:id/agent agent.go
POST /workspaces/:id/agent/move agent.go
GET/POST/PUT /workspaces/:id/secrets secrets.go (POST/PUT auto-restarts workspace)
DELETE /workspaces/:id/secrets/:key secrets.go (DELETE auto-restarts workspace)
GET /workspaces/:id/model secrets.go
GET /settings/secrets secrets.go — list global secrets (keys only, values masked)
PUT/POST /settings/secrets secrets.go — set a global secret {key, value}; auto-restarts every non-paused/non-removed/non-external workspace that does not shadow the key with a workspace-level override (issue #15 / PR #64)
DELETE /settings/secrets/:key secrets.go — delete a global secret; same auto-restart fan-out as SetGlobal
GET /admin/workspaces/:id/test-token admin_test_token.go — mint a fresh bearer token for E2E scripts; 404 unless MOLECULE_ENV != production or MOLECULE_ENABLE_TEST_TOKENS=1
GET/POST/DELETE /admin/secrets[/:key] secrets.go — legacy aliases for /settings/secrets
WS /workspaces/:id/terminal terminal.go
POST /workspaces/:id/expand team.go
POST /workspaces/:id/collapse team.go
POST/GET /workspaces/:id/approvals approvals.go
POST /workspaces/:id/approvals/:id/decide approvals.go
GET /approvals/pending approvals.go
POST/GET /workspaces/:id/memories memories.go
DELETE /workspaces/:id/memories/:id memories.go
GET /workspaces/:id/traces traces.go
GET/POST /workspaces/:id/activity activity.go
POST /workspaces/:id/notify activity.go (agent→user push message via WS)
POST /workspaces/:id/restart workspace.go
POST /workspaces/:id/pause workspace.go (stops container, status→paused)
POST /workspaces/:id/resume workspace.go (re-provisions paused workspace)
POST /workspaces/:id/a2a workspace.go
POST /workspaces/:id/delegate delegation.go (async fire-and-forget)
GET /workspaces/:id/delegations delegation.go (list delegation status)
GET/POST /workspaces/:id/schedules schedules.go (cron CRUD)
PATCH/DELETE /workspaces/:id/schedules/:scheduleId schedules.go
POST /workspaces/:id/schedules/:scheduleId/run schedules.go (manual trigger)
GET /workspaces/:id/schedules/:scheduleId/history schedules.go (past runs)
GET/POST /workspaces/:id/channels channels.go (social channel CRUD)
PATCH/DELETE /workspaces/:id/channels/:channelId channels.go
POST /workspaces/:id/channels/:channelId/send channels.go (outbound message)
POST /workspaces/:id/channels/:channelId/test channels.go (test connection)
GET /channels/adapters channels.go (list available platforms)
POST /channels/discover channels.go (auto-detect chats for a bot token)
POST /webhooks/:type channels.go (incoming social webhook)
GET /workspaces/:id/shared-context templates.go
GET/PUT/DELETE /workspaces/:id/files[/*path] templates.go
GET/PUT /canvas/viewport viewport.go
GET /templates templates.go
POST /templates/import templates.go
POST /registry/register registry.go
POST /registry/heartbeat registry.go
POST /registry/update-card registry.go
GET /registry/discover/:id discovery.go
GET /registry/:id/peers discovery.go
POST /registry/check-access discovery.go
GET /plugins plugins.go (list registry; supports ?runtime= filter)
GET /plugins/sources plugins.go (list registered install-source schemes)
GET/POST/DELETE /workspaces/:id/plugins[/:name] plugins.go — list, install ({"source":"scheme://spec"}), uninstall per-workspace
GET /workspaces/:id/plugins/available plugins.go (filtered by workspace runtime)
GET /workspaces/:id/plugins/compatibility?runtime=X plugins.go (preflight runtime-change check)
GET /bundles/export/:id bundle.go
POST /bundles/import bundle.go
GET /org/templates org.go (list available org templates)
POST /org/import org.go (import entire org hierarchy from YAML)
GET /ws socket.go

Database

23 migration files in platform/migrations/ (up to 022_workspace_schedules_source — 2026-04-14 tick-7, PR #76). Key tables: workspaces (core entity with status, runtime, agent_card JSONB, heartbeat columns, current_task, awareness_namespace, workspace_dir), canvas_layouts (x/y position), structure_events (append-only event log), activity_logs (A2A communications, task updates, agent logs, errors), workspace_schedules (cron tasks with expression, timezone, prompt, run history, and source'template' for org/import-seeded, 'runtime' for Canvas/API-created; org/import is additive and only refreshes template-source rows on re-import), workspace_channels (social channel integrations — Telegram, Slack, etc., with JSONB config and allowlist), agents, workspace_secrets, global_secrets, agent_memories (HMA scoped memory), approvals.

The platform auto-discovers and runs migrations on startup from several candidate paths.

Project Memory (Awareness MCP)

IMPORTANT: These instructions override default behavior. You must follow them exactly.

Awareness Memory Integration (MANDATORY)

awareness_* = cross-session persistent memory (past decisions, knowledge, tasks). Other tools = current codebase navigation (file search, code index). Use BOTH - they serve different purposes.

STEP 1 - SESSION START: Call awareness_init(source="claude-code") -> get session_id, review context. If active_skills[] is returned: skill = reusable procedure done 2+ times; summary = injectable instruction, methods = steps. Apply matching skills to tasks.

STEP 2 - RECALL BEFORE WORK (progressive disclosure):

  1. awareness_recall(semantic_query=..., keyword_query=..., detail='summary') → lightweight index.
  2. Review summaries/scores, pick relevant IDs.
  3. awareness_recall(detail='full', ids=[...]) → expand only what you need.

STEP 3 - RECORD EVERY CHANGE: After EVERY code edit, decision, or bug fix: awareness_record(content=, insights={knowledge_cards:[...], action_items:[...], risks:[...]}) Content should be RICH and DETAILED — include reasoning, key code snippets, user quotes, alternatives considered, and files changed. Do NOT compress into a single-line summary. The content IS the memory — more detail = better recall. Include insights to create searchable knowledge in ONE step (recommended). Skipping = permanent data loss.

STEP 4 - CATEGORY GUIDE (for insights.knowledge_cards):

  • decision = choice made between alternatives.
  • problem_solution = bug/problem plus the fix that resolved it.
  • workflow = process, setup, or configuration steps only.
  • pitfall = blocker, warning, or limitation without a fix yet.
  • insight = reusable pattern or general learning.
  • skill = reusable procedure done 2+ times; summary = injectable instruction, methods = steps.
  • key_point = important technical fact when nothing else fits. Never default everything to workflow.

STEP 5 - SESSION END: awareness_record(content=[step1, step2, ...], insights={...}) with final summary.

BACKFILL (if applicable): If MCP connected late: awareness_record(content=)

RULES VERSION: Pass rules_version="2" to awareness_init so the server knows you have these rules. If the server returns _setup_action, the rules have been updated — follow the instruction to re-sync.

NOTE: memory_id from X-Awareness-Memory-Id header. source/actor/event_type auto-inferred.

Compliance Check

Before responding to ANY user request:

  1. Have you called awareness_init yet this session? If not, call it NOW.

  2. Did you just edit a file? Call awareness_record(content=, insights={...}) IMMEDIATELY.

  3. Is the user asking about past work? Call awareness_recall FIRST.