molecule-core/CLAUDE.md

# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

Molecule AI is a platform for orchestrating AI agent workspaces that form an organizational hierarchy. Workspaces register with a central platform, communicate via A2A protocol, and are visualized on a drag-and-drop canvas.

## Ecosystem Context

Before research, strategy, or design work, skim **`docs/ecosystem-watch.md`** —
it catalogs adjacent agent projects (Holaboss, Hermes, gstack, …) with
overlap / differentiation / terminology-collision notes. Cross-referenced
from `PLAN.md` and `README.md`; it's the canonical starting point for
"what else is out there."

## SaaS ops

When rotating SaaS credentials (Fly / Neon / Upstash / envelope key), read
**`docs/runbooks/saas-secrets.md`** first. It documents which secrets live
in multiple places (e.g. `FLY_API_TOKEN` in both GitHub Actions and `fly
secrets` on `molecule-cp`), the correct rotation order, and danger cases —
notably `SECRETS_ENCRYPTION_KEY`, which cannot be rotated without a data
migration until Phase H lands KMS envelope encryption.

## Agent operating rules (auto-loaded — read first)

The following are project-level rules that override default behavior. They
apply to every conversation in this repo, automated cron tick, and every
subagent the orchestrator spawns.

### Cron / triage discipline

1. **Always read the most recent cron-learnings before reviewing PRs.** Open
   `~/.claude/projects/-Users-hongming-Documents-GitHub-molecule-monorepo/memory/cron-learnings.jsonl`,
   read the last 20 lines. Patterns recur — a finding that was a false-positive
   last tick is likely a false-positive again. A fix that worked last tick is
   likely the fix this tick. The SessionStart hook auto-injects this; read
   anyway when starting a triage from the middle of a conversation.

2. **Treat `docs/sync-*` PRs that touch CLAUDE.md or PLAN.md as ALWAYS
   noteworthy.** Those two files are the agent-facing source of truth — a
   bad merge there silently corrupts every future triage tick. Run code-review
   skill at minimum, ideally cross-vendor-review too.

3. **After any cron tick, write a 1-line reflection** to
   `.claude/per-tick-reflections.md` (gitignored). Format: `2026-MM-DDTHH:MMZ
   — what surprised me / what I'd do differently next tick`. This is for
   YOUR future self; the cron-learnings JSONL is for the operational pattern
   memory. They are distinct.

### Hooks active in this repo

The following ambient guardrails fire automatically (configured in
`.claude/settings.json`). When a hook blocks a tool call, the response will
include a `permissionDecisionReason` — read it carefully before retrying.

| Hook | Event | Effect |
|------|-------|--------|
| `pre-bash-careful.sh` | PreToolUse:Bash | REFUSES `git push --force` to main, `rm -rf` at root/HOME, `DROP TABLE` against prod schema. WARNs on `--force-with-lease`, `gh pr close/issue close`. |
| `pre-edit-freeze.sh` | PreToolUse:Edit/Write | Blocks edits outside the path in `.claude/freeze` if that file exists. Use to lock scope while debugging. |
| `session-start-context.sh` | SessionStart | Auto-loads recent cron-learnings, freeze status, open PR/issue counts. |
| `post-edit-audit.sh` | PostToolUse:Edit/Write | Appends every edit to `.claude/audit.jsonl` (gitignored). |
| `user-prompt-tag.sh` | UserPromptSubmit | Injects warning into context when prompt mentions force-push / drop-table / "delete all" / etc. |
| `subagent-stop-judge.sh` | SubagentStop | Off by default (touch `.claude/judge-subagents` to enable). When on, prompts the orchestrator to verify the subagent's output addresses the original task. |

### Skills active in this repo

These are documented in `.claude/skills/*/SKILL.md`. Invoke explicitly via
the `Skill` tool — they are NOT auto-applied. The cron prompt invokes them
at fixed steps; for ad-hoc work, decide if the skill matches your situation:

- `code-review` — full 16-criteria rubric on a diff
- `cross-vendor-review` — adversarial second-model review (use for noteworthy PRs)
- `careful-mode` — the doc backing the bash hook above
- `cron-learnings` — defines the JSONL format
- `cron-retro` — weekly retrospective generator
- `llm-judge` — score whether a deliverable addresses the request
- `update-docs` — sync repo docs after merges

### Standing rules (inviolable)

- Never push directly to main — use feat/fix/chore/docs branches
- Merge-commits only (`gh pr merge --merge`) — never `--squash` / `--rebase`
- Never commit without explicit user approval EXCEPT on:
  - Open PR branches you're fixing for a gate
  - Issue-pickup branches you opened a draft PR for
  - Docs-sync branches
  - Main is untouchable without a merge
- Dark theme only (no white/light CSS classes; pre-commit hook enforces)
- No native browser dialogs (`confirm`/`alert`/`prompt`) — use `ConfirmDialog`
- Delegate through PM, never bypass hierarchy
- Only PM mounts the repo (`workspace_dir` bind-mount); other agents get isolated Docker volumes

## Architecture

```
Canvas (Next.js :3000) ←WebSocket→ Platform (Go :8080) ←HTTP→ Postgres + Redis
                                                                  ↑
                                   Workspace A ←──A2A──→ Workspace B
                                   (Python agents)
                                        ↑ register/heartbeat ↑
                                        └───── Platform ─────┘
```

Four main components:
- **Platform** (`platform/`): Go/Gin control plane — workspace CRUD, registry, discovery, WebSocket hub, liveness monitoring
- **Canvas** (`canvas/`): Next.js 15 + React Flow (@xyflow/react v12) + Zustand + Tailwind — visual workspace graph
- **Workspace Runtime** (`workspace-template/`): Unified Docker image with pluggable adapter system — supports LangGraph, Claude Code, OpenClaw, DeepAgents, CrewAI, AutoGen. Adapters in `workspace-template/adapters/`. Deps installed at startup via `entrypoint.sh`.
- **molecli** (`platform/cmd/cli/`): Go TUI dashboard (Bubbletea + Lipgloss) — real-time workspace monitoring, event log, health overview, delete/filter operations

## Build & Run Commands

### Infrastructure
```bash
./infra/scripts/setup.sh    # Start Postgres, Redis, Langfuse, Temporal; run migrations
./infra/scripts/nuke.sh     # Tear down everything, remove volumes
```

Infra services (via `docker-compose.infra.yml`, all attached to the shared `molecule-monorepo-net` network — `setup.sh` creates it idempotently):
- **Postgres** `:5432` — primary datastore (also backs Langfuse + Temporal via separate DBs)
- **Redis** `:6379` — pub/sub, heartbeat TTLs
- **Langfuse** `:3001` — LLM trace viewer (backed by Clickhouse)
- **Temporal** `:7233` (gRPC) + `:8233` (Web UI) — durable workflow engine for `workspace-template/builtin_tools/temporal_workflow.py`. **Dev-only posture:** the auto-setup image runs with no auth on `0.0.0.0:7233`; production deployments must gate access via mTLS or an API key / reverse proxy.

### Platform (Go)
```bash
cd platform
go build ./cmd/server       # Build server
go run ./cmd/server          # Run server (requires Postgres + Redis running)
go build -o molecli ./cmd/cli  # Build TUI dashboard
./molecli                    # Run TUI dashboard (requires platform running)
```
Must run from `platform/` directory (not repo root). Env vars: `DATABASE_URL`, `REDIS_URL`, `PORT`, `PLATFORM_URL` (default `http://host.docker.internal:PORT` — passed to agent containers so they can reach the platform), `SECRETS_ENCRYPTION_KEY` (optional AES-256, 32 bytes), `CONFIGS_DIR` (auto-discovered), `PLUGINS_DIR` (deprecated — plugins are now installed per-workspace via API; the `plugins/` registry at repo root is auto-discovered), `ACTIVITY_RETENTION_DAYS` (default `7`), `ACTIVITY_CLEANUP_INTERVAL_HOURS` (default `6`), `CORS_ORIGINS` (comma-separated, default `http://localhost:3000,http://localhost:3001`), `RATE_LIMIT` (requests/min, default `600`), `WORKSPACE_DIR` (optional — global fallback host path for `/workspace` bind-mount; overridden by per-workspace `workspace_dir` column in DB; if neither is set, each workspace gets an isolated Docker named volume), `AWARENESS_URL` (optional — if set, injected into workspace containers along with a deterministic `AWARENESS_NAMESPACE` derived from workspace ID), `MOLECULE_IN_DOCKER` (optional — set to `1` when the platform itself runs inside Docker so the A2A proxy rewrites `127.0.0.1:<port>` URLs to container hostnames; auto-detected via `/.dockerenv`), `MOLECULE_ENV` (optional — set to `production` to hide the `/admin/workspaces/:id/test-token` E2E helper endpoint; unset or any other value leaves it enabled), `MOLECULE_ENABLE_TEST_TOKENS` (optional — set to `1` to force-enable the test-token endpoint even when `MOLECULE_ENV=production`; intended for staging runs only), `MOLECULE_ORG_ID` (optional — the public repo's only SaaS hook. When set to a UUID, every non-allowlisted request must carry a matching `X-Molecule-Org-Id` header or gets a 404; when unset, the guard is a passthrough so self-hosted / dev / CI are unaffected. Set only by the private `molecule-controlplane` provisioner on Fly Machines tenant instances — never by self-hosters).

**Workspace tier resource limits** (issue #14 — override the per-tier memory/CPU caps in `provisioner.ApplyTierConfig`; CPU_SHARES follows Docker's 1024 = 1 CPU convention, translated to NanoCPUs for a hard cap):
- `TIER2_MEMORY_MB` / `TIER2_CPU_SHARES` — Standard tier (defaults `512` / `1024`)
- `TIER3_MEMORY_MB` / `TIER3_CPU_SHARES` — Privileged tier (defaults `2048` / `2048`; previously uncapped)
- `TIER4_MEMORY_MB` / `TIER4_CPU_SHARES` — Full-host tier (defaults `4096` / `4096`; previously uncapped)

**Plugin install safeguards** (bound the cost of a single `POST /workspaces/:id/plugins` install so a slow/malicious source can't tie up a handler):
- `PLUGIN_INSTALL_BODY_MAX_BYTES` — max request body size (default `65536` = 64 KiB)
- `PLUGIN_INSTALL_FETCH_TIMEOUT` — duration string; whole fetch+copy deadline (default `5m`)
- `PLUGIN_INSTALL_MAX_DIR_BYTES` — max staged-tree size (default `104857600` = 100 MiB)

See `docs/plugins/sources.md` for the two-axis source/shape plugin model.

Additional env vars documented in `.env.example` (2026-04-13 sync — all 21 distinct `os.Getenv`/`envx.*` keys now documented): `MOLECULE_ENV`, `GITHUB_WEBHOOK_SECRET`, `MOLECULE_URL` (MCP server target; same semantic as `PLATFORM_URL`).

`molecli` reads `MOLECLI_URL` (default http://localhost:8080) to locate the platform. Logs are written to `molecli.log` in the working directory (already covered by `*.log` in `.gitignore`).

### Canvas (Next.js)
```bash
cd canvas
npm install
npm run dev                  # Dev server on :3000
npm run build && npm start   # Production
```
Env vars: `NEXT_PUBLIC_PLATFORM_URL` (default http://localhost:8080), `NEXT_PUBLIC_WS_URL` (default ws://localhost:8080/ws).

### Workspace Images
```bash
bash workspace-template/build-all.sh                   # Build base + ALL runtime images
bash workspace-template/build-all.sh claude-code       # Build base + specific runtime only
```
Each runtime has its own Docker image extending `workspace-template:base`, with deps pre-installed for fast startup. The base Dockerfile (`workspace-template/Dockerfile`) builds `:base`, then each `adapters/*/Dockerfile` extends it (e.g. `claude_code/Dockerfile` installs the `claude` CLI). **Always use `build-all.sh`** — it builds base first, then all runtimes in order. No `:latest` tag — each runtime uses its own tag to avoid confusion.

| Runtime | Image Tag | Key Deps |
|---------|-----------|----------|
| langgraph | `workspace-template:langgraph` | langchain-anthropic, langgraph |
| claude-code | `workspace-template:claude-code` | claude-agent-sdk (pip), @anthropic-ai/claude-code (npm) |
| openclaw | `workspace-template:openclaw` | openclaw deps |
| crewai | `workspace-template:crewai` | crewai |
| autogen | `workspace-template:autogen` | autogen |
| deepagents | `workspace-template:deepagents` | deepagents |
| hermes | `workspace-template:hermes` | openai (OpenAI-compatible client; Nous Portal via `HERMES_API_KEY` or OpenRouter via `OPENROUTER_API_KEY` fallback) |

Templates are framework presets in `workspace-configs-templates/`: `claude-code-default`, `langgraph`, `openclaw`, `deepagents`. Agent roles are configured after deployment via Config tab or API.

For Claude Code runtime, write your OAuth token to `workspace-configs-templates/claude-code-default/.auth-token`.

### Pre-commit Hook
```bash
git config core.hooksPath .githooks            # Install hooks (agents do this via initial_prompt)
```
Enforces: `'use client'` on hook-using `.tsx` files, dark theme (no white/light), no SQL injection (`fmt.Sprintf` with SQL), no leaked secrets (`sk-ant-`, `ghp_`, `AKIA`). Commit is rejected until violations are fixed — agents cannot bypass this.

### Plugins
Shared plugins in `plugins/` are auto-loaded by every workspace:
- **`molecule-dev`**: Codebase conventions (rules injected into CLAUDE.md) + `review-loop` skill for multi-round QA cycles
- **`superpowers`**: `verification-before-completion`, `test-driven-development`, `systematic-debugging`, `writing-plans`
- **`ecc`**: General Claude Code guardrails
- **`browser-automation`**: Puppeteer/CDP-based web scraping and live canvas screenshots (opt-in per workspace — wired into Research + UIUX roles in `org-templates/molecule-dev/org.yaml`)

**Modular guardrails** (Claude Code only — pick what you need, or install several):

*Hook plugins (ambient enforcement at the harness layer)*
- **`molecule-careful-bash`** — REFUSES `git push --force` to main, `rm -rf` at root, `DROP TABLE` against prod schema. Ships the `careful-mode` skill as documentation.
- **`molecule-freeze-scope`** — locks edits to a single path glob via `.claude/freeze`. Useful while debugging.
- **`molecule-audit-trail`** — appends every Edit/Write to `.claude/audit.jsonl` for accountability.
- **`molecule-session-context`** — auto-loads recent cron-learnings + open PR/issue counts at session start. Pairs with `molecule-skill-cron-learnings`.
- **`molecule-prompt-watchdog`** — injects warning context when the user prompt mentions destructive keywords ("force push", "drop table", "delete all", etc).

*Skill plugins (on-demand, via the `Skill` tool)*
- **`molecule-skill-code-review`** — 16-criteria multi-axis review.
- **`molecule-skill-cross-vendor-review`** — adversarial second-model review (use for noteworthy PRs).
- **`molecule-skill-llm-judge`** — score whether a deliverable addresses the request.
- **`molecule-skill-update-docs`** — sync repo docs after merges.
- **`molecule-skill-cron-learnings`** — defines the operational-memory JSONL format consumed by `molecule-session-context`.

*Workflow plugins (slash commands that compose skills)*
- **`molecule-workflow-triage`** — `/triage` runs a full PR-triage cycle (gates 1–7 + code-review + merge if green). Recommends installing `molecule-skill-code-review` + `molecule-skill-cron-learnings` first.
- **`molecule-workflow-retro`** — `/retro` posts a weekly retrospective issue. Recommends `molecule-skill-cron-learnings` first.

These are distilled from the harness-level guardrails the orchestrator uses on itself. A workspace can install one (e.g., just `molecule-careful-bash` for safety) or stack the full set for the same posture as the Molecule AI orchestrator.

**Org-template plugin resolution (PR #71, issue #68):** per-workspace `plugins:` lists in `org-templates/*/org.yaml` role overrides **UNION** with `defaults.plugins` (deduplicated, defaults first) — they do **not** REPLACE them. To opt a specific default out for a given role/workspace, prefix the plugin name with `!` or `-` (e.g. `!browser-automation`). Implemented by `mergePlugins` in `platform/internal/handlers/org.go`.

### Scripts
```bash
bash scripts/setup-default-org.sh              # Create PM + 3 teams (Marketing/Research/Dev) via API
OPENAI_API_KEY=... bash scripts/test-a2a-cross-runtime.sh  # E2E: Claude Code ↔ OpenClaw A2A test
OPENAI_API_KEY=... bash scripts/test-team-e2e.sh           # E2E: Multi-template team + A2A
```

### Unit Tests
```bash
cd platform && go test -race ./...               # 746 Go tests (handlers, registry, provisioner, CLI, delegation, org, channels, wsauth, middleware — sqlmock + miniredis; +6 on 2026-04-14 tick-8 for TestTenantGuard_* covering MOLECULE_ORG_ID passthrough/match/mismatch/missing/allowlist/exact-match (#78, Phase 32 PR #1); prior: +9 tick-7 for category_routing + schedules.source; +5 tick-6 for plugins UNION; +6 tick-4 for auto-restart + restart-context branches)
cd canvas && npm test                            # 357 Vitest tests (store, components, hydration, buildTree, secrets API, org template import, ConfirmDialog singleButton + 7 native-dialog replacements)
cd workspace-template && python -m pytest -v     # 1140 pytest tests (adds platform_auth token store for Phase 30.1, memory_write activity logging)
cd sdk/python && python -m pytest -v              # 132 SDK tests (agentskills.io spec validator, CLI, AgentskillsAdaptor round-trip, workspace/org/channel validators, RemoteAgentClient Phase 30 flows)
cd mcp-server && npm test                        # 97 Jest tests (per-domain tool modules + smoke test on tool count)
```

### Integration Tests
```bash
bash tests/e2e/test_api.sh             # 62 API tests against localhost:8080 (Phase 30.1 bearer-token auth aware; shellcheck-clean; also runs in CI `e2e-api` job)
bash tests/e2e/test_a2a_e2e.sh         # 22 A2A end-to-end tests (requires 2 online agents)
bash tests/e2e/test_activity_e2e.sh    # 25 activity/task E2E tests (requires 1 online agent; re-registers detected agent to capture bearer token)
bash tests/e2e/test_comprehensive_e2e.sh # 67 checks — ALL endpoints, memory, runtime, bundles, approvals (registers workspaces immediately after create to beat the provisioner token race)
```
All five E2E scripts share `tests/e2e/_lib.sh` + `tests/e2e/_extract_token.py` helpers and are shellcheck-clean. `test_api.sh` is the quick local-verify command — use it after any platform change. Tests full CRUD, registry, heartbeat, discovery, peers, access control, events, degraded/recovery lifecycle, activity logging, current task tracking, bundle round-trip (export → delete → import → verify).

**Phase 30.1 / 30.6 auth callout (future-proofing):** `/registry/heartbeat` and `/registry/update-card` require `Authorization: Bearer <token>` once a workspace has any live token on file (Phase 30.1 — legacy workspaces grandfathered). `/registry/discover/:id` and `/registry/:id/peers` additionally require `X-Workspace-ID` + bearer token on the caller side (Phase 30.6 — fail-open on DB hiccup since hierarchy check is primary). If you change these routes, update `tests/e2e/test_api.sh` and `docs/api-protocol/platform-api.md` in the same PR.

`test_a2a_e2e.sh` requires platform + two provisioned agents (Echo Agent, SEO Agent) running with a valid `OPENROUTER_API_KEY`. Tests message/send, JSON-RPC wrapping, error handling, peer discovery, agent cards, heartbeat. Timeout configurable via `A2A_TIMEOUT` env var (default 120s).

`test_activity_e2e.sh` requires platform + one online agent. Tests A2A communication logging (request/response capture, duration, method), agent self-reported activity, type filtering, current task visibility via heartbeat, cross-workspace activity isolation, edge cases.

### MCP Server
```bash
cd mcp-server
npm install && npm run build   # Build MCP server
node dist/index.js             # Run (stdio transport)
```
Exposes **87 tools** for managing Molecule AI from Claude Code, Cursor, Codex, or any MCP client. Includes workspace CRUD, async delegation, plugins (install/uninstall/list), global secrets, pause/resume, org import, A2A chat, approvals, memory, files, config, discovery, bundles, templates, traces, activity logs, remote agents (Phase 30), and social channels (add/update/remove/send/test). Configured in `.mcp.json`. Env: `MOLECULE_URL` (default http://localhost:8080).

**Structure (refactored 2026-04-13, PRs #2/#4/#7):** `src/index.ts` shrank from 1697 → 89 lines and now only wires `createServer()`. Per-domain tool modules live in `src/tools/`: `workspaces.ts`, `agents.ts`, `secrets.ts`, `files.ts`, `memory.ts`, `plugins.ts`, `channels.ts`, `delegation.ts`, `schedules.ts`, `approvals.ts`, `discovery.ts`, `remote_agents.ts`. Each exports its handlers and a `registerXxxTools(srv)` function. Shared HTTP layer in `src/api.ts` (`PLATFORM_URL`, `apiCall<T>`, `ApiError`, `isApiError()`, `toMcpResult()`, `toMcpText()`). When adding a tool, pick the matching domain file or create a new one and wire it in `createServer()`.

### CI Pipeline
GitHub Actions (`.github/workflows/ci.yml`) runs on push to main and PRs:
- **platform-build**: Go build, vet, `go test -race` with coverage profiling (25% baseline threshold; `setup-go` uses module cache)
- **canvas-build**: npm build, `vitest run` (no `--passWithNoTests` -- tests must exist and pass)
- **mcp-server-build**: npm build
- **python-lint**: `pytest --cov=. --cov-report=term-missing` (pytest-cov enabled)
- **e2e-api** (added 2026-04-13): spins up Postgres + Redis service containers, runs platform migrations via `docker exec`, then executes `tests/e2e/test_api.sh` against a locally-built binary (62/62 must pass)
- **shellcheck** (added 2026-04-13): lints every `tests/e2e/*.sh` via the shellcheck marketplace action
- **publish-platform-image** (`.github/workflows/publish-platform-image.yml`, added 2026-04-14 tick-9): on push to main touching `platform/**`, builds `platform/Dockerfile` and pushes to `ghcr.io/molecule-ai/platform:latest` + `:sha-<short>`. Used by the private `molecule-controlplane` provisioner as tenant VM image. Manual re-trigger via `workflow_dispatch`.

### Docker Compose
```bash
docker compose -f docker-compose.infra.yml up -d    # Infra only
docker compose up                                     # Full stack
```

## Key Architectural Patterns

### Import Cycle Prevention
The platform uses function injection to avoid Go import cycles between ws, registry, and events packages:
- `ws.NewHub(canCommunicate AccessChecker)` — Hub accepts `registry.CanCommunicate` as a function
- `registry.StartLivenessMonitor(ctx, onOffline OfflineHandler)` — Liveness accepts broadcaster callback
- `registry.StartHealthSweep(ctx, checker ContainerChecker, interval, onOffline)` — Health sweep accepts Docker checker interface
- Wiring happens in `platform/cmd/server/main.go` — init order: `wh → onWorkspaceOffline → liveness/healthSweep → router`

### Container Health Detection
Three layers detect dead containers (e.g. Docker Desktop crash):
1. **Passive (Redis TTL):** 60s heartbeat key expires → liveness monitor → auto-restart
2. **Proactive (Health Sweep):** `registry.StartHealthSweep` polls Docker API every 15s → catches dead containers faster
3. **Reactive (A2A Proxy):** On connection error, checks `provisioner.IsRunning()` → immediate offline + restart

All three call `onWorkspaceOffline` which broadcasts `WORKSPACE_OFFLINE` + `go wh.RestartByID()`. Redis cleanup uses shared `db.ClearWorkspaceKeys()`.

### Template Resolution (Create)
Runtime detection happens **before** DB insert: if `payload.Runtime` is empty and a template is specified, the handler reads `runtime:` from `configsDir/template/config.yaml` first. If still empty, defaults to `"langgraph"`. This ensures the correct runtime (e.g. `claude-code`) is persisted in the DB and used for container image selection.

When a workspace specifies a template that doesn't exist, the Create handler falls back:
1. Check `os.Stat(configsDir/template)` — use if exists
2. Try `{runtime}-default` template (e.g. `claude-code-default/`)
3. Generate default config via `ensureDefaultConfig()` (includes `.auth-token` copy for CLI runtimes)

### Communication Rules (`registry/access.go`)
`CanCommunicate(callerID, targetID)` determines if two workspaces can talk:
- Same workspace → allowed
- Siblings (same parent_id) → allowed
- Root-level siblings (both parent_id IS NULL) → allowed
- Parent ↔ child → allowed
- Everything else → denied

The A2A proxy (`POST /workspaces/:id/a2a`) enforces this for agent-to-agent calls. Canvas requests (no `X-Workspace-ID`), self-calls, and system callers (`webhook:*`, `system:*`, `test:*` prefixes via `isSystemCaller()` in `a2a_proxy.go`) bypass the check.

### Handler Decomposition (2026-04-13)
Four oversize handler functions were split into private helpers (pure refactor, behavior unchanged — 47 new unit tests cover the helpers directly; `handlers` package coverage 56.1% → 57.6%):
- `a2a_proxy.go::proxyA2ARequest` (257→56 lines) — helpers: `resolveAgentURL`, `normalizeA2APayload`, `dispatchA2A`, `handleA2ADispatchError`, `maybeMarkContainerDead`, `logA2AFailure`, `logA2ASuccess`; sentinel `proxyDispatchBuildError`
- `delegation.go::Delegate` (127→60 lines) — helpers: `bindDelegateRequest`, `lookupIdempotentDelegation`, `insertDelegationRow`; typed `insertDelegationOutcome` enum replaces `(bool, bool)` positional return
- `discovery.go::Discover` (125→40 lines) — helpers: `discoverWorkspacePeer`, `writeExternalWorkspaceURL`, `discoverHostPeer`
- `activity.go::SessionSearch` (109→24 lines) — helpers: `parseSessionSearchParams`, `buildSessionSearchQuery`, `scanSessionSearchRows`

When modifying any of these, prefer extending the helper rather than inlining back.

### JSONB Gotcha
When inserting Go `[]byte` (from `json.Marshal`) into Postgres JSONB columns, you must:
1. Convert to `string()` first
2. Use `::jsonb` cast in SQL

lib/pq treats `[]byte` as `bytea`, not JSONB.

### WebSocket Events Flow
1. Action occurs (register, heartbeat, etc.)
2. `broadcaster.RecordAndBroadcast()` inserts into `structure_events` table + publishes to Redis pub/sub
3. Redis subscriber relays to WebSocket hub
4. Hub broadcasts to canvas clients (all events) and workspace clients (filtered by CanCommunicate)

### Canvas State Management
- Initial load: HTTP fetch from `GET /workspaces` → Zustand hydrate
- Real-time updates: WebSocket events → `applyEvent()` in Zustand store
- Position persistence: `onNodeDragStop` → `PATCH /workspaces/:id` with `{x, y}`
- Embedded sub-workspaces: `nestNode` sets `hidden: !!targetId` on child nodes; children render as recursive `TeamMemberChip` components inside parent (up to 3 levels), not as separate canvas nodes. Use `n.data.parentId` (not React Flow's `n.parentId`) for hierarchy lookups.
- Chat: two sub-tabs — "My Chat" (user↔agent, `source=canvas`) and "Agent Comms" (agent↔agent A2A traffic, `source=agent`). History loaded from `GET /activity` with source filter. Real-time via `A2A_RESPONSE` + `AGENT_MESSAGE` WebSocket events. Conversation history (last 20 messages) sent via `params.metadata.history` in A2A `message/send` requests.
- Config save: "Save & Restart" writes config.yaml and auto-restarts the workspace. "Save" writes only (shows restart banner). Secrets POST/DELETE auto-restart on the platform side.

### Initial Prompt
Agents can auto-execute a prompt on startup before any user interaction. Configure via `initial_prompt` (inline string) or `initial_prompt_file` (path relative to config dir) in `config.yaml`. After the A2A server is ready, `main.py` sends the prompt as a `message/send` to self. A `.initial_prompt_done` marker file prevents re-execution on restart. Org templates support `initial_prompt` on both `defaults` (all agents) and per-workspace (overrides default).

**Important:** Initial prompts must NOT send A2A messages (delegate_task, send_message_to_user) — other agents may not be ready. Keep them local: clone repo, read docs, save to memory, wait for tasks.

### Workspace Lifecycle
`provisioning` → `online` (on register) → `degraded` (error_rate > 0.5) → `online` (recovered) → `offline` (Redis TTL expired OR health sweep detects dead container) → auto-restart → `provisioning` → ... → `removed` (deleted). Any state → `paused` (user pauses) → `provisioning` (user resumes). Paused workspaces skip health sweep, liveness monitor, and auto-restart.

**Restart context message (issue #19 Layer 1):** After any restart (HTTP `/restart` or programmatic `RestartByID`) and successful re-registration, the platform sends a synthetic A2A `message/send` to the workspace with `metadata.kind=restart_context` — body contains restart timestamp, previous session end + duration, and env-var keys (keys only, never values) now available. Sender uses the `system:restart-context` caller prefix so it bypasses `CanCommunicate` via `isSystemCaller()`. If the workspace does not re-register within 30s the message is dropped (logged). Handler: `platform/internal/handlers/restart_context.go`. Layer 2 (user-defined `restart_prompt` from `config.yaml` / `org.yaml`) is tracked as GitHub issue #66.

## Platform API Routes

| Method | Path | Handler |
|--------|------|---------|
| GET | /health | inline |
| GET | /metrics | metrics.Handler() — Prometheus text format (v0.0.4); no auth, scrape-safe |
| POST/GET/PATCH/DELETE | /workspaces[/:id] | workspace.go |
| GET/PATCH | /workspaces/:id/config | workspace.go |
| GET/POST | /workspaces/:id/memory | workspace.go |
| DELETE | /workspaces/:id/memory/:key | workspace.go |
| POST/PATCH/DELETE | /workspaces/:id/agent | agent.go |
| POST | /workspaces/:id/agent/move | agent.go |
| GET/POST/PUT | /workspaces/:id/secrets | secrets.go (POST/PUT auto-restarts workspace) |
| DELETE | /workspaces/:id/secrets/:key | secrets.go (DELETE auto-restarts workspace) |
| GET | /workspaces/:id/model | secrets.go |
| GET | /settings/secrets | secrets.go — list global secrets (keys only, values masked) |
| PUT/POST | /settings/secrets | secrets.go — set a global secret {key, value}; auto-restarts every non-paused/non-removed/non-external workspace that does not shadow the key with a workspace-level override (issue #15 / PR #64) |
| DELETE | /settings/secrets/:key | secrets.go — delete a global secret; same auto-restart fan-out as SetGlobal |
| GET | /admin/workspaces/:id/test-token | admin_test_token.go — mint a fresh bearer token for E2E scripts; 404 unless `MOLECULE_ENV != production` or `MOLECULE_ENABLE_TEST_TOKENS=1` |
| GET/POST/DELETE | /admin/secrets[/:key] | secrets.go — legacy aliases for /settings/secrets |
| WS | /workspaces/:id/terminal | terminal.go |
| POST | /workspaces/:id/expand | team.go |
| POST | /workspaces/:id/collapse | team.go |
| POST/GET | /workspaces/:id/approvals | approvals.go |
| POST | /workspaces/:id/approvals/:id/decide | approvals.go |
| GET | /approvals/pending | approvals.go |
| POST/GET | /workspaces/:id/memories | memories.go |
| DELETE | /workspaces/:id/memories/:id | memories.go |
| GET | /workspaces/:id/traces | traces.go |
| GET/POST | /workspaces/:id/activity | activity.go |
| POST | /workspaces/:id/notify | activity.go (agent→user push message via WS) |
| POST | /workspaces/:id/restart | workspace.go |
| POST | /workspaces/:id/pause | workspace.go (stops container, status→paused) |
| POST | /workspaces/:id/resume | workspace.go (re-provisions paused workspace) |
| POST | /workspaces/:id/a2a | workspace.go |
| POST | /workspaces/:id/delegate | delegation.go (async fire-and-forget) |
| GET | /workspaces/:id/delegations | delegation.go (list delegation status) |
| GET/POST | /workspaces/:id/schedules | schedules.go (cron CRUD) |
| PATCH/DELETE | /workspaces/:id/schedules/:scheduleId | schedules.go |
| POST | /workspaces/:id/schedules/:scheduleId/run | schedules.go (manual trigger) |
| GET | /workspaces/:id/schedules/:scheduleId/history | schedules.go (past runs) |
| GET/POST | /workspaces/:id/channels | channels.go (social channel CRUD) |
| PATCH/DELETE | /workspaces/:id/channels/:channelId | channels.go |
| POST | /workspaces/:id/channels/:channelId/send | channels.go (outbound message) |
| POST | /workspaces/:id/channels/:channelId/test | channels.go (test connection) |
| GET | /channels/adapters | channels.go (list available platforms) |
| POST | /channels/discover | channels.go (auto-detect chats for a bot token) |
| POST | /webhooks/:type | channels.go (incoming social webhook) |
| GET | /workspaces/:id/shared-context | templates.go |
| GET/PUT/DELETE | /workspaces/:id/files[/*path] | templates.go |
| GET/PUT | /canvas/viewport | viewport.go |
| GET | /templates | templates.go |
| POST | /templates/import | templates.go |
| POST | /registry/register | registry.go |
| POST | /registry/heartbeat | registry.go |
| POST | /registry/update-card | registry.go |
| GET | /registry/discover/:id | discovery.go |
| GET | /registry/:id/peers | discovery.go |
| POST | /registry/check-access | discovery.go |
| GET | /plugins | plugins.go (list registry; supports `?runtime=` filter) |
| GET | /plugins/sources | plugins.go (list registered install-source schemes) |
| GET/POST/DELETE | /workspaces/:id/plugins[/:name] | plugins.go — list, install (`{"source":"scheme://spec"}`), uninstall per-workspace |
| GET | /workspaces/:id/plugins/available | plugins.go (filtered by workspace runtime) |
| GET | /workspaces/:id/plugins/compatibility?runtime=X | plugins.go (preflight runtime-change check) |
| GET | /bundles/export/:id | bundle.go |
| POST | /bundles/import | bundle.go |
| GET | /org/templates | org.go (list available org templates) |
| POST | /org/import | org.go (import entire org hierarchy from YAML) || GET | /events[/:workspaceId] | events.go |
| GET | /ws | socket.go |

## Database

23 migration files in `platform/migrations/` (up to `022_workspace_schedules_source` — 2026-04-14 tick-7, PR #76). Key tables: `workspaces` (core entity with status, runtime, agent_card JSONB, heartbeat columns, current_task, awareness_namespace, workspace_dir), `canvas_layouts` (x/y position), `structure_events` (append-only event log), `activity_logs` (A2A communications, task updates, agent logs, errors), `workspace_schedules` (cron tasks with expression, timezone, prompt, run history, and `source` — `'template'` for org/import-seeded, `'runtime'` for Canvas/API-created; org/import is additive and only refreshes template-source rows on re-import), `workspace_channels` (social channel integrations — Telegram, Slack, etc., with JSONB config and allowlist), `agents`, `workspace_secrets`, `global_secrets`, `agent_memories` (HMA scoped memory), `approvals`.

The platform auto-discovers and runs migrations on startup from several candidate paths.

<!-- AWARENESS_RULES_START -->
# Project Memory (Awareness MCP)

> IMPORTANT: These instructions override default behavior. You must follow them exactly.

## Awareness Memory Integration (MANDATORY)

awareness_* = cross-session persistent memory (past decisions, knowledge, tasks).
Other tools = current codebase navigation (file search, code index).
Use BOTH - they serve different purposes.

STEP 1 - SESSION START:
  Call awareness_init(source="claude-code") -> get session_id, review context.
  If active_skills[] is returned: skill = reusable procedure done 2+ times;
  summary = injectable instruction, methods = steps. Apply matching skills to tasks.

STEP 2 - RECALL BEFORE WORK (progressive disclosure):
  1. awareness_recall(semantic_query=..., keyword_query=..., detail='summary') → lightweight index.
  2. Review summaries/scores, pick relevant IDs.
  3. awareness_recall(detail='full', ids=[...]) → expand only what you need.

STEP 3 - RECORD EVERY CHANGE:
  After EVERY code edit, decision, or bug fix:
  awareness_record(content=<detailed natural language description>,
    insights={knowledge_cards:[...], action_items:[...], risks:[...]})
  Content should be RICH and DETAILED — include reasoning, key code snippets,
  user quotes, alternatives considered, and files changed. Do NOT compress into
  a single-line summary. The content IS the memory — more detail = better recall.
  Include insights to create searchable knowledge in ONE step (recommended).
  Skipping = permanent data loss.

STEP 4 - CATEGORY GUIDE (for insights.knowledge_cards):
  - decision = choice made between alternatives.
  - problem_solution = bug/problem plus the fix that resolved it.
  - workflow = process, setup, or configuration steps only.
  - pitfall = blocker, warning, or limitation without a fix yet.
  - insight = reusable pattern or general learning.
  - skill = reusable procedure done 2+ times; summary = injectable instruction, methods = steps.
  - key_point = important technical fact when nothing else fits.
  Never default everything to workflow.

STEP 5 - SESSION END:
  awareness_record(content=[step1, step2, ...], insights={...}) with final summary.

BACKFILL (if applicable):
  If MCP connected late: awareness_record(content=<transcript>)

RULES VERSION: Pass rules_version="2" to awareness_init so the server knows you have these rules.
If the server returns _setup_action, the rules have been updated — follow the instruction to re-sync.

NOTE: memory_id from X-Awareness-Memory-Id header. source/actor/event_type auto-inferred.

## Compliance Check

Before responding to ANY user request:

1. Have you called awareness_init yet this session? If not, call it NOW.

2. Did you just edit a file? Call awareness_record(content=<detailed description>, insights={...}) IMMEDIATELY.

3. Is the user asking about past work? Call awareness_recall FIRST.
<!-- AWARENESS_RULES_END -->