# 2026-04-10 Session

## Summary

Documentation maintenance for the new long-form Molecule AI product and technical narratives: moved both repository-root drafts into the VitePress docs tree, added sidebar/homepage entry points so they are discoverable from the docs site, and linked them from the product overview for ongoing maintenance inside `docs/`.

Also brought the landing-page messaging report under docs maintenance by tracking `docs/product/landing-messaging-report.md` in git and adding it to the product navigation surface.

## Changes

### New Long-Form Docs Added To `docs/`
- Moved `MOLECULE_PRODUCT_DOC.md` into `docs/product/molecule-product-doc.md`
- Moved `MOLECULE_TECHNICAL_DOC.md` into `docs/architecture/molecule-technical-doc.md`
- Kept the full source content intact while relocating it into the maintained docs structure

### VitePress Navigation Updated
- `docs/.vitepress/config.ts`
- Added `Product Narrative` under the Product sidebar group
- Added `Landing Messaging Report` under the Product sidebar group
- Added `Technical Documentation` under the Architecture sidebar group

### Docs Entry Points Updated
- `docs/index.md`
- Added homepage recommended-reading links for the new product and technical documents

### Product Overview Cross-Links Updated
- `docs/product/overview.md`
- Added direct links to the product narrative, landing messaging report, and comprehensive technical documentation

### Additional Product Doc Tracked
- Added `docs/product/landing-messaging-report.md` to version control under the Product docs section

## Files Changed
- `docs/.vitepress/config.ts`
- `docs/index.md`
- `docs/product/overview.md`
- `docs/product/landing-messaging-report.md`
- `docs/product/molecule-product-doc.md`
- `docs/architecture/molecule-technical-doc.md`
- `docs/edit-history/2026-04-10.md` (new)

---

## CEO Session — Infrastructure Audit + Chain Break Fix

### Infra Audit (fix/infra-audit-critical — PR #5)

Comprehensive codebase audit identified 19 issues across 4 priority levels. Critical fixes:

1. **Race condition in crypto/aes.go** — `encryptionKey` global accessed without sync. Fixed with `sync.Once`. Added `ResetForTesting()` for tests.
2. **Missing DB indexes** — Migration 014: `workspaces(parent_id)`, `workspaces(status)`, `canvas_layouts(workspace_id)`. Speeds up hierarchy queries, cascade deletes, list/get joins.
3. **N+1 cascade delete** — Replaced per-child `UPDATE`+`DELETE` loop with recursive CTE batch query. Docker stops still per-child.
4. **CI linting** — Added `golangci-lint` step (continue-on-error until codebase clean).

### Chain Break Root Cause + Fix

**Problem:** Delegation chain died after first result. PM delegated to Dev Lead + QA, results completed, heartbeat wrote results file — but PM was never woken again.

**Root cause:** Self-message cooldown was 5 minutes. First delegation triggered a self-message within the window. All subsequent completions were blocked by cooldown. PM never woke up to report.

**Fix:** Reduced `SELF_MESSAGE_COOLDOWN` from 300s to 60s. With 30s heartbeat cycles, new results trigger a self-message within 1-2 cycles. Results file dedup prevents double-processing.

### Agent-Authored PRs Received

Agents autonomously created PRs while CEO did infra work:
- **PR #3** — Settings Panel (Frontend Engineer): 34 files, 279 tests, full UX spec implementation
- **PR #4** — Onboarding Interception (Frontend Engineer): 10 files, 1362 additions, deploy preflight + missing keys modal

### Monitoring

- 13/13 workspaces online throughout session
- Heartbeats active (Redis TTL refreshing)
- Frontend Engineer + QA Engineer were actively processing tasks
- No container crashes, no degraded workspaces

## Files Changed (CEO Session)
- `workspace-server/internal/crypto/aes.go` (sync.Once)
- `workspace-server/internal/crypto/aes_test.go` (ResetForTesting)
- `workspace-server/internal/handlers/workspace.go` (recursive CTE delete)
- `workspace-server/internal/handlers/workspace_test.go` (updated mocks)
- `workspace-server/migrations/014_indexes.sql` (new — 3 indexes)
- `.github/workflows/ci.yml` (golangci-lint)
- `workspace/heartbeat.py` (60s cooldown, parent reporting, cached lookup)
- `workspace-server/internal/handlers/plugins_test.go` (new — 16 tests)
- `CLAUDE.md` (test counts: Go 365+, Python 869, migration 14)
- `docs/api-protocol/registry-and-heartbeat.md` (delegation checking section)

### Delegation Chain — Last Mile Fix

**Problem:** PM received delegation results but never reported to CEO. The heartbeat self-message said "report back to them" without specifying who.

**Fix:** Heartbeat looks up parent workspace name (cached after first call) and includes explicit instruction: "Report these results back to your parent 'CEO'." This closes the full chain: CEO → PM → team → results → PM wakes → reports to CEO.

### Plugins Handler Tests (16 new)

Covered: ListRegistry (empty/nonexistent/with plugins), Install validation (missing name, path traversal, not found), Uninstall validation, validatePluginName (valid/slash/dotdot/backslash/empty), parseManifestYAML (valid/invalid/minimal).

### Agent PRs Completed

Team autonomously completed test plan checklists:
- PR #3 (Settings Panel): 9/9 tasks ✅
- PR #4 (Onboarding): 10/10 tasks ✅

Chain worked: CEO → PM → Dev Lead → FE + QA → PRs updated → all checklists done → PM reported back.

### Root Scripts Cleanup

Deleted 4 dead scripts replaced by platform features:
- `setup-org.sh`, `setup_reno_stars.sh` → `POST /org/import`
- `import-ecc.sh` → plugin system
- `scripts/setup-default-org.sh` → `POST /org/import`

Moved utility scripts to `scripts/`: `import-agent.sh`, `bundle-compile.sh`

Moved 5 E2E test scripts to `tests/e2e/`: `test_api.sh` (62 tests), `test_a2a_e2e.sh` (22), `test_activity_e2e.sh` (25), `test_claude_code_e2e.sh`, `test_comprehensive_e2e.sh` (68). Updated CLAUDE.md paths.

### PR #3 + #4 Code Review Delegated

CEO reviewed both PRs and found 6 critical bugs + 9 warnings. Delegated fixes through PM → Dev Lead → FE. Both PRs updated at 4:50 with fixes in progress.

### Provisioner Stale Image Fix

**Root cause:** Docker's `unless-stopped` restart policy races with provisioner's Stop → Start sequence. Old container restarts before `ContainerRemove` completes, blocking `ContainerCreate`. Result: old image keeps running after rebuild.

**Fix:** Pre-emptive `ContainerRemove(force: true)` before `ContainerCreate` — kills any stale container from restart policy. Added image ID logging on create and start for immediate visibility of stale-image issues.

### PRs #3 + #4 Reverted

Agent-authored PRs had too many integration bugs (infinite re-renders, wrong API format, white theme on dark canvas). Reverted both via cherry-pick rebuild of main.

### Template Runtime Detection Bug

**Problem:** Deploying "Claude Code Agent" from the template palette started a `langgraph` container instead of `claude-code`. The agent error was `[Errno 2] No such file or directory: '/claude'`.

**Root cause:** `workspace.go:Create` defaulted `payload.Runtime` to `"langgraph"` (line 50-52) **before** reading the template's `config.yaml`. The later detection block (line 142) checked `if payload.Runtime == ""` but it was already set, so the template's `runtime: claude-code` was never used.

**Fix:** Moved runtime detection from template config.yaml to **before** the DB insert and before the default fallback. Removed the now-dead duplicate detection block in the provisioning section. Added debug log when config.yaml read fails.

### Branding + License

- Replaced gradient "S" square in toolbar with actual Molecule AI flame icon (`/molecule-icon.png`)
- Added Molecule AI favicon (`canvas/src/app/icon.png`)
- Added BSL 1.1 LICENSE file — personal/non-commercial use OK, no competing SaaS, converts to Apache 2.0 on 2029-01-01
- Updated README badge and license section

### AutoGen Adapter `'kwargs'` Fix

**Problem:** Deploying AutoGen Agent from template palette resulted in `AutoGen error: 'kwargs'` on every message.

**Root cause:** `_langchain_to_autogen()` wrapped LangChain tools as `async def wrapper(**kwargs)`. AutoGen 0.7.5's `FunctionTool` introspects function signatures with type hints — `**kwargs` has no type annotation, causing `KeyError: 'kwargs'` in `_function_utils.py`.

**Fix:** Replaced `**kwargs` wrapper with typed `async def _invoke(input: str) -> str` and used `autogen_core.tools.FunctionTool` directly. JSON parsing bridges structured input for tools that expect dicts.

### Chat Duplicate Messages Fix

**Problem:** Sending a message showed the agent response twice in the chat.

**Root cause:** Two paths both added the response: (1) WebSocket `A2A_RESPONSE` handler in ChatTab, and (2) Zustand store's `pendingA2AResponse` effect. Both fired from the same event.

**Fix:** Removed the duplicate WebSocket handler in ChatTab — the store effect is the canonical path.

### Canvas Pan-to-Node on Deploy

New workspaces now appear near center and the canvas smoothly pans to them on deploy instead of placing them all at (0,0).

### Docs Cleanup

Deleted 6 UX spec files for reverted Phase 20 features (settings panel, onboarding interception, deploy interception) — no longer in codebase.

### Initial Prompt System

New feature: agents can auto-execute a configurable prompt on startup — before any user interaction.

**Architecture:**
- `config.py`: new `initial_prompt` field (string or `initial_prompt_file` reference)
- `main.py`: after server ready, sends initial_prompt as A2A `message/send` to self
- `org.go`: `InitialPrompt` on `OrgDefaults` and `OrgWorkspace` structs with JSON+YAML tags; injected into config.yaml as YAML block scalar during org import
- Org template: per-agent initial prompts instruct dev agents to clone repo, read CLAUDE.md, study codebase, and report ready

**Manual E2E verified:** 12 agents deployed, 11/11 non-PM agents cloned repo to `/workspace/repo/`, PM has repo at `/workspace` (bind-mounted). All 12 have codebase access.

### Runtime Change on Restart Fix

**Problem:** Comprehensive E2E test "Runtime change langgraph→deepagents on restart" failed — container kept using old image.

**Root cause:** `workspace_restart.go` read runtime from DB (`COALESCE(runtime, 'langgraph')`) but when the user changes `config.yaml` runtime, the DB is never updated. Also, `ExecRead` was called after `Stop()` (container already stopped).

**Fix:** Read config.yaml runtime from running container *before* stopping it. If runtime differs from DB, update DB. Use `configDirName(id)` for container name (not raw workspace ID).

### QA System Prompt Overhaul

Comprehensive rewrite: never trust self-reported results, must clone repo independently, run ALL test suites to 100% green, E2E tests required, visual style verification against dark zinc theme, red flags checklist.

### Org Struct JSON Tags

Added `json` tags to `OrgTemplate`, `OrgDefaults`, and `OrgWorkspace` structs — without them, JSON POST bodies couldn't populate `initial_prompt` and other snake_case fields.

## Files Changed
- `workspace-server/internal/handlers/workspace.go` — runtime detection before DB insert
- `workspace-server/internal/handlers/workspace_restart.go` — read runtime from container config before stop
- `workspace-server/internal/handlers/org.go` — InitialPrompt field, JSON tags, config.yaml injection
- `workspace-server/internal/handlers/org_test.go` — 5 new tests (YAML parsing, injection, special chars)
- `workspace/config.py` — initial_prompt field + file reference
- `workspace/main.py` — auto-send initial_prompt after server ready
- `workspace/tests/test_config.py` — 5 new tests (inline, file, precedence, default, missing)
- `workspace/cli_executor.py` — __del__ getattr guard
- `workspace/adapters/autogen/adapter.py` — FunctionTool wrapper
- `workspace/tests/test_common_setup.py` — autogen skipif + FunctionTool assertions
- `org-templates/molecule-dev/org.yaml` — per-agent initial prompts
- `org-templates/molecule-dev/qa-engineer/system-prompt.md` — comprehensive QA rewrite
- `canvas/src/components/Canvas.tsx` — pan-to-node on deploy
- `canvas/src/components/Toolbar.tsx` — Molecule AI icon
- `canvas/src/components/tabs/ChatTab.tsx` — remove duplicate A2A_RESPONSE handler
- `canvas/src/store/canvas-events.ts` — node position offset + pan event + window guard
- `canvas/src/store/__tests__/canvas.test.ts` — relaxed position assertion
- `canvas/src/lib/api/__tests__/secrets.test.ts` — match actual API format
- `canvas/src/app/icon.png` — favicon
- `tests/e2e/test_comprehensive_e2e.sh` — fix secrets test assumption
- `.gitignore` — test-results/, playwright-report/
- `LICENSE` — BSL 1.1
- `README.md` — license badge + section
- `CLAUDE.md` — template resolution docs, initial prompt section, test counts
- Deleted: `docs/ux-specs/*`, `docs/onboarding-interception.md`

### Initial Prompt Cascade Loop Fix

**Problem:** 12 agents all executed initial prompts simultaneously on first boot. Each prompt ended with "report ready to parent" — sending A2A messages while other agents were still booting. Under load, containers died → ProxyA2A detected dead containers → triggered auto-restart → new container → initial prompt fired again → cascade loop.

**Root cause:** Two issues: (1) initial prompts instructed agents to send A2A messages during boot, (2) initial prompt re-executed on every restart (no idempotency guard).

**Fixes:**
- `main.py`: writes `.initial_prompt_done` marker file after first execution. Skips on restart.
- `org.yaml`: rewrote all 12 agent prompts — no outbound A2A, no test suite runs during boot. Agents clone repo, read docs, save to `commit_memory`, then wait for tasks.
- `workspace_restart.go`: fixed misleading "after secret change" log in `RestartByID` (called by multiple paths, not just secrets).

### Chat Separation: My Chat + Agent Comms

Refactored ChatTab into two sub-tabs:
- **My Chat**: user↔agent conversation only (`source=canvas` filter)
- **Agent Comms**: agent↔agent A2A traffic (`source=agent` filter), read-only, live WebSocket updates

**Backend:** Added `source` query param to `GET /workspaces/:id/activity` — `canvas` filters `source_id IS NULL`, `agent` filters `source_id IS NOT NULL`. Invalid values return 400.

**Initial prompt fix:** Routes through platform A2A proxy instead of self-send, so the prompt appears as a proper user message in chat history (logged with `source_id=NULL`). Removed `/notify` push code — proxy's `A2A_RESPONSE` broadcast handles delivery.

**Shared helper:** Extracted `extractRequestText()` into `message-parser.ts` — used by both ChatTab and AgentCommsPanel.

## Files Changed (Chat Separation)
- `workspace-server/internal/handlers/activity.go` — `source` query param + validation
- `workspace/main.py` — route initial prompt through proxy, remove /notify
- `canvas/src/components/tabs/ChatTab.tsx` — sub-tab container + MyChatPanel
- `canvas/src/components/tabs/chat/AgentCommsPanel.tsx` — new agent comms view
- `canvas/src/components/tabs/chat/message-parser.ts` — shared `extractRequestText()`

### Claude Code Adapter: CLI Subprocess → Claude Agent SDK Migration

Replaced the `claude-code` runtime's subprocess-based `CLIAgentExecutor` with a new `ClaudeSDKExecutor` that uses the official `claude-agent-sdk` Python package. The SDK wraps the same Claude Code engine, so plugins/skills/CLAUDE.md still work — but eliminates subprocess fragility (stdout buffering, zombie processes, session-ID parsing, ~500ms startup overhead).

**New files:**
- `workspace/claude_sdk_executor.py` — `ClaudeSDKExecutor` with asyncio.Lock serialization, cooperative cancel, `QueryResult` dataclass, session resume via SDK
- `workspace/executor_helpers.py` — shared helpers extracted from `cli_executor.py`: memory recall/commit, delegation results, heartbeat, system prompt, error sanitization (`sanitize_agent_error` + `classify_subprocess_error`), markdown-aware `brief_summary`, `extract_message_text`
- `workspace/tests/test_claude_sdk_executor.py` — 30 tests including concurrency (timestamp-ordered), cancel (GeneratorExit via async generator), session resume, error sanitization
- `workspace/tests/test_executor_helpers.py` — 73 tests for all shared helpers

**Modified files:**
- `workspace/adapters/claude_code/adapter.py` — `create_executor()` returns `ClaudeSDKExecutor`; removed `shutil.which` CLI check
- `workspace/adapters/claude_code/Dockerfile` — pre-installs SDK via `pip install -r requirements.txt`
- `workspace/adapters/claude_code/requirements.txt` — added `claude-agent-sdk>=0.1.58`
- `workspace/cli_executor.py` — removed `claude-code` from `RUNTIME_PRESETS`, deleted all `self.runtime == "claude-code"` branches (JSON parsing, `--resume`, `--output-format json`, `_session_id`), calls shared helpers directly (no more one-line wrapper methods), uses `sys.executable` for MCP server, regex word-boundary error classification
- `workspace/tests/conftest.py` — session-wide `claude_agent_sdk` stub for test imports
- `.gitignore` — `.initial_prompt_done`, `.coverage*`

**Architecture decisions:**
- `asyncio.Lock` on the SDK executor serializes concurrent turns (matches old CLI behavior, keeps session_id race-free)
- `ResultMessage.result` preferred over concatenated `AssistantMessage` chunks (avoids doubled pre/post-tool text)
- Error sanitization unified: `sanitize_agent_error(exc=..., category=...)` serves both SDK exceptions and CLI subprocess stderr
- `classify_subprocess_error()` uses regex word boundaries to avoid false positives (`\brate\b` not `"rate" in`)

**Coverage:** 100% on `claude_sdk_executor.py` (110 stmts), `cli_executor.py` (179 stmts), `executor_helpers.py` (154 stmts). Total: 443 stmts, 0 misses.

**Live verification:** 12 workspaces restarted on new image. Echo, session resume, Bash tool, TodoWrite, PM→QA MCP delegation, and concurrent requests all verified. Rate-limited on quota (not a code bug).

**5 iterative code review passes** caught and fixed: the `_active_stream` race, dead claude-code branches, duplicated A2A instructions, raw-stderr leaks, deprecated `typing.AsyncIterator`, the `_install_fake_sdk` teardown leak, inconsistent error patterns, missing `encoding` args, and 7 other issues across successive rounds.

### Agent Quality Enforcement Stack

Built three layers of quality enforcement after observing that agents (same Claude Opus model) missed bugs like `'use client'` directives because they lacked institutional memory and system-level enforcement.

**Layer 1: Git pre-commit hook** (`.githooks/pre-commit`)
- Rejects commits missing `'use client'` on hook-using `.tsx` files
- Rejects light theme colors in canvas components
- Rejects SQL injection patterns in Go (`fmt.Sprintf` with SQL)
- Rejects leaked secrets (`sk-ant-`, `ghp_`, `AKIA`)
- System-enforced — agents cannot bypass

**Layer 2: Molecule AI-dev plugin** (`plugins/molecule-dev/`)
- `rules/codebase-conventions.md` — injected into every agent's CLAUDE.md with past bugs, patterns, self-check scripts
- `skills/review-loop/SKILL.md` — multi-round FE→QA→fix→re-verify workflow for Dev Lead

**Layer 3: Awareness memory via initial_prompt**
- Key conventions saved to `commit_memory` on first boot
- Agents recall them on every future task via memory system
- Builds institutional knowledge across sessions

**Also shipped:**
- SDK executor retry logic (exponential backoff: 5s→10s→20s for rate limits)
- Force-remove in `provisioner.Stop()` to prevent restart-policy zombie containers
- All 12 agent system prompts rewritten from checklists to senior-engineer expectations
- Dev Lead prompt requires UIUX + Security involvement for UI/credential work
- Repo made public — removed GITHUB_TOKEN from initial_prompt

### Cron Scheduling System (Phase 22)

New feature: users can set up recurring tasks that fire A2A messages to agents on a cron schedule.

**Backend:**
- `workspace-server/migrations/015_workspace_schedules.sql` — new table with cron_expr, timezone, prompt, enabled, last_run_at, next_run_at, run_count, last_status
- `workspace-server/internal/scheduler/scheduler.go` — goroutine polls every 30s, fires due schedules via proxyA2ARequest with `system:scheduler` caller, WaitGroup for completion, semaphore (max 10 concurrent)
- `workspace-server/internal/handlers/schedules.go` — 6 REST endpoints: list, create, update (COALESCE-based), delete, run-now, history
- `robfig/cron/v3` for cron expression parsing + next-run computation
- `proxyA2ARequest` exposed as public method for internal callers
- Dedicated `cron_run` activity log entries with schedule metadata for history queries

**Frontend:**
- `canvas/src/components/tabs/ScheduleTab.tsx` — CRUD UI with create/edit form, cron-to-English helper, status indicators, Run Now button, delete confirmation
- Wired into SidePanel as new "Schedule" tab (⏲ icon)

**Org template:**
- `OrgSchedule` struct in `org.go`, inserted during org import
- Example: Security Auditor daily scan in `org-templates/molecule-dev/org.yaml`

**E2E verified:** Created every-minute schedule, scheduler fired at next minute boundary, agent received and responded, schedule updated with status=ok + run_count=1.

### Volume Ownership: Root → Gosu Agent Pattern

Docker creates volume contents as root, but workspace containers run as UID 1000 (`agent`). This caused `PermissionError` when the adapter tried to write CLAUDE.md with plugin rules. Initially fixed with scattered `chown` hacks in the provisioner and plugin handler, then properly fixed with the standard Docker pattern:

- `Dockerfile`: installs `gosu`, removes `USER agent` (entrypoint handles privilege drop)
- `entrypoint.sh`: starts as root → `chown -R agent:agent /configs /workspace` → `exec gosu agent` → `python3 main.py`
- Removed all band-aid chown calls from provisioner and plugin handler
- Verified: 12/12 containers, CLAUDE.md owned by `agent:agent`, plugin rules injected

### Comprehensive Code Review — 13 Issues Fixed + Test Coverage

Two-pass code review across the entire repo identified 24 issues. All 13 critical/warning items fixed:

**Critical (8):**
- `a2a_proxy.go`: ADD access control via `CanCommunicate` for agent-to-agent proxy requests (closing security boundary). Canvas requests (no `X-Workspace-ID`), self-calls, and system callers (`webhook:*`, `system:*`, `test:*`) bypass via explicit `isSystemCaller()` helper.
- `org.go`, `delegation.go`: replace `db.DB.Exec()` with `ExecContext` + error checks. Errors no longer silently dropped on inserts/updates.
- `activity.go`, `workspace.go`: add `rows.Err()` checks after iteration loops to catch DB iteration failures (was returning partial results).
- `ws/hub.go`: add `safeSend` with `recover` for race between Broadcast and Unregister (defensive fix for closed channel send).
- `workspace.go`: improve `canvas_layouts` insert error log (non-fatal).
- `ChatTab.tsx`, `AgentCommsPanel.tsx`: add WebSocket `onerror` handlers (orphaned connections on failure).
- `app/page.tsx`: log hydration errors instead of silent catch.
- `cli_executor.py`: guarantee `proc.wait()` after `kill` on timeout to prevent zombie processes; bounded 5s wait timeout.

**Warning (5):**
- `a2a_proxy.go`: cap `LogActivity` context with 30s timeout (was `WithoutCancel` = unbounded lifetime).
- `activity.go`: log JSON marshal failures in `LogActivity` instead of silently corrupting activity logs with nil bodies.
- `org.go`: replace 500ms `time.Sleep` with `workspaceCreatePacingMs = 50` constant (org of 12 was 6s+).
- `main.py`: stop heartbeat if `adapter.setup()` raises (resource leak).
- `Canvas.tsx`: document intentional `getState()` pattern in imperative event handlers.

**Test coverage added:**
- `a2a_proxy_test.go`: `mockCanCommunicate` helper + 4 access control tests (denied, self-exempt, system caller, canvas) + table-driven `TestIsSystemCaller` (7 cases)
- `test_cli_executor.py`: 2 zombie reap tests (verify `proc.wait()` called after `kill`; degraded path when `wait()` also times out)

**Verification:**
- Go: 6 packages, all tests pass
- Canvas Vitest: 344 tests pass
- Python pytest: 874 tests pass (was 872, +2 new)
- Playwright E2E: 13/13 pass (incl. 3 data-flow tests verifying real browser content)
- Comprehensive bash E2E: 68/68 pass
- Manual verification: 12-agent org deployed, initial prompts complete, chat shows messages