molecule-core/docs/edit-history/2026-04-08.md
Hongming Wang d8026347e5 chore: open-source restructure — rename dirs, remove internal files, scrub secrets
Renames:
- platform/ → workspace-server/ (Go module path stays as "platform" for
  external dep compat — will update after plugin module republish)
- workspace-template/ → workspace/

Removed (moved to separate repos or deleted):
- PLAN.md — internal roadmap (move to private project board)
- HANDOFF.md, AGENTS.md — one-time internal session docs
- .claude/ — gitignored entirely (local agent config)
- infra/cloudflare-worker/ → Molecule-AI/molecule-tenant-proxy
- org-templates/molecule-dev/ → standalone template repo
- .mcp-eval/ → molecule-mcp-server repo
- test-results/ — ephemeral, gitignored

Security scrubbing:
- Cloudflare account/zone/KV IDs → placeholders
- Real EC2 IPs → <EC2_IP> in all docs
- CF token prefix, Neon project ID, Fly app names → redacted
- Langfuse dev credentials → parameterized
- Personal runner username/machine name → generic

Community files:
- CONTRIBUTING.md — build, test, branch conventions
- CODE_OF_CONDUCT.md — Contributor Covenant 2.1

All Dockerfiles, CI workflows, docker-compose, railway.toml, render.yaml,
README, CLAUDE.md updated for new directory names.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 00:24:44 -07:00

22 KiB

2026-04-08 Session

Summary

Fixed ChatTab agent reachability, added conversation history to all A2A adapters, added current_task heartbeat reporting, fixed WORKSPACE_PROVISIONING for restarts, fixed Config tab runtime dropdown, and improved config save/restart UX.

Changes

ChatTab — Agent Reachability Fix

  • Problem: ChatTab called GET /registry/discover/:id without X-Workspace-ID header → 400 error → "Agent not available" even though agent was online
  • Fix: Derived reachability from data.status (online/degraded) instead of network call. Messages are proxied through POST /workspaces/:id/a2a so browser never needs the agent's internal URL.
  • Files: canvas/src/components/tabs/ChatTab.tsx

Conversation History

  • ChatTab now sends last 20 messages via params.metadata.history in A2A message/send
  • a2a_executor.py: New _extract_history() function extracts history from request metadata
  • LangGraph/DeepAgents: History prepended as ("human"/"ai", text) tuples
  • CrewAI/AutoGen: History prepended as text prefix in task description
  • Files: ChatTab.tsx, a2a_executor.py, all adapter files

Current Task Heartbeat

  • New shared set_current_task(heartbeat, task) function in a2a_executor.py
  • All 5 adapters now set current_task during execution (truncated to 60 chars)
  • Task cleared in finally block after execution completes
  • Heartbeat passed from AdapterConfig through create_executor() in all adapters
  • Files: a2a_executor.py, langgraph/adapter.py, deepagents/adapter.py, crewai/adapter.py, autogen/adapter.py, openclaw/adapter.py

WORKSPACE_PROVISIONING for Restarts

  • Problem: applyEvent WORKSPACE_PROVISIONING only created new nodes, silently ignored restarts of existing nodes → UI didn't show "starting" state
  • Fix: Added else branch that sets existing node to status: "provisioning", clears needsRestart and currentTask
  • Files: canvas/src/store/canvas.ts

Config Tab Improvements

  • Runtime dropdown: Removed invalid options (Codex, Ollama). Now shows only available adapters: LangGraph, Claude Code, CrewAI, AutoGen, DeepAgents, OpenClaw
  • Save & Restart: Config save now auto-restarts workspace so changes take effect immediately. "Save" button also available for save-only (sets needsRestart banner)
  • Secrets: Removed needsRestart: true from secrets save/delete since platform already auto-restarts
  • Retry→Restart: Chat error banner button changed from no-op "Retry" to functional "Restart" with confirmation dialog
  • Files: canvas/src/components/tabs/ConfigTab.tsx, ChatTab.tsx

Tests

  • 8 new Python tests (15 total in test_a2a_executor.py, 80 total):
    • _extract_history: 5 tests (basic, empty, None, malformed, non-list)
    • History prepend in executor: 1 test
    • set_current_task: 2 tests (update + None heartbeat)
  • 1 updated Canvas test: WORKSPACE_PROVISIONING updates existing node status on restart
  • All existing tests updated ("user""human" role format, metadata in mock context)

Code Review Fixes

  • PEP 8 spacing in all set_current_task() calls
  • OpenClaw set_current_task("") moved into finally block
  • _extract_history guards against non-dict entries in history list

Merged PR #1: Workspace Awareness Integration

  • Platform assigns deterministic awareness_namespace (workspace:<id>) per workspace
  • AWARENESS_URL and AWARENESS_NAMESPACE injected into containers during provisioning (only when AWARENESS_URL env var is set on the platform)
  • commit_memory / search_memory tools route through awareness when configured, fall back to platform memory API
  • New migration 010_workspace_awareness.sql adds awareness_namespace column to workspaces
  • agent.py: Anthropic/OpenAI base URL support via ANTHROPIC_BASE_URL / OPENAI_BASE_URL env vars
  • test_sandbox.py: asyncio.get_event_loop()asyncio.run() for Python 3.13 compat
  • New files: workspace/tools/awareness_client.py, workspace/tests/test_memory.py, workspace/tests/test_agent_base_urls.py
  • Files: workspace-server/internal/handlers/workspace.go, workspace-server/internal/models/workspace.go, workspace-server/internal/provisioner/provisioner.go, workspace-server/migrations/010_workspace_awareness.sql, workspace/agent.py, workspace/main.py, workspace/tools/memory.py, workspace/tools/awareness_client.py

Restart Runtime Detection + Template Fallback

  • Problem: Changing runtime via Config tab (e.g. langgraph → claude-code) didn't take effect on restart — provisioner used the old image because it only read runtime from the template dir, not the container's config volume
  • Fix: Restart handler reads runtime from the running container via ExecRead (docker exec cat) BEFORE stopping it. Falls back to this value when no template provides a runtime.
  • Template auto-apply: When a runtime has a default template (e.g. claude-code-default/), it's automatically applied on restart — copies CLAUDE.md, .claude/settings.json, etc. into the container
  • Replaced ReadFileFromVolume (temp Alpine container, slow) with ExecRead (exec in existing container, instant)
  • Files: workspace-server/internal/handlers/workspace.go, workspace-server/internal/provisioner/provisioner.go

MCP Memory Tools for CLI Runtimes

  • Added commit_memory and recall_memory to a2a_mcp_server.py — now ALL runtimes (including Claude Code) can persist and recall memories via platform API
  • Updated workspace-configs-templates/claude-code-default/CLAUDE.md with memory usage guidelines (recall at conversation start, commit after interactions)
  • 7 unit tests in test_mcp_memory.py + 16 new E2E checks for memory CRUD, scope filtering, cross-workspace isolation

Comprehensive Test Suite

  • registry/access_test.go: 10 tests for CanCommunicate (siblings, parent-child, root, denied, grandchild)
  • handlers_extended_test.go: 14 tests for Delete, Update, Restart, Secrets, Discover, Peers, CheckAccess, Bundle, Config
  • test_cli_executor.py: 14 tests for CLI command building, session resume, model flags, timeout
  • test_plugins.py: 9 tests for plugin loading (rules, skills, prompts)
  • test_comprehensive_e2e.sh: 68 checks covering ALL platform endpoints including runtime assignment and memory

UI Cleanup

  • Removed 3 redundant task notifications from SidePanel/ChatTab (kept only the amber banner below tabs)
  • PM system prompt updated for fully autonomous delegation (no more "Shall I delegate?")

Runtime Persisted in Database (migration 011)

  • Root cause: runtime was only in config.yaml inside Docker volumes — fragile detection via ExecRead/ReadFromVolume failed when containers were dead
  • Fix: Added runtime column to workspaces table. Stored at creation, read on restart with simple SELECT
  • Fixed 6 broken paths: Restart, RestartByID, Create, Update (PATCH), Bundle import, ConfigTab
  • Removed ExecRead/ReadFromVolume workarounds entirely

Auto-Memory for CLI Agents

  • cli_executor.py: auto-recalls memories on first message (no session), auto-commits summary after each response
  • Memories persist via platform API, survive container restarts
  • Fixed memory pollution: saves original input, not memory-injected version

MCP Memory Tools

  • Added commit_memory and recall_memory to a2a_mcp_server.py — all runtimes can persist/recall memories
  • Updated claude-code-default/CLAUDE.md with memory guidelines

Real-Time Task Status on Canvas

  • set_current_task pushes heartbeat immediately when setting a task (not just on 30s loop)
  • Clearing deferred to next heartbeat cycle — keeps task visible for quick A2A responses
  • Team leads now show task banners during delegation

Auth & Session Fixes

  • CLI executor clears session_id on auth errors (prevents poisoned session resume)
  • FilesTab: deduplicated tree keys with path:type (.claude dir + file collision)

UX Improvements

  • Chat tab is now first and default tab (was Details)
  • Rate limit increased from 100 to 600 req/min (15 workspaces overwhelmed the default)
  • Merged PR #3: Awareness memory dashboard embedded as iframe in Memory tab

CI Fixes

  • Updated handler tests for runtime column (INSERT 7 args, SELECT includes runtime)

Build Fixes

  • workspace/Dockerfile: Added COPY policies/ ./policies/
  • workspace/requirements.txt: Added langchain-core to base deps
  • adapters/crewai/adapter.py: Fixed _langchain_to_crewai docstring

Container Health Detection & Auto-Restart

  • Problem: When Docker Desktop crashes, containers die but platform still thinks workspaces are "online" for up to 60s (Redis TTL). A2A proxy returns errors, terminal fails, discovery returns stale URLs.
  • Three-layer fix:
    1. Reactive: A2A proxy checks provisioner.IsRunning() on connection error → marks offline, clears Redis, triggers restart. Returns 503 with "restarting": true (or 502 if container is running but unresponsive)
    2. Proactive: New registry.StartHealthSweep polls Docker API every 15s for all online workspaces → catches dead containers before users notice
    3. Auto-restart: Both liveness monitor and health sweep trigger RestartByID() on offline detection. Per-workspace mutex deduplicates concurrent restart attempts.
  • WorkspaceHandler moved from router.Setup to main.go creation so RestartByID is accessible in offline callbacks
  • New db.ClearWorkspaceKeys() shared helper replaces 3x duplicated Redis cleanup
  • New files: workspace-server/internal/registry/healthsweep.go, healthsweep_test.go (3 tests)
  • Files: workspace-server/cmd/server/main.go, workspace-server/internal/handlers/workspace.go, workspace-server/internal/router/router.go, workspace-server/internal/db/redis.go, workspace-server/internal/registry/healthsweep.go

Template Fallback for Missing Templates

  • Root cause of auth error: setup-org.sh referenced non-existent org-* templates → containers got empty /configs → fell back to langgraph runtime with anthropic:claude-sonnet-4-6 but no ANTHROPIC_API_KEY
  • Fix: Create handler now validates template exists via os.Stat, falls back to {runtime}-default template, then ensureDefaultConfig()
  • runtime column added to List/Get API response (scanWorkspaceRow, workspaceListQuery, Get query)
  • Files: workspace-server/internal/handlers/workspace.go, workspace-server/internal/handlers/handlers_test.go

Graceful Delegation Error Handling

  • Problem: When child workspace fails (auth error, offline), PM forwarded raw error message to user instead of handling gracefully
  • Fix (3 layers):
    1. a2a_mcp_server.py: delegate_task detects errors via [A2A_ERROR] sentinel prefix, wraps as DELEGATION FAILED with instructions to try another peer or handle itself
    2. coordinator.py: Strengthened coordination rule 5 — "do NOT forward raw errors to user"
    3. cli_executor.py: Added IMPORTANT block in A2A instructions for delegation failure handling
  • Auth errors in CLI executor now retry with exponential backoff (same as rate limits)
  • Claude Code adapter: Fixed dict.get("command", "claude").get("command") or "claude" for empty string handling
  • Files: workspace/a2a_mcp_server.py, workspace/coordinator.py, workspace/cli_executor.py, workspace/adapters/claude_code/adapter.py

Agent Push Messaging (send_message_to_user)

  • Feature: Agents can now push messages to the user's canvas chat at any time — not just as A2A responses
  • Use case: Agent says "Got it, delegating now...", continues working, then sends results when done
  • Platform: New POST /workspaces/:id/notify endpoint → broadcasts AGENT_MESSAGE via WebSocket (BroadcastOnly)
  • MCP tool: send_message_to_user in a2a_mcp_server.py — calls notify endpoint
  • Canvas: AGENT_MESSAGE handled in global applyEvent → stored in agentMessages map → ChatTab consumes via store subscription (no extra WS connection)
  • Prompts: Updated A2A instructions + CLAUDE.md with "RESPOND FAST, FOLLOW UP LATER" rule
  • Files: workspace-server/internal/handlers/activity.go, workspace-server/internal/router/router.go, workspace/a2a_mcp_server.py, canvas/src/store/canvas.ts, canvas/src/components/tabs/ChatTab.tsx, workspace/cli_executor.py, workspace-configs-templates/claude-code-default/CLAUDE.md

Remove Default Agent Timeout

  • Changed default timeout from 300s to 0 (no timeout) — delegation chains can take arbitrarily long
  • Files: workspace-configs-templates/claude-code-default/config.yaml, workspace/config.py, workspace-server/internal/handlers/workspace.go

WebSocket Error Suppression

  • Suppressed noisy WebSocket error: {} console.error in socket.tsonerror fires before onclose and the Event object has no useful info
  • Files: canvas/src/store/socket.ts

Setup Script Fix

  • Removed dead code copying auth tokens to non-existent org-* template dirs
  • Auth token now auto-propagated via claude-code-default template fallback
  • Files: setup-org.sh

Remove Default Agent Timeout

  • Problem: PM timed out after 300s during delegation chains. Long-running tasks (multi-agent coordination, research) are expected to exceed 5 minutes.
  • Fix: Changed default timeout from 300s to 0 (no timeout) in three places:
    • workspace-configs-templates/claude-code-default/config.yaml — template default
    • workspace/config.pyRuntimeConfig.timeout dataclass default + YAML parser default
    • workspace-server/internal/handlers/workspace.goensureDefaultConfig generated config
  • timeout: 0self.config.timeout or NoneNoneproc.communicate() waits indefinitely
  • Files: workspace-configs-templates/claude-code-default/config.yaml, workspace/config.py, workspace-server/internal/handlers/workspace.go

Build Script for Runtime Images

  • Problem: Each runtime has its own Dockerfile extending workspace-template:base with pre-installed deps. Manually running docker build for each is error-prone — we shipped with 5-hour-old images and didn't notice.
  • Fix: New workspace/build-all.sh — builds base first, then all 6 runtime images in order. Supports selective builds (build-all.sh claude-code langgraph). Handles underscore/hyphen naming mismatch (dir claude_code → tag claude-code). No :latest tag — each runtime uses its own explicit tag.
  • Added missing error logging in activity.go List handler (was returning 500 "query failed" without logging the actual SQL error)
  • Files: workspace/build-all.sh (new), workspace-server/internal/provisioner/provisioner.go, workspace-server/internal/handlers/activity.go, CLAUDE.md

Codebase Modularization (Major Refactoring)

Split 6 large files (~4,200 lines total) into 22 focused modules. Pure structural — no behavior changes. All tests pass.

Platform handlers:

  • workspace.go (978→377 lines) → split out workspace_provision.go (217), workspace_restart.go (173), a2a_proxy.go (251)
  • templates.go (814→371 lines) → split out container_files.go (168), template_import.go (175)

Workspace template:

  • a2a_mcp_server.py (572→293 lines) → split out a2a_client.py (97), a2a_tools.py (275)

Canvas:

  • ConfigTab.tsx (738→310 lines) → split out config/form-inputs.tsx, config/secrets-section.tsx, config/yaml-utils.ts
  • ChatTab.tsx (635→340 lines) → split out chat/types.ts, chat/storage.ts, chat/message-parser.ts
  • canvas.ts (449→215 lines) → split out canvas-events.ts, canvas-topology.ts, canvas-capabilities.ts

Tier System Simplified (T1/T2/T3, removed T4)

  • T1 Sandboxed: No /workspace mount, config only (unchanged)
  • T2 Standard: Normal Docker + /workspace mount (unchanged, was identical to T3 before)
  • T3 Full Access: --privileged + --pid=host — full machine access for dev team
  • T4 removed: EC2 VMs were unimplemented; privileged Docker achieves the same goal
  • Updated provisioner switch statement, CreateWorkspaceDialog (3-col grid, no T4), docs/architecture/workspace-tiers.md (full rewrite)
  • Files: workspace-server/internal/provisioner/provisioner.go, canvas/src/components/CreateWorkspaceDialog.tsx, docs/architecture/workspace-tiers.md

Config Volume Persistence (Restart no longer overwrites)

  • Problem: Restart re-applied claude-code-default template, overwriting user config changes (e.g. model: opus → sonnet)
  • Fix: Restart handler skips templates by default. New "apply_template": true flag in restart body for explicit re-application (used when runtime changes).
  • RestartByID (auto-restart) also skips templates — passes empty template path
  • Files: workspace-server/internal/handlers/workspace_restart.go

Skills Self-Improvement System

  • Documented how agents can create persistent skills in /configs/skills/<name>/SKILL.md
  • Skills are auto-loaded into system prompt via skills/loader.py
  • Skills persist on Docker named volume — survive restarts
  • Updated workspace-configs-templates/claude-code-default/CLAUDE.md with skills creation guide
  • Trained PM agent to convert operating procedures into skills

Agent Code Fixes (from agent-written code)

  • Fixed pytest.ini: removed --cov-fail-under=100 that broke test runner
  • Fixed 6 test files: replaced hardcoded /workspace/workspace/ paths with os.path.dirname(__file__) relative paths
  • Fixed aes_test.go: test key that wasn't 32 bytes after base64 decode
  • Fixed agent_test.go: SQL mock arg count mismatch (2 args for 1-param query)
  • Fixed liveness_test.go: unused variable
  • Cleaned up .coverage, .coveragerc, __pycache__, index_minimal.ts

Agent Training via A2A

  • Sent feedback to PM, Dev Lead, QA Engineer about test-writing rules, path handling, config discipline
  • All 3 agents committed rules to persistent memory
  • PM + dev team upgraded to Opus 4.6 model, T3 tier
  • Marketing/Research teams remain Sonnet, T2

Misc

  • .gitignore: Added .claude/worktrees/ to prevent stale worktrees showing as submodule changes

Workspace Pause/Resume (PR #4)

  • New POST /workspaces/:id/pause — stops container, sets status='paused', clears Redis keys
  • New POST /workspaces/:id/resume — re-provisions from existing config volume
  • Health sweep, liveness monitor, and auto-restart all skip paused workspaces
  • Canvas: indigo "Paused" status dot, Legend entry, context menu Pause/Resume toggle
  • WORKSPACE_PAUSED WebSocket event handled in canvas-events.ts
  • Cascade: pausing a parent pauses all descendants (recursive CTE), resuming does the reverse
  • Guard: children cannot restart or resume while any ancestor is paused (409 Conflict)
  • isParentPaused() recursive helper checks ancestor chain
  • Context menu: right-click nested team members now opens correct child menu (not parent's)
  • Context menu closes immediately on pause/resume click (before API call, not after)
  • Files: workspace-server/internal/handlers/workspace_restart.go, workspace-server/internal/router/router.go, workspace-server/internal/registry/liveness.go, canvas/src/store/canvas-events.ts, canvas/src/components/StatusDot.tsx, canvas/src/components/WorkspaceNode.tsx, canvas/src/components/Legend.tsx, canvas/src/components/ContextMenu.tsx

Files Changed

  • canvas/src/components/tabs/ChatTab.tsx
  • canvas/src/components/tabs/ConfigTab.tsx
  • canvas/src/store/canvas.ts
  • canvas/src/store/__tests__/canvas.test.ts
  • workspace/a2a_executor.py
  • workspace/adapters/langgraph/adapter.py
  • workspace/adapters/deepagents/adapter.py
  • workspace/adapters/crewai/adapter.py
  • workspace/adapters/autogen/adapter.py
  • workspace/adapters/openclaw/adapter.py
  • workspace/tests/test_a2a_executor.py
  • workspace-server/cmd/server/main.go
  • workspace-server/internal/db/redis.go
  • workspace-server/internal/handlers/workspace.go
  • workspace-server/internal/handlers/handlers_test.go
  • workspace-server/internal/router/router.go
  • workspace-server/internal/registry/healthsweep.go (new)
  • workspace-server/internal/registry/healthsweep_test.go (new)
  • workspace/a2a_mcp_server.py
  • workspace/adapters/claude_code/adapter.py
  • workspace/cli_executor.py
  • workspace/coordinator.py
  • setup-org.sh
  • CLAUDE.md
  • docs/architecture/provisioner.md
  • workspace/config.py
  • workspace-configs-templates/claude-code-default/config.yaml
  • workspace-configs-templates/claude-code-default/CLAUDE.md
  • workspace-server/internal/handlers/activity.go
  • canvas/src/store/socket.ts
  • docs/architecture/provisioner.md
  • workspace-server/internal/provisioner/provisioner.go
  • workspace/build-all.sh (new)
  • docs/agent-runtime/cli-runtime.md
  • docs/agent-runtime/config-format.md
  • workspace-server/internal/handlers/workspace_provision.go (new — extracted from workspace.go)
  • workspace-server/internal/handlers/workspace_restart.go (new — extracted from workspace.go)
  • workspace-server/internal/handlers/a2a_proxy.go (new — extracted from workspace.go)
  • workspace-server/internal/handlers/container_files.go (new — extracted from templates.go)
  • workspace-server/internal/handlers/template_import.go (new — extracted from templates.go)
  • workspace/a2a_client.py (new — extracted from a2a_mcp_server.py)
  • workspace/a2a_tools.py (new — extracted from a2a_mcp_server.py)
  • workspace/tests/test_mcp_memory.py
  • canvas/src/store/canvas-events.ts (new — extracted from canvas.ts)
  • canvas/src/store/canvas-topology.ts (new — extracted from canvas.ts)
  • canvas/src/store/canvas-capabilities.ts (new — extracted from canvas.ts)
  • canvas/src/components/tabs/chat/types.ts (new)
  • canvas/src/components/tabs/chat/storage.ts (new)
  • canvas/src/components/tabs/chat/message-parser.ts (new)
  • canvas/src/components/tabs/chat/index.ts (new)
  • canvas/src/components/tabs/config/form-inputs.tsx (new)
  • canvas/src/components/tabs/config/secrets-section.tsx (new)
  • canvas/src/components/tabs/config/yaml-utils.ts (new)
  • canvas/src/components/tabs/config/index.ts (new)