molecule-core/docs/edit-history/2026-04-09.md
Hongming Wang d8026347e5 chore: open-source restructure — rename dirs, remove internal files, scrub secrets
Renames:
- platform/ → workspace-server/ (Go module path stays as "platform" for
  external dep compat — will update after plugin module republish)
- workspace-template/ → workspace/

Removed (moved to separate repos or deleted):
- PLAN.md — internal roadmap (move to private project board)
- HANDOFF.md, AGENTS.md — one-time internal session docs
- .claude/ — gitignored entirely (local agent config)
- infra/cloudflare-worker/ → Molecule-AI/molecule-tenant-proxy
- org-templates/molecule-dev/ → standalone template repo
- .mcp-eval/ → molecule-mcp-server repo
- test-results/ — ephemeral, gitignored

Security scrubbing:
- Cloudflare account/zone/KV IDs → placeholders
- Real EC2 IPs → <EC2_IP> in all docs
- CF token prefix, Neon project ID, Fly app names → redacted
- Langfuse dev credentials → parameterized
- Personal runner username/machine name → generic

Community files:
- CONTRIBUTING.md — build, test, branch conventions
- CODE_OF_CONDUCT.md — Contributor Covenant 2.1

All Dockerfiles, CI workflows, docker-compose, railway.toml, render.yaml,
README, CLAUDE.md updated for new directory names.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 00:24:44 -07:00

569 lines
38 KiB
Markdown

# 2026-04-09 Session
## Summary
Infrastructure hardening: removed exposed database ports, enforced SSL for Postgres, added HTTP security headers middleware, added healthchecks, and gitignored cryptographic key files. Comprehensive handler unit test coverage expanded with 22 additional edge-case tests. Fixed outdated T4 tier documentation reference.
Documentation sync: refreshed the English and Chinese README, VitePress docs home, quickstart, product overview, runtime/memory/canvas/API docs, and tightened wording so runtime count, memory architecture, global secrets, onboarding, and WebSocket-first chat behavior all match the current `main` branch.
## Changes
### Network Isolation (docker-compose.yml)
- Removed exposed host ports for Postgres (5432) and Redis (6379)
- Both services now communicate exclusively over internal `molecule-monorepo-net` Docker network
- Prevents accidental direct access from host or external containers
### Database SSL (docker-compose.yml)
- Changed `DATABASE_URL` sslmode from `disable` to `prefer`
- Added comment that production deployments must use `sslmode=require`
### Postgres Password Warning (docker-compose.yml)
- Added healthcheck warning that fires if `POSTGRES_PASSWORD` is still set to the default `dev` value
### Langfuse DB Init Healthcheck (docker-compose.infra.yml)
- Added healthcheck to `langfuse-db-init` service to verify initialization completes
### HTTP Security Headers (workspace-server/internal/middleware/securityheaders.go)
- New middleware setting `X-Content-Type-Options: nosniff`, `X-Frame-Options: DENY`, `X-XSS-Protection: 1; mode=block`
- Wired into router after CORS middleware (`workspace-server/internal/router/router.go`)
### Gitignore Patterns (.gitignore)
- Added `*.pem`, `*.key`, `*.crt`, `*.p12`, `*.pfx` to prevent accidental commits of cryptographic material
### Documentation Updates
- `docs/architecture/architecture.md`: Added Security section (headers, network isolation, DB SSL, gitignore patterns)
- `docs/development/local-development.md`: Updated service table (Postgres/Redis now "internal only"), added note about `docker compose exec` for direct access, updated DATABASE_URL with sslmode
- `docs/api-protocol/platform-api.md`: Updated DATABASE_URL env var with sslmode
- `docs/development/constraints-and-rules.md`: Added rules #13 (security headers) and #14 (no exposed database ports)
### Handler Unit Tests (workspace-server/internal/handlers/handlers_additional_test.go)
- Added 22 new edge-case tests covering gaps across all 6 critical handlers
- **workspace.go**: Create with parent_id, explicit claude-code runtime, missing name validation, update name-only, update parent_id, list with data (role/agent_card parsing)
- **registry.go**: Provisioner URL preservation during register, exact threshold (0.5) degraded transition, degraded→online recovery
- **a2a_proxy.go**: Workspace with no URL (503), agent unreachable (502), nilIfEmpty utility
- **discovery.go**: Access denied between different teams, target offline (503), sibling access allowed, parent→child access, different teams denied
- **secrets.go**: Auto-restart on Set/Delete, nil restart func safety, UUID validation edge cases (uppercase, no hyphens, SQL injection), invalid JSON handling
- Total handler tests: 187 across 14 test files
### Comprehensive Handler Unit Tests (6 new test files — 73 additional tests)
- **workspace_test.go** (14 tests): Get success/not-found/DB-error, Create bad-JSON/DB-error/defaults-applied, List empty/DB-error, Update bad-JSON/multiple-fields/runtime, Delete confirmation-required/cascade-with-children/children-query-error
- **registry_test.go** (12 tests): Register bad-JSON/missing-fields/DB-error, Heartbeat offline→online/bad-JSON/missing-ID/DB-error/online-stays-online, UpdateCard success/bad-JSON/missing-fields/DB-error
- **a2a_proxy_test.go** (7 tests): Invalid JSON, already-wrapped JSON-RPC, DB lookup fallback, DB lookup error, agent returns error, messageId injection, caller-ID propagation
- **discovery_test.go** (10 tests): Missing caller header, workspace-not-found with caller, external not-found, Peers with-parent/not-found/DB-error/root-no-peers, CheckAccess bad-JSON/missing-fields/same-workspace
- **workspace_provision_test.go** (13 tests): workspaceAwarenessNamespace (3 cases), configDirName (5 cases), findTemplateByName by-dir/by-config-yaml/not-found/skips-ws-prefix/invalid-dir, ensureDefaultConfig langgraph/claude-code/custom-model/special-chars, buildProvisionerConfig basic/env-vars
- **secrets_test.go** (17 tests): List success/empty/invalid-UUID/DB-error, Set invalid-UUID/missing-key/missing-value/success/auto-restart/DB-error, Delete success/not-found/invalid-UUID/DB-error/auto-restart, GetModel default/DB-error
- Also fixed pre-existing panic in `handlers_additional_test.go` TestSecretsUUIDValidation (SQL injection test path caused httptest.NewRequest panic)
- Total Go platform tests: 263 across 15 test files
### QA Feedback Fixes (Restart/Pause/Resume tests + time.Sleep replacement)
- **handlers_additional_test.go** (15 new tests): Restart not-found/DB-error/parent-paused/provisioner-nil, Pause success/not-found/DB-error/with-descendants, Resume not-paused/DB-error/provisioner-nil, RestartByID provisioner-nil/removed-skipped. Replaced time.Sleep with channel-based sync in 2 secrets restart callback tests.
- **secrets_test.go**: Replaced time.Sleep(100ms) with channel-based sync in TestSecretsSet_AutoRestart and TestSecretsDelete_AutoRestart (2 tests)
- Total Go platform tests: 278 across 15 test files (was 263)
## Files Changed
- `docker-compose.yml`
- `docker-compose.infra.yml`
- `workspace-server/internal/middleware/securityheaders.go` (new)
- `workspace-server/internal/router/router.go`
- `.gitignore`
- `docs/architecture/architecture.md`
- `docs/development/local-development.md`
- `docs/api-protocol/platform-api.md`
- `docs/development/constraints-and-rules.md`
- `workspace-server/internal/handlers/handlers_additional_test.go` (new — 37 tests: 22 edge-case + 15 restart/pause/resume; SQL injection test panic fixed; time.Sleep replaced with channels)
- `workspace-server/internal/handlers/workspace_test.go` (new — 14 tests)
- `workspace-server/internal/handlers/registry_test.go` (new — 12 tests)
- `workspace-server/internal/handlers/a2a_proxy_test.go` (new — 7 tests)
- `workspace-server/internal/handlers/discovery_test.go` (new — 10 tests)
- `workspace-server/internal/handlers/workspace_provision_test.go` (new — 13 tests)
- `workspace-server/internal/handlers/secrets_test.go` (new — 17 tests)
- `workspace-server/internal/handlers/secrets_test.go` (updated — time.Sleep replaced with channels in 2 tests)
- `CLAUDE.md` (updated Go test count: 141 → 278)
- `docs/architecture/technology-choices.md` (fixed outdated T4 "EC2 VMs" reference → Docker-based full-host)
### CI Pipeline Hardening (.github/workflows/ci.yml)
- Go tests now run with `-race` flag for data race detection
- Added Go coverage report step: `go test -race -coverprofile=coverage.out ./... && go tool cover -func=coverage.out`
- Removed `--passWithNoTests` from vitest — Canvas tests are now required to exist and pass
- Added `pytest-cov` to Python test dependencies and enabled `--cov=. --cov-report=term-missing`
### Documentation Updates (CI hardening)
- `CLAUDE.md`: Updated "Unit Tests" commands and "CI Pipeline" section to reflect race detection, coverage, and stricter vitest
- `docs/development/local-development.md`: Updated "Unit Tests" commands and "CI Pipeline" section to match
### Canvas Error Boundary (canvas/src/components/ErrorBoundary.tsx — new)
- React class component implementing `getDerivedStateFromError` + `componentDidCatch`
- Full-screen fallback UI: dark overlay with error icon, error message, "Reload" button (triggers `window.location.reload()`), "Report" link (opens mailto with error details)
- Logs caught errors and component stack to `console.error`
- Handles null errors gracefully (displays "Unknown error")
- Wrapped around `{children}` in `canvas/src/app/layout.tsx` — catches all unhandled React render errors app-wide
### Hydration Error Banner (canvas/src/app/page.tsx)
- Added `hydrationError` state — set when initial `GET /workspaces` or `GET /canvas/viewport` fetch fails
- Displays a fixed red banner at top of viewport with error message including `PLATFORM_URL` for debugging
- "Retry" button clears the error and re-attempts hydration (calls `hydrateData()` again)
- Viewport fetch failure is non-fatal — only workspace fetch failure triggers the banner
### Vitest OXC JSX Config (canvas/vitest.config.ts)
- Added `oxc.jsx = 'automatic'` and `oxc.jsxImportSource = 'react'` to support TSX test files
- Required for ErrorBoundary.test.tsx which uses `React.createElement` and class component instantiation
### Canvas Error Boundary Tests (canvas/src/components/__tests__/ErrorBoundary.test.tsx — new, 7 tests)
- Pure-unit tests instantiating the class directly (no DOM renderer needed in vitest `environment: "node"`)
- `getDerivedStateFromError` returns correct state
- `componentDidCatch` logs to console.error with component stack
- Initial state has no error
- `render()` returns children when no error
- `render()` returns fallback UI with fixed/inset-0 class when error
- Fallback UI contains error message, "Something went wrong", Reload/Report buttons
- Fallback UI handles null error gracefully ("Unknown error")
### Hydration Error Tests (canvas/src/app/__tests__/page-hydration.test.ts — new, 5 tests)
- Tests hydration logic in isolation (mocks fetch, socket, canvas store)
- No error when fetches succeed
- Error message set when workspace fetch fails (includes PLATFORM_URL)
- Retry clears previous error and re-attempts fetch
- Viewport fetch failure is non-fatal (succeeds with workspace data only)
- Total Canvas Vitest tests: 188 across 8 test files (was 176)
### Documentation Updates (Error Boundary)
- `CLAUDE.md`: Updated Vitest test count (61 → 188)
- `docs/frontend/canvas.md`: Added Error Handling section documenting ErrorBoundary and hydration error banner
## Files Changed (Error Boundary)
- `canvas/src/components/ErrorBoundary.tsx` (new)
- `canvas/src/app/layout.tsx` (modified — wraps children with ErrorBoundary)
- `canvas/src/app/page.tsx` (modified — hydration error state + banner + retry)
- `canvas/vitest.config.ts` (modified — added oxc jsx config)
- `canvas/src/components/__tests__/ErrorBoundary.test.tsx` (new — 7 tests)
- `canvas/src/app/__tests__/page-hydration.test.ts` (new — 5 tests)
- `CLAUDE.md` (updated Vitest test count)
- `docs/frontend/canvas.md` (added Error Handling section)
---
### Sprint: Handler Unit Tests (feat/handler-unit-tests — 80 new tests)
- **workspace_restart_test.go** (10 tests): Restart not-found/DB-error/ancestor-paused/nil-provisioner, Pause not-found/DB-error/success-no-children, Resume not-paused/DB-error/nil-provisioner
- **templates_test.go** (24 tests): validateRelPath valid/invalid, List empty/with-templates/nonexistent-dir, ListFiles invalid-root/not-found/fallback-no-template/fallback-with-template, ReadFile path-traversal/invalid-root/not-found/fallback-success/fallback-not-found, WriteFile path-traversal/invalid-body/not-found, DeleteFile path-traversal/not-found, SharedContext not-found/no-template/with-files, resolveTemplateDir by-name/not-found
- **template_import_test.go** (14 tests): normalizeName 9 cases, generateDefaultConfig with-files/empty, writeFiles success/path-traversal, Import success/missing-name/too-many-files/already-exists/with-config-yaml, ReplaceFiles missing-body/too-many-files/not-found/path-traversal
- **memory_test.go** (13 tests): List success/empty/DB-error, Get success/not-found/DB-error, Set success/with-TTL/missing-key/invalid-JSON/DB-error, Delete success/DB-error
- **events_test.go** (5 tests): List success/empty/DB-error, ListByWorkspace success/DB-error
- **config_test.go** (6 tests): Get success/no-config/DB-error, Patch success/invalid-JSON/DB-error
- **viewport_test.go** (5 tests): Get success/no-saved-viewport, Save success/invalid-body/DB-error
- **traces_test.go** (3 tests): No Langfuse config, partial config, unreachable Langfuse
- Total Go platform tests: 358 across 23 test files (was 278)
### Sprint: Docker Compose Hardening (feat/infra-hardening)
- Removed exposed host ports for Postgres (5432) and Redis (6379) — services only communicate over internal Docker network
- Changed DATABASE_URL sslmode from `disable` to `prefer` for dev flexibility
- Added WARNING comments on dev-only credentials (dev:dev Postgres, Langfuse secret/salt defaults)
- Added X-Content-Type-Options: nosniff and X-Frame-Options: DENY security headers middleware in router.go
- Added router_test.go verifying security headers on /health and API endpoints
### Sprint: Provisioner Tier 2/4 Enforcement (feat/tier-enforcement)
- Extracted tier logic from `Start()` into exported `ApplyTierConfig()` function for testability
- Added Tier 1: Sandboxed — readonly rootfs, tmpfs /tmp, strip /workspace mount
- Documented Tier 2: Standard — resource limits (512 MiB memory, 1 CPU), no special flags (default for unknown/zero tiers)
- Kept Tier 3: Privileged — privileged mode, host PID, Docker network (not host)
- Added Tier 4: Full Access — privileged, host PID, host network, Docker socket mount
- All 11 provisioner tests pass (T1-T4, unknown tier, zero tier, tier escalation matrix)
- Updated docs/architecture/workspace-tiers.md and docs/architecture/provisioner.md with 4-tier model
## Sprint Files Changed
- `workspace-server/internal/handlers/workspace_restart_test.go` (new — 10 tests)
- `workspace-server/internal/handlers/templates_test.go` (new — 24 tests)
- `workspace-server/internal/handlers/template_import_test.go` (new — 14 tests)
- `workspace-server/internal/handlers/memory_test.go` (new — 13 tests)
- `workspace-server/internal/handlers/events_test.go` (new — 5 tests)
- `workspace-server/internal/handlers/config_test.go` (new — 6 tests)
- `workspace-server/internal/handlers/viewport_test.go` (new — 5 tests)
- `workspace-server/internal/handlers/traces_test.go` (new — 3 tests)
- `docker-compose.yml` (ports removed, sslmode changed, warning comments added)
- `workspace-server/internal/router/router.go` (security headers middleware)
- `workspace-server/internal/router/router_test.go` (new — 2 tests)
- `workspace-server/internal/provisioner/provisioner.go` (ApplyTierConfig extracted, T2/T4 added)
- `docs/architecture/workspace-tiers.md` (updated for 4-tier model)
- `docs/architecture/provisioner.md` (updated tier table and descriptions)
### Remaining Audit Fixes (PR #16)
- **Hub double-close race**: `sync.Once` on `Close()`, `done` channel guards `ReadPump` deferred `Unregister` send. `Run()` exits on done signal. Prevents panic on concurrent shutdown.
- **Silent ExecContext in team.go**: expand layout insert and collapse remove/delete now log errors.
- **A2A proxy canvas timeout**: canvas-initiated requests get 5-min timeout; workspace-to-workspace (delegation chains) keep no timeout.
- **Python JSONDecodeError guards**: `delegation.py` and `approval.py` catch invalid JSON responses with specific error messages.
- **Ephemeral port retry**: provisioner retries `ContainerInspect` 3x with 500ms delay if Docker hasn't bound the port.
- **Files**: `workspace-server/internal/ws/hub.go`, `workspace-server/internal/handlers/team.go`, `workspace-server/internal/handlers/a2a_proxy.go`, `workspace-server/internal/provisioner/provisioner.go`, `workspace/tools/delegation.py`, `workspace/tools/approval.py`
### Branch Cleanup
- Deleted 10 stale remote branches (merged PRs + agent branches with 0 unique commits)
- Closed PR #5 (NemoClaw) in favor of `feat/nemoclaw-t4-docker` WIP branch
- Final state: `main` + `feat/nemoclaw-t4-docker` only
### Canvas Stale Tab State Fix (PR #18)
- **SidePanel.tsx**: Added `key={selectedNodeId}` to all 10 tab components — forces React to remount when switching workspaces, preventing chat/config/terminal from showing previous workspace's data
- **ChatTab.tsx**: Skip initial localStorage save on mount (was writing back the data just loaded). Removed workspaceId reload effect since key-based remounting handles it.
- Agent-authored fix, reviewed and verified by Claude Code
- **Files**: `canvas/src/components/SidePanel.tsx`, `canvas/src/components/tabs/ChatTab.tsx`
### Phase 1 Delivery — Streaming, Onboarding, Global API Keys (PR #21)
- **A2A streaming response**: proxy broadcasts `A2A_RESPONSE` via WebSocket on completion. ChatTab receives instantly, poll fallback reduced to 10s (recovery only). Added `responseReceivedRef` to prevent duplicate messages from poll+WS race.
- **Critical fix**: restored `context.WithoutCancel` in a2a_proxy.go — agents removed it, which would cancel delegation chains when browser tab closes.
- **Onboarding wizard**: 4-step guided setup (OnboardingWizard.tsx, 185 lines)
- **Global API keys**: Migration 012 `global_secrets` table. Secrets API returns merged workspace+global view with scope field.
- **VitePress docs site**: quickstart.md, index.md, .vitepress/config.ts
- **Files**: 27 files changed across platform, canvas, docs
### Coordinator Delegation Enforcement (PR #20)
- Removed "handle the task yourself" escape hatch from coordinator.py
- All coordinators (PM, Dev Lead, Research Lead, Marketing Lead) MUST delegate
- Added language matching rule to all agent prompts
- Corrected PM, Dev Lead, Research Lead, Marketing Lead via direct A2A
### Documentation Refresh (README + docs sync)
- Rewrote `README.md` and `README.zh-CN.md` as current repo homepages around the real product positioning: org-native control plane, heterogeneous runtime compatibility, HMA memory, skill evolution, canvas, and operational guardrails
- Elevated both README files again into a more commercial GitHub-homepage structure with stronger category framing, sharper competitive positioning, clearer defensibility, and a more shareable first-screen narrative
- Added an explicit compatibility comparison table and kept `NemoClaw` labeled as WIP branch work instead of merged `main` support
- Updated `docs/index.md` feature cards and quick reference to reflect the real six-adapter `main` surface, global secrets, and skill evolution
- Reworked `docs/quickstart.md` to match the current empty-state deployment flow, onboarding wizard, config/secrets UI, and WebSocket-first chat path
- Tightened `docs/product/overview.md` around the current abstraction boundary: workspaces as roles, not task nodes
- Rewrote `docs/agent-runtime/workspace-runtime.md` to match current startup flow, hot reload, awareness-backed memory, plugin loading, and coordinator-only delegation behavior
- Corrected `docs/architecture/memory.md` to describe the current implementation accurately: scoped `agent_memories`, key/value `workspace_memory`, session-search recall, optional awareness backend, and optional future pgvector extension
- Rewrote `docs/frontend/canvas.md` so the side-panel tab count, onboarding, global secret scopes, drag-to-nest teams, and `A2A_RESPONSE` delivery path match the current UI
- Rewrote `docs/api-protocol/platform-api.md` to reflect the real route surface, global secrets, pause/resume, activity recall, files roots, and `RATE_LIMIT=600` default
## Files Changed (Documentation Refresh)
- `README.md`
- `README.zh-CN.md`
- `docs/index.md`
- `docs/quickstart.md`
- `docs/product/overview.md`
- `docs/agent-runtime/workspace-runtime.md`
- `docs/architecture/memory.md`
- `docs/frontend/canvas.md`
- `docs/api-protocol/platform-api.md`
### Chat Rewrite — DB-backed History (PR #24, #25)
- **Replaced localStorage with database**: Chat messages now load from `activity_logs` table via `GET /workspaces/:id/activity?type=a2a_receive`. Each workspace has its own history, persisted in Postgres.
- **Removed**: localStorage sessions, session sidebar, session management, `chat/storage.ts`, `ChatSession` type (416 lines deleted)
- **Kept**: Real-time via A2A_RESPONSE WebSocket + push messages, conversation history in A2A metadata
- **Cleanup**: Removed broad `startsWith("CRITICAL")` message filter, dead code
- **Fixes**: Workspace switching now correctly shows per-agent chat history
- **Files**: `canvas/src/components/tabs/ChatTab.tsx` (579→346 lines), `chat/storage.ts` (deleted), `chat/types.ts`, `chat/index.ts`
### External Workspace Bridge — Pluggable A2A Agent Framework (PRs #28-#34)
- **Native external workspace type**: `POST /workspaces` with `external: true` skips Docker provisioning, sets URL directly, marks online immediately
- **Platform guards**: health sweep, auto-restart, and A2A proxy container checks all skip external workspaces (runtime='external')
- **Pluggable bridge**: `scripts/bridge/` package with MessageProcessor interface and 5 built-in backends:
- `claude-code`: spawns `claude --print` CLI with codebase access
- `openai`: calls any OpenAI-compatible API
- `anthropic`: calls Anthropic API directly
- `http`: forwards to any HTTP endpoint
- `echo`: testing
- **Auto-respond**: bridge processes messages immediately via the configured backend — agents get instant technical answers
- **API key validation**: OpenAI/Anthropic processors check for missing keys at init + process time
- **Files**: `scripts/bridge/{__init__,processor,server,platform}.py`, `scripts/claude-code-bridge.py`, `workspace-server/internal/{handlers,registry,models}/`
### Chat Rewrite + Coordinator Enforcement + Language Rules
- **Chat from DB**: replaced localStorage with activity_logs database (PR #24-#25)
- **Coordinator rules**: removed "handle it yourself" escape hatch (PR #20)
- **Language matching**: all agents respond in user's language (Chinese in → Chinese out)
### Org Template Import — Platform-Native Org Deployment (PR #35)
- **New endpoints**: `GET /org/templates` lists available org templates, `POST /org/import {"dir":"molecule-dev"}` creates entire hierarchy
- **Folder-based templates**: each org is a directory with `org.yaml` + per-workspace folders containing system-prompt.md, skills/, CLAUDE.md, .env
- **Per-workspace .env secrets**: each workspace folder can have a `.env` file (gitignored). On import, parsed and stored as encrypted workspace secrets. Resolution: workspace .env → org root .env (workspace overrides).
- **Canvas positions**: `canvas: {x, y}` in org.yaml for initial node placement
- **files_dir**: copies folder contents into workspace /configs (system prompts, tools, memory)
- **Replaces**: setup-org.sh and setup_reno_stars.sh shell scripts
- **Templates**: `org-templates/molecule-dev/` (11 workspaces, PM + Research + Dev teams)
- **Files**: `workspace-server/internal/handlers/org.go`, `workspace-server/internal/router/router.go`, `org-templates/`
### Discovery Fix for External Workspaces
- Discovery handler rewrites `127.0.0.1``host.docker.internal` for external workspaces so containers can reach host-side bridge
- Tested: PM successfully delegated to Claude Code Advisor and got response back
### File Browser Lazy Loading (fix/files-lazy-loading — 6 commits)
**Platform (templates.go)**:
- Added `?path=` and `?depth=` query params to `GET /workspaces/:id/files`
- Default depth=1 (was 5) — only fetches immediate children
- `path` validated with `validateRelPath()` to block command injection and traversal
- Invalid `depth` returns 400 (was silently defaulting)
- Shell `find` arguments quoted for paths with spaces/special chars
- Host-side fallback now also respects `subPath` and `depth`, excludes `__pycache__`/`node_modules`
**Canvas (FilesTab.tsx)**:
- Lazy loading: expanding a folder triggers `GET ...&path=<dir>&depth=1` on demand
- Loading indicator ("…") on folder arrow while fetching
- `expandedDirs` state lifted from local TreeItem to parent FilesTab
- `buildTree()` dedup fix: top-level dir entries now registered in `dirMap` — prevents duplicated folder nodes when subfolder children are merged
- Merge logic preserves expanded grandchildren when re-loading a parent
- `toggleDir` uses ref to avoid stale closure / infinite re-render loop
- Extracted `TreeCallbacks` interface to deduplicate TreeView/TreeItem prop types
- Exported `buildTree` for testability
**Tests**:
- Updated 3 sqlmock expectations in `handlers_additional_test.go` and `handlers_extended_test.go` to match new discovery query (`SELECT COALESCE(name,''), COALESCE(runtime,'langgraph')`)
- Added `buildTree.test.ts` — 8 unit tests covering empty input, flat files, dir sorting, nested children, dedup (the original bug), implicit parent dirs, nested same-name dirs, out-of-order entries
- Canvas tests: 195 → 203. All Go tests pass.
**Code review (4 rounds)**:
- Round 1: Found critical command injection in `subPath` → fixed with `validateRelPath()`
- Round 2: Found stale closure in `toggleDir` → fixed with ref
- Round 3: Shell quoting + buildTree unit tests
- Round 4: Clean — 0 issues
## Files Changed (Lazy Loading)
- `canvas/src/components/tabs/FilesTab.tsx`
- `canvas/src/components/__tests__/buildTree.test.ts` (new — 8 tests)
- `workspace-server/internal/handlers/templates.go`
- `workspace-server/internal/handlers/handlers_additional_test.go`
- `workspace-server/internal/handlers/handlers_extended_test.go`
- `CLAUDE.md` (Vitest count 188 → 203)
- `docs/api-protocol/platform-api.md` (added `path`/`depth` query param docs)
- `docs/api-reference.md` (updated files endpoint description)
- `docs/frontend/canvas.md` (added Lazy Loading + Input Validation sections)
### Per-Workspace workspace_dir (feat/per-workspace-dir — PR #38)
**Problem:** `WORKSPACE_DIR` was a global env var — ALL containers got the same host directory bind-mounted. No way to give PM repo access while keeping other agents isolated.
**Solution:** Per-workspace `workspace_dir` column with priority chain: per-workspace DB value → global env → isolated Docker volume.
**Platform changes:**
- Migration 013: `workspace_dir TEXT` column on `workspaces` table
- `CreateWorkspacePayload`: added `WorkspaceDir` field
- Create handler: validates path (absolute, no `..`, no system paths), stores in DB
- Update handler: validates, stores, returns `{"needs_restart": true}`
- Get/List: includes `workspace_dir` in response (null when not set)
- `buildProvisionerConfig`: reads per-workspace value from DB on restarts, falls back to global env
- `validateWorkspaceDir`: rejects relative paths, `..` traversal, and system paths (/etc, /var, /proc, etc.)
- Org import: `workspace_dir` field in org.yaml, validated before DB insert
**Org template:**
- `org-templates/molecule-dev/org.yaml`: PM gets `workspace_dir: /Users/hongming/.../molecule-monorepo`
- All other 10 agents: no `workspace_dir` → isolated Docker volumes
**Code review (3 rounds):**
- Round 1: Found no path validation (critical) + unnecessary DB query + no restart hint → all fixed
- Round 2: Found missing org import validation + no system path denylist → all fixed
- Round 3: Clean — 0 issues
**E2E verified:**
- 11/11 workspaces online after org import
- PM: bind mount, can see CLAUDE.md, workspace-server/, canvas/
- Backend Engineer: isolated volume, empty /workspace
- Path traversal rejected (400), system paths rejected (400), relative paths rejected (400)
## Files Changed (Per-Workspace Dir)
- `workspace-server/migrations/013_workspace_dir.sql` (new)
- `workspace-server/internal/models/workspace.go`
- `workspace-server/internal/handlers/workspace.go`
- `workspace-server/internal/handlers/workspace_provision.go`
- `workspace-server/internal/handlers/org.go`
- `workspace-server/internal/handlers/handlers_test.go` (mock updates)
- `workspace-server/internal/handlers/handlers_additional_test.go` (mock updates)
- `workspace-server/internal/handlers/workspace_test.go` (mock updates)
- `org-templates/molecule-dev/org.yaml`
- `CLAUDE.md` (env var docs, migration count)
- `docs/architecture/provisioner.md` (rewrote Shared Workspace section)
- `docs/development/local-development.md` (updated WORKSPACE_DIR comment)
- `docs/edit-history/2026-04-09.md`
### Per-Workspace Plugin System (feat/per-workspace-plugins — PR #39)
**Problem:** Plugins were mounted as a shared read-only volume (`/plugins`) into ALL containers. No way to install/uninstall per workspace. No adapter-specific injection.
**Solution:** Per-workspace plugin installation with registry, API, adapter hooks, and canvas UI.
**Platform API (plugins.go, 346 lines):**
- `GET /plugins` — list available plugins from registry (`plugins/` dir at repo root)
- `GET /workspaces/:id/plugins` — list installed plugins in workspace container
- `POST /workspaces/:id/plugins {"name":"ecc"}` — install (TAR copy to `/configs/plugins/`) + auto-restart
- `DELETE /workspaces/:id/plugins/:name` — uninstall (root exec `rm -rf`) + auto-restart with 2s delay
- Plugin name validation: rejects `/`, `\`, `..`, non-base names (prevents path traversal)
- Shared `parseManifestYAML()` for host-side and container-side manifest parsing
**Plugin manifests:**
- `plugins/ecc/plugin.yaml` — 5 skills (api-design, coding-standards, deep-research, security-review, tdd-workflow), 2 rules
- `plugins/superpowers/plugin.yaml` — 5 skills (executing-plans, systematic-debugging, test-driven-development, verification-before-completion, writing-plans)
**Runtime integration (Python):**
- `plugins.py` rewritten: dual-source loader (`/configs/plugins/` first, `/plugins/` fallback), `PluginManifest` dataclass
- `config.py`: added `plugins: list[str]` field to `WorkspaceConfig`
- `adapters/base.py`: `inject_plugins()` hook in `BaseAdapter`, dual-source in `_common_setup()`
- `adapters/claude_code/adapter.py`: overrides `inject_plugins()` — appends rules to CLAUDE.md (idempotent), copies skills to `/configs/skills/`
- LangGraph/CrewAI: use default `_common_setup()` pipeline (system prompt + LangChain tools)
**Org import:**
- `OrgDefaults.Plugins` and `OrgWorkspace.Plugins` fields — auto-install plugins during provisioning
- Plugin files copied into `configFiles` map and written to container on provision
**Provisioner:**
- Removed global `/plugins:ro` bind mount — per-workspace is now the model
- T1 sandboxed tier updated (no more plugins mount)
**Canvas UI (SkillsTab.tsx):**
- Plugins section at top of Skills tab: shows installed count, per-plugin skills/version
- "+ Install Plugin" expands registry browser with available plugins and Install/Installed badges
- Remove button per installed plugin
- Loading states, toast notifications, cleanup timer on unmount
**Code review (4 rounds):**
- Round 1: Found path traversal in Uninstall (critical), command injection, duplicate parsing, magic timeout, non-idempotent CLAUDE.md injection
- Round 2: All fixed
- Round 3: Timer cleanup on unmount
- Round 4: Clean — 0 issues
## Files Changed (Plugin System)
- `workspace-server/internal/handlers/plugins.go` (new — 346 lines)
- `workspace-server/internal/router/router.go` (plugin routes + findPluginsDir)
- `workspace-server/internal/handlers/org.go` (Plugins field + auto-install)
- `workspace-server/internal/provisioner/provisioner.go` (removed /plugins mount)
- `workspace-server/internal/provisioner/provisioner_test.go` (updated T1 test)
- `workspace/plugins.py` (rewritten — dual source + manifest)
- `workspace/config.py` (plugins field)
- `workspace/adapters/base.py` (inject_plugins hook)
- `workspace/adapters/claude_code/adapter.py` (inject_plugins override)
- `workspace/tests/test_common_setup.py` (mock kwargs fix)
- `canvas/src/components/tabs/SkillsTab.tsx` (plugins section)
- `plugins/ecc/plugin.yaml` (new)
- `plugins/superpowers/plugin.yaml` (new)
- `CLAUDE.md` (routes, PLUGINS_DIR deprecation)
- `docs/api-reference.md` (plugins endpoints)
- `docs/api-protocol/platform-api.md` (plugins section)
- `docs/edit-history/2026-04-09.md`
### Agent GitHub Access + MCP Tool Coverage (feat/agent-github-access — PR #40)
**Docker image:**
- Added `git` and `gh` CLI to base Dockerfile — all runtimes can clone repos and create PRs
- Removed `set -e` from entrypoint to prevent silent crash-loops
- Entrypoint is clean — agents use `GITHUB_TOKEN`/`GITHUB_REPO` env vars on demand
**Org template .env (gitignored):**
- `GITHUB_TOKEN`, `GITHUB_REPO`, `CLAUDE_CODE_OAUTH_TOKEN` — auto-injected as workspace secrets on org import
**UIUX Designer agent:**
- Added to dev team under Dev Lead (T3, opus)
**MCP server (41 → 52 tools):**
- `list_plugin_registry`, `list_installed_plugins`, `install_plugin`, `uninstall_plugin`
- `list_global_secrets`, `set_global_secret`, `delete_global_secret`
- `pause_workspace`, `resume_workspace`
- `list_org_templates`, `import_org`
## Files Changed (PR #40)
- `workspace/Dockerfile`, `workspace/entrypoint.sh`
- `org-templates/molecule-dev/org.yaml`, `org-templates/molecule-dev/uiux-designer/system-prompt.md` (new)
- `mcp-server/src/index.ts` (11 new tools)
- `CLAUDE.md` (MCP tool count 20 → 52)
### Async Delegation (feat/async-delegation — PR #41)
**Problem:** Delegation was synchronous and blocking — PM sends to Dev Lead, waits for full response (855s), times out. Deep delegation chains (PM → Dev Lead → UIUX Designer) were unusable.
**Solution:** Fire-and-forget delegation with status polling.
**New behavior:**
1. `delegate_to_workspace(id, task)` → returns `{task_id, status: "delegated"}` instantly
2. Background asyncio task sends the A2A request, retries on failure
3. `check_delegation_status(task_id)` → poll for results anytime
4. `check_delegation_status("")` → list all active delegations
5. Push notification via `POST /notify` when delegation completes/fails
**Code changes:**
- `tools/delegation.py` rewritten (272 lines):
- `DelegationTask` dataclass with status enum (pending/in_progress/completed/failed)
- `_delegations` dict (bounded at 100, auto-evicts completed/failed)
- `_execute_delegation` background coroutine with full A2A retry logic
- `_notify_completion` pushes WebSocket event on done
- `_on_task_done` callback logs unhandled exceptions
- `_evict_old_delegations` prevents memory leaks
- `coordinator.py`: `route_task_to_team` uses same async pattern
- `adapters/base.py`: `check_delegation_status` registered as 6th core tool
- `tests/test_delegation.py` rewritten (13 tests): RBAC, async return, background completion, list all, not found, discovery errors, A2A success/failure
- `tests/test_common_setup.py`: tool count 5→6, 6→7
- `tests/conftest.py`: added check_delegation_status mock
- 865 Python tests pass (0 failures)
**Code review (2 rounds):**
- Round 1: Found unbounded _delegations, silent exception swallowing, no push notification
- Round 2: Clean — 0 issues
## Files Changed (PR #41)
- `workspace/tools/delegation.py` (rewritten)
- `workspace/coordinator.py`
- `workspace/adapters/base.py`
- `workspace/tests/test_delegation.py` (rewritten)
- `workspace/tests/test_common_setup.py`
- `workspace/tests/conftest.py`
### Platform-Level Async Delegation (feat/platform-async-delegation — PR #42)
**Problem:** Delegation was synchronous — PM blocks for 855s waiting for the full delegation chain. The earlier fix (PR #41) put async logic in Python tools, but Claude Code agents don't use Python tools — they use MCP. Wrong layer.
**Solution:** Platform-level async delegation that works for ALL runtimes.
**New endpoints:**
- `POST /workspaces/:id/delegate {"target_id", "task"}` → returns `{delegation_id, status: "delegated"}` in 0s
- `GET /workspaces/:id/delegations` → list with status (pending/completed/failed), delegation_id, response_preview
**How it works:**
1. Platform receives delegation request, validates target UUID, stores in activity_logs
2. Background goroutine sends A2A to target workspace (30min timeout)
3. On completion: stores result in DB, broadcasts DELEGATION_COMPLETE via WebSocket
4. On failure: stores error, broadcasts DELEGATION_FAILED
5. `delegation_id` tracked in both request and response JSONB for correlation
**MCP tools (54 total):**
- `async_delegate` — fire-and-forget delegation from any MCP client
- `check_delegations` — poll for results
**Code review (2 rounds):**
- Round 1: Silent DB error, JSON dependency, no UUID validation, no delegation_id tracking
- Round 2: Clean — 0 issues
**E2E verified:**
- Delegate returns in 0s (was 855s)
- Status shows "pending" immediately, "completed" with response in ~10s
- Invalid UUID rejected with 400
- delegation_id returned in list for correlation
## Files Changed (PR #42)
- `workspace-server/internal/handlers/delegation.go` (new — 220 lines)
- `workspace-server/internal/router/router.go` (2 routes added)
- `mcp-server/src/index.ts` (2 new tools — async_delegate, check_delegations)
- `CLAUDE.md` (routes, MCP 52→54)
- `docs/api-protocol/platform-api.md` (Async Delegation section)
- `docs/api-reference.md` (Async Delegation table)
- `docs/edit-history/2026-04-09.md`
### Full Claude Code Tool Access (fix/full-claude-tools — PR #43)
**Bug:** `--allowed-tools Bash` restricted agents to only Bash — couldn't Read, Write, Edit, or use other tools. Agents acknowledged tasks but never executed them.
**Fix:** Removed restriction, added `cwd=/workspace`, stale session retry.
### Resilient Heartbeat + Platform-Routed Delegation (fix/heartbeat-and-reporting — PR #44)
**Heartbeat:** Auto-restart on crash, recreate client after 10 failures, proper logging. Now also checks delegation status every 30s — writes completed results to `/tmp/delegation_results.jsonl` for agent pickup. Bounded `_seen_delegation_ids` at 200 entries.
**Delegation lifecycle:** `pending → dispatched → received → in_progress → completed/failed`. Platform broadcasts `DELEGATION_STATUS` WebSocket event on each transition. `updateDelegationStatus()` updates activity_logs by delegation_id.
**MCP tools:** Route through platform API (`POST /delegate`, `GET /delegations`) instead of direct peer-to-peer. Full DB tracking + WebSocket events.
**CLI executor:** Reads delegation results on each message, injects as `[Delegation results received while you were idle]` context. Atomic file rename prevents race with heartbeat writer.
**7 Go delegation handler tests:** Delegate validation, success, DB failure, ListDelegations empty/with results.
## Files Changed (PRs #43-44)
- `workspace/cli_executor.py` (delegation context injection, atomic file consume)
- `workspace/heartbeat.py` (delegation checker, auto-restart, bounded IDs)
- `workspace/a2a_tools.py` (platform-routed delegation)
- `workspace-server/internal/handlers/delegation.go` (status lifecycle, updateDelegationStatus)
- `workspace-server/internal/handlers/delegation_test.go` (7 tests)
- `workspace/tests/test_a2a_tools_impl.py`
- `workspace/tests/test_heartbeat.py` (6 new delegation tests)
- `workspace/tests/test_cli_executor.py` (3 new delegation injection tests)
- `CLAUDE.md` (test counts: Go 365+, Python 869)
- `docs/api-protocol/registry-and-heartbeat.md` (delegation checking section)