# 2026-04-12 ## Summary Shipped the full two-axis plugin architecture on `feat/agentskills-compliance` (PR #62). **Plugin source** (where files come from) and **plugin shape** (what's inside them) are now independent, pluggable axes. - **Source axis** — `workspace-server/internal/plugins/` package: `SourceResolver` interface, `Registry`, `LocalResolver`, `GithubResolver`, `ParseSource`. `POST /workspaces/:id/plugins` accepts `{name}` (back-compat → local) or `{source: "scheme://spec"}`. New `GET /plugins/sources` enumerates registered schemes. - **Shape axis** — `workspace/plugins_registry/` package: `PluginAdaptor` protocol, hybrid resolver (registry > plugin-shipped > raw-drop), `AgentskillsAdaptor` built-in for agentskills.io-format skills + Molecule AI's rules extension. Named sub-type adapters planned for MCP, DeepAgents sub-agents, LangGraph sub-graphs, etc. - **agentskills.io compliance** — every first-party skill passes the open standard; `python -m molecule_plugin validate` CLI enforces it in CI. Our skills are now installable in ~35 other agent tools (Cursor, Codex, Copilot, Gemini CLI, etc.). - **Gemini org parity** — `molecule-worker-gemini` mirrors `molecule-dev` (11 workspaces, Research + Dev branches, schedules, Telegram channel, per-agent prompts) as the E2E proof point. ## Files touched Platform (Go): - `workspace-server/internal/plugins/{source,local,github}.go` + tests — source layer, 97.4% coverage. - `workspace-server/internal/envx/envx.go` + test — env-var helpers, 100% coverage. - `workspace-server/internal/handlers/plugins.go` — install pipeline refactored into `resolveAndStage` + `deliverToContainer`; typed `httpErr` for status propagation; `sort.Strings` in `Registry.Schemes`; `logInstall LimitsOnce` on startup. - `workspace-server/internal/router/router.go` — new routes (`/plugins/sources`, `/workspaces/:id/plugins/available`, `/workspaces/:id/plugins/compatibility`). - `workspace-server/Dockerfile` — `apk add git` for the github resolver. Workspace runtime (Python): - `workspace/plugins_registry/` — new module: `protocol.py`, `builtins.py` (`AgentskillsAdaptor`), `raw_drop.py`, resolver. - `workspace/skill_loader/` — renamed from `skills/`; reads `scripts/` per the agentskills.io spec. - `workspace/builtin_tools/` — renamed from `tools/` to disambiguate from user-plugin tool dirs. - `workspace/adapters/base.py` — added hooks: `memory_filename`, `register_tool_hook`, `register_subagent_hook`, `append_to_memory_hook`, `install_plugins_via_registry`. Default `inject_plugins()` drives the new pipeline. - `workspace/adapters/claude_code/adapter.py` — deleted the 40-line `inject_plugins()` override. - `workspace/adapters/deepagents/Dockerfile` — ships `plugins_registry/`. - `workspace/plugins.py` — `PluginManifest.runtimes` field. Plugins (content): - `plugins/*/adapters/{claude_code,deepagents}.py` — one-line `from plugins_registry.builtins import AgentskillsAdaptor as Adaptor`. - `plugins/*/plugin.yaml` — declare `runtimes: [claude_code, deepagents]`. SDK (Python): - `sdk/python/molecule_plugin/` — `protocol.py`, `builtins.py` (SDK- vendored `AgentskillsAdaptor`), `manifest.py` (spec validator), CLI via `__main__.py`. - `sdk/python/template/` — cookiecutter skeleton. Org templates: - `org-templates/molecule-worker-gemini/org.yaml` — full parity with `molecule-dev` (11 workspaces, schedules, Telegram, per-agent prompts, `workspace_dir` mount on PM, `required_env: [GOOGLE_API_KEY]`). - Copied 5 `system-prompt.md` files from molecule-dev (research-lead, market-analyst, technical-researcher, competitive-intelligence, uiux-designer). Docs: - `docs/plugins/agentskills-compat.md` — two-layer model, spec mapping. - `docs/plugins/sources.md` — two-axis source/shape architecture, security model, future resolvers. - `docs/ecosystem-watch.md` — Holaboss, Hermes Agent, gstack entries (adjacent projects to track). - `.env.example` — `PLUGIN_INSTALL_*` vars documented. - `PLAN.md` — plugin-adaptor landed; deferred items listed. - `CLAUDE.md` — new endpoints, env vars, test counts. ## Test counts - Go platform: all packages green under `-race`. - Python workspace: 1040 passed, 9 skipped. - Python SDK: 50 passed. - Total: **1090 passing**. Coverage on new code: - `workspace-server/internal/plugins/*`: 97.4% - `workspace-server/internal/envx/*`: 100% - `workspace/plugins_registry/*`: 100% - `workspace/skill_loader/*`: 100% - `sdk/python/molecule_plugin/*`: 100% ## 5 rounds of code review Every round addressed by new commits on the branch: 1. Round 1 — initial coverage pass. 2. Round 2 — `memory_filename` plumbing through `InstallContext`; `logger` in `skill_loader`; module constants for `SKILLS_SUBDIR`, `SKIP_ROOT_MD`, `SKILL_NAME_*`; SDK↔runtime drift-guard test; frontmatter parser unification. 3. Round 3 — fetch timeout + body size cap + staged-dir size cap via new env vars; typed `ErrPluginNotFound` sentinel replaces string matching; reject both `name`+`source`; `sort.Strings` in Schemes; `sync.RWMutex` on Registry; `--` in git clone; docs clarify github resolver is public-only. 4. Round 4 — `ParseSource` empty-spec guard; `dirSize(cap)` → `(limit)`; `localNameRE` length bound; extract `envDuration`/`envInt64` into `internal/envx`; `LANG=C LC_ALL=C` in git child env for locale- stable error parsing. 5. Round 5 — typed `httpErr` replaces 5-value tuple; `resolveAndStage` decoupled from `*gin.Context` via `installRequest` struct; drop unused `source` param from `deliverToContainer`; trim whitespace in `ParseSource`; consolidate 3 test resolver stubs into 1 parameterized `fakeResolver` + 3 constructors. ## Live E2E confirmed - `GET /plugins/sources` → `{"schemes":["github","local"]}`. - `POST {"name":"molecule-dev"}` → installed via local (back-compat). - `POST {"source":"local:// molecule-dev "}` → installed (whitespace trimmed). - `POST {"name":"a","source":"local://b"}` → 400 "not both". - `POST {"source":"github://"}` → 400 "empty spec after 'github'". - `POST {"source":"mystery://x"}` → 400 + `available_schemes: [...]`. - Uninstall + reinstall on PM workspace: CLAUDE.md has `# Plugin: molecule-dev / rule: codebase-conventions.md` marker; `/configs/skills/review-loop/` present; zero container errors. - Startup log on platform boot: `Plugin install limits: body=65536 bytes timeout=5m0s staged=104857600 bytes`. ## Branch `feat/agentskills-compliance` → PR #62 (open, all CI green, ready to merge). Use `git log --oneline origin/main..` for the commit list — counting commits inline goes stale fast. --- ## Post-merge session — team coordination, platform hardening, new backlog After PR #62 landed, the session continued with ecosystem-watch ship, a gemini-org proof-point attempt, and a PLAN.md refresh coordinated through the agent team. Several platform bugs surfaced; all filed and tracked. ### Shipped - **PR #59** — A2A proxy regression fix. PR #59 had rewritten `http://127.0.0.1:` → `http://ws-:8000` unconditionally, breaking platform-on-host mode. Gated behind `platformInDocker` detection (`/.dockerenv` or `MOLECULE_IN_DOCKER=1`). `workspace-server/internal/handlers/a2a_proxy.go`. Commit `4b42913`. - **PR #61** — `docs/ecosystem-watch.md`: Holaboss / Hermes / gstack entries + template + backlog candidates. Merged. - **Cross-references for ecosystem-watch** — wired into `PLAN.md` (new "Ecosystem Awareness" section), `README.md` + `README.zh-CN.md` Documentation Map, and `CLAUDE.md` (new "Ecosystem Context" section). Agents couldn't discover the doc because it wasn't linked anywhere; PM reported it missing despite being in its bind mount. Commit `8ae5e73`. - **DeepAgents adapter: `virtual_mode=False`** in `workspace/adapters/deepagents/adapter.py`. Previously `read_file`/`ls`/`write_file`/`edit_file` operated on an in-memory snapshot that drifted from the bind-mounted `/workspace`; writes didn't persist across restarts and real files reported as missing. Commit `bc563d1`. - **LangGraph recursion limit 100 → 500** default in `workspace/a2a_executor.py`. PM fan-out to 6+ reports routinely overran the 100-step ceiling. Still overridable via `LANGGRAPH_RECURSION_LIMIT` env var. Commit `d892eb4`. - **Gemini org model swap** `gemini-3.1-pro-preview` → `gemini-2.5-pro` in `org-templates/molecule-worker-gemini/org.yaml` (3.1-pro-preview's 25 req/min couldn't sustain 11-workspace delegation waves). Commit `4b42913`. - **Backlog tracking** for #64 / #65 added to `PLAN.md` Backlog. Commit `ba1cc15`. ### Open PRs (awaiting CEO approval) - **#68** `docs/plan-refresh` — PLAN.md refresh: correct test counts (Canvas 325→345, Python 990→1,040, +SDK row 50, total 1,811→1,911), promote #66/#67 to backlog with actual issue content. Coordinated with the molecule-dev team; corrected PM's hallucinated content for #66/#67 before open. - **#69** `chore/team-system-prompts-hardening` — harden PM / Dev Lead / Research Lead system prompts with hard-learned rules from today's coordination incident (15 rules total across 3 roles). Every rule maps to a specific failure we hit today. ### New platform issues filed - **#64** — `GET /workspaces/:id/delegations` returns `[]` while the agent-side `check_delegation_status` tool shows 4 delegations. Sources-of-truth mismatch. Bug. - **#65** — Per-agent repo-access config in `org.yaml`. New `workspace_access: none | read_only | read_write` field + `:ro` bind-mount for research agents. Eliminates the "PM couriers documents to reports" workaround. Enhancement. - **#66** — `claude_sdk_executor.py` swallows subprocess stderr on CLI exit ≠ 0. Every failure surfaces the same opaque `"Command failed with exit code 1 / Check stderr output for details"`. High-priority bug; blocked real debugging today. - **#67** — Agent MCP client defaults to `http://localhost:8080`, which inside a workspace container is the container itself. Inject `MOLECULE_URL=${PLATFORM_URL}` at provision time. High-priority bug; blocked PM from restarting its own reports. ### Gemini org — proof-point attempt, rolled back Deployed molecule-worker-gemini (11 DeepAgents workspaces), exercised the full delegation tree, hit three distinct blockers: 1. `virtual_mode=True` made PM report real files as missing (fixed in `bc563d1` above). 2. LangGraph recursion limit 100 tripped on PM fan-out (fixed in `d892eb4` above). 3. Google AI Studio **monthly spending cap** exhausted the whole project after repeated retries. Rolled back to molecule-dev (Claude Code runtime) to finish the PLAN.md refresh task. ### Session-state contamination note After a `ProcessError` crash on a Claude Code workspace, subsequent A2A calls to that workspace keep failing identically until the workspace is restarted — even when the same SDK query run manually from inside the container succeeds. Root cause likely session resume state in the executor. Workaround: restart on `ProcessError`. Worth formalizing in the executor as an auto-reset on `exit_code != 0` once #66 lands and we can see the real stderr. ### Rules distilled for the team (now encoded in #69) - Never commit to `main` — always a feature branch + PR. - Verify external refs (issue numbers, PRs, SHAs, file paths) before citing them. - Inline documents into every sub-delegation — reports don't have the repo mount. - `delegation.status == completed` ≠ work was done. - Pause ~60s after a batch restart before delegating (warm-up race). - Quote errors verbatim, don't paraphrase. - Research Lead must always fan out — solo synthesis is a role failure. --- ## #71 fix — initial_prompt marker written up-front **Root cause:** `main.py` previously wrote `/workspace/.initial_prompt_done` only AFTER the initial_prompt self-send succeeded. If the prompt crashed (any ProcessError, network failure, SDK exit), the marker was never written — the next container boot replayed the same failing prompt and cascaded into "every message crashes" until an operator intervened. Observed three times on 2026-04-12 (gemini org + molecule-dev import + post-restart). **Fix (extracted from main.py into `workspace/initial_prompt.py` so it's unit-testable without uvicorn):** - `resolve_initial_prompt_marker(config_path)` — prefer `/...` when writable, fall back to `/workspace/...`. - `mark_initial_prompt_attempted(marker_path)` — best-effort write, returns `True`/`False` so the caller can log a loud warning on I/O failure. - `main.py` calls `mark_initial_prompt_attempted` **before** scheduling the self-send. The post-send marker write is removed. **Semantic change:** the prompt is attempted at most once per fresh boot; if it fails, operators re-send manually via chat. Trade-off: trades silent auto-retry-on-restart (which could cascade) for a one-time attempt with a loud failure log. **Tests:** 5 new unit tests in `tests/test_main_initial_prompt.py`, 100% coverage on `initial_prompt.py`. Live E2E verified all 12 containers write the marker up-front and no replay occurs on restart. Manual browser test via canvas chat against Research Lead returned the expected reply — full round-trip through the UI. Branch: `fix/71-initial-prompt-marker-at-start`. Closes #71. --- ## #66 fix — surface Claude SDK subprocess stderr + exit_code **Root cause:** `claude_sdk_executor.py` caught `ProcessError` but extracted only `str(exc)`, which for a crashing CLI reads "Command failed with exit code 1 (exit code: 1) / Error output: Check stderr output for details". The SDK's `ProcessError` actually carries `.exit_code` and `.stderr` attributes — we were silently dropping both. Every CLI crash looked identical and required ad-hoc reproduction inside the container to diagnose. **Fix:** new `_format_process_error(exc)` helper that extracts `type(exc).__name__`, `exc.exit_code`, and `exc.stderr` (capped at `_PROCESS_ERROR_STDERR_MAX_CHARS = 4096` to prevent log flooding). Called in the retry loop (`logger.warning`) and the terminal error path (`logger.error` + `logger.exception` for the full traceback). Plain exceptions without SDK attributes fall back to `str(exc)` — no crash on missing attrs. **Tests:** 5 new unit tests in `tests/test_claude_sdk_executor.py` (format with full context / truncation / plain exception / exit-code only / end-to-end via `execute()` with caplog). Python pytest 1050 → 1055. **E2E:** rebuilt `workspace-template:claude-code`, restarted an agent, ran `_format_process_error` with a real `claude_agent_sdk._errors. ProcessError(exit_code=2, stderr='disk full: /tmp')` inside the live container → output shows both `exit_code=2` and the stderr verbatim. **Manual browser:** canvas chat against Research Lead — reply `BROWSER-OK-66` returned cleanly, full UI round-trip works with the new log format live. Branch: `fix/66-capture-claude-sdk-stderr`. Closes #66. --- ## #75 fix — auto-reset session_id on subprocess-level errors **Root cause:** after a `ProcessError` (or `CLIConnectionError`), the executor's `self._session_id` still points at the dead session. On the next call, `_build_options()` passes `resume=` to the SDK, which boots a new subprocess that can't resume the prior session state — and crashes again. Observed as "crashed once → crashes forever" on 2026-04-12 across PM / RL / DL in the coordination runs. **Fix:** new `_reset_session_after_error(exc)` method clears `self._session_id` when the exception looks subprocess-level (`ProcessError`, `CLIConnectionError`, has `exit_code` attribute, or message contains "exit code"). Rate-limit / capacity errors are left alone so normal retry preserves conversational continuity. Called in the retry loop, right after `_format_process_error` logs the context. **Tests:** 5 new tests in `tests/test_claude_sdk_executor.py` — clears on ProcessError / preserves on rate-limit / no-op when session_id is already None / triggers on "exit code" message only / end-to-end via `execute()` with `caplog` + spy-on-`_build_options` asserting that the second retry attempt sees `session_id=None` rather than the stale ID. Python pytest 1055 → 1060. **E2E:** verified in live container — `_reset_session_after_error` clears a stale session on ProcessError, preserves it on rate-limit. **Manual browser:** canvas chat round-trip on Research Lead — message went through and agent responded normally. Zero ProcessError indicators. Branch: `fix/75-session-reset-on-process-error`. Closes #75. --- ## Top-5 #1 — Memory FTS + namespace scoping Backend proposal from the ecosystem-research outcomes doc, highest- convergence team ask (BE + FE + QA + UX all independently proposed some flavour of this). **Migration `017_memories_fts_namespace.up.sql`:** - `agent_memories.namespace VARCHAR(50) NOT NULL DEFAULT 'general'` - `agent_memories.content_tsv tsvector` (STORED generated column from `to_tsvector('english', content)`) - `idx_memories_fts` (GIN on `content_tsv`) - `idx_memories_ns` (composite on `workspace_id, namespace`) **Handler `workspace-server/internal/handlers/memories.go`:** - `POST /workspaces/:id/memories` accepts optional `namespace` (default `"general"`, 50-char max validated at the handler). - `GET /workspaces/:id/memories?q=...` routes multi-char queries through `content_tsv @@ plainto_tsquery('english', ?)` with `ts_rank` ordering; single-char queries fall back to `ILIKE` (tsvector can't tokenise single chars in the 'english' config). - `GET /workspaces/:id/memories?namespace=...` filters regardless of scope. - Response always includes the `namespace` field. **Tests:** 5 existing tests updated for the new column list; 4 new tests added (commit-with-namespace, namespace-too-long, FTS path, ILIKE fallback, namespace filter). Handler test suite passes. **E2E (live Postgres + running platform):** - Platform restart applied migration 017 → column + indexes present. - `POST` with / without namespace → both work, default kicks in. - `?q=zinc+theme` → FTS returns reference memory. - `?namespace=procedures` → scoped retrieval works. - `?q=restart&namespace=procedures` → combined filter works. Branch: `feat/memory-fts-namespace`. --- ## Top-5 #5 — Fail-secure encryption at boot Security Auditor's top proposal from the outcomes doc. The platform previously booted without `SECRETS_ENCRYPTION_KEY` and silently stored workspace secrets in plaintext with only a WARNING log. OWASP A02:2021 (Cryptographic Failures) / STRIDE "Information Disclosure". **Fix** (`workspace-server/internal/crypto/aes.go`): - New `InitStrict() error` variant that returns `ErrEncryptionKeyMissing` when `MOLECULE_ENV=prod`/`production` and the key is unset, malformed, or the wrong length. Existing `Init()` retained for any callers that prefer the warn-and-continue behaviour; only `cmd/server/main.go` switched to the strict variant. - `isProdEnv()` accepts `prod`, `production`, case-insensitive + trimmed. - `loadKeyFromEnv` refactor: one helper returns the parse error so both entry points can format it the same way. **`cmd/server/main.go`:** `crypto.InitStrict()` + `log.Fatalf` on error. Local dev (no `MOLECULE_ENV`) keeps the existing warn-and-continue. **Tests:** 6 new tests in `internal/crypto/aes_test.go`: - fails in prod when key is missing - fails in prod on wrong-length key - succeeds in prod with valid key - allows dev mode without key (ergonomics) - allows staging without key (non-prod) - isProdEnv case-insensitivity table **E2E:** `/tmp/platform-failsec` binary run with `MOLECULE_ENV=prod` + empty key → `log.Fatalf` triggers, platform refuses to start. Same binary with `MOLECULE_ENV=prod` + valid base64 key → boots, prints "AES-256-GCM enabled", serves 200 on `/health`. Branch: `fix/top5-5-fail-secure-encryption`. --- ## #85 fix — encryption_version column + DecryptVersioned **Root cause (from the investigation):** rows in `workspace_secrets` / `global_secrets` are tagged as `encrypted_value bytea` but whether they're *actually* encrypted depends entirely on whether `SECRETS_ENCRYPTION_KEY` was set at the moment of `Encrypt` — `crypto.Encrypt` short-circuits and returns plaintext bytes when encryption is disabled. Switching on the key later makes `crypto.Decrypt` try GCM on plaintext bytes → fails → provisioner silently skips the row → container crashes on missing OAuth token. With PR #83 (fail-secure) pushing operators toward setting the key, this trap was about to start biting real installs. **Fix:** - Migration `018_secrets_encryption_version` adds `encryption_version INT NOT NULL DEFAULT 0` to both secret tables. All existing rows become `version=0` (plaintext). Additive, safe. - `crypto.aes.go`: - `EncryptionVersionPlaintext = 0`, `EncryptionVersionAESGCM = 1` constants. - `CurrentEncryptionVersion()` — tells callers which tag to write. - `DecryptVersioned(value, version)` — dispatches on tag; `v=0` passes through, `v=1` runs GCM (and errors if `IsEnabled()` is false). Unknown version → clear error. - Existing `Decrypt` deprecated-in-comment but kept for callers that haven't migrated (backward-compat during transition). - `handlers/workspace_provision.go`: SELECT now pulls `encryption_version`; decrypt uses `DecryptVersioned`; on failure **aborts provisioning with a loud FATAL log + marks workspace failed** (#66-style silent-failure removed). - `handlers/secrets.go`: both `Set` and global `SetGlobalSecret` persist `encryption_version = CurrentEncryptionVersion()` on INSERT. `ON CONFLICT` also updates the version — re-setting a historical plaintext row while a key is active upgrades it to GCM in-place. - `handlers/secrets.go::GetModel`: SELECT pulls version, uses `DecryptVersioned`. **Tests:** 6 new crypto tests (plaintext pass-through, GCM round-trip, GCM requires key, unknown version rejected, `CurrentEncryptionVersion` tracks key state, the exact #85 scenario end-to-end). 6 existing secret handler tests updated for the 4-arg INSERT. Full Go test suite passes. **E2E (live):** - Migration applied automatically on platform boot: `encryption_version` column present on both tables. - 102 pre-existing plaintext rows correctly tagged `version=0`. - New `TEST_NEW_SECRET_85` stored as 39 bytes (11 plaintext + 12 nonce + 16 tag = ✓) with `version=1`. - PM container restart succeeds — both `CLAUDE_CODE_OAUTH_TOKEN` (v=0 historical plaintext) AND `TEST_NEW_SECRET_85` (v=1 encrypted) are decrypted correctly and injected into the container env. Branch: `fix/85-encryption-version-migration`. Closes #85. --- ## #67 fix — inject MOLECULE_URL at workspace provision time **Root cause:** Agents calling `mcp__molecule__*` tools from inside a workspace container were hitting `localhost:8080` (container's own localhost, not the host). The MCP client (`mcp-server/src/index.ts`) defaulted to `MOLECULE_URL || "http://localhost:8080"` and the provisioner only injected `PLATFORM_URL`, never `MOLECULE_URL`. **Fix (two-sided, belt-and-suspenders):** 1. `workspace-server/internal/provisioner/provisioner.go` — extracted env building into pure `buildContainerEnv(cfg WorkspaceConfig) []string` so it's unit-testable. Now injects `MOLECULE_URL=` alongside `PLATFORM_URL`. 2. `mcp-server/src/index.ts` — client now prefers `MOLECULE_URL`, falls back to `PLATFORM_URL`, then `localhost:8080`. Protects older containers that don't yet have `MOLECULE_URL`. **Tests:** 4 new Go tests (`buildContainerEnv` injects both env vars, MOLECULE_URL always matches PLATFORM_URL across URL shapes, awareness both-or-nothing, custom envs append). Full provisioner suite green. 88 existing MCP tests still pass (fallback chain preserves existing behaviour). **E2E verified live:** rebuilt platform, restarted PM, `docker exec env` shows both `PLATFORM_URL=http://host.docker.internal:8080` and `MOLECULE_URL=http://host.docker.internal:8080` on the recreated container. **Side-discovery (filed as #85):** enabling `SECRETS_ENCRYPTION_KEY` on an install with pre-existing plaintext secrets silently breaks every secret — `crypto.Decrypt` runs GCM on plaintext bytes → fails → `log.Printf + continue` → row dropped → workspace crashes on preflight. Proposed fix: `encryption_version` column + boot-time re-encryption migration + fail-loud on decrypt mismatch. Branch: `fix/67-inject-molecule-url`. --- ## #73 fix — close three real delete-race windows **Observed symptom (corrected):** During the session's bulk-delete runs, PM / Research Lead / Dev Lead consistently survived as "stragglers." Turned out the cause wasn't a race — it was the `DELETE /workspaces/:id` endpoint returning **HTTP 200** with `{"status":"confirmation_required"}` when the workspace has children and `?confirm=true` is not set. The bulk-delete script read HTTP 200 as success and moved on. **What the #73 fix actually closes:** three real but distinct race windows that would bite in production even with correct `?confirm=true` usage: 1. `handlers/registry.go::Register` — `ON CONFLICT DO UPDATE SET status='online'` ran unconditionally; a late heartbeat from a workspace that was just soft-deleted (status='removed') could resurrect the row. Guard added: `WHERE workspaces.status IS DISTINCT FROM 'removed'`. 2. `handlers/registry.go::Heartbeat` — same UPDATE path had no filter; late heartbeats refreshed `last_heartbeat_at` on tombstoned rows (confusing liveness). Guard: `AND status != 'removed'`. Plus `evaluateStatus` recovery path made conditional in-SQL (`AND status = 'offline'`). 3. `handlers/workspace.go::Delete` — sequence was Stop container → UPDATE status='removed'. Between those calls, Redis TTL expiry could trigger the liveness monitor, which called `RestartByID`, recreating the container. New order: UPDATE status='removed' FIRST (for self + descendants as a single batch), THEN stop containers + remove volumes. Auto-restart paths now see status='removed' immediately and bail out via their existing `NOT IN ('removed', ...)` guards. **Tests:** 2 new registry tests pinning the SQL guards (substring match on the emitted UPDATE); 2 existing delete tests updated for the new order (single batch UPDATE covering self+descendants). Full `go test ./... -race` green. **Live E2E:** bulk delete of 12 workspaces with `?confirm=true` → all cleanly removed, **zero stragglers**, no pending provisions. **Separate issue filed:** API DX — DELETE should return 4xx (e.g. 409 Conflict) when confirmation is required, not 200. Misleading status code made the session's symptom diagnosis wrong for hours. Branch: `fix/73-delete-workspace-race`. --- ## #88 fix — DELETE returns 409 Conflict when confirmation required **Observed during #73:** bulk-delete scripts that read HTTP 200 as success silently skipped every parent workspace, leaving tier-3 / parent nodes behind and looking like a platform race bug. **Fix:** one-line change in `handlers/workspace.go::Delete` — return `http.StatusConflict` (409) instead of `http.StatusOK` (200) when children exist and `?confirm=true` isn't set. Response body shape unchanged (canvas UI + MCP server both parse the JSON body, not the status code). No regressions: canvas (`DetailsTab.tsx:75`) and MCP server (`mcp-server/src/index.ts:80`) already pass `?confirm=true` on every delete. The 409 only affects manual API users + bulk scripts that forgot — exactly the cohort that was silently failing. **Tests:** 1 existing delete test updated to expect 409. Full `go test ./...` green. **Live E2E:** real platform, real parent+child workspaces — `DELETE /workspaces/:id` (no confirm) returns `http=409` with the expected JSON body; `DELETE /workspaces/:id?confirm=true` still returns 200. Branch: `fix/88-delete-confirm-409`. Closes #88. ## #74 fix — retry delegation once after reactive URL refresh **Clarification of the original issue:** The delegation worker (`handlers/delegation.go::executeDelegation`) already calls the shared `h.workspace.proxyA2ARequest(...)` path — so it DOES benefit from the A2A proxy's reactive health-check / URL-refresh on connection errors. The real gap is that the reactive refresh runs *after* the current request fails; the caller still gets an error for that specific delegation attempt. During bulk restarts (observed 21:40 today), PM's delegation worker fired during the warm-up window, hit a stale URL, and the single-attempt logic marked the delegation `failed`. **Fix:** add a single retry with an 8-second pause when `proxyA2ARequest` returns a transient-looking error. The pause is long enough for the reactive refresh + container restart to land a fresh URL in the cache. `isTransientProxyError` classifies which statuses retry: - **502 Bad Gateway** (plain connection failure) — retry - **503 Service Unavailable** (reactive check decided to restart the container) — retry - **404 / 403 / 400 / 500** — static, don't waste the retry window **Tests:** 7 new cases on the classifier matrix + a regression guard on the 8-second window. Full `go test ./... -race` green. Branch: `fix/74-delegation-via-a2a-proxy`. Closes #74. --- ## 100% platform coverage — MCP + molecli Full parity pass so every platform endpoint is reachable from both client layers. ### MCP server (`mcp-server/src/index.ts`): 61 → 83 tools **+22 new handlers** added in a single coverage-completion block at the bottom of the file: - Delegations (#64): `record_delegation`, `update_delegation_status` - Activity: `report_activity`, `notify_user` - Canvas viewport: `get_canvas_viewport`, `set_canvas_viewport` - Channels (platform-level): `discover_channel_chats` - Plugins: `list_plugin_sources`, `list_available_plugins`, `check_plugin_compatibility` - Schedules (cron): `list_schedules`, `create_schedule`, `update_schedule`, `delete_schedule`, `run_schedule`, `get_schedule_history` - Session + shared context: `session_search`, `get_shared_context` - K/V memory (distinct from HMA): `memory_set`, `memory_get`, `memory_list`, `memory_delete_kv` **Updated schemas:** `create_workspace` + `update_workspace` now accept `workspace_access` (none / read_only / read_write) + explicit `runtime` / `workspace_dir` params. All 88 existing MCP tests still pass; `npm run build` green. ### molecli CLI (`workspace-server/cmd/cli/`): 9 → 21 top-level commands Two new files: - `cmd_api.go` — `molecli api [json-body]` raw escape hatch. Hits any endpoint without a typed wrapper. - `cmd_ops.go` — typed subcommands (thin wrappers over shared `callAPI` helper) for operator ergonomics: - `ws restart|pause|resume` — lifecycle ops - `plugin registry|sources|list|available|install|uninstall` - `secret list|set|delete|list-global|set-global|delete-global` - `schedule list|add|remove|run|history` - `channel adapters|list|remove|send|test` - `approval pending|list|decide` - `delegation list|create` - `bundle export|import` - `org templates|import` - `traces ` - `activity list ` - `hma commit|search` `go test ./cmd/cli/` passes; live smoke-test against running platform: `api GET /health`, `plugin sources`, `org templates`, `ws restart ` all return expected responses. Branch: `feat/mcp-molecli-full-coverage`. ## #65 fix — per-agent workspace_access in org.yaml + API **Design from the ecosystem-research outcomes doc:** new `workspace_access: none | read_only | read_write` field on every workspace, enforced at container provision time via Docker's native `:ro` bind-mount flag. Eliminates the "PM couriers documents to reports" workaround by letting research agents have read-only repo access without the write risk. **Changes:** - **Migration 019** — adds `workspace_access VARCHAR(20) NOT NULL DEFAULT 'none'` with CHECK constraint. Additive, all existing rows become 'none' (current isolated-volume behaviour preserved). - **`provisioner.go`:** - New `WorkspaceAccess` field on `WorkspaceConfig`. - Constants `WorkspaceAccessNone`/`ReadOnly`/`ReadWrite`. - `buildWorkspaceMount(cfg)` — pure helper, selects between named-volume, rw bind, and `:ro` bind based on access + workspace_path. - `ValidateWorkspaceAccess(access, path)` — rejects `read_*` without a path and unknown values. - **`handlers/workspace.go::Create`** and **`handlers/org.go::createOrgWorkspace`** — validate + persist `workspace_access` on INSERT. Response body echoes the stored value. - **`handlers/workspace_provision.go::buildProvisionerConfig`** — reads `workspace_access` from DB (with payload override) and forwards to the provisioner. Restart paths preserve the mode. **Tests:** - Provisioner: 2 new tables — `TestBuildWorkspaceMount_SelectionMatrix` (6 cases covering the full access × path matrix) and `TestValidateWorkspaceAccess` (7 cases). - Handler INSERT WithArgs updated across 5 existing tests for the new 9th column. - Full `go test ./... -race` green. **Live E2E:** - Migration auto-applied → `workspaces` table has `workspace_access` with the CHECK constraint. - `POST /workspaces {"workspace_access":"read_only","workspace_dir":"/repo"}` → 201 with `"workspace_access":"read_only"` echoed; DB row correct. - `POST {"workspace_access":"read_only"}` (no workspace_dir) → 400 with clear error. - `POST {"workspace_access":"wildcard"}` → 400 with allowed-values list. - Container inspected after provision: `/workspace` mount has `RW=false Mode=ro`; `touch /workspace/foo` from inside returns `Read-only file system` → enforcement is real. Branch: `feat/65-workspace-access-yaml`. Closes #65. ## #64 fix — agent registers delegations with platform (Option A) **Root cause (confirmed in comment on #64):** `check_delegation_status` reads from the agent's local `_delegations` dict; platform's `GET /workspaces/:id/delegations` reads from `activity_logs`. The agent's `delegate_to_workspace` MCP tool sends A2A directly and never touches `activity_logs` — so the platform's view was always empty for agent-initiated delegations. **Fix (minimal Option A, dual-write):** - Platform: two new endpoints on `DelegationHandler` — - `POST /workspaces/:id/delegations/record` — inserts a single `activity_logs` row with `method='delegate'`, status='dispatched'. No A2A fired (agent does that directly for OTEL/retry reasons). - `POST /workspaces/:id/delegations/:delegation_id/update` — accepts `status ∈ {completed, failed}` + optional error + preview. UPDATEs the original row and (on completion) INSERTs a `delegate_result` row matching the canvas-path flow. - Agent (`workspace/builtin_tools/delegation.py`): - New best-effort async helpers `_record_delegation_on_platform` and `_update_delegation_on_platform`. Failures are logged at debug and swallowed — never block the actual A2A delegation path. - `_execute_delegation` calls `_record_...` at task start and `_update_...` on completion / failure (alongside the existing `_notify_completion`). **Result:** agent keeps direct A2A for speed + OTEL trace-context propagation + existing retry logic; platform's activity_logs mirrors the same set the agent's local dict holds. `GET /delegations` now returns rows for agent-initiated delegations. **Tests:** 5 new Go tests (Record inserts + rejects invalid UUID, UpdateStatus completed inserts result row + rejects unknown status + failed broadcast). 4 new Python tests (record fires HTTP POST, best- effort on platform error, update completed, update truncates large preview to 500 chars). Python pytest 1060 → 1064; full Go suite green. Branch: `fix/64-agent-delegate-via-platform`. Closes #64. ## SDK — workspace / org / channel validators **Issue:** SDK only validated plugins. Authors publishing workspace-configs-templates, org-templates, or channel configs had no lint step — errors only surfaced at `POST /org/import` or container startup. **Fix:** extended `sdk/python/molecule_plugin/` with three new modules: - `workspace.py` — validates `config.yaml` (name, runtime, tier, runtime_config shape). `SUPPORTED_RUNTIMES` kept in sync with `provisioner.RuntimeImages`. - `org.py` — recursively validates `org.yaml` (name, workspaces tree, workspace_access + workspace_dir pairing per #65, channels via delegated `validate_channel_config`, schedules, plugins, external+url, children). - `channel.py` — validates channel configs (standalone dict or YAML file). `SUPPORTED_CHANNEL_TYPES` currently `{telegram}`; extend when Slack/Discord adapters land. CLI (`python -m molecule_plugin validate {plugin|workspace|org|channel} `) dispatches to the right validator; bare `validate ` still defaults to plugin for back-compat. Exit 0 on valid, 1 on any error. `validate_channel_config` is the single source of truth for channel schema — `org.py` delegates to it rather than duplicating checks. **Tests:** `sdk/python/tests/test_validators.py` — 37 new tests (happy, missing file, bad YAML, non-object, each field error, null-safety on `runtime_config: None` / `defaults: null`, CLI dispatch for all 4 kinds, back-compat form). Fixed bug found during test authoring: `org.py` crashed on non-dict children; now guarded with `isinstance` check. **Live smoke:** all 4 in-repo org templates (`free-beats-all`, `reno-stars`, `molecule-dev`, `molecule-worker-gemini`) validate clean. **SDK pytest:** 50 → 87. Branch: `feat/sdk-workspace-org-channel`. --- ## Top-5 #3 — parallel adapter builds DevOps proposal from the ecosystem-research outcomes doc. All six adapter Dockerfiles `FROM workspace-template:base` with no inter-adapter dependency, so they're safe to build concurrently once the base is done. **Change** (`workspace/build-all.sh`): - Serial path kept for single-runtime rebuilds and `SERIAL_BUILD=1` CI environments (preserves bounded-concurrency option). - Parallel path: fan out one `docker build` per adapter, capture stdout/stderr to `/tmp/build_.log`, wait for all, tally per-tag success/failure. Failures still exit non-zero. **E2E:** `bash build-all.sh claude-code deepagents langgraph` finished in **43s wall-clock** (three adapter builds running concurrently). Previously ~120s serial. Log files live under `/tmp/build_*.log` for post-hoc debugging. Branch: `feat/top5-3-parallel-adapter-builds`.