molecule-core/docs/edit-history/2026-04-12.md
Hongming Wang d8026347e5 chore: open-source restructure — rename dirs, remove internal files, scrub secrets
Renames:
- platform/ → workspace-server/ (Go module path stays as "platform" for
  external dep compat — will update after plugin module republish)
- workspace-template/ → workspace/

Removed (moved to separate repos or deleted):
- PLAN.md — internal roadmap (move to private project board)
- HANDOFF.md, AGENTS.md — one-time internal session docs
- .claude/ — gitignored entirely (local agent config)
- infra/cloudflare-worker/ → Molecule-AI/molecule-tenant-proxy
- org-templates/molecule-dev/ → standalone template repo
- .mcp-eval/ → molecule-mcp-server repo
- test-results/ — ephemeral, gitignored

Security scrubbing:
- Cloudflare account/zone/KV IDs → placeholders
- Real EC2 IPs → <EC2_IP> in all docs
- CF token prefix, Neon project ID, Fly app names → redacted
- Langfuse dev credentials → parameterized
- Personal runner username/machine name → generic

Community files:
- CONTRIBUTING.md — build, test, branch conventions
- CODE_OF_CONDUCT.md — Contributor Covenant 2.1

All Dockerfiles, CI workflows, docker-compose, railway.toml, render.yaml,
README, CLAUDE.md updated for new directory names.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 00:24:44 -07:00

38 KiB
Raw Blame History

2026-04-12

Summary

Shipped the full two-axis plugin architecture on feat/agentskills-compliance (PR #62). Plugin source (where files come from) and plugin shape (what's inside them) are now independent, pluggable axes.

  • Source axisworkspace-server/internal/plugins/ package: SourceResolver interface, Registry, LocalResolver, GithubResolver, ParseSource. POST /workspaces/:id/plugins accepts {name} (back-compat → local) or {source: "scheme://spec"}. New GET /plugins/sources enumerates registered schemes.
  • Shape axisworkspace/plugins_registry/ package: PluginAdaptor protocol, hybrid resolver (registry > plugin-shipped > raw-drop), AgentskillsAdaptor built-in for agentskills.io-format skills + Molecule AI's rules extension. Named sub-type adapters planned for MCP, DeepAgents sub-agents, LangGraph sub-graphs, etc.
  • agentskills.io compliance — every first-party skill passes the open standard; python -m molecule_plugin validate CLI enforces it in CI. Our skills are now installable in ~35 other agent tools (Cursor, Codex, Copilot, Gemini CLI, etc.).
  • Gemini org paritymolecule-worker-gemini mirrors molecule-dev (11 workspaces, Research + Dev branches, schedules, Telegram channel, per-agent prompts) as the E2E proof point.

Files touched

Platform (Go):

  • workspace-server/internal/plugins/{source,local,github}.go + tests — source layer, 97.4% coverage.
  • workspace-server/internal/envx/envx.go + test — env-var helpers, 100% coverage.
  • workspace-server/internal/handlers/plugins.go — install pipeline refactored into resolveAndStage + deliverToContainer; typed httpErr for status propagation; sort.Strings in Registry.Schemes; logInstall LimitsOnce on startup.
  • workspace-server/internal/router/router.go — new routes (/plugins/sources, /workspaces/:id/plugins/available, /workspaces/:id/plugins/compatibility).
  • workspace-server/Dockerfileapk add git for the github resolver.

Workspace runtime (Python):

  • workspace/plugins_registry/ — new module: protocol.py, builtins.py (AgentskillsAdaptor), raw_drop.py, resolver.
  • workspace/skill_loader/ — renamed from skills/; reads scripts/ per the agentskills.io spec.
  • workspace/builtin_tools/ — renamed from tools/ to disambiguate from user-plugin tool dirs.
  • workspace/adapters/base.py — added hooks: memory_filename, register_tool_hook, register_subagent_hook, append_to_memory_hook, install_plugins_via_registry. Default inject_plugins() drives the new pipeline.
  • workspace/adapters/claude_code/adapter.py — deleted the 40-line inject_plugins() override.
  • workspace/adapters/deepagents/Dockerfile — ships plugins_registry/.
  • workspace/plugins.pyPluginManifest.runtimes field.

Plugins (content):

  • plugins/*/adapters/{claude_code,deepagents}.py — one-line from plugins_registry.builtins import AgentskillsAdaptor as Adaptor.
  • plugins/*/plugin.yaml — declare runtimes: [claude_code, deepagents].

SDK (Python):

  • sdk/python/molecule_plugin/protocol.py, builtins.py (SDK- vendored AgentskillsAdaptor), manifest.py (spec validator), CLI via __main__.py.
  • sdk/python/template/ — cookiecutter skeleton.

Org templates:

  • org-templates/molecule-worker-gemini/org.yaml — full parity with molecule-dev (11 workspaces, schedules, Telegram, per-agent prompts, workspace_dir mount on PM, required_env: [GOOGLE_API_KEY]).
  • Copied 5 system-prompt.md files from molecule-dev (research-lead, market-analyst, technical-researcher, competitive-intelligence, uiux-designer).

Docs:

  • docs/plugins/agentskills-compat.md — two-layer model, spec mapping.
  • docs/plugins/sources.md — two-axis source/shape architecture, security model, future resolvers.
  • docs/ecosystem-watch.md — Holaboss, Hermes Agent, gstack entries (adjacent projects to track).
  • .env.examplePLUGIN_INSTALL_* vars documented.
  • PLAN.md — plugin-adaptor landed; deferred items listed.
  • CLAUDE.md — new endpoints, env vars, test counts.

Test counts

  • Go platform: all packages green under -race.
  • Python workspace: 1040 passed, 9 skipped.
  • Python SDK: 50 passed.
  • Total: 1090 passing.

Coverage on new code:

  • workspace-server/internal/plugins/*: 97.4%
  • workspace-server/internal/envx/*: 100%
  • workspace/plugins_registry/*: 100%
  • workspace/skill_loader/*: 100%
  • sdk/python/molecule_plugin/*: 100%

5 rounds of code review

Every round addressed by new commits on the branch:

  1. Round 1 — initial coverage pass.
  2. Round 2 — memory_filename plumbing through InstallContext; logger in skill_loader; module constants for SKILLS_SUBDIR, SKIP_ROOT_MD, SKILL_NAME_*; SDK↔runtime drift-guard test; frontmatter parser unification.
  3. Round 3 — fetch timeout + body size cap + staged-dir size cap via new env vars; typed ErrPluginNotFound sentinel replaces string matching; reject both name+source; sort.Strings in Schemes; sync.RWMutex on Registry; -- in git clone; docs clarify github resolver is public-only.
  4. Round 4 — ParseSource empty-spec guard; dirSize(cap)(limit); localNameRE length bound; extract envDuration/envInt64 into internal/envx; LANG=C LC_ALL=C in git child env for locale- stable error parsing.
  5. Round 5 — typed httpErr replaces 5-value tuple; resolveAndStage decoupled from *gin.Context via installRequest struct; drop unused source param from deliverToContainer; trim whitespace in ParseSource; consolidate 3 test resolver stubs into 1 parameterized fakeResolver + 3 constructors.

Live E2E confirmed

  • GET /plugins/sources{"schemes":["github","local"]}.
  • POST {"name":"molecule-dev"} → installed via local (back-compat).
  • POST {"source":"local:// molecule-dev "} → installed (whitespace trimmed).
  • POST {"name":"a","source":"local://b"} → 400 "not both".
  • POST {"source":"github://"} → 400 "empty spec after 'github'".
  • POST {"source":"mystery://x"} → 400 + available_schemes: [...].
  • Uninstall + reinstall on PM workspace: CLAUDE.md has # Plugin: molecule-dev / rule: codebase-conventions.md marker; /configs/skills/review-loop/ present; zero container errors.
  • Startup log on platform boot: Plugin install limits: body=65536 bytes timeout=5m0s staged=104857600 bytes.

Branch

feat/agentskills-compliance → PR #62 (open, all CI green, ready to merge). Use git log --oneline origin/main.. for the commit list — counting commits inline goes stale fast.


Post-merge session — team coordination, platform hardening, new backlog

After PR #62 landed, the session continued with ecosystem-watch ship, a gemini-org proof-point attempt, and a PLAN.md refresh coordinated through the agent team. Several platform bugs surfaced; all filed and tracked.

Shipped

  • PR #59 — A2A proxy regression fix. PR #59 had rewritten http://127.0.0.1:<port>http://ws-<id>:8000 unconditionally, breaking platform-on-host mode. Gated behind platformInDocker detection (/.dockerenv or MOLECULE_IN_DOCKER=1). workspace-server/internal/handlers/a2a_proxy.go. Commit 4b42913.
  • PR #61docs/ecosystem-watch.md: Holaboss / Hermes / gstack entries + template + backlog candidates. Merged.
  • Cross-references for ecosystem-watch — wired into PLAN.md (new "Ecosystem Awareness" section), README.md + README.zh-CN.md Documentation Map, and CLAUDE.md (new "Ecosystem Context" section). Agents couldn't discover the doc because it wasn't linked anywhere; PM reported it missing despite being in its bind mount. Commit 8ae5e73.
  • DeepAgents adapter: virtual_mode=False in workspace/adapters/deepagents/adapter.py. Previously read_file/ls/write_file/edit_file operated on an in-memory snapshot that drifted from the bind-mounted /workspace; writes didn't persist across restarts and real files reported as missing. Commit bc563d1.
  • LangGraph recursion limit 100 → 500 default in workspace/a2a_executor.py. PM fan-out to 6+ reports routinely overran the 100-step ceiling. Still overridable via LANGGRAPH_RECURSION_LIMIT env var. Commit d892eb4.
  • Gemini org model swap gemini-3.1-pro-previewgemini-2.5-pro in org-templates/molecule-worker-gemini/org.yaml (3.1-pro-preview's 25 req/min couldn't sustain 11-workspace delegation waves). Commit 4b42913.
  • Backlog tracking for #64 / #65 added to PLAN.md Backlog. Commit ba1cc15.

Open PRs (awaiting CEO approval)

  • #68 docs/plan-refresh — PLAN.md refresh: correct test counts (Canvas 325→345, Python 990→1,040, +SDK row 50, total 1,811→1,911), promote #66/#67 to backlog with actual issue content. Coordinated with the molecule-dev team; corrected PM's hallucinated content for #66/#67 before open.
  • #69 chore/team-system-prompts-hardening — harden PM / Dev Lead / Research Lead system prompts with hard-learned rules from today's coordination incident (15 rules total across 3 roles). Every rule maps to a specific failure we hit today.

New platform issues filed

  • #64GET /workspaces/:id/delegations returns [] while the agent-side check_delegation_status tool shows 4 delegations. Sources-of-truth mismatch. Bug.
  • #65 — Per-agent repo-access config in org.yaml. New workspace_access: none | read_only | read_write field + :ro bind-mount for research agents. Eliminates the "PM couriers documents to reports" workaround. Enhancement.
  • #66claude_sdk_executor.py swallows subprocess stderr on CLI exit ≠ 0. Every failure surfaces the same opaque "Command failed with exit code 1 / Check stderr output for details". High-priority bug; blocked real debugging today.
  • #67 — Agent MCP client defaults to http://localhost:8080, which inside a workspace container is the container itself. Inject MOLECULE_URL=${PLATFORM_URL} at provision time. High-priority bug; blocked PM from restarting its own reports.

Gemini org — proof-point attempt, rolled back

Deployed molecule-worker-gemini (11 DeepAgents workspaces), exercised the full delegation tree, hit three distinct blockers:

  1. virtual_mode=True made PM report real files as missing (fixed in bc563d1 above).
  2. LangGraph recursion limit 100 tripped on PM fan-out (fixed in d892eb4 above).
  3. Google AI Studio monthly spending cap exhausted the whole project after repeated retries.

Rolled back to molecule-dev (Claude Code runtime) to finish the PLAN.md refresh task.

Session-state contamination note

After a ProcessError crash on a Claude Code workspace, subsequent A2A calls to that workspace keep failing identically until the workspace is restarted — even when the same SDK query run manually from inside the container succeeds. Root cause likely session resume state in the executor. Workaround: restart on ProcessError. Worth formalizing in the executor as an auto-reset on exit_code != 0 once #66 lands and we can see the real stderr.

Rules distilled for the team (now encoded in #69)

  • Never commit to main — always a feature branch + PR.
  • Verify external refs (issue numbers, PRs, SHAs, file paths) before citing them.
  • Inline documents into every sub-delegation — reports don't have the repo mount.
  • delegation.status == completed ≠ work was done.
  • Pause ~60s after a batch restart before delegating (warm-up race).
  • Quote errors verbatim, don't paraphrase.
  • Research Lead must always fan out — solo synthesis is a role failure.

#71 fix — initial_prompt marker written up-front

Root cause: main.py previously wrote /workspace/.initial_prompt_done only AFTER the initial_prompt self-send succeeded. If the prompt crashed (any ProcessError, network failure, SDK exit), the marker was never written — the next container boot replayed the same failing prompt and cascaded into "every message crashes" until an operator intervened. Observed three times on 2026-04-12 (gemini org + molecule-dev import + post-restart).

Fix (extracted from main.py into workspace/initial_prompt.py so it's unit-testable without uvicorn):

  • resolve_initial_prompt_marker(config_path) — prefer <config>/... when writable, fall back to /workspace/....
  • mark_initial_prompt_attempted(marker_path) — best-effort write, returns True/False so the caller can log a loud warning on I/O failure.
  • main.py calls mark_initial_prompt_attempted before scheduling the self-send. The post-send marker write is removed.

Semantic change: the prompt is attempted at most once per fresh boot; if it fails, operators re-send manually via chat. Trade-off: trades silent auto-retry-on-restart (which could cascade) for a one-time attempt with a loud failure log.

Tests: 5 new unit tests in tests/test_main_initial_prompt.py, 100% coverage on initial_prompt.py. Live E2E verified all 12 containers write the marker up-front and no replay occurs on restart. Manual browser test via canvas chat against Research Lead returned the expected reply — full round-trip through the UI.

Branch: fix/71-initial-prompt-marker-at-start. Closes #71.


#66 fix — surface Claude SDK subprocess stderr + exit_code

Root cause: claude_sdk_executor.py caught ProcessError but extracted only str(exc), which for a crashing CLI reads "Command failed with exit code 1 (exit code: 1) / Error output: Check stderr output for details". The SDK's ProcessError actually carries .exit_code and .stderr attributes — we were silently dropping both. Every CLI crash looked identical and required ad-hoc reproduction inside the container to diagnose.

Fix: new _format_process_error(exc) helper that extracts type(exc).__name__, exc.exit_code, and exc.stderr (capped at _PROCESS_ERROR_STDERR_MAX_CHARS = 4096 to prevent log flooding). Called in the retry loop (logger.warning) and the terminal error path (logger.error + logger.exception for the full traceback). Plain exceptions without SDK attributes fall back to str(exc) — no crash on missing attrs.

Tests: 5 new unit tests in tests/test_claude_sdk_executor.py (format with full context / truncation / plain exception / exit-code only / end-to-end via execute() with caplog). Python pytest 1050 → 1055.

E2E: rebuilt workspace-template:claude-code, restarted an agent, ran _format_process_error with a real claude_agent_sdk._errors. ProcessError(exit_code=2, stderr='disk full: /tmp') inside the live container → output shows both exit_code=2 and the stderr verbatim.

Manual browser: canvas chat against Research Lead — reply BROWSER-OK-66 returned cleanly, full UI round-trip works with the new log format live.

Branch: fix/66-capture-claude-sdk-stderr. Closes #66.


#75 fix — auto-reset session_id on subprocess-level errors

Root cause: after a ProcessError (or CLIConnectionError), the executor's self._session_id still points at the dead session. On the next call, _build_options() passes resume=<stale-id> to the SDK, which boots a new subprocess that can't resume the prior session state — and crashes again. Observed as "crashed once → crashes forever" on 2026-04-12 across PM / RL / DL in the coordination runs.

Fix: new _reset_session_after_error(exc) method clears self._session_id when the exception looks subprocess-level (ProcessError, CLIConnectionError, has exit_code attribute, or message contains "exit code"). Rate-limit / capacity errors are left alone so normal retry preserves conversational continuity. Called in the retry loop, right after _format_process_error logs the context.

Tests: 5 new tests in tests/test_claude_sdk_executor.py — clears on ProcessError / preserves on rate-limit / no-op when session_id is already None / triggers on "exit code" message only / end-to-end via execute() with caplog + spy-on-_build_options asserting that the second retry attempt sees session_id=None rather than the stale ID. Python pytest 1055 → 1060.

E2E: verified in live container — _reset_session_after_error clears a stale session on ProcessError, preserves it on rate-limit.

Manual browser: canvas chat round-trip on Research Lead — message went through and agent responded normally. Zero ProcessError indicators.

Branch: fix/75-session-reset-on-process-error. Closes #75.


Top-5 #1 — Memory FTS + namespace scoping

Backend proposal from the ecosystem-research outcomes doc, highest- convergence team ask (BE + FE + QA + UX all independently proposed some flavour of this).

Migration 017_memories_fts_namespace.up.sql:

  • agent_memories.namespace VARCHAR(50) NOT NULL DEFAULT 'general'
  • agent_memories.content_tsv tsvector (STORED generated column from to_tsvector('english', content))
  • idx_memories_fts (GIN on content_tsv)
  • idx_memories_ns (composite on workspace_id, namespace)

Handler workspace-server/internal/handlers/memories.go:

  • POST /workspaces/:id/memories accepts optional namespace (default "general", 50-char max validated at the handler).
  • GET /workspaces/:id/memories?q=... routes multi-char queries through content_tsv @@ plainto_tsquery('english', ?) with ts_rank ordering; single-char queries fall back to ILIKE (tsvector can't tokenise single chars in the 'english' config).
  • GET /workspaces/:id/memories?namespace=... filters regardless of scope.
  • Response always includes the namespace field.

Tests: 5 existing tests updated for the new column list; 4 new tests added (commit-with-namespace, namespace-too-long, FTS path, ILIKE fallback, namespace filter). Handler test suite passes.

E2E (live Postgres + running platform):

  • Platform restart applied migration 017 → column + indexes present.
  • POST with / without namespace → both work, default kicks in.
  • ?q=zinc+theme → FTS returns reference memory.
  • ?namespace=procedures → scoped retrieval works.
  • ?q=restart&namespace=procedures → combined filter works.

Branch: feat/memory-fts-namespace.


Top-5 #5 — Fail-secure encryption at boot

Security Auditor's top proposal from the outcomes doc. The platform previously booted without SECRETS_ENCRYPTION_KEY and silently stored workspace secrets in plaintext with only a WARNING log. OWASP A02:2021 (Cryptographic Failures) / STRIDE "Information Disclosure".

Fix (workspace-server/internal/crypto/aes.go):

  • New InitStrict() error variant that returns ErrEncryptionKeyMissing when MOLECULE_ENV=prod/production and the key is unset, malformed, or the wrong length. Existing Init() retained for any callers that prefer the warn-and-continue behaviour; only cmd/server/main.go switched to the strict variant.
  • isProdEnv() accepts prod, production, case-insensitive + trimmed.
  • loadKeyFromEnv refactor: one helper returns the parse error so both entry points can format it the same way.

cmd/server/main.go: crypto.InitStrict() + log.Fatalf on error. Local dev (no MOLECULE_ENV) keeps the existing warn-and-continue.

Tests: 6 new tests in internal/crypto/aes_test.go:

  • fails in prod when key is missing
  • fails in prod on wrong-length key
  • succeeds in prod with valid key
  • allows dev mode without key (ergonomics)
  • allows staging without key (non-prod)
  • isProdEnv case-insensitivity table

E2E: /tmp/platform-failsec binary run with MOLECULE_ENV=prod + empty key → log.Fatalf triggers, platform refuses to start. Same binary with MOLECULE_ENV=prod + valid base64 key → boots, prints "AES-256-GCM enabled", serves 200 on /health.

Branch: fix/top5-5-fail-secure-encryption.


#85 fix — encryption_version column + DecryptVersioned

Root cause (from the investigation): rows in workspace_secrets / global_secrets are tagged as encrypted_value bytea but whether they're actually encrypted depends entirely on whether SECRETS_ENCRYPTION_KEY was set at the moment of Encryptcrypto.Encrypt short-circuits and returns plaintext bytes when encryption is disabled. Switching on the key later makes crypto.Decrypt try GCM on plaintext bytes → fails → provisioner silently skips the row → container crashes on missing OAuth token.

With PR #83 (fail-secure) pushing operators toward setting the key, this trap was about to start biting real installs.

Fix:

  • Migration 018_secrets_encryption_version adds encryption_version INT NOT NULL DEFAULT 0 to both secret tables. All existing rows become version=0 (plaintext). Additive, safe.
  • crypto.aes.go:
    • EncryptionVersionPlaintext = 0, EncryptionVersionAESGCM = 1 constants.
    • CurrentEncryptionVersion() — tells callers which tag to write.
    • DecryptVersioned(value, version) — dispatches on tag; v=0 passes through, v=1 runs GCM (and errors if IsEnabled() is false). Unknown version → clear error.
    • Existing Decrypt deprecated-in-comment but kept for callers that haven't migrated (backward-compat during transition).
  • handlers/workspace_provision.go: SELECT now pulls encryption_version; decrypt uses DecryptVersioned; on failure aborts provisioning with a loud FATAL log + marks workspace failed (#66-style silent-failure removed).
  • handlers/secrets.go: both Set and global SetGlobalSecret persist encryption_version = CurrentEncryptionVersion() on INSERT. ON CONFLICT also updates the version — re-setting a historical plaintext row while a key is active upgrades it to GCM in-place.
  • handlers/secrets.go::GetModel: SELECT pulls version, uses DecryptVersioned.

Tests: 6 new crypto tests (plaintext pass-through, GCM round-trip, GCM requires key, unknown version rejected, CurrentEncryptionVersion tracks key state, the exact #85 scenario end-to-end). 6 existing secret handler tests updated for the 4-arg INSERT. Full Go test suite passes.

E2E (live):

  • Migration applied automatically on platform boot: encryption_version column present on both tables.
  • 102 pre-existing plaintext rows correctly tagged version=0.
  • New TEST_NEW_SECRET_85 stored as 39 bytes (11 plaintext + 12 nonce
    • 16 tag = ✓) with version=1.
  • PM container restart succeeds — both CLAUDE_CODE_OAUTH_TOKEN (v=0 historical plaintext) AND TEST_NEW_SECRET_85 (v=1 encrypted) are decrypted correctly and injected into the container env.

Branch: fix/85-encryption-version-migration. Closes #85.


#67 fix — inject MOLECULE_URL at workspace provision time

Root cause: Agents calling mcp__molecule__* tools from inside a workspace container were hitting localhost:8080 (container's own localhost, not the host). The MCP client (mcp-server/src/index.ts) defaulted to MOLECULE_URL || "http://localhost:8080" and the provisioner only injected PLATFORM_URL, never MOLECULE_URL.

Fix (two-sided, belt-and-suspenders):

  1. workspace-server/internal/provisioner/provisioner.go — extracted env building into pure buildContainerEnv(cfg WorkspaceConfig) []string so it's unit-testable. Now injects MOLECULE_URL=<PlatformURL> alongside PLATFORM_URL.
  2. mcp-server/src/index.ts — client now prefers MOLECULE_URL, falls back to PLATFORM_URL, then localhost:8080. Protects older containers that don't yet have MOLECULE_URL.

Tests: 4 new Go tests (buildContainerEnv injects both env vars, MOLECULE_URL always matches PLATFORM_URL across URL shapes, awareness both-or-nothing, custom envs append). Full provisioner suite green. 88 existing MCP tests still pass (fallback chain preserves existing behaviour).

E2E verified live: rebuilt platform, restarted PM, docker exec env shows both PLATFORM_URL=http://host.docker.internal:8080 and MOLECULE_URL=http://host.docker.internal:8080 on the recreated container.

Side-discovery (filed as #85): enabling SECRETS_ENCRYPTION_KEY on an install with pre-existing plaintext secrets silently breaks every secret — crypto.Decrypt runs GCM on plaintext bytes → fails → log.Printf + continue → row dropped → workspace crashes on preflight. Proposed fix: encryption_version column + boot-time re-encryption migration + fail-loud on decrypt mismatch.

Branch: fix/67-inject-molecule-url.


#73 fix — close three real delete-race windows

Observed symptom (corrected): During the session's bulk-delete runs, PM / Research Lead / Dev Lead consistently survived as "stragglers." Turned out the cause wasn't a race — it was the DELETE /workspaces/:id endpoint returning HTTP 200 with {"status":"confirmation_required"} when the workspace has children and ?confirm=true is not set. The bulk-delete script read HTTP 200 as success and moved on.

What the #73 fix actually closes: three real but distinct race windows that would bite in production even with correct ?confirm=true usage:

  1. handlers/registry.go::RegisterON CONFLICT DO UPDATE SET status='online' ran unconditionally; a late heartbeat from a workspace that was just soft-deleted (status='removed') could resurrect the row. Guard added: WHERE workspaces.status IS DISTINCT FROM 'removed'.
  2. handlers/registry.go::Heartbeat — same UPDATE path had no filter; late heartbeats refreshed last_heartbeat_at on tombstoned rows (confusing liveness). Guard: AND status != 'removed'. Plus evaluateStatus recovery path made conditional in-SQL (AND status = 'offline').
  3. handlers/workspace.go::Delete — sequence was Stop container → UPDATE status='removed'. Between those calls, Redis TTL expiry could trigger the liveness monitor, which called RestartByID, recreating the container. New order: UPDATE status='removed' FIRST (for self + descendants as a single batch), THEN stop containers + remove volumes. Auto-restart paths now see status='removed' immediately and bail out via their existing NOT IN ('removed', ...) guards.

Tests: 2 new registry tests pinning the SQL guards (substring match on the emitted UPDATE); 2 existing delete tests updated for the new order (single batch UPDATE covering self+descendants). Full go test ./... -race green.

Live E2E: bulk delete of 12 workspaces with ?confirm=true → all cleanly removed, zero stragglers, no pending provisions.

Separate issue filed: API DX — DELETE should return 4xx (e.g. 409 Conflict) when confirmation is required, not 200. Misleading status code made the session's symptom diagnosis wrong for hours.

Branch: fix/73-delete-workspace-race.


#88 fix — DELETE returns 409 Conflict when confirmation required

Observed during #73: bulk-delete scripts that read HTTP 200 as success silently skipped every parent workspace, leaving tier-3 / parent nodes behind and looking like a platform race bug.

Fix: one-line change in handlers/workspace.go::Delete — return http.StatusConflict (409) instead of http.StatusOK (200) when children exist and ?confirm=true isn't set. Response body shape unchanged (canvas UI + MCP server both parse the JSON body, not the status code).

No regressions: canvas (DetailsTab.tsx:75) and MCP server (mcp-server/src/index.ts:80) already pass ?confirm=true on every delete. The 409 only affects manual API users + bulk scripts that forgot — exactly the cohort that was silently failing.

Tests: 1 existing delete test updated to expect 409. Full go test ./... green.

Live E2E: real platform, real parent+child workspaces — DELETE /workspaces/:id (no confirm) returns http=409 with the expected JSON body; DELETE /workspaces/:id?confirm=true still returns 200.

Branch: fix/88-delete-confirm-409. Closes #88.

#74 fix — retry delegation once after reactive URL refresh

Clarification of the original issue: The delegation worker (handlers/delegation.go::executeDelegation) already calls the shared h.workspace.proxyA2ARequest(...) path — so it DOES benefit from the A2A proxy's reactive health-check / URL-refresh on connection errors. The real gap is that the reactive refresh runs after the current request fails; the caller still gets an error for that specific delegation attempt. During bulk restarts (observed 21:40 today), PM's delegation worker fired during the warm-up window, hit a stale URL, and the single-attempt logic marked the delegation failed.

Fix: add a single retry with an 8-second pause when proxyA2ARequest returns a transient-looking error. The pause is long enough for the reactive refresh + container restart to land a fresh URL in the cache. isTransientProxyError classifies which statuses retry:

  • 502 Bad Gateway (plain connection failure) — retry
  • 503 Service Unavailable (reactive check decided to restart the container) — retry
  • 404 / 403 / 400 / 500 — static, don't waste the retry window

Tests: 7 new cases on the classifier matrix + a regression guard on the 8-second window. Full go test ./... -race green.

Branch: fix/74-delegation-via-a2a-proxy. Closes #74.


100% platform coverage — MCP + molecli

Full parity pass so every platform endpoint is reachable from both client layers.

MCP server (mcp-server/src/index.ts): 61 → 83 tools

+22 new handlers added in a single coverage-completion block at the bottom of the file:

  • Delegations (#64): record_delegation, update_delegation_status
  • Activity: report_activity, notify_user
  • Canvas viewport: get_canvas_viewport, set_canvas_viewport
  • Channels (platform-level): discover_channel_chats
  • Plugins: list_plugin_sources, list_available_plugins, check_plugin_compatibility
  • Schedules (cron): list_schedules, create_schedule, update_schedule, delete_schedule, run_schedule, get_schedule_history
  • Session + shared context: session_search, get_shared_context
  • K/V memory (distinct from HMA): memory_set, memory_get, memory_list, memory_delete_kv

Updated schemas: create_workspace + update_workspace now accept workspace_access (none / read_only / read_write) + explicit runtime / workspace_dir params.

All 88 existing MCP tests still pass; npm run build green.

molecli CLI (workspace-server/cmd/cli/): 9 → 21 top-level commands

Two new files:

  • cmd_api.gomolecli api <METHOD> <PATH> [json-body] raw escape hatch. Hits any endpoint without a typed wrapper.
  • cmd_ops.go — typed subcommands (thin wrappers over shared callAPI helper) for operator ergonomics:
    • ws restart|pause|resume — lifecycle ops
    • plugin registry|sources|list|available|install|uninstall
    • secret list|set|delete|list-global|set-global|delete-global
    • schedule list|add|remove|run|history
    • channel adapters|list|remove|send|test
    • approval pending|list|decide
    • delegation list|create
    • bundle export|import
    • org templates|import
    • traces <workspace-id>
    • activity list <workspace-id>
    • hma commit|search

go test ./cmd/cli/ passes; live smoke-test against running platform: api GET /health, plugin sources, org templates, ws restart <bad-id> all return expected responses.

Branch: feat/mcp-molecli-full-coverage.

#65 fix — per-agent workspace_access in org.yaml + API

Design from the ecosystem-research outcomes doc: new workspace_access: none | read_only | read_write field on every workspace, enforced at container provision time via Docker's native :ro bind-mount flag. Eliminates the "PM couriers documents to reports" workaround by letting research agents have read-only repo access without the write risk.

Changes:

  • Migration 019 — adds workspace_access VARCHAR(20) NOT NULL DEFAULT 'none' with CHECK constraint. Additive, all existing rows become 'none' (current isolated-volume behaviour preserved).
  • provisioner.go:
    • New WorkspaceAccess field on WorkspaceConfig.
    • Constants WorkspaceAccessNone/ReadOnly/ReadWrite.
    • buildWorkspaceMount(cfg) — pure helper, selects between named-volume, rw bind, and :ro bind based on access + workspace_path.
    • ValidateWorkspaceAccess(access, path) — rejects read_* without a path and unknown values.
  • handlers/workspace.go::Create and handlers/org.go::createOrgWorkspace — validate + persist workspace_access on INSERT. Response body echoes the stored value.
  • handlers/workspace_provision.go::buildProvisionerConfig — reads workspace_access from DB (with payload override) and forwards to the provisioner. Restart paths preserve the mode.

Tests:

  • Provisioner: 2 new tables — TestBuildWorkspaceMount_SelectionMatrix (6 cases covering the full access × path matrix) and TestValidateWorkspaceAccess (7 cases).
  • Handler INSERT WithArgs updated across 5 existing tests for the new 9th column.
  • Full go test ./... -race green.

Live E2E:

  • Migration auto-applied → workspaces table has workspace_access with the CHECK constraint.
  • POST /workspaces {"workspace_access":"read_only","workspace_dir":"/repo"} → 201 with "workspace_access":"read_only" echoed; DB row correct.
  • POST {"workspace_access":"read_only"} (no workspace_dir) → 400 with clear error.
  • POST {"workspace_access":"wildcard"} → 400 with allowed-values list.
  • Container inspected after provision: /workspace mount has RW=false Mode=ro; touch /workspace/foo from inside returns Read-only file system → enforcement is real.

Branch: feat/65-workspace-access-yaml. Closes #65.

#64 fix — agent registers delegations with platform (Option A)

Root cause (confirmed in comment on #64): check_delegation_status reads from the agent's local _delegations dict; platform's GET /workspaces/:id/delegations reads from activity_logs. The agent's delegate_to_workspace MCP tool sends A2A directly and never touches activity_logs — so the platform's view was always empty for agent-initiated delegations.

Fix (minimal Option A, dual-write):

  • Platform: two new endpoints on DelegationHandler

    • POST /workspaces/:id/delegations/record — inserts a single activity_logs row with method='delegate', status='dispatched'. No A2A fired (agent does that directly for OTEL/retry reasons).
    • POST /workspaces/:id/delegations/:delegation_id/update — accepts status ∈ {completed, failed} + optional error + preview. UPDATEs the original row and (on completion) INSERTs a delegate_result row matching the canvas-path flow.
  • Agent (workspace/builtin_tools/delegation.py):

    • New best-effort async helpers _record_delegation_on_platform and _update_delegation_on_platform. Failures are logged at debug and swallowed — never block the actual A2A delegation path.
    • _execute_delegation calls _record_... at task start and _update_... on completion / failure (alongside the existing _notify_completion).

Result: agent keeps direct A2A for speed + OTEL trace-context propagation + existing retry logic; platform's activity_logs mirrors the same set the agent's local dict holds. GET /delegations now returns rows for agent-initiated delegations.

Tests: 5 new Go tests (Record inserts + rejects invalid UUID, UpdateStatus completed inserts result row + rejects unknown status + failed broadcast). 4 new Python tests (record fires HTTP POST, best- effort on platform error, update completed, update truncates large preview to 500 chars). Python pytest 1060 → 1064; full Go suite green.

Branch: fix/64-agent-delegate-via-platform. Closes #64.

SDK — workspace / org / channel validators

Issue: SDK only validated plugins. Authors publishing workspace-configs-templates, org-templates, or channel configs had no lint step — errors only surfaced at POST /org/import or container startup.

Fix: extended sdk/python/molecule_plugin/ with three new modules:

  • workspace.py — validates config.yaml (name, runtime, tier, runtime_config shape). SUPPORTED_RUNTIMES kept in sync with provisioner.RuntimeImages.
  • org.py — recursively validates org.yaml (name, workspaces tree, workspace_access + workspace_dir pairing per #65, channels via delegated validate_channel_config, schedules, plugins, external+url, children).
  • channel.py — validates channel configs (standalone dict or YAML file). SUPPORTED_CHANNEL_TYPES currently {telegram}; extend when Slack/Discord adapters land.

CLI (python -m molecule_plugin validate {plugin|workspace|org|channel} <path>) dispatches to the right validator; bare validate <path> still defaults to plugin for back-compat. Exit 0 on valid, 1 on any error.

validate_channel_config is the single source of truth for channel schema — org.py delegates to it rather than duplicating checks.

Tests: sdk/python/tests/test_validators.py — 37 new tests (happy, missing file, bad YAML, non-object, each field error, null-safety on runtime_config: None / defaults: null, CLI dispatch for all 4 kinds, back-compat form). Fixed bug found during test authoring: org.py crashed on non-dict children; now guarded with isinstance check.

Live smoke: all 4 in-repo org templates (free-beats-all, reno-stars, molecule-dev, molecule-worker-gemini) validate clean.

SDK pytest: 50 → 87. Branch: feat/sdk-workspace-org-channel.

Top-5 #3 — parallel adapter builds

DevOps proposal from the ecosystem-research outcomes doc. All six adapter Dockerfiles FROM workspace-template:base with no inter-adapter dependency, so they're safe to build concurrently once the base is done.

Change (workspace/build-all.sh):

  • Serial path kept for single-runtime rebuilds and SERIAL_BUILD=1 CI environments (preserves bounded-concurrency option).
  • Parallel path: fan out one docker build per adapter, capture stdout/stderr to /tmp/build_<tag>.log, wait for all, tally per-tag success/failure. Failures still exit non-zero.

E2E: bash build-all.sh claude-code deepagents langgraph finished in 43s wall-clock (three adapter builds running concurrently). Previously ~120s serial. Log files live under /tmp/build_*.log for post-hoc debugging.

Branch: feat/top5-3-parallel-adapter-builds.