Renames: - platform/ → workspace-server/ (Go module path stays as "platform" for external dep compat — will update after plugin module republish) - workspace-template/ → workspace/ Removed (moved to separate repos or deleted): - PLAN.md — internal roadmap (move to private project board) - HANDOFF.md, AGENTS.md — one-time internal session docs - .claude/ — gitignored entirely (local agent config) - infra/cloudflare-worker/ → Molecule-AI/molecule-tenant-proxy - org-templates/molecule-dev/ → standalone template repo - .mcp-eval/ → molecule-mcp-server repo - test-results/ — ephemeral, gitignored Security scrubbing: - Cloudflare account/zone/KV IDs → placeholders - Real EC2 IPs → <EC2_IP> in all docs - CF token prefix, Neon project ID, Fly app names → redacted - Langfuse dev credentials → parameterized - Personal runner username/machine name → generic Community files: - CONTRIBUTING.md — build, test, branch conventions - CODE_OF_CONDUCT.md — Contributor Covenant 2.1 All Dockerfiles, CI workflows, docker-compose, railway.toml, render.yaml, README, CLAUDE.md updated for new directory names. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
38 KiB
2026-04-12
Summary
Shipped the full two-axis plugin architecture on feat/agentskills-compliance
(PR #62). Plugin source (where files come from) and plugin shape
(what's inside them) are now independent, pluggable axes.
- Source axis —
workspace-server/internal/plugins/package:SourceResolverinterface,Registry,LocalResolver,GithubResolver,ParseSource.POST /workspaces/:id/pluginsaccepts{name}(back-compat → local) or{source: "scheme://spec"}. NewGET /plugins/sourcesenumerates registered schemes. - Shape axis —
workspace/plugins_registry/package:PluginAdaptorprotocol, hybrid resolver (registry > plugin-shipped > raw-drop),AgentskillsAdaptorbuilt-in for agentskills.io-format skills + Molecule AI's rules extension. Named sub-type adapters planned for MCP, DeepAgents sub-agents, LangGraph sub-graphs, etc. - agentskills.io compliance — every first-party skill passes the
open standard;
python -m molecule_plugin validateCLI enforces it in CI. Our skills are now installable in ~35 other agent tools (Cursor, Codex, Copilot, Gemini CLI, etc.). - Gemini org parity —
molecule-worker-geminimirrorsmolecule-dev(11 workspaces, Research + Dev branches, schedules, Telegram channel, per-agent prompts) as the E2E proof point.
Files touched
Platform (Go):
workspace-server/internal/plugins/{source,local,github}.go+ tests — source layer, 97.4% coverage.workspace-server/internal/envx/envx.go+ test — env-var helpers, 100% coverage.workspace-server/internal/handlers/plugins.go— install pipeline refactored intoresolveAndStage+deliverToContainer; typedhttpErrfor status propagation;sort.StringsinRegistry.Schemes;logInstall LimitsOnceon startup.workspace-server/internal/router/router.go— new routes (/plugins/sources,/workspaces/:id/plugins/available,/workspaces/:id/plugins/compatibility).workspace-server/Dockerfile—apk add gitfor the github resolver.
Workspace runtime (Python):
workspace/plugins_registry/— new module:protocol.py,builtins.py(AgentskillsAdaptor),raw_drop.py, resolver.workspace/skill_loader/— renamed fromskills/; readsscripts/per the agentskills.io spec.workspace/builtin_tools/— renamed fromtools/to disambiguate from user-plugin tool dirs.workspace/adapters/base.py— added hooks:memory_filename,register_tool_hook,register_subagent_hook,append_to_memory_hook,install_plugins_via_registry. Defaultinject_plugins()drives the new pipeline.workspace/adapters/claude_code/adapter.py— deleted the 40-lineinject_plugins()override.workspace/adapters/deepagents/Dockerfile— shipsplugins_registry/.workspace/plugins.py—PluginManifest.runtimesfield.
Plugins (content):
plugins/*/adapters/{claude_code,deepagents}.py— one-linefrom plugins_registry.builtins import AgentskillsAdaptor as Adaptor.plugins/*/plugin.yaml— declareruntimes: [claude_code, deepagents].
SDK (Python):
sdk/python/molecule_plugin/—protocol.py,builtins.py(SDK- vendoredAgentskillsAdaptor),manifest.py(spec validator), CLI via__main__.py.sdk/python/template/— cookiecutter skeleton.
Org templates:
org-templates/molecule-worker-gemini/org.yaml— full parity withmolecule-dev(11 workspaces, schedules, Telegram, per-agent prompts,workspace_dirmount on PM,required_env: [GOOGLE_API_KEY]).- Copied 5
system-prompt.mdfiles from molecule-dev (research-lead, market-analyst, technical-researcher, competitive-intelligence, uiux-designer).
Docs:
docs/plugins/agentskills-compat.md— two-layer model, spec mapping.docs/plugins/sources.md— two-axis source/shape architecture, security model, future resolvers.docs/ecosystem-watch.md— Holaboss, Hermes Agent, gstack entries (adjacent projects to track)..env.example—PLUGIN_INSTALL_*vars documented.PLAN.md— plugin-adaptor landed; deferred items listed.CLAUDE.md— new endpoints, env vars, test counts.
Test counts
- Go platform: all packages green under
-race. - Python workspace: 1040 passed, 9 skipped.
- Python SDK: 50 passed.
- Total: 1090 passing.
Coverage on new code:
workspace-server/internal/plugins/*: 97.4%workspace-server/internal/envx/*: 100%workspace/plugins_registry/*: 100%workspace/skill_loader/*: 100%sdk/python/molecule_plugin/*: 100%
5 rounds of code review
Every round addressed by new commits on the branch:
- Round 1 — initial coverage pass.
- Round 2 —
memory_filenameplumbing throughInstallContext;loggerinskill_loader; module constants forSKILLS_SUBDIR,SKIP_ROOT_MD,SKILL_NAME_*; SDK↔runtime drift-guard test; frontmatter parser unification. - Round 3 — fetch timeout + body size cap + staged-dir size cap via
new env vars; typed
ErrPluginNotFoundsentinel replaces string matching; reject bothname+source;sort.Stringsin Schemes;sync.RWMutexon Registry;--in git clone; docs clarify github resolver is public-only. - Round 4 —
ParseSourceempty-spec guard;dirSize(cap)→(limit);localNameRElength bound; extractenvDuration/envInt64intointernal/envx;LANG=C LC_ALL=Cin git child env for locale- stable error parsing. - Round 5 — typed
httpErrreplaces 5-value tuple;resolveAndStagedecoupled from*gin.ContextviainstallRequeststruct; drop unusedsourceparam fromdeliverToContainer; trim whitespace inParseSource; consolidate 3 test resolver stubs into 1 parameterizedfakeResolver+ 3 constructors.
Live E2E confirmed
GET /plugins/sources→{"schemes":["github","local"]}.POST {"name":"molecule-dev"}→ installed via local (back-compat).POST {"source":"local:// molecule-dev "}→ installed (whitespace trimmed).POST {"name":"a","source":"local://b"}→ 400 "not both".POST {"source":"github://"}→ 400 "empty spec after 'github'".POST {"source":"mystery://x"}→ 400 +available_schemes: [...].- Uninstall + reinstall on PM workspace: CLAUDE.md has
# Plugin: molecule-dev / rule: codebase-conventions.mdmarker;/configs/skills/review-loop/present; zero container errors. - Startup log on platform boot:
Plugin install limits: body=65536 bytes timeout=5m0s staged=104857600 bytes.
Branch
feat/agentskills-compliance → PR #62 (open, all CI green, ready to
merge). Use git log --oneline origin/main.. for the commit list —
counting commits inline goes stale fast.
Post-merge session — team coordination, platform hardening, new backlog
After PR #62 landed, the session continued with ecosystem-watch ship, a gemini-org proof-point attempt, and a PLAN.md refresh coordinated through the agent team. Several platform bugs surfaced; all filed and tracked.
Shipped
- PR #59 — A2A proxy regression fix. PR #59 had rewritten
http://127.0.0.1:<port>→http://ws-<id>:8000unconditionally, breaking platform-on-host mode. Gated behindplatformInDockerdetection (/.dockerenvorMOLECULE_IN_DOCKER=1).workspace-server/internal/handlers/a2a_proxy.go. Commit4b42913. - PR #61 —
docs/ecosystem-watch.md: Holaboss / Hermes / gstack entries + template + backlog candidates. Merged. - Cross-references for ecosystem-watch — wired into
PLAN.md(new "Ecosystem Awareness" section),README.md+README.zh-CN.mdDocumentation Map, andCLAUDE.md(new "Ecosystem Context" section). Agents couldn't discover the doc because it wasn't linked anywhere; PM reported it missing despite being in its bind mount. Commit8ae5e73. - DeepAgents adapter:
virtual_mode=Falseinworkspace/adapters/deepagents/adapter.py. Previouslyread_file/ls/write_file/edit_fileoperated on an in-memory snapshot that drifted from the bind-mounted/workspace; writes didn't persist across restarts and real files reported as missing. Commitbc563d1. - LangGraph recursion limit 100 → 500 default in
workspace/a2a_executor.py. PM fan-out to 6+ reports routinely overran the 100-step ceiling. Still overridable viaLANGGRAPH_RECURSION_LIMITenv var. Commitd892eb4. - Gemini org model swap
gemini-3.1-pro-preview→gemini-2.5-proinorg-templates/molecule-worker-gemini/org.yaml(3.1-pro-preview's 25 req/min couldn't sustain 11-workspace delegation waves). Commit4b42913. - Backlog tracking for #64 / #65 added to
PLAN.mdBacklog. Commitba1cc15.
Open PRs (awaiting CEO approval)
- #68
docs/plan-refresh— PLAN.md refresh: correct test counts (Canvas 325→345, Python 990→1,040, +SDK row 50, total 1,811→1,911), promote #66/#67 to backlog with actual issue content. Coordinated with the molecule-dev team; corrected PM's hallucinated content for #66/#67 before open. - #69
chore/team-system-prompts-hardening— harden PM / Dev Lead / Research Lead system prompts with hard-learned rules from today's coordination incident (15 rules total across 3 roles). Every rule maps to a specific failure we hit today.
New platform issues filed
- #64 —
GET /workspaces/:id/delegationsreturns[]while the agent-sidecheck_delegation_statustool shows 4 delegations. Sources-of-truth mismatch. Bug. - #65 — Per-agent repo-access config in
org.yaml. Newworkspace_access: none | read_only | read_writefield +:robind-mount for research agents. Eliminates the "PM couriers documents to reports" workaround. Enhancement. - #66 —
claude_sdk_executor.pyswallows subprocess stderr on CLI exit ≠ 0. Every failure surfaces the same opaque"Command failed with exit code 1 / Check stderr output for details". High-priority bug; blocked real debugging today. - #67 — Agent MCP client defaults to
http://localhost:8080, which inside a workspace container is the container itself. InjectMOLECULE_URL=${PLATFORM_URL}at provision time. High-priority bug; blocked PM from restarting its own reports.
Gemini org — proof-point attempt, rolled back
Deployed molecule-worker-gemini (11 DeepAgents workspaces), exercised the full delegation tree, hit three distinct blockers:
virtual_mode=Truemade PM report real files as missing (fixed inbc563d1above).- LangGraph recursion limit 100 tripped on PM fan-out (fixed in
d892eb4above). - Google AI Studio monthly spending cap exhausted the whole project after repeated retries.
Rolled back to molecule-dev (Claude Code runtime) to finish the PLAN.md refresh task.
Session-state contamination note
After a ProcessError crash on a Claude Code workspace, subsequent
A2A calls to that workspace keep failing identically until the
workspace is restarted — even when the same SDK query run manually
from inside the container succeeds. Root cause likely session
resume state in the executor. Workaround: restart on ProcessError.
Worth formalizing in the executor as an auto-reset on exit_code != 0
once #66 lands and we can see the real stderr.
Rules distilled for the team (now encoded in #69)
- Never commit to
main— always a feature branch + PR. - Verify external refs (issue numbers, PRs, SHAs, file paths) before citing them.
- Inline documents into every sub-delegation — reports don't have the repo mount.
delegation.status == completed≠ work was done.- Pause ~60s after a batch restart before delegating (warm-up race).
- Quote errors verbatim, don't paraphrase.
- Research Lead must always fan out — solo synthesis is a role failure.
#71 fix — initial_prompt marker written up-front
Root cause: main.py previously wrote /workspace/.initial_prompt_done
only AFTER the initial_prompt self-send succeeded. If the prompt crashed
(any ProcessError, network failure, SDK exit), the marker was never
written — the next container boot replayed the same failing prompt and
cascaded into "every message crashes" until an operator intervened.
Observed three times on 2026-04-12 (gemini org + molecule-dev import +
post-restart).
Fix (extracted from main.py into workspace/initial_prompt.py
so it's unit-testable without uvicorn):
resolve_initial_prompt_marker(config_path)— prefer<config>/...when writable, fall back to/workspace/....mark_initial_prompt_attempted(marker_path)— best-effort write, returnsTrue/Falseso the caller can log a loud warning on I/O failure.main.pycallsmark_initial_prompt_attemptedbefore scheduling the self-send. The post-send marker write is removed.
Semantic change: the prompt is attempted at most once per fresh boot; if it fails, operators re-send manually via chat. Trade-off: trades silent auto-retry-on-restart (which could cascade) for a one-time attempt with a loud failure log.
Tests: 5 new unit tests in tests/test_main_initial_prompt.py, 100%
coverage on initial_prompt.py. Live E2E verified all 12 containers
write the marker up-front and no replay occurs on restart. Manual
browser test via canvas chat against Research Lead returned the
expected reply — full round-trip through the UI.
Branch: fix/71-initial-prompt-marker-at-start. Closes #71.
#66 fix — surface Claude SDK subprocess stderr + exit_code
Root cause: claude_sdk_executor.py caught ProcessError but
extracted only str(exc), which for a crashing CLI reads "Command
failed with exit code 1 (exit code: 1) / Error output: Check stderr
output for details". The SDK's ProcessError actually carries
.exit_code and .stderr attributes — we were silently dropping both.
Every CLI crash looked identical and required ad-hoc reproduction
inside the container to diagnose.
Fix: new _format_process_error(exc) helper that extracts
type(exc).__name__, exc.exit_code, and exc.stderr (capped at
_PROCESS_ERROR_STDERR_MAX_CHARS = 4096 to prevent log flooding).
Called in the retry loop (logger.warning) and the terminal error
path (logger.error + logger.exception for the full traceback).
Plain exceptions without SDK attributes fall back to str(exc) —
no crash on missing attrs.
Tests: 5 new unit tests in tests/test_claude_sdk_executor.py
(format with full context / truncation / plain exception / exit-code
only / end-to-end via execute() with caplog). Python pytest 1050 →
1055.
E2E: rebuilt workspace-template:claude-code, restarted an agent,
ran _format_process_error with a real claude_agent_sdk._errors. ProcessError(exit_code=2, stderr='disk full: /tmp') inside the live
container → output shows both exit_code=2 and the stderr verbatim.
Manual browser: canvas chat against Research Lead — reply
BROWSER-OK-66 returned cleanly, full UI round-trip works with the
new log format live.
Branch: fix/66-capture-claude-sdk-stderr. Closes #66.
#75 fix — auto-reset session_id on subprocess-level errors
Root cause: after a ProcessError (or CLIConnectionError), the
executor's self._session_id still points at the dead session. On the
next call, _build_options() passes resume=<stale-id> to the SDK,
which boots a new subprocess that can't resume the prior session state
— and crashes again. Observed as "crashed once → crashes forever" on
2026-04-12 across PM / RL / DL in the coordination runs.
Fix: new _reset_session_after_error(exc) method clears
self._session_id when the exception looks subprocess-level
(ProcessError, CLIConnectionError, has exit_code attribute, or
message contains "exit code"). Rate-limit / capacity errors are left
alone so normal retry preserves conversational continuity. Called in
the retry loop, right after _format_process_error logs the context.
Tests: 5 new tests in tests/test_claude_sdk_executor.py — clears
on ProcessError / preserves on rate-limit / no-op when session_id is
already None / triggers on "exit code" message only / end-to-end via
execute() with caplog + spy-on-_build_options asserting that the
second retry attempt sees session_id=None rather than the stale ID.
Python pytest 1055 → 1060.
E2E: verified in live container — _reset_session_after_error
clears a stale session on ProcessError, preserves it on rate-limit.
Manual browser: canvas chat round-trip on Research Lead — message went through and agent responded normally. Zero ProcessError indicators.
Branch: fix/75-session-reset-on-process-error. Closes #75.
Top-5 #1 — Memory FTS + namespace scoping
Backend proposal from the ecosystem-research outcomes doc, highest- convergence team ask (BE + FE + QA + UX all independently proposed some flavour of this).
Migration 017_memories_fts_namespace.up.sql:
agent_memories.namespace VARCHAR(50) NOT NULL DEFAULT 'general'agent_memories.content_tsv tsvector(STORED generated column fromto_tsvector('english', content))idx_memories_fts(GIN oncontent_tsv)idx_memories_ns(composite onworkspace_id, namespace)
Handler workspace-server/internal/handlers/memories.go:
POST /workspaces/:id/memoriesaccepts optionalnamespace(default"general", 50-char max validated at the handler).GET /workspaces/:id/memories?q=...routes multi-char queries throughcontent_tsv @@ plainto_tsquery('english', ?)withts_rankordering; single-char queries fall back toILIKE(tsvector can't tokenise single chars in the 'english' config).GET /workspaces/:id/memories?namespace=...filters regardless of scope.- Response always includes the
namespacefield.
Tests: 5 existing tests updated for the new column list; 4 new tests added (commit-with-namespace, namespace-too-long, FTS path, ILIKE fallback, namespace filter). Handler test suite passes.
E2E (live Postgres + running platform):
- Platform restart applied migration 017 → column + indexes present.
POSTwith / without namespace → both work, default kicks in.?q=zinc+theme→ FTS returns reference memory.?namespace=procedures→ scoped retrieval works.?q=restart&namespace=procedures→ combined filter works.
Branch: feat/memory-fts-namespace.
Top-5 #5 — Fail-secure encryption at boot
Security Auditor's top proposal from the outcomes doc. The platform
previously booted without SECRETS_ENCRYPTION_KEY and silently stored
workspace secrets in plaintext with only a WARNING log. OWASP A02:2021
(Cryptographic Failures) / STRIDE "Information Disclosure".
Fix (workspace-server/internal/crypto/aes.go):
- New
InitStrict() errorvariant that returnsErrEncryptionKeyMissingwhenMOLECULE_ENV=prod/productionand the key is unset, malformed, or the wrong length. ExistingInit()retained for any callers that prefer the warn-and-continue behaviour; onlycmd/server/main.goswitched to the strict variant. isProdEnv()acceptsprod,production, case-insensitive + trimmed.loadKeyFromEnvrefactor: one helper returns the parse error so both entry points can format it the same way.
cmd/server/main.go: crypto.InitStrict() + log.Fatalf on error.
Local dev (no MOLECULE_ENV) keeps the existing warn-and-continue.
Tests: 6 new tests in internal/crypto/aes_test.go:
- fails in prod when key is missing
- fails in prod on wrong-length key
- succeeds in prod with valid key
- allows dev mode without key (ergonomics)
- allows staging without key (non-prod)
- isProdEnv case-insensitivity table
E2E: /tmp/platform-failsec binary run with MOLECULE_ENV=prod +
empty key → log.Fatalf triggers, platform refuses to start. Same
binary with MOLECULE_ENV=prod + valid base64 key → boots, prints
"AES-256-GCM enabled", serves 200 on /health.
Branch: fix/top5-5-fail-secure-encryption.
#85 fix — encryption_version column + DecryptVersioned
Root cause (from the investigation): rows in workspace_secrets /
global_secrets are tagged as encrypted_value bytea but whether
they're actually encrypted depends entirely on whether
SECRETS_ENCRYPTION_KEY was set at the moment of Encrypt —
crypto.Encrypt short-circuits and returns plaintext bytes when
encryption is disabled. Switching on the key later makes
crypto.Decrypt try GCM on plaintext bytes → fails → provisioner
silently skips the row → container crashes on missing OAuth token.
With PR #83 (fail-secure) pushing operators toward setting the key, this trap was about to start biting real installs.
Fix:
- Migration
018_secrets_encryption_versionaddsencryption_version INT NOT NULL DEFAULT 0to both secret tables. All existing rows becomeversion=0(plaintext). Additive, safe. crypto.aes.go:EncryptionVersionPlaintext = 0,EncryptionVersionAESGCM = 1constants.CurrentEncryptionVersion()— tells callers which tag to write.DecryptVersioned(value, version)— dispatches on tag;v=0passes through,v=1runs GCM (and errors ifIsEnabled()is false). Unknown version → clear error.- Existing
Decryptdeprecated-in-comment but kept for callers that haven't migrated (backward-compat during transition).
handlers/workspace_provision.go: SELECT now pullsencryption_version; decrypt usesDecryptVersioned; on failure aborts provisioning with a loud FATAL log + marks workspace failed (#66-style silent-failure removed).handlers/secrets.go: bothSetand globalSetGlobalSecretpersistencryption_version = CurrentEncryptionVersion()on INSERT.ON CONFLICTalso updates the version — re-setting a historical plaintext row while a key is active upgrades it to GCM in-place.handlers/secrets.go::GetModel: SELECT pulls version, usesDecryptVersioned.
Tests: 6 new crypto tests (plaintext pass-through, GCM round-trip,
GCM requires key, unknown version rejected, CurrentEncryptionVersion
tracks key state, the exact #85 scenario end-to-end). 6 existing
secret handler tests updated for the 4-arg INSERT. Full Go test suite
passes.
E2E (live):
- Migration applied automatically on platform boot:
encryption_versioncolumn present on both tables. - 102 pre-existing plaintext rows correctly tagged
version=0. - New
TEST_NEW_SECRET_85stored as 39 bytes (11 plaintext + 12 nonce- 16 tag = ✓) with
version=1.
- 16 tag = ✓) with
- PM container restart succeeds — both
CLAUDE_CODE_OAUTH_TOKEN(v=0 historical plaintext) ANDTEST_NEW_SECRET_85(v=1 encrypted) are decrypted correctly and injected into the container env.
Branch: fix/85-encryption-version-migration. Closes #85.
#67 fix — inject MOLECULE_URL at workspace provision time
Root cause: Agents calling mcp__molecule__* tools from inside a
workspace container were hitting localhost:8080 (container's own
localhost, not the host). The MCP client
(mcp-server/src/index.ts) defaulted to MOLECULE_URL || "http://localhost:8080" and the provisioner only injected
PLATFORM_URL, never MOLECULE_URL.
Fix (two-sided, belt-and-suspenders):
workspace-server/internal/provisioner/provisioner.go— extracted env building into purebuildContainerEnv(cfg WorkspaceConfig) []stringso it's unit-testable. Now injectsMOLECULE_URL=<PlatformURL>alongsidePLATFORM_URL.mcp-server/src/index.ts— client now prefersMOLECULE_URL, falls back toPLATFORM_URL, thenlocalhost:8080. Protects older containers that don't yet haveMOLECULE_URL.
Tests: 4 new Go tests (buildContainerEnv injects both env vars,
MOLECULE_URL always matches PLATFORM_URL across URL shapes, awareness
both-or-nothing, custom envs append). Full provisioner suite green.
88 existing MCP tests still pass (fallback chain preserves existing
behaviour).
E2E verified live: rebuilt platform, restarted PM, docker exec env shows both PLATFORM_URL=http://host.docker.internal:8080 and
MOLECULE_URL=http://host.docker.internal:8080 on the recreated
container.
Side-discovery (filed as #85): enabling SECRETS_ENCRYPTION_KEY
on an install with pre-existing plaintext secrets silently breaks
every secret — crypto.Decrypt runs GCM on plaintext bytes → fails
→ log.Printf + continue → row dropped → workspace crashes on
preflight. Proposed fix: encryption_version column + boot-time
re-encryption migration + fail-loud on decrypt mismatch.
Branch: fix/67-inject-molecule-url.
#73 fix — close three real delete-race windows
Observed symptom (corrected): During the session's bulk-delete runs,
PM / Research Lead / Dev Lead consistently survived as "stragglers."
Turned out the cause wasn't a race — it was the DELETE /workspaces/:id
endpoint returning HTTP 200 with {"status":"confirmation_required"}
when the workspace has children and ?confirm=true is not set. The
bulk-delete script read HTTP 200 as success and moved on.
What the #73 fix actually closes: three real but distinct race
windows that would bite in production even with correct ?confirm=true
usage:
handlers/registry.go::Register—ON CONFLICT DO UPDATE SET status='online'ran unconditionally; a late heartbeat from a workspace that was just soft-deleted (status='removed') could resurrect the row. Guard added:WHERE workspaces.status IS DISTINCT FROM 'removed'.handlers/registry.go::Heartbeat— same UPDATE path had no filter; late heartbeats refreshedlast_heartbeat_aton tombstoned rows (confusing liveness). Guard:AND status != 'removed'. PlusevaluateStatusrecovery path made conditional in-SQL (AND status = 'offline').handlers/workspace.go::Delete— sequence was Stop container → UPDATE status='removed'. Between those calls, Redis TTL expiry could trigger the liveness monitor, which calledRestartByID, recreating the container. New order: UPDATE status='removed' FIRST (for self + descendants as a single batch), THEN stop containers + remove volumes. Auto-restart paths now see status='removed' immediately and bail out via their existingNOT IN ('removed', ...)guards.
Tests: 2 new registry tests pinning the SQL guards (substring
match on the emitted UPDATE); 2 existing delete tests updated for
the new order (single batch UPDATE covering self+descendants).
Full go test ./... -race green.
Live E2E: bulk delete of 12 workspaces with ?confirm=true
→ all cleanly removed, zero stragglers, no pending provisions.
Separate issue filed: API DX — DELETE should return 4xx (e.g. 409 Conflict) when confirmation is required, not 200. Misleading status code made the session's symptom diagnosis wrong for hours.
Branch: fix/73-delete-workspace-race.
#88 fix — DELETE returns 409 Conflict when confirmation required
Observed during #73: bulk-delete scripts that read HTTP 200 as success silently skipped every parent workspace, leaving tier-3 / parent nodes behind and looking like a platform race bug.
Fix: one-line change in handlers/workspace.go::Delete — return
http.StatusConflict (409) instead of http.StatusOK (200) when
children exist and ?confirm=true isn't set. Response body shape
unchanged (canvas UI + MCP server both parse the JSON body, not the
status code).
No regressions: canvas (DetailsTab.tsx:75) and MCP server
(mcp-server/src/index.ts:80) already pass ?confirm=true on every
delete. The 409 only affects manual API users + bulk scripts that
forgot — exactly the cohort that was silently failing.
Tests: 1 existing delete test updated to expect 409. Full
go test ./... green.
Live E2E: real platform, real parent+child workspaces —
DELETE /workspaces/:id (no confirm) returns http=409 with the
expected JSON body; DELETE /workspaces/:id?confirm=true still
returns 200.
Branch: fix/88-delete-confirm-409. Closes #88.
#74 fix — retry delegation once after reactive URL refresh
Clarification of the original issue: The delegation worker
(handlers/delegation.go::executeDelegation) already calls the shared
h.workspace.proxyA2ARequest(...) path — so it DOES benefit from the
A2A proxy's reactive health-check / URL-refresh on connection errors.
The real gap is that the reactive refresh runs after the current
request fails; the caller still gets an error for that specific
delegation attempt. During bulk restarts (observed 21:40 today), PM's
delegation worker fired during the warm-up window, hit a stale URL,
and the single-attempt logic marked the delegation failed.
Fix: add a single retry with an 8-second pause when
proxyA2ARequest returns a transient-looking error. The pause is
long enough for the reactive refresh + container restart to land a
fresh URL in the cache. isTransientProxyError classifies which
statuses retry:
- 502 Bad Gateway (plain connection failure) — retry
- 503 Service Unavailable (reactive check decided to restart the container) — retry
- 404 / 403 / 400 / 500 — static, don't waste the retry window
Tests: 7 new cases on the classifier matrix + a regression
guard on the 8-second window. Full go test ./... -race green.
Branch: fix/74-delegation-via-a2a-proxy. Closes #74.
100% platform coverage — MCP + molecli
Full parity pass so every platform endpoint is reachable from both client layers.
MCP server (mcp-server/src/index.ts): 61 → 83 tools
+22 new handlers added in a single coverage-completion block at the bottom of the file:
- Delegations (#64):
record_delegation,update_delegation_status - Activity:
report_activity,notify_user - Canvas viewport:
get_canvas_viewport,set_canvas_viewport - Channels (platform-level):
discover_channel_chats - Plugins:
list_plugin_sources,list_available_plugins,check_plugin_compatibility - Schedules (cron):
list_schedules,create_schedule,update_schedule,delete_schedule,run_schedule,get_schedule_history - Session + shared context:
session_search,get_shared_context - K/V memory (distinct from HMA):
memory_set,memory_get,memory_list,memory_delete_kv
Updated schemas: create_workspace + update_workspace now
accept workspace_access (none / read_only / read_write) + explicit
runtime / workspace_dir params.
All 88 existing MCP tests still pass; npm run build green.
molecli CLI (workspace-server/cmd/cli/): 9 → 21 top-level commands
Two new files:
cmd_api.go—molecli api <METHOD> <PATH> [json-body]raw escape hatch. Hits any endpoint without a typed wrapper.cmd_ops.go— typed subcommands (thin wrappers over sharedcallAPIhelper) for operator ergonomics:ws restart|pause|resume— lifecycle opsplugin registry|sources|list|available|install|uninstallsecret list|set|delete|list-global|set-global|delete-globalschedule list|add|remove|run|historychannel adapters|list|remove|send|testapproval pending|list|decidedelegation list|createbundle export|importorg templates|importtraces <workspace-id>activity list <workspace-id>hma commit|search
go test ./cmd/cli/ passes; live smoke-test against running
platform: api GET /health, plugin sources, org templates,
ws restart <bad-id> all return expected responses.
Branch: feat/mcp-molecli-full-coverage.
#65 fix — per-agent workspace_access in org.yaml + API
Design from the ecosystem-research outcomes doc: new
workspace_access: none | read_only | read_write field on every
workspace, enforced at container provision time via Docker's native
:ro bind-mount flag. Eliminates the "PM couriers documents to
reports" workaround by letting research agents have read-only repo
access without the write risk.
Changes:
- Migration 019 — adds
workspace_access VARCHAR(20) NOT NULL DEFAULT 'none'with CHECK constraint. Additive, all existing rows become 'none' (current isolated-volume behaviour preserved). provisioner.go:- New
WorkspaceAccessfield onWorkspaceConfig. - Constants
WorkspaceAccessNone/ReadOnly/ReadWrite. buildWorkspaceMount(cfg)— pure helper, selects between named-volume, rw bind, and:robind based on access + workspace_path.ValidateWorkspaceAccess(access, path)— rejectsread_*without a path and unknown values.
- New
handlers/workspace.go::Createandhandlers/org.go::createOrgWorkspace— validate + persistworkspace_accesson INSERT. Response body echoes the stored value.handlers/workspace_provision.go::buildProvisionerConfig— readsworkspace_accessfrom DB (with payload override) and forwards to the provisioner. Restart paths preserve the mode.
Tests:
- Provisioner: 2 new tables —
TestBuildWorkspaceMount_SelectionMatrix(6 cases covering the full access × path matrix) andTestValidateWorkspaceAccess(7 cases). - Handler INSERT WithArgs updated across 5 existing tests for the new 9th column.
- Full
go test ./... -racegreen.
Live E2E:
- Migration auto-applied →
workspacestable hasworkspace_accesswith the CHECK constraint. POST /workspaces {"workspace_access":"read_only","workspace_dir":"/repo"}→ 201 with"workspace_access":"read_only"echoed; DB row correct.POST {"workspace_access":"read_only"}(no workspace_dir) → 400 with clear error.POST {"workspace_access":"wildcard"}→ 400 with allowed-values list.- Container inspected after provision:
/workspacemount hasRW=false Mode=ro;touch /workspace/foofrom inside returnsRead-only file system→ enforcement is real.
Branch: feat/65-workspace-access-yaml. Closes #65.
#64 fix — agent registers delegations with platform (Option A)
Root cause (confirmed in comment on #64): check_delegation_status
reads from the agent's local _delegations dict; platform's
GET /workspaces/:id/delegations reads from activity_logs. The
agent's delegate_to_workspace MCP tool sends A2A directly and
never touches activity_logs — so the platform's view was always empty
for agent-initiated delegations.
Fix (minimal Option A, dual-write):
-
Platform: two new endpoints on
DelegationHandler—POST /workspaces/:id/delegations/record— inserts a singleactivity_logsrow withmethod='delegate', status='dispatched'. No A2A fired (agent does that directly for OTEL/retry reasons).POST /workspaces/:id/delegations/:delegation_id/update— acceptsstatus ∈ {completed, failed}+ optional error + preview. UPDATEs the original row and (on completion) INSERTs adelegate_resultrow matching the canvas-path flow.
-
Agent (
workspace/builtin_tools/delegation.py):- New best-effort async helpers
_record_delegation_on_platformand_update_delegation_on_platform. Failures are logged at debug and swallowed — never block the actual A2A delegation path. _execute_delegationcalls_record_...at task start and_update_...on completion / failure (alongside the existing_notify_completion).
- New best-effort async helpers
Result: agent keeps direct A2A for speed + OTEL trace-context
propagation + existing retry logic; platform's activity_logs mirrors
the same set the agent's local dict holds. GET /delegations now
returns rows for agent-initiated delegations.
Tests: 5 new Go tests (Record inserts + rejects invalid UUID, UpdateStatus completed inserts result row + rejects unknown status + failed broadcast). 4 new Python tests (record fires HTTP POST, best- effort on platform error, update completed, update truncates large preview to 500 chars). Python pytest 1060 → 1064; full Go suite green.
Branch: fix/64-agent-delegate-via-platform. Closes #64.
SDK — workspace / org / channel validators
Issue: SDK only validated plugins. Authors publishing
workspace-configs-templates, org-templates, or channel configs had no
lint step — errors only surfaced at POST /org/import or container
startup.
Fix: extended sdk/python/molecule_plugin/ with three new modules:
workspace.py— validatesconfig.yaml(name, runtime, tier, runtime_config shape).SUPPORTED_RUNTIMESkept in sync withprovisioner.RuntimeImages.org.py— recursively validatesorg.yaml(name, workspaces tree, workspace_access + workspace_dir pairing per #65, channels via delegatedvalidate_channel_config, schedules, plugins, external+url, children).channel.py— validates channel configs (standalone dict or YAML file).SUPPORTED_CHANNEL_TYPEScurrently{telegram}; extend when Slack/Discord adapters land.
CLI (python -m molecule_plugin validate {plugin|workspace|org|channel} <path>)
dispatches to the right validator; bare validate <path> still defaults
to plugin for back-compat. Exit 0 on valid, 1 on any error.
validate_channel_config is the single source of truth for channel
schema — org.py delegates to it rather than duplicating checks.
Tests: sdk/python/tests/test_validators.py — 37 new tests (happy,
missing file, bad YAML, non-object, each field error, null-safety on
runtime_config: None / defaults: null, CLI dispatch for all 4 kinds,
back-compat form). Fixed bug found during test authoring: org.py crashed
on non-dict children; now guarded with isinstance check.
Live smoke: all 4 in-repo org templates (free-beats-all,
reno-stars, molecule-dev, molecule-worker-gemini) validate clean.
SDK pytest: 50 → 87. Branch: feat/sdk-workspace-org-channel.
Top-5 #3 — parallel adapter builds
DevOps proposal from the ecosystem-research outcomes doc. All six
adapter Dockerfiles FROM workspace-template:base with no
inter-adapter dependency, so they're safe to build concurrently once
the base is done.
Change (workspace/build-all.sh):
- Serial path kept for single-runtime rebuilds and
SERIAL_BUILD=1CI environments (preserves bounded-concurrency option). - Parallel path: fan out one
docker buildper adapter, capture stdout/stderr to/tmp/build_<tag>.log, wait for all, tally per-tag success/failure. Failures still exit non-zero.
E2E: bash build-all.sh claude-code deepagents langgraph
finished in 43s wall-clock (three adapter builds running
concurrently). Previously ~120s serial. Log files live under
/tmp/build_*.log for post-hoc debugging.
Branch: feat/top5-3-parallel-adapter-builds.