Hongming Wang d8026347e5 chore: open-source restructure — rename dirs, remove internal files, scrub secrets

Renames:
- platform/ → workspace-server/ (Go module path stays as "platform" for
  external dep compat — will update after plugin module republish)
- workspace-template/ → workspace/

Removed (moved to separate repos or deleted):
- PLAN.md — internal roadmap (move to private project board)
- HANDOFF.md, AGENTS.md — one-time internal session docs
- .claude/ — gitignored entirely (local agent config)
- infra/cloudflare-worker/ → Molecule-AI/molecule-tenant-proxy
- org-templates/molecule-dev/ → standalone template repo
- .mcp-eval/ → molecule-mcp-server repo
- test-results/ — ephemeral, gitignored

Security scrubbing:
- Cloudflare account/zone/KV IDs → placeholders
- Real EC2 IPs → <EC2_IP> in all docs
- CF token prefix, Neon project ID, Fly app names → redacted
- Langfuse dev credentials → parameterized
- Personal runner username/machine name → generic

Community files:
- CONTRIBUTING.md — build, test, branch conventions
- CODE_OF_CONDUCT.md — Contributor Covenant 2.1

All Dockerfiles, CI workflows, docker-compose, railway.toml, render.yaml,
README, CLAUDE.md updated for new directory names.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-18 00:24:44 -07:00

38 KiB

Raw Blame History

2026-04-12

Summary

Shipped the full two-axis plugin architecture on feat/agentskills-compliance (PR #62). Plugin source (where files come from) and plugin shape (what's inside them) are now independent, pluggable axes.

Source axis — workspace-server/internal/plugins/ package: SourceResolver interface, Registry, LocalResolver, GithubResolver, ParseSource. POST /workspaces/:id/plugins accepts {name} (back-compat → local) or {source: "scheme://spec"}. New GET /plugins/sources enumerates registered schemes.
Shape axis — workspace/plugins_registry/ package: PluginAdaptor protocol, hybrid resolver (registry > plugin-shipped > raw-drop), AgentskillsAdaptor built-in for agentskills.io-format skills + Molecule AI's rules extension. Named sub-type adapters planned for MCP, DeepAgents sub-agents, LangGraph sub-graphs, etc.
agentskills.io compliance — every first-party skill passes the open standard; python -m molecule_plugin validate CLI enforces it in CI. Our skills are now installable in ~35 other agent tools (Cursor, Codex, Copilot, Gemini CLI, etc.).
Gemini org parity — molecule-worker-gemini mirrors molecule-dev (11 workspaces, Research + Dev branches, schedules, Telegram channel, per-agent prompts) as the E2E proof point.

Files touched

Platform (Go):

workspace-server/internal/plugins/{source,local,github}.go + tests — source layer, 97.4% coverage.
workspace-server/internal/envx/envx.go + test — env-var helpers, 100% coverage.
workspace-server/internal/handlers/plugins.go — install pipeline refactored into resolveAndStage + deliverToContainer; typed httpErr for status propagation; sort.Strings in Registry.Schemes; logInstall LimitsOnce on startup.
workspace-server/internal/router/router.go — new routes (/plugins/sources, /workspaces/:id/plugins/available, /workspaces/:id/plugins/compatibility).
workspace-server/Dockerfile — apk add git for the github resolver.

Workspace runtime (Python):

workspace/plugins_registry/ — new module: protocol.py, builtins.py (AgentskillsAdaptor), raw_drop.py, resolver.
workspace/skill_loader/ — renamed from skills/; reads scripts/ per the agentskills.io spec.
workspace/builtin_tools/ — renamed from tools/ to disambiguate from user-plugin tool dirs.
workspace/adapters/base.py — added hooks: memory_filename, register_tool_hook, register_subagent_hook, append_to_memory_hook, install_plugins_via_registry. Default inject_plugins() drives the new pipeline.
workspace/adapters/claude_code/adapter.py — deleted the 40-line inject_plugins() override.
workspace/adapters/deepagents/Dockerfile — ships plugins_registry/.
workspace/plugins.py — PluginManifest.runtimes field.

Plugins (content):

plugins/*/adapters/{claude_code,deepagents}.py — one-line from plugins_registry.builtins import AgentskillsAdaptor as Adaptor.
plugins/*/plugin.yaml — declare runtimes: [claude_code, deepagents].

SDK (Python):

sdk/python/molecule_plugin/ — protocol.py, builtins.py (SDK- vendored AgentskillsAdaptor), manifest.py (spec validator), CLI via __main__.py.
sdk/python/template/ — cookiecutter skeleton.

Org templates:

org-templates/molecule-worker-gemini/org.yaml — full parity with molecule-dev (11 workspaces, schedules, Telegram, per-agent prompts, workspace_dir mount on PM, required_env: [GOOGLE_API_KEY]).
Copied 5 system-prompt.md files from molecule-dev (research-lead, market-analyst, technical-researcher, competitive-intelligence, uiux-designer).

Docs:

docs/plugins/agentskills-compat.md — two-layer model, spec mapping.
docs/plugins/sources.md — two-axis source/shape architecture, security model, future resolvers.
docs/ecosystem-watch.md — Holaboss, Hermes Agent, gstack entries (adjacent projects to track).
.env.example — PLUGIN_INSTALL_* vars documented.
PLAN.md — plugin-adaptor landed; deferred items listed.
CLAUDE.md — new endpoints, env vars, test counts.

Test counts

Go platform: all packages green under -race.
Python workspace: 1040 passed, 9 skipped.
Python SDK: 50 passed.
Total: 1090 passing.

Coverage on new code:

workspace-server/internal/plugins/*: 97.4%
workspace-server/internal/envx/*: 100%
workspace/plugins_registry/*: 100%
workspace/skill_loader/*: 100%
sdk/python/molecule_plugin/*: 100%

5 rounds of code review

Every round addressed by new commits on the branch:

Round 1 — initial coverage pass.
Round 2 — memory_filename plumbing through InstallContext; logger in skill_loader; module constants for SKILLS_SUBDIR, SKIP_ROOT_MD, SKILL_NAME_*; SDK↔runtime drift-guard test; frontmatter parser unification.
Round 3 — fetch timeout + body size cap + staged-dir size cap via new env vars; typed ErrPluginNotFound sentinel replaces string matching; reject both name+source; sort.Strings in Schemes; sync.RWMutex on Registry; -- in git clone; docs clarify github resolver is public-only.
Round 4 — ParseSource empty-spec guard; dirSize(cap) → (limit); localNameRE length bound; extract envDuration/envInt64 into internal/envx; LANG=C LC_ALL=C in git child env for locale- stable error parsing.
Round 5 — typed httpErr replaces 5-value tuple; resolveAndStage decoupled from *gin.Context via installRequest struct; drop unused source param from deliverToContainer; trim whitespace in ParseSource; consolidate 3 test resolver stubs into 1 parameterized fakeResolver + 3 constructors.

Live E2E confirmed

GET /plugins/sources → {"schemes":["github","local"]}.
POST {"name":"molecule-dev"} → installed via local (back-compat).
POST {"source":"local:// molecule-dev "} → installed (whitespace trimmed).
POST {"name":"a","source":"local://b"} → 400 "not both".
POST {"source":"github://"} → 400 "empty spec after 'github'".
POST {"source":"mystery://x"} → 400 + available_schemes: [...].
Uninstall + reinstall on PM workspace: CLAUDE.md has # Plugin: molecule-dev / rule: codebase-conventions.md marker; /configs/skills/review-loop/ present; zero container errors.
Startup log on platform boot: Plugin install limits: body=65536 bytes timeout=5m0s staged=104857600 bytes.

Branch

feat/agentskills-compliance → PR #62 (open, all CI green, ready to merge). Use git log --oneline origin/main.. for the commit list — counting commits inline goes stale fast.

Post-merge session — team coordination, platform hardening, new backlog

After PR #62 landed, the session continued with ecosystem-watch ship, a gemini-org proof-point attempt, and a PLAN.md refresh coordinated through the agent team. Several platform bugs surfaced; all filed and tracked.

Shipped

PR #59 — A2A proxy regression fix. PR #59 had rewritten http://127.0.0.1:<port> → http://ws-<id>:8000 unconditionally, breaking platform-on-host mode. Gated behind platformInDocker detection (/.dockerenv or MOLECULE_IN_DOCKER=1). workspace-server/internal/handlers/a2a_proxy.go. Commit 4b42913.
PR #61 — docs/ecosystem-watch.md: Holaboss / Hermes / gstack entries + template + backlog candidates. Merged.
Cross-references for ecosystem-watch — wired into PLAN.md (new "Ecosystem Awareness" section), README.md + README.zh-CN.md Documentation Map, and CLAUDE.md (new "Ecosystem Context" section). Agents couldn't discover the doc because it wasn't linked anywhere; PM reported it missing despite being in its bind mount. Commit 8ae5e73.
DeepAgents adapter: virtual_mode=False in workspace/adapters/deepagents/adapter.py. Previously read_file/ls/write_file/edit_file operated on an in-memory snapshot that drifted from the bind-mounted /workspace; writes didn't persist across restarts and real files reported as missing. Commit bc563d1.
LangGraph recursion limit 100 → 500 default in workspace/a2a_executor.py. PM fan-out to 6+ reports routinely overran the 100-step ceiling. Still overridable via LANGGRAPH_RECURSION_LIMIT env var. Commit d892eb4.
Gemini org model swap gemini-3.1-pro-preview → gemini-2.5-pro in org-templates/molecule-worker-gemini/org.yaml (3.1-pro-preview's 25 req/min couldn't sustain 11-workspace delegation waves). Commit 4b42913.
Backlog tracking for #64 / #65 added to PLAN.md Backlog. Commit ba1cc15.

Open PRs (awaiting CEO approval)

#68 docs/plan-refresh — PLAN.md refresh: correct test counts (Canvas 325→345, Python 990→1,040, +SDK row 50, total 1,811→1,911), promote #66/#67 to backlog with actual issue content. Coordinated with the molecule-dev team; corrected PM's hallucinated content for #66/#67 before open.
#69 chore/team-system-prompts-hardening — harden PM / Dev Lead / Research Lead system prompts with hard-learned rules from today's coordination incident (15 rules total across 3 roles). Every rule maps to a specific failure we hit today.

New platform issues filed

#64 — GET /workspaces/:id/delegations returns [] while the agent-side check_delegation_status tool shows 4 delegations. Sources-of-truth mismatch. Bug.
#65 — Per-agent repo-access config in org.yaml. New workspace_access: none | read_only | read_write field + :ro bind-mount for research agents. Eliminates the "PM couriers documents to reports" workaround. Enhancement.
#66 — claude_sdk_executor.py swallows subprocess stderr on CLI exit ≠ 0. Every failure surfaces the same opaque "Command failed with exit code 1 / Check stderr output for details". High-priority bug; blocked real debugging today.
#67 — Agent MCP client defaults to http://localhost:8080, which inside a workspace container is the container itself. Inject MOLECULE_URL=${PLATFORM_URL} at provision time. High-priority bug; blocked PM from restarting its own reports.

Gemini org — proof-point attempt, rolled back

Deployed molecule-worker-gemini (11 DeepAgents workspaces), exercised the full delegation tree, hit three distinct blockers:

virtual_mode=True made PM report real files as missing (fixed in bc563d1 above).
LangGraph recursion limit 100 tripped on PM fan-out (fixed in d892eb4 above).
Google AI Studio monthly spending cap exhausted the whole project after repeated retries.

Rolled back to molecule-dev (Claude Code runtime) to finish the PLAN.md refresh task.

Session-state contamination note

After a ProcessError crash on a Claude Code workspace, subsequent A2A calls to that workspace keep failing identically until the workspace is restarted — even when the same SDK query run manually from inside the container succeeds. Root cause likely session resume state in the executor. Workaround: restart on ProcessError. Worth formalizing in the executor as an auto-reset on exit_code != 0 once #66 lands and we can see the real stderr.

Rules distilled for the team (now encoded in #69)

Never commit to main — always a feature branch + PR.
Verify external refs (issue numbers, PRs, SHAs, file paths) before citing them.
Inline documents into every sub-delegation — reports don't have the repo mount.
delegation.status == completed ≠ work was done.
Pause ~60s after a batch restart before delegating (warm-up race).
Quote errors verbatim, don't paraphrase.
Research Lead must always fan out — solo synthesis is a role failure.

#71 fix — initial_prompt marker written up-front

Root cause: main.py previously wrote /workspace/.initial_prompt_done only AFTER the initial_prompt self-send succeeded. If the prompt crashed (any ProcessError, network failure, SDK exit), the marker was never written — the next container boot replayed the same failing prompt and cascaded into "every message crashes" until an operator intervened. Observed three times on 2026-04-12 (gemini org + molecule-dev import + post-restart).

Fix (extracted from main.py into workspace/initial_prompt.py so it's unit-testable without uvicorn):

resolve_initial_prompt_marker(config_path) — prefer <config>/... when writable, fall back to /workspace/....
mark_initial_prompt_attempted(marker_path) — best-effort write, returns True/False so the caller can log a loud warning on I/O failure.
main.py calls mark_initial_prompt_attempted before scheduling the self-send. The post-send marker write is removed.

Semantic change: the prompt is attempted at most once per fresh boot; if it fails, operators re-send manually via chat. Trade-off: trades silent auto-retry-on-restart (which could cascade) for a one-time attempt with a loud failure log.

Tests: 5 new unit tests in tests/test_main_initial_prompt.py, 100% coverage on initial_prompt.py. Live E2E verified all 12 containers write the marker up-front and no replay occurs on restart. Manual browser test via canvas chat against Research Lead returned the expected reply — full round-trip through the UI.

Branch: fix/71-initial-prompt-marker-at-start. Closes #71.

#66 fix — surface Claude SDK subprocess stderr + exit_code

Root cause: claude_sdk_executor.py caught ProcessError but extracted only str(exc), which for a crashing CLI reads "Command failed with exit code 1 (exit code: 1) / Error output: Check stderr output for details". The SDK's ProcessError actually carries .exit_code and .stderr attributes — we were silently dropping both. Every CLI crash looked identical and required ad-hoc reproduction inside the container to diagnose.

Fix: new _format_process_error(exc) helper that extracts type(exc).__name__, exc.exit_code, and exc.stderr (capped at _PROCESS_ERROR_STDERR_MAX_CHARS = 4096 to prevent log flooding). Called in the retry loop (logger.warning) and the terminal error path (logger.error + logger.exception for the full traceback). Plain exceptions without SDK attributes fall back to str(exc) — no crash on missing attrs.

Tests: 5 new unit tests in tests/test_claude_sdk_executor.py (format with full context / truncation / plain exception / exit-code only / end-to-end via execute() with caplog). Python pytest 1050 → 1055.

E2E: rebuilt workspace-template:claude-code, restarted an agent, ran _format_process_error with a real claude_agent_sdk._errors. ProcessError(exit_code=2, stderr='disk full: /tmp') inside the live container → output shows both exit_code=2 and the stderr verbatim.

Manual browser: canvas chat against Research Lead — reply BROWSER-OK-66 returned cleanly, full UI round-trip works with the new log format live.

Branch: fix/66-capture-claude-sdk-stderr. Closes #66.

#75 fix — auto-reset session_id on subprocess-level errors

Root cause: after a ProcessError (or CLIConnectionError), the executor's self._session_id still points at the dead session. On the next call, _build_options() passes resume=<stale-id> to the SDK, which boots a new subprocess that can't resume the prior session state — and crashes again. Observed as "crashed once → crashes forever" on 2026-04-12 across PM / RL / DL in the coordination runs.

Fix: new _reset_session_after_error(exc) method clears self._session_id when the exception looks subprocess-level (ProcessError, CLIConnectionError, has exit_code attribute, or message contains "exit code"). Rate-limit / capacity errors are left alone so normal retry preserves conversational continuity. Called in the retry loop, right after _format_process_error logs the context.

Tests: 5 new tests in tests/test_claude_sdk_executor.py — clears on ProcessError / preserves on rate-limit / no-op when session_id is already None / triggers on "exit code" message only / end-to-end via execute() with caplog + spy-on-_build_options asserting that the second retry attempt sees session_id=None rather than the stale ID. Python pytest 1055 → 1060.

E2E: verified in live container — _reset_session_after_error clears a stale session on ProcessError, preserves it on rate-limit.

Manual browser: canvas chat round-trip on Research Lead — message went through and agent responded normally. Zero ProcessError indicators.

Branch: fix/75-session-reset-on-process-error. Closes #75.

Top-5 #1 — Memory FTS + namespace scoping

Backend proposal from the ecosystem-research outcomes doc, highest- convergence team ask (BE + FE + QA + UX all independently proposed some flavour of this).

Migration 017_memories_fts_namespace.up.sql:

agent_memories.namespace VARCHAR(50) NOT NULL DEFAULT 'general'
agent_memories.content_tsv tsvector (STORED generated column from to_tsvector('english', content))
idx_memories_fts (GIN on content_tsv)
idx_memories_ns (composite on workspace_id, namespace)

Handler workspace-server/internal/handlers/memories.go:

POST /workspaces/:id/memories accepts optional namespace (default "general", 50-char max validated at the handler).
GET /workspaces/:id/memories?q=... routes multi-char queries through content_tsv @@ plainto_tsquery('english', ?) with ts_rank ordering; single-char queries fall back to ILIKE (tsvector can't tokenise single chars in the 'english' config).
GET /workspaces/:id/memories?namespace=... filters regardless of scope.
Response always includes the namespace field.

Tests: 5 existing tests updated for the new column list; 4 new tests added (commit-with-namespace, namespace-too-long, FTS path, ILIKE fallback, namespace filter). Handler test suite passes.

E2E (live Postgres + running platform):

Platform restart applied migration 017 → column + indexes present.
POST with / without namespace → both work, default kicks in.
?q=zinc+theme → FTS returns reference memory.
?namespace=procedures → scoped retrieval works.
?q=restart&namespace=procedures → combined filter works.

Branch: feat/memory-fts-namespace.

Top-5 #5 — Fail-secure encryption at boot

Security Auditor's top proposal from the outcomes doc. The platform previously booted without SECRETS_ENCRYPTION_KEY and silently stored workspace secrets in plaintext with only a WARNING log. OWASP A02:2021 (Cryptographic Failures) / STRIDE "Information Disclosure".

Fix (workspace-server/internal/crypto/aes.go):

New InitStrict() error variant that returns ErrEncryptionKeyMissing when MOLECULE_ENV=prod/production and the key is unset, malformed, or the wrong length. Existing Init() retained for any callers that prefer the warn-and-continue behaviour; only cmd/server/main.go switched to the strict variant.
isProdEnv() accepts prod, production, case-insensitive + trimmed.
loadKeyFromEnv refactor: one helper returns the parse error so both entry points can format it the same way.

cmd/server/main.go: crypto.InitStrict() + log.Fatalf on error. Local dev (no MOLECULE_ENV) keeps the existing warn-and-continue.

Tests: 6 new tests in internal/crypto/aes_test.go:

fails in prod when key is missing
fails in prod on wrong-length key
succeeds in prod with valid key
allows dev mode without key (ergonomics)
allows staging without key (non-prod)
isProdEnv case-insensitivity table

E2E: /tmp/platform-failsec binary run with MOLECULE_ENV=prod + empty key → log.Fatalf triggers, platform refuses to start. Same binary with MOLECULE_ENV=prod + valid base64 key → boots, prints "AES-256-GCM enabled", serves 200 on /health.

Branch: fix/top5-5-fail-secure-encryption.

#85 fix — encryption_version column + DecryptVersioned

Root cause (from the investigation): rows in workspace_secrets / global_secrets are tagged as encrypted_value bytea but whether they're actually encrypted depends entirely on whether SECRETS_ENCRYPTION_KEY was set at the moment of Encrypt — crypto.Encrypt short-circuits and returns plaintext bytes when encryption is disabled. Switching on the key later makes crypto.Decrypt try GCM on plaintext bytes → fails → provisioner silently skips the row → container crashes on missing OAuth token.

With PR #83 (fail-secure) pushing operators toward setting the key, this trap was about to start biting real installs.

Fix:

Migration 018_secrets_encryption_version adds encryption_version INT NOT NULL DEFAULT 0 to both secret tables. All existing rows become version=0 (plaintext). Additive, safe.
crypto.aes.go:
- EncryptionVersionPlaintext = 0, EncryptionVersionAESGCM = 1 constants.
- CurrentEncryptionVersion() — tells callers which tag to write.
- DecryptVersioned(value, version) — dispatches on tag; v=0 passes through, v=1 runs GCM (and errors if IsEnabled() is false). Unknown version → clear error.
- Existing Decrypt deprecated-in-comment but kept for callers that haven't migrated (backward-compat during transition).
handlers/workspace_provision.go: SELECT now pulls encryption_version; decrypt uses DecryptVersioned; on failure aborts provisioning with a loud FATAL log + marks workspace failed (#66-style silent-failure removed).
handlers/secrets.go: both Set and global SetGlobalSecret persist encryption_version = CurrentEncryptionVersion() on INSERT. ON CONFLICT also updates the version — re-setting a historical plaintext row while a key is active upgrades it to GCM in-place.
handlers/secrets.go::GetModel: SELECT pulls version, uses DecryptVersioned.

Tests: 6 new crypto tests (plaintext pass-through, GCM round-trip, GCM requires key, unknown version rejected, CurrentEncryptionVersion tracks key state, the exact #85 scenario end-to-end). 6 existing secret handler tests updated for the 4-arg INSERT. Full Go test suite passes.

E2E (live):

Migration applied automatically on platform boot: encryption_version column present on both tables.
102 pre-existing plaintext rows correctly tagged version=0.
New TEST_NEW_SECRET_85 stored as 39 bytes (11 plaintext + 12 nonce
- 16 tag = ✓) with version=1.
PM container restart succeeds — both CLAUDE_CODE_OAUTH_TOKEN (v=0 historical plaintext) AND TEST_NEW_SECRET_85 (v=1 encrypted) are decrypted correctly and injected into the container env.

Branch: fix/85-encryption-version-migration. Closes #85.

#67 fix — inject MOLECULE_URL at workspace provision time

Root cause: Agents calling mcp__molecule__* tools from inside a workspace container were hitting localhost:8080 (container's own localhost, not the host). The MCP client (mcp-server/src/index.ts) defaulted to MOLECULE_URL || "http://localhost:8080" and the provisioner only injected PLATFORM_URL, never MOLECULE_URL.

Fix (two-sided, belt-and-suspenders):

workspace-server/internal/provisioner/provisioner.go — extracted env building into pure buildContainerEnv(cfg WorkspaceConfig) []string so it's unit-testable. Now injects MOLECULE_URL=<PlatformURL> alongside PLATFORM_URL.
mcp-server/src/index.ts — client now prefers MOLECULE_URL, falls back to PLATFORM_URL, then localhost:8080. Protects older containers that don't yet have MOLECULE_URL.

Tests: 4 new Go tests (buildContainerEnv injects both env vars, MOLECULE_URL always matches PLATFORM_URL across URL shapes, awareness both-or-nothing, custom envs append). Full provisioner suite green. 88 existing MCP tests still pass (fallback chain preserves existing behaviour).

E2E verified live: rebuilt platform, restarted PM, docker exec env shows both PLATFORM_URL=http://host.docker.internal:8080 and MOLECULE_URL=http://host.docker.internal:8080 on the recreated container.

Side-discovery (filed as #85): enabling SECRETS_ENCRYPTION_KEY on an install with pre-existing plaintext secrets silently breaks every secret — crypto.Decrypt runs GCM on plaintext bytes → fails → log.Printf + continue → row dropped → workspace crashes on preflight. Proposed fix: encryption_version column + boot-time re-encryption migration + fail-loud on decrypt mismatch.

Branch: fix/67-inject-molecule-url.

#73 fix — close three real delete-race windows

Observed symptom (corrected): During the session's bulk-delete runs, PM / Research Lead / Dev Lead consistently survived as "stragglers." Turned out the cause wasn't a race — it was the DELETE /workspaces/:id endpoint returning HTTP 200 with {"status":"confirmation_required"} when the workspace has children and ?confirm=true is not set. The bulk-delete script read HTTP 200 as success and moved on.

What the #73 fix actually closes: three real but distinct race windows that would bite in production even with correct ?confirm=true usage:

handlers/registry.go::Register — ON CONFLICT DO UPDATE SET status='online' ran unconditionally; a late heartbeat from a workspace that was just soft-deleted (status='removed') could resurrect the row. Guard added: WHERE workspaces.status IS DISTINCT FROM 'removed'.
handlers/registry.go::Heartbeat — same UPDATE path had no filter; late heartbeats refreshed last_heartbeat_at on tombstoned rows (confusing liveness). Guard: AND status != 'removed'. Plus evaluateStatus recovery path made conditional in-SQL (AND status = 'offline').
handlers/workspace.go::Delete — sequence was Stop container → UPDATE status='removed'. Between those calls, Redis TTL expiry could trigger the liveness monitor, which called RestartByID, recreating the container. New order: UPDATE status='removed' FIRST (for self + descendants as a single batch), THEN stop containers + remove volumes. Auto-restart paths now see status='removed' immediately and bail out via their existing NOT IN ('removed', ...) guards.

Tests: 2 new registry tests pinning the SQL guards (substring match on the emitted UPDATE); 2 existing delete tests updated for the new order (single batch UPDATE covering self+descendants). Full go test ./... -race green.

Live E2E: bulk delete of 12 workspaces with ?confirm=true → all cleanly removed, zero stragglers, no pending provisions.

Separate issue filed: API DX — DELETE should return 4xx (e.g. 409 Conflict) when confirmation is required, not 200. Misleading status code made the session's symptom diagnosis wrong for hours.

Branch: fix/73-delete-workspace-race.

#88 fix — DELETE returns 409 Conflict when confirmation required

Observed during #73: bulk-delete scripts that read HTTP 200 as success silently skipped every parent workspace, leaving tier-3 / parent nodes behind and looking like a platform race bug.

Fix: one-line change in handlers/workspace.go::Delete — return http.StatusConflict (409) instead of http.StatusOK (200) when children exist and ?confirm=true isn't set. Response body shape unchanged (canvas UI + MCP server both parse the JSON body, not the status code).

No regressions: canvas (DetailsTab.tsx:75) and MCP server (mcp-server/src/index.ts:80) already pass ?confirm=true on every delete. The 409 only affects manual API users + bulk scripts that forgot — exactly the cohort that was silently failing.

Tests: 1 existing delete test updated to expect 409. Full go test ./... green.

Live E2E: real platform, real parent+child workspaces — DELETE /workspaces/:id (no confirm) returns http=409 with the expected JSON body; DELETE /workspaces/:id?confirm=true still returns 200.

Branch: fix/88-delete-confirm-409. Closes #88.

#74 fix — retry delegation once after reactive URL refresh

Clarification of the original issue: The delegation worker (handlers/delegation.go::executeDelegation) already calls the shared h.workspace.proxyA2ARequest(...) path — so it DOES benefit from the A2A proxy's reactive health-check / URL-refresh on connection errors. The real gap is that the reactive refresh runs after the current request fails; the caller still gets an error for that specific delegation attempt. During bulk restarts (observed 21:40 today), PM's delegation worker fired during the warm-up window, hit a stale URL, and the single-attempt logic marked the delegation failed.

Fix: add a single retry with an 8-second pause when proxyA2ARequest returns a transient-looking error. The pause is long enough for the reactive refresh + container restart to land a fresh URL in the cache. isTransientProxyError classifies which statuses retry:

502 Bad Gateway (plain connection failure) — retry
503 Service Unavailable (reactive check decided to restart the container) — retry
404 / 403 / 400 / 500 — static, don't waste the retry window

Tests: 7 new cases on the classifier matrix + a regression guard on the 8-second window. Full go test ./... -race green.

Branch: fix/74-delegation-via-a2a-proxy. Closes #74.

100% platform coverage — MCP + molecli

Full parity pass so every platform endpoint is reachable from both client layers.

MCP server (`mcp-server/src/index.ts`): 61 → 83 tools

+22 new handlers added in a single coverage-completion block at the bottom of the file:

Delegations (#64): record_delegation, update_delegation_status
Activity: report_activity, notify_user
Canvas viewport: get_canvas_viewport, set_canvas_viewport
Channels (platform-level): discover_channel_chats
Plugins: list_plugin_sources, list_available_plugins, check_plugin_compatibility
Schedules (cron): list_schedules, create_schedule, update_schedule, delete_schedule, run_schedule, get_schedule_history
Session + shared context: session_search, get_shared_context
K/V memory (distinct from HMA): memory_set, memory_get, memory_list, memory_delete_kv

Updated schemas: create_workspace + update_workspace now accept workspace_access (none / read_only / read_write) + explicit runtime / workspace_dir params.

All 88 existing MCP tests still pass; npm run build green.

molecli CLI (`workspace-server/cmd/cli/`): 9 → 21 top-level commands

Two new files:

cmd_api.go — molecli api <METHOD> <PATH> [json-body] raw escape hatch. Hits any endpoint without a typed wrapper.
cmd_ops.go — typed subcommands (thin wrappers over shared callAPI helper) for operator ergonomics:
- ws restart|pause|resume — lifecycle ops
- plugin registry|sources|list|available|install|uninstall
- secret list|set|delete|list-global|set-global|delete-global
- schedule list|add|remove|run|history
- channel adapters|list|remove|send|test
- approval pending|list|decide
- delegation list|create
- bundle export|import
- org templates|import
- traces <workspace-id>
- activity list <workspace-id>
- hma commit|search

go test ./cmd/cli/ passes; live smoke-test against running platform: api GET /health, plugin sources, org templates, ws restart <bad-id> all return expected responses.

Branch: feat/mcp-molecli-full-coverage.

#65 fix — per-agent workspace_access in org.yaml + API

Design from the ecosystem-research outcomes doc: new workspace_access: none | read_only | read_write field on every workspace, enforced at container provision time via Docker's native :ro bind-mount flag. Eliminates the "PM couriers documents to reports" workaround by letting research agents have read-only repo access without the write risk.

Changes:

Migration 019 — adds workspace_access VARCHAR(20) NOT NULL DEFAULT 'none' with CHECK constraint. Additive, all existing rows become 'none' (current isolated-volume behaviour preserved).
provisioner.go:
- New WorkspaceAccess field on WorkspaceConfig.
- Constants WorkspaceAccessNone/ReadOnly/ReadWrite.
- buildWorkspaceMount(cfg) — pure helper, selects between named-volume, rw bind, and :ro bind based on access + workspace_path.
- ValidateWorkspaceAccess(access, path) — rejects read_* without a path and unknown values.
handlers/workspace.go::Create and handlers/org.go::createOrgWorkspace — validate + persist workspace_access on INSERT. Response body echoes the stored value.
handlers/workspace_provision.go::buildProvisionerConfig — reads workspace_access from DB (with payload override) and forwards to the provisioner. Restart paths preserve the mode.

Tests:

Provisioner: 2 new tables — TestBuildWorkspaceMount_SelectionMatrix (6 cases covering the full access × path matrix) and TestValidateWorkspaceAccess (7 cases).
Handler INSERT WithArgs updated across 5 existing tests for the new 9th column.
Full go test ./... -race green.

Live E2E:

Migration auto-applied → workspaces table has workspace_access with the CHECK constraint.
POST /workspaces {"workspace_access":"read_only","workspace_dir":"/repo"} → 201 with "workspace_access":"read_only" echoed; DB row correct.
POST {"workspace_access":"read_only"} (no workspace_dir) → 400 with clear error.
POST {"workspace_access":"wildcard"} → 400 with allowed-values list.
Container inspected after provision: /workspace mount has RW=false Mode=ro; touch /workspace/foo from inside returns Read-only file system → enforcement is real.

Branch: feat/65-workspace-access-yaml. Closes #65.

#64 fix — agent registers delegations with platform (Option A)

Root cause (confirmed in comment on #64): check_delegation_status reads from the agent's local _delegations dict; platform's GET /workspaces/:id/delegations reads from activity_logs. The agent's delegate_to_workspace MCP tool sends A2A directly and never touches activity_logs — so the platform's view was always empty for agent-initiated delegations.

Fix (minimal Option A, dual-write):

Platform: two new endpoints on DelegationHandler —
- POST /workspaces/:id/delegations/record — inserts a single activity_logs row with method='delegate', status='dispatched'. No A2A fired (agent does that directly for OTEL/retry reasons).
- POST /workspaces/:id/delegations/:delegation_id/update — accepts status ∈ {completed, failed} + optional error + preview. UPDATEs the original row and (on completion) INSERTs a delegate_result row matching the canvas-path flow.
Agent (workspace/builtin_tools/delegation.py):
- New best-effort async helpers _record_delegation_on_platform and _update_delegation_on_platform. Failures are logged at debug and swallowed — never block the actual A2A delegation path.
- _execute_delegation calls _record_... at task start and _update_... on completion / failure (alongside the existing _notify_completion).

Result: agent keeps direct A2A for speed + OTEL trace-context propagation + existing retry logic; platform's activity_logs mirrors the same set the agent's local dict holds. GET /delegations now returns rows for agent-initiated delegations.

Tests: 5 new Go tests (Record inserts + rejects invalid UUID, UpdateStatus completed inserts result row + rejects unknown status + failed broadcast). 4 new Python tests (record fires HTTP POST, best- effort on platform error, update completed, update truncates large preview to 500 chars). Python pytest 1060 → 1064; full Go suite green.

Branch: fix/64-agent-delegate-via-platform. Closes #64.

SDK — workspace / org / channel validators

Issue: SDK only validated plugins. Authors publishing workspace-configs-templates, org-templates, or channel configs had no lint step — errors only surfaced at POST /org/import or container startup.

Fix: extended sdk/python/molecule_plugin/ with three new modules:

workspace.py — validates config.yaml (name, runtime, tier, runtime_config shape). SUPPORTED_RUNTIMES kept in sync with provisioner.RuntimeImages.
org.py — recursively validates org.yaml (name, workspaces tree, workspace_access + workspace_dir pairing per #65, channels via delegated validate_channel_config, schedules, plugins, external+url, children).
channel.py — validates channel configs (standalone dict or YAML file). SUPPORTED_CHANNEL_TYPES currently {telegram}; extend when Slack/Discord adapters land.

CLI (python -m molecule_plugin validate {plugin|workspace|org|channel} <path>) dispatches to the right validator; bare validate <path> still defaults to plugin for back-compat. Exit 0 on valid, 1 on any error.

validate_channel_config is the single source of truth for channel schema — org.py delegates to it rather than duplicating checks.

Tests: sdk/python/tests/test_validators.py — 37 new tests (happy, missing file, bad YAML, non-object, each field error, null-safety on runtime_config: None / defaults: null, CLI dispatch for all 4 kinds, back-compat form). Fixed bug found during test authoring: org.py crashed on non-dict children; now guarded with isinstance check.

Live smoke: all 4 in-repo org templates (free-beats-all, reno-stars, molecule-dev, molecule-worker-gemini) validate clean.

SDK pytest: 50 → 87. Branch: `feat/sdk-workspace-org-channel`.

Top-5 #3 — parallel adapter builds

DevOps proposal from the ecosystem-research outcomes doc. All six adapter Dockerfiles FROM workspace-template:base with no inter-adapter dependency, so they're safe to build concurrently once the base is done.

Change (workspace/build-all.sh):

Serial path kept for single-runtime rebuilds and SERIAL_BUILD=1 CI environments (preserves bounded-concurrency option).
Parallel path: fan out one docker build per adapter, capture stdout/stderr to /tmp/build_<tag>.log, wait for all, tally per-tag success/failure. Failures still exit non-zero.

E2E: bash build-all.sh claude-code deepagents langgraph finished in 43s wall-clock (three adapter builds running concurrently). Previously ~120s serial. Log files live under /tmp/build_*.log for post-hoc debugging.

Branch: feat/top5-3-parallel-adapter-builds.

38 KiB Raw Blame History Unescape Escape