molecule-core/docs/edit-history/2026-04-12.md
Hongming Wang d8026347e5 chore: open-source restructure — rename dirs, remove internal files, scrub secrets
Renames:
- platform/ → workspace-server/ (Go module path stays as "platform" for
  external dep compat — will update after plugin module republish)
- workspace-template/ → workspace/

Removed (moved to separate repos or deleted):
- PLAN.md — internal roadmap (move to private project board)
- HANDOFF.md, AGENTS.md — one-time internal session docs
- .claude/ — gitignored entirely (local agent config)
- infra/cloudflare-worker/ → Molecule-AI/molecule-tenant-proxy
- org-templates/molecule-dev/ → standalone template repo
- .mcp-eval/ → molecule-mcp-server repo
- test-results/ — ephemeral, gitignored

Security scrubbing:
- Cloudflare account/zone/KV IDs → placeholders
- Real EC2 IPs → <EC2_IP> in all docs
- CF token prefix, Neon project ID, Fly app names → redacted
- Langfuse dev credentials → parameterized
- Personal runner username/machine name → generic

Community files:
- CONTRIBUTING.md — build, test, branch conventions
- CODE_OF_CONDUCT.md — Contributor Covenant 2.1

All Dockerfiles, CI workflows, docker-compose, railway.toml, render.yaml,
README, CLAUDE.md updated for new directory names.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 00:24:44 -07:00

849 lines
38 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 2026-04-12
## Summary
Shipped the full two-axis plugin architecture on `feat/agentskills-compliance`
(PR #62). **Plugin source** (where files come from) and **plugin shape**
(what's inside them) are now independent, pluggable axes.
- **Source axis** — `workspace-server/internal/plugins/` package: `SourceResolver`
interface, `Registry`, `LocalResolver`, `GithubResolver`, `ParseSource`.
`POST /workspaces/:id/plugins` accepts `{name}` (back-compat → local) or
`{source: "scheme://spec"}`. New `GET /plugins/sources` enumerates
registered schemes.
- **Shape axis** — `workspace/plugins_registry/` package:
`PluginAdaptor` protocol, hybrid resolver (registry > plugin-shipped >
raw-drop), `AgentskillsAdaptor` built-in for agentskills.io-format
skills + Molecule AI's rules extension. Named sub-type adapters planned
for MCP, DeepAgents sub-agents, LangGraph sub-graphs, etc.
- **agentskills.io compliance** — every first-party skill passes the
open standard; `python -m molecule_plugin validate` CLI enforces it
in CI. Our skills are now installable in ~35 other agent tools
(Cursor, Codex, Copilot, Gemini CLI, etc.).
- **Gemini org parity** — `molecule-worker-gemini` mirrors `molecule-dev`
(11 workspaces, Research + Dev branches, schedules, Telegram channel,
per-agent prompts) as the E2E proof point.
## Files touched
Platform (Go):
- `workspace-server/internal/plugins/{source,local,github}.go` + tests — source
layer, 97.4% coverage.
- `workspace-server/internal/envx/envx.go` + test — env-var helpers, 100%
coverage.
- `workspace-server/internal/handlers/plugins.go` — install pipeline refactored
into `resolveAndStage` + `deliverToContainer`; typed `httpErr` for
status propagation; `sort.Strings` in `Registry.Schemes`; `logInstall
LimitsOnce` on startup.
- `workspace-server/internal/router/router.go` — new routes (`/plugins/sources`,
`/workspaces/:id/plugins/available`, `/workspaces/:id/plugins/compatibility`).
- `workspace-server/Dockerfile``apk add git` for the github resolver.
Workspace runtime (Python):
- `workspace/plugins_registry/` — new module: `protocol.py`,
`builtins.py` (`AgentskillsAdaptor`), `raw_drop.py`, resolver.
- `workspace/skill_loader/` — renamed from `skills/`; reads
`scripts/` per the agentskills.io spec.
- `workspace/builtin_tools/` — renamed from `tools/` to
disambiguate from user-plugin tool dirs.
- `workspace/adapters/base.py` — added hooks: `memory_filename`,
`register_tool_hook`, `register_subagent_hook`, `append_to_memory_hook`,
`install_plugins_via_registry`. Default `inject_plugins()` drives the
new pipeline.
- `workspace/adapters/claude_code/adapter.py` — deleted the
40-line `inject_plugins()` override.
- `workspace/adapters/deepagents/Dockerfile` — ships
`plugins_registry/`.
- `workspace/plugins.py``PluginManifest.runtimes` field.
Plugins (content):
- `plugins/*/adapters/{claude_code,deepagents}.py` — one-line
`from plugins_registry.builtins import AgentskillsAdaptor as Adaptor`.
- `plugins/*/plugin.yaml` — declare `runtimes: [claude_code, deepagents]`.
SDK (Python):
- `sdk/python/molecule_plugin/``protocol.py`, `builtins.py` (SDK-
vendored `AgentskillsAdaptor`), `manifest.py` (spec validator), CLI
via `__main__.py`.
- `sdk/python/template/` — cookiecutter skeleton.
Org templates:
- `org-templates/molecule-worker-gemini/org.yaml` — full parity with
`molecule-dev` (11 workspaces, schedules, Telegram, per-agent
prompts, `workspace_dir` mount on PM, `required_env: [GOOGLE_API_KEY]`).
- Copied 5 `system-prompt.md` files from molecule-dev (research-lead,
market-analyst, technical-researcher, competitive-intelligence,
uiux-designer).
Docs:
- `docs/plugins/agentskills-compat.md` — two-layer model, spec mapping.
- `docs/plugins/sources.md` — two-axis source/shape architecture,
security model, future resolvers.
- `docs/ecosystem-watch.md` — Holaboss, Hermes Agent, gstack entries
(adjacent projects to track).
- `.env.example``PLUGIN_INSTALL_*` vars documented.
- `PLAN.md` — plugin-adaptor landed; deferred items listed.
- `CLAUDE.md` — new endpoints, env vars, test counts.
## Test counts
- Go platform: all packages green under `-race`.
- Python workspace: 1040 passed, 9 skipped.
- Python SDK: 50 passed.
- Total: **1090 passing**.
Coverage on new code:
- `workspace-server/internal/plugins/*`: 97.4%
- `workspace-server/internal/envx/*`: 100%
- `workspace/plugins_registry/*`: 100%
- `workspace/skill_loader/*`: 100%
- `sdk/python/molecule_plugin/*`: 100%
## 5 rounds of code review
Every round addressed by new commits on the branch:
1. Round 1 — initial coverage pass.
2. Round 2 — `memory_filename` plumbing through `InstallContext`;
`logger` in `skill_loader`; module constants for `SKILLS_SUBDIR`,
`SKIP_ROOT_MD`, `SKILL_NAME_*`; SDK↔runtime drift-guard test;
frontmatter parser unification.
3. Round 3 — fetch timeout + body size cap + staged-dir size cap via
new env vars; typed `ErrPluginNotFound` sentinel replaces string
matching; reject both `name`+`source`; `sort.Strings` in Schemes;
`sync.RWMutex` on Registry; `--` in git clone; docs clarify
github resolver is public-only.
4. Round 4 — `ParseSource` empty-spec guard; `dirSize(cap)``(limit)`;
`localNameRE` length bound; extract `envDuration`/`envInt64` into
`internal/envx`; `LANG=C LC_ALL=C` in git child env for locale-
stable error parsing.
5. Round 5 — typed `httpErr` replaces 5-value tuple; `resolveAndStage`
decoupled from `*gin.Context` via `installRequest` struct; drop
unused `source` param from `deliverToContainer`; trim whitespace in
`ParseSource`; consolidate 3 test resolver stubs into 1
parameterized `fakeResolver` + 3 constructors.
## Live E2E confirmed
- `GET /plugins/sources``{"schemes":["github","local"]}`.
- `POST {"name":"molecule-dev"}` → installed via local (back-compat).
- `POST {"source":"local:// molecule-dev "}` → installed
(whitespace trimmed).
- `POST {"name":"a","source":"local://b"}` → 400 "not both".
- `POST {"source":"github://"}` → 400 "empty spec after 'github'".
- `POST {"source":"mystery://x"}` → 400 + `available_schemes: [...]`.
- Uninstall + reinstall on PM workspace: CLAUDE.md has
`# Plugin: molecule-dev / rule: codebase-conventions.md` marker;
`/configs/skills/review-loop/` present; zero container errors.
- Startup log on platform boot: `Plugin install limits: body=65536
bytes timeout=5m0s staged=104857600 bytes`.
## Branch
`feat/agentskills-compliance` → PR #62 (open, all CI green, ready to
merge). Use `git log --oneline origin/main..` for the commit list —
counting commits inline goes stale fast.
---
## Post-merge session — team coordination, platform hardening, new backlog
After PR #62 landed, the session continued with ecosystem-watch ship, a
gemini-org proof-point attempt, and a PLAN.md refresh coordinated through
the agent team. Several platform bugs surfaced; all filed and tracked.
### Shipped
- **PR #59** — A2A proxy regression fix. PR #59 had rewritten
`http://127.0.0.1:<port>``http://ws-<id>:8000` unconditionally,
breaking platform-on-host mode. Gated behind `platformInDocker` detection
(`/.dockerenv` or `MOLECULE_IN_DOCKER=1`). `workspace-server/internal/handlers/a2a_proxy.go`.
Commit `4b42913`.
- **PR #61** — `docs/ecosystem-watch.md`: Holaboss / Hermes / gstack
entries + template + backlog candidates. Merged.
- **Cross-references for ecosystem-watch** — wired into `PLAN.md` (new
"Ecosystem Awareness" section), `README.md` + `README.zh-CN.md`
Documentation Map, and `CLAUDE.md` (new "Ecosystem Context" section).
Agents couldn't discover the doc because it wasn't linked anywhere;
PM reported it missing despite being in its bind mount. Commit `8ae5e73`.
- **DeepAgents adapter: `virtual_mode=False`** in
`workspace/adapters/deepagents/adapter.py`. Previously
`read_file`/`ls`/`write_file`/`edit_file` operated on an in-memory
snapshot that drifted from the bind-mounted `/workspace`; writes
didn't persist across restarts and real files reported as missing.
Commit `bc563d1`.
- **LangGraph recursion limit 100 → 500** default in
`workspace/a2a_executor.py`. PM fan-out to 6+ reports routinely
overran the 100-step ceiling. Still overridable via
`LANGGRAPH_RECURSION_LIMIT` env var. Commit `d892eb4`.
- **Gemini org model swap** `gemini-3.1-pro-preview`
`gemini-2.5-pro` in `org-templates/molecule-worker-gemini/org.yaml`
(3.1-pro-preview's 25 req/min couldn't sustain 11-workspace delegation
waves). Commit `4b42913`.
- **Backlog tracking** for #64 / #65 added to `PLAN.md` Backlog. Commit `ba1cc15`.
### Open PRs (awaiting CEO approval)
- **#68** `docs/plan-refresh` — PLAN.md refresh: correct test counts
(Canvas 325→345, Python 990→1,040, +SDK row 50, total 1,811→1,911),
promote #66/#67 to backlog with actual issue content. Coordinated
with the molecule-dev team; corrected PM's hallucinated content for
#66/#67 before open.
- **#69** `chore/team-system-prompts-hardening` — harden PM / Dev Lead /
Research Lead system prompts with hard-learned rules from today's
coordination incident (15 rules total across 3 roles). Every rule
maps to a specific failure we hit today.
### New platform issues filed
- **#64** — `GET /workspaces/:id/delegations` returns `[]` while the
agent-side `check_delegation_status` tool shows 4 delegations.
Sources-of-truth mismatch. Bug.
- **#65** — Per-agent repo-access config in `org.yaml`. New
`workspace_access: none | read_only | read_write` field +
`:ro` bind-mount for research agents. Eliminates the
"PM couriers documents to reports" workaround. Enhancement.
- **#66** — `claude_sdk_executor.py` swallows subprocess stderr on
CLI exit ≠ 0. Every failure surfaces the same opaque
`"Command failed with exit code 1 / Check stderr output for details"`.
High-priority bug; blocked real debugging today.
- **#67** — Agent MCP client defaults to `http://localhost:8080`,
which inside a workspace container is the container itself.
Inject `MOLECULE_URL=${PLATFORM_URL}` at provision time. High-priority
bug; blocked PM from restarting its own reports.
### Gemini org — proof-point attempt, rolled back
Deployed molecule-worker-gemini (11 DeepAgents workspaces), exercised
the full delegation tree, hit three distinct blockers:
1. `virtual_mode=True` made PM report real files as missing (fixed
in `bc563d1` above).
2. LangGraph recursion limit 100 tripped on PM fan-out (fixed in
`d892eb4` above).
3. Google AI Studio **monthly spending cap** exhausted the whole
project after repeated retries.
Rolled back to molecule-dev (Claude Code runtime) to finish the
PLAN.md refresh task.
### Session-state contamination note
After a `ProcessError` crash on a Claude Code workspace, subsequent
A2A calls to that workspace keep failing identically until the
workspace is restarted — even when the same SDK query run manually
from inside the container succeeds. Root cause likely session
resume state in the executor. Workaround: restart on `ProcessError`.
Worth formalizing in the executor as an auto-reset on `exit_code != 0`
once #66 lands and we can see the real stderr.
### Rules distilled for the team (now encoded in #69)
- Never commit to `main` — always a feature branch + PR.
- Verify external refs (issue numbers, PRs, SHAs, file paths) before
citing them.
- Inline documents into every sub-delegation — reports don't have the
repo mount.
- `delegation.status == completed` ≠ work was done.
- Pause ~60s after a batch restart before delegating (warm-up race).
- Quote errors verbatim, don't paraphrase.
- Research Lead must always fan out — solo synthesis is a role failure.
---
## #71 fix — initial_prompt marker written up-front
**Root cause:** `main.py` previously wrote `/workspace/.initial_prompt_done`
only AFTER the initial_prompt self-send succeeded. If the prompt crashed
(any ProcessError, network failure, SDK exit), the marker was never
written — the next container boot replayed the same failing prompt and
cascaded into "every message crashes" until an operator intervened.
Observed three times on 2026-04-12 (gemini org + molecule-dev import +
post-restart).
**Fix (extracted from main.py into `workspace/initial_prompt.py`
so it's unit-testable without uvicorn):**
- `resolve_initial_prompt_marker(config_path)` — prefer `<config>/...`
when writable, fall back to `/workspace/...`.
- `mark_initial_prompt_attempted(marker_path)` — best-effort write,
returns `True`/`False` so the caller can log a loud warning on I/O
failure.
- `main.py` calls `mark_initial_prompt_attempted` **before** scheduling
the self-send. The post-send marker write is removed.
**Semantic change:** the prompt is attempted at most once per fresh boot;
if it fails, operators re-send manually via chat. Trade-off: trades
silent auto-retry-on-restart (which could cascade) for a one-time
attempt with a loud failure log.
**Tests:** 5 new unit tests in `tests/test_main_initial_prompt.py`, 100%
coverage on `initial_prompt.py`. Live E2E verified all 12 containers
write the marker up-front and no replay occurs on restart. Manual
browser test via canvas chat against Research Lead returned the
expected reply — full round-trip through the UI.
Branch: `fix/71-initial-prompt-marker-at-start`. Closes #71.
---
## #66 fix — surface Claude SDK subprocess stderr + exit_code
**Root cause:** `claude_sdk_executor.py` caught `ProcessError` but
extracted only `str(exc)`, which for a crashing CLI reads "Command
failed with exit code 1 (exit code: 1) / Error output: Check stderr
output for details". The SDK's `ProcessError` actually carries
`.exit_code` and `.stderr` attributes — we were silently dropping both.
Every CLI crash looked identical and required ad-hoc reproduction
inside the container to diagnose.
**Fix:** new `_format_process_error(exc)` helper that extracts
`type(exc).__name__`, `exc.exit_code`, and `exc.stderr` (capped at
`_PROCESS_ERROR_STDERR_MAX_CHARS = 4096` to prevent log flooding).
Called in the retry loop (`logger.warning`) and the terminal error
path (`logger.error` + `logger.exception` for the full traceback).
Plain exceptions without SDK attributes fall back to `str(exc)`
no crash on missing attrs.
**Tests:** 5 new unit tests in `tests/test_claude_sdk_executor.py`
(format with full context / truncation / plain exception / exit-code
only / end-to-end via `execute()` with caplog). Python pytest 1050 →
1055.
**E2E:** rebuilt `workspace-template:claude-code`, restarted an agent,
ran `_format_process_error` with a real `claude_agent_sdk._errors.
ProcessError(exit_code=2, stderr='disk full: /tmp')` inside the live
container → output shows both `exit_code=2` and the stderr verbatim.
**Manual browser:** canvas chat against Research Lead — reply
`BROWSER-OK-66` returned cleanly, full UI round-trip works with the
new log format live.
Branch: `fix/66-capture-claude-sdk-stderr`. Closes #66.
---
## #75 fix — auto-reset session_id on subprocess-level errors
**Root cause:** after a `ProcessError` (or `CLIConnectionError`), the
executor's `self._session_id` still points at the dead session. On the
next call, `_build_options()` passes `resume=<stale-id>` to the SDK,
which boots a new subprocess that can't resume the prior session state
— and crashes again. Observed as "crashed once → crashes forever" on
2026-04-12 across PM / RL / DL in the coordination runs.
**Fix:** new `_reset_session_after_error(exc)` method clears
`self._session_id` when the exception looks subprocess-level
(`ProcessError`, `CLIConnectionError`, has `exit_code` attribute, or
message contains "exit code"). Rate-limit / capacity errors are left
alone so normal retry preserves conversational continuity. Called in
the retry loop, right after `_format_process_error` logs the context.
**Tests:** 5 new tests in `tests/test_claude_sdk_executor.py` — clears
on ProcessError / preserves on rate-limit / no-op when session_id is
already None / triggers on "exit code" message only / end-to-end via
`execute()` with `caplog` + spy-on-`_build_options` asserting that the
second retry attempt sees `session_id=None` rather than the stale ID.
Python pytest 1055 → 1060.
**E2E:** verified in live container — `_reset_session_after_error`
clears a stale session on ProcessError, preserves it on rate-limit.
**Manual browser:** canvas chat round-trip on Research Lead — message
went through and agent responded normally. Zero ProcessError
indicators.
Branch: `fix/75-session-reset-on-process-error`. Closes #75.
---
## Top-5 #1 — Memory FTS + namespace scoping
Backend proposal from the ecosystem-research outcomes doc, highest-
convergence team ask (BE + FE + QA + UX all independently proposed
some flavour of this).
**Migration `017_memories_fts_namespace.up.sql`:**
- `agent_memories.namespace VARCHAR(50) NOT NULL DEFAULT 'general'`
- `agent_memories.content_tsv tsvector` (STORED generated column from
`to_tsvector('english', content)`)
- `idx_memories_fts` (GIN on `content_tsv`)
- `idx_memories_ns` (composite on `workspace_id, namespace`)
**Handler `workspace-server/internal/handlers/memories.go`:**
- `POST /workspaces/:id/memories` accepts optional `namespace` (default
`"general"`, 50-char max validated at the handler).
- `GET /workspaces/:id/memories?q=...` routes multi-char queries
through `content_tsv @@ plainto_tsquery('english', ?)` with
`ts_rank` ordering; single-char queries fall back to `ILIKE`
(tsvector can't tokenise single chars in the 'english' config).
- `GET /workspaces/:id/memories?namespace=...` filters regardless of
scope.
- Response always includes the `namespace` field.
**Tests:** 5 existing tests updated for the new column list; 4 new
tests added (commit-with-namespace, namespace-too-long, FTS path,
ILIKE fallback, namespace filter). Handler test suite passes.
**E2E (live Postgres + running platform):**
- Platform restart applied migration 017 → column + indexes present.
- `POST` with / without namespace → both work, default kicks in.
- `?q=zinc+theme` → FTS returns reference memory.
- `?namespace=procedures` → scoped retrieval works.
- `?q=restart&namespace=procedures` → combined filter works.
Branch: `feat/memory-fts-namespace`.
---
## Top-5 #5 — Fail-secure encryption at boot
Security Auditor's top proposal from the outcomes doc. The platform
previously booted without `SECRETS_ENCRYPTION_KEY` and silently stored
workspace secrets in plaintext with only a WARNING log. OWASP A02:2021
(Cryptographic Failures) / STRIDE "Information Disclosure".
**Fix** (`workspace-server/internal/crypto/aes.go`):
- New `InitStrict() error` variant that returns `ErrEncryptionKeyMissing`
when `MOLECULE_ENV=prod`/`production` and the key is unset, malformed,
or the wrong length. Existing `Init()` retained for any callers that
prefer the warn-and-continue behaviour; only `cmd/server/main.go`
switched to the strict variant.
- `isProdEnv()` accepts `prod`, `production`, case-insensitive + trimmed.
- `loadKeyFromEnv` refactor: one helper returns the parse error so both
entry points can format it the same way.
**`cmd/server/main.go`:** `crypto.InitStrict()` + `log.Fatalf` on error.
Local dev (no `MOLECULE_ENV`) keeps the existing warn-and-continue.
**Tests:** 6 new tests in `internal/crypto/aes_test.go`:
- fails in prod when key is missing
- fails in prod on wrong-length key
- succeeds in prod with valid key
- allows dev mode without key (ergonomics)
- allows staging without key (non-prod)
- isProdEnv case-insensitivity table
**E2E:** `/tmp/platform-failsec` binary run with `MOLECULE_ENV=prod` +
empty key → `log.Fatalf` triggers, platform refuses to start. Same
binary with `MOLECULE_ENV=prod` + valid base64 key → boots, prints
"AES-256-GCM enabled", serves 200 on `/health`.
Branch: `fix/top5-5-fail-secure-encryption`.
---
## #85 fix — encryption_version column + DecryptVersioned
**Root cause (from the investigation):** rows in `workspace_secrets` /
`global_secrets` are tagged as `encrypted_value bytea` but whether
they're *actually* encrypted depends entirely on whether
`SECRETS_ENCRYPTION_KEY` was set at the moment of `Encrypt`
`crypto.Encrypt` short-circuits and returns plaintext bytes when
encryption is disabled. Switching on the key later makes
`crypto.Decrypt` try GCM on plaintext bytes → fails → provisioner
silently skips the row → container crashes on missing OAuth token.
With PR #83 (fail-secure) pushing operators toward setting the key,
this trap was about to start biting real installs.
**Fix:**
- Migration `018_secrets_encryption_version` adds
`encryption_version INT NOT NULL DEFAULT 0` to both secret tables.
All existing rows become `version=0` (plaintext). Additive, safe.
- `crypto.aes.go`:
- `EncryptionVersionPlaintext = 0`, `EncryptionVersionAESGCM = 1` constants.
- `CurrentEncryptionVersion()` — tells callers which tag to write.
- `DecryptVersioned(value, version)` — dispatches on tag; `v=0`
passes through, `v=1` runs GCM (and errors if `IsEnabled()` is
false). Unknown version → clear error.
- Existing `Decrypt` deprecated-in-comment but kept for callers
that haven't migrated (backward-compat during transition).
- `handlers/workspace_provision.go`: SELECT now pulls
`encryption_version`; decrypt uses `DecryptVersioned`; on failure
**aborts provisioning with a loud FATAL log + marks workspace
failed** (#66-style silent-failure removed).
- `handlers/secrets.go`: both `Set` and global `SetGlobalSecret`
persist `encryption_version = CurrentEncryptionVersion()` on
INSERT. `ON CONFLICT` also updates the version — re-setting a
historical plaintext row while a key is active upgrades it to
GCM in-place.
- `handlers/secrets.go::GetModel`: SELECT pulls version, uses
`DecryptVersioned`.
**Tests:** 6 new crypto tests (plaintext pass-through, GCM round-trip,
GCM requires key, unknown version rejected, `CurrentEncryptionVersion`
tracks key state, the exact #85 scenario end-to-end). 6 existing
secret handler tests updated for the 4-arg INSERT. Full Go test suite
passes.
**E2E (live):**
- Migration applied automatically on platform boot: `encryption_version`
column present on both tables.
- 102 pre-existing plaintext rows correctly tagged `version=0`.
- New `TEST_NEW_SECRET_85` stored as 39 bytes (11 plaintext + 12 nonce
+ 16 tag = ✓) with `version=1`.
- PM container restart succeeds — both `CLAUDE_CODE_OAUTH_TOKEN`
(v=0 historical plaintext) AND `TEST_NEW_SECRET_85` (v=1 encrypted)
are decrypted correctly and injected into the container env.
Branch: `fix/85-encryption-version-migration`. Closes #85.
---
## #67 fix — inject MOLECULE_URL at workspace provision time
**Root cause:** Agents calling `mcp__molecule__*` tools from inside a
workspace container were hitting `localhost:8080` (container's own
localhost, not the host). The MCP client
(`mcp-server/src/index.ts`) defaulted to `MOLECULE_URL ||
"http://localhost:8080"` and the provisioner only injected
`PLATFORM_URL`, never `MOLECULE_URL`.
**Fix (two-sided, belt-and-suspenders):**
1. `workspace-server/internal/provisioner/provisioner.go` — extracted env
building into pure `buildContainerEnv(cfg WorkspaceConfig) []string`
so it's unit-testable. Now injects `MOLECULE_URL=<PlatformURL>`
alongside `PLATFORM_URL`.
2. `mcp-server/src/index.ts` — client now prefers `MOLECULE_URL`, falls
back to `PLATFORM_URL`, then `localhost:8080`. Protects older
containers that don't yet have `MOLECULE_URL`.
**Tests:** 4 new Go tests (`buildContainerEnv` injects both env vars,
MOLECULE_URL always matches PLATFORM_URL across URL shapes, awareness
both-or-nothing, custom envs append). Full provisioner suite green.
88 existing MCP tests still pass (fallback chain preserves existing
behaviour).
**E2E verified live:** rebuilt platform, restarted PM, `docker exec
env` shows both `PLATFORM_URL=http://host.docker.internal:8080` and
`MOLECULE_URL=http://host.docker.internal:8080` on the recreated
container.
**Side-discovery (filed as #85):** enabling `SECRETS_ENCRYPTION_KEY`
on an install with pre-existing plaintext secrets silently breaks
every secret — `crypto.Decrypt` runs GCM on plaintext bytes → fails
`log.Printf + continue` → row dropped → workspace crashes on
preflight. Proposed fix: `encryption_version` column + boot-time
re-encryption migration + fail-loud on decrypt mismatch.
Branch: `fix/67-inject-molecule-url`.
---
## #73 fix — close three real delete-race windows
**Observed symptom (corrected):** During the session's bulk-delete runs,
PM / Research Lead / Dev Lead consistently survived as "stragglers."
Turned out the cause wasn't a race — it was the `DELETE /workspaces/:id`
endpoint returning **HTTP 200** with `{"status":"confirmation_required"}`
when the workspace has children and `?confirm=true` is not set. The
bulk-delete script read HTTP 200 as success and moved on.
**What the #73 fix actually closes:** three real but distinct race
windows that would bite in production even with correct `?confirm=true`
usage:
1. `handlers/registry.go::Register` — `ON CONFLICT DO UPDATE SET
status='online'` ran unconditionally; a late heartbeat from a
workspace that was just soft-deleted (status='removed') could
resurrect the row. Guard added: `WHERE workspaces.status IS
DISTINCT FROM 'removed'`.
2. `handlers/registry.go::Heartbeat` — same UPDATE path had no
filter; late heartbeats refreshed `last_heartbeat_at` on
tombstoned rows (confusing liveness). Guard: `AND status !=
'removed'`. Plus `evaluateStatus` recovery path made conditional
in-SQL (`AND status = 'offline'`).
3. `handlers/workspace.go::Delete` — sequence was Stop container →
UPDATE status='removed'. Between those calls, Redis TTL expiry
could trigger the liveness monitor, which called `RestartByID`,
recreating the container. New order: UPDATE status='removed'
FIRST (for self + descendants as a single batch), THEN stop
containers + remove volumes. Auto-restart paths now see
status='removed' immediately and bail out via their existing
`NOT IN ('removed', ...)` guards.
**Tests:** 2 new registry tests pinning the SQL guards (substring
match on the emitted UPDATE); 2 existing delete tests updated for
the new order (single batch UPDATE covering self+descendants).
Full `go test ./... -race` green.
**Live E2E:** bulk delete of 12 workspaces with `?confirm=true`
→ all cleanly removed, **zero stragglers**, no pending provisions.
**Separate issue filed:** API DX — DELETE should return 4xx (e.g.
409 Conflict) when confirmation is required, not 200. Misleading
status code made the session's symptom diagnosis wrong for hours.
Branch: `fix/73-delete-workspace-race`.
---
## #88 fix — DELETE returns 409 Conflict when confirmation required
**Observed during #73:** bulk-delete scripts that read HTTP 200 as
success silently skipped every parent workspace, leaving tier-3 /
parent nodes behind and looking like a platform race bug.
**Fix:** one-line change in `handlers/workspace.go::Delete` — return
`http.StatusConflict` (409) instead of `http.StatusOK` (200) when
children exist and `?confirm=true` isn't set. Response body shape
unchanged (canvas UI + MCP server both parse the JSON body, not the
status code).
No regressions: canvas (`DetailsTab.tsx:75`) and MCP server
(`mcp-server/src/index.ts:80`) already pass `?confirm=true` on every
delete. The 409 only affects manual API users + bulk scripts that
forgot — exactly the cohort that was silently failing.
**Tests:** 1 existing delete test updated to expect 409. Full
`go test ./...` green.
**Live E2E:** real platform, real parent+child workspaces —
`DELETE /workspaces/:id` (no confirm) returns `http=409` with the
expected JSON body; `DELETE /workspaces/:id?confirm=true` still
returns 200.
Branch: `fix/88-delete-confirm-409`. Closes #88.
## #74 fix — retry delegation once after reactive URL refresh
**Clarification of the original issue:** The delegation worker
(`handlers/delegation.go::executeDelegation`) already calls the shared
`h.workspace.proxyA2ARequest(...)` path — so it DOES benefit from the
A2A proxy's reactive health-check / URL-refresh on connection errors.
The real gap is that the reactive refresh runs *after* the current
request fails; the caller still gets an error for that specific
delegation attempt. During bulk restarts (observed 21:40 today), PM's
delegation worker fired during the warm-up window, hit a stale URL,
and the single-attempt logic marked the delegation `failed`.
**Fix:** add a single retry with an 8-second pause when
`proxyA2ARequest` returns a transient-looking error. The pause is
long enough for the reactive refresh + container restart to land a
fresh URL in the cache. `isTransientProxyError` classifies which
statuses retry:
- **502 Bad Gateway** (plain connection failure) — retry
- **503 Service Unavailable** (reactive check decided to restart the
container) — retry
- **404 / 403 / 400 / 500** — static, don't waste the retry window
**Tests:** 7 new cases on the classifier matrix + a regression
guard on the 8-second window. Full `go test ./... -race` green.
Branch: `fix/74-delegation-via-a2a-proxy`. Closes #74.
---
## 100% platform coverage — MCP + molecli
Full parity pass so every platform endpoint is reachable from both
client layers.
### MCP server (`mcp-server/src/index.ts`): 61 → 83 tools
**+22 new handlers** added in a single coverage-completion block at
the bottom of the file:
- Delegations (#64): `record_delegation`, `update_delegation_status`
- Activity: `report_activity`, `notify_user`
- Canvas viewport: `get_canvas_viewport`, `set_canvas_viewport`
- Channels (platform-level): `discover_channel_chats`
- Plugins: `list_plugin_sources`, `list_available_plugins`,
`check_plugin_compatibility`
- Schedules (cron): `list_schedules`, `create_schedule`,
`update_schedule`, `delete_schedule`, `run_schedule`,
`get_schedule_history`
- Session + shared context: `session_search`, `get_shared_context`
- K/V memory (distinct from HMA): `memory_set`, `memory_get`,
`memory_list`, `memory_delete_kv`
**Updated schemas:** `create_workspace` + `update_workspace` now
accept `workspace_access` (none / read_only / read_write) + explicit
`runtime` / `workspace_dir` params.
All 88 existing MCP tests still pass; `npm run build` green.
### molecli CLI (`workspace-server/cmd/cli/`): 9 → 21 top-level commands
Two new files:
- `cmd_api.go``molecli api <METHOD> <PATH> [json-body]` raw
escape hatch. Hits any endpoint without a typed wrapper.
- `cmd_ops.go` — typed subcommands (thin wrappers over shared
`callAPI` helper) for operator ergonomics:
- `ws restart|pause|resume` — lifecycle ops
- `plugin registry|sources|list|available|install|uninstall`
- `secret list|set|delete|list-global|set-global|delete-global`
- `schedule list|add|remove|run|history`
- `channel adapters|list|remove|send|test`
- `approval pending|list|decide`
- `delegation list|create`
- `bundle export|import`
- `org templates|import`
- `traces <workspace-id>`
- `activity list <workspace-id>`
- `hma commit|search`
`go test ./cmd/cli/` passes; live smoke-test against running
platform: `api GET /health`, `plugin sources`, `org templates`,
`ws restart <bad-id>` all return expected responses.
Branch: `feat/mcp-molecli-full-coverage`.
## #65 fix — per-agent workspace_access in org.yaml + API
**Design from the ecosystem-research outcomes doc:** new
`workspace_access: none | read_only | read_write` field on every
workspace, enforced at container provision time via Docker's native
`:ro` bind-mount flag. Eliminates the "PM couriers documents to
reports" workaround by letting research agents have read-only repo
access without the write risk.
**Changes:**
- **Migration 019** — adds `workspace_access VARCHAR(20) NOT NULL
DEFAULT 'none'` with CHECK constraint. Additive, all existing rows
become 'none' (current isolated-volume behaviour preserved).
- **`provisioner.go`:**
- New `WorkspaceAccess` field on `WorkspaceConfig`.
- Constants `WorkspaceAccessNone`/`ReadOnly`/`ReadWrite`.
- `buildWorkspaceMount(cfg)` — pure helper, selects between
named-volume, rw bind, and `:ro` bind based on access +
workspace_path.
- `ValidateWorkspaceAccess(access, path)` — rejects `read_*`
without a path and unknown values.
- **`handlers/workspace.go::Create`** and
**`handlers/org.go::createOrgWorkspace`** — validate +
persist `workspace_access` on INSERT. Response body echoes
the stored value.
- **`handlers/workspace_provision.go::buildProvisionerConfig`** —
reads `workspace_access` from DB (with payload override) and
forwards to the provisioner. Restart paths preserve the mode.
**Tests:**
- Provisioner: 2 new tables — `TestBuildWorkspaceMount_SelectionMatrix`
(6 cases covering the full access × path matrix) and
`TestValidateWorkspaceAccess` (7 cases).
- Handler INSERT WithArgs updated across 5 existing tests for the
new 9th column.
- Full `go test ./... -race` green.
**Live E2E:**
- Migration auto-applied → `workspaces` table has `workspace_access`
with the CHECK constraint.
- `POST /workspaces {"workspace_access":"read_only","workspace_dir":"/repo"}`
→ 201 with `"workspace_access":"read_only"` echoed; DB row correct.
- `POST {"workspace_access":"read_only"}` (no workspace_dir) → 400
with clear error.
- `POST {"workspace_access":"wildcard"}` → 400 with allowed-values
list.
- Container inspected after provision: `/workspace` mount has
`RW=false Mode=ro`; `touch /workspace/foo` from inside returns
`Read-only file system` → enforcement is real.
Branch: `feat/65-workspace-access-yaml`. Closes #65.
## #64 fix — agent registers delegations with platform (Option A)
**Root cause (confirmed in comment on #64):** `check_delegation_status`
reads from the agent's local `_delegations` dict; platform's
`GET /workspaces/:id/delegations` reads from `activity_logs`. The
agent's `delegate_to_workspace` MCP tool sends A2A directly and
never touches `activity_logs` — so the platform's view was always empty
for agent-initiated delegations.
**Fix (minimal Option A, dual-write):**
- Platform: two new endpoints on `DelegationHandler`
- `POST /workspaces/:id/delegations/record` — inserts a single
`activity_logs` row with `method='delegate'`, status='dispatched'.
No A2A fired (agent does that directly for OTEL/retry reasons).
- `POST /workspaces/:id/delegations/:delegation_id/update` — accepts
`status ∈ {completed, failed}` + optional error + preview. UPDATEs
the original row and (on completion) INSERTs a `delegate_result`
row matching the canvas-path flow.
- Agent (`workspace/builtin_tools/delegation.py`):
- New best-effort async helpers `_record_delegation_on_platform`
and `_update_delegation_on_platform`. Failures are logged at debug
and swallowed — never block the actual A2A delegation path.
- `_execute_delegation` calls `_record_...` at task start and
`_update_...` on completion / failure (alongside the existing
`_notify_completion`).
**Result:** agent keeps direct A2A for speed + OTEL trace-context
propagation + existing retry logic; platform's activity_logs mirrors
the same set the agent's local dict holds. `GET /delegations` now
returns rows for agent-initiated delegations.
**Tests:** 5 new Go tests (Record inserts + rejects invalid UUID,
UpdateStatus completed inserts result row + rejects unknown status +
failed broadcast). 4 new Python tests (record fires HTTP POST, best-
effort on platform error, update completed, update truncates large
preview to 500 chars). Python pytest 1060 → 1064; full Go suite green.
Branch: `fix/64-agent-delegate-via-platform`. Closes #64.
## SDK — workspace / org / channel validators
**Issue:** SDK only validated plugins. Authors publishing
workspace-configs-templates, org-templates, or channel configs had no
lint step — errors only surfaced at `POST /org/import` or container
startup.
**Fix:** extended `sdk/python/molecule_plugin/` with three new modules:
- `workspace.py` — validates `config.yaml` (name, runtime, tier,
runtime_config shape). `SUPPORTED_RUNTIMES` kept in sync with
`provisioner.RuntimeImages`.
- `org.py` — recursively validates `org.yaml` (name, workspaces tree,
workspace_access + workspace_dir pairing per #65, channels via
delegated `validate_channel_config`, schedules, plugins, external+url,
children).
- `channel.py` — validates channel configs (standalone dict or YAML
file). `SUPPORTED_CHANNEL_TYPES` currently `{telegram}`; extend when
Slack/Discord adapters land.
CLI (`python -m molecule_plugin validate {plugin|workspace|org|channel} <path>`)
dispatches to the right validator; bare `validate <path>` still defaults
to plugin for back-compat. Exit 0 on valid, 1 on any error.
`validate_channel_config` is the single source of truth for channel
schema — `org.py` delegates to it rather than duplicating checks.
**Tests:** `sdk/python/tests/test_validators.py` — 37 new tests (happy,
missing file, bad YAML, non-object, each field error, null-safety on
`runtime_config: None` / `defaults: null`, CLI dispatch for all 4 kinds,
back-compat form). Fixed bug found during test authoring: `org.py` crashed
on non-dict children; now guarded with `isinstance` check.
**Live smoke:** all 4 in-repo org templates (`free-beats-all`,
`reno-stars`, `molecule-dev`, `molecule-worker-gemini`) validate clean.
**SDK pytest:** 50 → 87. Branch: `feat/sdk-workspace-org-channel`.
---
## Top-5 #3 — parallel adapter builds
DevOps proposal from the ecosystem-research outcomes doc. All six
adapter Dockerfiles `FROM workspace-template:base` with no
inter-adapter dependency, so they're safe to build concurrently once
the base is done.
**Change** (`workspace/build-all.sh`):
- Serial path kept for single-runtime rebuilds and `SERIAL_BUILD=1`
CI environments (preserves bounded-concurrency option).
- Parallel path: fan out one `docker build` per adapter, capture
stdout/stderr to `/tmp/build_<tag>.log`, wait for all, tally
per-tag success/failure. Failures still exit non-zero.
**E2E:** `bash build-all.sh claude-code deepagents langgraph`
finished in **43s wall-clock** (three adapter builds running
concurrently). Previously ~120s serial. Log files live under
`/tmp/build_*.log` for post-hoc debugging.
Branch: `feat/top5-3-parallel-adapter-builds`.