molecule-core/docs/edit-history/2026-04-12.md

# 2026-04-12

## Summary

Shipped the full two-axis plugin architecture on `feat/agentskills-compliance`
(PR #62). **Plugin source** (where files come from) and **plugin shape**
(what's inside them) are now independent, pluggable axes.

- **Source axis** — `workspace-server/internal/plugins/` package: `SourceResolver`
  interface, `Registry`, `LocalResolver`, `GithubResolver`, `ParseSource`.
  `POST /workspaces/:id/plugins` accepts `{name}` (back-compat → local) or
  `{source: "scheme://spec"}`. New `GET /plugins/sources` enumerates
  registered schemes.
- **Shape axis** — `workspace/plugins_registry/` package:
  `PluginAdaptor` protocol, hybrid resolver (registry > plugin-shipped >
  raw-drop), `AgentskillsAdaptor` built-in for agentskills.io-format
  skills + Molecule AI's rules extension. Named sub-type adapters planned
  for MCP, DeepAgents sub-agents, LangGraph sub-graphs, etc.
- **agentskills.io compliance** — every first-party skill passes the
  open standard; `python -m molecule_plugin validate` CLI enforces it
  in CI. Our skills are now installable in ~35 other agent tools
  (Cursor, Codex, Copilot, Gemini CLI, etc.).
- **Gemini org parity** — `molecule-worker-gemini` mirrors `molecule-dev`
  (11 workspaces, Research + Dev branches, schedules, Telegram channel,
  per-agent prompts) as the E2E proof point.

## Files touched

Platform (Go):
- `workspace-server/internal/plugins/{source,local,github}.go` + tests — source
  layer, 97.4% coverage.
- `workspace-server/internal/envx/envx.go` + test — env-var helpers, 100%
  coverage.
- `workspace-server/internal/handlers/plugins.go` — install pipeline refactored
  into `resolveAndStage` + `deliverToContainer`; typed `httpErr` for
  status propagation; `sort.Strings` in `Registry.Schemes`; `logInstall
  LimitsOnce` on startup.
- `workspace-server/internal/router/router.go` — new routes (`/plugins/sources`,
  `/workspaces/:id/plugins/available`, `/workspaces/:id/plugins/compatibility`).
- `workspace-server/Dockerfile` — `apk add git` for the github resolver.

Workspace runtime (Python):
- `workspace/plugins_registry/` — new module: `protocol.py`,
  `builtins.py` (`AgentskillsAdaptor`), `raw_drop.py`, resolver.
- `workspace/skill_loader/` — renamed from `skills/`; reads
  `scripts/` per the agentskills.io spec.
- `workspace/builtin_tools/` — renamed from `tools/` to
  disambiguate from user-plugin tool dirs.
- `workspace/adapters/base.py` — added hooks: `memory_filename`,
  `register_tool_hook`, `register_subagent_hook`, `append_to_memory_hook`,
  `install_plugins_via_registry`. Default `inject_plugins()` drives the
  new pipeline.
- `workspace/adapters/claude_code/adapter.py` — deleted the
  40-line `inject_plugins()` override.
- `workspace/adapters/deepagents/Dockerfile` — ships
  `plugins_registry/`.
- `workspace/plugins.py` — `PluginManifest.runtimes` field.

Plugins (content):
- `plugins/*/adapters/{claude_code,deepagents}.py` — one-line
  `from plugins_registry.builtins import AgentskillsAdaptor as Adaptor`.
- `plugins/*/plugin.yaml` — declare `runtimes: [claude_code, deepagents]`.

SDK (Python):
- `sdk/python/molecule_plugin/` — `protocol.py`, `builtins.py` (SDK-
  vendored `AgentskillsAdaptor`), `manifest.py` (spec validator), CLI
  via `__main__.py`.
- `sdk/python/template/` — cookiecutter skeleton.

Org templates:
- `org-templates/molecule-worker-gemini/org.yaml` — full parity with
  `molecule-dev` (11 workspaces, schedules, Telegram, per-agent
  prompts, `workspace_dir` mount on PM, `required_env: [GOOGLE_API_KEY]`).
- Copied 5 `system-prompt.md` files from molecule-dev (research-lead,
  market-analyst, technical-researcher, competitive-intelligence,
  uiux-designer).

Docs:
- `docs/plugins/agentskills-compat.md` — two-layer model, spec mapping.
- `docs/plugins/sources.md` — two-axis source/shape architecture,
  security model, future resolvers.
- `docs/ecosystem-watch.md` — Holaboss, Hermes Agent, gstack entries
  (adjacent projects to track).
- `.env.example` — `PLUGIN_INSTALL_*` vars documented.
- `PLAN.md` — plugin-adaptor landed; deferred items listed.
- `CLAUDE.md` — new endpoints, env vars, test counts.

## Test counts

- Go platform: all packages green under `-race`.
- Python workspace: 1040 passed, 9 skipped.
- Python SDK: 50 passed.
- Total: **1090 passing**.

Coverage on new code:
- `workspace-server/internal/plugins/*`: 97.4%
- `workspace-server/internal/envx/*`: 100%
- `workspace/plugins_registry/*`: 100%
- `workspace/skill_loader/*`: 100%
- `sdk/python/molecule_plugin/*`: 100%

## 5 rounds of code review

Every round addressed by new commits on the branch:

1. Round 1 — initial coverage pass.
2. Round 2 — `memory_filename` plumbing through `InstallContext`;
   `logger` in `skill_loader`; module constants for `SKILLS_SUBDIR`,
   `SKIP_ROOT_MD`, `SKILL_NAME_*`; SDK↔runtime drift-guard test;
   frontmatter parser unification.
3. Round 3 — fetch timeout + body size cap + staged-dir size cap via
   new env vars; typed `ErrPluginNotFound` sentinel replaces string
   matching; reject both `name`+`source`; `sort.Strings` in Schemes;
   `sync.RWMutex` on Registry; `--` in git clone; docs clarify
   github resolver is public-only.
4. Round 4 — `ParseSource` empty-spec guard; `dirSize(cap)` → `(limit)`;
   `localNameRE` length bound; extract `envDuration`/`envInt64` into
   `internal/envx`; `LANG=C LC_ALL=C` in git child env for locale-
   stable error parsing.
5. Round 5 — typed `httpErr` replaces 5-value tuple; `resolveAndStage`
   decoupled from `*gin.Context` via `installRequest` struct; drop
   unused `source` param from `deliverToContainer`; trim whitespace in
   `ParseSource`; consolidate 3 test resolver stubs into 1
   parameterized `fakeResolver` + 3 constructors.

## Live E2E confirmed

- `GET /plugins/sources` → `{"schemes":["github","local"]}`.
- `POST {"name":"molecule-dev"}` → installed via local (back-compat).
- `POST {"source":"local://   molecule-dev   "}` → installed
  (whitespace trimmed).
- `POST {"name":"a","source":"local://b"}` → 400 "not both".
- `POST {"source":"github://"}` → 400 "empty spec after 'github'".
- `POST {"source":"mystery://x"}` → 400 + `available_schemes: [...]`.
- Uninstall + reinstall on PM workspace: CLAUDE.md has
  `# Plugin: molecule-dev / rule: codebase-conventions.md` marker;
  `/configs/skills/review-loop/` present; zero container errors.
- Startup log on platform boot: `Plugin install limits: body=65536
  bytes timeout=5m0s staged=104857600 bytes`.

## Branch

`feat/agentskills-compliance` → PR #62 (open, all CI green, ready to
merge). Use `git log --oneline origin/main..` for the commit list —
counting commits inline goes stale fast.

---

## Post-merge session — team coordination, platform hardening, new backlog

After PR #62 landed, the session continued with ecosystem-watch ship, a
gemini-org proof-point attempt, and a PLAN.md refresh coordinated through
the agent team. Several platform bugs surfaced; all filed and tracked.

### Shipped

- **PR #59** — A2A proxy regression fix. PR #59 had rewritten
  `http://127.0.0.1:<port>` → `http://ws-<id>:8000` unconditionally,
  breaking platform-on-host mode. Gated behind `platformInDocker` detection
  (`/.dockerenv` or `MOLECULE_IN_DOCKER=1`). `workspace-server/internal/handlers/a2a_proxy.go`.
  Commit `4b42913`.
- **PR #61** — `docs/ecosystem-watch.md`: Holaboss / Hermes / gstack
  entries + template + backlog candidates. Merged.
- **Cross-references for ecosystem-watch** — wired into `PLAN.md` (new
  "Ecosystem Awareness" section), `README.md` + `README.zh-CN.md`
  Documentation Map, and `CLAUDE.md` (new "Ecosystem Context" section).
  Agents couldn't discover the doc because it wasn't linked anywhere;
  PM reported it missing despite being in its bind mount. Commit `8ae5e73`.
- **DeepAgents adapter: `virtual_mode=False`** in
  `workspace/adapters/deepagents/adapter.py`. Previously
  `read_file`/`ls`/`write_file`/`edit_file` operated on an in-memory
  snapshot that drifted from the bind-mounted `/workspace`; writes
  didn't persist across restarts and real files reported as missing.
  Commit `bc563d1`.
- **LangGraph recursion limit 100 → 500** default in
  `workspace/a2a_executor.py`. PM fan-out to 6+ reports routinely
  overran the 100-step ceiling. Still overridable via
  `LANGGRAPH_RECURSION_LIMIT` env var. Commit `d892eb4`.
- **Gemini org model swap** `gemini-3.1-pro-preview` →
  `gemini-2.5-pro` in `org-templates/molecule-worker-gemini/org.yaml`
  (3.1-pro-preview's 25 req/min couldn't sustain 11-workspace delegation
  waves). Commit `4b42913`.
- **Backlog tracking** for #64 / #65 added to `PLAN.md` Backlog. Commit `ba1cc15`.

### Open PRs (awaiting CEO approval)

- **#68** `docs/plan-refresh` — PLAN.md refresh: correct test counts
  (Canvas 325→345, Python 990→1,040, +SDK row 50, total 1,811→1,911),
  promote #66/#67 to backlog with actual issue content. Coordinated
  with the molecule-dev team; corrected PM's hallucinated content for
  #66/#67 before open.
- **#69** `chore/team-system-prompts-hardening` — harden PM / Dev Lead /
  Research Lead system prompts with hard-learned rules from today's
  coordination incident (15 rules total across 3 roles). Every rule
  maps to a specific failure we hit today.

### New platform issues filed

- **#64** — `GET /workspaces/:id/delegations` returns `[]` while the
  agent-side `check_delegation_status` tool shows 4 delegations.
  Sources-of-truth mismatch. Bug.
- **#65** — Per-agent repo-access config in `org.yaml`. New
  `workspace_access: none | read_only | read_write` field +
  `:ro` bind-mount for research agents. Eliminates the
  "PM couriers documents to reports" workaround. Enhancement.
- **#66** — `claude_sdk_executor.py` swallows subprocess stderr on
  CLI exit ≠ 0. Every failure surfaces the same opaque
  `"Command failed with exit code 1 / Check stderr output for details"`.
  High-priority bug; blocked real debugging today.
- **#67** — Agent MCP client defaults to `http://localhost:8080`,
  which inside a workspace container is the container itself.
  Inject `MOLECULE_URL=${PLATFORM_URL}` at provision time. High-priority
  bug; blocked PM from restarting its own reports.

### Gemini org — proof-point attempt, rolled back

Deployed molecule-worker-gemini (11 DeepAgents workspaces), exercised
the full delegation tree, hit three distinct blockers:

1. `virtual_mode=True` made PM report real files as missing (fixed
   in `bc563d1` above).
2. LangGraph recursion limit 100 tripped on PM fan-out (fixed in
   `d892eb4` above).
3. Google AI Studio **monthly spending cap** exhausted the whole
   project after repeated retries.

Rolled back to molecule-dev (Claude Code runtime) to finish the
PLAN.md refresh task.

### Session-state contamination note

After a `ProcessError` crash on a Claude Code workspace, subsequent
A2A calls to that workspace keep failing identically until the
workspace is restarted — even when the same SDK query run manually
from inside the container succeeds. Root cause likely session
resume state in the executor. Workaround: restart on `ProcessError`.
Worth formalizing in the executor as an auto-reset on `exit_code != 0`
once #66 lands and we can see the real stderr.

### Rules distilled for the team (now encoded in #69)

- Never commit to `main` — always a feature branch + PR.
- Verify external refs (issue numbers, PRs, SHAs, file paths) before
  citing them.
- Inline documents into every sub-delegation — reports don't have the
  repo mount.
- `delegation.status == completed` ≠ work was done.
- Pause ~60s after a batch restart before delegating (warm-up race).
- Quote errors verbatim, don't paraphrase.
- Research Lead must always fan out — solo synthesis is a role failure.

---

## #71 fix — initial_prompt marker written up-front

**Root cause:** `main.py` previously wrote `/workspace/.initial_prompt_done`
only AFTER the initial_prompt self-send succeeded. If the prompt crashed
(any ProcessError, network failure, SDK exit), the marker was never
written — the next container boot replayed the same failing prompt and
cascaded into "every message crashes" until an operator intervened.
Observed three times on 2026-04-12 (gemini org + molecule-dev import +
post-restart).

**Fix (extracted from main.py into `workspace/initial_prompt.py`
so it's unit-testable without uvicorn):**

- `resolve_initial_prompt_marker(config_path)` — prefer `<config>/...`
  when writable, fall back to `/workspace/...`.
- `mark_initial_prompt_attempted(marker_path)` — best-effort write,
  returns `True`/`False` so the caller can log a loud warning on I/O
  failure.
- `main.py` calls `mark_initial_prompt_attempted` **before** scheduling
  the self-send. The post-send marker write is removed.

**Semantic change:** the prompt is attempted at most once per fresh boot;
if it fails, operators re-send manually via chat. Trade-off: trades
silent auto-retry-on-restart (which could cascade) for a one-time
attempt with a loud failure log.

**Tests:** 5 new unit tests in `tests/test_main_initial_prompt.py`, 100%
coverage on `initial_prompt.py`. Live E2E verified all 12 containers
write the marker up-front and no replay occurs on restart. Manual
browser test via canvas chat against Research Lead returned the
expected reply — full round-trip through the UI.

Branch: `fix/71-initial-prompt-marker-at-start`. Closes #71.

---

## #66 fix — surface Claude SDK subprocess stderr + exit_code

**Root cause:** `claude_sdk_executor.py` caught `ProcessError` but
extracted only `str(exc)`, which for a crashing CLI reads "Command
failed with exit code 1 (exit code: 1) / Error output: Check stderr
output for details". The SDK's `ProcessError` actually carries
`.exit_code` and `.stderr` attributes — we were silently dropping both.
Every CLI crash looked identical and required ad-hoc reproduction
inside the container to diagnose.

**Fix:** new `_format_process_error(exc)` helper that extracts
`type(exc).__name__`, `exc.exit_code`, and `exc.stderr` (capped at
`_PROCESS_ERROR_STDERR_MAX_CHARS = 4096` to prevent log flooding).
Called in the retry loop (`logger.warning`) and the terminal error
path (`logger.error` + `logger.exception` for the full traceback).
Plain exceptions without SDK attributes fall back to `str(exc)` —
no crash on missing attrs.

**Tests:** 5 new unit tests in `tests/test_claude_sdk_executor.py`
(format with full context / truncation / plain exception / exit-code
only / end-to-end via `execute()` with caplog). Python pytest 1050 →
1055.

**E2E:** rebuilt `workspace-template:claude-code`, restarted an agent,
ran `_format_process_error` with a real `claude_agent_sdk._errors.
ProcessError(exit_code=2, stderr='disk full: /tmp')` inside the live
container → output shows both `exit_code=2` and the stderr verbatim.

**Manual browser:** canvas chat against Research Lead — reply
`BROWSER-OK-66` returned cleanly, full UI round-trip works with the
new log format live.

Branch: `fix/66-capture-claude-sdk-stderr`. Closes #66.

---

## #75 fix — auto-reset session_id on subprocess-level errors

**Root cause:** after a `ProcessError` (or `CLIConnectionError`), the
executor's `self._session_id` still points at the dead session. On the
next call, `_build_options()` passes `resume=<stale-id>` to the SDK,
which boots a new subprocess that can't resume the prior session state
— and crashes again. Observed as "crashed once → crashes forever" on
2026-04-12 across PM / RL / DL in the coordination runs.

**Fix:** new `_reset_session_after_error(exc)` method clears
`self._session_id` when the exception looks subprocess-level
(`ProcessError`, `CLIConnectionError`, has `exit_code` attribute, or
message contains "exit code"). Rate-limit / capacity errors are left
alone so normal retry preserves conversational continuity. Called in
the retry loop, right after `_format_process_error` logs the context.

**Tests:** 5 new tests in `tests/test_claude_sdk_executor.py` — clears
on ProcessError / preserves on rate-limit / no-op when session_id is
already None / triggers on "exit code" message only / end-to-end via
`execute()` with `caplog` + spy-on-`_build_options` asserting that the
second retry attempt sees `session_id=None` rather than the stale ID.
Python pytest 1055 → 1060.

**E2E:** verified in live container — `_reset_session_after_error`
clears a stale session on ProcessError, preserves it on rate-limit.

**Manual browser:** canvas chat round-trip on Research Lead — message
went through and agent responded normally. Zero ProcessError
indicators.

Branch: `fix/75-session-reset-on-process-error`. Closes #75.

---

## Top-5 #1 — Memory FTS + namespace scoping

Backend proposal from the ecosystem-research outcomes doc, highest-
convergence team ask (BE + FE + QA + UX all independently proposed
some flavour of this).

**Migration `017_memories_fts_namespace.up.sql`:**
- `agent_memories.namespace VARCHAR(50) NOT NULL DEFAULT 'general'`
- `agent_memories.content_tsv tsvector` (STORED generated column from
  `to_tsvector('english', content)`)
- `idx_memories_fts` (GIN on `content_tsv`)
- `idx_memories_ns` (composite on `workspace_id, namespace`)

**Handler `workspace-server/internal/handlers/memories.go`:**
- `POST /workspaces/:id/memories` accepts optional `namespace` (default
  `"general"`, 50-char max validated at the handler).
- `GET /workspaces/:id/memories?q=...` routes multi-char queries
  through `content_tsv @@ plainto_tsquery('english', ?)` with
  `ts_rank` ordering; single-char queries fall back to `ILIKE`
  (tsvector can't tokenise single chars in the 'english' config).
- `GET /workspaces/:id/memories?namespace=...` filters regardless of
  scope.
- Response always includes the `namespace` field.

**Tests:** 5 existing tests updated for the new column list; 4 new
tests added (commit-with-namespace, namespace-too-long, FTS path,
ILIKE fallback, namespace filter). Handler test suite passes.

**E2E (live Postgres + running platform):**
- Platform restart applied migration 017 → column + indexes present.
- `POST` with / without namespace → both work, default kicks in.
- `?q=zinc+theme` → FTS returns reference memory.
- `?namespace=procedures` → scoped retrieval works.
- `?q=restart&namespace=procedures` → combined filter works.

Branch: `feat/memory-fts-namespace`.

---

## Top-5 #5 — Fail-secure encryption at boot

Security Auditor's top proposal from the outcomes doc. The platform
previously booted without `SECRETS_ENCRYPTION_KEY` and silently stored
workspace secrets in plaintext with only a WARNING log. OWASP A02:2021
(Cryptographic Failures) / STRIDE "Information Disclosure".

**Fix** (`workspace-server/internal/crypto/aes.go`):

- New `InitStrict() error` variant that returns `ErrEncryptionKeyMissing`
  when `MOLECULE_ENV=prod`/`production` and the key is unset, malformed,
  or the wrong length. Existing `Init()` retained for any callers that
  prefer the warn-and-continue behaviour; only `cmd/server/main.go`
  switched to the strict variant.
- `isProdEnv()` accepts `prod`, `production`, case-insensitive + trimmed.
- `loadKeyFromEnv` refactor: one helper returns the parse error so both
  entry points can format it the same way.

**`cmd/server/main.go`:** `crypto.InitStrict()` + `log.Fatalf` on error.
Local dev (no `MOLECULE_ENV`) keeps the existing warn-and-continue.

**Tests:** 6 new tests in `internal/crypto/aes_test.go`:
- fails in prod when key is missing
- fails in prod on wrong-length key
- succeeds in prod with valid key
- allows dev mode without key (ergonomics)
- allows staging without key (non-prod)
- isProdEnv case-insensitivity table

**E2E:** `/tmp/platform-failsec` binary run with `MOLECULE_ENV=prod` +
empty key → `log.Fatalf` triggers, platform refuses to start. Same
binary with `MOLECULE_ENV=prod` + valid base64 key → boots, prints
"AES-256-GCM enabled", serves 200 on `/health`.

Branch: `fix/top5-5-fail-secure-encryption`.

---

## #85 fix — encryption_version column + DecryptVersioned

**Root cause (from the investigation):** rows in `workspace_secrets` /
`global_secrets` are tagged as `encrypted_value bytea` but whether
they're *actually* encrypted depends entirely on whether
`SECRETS_ENCRYPTION_KEY` was set at the moment of `Encrypt` —
`crypto.Encrypt` short-circuits and returns plaintext bytes when
encryption is disabled. Switching on the key later makes
`crypto.Decrypt` try GCM on plaintext bytes → fails → provisioner
silently skips the row → container crashes on missing OAuth token.

With PR #83 (fail-secure) pushing operators toward setting the key,
this trap was about to start biting real installs.

**Fix:**

- Migration `018_secrets_encryption_version` adds
  `encryption_version INT NOT NULL DEFAULT 0` to both secret tables.
  All existing rows become `version=0` (plaintext). Additive, safe.
- `crypto.aes.go`:
  - `EncryptionVersionPlaintext = 0`, `EncryptionVersionAESGCM = 1` constants.
  - `CurrentEncryptionVersion()` — tells callers which tag to write.
  - `DecryptVersioned(value, version)` — dispatches on tag; `v=0`
    passes through, `v=1` runs GCM (and errors if `IsEnabled()` is
    false). Unknown version → clear error.
  - Existing `Decrypt` deprecated-in-comment but kept for callers
    that haven't migrated (backward-compat during transition).
- `handlers/workspace_provision.go`: SELECT now pulls
  `encryption_version`; decrypt uses `DecryptVersioned`; on failure
  **aborts provisioning with a loud FATAL log + marks workspace
  failed** (#66-style silent-failure removed).
- `handlers/secrets.go`: both `Set` and global `SetGlobalSecret`
  persist `encryption_version = CurrentEncryptionVersion()` on
  INSERT. `ON CONFLICT` also updates the version — re-setting a
  historical plaintext row while a key is active upgrades it to
  GCM in-place.
- `handlers/secrets.go::GetModel`: SELECT pulls version, uses
  `DecryptVersioned`.

**Tests:** 6 new crypto tests (plaintext pass-through, GCM round-trip,
GCM requires key, unknown version rejected, `CurrentEncryptionVersion`
tracks key state, the exact #85 scenario end-to-end). 6 existing
secret handler tests updated for the 4-arg INSERT. Full Go test suite
passes.

**E2E (live):**
- Migration applied automatically on platform boot: `encryption_version`
  column present on both tables.
- 102 pre-existing plaintext rows correctly tagged `version=0`.
- New `TEST_NEW_SECRET_85` stored as 39 bytes (11 plaintext + 12 nonce
  + 16 tag = ✓) with `version=1`.
- PM container restart succeeds — both `CLAUDE_CODE_OAUTH_TOKEN`
  (v=0 historical plaintext) AND `TEST_NEW_SECRET_85` (v=1 encrypted)
  are decrypted correctly and injected into the container env.

Branch: `fix/85-encryption-version-migration`. Closes #85.

---

## #67 fix — inject MOLECULE_URL at workspace provision time

**Root cause:** Agents calling `mcp__molecule__*` tools from inside a
workspace container were hitting `localhost:8080` (container's own
localhost, not the host). The MCP client
(`mcp-server/src/index.ts`) defaulted to `MOLECULE_URL ||
"http://localhost:8080"` and the provisioner only injected
`PLATFORM_URL`, never `MOLECULE_URL`.

**Fix (two-sided, belt-and-suspenders):**

1. `workspace-server/internal/provisioner/provisioner.go` — extracted env
   building into pure `buildContainerEnv(cfg WorkspaceConfig) []string`
   so it's unit-testable. Now injects `MOLECULE_URL=<PlatformURL>`
   alongside `PLATFORM_URL`.
2. `mcp-server/src/index.ts` — client now prefers `MOLECULE_URL`, falls
   back to `PLATFORM_URL`, then `localhost:8080`. Protects older
   containers that don't yet have `MOLECULE_URL`.

**Tests:** 4 new Go tests (`buildContainerEnv` injects both env vars,
MOLECULE_URL always matches PLATFORM_URL across URL shapes, awareness
both-or-nothing, custom envs append). Full provisioner suite green.
88 existing MCP tests still pass (fallback chain preserves existing
behaviour).

**E2E verified live:** rebuilt platform, restarted PM, `docker exec
env` shows both `PLATFORM_URL=http://host.docker.internal:8080` and
`MOLECULE_URL=http://host.docker.internal:8080` on the recreated
container.

**Side-discovery (filed as #85):** enabling `SECRETS_ENCRYPTION_KEY`
on an install with pre-existing plaintext secrets silently breaks
every secret — `crypto.Decrypt` runs GCM on plaintext bytes → fails
→ `log.Printf + continue` → row dropped → workspace crashes on
preflight. Proposed fix: `encryption_version` column + boot-time
re-encryption migration + fail-loud on decrypt mismatch.

Branch: `fix/67-inject-molecule-url`.

---

## #73 fix — close three real delete-race windows

**Observed symptom (corrected):** During the session's bulk-delete runs,
PM / Research Lead / Dev Lead consistently survived as "stragglers."
Turned out the cause wasn't a race — it was the `DELETE /workspaces/:id`
endpoint returning **HTTP 200** with `{"status":"confirmation_required"}`
when the workspace has children and `?confirm=true` is not set. The
bulk-delete script read HTTP 200 as success and moved on.

**What the #73 fix actually closes:** three real but distinct race
windows that would bite in production even with correct `?confirm=true`
usage:

1. `handlers/registry.go::Register` — `ON CONFLICT DO UPDATE SET
   status='online'` ran unconditionally; a late heartbeat from a
   workspace that was just soft-deleted (status='removed') could
   resurrect the row. Guard added: `WHERE workspaces.status IS
   DISTINCT FROM 'removed'`.
2. `handlers/registry.go::Heartbeat` — same UPDATE path had no
   filter; late heartbeats refreshed `last_heartbeat_at` on
   tombstoned rows (confusing liveness). Guard: `AND status !=
   'removed'`. Plus `evaluateStatus` recovery path made conditional
   in-SQL (`AND status = 'offline'`).
3. `handlers/workspace.go::Delete` — sequence was Stop container →
   UPDATE status='removed'. Between those calls, Redis TTL expiry
   could trigger the liveness monitor, which called `RestartByID`,
   recreating the container. New order: UPDATE status='removed'
   FIRST (for self + descendants as a single batch), THEN stop
   containers + remove volumes. Auto-restart paths now see
   status='removed' immediately and bail out via their existing
   `NOT IN ('removed', ...)` guards.

**Tests:** 2 new registry tests pinning the SQL guards (substring
match on the emitted UPDATE); 2 existing delete tests updated for
the new order (single batch UPDATE covering self+descendants).
Full `go test ./... -race` green.

**Live E2E:** bulk delete of 12 workspaces with `?confirm=true`
→ all cleanly removed, **zero stragglers**, no pending provisions.

**Separate issue filed:** API DX — DELETE should return 4xx (e.g.
409 Conflict) when confirmation is required, not 200. Misleading
status code made the session's symptom diagnosis wrong for hours.

Branch: `fix/73-delete-workspace-race`.

---

## #88 fix — DELETE returns 409 Conflict when confirmation required

**Observed during #73:** bulk-delete scripts that read HTTP 200 as
success silently skipped every parent workspace, leaving tier-3 /
parent nodes behind and looking like a platform race bug.

**Fix:** one-line change in `handlers/workspace.go::Delete` — return
`http.StatusConflict` (409) instead of `http.StatusOK` (200) when
children exist and `?confirm=true` isn't set. Response body shape
unchanged (canvas UI + MCP server both parse the JSON body, not the
status code).

No regressions: canvas (`DetailsTab.tsx:75`) and MCP server
(`mcp-server/src/index.ts:80`) already pass `?confirm=true` on every
delete. The 409 only affects manual API users + bulk scripts that
forgot — exactly the cohort that was silently failing.

**Tests:** 1 existing delete test updated to expect 409. Full
`go test ./...` green.

**Live E2E:** real platform, real parent+child workspaces —
`DELETE /workspaces/:id` (no confirm) returns `http=409` with the
expected JSON body; `DELETE /workspaces/:id?confirm=true` still
returns 200.

Branch: `fix/88-delete-confirm-409`. Closes #88.
## #74 fix — retry delegation once after reactive URL refresh

**Clarification of the original issue:** The delegation worker
(`handlers/delegation.go::executeDelegation`) already calls the shared
`h.workspace.proxyA2ARequest(...)` path — so it DOES benefit from the
A2A proxy's reactive health-check / URL-refresh on connection errors.
The real gap is that the reactive refresh runs *after* the current
request fails; the caller still gets an error for that specific
delegation attempt. During bulk restarts (observed 21:40 today), PM's
delegation worker fired during the warm-up window, hit a stale URL,
and the single-attempt logic marked the delegation `failed`.

**Fix:** add a single retry with an 8-second pause when
`proxyA2ARequest` returns a transient-looking error. The pause is
long enough for the reactive refresh + container restart to land a
fresh URL in the cache. `isTransientProxyError` classifies which
statuses retry:

- **502 Bad Gateway** (plain connection failure) — retry
- **503 Service Unavailable** (reactive check decided to restart the
  container) — retry
- **404 / 403 / 400 / 500** — static, don't waste the retry window

**Tests:** 7 new cases on the classifier matrix + a regression
guard on the 8-second window. Full `go test ./... -race` green.

Branch: `fix/74-delegation-via-a2a-proxy`. Closes #74.

---

## 100% platform coverage — MCP + molecli

Full parity pass so every platform endpoint is reachable from both
client layers.

### MCP server (`mcp-server/src/index.ts`): 61 → 83 tools

**+22 new handlers** added in a single coverage-completion block at
the bottom of the file:

- Delegations (#64): `record_delegation`, `update_delegation_status`
- Activity: `report_activity`, `notify_user`
- Canvas viewport: `get_canvas_viewport`, `set_canvas_viewport`
- Channels (platform-level): `discover_channel_chats`
- Plugins: `list_plugin_sources`, `list_available_plugins`,
  `check_plugin_compatibility`
- Schedules (cron): `list_schedules`, `create_schedule`,
  `update_schedule`, `delete_schedule`, `run_schedule`,
  `get_schedule_history`
- Session + shared context: `session_search`, `get_shared_context`
- K/V memory (distinct from HMA): `memory_set`, `memory_get`,
  `memory_list`, `memory_delete_kv`

**Updated schemas:** `create_workspace` + `update_workspace` now
accept `workspace_access` (none / read_only / read_write) + explicit
`runtime` / `workspace_dir` params.

All 88 existing MCP tests still pass; `npm run build` green.

### molecli CLI (`workspace-server/cmd/cli/`): 9 → 21 top-level commands

Two new files:

- `cmd_api.go` — `molecli api <METHOD> <PATH> [json-body]` raw
  escape hatch. Hits any endpoint without a typed wrapper.
- `cmd_ops.go` — typed subcommands (thin wrappers over shared
  `callAPI` helper) for operator ergonomics:
  - `ws restart|pause|resume` — lifecycle ops
  - `plugin registry|sources|list|available|install|uninstall`
  - `secret list|set|delete|list-global|set-global|delete-global`
  - `schedule list|add|remove|run|history`
  - `channel adapters|list|remove|send|test`
  - `approval pending|list|decide`
  - `delegation list|create`
  - `bundle export|import`
  - `org templates|import`
  - `traces <workspace-id>`
  - `activity list <workspace-id>`
  - `hma commit|search`

`go test ./cmd/cli/` passes; live smoke-test against running
platform: `api GET /health`, `plugin sources`, `org templates`,
`ws restart <bad-id>` all return expected responses.

Branch: `feat/mcp-molecli-full-coverage`.
## #65 fix — per-agent workspace_access in org.yaml + API

**Design from the ecosystem-research outcomes doc:** new
`workspace_access: none | read_only | read_write` field on every
workspace, enforced at container provision time via Docker's native
`:ro` bind-mount flag. Eliminates the "PM couriers documents to
reports" workaround by letting research agents have read-only repo
access without the write risk.

**Changes:**

- **Migration 019** — adds `workspace_access VARCHAR(20) NOT NULL
  DEFAULT 'none'` with CHECK constraint. Additive, all existing rows
  become 'none' (current isolated-volume behaviour preserved).
- **`provisioner.go`:**
  - New `WorkspaceAccess` field on `WorkspaceConfig`.
  - Constants `WorkspaceAccessNone`/`ReadOnly`/`ReadWrite`.
  - `buildWorkspaceMount(cfg)` — pure helper, selects between
    named-volume, rw bind, and `:ro` bind based on access +
    workspace_path.
  - `ValidateWorkspaceAccess(access, path)` — rejects `read_*`
    without a path and unknown values.
- **`handlers/workspace.go::Create`** and
  **`handlers/org.go::createOrgWorkspace`** — validate +
  persist `workspace_access` on INSERT. Response body echoes
  the stored value.
- **`handlers/workspace_provision.go::buildProvisionerConfig`** —
  reads `workspace_access` from DB (with payload override) and
  forwards to the provisioner. Restart paths preserve the mode.

**Tests:**
- Provisioner: 2 new tables — `TestBuildWorkspaceMount_SelectionMatrix`
  (6 cases covering the full access × path matrix) and
  `TestValidateWorkspaceAccess` (7 cases).
- Handler INSERT WithArgs updated across 5 existing tests for the
  new 9th column.
- Full `go test ./... -race` green.

**Live E2E:**
- Migration auto-applied → `workspaces` table has `workspace_access`
  with the CHECK constraint.
- `POST /workspaces {"workspace_access":"read_only","workspace_dir":"/repo"}`
  → 201 with `"workspace_access":"read_only"` echoed; DB row correct.
- `POST {"workspace_access":"read_only"}` (no workspace_dir) → 400
  with clear error.
- `POST {"workspace_access":"wildcard"}` → 400 with allowed-values
  list.
- Container inspected after provision: `/workspace` mount has
  `RW=false Mode=ro`; `touch /workspace/foo` from inside returns
  `Read-only file system` → enforcement is real.

Branch: `feat/65-workspace-access-yaml`. Closes #65.
## #64 fix — agent registers delegations with platform (Option A)

**Root cause (confirmed in comment on #64):** `check_delegation_status`
reads from the agent's local `_delegations` dict; platform's
`GET /workspaces/:id/delegations` reads from `activity_logs`. The
agent's `delegate_to_workspace` MCP tool sends A2A directly and
never touches `activity_logs` — so the platform's view was always empty
for agent-initiated delegations.

**Fix (minimal Option A, dual-write):**

- Platform: two new endpoints on `DelegationHandler` —
  - `POST /workspaces/:id/delegations/record` — inserts a single
    `activity_logs` row with `method='delegate'`, status='dispatched'.
    No A2A fired (agent does that directly for OTEL/retry reasons).
  - `POST /workspaces/:id/delegations/:delegation_id/update` — accepts
    `status ∈ {completed, failed}` + optional error + preview. UPDATEs
    the original row and (on completion) INSERTs a `delegate_result`
    row matching the canvas-path flow.

- Agent (`workspace/builtin_tools/delegation.py`):
  - New best-effort async helpers `_record_delegation_on_platform`
    and `_update_delegation_on_platform`. Failures are logged at debug
    and swallowed — never block the actual A2A delegation path.
  - `_execute_delegation` calls `_record_...` at task start and
    `_update_...` on completion / failure (alongside the existing
    `_notify_completion`).

**Result:** agent keeps direct A2A for speed + OTEL trace-context
propagation + existing retry logic; platform's activity_logs mirrors
the same set the agent's local dict holds. `GET /delegations` now
returns rows for agent-initiated delegations.

**Tests:** 5 new Go tests (Record inserts + rejects invalid UUID,
UpdateStatus completed inserts result row + rejects unknown status +
failed broadcast). 4 new Python tests (record fires HTTP POST, best-
effort on platform error, update completed, update truncates large
preview to 500 chars). Python pytest 1060 → 1064; full Go suite green.

Branch: `fix/64-agent-delegate-via-platform`. Closes #64.

## SDK — workspace / org / channel validators

**Issue:** SDK only validated plugins. Authors publishing
workspace-configs-templates, org-templates, or channel configs had no
lint step — errors only surfaced at `POST /org/import` or container
startup.

**Fix:** extended `sdk/python/molecule_plugin/` with three new modules:

- `workspace.py` — validates `config.yaml` (name, runtime, tier,
  runtime_config shape). `SUPPORTED_RUNTIMES` kept in sync with
  `provisioner.RuntimeImages`.
- `org.py` — recursively validates `org.yaml` (name, workspaces tree,
  workspace_access + workspace_dir pairing per #65, channels via
  delegated `validate_channel_config`, schedules, plugins, external+url,
  children).
- `channel.py` — validates channel configs (standalone dict or YAML
  file). `SUPPORTED_CHANNEL_TYPES` currently `{telegram}`; extend when
  Slack/Discord adapters land.

CLI (`python -m molecule_plugin validate {plugin|workspace|org|channel} <path>`)
dispatches to the right validator; bare `validate <path>` still defaults
to plugin for back-compat. Exit 0 on valid, 1 on any error.

`validate_channel_config` is the single source of truth for channel
schema — `org.py` delegates to it rather than duplicating checks.

**Tests:** `sdk/python/tests/test_validators.py` — 37 new tests (happy,
missing file, bad YAML, non-object, each field error, null-safety on
`runtime_config: None` / `defaults: null`, CLI dispatch for all 4 kinds,
back-compat form). Fixed bug found during test authoring: `org.py` crashed
on non-dict children; now guarded with `isinstance` check.

**Live smoke:** all 4 in-repo org templates (`free-beats-all`,
`reno-stars`, `molecule-dev`, `molecule-worker-gemini`) validate clean.

**SDK pytest:** 50 → 87. Branch: `feat/sdk-workspace-org-channel`.
---

## Top-5 #3 — parallel adapter builds

DevOps proposal from the ecosystem-research outcomes doc. All six
adapter Dockerfiles `FROM workspace-template:base` with no
inter-adapter dependency, so they're safe to build concurrently once
the base is done.

**Change** (`workspace/build-all.sh`):

- Serial path kept for single-runtime rebuilds and `SERIAL_BUILD=1`
  CI environments (preserves bounded-concurrency option).
- Parallel path: fan out one `docker build` per adapter, capture
  stdout/stderr to `/tmp/build_<tag>.log`, wait for all, tally
  per-tag success/failure. Failures still exit non-zero.

**E2E:** `bash build-all.sh claude-code deepagents langgraph`
finished in **43s wall-clock** (three adapter builds running
concurrently). Previously ~120s serial. Log files live under
`/tmp/build_*.log` for post-hoc debugging.

Branch: `feat/top5-3-parallel-adapter-builds`.