Renames: - platform/ → workspace-server/ (Go module path stays as "platform" for external dep compat — will update after plugin module republish) - workspace-template/ → workspace/ Removed (moved to separate repos or deleted): - PLAN.md — internal roadmap (move to private project board) - HANDOFF.md, AGENTS.md — one-time internal session docs - .claude/ — gitignored entirely (local agent config) - infra/cloudflare-worker/ → Molecule-AI/molecule-tenant-proxy - org-templates/molecule-dev/ → standalone template repo - .mcp-eval/ → molecule-mcp-server repo - test-results/ — ephemeral, gitignored Security scrubbing: - Cloudflare account/zone/KV IDs → placeholders - Real EC2 IPs → <EC2_IP> in all docs - CF token prefix, Neon project ID, Fly app names → redacted - Langfuse dev credentials → parameterized - Personal runner username/machine name → generic Community files: - CONTRIBUTING.md — build, test, branch conventions - CODE_OF_CONDUCT.md — Contributor Covenant 2.1 All Dockerfiles, CI workflows, docker-compose, railway.toml, render.yaml, README, CLAUDE.md updated for new directory names. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
849 lines
38 KiB
Markdown
849 lines
38 KiB
Markdown
# 2026-04-12
|
||
|
||
## Summary
|
||
|
||
Shipped the full two-axis plugin architecture on `feat/agentskills-compliance`
|
||
(PR #62). **Plugin source** (where files come from) and **plugin shape**
|
||
(what's inside them) are now independent, pluggable axes.
|
||
|
||
- **Source axis** — `workspace-server/internal/plugins/` package: `SourceResolver`
|
||
interface, `Registry`, `LocalResolver`, `GithubResolver`, `ParseSource`.
|
||
`POST /workspaces/:id/plugins` accepts `{name}` (back-compat → local) or
|
||
`{source: "scheme://spec"}`. New `GET /plugins/sources` enumerates
|
||
registered schemes.
|
||
- **Shape axis** — `workspace/plugins_registry/` package:
|
||
`PluginAdaptor` protocol, hybrid resolver (registry > plugin-shipped >
|
||
raw-drop), `AgentskillsAdaptor` built-in for agentskills.io-format
|
||
skills + Molecule AI's rules extension. Named sub-type adapters planned
|
||
for MCP, DeepAgents sub-agents, LangGraph sub-graphs, etc.
|
||
- **agentskills.io compliance** — every first-party skill passes the
|
||
open standard; `python -m molecule_plugin validate` CLI enforces it
|
||
in CI. Our skills are now installable in ~35 other agent tools
|
||
(Cursor, Codex, Copilot, Gemini CLI, etc.).
|
||
- **Gemini org parity** — `molecule-worker-gemini` mirrors `molecule-dev`
|
||
(11 workspaces, Research + Dev branches, schedules, Telegram channel,
|
||
per-agent prompts) as the E2E proof point.
|
||
|
||
## Files touched
|
||
|
||
Platform (Go):
|
||
- `workspace-server/internal/plugins/{source,local,github}.go` + tests — source
|
||
layer, 97.4% coverage.
|
||
- `workspace-server/internal/envx/envx.go` + test — env-var helpers, 100%
|
||
coverage.
|
||
- `workspace-server/internal/handlers/plugins.go` — install pipeline refactored
|
||
into `resolveAndStage` + `deliverToContainer`; typed `httpErr` for
|
||
status propagation; `sort.Strings` in `Registry.Schemes`; `logInstall
|
||
LimitsOnce` on startup.
|
||
- `workspace-server/internal/router/router.go` — new routes (`/plugins/sources`,
|
||
`/workspaces/:id/plugins/available`, `/workspaces/:id/plugins/compatibility`).
|
||
- `workspace-server/Dockerfile` — `apk add git` for the github resolver.
|
||
|
||
Workspace runtime (Python):
|
||
- `workspace/plugins_registry/` — new module: `protocol.py`,
|
||
`builtins.py` (`AgentskillsAdaptor`), `raw_drop.py`, resolver.
|
||
- `workspace/skill_loader/` — renamed from `skills/`; reads
|
||
`scripts/` per the agentskills.io spec.
|
||
- `workspace/builtin_tools/` — renamed from `tools/` to
|
||
disambiguate from user-plugin tool dirs.
|
||
- `workspace/adapters/base.py` — added hooks: `memory_filename`,
|
||
`register_tool_hook`, `register_subagent_hook`, `append_to_memory_hook`,
|
||
`install_plugins_via_registry`. Default `inject_plugins()` drives the
|
||
new pipeline.
|
||
- `workspace/adapters/claude_code/adapter.py` — deleted the
|
||
40-line `inject_plugins()` override.
|
||
- `workspace/adapters/deepagents/Dockerfile` — ships
|
||
`plugins_registry/`.
|
||
- `workspace/plugins.py` — `PluginManifest.runtimes` field.
|
||
|
||
Plugins (content):
|
||
- `plugins/*/adapters/{claude_code,deepagents}.py` — one-line
|
||
`from plugins_registry.builtins import AgentskillsAdaptor as Adaptor`.
|
||
- `plugins/*/plugin.yaml` — declare `runtimes: [claude_code, deepagents]`.
|
||
|
||
SDK (Python):
|
||
- `sdk/python/molecule_plugin/` — `protocol.py`, `builtins.py` (SDK-
|
||
vendored `AgentskillsAdaptor`), `manifest.py` (spec validator), CLI
|
||
via `__main__.py`.
|
||
- `sdk/python/template/` — cookiecutter skeleton.
|
||
|
||
Org templates:
|
||
- `org-templates/molecule-worker-gemini/org.yaml` — full parity with
|
||
`molecule-dev` (11 workspaces, schedules, Telegram, per-agent
|
||
prompts, `workspace_dir` mount on PM, `required_env: [GOOGLE_API_KEY]`).
|
||
- Copied 5 `system-prompt.md` files from molecule-dev (research-lead,
|
||
market-analyst, technical-researcher, competitive-intelligence,
|
||
uiux-designer).
|
||
|
||
Docs:
|
||
- `docs/plugins/agentskills-compat.md` — two-layer model, spec mapping.
|
||
- `docs/plugins/sources.md` — two-axis source/shape architecture,
|
||
security model, future resolvers.
|
||
- `docs/ecosystem-watch.md` — Holaboss, Hermes Agent, gstack entries
|
||
(adjacent projects to track).
|
||
- `.env.example` — `PLUGIN_INSTALL_*` vars documented.
|
||
- `PLAN.md` — plugin-adaptor landed; deferred items listed.
|
||
- `CLAUDE.md` — new endpoints, env vars, test counts.
|
||
|
||
## Test counts
|
||
|
||
- Go platform: all packages green under `-race`.
|
||
- Python workspace: 1040 passed, 9 skipped.
|
||
- Python SDK: 50 passed.
|
||
- Total: **1090 passing**.
|
||
|
||
Coverage on new code:
|
||
- `workspace-server/internal/plugins/*`: 97.4%
|
||
- `workspace-server/internal/envx/*`: 100%
|
||
- `workspace/plugins_registry/*`: 100%
|
||
- `workspace/skill_loader/*`: 100%
|
||
- `sdk/python/molecule_plugin/*`: 100%
|
||
|
||
## 5 rounds of code review
|
||
|
||
Every round addressed by new commits on the branch:
|
||
|
||
1. Round 1 — initial coverage pass.
|
||
2. Round 2 — `memory_filename` plumbing through `InstallContext`;
|
||
`logger` in `skill_loader`; module constants for `SKILLS_SUBDIR`,
|
||
`SKIP_ROOT_MD`, `SKILL_NAME_*`; SDK↔runtime drift-guard test;
|
||
frontmatter parser unification.
|
||
3. Round 3 — fetch timeout + body size cap + staged-dir size cap via
|
||
new env vars; typed `ErrPluginNotFound` sentinel replaces string
|
||
matching; reject both `name`+`source`; `sort.Strings` in Schemes;
|
||
`sync.RWMutex` on Registry; `--` in git clone; docs clarify
|
||
github resolver is public-only.
|
||
4. Round 4 — `ParseSource` empty-spec guard; `dirSize(cap)` → `(limit)`;
|
||
`localNameRE` length bound; extract `envDuration`/`envInt64` into
|
||
`internal/envx`; `LANG=C LC_ALL=C` in git child env for locale-
|
||
stable error parsing.
|
||
5. Round 5 — typed `httpErr` replaces 5-value tuple; `resolveAndStage`
|
||
decoupled from `*gin.Context` via `installRequest` struct; drop
|
||
unused `source` param from `deliverToContainer`; trim whitespace in
|
||
`ParseSource`; consolidate 3 test resolver stubs into 1
|
||
parameterized `fakeResolver` + 3 constructors.
|
||
|
||
## Live E2E confirmed
|
||
|
||
- `GET /plugins/sources` → `{"schemes":["github","local"]}`.
|
||
- `POST {"name":"molecule-dev"}` → installed via local (back-compat).
|
||
- `POST {"source":"local:// molecule-dev "}` → installed
|
||
(whitespace trimmed).
|
||
- `POST {"name":"a","source":"local://b"}` → 400 "not both".
|
||
- `POST {"source":"github://"}` → 400 "empty spec after 'github'".
|
||
- `POST {"source":"mystery://x"}` → 400 + `available_schemes: [...]`.
|
||
- Uninstall + reinstall on PM workspace: CLAUDE.md has
|
||
`# Plugin: molecule-dev / rule: codebase-conventions.md` marker;
|
||
`/configs/skills/review-loop/` present; zero container errors.
|
||
- Startup log on platform boot: `Plugin install limits: body=65536
|
||
bytes timeout=5m0s staged=104857600 bytes`.
|
||
|
||
## Branch
|
||
|
||
`feat/agentskills-compliance` → PR #62 (open, all CI green, ready to
|
||
merge). Use `git log --oneline origin/main..` for the commit list —
|
||
counting commits inline goes stale fast.
|
||
|
||
---
|
||
|
||
## Post-merge session — team coordination, platform hardening, new backlog
|
||
|
||
After PR #62 landed, the session continued with ecosystem-watch ship, a
|
||
gemini-org proof-point attempt, and a PLAN.md refresh coordinated through
|
||
the agent team. Several platform bugs surfaced; all filed and tracked.
|
||
|
||
### Shipped
|
||
|
||
- **PR #59** — A2A proxy regression fix. PR #59 had rewritten
|
||
`http://127.0.0.1:<port>` → `http://ws-<id>:8000` unconditionally,
|
||
breaking platform-on-host mode. Gated behind `platformInDocker` detection
|
||
(`/.dockerenv` or `MOLECULE_IN_DOCKER=1`). `workspace-server/internal/handlers/a2a_proxy.go`.
|
||
Commit `4b42913`.
|
||
- **PR #61** — `docs/ecosystem-watch.md`: Holaboss / Hermes / gstack
|
||
entries + template + backlog candidates. Merged.
|
||
- **Cross-references for ecosystem-watch** — wired into `PLAN.md` (new
|
||
"Ecosystem Awareness" section), `README.md` + `README.zh-CN.md`
|
||
Documentation Map, and `CLAUDE.md` (new "Ecosystem Context" section).
|
||
Agents couldn't discover the doc because it wasn't linked anywhere;
|
||
PM reported it missing despite being in its bind mount. Commit `8ae5e73`.
|
||
- **DeepAgents adapter: `virtual_mode=False`** in
|
||
`workspace/adapters/deepagents/adapter.py`. Previously
|
||
`read_file`/`ls`/`write_file`/`edit_file` operated on an in-memory
|
||
snapshot that drifted from the bind-mounted `/workspace`; writes
|
||
didn't persist across restarts and real files reported as missing.
|
||
Commit `bc563d1`.
|
||
- **LangGraph recursion limit 100 → 500** default in
|
||
`workspace/a2a_executor.py`. PM fan-out to 6+ reports routinely
|
||
overran the 100-step ceiling. Still overridable via
|
||
`LANGGRAPH_RECURSION_LIMIT` env var. Commit `d892eb4`.
|
||
- **Gemini org model swap** `gemini-3.1-pro-preview` →
|
||
`gemini-2.5-pro` in `org-templates/molecule-worker-gemini/org.yaml`
|
||
(3.1-pro-preview's 25 req/min couldn't sustain 11-workspace delegation
|
||
waves). Commit `4b42913`.
|
||
- **Backlog tracking** for #64 / #65 added to `PLAN.md` Backlog. Commit `ba1cc15`.
|
||
|
||
### Open PRs (awaiting CEO approval)
|
||
|
||
- **#68** `docs/plan-refresh` — PLAN.md refresh: correct test counts
|
||
(Canvas 325→345, Python 990→1,040, +SDK row 50, total 1,811→1,911),
|
||
promote #66/#67 to backlog with actual issue content. Coordinated
|
||
with the molecule-dev team; corrected PM's hallucinated content for
|
||
#66/#67 before open.
|
||
- **#69** `chore/team-system-prompts-hardening` — harden PM / Dev Lead /
|
||
Research Lead system prompts with hard-learned rules from today's
|
||
coordination incident (15 rules total across 3 roles). Every rule
|
||
maps to a specific failure we hit today.
|
||
|
||
### New platform issues filed
|
||
|
||
- **#64** — `GET /workspaces/:id/delegations` returns `[]` while the
|
||
agent-side `check_delegation_status` tool shows 4 delegations.
|
||
Sources-of-truth mismatch. Bug.
|
||
- **#65** — Per-agent repo-access config in `org.yaml`. New
|
||
`workspace_access: none | read_only | read_write` field +
|
||
`:ro` bind-mount for research agents. Eliminates the
|
||
"PM couriers documents to reports" workaround. Enhancement.
|
||
- **#66** — `claude_sdk_executor.py` swallows subprocess stderr on
|
||
CLI exit ≠ 0. Every failure surfaces the same opaque
|
||
`"Command failed with exit code 1 / Check stderr output for details"`.
|
||
High-priority bug; blocked real debugging today.
|
||
- **#67** — Agent MCP client defaults to `http://localhost:8080`,
|
||
which inside a workspace container is the container itself.
|
||
Inject `MOLECULE_URL=${PLATFORM_URL}` at provision time. High-priority
|
||
bug; blocked PM from restarting its own reports.
|
||
|
||
### Gemini org — proof-point attempt, rolled back
|
||
|
||
Deployed molecule-worker-gemini (11 DeepAgents workspaces), exercised
|
||
the full delegation tree, hit three distinct blockers:
|
||
|
||
1. `virtual_mode=True` made PM report real files as missing (fixed
|
||
in `bc563d1` above).
|
||
2. LangGraph recursion limit 100 tripped on PM fan-out (fixed in
|
||
`d892eb4` above).
|
||
3. Google AI Studio **monthly spending cap** exhausted the whole
|
||
project after repeated retries.
|
||
|
||
Rolled back to molecule-dev (Claude Code runtime) to finish the
|
||
PLAN.md refresh task.
|
||
|
||
### Session-state contamination note
|
||
|
||
After a `ProcessError` crash on a Claude Code workspace, subsequent
|
||
A2A calls to that workspace keep failing identically until the
|
||
workspace is restarted — even when the same SDK query run manually
|
||
from inside the container succeeds. Root cause likely session
|
||
resume state in the executor. Workaround: restart on `ProcessError`.
|
||
Worth formalizing in the executor as an auto-reset on `exit_code != 0`
|
||
once #66 lands and we can see the real stderr.
|
||
|
||
### Rules distilled for the team (now encoded in #69)
|
||
|
||
- Never commit to `main` — always a feature branch + PR.
|
||
- Verify external refs (issue numbers, PRs, SHAs, file paths) before
|
||
citing them.
|
||
- Inline documents into every sub-delegation — reports don't have the
|
||
repo mount.
|
||
- `delegation.status == completed` ≠ work was done.
|
||
- Pause ~60s after a batch restart before delegating (warm-up race).
|
||
- Quote errors verbatim, don't paraphrase.
|
||
- Research Lead must always fan out — solo synthesis is a role failure.
|
||
|
||
---
|
||
|
||
## #71 fix — initial_prompt marker written up-front
|
||
|
||
**Root cause:** `main.py` previously wrote `/workspace/.initial_prompt_done`
|
||
only AFTER the initial_prompt self-send succeeded. If the prompt crashed
|
||
(any ProcessError, network failure, SDK exit), the marker was never
|
||
written — the next container boot replayed the same failing prompt and
|
||
cascaded into "every message crashes" until an operator intervened.
|
||
Observed three times on 2026-04-12 (gemini org + molecule-dev import +
|
||
post-restart).
|
||
|
||
**Fix (extracted from main.py into `workspace/initial_prompt.py`
|
||
so it's unit-testable without uvicorn):**
|
||
|
||
- `resolve_initial_prompt_marker(config_path)` — prefer `<config>/...`
|
||
when writable, fall back to `/workspace/...`.
|
||
- `mark_initial_prompt_attempted(marker_path)` — best-effort write,
|
||
returns `True`/`False` so the caller can log a loud warning on I/O
|
||
failure.
|
||
- `main.py` calls `mark_initial_prompt_attempted` **before** scheduling
|
||
the self-send. The post-send marker write is removed.
|
||
|
||
**Semantic change:** the prompt is attempted at most once per fresh boot;
|
||
if it fails, operators re-send manually via chat. Trade-off: trades
|
||
silent auto-retry-on-restart (which could cascade) for a one-time
|
||
attempt with a loud failure log.
|
||
|
||
**Tests:** 5 new unit tests in `tests/test_main_initial_prompt.py`, 100%
|
||
coverage on `initial_prompt.py`. Live E2E verified all 12 containers
|
||
write the marker up-front and no replay occurs on restart. Manual
|
||
browser test via canvas chat against Research Lead returned the
|
||
expected reply — full round-trip through the UI.
|
||
|
||
Branch: `fix/71-initial-prompt-marker-at-start`. Closes #71.
|
||
|
||
---
|
||
|
||
## #66 fix — surface Claude SDK subprocess stderr + exit_code
|
||
|
||
**Root cause:** `claude_sdk_executor.py` caught `ProcessError` but
|
||
extracted only `str(exc)`, which for a crashing CLI reads "Command
|
||
failed with exit code 1 (exit code: 1) / Error output: Check stderr
|
||
output for details". The SDK's `ProcessError` actually carries
|
||
`.exit_code` and `.stderr` attributes — we were silently dropping both.
|
||
Every CLI crash looked identical and required ad-hoc reproduction
|
||
inside the container to diagnose.
|
||
|
||
**Fix:** new `_format_process_error(exc)` helper that extracts
|
||
`type(exc).__name__`, `exc.exit_code`, and `exc.stderr` (capped at
|
||
`_PROCESS_ERROR_STDERR_MAX_CHARS = 4096` to prevent log flooding).
|
||
Called in the retry loop (`logger.warning`) and the terminal error
|
||
path (`logger.error` + `logger.exception` for the full traceback).
|
||
Plain exceptions without SDK attributes fall back to `str(exc)` —
|
||
no crash on missing attrs.
|
||
|
||
**Tests:** 5 new unit tests in `tests/test_claude_sdk_executor.py`
|
||
(format with full context / truncation / plain exception / exit-code
|
||
only / end-to-end via `execute()` with caplog). Python pytest 1050 →
|
||
1055.
|
||
|
||
**E2E:** rebuilt `workspace-template:claude-code`, restarted an agent,
|
||
ran `_format_process_error` with a real `claude_agent_sdk._errors.
|
||
ProcessError(exit_code=2, stderr='disk full: /tmp')` inside the live
|
||
container → output shows both `exit_code=2` and the stderr verbatim.
|
||
|
||
**Manual browser:** canvas chat against Research Lead — reply
|
||
`BROWSER-OK-66` returned cleanly, full UI round-trip works with the
|
||
new log format live.
|
||
|
||
Branch: `fix/66-capture-claude-sdk-stderr`. Closes #66.
|
||
|
||
---
|
||
|
||
## #75 fix — auto-reset session_id on subprocess-level errors
|
||
|
||
**Root cause:** after a `ProcessError` (or `CLIConnectionError`), the
|
||
executor's `self._session_id` still points at the dead session. On the
|
||
next call, `_build_options()` passes `resume=<stale-id>` to the SDK,
|
||
which boots a new subprocess that can't resume the prior session state
|
||
— and crashes again. Observed as "crashed once → crashes forever" on
|
||
2026-04-12 across PM / RL / DL in the coordination runs.
|
||
|
||
**Fix:** new `_reset_session_after_error(exc)` method clears
|
||
`self._session_id` when the exception looks subprocess-level
|
||
(`ProcessError`, `CLIConnectionError`, has `exit_code` attribute, or
|
||
message contains "exit code"). Rate-limit / capacity errors are left
|
||
alone so normal retry preserves conversational continuity. Called in
|
||
the retry loop, right after `_format_process_error` logs the context.
|
||
|
||
**Tests:** 5 new tests in `tests/test_claude_sdk_executor.py` — clears
|
||
on ProcessError / preserves on rate-limit / no-op when session_id is
|
||
already None / triggers on "exit code" message only / end-to-end via
|
||
`execute()` with `caplog` + spy-on-`_build_options` asserting that the
|
||
second retry attempt sees `session_id=None` rather than the stale ID.
|
||
Python pytest 1055 → 1060.
|
||
|
||
**E2E:** verified in live container — `_reset_session_after_error`
|
||
clears a stale session on ProcessError, preserves it on rate-limit.
|
||
|
||
**Manual browser:** canvas chat round-trip on Research Lead — message
|
||
went through and agent responded normally. Zero ProcessError
|
||
indicators.
|
||
|
||
Branch: `fix/75-session-reset-on-process-error`. Closes #75.
|
||
|
||
---
|
||
|
||
## Top-5 #1 — Memory FTS + namespace scoping
|
||
|
||
Backend proposal from the ecosystem-research outcomes doc, highest-
|
||
convergence team ask (BE + FE + QA + UX all independently proposed
|
||
some flavour of this).
|
||
|
||
**Migration `017_memories_fts_namespace.up.sql`:**
|
||
- `agent_memories.namespace VARCHAR(50) NOT NULL DEFAULT 'general'`
|
||
- `agent_memories.content_tsv tsvector` (STORED generated column from
|
||
`to_tsvector('english', content)`)
|
||
- `idx_memories_fts` (GIN on `content_tsv`)
|
||
- `idx_memories_ns` (composite on `workspace_id, namespace`)
|
||
|
||
**Handler `workspace-server/internal/handlers/memories.go`:**
|
||
- `POST /workspaces/:id/memories` accepts optional `namespace` (default
|
||
`"general"`, 50-char max validated at the handler).
|
||
- `GET /workspaces/:id/memories?q=...` routes multi-char queries
|
||
through `content_tsv @@ plainto_tsquery('english', ?)` with
|
||
`ts_rank` ordering; single-char queries fall back to `ILIKE`
|
||
(tsvector can't tokenise single chars in the 'english' config).
|
||
- `GET /workspaces/:id/memories?namespace=...` filters regardless of
|
||
scope.
|
||
- Response always includes the `namespace` field.
|
||
|
||
**Tests:** 5 existing tests updated for the new column list; 4 new
|
||
tests added (commit-with-namespace, namespace-too-long, FTS path,
|
||
ILIKE fallback, namespace filter). Handler test suite passes.
|
||
|
||
**E2E (live Postgres + running platform):**
|
||
- Platform restart applied migration 017 → column + indexes present.
|
||
- `POST` with / without namespace → both work, default kicks in.
|
||
- `?q=zinc+theme` → FTS returns reference memory.
|
||
- `?namespace=procedures` → scoped retrieval works.
|
||
- `?q=restart&namespace=procedures` → combined filter works.
|
||
|
||
Branch: `feat/memory-fts-namespace`.
|
||
|
||
---
|
||
|
||
## Top-5 #5 — Fail-secure encryption at boot
|
||
|
||
Security Auditor's top proposal from the outcomes doc. The platform
|
||
previously booted without `SECRETS_ENCRYPTION_KEY` and silently stored
|
||
workspace secrets in plaintext with only a WARNING log. OWASP A02:2021
|
||
(Cryptographic Failures) / STRIDE "Information Disclosure".
|
||
|
||
**Fix** (`workspace-server/internal/crypto/aes.go`):
|
||
|
||
- New `InitStrict() error` variant that returns `ErrEncryptionKeyMissing`
|
||
when `MOLECULE_ENV=prod`/`production` and the key is unset, malformed,
|
||
or the wrong length. Existing `Init()` retained for any callers that
|
||
prefer the warn-and-continue behaviour; only `cmd/server/main.go`
|
||
switched to the strict variant.
|
||
- `isProdEnv()` accepts `prod`, `production`, case-insensitive + trimmed.
|
||
- `loadKeyFromEnv` refactor: one helper returns the parse error so both
|
||
entry points can format it the same way.
|
||
|
||
**`cmd/server/main.go`:** `crypto.InitStrict()` + `log.Fatalf` on error.
|
||
Local dev (no `MOLECULE_ENV`) keeps the existing warn-and-continue.
|
||
|
||
**Tests:** 6 new tests in `internal/crypto/aes_test.go`:
|
||
- fails in prod when key is missing
|
||
- fails in prod on wrong-length key
|
||
- succeeds in prod with valid key
|
||
- allows dev mode without key (ergonomics)
|
||
- allows staging without key (non-prod)
|
||
- isProdEnv case-insensitivity table
|
||
|
||
**E2E:** `/tmp/platform-failsec` binary run with `MOLECULE_ENV=prod` +
|
||
empty key → `log.Fatalf` triggers, platform refuses to start. Same
|
||
binary with `MOLECULE_ENV=prod` + valid base64 key → boots, prints
|
||
"AES-256-GCM enabled", serves 200 on `/health`.
|
||
|
||
Branch: `fix/top5-5-fail-secure-encryption`.
|
||
|
||
---
|
||
|
||
## #85 fix — encryption_version column + DecryptVersioned
|
||
|
||
**Root cause (from the investigation):** rows in `workspace_secrets` /
|
||
`global_secrets` are tagged as `encrypted_value bytea` but whether
|
||
they're *actually* encrypted depends entirely on whether
|
||
`SECRETS_ENCRYPTION_KEY` was set at the moment of `Encrypt` —
|
||
`crypto.Encrypt` short-circuits and returns plaintext bytes when
|
||
encryption is disabled. Switching on the key later makes
|
||
`crypto.Decrypt` try GCM on plaintext bytes → fails → provisioner
|
||
silently skips the row → container crashes on missing OAuth token.
|
||
|
||
With PR #83 (fail-secure) pushing operators toward setting the key,
|
||
this trap was about to start biting real installs.
|
||
|
||
**Fix:**
|
||
|
||
- Migration `018_secrets_encryption_version` adds
|
||
`encryption_version INT NOT NULL DEFAULT 0` to both secret tables.
|
||
All existing rows become `version=0` (plaintext). Additive, safe.
|
||
- `crypto.aes.go`:
|
||
- `EncryptionVersionPlaintext = 0`, `EncryptionVersionAESGCM = 1` constants.
|
||
- `CurrentEncryptionVersion()` — tells callers which tag to write.
|
||
- `DecryptVersioned(value, version)` — dispatches on tag; `v=0`
|
||
passes through, `v=1` runs GCM (and errors if `IsEnabled()` is
|
||
false). Unknown version → clear error.
|
||
- Existing `Decrypt` deprecated-in-comment but kept for callers
|
||
that haven't migrated (backward-compat during transition).
|
||
- `handlers/workspace_provision.go`: SELECT now pulls
|
||
`encryption_version`; decrypt uses `DecryptVersioned`; on failure
|
||
**aborts provisioning with a loud FATAL log + marks workspace
|
||
failed** (#66-style silent-failure removed).
|
||
- `handlers/secrets.go`: both `Set` and global `SetGlobalSecret`
|
||
persist `encryption_version = CurrentEncryptionVersion()` on
|
||
INSERT. `ON CONFLICT` also updates the version — re-setting a
|
||
historical plaintext row while a key is active upgrades it to
|
||
GCM in-place.
|
||
- `handlers/secrets.go::GetModel`: SELECT pulls version, uses
|
||
`DecryptVersioned`.
|
||
|
||
**Tests:** 6 new crypto tests (plaintext pass-through, GCM round-trip,
|
||
GCM requires key, unknown version rejected, `CurrentEncryptionVersion`
|
||
tracks key state, the exact #85 scenario end-to-end). 6 existing
|
||
secret handler tests updated for the 4-arg INSERT. Full Go test suite
|
||
passes.
|
||
|
||
**E2E (live):**
|
||
- Migration applied automatically on platform boot: `encryption_version`
|
||
column present on both tables.
|
||
- 102 pre-existing plaintext rows correctly tagged `version=0`.
|
||
- New `TEST_NEW_SECRET_85` stored as 39 bytes (11 plaintext + 12 nonce
|
||
+ 16 tag = ✓) with `version=1`.
|
||
- PM container restart succeeds — both `CLAUDE_CODE_OAUTH_TOKEN`
|
||
(v=0 historical plaintext) AND `TEST_NEW_SECRET_85` (v=1 encrypted)
|
||
are decrypted correctly and injected into the container env.
|
||
|
||
Branch: `fix/85-encryption-version-migration`. Closes #85.
|
||
|
||
---
|
||
|
||
## #67 fix — inject MOLECULE_URL at workspace provision time
|
||
|
||
**Root cause:** Agents calling `mcp__molecule__*` tools from inside a
|
||
workspace container were hitting `localhost:8080` (container's own
|
||
localhost, not the host). The MCP client
|
||
(`mcp-server/src/index.ts`) defaulted to `MOLECULE_URL ||
|
||
"http://localhost:8080"` and the provisioner only injected
|
||
`PLATFORM_URL`, never `MOLECULE_URL`.
|
||
|
||
**Fix (two-sided, belt-and-suspenders):**
|
||
|
||
1. `workspace-server/internal/provisioner/provisioner.go` — extracted env
|
||
building into pure `buildContainerEnv(cfg WorkspaceConfig) []string`
|
||
so it's unit-testable. Now injects `MOLECULE_URL=<PlatformURL>`
|
||
alongside `PLATFORM_URL`.
|
||
2. `mcp-server/src/index.ts` — client now prefers `MOLECULE_URL`, falls
|
||
back to `PLATFORM_URL`, then `localhost:8080`. Protects older
|
||
containers that don't yet have `MOLECULE_URL`.
|
||
|
||
**Tests:** 4 new Go tests (`buildContainerEnv` injects both env vars,
|
||
MOLECULE_URL always matches PLATFORM_URL across URL shapes, awareness
|
||
both-or-nothing, custom envs append). Full provisioner suite green.
|
||
88 existing MCP tests still pass (fallback chain preserves existing
|
||
behaviour).
|
||
|
||
**E2E verified live:** rebuilt platform, restarted PM, `docker exec
|
||
env` shows both `PLATFORM_URL=http://host.docker.internal:8080` and
|
||
`MOLECULE_URL=http://host.docker.internal:8080` on the recreated
|
||
container.
|
||
|
||
**Side-discovery (filed as #85):** enabling `SECRETS_ENCRYPTION_KEY`
|
||
on an install with pre-existing plaintext secrets silently breaks
|
||
every secret — `crypto.Decrypt` runs GCM on plaintext bytes → fails
|
||
→ `log.Printf + continue` → row dropped → workspace crashes on
|
||
preflight. Proposed fix: `encryption_version` column + boot-time
|
||
re-encryption migration + fail-loud on decrypt mismatch.
|
||
|
||
Branch: `fix/67-inject-molecule-url`.
|
||
|
||
---
|
||
|
||
## #73 fix — close three real delete-race windows
|
||
|
||
**Observed symptom (corrected):** During the session's bulk-delete runs,
|
||
PM / Research Lead / Dev Lead consistently survived as "stragglers."
|
||
Turned out the cause wasn't a race — it was the `DELETE /workspaces/:id`
|
||
endpoint returning **HTTP 200** with `{"status":"confirmation_required"}`
|
||
when the workspace has children and `?confirm=true` is not set. The
|
||
bulk-delete script read HTTP 200 as success and moved on.
|
||
|
||
**What the #73 fix actually closes:** three real but distinct race
|
||
windows that would bite in production even with correct `?confirm=true`
|
||
usage:
|
||
|
||
1. `handlers/registry.go::Register` — `ON CONFLICT DO UPDATE SET
|
||
status='online'` ran unconditionally; a late heartbeat from a
|
||
workspace that was just soft-deleted (status='removed') could
|
||
resurrect the row. Guard added: `WHERE workspaces.status IS
|
||
DISTINCT FROM 'removed'`.
|
||
2. `handlers/registry.go::Heartbeat` — same UPDATE path had no
|
||
filter; late heartbeats refreshed `last_heartbeat_at` on
|
||
tombstoned rows (confusing liveness). Guard: `AND status !=
|
||
'removed'`. Plus `evaluateStatus` recovery path made conditional
|
||
in-SQL (`AND status = 'offline'`).
|
||
3. `handlers/workspace.go::Delete` — sequence was Stop container →
|
||
UPDATE status='removed'. Between those calls, Redis TTL expiry
|
||
could trigger the liveness monitor, which called `RestartByID`,
|
||
recreating the container. New order: UPDATE status='removed'
|
||
FIRST (for self + descendants as a single batch), THEN stop
|
||
containers + remove volumes. Auto-restart paths now see
|
||
status='removed' immediately and bail out via their existing
|
||
`NOT IN ('removed', ...)` guards.
|
||
|
||
**Tests:** 2 new registry tests pinning the SQL guards (substring
|
||
match on the emitted UPDATE); 2 existing delete tests updated for
|
||
the new order (single batch UPDATE covering self+descendants).
|
||
Full `go test ./... -race` green.
|
||
|
||
**Live E2E:** bulk delete of 12 workspaces with `?confirm=true`
|
||
→ all cleanly removed, **zero stragglers**, no pending provisions.
|
||
|
||
**Separate issue filed:** API DX — DELETE should return 4xx (e.g.
|
||
409 Conflict) when confirmation is required, not 200. Misleading
|
||
status code made the session's symptom diagnosis wrong for hours.
|
||
|
||
Branch: `fix/73-delete-workspace-race`.
|
||
|
||
---
|
||
|
||
## #88 fix — DELETE returns 409 Conflict when confirmation required
|
||
|
||
**Observed during #73:** bulk-delete scripts that read HTTP 200 as
|
||
success silently skipped every parent workspace, leaving tier-3 /
|
||
parent nodes behind and looking like a platform race bug.
|
||
|
||
**Fix:** one-line change in `handlers/workspace.go::Delete` — return
|
||
`http.StatusConflict` (409) instead of `http.StatusOK` (200) when
|
||
children exist and `?confirm=true` isn't set. Response body shape
|
||
unchanged (canvas UI + MCP server both parse the JSON body, not the
|
||
status code).
|
||
|
||
No regressions: canvas (`DetailsTab.tsx:75`) and MCP server
|
||
(`mcp-server/src/index.ts:80`) already pass `?confirm=true` on every
|
||
delete. The 409 only affects manual API users + bulk scripts that
|
||
forgot — exactly the cohort that was silently failing.
|
||
|
||
**Tests:** 1 existing delete test updated to expect 409. Full
|
||
`go test ./...` green.
|
||
|
||
**Live E2E:** real platform, real parent+child workspaces —
|
||
`DELETE /workspaces/:id` (no confirm) returns `http=409` with the
|
||
expected JSON body; `DELETE /workspaces/:id?confirm=true` still
|
||
returns 200.
|
||
|
||
Branch: `fix/88-delete-confirm-409`. Closes #88.
|
||
## #74 fix — retry delegation once after reactive URL refresh
|
||
|
||
**Clarification of the original issue:** The delegation worker
|
||
(`handlers/delegation.go::executeDelegation`) already calls the shared
|
||
`h.workspace.proxyA2ARequest(...)` path — so it DOES benefit from the
|
||
A2A proxy's reactive health-check / URL-refresh on connection errors.
|
||
The real gap is that the reactive refresh runs *after* the current
|
||
request fails; the caller still gets an error for that specific
|
||
delegation attempt. During bulk restarts (observed 21:40 today), PM's
|
||
delegation worker fired during the warm-up window, hit a stale URL,
|
||
and the single-attempt logic marked the delegation `failed`.
|
||
|
||
**Fix:** add a single retry with an 8-second pause when
|
||
`proxyA2ARequest` returns a transient-looking error. The pause is
|
||
long enough for the reactive refresh + container restart to land a
|
||
fresh URL in the cache. `isTransientProxyError` classifies which
|
||
statuses retry:
|
||
|
||
- **502 Bad Gateway** (plain connection failure) — retry
|
||
- **503 Service Unavailable** (reactive check decided to restart the
|
||
container) — retry
|
||
- **404 / 403 / 400 / 500** — static, don't waste the retry window
|
||
|
||
**Tests:** 7 new cases on the classifier matrix + a regression
|
||
guard on the 8-second window. Full `go test ./... -race` green.
|
||
|
||
Branch: `fix/74-delegation-via-a2a-proxy`. Closes #74.
|
||
|
||
---
|
||
|
||
## 100% platform coverage — MCP + molecli
|
||
|
||
Full parity pass so every platform endpoint is reachable from both
|
||
client layers.
|
||
|
||
### MCP server (`mcp-server/src/index.ts`): 61 → 83 tools
|
||
|
||
**+22 new handlers** added in a single coverage-completion block at
|
||
the bottom of the file:
|
||
|
||
- Delegations (#64): `record_delegation`, `update_delegation_status`
|
||
- Activity: `report_activity`, `notify_user`
|
||
- Canvas viewport: `get_canvas_viewport`, `set_canvas_viewport`
|
||
- Channels (platform-level): `discover_channel_chats`
|
||
- Plugins: `list_plugin_sources`, `list_available_plugins`,
|
||
`check_plugin_compatibility`
|
||
- Schedules (cron): `list_schedules`, `create_schedule`,
|
||
`update_schedule`, `delete_schedule`, `run_schedule`,
|
||
`get_schedule_history`
|
||
- Session + shared context: `session_search`, `get_shared_context`
|
||
- K/V memory (distinct from HMA): `memory_set`, `memory_get`,
|
||
`memory_list`, `memory_delete_kv`
|
||
|
||
**Updated schemas:** `create_workspace` + `update_workspace` now
|
||
accept `workspace_access` (none / read_only / read_write) + explicit
|
||
`runtime` / `workspace_dir` params.
|
||
|
||
All 88 existing MCP tests still pass; `npm run build` green.
|
||
|
||
### molecli CLI (`workspace-server/cmd/cli/`): 9 → 21 top-level commands
|
||
|
||
Two new files:
|
||
|
||
- `cmd_api.go` — `molecli api <METHOD> <PATH> [json-body]` raw
|
||
escape hatch. Hits any endpoint without a typed wrapper.
|
||
- `cmd_ops.go` — typed subcommands (thin wrappers over shared
|
||
`callAPI` helper) for operator ergonomics:
|
||
- `ws restart|pause|resume` — lifecycle ops
|
||
- `plugin registry|sources|list|available|install|uninstall`
|
||
- `secret list|set|delete|list-global|set-global|delete-global`
|
||
- `schedule list|add|remove|run|history`
|
||
- `channel adapters|list|remove|send|test`
|
||
- `approval pending|list|decide`
|
||
- `delegation list|create`
|
||
- `bundle export|import`
|
||
- `org templates|import`
|
||
- `traces <workspace-id>`
|
||
- `activity list <workspace-id>`
|
||
- `hma commit|search`
|
||
|
||
`go test ./cmd/cli/` passes; live smoke-test against running
|
||
platform: `api GET /health`, `plugin sources`, `org templates`,
|
||
`ws restart <bad-id>` all return expected responses.
|
||
|
||
Branch: `feat/mcp-molecli-full-coverage`.
|
||
## #65 fix — per-agent workspace_access in org.yaml + API
|
||
|
||
**Design from the ecosystem-research outcomes doc:** new
|
||
`workspace_access: none | read_only | read_write` field on every
|
||
workspace, enforced at container provision time via Docker's native
|
||
`:ro` bind-mount flag. Eliminates the "PM couriers documents to
|
||
reports" workaround by letting research agents have read-only repo
|
||
access without the write risk.
|
||
|
||
**Changes:**
|
||
|
||
- **Migration 019** — adds `workspace_access VARCHAR(20) NOT NULL
|
||
DEFAULT 'none'` with CHECK constraint. Additive, all existing rows
|
||
become 'none' (current isolated-volume behaviour preserved).
|
||
- **`provisioner.go`:**
|
||
- New `WorkspaceAccess` field on `WorkspaceConfig`.
|
||
- Constants `WorkspaceAccessNone`/`ReadOnly`/`ReadWrite`.
|
||
- `buildWorkspaceMount(cfg)` — pure helper, selects between
|
||
named-volume, rw bind, and `:ro` bind based on access +
|
||
workspace_path.
|
||
- `ValidateWorkspaceAccess(access, path)` — rejects `read_*`
|
||
without a path and unknown values.
|
||
- **`handlers/workspace.go::Create`** and
|
||
**`handlers/org.go::createOrgWorkspace`** — validate +
|
||
persist `workspace_access` on INSERT. Response body echoes
|
||
the stored value.
|
||
- **`handlers/workspace_provision.go::buildProvisionerConfig`** —
|
||
reads `workspace_access` from DB (with payload override) and
|
||
forwards to the provisioner. Restart paths preserve the mode.
|
||
|
||
**Tests:**
|
||
- Provisioner: 2 new tables — `TestBuildWorkspaceMount_SelectionMatrix`
|
||
(6 cases covering the full access × path matrix) and
|
||
`TestValidateWorkspaceAccess` (7 cases).
|
||
- Handler INSERT WithArgs updated across 5 existing tests for the
|
||
new 9th column.
|
||
- Full `go test ./... -race` green.
|
||
|
||
**Live E2E:**
|
||
- Migration auto-applied → `workspaces` table has `workspace_access`
|
||
with the CHECK constraint.
|
||
- `POST /workspaces {"workspace_access":"read_only","workspace_dir":"/repo"}`
|
||
→ 201 with `"workspace_access":"read_only"` echoed; DB row correct.
|
||
- `POST {"workspace_access":"read_only"}` (no workspace_dir) → 400
|
||
with clear error.
|
||
- `POST {"workspace_access":"wildcard"}` → 400 with allowed-values
|
||
list.
|
||
- Container inspected after provision: `/workspace` mount has
|
||
`RW=false Mode=ro`; `touch /workspace/foo` from inside returns
|
||
`Read-only file system` → enforcement is real.
|
||
|
||
Branch: `feat/65-workspace-access-yaml`. Closes #65.
|
||
## #64 fix — agent registers delegations with platform (Option A)
|
||
|
||
**Root cause (confirmed in comment on #64):** `check_delegation_status`
|
||
reads from the agent's local `_delegations` dict; platform's
|
||
`GET /workspaces/:id/delegations` reads from `activity_logs`. The
|
||
agent's `delegate_to_workspace` MCP tool sends A2A directly and
|
||
never touches `activity_logs` — so the platform's view was always empty
|
||
for agent-initiated delegations.
|
||
|
||
**Fix (minimal Option A, dual-write):**
|
||
|
||
- Platform: two new endpoints on `DelegationHandler` —
|
||
- `POST /workspaces/:id/delegations/record` — inserts a single
|
||
`activity_logs` row with `method='delegate'`, status='dispatched'.
|
||
No A2A fired (agent does that directly for OTEL/retry reasons).
|
||
- `POST /workspaces/:id/delegations/:delegation_id/update` — accepts
|
||
`status ∈ {completed, failed}` + optional error + preview. UPDATEs
|
||
the original row and (on completion) INSERTs a `delegate_result`
|
||
row matching the canvas-path flow.
|
||
|
||
- Agent (`workspace/builtin_tools/delegation.py`):
|
||
- New best-effort async helpers `_record_delegation_on_platform`
|
||
and `_update_delegation_on_platform`. Failures are logged at debug
|
||
and swallowed — never block the actual A2A delegation path.
|
||
- `_execute_delegation` calls `_record_...` at task start and
|
||
`_update_...` on completion / failure (alongside the existing
|
||
`_notify_completion`).
|
||
|
||
**Result:** agent keeps direct A2A for speed + OTEL trace-context
|
||
propagation + existing retry logic; platform's activity_logs mirrors
|
||
the same set the agent's local dict holds. `GET /delegations` now
|
||
returns rows for agent-initiated delegations.
|
||
|
||
**Tests:** 5 new Go tests (Record inserts + rejects invalid UUID,
|
||
UpdateStatus completed inserts result row + rejects unknown status +
|
||
failed broadcast). 4 new Python tests (record fires HTTP POST, best-
|
||
effort on platform error, update completed, update truncates large
|
||
preview to 500 chars). Python pytest 1060 → 1064; full Go suite green.
|
||
|
||
Branch: `fix/64-agent-delegate-via-platform`. Closes #64.
|
||
|
||
## SDK — workspace / org / channel validators
|
||
|
||
**Issue:** SDK only validated plugins. Authors publishing
|
||
workspace-configs-templates, org-templates, or channel configs had no
|
||
lint step — errors only surfaced at `POST /org/import` or container
|
||
startup.
|
||
|
||
**Fix:** extended `sdk/python/molecule_plugin/` with three new modules:
|
||
|
||
- `workspace.py` — validates `config.yaml` (name, runtime, tier,
|
||
runtime_config shape). `SUPPORTED_RUNTIMES` kept in sync with
|
||
`provisioner.RuntimeImages`.
|
||
- `org.py` — recursively validates `org.yaml` (name, workspaces tree,
|
||
workspace_access + workspace_dir pairing per #65, channels via
|
||
delegated `validate_channel_config`, schedules, plugins, external+url,
|
||
children).
|
||
- `channel.py` — validates channel configs (standalone dict or YAML
|
||
file). `SUPPORTED_CHANNEL_TYPES` currently `{telegram}`; extend when
|
||
Slack/Discord adapters land.
|
||
|
||
CLI (`python -m molecule_plugin validate {plugin|workspace|org|channel} <path>`)
|
||
dispatches to the right validator; bare `validate <path>` still defaults
|
||
to plugin for back-compat. Exit 0 on valid, 1 on any error.
|
||
|
||
`validate_channel_config` is the single source of truth for channel
|
||
schema — `org.py` delegates to it rather than duplicating checks.
|
||
|
||
**Tests:** `sdk/python/tests/test_validators.py` — 37 new tests (happy,
|
||
missing file, bad YAML, non-object, each field error, null-safety on
|
||
`runtime_config: None` / `defaults: null`, CLI dispatch for all 4 kinds,
|
||
back-compat form). Fixed bug found during test authoring: `org.py` crashed
|
||
on non-dict children; now guarded with `isinstance` check.
|
||
|
||
**Live smoke:** all 4 in-repo org templates (`free-beats-all`,
|
||
`reno-stars`, `molecule-dev`, `molecule-worker-gemini`) validate clean.
|
||
|
||
**SDK pytest:** 50 → 87. Branch: `feat/sdk-workspace-org-channel`.
|
||
---
|
||
|
||
## Top-5 #3 — parallel adapter builds
|
||
|
||
DevOps proposal from the ecosystem-research outcomes doc. All six
|
||
adapter Dockerfiles `FROM workspace-template:base` with no
|
||
inter-adapter dependency, so they're safe to build concurrently once
|
||
the base is done.
|
||
|
||
**Change** (`workspace/build-all.sh`):
|
||
|
||
- Serial path kept for single-runtime rebuilds and `SERIAL_BUILD=1`
|
||
CI environments (preserves bounded-concurrency option).
|
||
- Parallel path: fan out one `docker build` per adapter, capture
|
||
stdout/stderr to `/tmp/build_<tag>.log`, wait for all, tally
|
||
per-tag success/failure. Failures still exit non-zero.
|
||
|
||
**E2E:** `bash build-all.sh claude-code deepagents langgraph`
|
||
finished in **43s wall-clock** (three adapter builds running
|
||
concurrently). Previously ~120s serial. Log files live under
|
||
`/tmp/build_*.log` for post-hoc debugging.
|
||
|
||
Branch: `feat/top5-3-parallel-adapter-builds`.
|