docs: remove internal-only docs from the public docs repo #78

Merged
documentation-specialist merged 1 commits from docs/remove-internal-docs into main 2026-06-02 18:14:07 +00:00
16 changed files with 1 additions and 2417 deletions
@@ -1,96 +0,0 @@
---
title: "Hermes Adapter — Shell Design Spec"
description: "Design spec for the Hermes runtime adapter — the BaseAdapter shell, provider map, and integration points."
---
# Hermes Adapter — Shell Design Spec
**Perspective:** DevOps Engineer + Backend Engineer
**Status:** Draft — pre-implementation
**Hermes source:** `NousResearch/hermes-agent` (~61k ⭐)
**Adapter runtime key:** `hermes`
---
## 1. Files Under `workspace/adapters/hermes/`
| File | Purpose |
|------|---------|
| `Dockerfile` | Extends `workspace-template:base`; installs `hermes-agent` Python SDK and its deps via pip at image build time |
| `requirements.txt` | Python package list — at minimum `hermes-agent`; pin to a specific release tag for reproducibility |
| `adapter.py` | `HermesAdapter(BaseAdapter)` — implements `name()`, `display_name()`, `description()`, `get_config_schema()`, `setup()`, `create_executor()`; delegates to `_common_setup()` for plugins/skills/tools |
| `__init__.py` | Exports `Adapter = HermesAdapter` — required by the adapter autodiscovery loader in `workspace/adapters/__init__.py` |
### `Dockerfile` sketch (no implementation — shape only)
```dockerfile
FROM workspace-template:base
COPY adapters/hermes/requirements.txt /tmp/hermes-requirements.txt
RUN pip install --no-cache-dir -r /tmp/hermes-requirements.txt
```
### `adapter.py` shape
```python
class HermesAdapter(BaseAdapter):
@staticmethod
def name() -> str:
return "hermes"
async def setup(self, config: AdapterConfig) -> None:
# validate NOUS_API_KEY or OPENROUTER_API_KEY is set
# call self._common_setup(config) for plugins/skills/tools
...
async def create_executor(self, config: AdapterConfig) -> AgentExecutor:
# wrap Hermes SDK session as an A2A AgentExecutor
...
```
---
## 2. Platform-Side Changes
### `workspace-server/internal/provisioner/provisioner.go` — `RuntimeImages` map
Add one entry to the existing map:
```go
var RuntimeImages = map[string]string{
// ... existing entries ...
"hermes": "workspace-template:hermes", // ← ADD THIS
}
```
No other platform Go changes are required for the minimal adapter shell. The `runtime` column in the `workspaces` table is a free-form string; no enum migration needed.
### `workspace/build-all.sh`
Add `hermes` to the adapter build loop so `build-all.sh` (and the `build-all.sh claude-code`-style single-runtime path) includes it:
```bash
ADAPTERS=(langgraph claude_code openclaw autogen hermes codex google-adk)
```
---
## 3. Required Environment Variables
| Name | Required | Description |
|------|----------|-------------|
| `NOUS_API_KEY` | Required (unless `OPENROUTER_API_KEY` set) | Nous Research Portal API key — primary model provider for Hermes; obtain from `nousresearch.com` |
| `OPENROUTER_API_KEY` | Optional | Fallback provider; lets operators use any Hermes-supported model via OpenRouter instead of Nous Portal |
| `HERMES_MODEL` | Optional | Model identifier (e.g. `nous-hermes-3`, `openrouter:anthropic/claude-sonnet-4-5`); adapter defaults to `nous-hermes-3` if unset |
| `HERMES_SKILLS_DIR` | Optional | Path inside the container where Hermes looks for skills; defaults to `/configs/skills` — consistent with the Claude Code and LangGraph adapters |
**Note:** `NOUS_API_KEY` and `OPENROUTER_API_KEY` must be set as workspace secrets via `POST /workspaces/:id/secrets`, not baked into the image. At least one of the two must be present at container start; `setup()` should `raise RuntimeError` early with a clear message if both are absent.
---
## 4. Smallest Viable Adapter — Scope Constraints
This spec covers the **shell only** — the minimum to make a Hermes workspace provision, boot, and accept A2A messages:
- No Hermes learning loop (skill self-improvement) in v1 — that requires persistent storage writes outside `/configs`; defer to a follow-up PR.
- No multi-messenger gateway integration — Hermes's Telegram/Discord/Slack channels are separate from Molecule AI's `/channels` feature; map these later via the channels adapter.
- No FTS5 memory backend — use Molecule AI's existing `commit_memory` / `search_memory` built-in tools for v1; Hermes-native memory can be layered in a subsequent PR.
- The executor wraps one Hermes agent session per workspace, matching the 1:1 workspace→agent model used by all other adapters.
@@ -1,78 +0,0 @@
---
title: "Hermes Adapter — Implementation Plan"
description: "Implementation plan for the Hermes runtime adapter, from SDK import path to adapter.py build steps."
---
# Hermes Adapter — Implementation Plan
**Author:** Dev Lead
**Date:** 2026-04-13
**Branch convention:** `feat/hermes-adapter-<step>` for each PR below
**Target:** Ship a minimal but functional Hermes workspace adapter in 4 PRs, each ≤200 lines changed.
---
## PR Sequence
### PR 1 — Docker image shell
**Title:** `feat(hermes): add workspace-template:hermes Docker image`
**Files touched:**
- `workspace/adapters/hermes/Dockerfile` (new)
- `workspace/adapters/hermes/requirements.txt` (new)
- `workspace/adapters/hermes/__init__.py` (new)
- `workspace/build-all.sh` (1-line addition)
**Description:** Adds the Hermes Docker image layer. `Dockerfile` extends `workspace-template:base` and installs `hermes-agent` (and declared deps) via pip at build time. `build-all.sh` gains `hermes` in the adapter list so `bash build-all.sh` and `bash build-all.sh hermes` both work. No Python adapter logic yet — just proves the image builds and that `import hermes` succeeds inside the container. CI: add `hermes` to the docker-build matrix.
---
### PR 2 — Python adapter + A2A executor
**Title:** `feat(hermes): implement HermesAdapter and A2A executor`
**Files touched:**
- `workspace/adapters/hermes/adapter.py` (new, ~80 lines)
- `workspace/tests/test_adapters.py` (extend existing test file, ~30 lines)
**Description:** Implements `HermesAdapter(BaseAdapter)` with `name()`, `display_name()`, `description()`, `get_config_schema()`, `setup()`, and `create_executor()`. `setup()` calls `_common_setup()` to load plugins/skills/tools identically to other adapters, then validates that `NOUS_API_KEY` or `OPENROUTER_API_KEY` is present and initialises a Hermes SDK session. `create_executor()` wraps the session as an `AgentExecutor`. Tests cover: adapter name/display_name contract, `setup()` raises `RuntimeError` when both API keys are absent, executor is returned after valid setup.
---
### PR 3 — Platform RuntimeImages entry
**Title:** `fix(provisioner): add hermes to RuntimeImages map`
**Files touched:**
- `workspace-server/internal/provisioner/provisioner.go` (1-line addition)
- `workspace-server/internal/provisioner/provisioner_test.go` (1-line addition in RuntimeImages coverage test)
**Description:** Adds `"hermes": "workspace-template:hermes"` to the `RuntimeImages` map. Without this entry the platform falls back to `workspace-template:langgraph` (wrong deps, agent fails to start). Test: extend the existing table-driven test that asserts every declared runtime resolves to a non-empty image tag.
---
### PR 4 — Integration docs + org template entry
**Title:** `docs(hermes): adapter usage guide and org template example`
**Files touched:**
- `docs/adapters/hermes-adapter-design.md` (update status from Draft → Implemented)
- `workspace-configs-templates/hermes/config.yaml` (new, ~20 lines — minimal config template)
- `org-templates/molecule-worker-gemini/org.yaml` or a new `molecule-hermes/` org template (optional, ~30 lines)
**Description:** Marks the design doc as implemented, adds a `workspace-configs-templates/hermes/config.yaml` so operators can create a Hermes workspace from the UI template picker, and optionally adds a minimal org template showing a Hermes-runtime team. Documents the three env vars (`NOUS_API_KEY`, `OPENROUTER_API_KEY`, `HERMES_MODEL`) in the config template comments.
---
## Sequencing Notes
- PRs 1 and 2 can overlap in development but PR 2 must merge after PR 1 (image must exist before adapter tests run in CI).
- PR 3 is a single-line change and can merge any time after PR 1 lands.
- PR 4 has no code risk; it can be drafted alongside PR 2 and merged last.
- Total estimated diff: ~180 lines of new code across all 4 PRs; well within the ≤200 lines/PR budget.
## Open Questions (resolve before PR 2)
1. **Hermes SDK import path** — confirm the pip package name and the Python import path (`import hermes`? `from hermes_agent import ...`?). Check `NousResearch/hermes-agent` README before writing adapter.py.
2. **Session persistence** — Hermes has a learning loop that writes skill files. Decide at PR 2 time whether to mount `/workspace` as the Hermes skills root or suppress auto-write in v1.
3. **Model default** — confirm the correct model identifier string for Nous Portal (e.g. `nous-hermes-3-70b` vs `hermes-3`); hardcode a safe default in `get_config_schema()`.
-264
View File
@@ -1,264 +0,0 @@
---
title: "Hermes Agent — Adapter Reconnaissance"
description: "Reconnaissance of the NousResearch hermes-agent project as a candidate Molecule AI runtime adapter."
---
# Hermes Agent — Adapter Reconnaissance
Reconnaissance of [NousResearch/hermes-agent](https://github.com/NousResearch/hermes-agent) (v0.8.0, 68,713 ⭐, MIT) for potential Molecule AI adapter integration.
> **Status:** Design-only recon — no implementation.
---
## a) CLI Invocation
**Install** (curl-to-bash, targets Linux/macOS/WSL2/Termux):
```bash
curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash
```
The `hermes` binary in the repo root is a Python script (`#!/usr/bin/env python3`) that imports and calls `hermes_cli.main.main()`. After install it lands on `$PATH`.
**Minimal interactive session:**
```bash
hermes # launches TUI, auto-detects provider from env
hermes chat # explicit; same as bare `hermes`
hermes setup # one-time wizard: sets model, provider, API keys
```
**Key runtime flags:**
```bash
hermes chat \
--model anthropic/claude-opus-4.6 \
--provider openrouter \
--toolsets terminal,file,web \
--max-turns 60 \
--query "build me a FastAPI app" \
--resume # continue most recent session
--worktree # git-worktree isolation per session
--profile myprofile # load alternate HERMES_HOME profile
```
**One-shot (non-interactive):**
```bash
hermes chat --query "summarise this repo" --quiet
```
**Gateway (messaging platforms) start:**
```bash
hermes gateway start # daemonises; reads gateway config from config.yaml
hermes gateway status
hermes gateway stop
```
**OpenClaw migration:**
```bash
hermes claw migrate --dry-run # preview; drop --dry-run to execute
```
---
## b) Config Format
**Format:** YAML
**Primary path:** `~/.hermes/config.yaml` (default), overrideable via `HERMES_HOME` env var.
**Reference file in repo:** `cli-config.yaml.example`
**Minimal working config** (provider = OpenRouter, Docker terminal backend):
```yaml
# ~/.hermes/config.yaml
model:
default: "anthropic/claude-opus-4.6"
provider: "openrouter" # required; "auto" if you want env-var detection
base_url: "https://openrouter.ai/api/v1"
terminal:
backend: "local" # required; options: local | ssh | docker | singularity | modal | daytona
cwd: "."
timeout: 180
lifetime_seconds: 300
memory:
memory_enabled: true
user_profile_enabled: true
memory_char_limit: 2200
user_char_limit: 1375
nudge_interval: 10
agent:
max_turns: 60
reasoning_effort: "medium" # xhigh | high | medium | low | minimal | none
```
**Required fields:** `model.default`, `model.provider`, `terminal.backend`.
Everything else has a hardcoded default.
**Credentials** go in `~/.hermes/.env` (separate from config.yaml):
```bash
OPENROUTER_API_KEY=sk-or-...
ANTHROPIC_API_KEY=sk-ant-...
HERMES_HOME=~/.hermes # optional override
```
**Skills config** (in `config.yaml`):
```yaml
skills:
creation_nudge_interval: 15 # remind agent to persist a skill every N tool iterations
external_dirs:
- ~/.agents/shared-skills # read-only external skill dirs
```
**Compression config** (in `config.yaml`):
```yaml
compression:
enabled: true
threshold: 0.50
summary_model: "google/gemini-3-flash-preview"
```
---
## c) Runtime Dependencies
**Python version:** 3.13 (Dockerfile base: `ghcr.io/astral-sh/uv:0.11.6-python3.13-trixie`)
**Package manager:** [uv](https://github.com/astral-sh/uv) (not pip directly; `uv pip install .`)
**Package version:** `hermes-agent==0.8.0`
**Top core pip dependencies** (from `pyproject.toml`):
| Package | Version constraint | Purpose |
|---|---|---|
| `openai` | `>=2.21.0,<3` | Primary LLM client (all providers via OpenAI-compat API) |
| `anthropic` | `>=0.39.0,<1` | Direct Anthropic API adapter |
| `python-dotenv` | `>=1.2.1,<2` | `.env` loading |
| `fire` | `>=0.7.1,<1` | CLI argument dispatch |
| `httpx[socks]` | `>=0.28.1,<1` | Async HTTP (gateway, webhooks) |
| `rich` | `>=14.3.3,<15` | TUI rendering |
| `pyyaml` | `>=6.0.2,<7` | Config file parsing |
| `pydantic` | `>=2.12.5,<3` | Data validation |
| `prompt_toolkit` | `>=3.0.52,<4` | Interactive TUI / multiline input |
| `tenacity` | `>=9.1.4,<10` | Retry logic |
**Key optional extras:**
```bash
pip install "hermes-agent[modal]" # modal>=1.0.0 — serverless backend
pip install "hermes-agent[daytona]" # daytona>=0.148.0 — cloud sandbox backend
pip install "hermes-agent[mcp]" # mcp>=1.2.0 — MCP server/client
pip install "hermes-agent[honcho]" # honcho-ai — cross-session user modeling
pip install "hermes-agent[messaging]" # telegram, discord.py, aiohttp, slack
pip install "hermes-agent[voice]" # faster-whisper, sounddevice, numpy
pip install "hermes-agent[rl]" # atroposlib, fastapi, uvicorn, wandb
```
**System binaries** (from Dockerfile `apt-get install`):
```
nodejs npm ripgrep ffmpeg gcc python3-dev libffi-dev procps build-essential
```
`ripgrep` is used by the `file` toolset for fast codebase search. `ffmpeg` is used for voice transcription pre-processing.
---
## d) Session State
**All persistent state lives under `HERMES_HOME`** (default: `~/.hermes/`, overrideable via env var).
**Primary state store: SQLite**
```
~/.hermes/state.db ← DEFAULT_DB_PATH = get_hermes_home() / "state.db"
```
- Schema version: **6** (`SCHEMA_VERSION = 6` in `hermes_state.py`)
- WAL mode (`PRAGMA journal_mode=WAL`) — supports concurrent gateway + CLI writers
- Three core tables: `schema_version`, `sessions`, `messages`
- **FTS5 virtual table** `messages_fts` with auto-sync triggers on INSERT/UPDATE/DELETE — backs the `session_search` toolset (full-text search across all past conversation content)
- Compression-triggered session splitting tracked via `parent_session_id` chain in `sessions` table
- Session source tagged as `'cli'`, `'telegram'`, `'discord'`, etc. for per-platform filtering
**Full directory layout:**
```
~/.hermes/
├── config.yaml ← get_config_path()
├── .env ← get_env_path()
├── state.db ← SQLite WAL, FTS5
├── skills/ ← get_skills_dir() — user-created skill SKILL.md files
├── logs/ ← get_logs_dir() — trajectory JSONs
│ └── session_YYYYMMDD_HHMMSS_<uuid>.json
├── MEMORY.md ← agent's curated notes (injected into system prompt)
├── USER.md ← user profile (injected into system prompt)
└── skins/ ← optional custom theme YAMLs
```
**State is persistent by default.** Session history, memories (`MEMORY.md`/`USER.md`), and skills survive restarts. The `session_reset` config controls when gateway sessions are cleared (default: `mode: both`, idle after 1440 min or at 4 AM daily). Before any reset, Hermes is given one flush turn to write important context to `MEMORY.md`.
Container backend state is controlled separately by `container_persistent: true/false` in the `terminal:` block.
---
## e) Execution Backends
**Six backends configured via a single `terminal.backend` key in `config.yaml`:**
| Backend | Where commands run | Key extra config |
|---|---|---|
| `local` | Host machine, current dir | — |
| `ssh` | Remote server | `ssh_host`, `ssh_user`, `ssh_key` |
| `docker` | Inside a Docker container | `docker_image`, `docker_mount_cwd_to_workspace` |
| `singularity` | Singularity/Apptainer container (HPC) | `singularity_image` |
| `modal` | Modal cloud sandbox (serverless) | `modal_image`, `pip install hermes-agent[modal]` |
| `daytona` | Daytona cloud sandbox | `daytona_image`, `container_disk`, `pip install hermes-agent[daytona]` |
**Architecture clarification:** Hermes's Python process **always runs locally** (or wherever you launched it). The `backend` setting controls only where the **`terminal` tool** executes shell commands. For `docker`, Hermes calls the Docker API to spawn/reuse a container and routes `terminal` tool calls into it via exec — Hermes itself is **not** containerised by this setting.
**Docker backend minimal config:**
```yaml
terminal:
backend: "docker"
cwd: "/workspace" # path inside the container
timeout: 180
lifetime_seconds: 300
docker_image: "nikolaik/python-nodejs:python3.11-nodejs20"
docker_mount_cwd_to_workspace: false # default: false (security off). Set true to bind-mount launch dir into /workspace
docker_forward_env:
- "GITHUB_TOKEN"
- "NPM_TOKEN"
container_cpu: 1
container_memory: 5120 # MB
container_disk: 51200 # MB
container_persistent: true # false = ephemeral container, wiped after session
```
**The Dockerfile** (for running *all of Hermes* inside Docker, distinct from the backend setting) uses:
```dockerfile
FROM debian:13.4
ENV HERMES_HOME=/opt/data
ENV PLAYWRIGHT_BROWSERS_PATH=/opt/hermes/.playwright
VOLUME /opt/data
ENTRYPOINT ["/opt/hermes/docker/entrypoint.sh"]
# Runs as non-root user hermes (UID 10000), home /opt/data
```
**Serverless hibernation** (Modal + Daytona): `container_persistent: false` produces fully ephemeral sandboxes that are destroyed after `lifetime_seconds`; `true` persists the container filesystem between sessions (warm-resume, no re-install overhead).
---
## f) Value Proposition
Integrating Hermes adds one capability that none of the other existing adapters (LangGraph, Claude Code, AutoGen, OpenClaw, Codex, Google ADK) deliver end-to-end: **a closed learning loop that compounds across sessions at the skill, memory, and user-model layers simultaneously.** Concretely: after a complex task, Hermes autonomously creates a `SKILL.md` file in `~/.hermes/skills/` (prompted every `creation_nudge_interval=15` tool iterations), and those skills are re-injected as context in future sessions — agents get better at tasks they've done before without any human curation step. The `session_search` toolset adds FTS5 + Gemini Flash summarization over `state.db`, so the agent can recall specific conversations from months ago with semantic-quality results. Layered on top is **Honcho dialectic user modeling** (`plastic-labs/honcho`) — a cross-session profile that tracks user communication style, preferences, and expectations, shared across any Honcho-integrated tool (not just Hermes). Finally, the **Modal and Daytona serverless backends with `container_persistent`** give Molecule AI a path to hibernating, pay-per-use sandboxes that no existing adapter exposes — directly relevant to Molecule AI's multi-workspace billing model. The `hermes claw migrate` command (backed by `optional-skills/migration/openclaw-migration/scripts/openclaw_to_hermes.py`) is also relevant: Molecule AI could offer equivalent migration tooling to attract OpenClaw's existing ~247k-user base, and the **`agentskills.io` skill-manifest spec** (referenced in `optional-skills/`) should be reviewed before Molecule AI finalises its own plugin manifest schema to ensure interoperability with what is rapidly becoming the de-facto file-based skill standard.
-177
View File
@@ -1,177 +0,0 @@
---
title: "MeDo Integration Design — Molecule AI Hackathon (May 20 2026)"
description: "Design for integrating the Baidu MeDo / Miaoda App Builder as an OpenClaw-runtime workspace, with A2A delegation and open questions."
---
# MeDo Integration Design — Molecule AI Hackathon (May 20 2026)
**Status:** Design — implementation pending operator sign-off on open questions (§5).
**Scope:** How the molecule-dev team builds MeDo apps for the "Build with MeDo" hackathon.
**Key constraint:** MeDo App Builder is an OpenClaw skill on ClawHub (`seiriosPlus/miaoda-app-builder`),
not a REST API. All interactions go through natural-language messages to an OpenClaw workspace.
---
## 1. Architecture Overview
```
CEO / Canvas
│ A2A task
PM (claude-code)
│ delegate_task_async → workspace: medo-builder
MeDo Builder workspace [runtime: openclaw, skill: miaoda-app-builder]
│ OpenClaw CLI → skill → api.miaoda.cn
MeDo platform (app created / published → URL returned)
│ result relayed via A2A event_queue
PM → CEO
```
The MeDo Builder workspace is a **dedicated OpenClaw-runtime workspace** inside the
molecule-dev org with the Miaoda App Builder skill pre-installed. PM delegates natural-language
app-build requests to it via `delegate_task_async` and polls for the result (58 min latency).
---
## 2. Installing the Miaoda App Builder Skill
### 2.1 API Key
The skill requires `MIAODA_API_KEY` (not `MEDO_API_KEY`).
> ⚠️ **Credential name mismatch**: the global platform secret is currently named `MEDO_API_KEY`.
> The skill's frontmatter declares `primaryEnv: MIAODA_API_KEY`. The MeDo Builder workspace must
> set `MIAODA_API_KEY` — either rename the global secret or add a workspace-level alias.
> See open question §5-A.
Obtain the key from: **MeDo website → Settings → API Keys**. Keys do not expire, but generating
a new one immediately invalidates the previous one.
### 2.2 Installation Query
OpenClaw installs skills by sending a natural-language install message to the agent.
No CLI command is documented on ClawHub — send this message to the OpenClaw workspace on first boot:
```
Install the Miaoda App Builder skill from ClawHub: seiriosPlus/miaoda-app-builder
```
OpenClaw auto-downloads the skill, installs Python runtime deps (`requests`), and makes the skill
available for subsequent messages.
### 2.3 Workspace Config Sketch (`org-templates/medo-builder/workspace.yaml`)
```yaml
name: MeDo Builder
role: Builds and publishes MeDo applications via the Miaoda App Builder OpenClaw skill
runtime: openclaw
tier: 2
required_env:
- MIAODA_API_KEY # TODO: resolve name vs platform secret MEDO_API_KEY (§5-A)
- OPENROUTER_API_KEY # OpenClaw needs an LLM provider
initial_prompt: |
You are a MeDo App Builder. On startup:
1. Install the Miaoda App Builder skill:
"Install the Miaoda App Builder skill from ClawHub: seiriosPlus/miaoda-app-builder"
2. Confirm installation succeeded.
3. Wait for build tasks from PM via A2A.
When you receive a build task, use natural language to instruct the skill:
"Create a [description] app and publish it when done."
App generation takes 58 minutes — poll the skill or wait for confirmation before reporting done.
```
---
## 3. A2A Delegation Pattern (58 Min Latency)
App generation is asynchronous and slow. PM **must** use `delegate_task_async` + `check_task_status`
rather than `delegate_task` (which has a shorter timeout and will return before the app is ready).
### 3.1 PM Delegation Flow
```python
# Step 1: fire and forget
task = await delegate_task_async(
workspace_id="medo-builder-workspace-id",
task="Build a restaurant reservation tool with online booking, menu display, "
"and contact form. Publish when done and return the URL."
)
# Step 2: poll every 60s (app takes 58 min)
while True:
status = await check_task_status(task_id=task["task_id"])
if status["status"] in ("completed", "failed"):
break
await asyncio.sleep(60)
result_url = status.get("result") # MeDo app URL on success
```
### 3.2 Invocation Patterns (verified from Baidu doc)
Natural-language messages the MeDo Builder workspace should accept from PM:
| Intent | Message to send to MeDo Builder workspace |
|--------|-------------------------------------------|
| List existing apps | `"Show me my apps"` |
| Create + auto-publish | `"Create a [description] and publish it when done"` |
| Create only | `"Create a [description]"` |
| Modify existing | `"Add a search function to app [name/ID]"` |
| Publish draft | `"Publish this app"` |
| Status check | `"Is the app generation done yet?"` |
---
## 4. Proposed Org Template — `org-templates/medo-builder/`
```
org-templates/medo-builder/
├── org.yaml ← minimal single-workspace org (not full team)
├── medo-builder/
│ ├── system-prompt.md ← MeDo Builder agent persona + delegation rules
│ └── workspace.yaml ← runtime: openclaw, skill install, env
```
**org.yaml sketch:**
```yaml
name: MeDo Builder
description: Single-workspace org for building MeDo apps (hackathon)
defaults:
runtime: openclaw
tier: 2
required_env: [MIAODA_API_KEY, OPENROUTER_API_KEY]
workspaces:
- name: MeDo Builder
role: Builds and publishes MeDo applications via Miaoda App Builder skill
files_dir: medo-builder
canvas: { x: 400, y: 300 }
```
The medo-builder workspace is deployed **as a child of the molecule-dev PM** in the hackathon org,
not as a standalone org. Full `org-templates/medo-builder/` implementation is Week 2 scope.
---
## 5. Open Questions (Operator Resolution Required)
| # | Question | Why it blocks |
|---|----------|---------------|
| 5-A | **Credential name**: platform secret is `MEDO_API_KEY`; skill expects `MIAODA_API_KEY`. Rename global secret or add workspace alias? | Workspace boot will fail with "MIAODA_API_KEY not set" |
| 5-B | **Credit cost per app**: Baidu doc mentions a Credit System but content was not rendered. How many credits does create+generate+publish consume? Do we have enough for hackathon testing? | Budget planning |
| 5-C | **Rate limits**: no rate-limit info in docs or ClawHub page. What's the max concurrent app generations per API key? | Parallelism planning |
| 5-D | **Failure recovery**: what happens if the OpenClaw skill process crashes mid-generation (after Confirm & Generate, before Publish)? Is there a way to resume or check status by app ID? | Reliability design |
| 5-E | **Submission format**: does the hackathon judge the published MeDo app URL, the Molecule AI org config, or both? | Determines whether we need a polished demo org or just a working app |
---
## 6. Implementation Checklist (Weeks 13)
- [x] Week 1: This design doc (`docs/adapters/medo-integration.md`)
- [ ] Week 1: Resolve §5-A (credential name) + obtain API key credits estimate
- [ ] Week 2: `org-templates/medo-builder/` — full system-prompt + workspace.yaml
- [ ] Week 2: Integration test — PM delegates one real app build end-to-end
- [ ] Week 3: Polish demo org; rehearse submission flow; publish hackathon entry
@@ -1,117 +0,0 @@
---
title: "MeDo Smoke Test Log — 2026-04-13 (Run 4)"
description: "Smoke-test run log for the MeDo / Miaoda App Builder OpenClaw integration."
---
# MeDo Smoke Test Log — 2026-04-13 (Run 4)
**Tester:** PM (direct execution)
**Goal:** Install Miaoda App Builder skill → build "Hello Molecule AI" landing page → publish → URL.
**Credits spent:** 0 across all four runs.
---
## Run Summary
| Run | Blocker | Resolution |
|-----|---------|------------|
| 1 | `workspace-template:openclaw` image not built | ✅ Operator rebuilt image |
| 2 | Adapter key lookup ignores `AISTUDIO_API_KEY` / `QIANFAN_API_KEY` | ✅ Code fix committed (d779e16) |
| 3 | Executor creates fresh OpenClaw session per A2A message | ✅ Code fix committed (9466943) |
| 4 | `payloads: []` on every response — agent never returns text via `--json` mode | ❌ Root cause below |
---
## Run 4 — Detailed Findings
### Environment — all green
| Check | Result |
|-------|--------|
| Platform health | ✅ |
| `workspace-template:openclaw` image | ✅ boots in 31s |
| AISTUDIO_API_KEY + gemini-2.0-flash | ✅ confirmed in every response meta |
| Stable session ID (workspace ID) | ✅ `sessionKey: agent:main:explicit:a507780d-...` consistent across all calls |
### Messages Sent and Responses
| Message | Response | Duration |
|---------|----------|----------|
| Install skill | `payloads: [], livenessState: working` | 1.7s |
| Build Hello Molecule AI | `payloads: [], livenessState: working` | 0.8s |
| Check status (sessions_list) | `LLM request failed: provider rejected request schema/payload` | — |
| Reply with exactly: STATUS_OK | `payloads: [], livenessState: working` (after restart) | 1.8s |
The "Reply with exactly: STATUS_OK" response is decisive. A vanilla LLM call with no tool use should produce a text payload. It didn't. This rules out skill complexity or message ambiguity as the cause.
### Root Cause — `openclaw agent --json` Does Not Surface Agent Text in `payloads`
The OpenClaw agent processes messages using background session dispatch (`sessions_spawn` / `sessions_yield`). In this mode:
1. Main session receives message → immediately spawns background session → calls `sessions_yield`
2. `openclaw agent --json` exits with `payloads: [], livenessState: 'working'`
3. Background session processes the actual work and produces text — but only visible in interactive/streaming mode, not in the `--json` subprocess call
**Evidence:** Even "Reply with exactly: STATUS_OK" returns `payloads: []`. The agent is using background sessions for everything, including trivial echo requests.
**Likely cause:** OpenClaw's default `SOUL.md` / `BOOTSTRAP.md` workspace config instructs the agent to always use async session patterns. In a terminal session these background responses appear naturally; via subprocess `--json`, only the main session's synchronous output is captured.
### Transient issue: LLM request failed
After 3+ rapid A2A calls (install → build → status check), the Gemini AI Studio API returned a schema/payload rejection. Resolved by restarting the workspace (`POST /workspaces/:id/restart`). Likely a rate-limit or context-size rejection from Gemini. Restarted in 30s, normal on next call.
---
## 4. Required Fix — OpenClawA2AExecutor Response Capture
The executor must retrieve the agent's text response from session history **after** the main session yields. The `sessions_history` CLI command (exposed as `session_history` tool) retrieves past messages.
**Proposed change** to `workspace/adapters/openclaw/adapter.py` (`execute()` method):
```python
# After proc.communicate() returns with payloads=[]:
if not reply or reply.startswith("{'payloads': []"):
# Agent yielded without responding — fetch last message from session history
await asyncio.sleep(2) # brief wait for background session to complete short tasks
hist_proc = await asyncio.create_subprocess_exec(
"openclaw", "sessions", "history",
"--session-id", self._session_id,
"--limit", "1", "--json",
stdout=asyncio.subprocess.PIPE, stderr=asyncio.subprocess.PIPE,
env={**os.environ, "PATH": f"{os.path.expanduser('~/.local/bin')}:{os.environ.get('PATH', '')}"}
)
hist_stdout, _ = await asyncio.wait_for(hist_proc.communicate(), timeout=15)
hist_data = json.loads(hist_stdout.decode().strip() or "{}")
last_msg = (hist_data.get("messages") or [{}])[-1]
reply = last_msg.get("content", reply) # fall back to original if no history
```
**Note on long tasks (58 min builds):** Session history won't have the build result until it completes. For Miaoda App Builder, PM must poll: send a follow-up "What is the status of the Hello Molecule AI app build?" message every 60s until the response contains a URL or error.
---
## 5. Open Questions Status
### 5-C — Rate limits
**UNKNOWN.** Never reached skill invocation.
*New data:* Gemini AI Studio hit a schema/payload rejection after 3 rapid calls. This may be a Gemini-specific issue with large tool schemas (OpenClaw's `cron` schema is 6311 chars). Worth filing separately.
### 5-D — Failure recovery
**UNKNOWN.** Never reached app generation.
---
## 6. Issues to File
| # | Issue | Status | Location |
|---|-------|--------|----------|
| A | `fix(openclaw): use stable workspace session ID` | ✅ fixed in 9466943 | adapter.py |
| B | `fix(openclaw): extend key lookup for AISTUDIO/QIANFAN` | ✅ fixed in d779e16 | adapter.py |
| C | `fix(provisioner): surface Docker errors in last_sample_error` | ❌ open | provisioner.go |
| **D** | **`fix(openclaw): capture agent response via session history when payloads=[]`** | ❌ open — see §4 | adapter.py |
| **E** | **`fix(openclaw): Gemini rejects request after N rapid calls with large tool schema`** | ❌ open — investigate cron schema size | adapter.py |
---
## 7. Next Steps (before Run 5)
- [ ] **Dev Lead:** Implement §4 session-history fallback in `OpenClawA2AExecutor.execute()`
- [ ] **Dev Lead (optional):** Trim `cron` tool schema to reduce Gemini schema-size rejection risk
- [ ] **Operator:** Rebuild image: `bash workspace/build-all.sh openclaw`
- [ ] **PM (Run 5):** Re-run smoke test — expected to finally reach skill install confirmation
@@ -1,112 +0,0 @@
---
title: "ADR-001: Admin endpoints accept any workspace bearer token"
description: "ADR-001: why admin endpoints validate any workspace bearer token, and the AdminAuth lockdown that followed."
---
# ADR-001: Admin endpoints accept any workspace bearer token
**Status:** Accepted — known risk, Phase-H remediation planned
**Date:** 2026-04-17
**Issue:** #684
**Tracking:** Phase-H — #710
## Context
The `AdminAuth` middleware validates callers by calling `ValidateAnyToken`, which
accepts any live workspace bearer token regardless of which workspace issued it.
There is no separation between workspace-scoped tokens (issued to individual
agents) and admin-scoped tokens (intended for platform operators).
This means any workspace agent that has been issued a token can reach every
admin-gated route on the platform.
## Decision
Proper token-tier separation (workspace vs. admin scope) is deferred to Phase-H.
The known risk is explicitly accepted. Mitigation controls are documented below.
## Blast radius — affected admin endpoints
A compromised workspace token grants unauthenticated-equivalent access to all
of the following:
| Endpoint | Impact |
|----------|--------|
| `GET /admin/workspaces/:id/test-token` | Mint a fresh bearer token for any workspace |
| `DELETE /workspaces/:id` | Delete any workspace and auto-revoke its tokens |
| `PUT /settings/secrets` / `POST /admin/secrets` | Overwrite any global secret (env-poisons every agent on restart) |
| `DELETE /settings/secrets/:key` / `DELETE /admin/secrets/:key` | Delete any global secret; same fan-out restart |
| `GET /settings/secrets` / `GET /admin/secrets` | Read all global secret keys (values masked, but key enumeration enables targeted attacks) |
| `GET /workspaces/:id/budget` + `PATCH /workspaces/:id/budget` | Read or clear any workspace's token budget |
| `GET /events` / `GET /events/:workspaceId` | Read the full structural event log across all workspaces |
| `POST /bundles/import` | Import an arbitrary workspace bundle — creates workspaces, injects secrets, overwrites configs |
| `GET /bundles/export/:id` | Exfiltrate full workspace bundle including config, secrets references, and files |
| `POST /org/import` | Instantiate an entire org template — creates multiple workspaces with arbitrary roles and secrets |
| `GET /org/templates` | Enumerate all org template names and their configured roles/system prompts |
| `POST /templates/import` | Write arbitrary files into `configsDir` (workspace template injection) |
| `GET /templates` | Enumerate all template names and metadata |
| `GET /admin/liveness` | Read platform subsystem health (ops intel) |
| `GET /admin/schedules/health` | Read cron scheduler health across all workspaces |
## Risk statement
**A single compromised workspace agent can achieve full platform takeover via
admin endpoints.**
Attack chain example:
1. Agent A's token is exfiltrated (e.g. via a prompt-injection in a delegated task).
2. Attacker calls `PUT /settings/secrets` to overwrite `CLAUDE_API_KEY` with a
controlled value.
3. Every non-paused workspace restarts and loads the poisoned key.
4. Attacker now controls the LLM backend for the entire platform.
Alternatively: call `POST /bundles/import` with a crafted bundle to inject a
malicious workspace with a pre-configured `initial_prompt` and elevated secrets.
## Current mitigations
- **Workspace isolation** — `CanCommunicate()` in the A2A proxy limits which
workspaces can send tasks to which, reducing the blast radius of a single
compromised agent during normal operation.
- **Audit logging** — PR #651 writes all admin-route calls to `structure_events`.
Forensic recovery is possible after the fact.
- **`ValidateAnyToken` removed-workspace JOIN** — tokens belonging to deleted
workspaces are filtered at the DB layer (PR #682 defense-in-depth) so
post-deletion token replay is blocked.
- **`MOLECULE_ENV=production` gate** — hides the `/admin/workspaces/:id/test-token`
endpoint in production deployments unless `MOLECULE_ENABLE_TEST_TOKENS=1`.
## Phase-H remediation plan
Tracked in GitHub issue **#710**.
### Schema change
Add a `token_type` column to `workspace_auth_tokens`:
```sql
ALTER TABLE workspace_auth_tokens
ADD COLUMN IF NOT EXISTS token_type TEXT NOT NULL DEFAULT 'workspace'
CHECK (token_type IN ('workspace', 'admin'));
```
Admin tokens are minted only via a dedicated privileged endpoint that itself
requires an existing admin token or a one-time bootstrap secret.
### Middleware update
- `WorkspaceAuth` — continue accepting `token_type = 'workspace'` only.
- `AdminAuth` — require `token_type = 'admin'`. Workspace tokens rejected.
### Bootstrap flow
On first boot (no tokens exist), a single-use bootstrap secret is printed to
the server log. The operator uses it to mint the first admin token. Subsequent
admin tokens are minted by existing admin token holders. The fail-open path in
`HasAnyLiveTokenGlobal` is retired once Phase-H ships.
### Migration path
Phase-H is a breaking change for any automation that currently uses workspace
tokens against admin endpoints. A migration guide and a `MOLECULE_PHASE_H=1`
feature flag will be provided so operators can opt in before the strict
enforcement date.
-125
View File
@@ -1,125 +0,0 @@
---
title: API Reference
description: Full REST API reference for the Molecule AI workspace server — workspace management, A2A communication, file operations, secrets, tokens, and more.
---
# API Reference
This document describes the REST API exposed by the Molecule AI workspace server (Go/Gin, default port `:8080`). Clients include the Canvas frontend, workspace agents communicating over A2A, and external tooling such as the MCP server and CLI.
**Base URL:** `http://localhost:8080` (development default)
**Rate limit:** 600 req/min (configurable via `RATE_LIMIT`)
**CORS origins:** `http://localhost:3000,http://localhost:3001` by default (configurable via `CORS_ORIGINS`)
---
## Authentication
Three middleware classes gate server-side routes:
- **`AdminAuth`** — strict bearer-only. Required for any route that can leak prompts/memory, create/mutate workspaces, or expose ops intel. Lazy-bootstrap fail-open when no live tokens exist globally.
- **`WorkspaceAuth`** — binds a bearer token to a specific workspace `:id`. A token for workspace A cannot be used against workspace B's sub-routes.
- **`CanvasOrBearer`** — accepts a bearer token OR a request Origin matching `CORS_ORIGINS`. Used only for cosmetic routes with zero data/security impact (currently `PUT /canvas/viewport` only). Do not extend to routes that leak data or create resources.
Full contract: `docs/runbooks/admin-auth.md`.
---
## Routes
| Method | Path | Handler |
|--------|------|---------|
| GET | /health | inline |
| GET | /metrics | metrics.Handler() — Prometheus text format; no auth, scrape-safe |
| POST/GET/PATCH/DELETE | /workspaces[/:id] | workspace.go — `GET /workspaces`, `POST /workspaces`, and `DELETE /workspaces/:id` require `AdminAuth`. `PATCH /workspaces/:id` enforces field-level authz: cosmetic fields (name, role, x, y, canvas) pass through; sensitive fields (tier, parent_id, runtime, workspace_dir) require a valid bearer token when any live token exists. |
| GET/PATCH | /workspaces/:id/config | workspace.go |
| GET/POST | /workspaces/:id/memory | workspace.go |
| DELETE | /workspaces/:id/memory/:key | workspace.go |
| POST/PATCH/DELETE | /workspaces/:id/agent | agent.go |
| POST | /workspaces/:id/agent/move | agent.go |
| GET/POST/PUT | /workspaces/:id/secrets | secrets.go (POST/PUT auto-restarts workspace) |
| DELETE | /workspaces/:id/secrets/:key | secrets.go (DELETE auto-restarts workspace) |
| GET | /workspaces/:id/model | secrets.go |
| GET | /settings/secrets | secrets.go — list global secrets (keys only, values masked) |
| PUT/POST | /settings/secrets | secrets.go — set a global secret `{key, value}`; auto-restarts every non-paused/non-removed/non-external workspace that does not shadow the key with a workspace-level override |
| DELETE | /settings/secrets/:key | secrets.go — delete a global secret; same auto-restart fan-out as PUT/POST |
| GET | /admin/workspaces/:id/test-token | admin_test_token.go — mint a fresh bearer token for E2E scripts; returns 404 unless `MOLECULE_ENV != production` or `MOLECULE_ENABLE_TEST_TOKENS=1` |
| GET/POST/DELETE | /admin/secrets[/:key] | secrets.go — legacy aliases for /settings/secrets |
| WS | /workspaces/:id/terminal | terminal.go |
| POST | /workspaces/:id/expand | team.go |
| POST | /workspaces/:id/collapse | team.go |
| POST/GET | /workspaces/:id/approvals | approvals.go |
| POST | /workspaces/:id/approvals/:id/decide | approvals.go |
| GET | /approvals/pending | approvals.go |
| POST/GET | /workspaces/:id/memories | memories.go |
| DELETE | /workspaces/:id/memories/:id | memories.go |
| GET | /workspaces/:id/traces | traces.go |
| GET/POST | /workspaces/:id/activity | activity.go |
| POST | /workspaces/:id/notify | activity.go (agent→user push message via WebSocket) |
| POST | /workspaces/:id/restart | workspace.go |
| POST | /workspaces/:id/pause | workspace.go (stops container, status→paused) |
| POST | /workspaces/:id/resume | workspace.go (re-provisions paused workspace) |
| POST | /workspaces/:id/a2a | workspace.go |
| POST | /workspaces/:id/delegate | delegation.go (async fire-and-forget) |
| GET | /workspaces/:id/delegations | delegation.go (list delegation status) |
| GET/POST | /workspaces/:id/schedules | schedules.go (cron CRUD) |
| PATCH/DELETE | /workspaces/:id/schedules/:scheduleId | schedules.go |
| POST | /workspaces/:id/schedules/:scheduleId/run | schedules.go (manual trigger) |
| GET | /workspaces/:id/schedules/:scheduleId/history | schedules.go (past runs) |
| GET/POST | /workspaces/:id/channels | channels.go (social channel CRUD) |
| PATCH/DELETE | /workspaces/:id/channels/:channelId | channels.go |
| POST | /workspaces/:id/channels/:channelId/send | channels.go (outbound message) |
| POST | /workspaces/:id/channels/:channelId/test | channels.go (test connection) |
| GET | /channels/adapters | channels.go (list available platforms) |
| POST | /channels/discover | channels.go (auto-detect chats for a bot token) |
| POST | /webhooks/:type | channels.go (incoming social webhook) |
| GET | /workspaces/:id/shared-context | templates.go |
| GET/PUT/DELETE | /workspaces/:id/files[/*path] | templates.go |
| GET | /canvas/viewport | viewport.go — open, no auth required (cosmetic, bootstrap-friendly) |
| PUT | /canvas/viewport | viewport.go — `CanvasOrBearer` middleware; accepts bearer OR Origin matching `CORS_ORIGINS`. Cosmetic-only route — worst case viewport corruption, recovered by page refresh. |
| GET | /templates | templates.go |
| POST | /templates/import | templates.go — `AdminAuth` required |
| POST | /registry/register | registry.go |
| POST | /registry/heartbeat | registry.go — requires `Authorization: Bearer <token>` once a workspace has any live token on file (legacy workspaces grandfathered) |
| POST | /registry/update-card | registry.go — requires `Authorization: Bearer <token>` once a workspace has any live token on file |
| GET | /registry/discover/:id | discovery.go — requires `X-Workspace-ID` + bearer token on the caller side |
| GET | /registry/:id/peers | discovery.go — requires `X-Workspace-ID` + bearer token on the caller side |
| POST | /registry/check-access | discovery.go |
| GET | /plugins | plugins.go (list registry; supports `?runtime=` filter) |
| GET | /plugins/sources | plugins.go (list registered install-source schemes) |
| GET/POST/DELETE | /workspaces/:id/plugins[/:name] | plugins.go — list, install (`{"source":"scheme://spec"}`), uninstall per-workspace |
| GET | /workspaces/:id/plugins/available | plugins.go (filtered by workspace runtime) |
| GET | /workspaces/:id/plugins/compatibility?runtime=X | plugins.go (preflight runtime-change check) |
| GET/POST | /workspaces/:id/tokens | tokens.go — list active tokens (prefix + metadata), create new token (plaintext returned once). Max 50 per workspace. |
| DELETE | /workspaces/:id/tokens/:tokenId | tokens.go — revoke specific token by ID |
| GET | /bundles/export/:id | bundle.go — `AdminAuth` required |
| POST | /bundles/import | bundle.go — `AdminAuth` required |
| GET | /org/templates | org.go (list available org templates) |
| POST | /org/import | org.go — `AdminAuth` required; applies `resolveInsideRoot` path sanitiser on template paths |
| GET | /events | events.go — `AdminAuth` required |
| GET | /events/:workspaceId | events.go — `AdminAuth` required |
| GET | /admin/liveness | inline — `AdminAuth` required. Returns per-subsystem `supervised.Snapshot()` ages; use to check health of scheduler/heartbeat goroutines |
| GET | /ws | socket.go |
---
## Database
Migration files live in `workspace-server/migrations/` (latest: `022_workspace_schedules_source`). Each migration ships as a `.up.sql`/`.down.sql` pair. The migration runner globs `*.sql`, filters out `.down.sql` files, sorts alphabetically, and executes each file on boot. All `.up.sql` files must be idempotent (`CREATE TABLE IF NOT EXISTS`, `ALTER TABLE ... IF NOT EXISTS`) because the runner re-applies every migration on every boot.
### Key Tables
| Table | Description |
|-------|-------------|
| `workspaces` | Core entity — status, runtime, `agent_card` JSONB, heartbeat columns, `current_task`, `awareness_namespace`, `workspace_dir` |
| `canvas_layouts` | Per-workspace x/y canvas position |
| `structure_events` | Append-only event log (workspace lifecycle, agent, approval events) |
| `activity_logs` | A2A communications, task updates, agent logs, errors. `error_detail` is populated by the scheduler so cron run history can surface failure reasons. |
| `workspace_schedules` | Cron tasks — expression, timezone, prompt, run history, `source` (`'template'` for org/import-seeded, `'runtime'` for Canvas/API-created), `last_status` (includes `'skipped'` when the scheduler concurrency-skips a busy workspace) |
| `workspace_channels` | Social channel integrations (Telegram, Slack, etc.) with JSONB config and allowlist |
| `agents` | Agent records |
| `workspace_secrets` | Per-workspace encrypted secrets |
| `global_secrets` | Platform-wide encrypted secrets |
| `workspace_auth_tokens` | Bearer tokens; auto-revoked on workspace delete |
| `agent_memories` | HMA scoped memory (LOCAL / TEAM / GLOBAL) |
| `approvals` | Human-in-the-loop approval requests |
@@ -1,83 +0,0 @@
---
title: "Canary release pipeline"
description: "The canary release pipeline that ships workspace-server changes to the prod tenant fleet, and how to halt it."
---
# Canary release pipeline
How a workspace-server code change reaches the prod tenant fleet — and how to stop it if something's wrong.
## The loop
```
PR merged to staging → main
publish-workspace-server-image.yml ← pushes :staging-<sha> ONLY
│ (NOT :latest — prod is untouched)
Canary tenants auto-update to :staging-<sha>
│ (5-min auto-updater cycle on each canary EC2)
canary-verify.yml waits 6 min, runs scripts/canary-smoke.sh
├─► GREEN → crane tag :staging-<sha> → :latest
│ │
│ ▼
│ Prod tenants auto-update within 5 min
└─► RED → :latest stays on prior good digest
GitHub Step Summary flags the rejected sha
Ops fixes forward OR rolls back manually
```
## Canary fleet
Lives in a separate AWS account (`molecule-canary`, `004947743811`) via an assumed role (`MoleculeStagingProvisioner`). The CP's `is_canary` org flag routes provisioning there; every other org goes to the default staging account. See `docs/architecture/saas-prod-migration-2026-04-19.md` for the account bootstrap.
Canary tenants are configured to pull `:staging-<sha>` (not `:latest`) via `TENANT_IMAGE` on their provisioner, so they ingest each new build before prod does.
## Smoke suite
`scripts/canary-smoke.sh` hits each canary tenant (URL + ADMIN_TOKEN pair) and asserts:
- `/admin/liveness` returns a subsystems map (tenant booted, AdminAuth reachable)
- `/workspaces` returns a JSON array (wsAuth + DB healthy)
- `/memories/commit` + `/memories/search` round-trip (encryption + scrubber)
- `/events` admin read (C4 fail-closed proof)
- `/admin/liveness` without bearer → 401 (C4 regression gate)
Expand by editing the script — each `check "name" "expected" "$response"` call is one line.
## Adding a canary tenant
1. `POST /cp/orgs` — create the org normally (is_canary defaults to false)
2. `POST /cp/admin/orgs/<slug>/canary` with `{"is_canary": true}` — admin only, refuses to flip if already provisioned
3. Re-trigger provision (or delete + recreate if the org was already provisioned into staging) — the fresh EC2 lands in account `004947743811`
Then set repo secrets:
- `CANARY_TENANT_URLS` — append the new tenant's URL
- `CANARY_ADMIN_TOKENS` — append its ADMIN_TOKEN in the same position
## Rolling back `:latest`
When canary was green but something surfaces post-promotion, retag `:latest` to a prior digest:
```bash
export GITHUB_TOKEN=ghp_... # write:packages
scripts/rollback-latest.sh 4c1d56e # retags both platform + tenant images
```
`scripts/rollback-latest.sh` pre-checks that `:staging-<sha>` exists before moving `:latest`, and verifies the digest after the move. Prod tenants pick up the rolled-back image on their next 5-min auto-update.
A post-mortem should always include:
- the commit sha that broke
- why canary didn't catch it (new code path the smoke suite doesn't exercise?)
- whether the smoke suite should grow a new check to prevent the same class of bug
## What this gate doesn't catch
- Bugs that only surface under prod-only data (customer workloads with scale or shape canary doesn't produce). Canary uses real traffic shapes but can't simulate weeks of accumulated state.
- Config drift between canary and prod (different env-var values, different feature flags). Keep canary's config deltas minimal and documented.
- Cross-tenant interactions — canary tenants run in their own AWS account, so a bug that only appears when two tenants compete for a shared resource won't reproduce here.
When these miss, `rollback-latest.sh` is the escape hatch.
@@ -1,76 +0,0 @@
---
title: "SaaS prod migration — 2026-04-19"
description: "Prod cutover notes for the 2026-04-19 staging→main promotion of molecule-controlplane and molecule-core."
---
# SaaS prod migration — 2026-04-19
Promoted staging → main on both `Molecule-AI/molecule-controlplane` and `Molecule-AI/molecule-core`. This note captures the prod cutover deltas so ops can cross-check against the running system.
## What changed
Ten PRs landed, split across the two repos:
**Control plane (`molecule-controlplane`)**
- PR #50 — C1/C2/C3: bearer auth on `/cp/workspaces/*`, shell-escape tenant user-data, per-tenant security group
- PR #51 — H1/H2: crash-safe `SECRETS_ENCRYPTION_KEY` log, dropped `admin_token` from `/instance` SELECT
- PR #52 — SSRF guard on `platform_url`
- PR #53 — CP injects `MOLECULE_CP_SHARED_SECRET` + `MOLECULE_CP_URL` into tenant env
- PR #54 — Stripe webhook body capped at 1 MiB
**Core (`molecule-core` / this repo)**
- PR #978 — H3/H4: LimitReader on Discord webhook + workspace config PATCH
- PR #979 — C4: `AdminAuth` fail-closed on fresh install when `ADMIN_TOKEN` is set
- PR #980 — log-scrub: dropped token prefix logging, stopped logging raw upstream response bodies
- PR #981 — tenant `CPProvisioner` attaches the CP bearer on every outbound `/cp/workspaces/*` call
- PR #982 — Canvas API fetch timeout (15s)
- PR #984 — E2E smoke test sync for #966 (public GET no longer exposes `current_task`)
## New prod env vars (Railway, project `molecule-platform`, env `production`)
Set before the CP merge landed:
| Variable | Value shape | Purpose |
|---|---|---|
| `PROVISION_SHARED_SECRET` | 32-byte hex | Gates `/cp/workspaces/*` on CP. Routes refuse to mount when unset — C1 fail-closed. |
| `EC2_VPC_ID` | `vpc-…` | Enables per-tenant SG creation (C3). Shared-SG fallback emits a startup warning. |
| `CP_BASE_URL` | `https://api.moleculesai.app` | Injected into newly-provisioned tenant containers as `MOLECULE_CP_URL`. |
The live prod `PROVISION_SHARED_SECRET` value is held only in Railway; not committed anywhere. Rotate by `railway variables --set` + redeploy.
## Existing-tenant migration (the sharp edge)
Tenants provisioned **before** this cutover are still running the previous workspace-server image. When they pull the new image on their next boot or auto-update cycle, their `CPProvisioner` will start expecting `MOLECULE_CP_SHARED_SECRET` in the container env — but the existing tenant EC2s don't have that variable in their user-data (the CP only started injecting it from PR #53 onward).
**Symptom**: a pre-cutover tenant can still serve its users' existing workspaces, but any attempt to **provision a new workspace** from inside the tenant UI will hit the CP's new bearer gate and get `401` or `404` back, surfacing as "workspace provision failed" with a generic error.
**Fix per existing tenant (pick one)**:
1. **SSH in + add the env var**
- Copy `PROVISION_SHARED_SECRET` from Railway prod env.
- `ssh ubuntu@<tenant-ip>` and append to the running container's env (`docker stop && docker run … -e MOLECULE_CP_SHARED_SECRET='…' -e MOLECULE_CP_URL=https://api.moleculesai.app …`). Rolling this into an auto-update hook is follow-up work.
2. **Re-provision the tenant**
- `DELETE /cp/orgs/:slug` → re-create via normal signup flow. Tenant-level data survives only if the tenant's own Postgres volume is preserved; workspace_id values change. This is the heavy hammer — only for tenants where existing data can be recreated easily.
3. **Wait for the auto-update + user-data refresh cycle**
- Tenant auto-updater (cron, 5-minute cadence) pulls the new container image but **does not refresh env vars** — those are frozen from the initial user-data. So option 3 alone doesn't fix this; it still needs option 1 or 2.
Script at `scripts/migrate-tenant-cp-secret.sh` (follow-up) will automate option 1 across all running tenants in the prod AWS account.
## Post-deploy verification checklist
- [ ] Railway prod deploy for `controlplane` lands on the new commit (check `https://railway.com/project/7ccc…/service/ae76…`)
- [ ] `curl https://api.moleculesai.app/health` → 200 `{service: molecule-cp, status: ok}`
- [ ] `curl -X POST https://api.moleculesai.app/cp/workspaces/provision` (no bearer) → 401 (**not** 404 — proves the env var is live and routes mounted)
- [ ] GHCR publishes new `workspace-server` image for the core main commit
- [ ] Vercel canvas prod deploy lands
## Rollback
If prod is on fire:
1. `gh pr revert 46 -R Molecule-AI/molecule-controlplane` — reverts all 6 CP PRs together.
2. `gh pr revert 983 -R Molecule-AI/molecule-core` — reverts the core bundle.
3. Both reverts auto-deploy via Railway / GHCR / Vercel.
Existing tenants aren't affected by a rollback — they're running whichever tenant image tag they booted with. Only newly-provisioned tenants pick up the reverted control plane code.
@@ -1,218 +0,0 @@
---
title: "Staging Environment Design"
description: "The staging environment design on Railway, mirroring prod for safe pre-release validation."
---
# Staging Environment Design
> **Status:** Planned — gates all future infra changes (Tunnel migration,
> security fixes, etc.)
>
> **Problem:** We merge directly to main and auto-deploy to production.
> Today's session broke CI twice and caused hours of Cloudflare edge cache
> issues because there was no staging to test infra changes first.
>
> **Goal:** Full staging environment that mirrors production. Every change
> ships to staging first, gets verified, then promotes to production.
---
## Architecture
```
staging production
─────── ──────────
Git branch: main (auto-deploy) main (manual promote)
or staging branch
CP (Railway): staging service production service
staging.api.moleculesai.app api.moleculesai.app
Tenant EC2s: staging EC2 instances production EC2 instances
*.staging.moleculesai.app *.moleculesai.app
App (Vercel): staging.app.moleculesai.app app.moleculesai.app
(Vercel preview) (Vercel production)
DB (Neon): staging branch main branch
(or separate project)
Docker images: platform-tenant:staging platform-tenant:latest
(GHCR) (GHCR)
Cloudflare: *.staging.moleculesai.app *.moleculesai.app
(separate tunnel/worker) (tunnel per tenant)
```
## Deploy flow
```
Developer pushes to PR branch
→ CI runs (tests, build, lint)
→ PR merged to main
→ Auto-deploy to STAGING
→ Staging smoke tests (automated)
→ Manual verification if needed
→ Promote to PRODUCTION (manual trigger or approval)
```
## Components
### 1. Railway: two environments
Railway supports multiple environments per project. Create a `staging`
environment alongside `production`:
```bash
railway environment create staging
railway variables --environment staging --set "DATABASE_URL=<staging-neon>"
railway variables --environment staging --set "MOLECULE_ENV=staging"
# ... all other vars with staging-specific values
```
**Deploy trigger:**
- `staging`: auto-deploy on push to main
- `production`: manual promote via `railway up --environment production`
or GitHub Actions workflow_dispatch
**Domains:**
- staging: `staging-api.moleculesai.app` (Railway custom domain)
- production: `api.moleculesai.app` (unchanged)
### 2. Neon: branch per environment
Neon supports database branches (like git branches):
```bash
# Create staging branch from main
neon branch create --project-id <id> --name staging --parent main
```
- Staging DB has same schema, separate data
- Can reset staging by re-branching from main
- Production data never touched by staging tests
### 3. Vercel: preview deployments
Vercel already supports this natively:
- Push to main → deploys to `app.moleculesai.app` (production)
- Push to `staging` branch → deploys to preview URL
**Or** use Vercel environments:
- `staging.app.moleculesai.app` → staging deployment
- `app.moleculesai.app` → production deployment
### 4. GHCR: tagged images
```
platform-tenant:staging — built on every push to main
platform-tenant:latest — promoted from staging after verification
platform-tenant:sha-xxxxx — immutable, pinned to specific commit
```
**Publish workflow change:**
```yaml
# Current: pushes :latest on every main merge
# New: pushes :staging on every main merge
# pushes :latest only on manual promote
```
### 5. Cloudflare: staging subdomain
Option A (simple): `*.staging.moleculesai.app` with its own tunnel/worker
Option B (full): separate Cloudflare zone for staging (overkill)
Recommend Option A:
- Add `staging.moleculesai.app` DNS records
- Staging tenants get `slug.staging.moleculesai.app` subdomains
- Production tenants get `slug.moleculesai.app` (unchanged)
### 6. EC2: staging tag
Staging EC2 instances tagged with `Environment=staging`:
- Separate from production instances in AWS console
- Can use different AMI, instance type, security group
- Easy to identify and clean up
## Environment variables
| Variable | Staging | Production |
|----------|---------|------------|
| `MOLECULE_ENV` | `staging` | `production` |
| `DATABASE_URL` | Neon staging branch | Neon main branch |
| `TENANT_IMAGE` | `platform-tenant:staging` | `platform-tenant:latest` |
| `APP_DOMAIN` | `staging.moleculesai.app` | `moleculesai.app` |
| `CORS_ORIGINS` | `https://staging.app.moleculesai.app` | `https://app.moleculesai.app` |
| `ADMIN_TOKEN` | per-tenant (same mechanism) | per-tenant |
## Promotion workflow
### Automated (CI/CD)
```yaml
# .github/workflows/promote-to-production.yml
name: Promote to Production
on:
workflow_dispatch:
inputs:
confirm:
description: 'Type "promote" to confirm'
required: true
jobs:
promote:
if: github.event.inputs.confirm == 'promote'
steps:
# 1. Run staging smoke tests one more time
- run: bash tests/e2e/test_saas_tenant.sh
env:
TENANT_SLUG: smoke-test
BASE_URL: https://staging.api.moleculesai.app
# 2. Tag Docker image
- run: |
docker pull ghcr.io/molecule-ai/platform-tenant:staging
docker tag ghcr.io/molecule-ai/platform-tenant:staging \
ghcr.io/molecule-ai/platform-tenant:latest
docker push ghcr.io/molecule-ai/platform-tenant:latest
# 3. Deploy CP to production
- run: railway up --environment production
# 4. Production tenants auto-update within 5 min (Option B cron)
```
### Manual (for now)
Until the automated workflow is built:
1. Verify on staging (`staging.api.moleculesai.app`)
2. `docker tag platform-tenant:staging platform-tenant:latest && docker push`
3. `railway up --environment production`
4. Monitor production health
## What this prevents
- CI breakage from untested path filters (today's dorny/paths-filter issue)
- Cloudflare edge cache poisoning (test DNS changes on staging subdomain)
- Workspace boot script regressions (test on staging EC2 first)
- DB migration failures (test on Neon staging branch)
- Auth/security regressions (staging has same auth stack)
## Implementation order
1. **Railway staging environment** — create + configure vars (~30 min)
2. **Neon staging branch** — create from main (~5 min)
3. **Staging DNS**`staging.api.moleculesai.app` CNAME to Railway (~5 min)
4. **Publish workflow** — push `:staging` tag instead of `:latest` (~15 min)
5. **Promotion workflow** — manual trigger to promote staging → production (~30 min)
6. **Vercel staging** — configure preview deployment URL (~15 min)
7. **Staging smoke test** — automated test after staging deploy (~30 min)
**Total:** ~2.5 hours for full staging pipeline.
## Cost
- Railway staging: ~$5/mo (same as production, but can be smaller)
- Neon staging branch: free (included in plan)
- EC2 staging instances: only when testing (terminate after)
- Vercel: free (preview deployments included)
- Cloudflare: free (same zone, additional records)
@@ -1,154 +0,0 @@
---
title: "Tenant Image Upgrade Strategies"
description: "Strategies for rolling a new platform-tenant image out to existing EC2 tenants, with trade-offs."
---
# Tenant Image Upgrade Strategies
> **Status:** Option B (sidecar auto-updater) implemented. Options A and C
> documented for future use.
## Problem
When we push a new `platform-tenant:latest` to GHCR, existing EC2 tenant
instances keep running the old image. New orgs get the latest image at boot,
but existing tenants fall behind — missing bug fixes, security patches, and
new features.
## Option A: Rolling restart on publish (coordinated)
The publish workflow calls a CP admin endpoint after pushing the image.
The CP iterates all running tenants and restarts them one by one.
```
publish-platform-image succeeds
→ POST https://api.moleculesai.app/cp/admin/rolling-upgrade
→ CP queries org_instances WHERE status = 'running'
→ For each tenant (staggered, 30s apart):
1. AWS SSM Run Command: docker pull + docker restart
2. Wait for /health 200
3. Update org_instances.updated_at
4. If health fails after 60s, rollback (docker run old image)
→ Return summary: {upgraded: N, failed: M, skipped: K}
```
### Pros
- Immediate, coordinated upgrades across all tenants
- CP has full visibility into upgrade status
- Can implement canary (upgrade 1 tenant first, verify, then rest)
- Rollback capability per tenant
### Cons
- Requires AWS SSM agent on EC2 instances (not installed yet)
- Alternatively requires SSH access from Railway → EC2 (network/key management)
- Brief downtime per tenant during restart (~10-30s)
- Blast radius: a bad image can take down all tenants before canary catches it
### Implementation effort
- Add SSM agent to EC2 user-data script
- Add `POST /cp/admin/rolling-upgrade` handler
- Add upgrade step to publish workflow
- Add rollback logic
- ~2-3 days
### When to use
- Urgent security patches that can't wait 5 min
- Breaking changes that need coordinated rollout
- When you want canary/staged deployment
---
## Option B: Sidecar auto-updater (implemented)
A cron job on each EC2 checks GHCR for a new image digest every 5 minutes.
If the digest changed, it pulls the new image and restarts the container.
```bash
# Runs every 5 min on each EC2 (added to user-data)
*/5 * * * * /usr/local/bin/molecule-auto-update.sh
```
The update script:
1. `docker pull platform-tenant:latest`
2. Compare digest with running container's image digest
3. If different: `docker stop molecule-tenant && docker rm molecule-tenant && docker run ...`
4. Wait for `/health` 200
5. Log result to `/var/log/molecule-auto-update.log`
### Pros
- Zero CP involvement — fully autonomous per tenant
- Tenants upgrade within 5 min of any publish
- No SSH/SSM infrastructure needed
- Each tenant upgrades independently (natural canary)
- Simple to implement (2 lines in user-data + a small script)
### Cons
- Up to 5 min delay between publish and tenant upgrade
- Brief downtime during restart (~10-30s)
- No centralized visibility into upgrade status
- Can't selectively hold back specific tenants
- All tenants track `latest` — no pinned versions
### When to use
- Default for all tenants
- Works well for early-stage SaaS with frequent deploys
---
## Option C: Blue-green via Worker (zero downtime)
Each EC2 runs two container slots: `blue` (current) and `green` (new).
The Cloudflare Worker routes traffic to whichever is healthy.
```
EC2 instance:
molecule-tenant-blue → :8080 (current, serving traffic)
molecule-tenant-green → :8081 (new, starting up)
Upgrade flow:
1. Pull new image
2. Start green on :8081
3. Health check green: GET :8081/health
4. If healthy: update Worker routing (KV: slug → port 8081)
5. Stop blue
6. Next upgrade: blue becomes the new slot
Worker routing:
KV key: "example-org" → {"ip": "<EC2_IP>", "port": 8081}
(port defaults to 8080 when not in KV)
```
### Pros
- Zero downtime — traffic switches atomically after health check
- Instant rollback — just switch back to the old slot
- Worker already exists — just add port to the routing lookup
- Health-verified before any traffic switches
### Cons
- Double memory usage during transition (~512MB extra per tenant)
- More complex user-data script (manage two containers)
- Worker needs port-aware routing (KV schema change)
- Need to track which slot is active per tenant
### Implementation effort
- Update user-data to manage blue/green containers
- Update Worker to read port from KV
- Add blue/green state tracking to CP (org_instances.active_slot)
- Update auto-updater script for blue-green swap
- ~3-5 days
### When to use
- When tenants have SLAs requiring zero downtime
- Production deployments with paying customers
- After Option B proves the auto-update pattern works
---
## Migration path
```
Now: Option B (auto-updater, 5 min delay, brief downtime)
Growth: Option A (add SSM for urgent patches, keep B as default)
Scale: Option C (zero-downtime for premium/enterprise tenants)
```
-592
View File
@@ -1,592 +0,0 @@
---
title: "Incident Log — molecule-core"
description: "Chronological incident log for molecule-core — summaries, resolutions, and references."
---
# Incident Log — molecule-core
> This file documents security incidents, outages, and degraded states.
> Active incidents are listed first. Resolved incidents remain for historical record.
---
*Last updated: 2026-04-21T07:45Z by Core Platform Lead — Incident log rebuilt after linter reset*
---
## Security Audit Cycle 6 — ALL CLEAR (2026-04-21 ~07:15Z)
**SHA range:** e69cb26 → 674384b on main (~5 commits + ~10 merged PRs)
**Verdict:** ✅ No critical/high findings
### Commits Reviewed — All CLEAN
| Commit | Description |
|--------|-------------|
| `dc9c64e` / PR #1258 | F1097 org_id context — eliminates redundant 2nd SELECT in AdminAuth |
| `33f1d1a` | Canvas cascade-delete UX — `pendingDelete.hasChildren`, warning dialog |
| `0790d57` | Canvas metrics guard — null coalescing |
| `781c217` | CI YAML fix |
| `169120d` / PR #1310 | CWE-78/CWE-22 — exec form + path traversal guards |
| `e431fc4` / PR #1302 | CWE-918 SSRF — `isSafeURL` in `a2a_proxy.go` |
| `a66f889` / PR #1261 | CWE path-injection — `resolveInsideRoot` for template paths |
Full audit saved to TEAM memory id `abc58b47`.
---
## F1100 — workspace_restart.go Path Traversal (RESOLVED)
**Severity:** Medium | **Finding ID:** F1100
**Status:** Resolved — fix applied via `a66f889` (PR #1261) on both main and staging
### Summary
`workspace_restart.go:127-133` accepted `body.Template` (attacker-controlled) via raw `filepath.Join(h.configsDir, template)`, allowing path traversal (e.g. `../../../etc`) to escape `configsDir`. **Issue #1043 triage missed this — legitimate gap, not false positive.**
Authenticated callers could pass a crafted `body.Template` value to escape the configs directory.
### Fix Applied
PR #1260 (intended) closed without merge. Fix landed via **PR #1261 (`a66f889`)** on both main and staging:
```go
// Fixed (a66f889):
candidatePath, resolveErr := resolveInsideRoot(h.configsDir, template)
if resolveErr != nil {
template = "" // fallback fires safely
}
```
### References
- PR #1260: closed without merge — superseded by PR #1261
- PR #1261 (`a66f889`): merged ✅
- Closes: #1043
---
## F1088 Credential Exposure — CLOSED
**All prior F1088 entries below remain valid. Summary of current state:**
- Credentials: MiniMax revoked (⚠️), GitHub PAT revoked (✅), Admin token — treat as potentially exposed
- BFG git-history scrub: NOT REQUIRED — incident management closure, 0 public forks confirmed
- Git history still contains values — admin token rotation recommended as precaution
- PR #1179 (`b89f3fd`) merged — active code is clean
- Branch `origin/fix/credential-history-cleanup-f1088` exists but is 38 commits behind main — superseded by incident management closure
**Required remaining action:** Rotate `ADMIN_TOKEN` (`HlgeMb8...ShARE=`) as precaution. All other actions complete.
---
### Summary
Commit `d513a0ced549ef2be8903a7b4794256110ba1805` on staging (merged to main via PR #1098) contains three production credentials as hardcoded default values in `scripts/post-rebuild-setup.sh`. The credentials appeared in the git diff and were permanently visible in the public commit history.
### Credentials Status
| # | Credential | Value | Status |
|---|------------|-------|--------|
| 1 | ANTHROPIC_AUTH_TOKEN | `sk-cp-lHt-QFSyZwZxeo...KVw` | ⚠️ Revoked or inactive (404 on API call) |
| 2 | GITHUB_TOKEN | `github_pat_11BPRRWQI0m...hsIJLIL` | ✅ Revoked (confirmed 401) |
| 3 | ADMIN_TOKEN | `HlgeMb8...ShARE=` | Needs confirmation — treated as active until proven otherwise |
### Resolution
PR #1179 (`b89f3fd`: "ci: retry — trigger fresh runner allocation") closed this finding. The incident was closed at the finding-management level. Git history scrub via BFG was discussed but deemed not required by security team (no active public forks confirmed, credentials were already revoked/inactive).
Active code is clean (`d513a0c` replaced hardcoded defaults with env-var reads).
### Summary
Commit `d513a0ced549ef2be8903a7b4794256110ba1805` on staging (merged to main via PR #1098) contains two production credentials as hardcoded default values in `scripts/post-rebuild-setup.sh`. The credentials appear in the git diff and are permanently visible in the public commit history.
The commit itself fixed the problem by replacing hardcoded defaults with env-var reads (MINIMAX_API_KEY, GITHUB_PAT). However, git history still shows the original values.
### Credentials Exposed
> **Token values redacted from this table 2026-04-26** to reduce public-search surface (the docs repo is publicly indexed). Short-suffix references match the convention in the Blast Radius table below (lines 134-137). Full values remain in `molecule-core` git history per the F1088 closure decision (no BFG scrub).
| # | Credential | Value (short suffix) | Service |
|---|------------|----------------------|---------|
| 1 | ANTHROPIC_AUTH_TOKEN | `sk-cp-...KVw` | MiniMax API (api.minimax.io/anthropic) |
| 2 | GITHUB_TOKEN | `github_pat_...hsIJLIL` | GitHub (fine-grained PAT, scope unknown) |
| 3 | ADMIN_TOKEN | `HlgeMb8...ShARE=` | Platform admin authentication |
### Affected Files
- `scripts/post-rebuild-setup.sh` (commit d513a0c, PR #1098 → merged to staging → merged to main)
### Timeline
- **~2026-04-20T13:02Z**: Commit `d513a0c` pushed by `rabbitblood`. GitGuardian flagged credentials in the diff. Fix committed in same commit.
- **~2026-04-20T**: Credentials removed from active code, but git history still contains them.
- **2026-04-20T22:32Z**: Incident discovered and escalated.
### Actions Taken
1. Dev Lead notified (delegation failed — Dev Lead unreachable)
2. All child workspaces notified (delegation failed — all unreachable)
3. Incident documented in this file
4. Branch `origin/fix/credential-history-cleanup-f1088` exists but is 38 commits behind `origin/main`
5. **Incident CLOSED** — PR #1179 merged, finding management closure, BFG scrub deemed not required (no active public forks confirmed)
### Blast Radius (Confirmed by Core-Security)
| Credential | Test Result | Status |
|------------|-------------|--------|
| MiniMax API key (`sk-cp-...KVw`) | `404 Not Found` on real API call | ⚠️ **REVOKED** (or endpoint inactive) |
| GitHub PAT (`github_pat_...hsIJLIL`) | `401 Bad credentials` | ✅ **REVOKED** |
| Admin token (`HlgeMb8...ShARE=`) | Base64 — cannot test directly | ⚠️ **Treated as active** — recommend rotation as precaution |
**Public forks:** 0 confirmed (GH API `/forks` returns none) — low fork blast radius.
**Git history scope:** Credentials exist in both `main` and `staging` in commits `f787873`..`d513a0c`. They were introduced in `f787873` ("feat: nuke-and-rebuild.sh") and removed from active code in `d513a0c`. Both branches require BFG cleanup.
### Required Actions (RESOLVED)
- [x] Credentials revoked (MiniMax ⚠️, GitHub PAT ✅)
- [x] BFG git history cleanup **NOT REQUIRED** — incident management closure, no active public forks, credentials confirmed revoked/inactive
- [x] Team notification — documented in this log
- [ ] **Admin token rotation** — recommended as precaution (value still in git history, treat as potentially exposed)
### BFG Repo-Cleaner Procedure
**NOT REQUIRED** — F1088 closed without BFG scrub per security team decision. Retained for reference only.
**Step 1 — Create credentials manifest (`creds.txt`) [NOT NEEDED]:**
```
<ADMIN_TOKEN value>
<MiniMax sk-cp-... value>
<GitHub fine-grained PAT value>
```
Full token values redacted from this doc 2026-04-26 (see note in the
Credentials Exposed table above). Pull from the Core-Security incident
ticket if a future revival of this BFG procedure is needed.
**Step 2 — Clean origin/main:**
```bash
git clone --mirror https://git.moleculesai.app/molecule-ai/molecule-core /tmp/molecule-main-mirror
java -jar bfgr.jar --replace-text creds.txt --rewrite-not-committed-by-oss --no-blob-protection /tmp/molecule-main-mirror
cd /tmp/molecule-main-mirror && git push --mirror
```
**Step 3 — Clean origin/staging:**
```bash
git clone --mirror https://git.moleculesai.app/molecule-ai/molecule-core /tmp/molecule-staging-mirror
java -jar bfgr.jar --replace-text creds.txt --rewrite-not-committed-by-oss --no-blob-protection /tmp/molecule-staging-mirror
cd /tmp/molecule-staging-mirror && git push --mirror
```
**Step 4 — Notify team to re-clone both branches if cloned before ~13:02 UTC 2026-04-20.**
### References
- Commit: `d513a0ced549ef2be8903a7b4794256110ba1805`
- PR: #1098 (staging → main merge)
- Cleanup branch: `origin/fix/credential-history-cleanup-f1088` (behind main by 38 commits)
- Scanners triggered: GitGuardian
- Security investigation: Core-Security (confirmed credentials revoked via API tests)
- GitHub issue: #1282 (filed by Core-OffSec)
- **Closed by:** PR #1179 (`b89f3fd`) — incident management closure, BFG scrub deemed not required
### Known Issue — PR #1230 Incomplete (QA Round 16, 2026-04-21)
PR #1230 / commit `524e3c6` ("fix(security): replace err.Error() leaks") failed to carry mcp.go fixes into main's tree. All 3 MCP error leaks remain on main:
- `mcp.go:259`: "parse error: " + err.Error()
- `mcp.go:347`: "invalid params: " + err.Error()
- `mcp.go:352`: err.Error()
- `org_plugin_allowlist.go:260`: "detail": err.Error()
Fix is covered by PR #1226 (rebased, MERGEABLE). Gap should close after #1226 merges.
---
## CWE-918 SSRF — Backport to Main (RESOLVED)
**Severity:** High
**Status:** Resolved — PR #1302 merged to main
### Summary
SSRF defence (`isSafeURL` in `a2a_proxy.go`) was backported to main to address CWE-918 (Server-Side Request Forgery). The fix prevents the A2A proxy from forwarding requests to internal network addresses (localhost, private ranges, etc.).
### References
- Commit: `e431fc4` (fix(security): backport SSRF defence (CWE-918) to main — isSafeURL in a2a_proxy.go (#1292) (#1302))
---
## CWE-22 + CWE-78 Security Fixes — Merged (RESOLVED)
**Severity:** Critical
**Status:** Resolved — proper fixes merged to staging and main
### Summary
The `fix/cwe78-delete-via-ephemeral-shell-injection` branch was the right diagnosis but wrong implementation (removed `safeName` from `copyFilesToContainer`). The correct fixes were merged separately:
| Location | Commit | Fix |
|----------|--------|-----|
| staging | `ce2491e` | CWE-22: `copyFilesToContainer` safeName + `deleteViaEphemeral` validateRelPath + exec form |
| main | `169120d` | CWE-78/CWE-22: block shell injection in `deleteViaEphemeral` |
Both CWEs are fully resolved on both branches. The regression branch is superseded and must not be merged as-is.
### Verification (staging `ce2491e`)
`copyFilesToContainer` (container_files.go:73-99):
```go
clean := filepath.Clean(name)
if filepath.IsAbs(clean) || strings.Contains(clean, "..") {
return fmt.Errorf("path traversal blocked: %s", name)
}
safeName := filepath.Join(destPath, clean)
header := &tar.Header{Name: safeName, ...}
```
`deleteViaEphemeral` (container_files.go:152-168):
```go
validateRelPath(filePath)
Cmd: []string{"rm", "-rf", "/configs", filePath} exec form, no shell interpolation
```
---
**Severity:** High
**Period:** ~2026-04-20T22:00Z 2026-04-21T03:30Z
**Finding IDs:** N/A (infra incident)
**Status:** Resolved
### Summary
All self-hosted macOS arm64 runners saturated. 27 runs queued, 0 in-progress, 0 completed. Only cancellations processing. PRs #1053 and #1036 had zero CI runs.
### Root Causes (multiple)
1. `changes` job ran on `[self-hosted, macos, arm64]` despite having zero macOS dependencies (plain `git diff`) — wasted runner slots
2. YAML corruption in `ci.yml` (JSON-escaped `\n` sequences from commits `12c52d4`/`5831b4e`) caused "workflow file issue" failures before any job could start
3. `cancel-in-progress: false` at workflow level caused stale runs to queue instead of being cancelled
4. Workflow-level concurrency not set — multiple in-flight runs queued on same ref
---
## CI Stall — molecule-core/staging (RESOLVED 2026-04-21 ~07:05Z)
**Severity:** High
**Period:** ~2026-04-21T02:47Z ~2026-04-21T07:00Z
**Status:** Resolved — CI progressing normally, no config problems remain
### Resolution
All prior runner-saturation and YAML-corruption fixes were correct. The stall resolved naturally once stale queued runs drained. Current CI state (2026-04-21 ~07:07Z):
- Staging run #24708961892: **success** (SHA `5d32373`)
- Staging run #24708976467: **success** (changes job, SHA `72d825f`)
- Main run #24708984339: queued (normal — healthy queue, not stalled)
- Runner agent healthy — no dead slots
### Root Causes (all resolved)
1. `changes` job on `[self-hosted, macos, arm64]` — fixed by moving to `ubuntu-latest` (`9601545`)
2. YAML corruption in `ci.yml` — fixed by PR #1264 / `b61692c`
3. `cancel-in-progress: false` at workflow level — reverted to `true` on staging ✅
4. `cancel-in-progress: false` on main — correct for single-runner env, aligned via PR #1248
### Staging CI Config (confirmed healthy)
- `ci.yml`: `cancel-in-progress: true`, `changes` job on `ubuntu-latest`
- `codeql.yml`: `cancel-in-progress: false`
- `e2e-api.yml`: `cancel-in-progress: false`
### Infra Recommendations (for long-term stability)
1. Provision org-wide GitHub App installation token for CI automation (PATs rotate too frequently)
2. Update remote URLs on controlplane and tenant-proxy repos
3. Monitor runner agent health on mac mini — restart agent if future stalls recur
---
## PR #1242 YAML Corruption — RESOLVED (PR never merged)
**Severity:** Critical
**Status:** Resolved — PR #1242 closed without merge, staging unaffected
### Summary
PR #1242 (`fix/ci-runner-queue-contention`) branch contained a YAML corruption in `ci.yml` — the `concurrency` block was replaced with a commit-SHA string literal:
```yaml
e4a62e1 (ci: add workflow-level concurrency to ci.yml and codeql.yml)
```
However, PR #1242 was **closed without merging**. Staging received `cancel-in-progress: true` via PR #1264 (commit `b61692c`) instead, which is the correct clean version.
### Current State (updated 2026-04-21 ~04:30Z)
- **main:** `cancel-in-progress: false` ✅ (from PR #1248 / `2ffd11c` or similar clean commit)
- **staging:** `cancel-in-progress: true` (via `0b30465` tick restore after corruption)
- **PR #1248** (`2ffd11c`): open, sets staging `cancel-in-progress: false` — aligns staging with main ✅
- **Main has moved to `false`** — staging should follow to stay consistent
### PR #1248 — URGENT MERGE
PR #1248 (`fix/ci: restore corrupted ci.yml concurrency block`) by Dev Lead:
- Fixes the corruption pattern (same as prior incident)
- Sets `cancel-in-progress: false` — correct for single-runner environment
- Aligns staging CI config with main (which already has `false`)
- Must merge before any further CI runs on staging
### References
- PR: #1242 (`fix/ci-runner-queue-contention`) — closed, not merged
- Staging corruption restored via: PR #1264 / `b61692c`
- PR #1248 (`2ffd11c`): open, Dev Lead fix, `cancel-in-progress: false`
- Main: `cancel-in-progress: false`
---
## PR #1036 QA Audit (STALE)
**Severity:** Low
**Date:** 2026-04-20 (QA audit performed)
**Status:** Stale — CI infrastructure has been fixed since audit
### Summary
QA audit (2026-04-20) flagged CI as failing on PR #1036. However, CI was failing due to infrastructure issues (runner saturation, YAML corruption) that have since been resolved. The audit should be re-run now that staging CI is healthy.
---
## PR #1246 / #1247 — Sed Regression Fix — RESOLVED (PR #1247 merged)
**Severity:** Critical
**Status:** Resolved — PR #1247 merged to main (2026-04-21 ~03:18Z)
### Summary
PR #1246 (`364712d`) was closed without merging. However, **PR #1247** (`04be218`) achieved the same fix cleanly and merged to main:
```
fix(go): replace $1 literal with resp.Body.Close() in 7 files (#1247)
```
Commit `04be218` (merged by molecule-ai[bot]) applied:
```
sed -i 's/defer func() { _ = \$1 }()/defer func() { _ = resp.Body.Close() }()/g'
```
### Affected Files (all fixed on main)
- `workspace-server/cmd/server/cp_config.go`
- `workspace-server/internal/handlers/a2a_proxy.go`
- `workspace-server/internal/handlers/github_token.go`
- `workspace-server/internal/handlers/traces.go`
- `workspace-server/internal/handlers/transcript.go`
- `workspace-server/internal/middleware/session_auth.go`
- `workspace-server/internal/provisioner/cp_provisioner.go` (3 occurrences)
**Staging:** Fix present via prior commits. `cp_config.go` on staging has SHA `d1021c2` (correct form).
**PR #1246:** Closed without merging — superseded by PR #1247. No further action needed.
---
## CWE-78/CWE-22 Branch — RESOLVED (proper fixes merged separately)
**Severity:** Critical
**Status:** Resolved — proper fixes merged via `ce2491e` (staging) and `169120d` (main)
### Summary
The `fix/cwe78-delete-via-ephemeral-shell-injection` branch (commit `17419dd`) was **correct** for CWE-78 (`deleteViaEphemeral` exec form + `validateRelPath`) but **regressed** `copyFilesToContainer` by removing the `safeName` path-traversal guard.
**Resolution — both branches merged to main and staging:**
| Branch | Commit | Status |
|--------|--------|--------|
| staging | `ce2491e` — fix(security): CWE-22 in copyFilesToContainer and deleteViaEphemeral | ✅ merged |
| main | `169120d` — fix(security): CWE-78/CWE-22 — block shell injection in deleteViaEphemeral | ✅ merged |
### What was fixed (staging `ce2491e`)
- `copyFilesToContainer`: `filepath.Clean` + `IsAbs` + `strings.Contains("..")` validation, `safeName` in tar header ✅
- `deleteViaEphemeral`: `validateRelPath(filePath)` check before rm command ✅
- Both CWE-22 and CWE-78 addressed correctly
### `fix/cwe78-delete-via-ephemeral-shell-injection` branch status
**Do NOT merge** — it's now superseded by `ce2491e`/`169120d`. The regression it introduced (removing `safeName` from `copyFilesToContainer`) was never the right approach. If this branch is revived, it must be rebased on top of `ce2491e` to preserve existing CWE-22 protections while adding the CWE-78 exec-form fix.
---
## F1085 Regression Branch (`fix/f1085-regression-1283`) — IS a Regression
**Severity:** High
**Status:** Active — branch removes the confirmed-good F1085 fix (confirmed 2026-04-21 ~07:10Z)
### Summary
Branch `origin/fix/f1085-regression-1283` (commit `3b244e6`) removes `redactSecrets(workspaceID, content)` from `seedInitialMemories` in `workspace_provision.go:249`:
```diff
-`, workspaceID, redactSecrets(workspaceID, content), scope, awarenessNamespace); err != nil {
+`, workspaceID, content, scope, awarenessNamespace); err != nil {
```
**Staging still has the correct fix** (`workspace_provision.go:253` on origin/staging confirms `redactSecrets` is present). This branch is behind staging and would regress it if merged.
### Required Fix
Close or revert this branch. `redactSecrets` must remain in `seedInitialMemories`. If there is a legitimate reason to change this (e.g., a different redaction strategy), document it clearly in the PR before merging.
---
## F1097 — org_id Context Fix — RESOLVED
**Severity:** Medium
**Status:** Resolved — PR #1258 merged to main (`dc9c64e`)
### Summary
`orgToken.Validate` refactored to return `org_id` directly, eliminating the redundant 2nd SELECT in `AdminAuth`. All SQL parameterized correctly.
### References
- PR #1258 (`dc9c64e`): fix(F1097): set org_id in Gin context for org-token callers
---
## PR #1226 — err.Error() Leaks (STALE — closed without merge)
**Severity:** Medium
**Status:** Open — PR closed without merging, leaks still present on main
### Summary
PR #1226 (`fix(security): sanitize remaining err.Error() leaks + errcheck artifacts/client.go`) was **closed without merging**. The following leaks remain on main:
| File | Line | Code | Fix |
|------|------|------|-----|
| `mcp.go` | 259 | `"parse error: " + err.Error()` | → `"parse error: invalid JSON request body"` |
| `mcp.go` | 347 | `"invalid params: " + err.Error()` | → `"invalid params: malformed JSON"` |
| `mcp.go` | 352 | `err.Error()` | → `"dispatch error"` |
| `org_plugin_allowlist.go` | 260 | `"detail": err.Error()` | → `"detail": "plugin name validation failed"` |
| `admin_memories.go` | 99 | `"invalid JSON: " + err.Error()` | → `"invalid JSON request body"` |
**Already fixed:** `artifacts/client.go:175``defer func() { _ = resp.Body.Close() }()` confirmed correct (via PR #1247).
### Action Required
Reopen PR #1226 and fast-track merge. Alternatively, cherry-pick the 4 commits from that PR onto a fresh branch.
---
## QA Round 18 — orgs-page Test Regression (FIXED on main, pending staging port)
**Severity:** Medium
**SHA tested:** `ce33da5` (PR #1257 branch merge with staging)
**Status:** Regression identified in PR #1255, fixed on main, not yet on staging
### Findings
| Finding | Status |
|---------|--------|
| Canvas tests: 53 passed, **1 FAILED** | orgs-page.test.tsx line 133 — `vi.useRealTimers()` + raw `setTimeout(50)` without `act()` |
| PR #1257 conflict | MERGEABLE, approved — closed without merge; fix is on main/staging via `a66f889` |
| PR #1255 regression | Introduced orgs-page test flakiness — +18/-2 in orgs-page.test.tsx |
### orgs-page Test Regression — Root Cause
PR #1255 (`e885fa1`) regressed the timer fix from PR #1235. It replaced `waitFor()` with `vi.useRealTimers()` + raw `setTimeout(50)` without `act()` — causing microtask flush issues.
### Resolution
**Main:** Fixed in `674384b` (PR #1313) — wraps all 10 affected `vi.advanceTimersByTimeAsync(50)` calls in `act(async () => { ... })`. All 813 canvas tests pass on main.
**Staging:** Regression NOT yet fixed — `origin/staging` is 13 commits behind main.
### Action needed
Cherry-pick or port the orgs-page test fix from `674384b` to staging.
---
## Issue #1124 — Orchestrator GET /workspaces 404: Env Var Misconfiguration (OPEN)
**Severity:** Medium
**Status:** Active — root cause confirmed, fix pending, delegated to Core-BE
### Summary
Orchestrator (workspace agent, `workspace/` directory) GET /workspaces/{WORKSPACE_ID} returns 404 due to missing or empty `WORKSPACE_ID` env var. Confirmed via code review (2026-04-21 ~07:10Z).
### Root Causes
**Platform-side (provisioner.go:375-377) is CORRECT:**
```go
env := []string{
fmt.Sprintf("WORKSPACE_ID=%s", cfg.WorkspaceID), // ✅ correctly injected
"WORKSPACE_CONFIG_PATH=/configs",
fmt.Sprintf("PLATFORM_URL=%s", cfg.PlatformURL),
}
```
The platform injects `WORKSPACE_ID` at container provision time. **The bug is in the Python orchestrator modules** that default to empty string instead of validating the injected value.
**Buggy Python module-level defaults (empty string → broken API calls):**
| File | Line | Code |
|------|------|------|
| `workspace/a2a_cli.py` | 24 | `WORKSPACE_ID = os.environ.get("WORKSPACE_ID", "")` |
| `workspace/a2a_client.py` | 17 | `WORKSPACE_ID = os.environ.get("WORKSPACE_ID", "")` |
| `workspace/coordinator.py` | 26 | `WORKSPACE_ID = os.environ.get("WORKSPACE_ID", "")` |
| `workspace/consolidation.py` | 22 | `WORKSPACE_ID = os.environ.get("WORKSPACE_ID", "")` |
| `workspace/molecule_ai_status.py` | 25 | `WORKSPACE_ID = os.environ.get("WORKSPACE_ID", "")` |
When `WORKSPACE_ID` is empty, API calls produce URLs like `/workspaces//heartbeat` or `/registry/discover/` — platform returns 404 or wrong routing.
**Note — main.py is already correct:**
```python
workspace_id = os.environ.get("WORKSPACE_ID", "workspace-default") # main.py:55 ✅
```
However, `main.py` uses a local variable — it doesn't export `WORKSPACE_ID` as a module constant. The other modules that import `WORKSPACE_ID` from `a2a_client` etc. still get the empty-string default.
### Fix Required (Quick Win for Core-BE)
**Option A — Fail fast at module import (recommended):**
```python
WORKSPACE_ID = os.environ.get("WORKSPACE_ID")
if not WORKSPACE_ID:
raise RuntimeError("WORKSPACE_ID environment variable is required but not set")
```
Apply to all 5 affected modules. This surfaces the misconfiguration immediately instead of producing silent 404s downstream.
**Option B — Align with main.py's approach (safer):**
```python
WORKSPACE_ID = os.environ.get("WORKSPACE_ID", "workspace-default")
```
But this masks real misconfigurations. Option A is better.
### Modules Requiring Fix
- `workspace/a2a_cli.py` — line 24
- `workspace/a2a_client.py` — line 17
- `workspace/coordinator.py` — line 26
- `workspace/consolidation.py` — line 22
- `workspace/molecule_ai_status.py` — line 25
### PLATFORM_URL Note
All modules default to `http://platform:8080` (container mesh hostname). This is correct for in-container use but fails outside Docker. No action needed for in-container orchestrators — the platform injects `PLATFORM_URL` at provision time which overrides this default.
### Owner
Core-BE — delegated to Dev Lead (A2A failed). Core-BE sub-team: please pick up.
### Fix PR
[PR #1336](https://git.moleculesai.app/molecule-ai/molecule-core/pull/1336) filed — `fix(orchestrator): fail-fast if WORKSPACE_ID env var is unset/empty`. Targets staging. Labels: bug, needs-work, area:backend-engineer, area:dev-lead.
---
*Last updated: 2026-04-21T07:10Z by Core Platform Lead (post-restart session — all findings re-verified)*
-214
View File
@@ -1,214 +0,0 @@
---
title: "a2a-sdk v0 → v1 migration"
description: "Cheat sheet for migrating workspace runtime code (and forks) from a2a-sdk 0.3.x to 1.x — renamed/removed symbols, common error shapes, before/after diffs."
---
import { Callout } from 'fumadocs-ui/components/callout';
The `a2a-sdk` Python package released v1.0 in late April 2026. The
Molecule workspace runtime migrated under tracking ID **KI-009** and
shipped in `molecule-ai-workspace-runtime` **v0.1.11** (commit
`d5cf872`, PR #39). The platform now runs exclusively on v1.
If you're consuming the platform's published wheel, bumping
`molecule-ai-workspace-runtime>=0.1.11` handles the migration for
you. If you maintain a fork of the runtime, an external agent talking
A2A directly, or your own adapter that imports from `a2a.*`, this page
is your checklist.
## Why migrate
- **Upstream**: `a2a-sdk` 1.0 reorganised the import surface, flattened
`Part`, removed deprecated capability flags, and replaced the
`A2AStarletteApplication` wrapper with explicit Starlette route
factories.
- **Platform**: as of 2026-04-24 the platform sends/receives via v1
shapes natively. The SDK ships a v0_3 compat layer (enabled in the
runtime via `enable_v0_3_compat=True` on `create_jsonrpc_routes`) so
in-flight 0.x callers don't break, but new code should target v1.
- **Forks/external runtimes**: v0 code throws on `import a2a.utils`
and `from a2a.server.apps import A2AStarletteApplication` once you
install v1, so the migration is a hard cutover at install time, not
a soft deprecation.
## Cheat sheet — renamed and removed symbols
The four breaking changes that hit the Molecule runtime during KI-009.
All four are confirmed against
`molecule-core/workspace/` source.
### 1. `new_agent_text_message` renamed to `new_text_message`
- **v0 location**: `a2a.utils.new_agent_text_message`
- **v1 location**: `a2a.helpers.new_text_message`
Both the module path and the symbol name changed.
### 2. `Part` API flattened — `TextPart` removed
- **v0**: `Part(root=TextPart(text="..."))` — `Part` wrapped a `root`
union of `TextPart` / `FilePart` / `DataPart`.
- **v1**: `Part(text="...")` — `Part` accepts the text payload
directly. `TextPart` no longer exists as a public symbol.
`FilePart` / `DataPart` are similarly flattened (`Part(file=...)`,
`Part(data=...)`); the Molecule runtime only emits text parts so the
file/data shapes weren't exercised in KI-009 and aren't covered by
this guide.
### 3. `A2AStarletteApplication` removed — use route factories
- **v0**: `from a2a.server.apps import A2AStarletteApplication` then
`A2AStarletteApplication(agent_card, request_handler).build()`.
- **v1**: `from a2a.server.routes import create_agent_card_routes,
create_jsonrpc_routes` then build a Starlette app from the returned
route lists.
The factories also let you mount the JSON-RPC endpoint at any path
(the runtime mounts at `/` because the platform POSTs to root, see
`workspace/main.py:279`).
### 4. `state_transition_history` capability flag removed
- **v0**: `AgentCapabilities(streaming=..., push_notifications=...,
state_transition_history=True)` was a per-agent opt-in.
- **v1**: the field is gone from `AgentCapabilities`. Per the SDK's own
`a2a/compat/v0_3/conversions.py`: *"No longer supported in v1.0"*.
The capability is now universal — `Task.history` is always available
and `tasks/get` accepts `historyLength` via `apply_history_length()`.
If you pass `state_transition_history=...` as a kwarg to
`AgentCapabilities` under v1, Pydantic will reject it. Drop the kwarg.
See [`workspace/main.py`](https://git.moleculesai.app/molecule-ai/molecule-core/src/branch/main/workspace/main.py)
for the explanatory comment that prevents future accidental re-adds.
## Common error shapes
When v0 code runs against the v1 SDK, the failure modes look like this:
| Error | Cause |
|---|---|
| `ModuleNotFoundError: No module named 'a2a.utils'` | v0 import path; module renamed to `a2a.helpers`. |
| `ImportError: cannot import name 'A2AStarletteApplication' from 'a2a.server.apps'` | The whole `a2a.server.apps` module is gone in v1. Switch to `a2a.server.routes` factories. |
| `ImportError: cannot import name 'TextPart' from 'a2a.types'` | Flattened `Part` API; use `Part(text=...)`. |
| `ValueError: Protocol message AgentCapabilities has no "state_transition_history" field` | Removed capability flag passed as kwarg; drop it. |
| `ValueError: Protocol message Part has no "root" field` | v0 `Part(root=TextPart(...))` shape against v1 schema; flatten to `Part(text=...)`. |
The protobuf-style `ValueError` messages always follow the pattern
`Protocol message <Type> has no "<field>" field` — that's the
fingerprint of "v0 shape against v1 schema." Treat it as a v0→v1 hint
even if the field name isn't on the cheat sheet above.
## Migration checklist
1. **Bump the dep** — `a2a-sdk[http-server]>=0.3.25` is the floor; remove
any `<1.0` upper bound. The Molecule wheel uses
`a2a-sdk[http-server]>=0.3.25` with no upper bound (see
[`molecule-ai-workspace-runtime/pyproject.toml`](https://git.moleculesai.app/molecule-ai/molecule-ai-workspace-runtime/src/branch/main/pyproject.toml)).
2. **Fix imports** — sweep the four renamed/removed symbols above. A
safe grep is `grep -rn "from a2a\\|import a2a"` across your tree.
3. **Fix removed-field reads/writes** — search for
`state_transition_history` usage and delete the kwarg/field access.
4. **Flatten `Part` constructors** — search for `Part(root=` and
convert to `Part(text=...)` / `Part(file=...)` / `Part(data=...)`.
5. **Replace the app factory** — search for `A2AStarletteApplication`
and rewrite the bootstrap using `create_agent_card_routes` +
`create_jsonrpc_routes`. Pass `enable_v0_3_compat=True` to
`create_jsonrpc_routes` if your peers may still be on v0.
6. **Re-run tests** — fixture-level mocks of `a2a.helpers` /
`a2a.utils` need to mock both names so tests still pass during the
rename rollout (see
[`workspace/tests/conftest.py`](https://git.moleculesai.app/molecule-ai/molecule-core/src/branch/main/workspace/tests/conftest.py)
for the dual-name pattern).
## Before / after diffs
### `new_agent_text_message` → `new_text_message`
```diff
-from a2a.utils import new_agent_text_message
+from a2a.helpers import new_text_message
async def execute(self, context, event_queue):
- await event_queue.enqueue_event(new_agent_text_message("hello"))
+ await event_queue.enqueue_event(new_text_message("hello"))
```
### Flat `Part` API
```diff
-from a2a.types import Part, TextPart
+from a2a.types import Part
-msg_parts = [Part(root=TextPart(text=final_text))]
+msg_parts = [Part(text=final_text)]
```
### `AgentCapabilities` — drop `state_transition_history`
```diff
capabilities=AgentCapabilities(
streaming=config.a2a.streaming,
push_notifications=config.a2a.push_notifications,
- state_transition_history=True,
),
```
### `A2AStarletteApplication` → route factories
```diff
-from a2a.server.apps import A2AStarletteApplication
+from a2a.server.routes import create_agent_card_routes, create_jsonrpc_routes
-app = A2AStarletteApplication(
- agent_card=agent_card,
- http_handler=request_handler,
-).build()
+routes = []
+routes.extend(create_agent_card_routes(agent_card))
+routes.extend(create_jsonrpc_routes(
+ request_handler=request_handler,
+ rpc_url="/",
+ enable_v0_3_compat=True,
+))
+app = Starlette(routes=routes)
```
The `enable_v0_3_compat=True` flag on `create_jsonrpc_routes` is what
keeps in-flight v0 callers (peers that haven't migrated yet) from
breaking — it accepts the old method names and translates them. The
Molecule runtime ships with this flag on (see
[`workspace/main.py`](https://git.moleculesai.app/molecule-ai/molecule-core/src/branch/main/workspace/main.py));
strip it once your entire fleet is on v1.
## For downstream consumers
- **Using the published wheel** (`pip install
molecule-ai-workspace-runtime>=0.1.11`): the migration is in the
wheel — no code changes needed in your adapter or workspace template
beyond bumping the pin.
- **Running a fork of the runtime**: cherry-pick or rebase against
commit `d5cf872` ("feat: migrate a2a-sdk 1.x (KI-009) (#39)") in
`molecule-ai-workspace-runtime`. The diff is the canonical reference
for what KI-009 actually changed.
- **Standalone external agent** (talking A2A without the wheel): apply
the [Migration checklist](#migration-checklist) directly to your
source. The four cheat-sheet items are the entire surface that
changed for the typical agent role; only `Part` flattening and the
`state_transition_history` removal affect on-the-wire shapes — the
other two are import-only.
<Callout type="info">
The wheel keeps `enable_v0_3_compat=True` on `create_jsonrpc_routes`,
so a v0 peer can still hit a v1 wheel and vice versa during the
migration window. You don't need to coordinate a fleet-wide cutover —
migrate at your own pace.
</Callout>
## See also
- [`molecule-ai-workspace-runtime` v0.1.11 release](https://git.moleculesai.app/molecule-ai/molecule-ai-workspace-runtime/releases/tag/v0.1.11) — first wheel containing KI-009
- PR #39 (feat: migrate a2a-sdk 1.x / KI-009) — closed without merge; PR content is historical
- PR #48 (feat(a2a): dual-compat for a2a-sdk 0.3.x and 1.x) — closed without merge; PR content is historical
- [Bring Your Own Runtime (MCP)](/docs/runtime-mcp) — universal wheel install path
- [External Agents](/docs/external-agents) — manual A2A path for non-MCP runtimes
@@ -1,69 +0,0 @@
---
title: "Cognee Architecture Deep-Dive — Workspace Isolation"
description: "Deep-dive into Cognee's isolation primitives versus Molecule AI's per-workspace memory requirements."
---
# Cognee Architecture Deep-Dive — Workspace Isolation
**Date:** 2026-04-20
**Issue:** Molecule-AI/molecule-core#1146
**Research by:** Research Lead
**Status:** Complete
---
## Executive Summary
Cognee has **dataset-level isolation primitives** but **no storage-layer enforcement** and **no native `workspace_id` support** in its MCP tool interface. Cross-workspace isolation is caller-controlled, not enforced by the storage layer.
---
## Isolation Layer Analysis
| Layer | Mechanism | Enforced? | Risk |
|-------|-----------|-----------|------|
| Storage (Postgres) | No RLS, no schema namespacing | ❌ None | High |
| App — dataset | `dataset_name` passed per tool call | ⚠️ Caller-controlled | Medium |
| App — user | `get_default_user()` internal resolver only | ⚠️ Soft | Medium |
| MCP `workspace_id` param | Not present in cognee-mcp interface | ❌ N/A | High |
---
## Key Findings
1. **Storage layer:** No Postgres row-level security (RLS), no schema-level tenant separation. Any admin with DB access can read any tenant's data.
2. **Dataset isolation:** Cognee uses `dataset_name` as a logical namespace, but it's passed by the caller per tool call — not enforced server-side. A misconfigured or malicious caller could read/write across datasets.
3. **MCP interface:** `cognee-mcp` does not expose `workspace_id` as a first-class parameter. Workspaces would need to be mapped to dataset names externally.
4. **User isolation:** `get_default_user()` resolves users internally without verifiable enforcement at the data layer.
---
## Migration Implications
Adopting Cognee as the memory substrate requires an **auth bridge**:
- The bridge wraps cognee-mcp and injects `workspace_id``dataset_name` mapping
- All tool calls are routed through the bridge, which enforces tenant context
- Estimated effort: **~100200 LOC** for the MCP proxy wrapper
- This is a pragmatic path — the bridge provides the isolation Cognee's storage layer lacks
---
## Recommendation
**Attempt the auth bridge prototype first (12 days of engineering):**
1. Build MCP proxy that maps workspace_id to dataset_name on each call
2. Validate that cross-workspace calls are correctly rejected
3. If clean → adopt Cognee for Phase 9
4. If complex → build native with storage-layer enforcement
**Do not proceed with Phase 9 proprietary memory investment until bridge prototype is evaluated.**
---
## Sources
- Cognee GitHub: https://github.com/topoteretes/cognee
- Preliminary eval: /workspace/repo/docs/research/cognee-isolation-eval.md
@@ -1,41 +0,0 @@
---
title: "Cognee Workspace Isolation Evaluation"
description: "Evaluating Cognee, an open-source AI memory engine, against Molecule AI's hierarchical memory isolation needs."
---
# Cognee Workspace Isolation Evaluation
**Date:** 2026-04-20
**Issue:** Molecule-AI/molecule-core#1146
**Status:** Preliminary — needs deeper architecture review
## Summary
Cognee (Apache-2.0, by Topoteretes UG) is an open-source AI memory engine with a shipped MCP component. It has direct overlap with Molecule AI's Phase 9 hierarchical memory architecture.
## Workspace Isolation Assessment
**Signal: Partial/Positive**
Cognee's GitHub README explicitly lists "agentic user/tenant isolation, traceability, OTEL collector, audit traits" as a core architectural feature.
This is a positive signal. However:
- The README mention does not specify the technical mechanism (namespace-level separation? separate vector DB instances per tenant? row-level security in a shared DB?)
- The cognee-mcp MCP component's handling of multi-workspace contexts is not documented in the surface-level readme
**Verdict:** Cognee claims tenant isolation. Further due diligence required before treating this as confirmed.
## Next Steps
1. **Deep-dive into cognee architecture docs** — check if isolation is enforced at the storage layer (separate DB/collection per workspace), application layer (row-level), or both
2. **Test cognee-mcp with a multi-workspace scenario** — the MCP tool interface should reveal whether workspace_id is a first-class parameter
3. **Check cognee's GitHub issues/discussions** — any community reports of cross-tenant data leakage?
4. **Evaluate migration path** — if Cognee is adopted, what's involved in migrating existing Phase 9 work?
## Recommendation
Proceed with Phase 9 build-vs-buy review. Cognee is a credible candidate — isolation is claimed but mechanism needs verification. The Phase 9 halt stands until this is resolved.
## Sources
- https://github.com/topoteretes/cognee (README, 2026-04-20)
- /workspace/repo/research/cognee-memo.md
+1 -1
View File
@@ -239,7 +239,7 @@ This terminates all EC2 instances, drops the Neon branch, and removes the org re
- **Scoped roles**: give different team members read-only vs admin access within a tenant org (roadmap: Phase 34)
- **Usage-based billing**: Meter workspace runtime and forward events to Stripe for custom billing tiers
For runbook-level details on the provisioning flow, see the architecture docs at [`docs/architecture/saas-prod-migration-2026-04-19`](/docs/architecture/saas-prod-migration-2026-04-19).
For the provisioning flow internals, see the [Provisioner](/docs/architecture/provisioner) and [Workspace Tiers](/docs/architecture/workspace-tiers) reference.
For the API reference, see [`docs/api-reference`](/docs/api-reference) — the `/cp/orgs/*` endpoints are documented there.