docs: remove internal-only docs from the public docs repo #78
@@ -1,96 +0,0 @@
|
||||
---
|
||||
title: "Hermes Adapter — Shell Design Spec"
|
||||
description: "Design spec for the Hermes runtime adapter — the BaseAdapter shell, provider map, and integration points."
|
||||
---
|
||||
# Hermes Adapter — Shell Design Spec
|
||||
|
||||
**Perspective:** DevOps Engineer + Backend Engineer
|
||||
**Status:** Draft — pre-implementation
|
||||
**Hermes source:** `NousResearch/hermes-agent` (~61k ⭐)
|
||||
**Adapter runtime key:** `hermes`
|
||||
|
||||
---
|
||||
|
||||
## 1. Files Under `workspace/adapters/hermes/`
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `Dockerfile` | Extends `workspace-template:base`; installs `hermes-agent` Python SDK and its deps via pip at image build time |
|
||||
| `requirements.txt` | Python package list — at minimum `hermes-agent`; pin to a specific release tag for reproducibility |
|
||||
| `adapter.py` | `HermesAdapter(BaseAdapter)` — implements `name()`, `display_name()`, `description()`, `get_config_schema()`, `setup()`, `create_executor()`; delegates to `_common_setup()` for plugins/skills/tools |
|
||||
| `__init__.py` | Exports `Adapter = HermesAdapter` — required by the adapter autodiscovery loader in `workspace/adapters/__init__.py` |
|
||||
|
||||
### `Dockerfile` sketch (no implementation — shape only)
|
||||
|
||||
```dockerfile
|
||||
FROM workspace-template:base
|
||||
COPY adapters/hermes/requirements.txt /tmp/hermes-requirements.txt
|
||||
RUN pip install --no-cache-dir -r /tmp/hermes-requirements.txt
|
||||
```
|
||||
|
||||
### `adapter.py` shape
|
||||
|
||||
```python
|
||||
class HermesAdapter(BaseAdapter):
|
||||
@staticmethod
|
||||
def name() -> str:
|
||||
return "hermes"
|
||||
|
||||
async def setup(self, config: AdapterConfig) -> None:
|
||||
# validate NOUS_API_KEY or OPENROUTER_API_KEY is set
|
||||
# call self._common_setup(config) for plugins/skills/tools
|
||||
...
|
||||
|
||||
async def create_executor(self, config: AdapterConfig) -> AgentExecutor:
|
||||
# wrap Hermes SDK session as an A2A AgentExecutor
|
||||
...
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Platform-Side Changes
|
||||
|
||||
### `workspace-server/internal/provisioner/provisioner.go` — `RuntimeImages` map
|
||||
|
||||
Add one entry to the existing map:
|
||||
|
||||
```go
|
||||
var RuntimeImages = map[string]string{
|
||||
// ... existing entries ...
|
||||
"hermes": "workspace-template:hermes", // ← ADD THIS
|
||||
}
|
||||
```
|
||||
|
||||
No other platform Go changes are required for the minimal adapter shell. The `runtime` column in the `workspaces` table is a free-form string; no enum migration needed.
|
||||
|
||||
### `workspace/build-all.sh`
|
||||
|
||||
Add `hermes` to the adapter build loop so `build-all.sh` (and the `build-all.sh claude-code`-style single-runtime path) includes it:
|
||||
|
||||
```bash
|
||||
ADAPTERS=(langgraph claude_code openclaw autogen hermes codex google-adk)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Required Environment Variables
|
||||
|
||||
| Name | Required | Description |
|
||||
|------|----------|-------------|
|
||||
| `NOUS_API_KEY` | Required (unless `OPENROUTER_API_KEY` set) | Nous Research Portal API key — primary model provider for Hermes; obtain from `nousresearch.com` |
|
||||
| `OPENROUTER_API_KEY` | Optional | Fallback provider; lets operators use any Hermes-supported model via OpenRouter instead of Nous Portal |
|
||||
| `HERMES_MODEL` | Optional | Model identifier (e.g. `nous-hermes-3`, `openrouter:anthropic/claude-sonnet-4-5`); adapter defaults to `nous-hermes-3` if unset |
|
||||
| `HERMES_SKILLS_DIR` | Optional | Path inside the container where Hermes looks for skills; defaults to `/configs/skills` — consistent with the Claude Code and LangGraph adapters |
|
||||
|
||||
**Note:** `NOUS_API_KEY` and `OPENROUTER_API_KEY` must be set as workspace secrets via `POST /workspaces/:id/secrets`, not baked into the image. At least one of the two must be present at container start; `setup()` should `raise RuntimeError` early with a clear message if both are absent.
|
||||
|
||||
---
|
||||
|
||||
## 4. Smallest Viable Adapter — Scope Constraints
|
||||
|
||||
This spec covers the **shell only** — the minimum to make a Hermes workspace provision, boot, and accept A2A messages:
|
||||
|
||||
- No Hermes learning loop (skill self-improvement) in v1 — that requires persistent storage writes outside `/configs`; defer to a follow-up PR.
|
||||
- No multi-messenger gateway integration — Hermes's Telegram/Discord/Slack channels are separate from Molecule AI's `/channels` feature; map these later via the channels adapter.
|
||||
- No FTS5 memory backend — use Molecule AI's existing `commit_memory` / `search_memory` built-in tools for v1; Hermes-native memory can be layered in a subsequent PR.
|
||||
- The executor wraps one Hermes agent session per workspace, matching the 1:1 workspace→agent model used by all other adapters.
|
||||
@@ -1,78 +0,0 @@
|
||||
---
|
||||
title: "Hermes Adapter — Implementation Plan"
|
||||
description: "Implementation plan for the Hermes runtime adapter, from SDK import path to adapter.py build steps."
|
||||
---
|
||||
# Hermes Adapter — Implementation Plan
|
||||
|
||||
**Author:** Dev Lead
|
||||
**Date:** 2026-04-13
|
||||
**Branch convention:** `feat/hermes-adapter-<step>` for each PR below
|
||||
**Target:** Ship a minimal but functional Hermes workspace adapter in 4 PRs, each ≤200 lines changed.
|
||||
|
||||
---
|
||||
|
||||
## PR Sequence
|
||||
|
||||
### PR 1 — Docker image shell
|
||||
|
||||
**Title:** `feat(hermes): add workspace-template:hermes Docker image`
|
||||
|
||||
**Files touched:**
|
||||
- `workspace/adapters/hermes/Dockerfile` (new)
|
||||
- `workspace/adapters/hermes/requirements.txt` (new)
|
||||
- `workspace/adapters/hermes/__init__.py` (new)
|
||||
- `workspace/build-all.sh` (1-line addition)
|
||||
|
||||
**Description:** Adds the Hermes Docker image layer. `Dockerfile` extends `workspace-template:base` and installs `hermes-agent` (and declared deps) via pip at build time. `build-all.sh` gains `hermes` in the adapter list so `bash build-all.sh` and `bash build-all.sh hermes` both work. No Python adapter logic yet — just proves the image builds and that `import hermes` succeeds inside the container. CI: add `hermes` to the docker-build matrix.
|
||||
|
||||
---
|
||||
|
||||
### PR 2 — Python adapter + A2A executor
|
||||
|
||||
**Title:** `feat(hermes): implement HermesAdapter and A2A executor`
|
||||
|
||||
**Files touched:**
|
||||
- `workspace/adapters/hermes/adapter.py` (new, ~80 lines)
|
||||
- `workspace/tests/test_adapters.py` (extend existing test file, ~30 lines)
|
||||
|
||||
**Description:** Implements `HermesAdapter(BaseAdapter)` with `name()`, `display_name()`, `description()`, `get_config_schema()`, `setup()`, and `create_executor()`. `setup()` calls `_common_setup()` to load plugins/skills/tools identically to other adapters, then validates that `NOUS_API_KEY` or `OPENROUTER_API_KEY` is present and initialises a Hermes SDK session. `create_executor()` wraps the session as an `AgentExecutor`. Tests cover: adapter name/display_name contract, `setup()` raises `RuntimeError` when both API keys are absent, executor is returned after valid setup.
|
||||
|
||||
---
|
||||
|
||||
### PR 3 — Platform RuntimeImages entry
|
||||
|
||||
**Title:** `fix(provisioner): add hermes to RuntimeImages map`
|
||||
|
||||
**Files touched:**
|
||||
- `workspace-server/internal/provisioner/provisioner.go` (1-line addition)
|
||||
- `workspace-server/internal/provisioner/provisioner_test.go` (1-line addition in RuntimeImages coverage test)
|
||||
|
||||
**Description:** Adds `"hermes": "workspace-template:hermes"` to the `RuntimeImages` map. Without this entry the platform falls back to `workspace-template:langgraph` (wrong deps, agent fails to start). Test: extend the existing table-driven test that asserts every declared runtime resolves to a non-empty image tag.
|
||||
|
||||
---
|
||||
|
||||
### PR 4 — Integration docs + org template entry
|
||||
|
||||
**Title:** `docs(hermes): adapter usage guide and org template example`
|
||||
|
||||
**Files touched:**
|
||||
- `docs/adapters/hermes-adapter-design.md` (update status from Draft → Implemented)
|
||||
- `workspace-configs-templates/hermes/config.yaml` (new, ~20 lines — minimal config template)
|
||||
- `org-templates/molecule-worker-gemini/org.yaml` or a new `molecule-hermes/` org template (optional, ~30 lines)
|
||||
|
||||
**Description:** Marks the design doc as implemented, adds a `workspace-configs-templates/hermes/config.yaml` so operators can create a Hermes workspace from the UI template picker, and optionally adds a minimal org template showing a Hermes-runtime team. Documents the three env vars (`NOUS_API_KEY`, `OPENROUTER_API_KEY`, `HERMES_MODEL`) in the config template comments.
|
||||
|
||||
---
|
||||
|
||||
## Sequencing Notes
|
||||
|
||||
- PRs 1 and 2 can overlap in development but PR 2 must merge after PR 1 (image must exist before adapter tests run in CI).
|
||||
- PR 3 is a single-line change and can merge any time after PR 1 lands.
|
||||
- PR 4 has no code risk; it can be drafted alongside PR 2 and merged last.
|
||||
- Total estimated diff: ~180 lines of new code across all 4 PRs; well within the ≤200 lines/PR budget.
|
||||
|
||||
## Open Questions (resolve before PR 2)
|
||||
|
||||
1. **Hermes SDK import path** — confirm the pip package name and the Python import path (`import hermes`? `from hermes_agent import ...`?). Check `NousResearch/hermes-agent` README before writing adapter.py.
|
||||
2. **Session persistence** — Hermes has a learning loop that writes skill files. Decide at PR 2 time whether to mount `/workspace` as the Hermes skills root or suppress auto-write in v1.
|
||||
3. **Model default** — confirm the correct model identifier string for Nous Portal (e.g. `nous-hermes-3-70b` vs `hermes-3`); hardcode a safe default in `get_config_schema()`.
|
||||
@@ -1,264 +0,0 @@
|
||||
---
|
||||
title: "Hermes Agent — Adapter Reconnaissance"
|
||||
description: "Reconnaissance of the NousResearch hermes-agent project as a candidate Molecule AI runtime adapter."
|
||||
---
|
||||
# Hermes Agent — Adapter Reconnaissance
|
||||
|
||||
Reconnaissance of [NousResearch/hermes-agent](https://github.com/NousResearch/hermes-agent) (v0.8.0, 68,713 ⭐, MIT) for potential Molecule AI adapter integration.
|
||||
|
||||
> **Status:** Design-only recon — no implementation.
|
||||
|
||||
---
|
||||
|
||||
## a) CLI Invocation
|
||||
|
||||
**Install** (curl-to-bash, targets Linux/macOS/WSL2/Termux):
|
||||
|
||||
```bash
|
||||
curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash
|
||||
```
|
||||
|
||||
The `hermes` binary in the repo root is a Python script (`#!/usr/bin/env python3`) that imports and calls `hermes_cli.main.main()`. After install it lands on `$PATH`.
|
||||
|
||||
**Minimal interactive session:**
|
||||
|
||||
```bash
|
||||
hermes # launches TUI, auto-detects provider from env
|
||||
hermes chat # explicit; same as bare `hermes`
|
||||
hermes setup # one-time wizard: sets model, provider, API keys
|
||||
```
|
||||
|
||||
**Key runtime flags:**
|
||||
|
||||
```bash
|
||||
hermes chat \
|
||||
--model anthropic/claude-opus-4.6 \
|
||||
--provider openrouter \
|
||||
--toolsets terminal,file,web \
|
||||
--max-turns 60 \
|
||||
--query "build me a FastAPI app" \
|
||||
--resume # continue most recent session
|
||||
--worktree # git-worktree isolation per session
|
||||
--profile myprofile # load alternate HERMES_HOME profile
|
||||
```
|
||||
|
||||
**One-shot (non-interactive):**
|
||||
|
||||
```bash
|
||||
hermes chat --query "summarise this repo" --quiet
|
||||
```
|
||||
|
||||
**Gateway (messaging platforms) start:**
|
||||
|
||||
```bash
|
||||
hermes gateway start # daemonises; reads gateway config from config.yaml
|
||||
hermes gateway status
|
||||
hermes gateway stop
|
||||
```
|
||||
|
||||
**OpenClaw migration:**
|
||||
|
||||
```bash
|
||||
hermes claw migrate --dry-run # preview; drop --dry-run to execute
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## b) Config Format
|
||||
|
||||
**Format:** YAML
|
||||
**Primary path:** `~/.hermes/config.yaml` (default), overrideable via `HERMES_HOME` env var.
|
||||
**Reference file in repo:** `cli-config.yaml.example`
|
||||
|
||||
**Minimal working config** (provider = OpenRouter, Docker terminal backend):
|
||||
|
||||
```yaml
|
||||
# ~/.hermes/config.yaml
|
||||
|
||||
model:
|
||||
default: "anthropic/claude-opus-4.6"
|
||||
provider: "openrouter" # required; "auto" if you want env-var detection
|
||||
base_url: "https://openrouter.ai/api/v1"
|
||||
|
||||
terminal:
|
||||
backend: "local" # required; options: local | ssh | docker | singularity | modal | daytona
|
||||
cwd: "."
|
||||
timeout: 180
|
||||
lifetime_seconds: 300
|
||||
|
||||
memory:
|
||||
memory_enabled: true
|
||||
user_profile_enabled: true
|
||||
memory_char_limit: 2200
|
||||
user_char_limit: 1375
|
||||
nudge_interval: 10
|
||||
|
||||
agent:
|
||||
max_turns: 60
|
||||
reasoning_effort: "medium" # xhigh | high | medium | low | minimal | none
|
||||
```
|
||||
|
||||
**Required fields:** `model.default`, `model.provider`, `terminal.backend`.
|
||||
Everything else has a hardcoded default.
|
||||
|
||||
**Credentials** go in `~/.hermes/.env` (separate from config.yaml):
|
||||
|
||||
```bash
|
||||
OPENROUTER_API_KEY=sk-or-...
|
||||
ANTHROPIC_API_KEY=sk-ant-...
|
||||
HERMES_HOME=~/.hermes # optional override
|
||||
```
|
||||
|
||||
**Skills config** (in `config.yaml`):
|
||||
|
||||
```yaml
|
||||
skills:
|
||||
creation_nudge_interval: 15 # remind agent to persist a skill every N tool iterations
|
||||
external_dirs:
|
||||
- ~/.agents/shared-skills # read-only external skill dirs
|
||||
```
|
||||
|
||||
**Compression config** (in `config.yaml`):
|
||||
|
||||
```yaml
|
||||
compression:
|
||||
enabled: true
|
||||
threshold: 0.50
|
||||
summary_model: "google/gemini-3-flash-preview"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## c) Runtime Dependencies
|
||||
|
||||
**Python version:** 3.13 (Dockerfile base: `ghcr.io/astral-sh/uv:0.11.6-python3.13-trixie`)
|
||||
**Package manager:** [uv](https://github.com/astral-sh/uv) (not pip directly; `uv pip install .`)
|
||||
**Package version:** `hermes-agent==0.8.0`
|
||||
|
||||
**Top core pip dependencies** (from `pyproject.toml`):
|
||||
|
||||
| Package | Version constraint | Purpose |
|
||||
|---|---|---|
|
||||
| `openai` | `>=2.21.0,<3` | Primary LLM client (all providers via OpenAI-compat API) |
|
||||
| `anthropic` | `>=0.39.0,<1` | Direct Anthropic API adapter |
|
||||
| `python-dotenv` | `>=1.2.1,<2` | `.env` loading |
|
||||
| `fire` | `>=0.7.1,<1` | CLI argument dispatch |
|
||||
| `httpx[socks]` | `>=0.28.1,<1` | Async HTTP (gateway, webhooks) |
|
||||
| `rich` | `>=14.3.3,<15` | TUI rendering |
|
||||
| `pyyaml` | `>=6.0.2,<7` | Config file parsing |
|
||||
| `pydantic` | `>=2.12.5,<3` | Data validation |
|
||||
| `prompt_toolkit` | `>=3.0.52,<4` | Interactive TUI / multiline input |
|
||||
| `tenacity` | `>=9.1.4,<10` | Retry logic |
|
||||
|
||||
**Key optional extras:**
|
||||
|
||||
```bash
|
||||
pip install "hermes-agent[modal]" # modal>=1.0.0 — serverless backend
|
||||
pip install "hermes-agent[daytona]" # daytona>=0.148.0 — cloud sandbox backend
|
||||
pip install "hermes-agent[mcp]" # mcp>=1.2.0 — MCP server/client
|
||||
pip install "hermes-agent[honcho]" # honcho-ai — cross-session user modeling
|
||||
pip install "hermes-agent[messaging]" # telegram, discord.py, aiohttp, slack
|
||||
pip install "hermes-agent[voice]" # faster-whisper, sounddevice, numpy
|
||||
pip install "hermes-agent[rl]" # atroposlib, fastapi, uvicorn, wandb
|
||||
```
|
||||
|
||||
**System binaries** (from Dockerfile `apt-get install`):
|
||||
|
||||
```
|
||||
nodejs npm ripgrep ffmpeg gcc python3-dev libffi-dev procps build-essential
|
||||
```
|
||||
|
||||
`ripgrep` is used by the `file` toolset for fast codebase search. `ffmpeg` is used for voice transcription pre-processing.
|
||||
|
||||
---
|
||||
|
||||
## d) Session State
|
||||
|
||||
**All persistent state lives under `HERMES_HOME`** (default: `~/.hermes/`, overrideable via env var).
|
||||
|
||||
**Primary state store: SQLite**
|
||||
|
||||
```
|
||||
~/.hermes/state.db ← DEFAULT_DB_PATH = get_hermes_home() / "state.db"
|
||||
```
|
||||
|
||||
- Schema version: **6** (`SCHEMA_VERSION = 6` in `hermes_state.py`)
|
||||
- WAL mode (`PRAGMA journal_mode=WAL`) — supports concurrent gateway + CLI writers
|
||||
- Three core tables: `schema_version`, `sessions`, `messages`
|
||||
- **FTS5 virtual table** `messages_fts` with auto-sync triggers on INSERT/UPDATE/DELETE — backs the `session_search` toolset (full-text search across all past conversation content)
|
||||
- Compression-triggered session splitting tracked via `parent_session_id` chain in `sessions` table
|
||||
- Session source tagged as `'cli'`, `'telegram'`, `'discord'`, etc. for per-platform filtering
|
||||
|
||||
**Full directory layout:**
|
||||
|
||||
```
|
||||
~/.hermes/
|
||||
├── config.yaml ← get_config_path()
|
||||
├── .env ← get_env_path()
|
||||
├── state.db ← SQLite WAL, FTS5
|
||||
├── skills/ ← get_skills_dir() — user-created skill SKILL.md files
|
||||
├── logs/ ← get_logs_dir() — trajectory JSONs
|
||||
│ └── session_YYYYMMDD_HHMMSS_<uuid>.json
|
||||
├── MEMORY.md ← agent's curated notes (injected into system prompt)
|
||||
├── USER.md ← user profile (injected into system prompt)
|
||||
└── skins/ ← optional custom theme YAMLs
|
||||
```
|
||||
|
||||
**State is persistent by default.** Session history, memories (`MEMORY.md`/`USER.md`), and skills survive restarts. The `session_reset` config controls when gateway sessions are cleared (default: `mode: both`, idle after 1440 min or at 4 AM daily). Before any reset, Hermes is given one flush turn to write important context to `MEMORY.md`.
|
||||
|
||||
Container backend state is controlled separately by `container_persistent: true/false` in the `terminal:` block.
|
||||
|
||||
---
|
||||
|
||||
## e) Execution Backends
|
||||
|
||||
**Six backends configured via a single `terminal.backend` key in `config.yaml`:**
|
||||
|
||||
| Backend | Where commands run | Key extra config |
|
||||
|---|---|---|
|
||||
| `local` | Host machine, current dir | — |
|
||||
| `ssh` | Remote server | `ssh_host`, `ssh_user`, `ssh_key` |
|
||||
| `docker` | Inside a Docker container | `docker_image`, `docker_mount_cwd_to_workspace` |
|
||||
| `singularity` | Singularity/Apptainer container (HPC) | `singularity_image` |
|
||||
| `modal` | Modal cloud sandbox (serverless) | `modal_image`, `pip install hermes-agent[modal]` |
|
||||
| `daytona` | Daytona cloud sandbox | `daytona_image`, `container_disk`, `pip install hermes-agent[daytona]` |
|
||||
|
||||
**Architecture clarification:** Hermes's Python process **always runs locally** (or wherever you launched it). The `backend` setting controls only where the **`terminal` tool** executes shell commands. For `docker`, Hermes calls the Docker API to spawn/reuse a container and routes `terminal` tool calls into it via exec — Hermes itself is **not** containerised by this setting.
|
||||
|
||||
**Docker backend minimal config:**
|
||||
|
||||
```yaml
|
||||
terminal:
|
||||
backend: "docker"
|
||||
cwd: "/workspace" # path inside the container
|
||||
timeout: 180
|
||||
lifetime_seconds: 300
|
||||
docker_image: "nikolaik/python-nodejs:python3.11-nodejs20"
|
||||
docker_mount_cwd_to_workspace: false # default: false (security off). Set true to bind-mount launch dir into /workspace
|
||||
docker_forward_env:
|
||||
- "GITHUB_TOKEN"
|
||||
- "NPM_TOKEN"
|
||||
container_cpu: 1
|
||||
container_memory: 5120 # MB
|
||||
container_disk: 51200 # MB
|
||||
container_persistent: true # false = ephemeral container, wiped after session
|
||||
```
|
||||
|
||||
**The Dockerfile** (for running *all of Hermes* inside Docker, distinct from the backend setting) uses:
|
||||
|
||||
```dockerfile
|
||||
FROM debian:13.4
|
||||
ENV HERMES_HOME=/opt/data
|
||||
ENV PLAYWRIGHT_BROWSERS_PATH=/opt/hermes/.playwright
|
||||
VOLUME /opt/data
|
||||
ENTRYPOINT ["/opt/hermes/docker/entrypoint.sh"]
|
||||
# Runs as non-root user hermes (UID 10000), home /opt/data
|
||||
```
|
||||
|
||||
**Serverless hibernation** (Modal + Daytona): `container_persistent: false` produces fully ephemeral sandboxes that are destroyed after `lifetime_seconds`; `true` persists the container filesystem between sessions (warm-resume, no re-install overhead).
|
||||
|
||||
---
|
||||
|
||||
## f) Value Proposition
|
||||
|
||||
Integrating Hermes adds one capability that none of the other existing adapters (LangGraph, Claude Code, AutoGen, OpenClaw, Codex, Google ADK) deliver end-to-end: **a closed learning loop that compounds across sessions at the skill, memory, and user-model layers simultaneously.** Concretely: after a complex task, Hermes autonomously creates a `SKILL.md` file in `~/.hermes/skills/` (prompted every `creation_nudge_interval=15` tool iterations), and those skills are re-injected as context in future sessions — agents get better at tasks they've done before without any human curation step. The `session_search` toolset adds FTS5 + Gemini Flash summarization over `state.db`, so the agent can recall specific conversations from months ago with semantic-quality results. Layered on top is **Honcho dialectic user modeling** (`plastic-labs/honcho`) — a cross-session profile that tracks user communication style, preferences, and expectations, shared across any Honcho-integrated tool (not just Hermes). Finally, the **Modal and Daytona serverless backends with `container_persistent`** give Molecule AI a path to hibernating, pay-per-use sandboxes that no existing adapter exposes — directly relevant to Molecule AI's multi-workspace billing model. The `hermes claw migrate` command (backed by `optional-skills/migration/openclaw-migration/scripts/openclaw_to_hermes.py`) is also relevant: Molecule AI could offer equivalent migration tooling to attract OpenClaw's existing ~247k-user base, and the **`agentskills.io` skill-manifest spec** (referenced in `optional-skills/`) should be reviewed before Molecule AI finalises its own plugin manifest schema to ensure interoperability with what is rapidly becoming the de-facto file-based skill standard.
|
||||
@@ -1,177 +0,0 @@
|
||||
---
|
||||
title: "MeDo Integration Design — Molecule AI Hackathon (May 20 2026)"
|
||||
description: "Design for integrating the Baidu MeDo / Miaoda App Builder as an OpenClaw-runtime workspace, with A2A delegation and open questions."
|
||||
---
|
||||
# MeDo Integration Design — Molecule AI Hackathon (May 20 2026)
|
||||
|
||||
**Status:** Design — implementation pending operator sign-off on open questions (§5).
|
||||
**Scope:** How the molecule-dev team builds MeDo apps for the "Build with MeDo" hackathon.
|
||||
**Key constraint:** MeDo App Builder is an OpenClaw skill on ClawHub (`seiriosPlus/miaoda-app-builder`),
|
||||
not a REST API. All interactions go through natural-language messages to an OpenClaw workspace.
|
||||
|
||||
---
|
||||
|
||||
## 1. Architecture Overview
|
||||
|
||||
```
|
||||
CEO / Canvas
|
||||
│ A2A task
|
||||
▼
|
||||
PM (claude-code)
|
||||
│ delegate_task_async → workspace: medo-builder
|
||||
▼
|
||||
MeDo Builder workspace [runtime: openclaw, skill: miaoda-app-builder]
|
||||
│ OpenClaw CLI → skill → api.miaoda.cn
|
||||
▼
|
||||
MeDo platform (app created / published → URL returned)
|
||||
│ result relayed via A2A event_queue
|
||||
▼
|
||||
PM → CEO
|
||||
```
|
||||
|
||||
The MeDo Builder workspace is a **dedicated OpenClaw-runtime workspace** inside the
|
||||
molecule-dev org with the Miaoda App Builder skill pre-installed. PM delegates natural-language
|
||||
app-build requests to it via `delegate_task_async` and polls for the result (5–8 min latency).
|
||||
|
||||
---
|
||||
|
||||
## 2. Installing the Miaoda App Builder Skill
|
||||
|
||||
### 2.1 API Key
|
||||
|
||||
The skill requires `MIAODA_API_KEY` (not `MEDO_API_KEY`).
|
||||
|
||||
> ⚠️ **Credential name mismatch**: the global platform secret is currently named `MEDO_API_KEY`.
|
||||
> The skill's frontmatter declares `primaryEnv: MIAODA_API_KEY`. The MeDo Builder workspace must
|
||||
> set `MIAODA_API_KEY` — either rename the global secret or add a workspace-level alias.
|
||||
> See open question §5-A.
|
||||
|
||||
Obtain the key from: **MeDo website → Settings → API Keys**. Keys do not expire, but generating
|
||||
a new one immediately invalidates the previous one.
|
||||
|
||||
### 2.2 Installation Query
|
||||
|
||||
OpenClaw installs skills by sending a natural-language install message to the agent.
|
||||
No CLI command is documented on ClawHub — send this message to the OpenClaw workspace on first boot:
|
||||
|
||||
```
|
||||
Install the Miaoda App Builder skill from ClawHub: seiriosPlus/miaoda-app-builder
|
||||
```
|
||||
|
||||
OpenClaw auto-downloads the skill, installs Python runtime deps (`requests`), and makes the skill
|
||||
available for subsequent messages.
|
||||
|
||||
### 2.3 Workspace Config Sketch (`org-templates/medo-builder/workspace.yaml`)
|
||||
|
||||
```yaml
|
||||
name: MeDo Builder
|
||||
role: Builds and publishes MeDo applications via the Miaoda App Builder OpenClaw skill
|
||||
runtime: openclaw
|
||||
tier: 2
|
||||
required_env:
|
||||
- MIAODA_API_KEY # TODO: resolve name vs platform secret MEDO_API_KEY (§5-A)
|
||||
- OPENROUTER_API_KEY # OpenClaw needs an LLM provider
|
||||
initial_prompt: |
|
||||
You are a MeDo App Builder. On startup:
|
||||
1. Install the Miaoda App Builder skill:
|
||||
"Install the Miaoda App Builder skill from ClawHub: seiriosPlus/miaoda-app-builder"
|
||||
2. Confirm installation succeeded.
|
||||
3. Wait for build tasks from PM via A2A.
|
||||
When you receive a build task, use natural language to instruct the skill:
|
||||
"Create a [description] app and publish it when done."
|
||||
App generation takes 5–8 minutes — poll the skill or wait for confirmation before reporting done.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. A2A Delegation Pattern (5–8 Min Latency)
|
||||
|
||||
App generation is asynchronous and slow. PM **must** use `delegate_task_async` + `check_task_status`
|
||||
rather than `delegate_task` (which has a shorter timeout and will return before the app is ready).
|
||||
|
||||
### 3.1 PM Delegation Flow
|
||||
|
||||
```python
|
||||
# Step 1: fire and forget
|
||||
task = await delegate_task_async(
|
||||
workspace_id="medo-builder-workspace-id",
|
||||
task="Build a restaurant reservation tool with online booking, menu display, "
|
||||
"and contact form. Publish when done and return the URL."
|
||||
)
|
||||
|
||||
# Step 2: poll every 60s (app takes 5–8 min)
|
||||
while True:
|
||||
status = await check_task_status(task_id=task["task_id"])
|
||||
if status["status"] in ("completed", "failed"):
|
||||
break
|
||||
await asyncio.sleep(60)
|
||||
|
||||
result_url = status.get("result") # MeDo app URL on success
|
||||
```
|
||||
|
||||
### 3.2 Invocation Patterns (verified from Baidu doc)
|
||||
|
||||
Natural-language messages the MeDo Builder workspace should accept from PM:
|
||||
|
||||
| Intent | Message to send to MeDo Builder workspace |
|
||||
|--------|-------------------------------------------|
|
||||
| List existing apps | `"Show me my apps"` |
|
||||
| Create + auto-publish | `"Create a [description] and publish it when done"` |
|
||||
| Create only | `"Create a [description]"` |
|
||||
| Modify existing | `"Add a search function to app [name/ID]"` |
|
||||
| Publish draft | `"Publish this app"` |
|
||||
| Status check | `"Is the app generation done yet?"` |
|
||||
|
||||
---
|
||||
|
||||
## 4. Proposed Org Template — `org-templates/medo-builder/`
|
||||
|
||||
```
|
||||
org-templates/medo-builder/
|
||||
├── org.yaml ← minimal single-workspace org (not full team)
|
||||
├── medo-builder/
|
||||
│ ├── system-prompt.md ← MeDo Builder agent persona + delegation rules
|
||||
│ └── workspace.yaml ← runtime: openclaw, skill install, env
|
||||
```
|
||||
|
||||
**org.yaml sketch:**
|
||||
|
||||
```yaml
|
||||
name: MeDo Builder
|
||||
description: Single-workspace org for building MeDo apps (hackathon)
|
||||
defaults:
|
||||
runtime: openclaw
|
||||
tier: 2
|
||||
required_env: [MIAODA_API_KEY, OPENROUTER_API_KEY]
|
||||
|
||||
workspaces:
|
||||
- name: MeDo Builder
|
||||
role: Builds and publishes MeDo applications via Miaoda App Builder skill
|
||||
files_dir: medo-builder
|
||||
canvas: { x: 400, y: 300 }
|
||||
```
|
||||
|
||||
The medo-builder workspace is deployed **as a child of the molecule-dev PM** in the hackathon org,
|
||||
not as a standalone org. Full `org-templates/medo-builder/` implementation is Week 2 scope.
|
||||
|
||||
---
|
||||
|
||||
## 5. Open Questions (Operator Resolution Required)
|
||||
|
||||
| # | Question | Why it blocks |
|
||||
|---|----------|---------------|
|
||||
| 5-A | **Credential name**: platform secret is `MEDO_API_KEY`; skill expects `MIAODA_API_KEY`. Rename global secret or add workspace alias? | Workspace boot will fail with "MIAODA_API_KEY not set" |
|
||||
| 5-B | **Credit cost per app**: Baidu doc mentions a Credit System but content was not rendered. How many credits does create+generate+publish consume? Do we have enough for hackathon testing? | Budget planning |
|
||||
| 5-C | **Rate limits**: no rate-limit info in docs or ClawHub page. What's the max concurrent app generations per API key? | Parallelism planning |
|
||||
| 5-D | **Failure recovery**: what happens if the OpenClaw skill process crashes mid-generation (after Confirm & Generate, before Publish)? Is there a way to resume or check status by app ID? | Reliability design |
|
||||
| 5-E | **Submission format**: does the hackathon judge the published MeDo app URL, the Molecule AI org config, or both? | Determines whether we need a polished demo org or just a working app |
|
||||
|
||||
---
|
||||
|
||||
## 6. Implementation Checklist (Weeks 1–3)
|
||||
|
||||
- [x] Week 1: This design doc (`docs/adapters/medo-integration.md`)
|
||||
- [ ] Week 1: Resolve §5-A (credential name) + obtain API key credits estimate
|
||||
- [ ] Week 2: `org-templates/medo-builder/` — full system-prompt + workspace.yaml
|
||||
- [ ] Week 2: Integration test — PM delegates one real app build end-to-end
|
||||
- [ ] Week 3: Polish demo org; rehearse submission flow; publish hackathon entry
|
||||
@@ -1,117 +0,0 @@
|
||||
---
|
||||
title: "MeDo Smoke Test Log — 2026-04-13 (Run 4)"
|
||||
description: "Smoke-test run log for the MeDo / Miaoda App Builder OpenClaw integration."
|
||||
---
|
||||
# MeDo Smoke Test Log — 2026-04-13 (Run 4)
|
||||
|
||||
**Tester:** PM (direct execution)
|
||||
**Goal:** Install Miaoda App Builder skill → build "Hello Molecule AI" landing page → publish → URL.
|
||||
**Credits spent:** 0 across all four runs.
|
||||
|
||||
---
|
||||
|
||||
## Run Summary
|
||||
|
||||
| Run | Blocker | Resolution |
|
||||
|-----|---------|------------|
|
||||
| 1 | `workspace-template:openclaw` image not built | ✅ Operator rebuilt image |
|
||||
| 2 | Adapter key lookup ignores `AISTUDIO_API_KEY` / `QIANFAN_API_KEY` | ✅ Code fix committed (d779e16) |
|
||||
| 3 | Executor creates fresh OpenClaw session per A2A message | ✅ Code fix committed (9466943) |
|
||||
| 4 | `payloads: []` on every response — agent never returns text via `--json` mode | ❌ Root cause below |
|
||||
|
||||
---
|
||||
|
||||
## Run 4 — Detailed Findings
|
||||
|
||||
### Environment — all green
|
||||
| Check | Result |
|
||||
|-------|--------|
|
||||
| Platform health | ✅ |
|
||||
| `workspace-template:openclaw` image | ✅ boots in 31s |
|
||||
| AISTUDIO_API_KEY + gemini-2.0-flash | ✅ confirmed in every response meta |
|
||||
| Stable session ID (workspace ID) | ✅ `sessionKey: agent:main:explicit:a507780d-...` consistent across all calls |
|
||||
|
||||
### Messages Sent and Responses
|
||||
|
||||
| Message | Response | Duration |
|
||||
|---------|----------|----------|
|
||||
| Install skill | `payloads: [], livenessState: working` | 1.7s |
|
||||
| Build Hello Molecule AI | `payloads: [], livenessState: working` | 0.8s |
|
||||
| Check status (sessions_list) | `LLM request failed: provider rejected request schema/payload` | — |
|
||||
| Reply with exactly: STATUS_OK | `payloads: [], livenessState: working` (after restart) | 1.8s |
|
||||
|
||||
The "Reply with exactly: STATUS_OK" response is decisive. A vanilla LLM call with no tool use should produce a text payload. It didn't. This rules out skill complexity or message ambiguity as the cause.
|
||||
|
||||
### Root Cause — `openclaw agent --json` Does Not Surface Agent Text in `payloads`
|
||||
|
||||
The OpenClaw agent processes messages using background session dispatch (`sessions_spawn` / `sessions_yield`). In this mode:
|
||||
1. Main session receives message → immediately spawns background session → calls `sessions_yield`
|
||||
2. `openclaw agent --json` exits with `payloads: [], livenessState: 'working'`
|
||||
3. Background session processes the actual work and produces text — but only visible in interactive/streaming mode, not in the `--json` subprocess call
|
||||
|
||||
**Evidence:** Even "Reply with exactly: STATUS_OK" returns `payloads: []`. The agent is using background sessions for everything, including trivial echo requests.
|
||||
|
||||
**Likely cause:** OpenClaw's default `SOUL.md` / `BOOTSTRAP.md` workspace config instructs the agent to always use async session patterns. In a terminal session these background responses appear naturally; via subprocess `--json`, only the main session's synchronous output is captured.
|
||||
|
||||
### Transient issue: LLM request failed
|
||||
After 3+ rapid A2A calls (install → build → status check), the Gemini AI Studio API returned a schema/payload rejection. Resolved by restarting the workspace (`POST /workspaces/:id/restart`). Likely a rate-limit or context-size rejection from Gemini. Restarted in 30s, normal on next call.
|
||||
|
||||
---
|
||||
|
||||
## 4. Required Fix — OpenClawA2AExecutor Response Capture
|
||||
|
||||
The executor must retrieve the agent's text response from session history **after** the main session yields. The `sessions_history` CLI command (exposed as `session_history` tool) retrieves past messages.
|
||||
|
||||
**Proposed change** to `workspace/adapters/openclaw/adapter.py` (`execute()` method):
|
||||
|
||||
```python
|
||||
# After proc.communicate() returns with payloads=[]:
|
||||
if not reply or reply.startswith("{'payloads': []"):
|
||||
# Agent yielded without responding — fetch last message from session history
|
||||
await asyncio.sleep(2) # brief wait for background session to complete short tasks
|
||||
hist_proc = await asyncio.create_subprocess_exec(
|
||||
"openclaw", "sessions", "history",
|
||||
"--session-id", self._session_id,
|
||||
"--limit", "1", "--json",
|
||||
stdout=asyncio.subprocess.PIPE, stderr=asyncio.subprocess.PIPE,
|
||||
env={**os.environ, "PATH": f"{os.path.expanduser('~/.local/bin')}:{os.environ.get('PATH', '')}"}
|
||||
)
|
||||
hist_stdout, _ = await asyncio.wait_for(hist_proc.communicate(), timeout=15)
|
||||
hist_data = json.loads(hist_stdout.decode().strip() or "{}")
|
||||
last_msg = (hist_data.get("messages") or [{}])[-1]
|
||||
reply = last_msg.get("content", reply) # fall back to original if no history
|
||||
```
|
||||
|
||||
**Note on long tasks (5–8 min builds):** Session history won't have the build result until it completes. For Miaoda App Builder, PM must poll: send a follow-up "What is the status of the Hello Molecule AI app build?" message every 60s until the response contains a URL or error.
|
||||
|
||||
---
|
||||
|
||||
## 5. Open Questions Status
|
||||
|
||||
### 5-C — Rate limits
|
||||
**UNKNOWN.** Never reached skill invocation.
|
||||
*New data:* Gemini AI Studio hit a schema/payload rejection after 3 rapid calls. This may be a Gemini-specific issue with large tool schemas (OpenClaw's `cron` schema is 6311 chars). Worth filing separately.
|
||||
|
||||
### 5-D — Failure recovery
|
||||
**UNKNOWN.** Never reached app generation.
|
||||
|
||||
---
|
||||
|
||||
## 6. Issues to File
|
||||
|
||||
| # | Issue | Status | Location |
|
||||
|---|-------|--------|----------|
|
||||
| A | `fix(openclaw): use stable workspace session ID` | ✅ fixed in 9466943 | adapter.py |
|
||||
| B | `fix(openclaw): extend key lookup for AISTUDIO/QIANFAN` | ✅ fixed in d779e16 | adapter.py |
|
||||
| C | `fix(provisioner): surface Docker errors in last_sample_error` | ❌ open | provisioner.go |
|
||||
| **D** | **`fix(openclaw): capture agent response via session history when payloads=[]`** | ❌ open — see §4 | adapter.py |
|
||||
| **E** | **`fix(openclaw): Gemini rejects request after N rapid calls with large tool schema`** | ❌ open — investigate cron schema size | adapter.py |
|
||||
|
||||
---
|
||||
|
||||
## 7. Next Steps (before Run 5)
|
||||
|
||||
- [ ] **Dev Lead:** Implement §4 session-history fallback in `OpenClawA2AExecutor.execute()`
|
||||
- [ ] **Dev Lead (optional):** Trim `cron` tool schema to reduce Gemini schema-size rejection risk
|
||||
- [ ] **Operator:** Rebuild image: `bash workspace/build-all.sh openclaw`
|
||||
- [ ] **PM (Run 5):** Re-run smoke test — expected to finally reach skill install confirmation
|
||||
@@ -1,112 +0,0 @@
|
||||
---
|
||||
title: "ADR-001: Admin endpoints accept any workspace bearer token"
|
||||
description: "ADR-001: why admin endpoints validate any workspace bearer token, and the AdminAuth lockdown that followed."
|
||||
---
|
||||
# ADR-001: Admin endpoints accept any workspace bearer token
|
||||
|
||||
**Status:** Accepted — known risk, Phase-H remediation planned
|
||||
**Date:** 2026-04-17
|
||||
**Issue:** #684
|
||||
**Tracking:** Phase-H — #710
|
||||
|
||||
## Context
|
||||
|
||||
The `AdminAuth` middleware validates callers by calling `ValidateAnyToken`, which
|
||||
accepts any live workspace bearer token regardless of which workspace issued it.
|
||||
There is no separation between workspace-scoped tokens (issued to individual
|
||||
agents) and admin-scoped tokens (intended for platform operators).
|
||||
|
||||
This means any workspace agent that has been issued a token can reach every
|
||||
admin-gated route on the platform.
|
||||
|
||||
## Decision
|
||||
|
||||
Proper token-tier separation (workspace vs. admin scope) is deferred to Phase-H.
|
||||
The known risk is explicitly accepted. Mitigation controls are documented below.
|
||||
|
||||
## Blast radius — affected admin endpoints
|
||||
|
||||
A compromised workspace token grants unauthenticated-equivalent access to all
|
||||
of the following:
|
||||
|
||||
| Endpoint | Impact |
|
||||
|----------|--------|
|
||||
| `GET /admin/workspaces/:id/test-token` | Mint a fresh bearer token for any workspace |
|
||||
| `DELETE /workspaces/:id` | Delete any workspace and auto-revoke its tokens |
|
||||
| `PUT /settings/secrets` / `POST /admin/secrets` | Overwrite any global secret (env-poisons every agent on restart) |
|
||||
| `DELETE /settings/secrets/:key` / `DELETE /admin/secrets/:key` | Delete any global secret; same fan-out restart |
|
||||
| `GET /settings/secrets` / `GET /admin/secrets` | Read all global secret keys (values masked, but key enumeration enables targeted attacks) |
|
||||
| `GET /workspaces/:id/budget` + `PATCH /workspaces/:id/budget` | Read or clear any workspace's token budget |
|
||||
| `GET /events` / `GET /events/:workspaceId` | Read the full structural event log across all workspaces |
|
||||
| `POST /bundles/import` | Import an arbitrary workspace bundle — creates workspaces, injects secrets, overwrites configs |
|
||||
| `GET /bundles/export/:id` | Exfiltrate full workspace bundle including config, secrets references, and files |
|
||||
| `POST /org/import` | Instantiate an entire org template — creates multiple workspaces with arbitrary roles and secrets |
|
||||
| `GET /org/templates` | Enumerate all org template names and their configured roles/system prompts |
|
||||
| `POST /templates/import` | Write arbitrary files into `configsDir` (workspace template injection) |
|
||||
| `GET /templates` | Enumerate all template names and metadata |
|
||||
| `GET /admin/liveness` | Read platform subsystem health (ops intel) |
|
||||
| `GET /admin/schedules/health` | Read cron scheduler health across all workspaces |
|
||||
|
||||
## Risk statement
|
||||
|
||||
**A single compromised workspace agent can achieve full platform takeover via
|
||||
admin endpoints.**
|
||||
|
||||
Attack chain example:
|
||||
1. Agent A's token is exfiltrated (e.g. via a prompt-injection in a delegated task).
|
||||
2. Attacker calls `PUT /settings/secrets` to overwrite `CLAUDE_API_KEY` with a
|
||||
controlled value.
|
||||
3. Every non-paused workspace restarts and loads the poisoned key.
|
||||
4. Attacker now controls the LLM backend for the entire platform.
|
||||
|
||||
Alternatively: call `POST /bundles/import` with a crafted bundle to inject a
|
||||
malicious workspace with a pre-configured `initial_prompt` and elevated secrets.
|
||||
|
||||
## Current mitigations
|
||||
|
||||
- **Workspace isolation** — `CanCommunicate()` in the A2A proxy limits which
|
||||
workspaces can send tasks to which, reducing the blast radius of a single
|
||||
compromised agent during normal operation.
|
||||
- **Audit logging** — PR #651 writes all admin-route calls to `structure_events`.
|
||||
Forensic recovery is possible after the fact.
|
||||
- **`ValidateAnyToken` removed-workspace JOIN** — tokens belonging to deleted
|
||||
workspaces are filtered at the DB layer (PR #682 defense-in-depth) so
|
||||
post-deletion token replay is blocked.
|
||||
- **`MOLECULE_ENV=production` gate** — hides the `/admin/workspaces/:id/test-token`
|
||||
endpoint in production deployments unless `MOLECULE_ENABLE_TEST_TOKENS=1`.
|
||||
|
||||
## Phase-H remediation plan
|
||||
|
||||
Tracked in GitHub issue **#710**.
|
||||
|
||||
### Schema change
|
||||
|
||||
Add a `token_type` column to `workspace_auth_tokens`:
|
||||
|
||||
```sql
|
||||
ALTER TABLE workspace_auth_tokens
|
||||
ADD COLUMN IF NOT EXISTS token_type TEXT NOT NULL DEFAULT 'workspace'
|
||||
CHECK (token_type IN ('workspace', 'admin'));
|
||||
```
|
||||
|
||||
Admin tokens are minted only via a dedicated privileged endpoint that itself
|
||||
requires an existing admin token or a one-time bootstrap secret.
|
||||
|
||||
### Middleware update
|
||||
|
||||
- `WorkspaceAuth` — continue accepting `token_type = 'workspace'` only.
|
||||
- `AdminAuth` — require `token_type = 'admin'`. Workspace tokens rejected.
|
||||
|
||||
### Bootstrap flow
|
||||
|
||||
On first boot (no tokens exist), a single-use bootstrap secret is printed to
|
||||
the server log. The operator uses it to mint the first admin token. Subsequent
|
||||
admin tokens are minted by existing admin token holders. The fail-open path in
|
||||
`HasAnyLiveTokenGlobal` is retired once Phase-H ships.
|
||||
|
||||
### Migration path
|
||||
|
||||
Phase-H is a breaking change for any automation that currently uses workspace
|
||||
tokens against admin endpoints. A migration guide and a `MOLECULE_PHASE_H=1`
|
||||
feature flag will be provided so operators can opt in before the strict
|
||||
enforcement date.
|
||||
@@ -1,125 +0,0 @@
|
||||
---
|
||||
title: API Reference
|
||||
description: Full REST API reference for the Molecule AI workspace server — workspace management, A2A communication, file operations, secrets, tokens, and more.
|
||||
---
|
||||
|
||||
# API Reference
|
||||
|
||||
This document describes the REST API exposed by the Molecule AI workspace server (Go/Gin, default port `:8080`). Clients include the Canvas frontend, workspace agents communicating over A2A, and external tooling such as the MCP server and CLI.
|
||||
|
||||
**Base URL:** `http://localhost:8080` (development default)
|
||||
**Rate limit:** 600 req/min (configurable via `RATE_LIMIT`)
|
||||
**CORS origins:** `http://localhost:3000,http://localhost:3001` by default (configurable via `CORS_ORIGINS`)
|
||||
|
||||
---
|
||||
|
||||
## Authentication
|
||||
|
||||
Three middleware classes gate server-side routes:
|
||||
|
||||
- **`AdminAuth`** — strict bearer-only. Required for any route that can leak prompts/memory, create/mutate workspaces, or expose ops intel. Lazy-bootstrap fail-open when no live tokens exist globally.
|
||||
- **`WorkspaceAuth`** — binds a bearer token to a specific workspace `:id`. A token for workspace A cannot be used against workspace B's sub-routes.
|
||||
- **`CanvasOrBearer`** — accepts a bearer token OR a request Origin matching `CORS_ORIGINS`. Used only for cosmetic routes with zero data/security impact (currently `PUT /canvas/viewport` only). Do not extend to routes that leak data or create resources.
|
||||
|
||||
Full contract: `docs/runbooks/admin-auth.md`.
|
||||
|
||||
---
|
||||
|
||||
## Routes
|
||||
|
||||
| Method | Path | Handler |
|
||||
|--------|------|---------|
|
||||
| GET | /health | inline |
|
||||
| GET | /metrics | metrics.Handler() — Prometheus text format; no auth, scrape-safe |
|
||||
| POST/GET/PATCH/DELETE | /workspaces[/:id] | workspace.go — `GET /workspaces`, `POST /workspaces`, and `DELETE /workspaces/:id` require `AdminAuth`. `PATCH /workspaces/:id` enforces field-level authz: cosmetic fields (name, role, x, y, canvas) pass through; sensitive fields (tier, parent_id, runtime, workspace_dir) require a valid bearer token when any live token exists. |
|
||||
| GET/PATCH | /workspaces/:id/config | workspace.go |
|
||||
| GET/POST | /workspaces/:id/memory | workspace.go |
|
||||
| DELETE | /workspaces/:id/memory/:key | workspace.go |
|
||||
| POST/PATCH/DELETE | /workspaces/:id/agent | agent.go |
|
||||
| POST | /workspaces/:id/agent/move | agent.go |
|
||||
| GET/POST/PUT | /workspaces/:id/secrets | secrets.go (POST/PUT auto-restarts workspace) |
|
||||
| DELETE | /workspaces/:id/secrets/:key | secrets.go (DELETE auto-restarts workspace) |
|
||||
| GET | /workspaces/:id/model | secrets.go |
|
||||
| GET | /settings/secrets | secrets.go — list global secrets (keys only, values masked) |
|
||||
| PUT/POST | /settings/secrets | secrets.go — set a global secret `{key, value}`; auto-restarts every non-paused/non-removed/non-external workspace that does not shadow the key with a workspace-level override |
|
||||
| DELETE | /settings/secrets/:key | secrets.go — delete a global secret; same auto-restart fan-out as PUT/POST |
|
||||
| GET | /admin/workspaces/:id/test-token | admin_test_token.go — mint a fresh bearer token for E2E scripts; returns 404 unless `MOLECULE_ENV != production` or `MOLECULE_ENABLE_TEST_TOKENS=1` |
|
||||
| GET/POST/DELETE | /admin/secrets[/:key] | secrets.go — legacy aliases for /settings/secrets |
|
||||
| WS | /workspaces/:id/terminal | terminal.go |
|
||||
| POST | /workspaces/:id/expand | team.go |
|
||||
| POST | /workspaces/:id/collapse | team.go |
|
||||
| POST/GET | /workspaces/:id/approvals | approvals.go |
|
||||
| POST | /workspaces/:id/approvals/:id/decide | approvals.go |
|
||||
| GET | /approvals/pending | approvals.go |
|
||||
| POST/GET | /workspaces/:id/memories | memories.go |
|
||||
| DELETE | /workspaces/:id/memories/:id | memories.go |
|
||||
| GET | /workspaces/:id/traces | traces.go |
|
||||
| GET/POST | /workspaces/:id/activity | activity.go |
|
||||
| POST | /workspaces/:id/notify | activity.go (agent→user push message via WebSocket) |
|
||||
| POST | /workspaces/:id/restart | workspace.go |
|
||||
| POST | /workspaces/:id/pause | workspace.go (stops container, status→paused) |
|
||||
| POST | /workspaces/:id/resume | workspace.go (re-provisions paused workspace) |
|
||||
| POST | /workspaces/:id/a2a | workspace.go |
|
||||
| POST | /workspaces/:id/delegate | delegation.go (async fire-and-forget) |
|
||||
| GET | /workspaces/:id/delegations | delegation.go (list delegation status) |
|
||||
| GET/POST | /workspaces/:id/schedules | schedules.go (cron CRUD) |
|
||||
| PATCH/DELETE | /workspaces/:id/schedules/:scheduleId | schedules.go |
|
||||
| POST | /workspaces/:id/schedules/:scheduleId/run | schedules.go (manual trigger) |
|
||||
| GET | /workspaces/:id/schedules/:scheduleId/history | schedules.go (past runs) |
|
||||
| GET/POST | /workspaces/:id/channels | channels.go (social channel CRUD) |
|
||||
| PATCH/DELETE | /workspaces/:id/channels/:channelId | channels.go |
|
||||
| POST | /workspaces/:id/channels/:channelId/send | channels.go (outbound message) |
|
||||
| POST | /workspaces/:id/channels/:channelId/test | channels.go (test connection) |
|
||||
| GET | /channels/adapters | channels.go (list available platforms) |
|
||||
| POST | /channels/discover | channels.go (auto-detect chats for a bot token) |
|
||||
| POST | /webhooks/:type | channels.go (incoming social webhook) |
|
||||
| GET | /workspaces/:id/shared-context | templates.go |
|
||||
| GET/PUT/DELETE | /workspaces/:id/files[/*path] | templates.go |
|
||||
| GET | /canvas/viewport | viewport.go — open, no auth required (cosmetic, bootstrap-friendly) |
|
||||
| PUT | /canvas/viewport | viewport.go — `CanvasOrBearer` middleware; accepts bearer OR Origin matching `CORS_ORIGINS`. Cosmetic-only route — worst case viewport corruption, recovered by page refresh. |
|
||||
| GET | /templates | templates.go |
|
||||
| POST | /templates/import | templates.go — `AdminAuth` required |
|
||||
| POST | /registry/register | registry.go |
|
||||
| POST | /registry/heartbeat | registry.go — requires `Authorization: Bearer <token>` once a workspace has any live token on file (legacy workspaces grandfathered) |
|
||||
| POST | /registry/update-card | registry.go — requires `Authorization: Bearer <token>` once a workspace has any live token on file |
|
||||
| GET | /registry/discover/:id | discovery.go — requires `X-Workspace-ID` + bearer token on the caller side |
|
||||
| GET | /registry/:id/peers | discovery.go — requires `X-Workspace-ID` + bearer token on the caller side |
|
||||
| POST | /registry/check-access | discovery.go |
|
||||
| GET | /plugins | plugins.go (list registry; supports `?runtime=` filter) |
|
||||
| GET | /plugins/sources | plugins.go (list registered install-source schemes) |
|
||||
| GET/POST/DELETE | /workspaces/:id/plugins[/:name] | plugins.go — list, install (`{"source":"scheme://spec"}`), uninstall per-workspace |
|
||||
| GET | /workspaces/:id/plugins/available | plugins.go (filtered by workspace runtime) |
|
||||
| GET | /workspaces/:id/plugins/compatibility?runtime=X | plugins.go (preflight runtime-change check) |
|
||||
| GET/POST | /workspaces/:id/tokens | tokens.go — list active tokens (prefix + metadata), create new token (plaintext returned once). Max 50 per workspace. |
|
||||
| DELETE | /workspaces/:id/tokens/:tokenId | tokens.go — revoke specific token by ID |
|
||||
| GET | /bundles/export/:id | bundle.go — `AdminAuth` required |
|
||||
| POST | /bundles/import | bundle.go — `AdminAuth` required |
|
||||
| GET | /org/templates | org.go (list available org templates) |
|
||||
| POST | /org/import | org.go — `AdminAuth` required; applies `resolveInsideRoot` path sanitiser on template paths |
|
||||
| GET | /events | events.go — `AdminAuth` required |
|
||||
| GET | /events/:workspaceId | events.go — `AdminAuth` required |
|
||||
| GET | /admin/liveness | inline — `AdminAuth` required. Returns per-subsystem `supervised.Snapshot()` ages; use to check health of scheduler/heartbeat goroutines |
|
||||
| GET | /ws | socket.go |
|
||||
|
||||
---
|
||||
|
||||
## Database
|
||||
|
||||
Migration files live in `workspace-server/migrations/` (latest: `022_workspace_schedules_source`). Each migration ships as a `.up.sql`/`.down.sql` pair. The migration runner globs `*.sql`, filters out `.down.sql` files, sorts alphabetically, and executes each file on boot. All `.up.sql` files must be idempotent (`CREATE TABLE IF NOT EXISTS`, `ALTER TABLE ... IF NOT EXISTS`) because the runner re-applies every migration on every boot.
|
||||
|
||||
### Key Tables
|
||||
|
||||
| Table | Description |
|
||||
|-------|-------------|
|
||||
| `workspaces` | Core entity — status, runtime, `agent_card` JSONB, heartbeat columns, `current_task`, `awareness_namespace`, `workspace_dir` |
|
||||
| `canvas_layouts` | Per-workspace x/y canvas position |
|
||||
| `structure_events` | Append-only event log (workspace lifecycle, agent, approval events) |
|
||||
| `activity_logs` | A2A communications, task updates, agent logs, errors. `error_detail` is populated by the scheduler so cron run history can surface failure reasons. |
|
||||
| `workspace_schedules` | Cron tasks — expression, timezone, prompt, run history, `source` (`'template'` for org/import-seeded, `'runtime'` for Canvas/API-created), `last_status` (includes `'skipped'` when the scheduler concurrency-skips a busy workspace) |
|
||||
| `workspace_channels` | Social channel integrations (Telegram, Slack, etc.) with JSONB config and allowlist |
|
||||
| `agents` | Agent records |
|
||||
| `workspace_secrets` | Per-workspace encrypted secrets |
|
||||
| `global_secrets` | Platform-wide encrypted secrets |
|
||||
| `workspace_auth_tokens` | Bearer tokens; auto-revoked on workspace delete |
|
||||
| `agent_memories` | HMA scoped memory (LOCAL / TEAM / GLOBAL) |
|
||||
| `approvals` | Human-in-the-loop approval requests |
|
||||
@@ -1,83 +0,0 @@
|
||||
---
|
||||
title: "Canary release pipeline"
|
||||
description: "The canary release pipeline that ships workspace-server changes to the prod tenant fleet, and how to halt it."
|
||||
---
|
||||
# Canary release pipeline
|
||||
|
||||
How a workspace-server code change reaches the prod tenant fleet — and how to stop it if something's wrong.
|
||||
|
||||
## The loop
|
||||
|
||||
```
|
||||
PR merged to staging → main
|
||||
│
|
||||
▼
|
||||
publish-workspace-server-image.yml ← pushes :staging-<sha> ONLY
|
||||
│ (NOT :latest — prod is untouched)
|
||||
▼
|
||||
Canary tenants auto-update to :staging-<sha>
|
||||
│ (5-min auto-updater cycle on each canary EC2)
|
||||
▼
|
||||
canary-verify.yml waits 6 min, runs scripts/canary-smoke.sh
|
||||
│
|
||||
├─► GREEN → crane tag :staging-<sha> → :latest
|
||||
│ │
|
||||
│ ▼
|
||||
│ Prod tenants auto-update within 5 min
|
||||
│
|
||||
└─► RED → :latest stays on prior good digest
|
||||
GitHub Step Summary flags the rejected sha
|
||||
Ops fixes forward OR rolls back manually
|
||||
```
|
||||
|
||||
## Canary fleet
|
||||
|
||||
Lives in a separate AWS account (`molecule-canary`, `004947743811`) via an assumed role (`MoleculeStagingProvisioner`). The CP's `is_canary` org flag routes provisioning there; every other org goes to the default staging account. See `docs/architecture/saas-prod-migration-2026-04-19.md` for the account bootstrap.
|
||||
|
||||
Canary tenants are configured to pull `:staging-<sha>` (not `:latest`) via `TENANT_IMAGE` on their provisioner, so they ingest each new build before prod does.
|
||||
|
||||
## Smoke suite
|
||||
|
||||
`scripts/canary-smoke.sh` hits each canary tenant (URL + ADMIN_TOKEN pair) and asserts:
|
||||
|
||||
- `/admin/liveness` returns a subsystems map (tenant booted, AdminAuth reachable)
|
||||
- `/workspaces` returns a JSON array (wsAuth + DB healthy)
|
||||
- `/memories/commit` + `/memories/search` round-trip (encryption + scrubber)
|
||||
- `/events` admin read (C4 fail-closed proof)
|
||||
- `/admin/liveness` without bearer → 401 (C4 regression gate)
|
||||
|
||||
Expand by editing the script — each `check "name" "expected" "$response"` call is one line.
|
||||
|
||||
## Adding a canary tenant
|
||||
|
||||
1. `POST /cp/orgs` — create the org normally (is_canary defaults to false)
|
||||
2. `POST /cp/admin/orgs/<slug>/canary` with `{"is_canary": true}` — admin only, refuses to flip if already provisioned
|
||||
3. Re-trigger provision (or delete + recreate if the org was already provisioned into staging) — the fresh EC2 lands in account `004947743811`
|
||||
|
||||
Then set repo secrets:
|
||||
- `CANARY_TENANT_URLS` — append the new tenant's URL
|
||||
- `CANARY_ADMIN_TOKENS` — append its ADMIN_TOKEN in the same position
|
||||
|
||||
## Rolling back `:latest`
|
||||
|
||||
When canary was green but something surfaces post-promotion, retag `:latest` to a prior digest:
|
||||
|
||||
```bash
|
||||
export GITHUB_TOKEN=ghp_... # write:packages
|
||||
scripts/rollback-latest.sh 4c1d56e # retags both platform + tenant images
|
||||
```
|
||||
|
||||
`scripts/rollback-latest.sh` pre-checks that `:staging-<sha>` exists before moving `:latest`, and verifies the digest after the move. Prod tenants pick up the rolled-back image on their next 5-min auto-update.
|
||||
|
||||
A post-mortem should always include:
|
||||
- the commit sha that broke
|
||||
- why canary didn't catch it (new code path the smoke suite doesn't exercise?)
|
||||
- whether the smoke suite should grow a new check to prevent the same class of bug
|
||||
|
||||
## What this gate doesn't catch
|
||||
|
||||
- Bugs that only surface under prod-only data (customer workloads with scale or shape canary doesn't produce). Canary uses real traffic shapes but can't simulate weeks of accumulated state.
|
||||
- Config drift between canary and prod (different env-var values, different feature flags). Keep canary's config deltas minimal and documented.
|
||||
- Cross-tenant interactions — canary tenants run in their own AWS account, so a bug that only appears when two tenants compete for a shared resource won't reproduce here.
|
||||
|
||||
When these miss, `rollback-latest.sh` is the escape hatch.
|
||||
@@ -1,76 +0,0 @@
|
||||
---
|
||||
title: "SaaS prod migration — 2026-04-19"
|
||||
description: "Prod cutover notes for the 2026-04-19 staging→main promotion of molecule-controlplane and molecule-core."
|
||||
---
|
||||
# SaaS prod migration — 2026-04-19
|
||||
|
||||
Promoted staging → main on both `Molecule-AI/molecule-controlplane` and `Molecule-AI/molecule-core`. This note captures the prod cutover deltas so ops can cross-check against the running system.
|
||||
|
||||
## What changed
|
||||
|
||||
Ten PRs landed, split across the two repos:
|
||||
|
||||
**Control plane (`molecule-controlplane`)**
|
||||
- PR #50 — C1/C2/C3: bearer auth on `/cp/workspaces/*`, shell-escape tenant user-data, per-tenant security group
|
||||
- PR #51 — H1/H2: crash-safe `SECRETS_ENCRYPTION_KEY` log, dropped `admin_token` from `/instance` SELECT
|
||||
- PR #52 — SSRF guard on `platform_url`
|
||||
- PR #53 — CP injects `MOLECULE_CP_SHARED_SECRET` + `MOLECULE_CP_URL` into tenant env
|
||||
- PR #54 — Stripe webhook body capped at 1 MiB
|
||||
|
||||
**Core (`molecule-core` / this repo)**
|
||||
- PR #978 — H3/H4: LimitReader on Discord webhook + workspace config PATCH
|
||||
- PR #979 — C4: `AdminAuth` fail-closed on fresh install when `ADMIN_TOKEN` is set
|
||||
- PR #980 — log-scrub: dropped token prefix logging, stopped logging raw upstream response bodies
|
||||
- PR #981 — tenant `CPProvisioner` attaches the CP bearer on every outbound `/cp/workspaces/*` call
|
||||
- PR #982 — Canvas API fetch timeout (15s)
|
||||
- PR #984 — E2E smoke test sync for #966 (public GET no longer exposes `current_task`)
|
||||
|
||||
## New prod env vars (Railway, project `molecule-platform`, env `production`)
|
||||
|
||||
Set before the CP merge landed:
|
||||
|
||||
| Variable | Value shape | Purpose |
|
||||
|---|---|---|
|
||||
| `PROVISION_SHARED_SECRET` | 32-byte hex | Gates `/cp/workspaces/*` on CP. Routes refuse to mount when unset — C1 fail-closed. |
|
||||
| `EC2_VPC_ID` | `vpc-…` | Enables per-tenant SG creation (C3). Shared-SG fallback emits a startup warning. |
|
||||
| `CP_BASE_URL` | `https://api.moleculesai.app` | Injected into newly-provisioned tenant containers as `MOLECULE_CP_URL`. |
|
||||
|
||||
The live prod `PROVISION_SHARED_SECRET` value is held only in Railway; not committed anywhere. Rotate by `railway variables --set` + redeploy.
|
||||
|
||||
## Existing-tenant migration (the sharp edge)
|
||||
|
||||
Tenants provisioned **before** this cutover are still running the previous workspace-server image. When they pull the new image on their next boot or auto-update cycle, their `CPProvisioner` will start expecting `MOLECULE_CP_SHARED_SECRET` in the container env — but the existing tenant EC2s don't have that variable in their user-data (the CP only started injecting it from PR #53 onward).
|
||||
|
||||
**Symptom**: a pre-cutover tenant can still serve its users' existing workspaces, but any attempt to **provision a new workspace** from inside the tenant UI will hit the CP's new bearer gate and get `401` or `404` back, surfacing as "workspace provision failed" with a generic error.
|
||||
|
||||
**Fix per existing tenant (pick one)**:
|
||||
|
||||
1. **SSH in + add the env var**
|
||||
- Copy `PROVISION_SHARED_SECRET` from Railway prod env.
|
||||
- `ssh ubuntu@<tenant-ip>` and append to the running container's env (`docker stop && docker run … -e MOLECULE_CP_SHARED_SECRET='…' -e MOLECULE_CP_URL=https://api.moleculesai.app …`). Rolling this into an auto-update hook is follow-up work.
|
||||
|
||||
2. **Re-provision the tenant**
|
||||
- `DELETE /cp/orgs/:slug` → re-create via normal signup flow. Tenant-level data survives only if the tenant's own Postgres volume is preserved; workspace_id values change. This is the heavy hammer — only for tenants where existing data can be recreated easily.
|
||||
|
||||
3. **Wait for the auto-update + user-data refresh cycle**
|
||||
- Tenant auto-updater (cron, 5-minute cadence) pulls the new container image but **does not refresh env vars** — those are frozen from the initial user-data. So option 3 alone doesn't fix this; it still needs option 1 or 2.
|
||||
|
||||
Script at `scripts/migrate-tenant-cp-secret.sh` (follow-up) will automate option 1 across all running tenants in the prod AWS account.
|
||||
|
||||
## Post-deploy verification checklist
|
||||
|
||||
- [ ] Railway prod deploy for `controlplane` lands on the new commit (check `https://railway.com/project/7ccc…/service/ae76…`)
|
||||
- [ ] `curl https://api.moleculesai.app/health` → 200 `{service: molecule-cp, status: ok}`
|
||||
- [ ] `curl -X POST https://api.moleculesai.app/cp/workspaces/provision` (no bearer) → 401 (**not** 404 — proves the env var is live and routes mounted)
|
||||
- [ ] GHCR publishes new `workspace-server` image for the core main commit
|
||||
- [ ] Vercel canvas prod deploy lands
|
||||
|
||||
## Rollback
|
||||
|
||||
If prod is on fire:
|
||||
|
||||
1. `gh pr revert 46 -R Molecule-AI/molecule-controlplane` — reverts all 6 CP PRs together.
|
||||
2. `gh pr revert 983 -R Molecule-AI/molecule-core` — reverts the core bundle.
|
||||
3. Both reverts auto-deploy via Railway / GHCR / Vercel.
|
||||
|
||||
Existing tenants aren't affected by a rollback — they're running whichever tenant image tag they booted with. Only newly-provisioned tenants pick up the reverted control plane code.
|
||||
@@ -1,218 +0,0 @@
|
||||
---
|
||||
title: "Staging Environment Design"
|
||||
description: "The staging environment design on Railway, mirroring prod for safe pre-release validation."
|
||||
---
|
||||
# Staging Environment Design
|
||||
|
||||
> **Status:** Planned — gates all future infra changes (Tunnel migration,
|
||||
> security fixes, etc.)
|
||||
>
|
||||
> **Problem:** We merge directly to main and auto-deploy to production.
|
||||
> Today's session broke CI twice and caused hours of Cloudflare edge cache
|
||||
> issues because there was no staging to test infra changes first.
|
||||
>
|
||||
> **Goal:** Full staging environment that mirrors production. Every change
|
||||
> ships to staging first, gets verified, then promotes to production.
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
staging production
|
||||
─────── ──────────
|
||||
Git branch: main (auto-deploy) main (manual promote)
|
||||
or staging branch
|
||||
|
||||
CP (Railway): staging service production service
|
||||
staging.api.moleculesai.app api.moleculesai.app
|
||||
|
||||
Tenant EC2s: staging EC2 instances production EC2 instances
|
||||
*.staging.moleculesai.app *.moleculesai.app
|
||||
|
||||
App (Vercel): staging.app.moleculesai.app app.moleculesai.app
|
||||
(Vercel preview) (Vercel production)
|
||||
|
||||
DB (Neon): staging branch main branch
|
||||
(or separate project)
|
||||
|
||||
Docker images: platform-tenant:staging platform-tenant:latest
|
||||
(GHCR) (GHCR)
|
||||
|
||||
Cloudflare: *.staging.moleculesai.app *.moleculesai.app
|
||||
(separate tunnel/worker) (tunnel per tenant)
|
||||
```
|
||||
|
||||
## Deploy flow
|
||||
|
||||
```
|
||||
Developer pushes to PR branch
|
||||
→ CI runs (tests, build, lint)
|
||||
→ PR merged to main
|
||||
→ Auto-deploy to STAGING
|
||||
→ Staging smoke tests (automated)
|
||||
→ Manual verification if needed
|
||||
→ Promote to PRODUCTION (manual trigger or approval)
|
||||
```
|
||||
|
||||
## Components
|
||||
|
||||
### 1. Railway: two environments
|
||||
|
||||
Railway supports multiple environments per project. Create a `staging`
|
||||
environment alongside `production`:
|
||||
|
||||
```bash
|
||||
railway environment create staging
|
||||
railway variables --environment staging --set "DATABASE_URL=<staging-neon>"
|
||||
railway variables --environment staging --set "MOLECULE_ENV=staging"
|
||||
# ... all other vars with staging-specific values
|
||||
```
|
||||
|
||||
**Deploy trigger:**
|
||||
- `staging`: auto-deploy on push to main
|
||||
- `production`: manual promote via `railway up --environment production`
|
||||
or GitHub Actions workflow_dispatch
|
||||
|
||||
**Domains:**
|
||||
- staging: `staging-api.moleculesai.app` (Railway custom domain)
|
||||
- production: `api.moleculesai.app` (unchanged)
|
||||
|
||||
### 2. Neon: branch per environment
|
||||
|
||||
Neon supports database branches (like git branches):
|
||||
|
||||
```bash
|
||||
# Create staging branch from main
|
||||
neon branch create --project-id <id> --name staging --parent main
|
||||
```
|
||||
|
||||
- Staging DB has same schema, separate data
|
||||
- Can reset staging by re-branching from main
|
||||
- Production data never touched by staging tests
|
||||
|
||||
### 3. Vercel: preview deployments
|
||||
|
||||
Vercel already supports this natively:
|
||||
- Push to main → deploys to `app.moleculesai.app` (production)
|
||||
- Push to `staging` branch → deploys to preview URL
|
||||
|
||||
**Or** use Vercel environments:
|
||||
- `staging.app.moleculesai.app` → staging deployment
|
||||
- `app.moleculesai.app` → production deployment
|
||||
|
||||
### 4. GHCR: tagged images
|
||||
|
||||
```
|
||||
platform-tenant:staging — built on every push to main
|
||||
platform-tenant:latest — promoted from staging after verification
|
||||
platform-tenant:sha-xxxxx — immutable, pinned to specific commit
|
||||
```
|
||||
|
||||
**Publish workflow change:**
|
||||
```yaml
|
||||
# Current: pushes :latest on every main merge
|
||||
# New: pushes :staging on every main merge
|
||||
# pushes :latest only on manual promote
|
||||
```
|
||||
|
||||
### 5. Cloudflare: staging subdomain
|
||||
|
||||
Option A (simple): `*.staging.moleculesai.app` with its own tunnel/worker
|
||||
Option B (full): separate Cloudflare zone for staging (overkill)
|
||||
|
||||
Recommend Option A:
|
||||
- Add `staging.moleculesai.app` DNS records
|
||||
- Staging tenants get `slug.staging.moleculesai.app` subdomains
|
||||
- Production tenants get `slug.moleculesai.app` (unchanged)
|
||||
|
||||
### 6. EC2: staging tag
|
||||
|
||||
Staging EC2 instances tagged with `Environment=staging`:
|
||||
- Separate from production instances in AWS console
|
||||
- Can use different AMI, instance type, security group
|
||||
- Easy to identify and clean up
|
||||
|
||||
## Environment variables
|
||||
|
||||
| Variable | Staging | Production |
|
||||
|----------|---------|------------|
|
||||
| `MOLECULE_ENV` | `staging` | `production` |
|
||||
| `DATABASE_URL` | Neon staging branch | Neon main branch |
|
||||
| `TENANT_IMAGE` | `platform-tenant:staging` | `platform-tenant:latest` |
|
||||
| `APP_DOMAIN` | `staging.moleculesai.app` | `moleculesai.app` |
|
||||
| `CORS_ORIGINS` | `https://staging.app.moleculesai.app` | `https://app.moleculesai.app` |
|
||||
| `ADMIN_TOKEN` | per-tenant (same mechanism) | per-tenant |
|
||||
|
||||
## Promotion workflow
|
||||
|
||||
### Automated (CI/CD)
|
||||
|
||||
```yaml
|
||||
# .github/workflows/promote-to-production.yml
|
||||
name: Promote to Production
|
||||
on:
|
||||
workflow_dispatch:
|
||||
inputs:
|
||||
confirm:
|
||||
description: 'Type "promote" to confirm'
|
||||
required: true
|
||||
|
||||
jobs:
|
||||
promote:
|
||||
if: github.event.inputs.confirm == 'promote'
|
||||
steps:
|
||||
# 1. Run staging smoke tests one more time
|
||||
- run: bash tests/e2e/test_saas_tenant.sh
|
||||
env:
|
||||
TENANT_SLUG: smoke-test
|
||||
BASE_URL: https://staging.api.moleculesai.app
|
||||
|
||||
# 2. Tag Docker image
|
||||
- run: |
|
||||
docker pull ghcr.io/molecule-ai/platform-tenant:staging
|
||||
docker tag ghcr.io/molecule-ai/platform-tenant:staging \
|
||||
ghcr.io/molecule-ai/platform-tenant:latest
|
||||
docker push ghcr.io/molecule-ai/platform-tenant:latest
|
||||
|
||||
# 3. Deploy CP to production
|
||||
- run: railway up --environment production
|
||||
|
||||
# 4. Production tenants auto-update within 5 min (Option B cron)
|
||||
```
|
||||
|
||||
### Manual (for now)
|
||||
|
||||
Until the automated workflow is built:
|
||||
1. Verify on staging (`staging.api.moleculesai.app`)
|
||||
2. `docker tag platform-tenant:staging platform-tenant:latest && docker push`
|
||||
3. `railway up --environment production`
|
||||
4. Monitor production health
|
||||
|
||||
## What this prevents
|
||||
|
||||
- CI breakage from untested path filters (today's dorny/paths-filter issue)
|
||||
- Cloudflare edge cache poisoning (test DNS changes on staging subdomain)
|
||||
- Workspace boot script regressions (test on staging EC2 first)
|
||||
- DB migration failures (test on Neon staging branch)
|
||||
- Auth/security regressions (staging has same auth stack)
|
||||
|
||||
## Implementation order
|
||||
|
||||
1. **Railway staging environment** — create + configure vars (~30 min)
|
||||
2. **Neon staging branch** — create from main (~5 min)
|
||||
3. **Staging DNS** — `staging.api.moleculesai.app` CNAME to Railway (~5 min)
|
||||
4. **Publish workflow** — push `:staging` tag instead of `:latest` (~15 min)
|
||||
5. **Promotion workflow** — manual trigger to promote staging → production (~30 min)
|
||||
6. **Vercel staging** — configure preview deployment URL (~15 min)
|
||||
7. **Staging smoke test** — automated test after staging deploy (~30 min)
|
||||
|
||||
**Total:** ~2.5 hours for full staging pipeline.
|
||||
|
||||
## Cost
|
||||
|
||||
- Railway staging: ~$5/mo (same as production, but can be smaller)
|
||||
- Neon staging branch: free (included in plan)
|
||||
- EC2 staging instances: only when testing (terminate after)
|
||||
- Vercel: free (preview deployments included)
|
||||
- Cloudflare: free (same zone, additional records)
|
||||
@@ -1,154 +0,0 @@
|
||||
---
|
||||
title: "Tenant Image Upgrade Strategies"
|
||||
description: "Strategies for rolling a new platform-tenant image out to existing EC2 tenants, with trade-offs."
|
||||
---
|
||||
# Tenant Image Upgrade Strategies
|
||||
|
||||
> **Status:** Option B (sidecar auto-updater) implemented. Options A and C
|
||||
> documented for future use.
|
||||
|
||||
## Problem
|
||||
|
||||
When we push a new `platform-tenant:latest` to GHCR, existing EC2 tenant
|
||||
instances keep running the old image. New orgs get the latest image at boot,
|
||||
but existing tenants fall behind — missing bug fixes, security patches, and
|
||||
new features.
|
||||
|
||||
## Option A: Rolling restart on publish (coordinated)
|
||||
|
||||
The publish workflow calls a CP admin endpoint after pushing the image.
|
||||
The CP iterates all running tenants and restarts them one by one.
|
||||
|
||||
```
|
||||
publish-platform-image succeeds
|
||||
→ POST https://api.moleculesai.app/cp/admin/rolling-upgrade
|
||||
→ CP queries org_instances WHERE status = 'running'
|
||||
→ For each tenant (staggered, 30s apart):
|
||||
1. AWS SSM Run Command: docker pull + docker restart
|
||||
2. Wait for /health 200
|
||||
3. Update org_instances.updated_at
|
||||
4. If health fails after 60s, rollback (docker run old image)
|
||||
→ Return summary: {upgraded: N, failed: M, skipped: K}
|
||||
```
|
||||
|
||||
### Pros
|
||||
- Immediate, coordinated upgrades across all tenants
|
||||
- CP has full visibility into upgrade status
|
||||
- Can implement canary (upgrade 1 tenant first, verify, then rest)
|
||||
- Rollback capability per tenant
|
||||
|
||||
### Cons
|
||||
- Requires AWS SSM agent on EC2 instances (not installed yet)
|
||||
- Alternatively requires SSH access from Railway → EC2 (network/key management)
|
||||
- Brief downtime per tenant during restart (~10-30s)
|
||||
- Blast radius: a bad image can take down all tenants before canary catches it
|
||||
|
||||
### Implementation effort
|
||||
- Add SSM agent to EC2 user-data script
|
||||
- Add `POST /cp/admin/rolling-upgrade` handler
|
||||
- Add upgrade step to publish workflow
|
||||
- Add rollback logic
|
||||
- ~2-3 days
|
||||
|
||||
### When to use
|
||||
- Urgent security patches that can't wait 5 min
|
||||
- Breaking changes that need coordinated rollout
|
||||
- When you want canary/staged deployment
|
||||
|
||||
---
|
||||
|
||||
## Option B: Sidecar auto-updater (implemented)
|
||||
|
||||
A cron job on each EC2 checks GHCR for a new image digest every 5 minutes.
|
||||
If the digest changed, it pulls the new image and restarts the container.
|
||||
|
||||
```bash
|
||||
# Runs every 5 min on each EC2 (added to user-data)
|
||||
*/5 * * * * /usr/local/bin/molecule-auto-update.sh
|
||||
```
|
||||
|
||||
The update script:
|
||||
1. `docker pull platform-tenant:latest`
|
||||
2. Compare digest with running container's image digest
|
||||
3. If different: `docker stop molecule-tenant && docker rm molecule-tenant && docker run ...`
|
||||
4. Wait for `/health` 200
|
||||
5. Log result to `/var/log/molecule-auto-update.log`
|
||||
|
||||
### Pros
|
||||
- Zero CP involvement — fully autonomous per tenant
|
||||
- Tenants upgrade within 5 min of any publish
|
||||
- No SSH/SSM infrastructure needed
|
||||
- Each tenant upgrades independently (natural canary)
|
||||
- Simple to implement (2 lines in user-data + a small script)
|
||||
|
||||
### Cons
|
||||
- Up to 5 min delay between publish and tenant upgrade
|
||||
- Brief downtime during restart (~10-30s)
|
||||
- No centralized visibility into upgrade status
|
||||
- Can't selectively hold back specific tenants
|
||||
- All tenants track `latest` — no pinned versions
|
||||
|
||||
### When to use
|
||||
- Default for all tenants
|
||||
- Works well for early-stage SaaS with frequent deploys
|
||||
|
||||
---
|
||||
|
||||
## Option C: Blue-green via Worker (zero downtime)
|
||||
|
||||
Each EC2 runs two container slots: `blue` (current) and `green` (new).
|
||||
The Cloudflare Worker routes traffic to whichever is healthy.
|
||||
|
||||
```
|
||||
EC2 instance:
|
||||
molecule-tenant-blue → :8080 (current, serving traffic)
|
||||
molecule-tenant-green → :8081 (new, starting up)
|
||||
|
||||
Upgrade flow:
|
||||
1. Pull new image
|
||||
2. Start green on :8081
|
||||
3. Health check green: GET :8081/health
|
||||
4. If healthy: update Worker routing (KV: slug → port 8081)
|
||||
5. Stop blue
|
||||
6. Next upgrade: blue becomes the new slot
|
||||
|
||||
Worker routing:
|
||||
KV key: "example-org" → {"ip": "<EC2_IP>", "port": 8081}
|
||||
(port defaults to 8080 when not in KV)
|
||||
```
|
||||
|
||||
### Pros
|
||||
- Zero downtime — traffic switches atomically after health check
|
||||
- Instant rollback — just switch back to the old slot
|
||||
- Worker already exists — just add port to the routing lookup
|
||||
- Health-verified before any traffic switches
|
||||
|
||||
### Cons
|
||||
- Double memory usage during transition (~512MB extra per tenant)
|
||||
- More complex user-data script (manage two containers)
|
||||
- Worker needs port-aware routing (KV schema change)
|
||||
- Need to track which slot is active per tenant
|
||||
|
||||
### Implementation effort
|
||||
- Update user-data to manage blue/green containers
|
||||
- Update Worker to read port from KV
|
||||
- Add blue/green state tracking to CP (org_instances.active_slot)
|
||||
- Update auto-updater script for blue-green swap
|
||||
- ~3-5 days
|
||||
|
||||
### When to use
|
||||
- When tenants have SLAs requiring zero downtime
|
||||
- Production deployments with paying customers
|
||||
- After Option B proves the auto-update pattern works
|
||||
|
||||
---
|
||||
|
||||
## Migration path
|
||||
|
||||
```
|
||||
Now: Option B (auto-updater, 5 min delay, brief downtime)
|
||||
↓
|
||||
Growth: Option A (add SSM for urgent patches, keep B as default)
|
||||
↓
|
||||
Scale: Option C (zero-downtime for premium/enterprise tenants)
|
||||
```
|
||||
@@ -1,592 +0,0 @@
|
||||
---
|
||||
title: "Incident Log — molecule-core"
|
||||
description: "Chronological incident log for molecule-core — summaries, resolutions, and references."
|
||||
---
|
||||
# Incident Log — molecule-core
|
||||
|
||||
> This file documents security incidents, outages, and degraded states.
|
||||
> Active incidents are listed first. Resolved incidents remain for historical record.
|
||||
|
||||
---
|
||||
|
||||
*Last updated: 2026-04-21T07:45Z by Core Platform Lead — Incident log rebuilt after linter reset*
|
||||
|
||||
---
|
||||
|
||||
## Security Audit Cycle 6 — ALL CLEAR (2026-04-21 ~07:15Z)
|
||||
|
||||
**SHA range:** e69cb26 → 674384b on main (~5 commits + ~10 merged PRs)
|
||||
**Verdict:** ✅ No critical/high findings
|
||||
|
||||
### Commits Reviewed — All CLEAN
|
||||
|
||||
| Commit | Description |
|
||||
|--------|-------------|
|
||||
| `dc9c64e` / PR #1258 | F1097 org_id context — eliminates redundant 2nd SELECT in AdminAuth |
|
||||
| `33f1d1a` | Canvas cascade-delete UX — `pendingDelete.hasChildren`, warning dialog |
|
||||
| `0790d57` | Canvas metrics guard — null coalescing |
|
||||
| `781c217` | CI YAML fix |
|
||||
| `169120d` / PR #1310 | CWE-78/CWE-22 — exec form + path traversal guards |
|
||||
| `e431fc4` / PR #1302 | CWE-918 SSRF — `isSafeURL` in `a2a_proxy.go` |
|
||||
| `a66f889` / PR #1261 | CWE path-injection — `resolveInsideRoot` for template paths |
|
||||
|
||||
Full audit saved to TEAM memory id `abc58b47`.
|
||||
|
||||
---
|
||||
|
||||
## F1100 — workspace_restart.go Path Traversal (RESOLVED)
|
||||
|
||||
**Severity:** Medium | **Finding ID:** F1100
|
||||
**Status:** Resolved — fix applied via `a66f889` (PR #1261) on both main and staging
|
||||
|
||||
### Summary
|
||||
|
||||
`workspace_restart.go:127-133` accepted `body.Template` (attacker-controlled) via raw `filepath.Join(h.configsDir, template)`, allowing path traversal (e.g. `../../../etc`) to escape `configsDir`. **Issue #1043 triage missed this — legitimate gap, not false positive.**
|
||||
|
||||
Authenticated callers could pass a crafted `body.Template` value to escape the configs directory.
|
||||
|
||||
### Fix Applied
|
||||
|
||||
PR #1260 (intended) closed without merge. Fix landed via **PR #1261 (`a66f889`)** on both main and staging:
|
||||
|
||||
```go
|
||||
// Fixed (a66f889):
|
||||
candidatePath, resolveErr := resolveInsideRoot(h.configsDir, template)
|
||||
if resolveErr != nil {
|
||||
template = "" // fallback fires safely
|
||||
}
|
||||
```
|
||||
|
||||
### References
|
||||
|
||||
- PR #1260: closed without merge — superseded by PR #1261
|
||||
- PR #1261 (`a66f889`): merged ✅
|
||||
- Closes: #1043
|
||||
|
||||
---
|
||||
|
||||
## F1088 Credential Exposure — CLOSED
|
||||
|
||||
**All prior F1088 entries below remain valid. Summary of current state:**
|
||||
|
||||
- Credentials: MiniMax revoked (⚠️), GitHub PAT revoked (✅), Admin token — treat as potentially exposed
|
||||
- BFG git-history scrub: NOT REQUIRED — incident management closure, 0 public forks confirmed
|
||||
- Git history still contains values — admin token rotation recommended as precaution
|
||||
- PR #1179 (`b89f3fd`) merged — active code is clean
|
||||
- Branch `origin/fix/credential-history-cleanup-f1088` exists but is 38 commits behind main — superseded by incident management closure
|
||||
|
||||
**Required remaining action:** Rotate `ADMIN_TOKEN` (`HlgeMb8...ShARE=`) as precaution. All other actions complete.
|
||||
|
||||
---
|
||||
|
||||
### Summary
|
||||
|
||||
Commit `d513a0ced549ef2be8903a7b4794256110ba1805` on staging (merged to main via PR #1098) contains three production credentials as hardcoded default values in `scripts/post-rebuild-setup.sh`. The credentials appeared in the git diff and were permanently visible in the public commit history.
|
||||
|
||||
### Credentials Status
|
||||
|
||||
| # | Credential | Value | Status |
|
||||
|---|------------|-------|--------|
|
||||
| 1 | ANTHROPIC_AUTH_TOKEN | `sk-cp-lHt-QFSyZwZxeo...KVw` | ⚠️ Revoked or inactive (404 on API call) |
|
||||
| 2 | GITHUB_TOKEN | `github_pat_11BPRRWQI0m...hsIJLIL` | ✅ Revoked (confirmed 401) |
|
||||
| 3 | ADMIN_TOKEN | `HlgeMb8...ShARE=` | Needs confirmation — treated as active until proven otherwise |
|
||||
|
||||
### Resolution
|
||||
|
||||
PR #1179 (`b89f3fd`: "ci: retry — trigger fresh runner allocation") closed this finding. The incident was closed at the finding-management level. Git history scrub via BFG was discussed but deemed not required by security team (no active public forks confirmed, credentials were already revoked/inactive).
|
||||
|
||||
Active code is clean (`d513a0c` replaced hardcoded defaults with env-var reads).
|
||||
|
||||
### Summary
|
||||
|
||||
Commit `d513a0ced549ef2be8903a7b4794256110ba1805` on staging (merged to main via PR #1098) contains two production credentials as hardcoded default values in `scripts/post-rebuild-setup.sh`. The credentials appear in the git diff and are permanently visible in the public commit history.
|
||||
|
||||
The commit itself fixed the problem by replacing hardcoded defaults with env-var reads (MINIMAX_API_KEY, GITHUB_PAT). However, git history still shows the original values.
|
||||
|
||||
### Credentials Exposed
|
||||
|
||||
> **Token values redacted from this table 2026-04-26** to reduce public-search surface (the docs repo is publicly indexed). Short-suffix references match the convention in the Blast Radius table below (lines 134-137). Full values remain in `molecule-core` git history per the F1088 closure decision (no BFG scrub).
|
||||
|
||||
| # | Credential | Value (short suffix) | Service |
|
||||
|---|------------|----------------------|---------|
|
||||
| 1 | ANTHROPIC_AUTH_TOKEN | `sk-cp-...KVw` | MiniMax API (api.minimax.io/anthropic) |
|
||||
| 2 | GITHUB_TOKEN | `github_pat_...hsIJLIL` | GitHub (fine-grained PAT, scope unknown) |
|
||||
| 3 | ADMIN_TOKEN | `HlgeMb8...ShARE=` | Platform admin authentication |
|
||||
|
||||
### Affected Files
|
||||
|
||||
- `scripts/post-rebuild-setup.sh` (commit d513a0c, PR #1098 → merged to staging → merged to main)
|
||||
|
||||
### Timeline
|
||||
|
||||
- **~2026-04-20T13:02Z**: Commit `d513a0c` pushed by `rabbitblood`. GitGuardian flagged credentials in the diff. Fix committed in same commit.
|
||||
- **~2026-04-20T**: Credentials removed from active code, but git history still contains them.
|
||||
- **2026-04-20T22:32Z**: Incident discovered and escalated.
|
||||
|
||||
### Actions Taken
|
||||
|
||||
1. Dev Lead notified (delegation failed — Dev Lead unreachable)
|
||||
2. All child workspaces notified (delegation failed — all unreachable)
|
||||
3. Incident documented in this file
|
||||
4. Branch `origin/fix/credential-history-cleanup-f1088` exists but is 38 commits behind `origin/main`
|
||||
5. **Incident CLOSED** — PR #1179 merged, finding management closure, BFG scrub deemed not required (no active public forks confirmed)
|
||||
|
||||
### Blast Radius (Confirmed by Core-Security)
|
||||
|
||||
| Credential | Test Result | Status |
|
||||
|------------|-------------|--------|
|
||||
| MiniMax API key (`sk-cp-...KVw`) | `404 Not Found` on real API call | ⚠️ **REVOKED** (or endpoint inactive) |
|
||||
| GitHub PAT (`github_pat_...hsIJLIL`) | `401 Bad credentials` | ✅ **REVOKED** |
|
||||
| Admin token (`HlgeMb8...ShARE=`) | Base64 — cannot test directly | ⚠️ **Treated as active** — recommend rotation as precaution |
|
||||
|
||||
**Public forks:** 0 confirmed (GH API `/forks` returns none) — low fork blast radius.
|
||||
|
||||
**Git history scope:** Credentials exist in both `main` and `staging` in commits `f787873`..`d513a0c`. They were introduced in `f787873` ("feat: nuke-and-rebuild.sh") and removed from active code in `d513a0c`. Both branches require BFG cleanup.
|
||||
|
||||
### Required Actions (RESOLVED)
|
||||
|
||||
- [x] Credentials revoked (MiniMax ⚠️, GitHub PAT ✅)
|
||||
- [x] BFG git history cleanup **NOT REQUIRED** — incident management closure, no active public forks, credentials confirmed revoked/inactive
|
||||
- [x] Team notification — documented in this log
|
||||
- [ ] **Admin token rotation** — recommended as precaution (value still in git history, treat as potentially exposed)
|
||||
|
||||
### BFG Repo-Cleaner Procedure
|
||||
|
||||
**NOT REQUIRED** — F1088 closed without BFG scrub per security team decision. Retained for reference only.
|
||||
|
||||
**Step 1 — Create credentials manifest (`creds.txt`) [NOT NEEDED]:**
|
||||
```
|
||||
<ADMIN_TOKEN value>
|
||||
<MiniMax sk-cp-... value>
|
||||
<GitHub fine-grained PAT value>
|
||||
```
|
||||
Full token values redacted from this doc 2026-04-26 (see note in the
|
||||
Credentials Exposed table above). Pull from the Core-Security incident
|
||||
ticket if a future revival of this BFG procedure is needed.
|
||||
|
||||
**Step 2 — Clean origin/main:**
|
||||
```bash
|
||||
git clone --mirror https://git.moleculesai.app/molecule-ai/molecule-core /tmp/molecule-main-mirror
|
||||
java -jar bfgr.jar --replace-text creds.txt --rewrite-not-committed-by-oss --no-blob-protection /tmp/molecule-main-mirror
|
||||
cd /tmp/molecule-main-mirror && git push --mirror
|
||||
```
|
||||
|
||||
**Step 3 — Clean origin/staging:**
|
||||
```bash
|
||||
git clone --mirror https://git.moleculesai.app/molecule-ai/molecule-core /tmp/molecule-staging-mirror
|
||||
java -jar bfgr.jar --replace-text creds.txt --rewrite-not-committed-by-oss --no-blob-protection /tmp/molecule-staging-mirror
|
||||
cd /tmp/molecule-staging-mirror && git push --mirror
|
||||
```
|
||||
|
||||
**Step 4 — Notify team to re-clone both branches if cloned before ~13:02 UTC 2026-04-20.**
|
||||
|
||||
### References
|
||||
|
||||
- Commit: `d513a0ced549ef2be8903a7b4794256110ba1805`
|
||||
- PR: #1098 (staging → main merge)
|
||||
- Cleanup branch: `origin/fix/credential-history-cleanup-f1088` (behind main by 38 commits)
|
||||
- Scanners triggered: GitGuardian
|
||||
- Security investigation: Core-Security (confirmed credentials revoked via API tests)
|
||||
- GitHub issue: #1282 (filed by Core-OffSec)
|
||||
- **Closed by:** PR #1179 (`b89f3fd`) — incident management closure, BFG scrub deemed not required
|
||||
|
||||
### Known Issue — PR #1230 Incomplete (QA Round 16, 2026-04-21)
|
||||
|
||||
PR #1230 / commit `524e3c6` ("fix(security): replace err.Error() leaks") failed to carry mcp.go fixes into main's tree. All 3 MCP error leaks remain on main:
|
||||
- `mcp.go:259`: "parse error: " + err.Error()
|
||||
- `mcp.go:347`: "invalid params: " + err.Error()
|
||||
- `mcp.go:352`: err.Error()
|
||||
- `org_plugin_allowlist.go:260`: "detail": err.Error()
|
||||
|
||||
Fix is covered by PR #1226 (rebased, MERGEABLE). Gap should close after #1226 merges.
|
||||
|
||||
---
|
||||
|
||||
## CWE-918 SSRF — Backport to Main (RESOLVED)
|
||||
|
||||
**Severity:** High
|
||||
**Status:** Resolved — PR #1302 merged to main
|
||||
|
||||
### Summary
|
||||
|
||||
SSRF defence (`isSafeURL` in `a2a_proxy.go`) was backported to main to address CWE-918 (Server-Side Request Forgery). The fix prevents the A2A proxy from forwarding requests to internal network addresses (localhost, private ranges, etc.).
|
||||
|
||||
### References
|
||||
|
||||
- Commit: `e431fc4` (fix(security): backport SSRF defence (CWE-918) to main — isSafeURL in a2a_proxy.go (#1292) (#1302))
|
||||
|
||||
---
|
||||
|
||||
## CWE-22 + CWE-78 Security Fixes — Merged (RESOLVED)
|
||||
|
||||
**Severity:** Critical
|
||||
**Status:** Resolved — proper fixes merged to staging and main
|
||||
|
||||
### Summary
|
||||
|
||||
The `fix/cwe78-delete-via-ephemeral-shell-injection` branch was the right diagnosis but wrong implementation (removed `safeName` from `copyFilesToContainer`). The correct fixes were merged separately:
|
||||
|
||||
| Location | Commit | Fix |
|
||||
|----------|--------|-----|
|
||||
| staging | `ce2491e` | CWE-22: `copyFilesToContainer` safeName + `deleteViaEphemeral` validateRelPath + exec form |
|
||||
| main | `169120d` | CWE-78/CWE-22: block shell injection in `deleteViaEphemeral` |
|
||||
|
||||
Both CWEs are fully resolved on both branches. The regression branch is superseded and must not be merged as-is.
|
||||
|
||||
### Verification (staging `ce2491e`)
|
||||
|
||||
`copyFilesToContainer` (container_files.go:73-99):
|
||||
```go
|
||||
clean := filepath.Clean(name)
|
||||
if filepath.IsAbs(clean) || strings.Contains(clean, "..") {
|
||||
return fmt.Errorf("path traversal blocked: %s", name)
|
||||
}
|
||||
safeName := filepath.Join(destPath, clean)
|
||||
header := &tar.Header{Name: safeName, ...} ✅
|
||||
```
|
||||
|
||||
`deleteViaEphemeral` (container_files.go:152-168):
|
||||
```go
|
||||
validateRelPath(filePath) ✅
|
||||
Cmd: []string{"rm", "-rf", "/configs", filePath} ✅ exec form, no shell interpolation
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
|
||||
|
||||
**Severity:** High
|
||||
**Period:** ~2026-04-20T22:00Z – 2026-04-21T03:30Z
|
||||
**Finding IDs:** N/A (infra incident)
|
||||
**Status:** Resolved
|
||||
|
||||
### Summary
|
||||
|
||||
All self-hosted macOS arm64 runners saturated. 27 runs queued, 0 in-progress, 0 completed. Only cancellations processing. PRs #1053 and #1036 had zero CI runs.
|
||||
|
||||
### Root Causes (multiple)
|
||||
|
||||
1. `changes` job ran on `[self-hosted, macos, arm64]` despite having zero macOS dependencies (plain `git diff`) — wasted runner slots
|
||||
2. YAML corruption in `ci.yml` (JSON-escaped `\n` sequences from commits `12c52d4`/`5831b4e`) caused "workflow file issue" failures before any job could start
|
||||
3. `cancel-in-progress: false` at workflow level caused stale runs to queue instead of being cancelled
|
||||
4. Workflow-level concurrency not set — multiple in-flight runs queued on same ref
|
||||
|
||||
---
|
||||
|
||||
## CI Stall — molecule-core/staging (RESOLVED 2026-04-21 ~07:05Z)
|
||||
|
||||
**Severity:** High
|
||||
**Period:** ~2026-04-21T02:47Z – ~2026-04-21T07:00Z
|
||||
**Status:** Resolved — CI progressing normally, no config problems remain
|
||||
|
||||
### Resolution
|
||||
|
||||
All prior runner-saturation and YAML-corruption fixes were correct. The stall resolved naturally once stale queued runs drained. Current CI state (2026-04-21 ~07:07Z):
|
||||
|
||||
- Staging run #24708961892: **success** (SHA `5d32373`)
|
||||
- Staging run #24708976467: **success** (changes job, SHA `72d825f`)
|
||||
- Main run #24708984339: queued (normal — healthy queue, not stalled)
|
||||
- Runner agent healthy — no dead slots
|
||||
|
||||
### Root Causes (all resolved)
|
||||
|
||||
1. `changes` job on `[self-hosted, macos, arm64]` — fixed by moving to `ubuntu-latest` (`9601545`)
|
||||
2. YAML corruption in `ci.yml` — fixed by PR #1264 / `b61692c` ✅
|
||||
3. `cancel-in-progress: false` at workflow level — reverted to `true` on staging ✅
|
||||
4. `cancel-in-progress: false` on main — correct for single-runner env, aligned via PR #1248 ✅
|
||||
|
||||
### Staging CI Config (confirmed healthy)
|
||||
|
||||
- `ci.yml`: `cancel-in-progress: true`, `changes` job on `ubuntu-latest` ✅
|
||||
- `codeql.yml`: `cancel-in-progress: false` ✅
|
||||
- `e2e-api.yml`: `cancel-in-progress: false` ✅
|
||||
|
||||
### Infra Recommendations (for long-term stability)
|
||||
|
||||
1. Provision org-wide GitHub App installation token for CI automation (PATs rotate too frequently)
|
||||
2. Update remote URLs on controlplane and tenant-proxy repos
|
||||
3. Monitor runner agent health on mac mini — restart agent if future stalls recur
|
||||
|
||||
---
|
||||
|
||||
## PR #1242 YAML Corruption — RESOLVED (PR never merged)
|
||||
|
||||
**Severity:** Critical
|
||||
**Status:** Resolved — PR #1242 closed without merge, staging unaffected
|
||||
|
||||
### Summary
|
||||
|
||||
PR #1242 (`fix/ci-runner-queue-contention`) branch contained a YAML corruption in `ci.yml` — the `concurrency` block was replaced with a commit-SHA string literal:
|
||||
|
||||
```yaml
|
||||
e4a62e1 (ci: add workflow-level concurrency to ci.yml and codeql.yml)
|
||||
```
|
||||
|
||||
However, PR #1242 was **closed without merging**. Staging received `cancel-in-progress: true` via PR #1264 (commit `b61692c`) instead, which is the correct clean version.
|
||||
|
||||
### Current State (updated 2026-04-21 ~04:30Z)
|
||||
|
||||
- **main:** `cancel-in-progress: false` ✅ (from PR #1248 / `2ffd11c` or similar clean commit)
|
||||
- **staging:** `cancel-in-progress: true` (via `0b30465` tick restore after corruption)
|
||||
- **PR #1248** (`2ffd11c`): open, sets staging `cancel-in-progress: false` — aligns staging with main ✅
|
||||
- **Main has moved to `false`** — staging should follow to stay consistent
|
||||
|
||||
### PR #1248 — URGENT MERGE
|
||||
|
||||
PR #1248 (`fix/ci: restore corrupted ci.yml concurrency block`) by Dev Lead:
|
||||
- Fixes the corruption pattern (same as prior incident)
|
||||
- Sets `cancel-in-progress: false` — correct for single-runner environment
|
||||
- Aligns staging CI config with main (which already has `false`)
|
||||
- Must merge before any further CI runs on staging
|
||||
|
||||
### References
|
||||
|
||||
- PR: #1242 (`fix/ci-runner-queue-contention`) — closed, not merged
|
||||
- Staging corruption restored via: PR #1264 / `b61692c`
|
||||
- PR #1248 (`2ffd11c`): open, Dev Lead fix, `cancel-in-progress: false`
|
||||
- Main: `cancel-in-progress: false` ✅
|
||||
|
||||
---
|
||||
|
||||
## PR #1036 QA Audit (STALE)
|
||||
|
||||
**Severity:** Low
|
||||
**Date:** 2026-04-20 (QA audit performed)
|
||||
**Status:** Stale — CI infrastructure has been fixed since audit
|
||||
|
||||
### Summary
|
||||
|
||||
QA audit (2026-04-20) flagged CI as failing on PR #1036. However, CI was failing due to infrastructure issues (runner saturation, YAML corruption) that have since been resolved. The audit should be re-run now that staging CI is healthy.
|
||||
|
||||
---
|
||||
|
||||
## PR #1246 / #1247 — Sed Regression Fix — RESOLVED (PR #1247 merged)
|
||||
|
||||
**Severity:** Critical
|
||||
**Status:** Resolved — PR #1247 merged to main (2026-04-21 ~03:18Z)
|
||||
|
||||
### Summary
|
||||
|
||||
PR #1246 (`364712d`) was closed without merging. However, **PR #1247** (`04be218`) achieved the same fix cleanly and merged to main:
|
||||
|
||||
```
|
||||
fix(go): replace $1 literal with resp.Body.Close() in 7 files (#1247)
|
||||
```
|
||||
|
||||
Commit `04be218` (merged by molecule-ai[bot]) applied:
|
||||
```
|
||||
sed -i 's/defer func() { _ = \$1 }()/defer func() { _ = resp.Body.Close() }()/g'
|
||||
```
|
||||
|
||||
### Affected Files (all fixed on main)
|
||||
|
||||
- `workspace-server/cmd/server/cp_config.go`
|
||||
- `workspace-server/internal/handlers/a2a_proxy.go`
|
||||
- `workspace-server/internal/handlers/github_token.go`
|
||||
- `workspace-server/internal/handlers/traces.go`
|
||||
- `workspace-server/internal/handlers/transcript.go`
|
||||
- `workspace-server/internal/middleware/session_auth.go`
|
||||
- `workspace-server/internal/provisioner/cp_provisioner.go` (3 occurrences)
|
||||
|
||||
**Staging:** Fix present via prior commits. `cp_config.go` on staging has SHA `d1021c2` (correct form).
|
||||
|
||||
**PR #1246:** Closed without merging — superseded by PR #1247. No further action needed.
|
||||
|
||||
---
|
||||
|
||||
## CWE-78/CWE-22 Branch — RESOLVED (proper fixes merged separately)
|
||||
|
||||
**Severity:** Critical
|
||||
**Status:** Resolved — proper fixes merged via `ce2491e` (staging) and `169120d` (main)
|
||||
|
||||
### Summary
|
||||
|
||||
The `fix/cwe78-delete-via-ephemeral-shell-injection` branch (commit `17419dd`) was **correct** for CWE-78 (`deleteViaEphemeral` exec form + `validateRelPath`) but **regressed** `copyFilesToContainer` by removing the `safeName` path-traversal guard.
|
||||
|
||||
**Resolution — both branches merged to main and staging:**
|
||||
|
||||
| Branch | Commit | Status |
|
||||
|--------|--------|--------|
|
||||
| staging | `ce2491e` — fix(security): CWE-22 in copyFilesToContainer and deleteViaEphemeral | ✅ merged |
|
||||
| main | `169120d` — fix(security): CWE-78/CWE-22 — block shell injection in deleteViaEphemeral | ✅ merged |
|
||||
|
||||
### What was fixed (staging `ce2491e`)
|
||||
|
||||
- `copyFilesToContainer`: `filepath.Clean` + `IsAbs` + `strings.Contains("..")` validation, `safeName` in tar header ✅
|
||||
- `deleteViaEphemeral`: `validateRelPath(filePath)` check before rm command ✅
|
||||
- Both CWE-22 and CWE-78 addressed correctly
|
||||
|
||||
### `fix/cwe78-delete-via-ephemeral-shell-injection` branch status
|
||||
|
||||
**Do NOT merge** — it's now superseded by `ce2491e`/`169120d`. The regression it introduced (removing `safeName` from `copyFilesToContainer`) was never the right approach. If this branch is revived, it must be rebased on top of `ce2491e` to preserve existing CWE-22 protections while adding the CWE-78 exec-form fix.
|
||||
|
||||
---
|
||||
|
||||
## F1085 Regression Branch (`fix/f1085-regression-1283`) — IS a Regression
|
||||
|
||||
**Severity:** High
|
||||
**Status:** Active — branch removes the confirmed-good F1085 fix (confirmed 2026-04-21 ~07:10Z)
|
||||
|
||||
### Summary
|
||||
|
||||
Branch `origin/fix/f1085-regression-1283` (commit `3b244e6`) removes `redactSecrets(workspaceID, content)` from `seedInitialMemories` in `workspace_provision.go:249`:
|
||||
|
||||
```diff
|
||||
-`, workspaceID, redactSecrets(workspaceID, content), scope, awarenessNamespace); err != nil {
|
||||
+`, workspaceID, content, scope, awarenessNamespace); err != nil {
|
||||
```
|
||||
|
||||
**Staging still has the correct fix** (`workspace_provision.go:253` on origin/staging confirms `redactSecrets` is present). This branch is behind staging and would regress it if merged.
|
||||
|
||||
### Required Fix
|
||||
|
||||
Close or revert this branch. `redactSecrets` must remain in `seedInitialMemories`. If there is a legitimate reason to change this (e.g., a different redaction strategy), document it clearly in the PR before merging.
|
||||
|
||||
---
|
||||
|
||||
## F1097 — org_id Context Fix — RESOLVED
|
||||
|
||||
**Severity:** Medium
|
||||
**Status:** Resolved — PR #1258 merged to main (`dc9c64e`)
|
||||
|
||||
### Summary
|
||||
|
||||
`orgToken.Validate` refactored to return `org_id` directly, eliminating the redundant 2nd SELECT in `AdminAuth`. All SQL parameterized correctly.
|
||||
|
||||
### References
|
||||
|
||||
- PR #1258 (`dc9c64e`): fix(F1097): set org_id in Gin context for org-token callers
|
||||
|
||||
---
|
||||
|
||||
## PR #1226 — err.Error() Leaks (STALE — closed without merge)
|
||||
|
||||
**Severity:** Medium
|
||||
**Status:** Open — PR closed without merging, leaks still present on main
|
||||
|
||||
### Summary
|
||||
|
||||
PR #1226 (`fix(security): sanitize remaining err.Error() leaks + errcheck artifacts/client.go`) was **closed without merging**. The following leaks remain on main:
|
||||
|
||||
| File | Line | Code | Fix |
|
||||
|------|------|------|-----|
|
||||
| `mcp.go` | 259 | `"parse error: " + err.Error()` | → `"parse error: invalid JSON request body"` |
|
||||
| `mcp.go` | 347 | `"invalid params: " + err.Error()` | → `"invalid params: malformed JSON"` |
|
||||
| `mcp.go` | 352 | `err.Error()` | → `"dispatch error"` |
|
||||
| `org_plugin_allowlist.go` | 260 | `"detail": err.Error()` | → `"detail": "plugin name validation failed"` |
|
||||
| `admin_memories.go` | 99 | `"invalid JSON: " + err.Error()` | → `"invalid JSON request body"` |
|
||||
|
||||
**Already fixed:** `artifacts/client.go:175` — `defer func() { _ = resp.Body.Close() }()` confirmed correct (via PR #1247).
|
||||
|
||||
### Action Required
|
||||
|
||||
Reopen PR #1226 and fast-track merge. Alternatively, cherry-pick the 4 commits from that PR onto a fresh branch.
|
||||
|
||||
---
|
||||
|
||||
## QA Round 18 — orgs-page Test Regression (FIXED on main, pending staging port)
|
||||
|
||||
**Severity:** Medium
|
||||
**SHA tested:** `ce33da5` (PR #1257 branch merge with staging)
|
||||
**Status:** Regression identified in PR #1255, fixed on main, not yet on staging
|
||||
|
||||
### Findings
|
||||
|
||||
| Finding | Status |
|
||||
|---------|--------|
|
||||
| Canvas tests: 53 passed, **1 FAILED** | orgs-page.test.tsx line 133 — `vi.useRealTimers()` + raw `setTimeout(50)` without `act()` |
|
||||
| PR #1257 conflict | MERGEABLE, approved — closed without merge; fix is on main/staging via `a66f889` |
|
||||
| PR #1255 regression | Introduced orgs-page test flakiness — +18/-2 in orgs-page.test.tsx |
|
||||
|
||||
### orgs-page Test Regression — Root Cause
|
||||
|
||||
PR #1255 (`e885fa1`) regressed the timer fix from PR #1235. It replaced `waitFor()` with `vi.useRealTimers()` + raw `setTimeout(50)` without `act()` — causing microtask flush issues.
|
||||
|
||||
### Resolution
|
||||
|
||||
**Main:** Fixed in `674384b` (PR #1313) — wraps all 10 affected `vi.advanceTimersByTimeAsync(50)` calls in `act(async () => { ... })`. All 813 canvas tests pass on main.
|
||||
**Staging:** Regression NOT yet fixed — `origin/staging` is 13 commits behind main.
|
||||
|
||||
### Action needed
|
||||
|
||||
Cherry-pick or port the orgs-page test fix from `674384b` to staging.
|
||||
|
||||
---
|
||||
|
||||
## Issue #1124 — Orchestrator GET /workspaces 404: Env Var Misconfiguration (OPEN)
|
||||
|
||||
**Severity:** Medium
|
||||
**Status:** Active — root cause confirmed, fix pending, delegated to Core-BE
|
||||
|
||||
### Summary
|
||||
|
||||
Orchestrator (workspace agent, `workspace/` directory) GET /workspaces/{WORKSPACE_ID} returns 404 due to missing or empty `WORKSPACE_ID` env var. Confirmed via code review (2026-04-21 ~07:10Z).
|
||||
|
||||
### Root Causes
|
||||
|
||||
**Platform-side (provisioner.go:375-377) is CORRECT:**
|
||||
```go
|
||||
env := []string{
|
||||
fmt.Sprintf("WORKSPACE_ID=%s", cfg.WorkspaceID), // ✅ correctly injected
|
||||
"WORKSPACE_CONFIG_PATH=/configs",
|
||||
fmt.Sprintf("PLATFORM_URL=%s", cfg.PlatformURL),
|
||||
}
|
||||
```
|
||||
The platform injects `WORKSPACE_ID` at container provision time. **The bug is in the Python orchestrator modules** that default to empty string instead of validating the injected value.
|
||||
|
||||
**Buggy Python module-level defaults (empty string → broken API calls):**
|
||||
| File | Line | Code |
|
||||
|------|------|------|
|
||||
| `workspace/a2a_cli.py` | 24 | `WORKSPACE_ID = os.environ.get("WORKSPACE_ID", "")` |
|
||||
| `workspace/a2a_client.py` | 17 | `WORKSPACE_ID = os.environ.get("WORKSPACE_ID", "")` |
|
||||
| `workspace/coordinator.py` | 26 | `WORKSPACE_ID = os.environ.get("WORKSPACE_ID", "")` |
|
||||
| `workspace/consolidation.py` | 22 | `WORKSPACE_ID = os.environ.get("WORKSPACE_ID", "")` |
|
||||
| `workspace/molecule_ai_status.py` | 25 | `WORKSPACE_ID = os.environ.get("WORKSPACE_ID", "")` |
|
||||
|
||||
When `WORKSPACE_ID` is empty, API calls produce URLs like `/workspaces//heartbeat` or `/registry/discover/` — platform returns 404 or wrong routing.
|
||||
|
||||
**Note — main.py is already correct:**
|
||||
```python
|
||||
workspace_id = os.environ.get("WORKSPACE_ID", "workspace-default") # main.py:55 ✅
|
||||
```
|
||||
However, `main.py` uses a local variable — it doesn't export `WORKSPACE_ID` as a module constant. The other modules that import `WORKSPACE_ID` from `a2a_client` etc. still get the empty-string default.
|
||||
|
||||
### Fix Required (Quick Win for Core-BE)
|
||||
|
||||
**Option A — Fail fast at module import (recommended):**
|
||||
```python
|
||||
WORKSPACE_ID = os.environ.get("WORKSPACE_ID")
|
||||
if not WORKSPACE_ID:
|
||||
raise RuntimeError("WORKSPACE_ID environment variable is required but not set")
|
||||
```
|
||||
Apply to all 5 affected modules. This surfaces the misconfiguration immediately instead of producing silent 404s downstream.
|
||||
|
||||
**Option B — Align with main.py's approach (safer):**
|
||||
```python
|
||||
WORKSPACE_ID = os.environ.get("WORKSPACE_ID", "workspace-default")
|
||||
```
|
||||
But this masks real misconfigurations. Option A is better.
|
||||
|
||||
### Modules Requiring Fix
|
||||
|
||||
- `workspace/a2a_cli.py` — line 24
|
||||
- `workspace/a2a_client.py` — line 17
|
||||
- `workspace/coordinator.py` — line 26
|
||||
- `workspace/consolidation.py` — line 22
|
||||
- `workspace/molecule_ai_status.py` — line 25
|
||||
|
||||
### PLATFORM_URL Note
|
||||
|
||||
All modules default to `http://platform:8080` (container mesh hostname). This is correct for in-container use but fails outside Docker. No action needed for in-container orchestrators — the platform injects `PLATFORM_URL` at provision time which overrides this default.
|
||||
|
||||
### Owner
|
||||
|
||||
Core-BE — delegated to Dev Lead (A2A failed). Core-BE sub-team: please pick up.
|
||||
|
||||
### Fix PR
|
||||
|
||||
[PR #1336](https://git.moleculesai.app/molecule-ai/molecule-core/pull/1336) filed — `fix(orchestrator): fail-fast if WORKSPACE_ID env var is unset/empty`. Targets staging. Labels: bug, needs-work, area:backend-engineer, area:dev-lead.
|
||||
|
||||
---
|
||||
|
||||
*Last updated: 2026-04-21T07:10Z by Core Platform Lead (post-restart session — all findings re-verified)*
|
||||
@@ -1,214 +0,0 @@
|
||||
---
|
||||
title: "a2a-sdk v0 → v1 migration"
|
||||
description: "Cheat sheet for migrating workspace runtime code (and forks) from a2a-sdk 0.3.x to 1.x — renamed/removed symbols, common error shapes, before/after diffs."
|
||||
---
|
||||
|
||||
import { Callout } from 'fumadocs-ui/components/callout';
|
||||
|
||||
The `a2a-sdk` Python package released v1.0 in late April 2026. The
|
||||
Molecule workspace runtime migrated under tracking ID **KI-009** and
|
||||
shipped in `molecule-ai-workspace-runtime` **v0.1.11** (commit
|
||||
`d5cf872`, PR #39). The platform now runs exclusively on v1.
|
||||
|
||||
If you're consuming the platform's published wheel, bumping
|
||||
`molecule-ai-workspace-runtime>=0.1.11` handles the migration for
|
||||
you. If you maintain a fork of the runtime, an external agent talking
|
||||
A2A directly, or your own adapter that imports from `a2a.*`, this page
|
||||
is your checklist.
|
||||
|
||||
## Why migrate
|
||||
|
||||
- **Upstream**: `a2a-sdk` 1.0 reorganised the import surface, flattened
|
||||
`Part`, removed deprecated capability flags, and replaced the
|
||||
`A2AStarletteApplication` wrapper with explicit Starlette route
|
||||
factories.
|
||||
- **Platform**: as of 2026-04-24 the platform sends/receives via v1
|
||||
shapes natively. The SDK ships a v0_3 compat layer (enabled in the
|
||||
runtime via `enable_v0_3_compat=True` on `create_jsonrpc_routes`) so
|
||||
in-flight 0.x callers don't break, but new code should target v1.
|
||||
- **Forks/external runtimes**: v0 code throws on `import a2a.utils`
|
||||
and `from a2a.server.apps import A2AStarletteApplication` once you
|
||||
install v1, so the migration is a hard cutover at install time, not
|
||||
a soft deprecation.
|
||||
|
||||
## Cheat sheet — renamed and removed symbols
|
||||
|
||||
The four breaking changes that hit the Molecule runtime during KI-009.
|
||||
All four are confirmed against
|
||||
`molecule-core/workspace/` source.
|
||||
|
||||
### 1. `new_agent_text_message` renamed to `new_text_message`
|
||||
|
||||
- **v0 location**: `a2a.utils.new_agent_text_message`
|
||||
- **v1 location**: `a2a.helpers.new_text_message`
|
||||
|
||||
Both the module path and the symbol name changed.
|
||||
|
||||
### 2. `Part` API flattened — `TextPart` removed
|
||||
|
||||
- **v0**: `Part(root=TextPart(text="..."))` — `Part` wrapped a `root`
|
||||
union of `TextPart` / `FilePart` / `DataPart`.
|
||||
- **v1**: `Part(text="...")` — `Part` accepts the text payload
|
||||
directly. `TextPart` no longer exists as a public symbol.
|
||||
|
||||
`FilePart` / `DataPart` are similarly flattened (`Part(file=...)`,
|
||||
`Part(data=...)`); the Molecule runtime only emits text parts so the
|
||||
file/data shapes weren't exercised in KI-009 and aren't covered by
|
||||
this guide.
|
||||
|
||||
### 3. `A2AStarletteApplication` removed — use route factories
|
||||
|
||||
- **v0**: `from a2a.server.apps import A2AStarletteApplication` then
|
||||
`A2AStarletteApplication(agent_card, request_handler).build()`.
|
||||
- **v1**: `from a2a.server.routes import create_agent_card_routes,
|
||||
create_jsonrpc_routes` then build a Starlette app from the returned
|
||||
route lists.
|
||||
|
||||
The factories also let you mount the JSON-RPC endpoint at any path
|
||||
(the runtime mounts at `/` because the platform POSTs to root, see
|
||||
`workspace/main.py:279`).
|
||||
|
||||
### 4. `state_transition_history` capability flag removed
|
||||
|
||||
- **v0**: `AgentCapabilities(streaming=..., push_notifications=...,
|
||||
state_transition_history=True)` was a per-agent opt-in.
|
||||
- **v1**: the field is gone from `AgentCapabilities`. Per the SDK's own
|
||||
`a2a/compat/v0_3/conversions.py`: *"No longer supported in v1.0"*.
|
||||
The capability is now universal — `Task.history` is always available
|
||||
and `tasks/get` accepts `historyLength` via `apply_history_length()`.
|
||||
|
||||
If you pass `state_transition_history=...` as a kwarg to
|
||||
`AgentCapabilities` under v1, Pydantic will reject it. Drop the kwarg.
|
||||
See [`workspace/main.py`](https://git.moleculesai.app/molecule-ai/molecule-core/src/branch/main/workspace/main.py)
|
||||
for the explanatory comment that prevents future accidental re-adds.
|
||||
|
||||
## Common error shapes
|
||||
|
||||
When v0 code runs against the v1 SDK, the failure modes look like this:
|
||||
|
||||
| Error | Cause |
|
||||
|---|---|
|
||||
| `ModuleNotFoundError: No module named 'a2a.utils'` | v0 import path; module renamed to `a2a.helpers`. |
|
||||
| `ImportError: cannot import name 'A2AStarletteApplication' from 'a2a.server.apps'` | The whole `a2a.server.apps` module is gone in v1. Switch to `a2a.server.routes` factories. |
|
||||
| `ImportError: cannot import name 'TextPart' from 'a2a.types'` | Flattened `Part` API; use `Part(text=...)`. |
|
||||
| `ValueError: Protocol message AgentCapabilities has no "state_transition_history" field` | Removed capability flag passed as kwarg; drop it. |
|
||||
| `ValueError: Protocol message Part has no "root" field` | v0 `Part(root=TextPart(...))` shape against v1 schema; flatten to `Part(text=...)`. |
|
||||
|
||||
The protobuf-style `ValueError` messages always follow the pattern
|
||||
`Protocol message <Type> has no "<field>" field` — that's the
|
||||
fingerprint of "v0 shape against v1 schema." Treat it as a v0→v1 hint
|
||||
even if the field name isn't on the cheat sheet above.
|
||||
|
||||
## Migration checklist
|
||||
|
||||
1. **Bump the dep** — `a2a-sdk[http-server]>=0.3.25` is the floor; remove
|
||||
any `<1.0` upper bound. The Molecule wheel uses
|
||||
`a2a-sdk[http-server]>=0.3.25` with no upper bound (see
|
||||
[`molecule-ai-workspace-runtime/pyproject.toml`](https://git.moleculesai.app/molecule-ai/molecule-ai-workspace-runtime/src/branch/main/pyproject.toml)).
|
||||
2. **Fix imports** — sweep the four renamed/removed symbols above. A
|
||||
safe grep is `grep -rn "from a2a\\|import a2a"` across your tree.
|
||||
3. **Fix removed-field reads/writes** — search for
|
||||
`state_transition_history` usage and delete the kwarg/field access.
|
||||
4. **Flatten `Part` constructors** — search for `Part(root=` and
|
||||
convert to `Part(text=...)` / `Part(file=...)` / `Part(data=...)`.
|
||||
5. **Replace the app factory** — search for `A2AStarletteApplication`
|
||||
and rewrite the bootstrap using `create_agent_card_routes` +
|
||||
`create_jsonrpc_routes`. Pass `enable_v0_3_compat=True` to
|
||||
`create_jsonrpc_routes` if your peers may still be on v0.
|
||||
6. **Re-run tests** — fixture-level mocks of `a2a.helpers` /
|
||||
`a2a.utils` need to mock both names so tests still pass during the
|
||||
rename rollout (see
|
||||
[`workspace/tests/conftest.py`](https://git.moleculesai.app/molecule-ai/molecule-core/src/branch/main/workspace/tests/conftest.py)
|
||||
for the dual-name pattern).
|
||||
|
||||
## Before / after diffs
|
||||
|
||||
### `new_agent_text_message` → `new_text_message`
|
||||
|
||||
```diff
|
||||
-from a2a.utils import new_agent_text_message
|
||||
+from a2a.helpers import new_text_message
|
||||
|
||||
async def execute(self, context, event_queue):
|
||||
- await event_queue.enqueue_event(new_agent_text_message("hello"))
|
||||
+ await event_queue.enqueue_event(new_text_message("hello"))
|
||||
```
|
||||
|
||||
### Flat `Part` API
|
||||
|
||||
```diff
|
||||
-from a2a.types import Part, TextPart
|
||||
+from a2a.types import Part
|
||||
|
||||
-msg_parts = [Part(root=TextPart(text=final_text))]
|
||||
+msg_parts = [Part(text=final_text)]
|
||||
```
|
||||
|
||||
### `AgentCapabilities` — drop `state_transition_history`
|
||||
|
||||
```diff
|
||||
capabilities=AgentCapabilities(
|
||||
streaming=config.a2a.streaming,
|
||||
push_notifications=config.a2a.push_notifications,
|
||||
- state_transition_history=True,
|
||||
),
|
||||
```
|
||||
|
||||
### `A2AStarletteApplication` → route factories
|
||||
|
||||
```diff
|
||||
-from a2a.server.apps import A2AStarletteApplication
|
||||
+from a2a.server.routes import create_agent_card_routes, create_jsonrpc_routes
|
||||
|
||||
-app = A2AStarletteApplication(
|
||||
- agent_card=agent_card,
|
||||
- http_handler=request_handler,
|
||||
-).build()
|
||||
+routes = []
|
||||
+routes.extend(create_agent_card_routes(agent_card))
|
||||
+routes.extend(create_jsonrpc_routes(
|
||||
+ request_handler=request_handler,
|
||||
+ rpc_url="/",
|
||||
+ enable_v0_3_compat=True,
|
||||
+))
|
||||
+app = Starlette(routes=routes)
|
||||
```
|
||||
|
||||
The `enable_v0_3_compat=True` flag on `create_jsonrpc_routes` is what
|
||||
keeps in-flight v0 callers (peers that haven't migrated yet) from
|
||||
breaking — it accepts the old method names and translates them. The
|
||||
Molecule runtime ships with this flag on (see
|
||||
[`workspace/main.py`](https://git.moleculesai.app/molecule-ai/molecule-core/src/branch/main/workspace/main.py));
|
||||
strip it once your entire fleet is on v1.
|
||||
|
||||
## For downstream consumers
|
||||
|
||||
- **Using the published wheel** (`pip install
|
||||
molecule-ai-workspace-runtime>=0.1.11`): the migration is in the
|
||||
wheel — no code changes needed in your adapter or workspace template
|
||||
beyond bumping the pin.
|
||||
- **Running a fork of the runtime**: cherry-pick or rebase against
|
||||
commit `d5cf872` ("feat: migrate a2a-sdk 1.x (KI-009) (#39)") in
|
||||
`molecule-ai-workspace-runtime`. The diff is the canonical reference
|
||||
for what KI-009 actually changed.
|
||||
- **Standalone external agent** (talking A2A without the wheel): apply
|
||||
the [Migration checklist](#migration-checklist) directly to your
|
||||
source. The four cheat-sheet items are the entire surface that
|
||||
changed for the typical agent role; only `Part` flattening and the
|
||||
`state_transition_history` removal affect on-the-wire shapes — the
|
||||
other two are import-only.
|
||||
|
||||
<Callout type="info">
|
||||
The wheel keeps `enable_v0_3_compat=True` on `create_jsonrpc_routes`,
|
||||
so a v0 peer can still hit a v1 wheel and vice versa during the
|
||||
migration window. You don't need to coordinate a fleet-wide cutover —
|
||||
migrate at your own pace.
|
||||
</Callout>
|
||||
|
||||
## See also
|
||||
|
||||
- [`molecule-ai-workspace-runtime` v0.1.11 release](https://git.moleculesai.app/molecule-ai/molecule-ai-workspace-runtime/releases/tag/v0.1.11) — first wheel containing KI-009
|
||||
- PR #39 (feat: migrate a2a-sdk 1.x / KI-009) — closed without merge; PR content is historical
|
||||
- PR #48 (feat(a2a): dual-compat for a2a-sdk 0.3.x and 1.x) — closed without merge; PR content is historical
|
||||
- [Bring Your Own Runtime (MCP)](/docs/runtime-mcp) — universal wheel install path
|
||||
- [External Agents](/docs/external-agents) — manual A2A path for non-MCP runtimes
|
||||
@@ -1,69 +0,0 @@
|
||||
---
|
||||
title: "Cognee Architecture Deep-Dive — Workspace Isolation"
|
||||
description: "Deep-dive into Cognee's isolation primitives versus Molecule AI's per-workspace memory requirements."
|
||||
---
|
||||
# Cognee Architecture Deep-Dive — Workspace Isolation
|
||||
|
||||
**Date:** 2026-04-20
|
||||
**Issue:** Molecule-AI/molecule-core#1146
|
||||
**Research by:** Research Lead
|
||||
**Status:** Complete
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
Cognee has **dataset-level isolation primitives** but **no storage-layer enforcement** and **no native `workspace_id` support** in its MCP tool interface. Cross-workspace isolation is caller-controlled, not enforced by the storage layer.
|
||||
|
||||
---
|
||||
|
||||
## Isolation Layer Analysis
|
||||
|
||||
| Layer | Mechanism | Enforced? | Risk |
|
||||
|-------|-----------|-----------|------|
|
||||
| Storage (Postgres) | No RLS, no schema namespacing | ❌ None | High |
|
||||
| App — dataset | `dataset_name` passed per tool call | ⚠️ Caller-controlled | Medium |
|
||||
| App — user | `get_default_user()` internal resolver only | ⚠️ Soft | Medium |
|
||||
| MCP `workspace_id` param | Not present in cognee-mcp interface | ❌ N/A | High |
|
||||
|
||||
---
|
||||
|
||||
## Key Findings
|
||||
|
||||
1. **Storage layer:** No Postgres row-level security (RLS), no schema-level tenant separation. Any admin with DB access can read any tenant's data.
|
||||
|
||||
2. **Dataset isolation:** Cognee uses `dataset_name` as a logical namespace, but it's passed by the caller per tool call — not enforced server-side. A misconfigured or malicious caller could read/write across datasets.
|
||||
|
||||
3. **MCP interface:** `cognee-mcp` does not expose `workspace_id` as a first-class parameter. Workspaces would need to be mapped to dataset names externally.
|
||||
|
||||
4. **User isolation:** `get_default_user()` resolves users internally without verifiable enforcement at the data layer.
|
||||
|
||||
---
|
||||
|
||||
## Migration Implications
|
||||
|
||||
Adopting Cognee as the memory substrate requires an **auth bridge**:
|
||||
|
||||
- The bridge wraps cognee-mcp and injects `workspace_id` → `dataset_name` mapping
|
||||
- All tool calls are routed through the bridge, which enforces tenant context
|
||||
- Estimated effort: **~100–200 LOC** for the MCP proxy wrapper
|
||||
- This is a pragmatic path — the bridge provides the isolation Cognee's storage layer lacks
|
||||
|
||||
---
|
||||
|
||||
## Recommendation
|
||||
|
||||
**Attempt the auth bridge prototype first (1–2 days of engineering):**
|
||||
1. Build MCP proxy that maps workspace_id to dataset_name on each call
|
||||
2. Validate that cross-workspace calls are correctly rejected
|
||||
3. If clean → adopt Cognee for Phase 9
|
||||
4. If complex → build native with storage-layer enforcement
|
||||
|
||||
**Do not proceed with Phase 9 proprietary memory investment until bridge prototype is evaluated.**
|
||||
|
||||
---
|
||||
|
||||
## Sources
|
||||
|
||||
- Cognee GitHub: https://github.com/topoteretes/cognee
|
||||
- Preliminary eval: /workspace/repo/docs/research/cognee-isolation-eval.md
|
||||
@@ -1,41 +0,0 @@
|
||||
---
|
||||
title: "Cognee Workspace Isolation Evaluation"
|
||||
description: "Evaluating Cognee, an open-source AI memory engine, against Molecule AI's hierarchical memory isolation needs."
|
||||
---
|
||||
# Cognee Workspace Isolation Evaluation
|
||||
|
||||
**Date:** 2026-04-20
|
||||
**Issue:** Molecule-AI/molecule-core#1146
|
||||
**Status:** Preliminary — needs deeper architecture review
|
||||
|
||||
## Summary
|
||||
|
||||
Cognee (Apache-2.0, by Topoteretes UG) is an open-source AI memory engine with a shipped MCP component. It has direct overlap with Molecule AI's Phase 9 hierarchical memory architecture.
|
||||
|
||||
## Workspace Isolation Assessment
|
||||
|
||||
**Signal: Partial/Positive**
|
||||
|
||||
Cognee's GitHub README explicitly lists "agentic user/tenant isolation, traceability, OTEL collector, audit traits" as a core architectural feature.
|
||||
|
||||
This is a positive signal. However:
|
||||
- The README mention does not specify the technical mechanism (namespace-level separation? separate vector DB instances per tenant? row-level security in a shared DB?)
|
||||
- The cognee-mcp MCP component's handling of multi-workspace contexts is not documented in the surface-level readme
|
||||
|
||||
**Verdict:** Cognee claims tenant isolation. Further due diligence required before treating this as confirmed.
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Deep-dive into cognee architecture docs** — check if isolation is enforced at the storage layer (separate DB/collection per workspace), application layer (row-level), or both
|
||||
2. **Test cognee-mcp with a multi-workspace scenario** — the MCP tool interface should reveal whether workspace_id is a first-class parameter
|
||||
3. **Check cognee's GitHub issues/discussions** — any community reports of cross-tenant data leakage?
|
||||
4. **Evaluate migration path** — if Cognee is adopted, what's involved in migrating existing Phase 9 work?
|
||||
|
||||
## Recommendation
|
||||
|
||||
Proceed with Phase 9 build-vs-buy review. Cognee is a credible candidate — isolation is claimed but mechanism needs verification. The Phase 9 halt stands until this is resolved.
|
||||
|
||||
## Sources
|
||||
|
||||
- https://github.com/topoteretes/cognee (README, 2026-04-20)
|
||||
- /workspace/repo/research/cognee-memo.md
|
||||
@@ -239,7 +239,7 @@ This terminates all EC2 instances, drops the Neon branch, and removes the org re
|
||||
- **Scoped roles**: give different team members read-only vs admin access within a tenant org (roadmap: Phase 34)
|
||||
- **Usage-based billing**: Meter workspace runtime and forward events to Stripe for custom billing tiers
|
||||
|
||||
For runbook-level details on the provisioning flow, see the architecture docs at [`docs/architecture/saas-prod-migration-2026-04-19`](/docs/architecture/saas-prod-migration-2026-04-19).
|
||||
For the provisioning flow internals, see the [Provisioner](/docs/architecture/provisioner) and [Workspace Tiers](/docs/architecture/workspace-tiers) reference.
|
||||
|
||||
For the API reference, see [`docs/api-reference`](/docs/api-reference) — the `/cp/orgs/*` endpoints are documented there.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user