diff --git a/content/docs/adapters/hermes-adapter-design.md b/content/docs/adapters/hermes-adapter-design.md deleted file mode 100644 index e85f6b5..0000000 --- a/content/docs/adapters/hermes-adapter-design.md +++ /dev/null @@ -1,96 +0,0 @@ ---- -title: "Hermes Adapter — Shell Design Spec" -description: "Design spec for the Hermes runtime adapter — the BaseAdapter shell, provider map, and integration points." ---- -# Hermes Adapter — Shell Design Spec - -**Perspective:** DevOps Engineer + Backend Engineer -**Status:** Draft — pre-implementation -**Hermes source:** `NousResearch/hermes-agent` (~61k ⭐) -**Adapter runtime key:** `hermes` - ---- - -## 1. Files Under `workspace/adapters/hermes/` - -| File | Purpose | -|------|---------| -| `Dockerfile` | Extends `workspace-template:base`; installs `hermes-agent` Python SDK and its deps via pip at image build time | -| `requirements.txt` | Python package list — at minimum `hermes-agent`; pin to a specific release tag for reproducibility | -| `adapter.py` | `HermesAdapter(BaseAdapter)` — implements `name()`, `display_name()`, `description()`, `get_config_schema()`, `setup()`, `create_executor()`; delegates to `_common_setup()` for plugins/skills/tools | -| `__init__.py` | Exports `Adapter = HermesAdapter` — required by the adapter autodiscovery loader in `workspace/adapters/__init__.py` | - -### `Dockerfile` sketch (no implementation — shape only) - -```dockerfile -FROM workspace-template:base -COPY adapters/hermes/requirements.txt /tmp/hermes-requirements.txt -RUN pip install --no-cache-dir -r /tmp/hermes-requirements.txt -``` - -### `adapter.py` shape - -```python -class HermesAdapter(BaseAdapter): - @staticmethod - def name() -> str: - return "hermes" - - async def setup(self, config: AdapterConfig) -> None: - # validate NOUS_API_KEY or OPENROUTER_API_KEY is set - # call self._common_setup(config) for plugins/skills/tools - ... - - async def create_executor(self, config: AdapterConfig) -> AgentExecutor: - # wrap Hermes SDK session as an A2A AgentExecutor - ... -``` - ---- - -## 2. Platform-Side Changes - -### `workspace-server/internal/provisioner/provisioner.go` — `RuntimeImages` map - -Add one entry to the existing map: - -```go -var RuntimeImages = map[string]string{ - // ... existing entries ... - "hermes": "workspace-template:hermes", // ← ADD THIS -} -``` - -No other platform Go changes are required for the minimal adapter shell. The `runtime` column in the `workspaces` table is a free-form string; no enum migration needed. - -### `workspace/build-all.sh` - -Add `hermes` to the adapter build loop so `build-all.sh` (and the `build-all.sh claude-code`-style single-runtime path) includes it: - -```bash -ADAPTERS=(langgraph claude_code openclaw autogen hermes codex google-adk) -``` - ---- - -## 3. Required Environment Variables - -| Name | Required | Description | -|------|----------|-------------| -| `NOUS_API_KEY` | Required (unless `OPENROUTER_API_KEY` set) | Nous Research Portal API key — primary model provider for Hermes; obtain from `nousresearch.com` | -| `OPENROUTER_API_KEY` | Optional | Fallback provider; lets operators use any Hermes-supported model via OpenRouter instead of Nous Portal | -| `HERMES_MODEL` | Optional | Model identifier (e.g. `nous-hermes-3`, `openrouter:anthropic/claude-sonnet-4-5`); adapter defaults to `nous-hermes-3` if unset | -| `HERMES_SKILLS_DIR` | Optional | Path inside the container where Hermes looks for skills; defaults to `/configs/skills` — consistent with the Claude Code and LangGraph adapters | - -**Note:** `NOUS_API_KEY` and `OPENROUTER_API_KEY` must be set as workspace secrets via `POST /workspaces/:id/secrets`, not baked into the image. At least one of the two must be present at container start; `setup()` should `raise RuntimeError` early with a clear message if both are absent. - ---- - -## 4. Smallest Viable Adapter — Scope Constraints - -This spec covers the **shell only** — the minimum to make a Hermes workspace provision, boot, and accept A2A messages: - -- No Hermes learning loop (skill self-improvement) in v1 — that requires persistent storage writes outside `/configs`; defer to a follow-up PR. -- No multi-messenger gateway integration — Hermes's Telegram/Discord/Slack channels are separate from Molecule AI's `/channels` feature; map these later via the channels adapter. -- No FTS5 memory backend — use Molecule AI's existing `commit_memory` / `search_memory` built-in tools for v1; Hermes-native memory can be layered in a subsequent PR. -- The executor wraps one Hermes agent session per workspace, matching the 1:1 workspace→agent model used by all other adapters. diff --git a/content/docs/adapters/hermes-adapter-plan.md b/content/docs/adapters/hermes-adapter-plan.md deleted file mode 100644 index 2c35964..0000000 --- a/content/docs/adapters/hermes-adapter-plan.md +++ /dev/null @@ -1,78 +0,0 @@ ---- -title: "Hermes Adapter — Implementation Plan" -description: "Implementation plan for the Hermes runtime adapter, from SDK import path to adapter.py build steps." ---- -# Hermes Adapter — Implementation Plan - -**Author:** Dev Lead -**Date:** 2026-04-13 -**Branch convention:** `feat/hermes-adapter-` for each PR below -**Target:** Ship a minimal but functional Hermes workspace adapter in 4 PRs, each ≤200 lines changed. - ---- - -## PR Sequence - -### PR 1 — Docker image shell - -**Title:** `feat(hermes): add workspace-template:hermes Docker image` - -**Files touched:** -- `workspace/adapters/hermes/Dockerfile` (new) -- `workspace/adapters/hermes/requirements.txt` (new) -- `workspace/adapters/hermes/__init__.py` (new) -- `workspace/build-all.sh` (1-line addition) - -**Description:** Adds the Hermes Docker image layer. `Dockerfile` extends `workspace-template:base` and installs `hermes-agent` (and declared deps) via pip at build time. `build-all.sh` gains `hermes` in the adapter list so `bash build-all.sh` and `bash build-all.sh hermes` both work. No Python adapter logic yet — just proves the image builds and that `import hermes` succeeds inside the container. CI: add `hermes` to the docker-build matrix. - ---- - -### PR 2 — Python adapter + A2A executor - -**Title:** `feat(hermes): implement HermesAdapter and A2A executor` - -**Files touched:** -- `workspace/adapters/hermes/adapter.py` (new, ~80 lines) -- `workspace/tests/test_adapters.py` (extend existing test file, ~30 lines) - -**Description:** Implements `HermesAdapter(BaseAdapter)` with `name()`, `display_name()`, `description()`, `get_config_schema()`, `setup()`, and `create_executor()`. `setup()` calls `_common_setup()` to load plugins/skills/tools identically to other adapters, then validates that `NOUS_API_KEY` or `OPENROUTER_API_KEY` is present and initialises a Hermes SDK session. `create_executor()` wraps the session as an `AgentExecutor`. Tests cover: adapter name/display_name contract, `setup()` raises `RuntimeError` when both API keys are absent, executor is returned after valid setup. - ---- - -### PR 3 — Platform RuntimeImages entry - -**Title:** `fix(provisioner): add hermes to RuntimeImages map` - -**Files touched:** -- `workspace-server/internal/provisioner/provisioner.go` (1-line addition) -- `workspace-server/internal/provisioner/provisioner_test.go` (1-line addition in RuntimeImages coverage test) - -**Description:** Adds `"hermes": "workspace-template:hermes"` to the `RuntimeImages` map. Without this entry the platform falls back to `workspace-template:langgraph` (wrong deps, agent fails to start). Test: extend the existing table-driven test that asserts every declared runtime resolves to a non-empty image tag. - ---- - -### PR 4 — Integration docs + org template entry - -**Title:** `docs(hermes): adapter usage guide and org template example` - -**Files touched:** -- `docs/adapters/hermes-adapter-design.md` (update status from Draft → Implemented) -- `workspace-configs-templates/hermes/config.yaml` (new, ~20 lines — minimal config template) -- `org-templates/molecule-worker-gemini/org.yaml` or a new `molecule-hermes/` org template (optional, ~30 lines) - -**Description:** Marks the design doc as implemented, adds a `workspace-configs-templates/hermes/config.yaml` so operators can create a Hermes workspace from the UI template picker, and optionally adds a minimal org template showing a Hermes-runtime team. Documents the three env vars (`NOUS_API_KEY`, `OPENROUTER_API_KEY`, `HERMES_MODEL`) in the config template comments. - ---- - -## Sequencing Notes - -- PRs 1 and 2 can overlap in development but PR 2 must merge after PR 1 (image must exist before adapter tests run in CI). -- PR 3 is a single-line change and can merge any time after PR 1 lands. -- PR 4 has no code risk; it can be drafted alongside PR 2 and merged last. -- Total estimated diff: ~180 lines of new code across all 4 PRs; well within the ≤200 lines/PR budget. - -## Open Questions (resolve before PR 2) - -1. **Hermes SDK import path** — confirm the pip package name and the Python import path (`import hermes`? `from hermes_agent import ...`?). Check `NousResearch/hermes-agent` README before writing adapter.py. -2. **Session persistence** — Hermes has a learning loop that writes skill files. Decide at PR 2 time whether to mount `/workspace` as the Hermes skills root or suppress auto-write in v1. -3. **Model default** — confirm the correct model identifier string for Nous Portal (e.g. `nous-hermes-3-70b` vs `hermes-3`); hardcode a safe default in `get_config_schema()`. diff --git a/content/docs/adapters/hermes-recon.md b/content/docs/adapters/hermes-recon.md deleted file mode 100644 index 317acff..0000000 --- a/content/docs/adapters/hermes-recon.md +++ /dev/null @@ -1,264 +0,0 @@ ---- -title: "Hermes Agent — Adapter Reconnaissance" -description: "Reconnaissance of the NousResearch hermes-agent project as a candidate Molecule AI runtime adapter." ---- -# Hermes Agent — Adapter Reconnaissance - -Reconnaissance of [NousResearch/hermes-agent](https://github.com/NousResearch/hermes-agent) (v0.8.0, 68,713 ⭐, MIT) for potential Molecule AI adapter integration. - -> **Status:** Design-only recon — no implementation. - ---- - -## a) CLI Invocation - -**Install** (curl-to-bash, targets Linux/macOS/WSL2/Termux): - -```bash -curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash -``` - -The `hermes` binary in the repo root is a Python script (`#!/usr/bin/env python3`) that imports and calls `hermes_cli.main.main()`. After install it lands on `$PATH`. - -**Minimal interactive session:** - -```bash -hermes # launches TUI, auto-detects provider from env -hermes chat # explicit; same as bare `hermes` -hermes setup # one-time wizard: sets model, provider, API keys -``` - -**Key runtime flags:** - -```bash -hermes chat \ - --model anthropic/claude-opus-4.6 \ - --provider openrouter \ - --toolsets terminal,file,web \ - --max-turns 60 \ - --query "build me a FastAPI app" \ - --resume # continue most recent session - --worktree # git-worktree isolation per session - --profile myprofile # load alternate HERMES_HOME profile -``` - -**One-shot (non-interactive):** - -```bash -hermes chat --query "summarise this repo" --quiet -``` - -**Gateway (messaging platforms) start:** - -```bash -hermes gateway start # daemonises; reads gateway config from config.yaml -hermes gateway status -hermes gateway stop -``` - -**OpenClaw migration:** - -```bash -hermes claw migrate --dry-run # preview; drop --dry-run to execute -``` - ---- - -## b) Config Format - -**Format:** YAML -**Primary path:** `~/.hermes/config.yaml` (default), overrideable via `HERMES_HOME` env var. -**Reference file in repo:** `cli-config.yaml.example` - -**Minimal working config** (provider = OpenRouter, Docker terminal backend): - -```yaml -# ~/.hermes/config.yaml - -model: - default: "anthropic/claude-opus-4.6" - provider: "openrouter" # required; "auto" if you want env-var detection - base_url: "https://openrouter.ai/api/v1" - -terminal: - backend: "local" # required; options: local | ssh | docker | singularity | modal | daytona - cwd: "." - timeout: 180 - lifetime_seconds: 300 - -memory: - memory_enabled: true - user_profile_enabled: true - memory_char_limit: 2200 - user_char_limit: 1375 - nudge_interval: 10 - -agent: - max_turns: 60 - reasoning_effort: "medium" # xhigh | high | medium | low | minimal | none -``` - -**Required fields:** `model.default`, `model.provider`, `terminal.backend`. -Everything else has a hardcoded default. - -**Credentials** go in `~/.hermes/.env` (separate from config.yaml): - -```bash -OPENROUTER_API_KEY=sk-or-... -ANTHROPIC_API_KEY=sk-ant-... -HERMES_HOME=~/.hermes # optional override -``` - -**Skills config** (in `config.yaml`): - -```yaml -skills: - creation_nudge_interval: 15 # remind agent to persist a skill every N tool iterations - external_dirs: - - ~/.agents/shared-skills # read-only external skill dirs -``` - -**Compression config** (in `config.yaml`): - -```yaml -compression: - enabled: true - threshold: 0.50 - summary_model: "google/gemini-3-flash-preview" -``` - ---- - -## c) Runtime Dependencies - -**Python version:** 3.13 (Dockerfile base: `ghcr.io/astral-sh/uv:0.11.6-python3.13-trixie`) -**Package manager:** [uv](https://github.com/astral-sh/uv) (not pip directly; `uv pip install .`) -**Package version:** `hermes-agent==0.8.0` - -**Top core pip dependencies** (from `pyproject.toml`): - -| Package | Version constraint | Purpose | -|---|---|---| -| `openai` | `>=2.21.0,<3` | Primary LLM client (all providers via OpenAI-compat API) | -| `anthropic` | `>=0.39.0,<1` | Direct Anthropic API adapter | -| `python-dotenv` | `>=1.2.1,<2` | `.env` loading | -| `fire` | `>=0.7.1,<1` | CLI argument dispatch | -| `httpx[socks]` | `>=0.28.1,<1` | Async HTTP (gateway, webhooks) | -| `rich` | `>=14.3.3,<15` | TUI rendering | -| `pyyaml` | `>=6.0.2,<7` | Config file parsing | -| `pydantic` | `>=2.12.5,<3` | Data validation | -| `prompt_toolkit` | `>=3.0.52,<4` | Interactive TUI / multiline input | -| `tenacity` | `>=9.1.4,<10` | Retry logic | - -**Key optional extras:** - -```bash -pip install "hermes-agent[modal]" # modal>=1.0.0 — serverless backend -pip install "hermes-agent[daytona]" # daytona>=0.148.0 — cloud sandbox backend -pip install "hermes-agent[mcp]" # mcp>=1.2.0 — MCP server/client -pip install "hermes-agent[honcho]" # honcho-ai — cross-session user modeling -pip install "hermes-agent[messaging]" # telegram, discord.py, aiohttp, slack -pip install "hermes-agent[voice]" # faster-whisper, sounddevice, numpy -pip install "hermes-agent[rl]" # atroposlib, fastapi, uvicorn, wandb -``` - -**System binaries** (from Dockerfile `apt-get install`): - -``` -nodejs npm ripgrep ffmpeg gcc python3-dev libffi-dev procps build-essential -``` - -`ripgrep` is used by the `file` toolset for fast codebase search. `ffmpeg` is used for voice transcription pre-processing. - ---- - -## d) Session State - -**All persistent state lives under `HERMES_HOME`** (default: `~/.hermes/`, overrideable via env var). - -**Primary state store: SQLite** - -``` -~/.hermes/state.db ← DEFAULT_DB_PATH = get_hermes_home() / "state.db" -``` - -- Schema version: **6** (`SCHEMA_VERSION = 6` in `hermes_state.py`) -- WAL mode (`PRAGMA journal_mode=WAL`) — supports concurrent gateway + CLI writers -- Three core tables: `schema_version`, `sessions`, `messages` -- **FTS5 virtual table** `messages_fts` with auto-sync triggers on INSERT/UPDATE/DELETE — backs the `session_search` toolset (full-text search across all past conversation content) -- Compression-triggered session splitting tracked via `parent_session_id` chain in `sessions` table -- Session source tagged as `'cli'`, `'telegram'`, `'discord'`, etc. for per-platform filtering - -**Full directory layout:** - -``` -~/.hermes/ -├── config.yaml ← get_config_path() -├── .env ← get_env_path() -├── state.db ← SQLite WAL, FTS5 -├── skills/ ← get_skills_dir() — user-created skill SKILL.md files -├── logs/ ← get_logs_dir() — trajectory JSONs -│ └── session_YYYYMMDD_HHMMSS_.json -├── MEMORY.md ← agent's curated notes (injected into system prompt) -├── USER.md ← user profile (injected into system prompt) -└── skins/ ← optional custom theme YAMLs -``` - -**State is persistent by default.** Session history, memories (`MEMORY.md`/`USER.md`), and skills survive restarts. The `session_reset` config controls when gateway sessions are cleared (default: `mode: both`, idle after 1440 min or at 4 AM daily). Before any reset, Hermes is given one flush turn to write important context to `MEMORY.md`. - -Container backend state is controlled separately by `container_persistent: true/false` in the `terminal:` block. - ---- - -## e) Execution Backends - -**Six backends configured via a single `terminal.backend` key in `config.yaml`:** - -| Backend | Where commands run | Key extra config | -|---|---|---| -| `local` | Host machine, current dir | — | -| `ssh` | Remote server | `ssh_host`, `ssh_user`, `ssh_key` | -| `docker` | Inside a Docker container | `docker_image`, `docker_mount_cwd_to_workspace` | -| `singularity` | Singularity/Apptainer container (HPC) | `singularity_image` | -| `modal` | Modal cloud sandbox (serverless) | `modal_image`, `pip install hermes-agent[modal]` | -| `daytona` | Daytona cloud sandbox | `daytona_image`, `container_disk`, `pip install hermes-agent[daytona]` | - -**Architecture clarification:** Hermes's Python process **always runs locally** (or wherever you launched it). The `backend` setting controls only where the **`terminal` tool** executes shell commands. For `docker`, Hermes calls the Docker API to spawn/reuse a container and routes `terminal` tool calls into it via exec — Hermes itself is **not** containerised by this setting. - -**Docker backend minimal config:** - -```yaml -terminal: - backend: "docker" - cwd: "/workspace" # path inside the container - timeout: 180 - lifetime_seconds: 300 - docker_image: "nikolaik/python-nodejs:python3.11-nodejs20" - docker_mount_cwd_to_workspace: false # default: false (security off). Set true to bind-mount launch dir into /workspace - docker_forward_env: - - "GITHUB_TOKEN" - - "NPM_TOKEN" - container_cpu: 1 - container_memory: 5120 # MB - container_disk: 51200 # MB - container_persistent: true # false = ephemeral container, wiped after session -``` - -**The Dockerfile** (for running *all of Hermes* inside Docker, distinct from the backend setting) uses: - -```dockerfile -FROM debian:13.4 -ENV HERMES_HOME=/opt/data -ENV PLAYWRIGHT_BROWSERS_PATH=/opt/hermes/.playwright -VOLUME /opt/data -ENTRYPOINT ["/opt/hermes/docker/entrypoint.sh"] -# Runs as non-root user hermes (UID 10000), home /opt/data -``` - -**Serverless hibernation** (Modal + Daytona): `container_persistent: false` produces fully ephemeral sandboxes that are destroyed after `lifetime_seconds`; `true` persists the container filesystem between sessions (warm-resume, no re-install overhead). - ---- - -## f) Value Proposition - -Integrating Hermes adds one capability that none of the other existing adapters (LangGraph, Claude Code, AutoGen, OpenClaw, Codex, Google ADK) deliver end-to-end: **a closed learning loop that compounds across sessions at the skill, memory, and user-model layers simultaneously.** Concretely: after a complex task, Hermes autonomously creates a `SKILL.md` file in `~/.hermes/skills/` (prompted every `creation_nudge_interval=15` tool iterations), and those skills are re-injected as context in future sessions — agents get better at tasks they've done before without any human curation step. The `session_search` toolset adds FTS5 + Gemini Flash summarization over `state.db`, so the agent can recall specific conversations from months ago with semantic-quality results. Layered on top is **Honcho dialectic user modeling** (`plastic-labs/honcho`) — a cross-session profile that tracks user communication style, preferences, and expectations, shared across any Honcho-integrated tool (not just Hermes). Finally, the **Modal and Daytona serverless backends with `container_persistent`** give Molecule AI a path to hibernating, pay-per-use sandboxes that no existing adapter exposes — directly relevant to Molecule AI's multi-workspace billing model. The `hermes claw migrate` command (backed by `optional-skills/migration/openclaw-migration/scripts/openclaw_to_hermes.py`) is also relevant: Molecule AI could offer equivalent migration tooling to attract OpenClaw's existing ~247k-user base, and the **`agentskills.io` skill-manifest spec** (referenced in `optional-skills/`) should be reviewed before Molecule AI finalises its own plugin manifest schema to ensure interoperability with what is rapidly becoming the de-facto file-based skill standard. diff --git a/content/docs/adapters/medo-integration.md b/content/docs/adapters/medo-integration.md deleted file mode 100644 index 5461d10..0000000 --- a/content/docs/adapters/medo-integration.md +++ /dev/null @@ -1,177 +0,0 @@ ---- -title: "MeDo Integration Design — Molecule AI Hackathon (May 20 2026)" -description: "Design for integrating the Baidu MeDo / Miaoda App Builder as an OpenClaw-runtime workspace, with A2A delegation and open questions." ---- -# MeDo Integration Design — Molecule AI Hackathon (May 20 2026) - -**Status:** Design — implementation pending operator sign-off on open questions (§5). -**Scope:** How the molecule-dev team builds MeDo apps for the "Build with MeDo" hackathon. -**Key constraint:** MeDo App Builder is an OpenClaw skill on ClawHub (`seiriosPlus/miaoda-app-builder`), -not a REST API. All interactions go through natural-language messages to an OpenClaw workspace. - ---- - -## 1. Architecture Overview - -``` -CEO / Canvas - │ A2A task - ▼ - PM (claude-code) - │ delegate_task_async → workspace: medo-builder - ▼ - MeDo Builder workspace [runtime: openclaw, skill: miaoda-app-builder] - │ OpenClaw CLI → skill → api.miaoda.cn - ▼ - MeDo platform (app created / published → URL returned) - │ result relayed via A2A event_queue - ▼ - PM → CEO -``` - -The MeDo Builder workspace is a **dedicated OpenClaw-runtime workspace** inside the -molecule-dev org with the Miaoda App Builder skill pre-installed. PM delegates natural-language -app-build requests to it via `delegate_task_async` and polls for the result (5–8 min latency). - ---- - -## 2. Installing the Miaoda App Builder Skill - -### 2.1 API Key - -The skill requires `MIAODA_API_KEY` (not `MEDO_API_KEY`). - -> ⚠️ **Credential name mismatch**: the global platform secret is currently named `MEDO_API_KEY`. -> The skill's frontmatter declares `primaryEnv: MIAODA_API_KEY`. The MeDo Builder workspace must -> set `MIAODA_API_KEY` — either rename the global secret or add a workspace-level alias. -> See open question §5-A. - -Obtain the key from: **MeDo website → Settings → API Keys**. Keys do not expire, but generating -a new one immediately invalidates the previous one. - -### 2.2 Installation Query - -OpenClaw installs skills by sending a natural-language install message to the agent. -No CLI command is documented on ClawHub — send this message to the OpenClaw workspace on first boot: - -``` -Install the Miaoda App Builder skill from ClawHub: seiriosPlus/miaoda-app-builder -``` - -OpenClaw auto-downloads the skill, installs Python runtime deps (`requests`), and makes the skill -available for subsequent messages. - -### 2.3 Workspace Config Sketch (`org-templates/medo-builder/workspace.yaml`) - -```yaml -name: MeDo Builder -role: Builds and publishes MeDo applications via the Miaoda App Builder OpenClaw skill -runtime: openclaw -tier: 2 -required_env: - - MIAODA_API_KEY # TODO: resolve name vs platform secret MEDO_API_KEY (§5-A) - - OPENROUTER_API_KEY # OpenClaw needs an LLM provider -initial_prompt: | - You are a MeDo App Builder. On startup: - 1. Install the Miaoda App Builder skill: - "Install the Miaoda App Builder skill from ClawHub: seiriosPlus/miaoda-app-builder" - 2. Confirm installation succeeded. - 3. Wait for build tasks from PM via A2A. - When you receive a build task, use natural language to instruct the skill: - "Create a [description] app and publish it when done." - App generation takes 5–8 minutes — poll the skill or wait for confirmation before reporting done. -``` - ---- - -## 3. A2A Delegation Pattern (5–8 Min Latency) - -App generation is asynchronous and slow. PM **must** use `delegate_task_async` + `check_task_status` -rather than `delegate_task` (which has a shorter timeout and will return before the app is ready). - -### 3.1 PM Delegation Flow - -```python -# Step 1: fire and forget -task = await delegate_task_async( - workspace_id="medo-builder-workspace-id", - task="Build a restaurant reservation tool with online booking, menu display, " - "and contact form. Publish when done and return the URL." -) - -# Step 2: poll every 60s (app takes 5–8 min) -while True: - status = await check_task_status(task_id=task["task_id"]) - if status["status"] in ("completed", "failed"): - break - await asyncio.sleep(60) - -result_url = status.get("result") # MeDo app URL on success -``` - -### 3.2 Invocation Patterns (verified from Baidu doc) - -Natural-language messages the MeDo Builder workspace should accept from PM: - -| Intent | Message to send to MeDo Builder workspace | -|--------|-------------------------------------------| -| List existing apps | `"Show me my apps"` | -| Create + auto-publish | `"Create a [description] and publish it when done"` | -| Create only | `"Create a [description]"` | -| Modify existing | `"Add a search function to app [name/ID]"` | -| Publish draft | `"Publish this app"` | -| Status check | `"Is the app generation done yet?"` | - ---- - -## 4. Proposed Org Template — `org-templates/medo-builder/` - -``` -org-templates/medo-builder/ -├── org.yaml ← minimal single-workspace org (not full team) -├── medo-builder/ -│ ├── system-prompt.md ← MeDo Builder agent persona + delegation rules -│ └── workspace.yaml ← runtime: openclaw, skill install, env -``` - -**org.yaml sketch:** - -```yaml -name: MeDo Builder -description: Single-workspace org for building MeDo apps (hackathon) -defaults: - runtime: openclaw - tier: 2 - required_env: [MIAODA_API_KEY, OPENROUTER_API_KEY] - -workspaces: - - name: MeDo Builder - role: Builds and publishes MeDo applications via Miaoda App Builder skill - files_dir: medo-builder - canvas: { x: 400, y: 300 } -``` - -The medo-builder workspace is deployed **as a child of the molecule-dev PM** in the hackathon org, -not as a standalone org. Full `org-templates/medo-builder/` implementation is Week 2 scope. - ---- - -## 5. Open Questions (Operator Resolution Required) - -| # | Question | Why it blocks | -|---|----------|---------------| -| 5-A | **Credential name**: platform secret is `MEDO_API_KEY`; skill expects `MIAODA_API_KEY`. Rename global secret or add workspace alias? | Workspace boot will fail with "MIAODA_API_KEY not set" | -| 5-B | **Credit cost per app**: Baidu doc mentions a Credit System but content was not rendered. How many credits does create+generate+publish consume? Do we have enough for hackathon testing? | Budget planning | -| 5-C | **Rate limits**: no rate-limit info in docs or ClawHub page. What's the max concurrent app generations per API key? | Parallelism planning | -| 5-D | **Failure recovery**: what happens if the OpenClaw skill process crashes mid-generation (after Confirm & Generate, before Publish)? Is there a way to resume or check status by app ID? | Reliability design | -| 5-E | **Submission format**: does the hackathon judge the published MeDo app URL, the Molecule AI org config, or both? | Determines whether we need a polished demo org or just a working app | - ---- - -## 6. Implementation Checklist (Weeks 1–3) - -- [x] Week 1: This design doc (`docs/adapters/medo-integration.md`) -- [ ] Week 1: Resolve §5-A (credential name) + obtain API key credits estimate -- [ ] Week 2: `org-templates/medo-builder/` — full system-prompt + workspace.yaml -- [ ] Week 2: Integration test — PM delegates one real app build end-to-end -- [ ] Week 3: Polish demo org; rehearse submission flow; publish hackathon entry diff --git a/content/docs/adapters/medo-smoke-test-log.md b/content/docs/adapters/medo-smoke-test-log.md deleted file mode 100644 index 0896e97..0000000 --- a/content/docs/adapters/medo-smoke-test-log.md +++ /dev/null @@ -1,117 +0,0 @@ ---- -title: "MeDo Smoke Test Log — 2026-04-13 (Run 4)" -description: "Smoke-test run log for the MeDo / Miaoda App Builder OpenClaw integration." ---- -# MeDo Smoke Test Log — 2026-04-13 (Run 4) - -**Tester:** PM (direct execution) -**Goal:** Install Miaoda App Builder skill → build "Hello Molecule AI" landing page → publish → URL. -**Credits spent:** 0 across all four runs. - ---- - -## Run Summary - -| Run | Blocker | Resolution | -|-----|---------|------------| -| 1 | `workspace-template:openclaw` image not built | ✅ Operator rebuilt image | -| 2 | Adapter key lookup ignores `AISTUDIO_API_KEY` / `QIANFAN_API_KEY` | ✅ Code fix committed (d779e16) | -| 3 | Executor creates fresh OpenClaw session per A2A message | ✅ Code fix committed (9466943) | -| 4 | `payloads: []` on every response — agent never returns text via `--json` mode | ❌ Root cause below | - ---- - -## Run 4 — Detailed Findings - -### Environment — all green -| Check | Result | -|-------|--------| -| Platform health | ✅ | -| `workspace-template:openclaw` image | ✅ boots in 31s | -| AISTUDIO_API_KEY + gemini-2.0-flash | ✅ confirmed in every response meta | -| Stable session ID (workspace ID) | ✅ `sessionKey: agent:main:explicit:a507780d-...` consistent across all calls | - -### Messages Sent and Responses - -| Message | Response | Duration | -|---------|----------|----------| -| Install skill | `payloads: [], livenessState: working` | 1.7s | -| Build Hello Molecule AI | `payloads: [], livenessState: working` | 0.8s | -| Check status (sessions_list) | `LLM request failed: provider rejected request schema/payload` | — | -| Reply with exactly: STATUS_OK | `payloads: [], livenessState: working` (after restart) | 1.8s | - -The "Reply with exactly: STATUS_OK" response is decisive. A vanilla LLM call with no tool use should produce a text payload. It didn't. This rules out skill complexity or message ambiguity as the cause. - -### Root Cause — `openclaw agent --json` Does Not Surface Agent Text in `payloads` - -The OpenClaw agent processes messages using background session dispatch (`sessions_spawn` / `sessions_yield`). In this mode: -1. Main session receives message → immediately spawns background session → calls `sessions_yield` -2. `openclaw agent --json` exits with `payloads: [], livenessState: 'working'` -3. Background session processes the actual work and produces text — but only visible in interactive/streaming mode, not in the `--json` subprocess call - -**Evidence:** Even "Reply with exactly: STATUS_OK" returns `payloads: []`. The agent is using background sessions for everything, including trivial echo requests. - -**Likely cause:** OpenClaw's default `SOUL.md` / `BOOTSTRAP.md` workspace config instructs the agent to always use async session patterns. In a terminal session these background responses appear naturally; via subprocess `--json`, only the main session's synchronous output is captured. - -### Transient issue: LLM request failed -After 3+ rapid A2A calls (install → build → status check), the Gemini AI Studio API returned a schema/payload rejection. Resolved by restarting the workspace (`POST /workspaces/:id/restart`). Likely a rate-limit or context-size rejection from Gemini. Restarted in 30s, normal on next call. - ---- - -## 4. Required Fix — OpenClawA2AExecutor Response Capture - -The executor must retrieve the agent's text response from session history **after** the main session yields. The `sessions_history` CLI command (exposed as `session_history` tool) retrieves past messages. - -**Proposed change** to `workspace/adapters/openclaw/adapter.py` (`execute()` method): - -```python -# After proc.communicate() returns with payloads=[]: -if not reply or reply.startswith("{'payloads': []"): - # Agent yielded without responding — fetch last message from session history - await asyncio.sleep(2) # brief wait for background session to complete short tasks - hist_proc = await asyncio.create_subprocess_exec( - "openclaw", "sessions", "history", - "--session-id", self._session_id, - "--limit", "1", "--json", - stdout=asyncio.subprocess.PIPE, stderr=asyncio.subprocess.PIPE, - env={**os.environ, "PATH": f"{os.path.expanduser('~/.local/bin')}:{os.environ.get('PATH', '')}"} - ) - hist_stdout, _ = await asyncio.wait_for(hist_proc.communicate(), timeout=15) - hist_data = json.loads(hist_stdout.decode().strip() or "{}") - last_msg = (hist_data.get("messages") or [{}])[-1] - reply = last_msg.get("content", reply) # fall back to original if no history -``` - -**Note on long tasks (5–8 min builds):** Session history won't have the build result until it completes. For Miaoda App Builder, PM must poll: send a follow-up "What is the status of the Hello Molecule AI app build?" message every 60s until the response contains a URL or error. - ---- - -## 5. Open Questions Status - -### 5-C — Rate limits -**UNKNOWN.** Never reached skill invocation. -*New data:* Gemini AI Studio hit a schema/payload rejection after 3 rapid calls. This may be a Gemini-specific issue with large tool schemas (OpenClaw's `cron` schema is 6311 chars). Worth filing separately. - -### 5-D — Failure recovery -**UNKNOWN.** Never reached app generation. - ---- - -## 6. Issues to File - -| # | Issue | Status | Location | -|---|-------|--------|----------| -| A | `fix(openclaw): use stable workspace session ID` | ✅ fixed in 9466943 | adapter.py | -| B | `fix(openclaw): extend key lookup for AISTUDIO/QIANFAN` | ✅ fixed in d779e16 | adapter.py | -| C | `fix(provisioner): surface Docker errors in last_sample_error` | ❌ open | provisioner.go | -| **D** | **`fix(openclaw): capture agent response via session history when payloads=[]`** | ❌ open — see §4 | adapter.py | -| **E** | **`fix(openclaw): Gemini rejects request after N rapid calls with large tool schema`** | ❌ open — investigate cron schema size | adapter.py | - ---- - -## 7. Next Steps (before Run 5) - -- [ ] **Dev Lead:** Implement §4 session-history fallback in `OpenClawA2AExecutor.execute()` -- [ ] **Dev Lead (optional):** Trim `cron` tool schema to reduce Gemini schema-size rejection risk -- [ ] **Operator:** Rebuild image: `bash workspace/build-all.sh openclaw` -- [ ] **PM (Run 5):** Re-run smoke test — expected to finally reach skill install confirmation diff --git a/content/docs/adr/ADR-001-admin-token-scope.md b/content/docs/adr/ADR-001-admin-token-scope.md deleted file mode 100644 index 6970721..0000000 --- a/content/docs/adr/ADR-001-admin-token-scope.md +++ /dev/null @@ -1,112 +0,0 @@ ---- -title: "ADR-001: Admin endpoints accept any workspace bearer token" -description: "ADR-001: why admin endpoints validate any workspace bearer token, and the AdminAuth lockdown that followed." ---- -# ADR-001: Admin endpoints accept any workspace bearer token - -**Status:** Accepted — known risk, Phase-H remediation planned -**Date:** 2026-04-17 -**Issue:** #684 -**Tracking:** Phase-H — #710 - -## Context - -The `AdminAuth` middleware validates callers by calling `ValidateAnyToken`, which -accepts any live workspace bearer token regardless of which workspace issued it. -There is no separation between workspace-scoped tokens (issued to individual -agents) and admin-scoped tokens (intended for platform operators). - -This means any workspace agent that has been issued a token can reach every -admin-gated route on the platform. - -## Decision - -Proper token-tier separation (workspace vs. admin scope) is deferred to Phase-H. -The known risk is explicitly accepted. Mitigation controls are documented below. - -## Blast radius — affected admin endpoints - -A compromised workspace token grants unauthenticated-equivalent access to all -of the following: - -| Endpoint | Impact | -|----------|--------| -| `GET /admin/workspaces/:id/test-token` | Mint a fresh bearer token for any workspace | -| `DELETE /workspaces/:id` | Delete any workspace and auto-revoke its tokens | -| `PUT /settings/secrets` / `POST /admin/secrets` | Overwrite any global secret (env-poisons every agent on restart) | -| `DELETE /settings/secrets/:key` / `DELETE /admin/secrets/:key` | Delete any global secret; same fan-out restart | -| `GET /settings/secrets` / `GET /admin/secrets` | Read all global secret keys (values masked, but key enumeration enables targeted attacks) | -| `GET /workspaces/:id/budget` + `PATCH /workspaces/:id/budget` | Read or clear any workspace's token budget | -| `GET /events` / `GET /events/:workspaceId` | Read the full structural event log across all workspaces | -| `POST /bundles/import` | Import an arbitrary workspace bundle — creates workspaces, injects secrets, overwrites configs | -| `GET /bundles/export/:id` | Exfiltrate full workspace bundle including config, secrets references, and files | -| `POST /org/import` | Instantiate an entire org template — creates multiple workspaces with arbitrary roles and secrets | -| `GET /org/templates` | Enumerate all org template names and their configured roles/system prompts | -| `POST /templates/import` | Write arbitrary files into `configsDir` (workspace template injection) | -| `GET /templates` | Enumerate all template names and metadata | -| `GET /admin/liveness` | Read platform subsystem health (ops intel) | -| `GET /admin/schedules/health` | Read cron scheduler health across all workspaces | - -## Risk statement - -**A single compromised workspace agent can achieve full platform takeover via -admin endpoints.** - -Attack chain example: -1. Agent A's token is exfiltrated (e.g. via a prompt-injection in a delegated task). -2. Attacker calls `PUT /settings/secrets` to overwrite `CLAUDE_API_KEY` with a - controlled value. -3. Every non-paused workspace restarts and loads the poisoned key. -4. Attacker now controls the LLM backend for the entire platform. - -Alternatively: call `POST /bundles/import` with a crafted bundle to inject a -malicious workspace with a pre-configured `initial_prompt` and elevated secrets. - -## Current mitigations - -- **Workspace isolation** — `CanCommunicate()` in the A2A proxy limits which - workspaces can send tasks to which, reducing the blast radius of a single - compromised agent during normal operation. -- **Audit logging** — PR #651 writes all admin-route calls to `structure_events`. - Forensic recovery is possible after the fact. -- **`ValidateAnyToken` removed-workspace JOIN** — tokens belonging to deleted - workspaces are filtered at the DB layer (PR #682 defense-in-depth) so - post-deletion token replay is blocked. -- **`MOLECULE_ENV=production` gate** — hides the `/admin/workspaces/:id/test-token` - endpoint in production deployments unless `MOLECULE_ENABLE_TEST_TOKENS=1`. - -## Phase-H remediation plan - -Tracked in GitHub issue **#710**. - -### Schema change - -Add a `token_type` column to `workspace_auth_tokens`: - -```sql -ALTER TABLE workspace_auth_tokens - ADD COLUMN IF NOT EXISTS token_type TEXT NOT NULL DEFAULT 'workspace' - CHECK (token_type IN ('workspace', 'admin')); -``` - -Admin tokens are minted only via a dedicated privileged endpoint that itself -requires an existing admin token or a one-time bootstrap secret. - -### Middleware update - -- `WorkspaceAuth` — continue accepting `token_type = 'workspace'` only. -- `AdminAuth` — require `token_type = 'admin'`. Workspace tokens rejected. - -### Bootstrap flow - -On first boot (no tokens exist), a single-use bootstrap secret is printed to -the server log. The operator uses it to mint the first admin token. Subsequent -admin tokens are minted by existing admin token holders. The fail-open path in -`HasAnyLiveTokenGlobal` is retired once Phase-H ships. - -### Migration path - -Phase-H is a breaking change for any automation that currently uses workspace -tokens against admin endpoints. A migration guide and a `MOLECULE_PHASE_H=1` -feature flag will be provided so operators can opt in before the strict -enforcement date. diff --git a/content/docs/api/reference.md b/content/docs/api/reference.md deleted file mode 100644 index 19c4fc0..0000000 --- a/content/docs/api/reference.md +++ /dev/null @@ -1,125 +0,0 @@ ---- -title: API Reference -description: Full REST API reference for the Molecule AI workspace server — workspace management, A2A communication, file operations, secrets, tokens, and more. ---- - -# API Reference - -This document describes the REST API exposed by the Molecule AI workspace server (Go/Gin, default port `:8080`). Clients include the Canvas frontend, workspace agents communicating over A2A, and external tooling such as the MCP server and CLI. - -**Base URL:** `http://localhost:8080` (development default) -**Rate limit:** 600 req/min (configurable via `RATE_LIMIT`) -**CORS origins:** `http://localhost:3000,http://localhost:3001` by default (configurable via `CORS_ORIGINS`) - ---- - -## Authentication - -Three middleware classes gate server-side routes: - -- **`AdminAuth`** — strict bearer-only. Required for any route that can leak prompts/memory, create/mutate workspaces, or expose ops intel. Lazy-bootstrap fail-open when no live tokens exist globally. -- **`WorkspaceAuth`** — binds a bearer token to a specific workspace `:id`. A token for workspace A cannot be used against workspace B's sub-routes. -- **`CanvasOrBearer`** — accepts a bearer token OR a request Origin matching `CORS_ORIGINS`. Used only for cosmetic routes with zero data/security impact (currently `PUT /canvas/viewport` only). Do not extend to routes that leak data or create resources. - -Full contract: `docs/runbooks/admin-auth.md`. - ---- - -## Routes - -| Method | Path | Handler | -|--------|------|---------| -| GET | /health | inline | -| GET | /metrics | metrics.Handler() — Prometheus text format; no auth, scrape-safe | -| POST/GET/PATCH/DELETE | /workspaces[/:id] | workspace.go — `GET /workspaces`, `POST /workspaces`, and `DELETE /workspaces/:id` require `AdminAuth`. `PATCH /workspaces/:id` enforces field-level authz: cosmetic fields (name, role, x, y, canvas) pass through; sensitive fields (tier, parent_id, runtime, workspace_dir) require a valid bearer token when any live token exists. | -| GET/PATCH | /workspaces/:id/config | workspace.go | -| GET/POST | /workspaces/:id/memory | workspace.go | -| DELETE | /workspaces/:id/memory/:key | workspace.go | -| POST/PATCH/DELETE | /workspaces/:id/agent | agent.go | -| POST | /workspaces/:id/agent/move | agent.go | -| GET/POST/PUT | /workspaces/:id/secrets | secrets.go (POST/PUT auto-restarts workspace) | -| DELETE | /workspaces/:id/secrets/:key | secrets.go (DELETE auto-restarts workspace) | -| GET | /workspaces/:id/model | secrets.go | -| GET | /settings/secrets | secrets.go — list global secrets (keys only, values masked) | -| PUT/POST | /settings/secrets | secrets.go — set a global secret `{key, value}`; auto-restarts every non-paused/non-removed/non-external workspace that does not shadow the key with a workspace-level override | -| DELETE | /settings/secrets/:key | secrets.go — delete a global secret; same auto-restart fan-out as PUT/POST | -| GET | /admin/workspaces/:id/test-token | admin_test_token.go — mint a fresh bearer token for E2E scripts; returns 404 unless `MOLECULE_ENV != production` or `MOLECULE_ENABLE_TEST_TOKENS=1` | -| GET/POST/DELETE | /admin/secrets[/:key] | secrets.go — legacy aliases for /settings/secrets | -| WS | /workspaces/:id/terminal | terminal.go | -| POST | /workspaces/:id/expand | team.go | -| POST | /workspaces/:id/collapse | team.go | -| POST/GET | /workspaces/:id/approvals | approvals.go | -| POST | /workspaces/:id/approvals/:id/decide | approvals.go | -| GET | /approvals/pending | approvals.go | -| POST/GET | /workspaces/:id/memories | memories.go | -| DELETE | /workspaces/:id/memories/:id | memories.go | -| GET | /workspaces/:id/traces | traces.go | -| GET/POST | /workspaces/:id/activity | activity.go | -| POST | /workspaces/:id/notify | activity.go (agent→user push message via WebSocket) | -| POST | /workspaces/:id/restart | workspace.go | -| POST | /workspaces/:id/pause | workspace.go (stops container, status→paused) | -| POST | /workspaces/:id/resume | workspace.go (re-provisions paused workspace) | -| POST | /workspaces/:id/a2a | workspace.go | -| POST | /workspaces/:id/delegate | delegation.go (async fire-and-forget) | -| GET | /workspaces/:id/delegations | delegation.go (list delegation status) | -| GET/POST | /workspaces/:id/schedules | schedules.go (cron CRUD) | -| PATCH/DELETE | /workspaces/:id/schedules/:scheduleId | schedules.go | -| POST | /workspaces/:id/schedules/:scheduleId/run | schedules.go (manual trigger) | -| GET | /workspaces/:id/schedules/:scheduleId/history | schedules.go (past runs) | -| GET/POST | /workspaces/:id/channels | channels.go (social channel CRUD) | -| PATCH/DELETE | /workspaces/:id/channels/:channelId | channels.go | -| POST | /workspaces/:id/channels/:channelId/send | channels.go (outbound message) | -| POST | /workspaces/:id/channels/:channelId/test | channels.go (test connection) | -| GET | /channels/adapters | channels.go (list available platforms) | -| POST | /channels/discover | channels.go (auto-detect chats for a bot token) | -| POST | /webhooks/:type | channels.go (incoming social webhook) | -| GET | /workspaces/:id/shared-context | templates.go | -| GET/PUT/DELETE | /workspaces/:id/files[/*path] | templates.go | -| GET | /canvas/viewport | viewport.go — open, no auth required (cosmetic, bootstrap-friendly) | -| PUT | /canvas/viewport | viewport.go — `CanvasOrBearer` middleware; accepts bearer OR Origin matching `CORS_ORIGINS`. Cosmetic-only route — worst case viewport corruption, recovered by page refresh. | -| GET | /templates | templates.go | -| POST | /templates/import | templates.go — `AdminAuth` required | -| POST | /registry/register | registry.go | -| POST | /registry/heartbeat | registry.go — requires `Authorization: Bearer ` once a workspace has any live token on file (legacy workspaces grandfathered) | -| POST | /registry/update-card | registry.go — requires `Authorization: Bearer ` once a workspace has any live token on file | -| GET | /registry/discover/:id | discovery.go — requires `X-Workspace-ID` + bearer token on the caller side | -| GET | /registry/:id/peers | discovery.go — requires `X-Workspace-ID` + bearer token on the caller side | -| POST | /registry/check-access | discovery.go | -| GET | /plugins | plugins.go (list registry; supports `?runtime=` filter) | -| GET | /plugins/sources | plugins.go (list registered install-source schemes) | -| GET/POST/DELETE | /workspaces/:id/plugins[/:name] | plugins.go — list, install (`{"source":"scheme://spec"}`), uninstall per-workspace | -| GET | /workspaces/:id/plugins/available | plugins.go (filtered by workspace runtime) | -| GET | /workspaces/:id/plugins/compatibility?runtime=X | plugins.go (preflight runtime-change check) | -| GET/POST | /workspaces/:id/tokens | tokens.go — list active tokens (prefix + metadata), create new token (plaintext returned once). Max 50 per workspace. | -| DELETE | /workspaces/:id/tokens/:tokenId | tokens.go — revoke specific token by ID | -| GET | /bundles/export/:id | bundle.go — `AdminAuth` required | -| POST | /bundles/import | bundle.go — `AdminAuth` required | -| GET | /org/templates | org.go (list available org templates) | -| POST | /org/import | org.go — `AdminAuth` required; applies `resolveInsideRoot` path sanitiser on template paths | -| GET | /events | events.go — `AdminAuth` required | -| GET | /events/:workspaceId | events.go — `AdminAuth` required | -| GET | /admin/liveness | inline — `AdminAuth` required. Returns per-subsystem `supervised.Snapshot()` ages; use to check health of scheduler/heartbeat goroutines | -| GET | /ws | socket.go | - ---- - -## Database - -Migration files live in `workspace-server/migrations/` (latest: `022_workspace_schedules_source`). Each migration ships as a `.up.sql`/`.down.sql` pair. The migration runner globs `*.sql`, filters out `.down.sql` files, sorts alphabetically, and executes each file on boot. All `.up.sql` files must be idempotent (`CREATE TABLE IF NOT EXISTS`, `ALTER TABLE ... IF NOT EXISTS`) because the runner re-applies every migration on every boot. - -### Key Tables - -| Table | Description | -|-------|-------------| -| `workspaces` | Core entity — status, runtime, `agent_card` JSONB, heartbeat columns, `current_task`, `awareness_namespace`, `workspace_dir` | -| `canvas_layouts` | Per-workspace x/y canvas position | -| `structure_events` | Append-only event log (workspace lifecycle, agent, approval events) | -| `activity_logs` | A2A communications, task updates, agent logs, errors. `error_detail` is populated by the scheduler so cron run history can surface failure reasons. | -| `workspace_schedules` | Cron tasks — expression, timezone, prompt, run history, `source` (`'template'` for org/import-seeded, `'runtime'` for Canvas/API-created), `last_status` (includes `'skipped'` when the scheduler concurrency-skips a busy workspace) | -| `workspace_channels` | Social channel integrations (Telegram, Slack, etc.) with JSONB config and allowlist | -| `agents` | Agent records | -| `workspace_secrets` | Per-workspace encrypted secrets | -| `global_secrets` | Platform-wide encrypted secrets | -| `workspace_auth_tokens` | Bearer tokens; auto-revoked on workspace delete | -| `agent_memories` | HMA scoped memory (LOCAL / TEAM / GLOBAL) | -| `approvals` | Human-in-the-loop approval requests | diff --git a/content/docs/architecture/canary-release.md b/content/docs/architecture/canary-release.md deleted file mode 100644 index 90bcbf1..0000000 --- a/content/docs/architecture/canary-release.md +++ /dev/null @@ -1,83 +0,0 @@ ---- -title: "Canary release pipeline" -description: "The canary release pipeline that ships workspace-server changes to the prod tenant fleet, and how to halt it." ---- -# Canary release pipeline - -How a workspace-server code change reaches the prod tenant fleet — and how to stop it if something's wrong. - -## The loop - -``` -PR merged to staging → main - │ - ▼ -publish-workspace-server-image.yml ← pushes :staging- ONLY - │ (NOT :latest — prod is untouched) - ▼ -Canary tenants auto-update to :staging- - │ (5-min auto-updater cycle on each canary EC2) - ▼ -canary-verify.yml waits 6 min, runs scripts/canary-smoke.sh - │ - ├─► GREEN → crane tag :staging- → :latest - │ │ - │ ▼ - │ Prod tenants auto-update within 5 min - │ - └─► RED → :latest stays on prior good digest - GitHub Step Summary flags the rejected sha - Ops fixes forward OR rolls back manually -``` - -## Canary fleet - -Lives in a separate AWS account (`molecule-canary`, `004947743811`) via an assumed role (`MoleculeStagingProvisioner`). The CP's `is_canary` org flag routes provisioning there; every other org goes to the default staging account. See `docs/architecture/saas-prod-migration-2026-04-19.md` for the account bootstrap. - -Canary tenants are configured to pull `:staging-` (not `:latest`) via `TENANT_IMAGE` on their provisioner, so they ingest each new build before prod does. - -## Smoke suite - -`scripts/canary-smoke.sh` hits each canary tenant (URL + ADMIN_TOKEN pair) and asserts: - -- `/admin/liveness` returns a subsystems map (tenant booted, AdminAuth reachable) -- `/workspaces` returns a JSON array (wsAuth + DB healthy) -- `/memories/commit` + `/memories/search` round-trip (encryption + scrubber) -- `/events` admin read (C4 fail-closed proof) -- `/admin/liveness` without bearer → 401 (C4 regression gate) - -Expand by editing the script — each `check "name" "expected" "$response"` call is one line. - -## Adding a canary tenant - -1. `POST /cp/orgs` — create the org normally (is_canary defaults to false) -2. `POST /cp/admin/orgs//canary` with `{"is_canary": true}` — admin only, refuses to flip if already provisioned -3. Re-trigger provision (or delete + recreate if the org was already provisioned into staging) — the fresh EC2 lands in account `004947743811` - -Then set repo secrets: -- `CANARY_TENANT_URLS` — append the new tenant's URL -- `CANARY_ADMIN_TOKENS` — append its ADMIN_TOKEN in the same position - -## Rolling back `:latest` - -When canary was green but something surfaces post-promotion, retag `:latest` to a prior digest: - -```bash -export GITHUB_TOKEN=ghp_... # write:packages -scripts/rollback-latest.sh 4c1d56e # retags both platform + tenant images -``` - -`scripts/rollback-latest.sh` pre-checks that `:staging-` exists before moving `:latest`, and verifies the digest after the move. Prod tenants pick up the rolled-back image on their next 5-min auto-update. - -A post-mortem should always include: -- the commit sha that broke -- why canary didn't catch it (new code path the smoke suite doesn't exercise?) -- whether the smoke suite should grow a new check to prevent the same class of bug - -## What this gate doesn't catch - -- Bugs that only surface under prod-only data (customer workloads with scale or shape canary doesn't produce). Canary uses real traffic shapes but can't simulate weeks of accumulated state. -- Config drift between canary and prod (different env-var values, different feature flags). Keep canary's config deltas minimal and documented. -- Cross-tenant interactions — canary tenants run in their own AWS account, so a bug that only appears when two tenants compete for a shared resource won't reproduce here. - -When these miss, `rollback-latest.sh` is the escape hatch. diff --git a/content/docs/architecture/saas-prod-migration-2026-04-19.md b/content/docs/architecture/saas-prod-migration-2026-04-19.md deleted file mode 100644 index 0aaa99e..0000000 --- a/content/docs/architecture/saas-prod-migration-2026-04-19.md +++ /dev/null @@ -1,76 +0,0 @@ ---- -title: "SaaS prod migration — 2026-04-19" -description: "Prod cutover notes for the 2026-04-19 staging→main promotion of molecule-controlplane and molecule-core." ---- -# SaaS prod migration — 2026-04-19 - -Promoted staging → main on both `Molecule-AI/molecule-controlplane` and `Molecule-AI/molecule-core`. This note captures the prod cutover deltas so ops can cross-check against the running system. - -## What changed - -Ten PRs landed, split across the two repos: - -**Control plane (`molecule-controlplane`)** -- PR #50 — C1/C2/C3: bearer auth on `/cp/workspaces/*`, shell-escape tenant user-data, per-tenant security group -- PR #51 — H1/H2: crash-safe `SECRETS_ENCRYPTION_KEY` log, dropped `admin_token` from `/instance` SELECT -- PR #52 — SSRF guard on `platform_url` -- PR #53 — CP injects `MOLECULE_CP_SHARED_SECRET` + `MOLECULE_CP_URL` into tenant env -- PR #54 — Stripe webhook body capped at 1 MiB - -**Core (`molecule-core` / this repo)** -- PR #978 — H3/H4: LimitReader on Discord webhook + workspace config PATCH -- PR #979 — C4: `AdminAuth` fail-closed on fresh install when `ADMIN_TOKEN` is set -- PR #980 — log-scrub: dropped token prefix logging, stopped logging raw upstream response bodies -- PR #981 — tenant `CPProvisioner` attaches the CP bearer on every outbound `/cp/workspaces/*` call -- PR #982 — Canvas API fetch timeout (15s) -- PR #984 — E2E smoke test sync for #966 (public GET no longer exposes `current_task`) - -## New prod env vars (Railway, project `molecule-platform`, env `production`) - -Set before the CP merge landed: - -| Variable | Value shape | Purpose | -|---|---|---| -| `PROVISION_SHARED_SECRET` | 32-byte hex | Gates `/cp/workspaces/*` on CP. Routes refuse to mount when unset — C1 fail-closed. | -| `EC2_VPC_ID` | `vpc-…` | Enables per-tenant SG creation (C3). Shared-SG fallback emits a startup warning. | -| `CP_BASE_URL` | `https://api.moleculesai.app` | Injected into newly-provisioned tenant containers as `MOLECULE_CP_URL`. | - -The live prod `PROVISION_SHARED_SECRET` value is held only in Railway; not committed anywhere. Rotate by `railway variables --set` + redeploy. - -## Existing-tenant migration (the sharp edge) - -Tenants provisioned **before** this cutover are still running the previous workspace-server image. When they pull the new image on their next boot or auto-update cycle, their `CPProvisioner` will start expecting `MOLECULE_CP_SHARED_SECRET` in the container env — but the existing tenant EC2s don't have that variable in their user-data (the CP only started injecting it from PR #53 onward). - -**Symptom**: a pre-cutover tenant can still serve its users' existing workspaces, but any attempt to **provision a new workspace** from inside the tenant UI will hit the CP's new bearer gate and get `401` or `404` back, surfacing as "workspace provision failed" with a generic error. - -**Fix per existing tenant (pick one)**: - -1. **SSH in + add the env var** - - Copy `PROVISION_SHARED_SECRET` from Railway prod env. - - `ssh ubuntu@` and append to the running container's env (`docker stop && docker run … -e MOLECULE_CP_SHARED_SECRET='…' -e MOLECULE_CP_URL=https://api.moleculesai.app …`). Rolling this into an auto-update hook is follow-up work. - -2. **Re-provision the tenant** - - `DELETE /cp/orgs/:slug` → re-create via normal signup flow. Tenant-level data survives only if the tenant's own Postgres volume is preserved; workspace_id values change. This is the heavy hammer — only for tenants where existing data can be recreated easily. - -3. **Wait for the auto-update + user-data refresh cycle** - - Tenant auto-updater (cron, 5-minute cadence) pulls the new container image but **does not refresh env vars** — those are frozen from the initial user-data. So option 3 alone doesn't fix this; it still needs option 1 or 2. - -Script at `scripts/migrate-tenant-cp-secret.sh` (follow-up) will automate option 1 across all running tenants in the prod AWS account. - -## Post-deploy verification checklist - -- [ ] Railway prod deploy for `controlplane` lands on the new commit (check `https://railway.com/project/7ccc…/service/ae76…`) -- [ ] `curl https://api.moleculesai.app/health` → 200 `{service: molecule-cp, status: ok}` -- [ ] `curl -X POST https://api.moleculesai.app/cp/workspaces/provision` (no bearer) → 401 (**not** 404 — proves the env var is live and routes mounted) -- [ ] GHCR publishes new `workspace-server` image for the core main commit -- [ ] Vercel canvas prod deploy lands - -## Rollback - -If prod is on fire: - -1. `gh pr revert 46 -R Molecule-AI/molecule-controlplane` — reverts all 6 CP PRs together. -2. `gh pr revert 983 -R Molecule-AI/molecule-core` — reverts the core bundle. -3. Both reverts auto-deploy via Railway / GHCR / Vercel. - -Existing tenants aren't affected by a rollback — they're running whichever tenant image tag they booted with. Only newly-provisioned tenants pick up the reverted control plane code. diff --git a/content/docs/architecture/staging-environment.md b/content/docs/architecture/staging-environment.md deleted file mode 100644 index a82226b..0000000 --- a/content/docs/architecture/staging-environment.md +++ /dev/null @@ -1,218 +0,0 @@ ---- -title: "Staging Environment Design" -description: "The staging environment design on Railway, mirroring prod for safe pre-release validation." ---- -# Staging Environment Design - -> **Status:** Planned — gates all future infra changes (Tunnel migration, -> security fixes, etc.) -> -> **Problem:** We merge directly to main and auto-deploy to production. -> Today's session broke CI twice and caused hours of Cloudflare edge cache -> issues because there was no staging to test infra changes first. -> -> **Goal:** Full staging environment that mirrors production. Every change -> ships to staging first, gets verified, then promotes to production. - ---- - -## Architecture - -``` - staging production - ─────── ────────── -Git branch: main (auto-deploy) main (manual promote) - or staging branch - -CP (Railway): staging service production service - staging.api.moleculesai.app api.moleculesai.app - -Tenant EC2s: staging EC2 instances production EC2 instances - *.staging.moleculesai.app *.moleculesai.app - -App (Vercel): staging.app.moleculesai.app app.moleculesai.app - (Vercel preview) (Vercel production) - -DB (Neon): staging branch main branch - (or separate project) - -Docker images: platform-tenant:staging platform-tenant:latest - (GHCR) (GHCR) - -Cloudflare: *.staging.moleculesai.app *.moleculesai.app - (separate tunnel/worker) (tunnel per tenant) -``` - -## Deploy flow - -``` -Developer pushes to PR branch - → CI runs (tests, build, lint) - → PR merged to main - → Auto-deploy to STAGING - → Staging smoke tests (automated) - → Manual verification if needed - → Promote to PRODUCTION (manual trigger or approval) -``` - -## Components - -### 1. Railway: two environments - -Railway supports multiple environments per project. Create a `staging` -environment alongside `production`: - -```bash -railway environment create staging -railway variables --environment staging --set "DATABASE_URL=" -railway variables --environment staging --set "MOLECULE_ENV=staging" -# ... all other vars with staging-specific values -``` - -**Deploy trigger:** -- `staging`: auto-deploy on push to main -- `production`: manual promote via `railway up --environment production` - or GitHub Actions workflow_dispatch - -**Domains:** -- staging: `staging-api.moleculesai.app` (Railway custom domain) -- production: `api.moleculesai.app` (unchanged) - -### 2. Neon: branch per environment - -Neon supports database branches (like git branches): - -```bash -# Create staging branch from main -neon branch create --project-id --name staging --parent main -``` - -- Staging DB has same schema, separate data -- Can reset staging by re-branching from main -- Production data never touched by staging tests - -### 3. Vercel: preview deployments - -Vercel already supports this natively: -- Push to main → deploys to `app.moleculesai.app` (production) -- Push to `staging` branch → deploys to preview URL - -**Or** use Vercel environments: -- `staging.app.moleculesai.app` → staging deployment -- `app.moleculesai.app` → production deployment - -### 4. GHCR: tagged images - -``` -platform-tenant:staging — built on every push to main -platform-tenant:latest — promoted from staging after verification -platform-tenant:sha-xxxxx — immutable, pinned to specific commit -``` - -**Publish workflow change:** -```yaml -# Current: pushes :latest on every main merge -# New: pushes :staging on every main merge -# pushes :latest only on manual promote -``` - -### 5. Cloudflare: staging subdomain - -Option A (simple): `*.staging.moleculesai.app` with its own tunnel/worker -Option B (full): separate Cloudflare zone for staging (overkill) - -Recommend Option A: -- Add `staging.moleculesai.app` DNS records -- Staging tenants get `slug.staging.moleculesai.app` subdomains -- Production tenants get `slug.moleculesai.app` (unchanged) - -### 6. EC2: staging tag - -Staging EC2 instances tagged with `Environment=staging`: -- Separate from production instances in AWS console -- Can use different AMI, instance type, security group -- Easy to identify and clean up - -## Environment variables - -| Variable | Staging | Production | -|----------|---------|------------| -| `MOLECULE_ENV` | `staging` | `production` | -| `DATABASE_URL` | Neon staging branch | Neon main branch | -| `TENANT_IMAGE` | `platform-tenant:staging` | `platform-tenant:latest` | -| `APP_DOMAIN` | `staging.moleculesai.app` | `moleculesai.app` | -| `CORS_ORIGINS` | `https://staging.app.moleculesai.app` | `https://app.moleculesai.app` | -| `ADMIN_TOKEN` | per-tenant (same mechanism) | per-tenant | - -## Promotion workflow - -### Automated (CI/CD) - -```yaml -# .github/workflows/promote-to-production.yml -name: Promote to Production -on: - workflow_dispatch: - inputs: - confirm: - description: 'Type "promote" to confirm' - required: true - -jobs: - promote: - if: github.event.inputs.confirm == 'promote' - steps: - # 1. Run staging smoke tests one more time - - run: bash tests/e2e/test_saas_tenant.sh - env: - TENANT_SLUG: smoke-test - BASE_URL: https://staging.api.moleculesai.app - - # 2. Tag Docker image - - run: | - docker pull ghcr.io/molecule-ai/platform-tenant:staging - docker tag ghcr.io/molecule-ai/platform-tenant:staging \ - ghcr.io/molecule-ai/platform-tenant:latest - docker push ghcr.io/molecule-ai/platform-tenant:latest - - # 3. Deploy CP to production - - run: railway up --environment production - - # 4. Production tenants auto-update within 5 min (Option B cron) -``` - -### Manual (for now) - -Until the automated workflow is built: -1. Verify on staging (`staging.api.moleculesai.app`) -2. `docker tag platform-tenant:staging platform-tenant:latest && docker push` -3. `railway up --environment production` -4. Monitor production health - -## What this prevents - -- CI breakage from untested path filters (today's dorny/paths-filter issue) -- Cloudflare edge cache poisoning (test DNS changes on staging subdomain) -- Workspace boot script regressions (test on staging EC2 first) -- DB migration failures (test on Neon staging branch) -- Auth/security regressions (staging has same auth stack) - -## Implementation order - -1. **Railway staging environment** — create + configure vars (~30 min) -2. **Neon staging branch** — create from main (~5 min) -3. **Staging DNS** — `staging.api.moleculesai.app` CNAME to Railway (~5 min) -4. **Publish workflow** — push `:staging` tag instead of `:latest` (~15 min) -5. **Promotion workflow** — manual trigger to promote staging → production (~30 min) -6. **Vercel staging** — configure preview deployment URL (~15 min) -7. **Staging smoke test** — automated test after staging deploy (~30 min) - -**Total:** ~2.5 hours for full staging pipeline. - -## Cost - -- Railway staging: ~$5/mo (same as production, but can be smaller) -- Neon staging branch: free (included in plan) -- EC2 staging instances: only when testing (terminate after) -- Vercel: free (preview deployments included) -- Cloudflare: free (same zone, additional records) diff --git a/content/docs/architecture/tenant-image-upgrades.md b/content/docs/architecture/tenant-image-upgrades.md deleted file mode 100644 index 60c18e7..0000000 --- a/content/docs/architecture/tenant-image-upgrades.md +++ /dev/null @@ -1,154 +0,0 @@ ---- -title: "Tenant Image Upgrade Strategies" -description: "Strategies for rolling a new platform-tenant image out to existing EC2 tenants, with trade-offs." ---- -# Tenant Image Upgrade Strategies - -> **Status:** Option B (sidecar auto-updater) implemented. Options A and C -> documented for future use. - -## Problem - -When we push a new `platform-tenant:latest` to GHCR, existing EC2 tenant -instances keep running the old image. New orgs get the latest image at boot, -but existing tenants fall behind — missing bug fixes, security patches, and -new features. - -## Option A: Rolling restart on publish (coordinated) - -The publish workflow calls a CP admin endpoint after pushing the image. -The CP iterates all running tenants and restarts them one by one. - -``` -publish-platform-image succeeds - → POST https://api.moleculesai.app/cp/admin/rolling-upgrade - → CP queries org_instances WHERE status = 'running' - → For each tenant (staggered, 30s apart): - 1. AWS SSM Run Command: docker pull + docker restart - 2. Wait for /health 200 - 3. Update org_instances.updated_at - 4. If health fails after 60s, rollback (docker run old image) - → Return summary: {upgraded: N, failed: M, skipped: K} -``` - -### Pros -- Immediate, coordinated upgrades across all tenants -- CP has full visibility into upgrade status -- Can implement canary (upgrade 1 tenant first, verify, then rest) -- Rollback capability per tenant - -### Cons -- Requires AWS SSM agent on EC2 instances (not installed yet) -- Alternatively requires SSH access from Railway → EC2 (network/key management) -- Brief downtime per tenant during restart (~10-30s) -- Blast radius: a bad image can take down all tenants before canary catches it - -### Implementation effort -- Add SSM agent to EC2 user-data script -- Add `POST /cp/admin/rolling-upgrade` handler -- Add upgrade step to publish workflow -- Add rollback logic -- ~2-3 days - -### When to use -- Urgent security patches that can't wait 5 min -- Breaking changes that need coordinated rollout -- When you want canary/staged deployment - ---- - -## Option B: Sidecar auto-updater (implemented) - -A cron job on each EC2 checks GHCR for a new image digest every 5 minutes. -If the digest changed, it pulls the new image and restarts the container. - -```bash -# Runs every 5 min on each EC2 (added to user-data) -*/5 * * * * /usr/local/bin/molecule-auto-update.sh -``` - -The update script: -1. `docker pull platform-tenant:latest` -2. Compare digest with running container's image digest -3. If different: `docker stop molecule-tenant && docker rm molecule-tenant && docker run ...` -4. Wait for `/health` 200 -5. Log result to `/var/log/molecule-auto-update.log` - -### Pros -- Zero CP involvement — fully autonomous per tenant -- Tenants upgrade within 5 min of any publish -- No SSH/SSM infrastructure needed -- Each tenant upgrades independently (natural canary) -- Simple to implement (2 lines in user-data + a small script) - -### Cons -- Up to 5 min delay between publish and tenant upgrade -- Brief downtime during restart (~10-30s) -- No centralized visibility into upgrade status -- Can't selectively hold back specific tenants -- All tenants track `latest` — no pinned versions - -### When to use -- Default for all tenants -- Works well for early-stage SaaS with frequent deploys - ---- - -## Option C: Blue-green via Worker (zero downtime) - -Each EC2 runs two container slots: `blue` (current) and `green` (new). -The Cloudflare Worker routes traffic to whichever is healthy. - -``` -EC2 instance: - molecule-tenant-blue → :8080 (current, serving traffic) - molecule-tenant-green → :8081 (new, starting up) - -Upgrade flow: - 1. Pull new image - 2. Start green on :8081 - 3. Health check green: GET :8081/health - 4. If healthy: update Worker routing (KV: slug → port 8081) - 5. Stop blue - 6. Next upgrade: blue becomes the new slot - -Worker routing: - KV key: "example-org" → {"ip": "", "port": 8081} - (port defaults to 8080 when not in KV) -``` - -### Pros -- Zero downtime — traffic switches atomically after health check -- Instant rollback — just switch back to the old slot -- Worker already exists — just add port to the routing lookup -- Health-verified before any traffic switches - -### Cons -- Double memory usage during transition (~512MB extra per tenant) -- More complex user-data script (manage two containers) -- Worker needs port-aware routing (KV schema change) -- Need to track which slot is active per tenant - -### Implementation effort -- Update user-data to manage blue/green containers -- Update Worker to read port from KV -- Add blue/green state tracking to CP (org_instances.active_slot) -- Update auto-updater script for blue-green swap -- ~3-5 days - -### When to use -- When tenants have SLAs requiring zero downtime -- Production deployments with paying customers -- After Option B proves the auto-update pattern works - ---- - -## Migration path - -``` -Now: Option B (auto-updater, 5 min delay, brief downtime) - ↓ -Growth: Option A (add SSM for urgent patches, keep B as default) - ↓ -Scale: Option C (zero-downtime for premium/enterprise tenants) -``` diff --git a/content/docs/incidents/INCIDENT_LOG.md b/content/docs/incidents/INCIDENT_LOG.md deleted file mode 100644 index a4eb885..0000000 --- a/content/docs/incidents/INCIDENT_LOG.md +++ /dev/null @@ -1,592 +0,0 @@ ---- -title: "Incident Log — molecule-core" -description: "Chronological incident log for molecule-core — summaries, resolutions, and references." ---- -# Incident Log — molecule-core - -> This file documents security incidents, outages, and degraded states. -> Active incidents are listed first. Resolved incidents remain for historical record. - ---- - -*Last updated: 2026-04-21T07:45Z by Core Platform Lead — Incident log rebuilt after linter reset* - ---- - -## Security Audit Cycle 6 — ALL CLEAR (2026-04-21 ~07:15Z) - -**SHA range:** e69cb26 → 674384b on main (~5 commits + ~10 merged PRs) -**Verdict:** ✅ No critical/high findings - -### Commits Reviewed — All CLEAN - -| Commit | Description | -|--------|-------------| -| `dc9c64e` / PR #1258 | F1097 org_id context — eliminates redundant 2nd SELECT in AdminAuth | -| `33f1d1a` | Canvas cascade-delete UX — `pendingDelete.hasChildren`, warning dialog | -| `0790d57` | Canvas metrics guard — null coalescing | -| `781c217` | CI YAML fix | -| `169120d` / PR #1310 | CWE-78/CWE-22 — exec form + path traversal guards | -| `e431fc4` / PR #1302 | CWE-918 SSRF — `isSafeURL` in `a2a_proxy.go` | -| `a66f889` / PR #1261 | CWE path-injection — `resolveInsideRoot` for template paths | - -Full audit saved to TEAM memory id `abc58b47`. - ---- - -## F1100 — workspace_restart.go Path Traversal (RESOLVED) - -**Severity:** Medium | **Finding ID:** F1100 -**Status:** Resolved — fix applied via `a66f889` (PR #1261) on both main and staging - -### Summary - -`workspace_restart.go:127-133` accepted `body.Template` (attacker-controlled) via raw `filepath.Join(h.configsDir, template)`, allowing path traversal (e.g. `../../../etc`) to escape `configsDir`. **Issue #1043 triage missed this — legitimate gap, not false positive.** - -Authenticated callers could pass a crafted `body.Template` value to escape the configs directory. - -### Fix Applied - -PR #1260 (intended) closed without merge. Fix landed via **PR #1261 (`a66f889`)** on both main and staging: - -```go -// Fixed (a66f889): -candidatePath, resolveErr := resolveInsideRoot(h.configsDir, template) -if resolveErr != nil { - template = "" // fallback fires safely -} -``` - -### References - -- PR #1260: closed without merge — superseded by PR #1261 -- PR #1261 (`a66f889`): merged ✅ -- Closes: #1043 - ---- - -## F1088 Credential Exposure — CLOSED - -**All prior F1088 entries below remain valid. Summary of current state:** - -- Credentials: MiniMax revoked (⚠️), GitHub PAT revoked (✅), Admin token — treat as potentially exposed -- BFG git-history scrub: NOT REQUIRED — incident management closure, 0 public forks confirmed -- Git history still contains values — admin token rotation recommended as precaution -- PR #1179 (`b89f3fd`) merged — active code is clean -- Branch `origin/fix/credential-history-cleanup-f1088` exists but is 38 commits behind main — superseded by incident management closure - -**Required remaining action:** Rotate `ADMIN_TOKEN` (`HlgeMb8...ShARE=`) as precaution. All other actions complete. - ---- - -### Summary - -Commit `d513a0ced549ef2be8903a7b4794256110ba1805` on staging (merged to main via PR #1098) contains three production credentials as hardcoded default values in `scripts/post-rebuild-setup.sh`. The credentials appeared in the git diff and were permanently visible in the public commit history. - -### Credentials Status - -| # | Credential | Value | Status | -|---|------------|-------|--------| -| 1 | ANTHROPIC_AUTH_TOKEN | `sk-cp-lHt-QFSyZwZxeo...KVw` | ⚠️ Revoked or inactive (404 on API call) | -| 2 | GITHUB_TOKEN | `github_pat_11BPRRWQI0m...hsIJLIL` | ✅ Revoked (confirmed 401) | -| 3 | ADMIN_TOKEN | `HlgeMb8...ShARE=` | Needs confirmation — treated as active until proven otherwise | - -### Resolution - -PR #1179 (`b89f3fd`: "ci: retry — trigger fresh runner allocation") closed this finding. The incident was closed at the finding-management level. Git history scrub via BFG was discussed but deemed not required by security team (no active public forks confirmed, credentials were already revoked/inactive). - -Active code is clean (`d513a0c` replaced hardcoded defaults with env-var reads). - -### Summary - -Commit `d513a0ced549ef2be8903a7b4794256110ba1805` on staging (merged to main via PR #1098) contains two production credentials as hardcoded default values in `scripts/post-rebuild-setup.sh`. The credentials appear in the git diff and are permanently visible in the public commit history. - -The commit itself fixed the problem by replacing hardcoded defaults with env-var reads (MINIMAX_API_KEY, GITHUB_PAT). However, git history still shows the original values. - -### Credentials Exposed - -> **Token values redacted from this table 2026-04-26** to reduce public-search surface (the docs repo is publicly indexed). Short-suffix references match the convention in the Blast Radius table below (lines 134-137). Full values remain in `molecule-core` git history per the F1088 closure decision (no BFG scrub). - -| # | Credential | Value (short suffix) | Service | -|---|------------|----------------------|---------| -| 1 | ANTHROPIC_AUTH_TOKEN | `sk-cp-...KVw` | MiniMax API (api.minimax.io/anthropic) | -| 2 | GITHUB_TOKEN | `github_pat_...hsIJLIL` | GitHub (fine-grained PAT, scope unknown) | -| 3 | ADMIN_TOKEN | `HlgeMb8...ShARE=` | Platform admin authentication | - -### Affected Files - -- `scripts/post-rebuild-setup.sh` (commit d513a0c, PR #1098 → merged to staging → merged to main) - -### Timeline - -- **~2026-04-20T13:02Z**: Commit `d513a0c` pushed by `rabbitblood`. GitGuardian flagged credentials in the diff. Fix committed in same commit. -- **~2026-04-20T**: Credentials removed from active code, but git history still contains them. -- **2026-04-20T22:32Z**: Incident discovered and escalated. - -### Actions Taken - -1. Dev Lead notified (delegation failed — Dev Lead unreachable) -2. All child workspaces notified (delegation failed — all unreachable) -3. Incident documented in this file -4. Branch `origin/fix/credential-history-cleanup-f1088` exists but is 38 commits behind `origin/main` -5. **Incident CLOSED** — PR #1179 merged, finding management closure, BFG scrub deemed not required (no active public forks confirmed) - -### Blast Radius (Confirmed by Core-Security) - -| Credential | Test Result | Status | -|------------|-------------|--------| -| MiniMax API key (`sk-cp-...KVw`) | `404 Not Found` on real API call | ⚠️ **REVOKED** (or endpoint inactive) | -| GitHub PAT (`github_pat_...hsIJLIL`) | `401 Bad credentials` | ✅ **REVOKED** | -| Admin token (`HlgeMb8...ShARE=`) | Base64 — cannot test directly | ⚠️ **Treated as active** — recommend rotation as precaution | - -**Public forks:** 0 confirmed (GH API `/forks` returns none) — low fork blast radius. - -**Git history scope:** Credentials exist in both `main` and `staging` in commits `f787873`..`d513a0c`. They were introduced in `f787873` ("feat: nuke-and-rebuild.sh") and removed from active code in `d513a0c`. Both branches require BFG cleanup. - -### Required Actions (RESOLVED) - -- [x] Credentials revoked (MiniMax ⚠️, GitHub PAT ✅) -- [x] BFG git history cleanup **NOT REQUIRED** — incident management closure, no active public forks, credentials confirmed revoked/inactive -- [x] Team notification — documented in this log -- [ ] **Admin token rotation** — recommended as precaution (value still in git history, treat as potentially exposed) - -### BFG Repo-Cleaner Procedure - -**NOT REQUIRED** — F1088 closed without BFG scrub per security team decision. Retained for reference only. - -**Step 1 — Create credentials manifest (`creds.txt`) [NOT NEEDED]:** -``` - - - -``` -Full token values redacted from this doc 2026-04-26 (see note in the -Credentials Exposed table above). Pull from the Core-Security incident -ticket if a future revival of this BFG procedure is needed. - -**Step 2 — Clean origin/main:** -```bash -git clone --mirror https://git.moleculesai.app/molecule-ai/molecule-core /tmp/molecule-main-mirror -java -jar bfgr.jar --replace-text creds.txt --rewrite-not-committed-by-oss --no-blob-protection /tmp/molecule-main-mirror -cd /tmp/molecule-main-mirror && git push --mirror -``` - -**Step 3 — Clean origin/staging:** -```bash -git clone --mirror https://git.moleculesai.app/molecule-ai/molecule-core /tmp/molecule-staging-mirror -java -jar bfgr.jar --replace-text creds.txt --rewrite-not-committed-by-oss --no-blob-protection /tmp/molecule-staging-mirror -cd /tmp/molecule-staging-mirror && git push --mirror -``` - -**Step 4 — Notify team to re-clone both branches if cloned before ~13:02 UTC 2026-04-20.** - -### References - -- Commit: `d513a0ced549ef2be8903a7b4794256110ba1805` -- PR: #1098 (staging → main merge) -- Cleanup branch: `origin/fix/credential-history-cleanup-f1088` (behind main by 38 commits) -- Scanners triggered: GitGuardian -- Security investigation: Core-Security (confirmed credentials revoked via API tests) -- GitHub issue: #1282 (filed by Core-OffSec) -- **Closed by:** PR #1179 (`b89f3fd`) — incident management closure, BFG scrub deemed not required - -### Known Issue — PR #1230 Incomplete (QA Round 16, 2026-04-21) - -PR #1230 / commit `524e3c6` ("fix(security): replace err.Error() leaks") failed to carry mcp.go fixes into main's tree. All 3 MCP error leaks remain on main: -- `mcp.go:259`: "parse error: " + err.Error() -- `mcp.go:347`: "invalid params: " + err.Error() -- `mcp.go:352`: err.Error() -- `org_plugin_allowlist.go:260`: "detail": err.Error() - -Fix is covered by PR #1226 (rebased, MERGEABLE). Gap should close after #1226 merges. - ---- - -## CWE-918 SSRF — Backport to Main (RESOLVED) - -**Severity:** High -**Status:** Resolved — PR #1302 merged to main - -### Summary - -SSRF defence (`isSafeURL` in `a2a_proxy.go`) was backported to main to address CWE-918 (Server-Side Request Forgery). The fix prevents the A2A proxy from forwarding requests to internal network addresses (localhost, private ranges, etc.). - -### References - -- Commit: `e431fc4` (fix(security): backport SSRF defence (CWE-918) to main — isSafeURL in a2a_proxy.go (#1292) (#1302)) - ---- - -## CWE-22 + CWE-78 Security Fixes — Merged (RESOLVED) - -**Severity:** Critical -**Status:** Resolved — proper fixes merged to staging and main - -### Summary - -The `fix/cwe78-delete-via-ephemeral-shell-injection` branch was the right diagnosis but wrong implementation (removed `safeName` from `copyFilesToContainer`). The correct fixes were merged separately: - -| Location | Commit | Fix | -|----------|--------|-----| -| staging | `ce2491e` | CWE-22: `copyFilesToContainer` safeName + `deleteViaEphemeral` validateRelPath + exec form | -| main | `169120d` | CWE-78/CWE-22: block shell injection in `deleteViaEphemeral` | - -Both CWEs are fully resolved on both branches. The regression branch is superseded and must not be merged as-is. - -### Verification (staging `ce2491e`) - -`copyFilesToContainer` (container_files.go:73-99): -```go -clean := filepath.Clean(name) -if filepath.IsAbs(clean) || strings.Contains(clean, "..") { - return fmt.Errorf("path traversal blocked: %s", name) -} -safeName := filepath.Join(destPath, clean) -header := &tar.Header{Name: safeName, ...} ✅ -``` - -`deleteViaEphemeral` (container_files.go:152-168): -```go -validateRelPath(filePath) ✅ -Cmd: []string{"rm", "-rf", "/configs", filePath} ✅ exec form, no shell interpolation -``` - ---- - - - -**Severity:** High -**Period:** ~2026-04-20T22:00Z – 2026-04-21T03:30Z -**Finding IDs:** N/A (infra incident) -**Status:** Resolved - -### Summary - -All self-hosted macOS arm64 runners saturated. 27 runs queued, 0 in-progress, 0 completed. Only cancellations processing. PRs #1053 and #1036 had zero CI runs. - -### Root Causes (multiple) - -1. `changes` job ran on `[self-hosted, macos, arm64]` despite having zero macOS dependencies (plain `git diff`) — wasted runner slots -2. YAML corruption in `ci.yml` (JSON-escaped `\n` sequences from commits `12c52d4`/`5831b4e`) caused "workflow file issue" failures before any job could start -3. `cancel-in-progress: false` at workflow level caused stale runs to queue instead of being cancelled -4. Workflow-level concurrency not set — multiple in-flight runs queued on same ref - ---- - -## CI Stall — molecule-core/staging (RESOLVED 2026-04-21 ~07:05Z) - -**Severity:** High -**Period:** ~2026-04-21T02:47Z – ~2026-04-21T07:00Z -**Status:** Resolved — CI progressing normally, no config problems remain - -### Resolution - -All prior runner-saturation and YAML-corruption fixes were correct. The stall resolved naturally once stale queued runs drained. Current CI state (2026-04-21 ~07:07Z): - -- Staging run #24708961892: **success** (SHA `5d32373`) -- Staging run #24708976467: **success** (changes job, SHA `72d825f`) -- Main run #24708984339: queued (normal — healthy queue, not stalled) -- Runner agent healthy — no dead slots - -### Root Causes (all resolved) - -1. `changes` job on `[self-hosted, macos, arm64]` — fixed by moving to `ubuntu-latest` (`9601545`) -2. YAML corruption in `ci.yml` — fixed by PR #1264 / `b61692c` ✅ -3. `cancel-in-progress: false` at workflow level — reverted to `true` on staging ✅ -4. `cancel-in-progress: false` on main — correct for single-runner env, aligned via PR #1248 ✅ - -### Staging CI Config (confirmed healthy) - -- `ci.yml`: `cancel-in-progress: true`, `changes` job on `ubuntu-latest` ✅ -- `codeql.yml`: `cancel-in-progress: false` ✅ -- `e2e-api.yml`: `cancel-in-progress: false` ✅ - -### Infra Recommendations (for long-term stability) - -1. Provision org-wide GitHub App installation token for CI automation (PATs rotate too frequently) -2. Update remote URLs on controlplane and tenant-proxy repos -3. Monitor runner agent health on mac mini — restart agent if future stalls recur - ---- - -## PR #1242 YAML Corruption — RESOLVED (PR never merged) - -**Severity:** Critical -**Status:** Resolved — PR #1242 closed without merge, staging unaffected - -### Summary - -PR #1242 (`fix/ci-runner-queue-contention`) branch contained a YAML corruption in `ci.yml` — the `concurrency` block was replaced with a commit-SHA string literal: - -```yaml -e4a62e1 (ci: add workflow-level concurrency to ci.yml and codeql.yml) -``` - -However, PR #1242 was **closed without merging**. Staging received `cancel-in-progress: true` via PR #1264 (commit `b61692c`) instead, which is the correct clean version. - -### Current State (updated 2026-04-21 ~04:30Z) - -- **main:** `cancel-in-progress: false` ✅ (from PR #1248 / `2ffd11c` or similar clean commit) -- **staging:** `cancel-in-progress: true` (via `0b30465` tick restore after corruption) -- **PR #1248** (`2ffd11c`): open, sets staging `cancel-in-progress: false` — aligns staging with main ✅ -- **Main has moved to `false`** — staging should follow to stay consistent - -### PR #1248 — URGENT MERGE - -PR #1248 (`fix/ci: restore corrupted ci.yml concurrency block`) by Dev Lead: -- Fixes the corruption pattern (same as prior incident) -- Sets `cancel-in-progress: false` — correct for single-runner environment -- Aligns staging CI config with main (which already has `false`) -- Must merge before any further CI runs on staging - -### References - -- PR: #1242 (`fix/ci-runner-queue-contention`) — closed, not merged -- Staging corruption restored via: PR #1264 / `b61692c` -- PR #1248 (`2ffd11c`): open, Dev Lead fix, `cancel-in-progress: false` -- Main: `cancel-in-progress: false` ✅ - ---- - -## PR #1036 QA Audit (STALE) - -**Severity:** Low -**Date:** 2026-04-20 (QA audit performed) -**Status:** Stale — CI infrastructure has been fixed since audit - -### Summary - -QA audit (2026-04-20) flagged CI as failing on PR #1036. However, CI was failing due to infrastructure issues (runner saturation, YAML corruption) that have since been resolved. The audit should be re-run now that staging CI is healthy. - ---- - -## PR #1246 / #1247 — Sed Regression Fix — RESOLVED (PR #1247 merged) - -**Severity:** Critical -**Status:** Resolved — PR #1247 merged to main (2026-04-21 ~03:18Z) - -### Summary - -PR #1246 (`364712d`) was closed without merging. However, **PR #1247** (`04be218`) achieved the same fix cleanly and merged to main: - -``` -fix(go): replace $1 literal with resp.Body.Close() in 7 files (#1247) -``` - -Commit `04be218` (merged by molecule-ai[bot]) applied: -``` -sed -i 's/defer func() { _ = \$1 }()/defer func() { _ = resp.Body.Close() }()/g' -``` - -### Affected Files (all fixed on main) - -- `workspace-server/cmd/server/cp_config.go` -- `workspace-server/internal/handlers/a2a_proxy.go` -- `workspace-server/internal/handlers/github_token.go` -- `workspace-server/internal/handlers/traces.go` -- `workspace-server/internal/handlers/transcript.go` -- `workspace-server/internal/middleware/session_auth.go` -- `workspace-server/internal/provisioner/cp_provisioner.go` (3 occurrences) - -**Staging:** Fix present via prior commits. `cp_config.go` on staging has SHA `d1021c2` (correct form). - -**PR #1246:** Closed without merging — superseded by PR #1247. No further action needed. - ---- - -## CWE-78/CWE-22 Branch — RESOLVED (proper fixes merged separately) - -**Severity:** Critical -**Status:** Resolved — proper fixes merged via `ce2491e` (staging) and `169120d` (main) - -### Summary - -The `fix/cwe78-delete-via-ephemeral-shell-injection` branch (commit `17419dd`) was **correct** for CWE-78 (`deleteViaEphemeral` exec form + `validateRelPath`) but **regressed** `copyFilesToContainer` by removing the `safeName` path-traversal guard. - -**Resolution — both branches merged to main and staging:** - -| Branch | Commit | Status | -|--------|--------|--------| -| staging | `ce2491e` — fix(security): CWE-22 in copyFilesToContainer and deleteViaEphemeral | ✅ merged | -| main | `169120d` — fix(security): CWE-78/CWE-22 — block shell injection in deleteViaEphemeral | ✅ merged | - -### What was fixed (staging `ce2491e`) - -- `copyFilesToContainer`: `filepath.Clean` + `IsAbs` + `strings.Contains("..")` validation, `safeName` in tar header ✅ -- `deleteViaEphemeral`: `validateRelPath(filePath)` check before rm command ✅ -- Both CWE-22 and CWE-78 addressed correctly - -### `fix/cwe78-delete-via-ephemeral-shell-injection` branch status - -**Do NOT merge** — it's now superseded by `ce2491e`/`169120d`. The regression it introduced (removing `safeName` from `copyFilesToContainer`) was never the right approach. If this branch is revived, it must be rebased on top of `ce2491e` to preserve existing CWE-22 protections while adding the CWE-78 exec-form fix. - ---- - -## F1085 Regression Branch (`fix/f1085-regression-1283`) — IS a Regression - -**Severity:** High -**Status:** Active — branch removes the confirmed-good F1085 fix (confirmed 2026-04-21 ~07:10Z) - -### Summary - -Branch `origin/fix/f1085-regression-1283` (commit `3b244e6`) removes `redactSecrets(workspaceID, content)` from `seedInitialMemories` in `workspace_provision.go:249`: - -```diff --`, workspaceID, redactSecrets(workspaceID, content), scope, awarenessNamespace); err != nil { -+`, workspaceID, content, scope, awarenessNamespace); err != nil { -``` - -**Staging still has the correct fix** (`workspace_provision.go:253` on origin/staging confirms `redactSecrets` is present). This branch is behind staging and would regress it if merged. - -### Required Fix - -Close or revert this branch. `redactSecrets` must remain in `seedInitialMemories`. If there is a legitimate reason to change this (e.g., a different redaction strategy), document it clearly in the PR before merging. - ---- - -## F1097 — org_id Context Fix — RESOLVED - -**Severity:** Medium -**Status:** Resolved — PR #1258 merged to main (`dc9c64e`) - -### Summary - -`orgToken.Validate` refactored to return `org_id` directly, eliminating the redundant 2nd SELECT in `AdminAuth`. All SQL parameterized correctly. - -### References - -- PR #1258 (`dc9c64e`): fix(F1097): set org_id in Gin context for org-token callers - ---- - -## PR #1226 — err.Error() Leaks (STALE — closed without merge) - -**Severity:** Medium -**Status:** Open — PR closed without merging, leaks still present on main - -### Summary - -PR #1226 (`fix(security): sanitize remaining err.Error() leaks + errcheck artifacts/client.go`) was **closed without merging**. The following leaks remain on main: - -| File | Line | Code | Fix | -|------|------|------|-----| -| `mcp.go` | 259 | `"parse error: " + err.Error()` | → `"parse error: invalid JSON request body"` | -| `mcp.go` | 347 | `"invalid params: " + err.Error()` | → `"invalid params: malformed JSON"` | -| `mcp.go` | 352 | `err.Error()` | → `"dispatch error"` | -| `org_plugin_allowlist.go` | 260 | `"detail": err.Error()` | → `"detail": "plugin name validation failed"` | -| `admin_memories.go` | 99 | `"invalid JSON: " + err.Error()` | → `"invalid JSON request body"` | - -**Already fixed:** `artifacts/client.go:175` — `defer func() { _ = resp.Body.Close() }()` confirmed correct (via PR #1247). - -### Action Required - -Reopen PR #1226 and fast-track merge. Alternatively, cherry-pick the 4 commits from that PR onto a fresh branch. - ---- - -## QA Round 18 — orgs-page Test Regression (FIXED on main, pending staging port) - -**Severity:** Medium -**SHA tested:** `ce33da5` (PR #1257 branch merge with staging) -**Status:** Regression identified in PR #1255, fixed on main, not yet on staging - -### Findings - -| Finding | Status | -|---------|--------| -| Canvas tests: 53 passed, **1 FAILED** | orgs-page.test.tsx line 133 — `vi.useRealTimers()` + raw `setTimeout(50)` without `act()` | -| PR #1257 conflict | MERGEABLE, approved — closed without merge; fix is on main/staging via `a66f889` | -| PR #1255 regression | Introduced orgs-page test flakiness — +18/-2 in orgs-page.test.tsx | - -### orgs-page Test Regression — Root Cause - -PR #1255 (`e885fa1`) regressed the timer fix from PR #1235. It replaced `waitFor()` with `vi.useRealTimers()` + raw `setTimeout(50)` without `act()` — causing microtask flush issues. - -### Resolution - -**Main:** Fixed in `674384b` (PR #1313) — wraps all 10 affected `vi.advanceTimersByTimeAsync(50)` calls in `act(async () => { ... })`. All 813 canvas tests pass on main. -**Staging:** Regression NOT yet fixed — `origin/staging` is 13 commits behind main. - -### Action needed - -Cherry-pick or port the orgs-page test fix from `674384b` to staging. - ---- - -## Issue #1124 — Orchestrator GET /workspaces 404: Env Var Misconfiguration (OPEN) - -**Severity:** Medium -**Status:** Active — root cause confirmed, fix pending, delegated to Core-BE - -### Summary - -Orchestrator (workspace agent, `workspace/` directory) GET /workspaces/{WORKSPACE_ID} returns 404 due to missing or empty `WORKSPACE_ID` env var. Confirmed via code review (2026-04-21 ~07:10Z). - -### Root Causes - -**Platform-side (provisioner.go:375-377) is CORRECT:** -```go -env := []string{ - fmt.Sprintf("WORKSPACE_ID=%s", cfg.WorkspaceID), // ✅ correctly injected - "WORKSPACE_CONFIG_PATH=/configs", - fmt.Sprintf("PLATFORM_URL=%s", cfg.PlatformURL), -} -``` -The platform injects `WORKSPACE_ID` at container provision time. **The bug is in the Python orchestrator modules** that default to empty string instead of validating the injected value. - -**Buggy Python module-level defaults (empty string → broken API calls):** -| File | Line | Code | -|------|------|------| -| `workspace/a2a_cli.py` | 24 | `WORKSPACE_ID = os.environ.get("WORKSPACE_ID", "")` | -| `workspace/a2a_client.py` | 17 | `WORKSPACE_ID = os.environ.get("WORKSPACE_ID", "")` | -| `workspace/coordinator.py` | 26 | `WORKSPACE_ID = os.environ.get("WORKSPACE_ID", "")` | -| `workspace/consolidation.py` | 22 | `WORKSPACE_ID = os.environ.get("WORKSPACE_ID", "")` | -| `workspace/molecule_ai_status.py` | 25 | `WORKSPACE_ID = os.environ.get("WORKSPACE_ID", "")` | - -When `WORKSPACE_ID` is empty, API calls produce URLs like `/workspaces//heartbeat` or `/registry/discover/` — platform returns 404 or wrong routing. - -**Note — main.py is already correct:** -```python -workspace_id = os.environ.get("WORKSPACE_ID", "workspace-default") # main.py:55 ✅ -``` -However, `main.py` uses a local variable — it doesn't export `WORKSPACE_ID` as a module constant. The other modules that import `WORKSPACE_ID` from `a2a_client` etc. still get the empty-string default. - -### Fix Required (Quick Win for Core-BE) - -**Option A — Fail fast at module import (recommended):** -```python -WORKSPACE_ID = os.environ.get("WORKSPACE_ID") -if not WORKSPACE_ID: - raise RuntimeError("WORKSPACE_ID environment variable is required but not set") -``` -Apply to all 5 affected modules. This surfaces the misconfiguration immediately instead of producing silent 404s downstream. - -**Option B — Align with main.py's approach (safer):** -```python -WORKSPACE_ID = os.environ.get("WORKSPACE_ID", "workspace-default") -``` -But this masks real misconfigurations. Option A is better. - -### Modules Requiring Fix - -- `workspace/a2a_cli.py` — line 24 -- `workspace/a2a_client.py` — line 17 -- `workspace/coordinator.py` — line 26 -- `workspace/consolidation.py` — line 22 -- `workspace/molecule_ai_status.py` — line 25 - -### PLATFORM_URL Note - -All modules default to `http://platform:8080` (container mesh hostname). This is correct for in-container use but fails outside Docker. No action needed for in-container orchestrators — the platform injects `PLATFORM_URL` at provision time which overrides this default. - -### Owner - -Core-BE — delegated to Dev Lead (A2A failed). Core-BE sub-team: please pick up. - -### Fix PR - -[PR #1336](https://git.moleculesai.app/molecule-ai/molecule-core/pull/1336) filed — `fix(orchestrator): fail-fast if WORKSPACE_ID env var is unset/empty`. Targets staging. Labels: bug, needs-work, area:backend-engineer, area:dev-lead. - ---- - -*Last updated: 2026-04-21T07:10Z by Core Platform Lead (post-restart session — all findings re-verified)* \ No newline at end of file diff --git a/content/docs/migration/a2a-sdk-v0-to-v1.mdx b/content/docs/migration/a2a-sdk-v0-to-v1.mdx deleted file mode 100644 index 45f88f3..0000000 --- a/content/docs/migration/a2a-sdk-v0-to-v1.mdx +++ /dev/null @@ -1,214 +0,0 @@ ---- -title: "a2a-sdk v0 → v1 migration" -description: "Cheat sheet for migrating workspace runtime code (and forks) from a2a-sdk 0.3.x to 1.x — renamed/removed symbols, common error shapes, before/after diffs." ---- - -import { Callout } from 'fumadocs-ui/components/callout'; - -The `a2a-sdk` Python package released v1.0 in late April 2026. The -Molecule workspace runtime migrated under tracking ID **KI-009** and -shipped in `molecule-ai-workspace-runtime` **v0.1.11** (commit -`d5cf872`, PR #39). The platform now runs exclusively on v1. - -If you're consuming the platform's published wheel, bumping -`molecule-ai-workspace-runtime>=0.1.11` handles the migration for -you. If you maintain a fork of the runtime, an external agent talking -A2A directly, or your own adapter that imports from `a2a.*`, this page -is your checklist. - -## Why migrate - -- **Upstream**: `a2a-sdk` 1.0 reorganised the import surface, flattened - `Part`, removed deprecated capability flags, and replaced the - `A2AStarletteApplication` wrapper with explicit Starlette route - factories. -- **Platform**: as of 2026-04-24 the platform sends/receives via v1 - shapes natively. The SDK ships a v0_3 compat layer (enabled in the - runtime via `enable_v0_3_compat=True` on `create_jsonrpc_routes`) so - in-flight 0.x callers don't break, but new code should target v1. -- **Forks/external runtimes**: v0 code throws on `import a2a.utils` - and `from a2a.server.apps import A2AStarletteApplication` once you - install v1, so the migration is a hard cutover at install time, not - a soft deprecation. - -## Cheat sheet — renamed and removed symbols - -The four breaking changes that hit the Molecule runtime during KI-009. -All four are confirmed against -`molecule-core/workspace/` source. - -### 1. `new_agent_text_message` renamed to `new_text_message` - -- **v0 location**: `a2a.utils.new_agent_text_message` -- **v1 location**: `a2a.helpers.new_text_message` - -Both the module path and the symbol name changed. - -### 2. `Part` API flattened — `TextPart` removed - -- **v0**: `Part(root=TextPart(text="..."))` — `Part` wrapped a `root` - union of `TextPart` / `FilePart` / `DataPart`. -- **v1**: `Part(text="...")` — `Part` accepts the text payload - directly. `TextPart` no longer exists as a public symbol. - -`FilePart` / `DataPart` are similarly flattened (`Part(file=...)`, -`Part(data=...)`); the Molecule runtime only emits text parts so the -file/data shapes weren't exercised in KI-009 and aren't covered by -this guide. - -### 3. `A2AStarletteApplication` removed — use route factories - -- **v0**: `from a2a.server.apps import A2AStarletteApplication` then - `A2AStarletteApplication(agent_card, request_handler).build()`. -- **v1**: `from a2a.server.routes import create_agent_card_routes, - create_jsonrpc_routes` then build a Starlette app from the returned - route lists. - -The factories also let you mount the JSON-RPC endpoint at any path -(the runtime mounts at `/` because the platform POSTs to root, see -`workspace/main.py:279`). - -### 4. `state_transition_history` capability flag removed - -- **v0**: `AgentCapabilities(streaming=..., push_notifications=..., - state_transition_history=True)` was a per-agent opt-in. -- **v1**: the field is gone from `AgentCapabilities`. Per the SDK's own - `a2a/compat/v0_3/conversions.py`: *"No longer supported in v1.0"*. - The capability is now universal — `Task.history` is always available - and `tasks/get` accepts `historyLength` via `apply_history_length()`. - -If you pass `state_transition_history=...` as a kwarg to -`AgentCapabilities` under v1, Pydantic will reject it. Drop the kwarg. -See [`workspace/main.py`](https://git.moleculesai.app/molecule-ai/molecule-core/src/branch/main/workspace/main.py) -for the explanatory comment that prevents future accidental re-adds. - -## Common error shapes - -When v0 code runs against the v1 SDK, the failure modes look like this: - -| Error | Cause | -|---|---| -| `ModuleNotFoundError: No module named 'a2a.utils'` | v0 import path; module renamed to `a2a.helpers`. | -| `ImportError: cannot import name 'A2AStarletteApplication' from 'a2a.server.apps'` | The whole `a2a.server.apps` module is gone in v1. Switch to `a2a.server.routes` factories. | -| `ImportError: cannot import name 'TextPart' from 'a2a.types'` | Flattened `Part` API; use `Part(text=...)`. | -| `ValueError: Protocol message AgentCapabilities has no "state_transition_history" field` | Removed capability flag passed as kwarg; drop it. | -| `ValueError: Protocol message Part has no "root" field` | v0 `Part(root=TextPart(...))` shape against v1 schema; flatten to `Part(text=...)`. | - -The protobuf-style `ValueError` messages always follow the pattern -`Protocol message has no "" field` — that's the -fingerprint of "v0 shape against v1 schema." Treat it as a v0→v1 hint -even if the field name isn't on the cheat sheet above. - -## Migration checklist - -1. **Bump the dep** — `a2a-sdk[http-server]>=0.3.25` is the floor; remove - any `<1.0` upper bound. The Molecule wheel uses - `a2a-sdk[http-server]>=0.3.25` with no upper bound (see - [`molecule-ai-workspace-runtime/pyproject.toml`](https://git.moleculesai.app/molecule-ai/molecule-ai-workspace-runtime/src/branch/main/pyproject.toml)). -2. **Fix imports** — sweep the four renamed/removed symbols above. A - safe grep is `grep -rn "from a2a\\|import a2a"` across your tree. -3. **Fix removed-field reads/writes** — search for - `state_transition_history` usage and delete the kwarg/field access. -4. **Flatten `Part` constructors** — search for `Part(root=` and - convert to `Part(text=...)` / `Part(file=...)` / `Part(data=...)`. -5. **Replace the app factory** — search for `A2AStarletteApplication` - and rewrite the bootstrap using `create_agent_card_routes` + - `create_jsonrpc_routes`. Pass `enable_v0_3_compat=True` to - `create_jsonrpc_routes` if your peers may still be on v0. -6. **Re-run tests** — fixture-level mocks of `a2a.helpers` / - `a2a.utils` need to mock both names so tests still pass during the - rename rollout (see - [`workspace/tests/conftest.py`](https://git.moleculesai.app/molecule-ai/molecule-core/src/branch/main/workspace/tests/conftest.py) - for the dual-name pattern). - -## Before / after diffs - -### `new_agent_text_message` → `new_text_message` - -```diff --from a2a.utils import new_agent_text_message -+from a2a.helpers import new_text_message - - async def execute(self, context, event_queue): -- await event_queue.enqueue_event(new_agent_text_message("hello")) -+ await event_queue.enqueue_event(new_text_message("hello")) -``` - -### Flat `Part` API - -```diff --from a2a.types import Part, TextPart -+from a2a.types import Part - --msg_parts = [Part(root=TextPart(text=final_text))] -+msg_parts = [Part(text=final_text)] -``` - -### `AgentCapabilities` — drop `state_transition_history` - -```diff - capabilities=AgentCapabilities( - streaming=config.a2a.streaming, - push_notifications=config.a2a.push_notifications, -- state_transition_history=True, - ), -``` - -### `A2AStarletteApplication` → route factories - -```diff --from a2a.server.apps import A2AStarletteApplication -+from a2a.server.routes import create_agent_card_routes, create_jsonrpc_routes - --app = A2AStarletteApplication( -- agent_card=agent_card, -- http_handler=request_handler, --).build() -+routes = [] -+routes.extend(create_agent_card_routes(agent_card)) -+routes.extend(create_jsonrpc_routes( -+ request_handler=request_handler, -+ rpc_url="/", -+ enable_v0_3_compat=True, -+)) -+app = Starlette(routes=routes) -``` - -The `enable_v0_3_compat=True` flag on `create_jsonrpc_routes` is what -keeps in-flight v0 callers (peers that haven't migrated yet) from -breaking — it accepts the old method names and translates them. The -Molecule runtime ships with this flag on (see -[`workspace/main.py`](https://git.moleculesai.app/molecule-ai/molecule-core/src/branch/main/workspace/main.py)); -strip it once your entire fleet is on v1. - -## For downstream consumers - -- **Using the published wheel** (`pip install - molecule-ai-workspace-runtime>=0.1.11`): the migration is in the - wheel — no code changes needed in your adapter or workspace template - beyond bumping the pin. -- **Running a fork of the runtime**: cherry-pick or rebase against - commit `d5cf872` ("feat: migrate a2a-sdk 1.x (KI-009) (#39)") in - `molecule-ai-workspace-runtime`. The diff is the canonical reference - for what KI-009 actually changed. -- **Standalone external agent** (talking A2A without the wheel): apply - the [Migration checklist](#migration-checklist) directly to your - source. The four cheat-sheet items are the entire surface that - changed for the typical agent role; only `Part` flattening and the - `state_transition_history` removal affect on-the-wire shapes — the - other two are import-only. - - -The wheel keeps `enable_v0_3_compat=True` on `create_jsonrpc_routes`, -so a v0 peer can still hit a v1 wheel and vice versa during the -migration window. You don't need to coordinate a fleet-wide cutover — -migrate at your own pace. - - -## See also - -- [`molecule-ai-workspace-runtime` v0.1.11 release](https://git.moleculesai.app/molecule-ai/molecule-ai-workspace-runtime/releases/tag/v0.1.11) — first wheel containing KI-009 -- PR #39 (feat: migrate a2a-sdk 1.x / KI-009) — closed without merge; PR content is historical -- PR #48 (feat(a2a): dual-compat for a2a-sdk 0.3.x and 1.x) — closed without merge; PR content is historical -- [Bring Your Own Runtime (MCP)](/docs/runtime-mcp) — universal wheel install path -- [External Agents](/docs/external-agents) — manual A2A path for non-MCP runtimes diff --git a/content/docs/research/cognee-architecture-deep-dive.md b/content/docs/research/cognee-architecture-deep-dive.md deleted file mode 100644 index d3e5040..0000000 --- a/content/docs/research/cognee-architecture-deep-dive.md +++ /dev/null @@ -1,69 +0,0 @@ ---- -title: "Cognee Architecture Deep-Dive — Workspace Isolation" -description: "Deep-dive into Cognee's isolation primitives versus Molecule AI's per-workspace memory requirements." ---- -# Cognee Architecture Deep-Dive — Workspace Isolation - -**Date:** 2026-04-20 -**Issue:** Molecule-AI/molecule-core#1146 -**Research by:** Research Lead -**Status:** Complete - ---- - -## Executive Summary - -Cognee has **dataset-level isolation primitives** but **no storage-layer enforcement** and **no native `workspace_id` support** in its MCP tool interface. Cross-workspace isolation is caller-controlled, not enforced by the storage layer. - ---- - -## Isolation Layer Analysis - -| Layer | Mechanism | Enforced? | Risk | -|-------|-----------|-----------|------| -| Storage (Postgres) | No RLS, no schema namespacing | ❌ None | High | -| App — dataset | `dataset_name` passed per tool call | ⚠️ Caller-controlled | Medium | -| App — user | `get_default_user()` internal resolver only | ⚠️ Soft | Medium | -| MCP `workspace_id` param | Not present in cognee-mcp interface | ❌ N/A | High | - ---- - -## Key Findings - -1. **Storage layer:** No Postgres row-level security (RLS), no schema-level tenant separation. Any admin with DB access can read any tenant's data. - -2. **Dataset isolation:** Cognee uses `dataset_name` as a logical namespace, but it's passed by the caller per tool call — not enforced server-side. A misconfigured or malicious caller could read/write across datasets. - -3. **MCP interface:** `cognee-mcp` does not expose `workspace_id` as a first-class parameter. Workspaces would need to be mapped to dataset names externally. - -4. **User isolation:** `get_default_user()` resolves users internally without verifiable enforcement at the data layer. - ---- - -## Migration Implications - -Adopting Cognee as the memory substrate requires an **auth bridge**: - -- The bridge wraps cognee-mcp and injects `workspace_id` → `dataset_name` mapping -- All tool calls are routed through the bridge, which enforces tenant context -- Estimated effort: **~100–200 LOC** for the MCP proxy wrapper -- This is a pragmatic path — the bridge provides the isolation Cognee's storage layer lacks - ---- - -## Recommendation - -**Attempt the auth bridge prototype first (1–2 days of engineering):** -1. Build MCP proxy that maps workspace_id to dataset_name on each call -2. Validate that cross-workspace calls are correctly rejected -3. If clean → adopt Cognee for Phase 9 -4. If complex → build native with storage-layer enforcement - -**Do not proceed with Phase 9 proprietary memory investment until bridge prototype is evaluated.** - ---- - -## Sources - -- Cognee GitHub: https://github.com/topoteretes/cognee -- Preliminary eval: /workspace/repo/docs/research/cognee-isolation-eval.md diff --git a/content/docs/research/cognee-isolation-eval.md b/content/docs/research/cognee-isolation-eval.md deleted file mode 100644 index b3656a4..0000000 --- a/content/docs/research/cognee-isolation-eval.md +++ /dev/null @@ -1,41 +0,0 @@ ---- -title: "Cognee Workspace Isolation Evaluation" -description: "Evaluating Cognee, an open-source AI memory engine, against Molecule AI's hierarchical memory isolation needs." ---- -# Cognee Workspace Isolation Evaluation - -**Date:** 2026-04-20 -**Issue:** Molecule-AI/molecule-core#1146 -**Status:** Preliminary — needs deeper architecture review - -## Summary - -Cognee (Apache-2.0, by Topoteretes UG) is an open-source AI memory engine with a shipped MCP component. It has direct overlap with Molecule AI's Phase 9 hierarchical memory architecture. - -## Workspace Isolation Assessment - -**Signal: Partial/Positive** - -Cognee's GitHub README explicitly lists "agentic user/tenant isolation, traceability, OTEL collector, audit traits" as a core architectural feature. - -This is a positive signal. However: -- The README mention does not specify the technical mechanism (namespace-level separation? separate vector DB instances per tenant? row-level security in a shared DB?) -- The cognee-mcp MCP component's handling of multi-workspace contexts is not documented in the surface-level readme - -**Verdict:** Cognee claims tenant isolation. Further due diligence required before treating this as confirmed. - -## Next Steps - -1. **Deep-dive into cognee architecture docs** — check if isolation is enforced at the storage layer (separate DB/collection per workspace), application layer (row-level), or both -2. **Test cognee-mcp with a multi-workspace scenario** — the MCP tool interface should reveal whether workspace_id is a first-class parameter -3. **Check cognee's GitHub issues/discussions** — any community reports of cross-tenant data leakage? -4. **Evaluate migration path** — if Cognee is adopted, what's involved in migrating existing Phase 9 work? - -## Recommendation - -Proceed with Phase 9 build-vs-buy review. Cognee is a credible candidate — isolation is claimed but mechanism needs verification. The Phase 9 halt stands until this is resolved. - -## Sources - -- https://github.com/topoteretes/cognee (README, 2026-04-20) -- /workspace/repo/research/cognee-memo.md diff --git a/content/docs/tutorials/saas-federation.md b/content/docs/tutorials/saas-federation.md index 41e4170..f0ee957 100644 --- a/content/docs/tutorials/saas-federation.md +++ b/content/docs/tutorials/saas-federation.md @@ -239,7 +239,7 @@ This terminates all EC2 instances, drops the Neon branch, and removes the org re - **Scoped roles**: give different team members read-only vs admin access within a tenant org (roadmap: Phase 34) - **Usage-based billing**: Meter workspace runtime and forward events to Stripe for custom billing tiers -For runbook-level details on the provisioning flow, see the architecture docs at [`docs/architecture/saas-prod-migration-2026-04-19`](/docs/architecture/saas-prod-migration-2026-04-19). +For the provisioning flow internals, see the [Provisioner](/docs/architecture/provisioner) and [Workspace Tiers](/docs/architecture/workspace-tiers) reference. For the API reference, see [`docs/api-reference`](/docs/api-reference) — the `/cp/orgs/*` endpoints are documented there.