diff --git a/docs/adr/ADR-003-runtime-platform-plugin-responsibilities.md b/docs/adr/ADR-003-runtime-platform-plugin-responsibilities.md new file mode 100644 index 00000000..7f0ffd1a --- /dev/null +++ b/docs/adr/ADR-003-runtime-platform-plugin-responsibilities.md @@ -0,0 +1,66 @@ +# ADR-003: Runtime adapts to the platform; the plugin adapts to each runtime + +**Status:** Accepted — committed architecture +**Date:** 2026-06-25 +**Supersedes context:** RFC `rfc-platform-mcp-as-plugin` §2b/§3.4 (platform-MCP-as-plugin, de-bake) +**Related incident:** 2026-06-25 fleet-wide concierge `degraded` (half-wired `loaded_mcp_tools` producer) + +## Context + +Molecule runs one agent codebase across many runtimes (claude-code, codex, +hermes, openclaw, gemini-cli) and exposes the same capabilities (management MCP, +A2A, memory) on all of them. Which component adapts to which was, until now, +tribal knowledge: the *plugin → runtime* half was documented, but the +*runtime → platform* status contract lived only in source docstrings, the named +contract doc (`api-protocol/registry-and-heartbeat.md`) was stale, and the +de-bake rationale lived only in an RFC. There was no single canonical statement +and no ADR. + +That gap had teeth. On 2026-06-25 every de-baked concierge was marked +`degraded` because the runtime reported `mcp_server_present=true` but never +produced `loaded_mcp_tools` — the producer was a scaffold with **zero callers**, +`omitempty` masked it, the unit tests bypassed the production gate with +`force=True`, and the only end-to-end check was a non-deterministic LLM +self-enumeration that was also `continue-on-error` and didn't run on PRs. A +half-wired producer crossing the runtime↔platform seam shipped silently. + +## Decision + +Adopt as **committed architecture** the two-layer, opposite-direction +responsibility split, and **enforce it with guardrails** so it cannot regress to +tribal knowledge: + +1. **The runtime adapts the agent to the platform.** It owns the register/ + heartbeat **status contract** (`mcp_server_present` *and* `loaded_mcp_tools`), + reported runtime-agnostically in one place. Every gate-consumed field must + have a wired producer + a liveness test. The required tool id should be pinned + in a shared contract and **derived** on both sides rather than hardcoded per + layer — *target state*; today core holds it as a literal const guarded by a + drift test and the runtime enumerates it live (contract-pin in progress). + +2. **The plugin adapts its abilities to each runtime.** One runtime-agnostic + descriptor (the plugin = SSOT); per-runtime renderers write it into native + config. Renderer + reader + present-probe move in lockstep; an unmapped + runtime fails **closed**. + +3. **Platform-ness is a composition, not an image.** A concierge is an ordinary + runtime image + the org-admin key + the management MCP plugin. "Is this a + concierge?" is detected via `mcp_server_present()`, not the baked-image marker + (runtime#181). The baked `molecule-platform-agent` image is being removed and + guarded against return — *in progress* (artifact deletion + absence guard, + #78); until then the CP still carries vestigial `resolvePlatformAgentImage` + references. + +The full statement, the field tables, and the guardrail matrix live in +[`architecture/runtime-platform-plugin-responsibilities.md`](/architecture/runtime-platform-plugin-responsibilities). + +## Consequences + +- New runtimes and new plugins have a single doc to conform to; reviewers have a + named contract to check against. +- A set of red-on-regression guardrails becomes required (producer-liveness boot + test, contract-drift blocking with `loaded_mcp_tools` pinned, renderer/reader + lockstep, deterministic online+`create_workspace` e2e, de-bake absence guard). + Until each is green it is tracked under the guardrail/SSOT workstream. +- The stale `api-protocol/registry-and-heartbeat.md` "five fields" status model + is corrected to include the MCP status fields. diff --git a/docs/api-protocol/registry-and-heartbeat.md b/docs/api-protocol/registry-and-heartbeat.md index f35bb6c5..882ce581 100644 --- a/docs/api-protocol/registry-and-heartbeat.md +++ b/docs/api-protocol/registry-and-heartbeat.md @@ -38,7 +38,33 @@ await platform.post("/registry/heartbeat", json={ }) ``` -Five fields. Memory usage, CPU, queue depth — those are infrastructure metrics for Prometheus/Grafana or CloudWatch. The platform registry is a service discovery layer, not a metrics store. +Five fields for an ordinary workspace. Memory usage, CPU, queue depth — those are infrastructure metrics for Prometheus/Grafana or CloudWatch. The platform registry is a service discovery layer, not a metrics store. + +### Platform-agent (concierge) status fields + +A `kind=platform` concierge reports **two additional fields** that the +online/degraded gate reads (built by `platform_agent_identity.identity_gate_payload`): + +```python + # platform concierge only — read by the online/degraded gate + "mcp_server_present": True, # management molecule-platform MCP is DECLARED/wired (RCA#2970) + "loaded_mcp_tools": ["mcp__molecule-platform__create_workspace", ...], # tools ACTUALLY loaded (core#3082) +``` + +`mcp_server_present` (declared) is **necessary but not sufficient** — a +declared-but-dead MCP is the false-green core#3082 catches, so the concierge also +reports `loaded_mcp_tools`, the namespaced ids that actually loaded into the +model. The runtime produces `loaded_mcp_tools` by enumerating the connected MCP +servers over the wire (retrying in the background until the management MCP is +connectable), so the concierge reaches `online` without waiting for a user turn. +Tri-state: the field is **omitted** until something is observed (grace window +applies), `[]` when a server connected with zero tools, or the id list. The +required tool id is `mcp__molecule-platform__create_workspace`; *current state* +core holds it as a literal const (`conciergePlatformMCPCreateWorkspaceTool`) +guarded by a contract **drift test**, and the contract pins the management server +name + status-field shape only — pinning the full id in the contract and deriving +both sides is a *target* (in progress). See +[Runtime ↔ Platform ↔ Plugin Responsibilities](/architecture/runtime-platform-plugin-responsibilities). `active_tasks` is included because the canvas uses it for a busy indicator on the node, and it sets up backpressure for Phase 2 without a schema change. @@ -83,6 +109,14 @@ The workspace self-reports its health via the heartbeat payload. The platform de | `error_rate >= 0.5` and status is `online` | `online` -> `degraded` | `WORKSPACE_DEGRADED` | | `error_rate < 0.1` and status is `degraded` | `degraded` -> `online` | `WORKSPACE_ONLINE` | +For a `kind=platform` concierge, two **MCP-presence** transitions also apply +(fail-closed; `registry.go`): + +| Condition | Transition | Rationale | +|-----------|-----------|-----------| +| `mcp_server_present=false` on register/heartbeat | refuse `online` | RCA#2970 — no management MCP, refuse to serve a crippled concierge | +| `mcp_server_present=true` but `loaded_mcp_tools` absent / lacks `mcp__molecule-platform__create_workspace` past the grace window | `online` -> `degraded` | core#3082 — MCP declared but its tools never loaded (false-green) | + **What counts as an error:** Only things that indicate the workspace itself is broken — 5xx responses, timeouts, connection errors. Client errors (400, 403) are the caller's fault and are not counted. The workspace tracks errors locally using a rolling 60-second window and includes the stats in every heartbeat. The platform doesn't sit in the A2A message path, so it can't monitor response codes directly — self-reporting via heartbeat is the mechanism. diff --git a/docs/architecture/runtime-platform-plugin-responsibilities.md b/docs/architecture/runtime-platform-plugin-responsibilities.md new file mode 100644 index 00000000..e6a9ce4d --- /dev/null +++ b/docs/architecture/runtime-platform-plugin-responsibilities.md @@ -0,0 +1,133 @@ +# Runtime ↔ Platform ↔ Plugin: division of responsibilities + +> **Status:** Committed architecture (see [ADR-003](/adr/ADR-003-runtime-platform-plugin-responsibilities)). +> This page is the single canonical statement of *who adapts to whom*. If you are +> wiring a new runtime, a new plugin, or anything the online/degraded gate reads, +> start here. + +Molecule runs the **same agent code across many runtimes** (claude-code, codex, +hermes, openclaw, gemini-cli, …) and exposes the **same capabilities** (the +management MCP, A2A, memory) regardless of which runtime an org picks. Two +adapter layers make that work, and they have **opposite directions**: + +| Layer | Adapts… | …to… | Lives in | +|-------|---------|------|----------| +| **Runtime** | the agent | **the platform** | `molecule-ai-workspace-runtime` (shared pip pkg) | +| **Plugin** | its abilities | **each runtime** | the plugin (descriptor) + per-runtime renderers | + +Stated as one sentence each: + +- **The runtime's job is to adapt the agent to the platform.** It registers, + heartbeats, and reports the status the platform's online/degraded gate reads — + **runtime-agnostically, in one place**. The platform never asks "which runtime + is this?"; it reads a fixed status contract. +- **The plugin's job is to adapt its abilities to each runtime.** It ships one + runtime-agnostic descriptor; a per-runtime renderer writes that descriptor into + the runtime's *native* MCP config. The same plugin works on every runtime, and + the agent stays runtime-switchable. + +## 1. Runtime → Platform: the status contract + +The runtime reports to the platform on **register + every heartbeat** via one +payload builder (`platform_agent_identity.identity_gate_payload`). The platform's +fail-closed online/degraded gate (`workspace-server/internal/handlers/registry.go`) +consumes it. The load-bearing fields: + +| Field | Meaning | Consumed by | +|-------|---------|-------------| +| `mcp_server_present` | the management `molecule-platform` MCP is **declared/wired** in the active runtime's native config | RCA#2970 online gate | +| `loaded_mcp_tools` | the management MCP's tools **actually loaded** into the model (the `mcp____` ids) | core#3082 degrade gate | + +**Both halves are required.** "Declared" (`mcp_server_present`) is necessary but +not sufficient — a declared-but-dead server is the exact false-green core#3082 +catches, so the runtime must also report what *loaded* (`loaded_mcp_tools`). The +runtime produces `loaded_mcp_tools` **at init** (it enumerates the connected MCP +servers itself — `loaded_mcp_tools_probe.py`), so the gate can flip +`degraded → online` **without waiting for a user turn**. This enumeration is +runtime-agnostic: it speaks the MCP wire protocol directly, not any one SDK's +tool-list message. + +**Rule:** any field the gate reads MUST have a runtime-side **producer** that is +actually wired (not just a scaffold) and a **liveness test**. A producer that is +unreferenced serializes (`omitempty`) identically to a correctly-warming-up one — +that is how a half-wired producer silently degrades every concierge (the +2026-06-25 incident). + +> **Current state vs target.** Today the required tool id is a Go **literal +> constant** (`conciergePlatformMCPCreateWorkspaceTool`) guarded by a contract +> **drift test**, and `contracts/mcp-plugin-delivery.contract.json` pins the +> management **server name** + the **status-field** shape — it does **not** yet +> pin the full tool id, and the runtime does not derive it (it enumerates the +> live MCP). **Target** (guardrail/SSOT workstream, in progress): pin the full +> id `mcp__molecule-platform__create_workspace` in the contract and have core's +> gate **derive** it, so a rename on either side fails the (blocking) drift +> check rather than agreeing by convention. + +## 2. Plugin → Runtime: per-runtime rendering + +The plugin declaration is the **single source of truth** for its MCP descriptor +(server name, command, args, env). A per-runtime **shape adapter** renders that +one descriptor into the runtime's native config: + +| Runtime | Native MCP config | +|---------|-------------------| +| claude-code | `.claude/settings.json` (`mcpServers`) | +| codex | `~/.codex/config.toml` | +| gemini-cli | `settings.json` | +| openclaw | `~/.openclaw/openclaw.json` | + +Renderers + their inverse readers + the present-probe live in +`molecule-ai-workspace-runtime/molecule_runtime/mcp_render.py` +(`_RUNTIME_SPECS` / `render_for_runtime` / `read_mcp_servers_for` / +`management_mcp_present_for`), dispatched **by runtime name** — never by a baked +image. See [agentskills-compat](/plugins/agentskills-compat) and +[cli-runtime](/agent-runtime/cli-runtime) for the plugin author's view. + +**Rule:** adding a runtime means adding its renderer **and** reader **and** +present-probe together (kept in lockstep by a test). An unmapped runtime must +fail **closed**, never silently fall back to `.claude/settings.json` (the #3159 +mis-attribution class). + +## 3. Corollaries + +- **No logic in the wrong layer.** The platform-facing gate has zero + `if runtime == …` branches; runtime selection happens once via `ADAPTER_MODULE` + → `get_adapter`, and per-runtime shape happens via the `mcp_render` dispatch. + No platform logic is baked into a plugin. +- **Tool ids should be SSOT.** Re-spelling `mcp__molecule-platform__create_workspace` + per layer is how the codex#142 `mcp__platform__` vs `mcp__molecule-platform__` + drift happened. *Current state:* core holds it as a literal const guarded by a + drift test; the runtime enumerates it from the live MCP (no literal). *Target:* + pin the full id in the contract and derive core's gate from it (in progress). +- **Platform-ness is a composition, not an image.** A platform concierge is an + **ordinary runtime image** plus (a) the org-admin key and (b) the management + MCP plugin — *not* a special baked `molecule-platform-agent` image. The baked + image structurally binds the concierge to claude-code; for any non-claude-code + concierge it is not just redundant but **wrong** + ([rfc-platform-mcp-as-plugin](/design/rfc-platform-mcp-as-plugin) §3.4). Detect + "is this a concierge?" via the **composition** (`mcp_server_present()`), never + the baked-image marker (`MOLECULE_PLATFORM_AGENT_IMAGE_BAKED`), which is absent + on a de-baked concierge. + +## 4. How this is enforced (guardrails) + +Each rule should be a red-on-regression test so the principle can't drift back +into tribal knowledge. **Status is honest** — ✅ enforced in CI today, +◻ target/in-progress (guardrail/SSOT workstream, post the 2026-06-25 audit): + +| Rule | Guardrail | Status | +|------|-----------|--------| +| Plugin renders per-runtime native config | `test_mcp_plugin_delivery_contract` (codex writes `config.toml`, not claude settings) | ✅ | +| No runtime branching in the gate | gate has zero `if runtime==`; `…RoutesThroughPort` | ✅ | +| Unmapped runtime fails closed | G6 `test_mcp_render_completeness_g6` | ✅ | +| Prompt SSOT (filename / channel) | G0 `test_prompt_filename_ssot_g0`, G1 `test_prompt_channel_ssot_g1`, `test_mcp_ssot` | ✅ | +| `loaded_mcp_tools` producer fires through the real gate | `test_debaked_concierge_runs_via_mcp_server_present` (runtime#181) | ✅ (partial) | +| Renderer/reader/present-probe lockstep | `test_mcp_render_lockstep` (`set(_RUNTIME_SPECS)==set(_RUNTIME_READERS)`) | ◻ target | +| Full tool id derived from a shared contract | pin `mcp__molecule-platform__create_workspace` in the contract + derive core's gate | ◻ target | +| `loaded_mcp_tools` / required tool pinned both sides + blocking drift | `mcp-plugin-delivery-contract-drift` made fail-closed, runtime copy in the compare set | ◻ target | +| Producer-liveness via the full `main.py` boot path | boot test (real gate, no `force`) | ◻ target | +| Concierge reaches online + has `create_workspace` | **current**: e2e LLM self-enumeration; **target**: deterministic `status=online` + heartbeat `loaded_mcp_tools` ∋ `create_workspace` | ◻ target (pending the deterministic e2e PR) | +| Baked image cannot return | de-bake absence guard (fails if `Dockerfile.platform-agent` / `resolvePlatformAgentImage` / baked publish job reappear) | ◻ target (#78) | + +See the guardrail audit (2026-06-25) for the full enforced/gap analysis; the ◻ +items are tracked under the guardrail/SSOT workstream. diff --git a/docs/index.md b/docs/index.md index 2bbe5037..bd12b13d 100644 --- a/docs/index.md +++ b/docs/index.md @@ -63,6 +63,7 @@ features: - [Product Overview](/product/overview) - [Product Narrative](/product/molecule-product-doc) - [System Architecture](/architecture/architecture) +- [Runtime ↔ Platform ↔ Plugin Responsibilities](/architecture/runtime-platform-plugin-responsibilities) - [Comprehensive Technical Documentation](/architecture/molecule-technical-doc) - [Memory Architecture](/architecture/memory) - [Workspace Runtime](/agent-runtime/workspace-runtime)