docs(architecture): canonical runtime↔platform↔plugin responsibilities + ADR-003 (guardrail/SSOT) #3255

Merged
agent-reviewer-cr2 merged 3 commits from docs/runtime-platform-plugin-responsibilities into main 2026-06-25 05:03:51 +00:00
4 changed files with 235 additions and 1 deletions
@@ -0,0 +1,66 @@
# ADR-003: Runtime adapts to the platform; the plugin adapts to each runtime
**Status:** Accepted — committed architecture
**Date:** 2026-06-25
**Supersedes context:** RFC `rfc-platform-mcp-as-plugin` §2b/§3.4 (platform-MCP-as-plugin, de-bake)
**Related incident:** 2026-06-25 fleet-wide concierge `degraded` (half-wired `loaded_mcp_tools` producer)
## Context
Molecule runs one agent codebase across many runtimes (claude-code, codex,
hermes, openclaw, gemini-cli) and exposes the same capabilities (management MCP,
A2A, memory) on all of them. Which component adapts to which was, until now,
tribal knowledge: the *plugin → runtime* half was documented, but the
*runtime → platform* status contract lived only in source docstrings, the named
contract doc (`api-protocol/registry-and-heartbeat.md`) was stale, and the
de-bake rationale lived only in an RFC. There was no single canonical statement
and no ADR.
That gap had teeth. On 2026-06-25 every de-baked concierge was marked
`degraded` because the runtime reported `mcp_server_present=true` but never
produced `loaded_mcp_tools` — the producer was a scaffold with **zero callers**,
`omitempty` masked it, the unit tests bypassed the production gate with
`force=True`, and the only end-to-end check was a non-deterministic LLM
self-enumeration that was also `continue-on-error` and didn't run on PRs. A
half-wired producer crossing the runtime↔platform seam shipped silently.
## Decision
Adopt as **committed architecture** the two-layer, opposite-direction
responsibility split, and **enforce it with guardrails** so it cannot regress to
tribal knowledge:
1. **The runtime adapts the agent to the platform.** It owns the register/
heartbeat **status contract** (`mcp_server_present` *and* `loaded_mcp_tools`),
reported runtime-agnostically in one place. Every gate-consumed field must
have a wired producer + a liveness test. The required tool id should be pinned
in a shared contract and **derived** on both sides rather than hardcoded per
layer — *target state*; today core holds it as a literal const guarded by a
drift test and the runtime enumerates it live (contract-pin in progress).
2. **The plugin adapts its abilities to each runtime.** One runtime-agnostic
descriptor (the plugin = SSOT); per-runtime renderers write it into native
config. Renderer + reader + present-probe move in lockstep; an unmapped
runtime fails **closed**.
3. **Platform-ness is a composition, not an image.** A concierge is an ordinary
runtime image + the org-admin key + the management MCP plugin. "Is this a
concierge?" is detected via `mcp_server_present()`, not the baked-image marker
(runtime#181). The baked `molecule-platform-agent` image is being removed and
guarded against return — *in progress* (artifact deletion + absence guard,
#78); until then the CP still carries vestigial `resolvePlatformAgentImage`
references.
The full statement, the field tables, and the guardrail matrix live in
[`architecture/runtime-platform-plugin-responsibilities.md`](/architecture/runtime-platform-plugin-responsibilities).
## Consequences
- New runtimes and new plugins have a single doc to conform to; reviewers have a
named contract to check against.
- A set of red-on-regression guardrails becomes required (producer-liveness boot
test, contract-drift blocking with `loaded_mcp_tools` pinned, renderer/reader
lockstep, deterministic online+`create_workspace` e2e, de-bake absence guard).
Until each is green it is tracked under the guardrail/SSOT workstream.
- The stale `api-protocol/registry-and-heartbeat.md` "five fields" status model
is corrected to include the MCP status fields.
+35 -1
View File
@@ -38,7 +38,33 @@ await platform.post("/registry/heartbeat", json={
})
```
Five fields. Memory usage, CPU, queue depth — those are infrastructure metrics for Prometheus/Grafana or CloudWatch. The platform registry is a service discovery layer, not a metrics store.
Five fields for an ordinary workspace. Memory usage, CPU, queue depth — those are infrastructure metrics for Prometheus/Grafana or CloudWatch. The platform registry is a service discovery layer, not a metrics store.
### Platform-agent (concierge) status fields
A `kind=platform` concierge reports **two additional fields** that the
online/degraded gate reads (built by `platform_agent_identity.identity_gate_payload`):
```python
# platform concierge only — read by the online/degraded gate
"mcp_server_present": True, # management molecule-platform MCP is DECLARED/wired (RCA#2970)
"loaded_mcp_tools": ["mcp__molecule-platform__create_workspace", ...], # tools ACTUALLY loaded (core#3082)
```
`mcp_server_present` (declared) is **necessary but not sufficient** — a
declared-but-dead MCP is the false-green core#3082 catches, so the concierge also
reports `loaded_mcp_tools`, the namespaced ids that actually loaded into the
model. The runtime produces `loaded_mcp_tools` by enumerating the connected MCP
servers over the wire (retrying in the background until the management MCP is
connectable), so the concierge reaches `online` without waiting for a user turn.
Tri-state: the field is **omitted** until something is observed (grace window
applies), `[]` when a server connected with zero tools, or the id list. The
required tool id is `mcp__molecule-platform__create_workspace`; *current state*
core holds it as a literal const (`conciergePlatformMCPCreateWorkspaceTool`)
guarded by a contract **drift test**, and the contract pins the management server
name + status-field shape only — pinning the full id in the contract and deriving
both sides is a *target* (in progress). See
[Runtime ↔ Platform ↔ Plugin Responsibilities](/architecture/runtime-platform-plugin-responsibilities).
`active_tasks` is included because the canvas uses it for a busy indicator on the node, and it sets up backpressure for Phase 2 without a schema change.
@@ -83,6 +109,14 @@ The workspace self-reports its health via the heartbeat payload. The platform de
| `error_rate >= 0.5` and status is `online` | `online` -> `degraded` | `WORKSPACE_DEGRADED` |
| `error_rate < 0.1` and status is `degraded` | `degraded` -> `online` | `WORKSPACE_ONLINE` |
For a `kind=platform` concierge, two **MCP-presence** transitions also apply
(fail-closed; `registry.go`):
| Condition | Transition | Rationale |
|-----------|-----------|-----------|
| `mcp_server_present=false` on register/heartbeat | refuse `online` | RCA#2970 — no management MCP, refuse to serve a crippled concierge |
| `mcp_server_present=true` but `loaded_mcp_tools` absent / lacks `mcp__molecule-platform__create_workspace` past the grace window | `online` -> `degraded` | core#3082 — MCP declared but its tools never loaded (false-green) |
**What counts as an error:** Only things that indicate the workspace itself is broken — 5xx responses, timeouts, connection errors. Client errors (400, 403) are the caller's fault and are not counted.
The workspace tracks errors locally using a rolling 60-second window and includes the stats in every heartbeat. The platform doesn't sit in the A2A message path, so it can't monitor response codes directly — self-reporting via heartbeat is the mechanism.
@@ -0,0 +1,133 @@
# Runtime ↔ Platform ↔ Plugin: division of responsibilities
> **Status:** Committed architecture (see [ADR-003](/adr/ADR-003-runtime-platform-plugin-responsibilities)).
> This page is the single canonical statement of *who adapts to whom*. If you are
> wiring a new runtime, a new plugin, or anything the online/degraded gate reads,
> start here.
Molecule runs the **same agent code across many runtimes** (claude-code, codex,
hermes, openclaw, gemini-cli, …) and exposes the **same capabilities** (the
management MCP, A2A, memory) regardless of which runtime an org picks. Two
adapter layers make that work, and they have **opposite directions**:
| Layer | Adapts… | …to… | Lives in |
|-------|---------|------|----------|
| **Runtime** | the agent | **the platform** | `molecule-ai-workspace-runtime` (shared pip pkg) |
| **Plugin** | its abilities | **each runtime** | the plugin (descriptor) + per-runtime renderers |
Stated as one sentence each:
- **The runtime's job is to adapt the agent to the platform.** It registers,
heartbeats, and reports the status the platform's online/degraded gate reads —
**runtime-agnostically, in one place**. The platform never asks "which runtime
is this?"; it reads a fixed status contract.
- **The plugin's job is to adapt its abilities to each runtime.** It ships one
runtime-agnostic descriptor; a per-runtime renderer writes that descriptor into
the runtime's *native* MCP config. The same plugin works on every runtime, and
the agent stays runtime-switchable.
## 1. Runtime → Platform: the status contract
The runtime reports to the platform on **register + every heartbeat** via one
payload builder (`platform_agent_identity.identity_gate_payload`). The platform's
fail-closed online/degraded gate (`workspace-server/internal/handlers/registry.go`)
consumes it. The load-bearing fields:
| Field | Meaning | Consumed by |
|-------|---------|-------------|
| `mcp_server_present` | the management `molecule-platform` MCP is **declared/wired** in the active runtime's native config | RCA#2970 online gate |
| `loaded_mcp_tools` | the management MCP's tools **actually loaded** into the model (the `mcp__<server>__<tool>` ids) | core#3082 degrade gate |
**Both halves are required.** "Declared" (`mcp_server_present`) is necessary but
not sufficient — a declared-but-dead server is the exact false-green core#3082
catches, so the runtime must also report what *loaded* (`loaded_mcp_tools`). The
runtime produces `loaded_mcp_tools` **at init** (it enumerates the connected MCP
servers itself — `loaded_mcp_tools_probe.py`), so the gate can flip
`degraded → online` **without waiting for a user turn**. This enumeration is
runtime-agnostic: it speaks the MCP wire protocol directly, not any one SDK's
tool-list message.
**Rule:** any field the gate reads MUST have a runtime-side **producer** that is
actually wired (not just a scaffold) and a **liveness test**. A producer that is
unreferenced serializes (`omitempty`) identically to a correctly-warming-up one —
that is how a half-wired producer silently degrades every concierge (the
2026-06-25 incident).
> **Current state vs target.** Today the required tool id is a Go **literal
> constant** (`conciergePlatformMCPCreateWorkspaceTool`) guarded by a contract
> **drift test**, and `contracts/mcp-plugin-delivery.contract.json` pins the
> management **server name** + the **status-field** shape — it does **not** yet
> pin the full tool id, and the runtime does not derive it (it enumerates the
> live MCP). **Target** (guardrail/SSOT workstream, in progress): pin the full
> id `mcp__molecule-platform__create_workspace` in the contract and have core's
> gate **derive** it, so a rename on either side fails the (blocking) drift
> check rather than agreeing by convention.
## 2. Plugin → Runtime: per-runtime rendering
The plugin declaration is the **single source of truth** for its MCP descriptor
(server name, command, args, env). A per-runtime **shape adapter** renders that
one descriptor into the runtime's native config:
| Runtime | Native MCP config |
|---------|-------------------|
| claude-code | `.claude/settings.json` (`mcpServers`) |
| codex | `~/.codex/config.toml` |
| gemini-cli | `settings.json` |
| openclaw | `~/.openclaw/openclaw.json` |
Renderers + their inverse readers + the present-probe live in
`molecule-ai-workspace-runtime/molecule_runtime/mcp_render.py`
(`_RUNTIME_SPECS` / `render_for_runtime` / `read_mcp_servers_for` /
`management_mcp_present_for`), dispatched **by runtime name** — never by a baked
image. See [agentskills-compat](/plugins/agentskills-compat) and
[cli-runtime](/agent-runtime/cli-runtime) for the plugin author's view.
**Rule:** adding a runtime means adding its renderer **and** reader **and**
present-probe together (kept in lockstep by a test). An unmapped runtime must
fail **closed**, never silently fall back to `.claude/settings.json` (the #3159
mis-attribution class).
## 3. Corollaries
- **No logic in the wrong layer.** The platform-facing gate has zero
`if runtime == …` branches; runtime selection happens once via `ADAPTER_MODULE`
`get_adapter`, and per-runtime shape happens via the `mcp_render` dispatch.
No platform logic is baked into a plugin.
- **Tool ids should be SSOT.** Re-spelling `mcp__molecule-platform__create_workspace`
per layer is how the codex#142 `mcp__platform__` vs `mcp__molecule-platform__`
drift happened. *Current state:* core holds it as a literal const guarded by a
drift test; the runtime enumerates it from the live MCP (no literal). *Target:*
pin the full id in the contract and derive core's gate from it (in progress).
- **Platform-ness is a composition, not an image.** A platform concierge is an
**ordinary runtime image** plus (a) the org-admin key and (b) the management
MCP plugin — *not* a special baked `molecule-platform-agent` image. The baked
image structurally binds the concierge to claude-code; for any non-claude-code
concierge it is not just redundant but **wrong**
([rfc-platform-mcp-as-plugin](/design/rfc-platform-mcp-as-plugin) §3.4). Detect
"is this a concierge?" via the **composition** (`mcp_server_present()`), never
the baked-image marker (`MOLECULE_PLATFORM_AGENT_IMAGE_BAKED`), which is absent
on a de-baked concierge.
## 4. How this is enforced (guardrails)
Each rule should be a red-on-regression test so the principle can't drift back
into tribal knowledge. **Status is honest** — ✅ enforced in CI today,
◻ target/in-progress (guardrail/SSOT workstream, post the 2026-06-25 audit):
| Rule | Guardrail | Status |
|------|-----------|--------|
| Plugin renders per-runtime native config | `test_mcp_plugin_delivery_contract` (codex writes `config.toml`, not claude settings) | ✅ |
| No runtime branching in the gate | gate has zero `if runtime==`; `…RoutesThroughPort` | ✅ |
| Unmapped runtime fails closed | G6 `test_mcp_render_completeness_g6` | ✅ |
| Prompt SSOT (filename / channel) | G0 `test_prompt_filename_ssot_g0`, G1 `test_prompt_channel_ssot_g1`, `test_mcp_ssot` | ✅ |
| `loaded_mcp_tools` producer fires through the real gate | `test_debaked_concierge_runs_via_mcp_server_present` (runtime#181) | ✅ (partial) |
| Renderer/reader/present-probe lockstep | `test_mcp_render_lockstep` (`set(_RUNTIME_SPECS)==set(_RUNTIME_READERS)`) | ◻ target |
| Full tool id derived from a shared contract | pin `mcp__molecule-platform__create_workspace` in the contract + derive core's gate | ◻ target |
| `loaded_mcp_tools` / required tool pinned both sides + blocking drift | `mcp-plugin-delivery-contract-drift` made fail-closed, runtime copy in the compare set | ◻ target |
| Producer-liveness via the full `main.py` boot path | boot test (real gate, no `force`) | ◻ target |
| Concierge reaches online + has `create_workspace` | **current**: e2e LLM self-enumeration; **target**: deterministic `status=online` + heartbeat `loaded_mcp_tools``create_workspace` | ◻ target (pending the deterministic e2e PR) |
| Baked image cannot return | de-bake absence guard (fails if `Dockerfile.platform-agent` / `resolvePlatformAgentImage` / baked publish job reappear) | ◻ target (#78) |
See the guardrail audit (2026-06-25) for the full enforced/gap analysis; the ◻
items are tracked under the guardrail/SSOT workstream.
+1
View File
@@ -63,6 +63,7 @@ features:
- [Product Overview](/product/overview)
- [Product Narrative](/product/molecule-product-doc)
- [System Architecture](/architecture/architecture)
- [Runtime ↔ Platform ↔ Plugin Responsibilities](/architecture/runtime-platform-plugin-responsibilities)
- [Comprehensive Technical Documentation](/architecture/molecule-technical-doc)
- [Memory Architecture](/architecture/memory)
- [Workspace Runtime](/agent-runtime/workspace-runtime)