docs(architecture): canonical runtime↔platform↔plugin responsibilities + ADR-003 (guardrail/SSOT) #3255
@@ -0,0 +1,66 @@
|
||||
# ADR-003: Runtime adapts to the platform; the plugin adapts to each runtime
|
||||
|
||||
**Status:** Accepted — committed architecture
|
||||
**Date:** 2026-06-25
|
||||
**Supersedes context:** RFC `rfc-platform-mcp-as-plugin` §2b/§3.4 (platform-MCP-as-plugin, de-bake)
|
||||
**Related incident:** 2026-06-25 fleet-wide concierge `degraded` (half-wired `loaded_mcp_tools` producer)
|
||||
|
||||
## Context
|
||||
|
||||
Molecule runs one agent codebase across many runtimes (claude-code, codex,
|
||||
hermes, openclaw, gemini-cli) and exposes the same capabilities (management MCP,
|
||||
A2A, memory) on all of them. Which component adapts to which was, until now,
|
||||
tribal knowledge: the *plugin → runtime* half was documented, but the
|
||||
*runtime → platform* status contract lived only in source docstrings, the named
|
||||
contract doc (`api-protocol/registry-and-heartbeat.md`) was stale, and the
|
||||
de-bake rationale lived only in an RFC. There was no single canonical statement
|
||||
and no ADR.
|
||||
|
||||
That gap had teeth. On 2026-06-25 every de-baked concierge was marked
|
||||
`degraded` because the runtime reported `mcp_server_present=true` but never
|
||||
produced `loaded_mcp_tools` — the producer was a scaffold with **zero callers**,
|
||||
`omitempty` masked it, the unit tests bypassed the production gate with
|
||||
`force=True`, and the only end-to-end check was a non-deterministic LLM
|
||||
self-enumeration that was also `continue-on-error` and didn't run on PRs. A
|
||||
half-wired producer crossing the runtime↔platform seam shipped silently.
|
||||
|
||||
## Decision
|
||||
|
||||
Adopt as **committed architecture** the two-layer, opposite-direction
|
||||
responsibility split, and **enforce it with guardrails** so it cannot regress to
|
||||
tribal knowledge:
|
||||
|
||||
1. **The runtime adapts the agent to the platform.** It owns the register/
|
||||
heartbeat **status contract** (`mcp_server_present` *and* `loaded_mcp_tools`),
|
||||
reported runtime-agnostically in one place. Every gate-consumed field must
|
||||
have a wired producer + a liveness test. The required tool id should be pinned
|
||||
in a shared contract and **derived** on both sides rather than hardcoded per
|
||||
layer — *target state*; today core holds it as a literal const guarded by a
|
||||
drift test and the runtime enumerates it live (contract-pin in progress).
|
||||
|
||||
2. **The plugin adapts its abilities to each runtime.** One runtime-agnostic
|
||||
descriptor (the plugin = SSOT); per-runtime renderers write it into native
|
||||
config. Renderer + reader + present-probe move in lockstep; an unmapped
|
||||
runtime fails **closed**.
|
||||
|
||||
3. **Platform-ness is a composition, not an image.** A concierge is an ordinary
|
||||
runtime image + the org-admin key + the management MCP plugin. "Is this a
|
||||
concierge?" is detected via `mcp_server_present()`, not the baked-image marker
|
||||
(runtime#181). The baked `molecule-platform-agent` image is being removed and
|
||||
guarded against return — *in progress* (artifact deletion + absence guard,
|
||||
#78); until then the CP still carries vestigial `resolvePlatformAgentImage`
|
||||
references.
|
||||
|
||||
The full statement, the field tables, and the guardrail matrix live in
|
||||
[`architecture/runtime-platform-plugin-responsibilities.md`](/architecture/runtime-platform-plugin-responsibilities).
|
||||
|
||||
## Consequences
|
||||
|
||||
- New runtimes and new plugins have a single doc to conform to; reviewers have a
|
||||
named contract to check against.
|
||||
- A set of red-on-regression guardrails becomes required (producer-liveness boot
|
||||
test, contract-drift blocking with `loaded_mcp_tools` pinned, renderer/reader
|
||||
lockstep, deterministic online+`create_workspace` e2e, de-bake absence guard).
|
||||
Until each is green it is tracked under the guardrail/SSOT workstream.
|
||||
- The stale `api-protocol/registry-and-heartbeat.md` "five fields" status model
|
||||
is corrected to include the MCP status fields.
|
||||
@@ -38,7 +38,33 @@ await platform.post("/registry/heartbeat", json={
|
||||
})
|
||||
```
|
||||
|
||||
Five fields. Memory usage, CPU, queue depth — those are infrastructure metrics for Prometheus/Grafana or CloudWatch. The platform registry is a service discovery layer, not a metrics store.
|
||||
Five fields for an ordinary workspace. Memory usage, CPU, queue depth — those are infrastructure metrics for Prometheus/Grafana or CloudWatch. The platform registry is a service discovery layer, not a metrics store.
|
||||
|
||||
### Platform-agent (concierge) status fields
|
||||
|
||||
A `kind=platform` concierge reports **two additional fields** that the
|
||||
online/degraded gate reads (built by `platform_agent_identity.identity_gate_payload`):
|
||||
|
||||
```python
|
||||
# platform concierge only — read by the online/degraded gate
|
||||
"mcp_server_present": True, # management molecule-platform MCP is DECLARED/wired (RCA#2970)
|
||||
"loaded_mcp_tools": ["mcp__molecule-platform__create_workspace", ...], # tools ACTUALLY loaded (core#3082)
|
||||
```
|
||||
|
||||
`mcp_server_present` (declared) is **necessary but not sufficient** — a
|
||||
declared-but-dead MCP is the false-green core#3082 catches, so the concierge also
|
||||
reports `loaded_mcp_tools`, the namespaced ids that actually loaded into the
|
||||
model. The runtime produces `loaded_mcp_tools` by enumerating the connected MCP
|
||||
servers over the wire (retrying in the background until the management MCP is
|
||||
connectable), so the concierge reaches `online` without waiting for a user turn.
|
||||
Tri-state: the field is **omitted** until something is observed (grace window
|
||||
applies), `[]` when a server connected with zero tools, or the id list. The
|
||||
required tool id is `mcp__molecule-platform__create_workspace`; *current state*
|
||||
core holds it as a literal const (`conciergePlatformMCPCreateWorkspaceTool`)
|
||||
guarded by a contract **drift test**, and the contract pins the management server
|
||||
name + status-field shape only — pinning the full id in the contract and deriving
|
||||
both sides is a *target* (in progress). See
|
||||
[Runtime ↔ Platform ↔ Plugin Responsibilities](/architecture/runtime-platform-plugin-responsibilities).
|
||||
|
||||
`active_tasks` is included because the canvas uses it for a busy indicator on the node, and it sets up backpressure for Phase 2 without a schema change.
|
||||
|
||||
@@ -83,6 +109,14 @@ The workspace self-reports its health via the heartbeat payload. The platform de
|
||||
| `error_rate >= 0.5` and status is `online` | `online` -> `degraded` | `WORKSPACE_DEGRADED` |
|
||||
| `error_rate < 0.1` and status is `degraded` | `degraded` -> `online` | `WORKSPACE_ONLINE` |
|
||||
|
||||
For a `kind=platform` concierge, two **MCP-presence** transitions also apply
|
||||
(fail-closed; `registry.go`):
|
||||
|
||||
| Condition | Transition | Rationale |
|
||||
|-----------|-----------|-----------|
|
||||
| `mcp_server_present=false` on register/heartbeat | refuse `online` | RCA#2970 — no management MCP, refuse to serve a crippled concierge |
|
||||
| `mcp_server_present=true` but `loaded_mcp_tools` absent / lacks `mcp__molecule-platform__create_workspace` past the grace window | `online` -> `degraded` | core#3082 — MCP declared but its tools never loaded (false-green) |
|
||||
|
||||
**What counts as an error:** Only things that indicate the workspace itself is broken — 5xx responses, timeouts, connection errors. Client errors (400, 403) are the caller's fault and are not counted.
|
||||
|
||||
The workspace tracks errors locally using a rolling 60-second window and includes the stats in every heartbeat. The platform doesn't sit in the A2A message path, so it can't monitor response codes directly — self-reporting via heartbeat is the mechanism.
|
||||
|
||||
@@ -0,0 +1,133 @@
|
||||
# Runtime ↔ Platform ↔ Plugin: division of responsibilities
|
||||
|
||||
> **Status:** Committed architecture (see [ADR-003](/adr/ADR-003-runtime-platform-plugin-responsibilities)).
|
||||
> This page is the single canonical statement of *who adapts to whom*. If you are
|
||||
> wiring a new runtime, a new plugin, or anything the online/degraded gate reads,
|
||||
> start here.
|
||||
|
||||
Molecule runs the **same agent code across many runtimes** (claude-code, codex,
|
||||
hermes, openclaw, gemini-cli, …) and exposes the **same capabilities** (the
|
||||
management MCP, A2A, memory) regardless of which runtime an org picks. Two
|
||||
adapter layers make that work, and they have **opposite directions**:
|
||||
|
||||
| Layer | Adapts… | …to… | Lives in |
|
||||
|-------|---------|------|----------|
|
||||
| **Runtime** | the agent | **the platform** | `molecule-ai-workspace-runtime` (shared pip pkg) |
|
||||
| **Plugin** | its abilities | **each runtime** | the plugin (descriptor) + per-runtime renderers |
|
||||
|
||||
Stated as one sentence each:
|
||||
|
||||
- **The runtime's job is to adapt the agent to the platform.** It registers,
|
||||
heartbeats, and reports the status the platform's online/degraded gate reads —
|
||||
**runtime-agnostically, in one place**. The platform never asks "which runtime
|
||||
is this?"; it reads a fixed status contract.
|
||||
- **The plugin's job is to adapt its abilities to each runtime.** It ships one
|
||||
runtime-agnostic descriptor; a per-runtime renderer writes that descriptor into
|
||||
the runtime's *native* MCP config. The same plugin works on every runtime, and
|
||||
the agent stays runtime-switchable.
|
||||
|
||||
## 1. Runtime → Platform: the status contract
|
||||
|
||||
The runtime reports to the platform on **register + every heartbeat** via one
|
||||
payload builder (`platform_agent_identity.identity_gate_payload`). The platform's
|
||||
fail-closed online/degraded gate (`workspace-server/internal/handlers/registry.go`)
|
||||
consumes it. The load-bearing fields:
|
||||
|
||||
| Field | Meaning | Consumed by |
|
||||
|-------|---------|-------------|
|
||||
| `mcp_server_present` | the management `molecule-platform` MCP is **declared/wired** in the active runtime's native config | RCA#2970 online gate |
|
||||
| `loaded_mcp_tools` | the management MCP's tools **actually loaded** into the model (the `mcp__<server>__<tool>` ids) | core#3082 degrade gate |
|
||||
|
||||
**Both halves are required.** "Declared" (`mcp_server_present`) is necessary but
|
||||
not sufficient — a declared-but-dead server is the exact false-green core#3082
|
||||
catches, so the runtime must also report what *loaded* (`loaded_mcp_tools`). The
|
||||
runtime produces `loaded_mcp_tools` **at init** (it enumerates the connected MCP
|
||||
servers itself — `loaded_mcp_tools_probe.py`), so the gate can flip
|
||||
`degraded → online` **without waiting for a user turn**. This enumeration is
|
||||
runtime-agnostic: it speaks the MCP wire protocol directly, not any one SDK's
|
||||
tool-list message.
|
||||
|
||||
**Rule:** any field the gate reads MUST have a runtime-side **producer** that is
|
||||
actually wired (not just a scaffold) and a **liveness test**. A producer that is
|
||||
unreferenced serializes (`omitempty`) identically to a correctly-warming-up one —
|
||||
that is how a half-wired producer silently degrades every concierge (the
|
||||
2026-06-25 incident).
|
||||
|
||||
> **Current state vs target.** Today the required tool id is a Go **literal
|
||||
> constant** (`conciergePlatformMCPCreateWorkspaceTool`) guarded by a contract
|
||||
> **drift test**, and `contracts/mcp-plugin-delivery.contract.json` pins the
|
||||
> management **server name** + the **status-field** shape — it does **not** yet
|
||||
> pin the full tool id, and the runtime does not derive it (it enumerates the
|
||||
> live MCP). **Target** (guardrail/SSOT workstream, in progress): pin the full
|
||||
> id `mcp__molecule-platform__create_workspace` in the contract and have core's
|
||||
> gate **derive** it, so a rename on either side fails the (blocking) drift
|
||||
> check rather than agreeing by convention.
|
||||
|
||||
## 2. Plugin → Runtime: per-runtime rendering
|
||||
|
||||
The plugin declaration is the **single source of truth** for its MCP descriptor
|
||||
(server name, command, args, env). A per-runtime **shape adapter** renders that
|
||||
one descriptor into the runtime's native config:
|
||||
|
||||
| Runtime | Native MCP config |
|
||||
|---------|-------------------|
|
||||
| claude-code | `.claude/settings.json` (`mcpServers`) |
|
||||
| codex | `~/.codex/config.toml` |
|
||||
| gemini-cli | `settings.json` |
|
||||
| openclaw | `~/.openclaw/openclaw.json` |
|
||||
|
||||
Renderers + their inverse readers + the present-probe live in
|
||||
`molecule-ai-workspace-runtime/molecule_runtime/mcp_render.py`
|
||||
(`_RUNTIME_SPECS` / `render_for_runtime` / `read_mcp_servers_for` /
|
||||
`management_mcp_present_for`), dispatched **by runtime name** — never by a baked
|
||||
image. See [agentskills-compat](/plugins/agentskills-compat) and
|
||||
[cli-runtime](/agent-runtime/cli-runtime) for the plugin author's view.
|
||||
|
||||
**Rule:** adding a runtime means adding its renderer **and** reader **and**
|
||||
present-probe together (kept in lockstep by a test). An unmapped runtime must
|
||||
fail **closed**, never silently fall back to `.claude/settings.json` (the #3159
|
||||
mis-attribution class).
|
||||
|
||||
## 3. Corollaries
|
||||
|
||||
- **No logic in the wrong layer.** The platform-facing gate has zero
|
||||
`if runtime == …` branches; runtime selection happens once via `ADAPTER_MODULE`
|
||||
→ `get_adapter`, and per-runtime shape happens via the `mcp_render` dispatch.
|
||||
No platform logic is baked into a plugin.
|
||||
- **Tool ids should be SSOT.** Re-spelling `mcp__molecule-platform__create_workspace`
|
||||
per layer is how the codex#142 `mcp__platform__` vs `mcp__molecule-platform__`
|
||||
drift happened. *Current state:* core holds it as a literal const guarded by a
|
||||
drift test; the runtime enumerates it from the live MCP (no literal). *Target:*
|
||||
pin the full id in the contract and derive core's gate from it (in progress).
|
||||
- **Platform-ness is a composition, not an image.** A platform concierge is an
|
||||
**ordinary runtime image** plus (a) the org-admin key and (b) the management
|
||||
MCP plugin — *not* a special baked `molecule-platform-agent` image. The baked
|
||||
image structurally binds the concierge to claude-code; for any non-claude-code
|
||||
concierge it is not just redundant but **wrong**
|
||||
([rfc-platform-mcp-as-plugin](/design/rfc-platform-mcp-as-plugin) §3.4). Detect
|
||||
"is this a concierge?" via the **composition** (`mcp_server_present()`), never
|
||||
the baked-image marker (`MOLECULE_PLATFORM_AGENT_IMAGE_BAKED`), which is absent
|
||||
on a de-baked concierge.
|
||||
|
||||
## 4. How this is enforced (guardrails)
|
||||
|
||||
Each rule should be a red-on-regression test so the principle can't drift back
|
||||
into tribal knowledge. **Status is honest** — ✅ enforced in CI today,
|
||||
◻ target/in-progress (guardrail/SSOT workstream, post the 2026-06-25 audit):
|
||||
|
||||
| Rule | Guardrail | Status |
|
||||
|------|-----------|--------|
|
||||
| Plugin renders per-runtime native config | `test_mcp_plugin_delivery_contract` (codex writes `config.toml`, not claude settings) | ✅ |
|
||||
| No runtime branching in the gate | gate has zero `if runtime==`; `…RoutesThroughPort` | ✅ |
|
||||
| Unmapped runtime fails closed | G6 `test_mcp_render_completeness_g6` | ✅ |
|
||||
| Prompt SSOT (filename / channel) | G0 `test_prompt_filename_ssot_g0`, G1 `test_prompt_channel_ssot_g1`, `test_mcp_ssot` | ✅ |
|
||||
| `loaded_mcp_tools` producer fires through the real gate | `test_debaked_concierge_runs_via_mcp_server_present` (runtime#181) | ✅ (partial) |
|
||||
| Renderer/reader/present-probe lockstep | `test_mcp_render_lockstep` (`set(_RUNTIME_SPECS)==set(_RUNTIME_READERS)`) | ◻ target |
|
||||
| Full tool id derived from a shared contract | pin `mcp__molecule-platform__create_workspace` in the contract + derive core's gate | ◻ target |
|
||||
| `loaded_mcp_tools` / required tool pinned both sides + blocking drift | `mcp-plugin-delivery-contract-drift` made fail-closed, runtime copy in the compare set | ◻ target |
|
||||
| Producer-liveness via the full `main.py` boot path | boot test (real gate, no `force`) | ◻ target |
|
||||
| Concierge reaches online + has `create_workspace` | **current**: e2e LLM self-enumeration; **target**: deterministic `status=online` + heartbeat `loaded_mcp_tools` ∋ `create_workspace` | ◻ target (pending the deterministic e2e PR) |
|
||||
| Baked image cannot return | de-bake absence guard (fails if `Dockerfile.platform-agent` / `resolvePlatformAgentImage` / baked publish job reappear) | ◻ target (#78) |
|
||||
|
||||
See the guardrail audit (2026-06-25) for the full enforced/gap analysis; the ◻
|
||||
items are tracked under the guardrail/SSOT workstream.
|
||||
@@ -63,6 +63,7 @@ features:
|
||||
- [Product Overview](/product/overview)
|
||||
- [Product Narrative](/product/molecule-product-doc)
|
||||
- [System Architecture](/architecture/architecture)
|
||||
- [Runtime ↔ Platform ↔ Plugin Responsibilities](/architecture/runtime-platform-plugin-responsibilities)
|
||||
- [Comprehensive Technical Documentation](/architecture/molecule-technical-doc)
|
||||
- [Memory Architecture](/architecture/memory)
|
||||
- [Workspace Runtime](/agent-runtime/workspace-runtime)
|
||||
|
||||
Reference in New Issue
Block a user