Merge pull request #2261 from Molecule-AI/fix/harness-cleanup-failed-event

harness: SaaS routing + provider-agnostic config for RFC #2251 measurement
2026-04-29 05:35:43 +00:00 · 2026-04-29 05:35:43 +00:00 · a18d116606
commit a18d116606
parent b78ca4ae02 dd5c54dbaa
3 changed files with 427 additions and 0 deletions
--- a/scripts/README.md
+++ b/scripts/README.md
@ -0,0 +1,47 @@
+# scripts/
+
+Operational and one-off scripts for molecule-core. Most are
+self-documenting — see the header comments in each file.
+
+## RFC #2251 coordinator task-bound harnesses
+
+There are three related scripts; pick the right one:
+
+| Script | Purpose | Targets |
+|---|---|---|
+| `measure-coordinator-task-bounds.sh` | **Canonical** v1 harness for the RFC #2251 / Issue 4 reproduction. Provisions a PM coordinator + Researcher child via `claude-code-default` + `langgraph` templates, sends a synthesis-heavy A2A kickoff, observes elapsed time + heartbeat trace. | OSS-shape platform — localhost or any `/workspaces`-shaped endpoint. Has tenant/admin-token guards for non-localhost runs. |
+| `measure-coordinator-task-bounds-runner.sh` | Generalised runner for the same measurement contract but with **arbitrary template + secret + model combinations** (Hermes/MiniMax, etc.). Useful for cross-runtime variants without modifying the canonical harness. | Same as above (local or SaaS via `MODE=saas`). |
+| `measure-coordinator-task-bounds.sh` (in [molecule-controlplane](https://github.com/Molecule-AI/molecule-controlplane)) | **Production-shape** variant that bootstraps a real staging tenant via `POST /cp/admin/orgs`, then runs the same measurement against `<slug>.staging.moleculesai.app`. | Staging controlplane only — refuses to run against production. |
+
+See `reference_harness_pair_pattern` (auto-memory) for when to use which
+and the cross-repo design rationale.
+
+### Common safety pattern across all three
+
+- **Cleanup trap** on EXIT/INT/TERM auto-deletes provisioned resources.
+- **`DRY_RUN=1`** prints plan + auth fingerprint, exits before any
+  state mutation. Run this before pointing at staging or any shared
+  infrastructure.
+- **Non-target guard** refuses arbitrary endpoints (the controlplane
+  variant is locked to `staging-api.moleculesai.app`; the OSS variant
+  requires explicit auth + tenant scoping for non-localhost PLATFORM).
+- **Cleanup failures emit `cleanup_*_failed` events** with remediation
+  hints; no silenced curl. ADMIN_TOKEN expiring mid-run surfaces as a
+  structured event rather than a silent leak.
+
+### Heartbeat trace caveat
+
+If `heartbeat_trace.raw == "<endpoint_unavailable>"`, the per-workspace
+`/heartbeat-history` endpoint isn't wired on the target build — the
+bound measurement is INCONCLUSIVE on the platform-ceiling question.
+Either wire the endpoint or replace with the equivalent Datadog query.
+
+## Other scripts
+
+- `cleanup-rogue-workspaces.sh` — emergency teardown for leaked
+  workspaces. Prompts for confirmation. Pair with the harnesses if a
+  cleanup trap fails (see `cleanup_*_failed` events).
+- `canary-smoke.sh` — quick smoke test for canary releases.
+- `dev-start.sh` — local-dev platform bring-up.
+
+The rest are self-documenting in their header comments.
--- a/scripts/measure-coordinator-task-bounds-runner.sh
+++ b/scripts/measure-coordinator-task-bounds-runner.sh
@ -0,0 +1,273 @@
+#!/usr/bin/env bash
+# Standalone runner for Issue 4 reproduction (RFC #2251) — exists alongside
+# `measure-coordinator-task-bounds.sh` to support arbitrary template + secret
+# combinations without modifying the canonical harness. The canonical harness
+# stays focused on its v1 contract (claude-code-default + langgraph + OpenRouter);
+# this runner wraps the same workspace-server API calls but takes everything as
+# env-var inputs so a Hermes/MiniMax run can share the measurement code path.
+#
+# Two routing modes:
+#   MODE=local (default) — direct workspace-server API
+#   MODE=saas            — placeholder; populates same vars but expects
+#                          PLATFORM=<tenant-subdomain> with X-Tenant-Id +
+#                          Authorization headers from CP_ADMIN_API_TOKEN
+#
+# Required env:
+#   PLATFORM            workspace-server base URL (default http://localhost:8080)
+#   PM_TEMPLATE         template slug for coordinator
+#   CHILD_TEMPLATE      template slug for researcher child
+#   SECRET_NAME         workspace_secrets key (e.g. MINIMAX_API_KEY)
+#   SECRET_VALUE        the secret value (or read from $SECRET_NAME if unset)
+#
+# Optional:
+#   MODEL               PUT /workspaces/:id/model after provision
+#   SYNTHESIS_DEPTH=3   number of delegation rounds in the kickoff task
+#   A2A_TIMEOUT=600     ceiling on measurement-side wait (seconds)
+#   KEEP_WORKSPACES=0   skip cleanup-on-exit when 1 (for log inspection)
+#   MODE=local|saas     local-dev vs SaaS routing posture
+#   CP_ADMIN_API_TOKEN  required when MODE=saas; sent as Authorization bearer
+#   TENANT_ID           required when MODE=saas; sent as X-Tenant-Id
+#
+# Output: NDJSON event stream on stdout + a human summary on stderr.
+#
+set -euo pipefail
+
+PLATFORM="${PLATFORM:-http://localhost:8080}"
+MODE="${MODE:-local}"
+PM_TEMPLATE="${PM_TEMPLATE:?PM_TEMPLATE is required (e.g. claude-code-default, hermes)}"
+CHILD_TEMPLATE="${CHILD_TEMPLATE:?CHILD_TEMPLATE is required}"
+SECRET_NAME="${SECRET_NAME:?SECRET_NAME is required (e.g. MINIMAX_API_KEY)}"
+MODEL="${MODEL:-}"
+SYNTHESIS_DEPTH="${SYNTHESIS_DEPTH:-3}"
+A2A_TIMEOUT="${A2A_TIMEOUT:-600}"
+KEEP_WORKSPACES="${KEEP_WORKSPACES:-0}"
+
+# SaaS-mode auth chain: workspace-server (per-tenant Go binary on EC2)
+# requires BOTH headers:
+#   Authorization: Bearer <tenant-admin-token>      (per-tenant secret)
+#   X-Molecule-Org-Id:  <org-uuid>                  (TenantGuard middleware)
+# The tenant-admin-token is provisioned by controlplane and retrievable via:
+#   GET /cp/admin/orgs/<slug>/admin-token   (CP_ADMIN_API_TOKEN bearer-gated)
+# The runner can either:
+#   1. Take ORG_SLUG + CP_ADMIN_API_TOKEN and fetch the tenant token itself, or
+#   2. Take ORG_ID + TENANT_ADMIN_TOKEN directly.
+ORG_ID="${ORG_ID:-}"
+ORG_SLUG="${ORG_SLUG:-}"
+TENANT_ADMIN_TOKEN="${TENANT_ADMIN_TOKEN:-}"
+CP_ADMIN_API_TOKEN="${CP_ADMIN_API_TOKEN:-}"
+CP_API_URL="${CP_API_URL:-https://staging-api.moleculesai.app}"
+
+# Resolve secret value: ${SECRET_VALUE} > $${SECRET_NAME} > error.
+SECRET_VALUE="${SECRET_VALUE:-}"
+if [ -z "$SECRET_VALUE" ]; then
+  SECRET_VALUE="$(printenv "$SECRET_NAME" 2>/dev/null || true)"
+fi
+[ -n "$SECRET_VALUE" ] || { echo "ERROR: set \$$SECRET_NAME or \$SECRET_VALUE" >&2; exit 1; }
+
+# SaaS-mode preflight + format validation.
+# Validating ORG_ID + ORG_SLUG client-side gives an actionable error
+# before the request hits TenantGuard's intentionally-opaque 404
+# (which doesn't tell the operator whether the slug is wrong, the
+# UUID is wrong, or auth is wrong).
+if [ "$MODE" = "saas" ]; then
+  [ -n "$ORG_ID" ] || { echo "ERROR: MODE=saas requires ORG_ID (the org UUID)" >&2; exit 1; }
+  case "$ORG_ID" in
+    [0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f]-[0-9a-f][0-9a-f][0-9a-f][0-9a-f]-[0-9a-f][0-9a-f][0-9a-f][0-9a-f]-[0-9a-f][0-9a-f][0-9a-f][0-9a-f]-[0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f]) ;;
+    *) echo "ERROR: ORG_ID must be a UUID (got '$ORG_ID')" >&2; exit 1;;
+  esac
+  if [ -n "$ORG_SLUG" ]; then
+    case "$ORG_SLUG" in
+      *[!a-z0-9-]* | -* | *-) echo "ERROR: ORG_SLUG must match ^[a-z0-9][a-z0-9-]*[a-z0-9]\$ (got '$ORG_SLUG')" >&2; exit 1;;
+    esac
+  fi
+  if [ -z "$TENANT_ADMIN_TOKEN" ]; then
+    [ -n "$ORG_SLUG" ]          || { echo "ERROR: MODE=saas needs TENANT_ADMIN_TOKEN or ORG_SLUG (to fetch it via CP)" >&2; exit 1; }
+    [ -n "$CP_ADMIN_API_TOKEN" ] || { echo "ERROR: ORG_SLUG path needs CP_ADMIN_API_TOKEN to fetch tenant token from $CP_API_URL" >&2; exit 1; }
+    TENANT_ADMIN_TOKEN=$(curl -s -H "Authorization: Bearer $CP_ADMIN_API_TOKEN" \
+      "$CP_API_URL/cp/admin/orgs/$ORG_SLUG/admin-token" \
+      | python3 -c "import sys,json; print(json.load(sys.stdin).get('admin_token',''))" 2>/dev/null || echo "")
+    [ -n "$TENANT_ADMIN_TOKEN" ] || { echo "ERROR: failed to resolve tenant admin token via $CP_API_URL/cp/admin/orgs/$ORG_SLUG/admin-token" >&2; exit 1; }
+  fi
+fi
+
+ts() { date -u +%Y-%m-%dT%H:%M:%S.%3NZ 2>/dev/null || date -u +%Y-%m-%dT%H:%M:%SZ; }
+emit() { printf '{"ts":"%s","event":"%s","data":%s}\n' "$(ts)" "$1" "${2:-null}"; }
+
+api() {
+  local args=()
+  if [ "$MODE" = "saas" ]; then
+    args+=(-H "Authorization: Bearer $TENANT_ADMIN_TOKEN")
+    args+=(-H "X-Molecule-Org-Id: $ORG_ID")
+  fi
+  curl -s ${args[@]+"${args[@]}"} "$@"
+}
+
+PM_ID=""
+CHILD_ID=""
+cleanup() {
+  local rc=$?
+  set +e
+  if [ "$KEEP_WORKSPACES" = "1" ]; then
+    emit "cleanup_skipped" "{\"reason\":\"KEEP_WORKSPACES=1\",\"pm_id\":\"$PM_ID\",\"child_id\":\"$CHILD_ID\"}"
+    return $rc
+  fi
+  for id in "$CHILD_ID" "$PM_ID"; do
+    [ -z "$id" ] && continue
+    code=$(api -o /dev/null -w '%{http_code}' -X DELETE "$PLATFORM/workspaces/$id" 2>/dev/null || echo "curl_err")
+    if [ "$code" = "200" ] || [ "$code" = "204" ] || [ "$code" = "404" ]; then
+      emit "cleanup_deleted" "{\"workspace_id\":\"$id\",\"http_code\":\"$code\"}"
+    else
+      emit "cleanup_failed" "{\"workspace_id\":\"$id\",\"http_code\":\"$code\"}"
+    fi
+  done
+  return $rc
+}
+trap cleanup EXIT INT TERM
+
+emit "run_started" "{\"platform\":\"$PLATFORM\",\"mode\":\"$MODE\",\"pm_template\":\"$PM_TEMPLATE\",\"child_template\":\"$CHILD_TEMPLATE\",\"model\":\"$MODEL\",\"secret_name\":\"$SECRET_NAME\",\"synthesis_depth\":$SYNTHESIS_DEPTH,\"a2a_timeout_secs\":$A2A_TIMEOUT}"
+
+# ---- Provision via JSON-encoded bodies (defends against templates/values
+# with embedded shell-special chars). ----
+pm_body=$(python3 -c '
+import json, sys
+print(json.dumps({"name":"PM","role":"Coordinator — delegates and synthesizes","tier":2,"template":sys.argv[1]}))' "$PM_TEMPLATE")
+child_body=$(python3 -c '
+import json, sys
+print(json.dumps({"name":"Researcher","role":"Returns short research findings","tier":2,"template":sys.argv[1]}))' "$CHILD_TEMPLATE")
+secret_body=$(python3 -c '
+import json, sys
+print(json.dumps({"key":sys.argv[1],"value":sys.argv[2]}))' "$SECRET_NAME" "$SECRET_VALUE")
+
+emit "provisioning_pm" "{\"template\":\"$PM_TEMPLATE\"}"
+R=$(api -X POST "$PLATFORM/workspaces" -H 'Content-Type: application/json' -d "$pm_body")
+PM_ID=$(echo "$R" | python3 -c "import sys,json; print(json.load(sys.stdin).get('id',''))" 2>/dev/null || echo "")
+[ -n "$PM_ID" ] || { echo "ERROR: PM create failed — response: $R" >&2; exit 1; }
+emit "pm_provisioned" "{\"workspace_id\":\"$PM_ID\"}"
+
+emit "provisioning_child" "{\"template\":\"$CHILD_TEMPLATE\"}"
+R=$(api -X POST "$PLATFORM/workspaces" -H 'Content-Type: application/json' -d "$child_body")
+CHILD_ID=$(echo "$R" | python3 -c "import sys,json; print(json.load(sys.stdin).get('id',''))" 2>/dev/null || echo "")
+[ -n "$CHILD_ID" ] || { echo "ERROR: child create failed — response: $R" >&2; exit 1; }
+emit "child_provisioned" "{\"workspace_id\":\"$CHILD_ID\"}"
+
+api -X PATCH "$PLATFORM/workspaces/$CHILD_ID" -H 'Content-Type: application/json' \
+  -d "{\"parent_id\":\"$PM_ID\"}" > /dev/null
+
+# Seed secret on BOTH workspaces. Hermes/MiniMax both sides need it; templates
+# that ignore unknown env vars treat extras as no-op.
+for id in "$PM_ID" "$CHILD_ID"; do
+  api -X POST "$PLATFORM/workspaces/$id/secrets" -H 'Content-Type: application/json' -d "$secret_body" > /dev/null
+done
+emit "secrets_seeded" "{\"key\":\"$SECRET_NAME\",\"workspaces\":[\"$PM_ID\",\"$CHILD_ID\"]}"
+
+if [ -n "$MODEL" ]; then
+  model_body=$(python3 -c 'import json,sys; print(json.dumps({"model":sys.argv[1]}))' "$MODEL")
+  for id in "$PM_ID" "$CHILD_ID"; do
+    api -X PUT "$PLATFORM/workspaces/$id/model" -H 'Content-Type: application/json' -d "$model_body" > /dev/null
+  done
+  emit "model_set" "{\"model\":\"$MODEL\",\"workspaces\":[\"$PM_ID\",\"$CHILD_ID\"]}"
+fi
+
+# ---- Wait for both online ----
+WAIT_ONLINE_SECS="${WAIT_ONLINE_SECS:-180}"
+wait_online() {
+  local id="$1" label="$2"
+  # Round up so a non-multiple-of-3 budget waits at least the requested
+  # seconds (200 → 67 polls × 3s = 201s, not 198s).
+  local polls=$(( (WAIT_ONLINE_SECS + 2) / 3 ))
+  local last_status=""
+  for i in $(seq 1 "$polls"); do
+    s=$(api "$PLATFORM/workspaces/$id" | python3 -c "import sys,json; print(json.load(sys.stdin).get('status',''))" 2>/dev/null || echo "")
+    if [ "$s" != "$last_status" ]; then
+      emit "status_change" "{\"workspace\":\"$label\",\"from\":\"$last_status\",\"to\":\"$s\",\"poll\":$i}"
+      last_status="$s"
+    fi
+    [ "$s" = "online" ] && { emit "online" "{\"workspace\":\"$label\",\"after_polls\":$i,\"after_secs\":$((i * 3))}"; return 0; }
+    [ "$s" = "failed" ] && { emit "failed" "{\"workspace\":\"$label\"}"; return 1; }
+    sleep 3
+  done
+  emit "online_timeout" "{\"workspace\":\"$label\",\"last_status\":\"$last_status\",\"waited_secs\":$WAIT_ONLINE_SECS}"
+  return 1
+}
+wait_online "$PM_ID"    "PM"    || exit 2
+wait_online "$CHILD_ID" "child" || exit 2
+
+# ---- Build a synthesis-heavy kickoff task ----
+TASK="You are coordinating a research analysis. Delegate $SYNTHESIS_DEPTH separate sub-questions to the Researcher (one at a time, sequentially — wait for each response before sending the next), then synthesize all findings into a single coherent report. Sub-questions: (a) historical context of distributed consensus, (b) modern Byzantine-fault-tolerant protocols, (c) practical trade-offs between Raft and Paxos. After all delegations complete, write a 600-word synthesis comparing the three responses and drawing one cross-cutting insight. Do not respond until the synthesis is complete."
+
+# ---- A2A kickoff round-trip ----
+emit "a2a_kickoff_sent" "{\"to\":\"$PM_ID\",\"task_chars\":${#TASK}}"
+START_NS=$(python3 -c 'import time; print(int(time.time_ns()))')
+
+a2a_body=$(python3 -c '
+import json, sys
+print(json.dumps({"method":"message/send","params":{"message":{"role":"user","parts":[{"type":"text","text":sys.argv[1]}]}}}))' "$TASK")
+
+RESP=$(api --max-time "$A2A_TIMEOUT" -X POST "$PLATFORM/workspaces/$PM_ID/a2a" \
+  -H "Content-Type: application/json" -d "$a2a_body" || echo "<curl_failed_or_timed_out>")
+
+END_NS=$(python3 -c 'import time; print(int(time.time_ns()))')
+ELAPSED_SECS=$(python3 -c "print(round(($END_NS - $START_NS) / 1e9, 2))")
+
+emit "a2a_response_observed" "{\"elapsed_secs\":$ELAPSED_SECS,\"response_chars\":${#RESP},\"response_head\":$(python3 -c "import json,sys; print(json.dumps(sys.argv[1][:200]))" "$RESP")}"
+
+# ---- Heartbeat trace ----
+# `/workspaces/:id/heartbeat-history` is a local-dev workspace-server
+# route — on tenant deployments the platform's :8080 fallback proxies
+# any unmatched path to the canvas Next.js, so this 404s with 28KB of
+# HTML rather than a clean error. Skip the fetch entirely in SaaS mode
+# and emit an explicit placeholder instead of polluting the event log
+# with HTML.
+emit "fetching_heartbeat_trace" "{\"mode\":\"$MODE\"}"
+if [ "$MODE" = "saas" ]; then
+  emit "heartbeat_trace" "{\"raw\":\"<skipped: heartbeat-history endpoint unavailable in SaaS — see scripts/README.md\>\"}"
+else
+  HB=$(api "$PLATFORM/workspaces/$PM_ID/heartbeat-history?since_secs=$A2A_TIMEOUT" 2>&1 || echo "<endpoint_unavailable>")
+  emit "heartbeat_trace" "{\"raw\":$(python3 -c "import json,sys; print(json.dumps(sys.argv[1]))" "$HB")}"
+fi
+
+# ---- rfc2251_phase log lines from the workspace container ----
+# Local Docker provisioner: workspace container name is workspace-<id>.
+# SaaS: container is on EC2 — skip log capture, fall back to heartbeat only.
+if [ "$MODE" = "local" ] && command -v docker >/dev/null 2>&1; then
+  for id in "$PM_ID"; do
+    container=$(docker ps --filter "name=workspace-$id" --format '{{.Names}}' | head -1)
+    if [ -n "$container" ]; then
+      phase_log=$(docker logs --since "${A2A_TIMEOUT}s" "$container" 2>&1 | grep 'rfc2251_phase=' || echo "<no rfc2251_phase log lines — container running stale image without #2255 instrumentation>")
+      emit "phase_log" "{\"workspace_id\":\"$id\",\"container\":\"$container\",\"raw\":$(python3 -c "import json,sys; print(json.dumps(sys.argv[1]))" "$phase_log")}"
+    fi
+  done
+fi
+
+emit "run_completed" "{\"elapsed_secs\":$ELAPSED_SECS,\"pm_id\":\"$PM_ID\",\"child_id\":\"$CHILD_ID\"}"
+
+cat <<EOF >&2
+
+=========================================
+  Measurement complete. (RFC #2251 / Issue 4 repro)
+  Mode:                  $MODE
+  Coordinator template:  $PM_TEMPLATE
+  Child template:        $CHILD_TEMPLATE
+  Model:                 ${MODEL:-<template default>}
+  Coordinator response:  ${ELAPSED_SECS}s
+  PM workspace:          $PM_ID
+  Child workspace:       $CHILD_ID
+=========================================
+
+Interpretation:
+
+  ELAPSED < 60   → Synthesis fast; not informative about platform bounds.
+                   Re-run with SYNTHESIS_DEPTH=8 for longer synthesis.
+
+  60 <= ELAPSED < 300 → Within DELEGATION_TIMEOUT. Doesn't prove or refute
+                   Issue 4 — HTTP-level timeout would be sufficient.
+
+  ELAPSED >= 300 → BUG CONFIRMED IF heartbeat_trace shows no platform-side
+                   transition. Coordinator ran past DELEGATION_TIMEOUT without
+                   any platform ceiling kicking in — exactly the gap V1.0
+                   plans to close with MAX_TASK_EXECUTION_SECS.
+
+  curl_failed_or_timed_out → \$A2A_TIMEOUT exceeded. Coordinator likely hung
+                   or synthesis is just very slow.
+
+EOF
--- a/workspace/platform_tools/README.md
+++ b/workspace/platform_tools/README.md
@ -0,0 +1,107 @@
+# Platform tool registry
+
+Single source of truth for every tool the platform exposes to agents
+(A2A delegation, hierarchical memory, broadcast, introspection).
+
+## Why this exists
+
+Pre-#2240, three places independently declared each tool:
+
+1. **MCP server** (`workspace/a2a_mcp_server.py`) — the `TOOLS` JSON list
+2. **LangChain `@tool` wrappers** (`workspace/builtin_tools/{delegation,memory}.py`)
+3. **Agent-facing system-prompt docs** (`workspace/executor_helpers.py`)
+
+Adding a tool to one and forgetting the others happened repeatedly. The
+canonical case: `send_message_to_user` was registered in MCP TOOLS but
+the executor_helpers doc string never mentioned it, so agents saw the
+tool as available but had no usage guidance — a silent capability
+regression.
+
+## What the registry does
+
+`registry.py` defines each tool ONCE as a frozen `ToolSpec`:
+
+```python
+ToolSpec(
+    name="delegate_task",
+    short="Delegate a task to a peer workspace via A2A and WAIT for the response.",
+    when_to_use="Use for QUICK questions and small sub-tasks where you can afford to wait inline...",
+    input_schema={...},          # JSON Schema, consumed by MCP server
+    impl=tool_delegate_task,     # the actual coroutine
+    section="a2a",               # which prompt section it belongs to
+)
+```
+
+Adapters consume specs; no hardcoded names anywhere else:
+
+- **MCP server** builds its `TOOLS` list from `_PLATFORM_TOOL_SPECS` at import time
+- **LangChain `@tool` wrappers** read `name=spec.name` from the registry
+- **Doc generator** (`executor_helpers._render_section()`) produces the
+  system-prompt block from `spec.short` (bullet) + `spec.when_to_use`
+  (heading + paragraph)
+
+## CLI subprocess block — special case
+
+Non-MCP runtimes (ollama, custom subprocess adapters) use a separate
+hand-maintained block in `executor_helpers._A2A_INSTRUCTIONS_CLI` because
+the CLI subcommand vocabulary (`peers`, `delegate`, `status`, `info`)
+differs from the MCP tool names (`list_peers`, `delegate_task`, etc.).
+Auto-generation would lose the readable invocation syntax.
+
+Alignment is enforced via `_CLI_A2A_COMMAND_KEYWORDS` (in
+`executor_helpers.py`): every a2a-section spec must be keyed there with
+either a CLI subcommand keyword OR an explicit `None` if the tool is
+intentionally not exposed via subprocess (e.g.
+`send_message_to_user` because its structured `attachments` field
+doesn't survive positional-arg shell invocation).
+
+## Tests that catch drift
+
+`workspace/tests/test_platform_tools.py`:
+
+| Test | What it catches |
+|---|---|
+| `test_mcp_server_registers_every_registry_tool` | MCP TOOLS list out of sync with registry |
+| `test_mcp_tool_descriptions_match_registry_short` | hand-edited MCP description that drifted |
+| `test_mcp_tool_input_schemas_match_registry` | schema duplicated in server file |
+| `test_a2a_instructions_text_includes_every_a2a_tool` | doc generator missed a tool |
+| `test_old_pre_rename_names_not_present_in_docs` | stale name leaked back in |
+| `test_a2a_mcp_instructions_match_snapshot` | rendered shape (bullet ordering, headings, footers) drifted |
+| `test_a2a_cli_instructions_match_snapshot` | CLI block edited in a way that changes shape |
+| `test_hma_instructions_match_snapshot` | HMA section drifted |
+| `test_cli_keyword_mapping_covers_every_a2a_tool` | tool added to registry without a CLI mapping decision |
+| `test_cli_keyword_substrings_appear_in_cli_block` | CLI keyword in the mapping but missing from the doc block |
+
+The snapshot files at `workspace/tests/snapshots/*.txt` are LF-pinned
+in `.gitattributes` so a Windows contributor with `core.autocrlf=true`
+doesn't get mysterious test failures.
+
+## Adding a new tool
+
+1. Append a `ToolSpec(...)` to `TOOLS` in `registry.py`.
+2. Add the LangChain `@tool` wrapper in `workspace/builtin_tools/`
+   (the wrapper body just calls `spec.impl`).
+3. Update `_CLI_A2A_COMMAND_KEYWORDS` in `executor_helpers.py` — set the
+   value to the CLI subcommand keyword, or to `None` if the tool isn't
+   exposed via the subprocess interface.
+4. Regenerate snapshots — see the comment block at the top of
+   `workspace/tests/test_platform_tools.py` for the one-liner.
+5. Run `pytest workspace/tests/test_platform_tools.py --no-cov`.
+
+## Renaming a tool
+
+Edit `name` in `registry.py` only. Then:
+
+1. The MCP TOOLS list rebuilds automatically.
+2. The doc generator regenerates automatically (snapshots will fail
+   the diff — regenerate them).
+3. Search `workspace/` for the old literal in case a non-adapter
+   consumer (tests, plugin code) hardcoded the old name; update those.
+4. Update any `_CLI_A2A_COMMAND_KEYWORDS` key + the literal substring
+   in `_A2A_INSTRUCTIONS_CLI` if applicable.
+
+## Removing a tool
+
+Delete the `ToolSpec` and the `_CLI_A2A_COMMAND_KEYWORDS` key. Adapters
+and doc generators stop registering it automatically; the structural
+tests prevent stale references from surviving.