diff --git a/scripts/README.md b/scripts/README.md new file mode 100644 index 00000000..3713429c --- /dev/null +++ b/scripts/README.md @@ -0,0 +1,47 @@ +# scripts/ + +Operational and one-off scripts for molecule-core. Most are +self-documenting — see the header comments in each file. + +## RFC #2251 coordinator task-bound harnesses + +There are three related scripts; pick the right one: + +| Script | Purpose | Targets | +|---|---|---| +| `measure-coordinator-task-bounds.sh` | **Canonical** v1 harness for the RFC #2251 / Issue 4 reproduction. Provisions a PM coordinator + Researcher child via `claude-code-default` + `langgraph` templates, sends a synthesis-heavy A2A kickoff, observes elapsed time + heartbeat trace. | OSS-shape platform — localhost or any `/workspaces`-shaped endpoint. Has tenant/admin-token guards for non-localhost runs. | +| `measure-coordinator-task-bounds-runner.sh` | Generalised runner for the same measurement contract but with **arbitrary template + secret + model combinations** (Hermes/MiniMax, etc.). Useful for cross-runtime variants without modifying the canonical harness. | Same as above (local or SaaS via `MODE=saas`). | +| `measure-coordinator-task-bounds.sh` (in [molecule-controlplane](https://github.com/Molecule-AI/molecule-controlplane)) | **Production-shape** variant that bootstraps a real staging tenant via `POST /cp/admin/orgs`, then runs the same measurement against `.staging.moleculesai.app`. | Staging controlplane only — refuses to run against production. | + +See `reference_harness_pair_pattern` (auto-memory) for when to use which +and the cross-repo design rationale. + +### Common safety pattern across all three + +- **Cleanup trap** on EXIT/INT/TERM auto-deletes provisioned resources. +- **`DRY_RUN=1`** prints plan + auth fingerprint, exits before any + state mutation. Run this before pointing at staging or any shared + infrastructure. +- **Non-target guard** refuses arbitrary endpoints (the controlplane + variant is locked to `staging-api.moleculesai.app`; the OSS variant + requires explicit auth + tenant scoping for non-localhost PLATFORM). +- **Cleanup failures emit `cleanup_*_failed` events** with remediation + hints; no silenced curl. ADMIN_TOKEN expiring mid-run surfaces as a + structured event rather than a silent leak. + +### Heartbeat trace caveat + +If `heartbeat_trace.raw == ""`, the per-workspace +`/heartbeat-history` endpoint isn't wired on the target build — the +bound measurement is INCONCLUSIVE on the platform-ceiling question. +Either wire the endpoint or replace with the equivalent Datadog query. + +## Other scripts + +- `cleanup-rogue-workspaces.sh` — emergency teardown for leaked + workspaces. Prompts for confirmation. Pair with the harnesses if a + cleanup trap fails (see `cleanup_*_failed` events). +- `canary-smoke.sh` — quick smoke test for canary releases. +- `dev-start.sh` — local-dev platform bring-up. + +The rest are self-documenting in their header comments. diff --git a/scripts/measure-coordinator-task-bounds.sh b/scripts/measure-coordinator-task-bounds.sh index 5de1b8f8..732f2ce7 100755 --- a/scripts/measure-coordinator-task-bounds.sh +++ b/scripts/measure-coordinator-task-bounds.sh @@ -75,33 +75,16 @@ set -euo pipefail PLATFORM="${PLATFORM:-http://localhost:8080}" - -# Templates, model, and the workspace-secret name are configurable so -# the harness covers any provider combo without code edits. Default -# values reproduce the original local-dev shape (claude-code coordinator -# + langgraph child + OpenRouter secret). MiniMax via Hermes Token Plan -# example: PM_TEMPLATE=hermes CHILD_TEMPLATE=hermes -# MODEL=minimax/MiniMax-M2.7-highspeed SECRET_NAME=MINIMAX_API_KEY -# MINIMAX_API_KEY=sk-cp-... -PM_TEMPLATE="${PM_TEMPLATE:-claude-code-default}" -CHILD_TEMPLATE="${CHILD_TEMPLATE:-langgraph}" -MODEL="${MODEL:-}" # empty = template default; non-empty = PUT /workspaces/:id/model -SECRET_NAME="${SECRET_NAME:-OPENROUTER_API_KEY}" - -# Resolve the secret value: prefer $SECRET_NAME (so MINIMAX_API_KEY=... -# works when SECRET_NAME=MINIMAX_API_KEY), then SECRET_VALUE override, -# then legacy OPENROUTER_API_KEY/OPENAI_API_KEY for v1 one-liner compat. -SECRET_VALUE="${SECRET_VALUE:-}" -if [ -z "$SECRET_VALUE" ]; then - SECRET_VALUE="$(printenv "$SECRET_NAME" 2>/dev/null || true)" -fi -if [ -z "$SECRET_VALUE" ]; then - SECRET_VALUE="${OPENROUTER_API_KEY:-${OPENAI_API_KEY:-}}" -fi -if [ -z "$SECRET_VALUE" ]; then - echo "ERROR: no value found for secret \$$SECRET_NAME — set it (or SECRET_VALUE / OPENROUTER_API_KEY / OPENAI_API_KEY)" >&2 +# Require an explicitly-set non-empty key. The previous chained +# default (`${OPENROUTER_API_KEY:-${OPENAI_API_KEY:?...}}`) silently +# accepted `OPENROUTER_API_KEY=""` and only failed when OPENAI_API_KEY +# was also unset — defeating the guard against running with no LLM +# credentials. +if [ -z "${OPENROUTER_API_KEY:-}" ] && [ -z "${OPENAI_API_KEY:-}" ]; then + echo "ERROR: set OPENROUTER_API_KEY (or OPENAI_API_KEY) to a non-empty value" >&2 exit 1 fi +OR_KEY="${OPENROUTER_API_KEY:-${OPENAI_API_KEY}}" # Required for non-localhost platforms — staging-api etc. enforce # tenant-admin auth on /workspaces. Without it the harness would either diff --git a/workspace/platform_tools/README.md b/workspace/platform_tools/README.md new file mode 100644 index 00000000..56180fe8 --- /dev/null +++ b/workspace/platform_tools/README.md @@ -0,0 +1,107 @@ +# Platform tool registry + +Single source of truth for every tool the platform exposes to agents +(A2A delegation, hierarchical memory, broadcast, introspection). + +## Why this exists + +Pre-#2240, three places independently declared each tool: + +1. **MCP server** (`workspace/a2a_mcp_server.py`) — the `TOOLS` JSON list +2. **LangChain `@tool` wrappers** (`workspace/builtin_tools/{delegation,memory}.py`) +3. **Agent-facing system-prompt docs** (`workspace/executor_helpers.py`) + +Adding a tool to one and forgetting the others happened repeatedly. The +canonical case: `send_message_to_user` was registered in MCP TOOLS but +the executor_helpers doc string never mentioned it, so agents saw the +tool as available but had no usage guidance — a silent capability +regression. + +## What the registry does + +`registry.py` defines each tool ONCE as a frozen `ToolSpec`: + +```python +ToolSpec( + name="delegate_task", + short="Delegate a task to a peer workspace via A2A and WAIT for the response.", + when_to_use="Use for QUICK questions and small sub-tasks where you can afford to wait inline...", + input_schema={...}, # JSON Schema, consumed by MCP server + impl=tool_delegate_task, # the actual coroutine + section="a2a", # which prompt section it belongs to +) +``` + +Adapters consume specs; no hardcoded names anywhere else: + +- **MCP server** builds its `TOOLS` list from `_PLATFORM_TOOL_SPECS` at import time +- **LangChain `@tool` wrappers** read `name=spec.name` from the registry +- **Doc generator** (`executor_helpers._render_section()`) produces the + system-prompt block from `spec.short` (bullet) + `spec.when_to_use` + (heading + paragraph) + +## CLI subprocess block — special case + +Non-MCP runtimes (ollama, custom subprocess adapters) use a separate +hand-maintained block in `executor_helpers._A2A_INSTRUCTIONS_CLI` because +the CLI subcommand vocabulary (`peers`, `delegate`, `status`, `info`) +differs from the MCP tool names (`list_peers`, `delegate_task`, etc.). +Auto-generation would lose the readable invocation syntax. + +Alignment is enforced via `_CLI_A2A_COMMAND_KEYWORDS` (in +`executor_helpers.py`): every a2a-section spec must be keyed there with +either a CLI subcommand keyword OR an explicit `None` if the tool is +intentionally not exposed via subprocess (e.g. +`send_message_to_user` because its structured `attachments` field +doesn't survive positional-arg shell invocation). + +## Tests that catch drift + +`workspace/tests/test_platform_tools.py`: + +| Test | What it catches | +|---|---| +| `test_mcp_server_registers_every_registry_tool` | MCP TOOLS list out of sync with registry | +| `test_mcp_tool_descriptions_match_registry_short` | hand-edited MCP description that drifted | +| `test_mcp_tool_input_schemas_match_registry` | schema duplicated in server file | +| `test_a2a_instructions_text_includes_every_a2a_tool` | doc generator missed a tool | +| `test_old_pre_rename_names_not_present_in_docs` | stale name leaked back in | +| `test_a2a_mcp_instructions_match_snapshot` | rendered shape (bullet ordering, headings, footers) drifted | +| `test_a2a_cli_instructions_match_snapshot` | CLI block edited in a way that changes shape | +| `test_hma_instructions_match_snapshot` | HMA section drifted | +| `test_cli_keyword_mapping_covers_every_a2a_tool` | tool added to registry without a CLI mapping decision | +| `test_cli_keyword_substrings_appear_in_cli_block` | CLI keyword in the mapping but missing from the doc block | + +The snapshot files at `workspace/tests/snapshots/*.txt` are LF-pinned +in `.gitattributes` so a Windows contributor with `core.autocrlf=true` +doesn't get mysterious test failures. + +## Adding a new tool + +1. Append a `ToolSpec(...)` to `TOOLS` in `registry.py`. +2. Add the LangChain `@tool` wrapper in `workspace/builtin_tools/` + (the wrapper body just calls `spec.impl`). +3. Update `_CLI_A2A_COMMAND_KEYWORDS` in `executor_helpers.py` — set the + value to the CLI subcommand keyword, or to `None` if the tool isn't + exposed via the subprocess interface. +4. Regenerate snapshots — see the comment block at the top of + `workspace/tests/test_platform_tools.py` for the one-liner. +5. Run `pytest workspace/tests/test_platform_tools.py --no-cov`. + +## Renaming a tool + +Edit `name` in `registry.py` only. Then: + +1. The MCP TOOLS list rebuilds automatically. +2. The doc generator regenerates automatically (snapshots will fail + the diff — regenerate them). +3. Search `workspace/` for the old literal in case a non-adapter + consumer (tests, plugin code) hardcoded the old name; update those. +4. Update any `_CLI_A2A_COMMAND_KEYWORDS` key + the literal substring + in `_A2A_INSTRUCTIONS_CLI` if applicable. + +## Removing a tool + +Delete the `ToolSpec` and the `_CLI_A2A_COMMAND_KEYWORDS` key. Adapters +and doc generators stop registering it automatically; the structural +tests prevent stale references from surviving.