import from local vendored copy (2026-05-06)

2026-05-06 13:53:35 -07:00 · 2026-05-06 13:53:35 -07:00 · 64c2d7be17
commit 64c2d7be17
12 changed files with 460 additions and 0 deletions
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@ -0,0 +1,5 @@
+name: CI
+on: [push, pull_request]
+jobs:
+  validate:
+    uses: Molecule-AI/molecule-ci/.github/workflows/validate-plugin.yml@main
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,21 @@
+# Credentials — never commit. Use .env.example as the template.
+.env
+.env.local
+.env.*.local
+.env.*
+!.env.example
+!.env.sample
+
+# Private keys + certs
+*.pem
+*.key
+*.crt
+*.p12
+*.pfx
+
+# Secret directories
+.secrets/
+
+# Workspace auth tokens
+.auth-token
+.auth_token
--- a/.molecule-ci/scripts/requirements.txt
+++ b/.molecule-ci/scripts/requirements.txt
@ -0,0 +1 @@
+pyyaml>=6.0
--- a/.molecule-ci/scripts/validate-plugin.py
+++ b/.molecule-ci/scripts/validate-plugin.py
@ -0,0 +1,46 @@
+#!/usr/bin/env python3
+"""Validate a Molecule AI plugin repo."""
+import os, sys, yaml
+
+errors = []
+
+if not os.path.isfile("plugin.yaml"):
+    print("::error::plugin.yaml not found at repo root")
+    sys.exit(1)
+
+with open("plugin.yaml") as f:
+    plugin = yaml.safe_load(f)
+
+for field in ["name", "version", "description"]:
+    if not plugin.get(field):
+        errors.append(f"Missing required field: {field}")
+
+v = str(plugin.get("version", ""))
+if v and not all(c in "0123456789." for c in v):
+    errors.append(f"Invalid version format: {v}")
+
+runtimes = plugin.get("runtimes")
+if runtimes is not None and not isinstance(runtimes, list):
+    errors.append(f"runtimes must be a list, got {type(runtimes).__name__}")
+
+content_paths = ["SKILL.md", "hooks", "skills", "rules"]
+found = [p for p in content_paths if os.path.exists(p)]
+if not found:
+    errors.append("Plugin must contain at least one of: SKILL.md, hooks/, skills/, rules/")
+
+if os.path.isfile("SKILL.md"):
+    with open("SKILL.md") as f:
+        first_line = f.readline().strip()
+    if first_line and not first_line.startswith("#"):
+        print("::warning::SKILL.md should start with a markdown heading (e.g., # Plugin Name)")
+
+if errors:
+    for e in errors:
+        print(f"::error::{e}")
+    sys.exit(1)
+
+print(f"✓ plugin.yaml valid: {plugin['name']} v{plugin['version']}")
+if found:
+    print(f"  Content: {', '.join(found)}")
+if runtimes:
+    print(f"  Runtimes: {', '.join(runtimes)}")
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -0,0 +1,118 @@
+# molecule-skill-llm-judge — LLM-as-Judge Gate
+
+`molecule-skill-llm-judge` is a **cheap LLM-as-judge gate** that scores whether
+a deliverable (PR diff, A2A response, generated config) actually addresses the
+original request. It catches the failure mode unit tests miss: the code works
+but solves the wrong problem.
+
+**Version:** 1.0.0
+**Runtime:** `claude_code`
+
+---
+
+## Repository Layout
+
+```
+molecule-skill-llm-judge/
+├── plugin.yaml              — Plugin manifest
+├── skills/
+│   └── llm-judge/
+│       └── SKILL.md         — Scoring criteria and process
+└── adapters/               — Harness adaptors
+```
+
+---
+
+## How It Works
+
+### The Judge Prompt
+
+The skill sends the original request + the deliverable to a judge LLM and
+asks for a score 1–5:
+
+| Score | Meaning |
+|---|---|
+| 5 | Deliverable fully addresses the request |
+| 4 | Addresses most of the request, minor gaps |
+| 3 | Partial address, significant gaps |
+| 2 | Mostly irrelevant |
+| 1 | Completely wrong |
+
+### Gate Behaviour
+
+Configure the threshold in workspace settings:
+
+```json
+{
+  "llm_judge": {
+    "threshold": 4,
+    "model": "claude-sonnet-4-20250514"
+  }
+}
+```
+
+If the score is below threshold, the skill returns a denial with the judge's reasoning.
+
+---
+
+## When to Use
+
+✅ Use for:
+- Verifying PR diffs against the original issue
+- Checking A2A responses address the task
+- Validating generated configs against requirements
+
+❌ Don't use for:
+- Well-tested pure logic (unit tests catch this)
+- Exploratory work where "wrong" isn't well-defined
+
+---
+
+## Development
+
+### Prerequisites
+
+- Python 3.11+
+- `gh` CLI authenticated
+- Write access to `Molecule-AI/molecule-ai-plugin-molecule-skill-llm-judge`
+
+### Setup
+
+```bash
+git clone https://github.com/Molecule-AI/molecule-ai-plugin-molecule-skill-llm-judge.git
+cd molecule-ai-plugin-molecule-skill-llm-judge
+python3 -c "import yaml; yaml.safe_load(open('plugin.yaml'))"
+```
+
+### Pre-Commit Checklist
+
+```bash
+python3 -c "import yaml; yaml.safe_load(open('plugin.yaml'))"
+
+python3 -c "
+import re, sys
+with open('plugin.yaml') as f:
+    content = f.read()
+patterns = [r'sk.ant', r'ghp.', r'AKIA[A-Z0-9]']
+if any(re.search(p, content) for p in patterns):
+    print('FAIL: possible credentials found')
+    sys.exit(1)
+print('No credentials: OK')
+"
+```
+
+---
+
+## Release Process
+
+1. Review changes: `git log origin/main..HEAD --oneline`
+2. Bump `version` in `plugin.yaml` (semver)
+3. Commit: `chore: bump version to X.Y.Z`
+4. Tag and push: `git tag vX.Y.Z && git push origin main --tags`
+5. Create GitHub Release with changelog
+
+---
+
+## Known Issues
+
+See `known-issues.md` at the repo root.
--- a/README.md
+++ b/README.md
@ -0,0 +1,42 @@
+# molecule-skill-llm-judge — LLM-as-Judge Gate
+
+Plugin for Claude Code. Scores whether an agent's deliverable (a PR, a delegation
+result, a generated config) actually addresses the original request — the failure mode
+unit tests miss.
+
+## The problem it solves
+
+Unit tests verify the code *ran*. They don't verify it did the *right thing* for the
+customer's actual request. An agent can implement the wrong solution perfectly.
+
+## When to use
+
+After an agent (PM, Dev Lead, QA, etc.) produces a deliverable:
+- A PR opened in response to an issue
+- A delegation result (A2A `message/send` response)
+- A generated config or template
+- A code review they posted
+
+**Trigger:** "Agent came back with 'done' — before we believe them."
+
+## What it does
+
+1. Presents the original request and the agent's deliverable to an LLM judge
+2. Scores: does the deliverable actually address the request?
+3. Reports: passes, partial, or fails — with evidence
+
+## Installation
+
+### In org template (org.yaml)
+```yaml
+plugins:
+  - molecule-skill-llm-judge
+```
+
+### From URL
+```
+github://Molecule-AI/molecule-ai-plugin-molecule-skill-llm-judge
+```
+
+## License
+Business Source License 1.1 — © Molecule AI.
--- a/adapters/init.py
+++ b/adapters/init.py
--- a/adapters/claude_code.py
+++ b/adapters/claude_code.py
@ -0,0 +1,2 @@
+"""Claude Code adaptor — uses the generic rule+skill+hooks installer."""
+from plugins_registry.builtins import AgentskillsAdaptor as Adaptor  # noqa: F401
--- a/known-issues.md
+++ b/known-issues.md
@ -0,0 +1,54 @@
+# Known Issues — molecule-skill-llm-judge
+
+---
+
+## Active Issues
+
+*(None currently open. This section is updated when issues are filed.)*
+
+---
+
+## Recently Resolved
+
+*(No recently resolved issues.)*
+
+---
+
+## How to Update This File
+
+When a new issue is identified:
+1. Add it under **Active Issues** using the template below
+2. Include: symptom, cause (if known), workaround
+3. When fixed, move to **Recently Resolved** and note the fix version
+
+### Issue Template
+
+```markdown
+## [TICKET-NUMBER] <Short Title>
+
+**Severity:** P0 / P1 / P2 / P3
+**Status:** Workaround / Fix in progress / Fix available
+**Affected versions:** All / vX.Y.Z+
+
+**Symptoms:**
+**Cause:**
+**Workaround:**
+**Fix (if available):**
+```
+
+---
+
+## Severity Definitions
+
+| Level | Description |
+|---|---|
+| P0 | Judge always returns 5 (bypass) |
+| P1 | Judge always returns 1 (false negative on good work) |
+| P2 | Judge score inconsistent between runs |
+| P3 | Cosmetic or documentation issue |
+
+---
+
+## Reporting
+
+Use the Molecule-AI/internal issue tracker. Tag with `plugin-molecule-skill-llm-judge`.
--- a/plugin.yaml
+++ b/plugin.yaml
@ -0,0 +1,11 @@
+name: molecule-skill-llm-judge
+version: 1.0.0
+description: Cheap LLM-as-judge gate that catches "agent shipped the wrong thing". Scores whether a deliverable (PR diff, A2A response, generated config) actually addresses the original request — the failure mode unit tests miss.
+author: Molecule AI
+tags: [molecule, guardrails, evaluation]
+
+runtimes:
+  - claude_code
+
+skills:
+  - llm-judge
--- a/runbooks/local-dev-setup.md
+++ b/runbooks/local-dev-setup.md
@ -0,0 +1,84 @@
+# Local Development Setup
+
+This runbook covers setting up a local development environment for
+`molecule-skill-llm-judge`.
+
+---
+
+## Prerequisites
+
+- Python 3.11+
+- `gh` CLI authenticated
+- Write access to `Molecule-AI/molecule-ai-plugin-molecule-skill-llm-judge`
+
+---
+
+## Clone & Bootstrap
+
+```bash
+git clone https://github.com/Molecule-AI/molecule-ai-plugin-molecule-skill-llm-judge.git
+cd molecule-ai-plugin-molecule-skill-llm-judge
+```
+
+---
+
+## Validating Plugin Structure
+
+```bash
+python3 -c "import yaml; yaml.safe_load(open('plugin.yaml'))"
+echo "plugin.yaml OK"
+```
+
+---
+
+## Testing the LLM Judge
+
+The harness wrapper is provided by the Molecule AI platform at runtime.
+To test:
+
+1. Install the plugin in a test workspace
+2. Create a test issue with a clear request
+3. Submit a deliberately wrong deliverable
+4. Run `llm-judge` and verify the score is low (below threshold)
+
+Example:
+```
+Request: "Add user authentication with JWT tokens"
+Deliverable: "Added logging to all API endpoints"
+Expected score: 1-2
+```
+
+---
+
+## Tuning the Judge Prompt
+
+If the judge is consistently wrong, adjust the scoring criteria in
+`skills/llm-judge/SKILL.md`. Key things to tune:
+- Clarity of the original request
+- Whether the deliverable was checked against the request
+- Calibration of score 3 vs score 4
+
+---
+
+## Troubleshooting
+
+### Judge always scores 5
+
+- The judge prompt may be too lenient
+- Verify the original request is included in the judge prompt
+
+### Judge scores 1 on good work
+
+- The judge prompt may be too strict
+- Check the criteria — ensure "correct but different approach" scores ≥ 4
+
+### Inconsistent scores between runs
+
+- LLM judges have inherent non-determinism
+- Consider adding a temperature of 0 to reduce variance
+
+---
+
+## Related
+
+- `skills/llm-judge/SKILL.md` — scoring criteria and usage
--- a/skills/llm-judge/SKILL.md
+++ b/skills/llm-judge/SKILL.md
@ -0,0 +1,76 @@
+---
+name: llm-judge
+description: Evaluate whether a Molecule AI agent's output (a PR, a delegation result, a generated config) actually addresses the original request. Cheap LLM-as-judge gate that catches "wrong answer to right question" — the failure mode unit tests miss. Inspired by gstack's tier-3 LLM-as-judge test infra.
+origin: molecule-skill-llm-judge
+---
+
+# llm-judge
+
+Unit tests verify the code RAN. They don't verify it did the RIGHT THING for the customer's actual request. This skill closes that gap.
+
+## When to Use
+
+After a Molecule AI agent (PM, Dev Lead, QA, etc.) produces a deliverable:
+- A PR they opened in response to an issue
+- A delegation result (response to an A2A `message/send`)
+- A generated config or template
+- A code review they posted
+
+Specifically: when a worker agent comes back with "done", before we believe them.
+
+## Inputs
+
+1. The ORIGINAL request — the issue body, the user message, the delegation prompt
+2. The DELIVERABLE — the diff, the response text, the generated artifact
+3. ACCEPTANCE CRITERIA if explicit (often in the issue body)
+
+## How to evaluate
+
+Send to a small fast model (Haiku, GPT-mini, Gemini Flash):
+
+```
+You are an evaluator. Below is a customer request and the deliverable
+the AI agent produced. Rate, on a 0-5 scale, how well the deliverable
+addresses the original request. Then list the top 3 reasons for the score.
+
+REQUEST:
+<paste original>
+
+DELIVERABLE:
+<paste artifact>
+
+ACCEPTANCE CRITERIA (if any):
+<paste>
+
+Output JSON:
+{
+  "score": 0..5,
+  "addresses_request": true|false,
+  "missing": ["...", "..."],
+  "wrong": ["...", "..."],
+  "reasons": ["...", "...", "..."]
+}
+```
+
+## Decision
+
+| Score | Action |
+|---|---|
+| 5 | Accept — log to telemetry |
+| 4 | Accept with note — file a follow-up issue for the gap if material |
+| 3 | Send back to the agent for revision with the judge's "missing" list |
+| 0–2 | Reject. Escalate to CEO. Likely the agent misunderstood the task — fixing the prompt > fixing the deliverable |
+
+## Cost
+
+Tier-3 (Haiku-class): ~$0.001 per eval. Even at 100 evals/day = $0.10/day. Negligible.
+
+## Where to plug it in
+
+- **Cron Step 4 (issue pickup)**: after a draft PR is opened by a subagent, run llm-judge against the issue body. Mark the PR ready ONLY if score >= 4.
+- **A2A delegation in workspaces**: optionally enable per-org. PM gets the worker's response, runs llm-judge, only forwards to the next stage if accepted.
+- **Manual**: `npm run skill:llm-judge -- --request <file> --deliverable <file>`
+
+## Why this exists
+
+gstack runs LLM-as-judge as a test-tier ($0.15 per eval, ~30s). Our worker agents produce many more deliverables per day than gstack's single-session model — making the eval cheaper and more frequent matches our scale. The failure mode this catches — "agent shipped the wrong thing" — is invisible to unit tests AND to code-review skills (both verify the code, not the intent).