import from local vendored copy (2026-05-06)
Some checks failed
CI / validate (push) Failing after 1s
Some checks failed
CI / validate (push) Failing after 1s
This commit is contained in:
commit
64c2d7be17
5
.github/workflows/ci.yml
vendored
Normal file
5
.github/workflows/ci.yml
vendored
Normal file
@ -0,0 +1,5 @@
|
||||
name: CI
|
||||
on: [push, pull_request]
|
||||
jobs:
|
||||
validate:
|
||||
uses: Molecule-AI/molecule-ci/.github/workflows/validate-plugin.yml@main
|
||||
21
.gitignore
vendored
Normal file
21
.gitignore
vendored
Normal file
@ -0,0 +1,21 @@
|
||||
# Credentials — never commit. Use .env.example as the template.
|
||||
.env
|
||||
.env.local
|
||||
.env.*.local
|
||||
.env.*
|
||||
!.env.example
|
||||
!.env.sample
|
||||
|
||||
# Private keys + certs
|
||||
*.pem
|
||||
*.key
|
||||
*.crt
|
||||
*.p12
|
||||
*.pfx
|
||||
|
||||
# Secret directories
|
||||
.secrets/
|
||||
|
||||
# Workspace auth tokens
|
||||
.auth-token
|
||||
.auth_token
|
||||
1
.molecule-ci/scripts/requirements.txt
Normal file
1
.molecule-ci/scripts/requirements.txt
Normal file
@ -0,0 +1 @@
|
||||
pyyaml>=6.0
|
||||
46
.molecule-ci/scripts/validate-plugin.py
Normal file
46
.molecule-ci/scripts/validate-plugin.py
Normal file
@ -0,0 +1,46 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Validate a Molecule AI plugin repo."""
|
||||
import os, sys, yaml
|
||||
|
||||
errors = []
|
||||
|
||||
if not os.path.isfile("plugin.yaml"):
|
||||
print("::error::plugin.yaml not found at repo root")
|
||||
sys.exit(1)
|
||||
|
||||
with open("plugin.yaml") as f:
|
||||
plugin = yaml.safe_load(f)
|
||||
|
||||
for field in ["name", "version", "description"]:
|
||||
if not plugin.get(field):
|
||||
errors.append(f"Missing required field: {field}")
|
||||
|
||||
v = str(plugin.get("version", ""))
|
||||
if v and not all(c in "0123456789." for c in v):
|
||||
errors.append(f"Invalid version format: {v}")
|
||||
|
||||
runtimes = plugin.get("runtimes")
|
||||
if runtimes is not None and not isinstance(runtimes, list):
|
||||
errors.append(f"runtimes must be a list, got {type(runtimes).__name__}")
|
||||
|
||||
content_paths = ["SKILL.md", "hooks", "skills", "rules"]
|
||||
found = [p for p in content_paths if os.path.exists(p)]
|
||||
if not found:
|
||||
errors.append("Plugin must contain at least one of: SKILL.md, hooks/, skills/, rules/")
|
||||
|
||||
if os.path.isfile("SKILL.md"):
|
||||
with open("SKILL.md") as f:
|
||||
first_line = f.readline().strip()
|
||||
if first_line and not first_line.startswith("#"):
|
||||
print("::warning::SKILL.md should start with a markdown heading (e.g., # Plugin Name)")
|
||||
|
||||
if errors:
|
||||
for e in errors:
|
||||
print(f"::error::{e}")
|
||||
sys.exit(1)
|
||||
|
||||
print(f"✓ plugin.yaml valid: {plugin['name']} v{plugin['version']}")
|
||||
if found:
|
||||
print(f" Content: {', '.join(found)}")
|
||||
if runtimes:
|
||||
print(f" Runtimes: {', '.join(runtimes)}")
|
||||
118
CLAUDE.md
Normal file
118
CLAUDE.md
Normal file
@ -0,0 +1,118 @@
|
||||
# molecule-skill-llm-judge — LLM-as-Judge Gate
|
||||
|
||||
`molecule-skill-llm-judge` is a **cheap LLM-as-judge gate** that scores whether
|
||||
a deliverable (PR diff, A2A response, generated config) actually addresses the
|
||||
original request. It catches the failure mode unit tests miss: the code works
|
||||
but solves the wrong problem.
|
||||
|
||||
**Version:** 1.0.0
|
||||
**Runtime:** `claude_code`
|
||||
|
||||
---
|
||||
|
||||
## Repository Layout
|
||||
|
||||
```
|
||||
molecule-skill-llm-judge/
|
||||
├── plugin.yaml — Plugin manifest
|
||||
├── skills/
|
||||
│ └── llm-judge/
|
||||
│ └── SKILL.md — Scoring criteria and process
|
||||
└── adapters/ — Harness adaptors
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## How It Works
|
||||
|
||||
### The Judge Prompt
|
||||
|
||||
The skill sends the original request + the deliverable to a judge LLM and
|
||||
asks for a score 1–5:
|
||||
|
||||
| Score | Meaning |
|
||||
|---|---|
|
||||
| 5 | Deliverable fully addresses the request |
|
||||
| 4 | Addresses most of the request, minor gaps |
|
||||
| 3 | Partial address, significant gaps |
|
||||
| 2 | Mostly irrelevant |
|
||||
| 1 | Completely wrong |
|
||||
|
||||
### Gate Behaviour
|
||||
|
||||
Configure the threshold in workspace settings:
|
||||
|
||||
```json
|
||||
{
|
||||
"llm_judge": {
|
||||
"threshold": 4,
|
||||
"model": "claude-sonnet-4-20250514"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
If the score is below threshold, the skill returns a denial with the judge's reasoning.
|
||||
|
||||
---
|
||||
|
||||
## When to Use
|
||||
|
||||
✅ Use for:
|
||||
- Verifying PR diffs against the original issue
|
||||
- Checking A2A responses address the task
|
||||
- Validating generated configs against requirements
|
||||
|
||||
❌ Don't use for:
|
||||
- Well-tested pure logic (unit tests catch this)
|
||||
- Exploratory work where "wrong" isn't well-defined
|
||||
|
||||
---
|
||||
|
||||
## Development
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Python 3.11+
|
||||
- `gh` CLI authenticated
|
||||
- Write access to `Molecule-AI/molecule-ai-plugin-molecule-skill-llm-judge`
|
||||
|
||||
### Setup
|
||||
|
||||
```bash
|
||||
git clone https://github.com/Molecule-AI/molecule-ai-plugin-molecule-skill-llm-judge.git
|
||||
cd molecule-ai-plugin-molecule-skill-llm-judge
|
||||
python3 -c "import yaml; yaml.safe_load(open('plugin.yaml'))"
|
||||
```
|
||||
|
||||
### Pre-Commit Checklist
|
||||
|
||||
```bash
|
||||
python3 -c "import yaml; yaml.safe_load(open('plugin.yaml'))"
|
||||
|
||||
python3 -c "
|
||||
import re, sys
|
||||
with open('plugin.yaml') as f:
|
||||
content = f.read()
|
||||
patterns = [r'sk.ant', r'ghp.', r'AKIA[A-Z0-9]']
|
||||
if any(re.search(p, content) for p in patterns):
|
||||
print('FAIL: possible credentials found')
|
||||
sys.exit(1)
|
||||
print('No credentials: OK')
|
||||
"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Release Process
|
||||
|
||||
1. Review changes: `git log origin/main..HEAD --oneline`
|
||||
2. Bump `version` in `plugin.yaml` (semver)
|
||||
3. Commit: `chore: bump version to X.Y.Z`
|
||||
4. Tag and push: `git tag vX.Y.Z && git push origin main --tags`
|
||||
5. Create GitHub Release with changelog
|
||||
|
||||
---
|
||||
|
||||
## Known Issues
|
||||
|
||||
See `known-issues.md` at the repo root.
|
||||
42
README.md
Normal file
42
README.md
Normal file
@ -0,0 +1,42 @@
|
||||
# molecule-skill-llm-judge — LLM-as-Judge Gate
|
||||
|
||||
Plugin for Claude Code. Scores whether an agent's deliverable (a PR, a delegation
|
||||
result, a generated config) actually addresses the original request — the failure mode
|
||||
unit tests miss.
|
||||
|
||||
## The problem it solves
|
||||
|
||||
Unit tests verify the code *ran*. They don't verify it did the *right thing* for the
|
||||
customer's actual request. An agent can implement the wrong solution perfectly.
|
||||
|
||||
## When to use
|
||||
|
||||
After an agent (PM, Dev Lead, QA, etc.) produces a deliverable:
|
||||
- A PR opened in response to an issue
|
||||
- A delegation result (A2A `message/send` response)
|
||||
- A generated config or template
|
||||
- A code review they posted
|
||||
|
||||
**Trigger:** "Agent came back with 'done' — before we believe them."
|
||||
|
||||
## What it does
|
||||
|
||||
1. Presents the original request and the agent's deliverable to an LLM judge
|
||||
2. Scores: does the deliverable actually address the request?
|
||||
3. Reports: passes, partial, or fails — with evidence
|
||||
|
||||
## Installation
|
||||
|
||||
### In org template (org.yaml)
|
||||
```yaml
|
||||
plugins:
|
||||
- molecule-skill-llm-judge
|
||||
```
|
||||
|
||||
### From URL
|
||||
```
|
||||
github://Molecule-AI/molecule-ai-plugin-molecule-skill-llm-judge
|
||||
```
|
||||
|
||||
## License
|
||||
Business Source License 1.1 — © Molecule AI.
|
||||
0
adapters/__init__.py
Normal file
0
adapters/__init__.py
Normal file
2
adapters/claude_code.py
Normal file
2
adapters/claude_code.py
Normal file
@ -0,0 +1,2 @@
|
||||
"""Claude Code adaptor — uses the generic rule+skill+hooks installer."""
|
||||
from plugins_registry.builtins import AgentskillsAdaptor as Adaptor # noqa: F401
|
||||
54
known-issues.md
Normal file
54
known-issues.md
Normal file
@ -0,0 +1,54 @@
|
||||
# Known Issues — molecule-skill-llm-judge
|
||||
|
||||
---
|
||||
|
||||
## Active Issues
|
||||
|
||||
*(None currently open. This section is updated when issues are filed.)*
|
||||
|
||||
---
|
||||
|
||||
## Recently Resolved
|
||||
|
||||
*(No recently resolved issues.)*
|
||||
|
||||
---
|
||||
|
||||
## How to Update This File
|
||||
|
||||
When a new issue is identified:
|
||||
1. Add it under **Active Issues** using the template below
|
||||
2. Include: symptom, cause (if known), workaround
|
||||
3. When fixed, move to **Recently Resolved** and note the fix version
|
||||
|
||||
### Issue Template
|
||||
|
||||
```markdown
|
||||
## [TICKET-NUMBER] <Short Title>
|
||||
|
||||
**Severity:** P0 / P1 / P2 / P3
|
||||
**Status:** Workaround / Fix in progress / Fix available
|
||||
**Affected versions:** All / vX.Y.Z+
|
||||
|
||||
**Symptoms:**
|
||||
**Cause:**
|
||||
**Workaround:**
|
||||
**Fix (if available):**
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Severity Definitions
|
||||
|
||||
| Level | Description |
|
||||
|---|---|
|
||||
| P0 | Judge always returns 5 (bypass) |
|
||||
| P1 | Judge always returns 1 (false negative on good work) |
|
||||
| P2 | Judge score inconsistent between runs |
|
||||
| P3 | Cosmetic or documentation issue |
|
||||
|
||||
---
|
||||
|
||||
## Reporting
|
||||
|
||||
Use the Molecule-AI/internal issue tracker. Tag with `plugin-molecule-skill-llm-judge`.
|
||||
11
plugin.yaml
Normal file
11
plugin.yaml
Normal file
@ -0,0 +1,11 @@
|
||||
name: molecule-skill-llm-judge
|
||||
version: 1.0.0
|
||||
description: Cheap LLM-as-judge gate that catches "agent shipped the wrong thing". Scores whether a deliverable (PR diff, A2A response, generated config) actually addresses the original request — the failure mode unit tests miss.
|
||||
author: Molecule AI
|
||||
tags: [molecule, guardrails, evaluation]
|
||||
|
||||
runtimes:
|
||||
- claude_code
|
||||
|
||||
skills:
|
||||
- llm-judge
|
||||
84
runbooks/local-dev-setup.md
Normal file
84
runbooks/local-dev-setup.md
Normal file
@ -0,0 +1,84 @@
|
||||
# Local Development Setup
|
||||
|
||||
This runbook covers setting up a local development environment for
|
||||
`molecule-skill-llm-judge`.
|
||||
|
||||
---
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Python 3.11+
|
||||
- `gh` CLI authenticated
|
||||
- Write access to `Molecule-AI/molecule-ai-plugin-molecule-skill-llm-judge`
|
||||
|
||||
---
|
||||
|
||||
## Clone & Bootstrap
|
||||
|
||||
```bash
|
||||
git clone https://github.com/Molecule-AI/molecule-ai-plugin-molecule-skill-llm-judge.git
|
||||
cd molecule-ai-plugin-molecule-skill-llm-judge
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Validating Plugin Structure
|
||||
|
||||
```bash
|
||||
python3 -c "import yaml; yaml.safe_load(open('plugin.yaml'))"
|
||||
echo "plugin.yaml OK"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Testing the LLM Judge
|
||||
|
||||
The harness wrapper is provided by the Molecule AI platform at runtime.
|
||||
To test:
|
||||
|
||||
1. Install the plugin in a test workspace
|
||||
2. Create a test issue with a clear request
|
||||
3. Submit a deliberately wrong deliverable
|
||||
4. Run `llm-judge` and verify the score is low (below threshold)
|
||||
|
||||
Example:
|
||||
```
|
||||
Request: "Add user authentication with JWT tokens"
|
||||
Deliverable: "Added logging to all API endpoints"
|
||||
Expected score: 1-2
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Tuning the Judge Prompt
|
||||
|
||||
If the judge is consistently wrong, adjust the scoring criteria in
|
||||
`skills/llm-judge/SKILL.md`. Key things to tune:
|
||||
- Clarity of the original request
|
||||
- Whether the deliverable was checked against the request
|
||||
- Calibration of score 3 vs score 4
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Judge always scores 5
|
||||
|
||||
- The judge prompt may be too lenient
|
||||
- Verify the original request is included in the judge prompt
|
||||
|
||||
### Judge scores 1 on good work
|
||||
|
||||
- The judge prompt may be too strict
|
||||
- Check the criteria — ensure "correct but different approach" scores ≥ 4
|
||||
|
||||
### Inconsistent scores between runs
|
||||
|
||||
- LLM judges have inherent non-determinism
|
||||
- Consider adding a temperature of 0 to reduce variance
|
||||
|
||||
---
|
||||
|
||||
## Related
|
||||
|
||||
- `skills/llm-judge/SKILL.md` — scoring criteria and usage
|
||||
76
skills/llm-judge/SKILL.md
Normal file
76
skills/llm-judge/SKILL.md
Normal file
@ -0,0 +1,76 @@
|
||||
---
|
||||
name: llm-judge
|
||||
description: Evaluate whether a Molecule AI agent's output (a PR, a delegation result, a generated config) actually addresses the original request. Cheap LLM-as-judge gate that catches "wrong answer to right question" — the failure mode unit tests miss. Inspired by gstack's tier-3 LLM-as-judge test infra.
|
||||
origin: molecule-skill-llm-judge
|
||||
---
|
||||
|
||||
# llm-judge
|
||||
|
||||
Unit tests verify the code RAN. They don't verify it did the RIGHT THING for the customer's actual request. This skill closes that gap.
|
||||
|
||||
## When to Use
|
||||
|
||||
After a Molecule AI agent (PM, Dev Lead, QA, etc.) produces a deliverable:
|
||||
- A PR they opened in response to an issue
|
||||
- A delegation result (response to an A2A `message/send`)
|
||||
- A generated config or template
|
||||
- A code review they posted
|
||||
|
||||
Specifically: when a worker agent comes back with "done", before we believe them.
|
||||
|
||||
## Inputs
|
||||
|
||||
1. The ORIGINAL request — the issue body, the user message, the delegation prompt
|
||||
2. The DELIVERABLE — the diff, the response text, the generated artifact
|
||||
3. ACCEPTANCE CRITERIA if explicit (often in the issue body)
|
||||
|
||||
## How to evaluate
|
||||
|
||||
Send to a small fast model (Haiku, GPT-mini, Gemini Flash):
|
||||
|
||||
```
|
||||
You are an evaluator. Below is a customer request and the deliverable
|
||||
the AI agent produced. Rate, on a 0-5 scale, how well the deliverable
|
||||
addresses the original request. Then list the top 3 reasons for the score.
|
||||
|
||||
REQUEST:
|
||||
<paste original>
|
||||
|
||||
DELIVERABLE:
|
||||
<paste artifact>
|
||||
|
||||
ACCEPTANCE CRITERIA (if any):
|
||||
<paste>
|
||||
|
||||
Output JSON:
|
||||
{
|
||||
"score": 0..5,
|
||||
"addresses_request": true|false,
|
||||
"missing": ["...", "..."],
|
||||
"wrong": ["...", "..."],
|
||||
"reasons": ["...", "...", "..."]
|
||||
}
|
||||
```
|
||||
|
||||
## Decision
|
||||
|
||||
| Score | Action |
|
||||
|---|---|
|
||||
| 5 | Accept — log to telemetry |
|
||||
| 4 | Accept with note — file a follow-up issue for the gap if material |
|
||||
| 3 | Send back to the agent for revision with the judge's "missing" list |
|
||||
| 0–2 | Reject. Escalate to CEO. Likely the agent misunderstood the task — fixing the prompt > fixing the deliverable |
|
||||
|
||||
## Cost
|
||||
|
||||
Tier-3 (Haiku-class): ~$0.001 per eval. Even at 100 evals/day = $0.10/day. Negligible.
|
||||
|
||||
## Where to plug it in
|
||||
|
||||
- **Cron Step 4 (issue pickup)**: after a draft PR is opened by a subagent, run llm-judge against the issue body. Mark the PR ready ONLY if score >= 4.
|
||||
- **A2A delegation in workspaces**: optionally enable per-org. PM gets the worker's response, runs llm-judge, only forwards to the next stage if accepted.
|
||||
- **Manual**: `npm run skill:llm-judge -- --request <file> --deliverable <file>`
|
||||
|
||||
## Why this exists
|
||||
|
||||
gstack runs LLM-as-judge as a test-tier ($0.15 per eval, ~30s). Our worker agents produce many more deliverables per day than gstack's single-session model — making the eval cheaper and more frequent matches our scale. The failure mode this catches — "agent shipped the wrong thing" — is invisible to unit tests AND to code-review skills (both verify the code, not the intent).
|
||||
Loading…
Reference in New Issue
Block a user