import from local vendored copy (2026-05-06)
Some checks failed
CI / validate (push) Failing after 1s

This commit is contained in:
Hongming Wang 2026-05-06 13:53:35 -07:00
commit 64c2d7be17
12 changed files with 460 additions and 0 deletions

5
.github/workflows/ci.yml vendored Normal file
View File

@ -0,0 +1,5 @@
name: CI
on: [push, pull_request]
jobs:
validate:
uses: Molecule-AI/molecule-ci/.github/workflows/validate-plugin.yml@main

21
.gitignore vendored Normal file
View File

@ -0,0 +1,21 @@
# Credentials — never commit. Use .env.example as the template.
.env
.env.local
.env.*.local
.env.*
!.env.example
!.env.sample
# Private keys + certs
*.pem
*.key
*.crt
*.p12
*.pfx
# Secret directories
.secrets/
# Workspace auth tokens
.auth-token
.auth_token

View File

@ -0,0 +1 @@
pyyaml>=6.0

View File

@ -0,0 +1,46 @@
#!/usr/bin/env python3
"""Validate a Molecule AI plugin repo."""
import os, sys, yaml
errors = []
if not os.path.isfile("plugin.yaml"):
print("::error::plugin.yaml not found at repo root")
sys.exit(1)
with open("plugin.yaml") as f:
plugin = yaml.safe_load(f)
for field in ["name", "version", "description"]:
if not plugin.get(field):
errors.append(f"Missing required field: {field}")
v = str(plugin.get("version", ""))
if v and not all(c in "0123456789." for c in v):
errors.append(f"Invalid version format: {v}")
runtimes = plugin.get("runtimes")
if runtimes is not None and not isinstance(runtimes, list):
errors.append(f"runtimes must be a list, got {type(runtimes).__name__}")
content_paths = ["SKILL.md", "hooks", "skills", "rules"]
found = [p for p in content_paths if os.path.exists(p)]
if not found:
errors.append("Plugin must contain at least one of: SKILL.md, hooks/, skills/, rules/")
if os.path.isfile("SKILL.md"):
with open("SKILL.md") as f:
first_line = f.readline().strip()
if first_line and not first_line.startswith("#"):
print("::warning::SKILL.md should start with a markdown heading (e.g., # Plugin Name)")
if errors:
for e in errors:
print(f"::error::{e}")
sys.exit(1)
print(f"✓ plugin.yaml valid: {plugin['name']} v{plugin['version']}")
if found:
print(f" Content: {', '.join(found)}")
if runtimes:
print(f" Runtimes: {', '.join(runtimes)}")

118
CLAUDE.md Normal file
View File

@ -0,0 +1,118 @@
# molecule-skill-llm-judge — LLM-as-Judge Gate
`molecule-skill-llm-judge` is a **cheap LLM-as-judge gate** that scores whether
a deliverable (PR diff, A2A response, generated config) actually addresses the
original request. It catches the failure mode unit tests miss: the code works
but solves the wrong problem.
**Version:** 1.0.0
**Runtime:** `claude_code`
---
## Repository Layout
```
molecule-skill-llm-judge/
├── plugin.yaml — Plugin manifest
├── skills/
│ └── llm-judge/
│ └── SKILL.md — Scoring criteria and process
└── adapters/ — Harness adaptors
```
---
## How It Works
### The Judge Prompt
The skill sends the original request + the deliverable to a judge LLM and
asks for a score 15:
| Score | Meaning |
|---|---|
| 5 | Deliverable fully addresses the request |
| 4 | Addresses most of the request, minor gaps |
| 3 | Partial address, significant gaps |
| 2 | Mostly irrelevant |
| 1 | Completely wrong |
### Gate Behaviour
Configure the threshold in workspace settings:
```json
{
"llm_judge": {
"threshold": 4,
"model": "claude-sonnet-4-20250514"
}
}
```
If the score is below threshold, the skill returns a denial with the judge's reasoning.
---
## When to Use
✅ Use for:
- Verifying PR diffs against the original issue
- Checking A2A responses address the task
- Validating generated configs against requirements
❌ Don't use for:
- Well-tested pure logic (unit tests catch this)
- Exploratory work where "wrong" isn't well-defined
---
## Development
### Prerequisites
- Python 3.11+
- `gh` CLI authenticated
- Write access to `Molecule-AI/molecule-ai-plugin-molecule-skill-llm-judge`
### Setup
```bash
git clone https://github.com/Molecule-AI/molecule-ai-plugin-molecule-skill-llm-judge.git
cd molecule-ai-plugin-molecule-skill-llm-judge
python3 -c "import yaml; yaml.safe_load(open('plugin.yaml'))"
```
### Pre-Commit Checklist
```bash
python3 -c "import yaml; yaml.safe_load(open('plugin.yaml'))"
python3 -c "
import re, sys
with open('plugin.yaml') as f:
content = f.read()
patterns = [r'sk.ant', r'ghp.', r'AKIA[A-Z0-9]']
if any(re.search(p, content) for p in patterns):
print('FAIL: possible credentials found')
sys.exit(1)
print('No credentials: OK')
"
```
---
## Release Process
1. Review changes: `git log origin/main..HEAD --oneline`
2. Bump `version` in `plugin.yaml` (semver)
3. Commit: `chore: bump version to X.Y.Z`
4. Tag and push: `git tag vX.Y.Z && git push origin main --tags`
5. Create GitHub Release with changelog
---
## Known Issues
See `known-issues.md` at the repo root.

42
README.md Normal file
View File

@ -0,0 +1,42 @@
# molecule-skill-llm-judge — LLM-as-Judge Gate
Plugin for Claude Code. Scores whether an agent's deliverable (a PR, a delegation
result, a generated config) actually addresses the original request — the failure mode
unit tests miss.
## The problem it solves
Unit tests verify the code *ran*. They don't verify it did the *right thing* for the
customer's actual request. An agent can implement the wrong solution perfectly.
## When to use
After an agent (PM, Dev Lead, QA, etc.) produces a deliverable:
- A PR opened in response to an issue
- A delegation result (A2A `message/send` response)
- A generated config or template
- A code review they posted
**Trigger:** "Agent came back with 'done' — before we believe them."
## What it does
1. Presents the original request and the agent's deliverable to an LLM judge
2. Scores: does the deliverable actually address the request?
3. Reports: passes, partial, or fails — with evidence
## Installation
### In org template (org.yaml)
```yaml
plugins:
- molecule-skill-llm-judge
```
### From URL
```
github://Molecule-AI/molecule-ai-plugin-molecule-skill-llm-judge
```
## License
Business Source License 1.1 — © Molecule AI.

0
adapters/__init__.py Normal file
View File

2
adapters/claude_code.py Normal file
View File

@ -0,0 +1,2 @@
"""Claude Code adaptor — uses the generic rule+skill+hooks installer."""
from plugins_registry.builtins import AgentskillsAdaptor as Adaptor # noqa: F401

54
known-issues.md Normal file
View File

@ -0,0 +1,54 @@
# Known Issues — molecule-skill-llm-judge
---
## Active Issues
*(None currently open. This section is updated when issues are filed.)*
---
## Recently Resolved
*(No recently resolved issues.)*
---
## How to Update This File
When a new issue is identified:
1. Add it under **Active Issues** using the template below
2. Include: symptom, cause (if known), workaround
3. When fixed, move to **Recently Resolved** and note the fix version
### Issue Template
```markdown
## [TICKET-NUMBER] <Short Title>
**Severity:** P0 / P1 / P2 / P3
**Status:** Workaround / Fix in progress / Fix available
**Affected versions:** All / vX.Y.Z+
**Symptoms:**
**Cause:**
**Workaround:**
**Fix (if available):**
```
---
## Severity Definitions
| Level | Description |
|---|---|
| P0 | Judge always returns 5 (bypass) |
| P1 | Judge always returns 1 (false negative on good work) |
| P2 | Judge score inconsistent between runs |
| P3 | Cosmetic or documentation issue |
---
## Reporting
Use the Molecule-AI/internal issue tracker. Tag with `plugin-molecule-skill-llm-judge`.

11
plugin.yaml Normal file
View File

@ -0,0 +1,11 @@
name: molecule-skill-llm-judge
version: 1.0.0
description: Cheap LLM-as-judge gate that catches "agent shipped the wrong thing". Scores whether a deliverable (PR diff, A2A response, generated config) actually addresses the original request — the failure mode unit tests miss.
author: Molecule AI
tags: [molecule, guardrails, evaluation]
runtimes:
- claude_code
skills:
- llm-judge

View File

@ -0,0 +1,84 @@
# Local Development Setup
This runbook covers setting up a local development environment for
`molecule-skill-llm-judge`.
---
## Prerequisites
- Python 3.11+
- `gh` CLI authenticated
- Write access to `Molecule-AI/molecule-ai-plugin-molecule-skill-llm-judge`
---
## Clone & Bootstrap
```bash
git clone https://github.com/Molecule-AI/molecule-ai-plugin-molecule-skill-llm-judge.git
cd molecule-ai-plugin-molecule-skill-llm-judge
```
---
## Validating Plugin Structure
```bash
python3 -c "import yaml; yaml.safe_load(open('plugin.yaml'))"
echo "plugin.yaml OK"
```
---
## Testing the LLM Judge
The harness wrapper is provided by the Molecule AI platform at runtime.
To test:
1. Install the plugin in a test workspace
2. Create a test issue with a clear request
3. Submit a deliberately wrong deliverable
4. Run `llm-judge` and verify the score is low (below threshold)
Example:
```
Request: "Add user authentication with JWT tokens"
Deliverable: "Added logging to all API endpoints"
Expected score: 1-2
```
---
## Tuning the Judge Prompt
If the judge is consistently wrong, adjust the scoring criteria in
`skills/llm-judge/SKILL.md`. Key things to tune:
- Clarity of the original request
- Whether the deliverable was checked against the request
- Calibration of score 3 vs score 4
---
## Troubleshooting
### Judge always scores 5
- The judge prompt may be too lenient
- Verify the original request is included in the judge prompt
### Judge scores 1 on good work
- The judge prompt may be too strict
- Check the criteria — ensure "correct but different approach" scores ≥ 4
### Inconsistent scores between runs
- LLM judges have inherent non-determinism
- Consider adding a temperature of 0 to reduce variance
---
## Related
- `skills/llm-judge/SKILL.md` — scoring criteria and usage

76
skills/llm-judge/SKILL.md Normal file
View File

@ -0,0 +1,76 @@
---
name: llm-judge
description: Evaluate whether a Molecule AI agent's output (a PR, a delegation result, a generated config) actually addresses the original request. Cheap LLM-as-judge gate that catches "wrong answer to right question" — the failure mode unit tests miss. Inspired by gstack's tier-3 LLM-as-judge test infra.
origin: molecule-skill-llm-judge
---
# llm-judge
Unit tests verify the code RAN. They don't verify it did the RIGHT THING for the customer's actual request. This skill closes that gap.
## When to Use
After a Molecule AI agent (PM, Dev Lead, QA, etc.) produces a deliverable:
- A PR they opened in response to an issue
- A delegation result (response to an A2A `message/send`)
- A generated config or template
- A code review they posted
Specifically: when a worker agent comes back with "done", before we believe them.
## Inputs
1. The ORIGINAL request — the issue body, the user message, the delegation prompt
2. The DELIVERABLE — the diff, the response text, the generated artifact
3. ACCEPTANCE CRITERIA if explicit (often in the issue body)
## How to evaluate
Send to a small fast model (Haiku, GPT-mini, Gemini Flash):
```
You are an evaluator. Below is a customer request and the deliverable
the AI agent produced. Rate, on a 0-5 scale, how well the deliverable
addresses the original request. Then list the top 3 reasons for the score.
REQUEST:
<paste original>
DELIVERABLE:
<paste artifact>
ACCEPTANCE CRITERIA (if any):
<paste>
Output JSON:
{
"score": 0..5,
"addresses_request": true|false,
"missing": ["...", "..."],
"wrong": ["...", "..."],
"reasons": ["...", "...", "..."]
}
```
## Decision
| Score | Action |
|---|---|
| 5 | Accept — log to telemetry |
| 4 | Accept with note — file a follow-up issue for the gap if material |
| 3 | Send back to the agent for revision with the judge's "missing" list |
| 02 | Reject. Escalate to CEO. Likely the agent misunderstood the task — fixing the prompt > fixing the deliverable |
## Cost
Tier-3 (Haiku-class): ~$0.001 per eval. Even at 100 evals/day = $0.10/day. Negligible.
## Where to plug it in
- **Cron Step 4 (issue pickup)**: after a draft PR is opened by a subagent, run llm-judge against the issue body. Mark the PR ready ONLY if score >= 4.
- **A2A delegation in workspaces**: optionally enable per-org. PM gets the worker's response, runs llm-judge, only forwards to the next stage if accepted.
- **Manual**: `npm run skill:llm-judge -- --request <file> --deliverable <file>`
## Why this exists
gstack runs LLM-as-judge as a test-tier ($0.15 per eval, ~30s). Our worker agents produce many more deliverables per day than gstack's single-session model — making the eval cheaper and more frequent matches our scale. The failure mode this catches — "agent shipped the wrong thing" — is invisible to unit tests AND to code-review skills (both verify the code, not the intent).