molecule-ai/molecule-ai-plugin-molecule-skill-llm-judge

documentation-specialist 39ac34935e

CI / validate (pull_request) Failing after 0s

Details

CI / validate (push) Failing after 0s

Details

docs(install): migrate git clone URL to git.moleculesai.app (#37 )\n\nAnonymous git-clone refs in CLAUDE.md, runbooks/local-dev-setup.md migrated github.com/Molecule-AI \u2192 git.moleculesai.app/molecule-ai. Public repo, no auth-shape change. Same pattern as the other plugin-* sweeps in the #37 series.\n\nRefs: molecule-ai/internal#37 , molecule-ai/internal#38 , molecule-ai/internal#42

2026-05-07 00:01:17 -07:00

1.8 KiB

Raw Blame History

Local Development Setup

This runbook covers setting up a local development environment for molecule-skill-llm-judge.

Prerequisites

Python 3.11+
gh CLI authenticated
Write access to Molecule-AI/molecule-ai-plugin-molecule-skill-llm-judge

Clone & Bootstrap

git clone https://git.moleculesai.app/molecule-ai/molecule-ai-plugin-molecule-skill-llm-judge.git
cd molecule-ai-plugin-molecule-skill-llm-judge

Validating Plugin Structure

python3 -c "import yaml; yaml.safe_load(open('plugin.yaml'))"
echo "plugin.yaml OK"

Testing the LLM Judge

The harness wrapper is provided by the Molecule AI platform at runtime. To test:

Install the plugin in a test workspace
Create a test issue with a clear request
Submit a deliberately wrong deliverable
Run llm-judge and verify the score is low (below threshold)

Example:

Request: "Add user authentication with JWT tokens"
Deliverable: "Added logging to all API endpoints"
Expected score: 1-2

Tuning the Judge Prompt

If the judge is consistently wrong, adjust the scoring criteria in skills/llm-judge/SKILL.md. Key things to tune:

Clarity of the original request
Whether the deliverable was checked against the request
Calibration of score 3 vs score 4

Troubleshooting

Judge always scores 5

The judge prompt may be too lenient
Verify the original request is included in the judge prompt

Judge scores 1 on good work

The judge prompt may be too strict
Check the criteria — ensure "correct but different approach" scores ≥ 4

Inconsistent scores between runs

LLM judges have inherent non-determinism
Consider adding a temperature of 0 to reduce variance

skills/llm-judge/SKILL.md — scoring criteria and usage