molecule-core/.claude/skills/llm-judge/SKILL.md
Hongming Wang 239883920e feat(.claude): 5 gstack-inspired skills + cron upgrades
Research on garrytan/gstack surfaced 5 patterns worth importing into
our cron / agent setup. These are skills, not platform code — they
guide how the cron and our own subagents work, not what the platform
does at runtime.

## New skills

1. **cross-vendor-review** — adversarial second-model review for
   noteworthy PRs (auth, billing, data deletion, migrations). Catches
   the 15-30% of bugs single-model review misses. Inspired by
   gstack's /codex.

2. **careful-mode** — REFUSE/WARN/ALLOW lists for destructive
   commands. Refuses force-push to main, blocks merging draft PRs,
   prevents rm -rf outside scratch dirs. Inspired by gstack's
   /careful + /freeze.

3. **cron-learnings** — per-project JSONL of operational learnings
   appended at the end of every tick, replayed at the start of the
   next. Stops the cron from re-litigating decided issues.
   Inspired by gstack's /learn.

4. **cron-retro** — weekly retrospective auto-posted as a GitHub
   issue. Sunday 23:07 local. Tracks PR count, time-to-merge, gate
   failure trends, code-review severity over time. Inspired by
   gstack's /retro.

5. **llm-judge** — cheap LLM-as-judge eval to catch "agent shipped
   the wrong thing" — the failure mode unit tests miss. Plug into
   issue-pickup pipeline so worker-agent draft PRs get scored before
   being marked ready. Inspired by gstack's tier-3 test infra.

## Cron updates (session-only, c5074cd5 + 060d136c)

- Hourly triage cron now opens with careful-mode activation +
  cron-learnings replay (Step 0)
- code-review skill on every PR being considered for merge
  (Step 2 supplement A — already present, formalized)
- cross-vendor-review on noteworthy PRs (Step 2 supplement B — new)
- llm-judge on issue-pickup draft PRs before marking ready (Step 4)
- Status report now includes cross-vendor pass/fail and llm-judge
  scores (Step 5)
- End-of-tick cron-learnings append (Step 5)
- New weekly cron at Sun 23:07 invokes the cron-retro skill

## What we did NOT take from gstack

- Their browser fork — not our product
- The 23 named roles — we have agent role templates already
- Bun toolchain — adds yet another runtime to our stack
- /design-shotgun and design-tool variants — we're not a design tool
- /document-release — our update-docs skill already covers this

See PR description for full research notes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 11:36:55 -07:00

76 lines
2.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
name: llm-judge
description: Evaluate whether a Molecule AI agent's output (a PR, a delegation result, a generated config) actually addresses the original request. Cheap LLM-as-judge gate that catches "wrong answer to right question" — the failure mode unit tests miss. Inspired by gstack's tier-3 LLM-as-judge test infra.
---
# llm-judge
Unit tests verify the code RAN. They don't verify it did the RIGHT THING for the customer's actual request. This skill closes that gap.
## When to invoke
After a Molecule AI agent (PM, Dev Lead, QA, etc.) produces a deliverable:
- A PR they opened in response to an issue
- A delegation result (response to an A2A `message/send`)
- A generated config or template
- A code review they posted
Specifically: when a worker agent comes back with "done", before we believe them.
## Inputs
1. The ORIGINAL request — the issue body, the user message, the delegation prompt
2. The DELIVERABLE — the diff, the response text, the generated artifact
3. ACCEPTANCE CRITERIA if explicit (often in the issue body)
## How to evaluate
Send to a small fast model (Haiku, GPT-mini, Gemini Flash):
```
You are an evaluator. Below is a customer request and the deliverable
the AI agent produced. Rate, on a 0-5 scale, how well the deliverable
addresses the original request. Then list the top 3 reasons for the score.
REQUEST:
<paste original>
DELIVERABLE:
<paste artifact>
ACCEPTANCE CRITERIA (if any):
<paste>
Output JSON:
{
"score": 0..5,
"addresses_request": true|false,
"missing": ["...", "..."],
"wrong": ["...", "..."],
"reasons": ["...", "...", "..."]
}
```
## Decision
| Score | Action |
|---|---|
| 5 | Accept — log to telemetry |
| 4 | Accept with note — file a follow-up issue for the gap if material |
| 3 | Send back to the agent for revision with the judge's "missing" list |
| 02 | Reject. Escalate to CEO. Likely the agent misunderstood the task — fixing the prompt > fixing the deliverable |
## Cost
Tier-3 (Haiku-class): ~$0.001 per eval. Even at 100 evals/day = $0.10/day. Negligible.
## Where to plug it in
- **Cron Step 4 (issue pickup)**: after a draft PR is opened by a subagent, run llm-judge against the issue body. Mark the PR ready ONLY if score >= 4.
- **A2A delegation in workspaces**: optionally enable per-org. PM gets the worker's response, runs llm-judge, only forwards to the next stage if accepted.
- **Manual**: `npm run skill:llm-judge -- --request <file> --deliverable <file>`
## Why this exists
gstack runs LLM-as-judge as a test-tier ($0.15 per eval, ~30s). Our worker agents produce many more deliverables per day than gstack's single-session model — making the eval cheaper and more frequent matches our scale. The failure mode this catches — "agent shipped the wrong thing" — is invisible to unit tests AND to code-review skills (both verify the code, not the intent).