molecule-core/.claude/skills/llm-judge/SKILL.md
Hongming Wang 9d914193d2 feat(.claude): 5 gstack-inspired skills + cron upgrades
Research on garrytan/gstack surfaced 5 patterns worth importing into
our cron / agent setup. These are skills, not platform code — they
guide how the cron and our own subagents work, not what the platform
does at runtime.

## New skills

1. **cross-vendor-review** — adversarial second-model review for
   noteworthy PRs (auth, billing, data deletion, migrations). Catches
   the 15-30% of bugs single-model review misses. Inspired by
   gstack's /codex.

2. **careful-mode** — REFUSE/WARN/ALLOW lists for destructive
   commands. Refuses force-push to main, blocks merging draft PRs,
   prevents rm -rf outside scratch dirs. Inspired by gstack's
   /careful + /freeze.

3. **cron-learnings** — per-project JSONL of operational learnings
   appended at the end of every tick, replayed at the start of the
   next. Stops the cron from re-litigating decided issues.
   Inspired by gstack's /learn.

4. **cron-retro** — weekly retrospective auto-posted as a GitHub
   issue. Sunday 23:07 local. Tracks PR count, time-to-merge, gate
   failure trends, code-review severity over time. Inspired by
   gstack's /retro.

5. **llm-judge** — cheap LLM-as-judge eval to catch "agent shipped
   the wrong thing" — the failure mode unit tests miss. Plug into
   issue-pickup pipeline so worker-agent draft PRs get scored before
   being marked ready. Inspired by gstack's tier-3 test infra.

## Cron updates (session-only, c5074cd5 + 060d136c)

- Hourly triage cron now opens with careful-mode activation +
  cron-learnings replay (Step 0)
- code-review skill on every PR being considered for merge
  (Step 2 supplement A — already present, formalized)
- cross-vendor-review on noteworthy PRs (Step 2 supplement B — new)
- llm-judge on issue-pickup draft PRs before marking ready (Step 4)
- Status report now includes cross-vendor pass/fail and llm-judge
  scores (Step 5)
- End-of-tick cron-learnings append (Step 5)
- New weekly cron at Sun 23:07 invokes the cron-retro skill

## What we did NOT take from gstack

- Their browser fork — not our product
- The 23 named roles — we have agent role templates already
- Bun toolchain — adds yet another runtime to our stack
- /design-shotgun and design-tool variants — we're not a design tool
- /document-release — our update-docs skill already covers this

See PR description for full research notes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 11:36:55 -07:00

2.8 KiB
Raw Blame History

name description
llm-judge Evaluate whether a Molecule AI agent's output (a PR, a delegation result, a generated config) actually addresses the original request. Cheap LLM-as-judge gate that catches "wrong answer to right question" — the failure mode unit tests miss. Inspired by gstack's tier-3 LLM-as-judge test infra.

llm-judge

Unit tests verify the code RAN. They don't verify it did the RIGHT THING for the customer's actual request. This skill closes that gap.

When to invoke

After a Molecule AI agent (PM, Dev Lead, QA, etc.) produces a deliverable:

  • A PR they opened in response to an issue
  • A delegation result (response to an A2A message/send)
  • A generated config or template
  • A code review they posted

Specifically: when a worker agent comes back with "done", before we believe them.

Inputs

  1. The ORIGINAL request — the issue body, the user message, the delegation prompt
  2. The DELIVERABLE — the diff, the response text, the generated artifact
  3. ACCEPTANCE CRITERIA if explicit (often in the issue body)

How to evaluate

Send to a small fast model (Haiku, GPT-mini, Gemini Flash):

You are an evaluator. Below is a customer request and the deliverable
the AI agent produced. Rate, on a 0-5 scale, how well the deliverable
addresses the original request. Then list the top 3 reasons for the score.

REQUEST:
<paste original>

DELIVERABLE:
<paste artifact>

ACCEPTANCE CRITERIA (if any):
<paste>

Output JSON:
{
  "score": 0..5,
  "addresses_request": true|false,
  "missing": ["...", "..."],
  "wrong": ["...", "..."],
  "reasons": ["...", "...", "..."]
}

Decision

Score Action
5 Accept — log to telemetry
4 Accept with note — file a follow-up issue for the gap if material
3 Send back to the agent for revision with the judge's "missing" list
02 Reject. Escalate to CEO. Likely the agent misunderstood the task — fixing the prompt > fixing the deliverable

Cost

Tier-3 (Haiku-class): ~$0.001 per eval. Even at 100 evals/day = $0.10/day. Negligible.

Where to plug it in

  • Cron Step 4 (issue pickup): after a draft PR is opened by a subagent, run llm-judge against the issue body. Mark the PR ready ONLY if score >= 4.
  • A2A delegation in workspaces: optionally enable per-org. PM gets the worker's response, runs llm-judge, only forwards to the next stage if accepted.
  • Manual: npm run skill:llm-judge -- --request <file> --deliverable <file>

Why this exists

gstack runs LLM-as-judge as a test-tier ($0.15 per eval, ~30s). Our worker agents produce many more deliverables per day than gstack's single-session model — making the eval cheaper and more frequent matches our scale. The failure mode this catches — "agent shipped the wrong thing" — is invisible to unit tests AND to code-review skills (both verify the code, not the intent).