Research on garrytan/gstack surfaced 5 patterns worth importing into our cron / agent setup. These are skills, not platform code — they guide how the cron and our own subagents work, not what the platform does at runtime. ## New skills 1. **cross-vendor-review** — adversarial second-model review for noteworthy PRs (auth, billing, data deletion, migrations). Catches the 15-30% of bugs single-model review misses. Inspired by gstack's /codex. 2. **careful-mode** — REFUSE/WARN/ALLOW lists for destructive commands. Refuses force-push to main, blocks merging draft PRs, prevents rm -rf outside scratch dirs. Inspired by gstack's /careful + /freeze. 3. **cron-learnings** — per-project JSONL of operational learnings appended at the end of every tick, replayed at the start of the next. Stops the cron from re-litigating decided issues. Inspired by gstack's /learn. 4. **cron-retro** — weekly retrospective auto-posted as a GitHub issue. Sunday 23:07 local. Tracks PR count, time-to-merge, gate failure trends, code-review severity over time. Inspired by gstack's /retro. 5. **llm-judge** — cheap LLM-as-judge eval to catch "agent shipped the wrong thing" — the failure mode unit tests miss. Plug into issue-pickup pipeline so worker-agent draft PRs get scored before being marked ready. Inspired by gstack's tier-3 test infra. ## Cron updates (session-only, c5074cd5 + 060d136c) - Hourly triage cron now opens with careful-mode activation + cron-learnings replay (Step 0) - code-review skill on every PR being considered for merge (Step 2 supplement A — already present, formalized) - cross-vendor-review on noteworthy PRs (Step 2 supplement B — new) - llm-judge on issue-pickup draft PRs before marking ready (Step 4) - Status report now includes cross-vendor pass/fail and llm-judge scores (Step 5) - End-of-tick cron-learnings append (Step 5) - New weekly cron at Sun 23:07 invokes the cron-retro skill ## What we did NOT take from gstack - Their browser fork — not our product - The 23 named roles — we have agent role templates already - Bun toolchain — adds yet another runtime to our stack - /design-shotgun and design-tool variants — we're not a design tool - /document-release — our update-docs skill already covers this See PR description for full research notes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
76 lines
2.8 KiB
Markdown
76 lines
2.8 KiB
Markdown
---
|
||
name: llm-judge
|
||
description: Evaluate whether a Molecule AI agent's output (a PR, a delegation result, a generated config) actually addresses the original request. Cheap LLM-as-judge gate that catches "wrong answer to right question" — the failure mode unit tests miss. Inspired by gstack's tier-3 LLM-as-judge test infra.
|
||
---
|
||
|
||
# llm-judge
|
||
|
||
Unit tests verify the code RAN. They don't verify it did the RIGHT THING for the customer's actual request. This skill closes that gap.
|
||
|
||
## When to invoke
|
||
|
||
After a Molecule AI agent (PM, Dev Lead, QA, etc.) produces a deliverable:
|
||
- A PR they opened in response to an issue
|
||
- A delegation result (response to an A2A `message/send`)
|
||
- A generated config or template
|
||
- A code review they posted
|
||
|
||
Specifically: when a worker agent comes back with "done", before we believe them.
|
||
|
||
## Inputs
|
||
|
||
1. The ORIGINAL request — the issue body, the user message, the delegation prompt
|
||
2. The DELIVERABLE — the diff, the response text, the generated artifact
|
||
3. ACCEPTANCE CRITERIA if explicit (often in the issue body)
|
||
|
||
## How to evaluate
|
||
|
||
Send to a small fast model (Haiku, GPT-mini, Gemini Flash):
|
||
|
||
```
|
||
You are an evaluator. Below is a customer request and the deliverable
|
||
the AI agent produced. Rate, on a 0-5 scale, how well the deliverable
|
||
addresses the original request. Then list the top 3 reasons for the score.
|
||
|
||
REQUEST:
|
||
<paste original>
|
||
|
||
DELIVERABLE:
|
||
<paste artifact>
|
||
|
||
ACCEPTANCE CRITERIA (if any):
|
||
<paste>
|
||
|
||
Output JSON:
|
||
{
|
||
"score": 0..5,
|
||
"addresses_request": true|false,
|
||
"missing": ["...", "..."],
|
||
"wrong": ["...", "..."],
|
||
"reasons": ["...", "...", "..."]
|
||
}
|
||
```
|
||
|
||
## Decision
|
||
|
||
| Score | Action |
|
||
|---|---|
|
||
| 5 | Accept — log to telemetry |
|
||
| 4 | Accept with note — file a follow-up issue for the gap if material |
|
||
| 3 | Send back to the agent for revision with the judge's "missing" list |
|
||
| 0–2 | Reject. Escalate to CEO. Likely the agent misunderstood the task — fixing the prompt > fixing the deliverable |
|
||
|
||
## Cost
|
||
|
||
Tier-3 (Haiku-class): ~$0.001 per eval. Even at 100 evals/day = $0.10/day. Negligible.
|
||
|
||
## Where to plug it in
|
||
|
||
- **Cron Step 4 (issue pickup)**: after a draft PR is opened by a subagent, run llm-judge against the issue body. Mark the PR ready ONLY if score >= 4.
|
||
- **A2A delegation in workspaces**: optionally enable per-org. PM gets the worker's response, runs llm-judge, only forwards to the next stage if accepted.
|
||
- **Manual**: `npm run skill:llm-judge -- --request <file> --deliverable <file>`
|
||
|
||
## Why this exists
|
||
|
||
gstack runs LLM-as-judge as a test-tier ($0.15 per eval, ~30s). Our worker agents produce many more deliverables per day than gstack's single-session model — making the eval cheaper and more frequent matches our scale. The failure mode this catches — "agent shipped the wrong thing" — is invisible to unit tests AND to code-review skills (both verify the code, not the intent).
|