Research on garrytan/gstack surfaced 5 patterns worth importing into our cron / agent setup. These are skills, not platform code — they guide how the cron and our own subagents work, not what the platform does at runtime. ## New skills 1. **cross-vendor-review** — adversarial second-model review for noteworthy PRs (auth, billing, data deletion, migrations). Catches the 15-30% of bugs single-model review misses. Inspired by gstack's /codex. 2. **careful-mode** — REFUSE/WARN/ALLOW lists for destructive commands. Refuses force-push to main, blocks merging draft PRs, prevents rm -rf outside scratch dirs. Inspired by gstack's /careful + /freeze. 3. **cron-learnings** — per-project JSONL of operational learnings appended at the end of every tick, replayed at the start of the next. Stops the cron from re-litigating decided issues. Inspired by gstack's /learn. 4. **cron-retro** — weekly retrospective auto-posted as a GitHub issue. Sunday 23:07 local. Tracks PR count, time-to-merge, gate failure trends, code-review severity over time. Inspired by gstack's /retro. 5. **llm-judge** — cheap LLM-as-judge eval to catch "agent shipped the wrong thing" — the failure mode unit tests miss. Plug into issue-pickup pipeline so worker-agent draft PRs get scored before being marked ready. Inspired by gstack's tier-3 test infra. ## Cron updates (session-only, c5074cd5 + 060d136c) - Hourly triage cron now opens with careful-mode activation + cron-learnings replay (Step 0) - code-review skill on every PR being considered for merge (Step 2 supplement A — already present, formalized) - cross-vendor-review on noteworthy PRs (Step 2 supplement B — new) - llm-judge on issue-pickup draft PRs before marking ready (Step 4) - Status report now includes cross-vendor pass/fail and llm-judge scores (Step 5) - End-of-tick cron-learnings append (Step 5) - New weekly cron at Sun 23:07 invokes the cron-retro skill ## What we did NOT take from gstack - Their browser fork — not our product - The 23 named roles — we have agent role templates already - Bun toolchain — adds yet another runtime to our stack - /design-shotgun and design-tool variants — we're not a design tool - /document-release — our update-docs skill already covers this See PR description for full research notes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2.8 KiB
| name | description |
|---|---|
| llm-judge | Evaluate whether a Molecule AI agent's output (a PR, a delegation result, a generated config) actually addresses the original request. Cheap LLM-as-judge gate that catches "wrong answer to right question" — the failure mode unit tests miss. Inspired by gstack's tier-3 LLM-as-judge test infra. |
llm-judge
Unit tests verify the code RAN. They don't verify it did the RIGHT THING for the customer's actual request. This skill closes that gap.
When to invoke
After a Molecule AI agent (PM, Dev Lead, QA, etc.) produces a deliverable:
- A PR they opened in response to an issue
- A delegation result (response to an A2A
message/send) - A generated config or template
- A code review they posted
Specifically: when a worker agent comes back with "done", before we believe them.
Inputs
- The ORIGINAL request — the issue body, the user message, the delegation prompt
- The DELIVERABLE — the diff, the response text, the generated artifact
- ACCEPTANCE CRITERIA if explicit (often in the issue body)
How to evaluate
Send to a small fast model (Haiku, GPT-mini, Gemini Flash):
You are an evaluator. Below is a customer request and the deliverable
the AI agent produced. Rate, on a 0-5 scale, how well the deliverable
addresses the original request. Then list the top 3 reasons for the score.
REQUEST:
<paste original>
DELIVERABLE:
<paste artifact>
ACCEPTANCE CRITERIA (if any):
<paste>
Output JSON:
{
"score": 0..5,
"addresses_request": true|false,
"missing": ["...", "..."],
"wrong": ["...", "..."],
"reasons": ["...", "...", "..."]
}
Decision
| Score | Action |
|---|---|
| 5 | Accept — log to telemetry |
| 4 | Accept with note — file a follow-up issue for the gap if material |
| 3 | Send back to the agent for revision with the judge's "missing" list |
| 0–2 | Reject. Escalate to CEO. Likely the agent misunderstood the task — fixing the prompt > fixing the deliverable |
Cost
Tier-3 (Haiku-class): ~$0.001 per eval. Even at 100 evals/day = $0.10/day. Negligible.
Where to plug it in
- Cron Step 4 (issue pickup): after a draft PR is opened by a subagent, run llm-judge against the issue body. Mark the PR ready ONLY if score >= 4.
- A2A delegation in workspaces: optionally enable per-org. PM gets the worker's response, runs llm-judge, only forwards to the next stage if accepted.
- Manual:
npm run skill:llm-judge -- --request <file> --deliverable <file>
Why this exists
gstack runs LLM-as-judge as a test-tier ($0.15 per eval, ~30s). Our worker agents produce many more deliverables per day than gstack's single-session model — making the eval cheaper and more frequent matches our scale. The failure mode this catches — "agent shipped the wrong thing" — is invisible to unit tests AND to code-review skills (both verify the code, not the intent).