Research on garrytan/gstack surfaced 5 patterns worth importing into our cron / agent setup. These are skills, not platform code — they guide how the cron and our own subagents work, not what the platform does at runtime. ## New skills 1. **cross-vendor-review** — adversarial second-model review for noteworthy PRs (auth, billing, data deletion, migrations). Catches the 15-30% of bugs single-model review misses. Inspired by gstack's /codex. 2. **careful-mode** — REFUSE/WARN/ALLOW lists for destructive commands. Refuses force-push to main, blocks merging draft PRs, prevents rm -rf outside scratch dirs. Inspired by gstack's /careful + /freeze. 3. **cron-learnings** — per-project JSONL of operational learnings appended at the end of every tick, replayed at the start of the next. Stops the cron from re-litigating decided issues. Inspired by gstack's /learn. 4. **cron-retro** — weekly retrospective auto-posted as a GitHub issue. Sunday 23:07 local. Tracks PR count, time-to-merge, gate failure trends, code-review severity over time. Inspired by gstack's /retro. 5. **llm-judge** — cheap LLM-as-judge eval to catch "agent shipped the wrong thing" — the failure mode unit tests miss. Plug into issue-pickup pipeline so worker-agent draft PRs get scored before being marked ready. Inspired by gstack's tier-3 test infra. ## Cron updates (session-only, c5074cd5 + 060d136c) - Hourly triage cron now opens with careful-mode activation + cron-learnings replay (Step 0) - code-review skill on every PR being considered for merge (Step 2 supplement A — already present, formalized) - cross-vendor-review on noteworthy PRs (Step 2 supplement B — new) - llm-judge on issue-pickup draft PRs before marking ready (Step 4) - Status report now includes cross-vendor pass/fail and llm-judge scores (Step 5) - End-of-tick cron-learnings append (Step 5) - New weekly cron at Sun 23:07 invokes the cron-retro skill ## What we did NOT take from gstack - Their browser fork — not our product - The 23 named roles — we have agent role templates already - Bun toolchain — adds yet another runtime to our stack - /design-shotgun and design-tool variants — we're not a design tool - /document-release — our update-docs skill already covers this See PR description for full research notes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2.6 KiB
2.6 KiB
| name | description |
|---|---|
| cross-vendor-review | Run an adversarial code review against a non-Claude model (Codex / GPT / Gemini) and surface disagreements with Claude's own review. Use ONLY for noteworthy PRs (auth, billing, data-deletion, irreversible migration, large-blast-radius). Inspired by gstack's /codex command. |
cross-vendor-review
Two LLMs catch bugs one doesn't. Claude has blind spots; so does GPT-5; so does Gemini. For high-stakes PRs the cost of a second model is dwarfed by the cost of a missed defect.
When to invoke
ALWAYS for PRs touching:
- Authentication, authorization, session, or token handling
- Billing / payments / Stripe / metering
- Destructive operations (delete cascades, mass-update, drop)
- Database migrations (schema changes, data backfills)
- Cross-tenant isolation logic
- Cryptographic primitives
OPTIONAL for:
- Large refactors (>500 LOC)
- Performance-sensitive changes
- Anything where the cron's standard code-review skill returned conflicting signals
NEVER for:
- Docs, templates, CI tweaks, dependency bumps, test-only changes
How to invoke
- Pull the diff:
gh pr diff N --repo OWNER/REPO - Run Claude's own code-review skill first; capture its findings
- Send the SAME diff + the SAME rubric to a second model:
- Preferred order: GPT-5 (via Codex CLI or API), Gemini Pro 2.5, Llama 3.3 70B
- One-shot prompt; no conversation
- Instruct the second model to be ADVERSARIAL: assume the diff has at least one bug and find it
- Compare the two reports. For each finding:
- Both flag it → real, must address
- Only Claude → likely real, address or justify dismissal
- Only second model → may be real, investigate
- Both clean → ok to merge
Output format
## Cross-vendor review for PR #N
| Finding | Claude | <2nd model> | Verdict |
|---|---|---|---|
| Token compared with == not constant-time | 🔴 | 🔴 | MUST FIX |
| ctx not propagated through goroutine | 🟡 | — | SHOULD FIX |
| — | — | 🟡 stale jwt cache on revoke | INVESTIGATE |
## Disagreements
- Claude said X; <model> said Y. Resolution: ...
## Verdict
- ☐ Merge (both clean)
- ☐ Address findings then re-review
- ☐ Escalate to CEO (irreconcilable models)
Cost guard
Cross-vendor calls cost real money. Cap:
- One pass per PR per session
- Skip if the noteworthy-flag is uncertain (default: no second model)
- Log per-tick spend in the cron telemetry channel
Why this exists
gstack's /codex showed that single-model review misses ~15-30% of real findings catchable by a different vendor. Auth bugs are precisely the class where blind spots are catastrophic. This skill formalizes the pattern.