molecule-core/.claude/skills/cross-vendor-review/SKILL.md
Hongming Wang 239883920e feat(.claude): 5 gstack-inspired skills + cron upgrades
Research on garrytan/gstack surfaced 5 patterns worth importing into
our cron / agent setup. These are skills, not platform code — they
guide how the cron and our own subagents work, not what the platform
does at runtime.

## New skills

1. **cross-vendor-review** — adversarial second-model review for
   noteworthy PRs (auth, billing, data deletion, migrations). Catches
   the 15-30% of bugs single-model review misses. Inspired by
   gstack's /codex.

2. **careful-mode** — REFUSE/WARN/ALLOW lists for destructive
   commands. Refuses force-push to main, blocks merging draft PRs,
   prevents rm -rf outside scratch dirs. Inspired by gstack's
   /careful + /freeze.

3. **cron-learnings** — per-project JSONL of operational learnings
   appended at the end of every tick, replayed at the start of the
   next. Stops the cron from re-litigating decided issues.
   Inspired by gstack's /learn.

4. **cron-retro** — weekly retrospective auto-posted as a GitHub
   issue. Sunday 23:07 local. Tracks PR count, time-to-merge, gate
   failure trends, code-review severity over time. Inspired by
   gstack's /retro.

5. **llm-judge** — cheap LLM-as-judge eval to catch "agent shipped
   the wrong thing" — the failure mode unit tests miss. Plug into
   issue-pickup pipeline so worker-agent draft PRs get scored before
   being marked ready. Inspired by gstack's tier-3 test infra.

## Cron updates (session-only, c5074cd5 + 060d136c)

- Hourly triage cron now opens with careful-mode activation +
  cron-learnings replay (Step 0)
- code-review skill on every PR being considered for merge
  (Step 2 supplement A — already present, formalized)
- cross-vendor-review on noteworthy PRs (Step 2 supplement B — new)
- llm-judge on issue-pickup draft PRs before marking ready (Step 4)
- Status report now includes cross-vendor pass/fail and llm-judge
  scores (Step 5)
- End-of-tick cron-learnings append (Step 5)
- New weekly cron at Sun 23:07 invokes the cron-retro skill

## What we did NOT take from gstack

- Their browser fork — not our product
- The 23 named roles — we have agent role templates already
- Bun toolchain — adds yet another runtime to our stack
- /design-shotgun and design-tool variants — we're not a design tool
- /document-release — our update-docs skill already covers this

See PR description for full research notes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 11:36:55 -07:00

2.6 KiB

name description
cross-vendor-review Run an adversarial code review against a non-Claude model (Codex / GPT / Gemini) and surface disagreements with Claude's own review. Use ONLY for noteworthy PRs (auth, billing, data-deletion, irreversible migration, large-blast-radius). Inspired by gstack's /codex command.

cross-vendor-review

Two LLMs catch bugs one doesn't. Claude has blind spots; so does GPT-5; so does Gemini. For high-stakes PRs the cost of a second model is dwarfed by the cost of a missed defect.

When to invoke

ALWAYS for PRs touching:

  • Authentication, authorization, session, or token handling
  • Billing / payments / Stripe / metering
  • Destructive operations (delete cascades, mass-update, drop)
  • Database migrations (schema changes, data backfills)
  • Cross-tenant isolation logic
  • Cryptographic primitives

OPTIONAL for:

  • Large refactors (>500 LOC)
  • Performance-sensitive changes
  • Anything where the cron's standard code-review skill returned conflicting signals

NEVER for:

  • Docs, templates, CI tweaks, dependency bumps, test-only changes

How to invoke

  1. Pull the diff: gh pr diff N --repo OWNER/REPO
  2. Run Claude's own code-review skill first; capture its findings
  3. Send the SAME diff + the SAME rubric to a second model:
    • Preferred order: GPT-5 (via Codex CLI or API), Gemini Pro 2.5, Llama 3.3 70B
    • One-shot prompt; no conversation
    • Instruct the second model to be ADVERSARIAL: assume the diff has at least one bug and find it
  4. Compare the two reports. For each finding:
    • Both flag it → real, must address
    • Only Claude → likely real, address or justify dismissal
    • Only second model → may be real, investigate
    • Both clean → ok to merge

Output format

## Cross-vendor review for PR #N

| Finding | Claude | <2nd model> | Verdict |
|---|---|---|---|
| Token compared with == not constant-time | 🔴 | 🔴 | MUST FIX |
| ctx not propagated through goroutine | 🟡 | — | SHOULD FIX |
| — | — | 🟡 stale jwt cache on revoke | INVESTIGATE |

## Disagreements
- Claude said X; <model> said Y. Resolution: ...

## Verdict
- ☐ Merge (both clean)
- ☐ Address findings then re-review
- ☐ Escalate to CEO (irreconcilable models)

Cost guard

Cross-vendor calls cost real money. Cap:

  • One pass per PR per session
  • Skip if the noteworthy-flag is uncertain (default: no second model)
  • Log per-tick spend in the cron telemetry channel

Why this exists

gstack's /codex showed that single-model review misses ~15-30% of real findings catchable by a different vendor. Auth bugs are precisely the class where blind spots are catastrophic. This skill formalizes the pattern.