Hongming Wang 239883920e feat(.claude): 5 gstack-inspired skills + cron upgrades

Research on garrytan/gstack surfaced 5 patterns worth importing into
our cron / agent setup. These are skills, not platform code — they
guide how the cron and our own subagents work, not what the platform
does at runtime.

## New skills

1. **cross-vendor-review** — adversarial second-model review for
   noteworthy PRs (auth, billing, data deletion, migrations). Catches
   the 15-30% of bugs single-model review misses. Inspired by
   gstack's /codex.

2. **careful-mode** — REFUSE/WARN/ALLOW lists for destructive
   commands. Refuses force-push to main, blocks merging draft PRs,
   prevents rm -rf outside scratch dirs. Inspired by gstack's
   /careful + /freeze.

3. **cron-learnings** — per-project JSONL of operational learnings
   appended at the end of every tick, replayed at the start of the
   next. Stops the cron from re-litigating decided issues.
   Inspired by gstack's /learn.

4. **cron-retro** — weekly retrospective auto-posted as a GitHub
   issue. Sunday 23:07 local. Tracks PR count, time-to-merge, gate
   failure trends, code-review severity over time. Inspired by
   gstack's /retro.

5. **llm-judge** — cheap LLM-as-judge eval to catch "agent shipped
   the wrong thing" — the failure mode unit tests miss. Plug into
   issue-pickup pipeline so worker-agent draft PRs get scored before
   being marked ready. Inspired by gstack's tier-3 test infra.

## Cron updates (session-only, c5074cd5 + 060d136c)

- Hourly triage cron now opens with careful-mode activation +
  cron-learnings replay (Step 0)
- code-review skill on every PR being considered for merge
  (Step 2 supplement A — already present, formalized)
- cross-vendor-review on noteworthy PRs (Step 2 supplement B — new)
- llm-judge on issue-pickup draft PRs before marking ready (Step 4)
- Status report now includes cross-vendor pass/fail and llm-judge
  scores (Step 5)
- End-of-tick cron-learnings append (Step 5)
- New weekly cron at Sun 23:07 invokes the cron-retro skill

## What we did NOT take from gstack

- Their browser fork — not our product
- The 23 named roles — we have agent role templates already
- Bun toolchain — adds yet another runtime to our stack
- /design-shotgun and design-tool variants — we're not a design tool
- /document-release — our update-docs skill already covers this

See PR description for full research notes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-14 11:36:55 -07:00

2.6 KiB

Raw Blame History

name	description
cross-vendor-review	Run an adversarial code review against a non-Claude model (Codex / GPT / Gemini) and surface disagreements with Claude's own review. Use ONLY for noteworthy PRs (auth, billing, data-deletion, irreversible migration, large-blast-radius). Inspired by gstack's /codex command.

cross-vendor-review

Two LLMs catch bugs one doesn't. Claude has blind spots; so does GPT-5; so does Gemini. For high-stakes PRs the cost of a second model is dwarfed by the cost of a missed defect.

When to invoke

ALWAYS for PRs touching:

Authentication, authorization, session, or token handling
Billing / payments / Stripe / metering
Destructive operations (delete cascades, mass-update, drop)
Database migrations (schema changes, data backfills)
Cross-tenant isolation logic
Cryptographic primitives

OPTIONAL for:

Large refactors (>500 LOC)
Performance-sensitive changes
Anything where the cron's standard code-review skill returned conflicting signals

NEVER for:

Docs, templates, CI tweaks, dependency bumps, test-only changes

How to invoke

Pull the diff: gh pr diff N --repo OWNER/REPO
Run Claude's own code-review skill first; capture its findings
Send the SAME diff + the SAME rubric to a second model:
- Preferred order: GPT-5 (via Codex CLI or API), Gemini Pro 2.5, Llama 3.3 70B
- One-shot prompt; no conversation
- Instruct the second model to be ADVERSARIAL: assume the diff has at least one bug and find it
Compare the two reports. For each finding:
- Both flag it → real, must address
- Only Claude → likely real, address or justify dismissal
- Only second model → may be real, investigate
- Both clean → ok to merge

Output format

## Cross-vendor review for PR #N

| Finding | Claude | <2nd model> | Verdict |
|---|---|---|---|
| Token compared with == not constant-time | 🔴 | 🔴 | MUST FIX |
| ctx not propagated through goroutine | 🟡 | — | SHOULD FIX |
| — | — | 🟡 stale jwt cache on revoke | INVESTIGATE |

## Disagreements
- Claude said X; <model> said Y. Resolution: ...

## Verdict
- ☐ Merge (both clean)
- ☐ Address findings then re-review
- ☐ Escalate to CEO (irreconcilable models)

Cost guard

Cross-vendor calls cost real money. Cap:

One pass per PR per session
Skip if the noteworthy-flag is uncertain (default: no second model)
Log per-tick spend in the cron telemetry channel

Why this exists

gstack's /codex showed that single-model review misses ~15-30% of real findings catchable by a different vendor. Auth bugs are precisely the class where blind spots are catastrophic. This skill formalizes the pattern.

2.6 KiB Raw Blame History