Hongming Wang 239883920e feat(.claude): 5 gstack-inspired skills + cron upgrades

Research on garrytan/gstack surfaced 5 patterns worth importing into
our cron / agent setup. These are skills, not platform code — they
guide how the cron and our own subagents work, not what the platform
does at runtime.

## New skills

1. **cross-vendor-review** — adversarial second-model review for
   noteworthy PRs (auth, billing, data deletion, migrations). Catches
   the 15-30% of bugs single-model review misses. Inspired by
   gstack's /codex.

2. **careful-mode** — REFUSE/WARN/ALLOW lists for destructive
   commands. Refuses force-push to main, blocks merging draft PRs,
   prevents rm -rf outside scratch dirs. Inspired by gstack's
   /careful + /freeze.

3. **cron-learnings** — per-project JSONL of operational learnings
   appended at the end of every tick, replayed at the start of the
   next. Stops the cron from re-litigating decided issues.
   Inspired by gstack's /learn.

4. **cron-retro** — weekly retrospective auto-posted as a GitHub
   issue. Sunday 23:07 local. Tracks PR count, time-to-merge, gate
   failure trends, code-review severity over time. Inspired by
   gstack's /retro.

5. **llm-judge** — cheap LLM-as-judge eval to catch "agent shipped
   the wrong thing" — the failure mode unit tests miss. Plug into
   issue-pickup pipeline so worker-agent draft PRs get scored before
   being marked ready. Inspired by gstack's tier-3 test infra.

## Cron updates (session-only, c5074cd5 + 060d136c)

- Hourly triage cron now opens with careful-mode activation +
  cron-learnings replay (Step 0)
- code-review skill on every PR being considered for merge
  (Step 2 supplement A — already present, formalized)
- cross-vendor-review on noteworthy PRs (Step 2 supplement B — new)
- llm-judge on issue-pickup draft PRs before marking ready (Step 4)
- Status report now includes cross-vendor pass/fail and llm-judge
  scores (Step 5)
- End-of-tick cron-learnings append (Step 5)
- New weekly cron at Sun 23:07 invokes the cron-retro skill

## What we did NOT take from gstack

- Their browser fork — not our product
- The 23 named roles — we have agent role templates already
- Bun toolchain — adds yet another runtime to our stack
- /design-shotgun and design-tool variants — we're not a design tool
- /document-release — our update-docs skill already covers this

See PR description for full research notes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-14 11:36:55 -07:00

3.2 KiB

Raw Blame History

name	description
cron-retro	Weekly retrospective digest of cron activity — PRs merged, gates failed, issues picked, code-review findings by severity, time-to-merge, regression trend. Posts to a dedicated GitHub issue. Inspired by gstack's /retro.

cron-retro

The cron runs hourly and ships a lot. Without a periodic summary, drift happens silently — Gate 4 starts failing more often, code-review noise climbs, time-to-merge balloons, and nobody notices for weeks.

When to run

Every Sunday at 23:00 local (0 23 * * 0 cron expression)
On-demand by the CEO

What to compute (over the prior 7 days)

From gh pr list --state merged --search "merged:>=YYYY-MM-DD" and our local cron-learnings.jsonl:

Merged PR count — total + by category (auth/security, refactor, feat, fix, docs, infra)
Issues closed — count, with PR-link for each
Time-to-merge distribution — median, p90, max. Excluding docs PRs (they merge instantly).
Gate failure breakdown — which gates failed how often. Patterns?
Code-review findings — total 🔴 / 🟡 / 🔵 across all PRs. Trend vs prior week.
Mechanical fixes pushed — how often did the cron fix a gate failure on-branch?
Skips by reason — categorize: design-judgment, CI-down, scope-too-open, noteworthy-CEO-needed
Code volume — net LOC added/removed (Garry Tan publishes these in his retros — keep us honest)
Test count delta — Go + Python + Vitest + Jest from start to end of week
New runtime / library / tool added or removed — anything strategic

Format

Post a new GitHub issue titled Cron retro: 2026-04-14 → 2026-04-21 (week N) with body:

# Week summary
- Merged: X PRs (Y closed issues)
- Median TTM: 3h12m (excluding docs)
- Code-review findings: 0 🔴 / 4 🟡 / 18 🔵 (vs last week: 0 / 6 / 24)
- Mechanical fixes pushed: 5
- Skips: 2 design-judgment, 1 CI-down

# Trend signals
- ↑ Frontend test coverage (+12 vitest, +1 file)
- ↓ Time-to-merge for auth PRs (down from 8h median to 3h — likely
   because Gate-4 doc-sync subagent now catches missing .env entries)
- ⚠ Gate 7 (Playwright) failed 3 times this week vs 0 last week —
   probably the canvas dev-server stale-chunk issue. Action item.

# Code volume
- 12,847 lines added, 8,213 removed across 23 commits

# Notes
- Closed #6, #13, #17, #23 — 4 issues from the launch backlog
- 2 issues remain in the SaaS-launch Tier 1 list (multi-tenancy, Fly Machines)
- New skills added this week: cross-vendor-review, careful-mode, cron-learnings, cron-retro

# Action items for next week
- [ ] Investigate Gate 7 flakes (likely fix: persistent canvas dev daemon)
- [ ] Pick up issue #19 (workspace restart context)
- [ ] PR #58 needs CEO review (configurable tier limits — behavior change)

Why this exists

What gets measured improves. gstack publishes weekly retros and credits them with knowing where to invest. We have no analog. This is the smallest viable analog: one issue per week, generated automatically, costs nothing to ignore, valuable when the metrics start drifting.

Implementation note

This skill should be invoked from a separate cron job (not the hourly triage cron). Suggested cron expression: 7 23 * * 0 — Sunday 23:07 local.

3.2 KiB Raw Blame History