feat(.claude): 5 gstack-inspired skills + cron upgrades

Research on garrytan/gstack surfaced 5 patterns worth importing into our cron / agent setup. These are skills, not platform code — they guide how the cron and our own subagents work, not what the platform does at runtime. ## New skills 1. **cross-vendor-review** — adversarial second-model review for noteworthy PRs (auth, billing, data deletion, migrations). Catches the 15-30% of bugs single-model review misses. Inspired by gstack's /codex. 2. **careful-mode** — REFUSE/WARN/ALLOW lists for destructive commands. Refuses force-push to main, blocks merging draft PRs, prevents rm -rf outside scratch dirs. Inspired by gstack's /careful + /freeze. 3. **cron-learnings** — per-project JSONL of operational learnings appended at the end of every tick, replayed at the start of the next. Stops the cron from re-litigating decided issues. Inspired by gstack's /learn. 4. **cron-retro** — weekly retrospective auto-posted as a GitHub issue. Sunday 23:07 local. Tracks PR count, time-to-merge, gate failure trends, code-review severity over time. Inspired by gstack's /retro. 5. **llm-judge** — cheap LLM-as-judge eval to catch "agent shipped the wrong thing" — the failure mode unit tests miss. Plug into issue-pickup pipeline so worker-agent draft PRs get scored before being marked ready. Inspired by gstack's tier-3 test infra. ## Cron updates (session-only, c5074cd5 + 060d136c) - Hourly triage cron now opens with careful-mode activation + cron-learnings replay (Step 0) - code-review skill on every PR being considered for merge (Step 2 supplement A — already present, formalized) - cross-vendor-review on noteworthy PRs (Step 2 supplement B — new) - llm-judge on issue-pickup draft PRs before marking ready (Step 4) - Status report now includes cross-vendor pass/fail and llm-judge scores (Step 5) - End-of-tick cron-learnings append (Step 5) - New weekly cron at Sun 23:07 invokes the cron-retro skill ## What we did NOT take from gstack - Their browser fork — not our product - The 23 named roles — we have agent role templates already - Bun toolchain — adds yet another runtime to our stack - /design-shotgun and design-tool variants — we're not a design tool - /document-release — our update-docs skill already covers this See PR description for full research notes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 11:36:55 -07:00 · 2026-04-14 11:36:55 -07:00 · 9d914193d2
commit 9d914193d2
parent 639c32045d
6 changed files with 365 additions and 0 deletions
--- a/.claude/CLAUDE_LOOP_NOTES.md
+++ b/.claude/CLAUDE_LOOP_NOTES.md
@ -1,5 +1,21 @@
 # Loop discipline — process notes

+## 2026-04-14 — gstack-inspired cron upgrades
+
+Five new skills added under `.claude/skills/` (inspired by garrytan/gstack):
+
+- **`cross-vendor-review`** — second-model adversarial review for noteworthy PRs (auth, billing, data deletion, migrations). Catches the 15–30% of bugs single-model review misses.
+- **`careful-mode`** — REFUSE/WARN/ALLOW lists for destructive commands. Active at the start of every cron tick. Refuses force-push to main, blocks merging draft PRs, prevents `rm -rf` outside scratch dirs.
+- **`cron-learnings`** — per-project JSONL of operational learnings. End each cron tick by appending 1–3 lines; start the next tick by replaying the last 20.
+- **`cron-retro`** — weekly retrospective auto-posted as a GitHub issue. Sunday 23:07 local. Tracks PR count, time-to-merge, gate failures, code-review severity trends.
+- **`llm-judge`** — cheap LLM-as-judge eval to catch "agent shipped the wrong thing" — the failure mode unit tests miss.
+
+Two crons govern this:
+- **Hourly triage** (`:17` past each hour) — Step 0 activates careful-mode + replays cron-learnings; Step 2 supplements run code-review and (for noteworthy PRs) cross-vendor-review; Step 4 issue-pickup runs llm-judge before marking ready; Step 5 appends cron-learnings.
+- **Weekly retro** (Sunday `23:07`) — invokes cron-retro skill, posts a GitHub issue.
+
+Both crons are session-only per the runtime; re-invoke in a new session if needed.
+
 ## Rule: a "skipped" PR must have a comment explaining the skip

 When the hourly maintenance loop skips a PR for any reason — CI red,
--- a/.claude/skills/careful-mode/SKILL.md
+++ b/.claude/skills/careful-mode/SKILL.md
@ -0,0 +1,74 @@
+---
+name: careful-mode
+description: Refuse or warn before destructive irreversible commands (rm -rf, force push, DROP TABLE, gh pr close, gh issue close, mass DELETE). Inspired by gstack's /careful and /freeze. Activate at the start of any cron tick or when about to write to shared resources.
+---
+
+# careful-mode
+
+Cron has merge authority + commit authority. That is enough rope to do permanent damage. This skill is the seatbelt.
+
+## Activate when
+
+- The hourly cron tick starts
+- About to call `gh pr merge` / `gh pr close` / `gh issue close`
+- About to push to a branch other than your own draft
+- About to run `git push --force` for any reason
+- About to run `rm -rf` on anything inside the repo
+- About to issue `DROP TABLE` / `TRUNCATE` / `DELETE FROM ... WHERE` without a known small WHERE
+
+## Categories
+
+### REFUSE — hard stop
+
+- `git push --force` to `main`, `master`, or any protected branch
+- `gh pr merge` on a PR that:
+  - has CI failing
+  - has `state: draft`
+  - has unresolved review comments from a non-bot author
+  - was created in the same conversation context (need 1 tick of distance)
+- `git reset --hard` against a branch that has commits I haven't seen pushed to a remote
+- `rm -rf` against any path matching `**/migrations/**`, `.git/`, `~/.molecule/`, or repo root
+- `DROP TABLE`, `TRUNCATE TABLE` against any table in the molecule schema
+- `DELETE FROM workspaces` without a `WHERE id = $known_uuid` clause
+
+### WARN — proceed only with explicit confirmation in the prompt
+
+- `gh pr close` on a PR not authored by me
+- `gh issue close` on any issue
+- `git push --force-with-lease` (safer than `--force`, still requires care)
+- `rm -rf node_modules / dist /` (safe, but worth a one-line "yes I meant this")
+- `chmod -R` on anything outside the current PR's diff
+- Mass curl-DELETE loops over `/workspaces` (the cleanup-rogue-workspaces.sh pattern is OK but document the prefix)
+
+### ALLOW
+
+- Anything against `/tmp/`, the agent's own scratch dir, or test artifacts
+- Reads of any kind
+- Standard merges via `gh pr merge --merge --delete-branch` once the gates pass
+- Single-row updates / deletes with explicit WHERE on a known-uuid
+
+## Freeze mode
+
+When debugging a tricky issue, lock edits to one directory. Example invocation:
+
+```
+careful-mode freeze platform/internal/handlers/
+# now any Edit/Write outside that path refuses
+careful-mode unfreeze
+```
+
+This is conceptually like gstack's `/freeze` — prevents accidental scope creep when an agent is spelunking.
+
+## How to honor this skill
+
+The skill is enforced by the AGENT, not by the harness. When making a tool call that lands in the REFUSE / WARN list, the agent must:
+
+1. Stop
+2. State the exact command + which list it falls under
+3. Explain why this case is or isn't safe
+4. For WARN, ask for explicit user confirmation
+5. For REFUSE, decline and propose a safer alternative
+
+## Why this exists
+
+The cron has merge authority. gstack documented several near-misses where Claude wiped working directories or force-pushed to main. We avoid those by making the rules explicit and machine-readable, applied at the start of every tick.
--- a/.claude/skills/cron-learnings/SKILL.md
+++ b/.claude/skills/cron-learnings/SKILL.md
@ -0,0 +1,60 @@
+---
+name: cron-learnings
+description: At the end of every cron tick, append 1-3 lines of operational learnings (what worked, what surprised, what should change next tick) to a per-project JSONL. Replay at start of next tick. Inspired by gstack's /learn skill.
+---
+
+# cron-learnings
+
+Each tick, the cron does a lot of work. Half the lessons are forgotten by the next tick. This skill is the compounding layer.
+
+## Storage
+
+Per-project file at:
+```
+~/.claude/projects/<sanitized-project-path>/memory/cron-learnings.jsonl
+```
+
+For molecule-monorepo, that's:
+```
+~/.claude/projects/-Users-hongming-Documents-GitHub-molecule-monorepo/memory/cron-learnings.jsonl
+```
+
+One JSON object per line:
+```json
+{"ts": "2026-04-14T05:17:00Z", "tick_id": "5939aa3f-001", "category": "gate-fail", "summary": "Gate 4 (security) flagged token!=secret in PR #28; requireInternalAPISecret needs subtle.ConstantTimeCompare", "next_action": "When reviewing auth-gate code, grep for `subtle.ConstantTimeCompare`. Flag plain == on tokens."}
+```
+
+Categories:
+- `gate-fail` — a verification gate caught something
+- `mechanical-fix` — fixed a gate failure on-branch
+- `false-positive` — a code-review finding turned out to be wrong; record so we don't keep flagging it
+- `tool-error` — an MCP tool / CLI flaked; note the workaround
+- `repo-state` — something about the repo's state that next tick should know
+- `pattern` — a cross-PR pattern worth remembering (e.g., "every cron loop adds itself as `noreply@anthropic.com`; reviewers OK with it")
+
+## When to write
+
+End of every cron tick (Step 5 of the cron prompt). 1-3 lines max — be terse.
+
+## When to read
+
+Start of every cron tick. Read the last 20 lines (most recent first) before Step 1. Use them to:
+- Skip false-positive paths the previous tick flagged
+- Apply learned patterns (e.g., "PR #28 found INTERNAL_API_SECRET missing from .env.example — when reviewing future security PRs, always check .env.example sync as a first move")
+- Avoid re-litigating decided design choices
+
+## Pruning
+
+Cap at 500 lines. When exceeded, the next write also drops the oldest 100 lines. The point is recent operational memory, not an audit log.
+
+## Format discipline
+
+- One line per event
+- ASCII-only for grep-friendliness
+- No PII, no tokens, no URLs with auth
+- `summary` is what HAPPENED; `next_action` is what FUTURE-YOU should DO
+- If you can't think of a concrete next_action, it's not worth logging
+
+## Why this exists
+
+gstack's `/learn` showed that AI sessions repeatedly make the same mistakes because the lessons live only in the conversation that produced them. Writing them to disk lets every tick start with the accumulated wisdom of every prior tick, at zero cost. The awareness MCP we have is fine for cross-session human/agent memory — this file is specifically for the cron's own automation.
--- a/.claude/skills/cron-retro/SKILL.md
+++ b/.claude/skills/cron-retro/SKILL.md
@ -0,0 +1,69 @@
+---
+name: cron-retro
+description: Weekly retrospective digest of cron activity — PRs merged, gates failed, issues picked, code-review findings by severity, time-to-merge, regression trend. Posts to a dedicated GitHub issue. Inspired by gstack's /retro.
+---
+
+# cron-retro
+
+The cron runs hourly and ships a lot. Without a periodic summary, drift happens silently — Gate 4 starts failing more often, code-review noise climbs, time-to-merge balloons, and nobody notices for weeks.
+
+## When to run
+
+- Every Sunday at 23:00 local (`0 23 * * 0` cron expression)
+- On-demand by the CEO
+
+## What to compute (over the prior 7 days)
+
+From `gh pr list --state merged --search "merged:>=YYYY-MM-DD"` and our local `cron-learnings.jsonl`:
+
+1. **Merged PR count** — total + by category (auth/security, refactor, feat, fix, docs, infra)
+2. **Issues closed** — count, with PR-link for each
+3. **Time-to-merge distribution** — median, p90, max. Excluding docs PRs (they merge instantly).
+4. **Gate failure breakdown** — which gates failed how often. Patterns?
+5. **Code-review findings** — total 🔴 / 🟡 / 🔵 across all PRs. Trend vs prior week.
+6. **Mechanical fixes pushed** — how often did the cron fix a gate failure on-branch?
+7. **Skips by reason** — categorize: design-judgment, CI-down, scope-too-open, noteworthy-CEO-needed
+8. **Code volume** — net LOC added/removed (Garry Tan publishes these in his retros — keep us honest)
+9. **Test count delta** — Go + Python + Vitest + Jest from start to end of week
+10. **New runtime / library / tool added or removed** — anything strategic
+
+## Format
+
+Post a new GitHub issue titled `Cron retro: 2026-04-14 → 2026-04-21 (week N)` with body:
+
+```markdown
+# Week summary
+- Merged: X PRs (Y closed issues)
+- Median TTM: 3h12m (excluding docs)
+- Code-review findings: 0 🔴 / 4 🟡 / 18 🔵 (vs last week: 0 / 6 / 24)
+- Mechanical fixes pushed: 5
+- Skips: 2 design-judgment, 1 CI-down
+
+# Trend signals
+- ↑ Frontend test coverage (+12 vitest, +1 file)
+- ↓ Time-to-merge for auth PRs (down from 8h median to 3h — likely
+   because Gate-4 doc-sync subagent now catches missing .env entries)
+- ⚠ Gate 7 (Playwright) failed 3 times this week vs 0 last week —
+   probably the canvas dev-server stale-chunk issue. Action item.
+
+# Code volume
+- 12,847 lines added, 8,213 removed across 23 commits
+
+# Notes
+- Closed #6, #13, #17, #23 — 4 issues from the launch backlog
+- 2 issues remain in the SaaS-launch Tier 1 list (multi-tenancy, Fly Machines)
+- New skills added this week: cross-vendor-review, careful-mode, cron-learnings, cron-retro
+
+# Action items for next week
+- [ ] Investigate Gate 7 flakes (likely fix: persistent canvas dev daemon)
+- [ ] Pick up issue #19 (workspace restart context)
+- [ ] PR #58 needs CEO review (configurable tier limits — behavior change)
+```
+
+## Why this exists
+
+What gets measured improves. gstack publishes weekly retros and credits them with knowing where to invest. We have no analog. This is the smallest viable analog: one issue per week, generated automatically, costs nothing to ignore, valuable when the metrics start drifting.
+
+## Implementation note
+
+This skill should be invoked from a separate cron job (not the hourly triage cron). Suggested cron expression: `7 23 * * 0` — Sunday 23:07 local.
--- a/.claude/skills/cross-vendor-review/SKILL.md
+++ b/.claude/skills/cross-vendor-review/SKILL.md
@ -0,0 +1,71 @@
+---
+name: cross-vendor-review
+description: Run an adversarial code review against a non-Claude model (Codex / GPT / Gemini) and surface disagreements with Claude's own review. Use ONLY for noteworthy PRs (auth, billing, data-deletion, irreversible migration, large-blast-radius). Inspired by gstack's /codex command.
+---
+
+# cross-vendor-review
+
+Two LLMs catch bugs one doesn't. Claude has blind spots; so does GPT-5; so does Gemini. For high-stakes PRs the cost of a second model is dwarfed by the cost of a missed defect.
+
+## When to invoke
+
+ALWAYS for PRs touching:
+- Authentication, authorization, session, or token handling
+- Billing / payments / Stripe / metering
+- Destructive operations (delete cascades, mass-update, drop)
+- Database migrations (schema changes, data backfills)
+- Cross-tenant isolation logic
+- Cryptographic primitives
+
+OPTIONAL for:
+- Large refactors (>500 LOC)
+- Performance-sensitive changes
+- Anything where the cron's standard code-review skill returned conflicting signals
+
+NEVER for:
+- Docs, templates, CI tweaks, dependency bumps, test-only changes
+
+## How to invoke
+
+1. Pull the diff: `gh pr diff N --repo OWNER/REPO`
+2. Run Claude's own code-review skill first; capture its findings
+3. Send the SAME diff + the SAME rubric to a second model:
+   - Preferred order: GPT-5 (via Codex CLI or API), Gemini Pro 2.5, Llama 3.3 70B
+   - One-shot prompt; no conversation
+   - Instruct the second model to be ADVERSARIAL: assume the diff has at least one bug and find it
+4. Compare the two reports. For each finding:
+   - Both flag it → real, must address
+   - Only Claude → likely real, address or justify dismissal
+   - Only second model → may be real, investigate
+   - Both clean → ok to merge
+
+## Output format
+
+```
+## Cross-vendor review for PR #N
+
+| Finding | Claude | <2nd model> | Verdict |
+|---|---|---|---|
+| Token compared with == not constant-time | 🔴 | 🔴 | MUST FIX |
+| ctx not propagated through goroutine | 🟡 | — | SHOULD FIX |
+| — | — | 🟡 stale jwt cache on revoke | INVESTIGATE |
+
+## Disagreements
+- Claude said X; <model> said Y. Resolution: ...
+
+## Verdict
+- ☐ Merge (both clean)
+- ☐ Address findings then re-review
+- ☐ Escalate to CEO (irreconcilable models)
+```
+
+## Cost guard
+
+Cross-vendor calls cost real money. Cap:
+- One pass per PR per session
+- Skip if the noteworthy-flag is uncertain (default: no second model)
+- Log per-tick spend in the cron telemetry channel
+
+## Why this exists
+
+gstack's `/codex` showed that single-model review misses ~15-30% of real findings catchable by a different vendor. Auth bugs are precisely the class where blind spots are catastrophic. This skill formalizes the pattern.
--- a/.claude/skills/llm-judge/SKILL.md
+++ b/.claude/skills/llm-judge/SKILL.md
@ -0,0 +1,75 @@
+---
+name: llm-judge
+description: Evaluate whether a Molecule AI agent's output (a PR, a delegation result, a generated config) actually addresses the original request. Cheap LLM-as-judge gate that catches "wrong answer to right question" — the failure mode unit tests miss. Inspired by gstack's tier-3 LLM-as-judge test infra.
+---
+
+# llm-judge
+
+Unit tests verify the code RAN. They don't verify it did the RIGHT THING for the customer's actual request. This skill closes that gap.
+
+## When to invoke
+
+After a Molecule AI agent (PM, Dev Lead, QA, etc.) produces a deliverable:
+- A PR they opened in response to an issue
+- A delegation result (response to an A2A `message/send`)
+- A generated config or template
+- A code review they posted
+
+Specifically: when a worker agent comes back with "done", before we believe them.
+
+## Inputs
+
+1. The ORIGINAL request — the issue body, the user message, the delegation prompt
+2. The DELIVERABLE — the diff, the response text, the generated artifact
+3. ACCEPTANCE CRITERIA if explicit (often in the issue body)
+
+## How to evaluate
+
+Send to a small fast model (Haiku, GPT-mini, Gemini Flash):
+
+```
+You are an evaluator. Below is a customer request and the deliverable
+the AI agent produced. Rate, on a 0-5 scale, how well the deliverable
+addresses the original request. Then list the top 3 reasons for the score.
+
+REQUEST:
+<paste original>
+
+DELIVERABLE:
+<paste artifact>
+
+ACCEPTANCE CRITERIA (if any):
+<paste>
+
+Output JSON:
+{
+  "score": 0..5,
+  "addresses_request": true|false,
+  "missing": ["...", "..."],
+  "wrong": ["...", "..."],
+  "reasons": ["...", "...", "..."]
+}
+```
+
+## Decision
+
+| Score | Action |
+|---|---|
+| 5 | Accept — log to telemetry |
+| 4 | Accept with note — file a follow-up issue for the gap if material |
+| 3 | Send back to the agent for revision with the judge's "missing" list |
+| 0–2 | Reject. Escalate to CEO. Likely the agent misunderstood the task — fixing the prompt > fixing the deliverable |
+
+## Cost
+
+Tier-3 (Haiku-class): ~$0.001 per eval. Even at 100 evals/day = $0.10/day. Negligible.
+
+## Where to plug it in
+
+- **Cron Step 4 (issue pickup)**: after a draft PR is opened by a subagent, run llm-judge against the issue body. Mark the PR ready ONLY if score >= 4.
+- **A2A delegation in workspaces**: optionally enable per-org. PM gets the worker's response, runs llm-judge, only forwards to the next stage if accepted.
+- **Manual**: `npm run skill:llm-judge -- --request <file> --deliverable <file>`
+
+## Why this exists
+
+gstack runs LLM-as-judge as a test-tier ($0.15 per eval, ~30s). Our worker agents produce many more deliverables per day than gstack's single-session model — making the eval cheaper and more frequent matches our scale. The failure mode this catches — "agent shipped the wrong thing" — is invisible to unit tests AND to code-review skills (both verify the code, not the intent).