Wraps up a ~100-tick autonomous triage session by converting the prior
operator's institutional knowledge into standing, checked-in artifacts
so the next team picking up the hourly PR + issue cycle can drop in
without re-discovering everything from scratch.
## New role: Triage Operator
Peer to Dev Lead, Research Lead, Documentation Specialist under PM.
Owns the 7-gate PR verification + issue-pickup cycle across both
molecule-monorepo and molecule-controlplane. NOT an engineer — never
writes logic, never makes design calls. Mechanical fixes on other
people's branches + verified-merge only.
Runs on cron `17 * * * *`. On first boot reads four handoff files +
the last 20 lines of cron-learnings.jsonl, waits for the scheduled
tick (no first-boot triage — known stale-state footgun).
## Files
org-templates/molecule-dev/triage-operator/
- system-prompt.md (48 lines) — role prompt loaded at boot. Standing
rules, verification discipline, escalation paths.
- philosophy.md (135 lines) — 10 principles each tied to a real
incident. Rule 2 ("tool succeeded ≠ work done") references the
WorkOS refresh-token + missing-migration saga. Rule 3 (authority
verification) references PR #370 CEO directive hold.
- playbook.md (234 lines) — step-by-step tick flow (Step 0 guards →
1 list → 2 seven-gate → 3 docs sync → 4 issue pickup → 5 report).
Expected 5–30 min wall-clock. When-not-to-triage.
- handoff-notes.md (146 lines) — point-in-time state for the NEXT
operator arriving fresh. 15 PRs merged this session, in-flight
items, design-call backlog with recommendations per issue.
- SKILL.md (152 lines) — installable skill spec. Invocation, inputs,
outputs, required composed skills, edge cases, output format.
.claude/AGENT_HANDOFF.md (206 lines) — top-level handoff for any
Claude Code agent working this repo (not just the triage operator).
The 10 principles (one-liners), communication style the user
expects, currently-live state, open items, what NOT to do, break-
glass escalation conditions. Points at triage-operator/philosophy.md
for full incident context.
## Wiring
org.yaml gains a Triage Operator workspace block under PM with:
- tier: 3, model: opus
- 8 plugins (careful-bash, session-context, cron-learnings,
code-review, cross-vendor-review, llm-judge, update-docs, hitl)
- Hourly cron at `:17` with the full Step 0–5 flow inline as prompt
- canvas position (1150, 250) — peer to Documentation Specialist
## Why this ships now
The 30-min manual triage cron was cancelled per CEO direction. The
role moves to another team. Without this handoff package they'd be
rediscovering the same incident-classes I shipped fixes for
(#318 fail-open, #327 cross-tenant decrypt, #351 tokenless grace,
WorkOS refresh-token saga, missing migration runner). The philosophy
file gives them the scar tissue in ~10 min of reading; the playbook
gives them the steps; the SKILL gives them an invocable entry point.
No code changes outside org.yaml. Existing TestPlugins_UnionWithDefaults
still passes (verified in platform test run).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
9.6 KiB
Triage Operator — Handoff Notes (2026-04-16)
Snapshot taken at handoff from the prior operator (Claude Opus 4.6, 1M context, ~100 tick session). Read this once, then discard — it's a point-in-time dump, not a running doc.
What shipped this session (merge log, for audit)
Platform monorepo (merged to main):
| PR | Fix | Severity |
|---|---|---|
| #317 | hitl.py workspace-ID ownership + security_scan.py fail-closed + caught SkillSecurityError kwargs bug via regression test |
LOW+LOW |
| #326 | WorkspaceAuth fake-UUID fail-open fix (Phase 30.1 grace-period kept) |
HIGH |
| #327 | channel_config bot_token + webhook_secret AES-256-GCM encryption (ec1: prefix scheme, lazy migration) |
MEDIUM |
| #330 | Wired molecule-compliance + molecule-audit + molecule-freeze-scope to Security Auditor / Backend / QA / DevOps |
config |
| #331 | New docs/glossary.md — terminology disambiguation table (9 terms + near-miss section) |
docs |
| #335 | PausePollersForToken scoped to requesting workspace (cross-tenant decrypt fix) |
MEDIUM |
| #338 | /transcript fail-closed on missing token; extracted transcript_auth.py for testability |
HIGH |
| #341 | Self-hosted Mac runner: credsStore: "" explicit to avoid osxkeychain bindings |
CI |
| #343 | webhook_secret constant-time compare (subtle.ConstantTimeCompare) |
LOW |
| #346 | Security Auditor prompt drift: added #319 + #337 checks to system prompt + 12h cron | chore |
| #357 | Remove WorkspaceAuth tokenless grace period entirely (strict bearer required) |
HIGH |
| #370 | Engineer idle-loops (proactive issue pickup) — CEO-confirmed directive | template |
Control plane (merged to main):
| PR | Fix |
|---|---|
| #35 | Session cookie stores refresh_token instead of OAuth code (auth-blocker) |
| #36 | Auto-apply embedded migrations on boot (migrations 006, 007 ran for the first time in prod) |
| #37 | Reserved subdomain list expanded from 9 entries to 341 across 12 categories |
Live deploys:
app.moleculesai.appon Fly (v38 with all three CP PRs)api.moleculesai.appmigration in-flight (DNS done, WorkOS dashboard done,WORKOS_REDIRECT_URIflipped at 06:06Z, user verifying end-to-end)status.moleculesai.app(Upptime on GitHub Pages) — unchanged from earlier session- Stripe test-mode webhook + products + prices live on molecule-cp
CP_ADMIN_USER_IDS=user_01KPA3Z3810QEF3HCKRXP2EED9(CEO's WorkOS user)
What's in-flight that the next operator inherits
1. app.moleculesai.app grace period
After the CEO confirms api.moleculesai.app works end-to-end (login + admin endpoints), the OLD app.moleculesai.app subdomain needs to be dropped:
- Fly:
fly certs delete app.moleculesai.app -a molecule-cp - WorkOS dashboard: remove
https://app.moleculesai.app/cp/auth/callbackfrom allowed redirect URIs - Cloudflare DNS: delete the
appCNAME record
Do NOT do any of this until the CEO confirms the new domain works. 24–48h grace period minimum. If an active session still references the old cookie domain, dropping too early breaks their login.
2. Zombie workspace row (#367)
The Security Auditor agent filed #367 claiming ffffffff-ffff-ffff-ffff-ffffffffffff still returns 200 on unauth /secrets. My analysis: stale probe — no local platform is running on this host (lsof -iTCP:8080 empty), so the auditor's probe must have hit an old process. My triage comment pointed this out and asked for live re-verification against a fresh ./platform/server binary.
Next operator: if the CEO rebuilds + runs the local platform, re-probe:
curl -s -o /dev/null -w "%{http_code}" \
http://localhost:8080/workspaces/ffffffff-ffff-ffff-ffff-ffffffffffff/secrets
Expected: 401 (because PR #357 removed the tokenless grace period). If 200, there's a real bug in the routing layer we haven't found.
3. Open design calls — CEO deciding
These are feature/plugin/research proposals. The next operator should NOT pick them up without explicit CEO instruction. They are listed here so the next operator can reference them quickly:
| Issue | Class | My recommendation |
|---|---|---|
| #126 / #243 | Slack adapter for DevOps + Security Auditor | Build small (one webhook pattern, not full Slack app); confirm scope with CEO |
| #239 | Provisioner recovery for failed workspaces with missing config volume |
Lean Option 1 (auto-reap + log) |
| #245 | Telegram channel for Security Auditor + DevOps | Already shipped via #246 |
| #258 | molecule-sandbox plugin (subprocess/docker/e2b) |
Three separate plugins per CEO tick-032 direction |
| #274 | Witness/Deacon/Dogs three-tier health pattern | Layer 1 scaffolding only, ~6h |
| #286 | investment-committee template |
Vertical pattern — valuable if there's a customer; skip otherwise |
| #294 | IATP signed delegation | Couple with #311 ADK spike |
| #298 | molecule-plugin-github |
~2h pickup, wraps github-mcp-server |
| #302 | Bloom behavioral eval hook | Skip, diminishing returns |
| #305 | Per-workspace token budget cap | Defer until billing model changes |
| #309 | browser-use plugin |
Defer, overlaps with #281 |
| #311 | Google ADK A2A spike | Research spike, not code |
| #313 | Workspace-as-MCP-server | Phase-H design spike |
| #315 | HERMES_OVERLAYS two-layer provider | Research |
| #323 | mcp-agent plugin |
Defer unless Research Lead bottleneck is real |
| #332 | gemini-cli runtime adapter |
Defer until a user asks; ~4-6h |
| #333 | PM goal-decomposition skill | Minimal-scope, ~6h if picked up |
| #345 | molecule-temporal plugin |
Defer — temporal_workflow.py already ships per-workspace |
| #347 | molecule-governance plugin |
Pick up if MS AGT compliance matters to sales |
| #348 | Agent Protocol exposure spike | Research only |
| #349 | HITL structured feedback types | Pickable — concrete value, ~4h |
| #361 | Memory tiers (L0-L4) | Pickable with 2 answers: TEXT+CHECK vs enum, L0 enforced vs advisory |
| #362 | OpenSRE DevOps integrations | Research spike, need 3 target integrations from CEO |
| #364–368 | Recent plugin proposals (telemetry / trailofbits / awareness / budget / zombie / eco) | Mostly design calls; #368 budget enforcement is pickable |
4. Cron-learnings is the read-first file
~/.claude/projects/-Users-hongming-Documents-GitHub-molecule-monorepo/memory/cron-learnings.jsonl has ~52 ticks of operational history. The next operator reads the last 20 lines at the start of every tick (enforced by the SessionStart hook if installed, or by Step 0 of playbook.md).
Key cron-learnings conventions:
tick_idformat:manual-NNNfor /triage runs,overnight-NNNfor cron autonomous runscategoryis alwaysworkflowfor now — reserved for future (incident,config,research)next_actionmust be CONCRETE and actionable by either the CEO or the next tick. Vague "continue monitoring" is a waste of disk.
5. Secrets status (for ops continuity)
| Secret | Where | Rotation |
|---|---|---|
FLY_API_TOKEN |
GitHub Actions + fly secrets on molecule-cp |
Both places, together |
SECRETS_ENCRYPTION_KEY |
molecule-cp | Cannot rotate until Phase H KMS envelope lands — see docs/runbooks/saas-secrets.md |
WORKOS_API_KEY |
molecule-cp | WorkOS dashboard only |
STRIPE_API_KEY |
molecule-cp | Currently TEST-MODE sk_test_51TMJEV.... Flip to live when CEO completes Canadian federal incorporation |
RESEND_API_KEY |
molecule-cp | Resend dashboard |
CP_ADMIN_USER_IDS |
molecule-cp | Comma-separated WorkOS user_ids — currently user_01KPA3Z3810QEF3HCKRXP2EED9 |
6. Known unreliable signals
- Mac mini self-hosted runner has a history of 2+ hour queue latency. If CI pending > 30 min, prefer merging via local
go test -race ./...+ explicit CEO approval over waiting. - Security Auditor agent probes sometimes run against stale platform binaries. Always confirm "which process / when" before treating a finding as current.
- Eco-watch agent PRs (e.g. #334, #350) are usually doc-only additions to
docs/ecosystem-watch.md. Verified-merge is fine if the diff is pure docs.
Open questions the next operator should NOT answer — escalate
- Stripe live-mode cutover timing
- App-UI subdomain layout (what goes at
app.moleculesai.apponce the CEO's other agent ships the landing page) - Whether to add
schema_migrationstracking table to the control plane migration runner - Investment-committee template go/no-go (#286)
Goodbye note
This was a ~100-tick session. I shipped 15 PRs across the two repos, caught two HIGH auth fail-opens the security auditor missed (#318 fake-UUID + #351 tokenless grace), two auth-blocker bugs in the control plane (wrong-cookie-contents + missing migration runner), and one directive-claim verification that held a PR for 10 minutes until the CEO confirmed (#370).
The philosophy that held up best across the whole session: verify before claiming done. Three different 401-loop bugs (#336, #351, WorkOS refresh-token) were all the same class — a claim of success that was technically true for the step the agent observed but false for the downstream step the agent didn't re-check. The operator who reads playbook.md Step 2 carefully will catch these before I did.
The philosophy that was hardest to hold: don't pick up design calls. The backlog looks like easy wins; each proposal says "small scope, clear fix." Most are 2-hour conversations with the CEO disguised as 2-hour engineering tickets. Reading the philosophy file's rule #7 (two-issue cap) + rule #9 (when you don't know, don't guess) is how you stay in-scope.
Good luck. Append your own goodbye note when you hand off.
— Claude Opus 4.6, 2026-04-16