Adds a 13th workspace to the molecule-dev template owning end-to-end
documentation across all Molecule AI surfaces.
## Why now
- We just created Molecule-AI/docs (customer-facing site at
doc.moleculesai.app, Fumadocs + Next.js 15) and the customer site needs
someone to own it.
- Internal docs (README.md, docs/architecture.md, docs/edit-history/) were
drifting — every platform PR has been opening a docs sync PR manually.
- No agent in the team owned terminology consistency or stub backfill.
## Where it sits in the org
Third PM direct report, parallel to Research Lead and Dev Lead — docs is
its own swim lane that spans engineering (docs follow code) and
research/product (concepts and terminology).
PM
├── Research Lead
├── Dev Lead
└── Documentation Specialist <-- new
## Schedules (2)
1. **Daily docs sync — backfill stubs and pair recent platform PRs**
`0 9 * * *` — every morning:
- Pair every merged platform PR (last 24h) with a docs PR if needed
- Backfill one stub page on the docs site
- Crawl the live site for broken links / dead anchors
- delegate_task to PM with audit_summary (category=docs)
2. **Weekly terminology + freshness audit**
`0 11 * * 1` — every Monday:
- Stale page detection (>30 days untouched on fast-moving surfaces)
- Terminology consistency check (one canonical name per concept)
- Link-rot scan
- Same audit_summary contract
## Plugins
Inherits the 9 universal defaults. Adds `browser-automation` for crawling
the live docs site. `molecule-skill-update-docs` is already in defaults
so the cross-repo sync skill is available.
## Routing
Adds `docs: [Documentation Specialist]` to `category_routing` so any
agent that emits an audit_summary with category=docs is auto-routed
here by the platform.
## Bind mounts
Note: this workspace clones BOTH /workspace/repo (the platform monorepo)
and /workspace/docs (Molecule-AI/docs) in its initial_prompt so the
agent can edit either side.
Wraps the canvas root so every tenant-subdomain request checks for a
valid session and bounces to app.moleculesai.app/cp/auth/login with a
return_to pointing back at the current URL. Local dev + vercel preview
URLs + apex pass through unchanged.
Files:
- canvas/src/lib/auth.ts: fetchSession() probes /cp/auth/me
(credentials:include for cross-origin cookie); returns Session on 200,
null on 401 (anonymous, no throw), throws on 5xx so transient
outages don't leak the UI.
- canvas/src/lib/auth.ts: redirectToLogin() builds the cp login URL
with window.location.href as return_to; CP's isSafeReturnTo check
rejects cross-domain bounces.
- canvas/src/components/AuthGate.tsx: client component wrapping
children. State machine: loading → authenticated | anonymous. In
non-SaaS mode (no tenant slug) skips the gate entirely.
- canvas/src/app/layout.tsx: wraps the root body in <AuthGate>.
Tests: +6 auth.ts (200 / 401 null / 5xx throw / credentials:include /
redirectToLogin href + signup variant). Full suite 453 green (was 447).
Pairs with molecule-controlplane PR #16 (return_to cookie handshake
on the cp side).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Canvas will be served at <slug>.moleculesai.app (Vercel). API calls go
cross-origin to https://app.moleculesai.app. This commit wires the
client side:
- canvas/src/lib/tenant.ts: getTenantSlug() derives the slug from
window.location.hostname, case-insensitive, matching the control
plane's reservedSubdomains list (app/www/api/admin/…). Server-side
+ localhost + vercel preview URLs + apex all return "" so local dev
keeps working.
- canvas/src/lib/api.ts: adds X-Molecule-Org-Slug header + sets
credentials:"include" on every fetch. The control plane's CORS
middleware allows the origin + credentials; the session cookie has
Domain=.moleculesai.app so the browser ships it.
- canvas/src/lib/api/secrets.ts: same treatment (secrets API uses its
own fetch helper — shared slug+credentials logic applied).
Tests: +6 (tenant.test.ts covers slug / reserved / case / non-SaaS /
preview URL / apex). Full canvas suite 447/447 green.
Not in this PR:
- WS URL derivation for terminal/socket.ts (separate follow-up; WS
needs its own slug-aware URL and the canvas terminal isn't used in
SaaS launch day-one).
- Next.js rewrites (decided against; cross-origin with credentials
is cleaner than path-level rewrites for session cookies).
Deploys to Vercel once merged — no manual config needed (env already set).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The scheduler died silently on 2026-04-14 14:21 UTC and stayed dead for
12+ hours. Platform restart didn't recover it. Root cause: tick() and
fireSchedule() goroutines have no panic recovery. A single bad row, bad
cron expression, DB blip, or transient panic anywhere in the chain
permanently kills the scheduler goroutine — and the only signal to an
operator is "no crons firing", which is invisible if you're not watching.
Specifically:
func (s *Scheduler) Start(ctx context.Context) {
for {
select {
case <-ticker.C:
s.tick(ctx) // <- if this panics, the for-loop exits forever
}
}
}
And inside tick:
go func(s2 scheduleRow) {
defer wg.Done()
defer func() { <-sem }()
s.fireSchedule(ctx, s2) // <- panic here propagates up wg.Wait()
}(sched)
Two `defer recover()` additions:
1. In Start's tick wrapper — a panic in tick() (DB scan, cron parse,
row processing) is logged and the next tick fires normally.
2. In each fireSchedule goroutine — a single bad workspace can't take
the rest of the batch down.
Plus a liveness watchdog:
- Scheduler now records `lastTickAt` after each successful tick.
- New methods `LastTickAt()` and `Healthy()` (true if last tick within
2× pollInterval = 60s).
- Initialised at Start so Healthy() returns true on a fresh process.
Endpoint plumbing for /admin/scheduler/health is a follow-up — needs
threading the scheduler instance through router.Setup(). Documented
on #85.
Closes the silent-outage failure mode of #85. The other proposed
fixes (force-kill on /restart hang, active_tasks watchdog) are
separate concerns tracked in #85's comments.
Point-in-time snapshot of the live SaaS infrastructure + which phases
are done vs in-flight vs not started. Links to molecule-controlplane's
own PLAN for deeper operator detail.
Pair to molecule-controlplane PR #8. Fly's proxy returns 502 if the
fly-replay state value contains '=', so the control plane now puts the
bare UUID in state= (no 'org-id=' prefix). TenantGuard now treats the
whole 'state=...' value as the org id.
Today's crons are all REVIEW (Security audit, UIUX audit, QA tests). Nothing
actively pushes the team to EVOLVE the four levers CEO named: templates,
plugins, channels, watchlist. The team-runs-24/7 goal needs both — defensive
reviews AND offensive evolution.
Adds 4 new schedules:
1. Research Lead — Daily ecosystem watch (0 8 * * *)
Survey github.com/trending + HN + AI-blogs for new agent-infra projects
from the last 24h. Add 1-3 entries to docs/ecosystem-watch.md per day,
commit to chore/eco-watch-YYYY-MM-DD branch + push + PR. Re-enables
the watchlist pipeline that was paused earlier today.
2. Technical Researcher — Weekly plugin curation (0 9 * * 1, Mondays)
Inventory plugins/ + builtin_tools/ + recent landings. Identify gaps
(builtin not exposed as plugin; role missing extras; rarely-used plugin
in defaults). Survey upstream (claude.ai cookbook, MCP servers,
anthropic/openai/langchain blogs). File 1-3 plugin proposals per week
as GH issues with concrete integration sketches.
3. Dev Lead — Daily template fitness audit (30 8 * * *)
Health-check the template itself: stale system prompts, schedules not
firing (catches the #85 scheduler-died failure mode), roles missing
plugins they should have, missing crons, channel gaps. File issues for
any drift. Designed to catch the silent-stall pattern from today's
incident.
4. DevOps Engineer — Weekly channel expansion survey (0 10 * * 1, Mondays)
PM is the only role with a channel today (Telegram). Survey what
channel infra the platform supports + what role-channel pairings would
actually help (Security→email-on-critical, DevOps→Slack-on-CI-break,
etc). File channel-proposal issues.
All four crons end with the structured audit_summary routing per #51/#75
(category, severity, issues, top_recommendation) so they integrate with
the platform-level category_routing PM uses to fan out work. The template's
existing category_routing block already maps research / plugins / template /
channels — these new crons consume exactly those slots.
Also drops three stale "# UNION with defaults (#71)" comments left from
the cleanup PR — those plugins lists are now self-documenting after #71.
Aligns with north-star goal: team should run 24/7 AND keep getting better
across templates / plugins / channels / watchlist. This PR closes the gap
where the "review" half of the loop was running but the "evolve" half had
no active driver.
Header implied the whole system was future work, but the section body
says the core (per-runtime adapters, hybrid resolver, AgentskillsAdaptor,
/plugins filter, SDK, agentskills.io spec compliance) all landed. Only
the bullets under 'Deferred, not blocking' are actually open.
Rename + lead with 'The system is done.' so a skim reader doesn't
misfile the whole topic as unshipped.
Phase B.3 pair-fix to the control plane's fly-replay state change.
Background: the private molecule-controlplane's router emits
`fly-replay: app=X;instance=Y;state=org-id=<uuid>`. Fly's edge replays
the request to the tenant and injects `Fly-Replay-Src: instance=Z;...;
state=org-id=<uuid>` on the replayed request. But response headers from
the cp (like X-Molecule-Org-Id) never travel to the replayed tenant —
only the state= param does.
TenantGuard now checks both paths in order:
1. Primary: X-Molecule-Org-Id header (direct-access path, e.g. molecli)
2. Secondary: Fly-Replay-Src's `state=org-id=<uuid>` segment
(production fly-replay path)
Either matching configured MOLECULE_ORG_ID → allow. Neither matches →
404 (still don't leak tenant existence).
New helper orgIDFromReplaySrc parses the semicolon-separated Fly-Replay-
Src header per Fly's format. Covered by a table-driven test with 7 cases
including malformed + empty-header + wrong-state-key.
Tests: +3 new TestTenantGuard_* (FlyReplaySrc match, mismatch, table).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Post-mortem on the failed publish-platform-image run on main (PR #82):
Fly's Docker registry requires username EXACTLY equal to "x". My
code-review "readability fix" changing it to "molecule-ai" caused
every push to return 401 Unauthorized. Verified locally:
echo $FLY_API_TOKEN | docker login registry.fly.io -u x --password-stdin
→ Login Succeeded
echo $FLY_API_TOKEN | docker login registry.fly.io -u molecule-ai --password-stdin
→ 401 Unauthorized
Lesson: don't second-guess docs that specify a literal value. Comment
now says "MUST be literal 'x'" with a 2026-04-15 verification note to
prevent future regressions.
Code-review process improvement: when reviewing a change against a
vendor API, prefer "preserve exact doc-specified values" over readability
suggestions. Logged as a cron-learning.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Addresses PR #82 code review: 🟡×3 + 🔵×5.
- Fly registry login username: 'x' → 'molecule-ai' + explanatory comment.
- Build & push split into two steps (GHCR / Fly registry) so a single-
registry outage can't fail the other. Second step uses 'if: always()'
to ensure Fly mirror runs even if GHCR push flakes.
- docs/runbooks/saas-secrets.md: full secret map + rotation procedures
for every SaaS credential, with danger-case callouts. Documents the
coupled FLY_API_TOKEN (lives in GHA secret AND fly secrets — must be
rotated in both).
- CLAUDE.md: new 'SaaS ops' section linking to the runbook.
Keeps ghcr.io/molecule-ai/platform private (per CEO direction — open-
source when full SaaS ships) while still letting the private control
plane's Fly provisioner boot tenant machines: Fly auto-authenticates
same-org machines against registry.fly.io, no per-tenant pull
credentials to wire.
Workflow now logs into both GHCR (using built-in GITHUB_TOKEN) and
Fly registry (using FLY_API_TOKEN secret) and pushes the same image to
four tags total:
- ghcr.io/molecule-ai/platform:latest
- ghcr.io/molecule-ai/platform:sha-<short>
- registry.fly.io/molecule-tenant:latest
- registry.fly.io/molecule-tenant:sha-<short>
Secret added via `gh secret set FLY_API_TOKEN` on the public repo.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase B.2 companion to the private molecule-controlplane provisioner PR.
On every push to main that touches platform/**, builds platform/Dockerfile
and pushes to GHCR with two tags:
- :latest (floating, always main's tip)
- :sha-<short-commit> (immutable, pin-friendly)
Cache via GitHub Actions cache (cache-from: type=gha). Workflow_dispatch
trigger so we can re-publish after a docs-only merge if needed.
The private molecule-controlplane sets TENANT_IMAGE=ghcr.io/molecule-ai/platform:<tag>
and the provisioner creates each tenant Fly Machine from this image. Staying
on the same base image across tenants keeps upgrades atomic.
CLAUDE.md updated to document the new workflow in the CI pipeline section.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phase 32 foundation. The SaaS control plane (private molecule-controlplane
repo) provisions one platform instance per customer org on Fly Machines
and sets MOLECULE_ORG_ID=<uuid> on the machine. Its subdomain router
forwards requests with X-Molecule-Org-Id=<uuid>.
TenantGuard:
- When MOLECULE_ORG_ID is set → every non-allowlisted request must carry a
matching X-Molecule-Org-Id header. Mismatched/missing header → 404 (not
403 — don't leak tenant existence by letting probers distinguish "wrong
org" from "route doesn't exist").
- When unset → passthrough. Self-hosted / dev / CI behavior unchanged.
- Allowlist is exact-match, not prefix — /health and /metrics only.
No orgs table, no signup, no billing, no Fly provisioning in this repo —
all that lives in the private control plane. The public repo's SaaS
surface is exactly this one middleware.
6 tests covering: unset-is-passthrough, matching header, mismatched
header 404 (with empty body), missing header 404, allowlist bypass, and
allowlist-is-exact-match.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Addresses code-review warnings on PR #76:
- Migration 022 now backfills pre-existing workspace_schedules rows to
source='template' before flipping NOT NULL + DEFAULT 'runtime'. Legacy
rows (all seeded via org/import historically) stay refreshable on
re-import. Down migration drops the CHECK constraint too.
- Extracted the import UPSERT into const orgImportScheduleSQL so the shape
test asserts against the const directly instead of file-scraping org.go.
Removed the os.ReadFile helper.
- scheduleResponse.Source gets json:\",omitempty\" so old clients that
predate the migration don't see an empty string they can't explain.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Addresses code-review warnings on PR #75:
- renderCategoryRoutingYAML now builds yaml.Node + yaml.Marshal, escaping
YAML-reserved chars in role names correctly (was JSON-as-YAML, fragile on
unicode line separators).
- New appendYAMLBlock helper guarantees a newline boundary when concatenating
YAML fragments into config.yaml (category_routing + initial_prompt both
used to risk merging into the previous line).
- Fixed struct comment (replace-per-key, not UNION).
- Added TestCategoryRouting_EscapesYAMLSpecials and TestAppendYAMLBlock_NewlineGuard.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolves#24 per CEO direction.
DB is source of truth for workspace_schedules. POST /org/import becomes
idempotent — only touches rows it owns (source='template'); runtime-added
schedules (Canvas / API) are preserved across re-imports.
- Migration 022: adds source TEXT NOT NULL DEFAULT 'runtime' CHECK in
('template','runtime'); unique index on (workspace_id, name) so the
org/import upsert can use ON CONFLICT.
- org.go: schedule INSERT becomes
INSERT ... 'template' ON CONFLICT (workspace_id, name) DO UPDATE
SET ... WHERE workspace_schedules.source='template'.
Never DELETEs.
- schedules.go: runtime POST writes 'runtime' explicitly; List handler
surfaces the source field on the response so Canvas can render badges.
- 3 new unit tests assert source='runtime' default for runtime CRUD,
the SQL shape contract for org/import (additive + idempotent +
runtime-preserving + never-DELETE), and List response surface.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a category_routing block to org.yaml schema (defaults + per-workspace,
UNION semantics with per-key replace). The merged routing table is rendered
into each workspace's config.yaml at import time.
PM's system prompt loses the hardcoded security/ui/infra → role mapping
from PR #50; instead it reads category_routing from /configs/config.yaml
and delegates to whatever roles the org template lists for the incoming
audit-summary's category. Future org templates ship their own routing
without prompt churn.
Tests: 4 new TestCategoryRouting_* cases covering YAML parse, UNION+drop
semantics, deterministic config.yaml render, and empty-map handling.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
#71 just merged — per-workspace `plugins:` now UNIONs with `defaults.plugins`
instead of replacing it. Simplifies every override in molecule-dev/ from
"defaults+1 = list 10 items" to "defaults+1 = list 1 item":
PM: 11 items → 2 (workflow-triage + workflow-retro)
Research Lead: 10 items → 1 (browser-automation)
Market Analyst: 10 items → 1
Technical Researcher: 10 items → 1
Competitive Intel: 10 items → 1
Security Auditor: 12 items → 3 (code-review + cross-vendor-review + llm-judge)
UIUX Designer: 10 items → 1 (browser-automation)
Every workspace still receives the full 9-plugin default set (ecc,
molecule-dev, superpowers, careful-bash, prompt-watchdog, audit-trail,
session-context, cron-learnings, update-docs) — verified by reading
mergePlugins() in platform/internal/handlers/org.go:645.
Also drops the stale "REPLACE not UNION" warning comments and points
defaults' header comment at the new union behaviour.
Net diff: ~30 lines removed, ~10 added. Template is now meaningfully
easier to extend — each new defaults.plugin propagates everywhere
without sweeping per-role lists.
Closes follow-up scope from PR #70.
Merged after 7-gate verification.
Gates: 1 (CI 6/6 + 1 skip) pass, 2 (build/vet) pass, 3 (5 new TestPlugins_* + backward-compat) pass, 4 (security) pass, 5 (design) pass with 1 yellow, 6 (line review) pass, 7 N/A.
Backward-compat verified: molecule-dev/org.yaml re-lists [ecc, molecule-dev, superpowers, browser-automation] in each role; under new UNION+dedupe the merged set is identical to the prior REPLACE result. PR #70's 1 yellow (REPLACE verbosity / re-listing chore) is now closed by this change — orgs can drop the re-listing once confident.
Cross-vendor-review: second-model tooling unavailable in this worktree; Claude-only review applied per standing rule fallback.
Yellow (non-blocking, follow-up): opt-out semantics (`!plugin` / `-plugin`) are documented only in the code comment. Safety plugins like `molecule-careful-bash` can be disabled by an org.yaml using `!molecule-careful-bash` — this is operator-controlled config per I-2 and therefore acceptable, but docs/plugins/ should get an "overriding defaults" page in a follow-up.
noteworthy: plugin-semantics-change
- docs/edit-history/2026-04-14.md — append tick-5 section covering PR #69
(PLAN.md backlog stale-ref cleanup) and PR #70 (wire 12 modular plugins
from PR #63 into the default molecule-dev org template; defaults 3 → 9
plus PM + Security Auditor role extras).
- PLAN.md — add tick-5 entries under "Recently launched" noting PR #70
activated the tick-4 plugins and PR #69 cleaned up stale backlog refs.
Both merges are docs/template-only. No code surface moved, no new env
vars, no test-count drift. CLAUDE.md, .env.example, README.md, and
README.zh-CN.md unchanged.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Per-workspace `plugins:` now UNIONS with `defaults.plugins` instead of
replacing. A leading `!` or `-` on a per-workspace entry opts a default
out. Backward-compatible: re-listing defaults still dedupes to the same
list.
Refactored the inline REPLACE logic into a pure helper `mergePlugins`
in org.go so it's unit-testable. Five TestPlugins_* cases added.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR #63 just merged 12 new modular plugins (split from a single guardrails
bundle) and the audit pipeline (Security/UIUX/QA crons) is now producing
PRs continuously. Time to wire the new plugins into the molecule-dev
template so every workspace + every cron tick benefits.
## Defaults — universal additions (was 3, now 9)
- molecule-careful-bash — refuse rm -rf, push --force main, DROP TABLE
- molecule-prompt-watchdog — warn on destructive user prompts
- molecule-audit-trail — append every Edit/Write to .claude/audit.jsonl
- molecule-session-context — auto-load cron learnings + PR/issue counts on SessionStart
- molecule-skill-cron-learnings — per-tick learning JSONL format (pairs with session-context)
- molecule-skill-update-docs — keep architecture/README/edit-history aligned
Kept: ecc, molecule-dev, superpowers.
## Per-role overrides
- PM: defaults + molecule-workflow-triage + molecule-workflow-retro
(the /triage and /retro slash commands match PM's coordination role)
- Security Auditor: defaults + molecule-skill-code-review +
molecule-skill-cross-vendor-review + molecule-skill-llm-judge
(security PRs benefit from multi-criteria review, adversarial cross-vendor
second opinion, and an LLM-judge gate that catches "agent shipped the
wrong thing")
- Research Lead + 3 researchers + UIUX Designer: defaults + browser-automation
(existing override; just synced to the new default set)
Other 5 dev roles (Dev Lead, BE, FE, DevOps, QA) inherit defaults — the
new universal set is rich enough for them; code-review skill is a runtime
opt-in if Dev Lead decides per-PR.
## REPLACE-semantics verbosity
`platform/internal/handlers/org.go:~345` treats per-workspace plugins as
REPLACE not UNION. Every override has to re-list the 9 defaults to add 1
extra. Tracked as #68 with a union-proposal; once that lands the per-role
lists shrink to just the additions.
## Test plan
- [x] YAML valid (`python -c "import yaml; yaml.safe_load(...)"`)
- [x] defaults.plugins count = 9
- [ ] After merge + re-import: every workspace's /configs/plugins/ contains
the full set; PM has /triage and /retro commands; Security Auditor
can invoke cross-vendor-review on its findings.