molecule-core

Author	SHA1	Message	Date
Research Lead	d761f99fe0	chore(eco-watch): 2026-04-15 daily survey — 3 new entries, 3 issues New entries: - vercel-labs/skills: canonical agentskills.io CLI (14.2k ⭐, +153) - coleam00/Archon: YAML-DAG harness builder for AI coding (18.1k ⭐, +396) - Claude Code Routines: Anthropic cloud-scheduled agents (611 HN pts) Issues filed: - #146 plugins/: align with agentskills.io SKILL.md spec - #147 workspace_schedules: add GitHub event trigger types - #148 workspace-template/: workflow.yaml YAML-DAG convention HEAD at survey time: `bed2f2f` Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 10:14:59 +00:00
Hongming Wang	a7e9d0b824	chore: eco-watch 2026-04-15 — add scion, claude-mem, multica	2026-04-15 08:15:56 +00:00
Hongming Wang	8ba88011b4	Merge pull request #109 from Molecule-AI/feat/issue-101-github-workflow-run feat(webhooks): #101 — GitHub workflow_run event → DevOps A2A	2026-04-15 00:51:01 -07:00
Hongming Wang	7a41d67fa3	Merge pull request #108 from Molecule-AI/fix/issue-93-category-routing fix: #93 category_routing + #105 X-RateLimit headers	2026-04-15 00:50:58 -07:00
Hongming Wang	de6ebe2262	Merge pull request #106 from Molecule-AI/fix/org-import-path-traversal fix(security): #103 — path-sanitize + admin-gate POST /org/import	2026-04-15 00:26:16 -07:00
Hongming Wang	7859d43685	Merge pull request #95 from Molecule-AI/fix/supervised-goroutines fix(platform): panic-recovering supervisor for every background goroutine (#92)	2026-04-15 00:26:13 -07:00
Hongming Wang	f8c1b786ac	Merge pull request #99 from Molecule-AI/fix/auth-middleware-critical fix(security): C1 — auth-gate GET /workspaces + middleware test coverage (C4/C8/C10/C11)	2026-04-15 00:26:10 -07:00
Hongming Wang	958789f4ba	feat(webhooks): #101 — workflow_run event → DevOps A2A Closes #101 layer 1: buildGitHubA2APayload now handles workflow_run events, routing failed CI runs to a workspace via the existing X-Molecule-Workspace-ID / webhook path. Only completed runs with a failure/cancelled/timed_out conclusion fan out — success/skipped/neutral are dropped via errIgnoredGitHubAction. Surface message is human-readable + includes the run URL so DevOps can jump straight to the failing job. Metadata carries the full run context (workflow_name, run_id, run_number, conclusion, head_branch, head_sha, run_url, trigger_event) for programmatic handling. 4 new tests cover the failure path, success skip, non-completed action skip, and short-SHA edge case. Layer 2 (org.yaml wiring for DevOps workspace + GITHUB_WEBHOOK_SECRET docs) stays as a follow-up PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 00:25:49 -07:00
Hongming Wang	2a74a7b11b	fix: #93 category_routing + #105 X-RateLimit headers Closes #93 and #105. #93 — add research/plugins/template/channels entries to org.yaml category_routing defaults. Without them, evolution crons firing with these categories found no target and their audit summaries silently dropped at PM. Routes each back to the role that generated it so the author acts on their own findings. #105 — emit X-RateLimit-Limit / -Remaining / -Reset on every response (allowed and throttled) and Retry-After on 429s per RFC 6585. 2 tests cover both paths. Clients and monitoring tools can now back off proactively instead of polling into 429 walls. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 00:23:46 -07:00
Hongming Wang	418a250d54	test(e2e): skip count=0 post-delete assertion — conflicts with #99 C1 gate Soft-delete leaves workspace_auth_tokens rows alive, so HasAnyLiveTokenGlobal stays non-zero and admin-auth 401s an unauth GET /workspaces. The assertion was verifying deletion, not auth; the bundle round-trip below still covers the deletion path end-to-end. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 00:22:02 -07:00
Hongming Wang	4dbf335d7f	fix(security): #103 — path-sanitize + admin-gate POST /org/import Closes #103 (HIGH). Three attack surfaces on the import endpoint — body.Dir, workspace.Template, workspace.FilesDir — were concatenated via filepath.Join without validation, letting an unauthenticated caller probe arbitrary filesystem paths with "../../../etc". Two layers of defense: 1. resolveInsideRoot() rejects absolute paths and any relative path whose lexically cleaned join escapes the provided root (Abs + HasPrefix + separator guard). 6 tests cover happy path, traversal attempts, absolute path, empty input, prefix-sibling escape, and deep subpath resolution. 2. Route now runs behind middleware.AdminAuth so an unauthenticated attacker can't reach the handler at all once a token exists. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 00:18:09 -07:00
Hongming Wang	80b0ad25ff	Merge pull request #94 from Molecule-AI/fix/c6-loopback-ssrf fix(security): C6 — block loopback IP literals in /registry/register	2026-04-15 00:15:23 -07:00
Hongming Wang	593c7e2984	merge: resolve scheduler conflicts with main (#85 panic-recover + supervised heartbeat)	2026-04-15 00:12:29 -07:00
Hongming Wang	a25daa633f	test(e2e): pass bearer token to admin-gated GET /workspaces calls C1 fix (#99) moved GET /workspaces behind AdminAuth. Three late-script calls that run after tokens exist now include Authorization headers; the post-delete-all call stays anonymous since revoked tokens trigger the no-live-token fail-open path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 00:11:29 -07:00
Hongming Wang	d55362fece	Merge pull request #98 from Molecule-AI/chore/template-evolution-crons-hourly chore(template): evolution crons hourly instead of daily/weekly	2026-04-15 00:08:19 -07:00
Hongming Wang	b669b9f6ee	Merge pull request #97 from Molecule-AI/chore/template-documentation-specialist chore(template): add Documentation Specialist as 3rd PM direct report	2026-04-15 00:08:16 -07:00
Hongming Wang	edcfd615d7	Merge pull request #102 from Molecule-AI/fix/can-communicate-ancestor-chain fix(registry): allow ancestor↔descendant A2A so audit_summary can reach PM	2026-04-15 00:08:12 -07:00
rabbitblood	0653e78262	fix(registry): allow ancestor↔descendant A2A so audit_summary can reach PM Found via deep workspace inspection during a maintenance cycle: Security Auditor's hourly cron correctly tries to delegate_task its audit_summary to PM, the platform proxy rejects with "access denied: workspaces cannot communicate per hierarchy", the agent falls back to delegating to its direct parent (Dev Lead), and PM's category_routing dispatcher (#75) is never reached. This breaks the audit-routing contract end-to-end. Every audit cycle was landing on Dev Lead instead of being fanned out via PM's category_routing to the right dev role (security → BE+DevOps, ui/ux → FE, etc). ## Root cause `registry.CanCommunicate()` only allowed: - self → self - siblings (same parent) - root-level siblings - direct parent → child - direct child → parent A grandchild → grandparent (Security Auditor → PM, where parent is Dev Lead and grandparent is PM) was DENIED. The original design wanted strict hierarchy to prevent rogue horizontal A2A — but it also broke the fundamental "child can talk to its leadership chain" pattern that any audit/escalation flow needs. ## Fix Generalise to ancestor ↔ descendant. Any workspace can talk to any ancestor (any depth) and any descendant (any depth). Direct parent/child remains a fast path that avoids the walk. Sibling rules unchanged. Cousins still cannot directly communicate (would need to go through their shared ancestor). Cross-subtree A2A is still rejected. Implementation: `isAncestorOf(ancestorID, childID)` walks the parent chain in Go with a maxAncestorWalk=32 safety cap so a malformed cycle in the workspaces table cannot loop forever. One DB lookup per step. For a typical 3-deep tree, this adds 1-2 extra lookups vs the old direct-parent fast path. Could be optimized to a single recursive CTE if profiling shows it matters; not now. ## Tests - TestCanCommunicate_Denied_Grandchild → REPLACED with two new tests: - TestCanCommunicate_Allowed_GrandparentToGrandchild - TestCanCommunicate_Allowed_GrandchildToGrandparent (the actual bug) - TestCanCommunicate_Allowed_DeepAncestor — 4-level chain - TestCanCommunicate_Denied_UnrelatedAncestors — ensures cross-subtree walks still terminate denied - TestCanCommunicate_Denied_DifferentParents — extended with the walk lookup mocks so sqlmock doesn't log warnings - TestCanCommunicate_Denied_CousinToRoot — same All 13 tests pass clean. The previous direct parent/child / siblings / self tests are unchanged (fast paths preserved). ## Why platform-level Per the "platform-wide fixes are mine to ship" rule. Every org template hits the same broken audit-routing chain — fixing it at the platform benefits all users, not just molecule-dev. This unblocks #50 (PM dispatcher prompt) and #75 (category_routing).	2026-04-14 22:18:38 -07:00
Backend Engineer	80c2161687	fix(security): C1 — gate GET /workspaces behind AdminAuth; add auth middleware tests Security Auditor confirmed C1 (GET /workspaces) exposes workspace topology without any authentication. The endpoint was intentionally left open for the canvas browser frontend; this PR closes that gap. Router change: - Move GET /workspaces from the bare root router into the wsAdmin AdminAuth group alongside POST /workspaces and DELETE /workspaces/:id. - AdminAuth uses the same fail-open bootstrap contract as all other auth gates: fresh installs (no live tokens) pass through; once any workspace has registered with a token, a valid bearer is required. Status of findings C2–C11 (documented here for audit trail): - C2 POST /workspaces/:id/activity → already in wsAuth group (Cycle 5) - C3 POST /workspaces/:id/delegations/record → already in wsAuth group (Cycle 5) - C4 POST /workspaces/:id/delegations/:id/update → already in wsAuth group (Cycle 5) - C5 GET /workspaces/:id/delegations → already in wsAuth group (Cycle 5) - C7 GET /workspaces/:id/memories → already in wsAuth group (Cycle 5) - C8 POST /workspaces/:id/memories → already in wsAuth group (Cycle 5) - C9 POST /workspaces/:id/delegate → already in wsAuth group (Cycle 5) - C10 GET /admin/secrets → already in adminAuth group (Cycle 7) - C11 POST+DELETE /admin/secrets → already in adminAuth group (Cycle 7) Tests (platform/internal/middleware/wsauth_middleware_test.go — 13 new): WorkspaceAuth: - fail-open when workspace has no tokens (bootstrap path) - C4: no bearer on /delegations/:id/update → 401 - C8: no bearer on /memories POST → 401 - invalid bearer → 401 - cross-workspace token replay → 401 - valid bearer for correct workspace → 200 AdminAuth: - fail-open when no tokens exist globally (fresh install) - C10: no bearer on GET /admin/secrets → 401 - C11: no bearer on POST /admin/secrets → 401 - C11: no bearer on DELETE /admin/secrets/:key → 401 - valid bearer → 200 - invalid bearer → 401 Note: did NOT touch DELETE /admin/secrets in production — no destructive calls to live secrets endpoints were made during this work. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 04:37:14 +00:00
Backend Engineer	63e482f05b	fix(security): C6 — extend SSRF blocklist to RFC-1918 private ranges PR #94 only blocked 127.0.0.0/8 (loopback) and 169.254.0.0/16 (link-local/IMDS). An attacker could still register a workspace with a URL in any RFC-1918 range (10.x, 172.16–31.x, 192.168.x) and redirect A2A proxy traffic to internal services. Block all five reserved ranges in validateAgentURL: - 169.254.0.0/16 link-local (IMDS: AWS/GCP/Azure) - 127.0.0.0/8 loopback (self-SSRF) - 10.0.0.0/8 RFC-1918 - 172.16.0.0/12 RFC-1918 (includes Docker bridge networks) - 192.168.0.0/16 RFC-1918 Agents must use DNS hostnames, not IP literals. The provisioner still writes 127.0.0.1 URLs via direct SQL UPDATE (CASE guard preserves those); this blocklist only applies to the /registry/register request body. Tests: updated 3 previously-allowed RFC-1918 cases to expect rejection; added 9 new cases covering range boundaries and the Docker bridge range. All 22 validateAgentURL subtests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 04:35:05 +00:00
rabbitblood	c0142edbce	chore(template): switch evolution crons from daily/weekly to hourly CEO 2026-04-15: the team's evolution loops should be hourly, not daily/weekly. A 24h or 7d cadence is the wrong rhythm for a team that's expected to run 24/7 and keep improving. At hourly, every drift, every new project, every plugin gap, every channel opportunity gets surfaced within an hour of becoming visible. \| Schedule \| Was \| Now \| \|-----------------------------------\|----------------\|--------------\| \| Hourly ecosystem watch \| 0 8 * * * \| 8 * * * * \| \| Hourly plugin curation \| 0 9 * * 1 \| 22 * * * * \| \| Hourly template fitness audit \| 30 8 * * * \| 15 * * * * \| \| Hourly channel expansion survey \| 0 10 * * 1 \| 47 * * * * \| Spread across the hour (:08, :11, :15, :17, :22, :47) so the four evolution crons + UIUX :11 + Security :17 don't collide and don't all bury PM with audit_summary deliveries at the same instant. Renamed from "Daily..." / "Weekly..." to "Hourly..." to match the new cadence and so the prompts (which still say "Daily survey" etc.) read consistently. A follow-up will fix the body wording. Live-synced into running DB via PATCH (3 of 4) and direct UPDATE on the 4th (Dev Lead workspace requires a token the script didn't have). next_run_at recomputed for all 4. First fire: 04:47 UTC (channel expansion).	2026-04-14 21:33:31 -07:00
rabbitblood	101f284e5d	fix(scheduler): heartbeat at tick start + per-fire so liveness reflects work-in-progress The first scheduler heartbeat (#95) only fired AFTER each tick completed. A tick that runs fireSchedule for 110+ seconds (long agent prompts) would make /admin/liveness report scheduler as stale even though it was actively working. Observed today: scheduler firing UIUX audit, last_tick_at lagged by 95s+ and incrementing. Three places now call Heartbeat: 1. Top of tick() — proves we're past the ticker.C wait 2. Inside each fire goroutine, before fireSchedule — ANY active fire keeps the heartbeat fresh 3. Inside each fire goroutine, after fireSchedule — captures the moment the per-fire work completes (The post-tick Heartbeat in Start() is still there as the "all idle" case.) Net result: /admin/liveness reports stale only if the scheduler genuinely isn't doing anything for >2× pollInterval, which is the actual signal we want.	2026-04-14 21:20:06 -07:00
rabbitblood	41e39c2626	chore(template): Documentation Specialist also watches private molecule-controlplane Per CEO 2026-04-15: the SaaS controlplane (Molecule-AI/molecule-controlplane, PRIVATE Go/Fly.io provisioner) needs documentation coverage too. Updates the agent's role description, initial_prompt, and daily docs-sync cron to handle a third repo with a strict public/private split. ## Privacy rule (the critical addition) molecule-controlplane is private. Two-bucket model: Internal-only changes (handlers, schemas, infra config, billing logic, fly.toml, provisioner internals) → docs go INSIDE the controlplane repo itself (README.md, PLAN.md, docs/internal/*.md). NEVER mentioned in the public docs site. Customer-facing changes (new tier, new region, new SLA, pricing change, signup flow change) → sanitized PUBLIC description on doc.moleculesai.app. Describes the PRODUCT, never the implementation. When unsure: default to internal-only and ask PM before publishing. The privacy rule is repeated three times in the prompt (top of initial_prompt, 1b inside the daily cron, and the role description) so the agent can't miss it. ## Changes - role: extended to mention all three repos + privacy split - initial_prompt: clones controlplane in step 1, reads README+PLAN in step 5, scans recent commits in step 8, lists the four owned surfaces with public/private labels in step 10 - Daily cron: adds step 1b "PAIR RECENT CONTROLPLANE PRS" with the (i)/(ii) internal/customer-facing branching logic - SETUP block: adds controlplane git pull	2026-04-14 21:06:41 -07:00
rabbitblood	53fdffd2c5	chore(template): add Documentation Specialist as 3rd PM direct report Adds a 13th workspace to the molecule-dev template owning end-to-end documentation across all Molecule AI surfaces. ## Why now - We just created Molecule-AI/docs (customer-facing site at doc.moleculesai.app, Fumadocs + Next.js 15) and the customer site needs someone to own it. - Internal docs (README.md, docs/architecture.md, docs/edit-history/) were drifting — every platform PR has been opening a docs sync PR manually. - No agent in the team owned terminology consistency or stub backfill. ## Where it sits in the org Third PM direct report, parallel to Research Lead and Dev Lead — docs is its own swim lane that spans engineering (docs follow code) and research/product (concepts and terminology). PM ├── Research Lead ├── Dev Lead └── Documentation Specialist <-- new ## Schedules (2) 1. Daily docs sync — backfill stubs and pair recent platform PRs `0 9 * * ` — every morning: - Pair every merged platform PR (last 24h) with a docs PR if needed - Backfill one stub page on the docs site - Crawl the live site for broken links / dead anchors - delegate_task to PM with audit_summary (category=docs) 2. Weekly terminology + freshness audit* `0 11 * * 1` — every Monday: - Stale page detection (>30 days untouched on fast-moving surfaces) - Terminology consistency check (one canonical name per concept) - Link-rot scan - Same audit_summary contract ## Plugins Inherits the 9 universal defaults. Adds `browser-automation` for crawling the live docs site. `molecule-skill-update-docs` is already in defaults so the cross-repo sync skill is available. ## Routing Adds `docs: [Documentation Specialist]` to `category_routing` so any agent that emits an audit_summary with category=docs is auto-routed here by the platform. ## Bind mounts Note: this workspace clones BOTH /workspace/repo (the platform monorepo) and /workspace/docs (Molecule-AI/docs) in its initial_prompt so the agent can edit either side.	2026-04-14 21:03:22 -07:00
Hongming Wang	96d88f42a6	Merge pull request #96 from Molecule-AI/feat/canvas-auth-redirect feat(canvas): AuthGate — redirect anonymous users to cp login	2026-04-14 20:42:12 -07:00
Hongming Wang	aedd3db697	feat(canvas): AuthGate — redirect anonymous users to cp login (Phase F close) Wraps the canvas root so every tenant-subdomain request checks for a valid session and bounces to app.moleculesai.app/cp/auth/login with a return_to pointing back at the current URL. Local dev + vercel preview URLs + apex pass through unchanged. Files: - canvas/src/lib/auth.ts: fetchSession() probes /cp/auth/me (credentials:include for cross-origin cookie); returns Session on 200, null on 401 (anonymous, no throw), throws on 5xx so transient outages don't leak the UI. - canvas/src/lib/auth.ts: redirectToLogin() builds the cp login URL with window.location.href as return_to; CP's isSafeReturnTo check rejects cross-domain bounces. - canvas/src/components/AuthGate.tsx: client component wrapping children. State machine: loading → authenticated \| anonymous. In non-SaaS mode (no tenant slug) skips the gate entirely. - canvas/src/app/layout.tsx: wraps the root body in <AuthGate>. Tests: +6 auth.ts (200 / 401 null / 5xx throw / credentials:include / redirectToLogin href + signup variant). Full suite 453 green (was 447). Pairs with molecule-controlplane PR #16 (return_to cookie handshake on the cp side). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 20:37:26 -07:00
rabbitblood	e4535560cf	fix(platform): panic-recovering supervisor for every background goroutine (#92 ) Yesterday's scheduler-died incident (#85) was one instance of a systemic bug: every long-running goroutine in the platform lacks panic recovery and exposes no liveness signal. In a multi-tenant SaaS deployment, a single tenant's bad data panicking any subsystem takes down the subsystem for every tenant, silently, with all standard health probes still green. That is a scale-of-one sev-1. This PR: 1. Introduces `platform/internal/supervised/` with two primitives: a. RunWithRecover(ctx, name, fn) — runs fn in a recover wrapper. On panic logs the stack + exponential-backoff restart (1s → 2s → 4s → … → 30s cap). On clean return (fn decided to stop) returns. On ctx.Done cancels cleanly. b. Heartbeat(name) + LastTick(name) + Snapshot() + IsHealthy(names, staleThreshold) — shared in-memory liveness registry. Every subsystem calls Heartbeat(name) at the end of each tick so operators can distinguish "goroutine alive and healthy" from "alive but stuck inside a single tick". 2. Wraps every `go X.Start(ctx)` in main.go: - broadcaster.Subscribe (Redis pub/sub relay → WebSocket) - registry.StartLivenessMonitor - registry.StartHealthSweep - scheduler.Start (the one that died yesterday) - channelMgr.Start (Telegram / Slack) 3. Adds `supervised.Heartbeat("scheduler")` inside the scheduler tick loop as the first end-to-end demonstration. Follow-up PRs will add heartbeats to the other four subsystems. 4. Adds `GET /admin/liveness` endpoint returning per-subsystem last_tick_at + seconds_ago. Operators can poll this and alert on any subsystem whose seconds_ago exceeds 2x its cron/tick interval. 5. Unit tests for RunWithRecover (clean return no restart; panic restarts with backoff; ctx cancel stops restart loop) and for the liveness registry. Net new code: ~160 lines + ~100 lines of tests. Refactor of main.go: ~10 line changes. No behavior change on happy path; only lifts what happens on a panic. Closes #92. Supersedes the local recover added to scheduler.go in #90 (kept conceptually, but now via the shared helper).	2026-04-14 20:34:18 -07:00
Backend Engineer	19bdd81ba4	fix(security): C6 — block loopback IP literals in /registry/register A workspace that self-registers with a 127.0.0.x URL on first INSERT could redirect A2A proxy traffic back to the platform itself (SSRF). The previous fix only blocked 169.254.0.0/16 (cloud metadata). Add 127.0.0.0/8 to validateAgentURL's blocklist. RFC-1918 private ranges (10.x, 172.16.x, 192.168.x) remain allowed — Docker container networking depends on them. Safe because the provisioner writes 127.0.0.1 URLs via direct SQL UPDATE, not through /registry/register, so the UPSERT CASE that preserves provisioner URLs is unaffected. Local-dev agents can still register using "localhost" by name (hostname, not IP literal). Tests: removed "valid localhost http" case (now correctly rejected), added "valid localhost name" + three loopback-block assertions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 03:34:14 +00:00
Hongming Wang	c02bfb4257	Merge pull request #90 from Molecule-AI/fix/scheduler-watchdog-recover fix(scheduler): recover from panics + add liveness watchdog (#85)	2026-04-14 20:30:31 -07:00
Hongming Wang	12ef17f8e0	Merge pull request #87 from Molecule-AI/chore/template-evolution-crons chore(template): add 4 evolution crons — ecosystem / plugins / template / channels	2026-04-14 20:30:26 -07:00
Hongming Wang	092652770c	Merge pull request #81 from Molecule-AI/docs/sync-2026-04-15-tick-9 QA verified: docs-only change (PLAN.md + edit-history). CI green (all 6 checks pass). No code changes. Safe to merge.	2026-04-14 20:30:18 -07:00
Hongming Wang	e7275531d8	Merge pull request #91 from Molecule-AI/feat/canvas-saas-cross-origin feat(canvas): SaaS cross-origin — slug header + cookie credentials (Phase F)	2026-04-14 20:10:46 -07:00
Hongming Wang	c7537436ff	feat(canvas): SaaS cross-origin — slug header + cookie credentials (Phase F) Canvas will be served at <slug>.moleculesai.app (Vercel). API calls go cross-origin to https://app.moleculesai.app. This commit wires the client side: - canvas/src/lib/tenant.ts: getTenantSlug() derives the slug from window.location.hostname, case-insensitive, matching the control plane's reservedSubdomains list (app/www/api/admin/…). Server-side + localhost + vercel preview URLs + apex all return "" so local dev keeps working. - canvas/src/lib/api.ts: adds X-Molecule-Org-Slug header + sets credentials:"include" on every fetch. The control plane's CORS middleware allows the origin + credentials; the session cookie has Domain=.moleculesai.app so the browser ships it. - canvas/src/lib/api/secrets.ts: same treatment (secrets API uses its own fetch helper — shared slug+credentials logic applied). Tests: +6 (tenant.test.ts covers slug / reserved / case / non-SaaS / preview URL / apex). Full canvas suite 447/447 green. Not in this PR: - WS URL derivation for terminal/socket.ts (separate follow-up; WS needs its own slug-aware URL and the canvas terminal isn't used in SaaS launch day-one). - Next.js rewrites (decided against; cross-origin with credentials is cleaner than path-level rewrites for session cookies). Deploys to Vercel once merged — no manual config needed (env already set). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 20:08:39 -07:00
rabbitblood	7dc9d83792	fix(scheduler): recover from panics + add liveness watchdog (#85 ) The scheduler died silently on 2026-04-14 14:21 UTC and stayed dead for 12+ hours. Platform restart didn't recover it. Root cause: tick() and fireSchedule() goroutines have no panic recovery. A single bad row, bad cron expression, DB blip, or transient panic anywhere in the chain permanently kills the scheduler goroutine — and the only signal to an operator is "no crons firing", which is invisible if you're not watching. Specifically: func (s *Scheduler) Start(ctx context.Context) { for { select { case <-ticker.C: s.tick(ctx) // <- if this panics, the for-loop exits forever } } } And inside tick: go func(s2 scheduleRow) { defer wg.Done() defer func() { <-sem }() s.fireSchedule(ctx, s2) // <- panic here propagates up wg.Wait() }(sched) Two `defer recover()` additions: 1. In Start's tick wrapper — a panic in tick() (DB scan, cron parse, row processing) is logged and the next tick fires normally. 2. In each fireSchedule goroutine — a single bad workspace can't take the rest of the batch down. Plus a liveness watchdog: - Scheduler now records `lastTickAt` after each successful tick. - New methods `LastTickAt()` and `Healthy()` (true if last tick within 2× pollInterval = 60s). - Initialised at Start so Healthy() returns true on a fresh process. Endpoint plumbing for /admin/scheduler/health is a follow-up — needs threading the scheduler instance through router.Setup(). Documented on #85. Closes the silent-outage failure mode of #85. The other proposed fixes (force-kill on /restart hang, active_tasks watchdog) are separate concerns tracked in #85's comments.	2026-04-14 19:32:01 -07:00
Hongming Wang	15ad2a8dbe	Merge pull request #89 from Molecule-AI/docs/sync-saas-progress docs(plan): add Phase 32 current-state snapshot	2026-04-14 18:17:36 -07:00
Hongming Wang	ff6499f634	Merge pull request #88 from Molecule-AI/fix/tenant-guard-state-no-prefix fix(middleware): tenant guard reads bare UUID from state= (pair with cp #8)	2026-04-14 18:14:14 -07:00
Hongming Wang	821ed3a532	docs(plan): add Phase 32 current-state block Point-in-time snapshot of the live SaaS infrastructure + which phases are done vs in-flight vs not started. Links to molecule-controlplane's own PLAN for deeper operator detail.	2026-04-14 18:13:47 -07:00
Hongming Wang	e38257ac88	fix(middleware): tenant guard reads bare UUID from state= (no prefix) Pair to molecule-controlplane PR #8. Fly's proxy returns 502 if the fly-replay state value contains '=', so the control plane now puts the bare UUID in state= (no 'org-id=' prefix). TenantGuard now treats the whole 'state=...' value as the org id.	2026-04-14 18:09:44 -07:00
rabbitblood	18ded13ab3	chore(template): add 4 evolution crons — ecosystem / plugins / template / channels Today's crons are all REVIEW (Security audit, UIUX audit, QA tests). Nothing actively pushes the team to EVOLVE the four levers CEO named: templates, plugins, channels, watchlist. The team-runs-24/7 goal needs both — defensive reviews AND offensive evolution. Adds 4 new schedules: 1. Research Lead — Daily ecosystem watch (0 8 * * ) Survey github.com/trending + HN + AI-blogs for new agent-infra projects from the last 24h. Add 1-3 entries to docs/ecosystem-watch.md per day, commit to chore/eco-watch-YYYY-MM-DD branch + push + PR. Re-enables the watchlist pipeline that was paused earlier today. 2. Technical Researcher — Weekly plugin curation (0 9 * 1, Mondays) Inventory plugins/ + builtin_tools/ + recent landings. Identify gaps (builtin not exposed as plugin; role missing extras; rarely-used plugin in defaults). Survey upstream (claude.ai cookbook, MCP servers, anthropic/openai/langchain blogs). File 1-3 plugin proposals per week as GH issues with concrete integration sketches. 3. Dev Lead — Daily template fitness audit (30 8 * * ) Health-check the template itself: stale system prompts, schedules not firing (catches the #85 scheduler-died failure mode), roles missing plugins they should have, missing crons, channel gaps. File issues for any drift. Designed to catch the silent-stall pattern from today's incident. 4. DevOps Engineer — Weekly channel expansion survey (0 10 * 1, Mondays) PM is the only role with a channel today (Telegram). Survey what channel infra the platform supports + what role-channel pairings would actually help (Security→email-on-critical, DevOps→Slack-on-CI-break, etc). File channel-proposal issues. All four crons end with the structured audit_summary routing per #51/#75 (category, severity, issues, top_recommendation) so they integrate with the platform-level category_routing PM uses to fan out work. The template's existing category_routing block already maps research / plugins / template / channels — these new crons consume exactly those slots. Also drops three stale "# UNION with defaults (#71)" comments left from the cleanup PR — those plugins lists are now self-documenting after #71. Aligns with north-star goal: team should run 24/7 AND keep getting better across templates / plugins / channels / watchlist. This PR closes the gap where the "review" half of the loop was running but the "evolve" half had no active driver.	2026-04-14 18:04:00 -07:00
Hongming Wang	5b814ca1a7	Merge pull request #86 from Molecule-AI/docs/plugin-adaptor-header-fix docs(plan): plugin adaptor system is shipped, not future work	2026-04-14 18:03:28 -07:00
Hongming Wang	a7619d4f9a	Merge pull request #84 from Molecule-AI/fix/tenant-guard-fly-replay-src fix(middleware): TenantGuard accepts org id via Fly-Replay-Src state	2026-04-14 18:03:19 -07:00
Hongming Wang	a99517f4ec	docs(plan): rename 'Future Work — Plugin Adaptor System' to reflect shipped state Header implied the whole system was future work, but the section body says the core (per-runtime adapters, hybrid resolver, AgentskillsAdaptor, /plugins filter, SDK, agentskills.io spec compliance) all landed. Only the bullets under 'Deferred, not blocking' are actually open. Rename + lead with 'The system is done.' so a skim reader doesn't misfile the whole topic as unshipped.	2026-04-14 18:02:28 -07:00
Hongming Wang	522d055758	fix(middleware): TenantGuard accepts org id via Fly-Replay-Src state Phase B.3 pair-fix to the control plane's fly-replay state change. Background: the private molecule-controlplane's router emits `fly-replay: app=X;instance=Y;state=org-id=<uuid>`. Fly's edge replays the request to the tenant and injects `Fly-Replay-Src: instance=Z;...; state=org-id=<uuid>` on the replayed request. But response headers from the cp (like X-Molecule-Org-Id) never travel to the replayed tenant — only the state= param does. TenantGuard now checks both paths in order: 1. Primary: X-Molecule-Org-Id header (direct-access path, e.g. molecli) 2. Secondary: Fly-Replay-Src's `state=org-id=<uuid>` segment (production fly-replay path) Either matching configured MOLECULE_ORG_ID → allow. Neither matches → 404 (still don't leak tenant existence). New helper orgIDFromReplaySrc parses the semicolon-separated Fly-Replay- Src header per Fly's format. Covered by a table-driven test with 7 cases including malformed + empty-header + wrong-state-key. Tests: +3 new TestTenantGuard_* (FlyReplaySrc match, mismatch, table). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 17:54:13 -07:00
Hongming Wang	63cf7e5693	Merge pull request #83 from Molecule-AI/fix/fly-registry-username fix(ci): revert Fly registry username to 'x' — 401 on any other value	2026-04-14 17:26:12 -07:00
Hongming Wang	8decdd491e	fix(ci): revert Fly registry username to 'x' — 'molecule-ai' gets 401 Post-mortem on the failed publish-platform-image run on main (PR #82): Fly's Docker registry requires username EXACTLY equal to "x". My code-review "readability fix" changing it to "molecule-ai" caused every push to return 401 Unauthorized. Verified locally: echo $FLY_API_TOKEN \| docker login registry.fly.io -u x --password-stdin → Login Succeeded echo $FLY_API_TOKEN \| docker login registry.fly.io -u molecule-ai --password-stdin → 401 Unauthorized Lesson: don't second-guess docs that specify a literal value. Comment now says "MUST be literal 'x'" with a 2026-04-15 verification note to prevent future regressions. Code-review process improvement: when reviewing a change against a vendor API, prefer "preserve exact doc-specified values" over readability suggestions. Logged as a cron-learning. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 17:21:53 -07:00
Hongming Wang	31fca5ea6e	Merge pull request #82 from Molecule-AI/feat/mirror-to-fly-registry feat(ci): mirror platform image to registry.fly.io/molecule-tenant	2026-04-14 17:16:04 -07:00
Hongming Wang	73dbca4e38	review: split push steps, runbook for secret rotation, username clarity Addresses PR #82 code review: 🟡×3 + 🔵×5. - Fly registry login username: 'x' → 'molecule-ai' + explanatory comment. - Build & push split into two steps (GHCR / Fly registry) so a single- registry outage can't fail the other. Second step uses 'if: always()' to ensure Fly mirror runs even if GHCR push flakes. - docs/runbooks/saas-secrets.md: full secret map + rotation procedures for every SaaS credential, with danger-case callouts. Documents the coupled FLY_API_TOKEN (lives in GHA secret AND fly secrets — must be rotated in both). - CLAUDE.md: new 'SaaS ops' section linking to the runbook.	2026-04-14 17:09:11 -07:00
Hongming Wang	6bcafd643e	feat(ci): mirror platform image to registry.fly.io/molecule-tenant Keeps ghcr.io/molecule-ai/platform private (per CEO direction — open- source when full SaaS ships) while still letting the private control plane's Fly provisioner boot tenant machines: Fly auto-authenticates same-org machines against registry.fly.io, no per-tenant pull credentials to wire. Workflow now logs into both GHCR (using built-in GITHUB_TOKEN) and Fly registry (using FLY_API_TOKEN secret) and pushes the same image to four tags total: - ghcr.io/molecule-ai/platform:latest - ghcr.io/molecule-ai/platform:sha-<short> - registry.fly.io/molecule-tenant:latest - registry.fly.io/molecule-tenant:sha-<short> Secret added via `gh secret set FLY_API_TOKEN` on the public repo. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 17:05:36 -07:00
Hongming Wang	55eaa8d395	docs: sync documentation with 2026-04-15 tick-9 merges (#79 , #80 ) - PLAN.md: new "Recently launched (2026-04-15 tick-9)" block covering Phase 32 Phase B.2 image pipeline (PR #80) + tick-8 docs (PR #79). - docs/edit-history/2026-04-15.md: new file for today's merges. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 16:43:00 -07:00
Hongming Wang	c3cc8e8725	Merge pull request #80 from Molecule-AI/feat/ghcr-platform-image feat(ci): publish-platform-image → ghcr.io/molecule-ai/platform (Phase B.2)	2026-04-14 16:41:59 -07:00

1 2 3 4 5

201 Commits