molecule-core

Author	SHA1	Message	Date
Backend Engineer	1a28ec8ee5	fix(security): C1 — gate GET /workspaces behind AdminAuth; add auth middleware tests Security Auditor confirmed C1 (GET /workspaces) exposes workspace topology without any authentication. The endpoint was intentionally left open for the canvas browser frontend; this PR closes that gap. Router change: - Move GET /workspaces from the bare root router into the wsAdmin AdminAuth group alongside POST /workspaces and DELETE /workspaces/:id. - AdminAuth uses the same fail-open bootstrap contract as all other auth gates: fresh installs (no live tokens) pass through; once any workspace has registered with a token, a valid bearer is required. Status of findings C2–C11 (documented here for audit trail): - C2 POST /workspaces/:id/activity → already in wsAuth group (Cycle 5) - C3 POST /workspaces/:id/delegations/record → already in wsAuth group (Cycle 5) - C4 POST /workspaces/:id/delegations/:id/update → already in wsAuth group (Cycle 5) - C5 GET /workspaces/:id/delegations → already in wsAuth group (Cycle 5) - C7 GET /workspaces/:id/memories → already in wsAuth group (Cycle 5) - C8 POST /workspaces/:id/memories → already in wsAuth group (Cycle 5) - C9 POST /workspaces/:id/delegate → already in wsAuth group (Cycle 5) - C10 GET /admin/secrets → already in adminAuth group (Cycle 7) - C11 POST+DELETE /admin/secrets → already in adminAuth group (Cycle 7) Tests (platform/internal/middleware/wsauth_middleware_test.go — 13 new): WorkspaceAuth: - fail-open when workspace has no tokens (bootstrap path) - C4: no bearer on /delegations/:id/update → 401 - C8: no bearer on /memories POST → 401 - invalid bearer → 401 - cross-workspace token replay → 401 - valid bearer for correct workspace → 200 AdminAuth: - fail-open when no tokens exist globally (fresh install) - C10: no bearer on GET /admin/secrets → 401 - C11: no bearer on POST /admin/secrets → 401 - C11: no bearer on DELETE /admin/secrets/:key → 401 - valid bearer → 200 - invalid bearer → 401 Note: did NOT touch DELETE /admin/secrets in production — no destructive calls to live secrets endpoints were made during this work. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 04:37:14 +00:00
Backend Engineer	80c2161687	fix(security): C1 — gate GET /workspaces behind AdminAuth; add auth middleware tests Security Auditor confirmed C1 (GET /workspaces) exposes workspace topology without any authentication. The endpoint was intentionally left open for the canvas browser frontend; this PR closes that gap. Router change: - Move GET /workspaces from the bare root router into the wsAdmin AdminAuth group alongside POST /workspaces and DELETE /workspaces/:id. - AdminAuth uses the same fail-open bootstrap contract as all other auth gates: fresh installs (no live tokens) pass through; once any workspace has registered with a token, a valid bearer is required. Status of findings C2–C11 (documented here for audit trail): - C2 POST /workspaces/:id/activity → already in wsAuth group (Cycle 5) - C3 POST /workspaces/:id/delegations/record → already in wsAuth group (Cycle 5) - C4 POST /workspaces/:id/delegations/:id/update → already in wsAuth group (Cycle 5) - C5 GET /workspaces/:id/delegations → already in wsAuth group (Cycle 5) - C7 GET /workspaces/:id/memories → already in wsAuth group (Cycle 5) - C8 POST /workspaces/:id/memories → already in wsAuth group (Cycle 5) - C9 POST /workspaces/:id/delegate → already in wsAuth group (Cycle 5) - C10 GET /admin/secrets → already in adminAuth group (Cycle 7) - C11 POST+DELETE /admin/secrets → already in adminAuth group (Cycle 7) Tests (platform/internal/middleware/wsauth_middleware_test.go — 13 new): WorkspaceAuth: - fail-open when workspace has no tokens (bootstrap path) - C4: no bearer on /delegations/:id/update → 401 - C8: no bearer on /memories POST → 401 - invalid bearer → 401 - cross-workspace token replay → 401 - valid bearer for correct workspace → 200 AdminAuth: - fail-open when no tokens exist globally (fresh install) - C10: no bearer on GET /admin/secrets → 401 - C11: no bearer on POST /admin/secrets → 401 - C11: no bearer on DELETE /admin/secrets/:key → 401 - valid bearer → 200 - invalid bearer → 401 Note: did NOT touch DELETE /admin/secrets in production — no destructive calls to live secrets endpoints were made during this work. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 04:37:14 +00:00
Backend Engineer	96253ca8ca	fix(security): C6 — extend SSRF blocklist to RFC-1918 private ranges PR #94 only blocked 127.0.0.0/8 (loopback) and 169.254.0.0/16 (link-local/IMDS). An attacker could still register a workspace with a URL in any RFC-1918 range (10.x, 172.16–31.x, 192.168.x) and redirect A2A proxy traffic to internal services. Block all five reserved ranges in validateAgentURL: - 169.254.0.0/16 link-local (IMDS: AWS/GCP/Azure) - 127.0.0.0/8 loopback (self-SSRF) - 10.0.0.0/8 RFC-1918 - 172.16.0.0/12 RFC-1918 (includes Docker bridge networks) - 192.168.0.0/16 RFC-1918 Agents must use DNS hostnames, not IP literals. The provisioner still writes 127.0.0.1 URLs via direct SQL UPDATE (CASE guard preserves those); this blocklist only applies to the /registry/register request body. Tests: updated 3 previously-allowed RFC-1918 cases to expect rejection; added 9 new cases covering range boundaries and the Docker bridge range. All 22 validateAgentURL subtests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 04:35:05 +00:00
Backend Engineer	63e482f05b	fix(security): C6 — extend SSRF blocklist to RFC-1918 private ranges PR #94 only blocked 127.0.0.0/8 (loopback) and 169.254.0.0/16 (link-local/IMDS). An attacker could still register a workspace with a URL in any RFC-1918 range (10.x, 172.16–31.x, 192.168.x) and redirect A2A proxy traffic to internal services. Block all five reserved ranges in validateAgentURL: - 169.254.0.0/16 link-local (IMDS: AWS/GCP/Azure) - 127.0.0.0/8 loopback (self-SSRF) - 10.0.0.0/8 RFC-1918 - 172.16.0.0/12 RFC-1918 (includes Docker bridge networks) - 192.168.0.0/16 RFC-1918 Agents must use DNS hostnames, not IP literals. The provisioner still writes 127.0.0.1 URLs via direct SQL UPDATE (CASE guard preserves those); this blocklist only applies to the /registry/register request body. Tests: updated 3 previously-allowed RFC-1918 cases to expect rejection; added 9 new cases covering range boundaries and the Docker bridge range. All 22 validateAgentURL subtests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 04:35:05 +00:00
rabbitblood	0c5a1fdab0	chore(template): switch evolution crons from daily/weekly to hourly CEO 2026-04-15: the team's evolution loops should be hourly, not daily/weekly. A 24h or 7d cadence is the wrong rhythm for a team that's expected to run 24/7 and keep improving. At hourly, every drift, every new project, every plugin gap, every channel opportunity gets surfaced within an hour of becoming visible. \| Schedule \| Was \| Now \| \|-----------------------------------\|----------------\|--------------\| \| Hourly ecosystem watch \| 0 8 * * * \| 8 * * * * \| \| Hourly plugin curation \| 0 9 * * 1 \| 22 * * * * \| \| Hourly template fitness audit \| 30 8 * * * \| 15 * * * * \| \| Hourly channel expansion survey \| 0 10 * * 1 \| 47 * * * * \| Spread across the hour (:08, :11, :15, :17, :22, :47) so the four evolution crons + UIUX :11 + Security :17 don't collide and don't all bury PM with audit_summary deliveries at the same instant. Renamed from "Daily..." / "Weekly..." to "Hourly..." to match the new cadence and so the prompts (which still say "Daily survey" etc.) read consistently. A follow-up will fix the body wording. Live-synced into running DB via PATCH (3 of 4) and direct UPDATE on the 4th (Dev Lead workspace requires a token the script didn't have). next_run_at recomputed for all 4. First fire: 04:47 UTC (channel expansion).	2026-04-14 21:33:31 -07:00
rabbitblood	c0142edbce	chore(template): switch evolution crons from daily/weekly to hourly CEO 2026-04-15: the team's evolution loops should be hourly, not daily/weekly. A 24h or 7d cadence is the wrong rhythm for a team that's expected to run 24/7 and keep improving. At hourly, every drift, every new project, every plugin gap, every channel opportunity gets surfaced within an hour of becoming visible. \| Schedule \| Was \| Now \| \|-----------------------------------\|----------------\|--------------\| \| Hourly ecosystem watch \| 0 8 * * * \| 8 * * * * \| \| Hourly plugin curation \| 0 9 * * 1 \| 22 * * * * \| \| Hourly template fitness audit \| 30 8 * * * \| 15 * * * * \| \| Hourly channel expansion survey \| 0 10 * * 1 \| 47 * * * * \| Spread across the hour (:08, :11, :15, :17, :22, :47) so the four evolution crons + UIUX :11 + Security :17 don't collide and don't all bury PM with audit_summary deliveries at the same instant. Renamed from "Daily..." / "Weekly..." to "Hourly..." to match the new cadence and so the prompts (which still say "Daily survey" etc.) read consistently. A follow-up will fix the body wording. Live-synced into running DB via PATCH (3 of 4) and direct UPDATE on the 4th (Dev Lead workspace requires a token the script didn't have). next_run_at recomputed for all 4. First fire: 04:47 UTC (channel expansion).	2026-04-14 21:33:31 -07:00
rabbitblood	80a6fa6db5	fix(scheduler): heartbeat at tick start + per-fire so liveness reflects work-in-progress The first scheduler heartbeat (#95) only fired AFTER each tick completed. A tick that runs fireSchedule for 110+ seconds (long agent prompts) would make /admin/liveness report scheduler as stale even though it was actively working. Observed today: scheduler firing UIUX audit, last_tick_at lagged by 95s+ and incrementing. Three places now call Heartbeat: 1. Top of tick() — proves we're past the ticker.C wait 2. Inside each fire goroutine, before fireSchedule — ANY active fire keeps the heartbeat fresh 3. Inside each fire goroutine, after fireSchedule — captures the moment the per-fire work completes (The post-tick Heartbeat in Start() is still there as the "all idle" case.) Net result: /admin/liveness reports stale only if the scheduler genuinely isn't doing anything for >2× pollInterval, which is the actual signal we want.	2026-04-14 21:20:06 -07:00
rabbitblood	101f284e5d	fix(scheduler): heartbeat at tick start + per-fire so liveness reflects work-in-progress The first scheduler heartbeat (#95) only fired AFTER each tick completed. A tick that runs fireSchedule for 110+ seconds (long agent prompts) would make /admin/liveness report scheduler as stale even though it was actively working. Observed today: scheduler firing UIUX audit, last_tick_at lagged by 95s+ and incrementing. Three places now call Heartbeat: 1. Top of tick() — proves we're past the ticker.C wait 2. Inside each fire goroutine, before fireSchedule — ANY active fire keeps the heartbeat fresh 3. Inside each fire goroutine, after fireSchedule — captures the moment the per-fire work completes (The post-tick Heartbeat in Start() is still there as the "all idle" case.) Net result: /admin/liveness reports stale only if the scheduler genuinely isn't doing anything for >2× pollInterval, which is the actual signal we want.	2026-04-14 21:20:06 -07:00
rabbitblood	446111e43e	chore(template): Documentation Specialist also watches private molecule-controlplane Per CEO 2026-04-15: the SaaS controlplane (Molecule-AI/molecule-controlplane, PRIVATE Go/Fly.io provisioner) needs documentation coverage too. Updates the agent's role description, initial_prompt, and daily docs-sync cron to handle a third repo with a strict public/private split. ## Privacy rule (the critical addition) molecule-controlplane is private. Two-bucket model: Internal-only changes (handlers, schemas, infra config, billing logic, fly.toml, provisioner internals) → docs go INSIDE the controlplane repo itself (README.md, PLAN.md, docs/internal/*.md). NEVER mentioned in the public docs site. Customer-facing changes (new tier, new region, new SLA, pricing change, signup flow change) → sanitized PUBLIC description on doc.moleculesai.app. Describes the PRODUCT, never the implementation. When unsure: default to internal-only and ask PM before publishing. The privacy rule is repeated three times in the prompt (top of initial_prompt, 1b inside the daily cron, and the role description) so the agent can't miss it. ## Changes - role: extended to mention all three repos + privacy split - initial_prompt: clones controlplane in step 1, reads README+PLAN in step 5, scans recent commits in step 8, lists the four owned surfaces with public/private labels in step 10 - Daily cron: adds step 1b "PAIR RECENT CONTROLPLANE PRS" with the (i)/(ii) internal/customer-facing branching logic - SETUP block: adds controlplane git pull	2026-04-14 21:06:41 -07:00
rabbitblood	41e39c2626	chore(template): Documentation Specialist also watches private molecule-controlplane Per CEO 2026-04-15: the SaaS controlplane (Molecule-AI/molecule-controlplane, PRIVATE Go/Fly.io provisioner) needs documentation coverage too. Updates the agent's role description, initial_prompt, and daily docs-sync cron to handle a third repo with a strict public/private split. ## Privacy rule (the critical addition) molecule-controlplane is private. Two-bucket model: Internal-only changes (handlers, schemas, infra config, billing logic, fly.toml, provisioner internals) → docs go INSIDE the controlplane repo itself (README.md, PLAN.md, docs/internal/*.md). NEVER mentioned in the public docs site. Customer-facing changes (new tier, new region, new SLA, pricing change, signup flow change) → sanitized PUBLIC description on doc.moleculesai.app. Describes the PRODUCT, never the implementation. When unsure: default to internal-only and ask PM before publishing. The privacy rule is repeated three times in the prompt (top of initial_prompt, 1b inside the daily cron, and the role description) so the agent can't miss it. ## Changes - role: extended to mention all three repos + privacy split - initial_prompt: clones controlplane in step 1, reads README+PLAN in step 5, scans recent commits in step 8, lists the four owned surfaces with public/private labels in step 10 - Daily cron: adds step 1b "PAIR RECENT CONTROLPLANE PRS" with the (i)/(ii) internal/customer-facing branching logic - SETUP block: adds controlplane git pull	2026-04-14 21:06:41 -07:00
rabbitblood	7af5da31c2	chore(template): add Documentation Specialist as 3rd PM direct report Adds a 13th workspace to the molecule-dev template owning end-to-end documentation across all Molecule AI surfaces. ## Why now - We just created Molecule-AI/docs (customer-facing site at doc.moleculesai.app, Fumadocs + Next.js 15) and the customer site needs someone to own it. - Internal docs (README.md, docs/architecture.md, docs/edit-history/) were drifting — every platform PR has been opening a docs sync PR manually. - No agent in the team owned terminology consistency or stub backfill. ## Where it sits in the org Third PM direct report, parallel to Research Lead and Dev Lead — docs is its own swim lane that spans engineering (docs follow code) and research/product (concepts and terminology). PM ├── Research Lead ├── Dev Lead └── Documentation Specialist <-- new ## Schedules (2) 1. Daily docs sync — backfill stubs and pair recent platform PRs `0 9 * * ` — every morning: - Pair every merged platform PR (last 24h) with a docs PR if needed - Backfill one stub page on the docs site - Crawl the live site for broken links / dead anchors - delegate_task to PM with audit_summary (category=docs) 2. Weekly terminology + freshness audit* `0 11 * * 1` — every Monday: - Stale page detection (>30 days untouched on fast-moving surfaces) - Terminology consistency check (one canonical name per concept) - Link-rot scan - Same audit_summary contract ## Plugins Inherits the 9 universal defaults. Adds `browser-automation` for crawling the live docs site. `molecule-skill-update-docs` is already in defaults so the cross-repo sync skill is available. ## Routing Adds `docs: [Documentation Specialist]` to `category_routing` so any agent that emits an audit_summary with category=docs is auto-routed here by the platform. ## Bind mounts Note: this workspace clones BOTH /workspace/repo (the platform monorepo) and /workspace/docs (Molecule-AI/docs) in its initial_prompt so the agent can edit either side.	2026-04-14 21:03:22 -07:00
rabbitblood	53fdffd2c5	chore(template): add Documentation Specialist as 3rd PM direct report Adds a 13th workspace to the molecule-dev template owning end-to-end documentation across all Molecule AI surfaces. ## Why now - We just created Molecule-AI/docs (customer-facing site at doc.moleculesai.app, Fumadocs + Next.js 15) and the customer site needs someone to own it. - Internal docs (README.md, docs/architecture.md, docs/edit-history/) were drifting — every platform PR has been opening a docs sync PR manually. - No agent in the team owned terminology consistency or stub backfill. ## Where it sits in the org Third PM direct report, parallel to Research Lead and Dev Lead — docs is its own swim lane that spans engineering (docs follow code) and research/product (concepts and terminology). PM ├── Research Lead ├── Dev Lead └── Documentation Specialist <-- new ## Schedules (2) 1. Daily docs sync — backfill stubs and pair recent platform PRs `0 9 * * ` — every morning: - Pair every merged platform PR (last 24h) with a docs PR if needed - Backfill one stub page on the docs site - Crawl the live site for broken links / dead anchors - delegate_task to PM with audit_summary (category=docs) 2. Weekly terminology + freshness audit* `0 11 * * 1` — every Monday: - Stale page detection (>30 days untouched on fast-moving surfaces) - Terminology consistency check (one canonical name per concept) - Link-rot scan - Same audit_summary contract ## Plugins Inherits the 9 universal defaults. Adds `browser-automation` for crawling the live docs site. `molecule-skill-update-docs` is already in defaults so the cross-repo sync skill is available. ## Routing Adds `docs: [Documentation Specialist]` to `category_routing` so any agent that emits an audit_summary with category=docs is auto-routed here by the platform. ## Bind mounts Note: this workspace clones BOTH /workspace/repo (the platform monorepo) and /workspace/docs (Molecule-AI/docs) in its initial_prompt so the agent can edit either side.	2026-04-14 21:03:22 -07:00
Hongming Wang	6a8555ef61	Merge pull request #96 from Molecule-AI/feat/canvas-auth-redirect feat(canvas): AuthGate — redirect anonymous users to cp login	2026-04-14 20:42:12 -07:00
Hongming Wang	96d88f42a6	Merge pull request #96 from Molecule-AI/feat/canvas-auth-redirect feat(canvas): AuthGate — redirect anonymous users to cp login	2026-04-14 20:42:12 -07:00
Hongming Wang	043b8fe159	feat(canvas): AuthGate — redirect anonymous users to cp login (Phase F close) Wraps the canvas root so every tenant-subdomain request checks for a valid session and bounces to app.moleculesai.app/cp/auth/login with a return_to pointing back at the current URL. Local dev + vercel preview URLs + apex pass through unchanged. Files: - canvas/src/lib/auth.ts: fetchSession() probes /cp/auth/me (credentials:include for cross-origin cookie); returns Session on 200, null on 401 (anonymous, no throw), throws on 5xx so transient outages don't leak the UI. - canvas/src/lib/auth.ts: redirectToLogin() builds the cp login URL with window.location.href as return_to; CP's isSafeReturnTo check rejects cross-domain bounces. - canvas/src/components/AuthGate.tsx: client component wrapping children. State machine: loading → authenticated \| anonymous. In non-SaaS mode (no tenant slug) skips the gate entirely. - canvas/src/app/layout.tsx: wraps the root body in <AuthGate>. Tests: +6 auth.ts (200 / 401 null / 5xx throw / credentials:include / redirectToLogin href + signup variant). Full suite 453 green (was 447). Pairs with molecule-controlplane PR #16 (return_to cookie handshake on the cp side). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 20:37:26 -07:00
Hongming Wang	aedd3db697	feat(canvas): AuthGate — redirect anonymous users to cp login (Phase F close) Wraps the canvas root so every tenant-subdomain request checks for a valid session and bounces to app.moleculesai.app/cp/auth/login with a return_to pointing back at the current URL. Local dev + vercel preview URLs + apex pass through unchanged. Files: - canvas/src/lib/auth.ts: fetchSession() probes /cp/auth/me (credentials:include for cross-origin cookie); returns Session on 200, null on 401 (anonymous, no throw), throws on 5xx so transient outages don't leak the UI. - canvas/src/lib/auth.ts: redirectToLogin() builds the cp login URL with window.location.href as return_to; CP's isSafeReturnTo check rejects cross-domain bounces. - canvas/src/components/AuthGate.tsx: client component wrapping children. State machine: loading → authenticated \| anonymous. In non-SaaS mode (no tenant slug) skips the gate entirely. - canvas/src/app/layout.tsx: wraps the root body in <AuthGate>. Tests: +6 auth.ts (200 / 401 null / 5xx throw / credentials:include / redirectToLogin href + signup variant). Full suite 453 green (was 447). Pairs with molecule-controlplane PR #16 (return_to cookie handshake on the cp side). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 20:37:26 -07:00
rabbitblood	76a36e8062	fix(platform): panic-recovering supervisor for every background goroutine (#92 ) Yesterday's scheduler-died incident (#85) was one instance of a systemic bug: every long-running goroutine in the platform lacks panic recovery and exposes no liveness signal. In a multi-tenant SaaS deployment, a single tenant's bad data panicking any subsystem takes down the subsystem for every tenant, silently, with all standard health probes still green. That is a scale-of-one sev-1. This PR: 1. Introduces `platform/internal/supervised/` with two primitives: a. RunWithRecover(ctx, name, fn) — runs fn in a recover wrapper. On panic logs the stack + exponential-backoff restart (1s → 2s → 4s → … → 30s cap). On clean return (fn decided to stop) returns. On ctx.Done cancels cleanly. b. Heartbeat(name) + LastTick(name) + Snapshot() + IsHealthy(names, staleThreshold) — shared in-memory liveness registry. Every subsystem calls Heartbeat(name) at the end of each tick so operators can distinguish "goroutine alive and healthy" from "alive but stuck inside a single tick". 2. Wraps every `go X.Start(ctx)` in main.go: - broadcaster.Subscribe (Redis pub/sub relay → WebSocket) - registry.StartLivenessMonitor - registry.StartHealthSweep - scheduler.Start (the one that died yesterday) - channelMgr.Start (Telegram / Slack) 3. Adds `supervised.Heartbeat("scheduler")` inside the scheduler tick loop as the first end-to-end demonstration. Follow-up PRs will add heartbeats to the other four subsystems. 4. Adds `GET /admin/liveness` endpoint returning per-subsystem last_tick_at + seconds_ago. Operators can poll this and alert on any subsystem whose seconds_ago exceeds 2x its cron/tick interval. 5. Unit tests for RunWithRecover (clean return no restart; panic restarts with backoff; ctx cancel stops restart loop) and for the liveness registry. Net new code: ~160 lines + ~100 lines of tests. Refactor of main.go: ~10 line changes. No behavior change on happy path; only lifts what happens on a panic. Closes #92. Supersedes the local recover added to scheduler.go in #90 (kept conceptually, but now via the shared helper).	2026-04-14 20:34:18 -07:00
rabbitblood	e4535560cf	fix(platform): panic-recovering supervisor for every background goroutine (#92 ) Yesterday's scheduler-died incident (#85) was one instance of a systemic bug: every long-running goroutine in the platform lacks panic recovery and exposes no liveness signal. In a multi-tenant SaaS deployment, a single tenant's bad data panicking any subsystem takes down the subsystem for every tenant, silently, with all standard health probes still green. That is a scale-of-one sev-1. This PR: 1. Introduces `platform/internal/supervised/` with two primitives: a. RunWithRecover(ctx, name, fn) — runs fn in a recover wrapper. On panic logs the stack + exponential-backoff restart (1s → 2s → 4s → … → 30s cap). On clean return (fn decided to stop) returns. On ctx.Done cancels cleanly. b. Heartbeat(name) + LastTick(name) + Snapshot() + IsHealthy(names, staleThreshold) — shared in-memory liveness registry. Every subsystem calls Heartbeat(name) at the end of each tick so operators can distinguish "goroutine alive and healthy" from "alive but stuck inside a single tick". 2. Wraps every `go X.Start(ctx)` in main.go: - broadcaster.Subscribe (Redis pub/sub relay → WebSocket) - registry.StartLivenessMonitor - registry.StartHealthSweep - scheduler.Start (the one that died yesterday) - channelMgr.Start (Telegram / Slack) 3. Adds `supervised.Heartbeat("scheduler")` inside the scheduler tick loop as the first end-to-end demonstration. Follow-up PRs will add heartbeats to the other four subsystems. 4. Adds `GET /admin/liveness` endpoint returning per-subsystem last_tick_at + seconds_ago. Operators can poll this and alert on any subsystem whose seconds_ago exceeds 2x its cron/tick interval. 5. Unit tests for RunWithRecover (clean return no restart; panic restarts with backoff; ctx cancel stops restart loop) and for the liveness registry. Net new code: ~160 lines + ~100 lines of tests. Refactor of main.go: ~10 line changes. No behavior change on happy path; only lifts what happens on a panic. Closes #92. Supersedes the local recover added to scheduler.go in #90 (kept conceptually, but now via the shared helper).	2026-04-14 20:34:18 -07:00
Backend Engineer	602dcb6283	fix(security): C6 — block loopback IP literals in /registry/register A workspace that self-registers with a 127.0.0.x URL on first INSERT could redirect A2A proxy traffic back to the platform itself (SSRF). The previous fix only blocked 169.254.0.0/16 (cloud metadata). Add 127.0.0.0/8 to validateAgentURL's blocklist. RFC-1918 private ranges (10.x, 172.16.x, 192.168.x) remain allowed — Docker container networking depends on them. Safe because the provisioner writes 127.0.0.1 URLs via direct SQL UPDATE, not through /registry/register, so the UPSERT CASE that preserves provisioner URLs is unaffected. Local-dev agents can still register using "localhost" by name (hostname, not IP literal). Tests: removed "valid localhost http" case (now correctly rejected), added "valid localhost name" + three loopback-block assertions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 03:34:14 +00:00
Backend Engineer	19bdd81ba4	fix(security): C6 — block loopback IP literals in /registry/register A workspace that self-registers with a 127.0.0.x URL on first INSERT could redirect A2A proxy traffic back to the platform itself (SSRF). The previous fix only blocked 169.254.0.0/16 (cloud metadata). Add 127.0.0.0/8 to validateAgentURL's blocklist. RFC-1918 private ranges (10.x, 172.16.x, 192.168.x) remain allowed — Docker container networking depends on them. Safe because the provisioner writes 127.0.0.1 URLs via direct SQL UPDATE, not through /registry/register, so the UPSERT CASE that preserves provisioner URLs is unaffected. Local-dev agents can still register using "localhost" by name (hostname, not IP literal). Tests: removed "valid localhost http" case (now correctly rejected), added "valid localhost name" + three loopback-block assertions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 03:34:14 +00:00
Hongming Wang	8019adb811	Merge pull request #90 from Molecule-AI/fix/scheduler-watchdog-recover fix(scheduler): recover from panics + add liveness watchdog (#85)	2026-04-14 20:30:31 -07:00
Hongming Wang	c02bfb4257	Merge pull request #90 from Molecule-AI/fix/scheduler-watchdog-recover fix(scheduler): recover from panics + add liveness watchdog (#85)	2026-04-14 20:30:31 -07:00
Hongming Wang	e33ff21cad	Merge pull request #87 from Molecule-AI/chore/template-evolution-crons chore(template): add 4 evolution crons — ecosystem / plugins / template / channels	2026-04-14 20:30:26 -07:00
Hongming Wang	12ef17f8e0	Merge pull request #87 from Molecule-AI/chore/template-evolution-crons chore(template): add 4 evolution crons — ecosystem / plugins / template / channels	2026-04-14 20:30:26 -07:00
Hongming Wang	857218ad35	Merge pull request #81 from Molecule-AI/docs/sync-2026-04-15-tick-9 QA verified: docs-only change (PLAN.md + edit-history). CI green (all 6 checks pass). No code changes. Safe to merge.	2026-04-14 20:30:18 -07:00
Hongming Wang	092652770c	Merge pull request #81 from Molecule-AI/docs/sync-2026-04-15-tick-9 QA verified: docs-only change (PLAN.md + edit-history). CI green (all 6 checks pass). No code changes. Safe to merge.	2026-04-14 20:30:18 -07:00
Hongming Wang	e2ba3e3256	Merge pull request #91 from Molecule-AI/feat/canvas-saas-cross-origin feat(canvas): SaaS cross-origin — slug header + cookie credentials (Phase F)	2026-04-14 20:10:46 -07:00
Hongming Wang	e7275531d8	Merge pull request #91 from Molecule-AI/feat/canvas-saas-cross-origin feat(canvas): SaaS cross-origin — slug header + cookie credentials (Phase F)	2026-04-14 20:10:46 -07:00
Hongming Wang	3abcee11b3	feat(canvas): SaaS cross-origin — slug header + cookie credentials (Phase F) Canvas will be served at <slug>.moleculesai.app (Vercel). API calls go cross-origin to https://app.moleculesai.app. This commit wires the client side: - canvas/src/lib/tenant.ts: getTenantSlug() derives the slug from window.location.hostname, case-insensitive, matching the control plane's reservedSubdomains list (app/www/api/admin/…). Server-side + localhost + vercel preview URLs + apex all return "" so local dev keeps working. - canvas/src/lib/api.ts: adds X-Molecule-Org-Slug header + sets credentials:"include" on every fetch. The control plane's CORS middleware allows the origin + credentials; the session cookie has Domain=.moleculesai.app so the browser ships it. - canvas/src/lib/api/secrets.ts: same treatment (secrets API uses its own fetch helper — shared slug+credentials logic applied). Tests: +6 (tenant.test.ts covers slug / reserved / case / non-SaaS / preview URL / apex). Full canvas suite 447/447 green. Not in this PR: - WS URL derivation for terminal/socket.ts (separate follow-up; WS needs its own slug-aware URL and the canvas terminal isn't used in SaaS launch day-one). - Next.js rewrites (decided against; cross-origin with credentials is cleaner than path-level rewrites for session cookies). Deploys to Vercel once merged — no manual config needed (env already set). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 20:08:39 -07:00
Hongming Wang	c7537436ff	feat(canvas): SaaS cross-origin — slug header + cookie credentials (Phase F) Canvas will be served at <slug>.moleculesai.app (Vercel). API calls go cross-origin to https://app.moleculesai.app. This commit wires the client side: - canvas/src/lib/tenant.ts: getTenantSlug() derives the slug from window.location.hostname, case-insensitive, matching the control plane's reservedSubdomains list (app/www/api/admin/…). Server-side + localhost + vercel preview URLs + apex all return "" so local dev keeps working. - canvas/src/lib/api.ts: adds X-Molecule-Org-Slug header + sets credentials:"include" on every fetch. The control plane's CORS middleware allows the origin + credentials; the session cookie has Domain=.moleculesai.app so the browser ships it. - canvas/src/lib/api/secrets.ts: same treatment (secrets API uses its own fetch helper — shared slug+credentials logic applied). Tests: +6 (tenant.test.ts covers slug / reserved / case / non-SaaS / preview URL / apex). Full canvas suite 447/447 green. Not in this PR: - WS URL derivation for terminal/socket.ts (separate follow-up; WS needs its own slug-aware URL and the canvas terminal isn't used in SaaS launch day-one). - Next.js rewrites (decided against; cross-origin with credentials is cleaner than path-level rewrites for session cookies). Deploys to Vercel once merged — no manual config needed (env already set). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 20:08:39 -07:00
rabbitblood	ef7f482593	fix(scheduler): recover from panics + add liveness watchdog (#85 ) The scheduler died silently on 2026-04-14 14:21 UTC and stayed dead for 12+ hours. Platform restart didn't recover it. Root cause: tick() and fireSchedule() goroutines have no panic recovery. A single bad row, bad cron expression, DB blip, or transient panic anywhere in the chain permanently kills the scheduler goroutine — and the only signal to an operator is "no crons firing", which is invisible if you're not watching. Specifically: func (s *Scheduler) Start(ctx context.Context) { for { select { case <-ticker.C: s.tick(ctx) // <- if this panics, the for-loop exits forever } } } And inside tick: go func(s2 scheduleRow) { defer wg.Done() defer func() { <-sem }() s.fireSchedule(ctx, s2) // <- panic here propagates up wg.Wait() }(sched) Two `defer recover()` additions: 1. In Start's tick wrapper — a panic in tick() (DB scan, cron parse, row processing) is logged and the next tick fires normally. 2. In each fireSchedule goroutine — a single bad workspace can't take the rest of the batch down. Plus a liveness watchdog: - Scheduler now records `lastTickAt` after each successful tick. - New methods `LastTickAt()` and `Healthy()` (true if last tick within 2× pollInterval = 60s). - Initialised at Start so Healthy() returns true on a fresh process. Endpoint plumbing for /admin/scheduler/health is a follow-up — needs threading the scheduler instance through router.Setup(). Documented on #85. Closes the silent-outage failure mode of #85. The other proposed fixes (force-kill on /restart hang, active_tasks watchdog) are separate concerns tracked in #85's comments.	2026-04-14 19:32:01 -07:00
rabbitblood	7dc9d83792	fix(scheduler): recover from panics + add liveness watchdog (#85 ) The scheduler died silently on 2026-04-14 14:21 UTC and stayed dead for 12+ hours. Platform restart didn't recover it. Root cause: tick() and fireSchedule() goroutines have no panic recovery. A single bad row, bad cron expression, DB blip, or transient panic anywhere in the chain permanently kills the scheduler goroutine — and the only signal to an operator is "no crons firing", which is invisible if you're not watching. Specifically: func (s *Scheduler) Start(ctx context.Context) { for { select { case <-ticker.C: s.tick(ctx) // <- if this panics, the for-loop exits forever } } } And inside tick: go func(s2 scheduleRow) { defer wg.Done() defer func() { <-sem }() s.fireSchedule(ctx, s2) // <- panic here propagates up wg.Wait() }(sched) Two `defer recover()` additions: 1. In Start's tick wrapper — a panic in tick() (DB scan, cron parse, row processing) is logged and the next tick fires normally. 2. In each fireSchedule goroutine — a single bad workspace can't take the rest of the batch down. Plus a liveness watchdog: - Scheduler now records `lastTickAt` after each successful tick. - New methods `LastTickAt()` and `Healthy()` (true if last tick within 2× pollInterval = 60s). - Initialised at Start so Healthy() returns true on a fresh process. Endpoint plumbing for /admin/scheduler/health is a follow-up — needs threading the scheduler instance through router.Setup(). Documented on #85. Closes the silent-outage failure mode of #85. The other proposed fixes (force-kill on /restart hang, active_tasks watchdog) are separate concerns tracked in #85's comments.	2026-04-14 19:32:01 -07:00
Hongming Wang	d81d15537e	Merge pull request #89 from Molecule-AI/docs/sync-saas-progress docs(plan): add Phase 32 current-state snapshot	2026-04-14 18:17:36 -07:00
Hongming Wang	15ad2a8dbe	Merge pull request #89 from Molecule-AI/docs/sync-saas-progress docs(plan): add Phase 32 current-state snapshot	2026-04-14 18:17:36 -07:00
Hongming Wang	73887948b2	Merge pull request #88 from Molecule-AI/fix/tenant-guard-state-no-prefix fix(middleware): tenant guard reads bare UUID from state= (pair with cp #8)	2026-04-14 18:14:14 -07:00
Hongming Wang	ff6499f634	Merge pull request #88 from Molecule-AI/fix/tenant-guard-state-no-prefix fix(middleware): tenant guard reads bare UUID from state= (pair with cp #8)	2026-04-14 18:14:14 -07:00
Hongming Wang	c24c7bdb97	docs(plan): add Phase 32 current-state block Point-in-time snapshot of the live SaaS infrastructure + which phases are done vs in-flight vs not started. Links to molecule-controlplane's own PLAN for deeper operator detail.	2026-04-14 18:13:47 -07:00
Hongming Wang	821ed3a532	docs(plan): add Phase 32 current-state block Point-in-time snapshot of the live SaaS infrastructure + which phases are done vs in-flight vs not started. Links to molecule-controlplane's own PLAN for deeper operator detail.	2026-04-14 18:13:47 -07:00
Hongming Wang	7af4f10226	fix(middleware): tenant guard reads bare UUID from state= (no prefix) Pair to molecule-controlplane PR #8. Fly's proxy returns 502 if the fly-replay state value contains '=', so the control plane now puts the bare UUID in state= (no 'org-id=' prefix). TenantGuard now treats the whole 'state=...' value as the org id.	2026-04-14 18:09:44 -07:00
Hongming Wang	e38257ac88	fix(middleware): tenant guard reads bare UUID from state= (no prefix) Pair to molecule-controlplane PR #8. Fly's proxy returns 502 if the fly-replay state value contains '=', so the control plane now puts the bare UUID in state= (no 'org-id=' prefix). TenantGuard now treats the whole 'state=...' value as the org id.	2026-04-14 18:09:44 -07:00
rabbitblood	4f2b28c060	chore(template): add 4 evolution crons — ecosystem / plugins / template / channels Today's crons are all REVIEW (Security audit, UIUX audit, QA tests). Nothing actively pushes the team to EVOLVE the four levers CEO named: templates, plugins, channels, watchlist. The team-runs-24/7 goal needs both — defensive reviews AND offensive evolution. Adds 4 new schedules: 1. Research Lead — Daily ecosystem watch (0 8 * * ) Survey github.com/trending + HN + AI-blogs for new agent-infra projects from the last 24h. Add 1-3 entries to docs/ecosystem-watch.md per day, commit to chore/eco-watch-YYYY-MM-DD branch + push + PR. Re-enables the watchlist pipeline that was paused earlier today. 2. Technical Researcher — Weekly plugin curation (0 9 * 1, Mondays) Inventory plugins/ + builtin_tools/ + recent landings. Identify gaps (builtin not exposed as plugin; role missing extras; rarely-used plugin in defaults). Survey upstream (claude.ai cookbook, MCP servers, anthropic/openai/langchain blogs). File 1-3 plugin proposals per week as GH issues with concrete integration sketches. 3. Dev Lead — Daily template fitness audit (30 8 * * ) Health-check the template itself: stale system prompts, schedules not firing (catches the #85 scheduler-died failure mode), roles missing plugins they should have, missing crons, channel gaps. File issues for any drift. Designed to catch the silent-stall pattern from today's incident. 4. DevOps Engineer — Weekly channel expansion survey (0 10 * 1, Mondays) PM is the only role with a channel today (Telegram). Survey what channel infra the platform supports + what role-channel pairings would actually help (Security→email-on-critical, DevOps→Slack-on-CI-break, etc). File channel-proposal issues. All four crons end with the structured audit_summary routing per #51/#75 (category, severity, issues, top_recommendation) so they integrate with the platform-level category_routing PM uses to fan out work. The template's existing category_routing block already maps research / plugins / template / channels — these new crons consume exactly those slots. Also drops three stale "# UNION with defaults (#71)" comments left from the cleanup PR — those plugins lists are now self-documenting after #71. Aligns with north-star goal: team should run 24/7 AND keep getting better across templates / plugins / channels / watchlist. This PR closes the gap where the "review" half of the loop was running but the "evolve" half had no active driver.	2026-04-14 18:04:00 -07:00
rabbitblood	18ded13ab3	chore(template): add 4 evolution crons — ecosystem / plugins / template / channels Today's crons are all REVIEW (Security audit, UIUX audit, QA tests). Nothing actively pushes the team to EVOLVE the four levers CEO named: templates, plugins, channels, watchlist. The team-runs-24/7 goal needs both — defensive reviews AND offensive evolution. Adds 4 new schedules: 1. Research Lead — Daily ecosystem watch (0 8 * * ) Survey github.com/trending + HN + AI-blogs for new agent-infra projects from the last 24h. Add 1-3 entries to docs/ecosystem-watch.md per day, commit to chore/eco-watch-YYYY-MM-DD branch + push + PR. Re-enables the watchlist pipeline that was paused earlier today. 2. Technical Researcher — Weekly plugin curation (0 9 * 1, Mondays) Inventory plugins/ + builtin_tools/ + recent landings. Identify gaps (builtin not exposed as plugin; role missing extras; rarely-used plugin in defaults). Survey upstream (claude.ai cookbook, MCP servers, anthropic/openai/langchain blogs). File 1-3 plugin proposals per week as GH issues with concrete integration sketches. 3. Dev Lead — Daily template fitness audit (30 8 * * ) Health-check the template itself: stale system prompts, schedules not firing (catches the #85 scheduler-died failure mode), roles missing plugins they should have, missing crons, channel gaps. File issues for any drift. Designed to catch the silent-stall pattern from today's incident. 4. DevOps Engineer — Weekly channel expansion survey (0 10 * 1, Mondays) PM is the only role with a channel today (Telegram). Survey what channel infra the platform supports + what role-channel pairings would actually help (Security→email-on-critical, DevOps→Slack-on-CI-break, etc). File channel-proposal issues. All four crons end with the structured audit_summary routing per #51/#75 (category, severity, issues, top_recommendation) so they integrate with the platform-level category_routing PM uses to fan out work. The template's existing category_routing block already maps research / plugins / template / channels — these new crons consume exactly those slots. Also drops three stale "# UNION with defaults (#71)" comments left from the cleanup PR — those plugins lists are now self-documenting after #71. Aligns with north-star goal: team should run 24/7 AND keep getting better across templates / plugins / channels / watchlist. This PR closes the gap where the "review" half of the loop was running but the "evolve" half had no active driver.	2026-04-14 18:04:00 -07:00
Hongming Wang	e523ca9b20	Merge pull request #86 from Molecule-AI/docs/plugin-adaptor-header-fix docs(plan): plugin adaptor system is shipped, not future work	2026-04-14 18:03:28 -07:00
Hongming Wang	5b814ca1a7	Merge pull request #86 from Molecule-AI/docs/plugin-adaptor-header-fix docs(plan): plugin adaptor system is shipped, not future work	2026-04-14 18:03:28 -07:00
Hongming Wang	2db410cccb	Merge pull request #84 from Molecule-AI/fix/tenant-guard-fly-replay-src fix(middleware): TenantGuard accepts org id via Fly-Replay-Src state	2026-04-14 18:03:19 -07:00
Hongming Wang	a7619d4f9a	Merge pull request #84 from Molecule-AI/fix/tenant-guard-fly-replay-src fix(middleware): TenantGuard accepts org id via Fly-Replay-Src state	2026-04-14 18:03:19 -07:00
Hongming Wang	c442d79aac	docs(plan): rename 'Future Work — Plugin Adaptor System' to reflect shipped state Header implied the whole system was future work, but the section body says the core (per-runtime adapters, hybrid resolver, AgentskillsAdaptor, /plugins filter, SDK, agentskills.io spec compliance) all landed. Only the bullets under 'Deferred, not blocking' are actually open. Rename + lead with 'The system is done.' so a skim reader doesn't misfile the whole topic as unshipped.	2026-04-14 18:02:28 -07:00
Hongming Wang	a99517f4ec	docs(plan): rename 'Future Work — Plugin Adaptor System' to reflect shipped state Header implied the whole system was future work, but the section body says the core (per-runtime adapters, hybrid resolver, AgentskillsAdaptor, /plugins filter, SDK, agentskills.io spec compliance) all landed. Only the bullets under 'Deferred, not blocking' are actually open. Rename + lead with 'The system is done.' so a skim reader doesn't misfile the whole topic as unshipped.	2026-04-14 18:02:28 -07:00
Hongming Wang	f1dd7cc367	fix(middleware): TenantGuard accepts org id via Fly-Replay-Src state Phase B.3 pair-fix to the control plane's fly-replay state change. Background: the private molecule-controlplane's router emits `fly-replay: app=X;instance=Y;state=org-id=<uuid>`. Fly's edge replays the request to the tenant and injects `Fly-Replay-Src: instance=Z;...; state=org-id=<uuid>` on the replayed request. But response headers from the cp (like X-Molecule-Org-Id) never travel to the replayed tenant — only the state= param does. TenantGuard now checks both paths in order: 1. Primary: X-Molecule-Org-Id header (direct-access path, e.g. molecli) 2. Secondary: Fly-Replay-Src's `state=org-id=<uuid>` segment (production fly-replay path) Either matching configured MOLECULE_ORG_ID → allow. Neither matches → 404 (still don't leak tenant existence). New helper orgIDFromReplaySrc parses the semicolon-separated Fly-Replay- Src header per Fly's format. Covered by a table-driven test with 7 cases including malformed + empty-header + wrong-state-key. Tests: +3 new TestTenantGuard_* (FlyReplaySrc match, mismatch, table). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 17:54:13 -07:00
Hongming Wang	522d055758	fix(middleware): TenantGuard accepts org id via Fly-Replay-Src state Phase B.3 pair-fix to the control plane's fly-replay state change. Background: the private molecule-controlplane's router emits `fly-replay: app=X;instance=Y;state=org-id=<uuid>`. Fly's edge replays the request to the tenant and injects `Fly-Replay-Src: instance=Z;...; state=org-id=<uuid>` on the replayed request. But response headers from the cp (like X-Molecule-Org-Id) never travel to the replayed tenant — only the state= param does. TenantGuard now checks both paths in order: 1. Primary: X-Molecule-Org-Id header (direct-access path, e.g. molecli) 2. Secondary: Fly-Replay-Src's `state=org-id=<uuid>` segment (production fly-replay path) Either matching configured MOLECULE_ORG_ID → allow. Neither matches → 404 (still don't leak tenant existence). New helper orgIDFromReplaySrc parses the semicolon-separated Fly-Replay- Src header per Fly's format. Covered by a table-driven test with 7 cases including malformed + empty-header + wrong-state-key. Tests: +3 new TestTenantGuard_* (FlyReplaySrc match, mismatch, table). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 17:54:13 -07:00

... 65 66 67 68 69 ...

3663 Commits