Commit Graph

3663 Commits

Author SHA1 Message Date
Backend Engineer
1a28ec8ee5 fix(security): C1 — gate GET /workspaces behind AdminAuth; add auth middleware tests
Security Auditor confirmed C1 (GET /workspaces) exposes workspace topology
without any authentication. The endpoint was intentionally left open for
the canvas browser frontend; this PR closes that gap.

Router change:
- Move GET /workspaces from the bare root router into the wsAdmin AdminAuth
  group alongside POST /workspaces and DELETE /workspaces/:id.
- AdminAuth uses the same fail-open bootstrap contract as all other auth
  gates: fresh installs (no live tokens) pass through; once any workspace
  has registered with a token, a valid bearer is required.

Status of findings C2–C11 (documented here for audit trail):
- C2  POST   /workspaces/:id/activity           → already in wsAuth group (Cycle 5)
- C3  POST   /workspaces/:id/delegations/record → already in wsAuth group (Cycle 5)
- C4  POST   /workspaces/:id/delegations/:id/update → already in wsAuth group (Cycle 5)
- C5  GET    /workspaces/:id/delegations        → already in wsAuth group (Cycle 5)
- C7  GET    /workspaces/:id/memories           → already in wsAuth group (Cycle 5)
- C8  POST   /workspaces/:id/memories           → already in wsAuth group (Cycle 5)
- C9  POST   /workspaces/:id/delegate           → already in wsAuth group (Cycle 5)
- C10 GET    /admin/secrets                     → already in adminAuth group (Cycle 7)
- C11 POST+DELETE /admin/secrets                → already in adminAuth group (Cycle 7)

Tests (platform/internal/middleware/wsauth_middleware_test.go — 13 new):
WorkspaceAuth:
  - fail-open when workspace has no tokens (bootstrap path)
  - C4: no bearer on /delegations/:id/update → 401
  - C8: no bearer on /memories POST → 401
  - invalid bearer → 401
  - cross-workspace token replay → 401
  - valid bearer for correct workspace → 200

AdminAuth:
  - fail-open when no tokens exist globally (fresh install)
  - C10: no bearer on GET /admin/secrets → 401
  - C11: no bearer on POST /admin/secrets → 401
  - C11: no bearer on DELETE /admin/secrets/:key → 401
  - valid bearer → 200
  - invalid bearer → 401

Note: did NOT touch DELETE /admin/secrets in production — no destructive
calls to live secrets endpoints were made during this work.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 04:37:14 +00:00
Backend Engineer
80c2161687 fix(security): C1 — gate GET /workspaces behind AdminAuth; add auth middleware tests
Security Auditor confirmed C1 (GET /workspaces) exposes workspace topology
without any authentication. The endpoint was intentionally left open for
the canvas browser frontend; this PR closes that gap.

Router change:
- Move GET /workspaces from the bare root router into the wsAdmin AdminAuth
  group alongside POST /workspaces and DELETE /workspaces/:id.
- AdminAuth uses the same fail-open bootstrap contract as all other auth
  gates: fresh installs (no live tokens) pass through; once any workspace
  has registered with a token, a valid bearer is required.

Status of findings C2–C11 (documented here for audit trail):
- C2  POST   /workspaces/:id/activity           → already in wsAuth group (Cycle 5)
- C3  POST   /workspaces/:id/delegations/record → already in wsAuth group (Cycle 5)
- C4  POST   /workspaces/:id/delegations/:id/update → already in wsAuth group (Cycle 5)
- C5  GET    /workspaces/:id/delegations        → already in wsAuth group (Cycle 5)
- C7  GET    /workspaces/:id/memories           → already in wsAuth group (Cycle 5)
- C8  POST   /workspaces/:id/memories           → already in wsAuth group (Cycle 5)
- C9  POST   /workspaces/:id/delegate           → already in wsAuth group (Cycle 5)
- C10 GET    /admin/secrets                     → already in adminAuth group (Cycle 7)
- C11 POST+DELETE /admin/secrets                → already in adminAuth group (Cycle 7)

Tests (platform/internal/middleware/wsauth_middleware_test.go — 13 new):
WorkspaceAuth:
  - fail-open when workspace has no tokens (bootstrap path)
  - C4: no bearer on /delegations/:id/update → 401
  - C8: no bearer on /memories POST → 401
  - invalid bearer → 401
  - cross-workspace token replay → 401
  - valid bearer for correct workspace → 200

AdminAuth:
  - fail-open when no tokens exist globally (fresh install)
  - C10: no bearer on GET /admin/secrets → 401
  - C11: no bearer on POST /admin/secrets → 401
  - C11: no bearer on DELETE /admin/secrets/:key → 401
  - valid bearer → 200
  - invalid bearer → 401

Note: did NOT touch DELETE /admin/secrets in production — no destructive
calls to live secrets endpoints were made during this work.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 04:37:14 +00:00
Backend Engineer
96253ca8ca fix(security): C6 — extend SSRF blocklist to RFC-1918 private ranges
PR #94 only blocked 127.0.0.0/8 (loopback) and 169.254.0.0/16
(link-local/IMDS). An attacker could still register a workspace with
a URL in any RFC-1918 range (10.x, 172.16–31.x, 192.168.x) and
redirect A2A proxy traffic to internal services.

Block all five reserved ranges in validateAgentURL:
  - 169.254.0.0/16  link-local (IMDS: AWS/GCP/Azure)
  - 127.0.0.0/8     loopback (self-SSRF)
  - 10.0.0.0/8      RFC-1918
  - 172.16.0.0/12   RFC-1918 (includes Docker bridge networks)
  - 192.168.0.0/16  RFC-1918

Agents must use DNS hostnames, not IP literals. The provisioner
still writes 127.0.0.1 URLs via direct SQL UPDATE (CASE guard
preserves those); this blocklist only applies to the /registry/register
request body.

Tests: updated 3 previously-allowed RFC-1918 cases to expect rejection;
added 9 new cases covering range boundaries and the Docker bridge range.
All 22 validateAgentURL subtests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 04:35:05 +00:00
Backend Engineer
63e482f05b fix(security): C6 — extend SSRF blocklist to RFC-1918 private ranges
PR #94 only blocked 127.0.0.0/8 (loopback) and 169.254.0.0/16
(link-local/IMDS). An attacker could still register a workspace with
a URL in any RFC-1918 range (10.x, 172.16–31.x, 192.168.x) and
redirect A2A proxy traffic to internal services.

Block all five reserved ranges in validateAgentURL:
  - 169.254.0.0/16  link-local (IMDS: AWS/GCP/Azure)
  - 127.0.0.0/8     loopback (self-SSRF)
  - 10.0.0.0/8      RFC-1918
  - 172.16.0.0/12   RFC-1918 (includes Docker bridge networks)
  - 192.168.0.0/16  RFC-1918

Agents must use DNS hostnames, not IP literals. The provisioner
still writes 127.0.0.1 URLs via direct SQL UPDATE (CASE guard
preserves those); this blocklist only applies to the /registry/register
request body.

Tests: updated 3 previously-allowed RFC-1918 cases to expect rejection;
added 9 new cases covering range boundaries and the Docker bridge range.
All 22 validateAgentURL subtests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 04:35:05 +00:00
rabbitblood
0c5a1fdab0 chore(template): switch evolution crons from daily/weekly to hourly
CEO 2026-04-15: the team's evolution loops should be hourly, not daily/weekly.
A 24h or 7d cadence is the wrong rhythm for a team that's expected to run 24/7
and keep improving. At hourly, every drift, every new project, every plugin
gap, every channel opportunity gets surfaced within an hour of becoming visible.

| Schedule                          | Was            | Now          |
|-----------------------------------|----------------|--------------|
| Hourly ecosystem watch            | 0 8 * * *      | 8 * * * *    |
| Hourly plugin curation            | 0 9 * * 1      | 22 * * * *   |
| Hourly template fitness audit     | 30 8 * * *     | 15 * * * *   |
| Hourly channel expansion survey   | 0 10 * * 1     | 47 * * * *   |

Spread across the hour (:08, :11, :15, :17, :22, :47) so the four evolution
crons + UIUX :11 + Security :17 don't collide and don't all bury PM with
audit_summary deliveries at the same instant.

Renamed from "Daily..." / "Weekly..." to "Hourly..." to match the new cadence
and so the prompts (which still say "Daily survey" etc.) read consistently.
A follow-up will fix the body wording.

Live-synced into running DB via PATCH (3 of 4) and direct UPDATE on the 4th
(Dev Lead workspace requires a token the script didn't have). next_run_at
recomputed for all 4. First fire: 04:47 UTC (channel expansion).
2026-04-14 21:33:31 -07:00
rabbitblood
c0142edbce chore(template): switch evolution crons from daily/weekly to hourly
CEO 2026-04-15: the team's evolution loops should be hourly, not daily/weekly.
A 24h or 7d cadence is the wrong rhythm for a team that's expected to run 24/7
and keep improving. At hourly, every drift, every new project, every plugin
gap, every channel opportunity gets surfaced within an hour of becoming visible.

| Schedule                          | Was            | Now          |
|-----------------------------------|----------------|--------------|
| Hourly ecosystem watch            | 0 8 * * *      | 8 * * * *    |
| Hourly plugin curation            | 0 9 * * 1      | 22 * * * *   |
| Hourly template fitness audit     | 30 8 * * *     | 15 * * * *   |
| Hourly channel expansion survey   | 0 10 * * 1     | 47 * * * *   |

Spread across the hour (:08, :11, :15, :17, :22, :47) so the four evolution
crons + UIUX :11 + Security :17 don't collide and don't all bury PM with
audit_summary deliveries at the same instant.

Renamed from "Daily..." / "Weekly..." to "Hourly..." to match the new cadence
and so the prompts (which still say "Daily survey" etc.) read consistently.
A follow-up will fix the body wording.

Live-synced into running DB via PATCH (3 of 4) and direct UPDATE on the 4th
(Dev Lead workspace requires a token the script didn't have). next_run_at
recomputed for all 4. First fire: 04:47 UTC (channel expansion).
2026-04-14 21:33:31 -07:00
rabbitblood
80a6fa6db5 fix(scheduler): heartbeat at tick start + per-fire so liveness reflects work-in-progress
The first scheduler heartbeat (#95) only fired AFTER each tick completed.
A tick that runs fireSchedule for 110+ seconds (long agent prompts) would
make /admin/liveness report scheduler as stale even though it was actively
working. Observed today: scheduler firing UIUX audit, last_tick_at lagged
by 95s+ and incrementing.

Three places now call Heartbeat:
1. Top of tick() — proves we're past the ticker.C wait
2. Inside each fire goroutine, before fireSchedule — ANY active fire
   keeps the heartbeat fresh
3. Inside each fire goroutine, after fireSchedule — captures the moment
   the per-fire work completes

(The post-tick Heartbeat in Start() is still there as the "all idle" case.)

Net result: /admin/liveness reports stale only if the scheduler genuinely
isn't doing anything for >2× pollInterval, which is the actual signal we
want.
2026-04-14 21:20:06 -07:00
rabbitblood
101f284e5d fix(scheduler): heartbeat at tick start + per-fire so liveness reflects work-in-progress
The first scheduler heartbeat (#95) only fired AFTER each tick completed.
A tick that runs fireSchedule for 110+ seconds (long agent prompts) would
make /admin/liveness report scheduler as stale even though it was actively
working. Observed today: scheduler firing UIUX audit, last_tick_at lagged
by 95s+ and incrementing.

Three places now call Heartbeat:
1. Top of tick() — proves we're past the ticker.C wait
2. Inside each fire goroutine, before fireSchedule — ANY active fire
   keeps the heartbeat fresh
3. Inside each fire goroutine, after fireSchedule — captures the moment
   the per-fire work completes

(The post-tick Heartbeat in Start() is still there as the "all idle" case.)

Net result: /admin/liveness reports stale only if the scheduler genuinely
isn't doing anything for >2× pollInterval, which is the actual signal we
want.
2026-04-14 21:20:06 -07:00
rabbitblood
446111e43e chore(template): Documentation Specialist also watches private molecule-controlplane
Per CEO 2026-04-15: the SaaS controlplane (Molecule-AI/molecule-controlplane,
PRIVATE Go/Fly.io provisioner) needs documentation coverage too.

Updates the agent's role description, initial_prompt, and daily docs-sync
cron to handle a third repo with a strict public/private split.

## Privacy rule (the critical addition)

molecule-controlplane is private. Two-bucket model:

  Internal-only changes (handlers, schemas, infra config, billing logic,
  fly.toml, provisioner internals) → docs go INSIDE the controlplane repo
  itself (README.md, PLAN.md, docs/internal/*.md). NEVER mentioned in the
  public docs site.

  Customer-facing changes (new tier, new region, new SLA, pricing change,
  signup flow change) → sanitized PUBLIC description on doc.moleculesai.app.
  Describes the PRODUCT, never the implementation.

  When unsure: default to internal-only and ask PM before publishing.

The privacy rule is repeated three times in the prompt (top of initial_prompt,
1b inside the daily cron, and the role description) so the agent can't miss it.

## Changes
- role: extended to mention all three repos + privacy split
- initial_prompt: clones controlplane in step 1, reads README+PLAN in step 5,
  scans recent commits in step 8, lists the four owned surfaces with public/private
  labels in step 10
- Daily cron: adds step 1b "PAIR RECENT CONTROLPLANE PRS" with the (i)/(ii)
  internal/customer-facing branching logic
- SETUP block: adds controlplane git pull
2026-04-14 21:06:41 -07:00
rabbitblood
41e39c2626 chore(template): Documentation Specialist also watches private molecule-controlplane
Per CEO 2026-04-15: the SaaS controlplane (Molecule-AI/molecule-controlplane,
PRIVATE Go/Fly.io provisioner) needs documentation coverage too.

Updates the agent's role description, initial_prompt, and daily docs-sync
cron to handle a third repo with a strict public/private split.

## Privacy rule (the critical addition)

molecule-controlplane is private. Two-bucket model:

  Internal-only changes (handlers, schemas, infra config, billing logic,
  fly.toml, provisioner internals) → docs go INSIDE the controlplane repo
  itself (README.md, PLAN.md, docs/internal/*.md). NEVER mentioned in the
  public docs site.

  Customer-facing changes (new tier, new region, new SLA, pricing change,
  signup flow change) → sanitized PUBLIC description on doc.moleculesai.app.
  Describes the PRODUCT, never the implementation.

  When unsure: default to internal-only and ask PM before publishing.

The privacy rule is repeated three times in the prompt (top of initial_prompt,
1b inside the daily cron, and the role description) so the agent can't miss it.

## Changes
- role: extended to mention all three repos + privacy split
- initial_prompt: clones controlplane in step 1, reads README+PLAN in step 5,
  scans recent commits in step 8, lists the four owned surfaces with public/private
  labels in step 10
- Daily cron: adds step 1b "PAIR RECENT CONTROLPLANE PRS" with the (i)/(ii)
  internal/customer-facing branching logic
- SETUP block: adds controlplane git pull
2026-04-14 21:06:41 -07:00
rabbitblood
7af5da31c2 chore(template): add Documentation Specialist as 3rd PM direct report
Adds a 13th workspace to the molecule-dev template owning end-to-end
documentation across all Molecule AI surfaces.

## Why now
- We just created Molecule-AI/docs (customer-facing site at
  doc.moleculesai.app, Fumadocs + Next.js 15) and the customer site needs
  someone to own it.
- Internal docs (README.md, docs/architecture.md, docs/edit-history/) were
  drifting — every platform PR has been opening a docs sync PR manually.
- No agent in the team owned terminology consistency or stub backfill.

## Where it sits in the org
Third PM direct report, parallel to Research Lead and Dev Lead — docs is
its own swim lane that spans engineering (docs follow code) and
research/product (concepts and terminology).

  PM
  ├── Research Lead
  ├── Dev Lead
  └── Documentation Specialist  <-- new

## Schedules (2)

1. **Daily docs sync — backfill stubs and pair recent platform PRs**
   `0 9 * * *` — every morning:
   - Pair every merged platform PR (last 24h) with a docs PR if needed
   - Backfill one stub page on the docs site
   - Crawl the live site for broken links / dead anchors
   - delegate_task to PM with audit_summary (category=docs)

2. **Weekly terminology + freshness audit**
   `0 11 * * 1` — every Monday:
   - Stale page detection (>30 days untouched on fast-moving surfaces)
   - Terminology consistency check (one canonical name per concept)
   - Link-rot scan
   - Same audit_summary contract

## Plugins
Inherits the 9 universal defaults. Adds `browser-automation` for crawling
the live docs site. `molecule-skill-update-docs` is already in defaults
so the cross-repo sync skill is available.

## Routing
Adds `docs: [Documentation Specialist]` to `category_routing` so any
agent that emits an audit_summary with category=docs is auto-routed
here by the platform.

## Bind mounts
Note: this workspace clones BOTH /workspace/repo (the platform monorepo)
and /workspace/docs (Molecule-AI/docs) in its initial_prompt so the
agent can edit either side.
2026-04-14 21:03:22 -07:00
rabbitblood
53fdffd2c5 chore(template): add Documentation Specialist as 3rd PM direct report
Adds a 13th workspace to the molecule-dev template owning end-to-end
documentation across all Molecule AI surfaces.

## Why now
- We just created Molecule-AI/docs (customer-facing site at
  doc.moleculesai.app, Fumadocs + Next.js 15) and the customer site needs
  someone to own it.
- Internal docs (README.md, docs/architecture.md, docs/edit-history/) were
  drifting — every platform PR has been opening a docs sync PR manually.
- No agent in the team owned terminology consistency or stub backfill.

## Where it sits in the org
Third PM direct report, parallel to Research Lead and Dev Lead — docs is
its own swim lane that spans engineering (docs follow code) and
research/product (concepts and terminology).

  PM
  ├── Research Lead
  ├── Dev Lead
  └── Documentation Specialist  <-- new

## Schedules (2)

1. **Daily docs sync — backfill stubs and pair recent platform PRs**
   `0 9 * * *` — every morning:
   - Pair every merged platform PR (last 24h) with a docs PR if needed
   - Backfill one stub page on the docs site
   - Crawl the live site for broken links / dead anchors
   - delegate_task to PM with audit_summary (category=docs)

2. **Weekly terminology + freshness audit**
   `0 11 * * 1` — every Monday:
   - Stale page detection (>30 days untouched on fast-moving surfaces)
   - Terminology consistency check (one canonical name per concept)
   - Link-rot scan
   - Same audit_summary contract

## Plugins
Inherits the 9 universal defaults. Adds `browser-automation` for crawling
the live docs site. `molecule-skill-update-docs` is already in defaults
so the cross-repo sync skill is available.

## Routing
Adds `docs: [Documentation Specialist]` to `category_routing` so any
agent that emits an audit_summary with category=docs is auto-routed
here by the platform.

## Bind mounts
Note: this workspace clones BOTH /workspace/repo (the platform monorepo)
and /workspace/docs (Molecule-AI/docs) in its initial_prompt so the
agent can edit either side.
2026-04-14 21:03:22 -07:00
Hongming Wang
6a8555ef61 Merge pull request #96 from Molecule-AI/feat/canvas-auth-redirect
feat(canvas): AuthGate — redirect anonymous users to cp login
2026-04-14 20:42:12 -07:00
Hongming Wang
96d88f42a6
Merge pull request #96 from Molecule-AI/feat/canvas-auth-redirect
feat(canvas): AuthGate — redirect anonymous users to cp login
2026-04-14 20:42:12 -07:00
Hongming Wang
043b8fe159 feat(canvas): AuthGate — redirect anonymous users to cp login (Phase F close)
Wraps the canvas root so every tenant-subdomain request checks for a
valid session and bounces to app.moleculesai.app/cp/auth/login with a
return_to pointing back at the current URL. Local dev + vercel preview
URLs + apex pass through unchanged.

Files:
- canvas/src/lib/auth.ts: fetchSession() probes /cp/auth/me
  (credentials:include for cross-origin cookie); returns Session on 200,
  null on 401 (anonymous, no throw), throws on 5xx so transient
  outages don't leak the UI.
- canvas/src/lib/auth.ts: redirectToLogin() builds the cp login URL
  with window.location.href as return_to; CP's isSafeReturnTo check
  rejects cross-domain bounces.
- canvas/src/components/AuthGate.tsx: client component wrapping
  children. State machine: loading → authenticated | anonymous. In
  non-SaaS mode (no tenant slug) skips the gate entirely.
- canvas/src/app/layout.tsx: wraps the root body in <AuthGate>.

Tests: +6 auth.ts (200 / 401 null / 5xx throw / credentials:include /
redirectToLogin href + signup variant). Full suite 453 green (was 447).

Pairs with molecule-controlplane PR #16 (return_to cookie handshake
on the cp side).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 20:37:26 -07:00
Hongming Wang
aedd3db697 feat(canvas): AuthGate — redirect anonymous users to cp login (Phase F close)
Wraps the canvas root so every tenant-subdomain request checks for a
valid session and bounces to app.moleculesai.app/cp/auth/login with a
return_to pointing back at the current URL. Local dev + vercel preview
URLs + apex pass through unchanged.

Files:
- canvas/src/lib/auth.ts: fetchSession() probes /cp/auth/me
  (credentials:include for cross-origin cookie); returns Session on 200,
  null on 401 (anonymous, no throw), throws on 5xx so transient
  outages don't leak the UI.
- canvas/src/lib/auth.ts: redirectToLogin() builds the cp login URL
  with window.location.href as return_to; CP's isSafeReturnTo check
  rejects cross-domain bounces.
- canvas/src/components/AuthGate.tsx: client component wrapping
  children. State machine: loading → authenticated | anonymous. In
  non-SaaS mode (no tenant slug) skips the gate entirely.
- canvas/src/app/layout.tsx: wraps the root body in <AuthGate>.

Tests: +6 auth.ts (200 / 401 null / 5xx throw / credentials:include /
redirectToLogin href + signup variant). Full suite 453 green (was 447).

Pairs with molecule-controlplane PR #16 (return_to cookie handshake
on the cp side).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 20:37:26 -07:00
rabbitblood
76a36e8062 fix(platform): panic-recovering supervisor for every background goroutine (#92)
Yesterday's scheduler-died incident (#85) was one instance of a systemic
bug: every long-running goroutine in the platform lacks panic recovery
and exposes no liveness signal. In a multi-tenant SaaS deployment, a
single tenant's bad data panicking any subsystem takes down the
subsystem for every tenant, silently, with all standard health probes
still green. That is a scale-of-one sev-1.

This PR:

1. Introduces `platform/internal/supervised/` with two primitives:

   a. RunWithRecover(ctx, name, fn) — runs fn in a recover wrapper.
      On panic logs the stack + exponential-backoff restart (1s → 2s →
      4s → … → 30s cap). On clean return (fn decided to stop) returns.
      On ctx.Done cancels cleanly.

   b. Heartbeat(name) + LastTick(name) + Snapshot() + IsHealthy(names,
      staleThreshold) — shared in-memory liveness registry. Every
      subsystem calls Heartbeat(name) at the end of each tick so
      operators can distinguish "goroutine alive and healthy" from
      "alive but stuck inside a single tick".

2. Wraps every `go X.Start(ctx)` in main.go:
   - broadcaster.Subscribe   (Redis pub/sub relay → WebSocket)
   - registry.StartLivenessMonitor
   - registry.StartHealthSweep
   - scheduler.Start         (the one that died yesterday)
   - channelMgr.Start        (Telegram / Slack)

3. Adds `supervised.Heartbeat("scheduler")` inside the scheduler tick
   loop as the first end-to-end demonstration. Follow-up PRs will add
   heartbeats to the other four subsystems.

4. Adds `GET /admin/liveness` endpoint returning per-subsystem
   last_tick_at + seconds_ago. Operators can poll this and alert on
   any subsystem whose seconds_ago exceeds 2x its cron/tick interval.

5. Unit tests for RunWithRecover (clean return no restart; panic
   restarts with backoff; ctx cancel stops restart loop) and for the
   liveness registry.

Net new code: ~160 lines + ~100 lines of tests. Refactor of main.go:
~10 line changes. No behavior change on happy path; only lifts what
happens on a panic.

Closes #92. Supersedes the local recover added to scheduler.go in
#90 (kept conceptually, but now via the shared helper).
2026-04-14 20:34:18 -07:00
rabbitblood
e4535560cf fix(platform): panic-recovering supervisor for every background goroutine (#92)
Yesterday's scheduler-died incident (#85) was one instance of a systemic
bug: every long-running goroutine in the platform lacks panic recovery
and exposes no liveness signal. In a multi-tenant SaaS deployment, a
single tenant's bad data panicking any subsystem takes down the
subsystem for every tenant, silently, with all standard health probes
still green. That is a scale-of-one sev-1.

This PR:

1. Introduces `platform/internal/supervised/` with two primitives:

   a. RunWithRecover(ctx, name, fn) — runs fn in a recover wrapper.
      On panic logs the stack + exponential-backoff restart (1s → 2s →
      4s → … → 30s cap). On clean return (fn decided to stop) returns.
      On ctx.Done cancels cleanly.

   b. Heartbeat(name) + LastTick(name) + Snapshot() + IsHealthy(names,
      staleThreshold) — shared in-memory liveness registry. Every
      subsystem calls Heartbeat(name) at the end of each tick so
      operators can distinguish "goroutine alive and healthy" from
      "alive but stuck inside a single tick".

2. Wraps every `go X.Start(ctx)` in main.go:
   - broadcaster.Subscribe   (Redis pub/sub relay → WebSocket)
   - registry.StartLivenessMonitor
   - registry.StartHealthSweep
   - scheduler.Start         (the one that died yesterday)
   - channelMgr.Start        (Telegram / Slack)

3. Adds `supervised.Heartbeat("scheduler")` inside the scheduler tick
   loop as the first end-to-end demonstration. Follow-up PRs will add
   heartbeats to the other four subsystems.

4. Adds `GET /admin/liveness` endpoint returning per-subsystem
   last_tick_at + seconds_ago. Operators can poll this and alert on
   any subsystem whose seconds_ago exceeds 2x its cron/tick interval.

5. Unit tests for RunWithRecover (clean return no restart; panic
   restarts with backoff; ctx cancel stops restart loop) and for the
   liveness registry.

Net new code: ~160 lines + ~100 lines of tests. Refactor of main.go:
~10 line changes. No behavior change on happy path; only lifts what
happens on a panic.

Closes #92. Supersedes the local recover added to scheduler.go in
#90 (kept conceptually, but now via the shared helper).
2026-04-14 20:34:18 -07:00
Backend Engineer
602dcb6283 fix(security): C6 — block loopback IP literals in /registry/register
A workspace that self-registers with a 127.0.0.x URL on first INSERT
could redirect A2A proxy traffic back to the platform itself (SSRF).
The previous fix only blocked 169.254.0.0/16 (cloud metadata).

Add 127.0.0.0/8 to validateAgentURL's blocklist. RFC-1918 private
ranges (10.x, 172.16.x, 192.168.x) remain allowed — Docker container
networking depends on them.

Safe because the provisioner writes 127.0.0.1 URLs via direct SQL
UPDATE, not through /registry/register, so the UPSERT CASE that
preserves provisioner URLs is unaffected. Local-dev agents can still
register using "localhost" by name (hostname, not IP literal).

Tests: removed "valid localhost http" case (now correctly rejected),
added "valid localhost name" + three loopback-block assertions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 03:34:14 +00:00
Backend Engineer
19bdd81ba4 fix(security): C6 — block loopback IP literals in /registry/register
A workspace that self-registers with a 127.0.0.x URL on first INSERT
could redirect A2A proxy traffic back to the platform itself (SSRF).
The previous fix only blocked 169.254.0.0/16 (cloud metadata).

Add 127.0.0.0/8 to validateAgentURL's blocklist. RFC-1918 private
ranges (10.x, 172.16.x, 192.168.x) remain allowed — Docker container
networking depends on them.

Safe because the provisioner writes 127.0.0.1 URLs via direct SQL
UPDATE, not through /registry/register, so the UPSERT CASE that
preserves provisioner URLs is unaffected. Local-dev agents can still
register using "localhost" by name (hostname, not IP literal).

Tests: removed "valid localhost http" case (now correctly rejected),
added "valid localhost name" + three loopback-block assertions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 03:34:14 +00:00
Hongming Wang
8019adb811 Merge pull request #90 from Molecule-AI/fix/scheduler-watchdog-recover
fix(scheduler): recover from panics + add liveness watchdog (#85)
2026-04-14 20:30:31 -07:00
Hongming Wang
c02bfb4257
Merge pull request #90 from Molecule-AI/fix/scheduler-watchdog-recover
fix(scheduler): recover from panics + add liveness watchdog (#85)
2026-04-14 20:30:31 -07:00
Hongming Wang
e33ff21cad Merge pull request #87 from Molecule-AI/chore/template-evolution-crons
chore(template): add 4 evolution crons — ecosystem / plugins / template / channels
2026-04-14 20:30:26 -07:00
Hongming Wang
12ef17f8e0
Merge pull request #87 from Molecule-AI/chore/template-evolution-crons
chore(template): add 4 evolution crons — ecosystem / plugins / template / channels
2026-04-14 20:30:26 -07:00
Hongming Wang
857218ad35 Merge pull request #81 from Molecule-AI/docs/sync-2026-04-15-tick-9
QA verified: docs-only change (PLAN.md + edit-history). CI green (all 6 checks pass). No code changes. Safe to merge.
2026-04-14 20:30:18 -07:00
Hongming Wang
092652770c
Merge pull request #81 from Molecule-AI/docs/sync-2026-04-15-tick-9
QA verified: docs-only change (PLAN.md + edit-history). CI green (all 6 checks pass). No code changes. Safe to merge.
2026-04-14 20:30:18 -07:00
Hongming Wang
e2ba3e3256 Merge pull request #91 from Molecule-AI/feat/canvas-saas-cross-origin
feat(canvas): SaaS cross-origin — slug header + cookie credentials (Phase F)
2026-04-14 20:10:46 -07:00
Hongming Wang
e7275531d8
Merge pull request #91 from Molecule-AI/feat/canvas-saas-cross-origin
feat(canvas): SaaS cross-origin — slug header + cookie credentials (Phase F)
2026-04-14 20:10:46 -07:00
Hongming Wang
3abcee11b3 feat(canvas): SaaS cross-origin — slug header + cookie credentials (Phase F)
Canvas will be served at <slug>.moleculesai.app (Vercel). API calls go
cross-origin to https://app.moleculesai.app. This commit wires the
client side:

- canvas/src/lib/tenant.ts: getTenantSlug() derives the slug from
  window.location.hostname, case-insensitive, matching the control
  plane's reservedSubdomains list (app/www/api/admin/…). Server-side
  + localhost + vercel preview URLs + apex all return "" so local dev
  keeps working.

- canvas/src/lib/api.ts: adds X-Molecule-Org-Slug header + sets
  credentials:"include" on every fetch. The control plane's CORS
  middleware allows the origin + credentials; the session cookie has
  Domain=.moleculesai.app so the browser ships it.

- canvas/src/lib/api/secrets.ts: same treatment (secrets API uses its
  own fetch helper — shared slug+credentials logic applied).

Tests: +6 (tenant.test.ts covers slug / reserved / case / non-SaaS /
preview URL / apex). Full canvas suite 447/447 green.

Not in this PR:
- WS URL derivation for terminal/socket.ts (separate follow-up; WS
  needs its own slug-aware URL and the canvas terminal isn't used in
  SaaS launch day-one).
- Next.js rewrites (decided against; cross-origin with credentials
  is cleaner than path-level rewrites for session cookies).

Deploys to Vercel once merged — no manual config needed (env already set).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 20:08:39 -07:00
Hongming Wang
c7537436ff feat(canvas): SaaS cross-origin — slug header + cookie credentials (Phase F)
Canvas will be served at <slug>.moleculesai.app (Vercel). API calls go
cross-origin to https://app.moleculesai.app. This commit wires the
client side:

- canvas/src/lib/tenant.ts: getTenantSlug() derives the slug from
  window.location.hostname, case-insensitive, matching the control
  plane's reservedSubdomains list (app/www/api/admin/…). Server-side
  + localhost + vercel preview URLs + apex all return "" so local dev
  keeps working.

- canvas/src/lib/api.ts: adds X-Molecule-Org-Slug header + sets
  credentials:"include" on every fetch. The control plane's CORS
  middleware allows the origin + credentials; the session cookie has
  Domain=.moleculesai.app so the browser ships it.

- canvas/src/lib/api/secrets.ts: same treatment (secrets API uses its
  own fetch helper — shared slug+credentials logic applied).

Tests: +6 (tenant.test.ts covers slug / reserved / case / non-SaaS /
preview URL / apex). Full canvas suite 447/447 green.

Not in this PR:
- WS URL derivation for terminal/socket.ts (separate follow-up; WS
  needs its own slug-aware URL and the canvas terminal isn't used in
  SaaS launch day-one).
- Next.js rewrites (decided against; cross-origin with credentials
  is cleaner than path-level rewrites for session cookies).

Deploys to Vercel once merged — no manual config needed (env already set).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 20:08:39 -07:00
rabbitblood
ef7f482593 fix(scheduler): recover from panics + add liveness watchdog (#85)
The scheduler died silently on 2026-04-14 14:21 UTC and stayed dead for
12+ hours. Platform restart didn't recover it. Root cause: tick() and
fireSchedule() goroutines have no panic recovery. A single bad row, bad
cron expression, DB blip, or transient panic anywhere in the chain
permanently kills the scheduler goroutine — and the only signal to an
operator is "no crons firing", which is invisible if you're not watching.

Specifically:

  func (s *Scheduler) Start(ctx context.Context) {
      for {
          select {
          case <-ticker.C:
              s.tick(ctx)   // <- if this panics, the for-loop exits forever
          }
      }
  }

And inside tick:

  go func(s2 scheduleRow) {
      defer wg.Done()
      defer func() { <-sem }()
      s.fireSchedule(ctx, s2)   // <- panic here propagates up wg.Wait()
  }(sched)

Two `defer recover()` additions:

1. In Start's tick wrapper — a panic in tick() (DB scan, cron parse,
   row processing) is logged and the next tick fires normally.
2. In each fireSchedule goroutine — a single bad workspace can't take
   the rest of the batch down.

Plus a liveness watchdog:

- Scheduler now records `lastTickAt` after each successful tick.
- New methods `LastTickAt()` and `Healthy()` (true if last tick within
  2× pollInterval = 60s).
- Initialised at Start so Healthy() returns true on a fresh process.

Endpoint plumbing for /admin/scheduler/health is a follow-up — needs
threading the scheduler instance through router.Setup(). Documented
on #85.

Closes the silent-outage failure mode of #85. The other proposed
fixes (force-kill on /restart hang, active_tasks watchdog) are
separate concerns tracked in #85's comments.
2026-04-14 19:32:01 -07:00
rabbitblood
7dc9d83792 fix(scheduler): recover from panics + add liveness watchdog (#85)
The scheduler died silently on 2026-04-14 14:21 UTC and stayed dead for
12+ hours. Platform restart didn't recover it. Root cause: tick() and
fireSchedule() goroutines have no panic recovery. A single bad row, bad
cron expression, DB blip, or transient panic anywhere in the chain
permanently kills the scheduler goroutine — and the only signal to an
operator is "no crons firing", which is invisible if you're not watching.

Specifically:

  func (s *Scheduler) Start(ctx context.Context) {
      for {
          select {
          case <-ticker.C:
              s.tick(ctx)   // <- if this panics, the for-loop exits forever
          }
      }
  }

And inside tick:

  go func(s2 scheduleRow) {
      defer wg.Done()
      defer func() { <-sem }()
      s.fireSchedule(ctx, s2)   // <- panic here propagates up wg.Wait()
  }(sched)

Two `defer recover()` additions:

1. In Start's tick wrapper — a panic in tick() (DB scan, cron parse,
   row processing) is logged and the next tick fires normally.
2. In each fireSchedule goroutine — a single bad workspace can't take
   the rest of the batch down.

Plus a liveness watchdog:

- Scheduler now records `lastTickAt` after each successful tick.
- New methods `LastTickAt()` and `Healthy()` (true if last tick within
  2× pollInterval = 60s).
- Initialised at Start so Healthy() returns true on a fresh process.

Endpoint plumbing for /admin/scheduler/health is a follow-up — needs
threading the scheduler instance through router.Setup(). Documented
on #85.

Closes the silent-outage failure mode of #85. The other proposed
fixes (force-kill on /restart hang, active_tasks watchdog) are
separate concerns tracked in #85's comments.
2026-04-14 19:32:01 -07:00
Hongming Wang
d81d15537e Merge pull request #89 from Molecule-AI/docs/sync-saas-progress
docs(plan): add Phase 32 current-state snapshot
2026-04-14 18:17:36 -07:00
Hongming Wang
15ad2a8dbe
Merge pull request #89 from Molecule-AI/docs/sync-saas-progress
docs(plan): add Phase 32 current-state snapshot
2026-04-14 18:17:36 -07:00
Hongming Wang
73887948b2 Merge pull request #88 from Molecule-AI/fix/tenant-guard-state-no-prefix
fix(middleware): tenant guard reads bare UUID from state= (pair with cp #8)
2026-04-14 18:14:14 -07:00
Hongming Wang
ff6499f634
Merge pull request #88 from Molecule-AI/fix/tenant-guard-state-no-prefix
fix(middleware): tenant guard reads bare UUID from state= (pair with cp #8)
2026-04-14 18:14:14 -07:00
Hongming Wang
c24c7bdb97 docs(plan): add Phase 32 current-state block
Point-in-time snapshot of the live SaaS infrastructure + which phases
are done vs in-flight vs not started. Links to molecule-controlplane's
own PLAN for deeper operator detail.
2026-04-14 18:13:47 -07:00
Hongming Wang
821ed3a532 docs(plan): add Phase 32 current-state block
Point-in-time snapshot of the live SaaS infrastructure + which phases
are done vs in-flight vs not started. Links to molecule-controlplane's
own PLAN for deeper operator detail.
2026-04-14 18:13:47 -07:00
Hongming Wang
7af4f10226 fix(middleware): tenant guard reads bare UUID from state= (no prefix)
Pair to molecule-controlplane PR #8. Fly's proxy returns 502 if the
fly-replay state value contains '=', so the control plane now puts the
bare UUID in state= (no 'org-id=' prefix). TenantGuard now treats the
whole 'state=...' value as the org id.
2026-04-14 18:09:44 -07:00
Hongming Wang
e38257ac88 fix(middleware): tenant guard reads bare UUID from state= (no prefix)
Pair to molecule-controlplane PR #8. Fly's proxy returns 502 if the
fly-replay state value contains '=', so the control plane now puts the
bare UUID in state= (no 'org-id=' prefix). TenantGuard now treats the
whole 'state=...' value as the org id.
2026-04-14 18:09:44 -07:00
rabbitblood
4f2b28c060 chore(template): add 4 evolution crons — ecosystem / plugins / template / channels
Today's crons are all REVIEW (Security audit, UIUX audit, QA tests). Nothing
actively pushes the team to EVOLVE the four levers CEO named: templates,
plugins, channels, watchlist. The team-runs-24/7 goal needs both — defensive
reviews AND offensive evolution.

Adds 4 new schedules:

1. Research Lead — Daily ecosystem watch (0 8 * * *)
   Survey github.com/trending + HN + AI-blogs for new agent-infra projects
   from the last 24h. Add 1-3 entries to docs/ecosystem-watch.md per day,
   commit to chore/eco-watch-YYYY-MM-DD branch + push + PR. Re-enables
   the watchlist pipeline that was paused earlier today.

2. Technical Researcher — Weekly plugin curation (0 9 * * 1, Mondays)
   Inventory plugins/ + builtin_tools/ + recent landings. Identify gaps
   (builtin not exposed as plugin; role missing extras; rarely-used plugin
   in defaults). Survey upstream (claude.ai cookbook, MCP servers,
   anthropic/openai/langchain blogs). File 1-3 plugin proposals per week
   as GH issues with concrete integration sketches.

3. Dev Lead — Daily template fitness audit (30 8 * * *)
   Health-check the template itself: stale system prompts, schedules not
   firing (catches the #85 scheduler-died failure mode), roles missing
   plugins they should have, missing crons, channel gaps. File issues for
   any drift. Designed to catch the silent-stall pattern from today's
   incident.

4. DevOps Engineer — Weekly channel expansion survey (0 10 * * 1, Mondays)
   PM is the only role with a channel today (Telegram). Survey what
   channel infra the platform supports + what role-channel pairings would
   actually help (Security→email-on-critical, DevOps→Slack-on-CI-break,
   etc). File channel-proposal issues.

All four crons end with the structured audit_summary routing per #51/#75
(category, severity, issues, top_recommendation) so they integrate with
the platform-level category_routing PM uses to fan out work. The template's
existing category_routing block already maps research / plugins / template /
channels — these new crons consume exactly those slots.

Also drops three stale "# UNION with defaults (#71)" comments left from
the cleanup PR — those plugins lists are now self-documenting after #71.

Aligns with north-star goal: team should run 24/7 AND keep getting better
across templates / plugins / channels / watchlist. This PR closes the gap
where the "review" half of the loop was running but the "evolve" half had
no active driver.
2026-04-14 18:04:00 -07:00
rabbitblood
18ded13ab3 chore(template): add 4 evolution crons — ecosystem / plugins / template / channels
Today's crons are all REVIEW (Security audit, UIUX audit, QA tests). Nothing
actively pushes the team to EVOLVE the four levers CEO named: templates,
plugins, channels, watchlist. The team-runs-24/7 goal needs both — defensive
reviews AND offensive evolution.

Adds 4 new schedules:

1. Research Lead — Daily ecosystem watch (0 8 * * *)
   Survey github.com/trending + HN + AI-blogs for new agent-infra projects
   from the last 24h. Add 1-3 entries to docs/ecosystem-watch.md per day,
   commit to chore/eco-watch-YYYY-MM-DD branch + push + PR. Re-enables
   the watchlist pipeline that was paused earlier today.

2. Technical Researcher — Weekly plugin curation (0 9 * * 1, Mondays)
   Inventory plugins/ + builtin_tools/ + recent landings. Identify gaps
   (builtin not exposed as plugin; role missing extras; rarely-used plugin
   in defaults). Survey upstream (claude.ai cookbook, MCP servers,
   anthropic/openai/langchain blogs). File 1-3 plugin proposals per week
   as GH issues with concrete integration sketches.

3. Dev Lead — Daily template fitness audit (30 8 * * *)
   Health-check the template itself: stale system prompts, schedules not
   firing (catches the #85 scheduler-died failure mode), roles missing
   plugins they should have, missing crons, channel gaps. File issues for
   any drift. Designed to catch the silent-stall pattern from today's
   incident.

4. DevOps Engineer — Weekly channel expansion survey (0 10 * * 1, Mondays)
   PM is the only role with a channel today (Telegram). Survey what
   channel infra the platform supports + what role-channel pairings would
   actually help (Security→email-on-critical, DevOps→Slack-on-CI-break,
   etc). File channel-proposal issues.

All four crons end with the structured audit_summary routing per #51/#75
(category, severity, issues, top_recommendation) so they integrate with
the platform-level category_routing PM uses to fan out work. The template's
existing category_routing block already maps research / plugins / template /
channels — these new crons consume exactly those slots.

Also drops three stale "# UNION with defaults (#71)" comments left from
the cleanup PR — those plugins lists are now self-documenting after #71.

Aligns with north-star goal: team should run 24/7 AND keep getting better
across templates / plugins / channels / watchlist. This PR closes the gap
where the "review" half of the loop was running but the "evolve" half had
no active driver.
2026-04-14 18:04:00 -07:00
Hongming Wang
e523ca9b20 Merge pull request #86 from Molecule-AI/docs/plugin-adaptor-header-fix
docs(plan): plugin adaptor system is shipped, not future work
2026-04-14 18:03:28 -07:00
Hongming Wang
5b814ca1a7
Merge pull request #86 from Molecule-AI/docs/plugin-adaptor-header-fix
docs(plan): plugin adaptor system is shipped, not future work
2026-04-14 18:03:28 -07:00
Hongming Wang
2db410cccb Merge pull request #84 from Molecule-AI/fix/tenant-guard-fly-replay-src
fix(middleware): TenantGuard accepts org id via Fly-Replay-Src state
2026-04-14 18:03:19 -07:00
Hongming Wang
a7619d4f9a
Merge pull request #84 from Molecule-AI/fix/tenant-guard-fly-replay-src
fix(middleware): TenantGuard accepts org id via Fly-Replay-Src state
2026-04-14 18:03:19 -07:00
Hongming Wang
c442d79aac docs(plan): rename 'Future Work — Plugin Adaptor System' to reflect shipped state
Header implied the whole system was future work, but the section body
says the core (per-runtime adapters, hybrid resolver, AgentskillsAdaptor,
/plugins filter, SDK, agentskills.io spec compliance) all landed. Only
the bullets under 'Deferred, not blocking' are actually open.

Rename + lead with 'The system is done.' so a skim reader doesn't
misfile the whole topic as unshipped.
2026-04-14 18:02:28 -07:00
Hongming Wang
a99517f4ec docs(plan): rename 'Future Work — Plugin Adaptor System' to reflect shipped state
Header implied the whole system was future work, but the section body
says the core (per-runtime adapters, hybrid resolver, AgentskillsAdaptor,
/plugins filter, SDK, agentskills.io spec compliance) all landed. Only
the bullets under 'Deferred, not blocking' are actually open.

Rename + lead with 'The system is done.' so a skim reader doesn't
misfile the whole topic as unshipped.
2026-04-14 18:02:28 -07:00
Hongming Wang
f1dd7cc367 fix(middleware): TenantGuard accepts org id via Fly-Replay-Src state
Phase B.3 pair-fix to the control plane's fly-replay state change.

Background: the private molecule-controlplane's router emits
`fly-replay: app=X;instance=Y;state=org-id=<uuid>`. Fly's edge replays
the request to the tenant and injects `Fly-Replay-Src: instance=Z;...;
state=org-id=<uuid>` on the replayed request. But response headers from
the cp (like X-Molecule-Org-Id) never travel to the replayed tenant —
only the state= param does.

TenantGuard now checks both paths in order:
  1. Primary: X-Molecule-Org-Id header (direct-access path, e.g. molecli)
  2. Secondary: Fly-Replay-Src's `state=org-id=<uuid>` segment
     (production fly-replay path)

Either matching configured MOLECULE_ORG_ID → allow. Neither matches →
404 (still don't leak tenant existence).

New helper orgIDFromReplaySrc parses the semicolon-separated Fly-Replay-
Src header per Fly's format. Covered by a table-driven test with 7 cases
including malformed + empty-header + wrong-state-key.

Tests: +3 new TestTenantGuard_* (FlyReplaySrc match, mismatch, table).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 17:54:13 -07:00
Hongming Wang
522d055758 fix(middleware): TenantGuard accepts org id via Fly-Replay-Src state
Phase B.3 pair-fix to the control plane's fly-replay state change.

Background: the private molecule-controlplane's router emits
`fly-replay: app=X;instance=Y;state=org-id=<uuid>`. Fly's edge replays
the request to the tenant and injects `Fly-Replay-Src: instance=Z;...;
state=org-id=<uuid>` on the replayed request. But response headers from
the cp (like X-Molecule-Org-Id) never travel to the replayed tenant —
only the state= param does.

TenantGuard now checks both paths in order:
  1. Primary: X-Molecule-Org-Id header (direct-access path, e.g. molecli)
  2. Secondary: Fly-Replay-Src's `state=org-id=<uuid>` segment
     (production fly-replay path)

Either matching configured MOLECULE_ORG_ID → allow. Neither matches →
404 (still don't leak tenant existence).

New helper orgIDFromReplaySrc parses the semicolon-separated Fly-Replay-
Src header per Fly's format. Covered by a table-driven test with 7 cases
including malformed + empty-header + wrong-state-key.

Tests: +3 new TestTenantGuard_* (FlyReplaySrc match, mismatch, table).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 17:54:13 -07:00