Commit Graph

206 Commits

Author SHA1 Message Date
Dev Lead Agent
cf8db07020 fix(canvas): WCAG critical — ARIA live toasts, dialog focus trap, keyboard nav
Addresses the three release-blocking WCAG violations from the UX audit
(3rd consecutive cycle) and the new ChatTab ARIA gap from Audit #2.

Changes:
- Toaster: split into polite (success/info) + assertive (error) live
  regions, both always in DOM so screen readers register them before
  any toast fires. Adds x dismiss button on every toast. Errors no
  longer auto-expire after 4s — persist until explicitly dismissed.
- ConfirmDialog: on open, requestAnimationFrame focuses the first
  button inside the dialog. Tab/Shift-Tab is now trapped inside the
  dialog while open. Added role="dialog" aria-modal="true" and
  aria-labelledby pointing to the title h3.
- WorkspaceNode: outer div gains role="button", tabIndex={0},
  aria-label, aria-pressed, and onKeyDown (Enter/Space => selectNode,
  ContextMenu key => openContextMenu). Keyboard-only users can now
  reach and activate workspace nodes.
- ChatTab sub-tab bar: role="tablist" on wrapper, role="tab" +
  aria-selected + aria-controls on each button, matching
  role="tabpanel" + id on each panel div. Textarea gets
  aria-label="Message to agent".

453/453 Vitest tests pass. Production build clean (Next.js 15).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 08:31:06 +00:00
Hongming Wang
4a65c72860
Merge pull request #130 from Molecule-AI/chore/eco-watch-2026-04-15
chore: ecosystem watch 2026-04-15 — scion, claude-mem, multica
2026-04-15 01:22:19 -07:00
Hongming Wang
5d2777bbcf
Merge pull request #123 from Molecule-AI/fix/settings-dark-theme-a11y
fix(canvas): dark theme a11y — settings buttons, input fields, ReactFlow colorMode, zinc-400 contrast, aria-labels
2026-04-15 01:22:16 -07:00
Hongming Wang
a44cd0156a
Merge pull request #122 from Molecule-AI/fix/provisioning-grid-origin
fix(canvas): WORKSPACE_PROVISIONING grid origin offset — prevent viewport clipping
2026-04-15 01:22:13 -07:00
Hongming Wang
a7e9d0b824 chore: eco-watch 2026-04-15 — add scion, claude-mem, multica 2026-04-15 08:15:56 +00:00
Dev Lead Agent
3df2130458 fix(canvas): dark theme a11y — settings buttons, input fields, ReactFlow colorMode, zinc-400 contrast, aria-labels
Resolves low-contrast text and theming issues in the settings panel and
canvas overlays when running in dark mode:

- settings-panel.css: input fields (#d4d4d8 text), settings-button--active
  (#1e3a8a bg for better contrast against #3b82f6 accent)
- SearchDialog: placeholder-zinc-400, kbd hints, tier badge, footer counts,
  empty-state text — all lifted from zinc-600 → zinc-400
- ConversationTraceModal: timestamp, arrow separators, truncation ellipsis
  — lifted from zinc-600 → zinc-400
- CommunicationOverlay: arrow separator, age label, duration — zinc-600 → zinc-400
- TemplatePalette: dynamic aria-label on toggle button
  ("Open/Close template palette") for screen-reader clarity

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 07:56:53 +00:00
Dev Lead Agent
3b7da330f1 fix(canvas): WORKSPACE_PROVISIONING grid origin offset — prevent viewport clipping
New nodes were placed at (0,0) or close to it, causing them to spawn
behind the toolbar/palette chrome and require manual panning to find.
Add GRID_ORIGIN_X/Y = 100 offset so the first node lands in clear canvas
space, and update the position assertion in the unit test accordingly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 07:53:45 +00:00
Hongming Wang
8ba88011b4
Merge pull request #109 from Molecule-AI/feat/issue-101-github-workflow-run
feat(webhooks): #101 — GitHub workflow_run event → DevOps A2A
2026-04-15 00:51:01 -07:00
Hongming Wang
7a41d67fa3
Merge pull request #108 from Molecule-AI/fix/issue-93-category-routing
fix: #93 category_routing + #105 X-RateLimit headers
2026-04-15 00:50:58 -07:00
Hongming Wang
de6ebe2262
Merge pull request #106 from Molecule-AI/fix/org-import-path-traversal
fix(security): #103 — path-sanitize + admin-gate POST /org/import
2026-04-15 00:26:16 -07:00
Hongming Wang
7859d43685
Merge pull request #95 from Molecule-AI/fix/supervised-goroutines
fix(platform): panic-recovering supervisor for every background goroutine (#92)
2026-04-15 00:26:13 -07:00
Hongming Wang
f8c1b786ac
Merge pull request #99 from Molecule-AI/fix/auth-middleware-critical
fix(security): C1 — auth-gate GET /workspaces + middleware test coverage (C4/C8/C10/C11)
2026-04-15 00:26:10 -07:00
Hongming Wang
958789f4ba feat(webhooks): #101 — workflow_run event → DevOps A2A
Closes #101 layer 1: buildGitHubA2APayload now handles workflow_run
events, routing failed CI runs to a workspace via the existing
X-Molecule-Workspace-ID / webhook path. Only completed runs with a
failure/cancelled/timed_out conclusion fan out — success/skipped/neutral
are dropped via errIgnoredGitHubAction.

Surface message is human-readable + includes the run URL so DevOps can
jump straight to the failing job. Metadata carries the full run context
(workflow_name, run_id, run_number, conclusion, head_branch, head_sha,
run_url, trigger_event) for programmatic handling.

4 new tests cover the failure path, success skip, non-completed action
skip, and short-SHA edge case.

Layer 2 (org.yaml wiring for DevOps workspace + GITHUB_WEBHOOK_SECRET
docs) stays as a follow-up PR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 00:25:49 -07:00
Hongming Wang
2a74a7b11b fix: #93 category_routing + #105 X-RateLimit headers
Closes #93 and #105.

#93 — add research/plugins/template/channels entries to org.yaml
category_routing defaults. Without them, evolution crons firing with
these categories found no target and their audit summaries silently
dropped at PM. Routes each back to the role that generated it so the
author acts on their own findings.

#105 — emit X-RateLimit-Limit / -Remaining / -Reset on every response
(allowed and throttled) and Retry-After on 429s per RFC 6585. 2 tests
cover both paths. Clients and monitoring tools can now back off
proactively instead of polling into 429 walls.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 00:23:46 -07:00
Hongming Wang
418a250d54 test(e2e): skip count=0 post-delete assertion — conflicts with #99 C1 gate
Soft-delete leaves workspace_auth_tokens rows alive, so HasAnyLiveTokenGlobal
stays non-zero and admin-auth 401s an unauth GET /workspaces. The assertion
was verifying deletion, not auth; the bundle round-trip below still covers
the deletion path end-to-end.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 00:22:02 -07:00
Hongming Wang
4dbf335d7f fix(security): #103 — path-sanitize + admin-gate POST /org/import
Closes #103 (HIGH). Three attack surfaces on the import endpoint —
body.Dir, workspace.Template, workspace.FilesDir — were concatenated
via filepath.Join without validation, letting an unauthenticated
caller probe arbitrary filesystem paths with "../../../etc".

Two layers of defense:
  1. resolveInsideRoot() rejects absolute paths and any relative path
     whose lexically cleaned join escapes the provided root (Abs +
     HasPrefix + separator guard). 6 tests cover happy path, traversal
     attempts, absolute path, empty input, prefix-sibling escape, and
     deep subpath resolution.
  2. Route now runs behind middleware.AdminAuth so an unauthenticated
     attacker can't reach the handler at all once a token exists.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 00:18:09 -07:00
Hongming Wang
80b0ad25ff
Merge pull request #94 from Molecule-AI/fix/c6-loopback-ssrf
fix(security): C6 — block loopback IP literals in /registry/register
2026-04-15 00:15:23 -07:00
Hongming Wang
593c7e2984 merge: resolve scheduler conflicts with main (#85 panic-recover + supervised heartbeat) 2026-04-15 00:12:29 -07:00
Hongming Wang
a25daa633f test(e2e): pass bearer token to admin-gated GET /workspaces calls
C1 fix (#99) moved GET /workspaces behind AdminAuth. Three late-script
calls that run after tokens exist now include Authorization headers;
the post-delete-all call stays anonymous since revoked tokens trigger
the no-live-token fail-open path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 00:11:29 -07:00
Hongming Wang
d55362fece
Merge pull request #98 from Molecule-AI/chore/template-evolution-crons-hourly
chore(template): evolution crons hourly instead of daily/weekly
2026-04-15 00:08:19 -07:00
Hongming Wang
b669b9f6ee
Merge pull request #97 from Molecule-AI/chore/template-documentation-specialist
chore(template): add Documentation Specialist as 3rd PM direct report
2026-04-15 00:08:16 -07:00
Hongming Wang
edcfd615d7
Merge pull request #102 from Molecule-AI/fix/can-communicate-ancestor-chain
fix(registry): allow ancestor↔descendant A2A so audit_summary can reach PM
2026-04-15 00:08:12 -07:00
rabbitblood
0653e78262 fix(registry): allow ancestor↔descendant A2A so audit_summary can reach PM
Found via deep workspace inspection during a maintenance cycle: Security
Auditor's hourly cron correctly tries to delegate_task its audit_summary
to PM, the platform proxy rejects with "access denied: workspaces cannot
communicate per hierarchy", the agent falls back to delegating to its
direct parent (Dev Lead), and PM's category_routing dispatcher (#75) is
never reached.

This breaks the audit-routing contract end-to-end. Every audit cycle was
landing on Dev Lead instead of being fanned out via PM's category_routing
to the right dev role (security → BE+DevOps, ui/ux → FE, etc).

## Root cause
`registry.CanCommunicate()` only allowed:
- self → self
- siblings (same parent)
- root-level siblings
- direct parent → child
- direct child → parent

A grandchild → grandparent (Security Auditor → PM, where parent is Dev
Lead and grandparent is PM) was DENIED. The original design wanted strict
hierarchy to prevent rogue horizontal A2A — but it also broke the
fundamental "child can talk to its leadership chain" pattern that any
audit/escalation flow needs.

## Fix
Generalise to ancestor ↔ descendant. Any workspace can talk to any
ancestor (any depth) and any descendant (any depth). Direct parent/child
remains a fast path that avoids the walk. Sibling rules unchanged.

Cousins still cannot directly communicate (would need to go through their
shared ancestor). Cross-subtree A2A is still rejected.

Implementation: `isAncestorOf(ancestorID, childID)` walks the parent
chain in Go with a maxAncestorWalk=32 safety cap so a malformed cycle in
the workspaces table cannot loop forever. One DB lookup per step. For a
typical 3-deep tree, this adds 1-2 extra lookups vs the old direct-parent
fast path. Could be optimized to a single recursive CTE if profiling
shows it matters; not now.

## Tests
- TestCanCommunicate_Denied_Grandchild → REPLACED with two new tests:
  - TestCanCommunicate_Allowed_GrandparentToGrandchild
  - TestCanCommunicate_Allowed_GrandchildToGrandparent  (the actual bug)
- TestCanCommunicate_Allowed_DeepAncestor — 4-level chain
- TestCanCommunicate_Denied_UnrelatedAncestors — ensures cross-subtree
  walks still terminate denied
- TestCanCommunicate_Denied_DifferentParents — extended with the walk
  lookup mocks so sqlmock doesn't log warnings
- TestCanCommunicate_Denied_CousinToRoot — same

All 13 tests pass clean. The previous direct parent/child / siblings /
self tests are unchanged (fast paths preserved).

## Why platform-level
Per the "platform-wide fixes are mine to ship" rule. Every org template
hits the same broken audit-routing chain — fixing it at the platform
benefits all users, not just molecule-dev. This unblocks #50 (PM
dispatcher prompt) and #75 (category_routing).
2026-04-14 22:18:38 -07:00
Backend Engineer
80c2161687 fix(security): C1 — gate GET /workspaces behind AdminAuth; add auth middleware tests
Security Auditor confirmed C1 (GET /workspaces) exposes workspace topology
without any authentication. The endpoint was intentionally left open for
the canvas browser frontend; this PR closes that gap.

Router change:
- Move GET /workspaces from the bare root router into the wsAdmin AdminAuth
  group alongside POST /workspaces and DELETE /workspaces/:id.
- AdminAuth uses the same fail-open bootstrap contract as all other auth
  gates: fresh installs (no live tokens) pass through; once any workspace
  has registered with a token, a valid bearer is required.

Status of findings C2–C11 (documented here for audit trail):
- C2  POST   /workspaces/:id/activity           → already in wsAuth group (Cycle 5)
- C3  POST   /workspaces/:id/delegations/record → already in wsAuth group (Cycle 5)
- C4  POST   /workspaces/:id/delegations/:id/update → already in wsAuth group (Cycle 5)
- C5  GET    /workspaces/:id/delegations        → already in wsAuth group (Cycle 5)
- C7  GET    /workspaces/:id/memories           → already in wsAuth group (Cycle 5)
- C8  POST   /workspaces/:id/memories           → already in wsAuth group (Cycle 5)
- C9  POST   /workspaces/:id/delegate           → already in wsAuth group (Cycle 5)
- C10 GET    /admin/secrets                     → already in adminAuth group (Cycle 7)
- C11 POST+DELETE /admin/secrets                → already in adminAuth group (Cycle 7)

Tests (platform/internal/middleware/wsauth_middleware_test.go — 13 new):
WorkspaceAuth:
  - fail-open when workspace has no tokens (bootstrap path)
  - C4: no bearer on /delegations/:id/update → 401
  - C8: no bearer on /memories POST → 401
  - invalid bearer → 401
  - cross-workspace token replay → 401
  - valid bearer for correct workspace → 200

AdminAuth:
  - fail-open when no tokens exist globally (fresh install)
  - C10: no bearer on GET /admin/secrets → 401
  - C11: no bearer on POST /admin/secrets → 401
  - C11: no bearer on DELETE /admin/secrets/:key → 401
  - valid bearer → 200
  - invalid bearer → 401

Note: did NOT touch DELETE /admin/secrets in production — no destructive
calls to live secrets endpoints were made during this work.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 04:37:14 +00:00
Backend Engineer
63e482f05b fix(security): C6 — extend SSRF blocklist to RFC-1918 private ranges
PR #94 only blocked 127.0.0.0/8 (loopback) and 169.254.0.0/16
(link-local/IMDS). An attacker could still register a workspace with
a URL in any RFC-1918 range (10.x, 172.16–31.x, 192.168.x) and
redirect A2A proxy traffic to internal services.

Block all five reserved ranges in validateAgentURL:
  - 169.254.0.0/16  link-local (IMDS: AWS/GCP/Azure)
  - 127.0.0.0/8     loopback (self-SSRF)
  - 10.0.0.0/8      RFC-1918
  - 172.16.0.0/12   RFC-1918 (includes Docker bridge networks)
  - 192.168.0.0/16  RFC-1918

Agents must use DNS hostnames, not IP literals. The provisioner
still writes 127.0.0.1 URLs via direct SQL UPDATE (CASE guard
preserves those); this blocklist only applies to the /registry/register
request body.

Tests: updated 3 previously-allowed RFC-1918 cases to expect rejection;
added 9 new cases covering range boundaries and the Docker bridge range.
All 22 validateAgentURL subtests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 04:35:05 +00:00
rabbitblood
c0142edbce chore(template): switch evolution crons from daily/weekly to hourly
CEO 2026-04-15: the team's evolution loops should be hourly, not daily/weekly.
A 24h or 7d cadence is the wrong rhythm for a team that's expected to run 24/7
and keep improving. At hourly, every drift, every new project, every plugin
gap, every channel opportunity gets surfaced within an hour of becoming visible.

| Schedule                          | Was            | Now          |
|-----------------------------------|----------------|--------------|
| Hourly ecosystem watch            | 0 8 * * *      | 8 * * * *    |
| Hourly plugin curation            | 0 9 * * 1      | 22 * * * *   |
| Hourly template fitness audit     | 30 8 * * *     | 15 * * * *   |
| Hourly channel expansion survey   | 0 10 * * 1     | 47 * * * *   |

Spread across the hour (:08, :11, :15, :17, :22, :47) so the four evolution
crons + UIUX :11 + Security :17 don't collide and don't all bury PM with
audit_summary deliveries at the same instant.

Renamed from "Daily..." / "Weekly..." to "Hourly..." to match the new cadence
and so the prompts (which still say "Daily survey" etc.) read consistently.
A follow-up will fix the body wording.

Live-synced into running DB via PATCH (3 of 4) and direct UPDATE on the 4th
(Dev Lead workspace requires a token the script didn't have). next_run_at
recomputed for all 4. First fire: 04:47 UTC (channel expansion).
2026-04-14 21:33:31 -07:00
rabbitblood
101f284e5d fix(scheduler): heartbeat at tick start + per-fire so liveness reflects work-in-progress
The first scheduler heartbeat (#95) only fired AFTER each tick completed.
A tick that runs fireSchedule for 110+ seconds (long agent prompts) would
make /admin/liveness report scheduler as stale even though it was actively
working. Observed today: scheduler firing UIUX audit, last_tick_at lagged
by 95s+ and incrementing.

Three places now call Heartbeat:
1. Top of tick() — proves we're past the ticker.C wait
2. Inside each fire goroutine, before fireSchedule — ANY active fire
   keeps the heartbeat fresh
3. Inside each fire goroutine, after fireSchedule — captures the moment
   the per-fire work completes

(The post-tick Heartbeat in Start() is still there as the "all idle" case.)

Net result: /admin/liveness reports stale only if the scheduler genuinely
isn't doing anything for >2× pollInterval, which is the actual signal we
want.
2026-04-14 21:20:06 -07:00
rabbitblood
41e39c2626 chore(template): Documentation Specialist also watches private molecule-controlplane
Per CEO 2026-04-15: the SaaS controlplane (Molecule-AI/molecule-controlplane,
PRIVATE Go/Fly.io provisioner) needs documentation coverage too.

Updates the agent's role description, initial_prompt, and daily docs-sync
cron to handle a third repo with a strict public/private split.

## Privacy rule (the critical addition)

molecule-controlplane is private. Two-bucket model:

  Internal-only changes (handlers, schemas, infra config, billing logic,
  fly.toml, provisioner internals) → docs go INSIDE the controlplane repo
  itself (README.md, PLAN.md, docs/internal/*.md). NEVER mentioned in the
  public docs site.

  Customer-facing changes (new tier, new region, new SLA, pricing change,
  signup flow change) → sanitized PUBLIC description on doc.moleculesai.app.
  Describes the PRODUCT, never the implementation.

  When unsure: default to internal-only and ask PM before publishing.

The privacy rule is repeated three times in the prompt (top of initial_prompt,
1b inside the daily cron, and the role description) so the agent can't miss it.

## Changes
- role: extended to mention all three repos + privacy split
- initial_prompt: clones controlplane in step 1, reads README+PLAN in step 5,
  scans recent commits in step 8, lists the four owned surfaces with public/private
  labels in step 10
- Daily cron: adds step 1b "PAIR RECENT CONTROLPLANE PRS" with the (i)/(ii)
  internal/customer-facing branching logic
- SETUP block: adds controlplane git pull
2026-04-14 21:06:41 -07:00
rabbitblood
53fdffd2c5 chore(template): add Documentation Specialist as 3rd PM direct report
Adds a 13th workspace to the molecule-dev template owning end-to-end
documentation across all Molecule AI surfaces.

## Why now
- We just created Molecule-AI/docs (customer-facing site at
  doc.moleculesai.app, Fumadocs + Next.js 15) and the customer site needs
  someone to own it.
- Internal docs (README.md, docs/architecture.md, docs/edit-history/) were
  drifting — every platform PR has been opening a docs sync PR manually.
- No agent in the team owned terminology consistency or stub backfill.

## Where it sits in the org
Third PM direct report, parallel to Research Lead and Dev Lead — docs is
its own swim lane that spans engineering (docs follow code) and
research/product (concepts and terminology).

  PM
  ├── Research Lead
  ├── Dev Lead
  └── Documentation Specialist  <-- new

## Schedules (2)

1. **Daily docs sync — backfill stubs and pair recent platform PRs**
   `0 9 * * *` — every morning:
   - Pair every merged platform PR (last 24h) with a docs PR if needed
   - Backfill one stub page on the docs site
   - Crawl the live site for broken links / dead anchors
   - delegate_task to PM with audit_summary (category=docs)

2. **Weekly terminology + freshness audit**
   `0 11 * * 1` — every Monday:
   - Stale page detection (>30 days untouched on fast-moving surfaces)
   - Terminology consistency check (one canonical name per concept)
   - Link-rot scan
   - Same audit_summary contract

## Plugins
Inherits the 9 universal defaults. Adds `browser-automation` for crawling
the live docs site. `molecule-skill-update-docs` is already in defaults
so the cross-repo sync skill is available.

## Routing
Adds `docs: [Documentation Specialist]` to `category_routing` so any
agent that emits an audit_summary with category=docs is auto-routed
here by the platform.

## Bind mounts
Note: this workspace clones BOTH /workspace/repo (the platform monorepo)
and /workspace/docs (Molecule-AI/docs) in its initial_prompt so the
agent can edit either side.
2026-04-14 21:03:22 -07:00
Hongming Wang
96d88f42a6
Merge pull request #96 from Molecule-AI/feat/canvas-auth-redirect
feat(canvas): AuthGate — redirect anonymous users to cp login
2026-04-14 20:42:12 -07:00
Hongming Wang
aedd3db697 feat(canvas): AuthGate — redirect anonymous users to cp login (Phase F close)
Wraps the canvas root so every tenant-subdomain request checks for a
valid session and bounces to app.moleculesai.app/cp/auth/login with a
return_to pointing back at the current URL. Local dev + vercel preview
URLs + apex pass through unchanged.

Files:
- canvas/src/lib/auth.ts: fetchSession() probes /cp/auth/me
  (credentials:include for cross-origin cookie); returns Session on 200,
  null on 401 (anonymous, no throw), throws on 5xx so transient
  outages don't leak the UI.
- canvas/src/lib/auth.ts: redirectToLogin() builds the cp login URL
  with window.location.href as return_to; CP's isSafeReturnTo check
  rejects cross-domain bounces.
- canvas/src/components/AuthGate.tsx: client component wrapping
  children. State machine: loading → authenticated | anonymous. In
  non-SaaS mode (no tenant slug) skips the gate entirely.
- canvas/src/app/layout.tsx: wraps the root body in <AuthGate>.

Tests: +6 auth.ts (200 / 401 null / 5xx throw / credentials:include /
redirectToLogin href + signup variant). Full suite 453 green (was 447).

Pairs with molecule-controlplane PR #16 (return_to cookie handshake
on the cp side).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 20:37:26 -07:00
rabbitblood
e4535560cf fix(platform): panic-recovering supervisor for every background goroutine (#92)
Yesterday's scheduler-died incident (#85) was one instance of a systemic
bug: every long-running goroutine in the platform lacks panic recovery
and exposes no liveness signal. In a multi-tenant SaaS deployment, a
single tenant's bad data panicking any subsystem takes down the
subsystem for every tenant, silently, with all standard health probes
still green. That is a scale-of-one sev-1.

This PR:

1. Introduces `platform/internal/supervised/` with two primitives:

   a. RunWithRecover(ctx, name, fn) — runs fn in a recover wrapper.
      On panic logs the stack + exponential-backoff restart (1s → 2s →
      4s → … → 30s cap). On clean return (fn decided to stop) returns.
      On ctx.Done cancels cleanly.

   b. Heartbeat(name) + LastTick(name) + Snapshot() + IsHealthy(names,
      staleThreshold) — shared in-memory liveness registry. Every
      subsystem calls Heartbeat(name) at the end of each tick so
      operators can distinguish "goroutine alive and healthy" from
      "alive but stuck inside a single tick".

2. Wraps every `go X.Start(ctx)` in main.go:
   - broadcaster.Subscribe   (Redis pub/sub relay → WebSocket)
   - registry.StartLivenessMonitor
   - registry.StartHealthSweep
   - scheduler.Start         (the one that died yesterday)
   - channelMgr.Start        (Telegram / Slack)

3. Adds `supervised.Heartbeat("scheduler")` inside the scheduler tick
   loop as the first end-to-end demonstration. Follow-up PRs will add
   heartbeats to the other four subsystems.

4. Adds `GET /admin/liveness` endpoint returning per-subsystem
   last_tick_at + seconds_ago. Operators can poll this and alert on
   any subsystem whose seconds_ago exceeds 2x its cron/tick interval.

5. Unit tests for RunWithRecover (clean return no restart; panic
   restarts with backoff; ctx cancel stops restart loop) and for the
   liveness registry.

Net new code: ~160 lines + ~100 lines of tests. Refactor of main.go:
~10 line changes. No behavior change on happy path; only lifts what
happens on a panic.

Closes #92. Supersedes the local recover added to scheduler.go in
#90 (kept conceptually, but now via the shared helper).
2026-04-14 20:34:18 -07:00
Backend Engineer
19bdd81ba4 fix(security): C6 — block loopback IP literals in /registry/register
A workspace that self-registers with a 127.0.0.x URL on first INSERT
could redirect A2A proxy traffic back to the platform itself (SSRF).
The previous fix only blocked 169.254.0.0/16 (cloud metadata).

Add 127.0.0.0/8 to validateAgentURL's blocklist. RFC-1918 private
ranges (10.x, 172.16.x, 192.168.x) remain allowed — Docker container
networking depends on them.

Safe because the provisioner writes 127.0.0.1 URLs via direct SQL
UPDATE, not through /registry/register, so the UPSERT CASE that
preserves provisioner URLs is unaffected. Local-dev agents can still
register using "localhost" by name (hostname, not IP literal).

Tests: removed "valid localhost http" case (now correctly rejected),
added "valid localhost name" + three loopback-block assertions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 03:34:14 +00:00
Hongming Wang
c02bfb4257
Merge pull request #90 from Molecule-AI/fix/scheduler-watchdog-recover
fix(scheduler): recover from panics + add liveness watchdog (#85)
2026-04-14 20:30:31 -07:00
Hongming Wang
12ef17f8e0
Merge pull request #87 from Molecule-AI/chore/template-evolution-crons
chore(template): add 4 evolution crons — ecosystem / plugins / template / channels
2026-04-14 20:30:26 -07:00
Hongming Wang
092652770c
Merge pull request #81 from Molecule-AI/docs/sync-2026-04-15-tick-9
QA verified: docs-only change (PLAN.md + edit-history). CI green (all 6 checks pass). No code changes. Safe to merge.
2026-04-14 20:30:18 -07:00
Hongming Wang
e7275531d8
Merge pull request #91 from Molecule-AI/feat/canvas-saas-cross-origin
feat(canvas): SaaS cross-origin — slug header + cookie credentials (Phase F)
2026-04-14 20:10:46 -07:00
Hongming Wang
c7537436ff feat(canvas): SaaS cross-origin — slug header + cookie credentials (Phase F)
Canvas will be served at <slug>.moleculesai.app (Vercel). API calls go
cross-origin to https://app.moleculesai.app. This commit wires the
client side:

- canvas/src/lib/tenant.ts: getTenantSlug() derives the slug from
  window.location.hostname, case-insensitive, matching the control
  plane's reservedSubdomains list (app/www/api/admin/…). Server-side
  + localhost + vercel preview URLs + apex all return "" so local dev
  keeps working.

- canvas/src/lib/api.ts: adds X-Molecule-Org-Slug header + sets
  credentials:"include" on every fetch. The control plane's CORS
  middleware allows the origin + credentials; the session cookie has
  Domain=.moleculesai.app so the browser ships it.

- canvas/src/lib/api/secrets.ts: same treatment (secrets API uses its
  own fetch helper — shared slug+credentials logic applied).

Tests: +6 (tenant.test.ts covers slug / reserved / case / non-SaaS /
preview URL / apex). Full canvas suite 447/447 green.

Not in this PR:
- WS URL derivation for terminal/socket.ts (separate follow-up; WS
  needs its own slug-aware URL and the canvas terminal isn't used in
  SaaS launch day-one).
- Next.js rewrites (decided against; cross-origin with credentials
  is cleaner than path-level rewrites for session cookies).

Deploys to Vercel once merged — no manual config needed (env already set).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 20:08:39 -07:00
rabbitblood
7dc9d83792 fix(scheduler): recover from panics + add liveness watchdog (#85)
The scheduler died silently on 2026-04-14 14:21 UTC and stayed dead for
12+ hours. Platform restart didn't recover it. Root cause: tick() and
fireSchedule() goroutines have no panic recovery. A single bad row, bad
cron expression, DB blip, or transient panic anywhere in the chain
permanently kills the scheduler goroutine — and the only signal to an
operator is "no crons firing", which is invisible if you're not watching.

Specifically:

  func (s *Scheduler) Start(ctx context.Context) {
      for {
          select {
          case <-ticker.C:
              s.tick(ctx)   // <- if this panics, the for-loop exits forever
          }
      }
  }

And inside tick:

  go func(s2 scheduleRow) {
      defer wg.Done()
      defer func() { <-sem }()
      s.fireSchedule(ctx, s2)   // <- panic here propagates up wg.Wait()
  }(sched)

Two `defer recover()` additions:

1. In Start's tick wrapper — a panic in tick() (DB scan, cron parse,
   row processing) is logged and the next tick fires normally.
2. In each fireSchedule goroutine — a single bad workspace can't take
   the rest of the batch down.

Plus a liveness watchdog:

- Scheduler now records `lastTickAt` after each successful tick.
- New methods `LastTickAt()` and `Healthy()` (true if last tick within
  2× pollInterval = 60s).
- Initialised at Start so Healthy() returns true on a fresh process.

Endpoint plumbing for /admin/scheduler/health is a follow-up — needs
threading the scheduler instance through router.Setup(). Documented
on #85.

Closes the silent-outage failure mode of #85. The other proposed
fixes (force-kill on /restart hang, active_tasks watchdog) are
separate concerns tracked in #85's comments.
2026-04-14 19:32:01 -07:00
Hongming Wang
15ad2a8dbe
Merge pull request #89 from Molecule-AI/docs/sync-saas-progress
docs(plan): add Phase 32 current-state snapshot
2026-04-14 18:17:36 -07:00
Hongming Wang
ff6499f634
Merge pull request #88 from Molecule-AI/fix/tenant-guard-state-no-prefix
fix(middleware): tenant guard reads bare UUID from state= (pair with cp #8)
2026-04-14 18:14:14 -07:00
Hongming Wang
821ed3a532 docs(plan): add Phase 32 current-state block
Point-in-time snapshot of the live SaaS infrastructure + which phases
are done vs in-flight vs not started. Links to molecule-controlplane's
own PLAN for deeper operator detail.
2026-04-14 18:13:47 -07:00
Hongming Wang
e38257ac88 fix(middleware): tenant guard reads bare UUID from state= (no prefix)
Pair to molecule-controlplane PR #8. Fly's proxy returns 502 if the
fly-replay state value contains '=', so the control plane now puts the
bare UUID in state= (no 'org-id=' prefix). TenantGuard now treats the
whole 'state=...' value as the org id.
2026-04-14 18:09:44 -07:00
rabbitblood
18ded13ab3 chore(template): add 4 evolution crons — ecosystem / plugins / template / channels
Today's crons are all REVIEW (Security audit, UIUX audit, QA tests). Nothing
actively pushes the team to EVOLVE the four levers CEO named: templates,
plugins, channels, watchlist. The team-runs-24/7 goal needs both — defensive
reviews AND offensive evolution.

Adds 4 new schedules:

1. Research Lead — Daily ecosystem watch (0 8 * * *)
   Survey github.com/trending + HN + AI-blogs for new agent-infra projects
   from the last 24h. Add 1-3 entries to docs/ecosystem-watch.md per day,
   commit to chore/eco-watch-YYYY-MM-DD branch + push + PR. Re-enables
   the watchlist pipeline that was paused earlier today.

2. Technical Researcher — Weekly plugin curation (0 9 * * 1, Mondays)
   Inventory plugins/ + builtin_tools/ + recent landings. Identify gaps
   (builtin not exposed as plugin; role missing extras; rarely-used plugin
   in defaults). Survey upstream (claude.ai cookbook, MCP servers,
   anthropic/openai/langchain blogs). File 1-3 plugin proposals per week
   as GH issues with concrete integration sketches.

3. Dev Lead — Daily template fitness audit (30 8 * * *)
   Health-check the template itself: stale system prompts, schedules not
   firing (catches the #85 scheduler-died failure mode), roles missing
   plugins they should have, missing crons, channel gaps. File issues for
   any drift. Designed to catch the silent-stall pattern from today's
   incident.

4. DevOps Engineer — Weekly channel expansion survey (0 10 * * 1, Mondays)
   PM is the only role with a channel today (Telegram). Survey what
   channel infra the platform supports + what role-channel pairings would
   actually help (Security→email-on-critical, DevOps→Slack-on-CI-break,
   etc). File channel-proposal issues.

All four crons end with the structured audit_summary routing per #51/#75
(category, severity, issues, top_recommendation) so they integrate with
the platform-level category_routing PM uses to fan out work. The template's
existing category_routing block already maps research / plugins / template /
channels — these new crons consume exactly those slots.

Also drops three stale "# UNION with defaults (#71)" comments left from
the cleanup PR — those plugins lists are now self-documenting after #71.

Aligns with north-star goal: team should run 24/7 AND keep getting better
across templates / plugins / channels / watchlist. This PR closes the gap
where the "review" half of the loop was running but the "evolve" half had
no active driver.
2026-04-14 18:04:00 -07:00
Hongming Wang
5b814ca1a7
Merge pull request #86 from Molecule-AI/docs/plugin-adaptor-header-fix
docs(plan): plugin adaptor system is shipped, not future work
2026-04-14 18:03:28 -07:00
Hongming Wang
a7619d4f9a
Merge pull request #84 from Molecule-AI/fix/tenant-guard-fly-replay-src
fix(middleware): TenantGuard accepts org id via Fly-Replay-Src state
2026-04-14 18:03:19 -07:00
Hongming Wang
a99517f4ec docs(plan): rename 'Future Work — Plugin Adaptor System' to reflect shipped state
Header implied the whole system was future work, but the section body
says the core (per-runtime adapters, hybrid resolver, AgentskillsAdaptor,
/plugins filter, SDK, agentskills.io spec compliance) all landed. Only
the bullets under 'Deferred, not blocking' are actually open.

Rename + lead with 'The system is done.' so a skim reader doesn't
misfile the whole topic as unshipped.
2026-04-14 18:02:28 -07:00
Hongming Wang
522d055758 fix(middleware): TenantGuard accepts org id via Fly-Replay-Src state
Phase B.3 pair-fix to the control plane's fly-replay state change.

Background: the private molecule-controlplane's router emits
`fly-replay: app=X;instance=Y;state=org-id=<uuid>`. Fly's edge replays
the request to the tenant and injects `Fly-Replay-Src: instance=Z;...;
state=org-id=<uuid>` on the replayed request. But response headers from
the cp (like X-Molecule-Org-Id) never travel to the replayed tenant —
only the state= param does.

TenantGuard now checks both paths in order:
  1. Primary: X-Molecule-Org-Id header (direct-access path, e.g. molecli)
  2. Secondary: Fly-Replay-Src's `state=org-id=<uuid>` segment
     (production fly-replay path)

Either matching configured MOLECULE_ORG_ID → allow. Neither matches →
404 (still don't leak tenant existence).

New helper orgIDFromReplaySrc parses the semicolon-separated Fly-Replay-
Src header per Fly's format. Covered by a table-driven test with 7 cases
including malformed + empty-header + wrong-state-key.

Tests: +3 new TestTenantGuard_* (FlyReplaySrc match, mismatch, table).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 17:54:13 -07:00
Hongming Wang
63cf7e5693
Merge pull request #83 from Molecule-AI/fix/fly-registry-username
fix(ci): revert Fly registry username to 'x' — 401 on any other value
2026-04-14 17:26:12 -07:00
Hongming Wang
8decdd491e fix(ci): revert Fly registry username to 'x' — 'molecule-ai' gets 401
Post-mortem on the failed publish-platform-image run on main (PR #82):

Fly's Docker registry requires username EXACTLY equal to "x". My
code-review "readability fix" changing it to "molecule-ai" caused
every push to return 401 Unauthorized. Verified locally:

  echo $FLY_API_TOKEN | docker login registry.fly.io -u x --password-stdin
  → Login Succeeded

  echo $FLY_API_TOKEN | docker login registry.fly.io -u molecule-ai --password-stdin
  → 401 Unauthorized

Lesson: don't second-guess docs that specify a literal value. Comment
now says "MUST be literal 'x'" with a 2026-04-15 verification note to
prevent future regressions.

Code-review process improvement: when reviewing a change against a
vendor API, prefer "preserve exact doc-specified values" over readability
suggestions. Logged as a cron-learning.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 17:21:53 -07:00