Commit Graph

1070 Commits

Author SHA1 Message Date
Hongming Wang
00126db676 Merge pull request #150 from Molecule-AI/chore/eco-watch-2026-04-15
chore(eco-watch): 2026-04-15 daily survey — Skills CLI, Archon, Claude Code Routines
2026-04-15 03:20:58 -07:00
Hongming Wang
a95fbf7f52 Merge pull request #149 from Molecule-AI/fix/140-scheduler-heartbeat-pulse
fix(scheduler): independent heartbeat pulse so liveness doesn't false-stale during long fires (#140)
2026-04-15 03:20:55 -07:00
Research Lead
f922f87a22 chore(eco-watch): 2026-04-15 daily survey — 3 new entries, 3 issues
New entries:
- vercel-labs/skills: canonical agentskills.io CLI (14.2k , +153)
- coleam00/Archon: YAML-DAG harness builder for AI coding (18.1k , +396)
- Claude Code Routines: Anthropic cloud-scheduled agents (611 HN pts)

Issues filed:
- #146 plugins/: align with agentskills.io SKILL.md spec
- #147 workspace_schedules: add GitHub event trigger types
- #148 workspace-template/: workflow.yaml YAML-DAG convention

HEAD at survey time: 229c2ab

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 10:14:59 +00:00
rabbitblood
b00f478b6e fix(scheduler): independent heartbeat pulse so liveness doesn't false-stale during long fires (#140)
The #95 scheduler heartbeat scheme relied on:
1. Top of tick() (once per poll interval)
2. Per-fire goroutine entry + exit

That leaves a gap: tick() ends with wg.Wait(), so if a single fire takes
longer than pollInterval (UIUX audits routinely take 60-120s; max fireTimeout
is 5min), the next tick doesn't run and no top-of-tick heartbeat fires.
Per-fire heartbeats only bracket the fire — between entry and the HTTP
response returning, nothing heartbeats either.

Observed today: /admin/liveness reports seconds_ago=251 while docker logs
show the scheduler actively firing 'Hourly ecosystem watch'. Scheduler is
fine; liveness is lying.

Adds an independent 10s heartbeat pulse goroutine inside Start(), decoupled
from tick completion. The existing heartbeats at tick top + per-fire are
kept as redundant signals but this pulse is the one that guarantees liveness
freshness regardless of what tick is doing.

Ships the exact fix proposed in #140 body.

Closes #140.
2026-04-15 03:13:41 -07:00
Hongming Wang
229c2ab0b7 Merge pull request #139 from Molecule-AI/fix/issue-133-review-plugins
fix(template): #133 — add code-review plugins to Dev Lead + QA Engineer
2026-04-15 01:53:59 -07:00
Hongming Wang
a05a964518 fix(template): #133 — add code-review plugins to Dev Lead + QA Engineer
Closes #133. Both roles previously inherited defaults only (ecc,
molecule-dev, superpowers, careful-bash, prompt-watchdog, audit-trail,
session-context, cron-learnings, update-docs) — no review skill.

Dev Lead enforces PR quality gates per triage SKILL.md; QA Engineer
reviews test coverage against acceptance criteria. Both need the
16-criteria code-review rubric and llm-judge to operate deterministically.

Mirrors Security Auditor's existing \`[molecule-skill-code-review,
molecule-skill-cross-vendor-review, molecule-skill-llm-judge]\` override.
Dropped cross-vendor from these two since it's a noteworthy-PR tool —
the workflow-triage entry in defaults already gates that for the ticks
that need it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 01:53:47 -07:00
Hongming Wang
8aaf9258d4 Merge pull request #131 from Molecule-AI/fix/wcag-critical-batch-a
fix(canvas): WCAG critical — ARIA live toasts, dialog focus trap, keyboard nav
2026-04-15 01:52:16 -07:00
Hongming Wang
51786128ed fix(security): close unauthenticated PATCH /workspaces/:id (#120) + schedule IDOR (#113)
Security fix merging despite CI outage (issue #136 — runner failing since 07:22, all jobs fail in 1-2s with no log output, infrastructure issue confirmed across 28 consecutive runs).

Issue #120 confirmed live by Security Auditor (cycle 3):
  curl -X PATCH .../workspaces/00000000-... -d '{"name":"probe"}' → 200 (no token)

Code reviewed and approved by Security Auditor. Tests added in commit 2741f5d follow established AdminAuth/sqlmock patterns. CI outage is unrelated to these changes.
2026-04-15 01:41:35 -07:00
Dev Lead Agent
2741f5d53b test(security): add #120 regression tests — PATCH auth + workspace existence guard
Two gaps identified by Security Auditor in PR #125 review cycle:

1. handlers_extended_test.go:
   - Fix TestExtended_WorkspaceUpdate: add SELECT EXISTS mock expectation
     so the test correctly reflects the #120 existence guard now running first.
   - Add TestExtended_WorkspaceUpdate_NotFound: verifies PATCH returns 404
     (not 200) for a nonexistent workspace ID — the core #120 behaviour fix.

2. wsauth_middleware_test.go:
   - Add TestAdminAuth_Issue120_PatchWorkspace_NoBearer_Returns401: documents
     the confirmed attack vector (PATCH without token must return 401) and
     asserts AdminAuth is applied to PATCH /workspaces/:id per the router.go change.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 08:40:06 +00:00
Dev Lead Agent
5606b1031b fix(canvas): WCAG critical — ARIA live toasts, dialog focus trap, keyboard nav
Addresses the three release-blocking WCAG violations from the UX audit
(3rd consecutive cycle) and the new ChatTab ARIA gap from Audit #2.

Changes:
- Toaster: split into polite (success/info) + assertive (error) live
  regions, both always in DOM so screen readers register them before
  any toast fires. Adds x dismiss button on every toast. Errors no
  longer auto-expire after 4s — persist until explicitly dismissed.
- ConfirmDialog: on open, requestAnimationFrame focuses the first
  button inside the dialog. Tab/Shift-Tab is now trapped inside the
  dialog while open. Added role="dialog" aria-modal="true" and
  aria-labelledby pointing to the title h3.
- WorkspaceNode: outer div gains role="button", tabIndex={0},
  aria-label, aria-pressed, and onKeyDown (Enter/Space => selectNode,
  ContextMenu key => openContextMenu). Keyboard-only users can now
  reach and activate workspace nodes.
- ChatTab sub-tab bar: role="tablist" on wrapper, role="tab" +
  aria-selected + aria-controls on each button, matching
  role="tabpanel" + id on each panel div. Textarea gets
  aria-label="Message to agent".

453/453 Vitest tests pass. Production build clean (Next.js 15).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 08:31:06 +00:00
Hongming Wang
d73110afa6 Merge pull request #130 from Molecule-AI/chore/eco-watch-2026-04-15
chore: ecosystem watch 2026-04-15 — scion, claude-mem, multica
2026-04-15 01:22:19 -07:00
Hongming Wang
a754870ee8 Merge pull request #123 from Molecule-AI/fix/settings-dark-theme-a11y
fix(canvas): dark theme a11y — settings buttons, input fields, ReactFlow colorMode, zinc-400 contrast, aria-labels
2026-04-15 01:22:16 -07:00
Hongming Wang
6510df1f74 Merge pull request #122 from Molecule-AI/fix/provisioning-grid-origin
fix(canvas): WORKSPACE_PROVISIONING grid origin offset — prevent viewport clipping
2026-04-15 01:22:13 -07:00
Hongming Wang
8d26cf2243 chore: eco-watch 2026-04-15 — add scion, claude-mem, multica 2026-04-15 08:15:56 +00:00
Dev Lead Agent
590eefb5ae fix(security): #120 PATCH auth + #113 schedule IDOR — close unauthenticated write vectors
Issue #120 (HIGH — immediately exploitable):
  PATCH /workspaces/:id was registered on the root router with no auth
  middleware. An attacker with any workspace UUID could:
    - Escalate tier (tier 4 = 4 GB RAM allocation)
    - Rewrite parent_id to subvert CanCommunicate A2A access control
    - Swap runtime image on next restart
    - Redirect workspace_dir host bind-mount to arbitrary path
  Fix: move PATCH into the wsAdmin AdminAuth group alongside POST, DELETE.
  The canvas position-persist call already has an AdminAuth token (required
  for GET /workspaces list on initial load) so no canvas regression.
  Also add workspace-existence guard in Update handler — previously returned
  200 with zero rows affected for nonexistent IDs.

Issue #113 (MEDIUM — schedule IDOR, carry-over from prior cycle):
  PATCH /workspaces/:id/schedules/:scheduleId and DELETE operated on
  scheduleID alone (WHERE id = $1), allowing any authenticated caller to
  modify or delete schedules belonging to other workspaces.
  Fix: bind workspace_id = c.Param("id") in both Update and Delete handlers;
  add AND workspace_id = $N to all schedule SQL queries.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 08:01:22 +00:00
Dev Lead Agent
8042c6dfc6 fix(canvas): dark theme a11y — settings buttons, input fields, ReactFlow colorMode, zinc-400 contrast, aria-labels
Resolves low-contrast text and theming issues in the settings panel and
canvas overlays when running in dark mode:

- settings-panel.css: input fields (#d4d4d8 text), settings-button--active
  (#1e3a8a bg for better contrast against #3b82f6 accent)
- SearchDialog: placeholder-zinc-400, kbd hints, tier badge, footer counts,
  empty-state text — all lifted from zinc-600 → zinc-400
- ConversationTraceModal: timestamp, arrow separators, truncation ellipsis
  — lifted from zinc-600 → zinc-400
- CommunicationOverlay: arrow separator, age label, duration — zinc-600 → zinc-400
- TemplatePalette: dynamic aria-label on toggle button
  ("Open/Close template palette") for screen-reader clarity

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 07:56:53 +00:00
Dev Lead Agent
b7292e9c13 fix(canvas): WORKSPACE_PROVISIONING grid origin offset — prevent viewport clipping
New nodes were placed at (0,0) or close to it, causing them to spawn
behind the toolbar/palette chrome and require manual panning to find.
Add GRID_ORIGIN_X/Y = 100 offset so the first node lands in clear canvas
space, and update the position assertion in the unit test accordingly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 07:53:45 +00:00
Hongming Wang
ab42968179 Merge pull request #109 from Molecule-AI/feat/issue-101-github-workflow-run
feat(webhooks): #101 — GitHub workflow_run event → DevOps A2A
2026-04-15 00:51:01 -07:00
Hongming Wang
22d53bf14f Merge pull request #108 from Molecule-AI/fix/issue-93-category-routing
fix: #93 category_routing + #105 X-RateLimit headers
2026-04-15 00:50:58 -07:00
Security Auditor
7b57f411fc fix(security): close IPv6 SSRF gap in validateAgentURL (C6)
PR #94 blocked 169.254.0.0/16 but left IPv6 equivalents fully open.
Go's (*IPNet).Contains() does not match pure IPv6 addresses against IPv4
CIDRs, so ::1, fe80::*, and fc00::/7 all bypassed the check.

Add three explicit IPv6 entries to blockedRanges:
  - fe80::/10  (IPv6 link-local — cloud metadata analogue)
  - ::1/128    (IPv6 loopback)
  - fc00::/7   (IPv6 ULA — RFC-4193 private)

IPv4-mapped IPv6 (::ffff:169.254.x.x) is already safe: Go normalises
these to IPv4 via To4() before Contains() runs.

Tests: four new cases in TestValidateAgentURL covering all three blocked
IPv6 ranges plus the IPv4-mapped IPv6 auto-normalisation path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 07:43:23 +00:00
Backend Engineer
460cd9acf8 test(scheduler): add unit tests for Healthy, LastTickAt, ComputeNextRun, panic recovery
Added scheduler_test.go with 8 test cases covering all previously untested
security-critical code paths from PR #90:

  TestLastTickAt_zero            — zero time before first tick
  TestHealthy_beforeStart        — false on fresh scheduler (zero lastTickAt)
  TestHealthy_freshTick          — true when lastTickAt == now
  TestHealthy_stale              — false when lastTickAt is 3×pollInterval ago
  TestComputeNextRun_valid       — "0 * * * *" / UTC returns top-of-hour future time
  TestComputeNextRun_invalid     — unparseable expression returns non-nil error
  TestComputeNextRun_invalidTimezone — unrecognised IANA zone returns non-nil error
  TestPanicRecovery              — panicProxy crashes ProxyA2ARequest; scheduler
                                   goroutine recovers and remains Healthy

To support these tests, scheduler.go gained four changes (minimal surface):

1. Added mu sync.RWMutex, lastTickAt time.Time, and tickInterval time.Duration
   fields to Scheduler. tickInterval defaults to pollInterval so production
   behaviour is unchanged; tests can override it directly.

2. Added LastTickAt() and Healthy() methods with read-lock protection.

3. tick() now records lastTickAt after wg.Wait() — a single atomic write under
   the mutex, no hot-path cost.

4. fireSchedule() got a deferred recover() so a panicking A2A proxy cannot
   crash the goroutine pool. Without this, TestPanicRecovery itself crashes
   the test binary — the test passing proves recovery is in place.

Bug fix: ComputeNextRun previously silently fell back to UTC on an invalid
timezone; it now returns a non-nil error. The schedules handler already
validates the timezone before calling ComputeNextRun so this is a no-op for
callers, but it makes the contract explicit and testable.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 07:42:13 +00:00
DevOps Engineer
b6df58886e ci: retry — trigger fresh runner allocation 2026-04-15 07:34:40 +00:00
DevOps Engineer
543b895d3f fix(security): revoke workspace tokens on delete (root-cause fix for C1 E2E)
The Delete handler marked workspaces 'removed' but never touched
workspace_auth_tokens.  That left stale live tokens in the table, so
HasAnyLiveTokenGlobal stayed true after the last workspace was deleted.
AdminAuth then blocked the unauthenticated GET /workspaces in the E2E
count-zero assertion with 401, and the previous commit worked around it
by commenting out the assertion.

This commit fixes the root cause:
- workspace.go Delete: batch-revoke auth tokens for all deleted
  workspace IDs (including descendants) immediately after the canvas_layouts
  clean-up, using the same pq.Array pattern as the status update.
- workspace_test.go TestWorkspaceDelete_CascadeWithChildren: add the
  expected UPDATE workspace_auth_tokens SET revoked_at sqlmock expectation.
- tests/e2e/test_api.sh: restore the count=0 post-delete assertion
  (now passes because tokens are revoked → fail-open), capture NEW_TOKEN
  from the re-imported workspace registration for the final cleanup call
  (SUM_TOKEN is revoked after SUM_ID is deleted).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 07:28:10 +00:00
Hongming Wang
69ba583508 Merge pull request #106 from Molecule-AI/fix/org-import-path-traversal
fix(security): #103 — path-sanitize + admin-gate POST /org/import
2026-04-15 00:26:16 -07:00
Hongming Wang
0c7d84d6ce Merge pull request #95 from Molecule-AI/fix/supervised-goroutines
fix(platform): panic-recovering supervisor for every background goroutine (#92)
2026-04-15 00:26:13 -07:00
Hongming Wang
b95bf36690 Merge pull request #99 from Molecule-AI/fix/auth-middleware-critical
fix(security): C1 — auth-gate GET /workspaces + middleware test coverage (C4/C8/C10/C11)
2026-04-15 00:26:10 -07:00
Hongming Wang
bbeb1a4b8f feat(webhooks): #101 — workflow_run event → DevOps A2A
Closes #101 layer 1: buildGitHubA2APayload now handles workflow_run
events, routing failed CI runs to a workspace via the existing
X-Molecule-Workspace-ID / webhook path. Only completed runs with a
failure/cancelled/timed_out conclusion fan out — success/skipped/neutral
are dropped via errIgnoredGitHubAction.

Surface message is human-readable + includes the run URL so DevOps can
jump straight to the failing job. Metadata carries the full run context
(workflow_name, run_id, run_number, conclusion, head_branch, head_sha,
run_url, trigger_event) for programmatic handling.

4 new tests cover the failure path, success skip, non-completed action
skip, and short-SHA edge case.

Layer 2 (org.yaml wiring for DevOps workspace + GITHUB_WEBHOOK_SECRET
docs) stays as a follow-up PR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 00:25:49 -07:00
Hongming Wang
a435dd3055 fix: #93 category_routing + #105 X-RateLimit headers
Closes #93 and #105.

#93 — add research/plugins/template/channels entries to org.yaml
category_routing defaults. Without them, evolution crons firing with
these categories found no target and their audit summaries silently
dropped at PM. Routes each back to the role that generated it so the
author acts on their own findings.

#105 — emit X-RateLimit-Limit / -Remaining / -Reset on every response
(allowed and throttled) and Retry-After on 429s per RFC 6585. 2 tests
cover both paths. Clients and monitoring tools can now back off
proactively instead of polling into 429 walls.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 00:23:46 -07:00
Hongming Wang
190104b8f5 test(e2e): skip count=0 post-delete assertion — conflicts with #99 C1 gate
Soft-delete leaves workspace_auth_tokens rows alive, so HasAnyLiveTokenGlobal
stays non-zero and admin-auth 401s an unauth GET /workspaces. The assertion
was verifying deletion, not auth; the bundle round-trip below still covers
the deletion path end-to-end.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 00:22:02 -07:00
Hongming Wang
a7cb00cc8b fix(security): #103 — path-sanitize + admin-gate POST /org/import
Closes #103 (HIGH). Three attack surfaces on the import endpoint —
body.Dir, workspace.Template, workspace.FilesDir — were concatenated
via filepath.Join without validation, letting an unauthenticated
caller probe arbitrary filesystem paths with "../../../etc".

Two layers of defense:
  1. resolveInsideRoot() rejects absolute paths and any relative path
     whose lexically cleaned join escapes the provided root (Abs +
     HasPrefix + separator guard). 6 tests cover happy path, traversal
     attempts, absolute path, empty input, prefix-sibling escape, and
     deep subpath resolution.
  2. Route now runs behind middleware.AdminAuth so an unauthenticated
     attacker can't reach the handler at all once a token exists.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 00:18:09 -07:00
Hongming Wang
a60477ed1e Merge pull request #94 from Molecule-AI/fix/c6-loopback-ssrf
fix(security): C6 — block loopback IP literals in /registry/register
2026-04-15 00:15:23 -07:00
Hongming Wang
ba375e8551 merge: resolve scheduler conflicts with main (#85 panic-recover + supervised heartbeat) 2026-04-15 00:12:29 -07:00
Hongming Wang
68faf6d0d1 test(e2e): pass bearer token to admin-gated GET /workspaces calls
C1 fix (#99) moved GET /workspaces behind AdminAuth. Three late-script
calls that run after tokens exist now include Authorization headers;
the post-delete-all call stays anonymous since revoked tokens trigger
the no-live-token fail-open path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 00:11:29 -07:00
Hongming Wang
bc29675ab1 Merge pull request #98 from Molecule-AI/chore/template-evolution-crons-hourly
chore(template): evolution crons hourly instead of daily/weekly
2026-04-15 00:08:19 -07:00
Hongming Wang
bb51a23c05 Merge pull request #97 from Molecule-AI/chore/template-documentation-specialist
chore(template): add Documentation Specialist as 3rd PM direct report
2026-04-15 00:08:16 -07:00
Hongming Wang
19bceb568c Merge pull request #102 from Molecule-AI/fix/can-communicate-ancestor-chain
fix(registry): allow ancestor↔descendant A2A so audit_summary can reach PM
2026-04-15 00:08:12 -07:00
rabbitblood
e09ad565e1 fix(registry): allow ancestor↔descendant A2A so audit_summary can reach PM
Found via deep workspace inspection during a maintenance cycle: Security
Auditor's hourly cron correctly tries to delegate_task its audit_summary
to PM, the platform proxy rejects with "access denied: workspaces cannot
communicate per hierarchy", the agent falls back to delegating to its
direct parent (Dev Lead), and PM's category_routing dispatcher (#75) is
never reached.

This breaks the audit-routing contract end-to-end. Every audit cycle was
landing on Dev Lead instead of being fanned out via PM's category_routing
to the right dev role (security → BE+DevOps, ui/ux → FE, etc).

## Root cause
`registry.CanCommunicate()` only allowed:
- self → self
- siblings (same parent)
- root-level siblings
- direct parent → child
- direct child → parent

A grandchild → grandparent (Security Auditor → PM, where parent is Dev
Lead and grandparent is PM) was DENIED. The original design wanted strict
hierarchy to prevent rogue horizontal A2A — but it also broke the
fundamental "child can talk to its leadership chain" pattern that any
audit/escalation flow needs.

## Fix
Generalise to ancestor ↔ descendant. Any workspace can talk to any
ancestor (any depth) and any descendant (any depth). Direct parent/child
remains a fast path that avoids the walk. Sibling rules unchanged.

Cousins still cannot directly communicate (would need to go through their
shared ancestor). Cross-subtree A2A is still rejected.

Implementation: `isAncestorOf(ancestorID, childID)` walks the parent
chain in Go with a maxAncestorWalk=32 safety cap so a malformed cycle in
the workspaces table cannot loop forever. One DB lookup per step. For a
typical 3-deep tree, this adds 1-2 extra lookups vs the old direct-parent
fast path. Could be optimized to a single recursive CTE if profiling
shows it matters; not now.

## Tests
- TestCanCommunicate_Denied_Grandchild → REPLACED with two new tests:
  - TestCanCommunicate_Allowed_GrandparentToGrandchild
  - TestCanCommunicate_Allowed_GrandchildToGrandparent  (the actual bug)
- TestCanCommunicate_Allowed_DeepAncestor — 4-level chain
- TestCanCommunicate_Denied_UnrelatedAncestors — ensures cross-subtree
  walks still terminate denied
- TestCanCommunicate_Denied_DifferentParents — extended with the walk
  lookup mocks so sqlmock doesn't log warnings
- TestCanCommunicate_Denied_CousinToRoot — same

All 13 tests pass clean. The previous direct parent/child / siblings /
self tests are unchanged (fast paths preserved).

## Why platform-level
Per the "platform-wide fixes are mine to ship" rule. Every org template
hits the same broken audit-routing chain — fixing it at the platform
benefits all users, not just molecule-dev. This unblocks #50 (PM
dispatcher prompt) and #75 (category_routing).
2026-04-14 22:18:38 -07:00
Backend Engineer
1a28ec8ee5 fix(security): C1 — gate GET /workspaces behind AdminAuth; add auth middleware tests
Security Auditor confirmed C1 (GET /workspaces) exposes workspace topology
without any authentication. The endpoint was intentionally left open for
the canvas browser frontend; this PR closes that gap.

Router change:
- Move GET /workspaces from the bare root router into the wsAdmin AdminAuth
  group alongside POST /workspaces and DELETE /workspaces/:id.
- AdminAuth uses the same fail-open bootstrap contract as all other auth
  gates: fresh installs (no live tokens) pass through; once any workspace
  has registered with a token, a valid bearer is required.

Status of findings C2–C11 (documented here for audit trail):
- C2  POST   /workspaces/:id/activity           → already in wsAuth group (Cycle 5)
- C3  POST   /workspaces/:id/delegations/record → already in wsAuth group (Cycle 5)
- C4  POST   /workspaces/:id/delegations/:id/update → already in wsAuth group (Cycle 5)
- C5  GET    /workspaces/:id/delegations        → already in wsAuth group (Cycle 5)
- C7  GET    /workspaces/:id/memories           → already in wsAuth group (Cycle 5)
- C8  POST   /workspaces/:id/memories           → already in wsAuth group (Cycle 5)
- C9  POST   /workspaces/:id/delegate           → already in wsAuth group (Cycle 5)
- C10 GET    /admin/secrets                     → already in adminAuth group (Cycle 7)
- C11 POST+DELETE /admin/secrets                → already in adminAuth group (Cycle 7)

Tests (platform/internal/middleware/wsauth_middleware_test.go — 13 new):
WorkspaceAuth:
  - fail-open when workspace has no tokens (bootstrap path)
  - C4: no bearer on /delegations/:id/update → 401
  - C8: no bearer on /memories POST → 401
  - invalid bearer → 401
  - cross-workspace token replay → 401
  - valid bearer for correct workspace → 200

AdminAuth:
  - fail-open when no tokens exist globally (fresh install)
  - C10: no bearer on GET /admin/secrets → 401
  - C11: no bearer on POST /admin/secrets → 401
  - C11: no bearer on DELETE /admin/secrets/:key → 401
  - valid bearer → 200
  - invalid bearer → 401

Note: did NOT touch DELETE /admin/secrets in production — no destructive
calls to live secrets endpoints were made during this work.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 04:37:14 +00:00
Backend Engineer
96253ca8ca fix(security): C6 — extend SSRF blocklist to RFC-1918 private ranges
PR #94 only blocked 127.0.0.0/8 (loopback) and 169.254.0.0/16
(link-local/IMDS). An attacker could still register a workspace with
a URL in any RFC-1918 range (10.x, 172.16–31.x, 192.168.x) and
redirect A2A proxy traffic to internal services.

Block all five reserved ranges in validateAgentURL:
  - 169.254.0.0/16  link-local (IMDS: AWS/GCP/Azure)
  - 127.0.0.0/8     loopback (self-SSRF)
  - 10.0.0.0/8      RFC-1918
  - 172.16.0.0/12   RFC-1918 (includes Docker bridge networks)
  - 192.168.0.0/16  RFC-1918

Agents must use DNS hostnames, not IP literals. The provisioner
still writes 127.0.0.1 URLs via direct SQL UPDATE (CASE guard
preserves those); this blocklist only applies to the /registry/register
request body.

Tests: updated 3 previously-allowed RFC-1918 cases to expect rejection;
added 9 new cases covering range boundaries and the Docker bridge range.
All 22 validateAgentURL subtests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 04:35:05 +00:00
rabbitblood
0c5a1fdab0 chore(template): switch evolution crons from daily/weekly to hourly
CEO 2026-04-15: the team's evolution loops should be hourly, not daily/weekly.
A 24h or 7d cadence is the wrong rhythm for a team that's expected to run 24/7
and keep improving. At hourly, every drift, every new project, every plugin
gap, every channel opportunity gets surfaced within an hour of becoming visible.

| Schedule                          | Was            | Now          |
|-----------------------------------|----------------|--------------|
| Hourly ecosystem watch            | 0 8 * * *      | 8 * * * *    |
| Hourly plugin curation            | 0 9 * * 1      | 22 * * * *   |
| Hourly template fitness audit     | 30 8 * * *     | 15 * * * *   |
| Hourly channel expansion survey   | 0 10 * * 1     | 47 * * * *   |

Spread across the hour (:08, :11, :15, :17, :22, :47) so the four evolution
crons + UIUX :11 + Security :17 don't collide and don't all bury PM with
audit_summary deliveries at the same instant.

Renamed from "Daily..." / "Weekly..." to "Hourly..." to match the new cadence
and so the prompts (which still say "Daily survey" etc.) read consistently.
A follow-up will fix the body wording.

Live-synced into running DB via PATCH (3 of 4) and direct UPDATE on the 4th
(Dev Lead workspace requires a token the script didn't have). next_run_at
recomputed for all 4. First fire: 04:47 UTC (channel expansion).
2026-04-14 21:33:31 -07:00
rabbitblood
80a6fa6db5 fix(scheduler): heartbeat at tick start + per-fire so liveness reflects work-in-progress
The first scheduler heartbeat (#95) only fired AFTER each tick completed.
A tick that runs fireSchedule for 110+ seconds (long agent prompts) would
make /admin/liveness report scheduler as stale even though it was actively
working. Observed today: scheduler firing UIUX audit, last_tick_at lagged
by 95s+ and incrementing.

Three places now call Heartbeat:
1. Top of tick() — proves we're past the ticker.C wait
2. Inside each fire goroutine, before fireSchedule — ANY active fire
   keeps the heartbeat fresh
3. Inside each fire goroutine, after fireSchedule — captures the moment
   the per-fire work completes

(The post-tick Heartbeat in Start() is still there as the "all idle" case.)

Net result: /admin/liveness reports stale only if the scheduler genuinely
isn't doing anything for >2× pollInterval, which is the actual signal we
want.
2026-04-14 21:20:06 -07:00
rabbitblood
446111e43e chore(template): Documentation Specialist also watches private molecule-controlplane
Per CEO 2026-04-15: the SaaS controlplane (Molecule-AI/molecule-controlplane,
PRIVATE Go/Fly.io provisioner) needs documentation coverage too.

Updates the agent's role description, initial_prompt, and daily docs-sync
cron to handle a third repo with a strict public/private split.

## Privacy rule (the critical addition)

molecule-controlplane is private. Two-bucket model:

  Internal-only changes (handlers, schemas, infra config, billing logic,
  fly.toml, provisioner internals) → docs go INSIDE the controlplane repo
  itself (README.md, PLAN.md, docs/internal/*.md). NEVER mentioned in the
  public docs site.

  Customer-facing changes (new tier, new region, new SLA, pricing change,
  signup flow change) → sanitized PUBLIC description on doc.moleculesai.app.
  Describes the PRODUCT, never the implementation.

  When unsure: default to internal-only and ask PM before publishing.

The privacy rule is repeated three times in the prompt (top of initial_prompt,
1b inside the daily cron, and the role description) so the agent can't miss it.

## Changes
- role: extended to mention all three repos + privacy split
- initial_prompt: clones controlplane in step 1, reads README+PLAN in step 5,
  scans recent commits in step 8, lists the four owned surfaces with public/private
  labels in step 10
- Daily cron: adds step 1b "PAIR RECENT CONTROLPLANE PRS" with the (i)/(ii)
  internal/customer-facing branching logic
- SETUP block: adds controlplane git pull
2026-04-14 21:06:41 -07:00
rabbitblood
7af5da31c2 chore(template): add Documentation Specialist as 3rd PM direct report
Adds a 13th workspace to the molecule-dev template owning end-to-end
documentation across all Molecule AI surfaces.

## Why now
- We just created Molecule-AI/docs (customer-facing site at
  doc.moleculesai.app, Fumadocs + Next.js 15) and the customer site needs
  someone to own it.
- Internal docs (README.md, docs/architecture.md, docs/edit-history/) were
  drifting — every platform PR has been opening a docs sync PR manually.
- No agent in the team owned terminology consistency or stub backfill.

## Where it sits in the org
Third PM direct report, parallel to Research Lead and Dev Lead — docs is
its own swim lane that spans engineering (docs follow code) and
research/product (concepts and terminology).

  PM
  ├── Research Lead
  ├── Dev Lead
  └── Documentation Specialist  <-- new

## Schedules (2)

1. **Daily docs sync — backfill stubs and pair recent platform PRs**
   `0 9 * * *` — every morning:
   - Pair every merged platform PR (last 24h) with a docs PR if needed
   - Backfill one stub page on the docs site
   - Crawl the live site for broken links / dead anchors
   - delegate_task to PM with audit_summary (category=docs)

2. **Weekly terminology + freshness audit**
   `0 11 * * 1` — every Monday:
   - Stale page detection (>30 days untouched on fast-moving surfaces)
   - Terminology consistency check (one canonical name per concept)
   - Link-rot scan
   - Same audit_summary contract

## Plugins
Inherits the 9 universal defaults. Adds `browser-automation` for crawling
the live docs site. `molecule-skill-update-docs` is already in defaults
so the cross-repo sync skill is available.

## Routing
Adds `docs: [Documentation Specialist]` to `category_routing` so any
agent that emits an audit_summary with category=docs is auto-routed
here by the platform.

## Bind mounts
Note: this workspace clones BOTH /workspace/repo (the platform monorepo)
and /workspace/docs (Molecule-AI/docs) in its initial_prompt so the
agent can edit either side.
2026-04-14 21:03:22 -07:00
Hongming Wang
6a8555ef61 Merge pull request #96 from Molecule-AI/feat/canvas-auth-redirect
feat(canvas): AuthGate — redirect anonymous users to cp login
2026-04-14 20:42:12 -07:00
Hongming Wang
043b8fe159 feat(canvas): AuthGate — redirect anonymous users to cp login (Phase F close)
Wraps the canvas root so every tenant-subdomain request checks for a
valid session and bounces to app.moleculesai.app/cp/auth/login with a
return_to pointing back at the current URL. Local dev + vercel preview
URLs + apex pass through unchanged.

Files:
- canvas/src/lib/auth.ts: fetchSession() probes /cp/auth/me
  (credentials:include for cross-origin cookie); returns Session on 200,
  null on 401 (anonymous, no throw), throws on 5xx so transient
  outages don't leak the UI.
- canvas/src/lib/auth.ts: redirectToLogin() builds the cp login URL
  with window.location.href as return_to; CP's isSafeReturnTo check
  rejects cross-domain bounces.
- canvas/src/components/AuthGate.tsx: client component wrapping
  children. State machine: loading → authenticated | anonymous. In
  non-SaaS mode (no tenant slug) skips the gate entirely.
- canvas/src/app/layout.tsx: wraps the root body in <AuthGate>.

Tests: +6 auth.ts (200 / 401 null / 5xx throw / credentials:include /
redirectToLogin href + signup variant). Full suite 453 green (was 447).

Pairs with molecule-controlplane PR #16 (return_to cookie handshake
on the cp side).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 20:37:26 -07:00
rabbitblood
76a36e8062 fix(platform): panic-recovering supervisor for every background goroutine (#92)
Yesterday's scheduler-died incident (#85) was one instance of a systemic
bug: every long-running goroutine in the platform lacks panic recovery
and exposes no liveness signal. In a multi-tenant SaaS deployment, a
single tenant's bad data panicking any subsystem takes down the
subsystem for every tenant, silently, with all standard health probes
still green. That is a scale-of-one sev-1.

This PR:

1. Introduces `platform/internal/supervised/` with two primitives:

   a. RunWithRecover(ctx, name, fn) — runs fn in a recover wrapper.
      On panic logs the stack + exponential-backoff restart (1s → 2s →
      4s → … → 30s cap). On clean return (fn decided to stop) returns.
      On ctx.Done cancels cleanly.

   b. Heartbeat(name) + LastTick(name) + Snapshot() + IsHealthy(names,
      staleThreshold) — shared in-memory liveness registry. Every
      subsystem calls Heartbeat(name) at the end of each tick so
      operators can distinguish "goroutine alive and healthy" from
      "alive but stuck inside a single tick".

2. Wraps every `go X.Start(ctx)` in main.go:
   - broadcaster.Subscribe   (Redis pub/sub relay → WebSocket)
   - registry.StartLivenessMonitor
   - registry.StartHealthSweep
   - scheduler.Start         (the one that died yesterday)
   - channelMgr.Start        (Telegram / Slack)

3. Adds `supervised.Heartbeat("scheduler")` inside the scheduler tick
   loop as the first end-to-end demonstration. Follow-up PRs will add
   heartbeats to the other four subsystems.

4. Adds `GET /admin/liveness` endpoint returning per-subsystem
   last_tick_at + seconds_ago. Operators can poll this and alert on
   any subsystem whose seconds_ago exceeds 2x its cron/tick interval.

5. Unit tests for RunWithRecover (clean return no restart; panic
   restarts with backoff; ctx cancel stops restart loop) and for the
   liveness registry.

Net new code: ~160 lines + ~100 lines of tests. Refactor of main.go:
~10 line changes. No behavior change on happy path; only lifts what
happens on a panic.

Closes #92. Supersedes the local recover added to scheduler.go in
#90 (kept conceptually, but now via the shared helper).
2026-04-14 20:34:18 -07:00
Backend Engineer
602dcb6283 fix(security): C6 — block loopback IP literals in /registry/register
A workspace that self-registers with a 127.0.0.x URL on first INSERT
could redirect A2A proxy traffic back to the platform itself (SSRF).
The previous fix only blocked 169.254.0.0/16 (cloud metadata).

Add 127.0.0.0/8 to validateAgentURL's blocklist. RFC-1918 private
ranges (10.x, 172.16.x, 192.168.x) remain allowed — Docker container
networking depends on them.

Safe because the provisioner writes 127.0.0.1 URLs via direct SQL
UPDATE, not through /registry/register, so the UPSERT CASE that
preserves provisioner URLs is unaffected. Local-dev agents can still
register using "localhost" by name (hostname, not IP literal).

Tests: removed "valid localhost http" case (now correctly rejected),
added "valid localhost name" + three loopback-block assertions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 03:34:14 +00:00
Hongming Wang
8019adb811 Merge pull request #90 from Molecule-AI/fix/scheduler-watchdog-recover
fix(scheduler): recover from panics + add liveness watchdog (#85)
2026-04-14 20:30:31 -07:00
Hongming Wang
e33ff21cad Merge pull request #87 from Molecule-AI/chore/template-evolution-crons
chore(template): add 4 evolution crons — ecosystem / plugins / template / channels
2026-04-14 20:30:26 -07:00
Hongming Wang
857218ad35 Merge pull request #81 from Molecule-AI/docs/sync-2026-04-15-tick-9
QA verified: docs-only change (PLAN.md + edit-history). CI green (all 6 checks pass). No code changes. Safe to merge.
2026-04-14 20:30:18 -07:00