molecule-core

Author	SHA1	Message	Date
Hongming Wang	74046ca2cf	Merge pull request #187 from Molecule-AI/fix/issue-179-trusted-proxies fix(router): SetTrustedProxies(nil) closes rate-limit bypass via X-Forwarded-For (#179)	2026-04-15 10:55:01 -07:00
Hongming Wang	940a7772c3	Merge branch 'main' into fix/issue-170-secret-delete-auth	2026-04-15 10:54:36 -07:00
Backend Engineer	6edaebca00	fix: require workspace auth on DELETE /secrets/:key (#170 ) The route wsAuth.DELETE("/secrets/:key", sech.Delete) was already moved inside the WorkspaceAuth group in a prior commit, closing the CWE-306 unauthenticated-delete vector. This commit adds two regression tests to lock that in: - TestWorkspaceAuth_Issue170_SecretDelete_NoBearer_Returns401: workspace with live tokens, no bearer header → 401 (blocks the attack). - TestWorkspaceAuth_Issue170_SecretDelete_FailOpen_NoTokens: workspace with no tokens (bootstrap/legacy) → 200 (fail-open preserved). Mirrors the TestAdminAuth_Issue120_* and TestWorkspaceAuth_C4_C8_* patterns. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 17:42:08 +00:00
Backend Engineer	1ad98be17b	fix(router): call SetTrustedProxies(nil) to close IP-spoofing bypass (#179 ) Without this call Gin's default trusts all X-Forwarded-For headers, letting any caller rotate their effective IP and bypass per-IP rate limiting. SetTrustedProxies(nil) forces c.ClientIP() to always return the real TCP RemoteAddr. Adds two regression tests: one documenting the pre-fix bypass, one asserting the spoofed header is ignored after the fix. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 17:32:54 +00:00
Backend Engineer	3cbeab45ba	fix(security): gate GET /approvals/pending behind AdminAuth (#180 ) GET /approvals/pending was registered on the open router with no middleware, allowing any unauthenticated caller to enumerate all pending approvals across every workspace on the platform. Fix: add inline middleware.AdminAuth(db.DB) to the route registration, matching the pattern used in PR #167 for bundles, events, and viewport. The three workspace-scoped approvals routes (POST/GET /approvals, POST /approvals/:id/decide) were already correctly behind WorkspaceAuth inside the wsAuth group — no change needed there. Tests: two new regression tests in wsauth_middleware_test.go — TestAdminAuth_Issue180_ApprovalsListing_NoBearer_Returns401 TestAdminAuth_Issue180_ApprovalsListing_FailOpen_NoTokens Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 17:25:09 +00:00
Hongming Wang	ad5e7b88b3	fix(security): #164 + #165 + #166 — gate 6 unauth routes behind AdminAuth CRITICAL (#164): POST /bundles/import — anon callers could create arbitrary workspaces with user-supplied system prompts, plugins, and secrets envelopes. Fixed by gating behind AdminAuth (bundleAdmin group). HIGH (#165): GET /bundles/export/:id — anon UUID probe leaked full system prompts, agent_card, plugins, memory for any workspace. GET /events + GET /events/:workspaceId — anon read of the append-only event log leaked org topology, workspace names, card fragments. Both moved into the same bundleAdmin / eventsAdmin groups. MEDIUM (#166): PUT /canvas/viewport — anon callers could reset shared viewport state. Gated via a scoped viewportAdmin group; GET stays open so canvas bootstraps without a bearer. GET /admin/liveness — operational-intel leak (scheduler cadence reveals work pattern). Inline AdminAuth on the single handler. All 6 routes use the same lazy-bootstrap admin auth the rest of the platform uses: zero-token installs fail-open, once any token exists every request must present a valid bearer. Known follow-up: canvas uses session cookies not bearer tokens (same pattern as #138). In multi-tenant production these canvas features — Events tab, Export/Duplicate, viewport persist — will return 401 once a workspace is token-enrolled. Needs cookie-accepting AdminAuth as a follow-up (tracked as option B in #138 triage discussion); a new issue will be filed for that scope. The security gain from closing #164 CRITICAL outweighs the canvas UX regression for tonight. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 09:52:32 -07:00
Hongming Wang	146f4c781b	Merge pull request #162 from Molecule-AI/fix/issue-138-field-whitelist fix(auth): #138 — field-level authz on PATCH /workspaces/:id (canvas regression fix)	2026-04-15 09:39:22 -07:00
Hongming Wang	0fc4edab2a	fix(auth): #138 — field-level authz on PATCH /workspaces/:id Closes #138. #125 moved PATCH /workspaces/:id into the wsAdmin AdminAuth group to close the #120 unauth vulnerability, but broke canvas drag- reposition and inline rename because canvas uses session cookies not bearer tokens. Multi-tenant deployments with any live token would have seen every canvas PATCH 401. Option A per #138 triage: PATCH goes back on the open router, but WorkspaceHandler.Update now enforces field-level authz: Cosmetic (no bearer required): name, role, x, y, canvas Sensitive (bearer required when any live token exists): tier — resource escalation parent_id — A2A hierarchy manipulation runtime — container image swap workspace_dir — host bind-mount redirection Fail-open bootstrap: HasAnyLiveTokenGlobal = 0 → pass-through (fresh install, pre-Phase-30 upgrade path). Matches the same lazy-bootstrap contract WorkspaceAuth and AdminAuth use elsewhere. 3 new tests cover all three branches of the matrix (cosmetic no-bearer, sensitive no-bearer-rejected, sensitive fail-open). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 09:39:09 -07:00
Hongming Wang	f06574428e	Merge pull request #119 from Molecule-AI/fix/111-112-clean fix(security+scheduler): IPv6 SSRF gap + scheduler unit tests [supersedes #111, #112]	2026-04-15 09:36:59 -07:00
Hongming Wang	5c389efc82	Merge branch 'main' into fix/111-112-clean	2026-04-15 09:36:14 -07:00
Hongming Wang	639f225142	Merge branch 'main' into fix/delete-revokes-tokens	2026-04-15 09:35:44 -07:00
Hongming Wang	0f5ab7a2c9	fix(tests): add EXISTS probe mock to 4 WorkspaceUpdate tests #125 added a SELECT EXISTS guard before WorkspaceHandler.Update applies any UPDATE so nonexistent workspace IDs return 404 instead of silent zero-row successes. The 4 existing WorkspaceUpdate_* sqlmock tests didn't mock the probe, so they broke on main. This was not caught because CI is blocked by the Actions billing cap. Adds ExpectQuery for the EXISTS probe to: - TestWorkspaceUpdate_ParentID - TestWorkspaceUpdate_NameOnly - TestWorkspaceUpdate_MultipleFields - TestWorkspaceUpdate_RuntimeField TestWorkspaceUpdate_BadJSON doesn't need the fix — it aborts on c.ShouldBindJSON before reaching the guard. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 09:35:08 -07:00
Hongming Wang	30d2d268b5	fix(security): #151 — register SecurityHeaders middleware Closes #151. The middleware was already implemented + tested (3 passing tests in securityheaders_test.go covering base set, multi-route, and the don't-override-existing contract) but never registered in router.go. One-line wire-up, runs after TenantGuard so rejected requests still get the same headers as accepted ones, and before routes so handlers can still opt out by setting their own header before c.Next() returns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 03:50:52 -07:00
rabbitblood	3e13b727f7	fix(scheduler): independent heartbeat pulse so liveness doesn't false-stale during long fires (#140 ) The #95 scheduler heartbeat scheme relied on: 1. Top of tick() (once per poll interval) 2. Per-fire goroutine entry + exit That leaves a gap: tick() ends with wg.Wait(), so if a single fire takes longer than pollInterval (UIUX audits routinely take 60-120s; max fireTimeout is 5min), the next tick doesn't run and no top-of-tick heartbeat fires. Per-fire heartbeats only bracket the fire — between entry and the HTTP response returning, nothing heartbeats either. Observed today: /admin/liveness reports seconds_ago=251 while docker logs show the scheduler actively firing 'Hourly ecosystem watch'. Scheduler is fine; liveness is lying. Adds an independent 10s heartbeat pulse goroutine inside Start(), decoupled from tick completion. The existing heartbeats at tick top + per-fire are kept as redundant signals but this pulse is the one that guarantees liveness freshness regardless of what tick is doing. Ships the exact fix proposed in #140 body. Closes #140.	2026-04-15 03:13:41 -07:00
Hongming Wang	55827baafa	fix(security): close unauthenticated PATCH /workspaces/:id (#120 ) + schedule IDOR (#113 ) Security fix merging despite CI outage (issue #136 — runner failing since 07:22, all jobs fail in 1-2s with no log output, infrastructure issue confirmed across 28 consecutive runs). Issue #120 confirmed live by Security Auditor (cycle 3): curl -X PATCH .../workspaces/00000000-... -d '{"name":"probe"}' → 200 (no token) Code reviewed and approved by Security Auditor. Tests added in commit `76cb7c3` follow established AdminAuth/sqlmock patterns. CI outage is unrelated to these changes.	2026-04-15 01:41:35 -07:00
Dev Lead Agent	76cb7c3760	test(security): add #120 regression tests — PATCH auth + workspace existence guard Two gaps identified by Security Auditor in PR #125 review cycle: 1. handlers_extended_test.go: - Fix TestExtended_WorkspaceUpdate: add SELECT EXISTS mock expectation so the test correctly reflects the #120 existence guard now running first. - Add TestExtended_WorkspaceUpdate_NotFound: verifies PATCH returns 404 (not 200) for a nonexistent workspace ID — the core #120 behaviour fix. 2. wsauth_middleware_test.go: - Add TestAdminAuth_Issue120_PatchWorkspace_NoBearer_Returns401: documents the confirmed attack vector (PATCH without token must return 401) and asserts AdminAuth is applied to PATCH /workspaces/:id per the router.go change. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 08:40:06 +00:00
Dev Lead Agent	3705377a6c	fix(security): #120 PATCH auth + #113 schedule IDOR — close unauthenticated write vectors Issue #120 (HIGH — immediately exploitable): PATCH /workspaces/:id was registered on the root router with no auth middleware. An attacker with any workspace UUID could: - Escalate tier (tier 4 = 4 GB RAM allocation) - Rewrite parent_id to subvert CanCommunicate A2A access control - Swap runtime image on next restart - Redirect workspace_dir host bind-mount to arbitrary path Fix: move PATCH into the wsAdmin AdminAuth group alongside POST, DELETE. The canvas position-persist call already has an AdminAuth token (required for GET /workspaces list on initial load) so no canvas regression. Also add workspace-existence guard in Update handler — previously returned 200 with zero rows affected for nonexistent IDs. Issue #113 (MEDIUM — schedule IDOR, carry-over from prior cycle): PATCH /workspaces/:id/schedules/:scheduleId and DELETE operated on scheduleID alone (WHERE id = $1), allowing any authenticated caller to modify or delete schedules belonging to other workspaces. Fix: bind workspace_id = c.Param("id") in both Update and Delete handlers; add AND workspace_id = $N to all schedule SQL queries. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 08:01:22 +00:00
Hongming Wang	8ba88011b4	Merge pull request #109 from Molecule-AI/feat/issue-101-github-workflow-run feat(webhooks): #101 — GitHub workflow_run event → DevOps A2A	2026-04-15 00:51:01 -07:00
Hongming Wang	7a41d67fa3	Merge pull request #108 from Molecule-AI/fix/issue-93-category-routing fix: #93 category_routing + #105 X-RateLimit headers	2026-04-15 00:50:58 -07:00
Security Auditor	5718b05cc7	fix(security): close IPv6 SSRF gap in validateAgentURL (C6) PR #94 blocked 169.254.0.0/16 but left IPv6 equivalents fully open. Go's (IPNet).Contains() does not match pure IPv6 addresses against IPv4 CIDRs, so ::1, fe80::, and fc00::/7 all bypassed the check. Add three explicit IPv6 entries to blockedRanges: - fe80::/10 (IPv6 link-local — cloud metadata analogue) - ::1/128 (IPv6 loopback) - fc00::/7 (IPv6 ULA — RFC-4193 private) IPv4-mapped IPv6 (::ffff:169.254.x.x) is already safe: Go normalises these to IPv4 via To4() before Contains() runs. Tests: four new cases in TestValidateAgentURL covering all three blocked IPv6 ranges plus the IPv4-mapped IPv6 auto-normalisation path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 07:43:23 +00:00
Backend Engineer	140ae9ebee	test(scheduler): add unit tests for Healthy, LastTickAt, ComputeNextRun, panic recovery Added scheduler_test.go with 8 test cases covering all previously untested security-critical code paths from PR #90: TestLastTickAt_zero — zero time before first tick TestHealthy_beforeStart — false on fresh scheduler (zero lastTickAt) TestHealthy_freshTick — true when lastTickAt == now TestHealthy_stale — false when lastTickAt is 3×pollInterval ago TestComputeNextRun_valid — "0 * * * *" / UTC returns top-of-hour future time TestComputeNextRun_invalid — unparseable expression returns non-nil error TestComputeNextRun_invalidTimezone — unrecognised IANA zone returns non-nil error TestPanicRecovery — panicProxy crashes ProxyA2ARequest; scheduler goroutine recovers and remains Healthy To support these tests, scheduler.go gained four changes (minimal surface): 1. Added mu sync.RWMutex, lastTickAt time.Time, and tickInterval time.Duration fields to Scheduler. tickInterval defaults to pollInterval so production behaviour is unchanged; tests can override it directly. 2. Added LastTickAt() and Healthy() methods with read-lock protection. 3. tick() now records lastTickAt after wg.Wait() — a single atomic write under the mutex, no hot-path cost. 4. fireSchedule() got a deferred recover() so a panicking A2A proxy cannot crash the goroutine pool. Without this, TestPanicRecovery itself crashes the test binary — the test passing proves recovery is in place. Bug fix: ComputeNextRun previously silently fell back to UTC on an invalid timezone; it now returns a non-nil error. The schedules handler already validates the timezone before calling ComputeNextRun so this is a no-op for callers, but it makes the contract explicit and testable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 07:42:13 +00:00
DevOps Engineer	3ef9142914	fix(security): revoke workspace tokens on delete (root-cause fix for C1 E2E) The Delete handler marked workspaces 'removed' but never touched workspace_auth_tokens. That left stale live tokens in the table, so HasAnyLiveTokenGlobal stayed true after the last workspace was deleted. AdminAuth then blocked the unauthenticated GET /workspaces in the E2E count-zero assertion with 401, and the previous commit worked around it by commenting out the assertion. This commit fixes the root cause: - workspace.go Delete: batch-revoke auth tokens for all deleted workspace IDs (including descendants) immediately after the canvas_layouts clean-up, using the same pq.Array pattern as the status update. - workspace_test.go TestWorkspaceDelete_CascadeWithChildren: add the expected UPDATE workspace_auth_tokens SET revoked_at sqlmock expectation. - tests/e2e/test_api.sh: restore the count=0 post-delete assertion (now passes because tokens are revoked → fail-open), capture NEW_TOKEN from the re-imported workspace registration for the final cleanup call (SUM_TOKEN is revoked after SUM_ID is deleted). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 07:28:10 +00:00
Hongming Wang	de6ebe2262	Merge pull request #106 from Molecule-AI/fix/org-import-path-traversal fix(security): #103 — path-sanitize + admin-gate POST /org/import	2026-04-15 00:26:16 -07:00
Hongming Wang	7859d43685	Merge pull request #95 from Molecule-AI/fix/supervised-goroutines fix(platform): panic-recovering supervisor for every background goroutine (#92)	2026-04-15 00:26:13 -07:00
Hongming Wang	f8c1b786ac	Merge pull request #99 from Molecule-AI/fix/auth-middleware-critical fix(security): C1 — auth-gate GET /workspaces + middleware test coverage (C4/C8/C10/C11)	2026-04-15 00:26:10 -07:00
Hongming Wang	958789f4ba	feat(webhooks): #101 — workflow_run event → DevOps A2A Closes #101 layer 1: buildGitHubA2APayload now handles workflow_run events, routing failed CI runs to a workspace via the existing X-Molecule-Workspace-ID / webhook path. Only completed runs with a failure/cancelled/timed_out conclusion fan out — success/skipped/neutral are dropped via errIgnoredGitHubAction. Surface message is human-readable + includes the run URL so DevOps can jump straight to the failing job. Metadata carries the full run context (workflow_name, run_id, run_number, conclusion, head_branch, head_sha, run_url, trigger_event) for programmatic handling. 4 new tests cover the failure path, success skip, non-completed action skip, and short-SHA edge case. Layer 2 (org.yaml wiring for DevOps workspace + GITHUB_WEBHOOK_SECRET docs) stays as a follow-up PR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 00:25:49 -07:00
Hongming Wang	2a74a7b11b	fix: #93 category_routing + #105 X-RateLimit headers Closes #93 and #105. #93 — add research/plugins/template/channels entries to org.yaml category_routing defaults. Without them, evolution crons firing with these categories found no target and their audit summaries silently dropped at PM. Routes each back to the role that generated it so the author acts on their own findings. #105 — emit X-RateLimit-Limit / -Remaining / -Reset on every response (allowed and throttled) and Retry-After on 429s per RFC 6585. 2 tests cover both paths. Clients and monitoring tools can now back off proactively instead of polling into 429 walls. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 00:23:46 -07:00
Hongming Wang	4dbf335d7f	fix(security): #103 — path-sanitize + admin-gate POST /org/import Closes #103 (HIGH). Three attack surfaces on the import endpoint — body.Dir, workspace.Template, workspace.FilesDir — were concatenated via filepath.Join without validation, letting an unauthenticated caller probe arbitrary filesystem paths with "../../../etc". Two layers of defense: 1. resolveInsideRoot() rejects absolute paths and any relative path whose lexically cleaned join escapes the provided root (Abs + HasPrefix + separator guard). 6 tests cover happy path, traversal attempts, absolute path, empty input, prefix-sibling escape, and deep subpath resolution. 2. Route now runs behind middleware.AdminAuth so an unauthenticated attacker can't reach the handler at all once a token exists. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 00:18:09 -07:00
Hongming Wang	80b0ad25ff	Merge pull request #94 from Molecule-AI/fix/c6-loopback-ssrf fix(security): C6 — block loopback IP literals in /registry/register	2026-04-15 00:15:23 -07:00
Hongming Wang	593c7e2984	merge: resolve scheduler conflicts with main (#85 panic-recover + supervised heartbeat)	2026-04-15 00:12:29 -07:00
rabbitblood	0653e78262	fix(registry): allow ancestor↔descendant A2A so audit_summary can reach PM Found via deep workspace inspection during a maintenance cycle: Security Auditor's hourly cron correctly tries to delegate_task its audit_summary to PM, the platform proxy rejects with "access denied: workspaces cannot communicate per hierarchy", the agent falls back to delegating to its direct parent (Dev Lead), and PM's category_routing dispatcher (#75) is never reached. This breaks the audit-routing contract end-to-end. Every audit cycle was landing on Dev Lead instead of being fanned out via PM's category_routing to the right dev role (security → BE+DevOps, ui/ux → FE, etc). ## Root cause `registry.CanCommunicate()` only allowed: - self → self - siblings (same parent) - root-level siblings - direct parent → child - direct child → parent A grandchild → grandparent (Security Auditor → PM, where parent is Dev Lead and grandparent is PM) was DENIED. The original design wanted strict hierarchy to prevent rogue horizontal A2A — but it also broke the fundamental "child can talk to its leadership chain" pattern that any audit/escalation flow needs. ## Fix Generalise to ancestor ↔ descendant. Any workspace can talk to any ancestor (any depth) and any descendant (any depth). Direct parent/child remains a fast path that avoids the walk. Sibling rules unchanged. Cousins still cannot directly communicate (would need to go through their shared ancestor). Cross-subtree A2A is still rejected. Implementation: `isAncestorOf(ancestorID, childID)` walks the parent chain in Go with a maxAncestorWalk=32 safety cap so a malformed cycle in the workspaces table cannot loop forever. One DB lookup per step. For a typical 3-deep tree, this adds 1-2 extra lookups vs the old direct-parent fast path. Could be optimized to a single recursive CTE if profiling shows it matters; not now. ## Tests - TestCanCommunicate_Denied_Grandchild → REPLACED with two new tests: - TestCanCommunicate_Allowed_GrandparentToGrandchild - TestCanCommunicate_Allowed_GrandchildToGrandparent (the actual bug) - TestCanCommunicate_Allowed_DeepAncestor — 4-level chain - TestCanCommunicate_Denied_UnrelatedAncestors — ensures cross-subtree walks still terminate denied - TestCanCommunicate_Denied_DifferentParents — extended with the walk lookup mocks so sqlmock doesn't log warnings - TestCanCommunicate_Denied_CousinToRoot — same All 13 tests pass clean. The previous direct parent/child / siblings / self tests are unchanged (fast paths preserved). ## Why platform-level Per the "platform-wide fixes are mine to ship" rule. Every org template hits the same broken audit-routing chain — fixing it at the platform benefits all users, not just molecule-dev. This unblocks #50 (PM dispatcher prompt) and #75 (category_routing).	2026-04-14 22:18:38 -07:00
Backend Engineer	80c2161687	fix(security): C1 — gate GET /workspaces behind AdminAuth; add auth middleware tests Security Auditor confirmed C1 (GET /workspaces) exposes workspace topology without any authentication. The endpoint was intentionally left open for the canvas browser frontend; this PR closes that gap. Router change: - Move GET /workspaces from the bare root router into the wsAdmin AdminAuth group alongside POST /workspaces and DELETE /workspaces/:id. - AdminAuth uses the same fail-open bootstrap contract as all other auth gates: fresh installs (no live tokens) pass through; once any workspace has registered with a token, a valid bearer is required. Status of findings C2–C11 (documented here for audit trail): - C2 POST /workspaces/:id/activity → already in wsAuth group (Cycle 5) - C3 POST /workspaces/:id/delegations/record → already in wsAuth group (Cycle 5) - C4 POST /workspaces/:id/delegations/:id/update → already in wsAuth group (Cycle 5) - C5 GET /workspaces/:id/delegations → already in wsAuth group (Cycle 5) - C7 GET /workspaces/:id/memories → already in wsAuth group (Cycle 5) - C8 POST /workspaces/:id/memories → already in wsAuth group (Cycle 5) - C9 POST /workspaces/:id/delegate → already in wsAuth group (Cycle 5) - C10 GET /admin/secrets → already in adminAuth group (Cycle 7) - C11 POST+DELETE /admin/secrets → already in adminAuth group (Cycle 7) Tests (platform/internal/middleware/wsauth_middleware_test.go — 13 new): WorkspaceAuth: - fail-open when workspace has no tokens (bootstrap path) - C4: no bearer on /delegations/:id/update → 401 - C8: no bearer on /memories POST → 401 - invalid bearer → 401 - cross-workspace token replay → 401 - valid bearer for correct workspace → 200 AdminAuth: - fail-open when no tokens exist globally (fresh install) - C10: no bearer on GET /admin/secrets → 401 - C11: no bearer on POST /admin/secrets → 401 - C11: no bearer on DELETE /admin/secrets/:key → 401 - valid bearer → 200 - invalid bearer → 401 Note: did NOT touch DELETE /admin/secrets in production — no destructive calls to live secrets endpoints were made during this work. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 04:37:14 +00:00
Backend Engineer	63e482f05b	fix(security): C6 — extend SSRF blocklist to RFC-1918 private ranges PR #94 only blocked 127.0.0.0/8 (loopback) and 169.254.0.0/16 (link-local/IMDS). An attacker could still register a workspace with a URL in any RFC-1918 range (10.x, 172.16–31.x, 192.168.x) and redirect A2A proxy traffic to internal services. Block all five reserved ranges in validateAgentURL: - 169.254.0.0/16 link-local (IMDS: AWS/GCP/Azure) - 127.0.0.0/8 loopback (self-SSRF) - 10.0.0.0/8 RFC-1918 - 172.16.0.0/12 RFC-1918 (includes Docker bridge networks) - 192.168.0.0/16 RFC-1918 Agents must use DNS hostnames, not IP literals. The provisioner still writes 127.0.0.1 URLs via direct SQL UPDATE (CASE guard preserves those); this blocklist only applies to the /registry/register request body. Tests: updated 3 previously-allowed RFC-1918 cases to expect rejection; added 9 new cases covering range boundaries and the Docker bridge range. All 22 validateAgentURL subtests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 04:35:05 +00:00
rabbitblood	101f284e5d	fix(scheduler): heartbeat at tick start + per-fire so liveness reflects work-in-progress The first scheduler heartbeat (#95) only fired AFTER each tick completed. A tick that runs fireSchedule for 110+ seconds (long agent prompts) would make /admin/liveness report scheduler as stale even though it was actively working. Observed today: scheduler firing UIUX audit, last_tick_at lagged by 95s+ and incrementing. Three places now call Heartbeat: 1. Top of tick() — proves we're past the ticker.C wait 2. Inside each fire goroutine, before fireSchedule — ANY active fire keeps the heartbeat fresh 3. Inside each fire goroutine, after fireSchedule — captures the moment the per-fire work completes (The post-tick Heartbeat in Start() is still there as the "all idle" case.) Net result: /admin/liveness reports stale only if the scheduler genuinely isn't doing anything for >2× pollInterval, which is the actual signal we want.	2026-04-14 21:20:06 -07:00
rabbitblood	e4535560cf	fix(platform): panic-recovering supervisor for every background goroutine (#92 ) Yesterday's scheduler-died incident (#85) was one instance of a systemic bug: every long-running goroutine in the platform lacks panic recovery and exposes no liveness signal. In a multi-tenant SaaS deployment, a single tenant's bad data panicking any subsystem takes down the subsystem for every tenant, silently, with all standard health probes still green. That is a scale-of-one sev-1. This PR: 1. Introduces `platform/internal/supervised/` with two primitives: a. RunWithRecover(ctx, name, fn) — runs fn in a recover wrapper. On panic logs the stack + exponential-backoff restart (1s → 2s → 4s → … → 30s cap). On clean return (fn decided to stop) returns. On ctx.Done cancels cleanly. b. Heartbeat(name) + LastTick(name) + Snapshot() + IsHealthy(names, staleThreshold) — shared in-memory liveness registry. Every subsystem calls Heartbeat(name) at the end of each tick so operators can distinguish "goroutine alive and healthy" from "alive but stuck inside a single tick". 2. Wraps every `go X.Start(ctx)` in main.go: - broadcaster.Subscribe (Redis pub/sub relay → WebSocket) - registry.StartLivenessMonitor - registry.StartHealthSweep - scheduler.Start (the one that died yesterday) - channelMgr.Start (Telegram / Slack) 3. Adds `supervised.Heartbeat("scheduler")` inside the scheduler tick loop as the first end-to-end demonstration. Follow-up PRs will add heartbeats to the other four subsystems. 4. Adds `GET /admin/liveness` endpoint returning per-subsystem last_tick_at + seconds_ago. Operators can poll this and alert on any subsystem whose seconds_ago exceeds 2x its cron/tick interval. 5. Unit tests for RunWithRecover (clean return no restart; panic restarts with backoff; ctx cancel stops restart loop) and for the liveness registry. Net new code: ~160 lines + ~100 lines of tests. Refactor of main.go: ~10 line changes. No behavior change on happy path; only lifts what happens on a panic. Closes #92. Supersedes the local recover added to scheduler.go in #90 (kept conceptually, but now via the shared helper).	2026-04-14 20:34:18 -07:00
Backend Engineer	19bdd81ba4	fix(security): C6 — block loopback IP literals in /registry/register A workspace that self-registers with a 127.0.0.x URL on first INSERT could redirect A2A proxy traffic back to the platform itself (SSRF). The previous fix only blocked 169.254.0.0/16 (cloud metadata). Add 127.0.0.0/8 to validateAgentURL's blocklist. RFC-1918 private ranges (10.x, 172.16.x, 192.168.x) remain allowed — Docker container networking depends on them. Safe because the provisioner writes 127.0.0.1 URLs via direct SQL UPDATE, not through /registry/register, so the UPSERT CASE that preserves provisioner URLs is unaffected. Local-dev agents can still register using "localhost" by name (hostname, not IP literal). Tests: removed "valid localhost http" case (now correctly rejected), added "valid localhost name" + three loopback-block assertions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 03:34:14 +00:00
rabbitblood	7dc9d83792	fix(scheduler): recover from panics + add liveness watchdog (#85 ) The scheduler died silently on 2026-04-14 14:21 UTC and stayed dead for 12+ hours. Platform restart didn't recover it. Root cause: tick() and fireSchedule() goroutines have no panic recovery. A single bad row, bad cron expression, DB blip, or transient panic anywhere in the chain permanently kills the scheduler goroutine — and the only signal to an operator is "no crons firing", which is invisible if you're not watching. Specifically: func (s *Scheduler) Start(ctx context.Context) { for { select { case <-ticker.C: s.tick(ctx) // <- if this panics, the for-loop exits forever } } } And inside tick: go func(s2 scheduleRow) { defer wg.Done() defer func() { <-sem }() s.fireSchedule(ctx, s2) // <- panic here propagates up wg.Wait() }(sched) Two `defer recover()` additions: 1. In Start's tick wrapper — a panic in tick() (DB scan, cron parse, row processing) is logged and the next tick fires normally. 2. In each fireSchedule goroutine — a single bad workspace can't take the rest of the batch down. Plus a liveness watchdog: - Scheduler now records `lastTickAt` after each successful tick. - New methods `LastTickAt()` and `Healthy()` (true if last tick within 2× pollInterval = 60s). - Initialised at Start so Healthy() returns true on a fresh process. Endpoint plumbing for /admin/scheduler/health is a follow-up — needs threading the scheduler instance through router.Setup(). Documented on #85. Closes the silent-outage failure mode of #85. The other proposed fixes (force-kill on /restart hang, active_tasks watchdog) are separate concerns tracked in #85's comments.	2026-04-14 19:32:01 -07:00
Hongming Wang	e38257ac88	fix(middleware): tenant guard reads bare UUID from state= (no prefix) Pair to molecule-controlplane PR #8. Fly's proxy returns 502 if the fly-replay state value contains '=', so the control plane now puts the bare UUID in state= (no 'org-id=' prefix). TenantGuard now treats the whole 'state=...' value as the org id.	2026-04-14 18:09:44 -07:00
Hongming Wang	522d055758	fix(middleware): TenantGuard accepts org id via Fly-Replay-Src state Phase B.3 pair-fix to the control plane's fly-replay state change. Background: the private molecule-controlplane's router emits `fly-replay: app=X;instance=Y;state=org-id=<uuid>`. Fly's edge replays the request to the tenant and injects `Fly-Replay-Src: instance=Z;...; state=org-id=<uuid>` on the replayed request. But response headers from the cp (like X-Molecule-Org-Id) never travel to the replayed tenant — only the state= param does. TenantGuard now checks both paths in order: 1. Primary: X-Molecule-Org-Id header (direct-access path, e.g. molecli) 2. Secondary: Fly-Replay-Src's `state=org-id=<uuid>` segment (production fly-replay path) Either matching configured MOLECULE_ORG_ID → allow. Neither matches → 404 (still don't leak tenant existence). New helper orgIDFromReplaySrc parses the semicolon-separated Fly-Replay- Src header per Fly's format. Covered by a table-driven test with 7 cases including malformed + empty-header + wrong-state-key. Tests: +3 new TestTenantGuard_* (FlyReplaySrc match, mismatch, table). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 17:54:13 -07:00
Hongming Wang	2094f4f0c2	feat(platform): TenantGuard middleware — public repo's only SaaS hook Phase 32 foundation. The SaaS control plane (private molecule-controlplane repo) provisions one platform instance per customer org on Fly Machines and sets MOLECULE_ORG_ID=<uuid> on the machine. Its subdomain router forwards requests with X-Molecule-Org-Id=<uuid>. TenantGuard: - When MOLECULE_ORG_ID is set → every non-allowlisted request must carry a matching X-Molecule-Org-Id header. Mismatched/missing header → 404 (not 403 — don't leak tenant existence by letting probers distinguish "wrong org" from "route doesn't exist"). - When unset → passthrough. Self-hosted / dev / CI behavior unchanged. - Allowlist is exact-match, not prefix — /health and /metrics only. No orgs table, no signup, no billing, no Fly provisioning in this repo — all that lives in the private control plane. The public repo's SaaS surface is exactly this one middleware. 6 tests covering: unset-is-passthrough, matching header, mismatched header 404 (with empty body), missing header 404, allowlist bypass, and allowlist-is-exact-match. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 15:20:33 -07:00
Hongming Wang	07a5ca3c51	Merge pull request #76 from Molecule-AI/fix/issue-24-schedules-db-authoritative fix(org): DB-authoritative schedules; org/import is additive on template rows (#24)	2026-04-14 14:40:54 -07:00
Hongming Wang	a921644f9c	fix(schedules): backfill legacy rows to 'template' + extract import SQL const Addresses code-review warnings on PR #76: - Migration 022 now backfills pre-existing workspace_schedules rows to source='template' before flipping NOT NULL + DEFAULT 'runtime'. Legacy rows (all seeded via org/import historically) stay refreshable on re-import. Down migration drops the CHECK constraint too. - Extracted the import UPSERT into const orgImportScheduleSQL so the shape test asserts against the const directly instead of file-scraping org.go. Removed the os.ReadFile helper. - scheduleResponse.Source gets json:\",omitempty\" so old clients that predate the migration don't see an empty string they can't explain. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 14:30:22 -07:00
Hongming Wang	608d6745b6	fix(org): use yaml.Marshal for category_routing + newline-guard block appends Addresses code-review warnings on PR #75: - renderCategoryRoutingYAML now builds yaml.Node + yaml.Marshal, escaping YAML-reserved chars in role names correctly (was JSON-as-YAML, fragile on unicode line separators). - New appendYAMLBlock helper guarantees a newline boundary when concatenating YAML fragments into config.yaml (category_routing + initial_prompt both used to risk merging into the previous line). - Fixed struct comment (replace-per-key, not UNION). - Added TestCategoryRouting_EscapesYAMLSpecials and TestAppendYAMLBlock_NewlineGuard. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 14:28:22 -07:00
Hongming Wang	293033de23	fix(org): DB-authoritative schedules; org/import is additive on template rows (#24 ) Resolves #24 per CEO direction. DB is source of truth for workspace_schedules. POST /org/import becomes idempotent — only touches rows it owns (source='template'); runtime-added schedules (Canvas / API) are preserved across re-imports. - Migration 022: adds source TEXT NOT NULL DEFAULT 'runtime' CHECK in ('template','runtime'); unique index on (workspace_id, name) so the org/import upsert can use ON CONFLICT. - org.go: schedule INSERT becomes INSERT ... 'template' ON CONFLICT (workspace_id, name) DO UPDATE SET ... WHERE workspace_schedules.source='template'. Never DELETEs. - schedules.go: runtime POST writes 'runtime' explicitly; List handler surfaces the source field on the response so Canvas can render badges. - 3 new unit tests assert source='runtime' default for runtime CRUD, the SQL shape contract for org/import (additive + idempotent + runtime-preserving + never-DELETE), and List response surface. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 14:09:44 -07:00
Hongming Wang	932ada2c59	feat(platform): generic category_routing replaces hardcoded audit dispatch (#51 ) Add a category_routing block to org.yaml schema (defaults + per-workspace, UNION semantics with per-key replace). The merged routing table is rendered into each workspace's config.yaml at import time. PM's system prompt loses the hardcoded security/ui/infra → role mapping from PR #50; instead it reads category_routing from /configs/config.yaml and delegates to whatever roles the org template lists for the incoming audit-summary's category. Future org templates ship their own routing without prompt churn. Tests: 4 new TestCategoryRouting_* cases covering YAML parse, UNION+drop semantics, deterministic config.yaml render, and empty-map handling. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 14:06:47 -07:00
Hongming Wang	d9603a77ce	fix(org): per-workspace plugins UNION with defaults; '!' prefix opts out (#68 ) Per-workspace `plugins:` now UNIONS with `defaults.plugins` instead of replacing. A leading `!` or `-` on a per-workspace entry opts a default out. Backward-compatible: re-listing defaults still dedupes to the same list. Refactored the inline REPLACE logic into a pure helper `mergePlugins` in org.go so it's unit-testable. Five TestPlugins_* cases added. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 13:21:23 -07:00
Hongming Wang	383582fbbf	Merge pull request #64 from Molecule-AI/fix/issue-15-refresh-oauth-on-restart fix(secrets): auto-refresh global_secrets on workspace restart (#15)	2026-04-14 12:49:19 -07:00
Hongming Wang	c4240e32c1	feat(platform): inject restart context system message (#19 Layer 1) After a workspace restart (HTTP /restart or programmatic RestartByID) and re-registration, the platform sends a synthetic A2A message/send to the workspace containing: - restart timestamp - previous session end timestamp + human duration - env-var keys now available (keys only — never values) The message is rendered in the format proposed in #19 and marked with metadata.kind=restart_context so agents can detect and handle it specifically if they choose. Skip path: if the workspace doesn't re-register within 30s, log and drop. The Restart HTTP response is unaffected by delivery success. Layer 2 (user-defined restart_prompt via config.yaml / org.yaml) is deferred — tracked as a separate follow-up issue. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 12:41:01 -07:00
Hongming Wang	e658f86c08	fix(secrets): auto-restart workspaces on global secret change (#15 ) Global secrets (e.g. CLAUDE_CODE_OAUTH_TOKEN) are injected as container env vars at Start() time. Until now, rotating one only propagated to a workspace on the next full restart-from-zero, which manual ops had to drive via a `POST /workspaces/:id/restart` loop. Tier-3 Claude Code agents hit the stale-token path first and surfaced as 401s inside the SDK. Restart-time re-read of global_secrets + workspace_secrets was already correct in `provisionWorkspaceOpts` — the missing piece was the trigger. SetGlobal / DeleteGlobal now enqueue RestartByID for every non-paused, non-removed, non-external workspace that does NOT shadow the key with a workspace-level override. Matches the existing behaviour of workspace-scoped `Set` / `Delete`. Adds two sqlmock-backed tests exercising both branches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 12:39:00 -07:00
Hongming Wang	9c7f57688c	Merge pull request #57 from Molecule-AI/fix/issue-12-preserve-claude-sessions fix(provisioner): preserve Claude session directory across restart (#12)	2026-04-14 12:26:12 -07:00

1 2

65 Commits