Commit Graph

1127 Commits

Author SHA1 Message Date
Hongming Wang
488fde03a7 fix(middleware): TenantGuard passes through /cp/* to CP proxy
Today's rollout of cp_proxy (PR #1095/1096) mounted /cp/* as a
reverse-proxy to the control plane, but the TenantGuard middleware
runs first in the global chain and 404s anything that isn't in its
exact-path allowlist (/health + /metrics). Every /cp/auth/me fetch
from canvas landed on a 40µs 404 before ever reaching the proxy.

/cp/* is handled upstream (WorkOS session + admin bearer), so the
tenant doesn't need to attach org identity for those paths. Passing
them through is correct — matches the design where the tenant
platform is a pure transit layer for /cp/*.

Verified: /cp/auth/me via tunnel now returns 401 (correct unauth
from CP) instead of 404 from TenantGuard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:14:56 -07:00
Hongming Wang
e2ec12292b
Merge pull request #1096 from Molecule-AI/staging
promote: tenant cp-proxy same-origin
2026-04-20 13:01:51 -07:00
Hongming Wang
4ba498ca94
Merge pull request #1095 from Molecule-AI/feat/tenant-cp-proxy-same-origin
feat(router): /cp/* reverse-proxy + same-origin canvas fetches
2026-04-20 13:01:46 -07:00
Hongming Wang
eb4f262d2a feat(router): /cp/* reverse-proxy to CP + same-origin canvas fetches
Canvas's browser bundle issues fetches to both CP endpoints
(/cp/auth/me, /cp/orgs, ...) AND tenant-platform endpoints
(/canvas/viewport, /approvals/pending, /org/templates). They
share ONE build-time base URL. Baking api.moleculesai.app
broke tenant calls with 404; baking the tenant subdomain broke
auth. Tried both today and saw exactly one failure mode per
attempt.

Real fix: same-origin fetches + tenant-side split. Adds:

  internal/router/cp_proxy.go      # /cp/* → CP_UPSTREAM_URL

mounted before NoRoute(canvasProxy). Now a tenant serves:

  /cp/*              → reverse-proxy to api.moleculesai.app
  /canvas/viewport,
  /approvals/pending,
  /workspaces/:id/*,
  /ws, /registry,    → tenant platform (existing handlers)
  /metrics
  everything else    → canvas UI (existing reverse-proxy)

Canvas middleware reverts to `connect-src 'self' wss:` for the
same-origin path (keeping explicit PLATFORM_URL whitelist as a
self-hosted escape hatch when the build-arg is non-empty).

CI build-arg flips to NEXT_PUBLIC_PLATFORM_URL="" so the bundle
issues relative fetches.

Security of cp_proxy:
  - Cookie + Authorization PRESERVED across the hop (opposite of
    canvas proxy) — they carry the WorkOS session, which is the
    whole point.
  - Host rewritten to upstream so CORS + cookie-domain on the CP
    side see their own hostname.
  - Upstream URL validated at construction: must parse, must be
    http(s), must have a host — misconfig fails closed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:01:40 -07:00
Hongming Wang
5edc95e279
Merge pull request #1094 from Molecule-AI/staging
promote: CSP platform_url whitelist
2026-04-20 12:55:15 -07:00
Hongming Wang
c0ef6d92bf
Merge pull request #1093 from Molecule-AI/fix/csp-allow-platform-url
fix(canvas): include PLATFORM_URL origin in CSP connect-src
2026-04-20 12:55:09 -07:00
Hongming Wang
1bca58a01b fix(canvas): include NEXT_PUBLIC_PLATFORM_URL in CSP connect-src
Tenant page loads were blocked by:

  Refused to connect to 'https://api.moleculesai.app/cp/auth/me'
  because it violates the document's Content Security Policy.

CSP had `connect-src 'self' wss:` — fine for same-origin + any wss,
but browser refuses cross-origin HTTPS fetches that aren't listed.
PLATFORM_URL (baked from NEXT_PUBLIC_PLATFORM_URL, which is the CP
origin on SaaS tenants) needs to be explicit.

Fix: middleware reads NEXT_PUBLIC_PLATFORM_URL at build/runtime
and adds both the https and wss siblings to connect-src. Self-
hosted deploys that override the build-arg automatically get a
matching CSP — no hardcoded hostname.

Test added: buildCsp includes NEXT_PUBLIC_PLATFORM_URL origin in
connect-src when set. Also loosens the dev `ws:` assertion since
dev uses `connect-src *` which subsumes ws (pre-existing behavior,
test was stale).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 12:55:03 -07:00
rabbitblood
f787873698 feat: nuke-and-rebuild.sh — one-command fleet reset
Two scripts:
- nuke-and-rebuild.sh: docker down -v, clean orphans, rebuild, setup
- post-rebuild-setup.sh: insert global secrets (MiniMax + GH PAT),
  import org template, wait for platform health

Global secrets ensure every provisioned container gets MiniMax API
config and GitHub PAT injected as env vars automatically — no manual
settings.json deployment needed.

Usage: bash scripts/nuke-and-rebuild.sh

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-20 12:53:30 -07:00
Hongming Wang
1c945d02f5
Merge pull request #1092 from Molecule-AI/staging
promote: bake CP origin into tenant canvas
2026-04-20 12:51:33 -07:00
Hongming Wang
3783e6f5a1
Merge pull request #1091 from Molecule-AI/fix/tenant-canvas-cp-origin
fix(ci): bake api.moleculesai.app into tenant canvas bundle
2026-04-20 12:51:28 -07:00
Hongming Wang
ee40880f39 fix(ci): bake api.moleculesai.app into tenant canvas bundle
Canvas's browser-side code (auth.ts, api.ts, billing.ts) all call
fetch(PLATFORM_URL + /cp/*). PLATFORM_URL comes from
NEXT_PUBLIC_PLATFORM_URL at build time; with the build arg unset,
it falls back to http://localhost:8080 in the compiled bundle.

That means on a tenant like hongmingwang.moleculesai.app, the
user's browser actually tried to fetch http://localhost:8080/cp/
auth/me — which resolves to the USER'S OWN machine, not the tenant.
Login redirect loops 404. Every tenant canvas has been unable to
complete a fresh login on this path; existing sessions only worked
because the cookie was already set domain-wide.

Fix: pass NEXT_PUBLIC_PLATFORM_URL=https://api.moleculesai.app
as a build arg in the tenant-image workflow. CP already allows
CORS from *.moleculesai.app + credentials, and the session cookie
is scoped to .moleculesai.app so tenant subdomains inherit it.

Verified in prod by rebuilding canvas locally with the flag and
hot-patching the hongmingwang instance via SSM. Baked chunks now
contain api.moleculesai.app; browser auth redirects resolve
cleanly to the CP.

Self-hosted users override by rebuilding with their own URL —
same pattern molecule-app uses with NEXT_PUBLIC_CP_ORIGIN.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 12:51:22 -07:00
rabbitblood
6091fca961 fix(auth): accept admin token in CanvasOrBearer for viewport PUT 2026-04-20 12:45:09 -07:00
rabbitblood
d47ca547ac fix(auth): accept admin token in WorkspaceAuth for canvas dashboard
The canvas sends NEXT_PUBLIC_ADMIN_TOKEN on all API calls but per-workspace
routes (/activity, /delegations, /traces) use WorkspaceAuth which only
accepts per-workspace bearer tokens. This made the canvas dashboard 401
on every workspace detail view.

Fix: WorkspaceAuth now accepts the admin token as a fallback after
workspace token validation fails. This lets the canvas read all workspace
data with a single admin credential.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-20 12:42:43 -07:00
Hongming Wang
05aa0cc787
Merge pull request #1090 from Molecule-AI/staging
promote: canvas CSP nonce fix
2026-04-20 12:34:14 -07:00
Hongming Wang
5babbb47bd
Merge pull request #1089 from Molecule-AI/fix/canvas-csp-nonce-propagation
fix(canvas): root layout dynamic so CSP nonce reaches Next scripts
2026-04-20 12:34:08 -07:00
Hongming Wang
d70aef58f5 fix(canvas): make root layout dynamic so CSP nonce reaches Next scripts
Tenant page loads were failing with repeated CSP violations:

  Executing inline script violates ... script-src 'self'
  'nonce-M2M4YTVh...' 'strict-dynamic'. ...

because Next.js's bootstrap inline scripts were emitted without a
nonce attribute. The middleware was generating per-request nonces
correctly and sending them via `x-nonce` — but the layout was
fully static, so Next.js cached the HTML once and served that cached
bundle (no nonces baked in) for every request.

Fix: call `await headers()` in the root layout. That opts the tree
into dynamic rendering AND signals Next.js to propagate the
x-nonce value to its own generated <script> tags.

The `nonce` return value is intentionally unused — the framework
handles its bootstrap scripts automatically once the read happens.
Future code that adds third-party <Script> components (analytics,
etc.) should pass the returned nonce explicitly.

Verified against live tenant: before this change every /_next/
chunk script tag in the HTML had no nonce attribute; expected after
deploy is `<script nonce="..." src="/_next/...">` on each.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 12:34:03 -07:00
rabbitblood
5f5f70151b fix(canvas): CSP_DEV_MODE + admin token for local Docker (#1052 follow-up)
Three changes that keep getting lost on nuke+rebuild:
1. middleware.ts: read CSP_DEV_MODE env to relax CSP in local Docker
2. api.ts: send NEXT_PUBLIC_ADMIN_TOKEN header (AdminAuth on /workspaces)
3. Dockerfile: accept NEXT_PUBLIC_ADMIN_TOKEN as build arg

All three are required for the canvas to work in local Docker where
canvas (port 3000) fetches from platform (port 8080) cross-origin.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-20 12:23:43 -07:00
rabbitblood
b0ea25cc36 fix(canvas): add NEXT_PUBLIC_ADMIN_TOKEN + CSP_DEV_MODE to docker-compose
Canvas needs AdminAuth token to fetch /workspaces (gated since PR #729)
and CSP_DEV_MODE to allow cross-port fetches in local Docker.

These were added earlier but lost on nuke+rebuild because they weren't
committed to staging.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-20 12:19:12 -07:00
rabbitblood
6e6de392d9 chore: remove org-templates/molecule-dev from git tracking
This directory belongs in the dedicated repo
Molecule-AI/molecule-ai-org-template-molecule-dev.
It should be cloned locally for platform mounting, never
committed to molecule-core. The .gitignore already blocks it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-20 11:47:13 -07:00
molecule-ai[bot]
5c3ea0b61d
Merge pull request #1088 from Molecule-AI/fix/workspace-purge-delete-1087
fix: add ?purge=true hard-delete to DELETE /workspaces/:id (#1087)
2026-04-20 11:43:40 -07:00
rabbitblood
5a9658f83c fix: add ?purge=true hard-delete to DELETE /workspaces/:id (#1087)
Soft-delete (status='removed') leaves orphan DB rows and FK data forever.
When ?purge=true is passed, after container cleanup the handler cascade-
deletes all leaf FK tables and hard-removes the workspace row.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-20 11:08:44 -07:00
molecule-ai[bot]
7d931afce9
Merge pull request #1085 from Molecule-AI/fix/org-import-concurrency-1084
fix(org-import): limit concurrent Docker provisioning to 3 (#1084)
2026-04-20 10:38:26 -07:00
rabbitblood
5afc759859 fix(org-import): limit concurrent Docker provisioning to 3 (#1084)
The org import fired all workspace provisioning goroutines concurrently,
overwhelming Docker when creating 39+ containers. Containers timed out,
leaving workspaces stuck in 'provisioning' with no schedules or hooks.

Fix:
- Add provisionConcurrency=3 semaphore limiting concurrent Docker ops
- Increase workspaceCreatePacingMs from 50ms to 2000ms between siblings
- Pass semaphore through createWorkspaceTree recursion

With 39 workspaces at 3 concurrent + 2s pacing, import takes ~30s instead
of timing out. Each workspace gets its full template: schedules, hooks,
settings, hierarchy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-20 10:08:17 -07:00
Hongming Wang
7c3cff22c6
Merge pull request #1083 from Molecule-AI/staging
promote: staging → main (remove dead canvas waitlist)
2026-04-20 09:56:11 -07:00
Hongming Wang
cd4d2c5140
Merge pull request #1082 from Molecule-AI/chore/canvas-remove-waitlist-dead-page
chore(canvas): remove dead /waitlist page (lives in molecule-app)
2026-04-20 09:56:01 -07:00
Hongming Wang
f59473f1fd chore(canvas): remove dead /waitlist page (lives in molecule-app)
#1080 added /waitlist to canvas, but canvas isn't served at
app.moleculesai.app — it backs the tenant subdomains (acme.moleculesai.app
etc.). The real /waitlist lives in the separate molecule-app repo,
which is what the CP auth callback redirects to.

molecule-app#12 has the real page + contact form wiring to
/cp/waitlist/request. This canvas copy was never reachable and would
only diverge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 09:55:35 -07:00
Hongming Wang
59dd873f26
Merge pull request #1081 from Molecule-AI/staging
promote: staging → main (waitlist page)
2026-04-20 09:47:52 -07:00
Hongming Wang
61ed4ca293
Merge pull request #1080 from Molecule-AI/feat/waitlist-page
feat(canvas): /waitlist page with contact form
2026-04-20 09:47:35 -07:00
Hongming Wang
6bdad3d1b8 feat(canvas): /waitlist page with contact form
Adds the user-facing half of the beta-gate: a page at /waitlist that
the CP auth callback redirects users to when their email isn't on
the allowlist. Collects email + optional name + use-case and POSTs
to /cp/waitlist/request (backend landed in controlplane #150).

## Behavior

- No auto-pre-fill of email from URL query (CP's #145 dropped the
  ?email= param for the privacy reason; this test guards against a
  future regression on the client side).
- Client-side validates email shape for instant feedback; backend
  re-validates.
- Three UI states after submit:
    success → "your request is in" banner, form hidden
    dedup   → softer "already on file" banner when backend returns
              dedup=true (same 200, no 409 to avoid enumeration)
    error   → inline banner with backend message or network fallback

## Tests

9 tests in __tests__/waitlist-page.test.tsx covering:
- default render + a11y (role=button, role=status, role=alert)
- URL-pre-fill privacy regression guard
- HTML5 + JS validation (empty, malformed)
- successful POST with trimmed body
- dedup branch
- non-2xx with + without error field
- network rejection

Follow-up to the beta-gate rollout on controlplane #145 / #150.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 09:47:06 -07:00
Hongming Wang
4a072ae130
Merge pull request #1077 from Molecule-AI/staging
promote: staging → main (bounded IsRunning body read)
2026-04-20 09:06:54 -07:00
Hongming Wang
dc9f934446
Merge pull request #1076 from Molecule-AI/fix/cp-provisioner-bounded-body-read
fix(cp_provisioner): cap IsRunning body read at 64 KiB
2026-04-20 09:06:36 -07:00
Hongming Wang
2d80f61419 fix(cp_provisioner): cap IsRunning body read at 64 KiB
IsRunning used an unbounded json.NewDecoder(resp.Body).Decode on
CP status responses. Start already caps its body read at 64 KiB
(cp_provisioner.go:137) to defend against a misconfigured or
compromised CP streaming a huge body and exhausting memory.

IsRunning is called reactively per-request from a2a_proxy and
periodically from healthsweep, so it's a hotter path than Start
and arguably deserves the same defense more.

Adds TestIsRunning_BoundedBodyRead that serves a body padded past
the cap and asserts the decode still succeeds on the JSON prefix.

Follow-up to code-review Nit-2 on #1073.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 09:06:20 -07:00
Hongming Wang
ec99d7b5f1
Merge pull request #1074 from Molecule-AI/staging
promote: staging → main (IsRunning contract fix)
2026-04-20 08:59:07 -07:00
Hongming Wang
35f7193ca9
Merge pull request #1073 from Molecule-AI/fix/isrunning-alive-on-transient
fix(cp_provisioner): IsRunning returns (true, err) on transient failures
2026-04-20 08:58:44 -07:00
Hongming Wang
25b560960a fix(cp_provisioner): IsRunning returns (true, err) on transient failures
My #1071 made IsRunning return (false, err) on all error paths, but that
breaks a2a_proxy which depends on Docker provisioner's (true, err) contract.
Without this fix, any brief CP outage causes a2a_proxy to mark workspaces
offline and trigger restart cascades across every tenant.

Contract now matches Docker.IsRunning:
  transport error    → (true, err)  — alive, degraded signal
  non-2xx response   → (true, err)  — alive, degraded signal
  JSON decode error  → (true, err)  — alive, degraded signal
  2xx state!=running → (false, nil)
  2xx state==running → (true, nil)

healthsweep.go is also happy with this — it skips on err regardless.

Adds TestIsRunning_ContractCompat_A2AProxy as regression guard that
asserts each error path explicitly against the a2a_proxy expectations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 08:58:18 -07:00
Hongming Wang
d29ca3ce22
Merge pull request #1072 from Molecule-AI/staging
chore: promote IsRunning error surfacing to main
2026-04-20 08:50:28 -07:00
Hongming Wang
1fd9aa238c
Merge pull request #1071 from Molecule-AI/fix/isrunning-surface-http-errors
fix(workspace-server): IsRunning surfaces non-2xx + JSON errors
2026-04-20 08:50:03 -07:00
molecule-ai[bot]
3fbf40bf1b
Merge pull request #949 from Molecule-AI/feat/canvas-batch-operations
feat(canvas): batch operations — multi-select + restart/pause/delete
2026-04-20 08:48:26 -07:00
molecule-ai[bot]
78a434dfc1
Merge pull request #1011 from Molecule-AI/test/qa-coverage-orgs-page-and-api-timeout
test(canvas): QA coverage — orgs page polling + API timeout
2026-04-20 08:48:00 -07:00
molecule-ai[bot]
fe3e4366a3
Merge pull request #1015 from Molecule-AI/fix/canary-verify-health-poll-1013
fix(ci): replace sleep 360 with health-check poll in canary-verify (#1013)
2026-04-20 08:47:56 -07:00
Hongming Wang
47a15c340e fix(workspace-server): IsRunning surfaces non-2xx + JSON errors
Pre-existing silent-failure path: IsRunning decoded CP responses
regardless of HTTP status, so a CP 500 → empty body → State="" →
returned (false, nil). The sweeper couldn't distinguish "workspace
stopped" from "CP broken" and would leave a dead row in place.

## Fix

  - Non-2xx → wrapped error, does NOT echo body (CP 5xx bodies may
    contain echoed headers; leaking into logs would expose bearer)
  - JSON decode error → wrapped error
  - Transport error → now wrapped with "cp provisioner: status:"
    prefix for easier log grepping

## Tests

+7 cases (5-status table + malformed JSON + existing transport).
IsRunning coverage 100%; overall cp_provisioner at 98%.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 08:47:55 -07:00
molecule-ai[bot]
692625b774
Merge pull request #1016 from Molecule-AI/fix/a11y-workspace-node
fix(a11y): WorkspaceNode font floor, contrast, focus rings
2026-04-20 08:47:53 -07:00
molecule-ai[bot]
67eb87f43b
Merge pull request #1017 from Molecule-AI/fix/rows-err-missing
fix(bundle/exporter): add rows.Err() check + MCP secret scrub
2026-04-20 08:47:49 -07:00
molecule-ai[bot]
e7b2c10c60
Merge pull request #1022 from Molecule-AI/fix/unchecked-exec-workspace-provision
fix(mcp): scrub secrets in commit_memory + MCP handler tests
2026-04-20 08:47:25 -07:00
molecule-ai[bot]
70637ff4f7
Merge pull request #1049 from Molecule-AI/feat/platform-native-hma-instructions
feat(runtime): inject HMA memory instructions at platform level (#1047)
2026-04-20 08:47:20 -07:00
Hongming Wang
b955b97416
Merge pull request #1070 from Molecule-AI/staging
chore: promote workspace-server tenant-auth fix to main
2026-04-20 08:42:08 -07:00
Hongming Wang
df44524f6c merge main into staging for #1070 promotion
# Conflicts:
#	.gitignore
2026-04-20 08:41:58 -07:00
Hongming Wang
4e5071ffe2
Merge pull request #1067 from Molecule-AI/fix/tenant-workspace-auth
fix(workspace-server): send X-Molecule-Admin-Token on CP calls
2026-04-20 08:39:49 -07:00
molecule-ai[bot]
24a75954ff
Merge pull request #1069 from Molecule-AI/fix/github-token-refresh-1068
fix: GitHub token refresh — WorkspaceAuth path for credential helper (#1068)
2026-04-20 08:37:46 -07:00
Hongming Wang
e8943fba6c test(workspace-server): cover Stop/IsRunning/Close + auth-header + transport errors
Closes review gap: pre-PR coverage on CPProvisioner was 37%.
After this commit every exported method is exercised:

  - NewCPProvisioner            100%
  - authHeaders                  100%
  - Start                         91.7% (remainder: json.Marshal error
                                   path, unreachable with fixed-type
                                   request struct)
  - Stop                         100% (new — header + path + error)
  - IsRunning                    100% (new — 4-state matrix + auth)
  - Close                        100% (new — contract no-op)

New cases assert both auth headers (shared secret + admin_token) land
on every outbound request, transport failures surface clear errors
on Start/Stop, and IsRunning doesn't misreport on transport failure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 08:37:39 -07:00