Today's rollout of cp_proxy (PR #1095/1096) mounted /cp/* as a
reverse-proxy to the control plane, but the TenantGuard middleware
runs first in the global chain and 404s anything that isn't in its
exact-path allowlist (/health + /metrics). Every /cp/auth/me fetch
from canvas landed on a 40µs 404 before ever reaching the proxy.
/cp/* is handled upstream (WorkOS session + admin bearer), so the
tenant doesn't need to attach org identity for those paths. Passing
them through is correct — matches the design where the tenant
platform is a pure transit layer for /cp/*.
Verified: /cp/auth/me via tunnel now returns 401 (correct unauth
from CP) instead of 404 from TenantGuard.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Canvas's browser bundle issues fetches to both CP endpoints
(/cp/auth/me, /cp/orgs, ...) AND tenant-platform endpoints
(/canvas/viewport, /approvals/pending, /org/templates). They
share ONE build-time base URL. Baking api.moleculesai.app
broke tenant calls with 404; baking the tenant subdomain broke
auth. Tried both today and saw exactly one failure mode per
attempt.
Real fix: same-origin fetches + tenant-side split. Adds:
internal/router/cp_proxy.go # /cp/* → CP_UPSTREAM_URL
mounted before NoRoute(canvasProxy). Now a tenant serves:
/cp/* → reverse-proxy to api.moleculesai.app
/canvas/viewport,
/approvals/pending,
/workspaces/:id/*,
/ws, /registry, → tenant platform (existing handlers)
/metrics
everything else → canvas UI (existing reverse-proxy)
Canvas middleware reverts to `connect-src 'self' wss:` for the
same-origin path (keeping explicit PLATFORM_URL whitelist as a
self-hosted escape hatch when the build-arg is non-empty).
CI build-arg flips to NEXT_PUBLIC_PLATFORM_URL="" so the bundle
issues relative fetches.
Security of cp_proxy:
- Cookie + Authorization PRESERVED across the hop (opposite of
canvas proxy) — they carry the WorkOS session, which is the
whole point.
- Host rewritten to upstream so CORS + cookie-domain on the CP
side see their own hostname.
- Upstream URL validated at construction: must parse, must be
http(s), must have a host — misconfig fails closed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tenant page loads were blocked by:
Refused to connect to 'https://api.moleculesai.app/cp/auth/me'
because it violates the document's Content Security Policy.
CSP had `connect-src 'self' wss:` — fine for same-origin + any wss,
but browser refuses cross-origin HTTPS fetches that aren't listed.
PLATFORM_URL (baked from NEXT_PUBLIC_PLATFORM_URL, which is the CP
origin on SaaS tenants) needs to be explicit.
Fix: middleware reads NEXT_PUBLIC_PLATFORM_URL at build/runtime
and adds both the https and wss siblings to connect-src. Self-
hosted deploys that override the build-arg automatically get a
matching CSP — no hardcoded hostname.
Test added: buildCsp includes NEXT_PUBLIC_PLATFORM_URL origin in
connect-src when set. Also loosens the dev `ws:` assertion since
dev uses `connect-src *` which subsumes ws (pre-existing behavior,
test was stale).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two scripts:
- nuke-and-rebuild.sh: docker down -v, clean orphans, rebuild, setup
- post-rebuild-setup.sh: insert global secrets (MiniMax + GH PAT),
import org template, wait for platform health
Global secrets ensure every provisioned container gets MiniMax API
config and GitHub PAT injected as env vars automatically — no manual
settings.json deployment needed.
Usage: bash scripts/nuke-and-rebuild.sh
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Canvas's browser-side code (auth.ts, api.ts, billing.ts) all call
fetch(PLATFORM_URL + /cp/*). PLATFORM_URL comes from
NEXT_PUBLIC_PLATFORM_URL at build time; with the build arg unset,
it falls back to http://localhost:8080 in the compiled bundle.
That means on a tenant like hongmingwang.moleculesai.app, the
user's browser actually tried to fetch http://localhost:8080/cp/
auth/me — which resolves to the USER'S OWN machine, not the tenant.
Login redirect loops 404. Every tenant canvas has been unable to
complete a fresh login on this path; existing sessions only worked
because the cookie was already set domain-wide.
Fix: pass NEXT_PUBLIC_PLATFORM_URL=https://api.moleculesai.app
as a build arg in the tenant-image workflow. CP already allows
CORS from *.moleculesai.app + credentials, and the session cookie
is scoped to .moleculesai.app so tenant subdomains inherit it.
Verified in prod by rebuilding canvas locally with the flag and
hot-patching the hongmingwang instance via SSM. Baked chunks now
contain api.moleculesai.app; browser auth redirects resolve
cleanly to the CP.
Self-hosted users override by rebuilding with their own URL —
same pattern molecule-app uses with NEXT_PUBLIC_CP_ORIGIN.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The canvas sends NEXT_PUBLIC_ADMIN_TOKEN on all API calls but per-workspace
routes (/activity, /delegations, /traces) use WorkspaceAuth which only
accepts per-workspace bearer tokens. This made the canvas dashboard 401
on every workspace detail view.
Fix: WorkspaceAuth now accepts the admin token as a fallback after
workspace token validation fails. This lets the canvas read all workspace
data with a single admin credential.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tenant page loads were failing with repeated CSP violations:
Executing inline script violates ... script-src 'self'
'nonce-M2M4YTVh...' 'strict-dynamic'. ...
because Next.js's bootstrap inline scripts were emitted without a
nonce attribute. The middleware was generating per-request nonces
correctly and sending them via `x-nonce` — but the layout was
fully static, so Next.js cached the HTML once and served that cached
bundle (no nonces baked in) for every request.
Fix: call `await headers()` in the root layout. That opts the tree
into dynamic rendering AND signals Next.js to propagate the
x-nonce value to its own generated <script> tags.
The `nonce` return value is intentionally unused — the framework
handles its bootstrap scripts automatically once the read happens.
Future code that adds third-party <Script> components (analytics,
etc.) should pass the returned nonce explicitly.
Verified against live tenant: before this change every /_next/
chunk script tag in the HTML had no nonce attribute; expected after
deploy is `<script nonce="..." src="/_next/...">` on each.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three changes that keep getting lost on nuke+rebuild:
1. middleware.ts: read CSP_DEV_MODE env to relax CSP in local Docker
2. api.ts: send NEXT_PUBLIC_ADMIN_TOKEN header (AdminAuth on /workspaces)
3. Dockerfile: accept NEXT_PUBLIC_ADMIN_TOKEN as build arg
All three are required for the canvas to work in local Docker where
canvas (port 3000) fetches from platform (port 8080) cross-origin.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Canvas needs AdminAuth token to fetch /workspaces (gated since PR #729)
and CSP_DEV_MODE to allow cross-port fetches in local Docker.
These were added earlier but lost on nuke+rebuild because they weren't
committed to staging.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This directory belongs in the dedicated repo
Molecule-AI/molecule-ai-org-template-molecule-dev.
It should be cloned locally for platform mounting, never
committed to molecule-core. The .gitignore already blocks it.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Soft-delete (status='removed') leaves orphan DB rows and FK data forever.
When ?purge=true is passed, after container cleanup the handler cascade-
deletes all leaf FK tables and hard-removes the workspace row.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The org import fired all workspace provisioning goroutines concurrently,
overwhelming Docker when creating 39+ containers. Containers timed out,
leaving workspaces stuck in 'provisioning' with no schedules or hooks.
Fix:
- Add provisionConcurrency=3 semaphore limiting concurrent Docker ops
- Increase workspaceCreatePacingMs from 50ms to 2000ms between siblings
- Pass semaphore through createWorkspaceTree recursion
With 39 workspaces at 3 concurrent + 2s pacing, import takes ~30s instead
of timing out. Each workspace gets its full template: schedules, hooks,
settings, hierarchy.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
#1080 added /waitlist to canvas, but canvas isn't served at
app.moleculesai.app — it backs the tenant subdomains (acme.moleculesai.app
etc.). The real /waitlist lives in the separate molecule-app repo,
which is what the CP auth callback redirects to.
molecule-app#12 has the real page + contact form wiring to
/cp/waitlist/request. This canvas copy was never reachable and would
only diverge.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the user-facing half of the beta-gate: a page at /waitlist that
the CP auth callback redirects users to when their email isn't on
the allowlist. Collects email + optional name + use-case and POSTs
to /cp/waitlist/request (backend landed in controlplane #150).
## Behavior
- No auto-pre-fill of email from URL query (CP's #145 dropped the
?email= param for the privacy reason; this test guards against a
future regression on the client side).
- Client-side validates email shape for instant feedback; backend
re-validates.
- Three UI states after submit:
success → "your request is in" banner, form hidden
dedup → softer "already on file" banner when backend returns
dedup=true (same 200, no 409 to avoid enumeration)
error → inline banner with backend message or network fallback
## Tests
9 tests in __tests__/waitlist-page.test.tsx covering:
- default render + a11y (role=button, role=status, role=alert)
- URL-pre-fill privacy regression guard
- HTML5 + JS validation (empty, malformed)
- successful POST with trimmed body
- dedup branch
- non-2xx with + without error field
- network rejection
Follow-up to the beta-gate rollout on controlplane #145 / #150.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
IsRunning used an unbounded json.NewDecoder(resp.Body).Decode on
CP status responses. Start already caps its body read at 64 KiB
(cp_provisioner.go:137) to defend against a misconfigured or
compromised CP streaming a huge body and exhausting memory.
IsRunning is called reactively per-request from a2a_proxy and
periodically from healthsweep, so it's a hotter path than Start
and arguably deserves the same defense more.
Adds TestIsRunning_BoundedBodyRead that serves a body padded past
the cap and asserts the decode still succeeds on the JSON prefix.
Follow-up to code-review Nit-2 on #1073.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
My #1071 made IsRunning return (false, err) on all error paths, but that
breaks a2a_proxy which depends on Docker provisioner's (true, err) contract.
Without this fix, any brief CP outage causes a2a_proxy to mark workspaces
offline and trigger restart cascades across every tenant.
Contract now matches Docker.IsRunning:
transport error → (true, err) — alive, degraded signal
non-2xx response → (true, err) — alive, degraded signal
JSON decode error → (true, err) — alive, degraded signal
2xx state!=running → (false, nil)
2xx state==running → (true, nil)
healthsweep.go is also happy with this — it skips on err regardless.
Adds TestIsRunning_ContractCompat_A2AProxy as regression guard that
asserts each error path explicitly against the a2a_proxy expectations.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-existing silent-failure path: IsRunning decoded CP responses
regardless of HTTP status, so a CP 500 → empty body → State="" →
returned (false, nil). The sweeper couldn't distinguish "workspace
stopped" from "CP broken" and would leave a dead row in place.
## Fix
- Non-2xx → wrapped error, does NOT echo body (CP 5xx bodies may
contain echoed headers; leaking into logs would expose bearer)
- JSON decode error → wrapped error
- Transport error → now wrapped with "cp provisioner: status:"
prefix for easier log grepping
## Tests
+7 cases (5-status table + malformed JSON + existing transport).
IsRunning coverage 100%; overall cp_provisioner at 98%.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes review gap: pre-PR coverage on CPProvisioner was 37%.
After this commit every exported method is exercised:
- NewCPProvisioner 100%
- authHeaders 100%
- Start 91.7% (remainder: json.Marshal error
path, unreachable with fixed-type
request struct)
- Stop 100% (new — header + path + error)
- IsRunning 100% (new — 4-state matrix + auth)
- Close 100% (new — contract no-op)
New cases assert both auth headers (shared secret + admin_token) land
on every outbound request, transport failures surface clear errors
on Start/Stop, and IsRunning doesn't misreport on transport failure.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>