The monorepo docs/ tree is ecosystem + user-facing. Internal
roadmap ("what we'll build next", priorities, effort estimates)
doesn't belong there — customers reading our docs don't need our
backlog in their face, and we shouldn't signal "feature X is
coming" contractually when it's just a P2 item in internal
tracking.
Removes:
- docs/architecture/org-api-keys-followups.md (the whole
prioritized roadmap). Moved to the internal repo at
runbooks/org-api-keys-followups.md where it belongs.
- "Follow-up roadmap" section in docs/architecture/org-api-
keys.md, replaced with a shorter "Known limitations" section
that names the current constraints (full-admin only, no
expiry, no user_id in session-minted audit) without
speculating on when they change.
- "What's coming" section in docs/guides/org-api-keys.md,
replaced with "Current limits" that names the same
constraints from the user's POV.
Public docs now describe the feature as it exists TODAY. Internal
tracking of what comes next lives in Molecule-AI/internal (private).
Addresses the Critical + Important findings from today's code
review of the org API keys feature (PRs #1105-1108).
## Critical-1: rate-limit mint endpoint
Previously POST /org/tokens had no mint-rate limit. A compromised
WorkOS session or leaked bearer could mint thousands of tokens in
seconds, forcing a painful manual cleanup of each one.
Fix: dedicated per-IP token bucket, 10 mints/hour/IP. Legitimate
bursts fit under the ceiling; abuse bounces. List + Delete stay
on the global limiter — they can't be used to generate new
secret material.
## Important-1: HTTP handler integration tests
internal/orgtoken had 9 unit tests; the HTTP layer (org_tokens.go)
had none. Adds org_tokens_test.go covering:
- List happy path + DB error → 500
- Create actor="admin-token" (bootstrap), actor="org-token:<prefix>"
(chained mint), actor="session" (canvas browser path)
- Create name>100 chars → 400
- Create with empty body mints with no name
- Revoke happy path 200, missing id 404, empty id 400
- Plaintext returned in response body and prefix matches first 8 chars
- Warning text present
A regression that breaks the tier-ordering, drops the createdBy
field, or accepts oversized names now fails at CI not prod.
## Important-2: bound List output
List() had no LIMIT — a mint-storm bug or abuse could make the
admin UI slow to render and allocate proportionally. Adds
LIMIT 500 at the SQL layer. 10x realistic ceiling, guardrail
against pathological cases.
## Important-3: audit provenance uses plaintext prefix, not UUID
orgTokenActor() was logging "org-token:<first-8-of-uuid>" which
couldn't be cross-referenced with the UI (which shows first-8
of the plaintext). Users could not correlate "who minted this"
audit entries with the revoke button they're looking at.
Fix: Validate() now returns (id, prefix, error). Middleware
stashes both on the gin context. Handler reads prefix for the
actor string. Audit rows now match UI prefixes exactly.
## Nit: named constants for audit labels
actorOrgTokenPrefix / actorSession / actorAdminToken replace
the hardcoded strings scattered across the handler. Greppable
across log pipelines + audit queries; one place to change if
the format evolves.
## Tests
- internal/orgtoken: 9 existing + 0 new, all still green (updated
signatures for Validate returning prefix).
- internal/handlers/org_tokens_test.go: new — 9 HTTP-layer tests
above. Full gin.Context + sqlmock harness.
- Full `go test ./...` green except one pre-existing
TestGitHubToken_NoTokenProvider flake unrelated to this change
(expects 404, gets 500 — tracked separately).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends WorkspaceAuth to accept org API tokens as a valid
credential for any workspace sub-route in the org. Previously a
user minting an org token could hit admin-surface endpoints
(/workspaces, /org/import, etc.) but couldn't reach per-workspace
routes like /workspaces/:id/channels — those were gated by
WorkspaceAuth which only knew about workspace-scoped tokens.
Scope matches the explicit product spec: one org API key can
manipulate every workspace in the org. AI agents given a key can
read/write channels, tokens, schedules, secrets, tasks across all
workspaces.
## WorkspaceAuth tier order
1. ADMIN_TOKEN exact match (break-glass / bootstrap)
2. Org API token (Validate against org_api_tokens) NEW
3. Workspace-scoped token (ValidateToken with :id binding)
4. Same-origin canvas referer
Org token tier sits above the per-workspace check so a presenter
of an org key doesn't hit the narrower ValidateToken failure path
first. Checked with isSameOriginCanvas path unchanged.
## End-to-end verified
Minted test token via ADMIN_TOKEN, then with that org token:
- GET /workspaces → 200 (list all)
- GET /workspaces/<id> → 200 (detail, admin-only route)
- GET /workspaces/<id>/channels → 200 (workspace sub-route)
- GET /workspaces/<id>/tokens → 200 (workspace tokens list)
- GET /workspaces/<bad-uuid> → 404 workspace not found
(routing still scoped correctly)
## Documentation
- docs/architecture/org-api-keys.md — design, data model, threat
model, security properties
- docs/architecture/org-api-keys-followups.md — 10 tracked
follow-ups prioritized (role scoping P1, per-workspace binding
P1, expiry P2, usage metrics P2, WorkOS user_id capture P2,
rotation webhooks P3, mint-rate limit P3, audit log P2, CLI
P3, migrate ADMIN_TOKEN to the same table P4)
- docs/guides/org-api-keys.md — end-user guide (mint via UI,
use in curl/Python/TS/AI agents, session-vs-key comparison)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds user-facing API keys with full-org admin scope. Replaces the
single ADMIN_TOKEN env var with named, revocable, audited tokens
that users can mint/rotate from the canvas UI without ops
intervention.
Designed for the beta growth phase — one token tier (full admin).
Future work will split into scoped roles (admin / workspace-write
/ read-only) and per-workspace bindings. See docs/architecture/
org-api-keys.md for the design + follow-up roadmap.
## Surface
POST /org/tokens mint (plaintext returned once)
GET /org/tokens list live keys (prefix-only)
DELETE /org/tokens/:id revoke (idempotent)
All AdminAuth-gated. Bootstrap path: mint the first token via
ADMIN_TOKEN or canvas session; tokens can mint more tokens after.
## Validation as a new AdminAuth tier (2a)
AdminAuth evaluation order:
Tier 0 lazy-bootstrap fail-open (only when no live tokens AND
no ADMIN_TOKEN env)
Tier 1 verified WorkOS session via /cp/auth/tenant-member
Tier 2a org_api_tokens SELECT — NEW
Tier 2b ADMIN_TOKEN env (bootstrap / CLI break-glass)
Tier 3 any live workspace token (deprecated, only when ADMIN_TOKEN
unset)
Tier 2a runs ONE indexed lookup (partial index on
token_hash WHERE revoked_at IS NULL) + an async last_used_at
bump. No measurable latency cost on the hot path.
## UI
New "Org API Keys" tab in the settings panel. Label field for
human-readable naming. Plaintext shown once + clipboard copy.
Revoke with confirm dialog. Mirrors the existing workspace-
TokensTab flow so users who've used one get the other for free.
## Security properties
- Plaintext never stored. sha256 hash + 8-char display prefix.
- Revocation is immediate: partial index on revoked_at IS NULL
means the next request validates or fails in microseconds.
- created_by audit field captures provenance: "org-token:<short>"
when a token mints another, "session" for browser-UI mints,
"admin-token" for the ADMIN_TOKEN bootstrap path.
- Validate() collapses all failure shapes into ErrInvalidToken
so response-shape can't distinguish "never existed" from
"revoked".
## Tests
- internal/orgtoken: 9 unit tests (hash storage, empty field
null-ing, validation happy path, empty plaintext, unknown hash,
revoked filtering, list ordering, revoke idempotency, has-any-
live short-circuit).
- AdminAuth tier-2a integration covered by existing middleware
tests unchanged (fail-open + bearer paths).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses three Critical findings from today's code review of the
SaaS-canvas routing stack.
## Critical-1: session verification scoped to the current tenant
session_auth.go previously verified via GET /cp/auth/me, which
only answers "is someone logged in" — NOT "is this user in the
org they're targeting." Every WorkOS-authed user (including folks
who only signed up via app.moleculesai.app with no tenant
relationship) could call /workspaces, /approvals/pending,
/bundles/import, /org/import etc. on ANY tenant they could reach.
Cross-tenant read: user at acme.moleculesai.app could hit
bob.moleculesai.app/workspaces with their cookie and get Bob's
workspaces.
Fix:
- CP gains GET /cp/auth/tenant-member?slug=<slug> which joins
org_members × organizations and only returns member:true when
the authenticated user is actually in that org.
- Tenant sets MOLECULE_ORG_SLUG at boot via user-data.
- session_auth now calls tenant-member (not /me), passing its
own slug. Cache key includes slug so one tenant's cached
positive never satisfies another's check.
## Critical-2: cp_proxy path allowlist (lateral-movement fix)
cp_proxy.go forwarded any /cp/* path upstream with the cookie
and bearer attached. Since /cp/admin/* accepts sessions as one
of its auth tiers, a tenant-authed user could curl
/cp/admin/tenants/other-slug/diagnostics through their tenant
and the CP would honor it — turning any tenant into a lateral
hop into admin surface.
Fix: explicit allowlist of paths the canvas browser bundle
actually needs (/cp/auth, /cp/orgs, /cp/billing, /cp/templates,
/cp/legal). Everything else 404s at the tenant before cookies
leave. Fail-closed: future UI paths require explicit entries.
## Important-1,2: bounded session cache + split positive/negative TTL
Previous sync.Map cache grew unbounded (one entry per unique
Cookie header for process lifetime) and cached failures for 30s,
meaning a 3s CP blip locked users out for the full window.
Fix:
- Bounded map with batch random eviction at cap (10k entries ×
~100 bytes = 1 MB ceiling). Random eviction is O(1)
expected; we don't need precise LRU.
- Periodic sweeper goroutine (2 min) reclaims expired entries
even when they're not re-hit.
- Positive TTL 30s, negative TTL 5s — short negative so CP
flakes self-heal fast.
- Transport errors NOT cached (would otherwise trap every
user during a multi-second upstream outage).
- Cache key = sha256(slug + cookie) so raw session tokens
don't sit in process memory, and cross-tenant isolation is
structural not policy.
## Important-3: TenantGuard /cp/* bypass documented
Added a security note to the bypass explaining why it's safe
only under the current setup (cp_proxy allowlist + tunnel-only
ingress), and what would require revisiting (SG opens :8080
inbound to the VPC).
## Tests
- session_auth_test.go: 12 new tests — empty cookie, missing
slug, no CP, member:true happy path with cache hit, member:
false, 401 upstream, malformed JSON, transport error not
cached, cross-tenant isolation (same cookie different
tenants hit upstream separately), bounded eviction, expired
entries, cache key collision resistance.
- cp_proxy_test.go: new — isCPProxyAllowedPath covers 17
allow/block cases, forwarding preserves Cookie+Auth, Host
rewritten, blocked paths 404 without calling upstream.
All platform tests pass. CP provisioner tests pass after
threading cfg.OrgSlug into the container env.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The github-app-auth plugin's go.mod had a relative replace directive
(../molecule-monorepo/platform) that didn't resolve in Docker where
the plugin is at /plugin/ and the platform at /app/. This caused the
plugin's provisionhook.TokenProvider interface to come from a different
package path than the platform's, so the type assertion in
FirstTokenProvider() failed — "no token provider registered".
Fix: sed the plugin's go.mod replace to point at /app during Docker build.
Also added debug logging to GetInstallationToken for future diagnosis.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Canvas (SaaS tenant UI) runs in the browser and authenticates the
user via a WorkOS session cookie scoped to .moleculesai.app. It
has no bearer token — the token-based ADMIN_TOKEN scheme is for
CLI + server-to-server callers, not end users.
Adds a session-verification tier to AdminAuth that runs BEFORE the
bearer check:
1. If Cookie header present AND CP_UPSTREAM_URL configured →
GET /cp/auth/me upstream with the same cookie. 200 + valid
user_id → grant admin access. Non-200 → fall through.
2. Else (no cookie, or no CP configured, or CP said no) →
existing bearer-only path unchanged.
Positive verifications are cached 30s keyed by the raw Cookie
header, so a burst of canvas admin-page renders doesn't DDoS
the CP. Revocations propagate within that window.
Self-hosted / dev deploys without CP_UPSTREAM_URL: feature
disabled, behavior unchanged. So this is strictly additive for
the SaaS case.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Today's rollout of cp_proxy (PR #1095/1096) mounted /cp/* as a
reverse-proxy to the control plane, but the TenantGuard middleware
runs first in the global chain and 404s anything that isn't in its
exact-path allowlist (/health + /metrics). Every /cp/auth/me fetch
from canvas landed on a 40µs 404 before ever reaching the proxy.
/cp/* is handled upstream (WorkOS session + admin bearer), so the
tenant doesn't need to attach org identity for those paths. Passing
them through is correct — matches the design where the tenant
platform is a pure transit layer for /cp/*.
Verified: /cp/auth/me via tunnel now returns 401 (correct unauth
from CP) instead of 404 from TenantGuard.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GitGuardian detected exposed MiniMax API key and GitHub PAT in the
script's default values. Replaced with env var reads from .env file
(which is gitignored). Script now validates required secrets exist
before proceeding.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Canvas's browser bundle issues fetches to both CP endpoints
(/cp/auth/me, /cp/orgs, ...) AND tenant-platform endpoints
(/canvas/viewport, /approvals/pending, /org/templates). They
share ONE build-time base URL. Baking api.moleculesai.app
broke tenant calls with 404; baking the tenant subdomain broke
auth. Tried both today and saw exactly one failure mode per
attempt.
Real fix: same-origin fetches + tenant-side split. Adds:
internal/router/cp_proxy.go # /cp/* → CP_UPSTREAM_URL
mounted before NoRoute(canvasProxy). Now a tenant serves:
/cp/* → reverse-proxy to api.moleculesai.app
/canvas/viewport,
/approvals/pending,
/workspaces/:id/*,
/ws, /registry, → tenant platform (existing handlers)
/metrics
everything else → canvas UI (existing reverse-proxy)
Canvas middleware reverts to `connect-src 'self' wss:` for the
same-origin path (keeping explicit PLATFORM_URL whitelist as a
self-hosted escape hatch when the build-arg is non-empty).
CI build-arg flips to NEXT_PUBLIC_PLATFORM_URL="" so the bundle
issues relative fetches.
Security of cp_proxy:
- Cookie + Authorization PRESERVED across the hop (opposite of
canvas proxy) — they carry the WorkOS session, which is the
whole point.
- Host rewritten to upstream so CORS + cookie-domain on the CP
side see their own hostname.
- Upstream URL validated at construction: must parse, must be
http(s), must have a host — misconfig fails closed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tenant page loads were blocked by:
Refused to connect to 'https://api.moleculesai.app/cp/auth/me'
because it violates the document's Content Security Policy.
CSP had `connect-src 'self' wss:` — fine for same-origin + any wss,
but browser refuses cross-origin HTTPS fetches that aren't listed.
PLATFORM_URL (baked from NEXT_PUBLIC_PLATFORM_URL, which is the CP
origin on SaaS tenants) needs to be explicit.
Fix: middleware reads NEXT_PUBLIC_PLATFORM_URL at build/runtime
and adds both the https and wss siblings to connect-src. Self-
hosted deploys that override the build-arg automatically get a
matching CSP — no hardcoded hostname.
Test added: buildCsp includes NEXT_PUBLIC_PLATFORM_URL origin in
connect-src when set. Also loosens the dev `ws:` assertion since
dev uses `connect-src *` which subsumes ws (pre-existing behavior,
test was stale).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two scripts:
- nuke-and-rebuild.sh: docker down -v, clean orphans, rebuild, setup
- post-rebuild-setup.sh: insert global secrets (MiniMax + GH PAT),
import org template, wait for platform health
Global secrets ensure every provisioned container gets MiniMax API
config and GitHub PAT injected as env vars automatically — no manual
settings.json deployment needed.
Usage: bash scripts/nuke-and-rebuild.sh
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Canvas's browser-side code (auth.ts, api.ts, billing.ts) all call
fetch(PLATFORM_URL + /cp/*). PLATFORM_URL comes from
NEXT_PUBLIC_PLATFORM_URL at build time; with the build arg unset,
it falls back to http://localhost:8080 in the compiled bundle.
That means on a tenant like hongmingwang.moleculesai.app, the
user's browser actually tried to fetch http://localhost:8080/cp/
auth/me — which resolves to the USER'S OWN machine, not the tenant.
Login redirect loops 404. Every tenant canvas has been unable to
complete a fresh login on this path; existing sessions only worked
because the cookie was already set domain-wide.
Fix: pass NEXT_PUBLIC_PLATFORM_URL=https://api.moleculesai.app
as a build arg in the tenant-image workflow. CP already allows
CORS from *.moleculesai.app + credentials, and the session cookie
is scoped to .moleculesai.app so tenant subdomains inherit it.
Verified in prod by rebuilding canvas locally with the flag and
hot-patching the hongmingwang instance via SSM. Baked chunks now
contain api.moleculesai.app; browser auth redirects resolve
cleanly to the CP.
Self-hosted users override by rebuilding with their own URL —
same pattern molecule-app uses with NEXT_PUBLIC_CP_ORIGIN.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The canvas sends NEXT_PUBLIC_ADMIN_TOKEN on all API calls but per-workspace
routes (/activity, /delegations, /traces) use WorkspaceAuth which only
accepts per-workspace bearer tokens. This made the canvas dashboard 401
on every workspace detail view.
Fix: WorkspaceAuth now accepts the admin token as a fallback after
workspace token validation fails. This lets the canvas read all workspace
data with a single admin credential.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tenant page loads were failing with repeated CSP violations:
Executing inline script violates ... script-src 'self'
'nonce-M2M4YTVh...' 'strict-dynamic'. ...
because Next.js's bootstrap inline scripts were emitted without a
nonce attribute. The middleware was generating per-request nonces
correctly and sending them via `x-nonce` — but the layout was
fully static, so Next.js cached the HTML once and served that cached
bundle (no nonces baked in) for every request.
Fix: call `await headers()` in the root layout. That opts the tree
into dynamic rendering AND signals Next.js to propagate the
x-nonce value to its own generated <script> tags.
The `nonce` return value is intentionally unused — the framework
handles its bootstrap scripts automatically once the read happens.
Future code that adds third-party <Script> components (analytics,
etc.) should pass the returned nonce explicitly.
Verified against live tenant: before this change every /_next/
chunk script tag in the HTML had no nonce attribute; expected after
deploy is `<script nonce="..." src="/_next/...">` on each.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three changes that keep getting lost on nuke+rebuild:
1. middleware.ts: read CSP_DEV_MODE env to relax CSP in local Docker
2. api.ts: send NEXT_PUBLIC_ADMIN_TOKEN header (AdminAuth on /workspaces)
3. Dockerfile: accept NEXT_PUBLIC_ADMIN_TOKEN as build arg
All three are required for the canvas to work in local Docker where
canvas (port 3000) fetches from platform (port 8080) cross-origin.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Canvas needs AdminAuth token to fetch /workspaces (gated since PR #729)
and CSP_DEV_MODE to allow cross-port fetches in local Docker.
These were added earlier but lost on nuke+rebuild because they weren't
committed to staging.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This directory belongs in the dedicated repo
Molecule-AI/molecule-ai-org-template-molecule-dev.
It should be cloned locally for platform mounting, never
committed to molecule-core. The .gitignore already blocks it.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Soft-delete (status='removed') leaves orphan DB rows and FK data forever.
When ?purge=true is passed, after container cleanup the handler cascade-
deletes all leaf FK tables and hard-removes the workspace row.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The org import fired all workspace provisioning goroutines concurrently,
overwhelming Docker when creating 39+ containers. Containers timed out,
leaving workspaces stuck in 'provisioning' with no schedules or hooks.
Fix:
- Add provisionConcurrency=3 semaphore limiting concurrent Docker ops
- Increase workspaceCreatePacingMs from 50ms to 2000ms between siblings
- Pass semaphore through createWorkspaceTree recursion
With 39 workspaces at 3 concurrent + 2s pacing, import takes ~30s instead
of timing out. Each workspace gets its full template: schedules, hooks,
settings, hierarchy.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
#1080 added /waitlist to canvas, but canvas isn't served at
app.moleculesai.app — it backs the tenant subdomains (acme.moleculesai.app
etc.). The real /waitlist lives in the separate molecule-app repo,
which is what the CP auth callback redirects to.
molecule-app#12 has the real page + contact form wiring to
/cp/waitlist/request. This canvas copy was never reachable and would
only diverge.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the user-facing half of the beta-gate: a page at /waitlist that
the CP auth callback redirects users to when their email isn't on
the allowlist. Collects email + optional name + use-case and POSTs
to /cp/waitlist/request (backend landed in controlplane #150).
## Behavior
- No auto-pre-fill of email from URL query (CP's #145 dropped the
?email= param for the privacy reason; this test guards against a
future regression on the client side).
- Client-side validates email shape for instant feedback; backend
re-validates.
- Three UI states after submit:
success → "your request is in" banner, form hidden
dedup → softer "already on file" banner when backend returns
dedup=true (same 200, no 409 to avoid enumeration)
error → inline banner with backend message or network fallback
## Tests
9 tests in __tests__/waitlist-page.test.tsx covering:
- default render + a11y (role=button, role=status, role=alert)
- URL-pre-fill privacy regression guard
- HTML5 + JS validation (empty, malformed)
- successful POST with trimmed body
- dedup branch
- non-2xx with + without error field
- network rejection
Follow-up to the beta-gate rollout on controlplane #145 / #150.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>