Commit Graph

10 Commits

Author SHA1 Message Date
Hongming Wang
ad28e10bf4 fix(org-tokens): rate-limit mint, bound list, correct audit provenance
Addresses the Critical + Important findings from today's code
review of the org API keys feature (PRs #1105-1108).

## Critical-1: rate-limit mint endpoint

Previously POST /org/tokens had no mint-rate limit. A compromised
WorkOS session or leaked bearer could mint thousands of tokens in
seconds, forcing a painful manual cleanup of each one.

Fix: dedicated per-IP token bucket, 10 mints/hour/IP. Legitimate
bursts fit under the ceiling; abuse bounces. List + Delete stay
on the global limiter — they can't be used to generate new
secret material.

## Important-1: HTTP handler integration tests

internal/orgtoken had 9 unit tests; the HTTP layer (org_tokens.go)
had none. Adds org_tokens_test.go covering:
  - List happy path + DB error → 500
  - Create actor="admin-token" (bootstrap), actor="org-token:<prefix>"
    (chained mint), actor="session" (canvas browser path)
  - Create name>100 chars → 400
  - Create with empty body mints with no name
  - Revoke happy path 200, missing id 404, empty id 400
  - Plaintext returned in response body and prefix matches first 8 chars
  - Warning text present

A regression that breaks the tier-ordering, drops the createdBy
field, or accepts oversized names now fails at CI not prod.

## Important-2: bound List output

List() had no LIMIT — a mint-storm bug or abuse could make the
admin UI slow to render and allocate proportionally. Adds
LIMIT 500 at the SQL layer. 10x realistic ceiling, guardrail
against pathological cases.

## Important-3: audit provenance uses plaintext prefix, not UUID

orgTokenActor() was logging "org-token:<first-8-of-uuid>" which
couldn't be cross-referenced with the UI (which shows first-8
of the plaintext). Users could not correlate "who minted this"
audit entries with the revoke button they're looking at.

Fix: Validate() now returns (id, prefix, error). Middleware
stashes both on the gin context. Handler reads prefix for the
actor string. Audit rows now match UI prefixes exactly.

## Nit: named constants for audit labels

actorOrgTokenPrefix / actorSession / actorAdminToken replace
the hardcoded strings scattered across the handler. Greppable
across log pipelines + audit queries; one place to change if
the format evolves.

## Tests

  - internal/orgtoken: 9 existing + 0 new, all still green (updated
    signatures for Validate returning prefix).
  - internal/handlers/org_tokens_test.go: new — 9 HTTP-layer tests
    above. Full gin.Context + sqlmock harness.
  - Full `go test ./...` green except one pre-existing
    TestGitHubToken_NoTokenProvider flake unrelated to this change
    (expects 404, gets 500 — tracked separately).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 14:22:38 -07:00
Hongming Wang
3d7244ab94 feat(auth): org tokens reach /workspaces/:id/* subroutes + docs
Extends WorkspaceAuth to accept org API tokens as a valid
credential for any workspace sub-route in the org. Previously a
user minting an org token could hit admin-surface endpoints
(/workspaces, /org/import, etc.) but couldn't reach per-workspace
routes like /workspaces/:id/channels — those were gated by
WorkspaceAuth which only knew about workspace-scoped tokens.

Scope matches the explicit product spec: one org API key can
manipulate every workspace in the org. AI agents given a key can
read/write channels, tokens, schedules, secrets, tasks across all
workspaces.

## WorkspaceAuth tier order

  1. ADMIN_TOKEN exact match (break-glass / bootstrap)
  2. Org API token (Validate against org_api_tokens)           NEW
  3. Workspace-scoped token (ValidateToken with :id binding)
  4. Same-origin canvas referer

Org token tier sits above the per-workspace check so a presenter
of an org key doesn't hit the narrower ValidateToken failure path
first. Checked with isSameOriginCanvas path unchanged.

## End-to-end verified

Minted test token via ADMIN_TOKEN, then with that org token:
  - GET /workspaces             → 200 (list all)
  - GET /workspaces/<id>        → 200 (detail, admin-only route)
  - GET /workspaces/<id>/channels → 200 (workspace sub-route)
  - GET /workspaces/<id>/tokens   → 200 (workspace tokens list)
  - GET /workspaces/<bad-uuid>    → 404 workspace not found
                                    (routing still scoped correctly)

## Documentation

  - docs/architecture/org-api-keys.md — design, data model, threat
    model, security properties
  - docs/architecture/org-api-keys-followups.md — 10 tracked
    follow-ups prioritized (role scoping P1, per-workspace binding
    P1, expiry P2, usage metrics P2, WorkOS user_id capture P2,
    rotation webhooks P3, mint-rate limit P3, audit log P2, CLI
    P3, migrate ADMIN_TOKEN to the same table P4)
  - docs/guides/org-api-keys.md — end-user guide (mint via UI,
    use in curl/Python/TS/AI agents, session-vs-key comparison)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 14:11:45 -07:00
Hongming Wang
91187342b4 feat(auth): organization-scoped API keys for admin access
Adds user-facing API keys with full-org admin scope. Replaces the
single ADMIN_TOKEN env var with named, revocable, audited tokens
that users can mint/rotate from the canvas UI without ops
intervention.

Designed for the beta growth phase — one token tier (full admin).
Future work will split into scoped roles (admin / workspace-write
/ read-only) and per-workspace bindings. See docs/architecture/
org-api-keys.md for the design + follow-up roadmap.

## Surface

  POST   /org/tokens        mint (plaintext returned once)
  GET    /org/tokens        list live keys (prefix-only)
  DELETE /org/tokens/:id    revoke (idempotent)

All AdminAuth-gated. Bootstrap path: mint the first token via
ADMIN_TOKEN or canvas session; tokens can mint more tokens after.

## Validation as a new AdminAuth tier (2a)

AdminAuth evaluation order:
  Tier 0  lazy-bootstrap fail-open (only when no live tokens AND
          no ADMIN_TOKEN env)
  Tier 1  verified WorkOS session via /cp/auth/tenant-member
  Tier 2a org_api_tokens SELECT — NEW
  Tier 2b ADMIN_TOKEN env (bootstrap / CLI break-glass)
  Tier 3  any live workspace token (deprecated, only when ADMIN_TOKEN
          unset)

Tier 2a runs ONE indexed lookup (partial index on
token_hash WHERE revoked_at IS NULL) + an async last_used_at
bump. No measurable latency cost on the hot path.

## UI

New "Org API Keys" tab in the settings panel. Label field for
human-readable naming. Plaintext shown once + clipboard copy.
Revoke with confirm dialog. Mirrors the existing workspace-
TokensTab flow so users who've used one get the other for free.

## Security properties

  - Plaintext never stored. sha256 hash + 8-char display prefix.
  - Revocation is immediate: partial index on revoked_at IS NULL
    means the next request validates or fails in microseconds.
  - created_by audit field captures provenance: "org-token:<short>"
    when a token mints another, "session" for browser-UI mints,
    "admin-token" for the ADMIN_TOKEN bootstrap path.
  - Validate() collapses all failure shapes into ErrInvalidToken
    so response-shape can't distinguish "never existed" from
    "revoked".

## Tests

  - internal/orgtoken: 9 unit tests (hash storage, empty field
    null-ing, validation happy path, empty plaintext, unknown hash,
    revoked filtering, list ordering, revoke idempotency, has-any-
    live short-circuit).
  - AdminAuth tier-2a integration covered by existing middleware
    tests unchanged (fail-open + bearer paths).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 14:01:41 -07:00
Hongming Wang
d03f2d47e0 fix: close cross-tenant authz + cp_proxy admin-traversal gaps
Addresses three Critical findings from today's code review of the
SaaS-canvas routing stack.

## Critical-1: session verification scoped to the current tenant

session_auth.go previously verified via GET /cp/auth/me, which
only answers "is someone logged in" — NOT "is this user in the
org they're targeting." Every WorkOS-authed user (including folks
who only signed up via app.moleculesai.app with no tenant
relationship) could call /workspaces, /approvals/pending,
/bundles/import, /org/import etc. on ANY tenant they could reach.
Cross-tenant read: user at acme.moleculesai.app could hit
bob.moleculesai.app/workspaces with their cookie and get Bob's
workspaces.

Fix:
  - CP gains GET /cp/auth/tenant-member?slug=<slug> which joins
    org_members × organizations and only returns member:true when
    the authenticated user is actually in that org.
  - Tenant sets MOLECULE_ORG_SLUG at boot via user-data.
  - session_auth now calls tenant-member (not /me), passing its
    own slug. Cache key includes slug so one tenant's cached
    positive never satisfies another's check.

## Critical-2: cp_proxy path allowlist (lateral-movement fix)

cp_proxy.go forwarded any /cp/* path upstream with the cookie
and bearer attached. Since /cp/admin/* accepts sessions as one
of its auth tiers, a tenant-authed user could curl
/cp/admin/tenants/other-slug/diagnostics through their tenant
and the CP would honor it — turning any tenant into a lateral
hop into admin surface.

Fix: explicit allowlist of paths the canvas browser bundle
actually needs (/cp/auth, /cp/orgs, /cp/billing, /cp/templates,
/cp/legal). Everything else 404s at the tenant before cookies
leave. Fail-closed: future UI paths require explicit entries.

## Important-1,2: bounded session cache + split positive/negative TTL

Previous sync.Map cache grew unbounded (one entry per unique
Cookie header for process lifetime) and cached failures for 30s,
meaning a 3s CP blip locked users out for the full window.

Fix:
  - Bounded map with batch random eviction at cap (10k entries ×
    ~100 bytes = 1 MB ceiling). Random eviction is O(1)
    expected; we don't need precise LRU.
  - Periodic sweeper goroutine (2 min) reclaims expired entries
    even when they're not re-hit.
  - Positive TTL 30s, negative TTL 5s — short negative so CP
    flakes self-heal fast.
  - Transport errors NOT cached (would otherwise trap every
    user during a multi-second upstream outage).
  - Cache key = sha256(slug + cookie) so raw session tokens
    don't sit in process memory, and cross-tenant isolation is
    structural not policy.

## Important-3: TenantGuard /cp/* bypass documented

Added a security note to the bypass explaining why it's safe
only under the current setup (cp_proxy allowlist + tunnel-only
ingress), and what would require revisiting (SG opens :8080
inbound to the VPC).

## Tests

  - session_auth_test.go: 12 new tests — empty cookie, missing
    slug, no CP, member:true happy path with cache hit, member:
    false, 401 upstream, malformed JSON, transport error not
    cached, cross-tenant isolation (same cookie different
    tenants hit upstream separately), bounded eviction, expired
    entries, cache key collision resistance.
  - cp_proxy_test.go: new — isCPProxyAllowedPath covers 17
    allow/block cases, forwarding preserves Cookie+Auth, Host
    rewritten, blocked paths 404 without calling upstream.

All platform tests pass. CP provisioner tests pass after
threading cfg.OrgSlug into the container env.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:45:57 -07:00
Hongming Wang
03178b4712 feat(middleware): AdminAuth accepts CP-verified WorkOS session
Canvas (SaaS tenant UI) runs in the browser and authenticates the
user via a WorkOS session cookie scoped to .moleculesai.app. It
has no bearer token — the token-based ADMIN_TOKEN scheme is for
CLI + server-to-server callers, not end users.

Adds a session-verification tier to AdminAuth that runs BEFORE the
bearer check:

 1. If Cookie header present AND CP_UPSTREAM_URL configured →
    GET /cp/auth/me upstream with the same cookie. 200 + valid
    user_id → grant admin access. Non-200 → fall through.
 2. Else (no cookie, or no CP configured, or CP said no) →
    existing bearer-only path unchanged.

Positive verifications are cached 30s keyed by the raw Cookie
header, so a burst of canvas admin-page renders doesn't DDoS
the CP. Revocations propagate within that window.

Self-hosted / dev deploys without CP_UPSTREAM_URL: feature
disabled, behavior unchanged. So this is strictly additive for
the SaaS case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:27:13 -07:00
Hongming Wang
0b8f3239f6 fix(middleware): TenantGuard passes through /cp/* to CP proxy
Today's rollout of cp_proxy (PR #1095/1096) mounted /cp/* as a
reverse-proxy to the control plane, but the TenantGuard middleware
runs first in the global chain and 404s anything that isn't in its
exact-path allowlist (/health + /metrics). Every /cp/auth/me fetch
from canvas landed on a 40µs 404 before ever reaching the proxy.

/cp/* is handled upstream (WorkOS session + admin bearer), so the
tenant doesn't need to attach org identity for those paths. Passing
them through is correct — matches the design where the tenant
platform is a pure transit layer for /cp/*.

Verified: /cp/auth/me via tunnel now returns 401 (correct unauth
from CP) instead of 404 from TenantGuard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:14:56 -07:00
rabbitblood
992e6d3f38 fix(auth): accept admin token in CanvasOrBearer for viewport PUT 2026-04-20 12:45:09 -07:00
rabbitblood
1e30386aec fix(auth): accept admin token in WorkspaceAuth for canvas dashboard
The canvas sends NEXT_PUBLIC_ADMIN_TOKEN on all API calls but per-workspace
routes (/activity, /delegations, /traces) use WorkspaceAuth which only
accepts per-workspace bearer tokens. This made the canvas dashboard 401
on every workspace detail view.

Fix: WorkspaceAuth now accepts the admin token as a fallback after
workspace token validation fails. This lets the canvas read all workspace
data with a single admin credential.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-20 12:42:43 -07:00
Hongming Wang
481b5cfb1a fix(security): C4 — close AdminAuth fail-open race on hosted-SaaS fresh install
Pre-launch review blocker. AdminAuth's Tier-1 fail-open fired whenever
the workspace_auth_tokens table was empty — including the window between
a hosted tenant EC2 booting and the first workspace being created. In
that window, every admin-gated route (POST /org/import, POST /workspaces,
POST /bundles/import, etc.) was reachable without a bearer, letting an
attacker pre-empt the first real user by importing a hostile workspace
into a freshly provisioned instance.

Fix: fail-open is now ONLY applied when ADMIN_TOKEN is unset (self-
hosted dev with zero auth configured). Hosted SaaS always sets
ADMIN_TOKEN at provision time, so the branch never fires in prod and
requests with no bearer get 401 even before the first token is minted.

Tier-2 / Tier-3 paths unchanged.

The old TestAdminAuth_684_FailOpen_AdminTokenSet_NoGlobalTokens test
was codifying exactly this bug (asserting 200 on fresh install with
ADMIN_TOKEN set). Renamed and flipped to
TestAdminAuth_C4_AdminTokenSet_FreshInstall_FailsClosed asserting 401.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 01:28:13 -07:00
Hongming Wang
d8026347e5 chore: open-source restructure — rename dirs, remove internal files, scrub secrets
Renames:
- platform/ → workspace-server/ (Go module path stays as "platform" for
  external dep compat — will update after plugin module republish)
- workspace-template/ → workspace/

Removed (moved to separate repos or deleted):
- PLAN.md — internal roadmap (move to private project board)
- HANDOFF.md, AGENTS.md — one-time internal session docs
- .claude/ — gitignored entirely (local agent config)
- infra/cloudflare-worker/ → Molecule-AI/molecule-tenant-proxy
- org-templates/molecule-dev/ → standalone template repo
- .mcp-eval/ → molecule-mcp-server repo
- test-results/ — ephemeral, gitignored

Security scrubbing:
- Cloudflare account/zone/KV IDs → placeholders
- Real EC2 IPs → <EC2_IP> in all docs
- CF token prefix, Neon project ID, Fly app names → redacted
- Langfuse dev credentials → parameterized
- Personal runner username/machine name → generic

Community files:
- CONTRIBUTING.md — build, test, branch conventions
- CODE_OF_CONDUCT.md — Contributor Covenant 2.1

All Dockerfiles, CI workflows, docker-compose, railway.toml, render.yaml,
README, CLAUDE.md updated for new directory names.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 00:24:44 -07:00