feat(auth): org tokens reach /workspaces/:id/* subroutes + docs

Extends WorkspaceAuth to accept org API tokens as a valid credential for any workspace sub-route in the org. Previously a user minting an org token could hit admin-surface endpoints (/workspaces, /org/import, etc.) but couldn't reach per-workspace routes like /workspaces/:id/channels — those were gated by WorkspaceAuth which only knew about workspace-scoped tokens. Scope matches the explicit product spec: one org API key can manipulate every workspace in the org. AI agents given a key can read/write channels, tokens, schedules, secrets, tasks across all workspaces. ## WorkspaceAuth tier order 1. ADMIN_TOKEN exact match (break-glass / bootstrap) 2. Org API token (Validate against org_api_tokens) NEW 3. Workspace-scoped token (ValidateToken with :id binding) 4. Same-origin canvas referer Org token tier sits above the per-workspace check so a presenter of an org key doesn't hit the narrower ValidateToken failure path first. Checked with isSameOriginCanvas path unchanged. ## End-to-end verified Minted test token via ADMIN_TOKEN, then with that org token: - GET /workspaces → 200 (list all) - GET /workspaces/<id> → 200 (detail, admin-only route) - GET /workspaces/<id>/channels → 200 (workspace sub-route) - GET /workspaces/<id>/tokens → 200 (workspace tokens list) - GET /workspaces/<bad-uuid> → 404 workspace not found (routing still scoped correctly) ## Documentation - docs/architecture/org-api-keys.md — design, data model, threat model, security properties - docs/architecture/org-api-keys-followups.md — 10 tracked follow-ups prioritized (role scoping P1, per-workspace binding P1, expiry P2, usage metrics P2, WorkOS user_id capture P2, rotation webhooks P3, mint-rate limit P3, audit log P2, CLI P3, migrate ADMIN_TOKEN to the same table P4) - docs/guides/org-api-keys.md — end-user guide (mint via UI, use in curl/Python/TS/AI agents, session-vs-key comparison) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 14:11:45 -07:00 · 2026-04-20 14:11:45 -07:00 · 3d7244ab94
commit 3d7244ab94
parent c6bb4ae5c4
4 changed files with 536 additions and 1 deletions
--- a/docs/architecture/org-api-keys-followups.md
+++ b/docs/architecture/org-api-keys-followups.md
@ -0,0 +1,213 @@
+# Organization API Keys — Follow-up Work
+
+> Tracked improvements to the beta `org_api_tokens` system. Each item
+> has a rationale + sketch implementation + rough effort estimate.
+> Ordered by priority.
+
+## 1. Role scoping (P1 — next after beta signal)
+
+**Problem:** Today every token is full-admin. A token given to a
+simple read-only monitoring script is as dangerous as one given to
+a deploy bot. No way to hand an AI agent a token that lets it read
+workspace state but not nuke the org.
+
+**Proposal:** Add a `role` column to `org_api_tokens`:
+
+```sql
+ALTER TABLE org_api_tokens
+  ADD COLUMN role TEXT NOT NULL DEFAULT 'admin'
+  CHECK (role IN ('admin', 'editor', 'reader'));
+```
+
+- `admin` — current behavior (all AdminAuth routes)
+- `editor` — workspace CRUD + secrets + approvals, but NOT mint/
+  revoke org tokens (closes the self-escalation loop)
+- `reader` — GETs only, no mutations
+
+New middleware wrapper `RequireRole(role)` checks token's row
+against the route's required minimum. Extend AdminAuth to stash
+the resolved role on `c.Set("org_token_role", r)`.
+
+**Effort:** ~200 LOC + migration + UI role-picker in
+`OrgTokensTab.tsx`. Breaking change for existing tokens (default
+to `admin` preserves behavior).
+
+## 2. Per-workspace binding (P1)
+
+**Problem:** An org-admin token that only needs to touch one
+workspace is overkill. AWS IAM equivalent: "this key can only read
+bucket foo".
+
+**Proposal:** Optional `workspace_id` FK on the token. When set,
+AdminAuth + WorkspaceAuth both accept the token ONLY for routes
+scoped to that workspace (`/workspaces/<id>/*`). Tokens with
+`workspace_id = NULL` behave as today (full-org).
+
+```sql
+ALTER TABLE org_api_tokens
+  ADD COLUMN workspace_id UUID REFERENCES workspaces(id) ON DELETE CASCADE;
+```
+
+Cascade delete means revoking a workspace revokes its scoped
+tokens automatically. UI adds a workspace dropdown at mint time.
+
+**Effort:** ~250 LOC. Pairs naturally with role scoping.
+
+## 3. Expiry (P2)
+
+**Problem:** Long-lived tokens are a liability. "Mint this key for
+this one deploy and die after 1 hour" is a common ask.
+
+**Proposal:** Optional `expires_at` on the row, enforced in the
+hot-path query:
+
+```sql
+WHERE token_hash = $1 AND revoked_at IS NULL
+  AND (expires_at IS NULL OR expires_at > now())
+```
+
+UI: mint form has "Expires in: [Never / 1h / 1d / 30d]" picker.
+Show time-left on the list view; flag soon-to-expire in amber.
+
+**Effort:** ~80 LOC. Additive; existing tokens have NULL = never.
+
+## 4. Usage metrics (P2)
+
+**Problem:** `last_used_at` is the only observation we have. Users
+want to see what a token is doing — which paths, from which IPs,
+how often — so they can detect anomalies.
+
+**Proposal:** Async counter writes on every successful Validate.
+New table:
+
+```sql
+CREATE TABLE org_api_token_usage (
+  token_id       UUID REFERENCES org_api_tokens(id) ON DELETE CASCADE,
+  hour           TIMESTAMPTZ NOT NULL,  -- truncated to hour
+  request_count  BIGINT NOT NULL DEFAULT 0,
+  last_path      TEXT,
+  last_ip        INET,
+  last_user_agent TEXT,
+  PRIMARY KEY (token_id, hour)
+);
+```
+
+`ON CONFLICT DO UPDATE SET request_count = request_count + 1` —
+atomic counter upserts, one row per token-hour. UI graphs last 30
+days per token.
+
+**Effort:** ~150 LOC + background sweep to prune >90-day rows.
+
+## 5. Rotation webhooks (P3)
+
+**Problem:** When a user revokes a token, integrations using it
+get 401 with no warning. Big ones want "you're about to lose
+access, here's 60s to rotate" signals.
+
+**Proposal:** Soft-revoke tier. Revoke now accepts
+`?drain_seconds=60`. Token enters a `draining` state (still valid
+but a warning header `X-Molecule-Token-Draining: true` is added to
+every response). After drain window, fully revoked.
+
+Alternative / complement: webhook URL on the token. POST to it
+when revoked. Safer because no drain period.
+
+**Effort:** ~200 LOC. Webhook variant requires retry logic +
+delivery audit.
+
+## 6. Capture WorkOS user_id in created_by (P2, quick win)
+
+**Problem:** Today, tokens minted via the canvas UI log
+`created_by: "session"` — we know it was a session but not whose.
+Post-incident review can't link a token back to a user.
+
+**Proposal:** Thread the WorkOS user_id from the session-auth
+verification through to the handler. The CP's
+`/cp/auth/tenant-member` already returns `user_id`; stash it on
+the gin context in `session_auth.go`; handler reads it for
+`created_by`.
+
+```go
+// session_auth.go after successful verify
+c.Set("session_user_id", body.UserID)
+
+// handler
+if v, ok := c.Get("session_user_id"); ok {
+    createdBy = "session:" + v.(string)
+}
+```
+
+**Effort:** ~20 LOC. Unblocks Important follow-up #6 from today's
+code review.
+
+## 7. Mint-rate limit (P3)
+
+**Problem:** A compromised session or admin token could mint
+thousands of org tokens quickly, making forensic cleanup painful.
+
+**Proposal:** Rate limit mint calls per-org: max N tokens per 5 min.
+Existing `middleware/ratelimit` package does exactly this — bind
+the limiter to the mint route with a low ceiling.
+
+**Effort:** ~30 LOC. Do this before #5 — revoke-storms could hit
+the same pattern.
+
+## 8. Audit log (P2)
+
+**Problem:** Token revocation is logged to stdout. That's fine for
+Railway's retention window but ops want a queryable audit log.
+
+**Proposal:** New table `org_token_audit` with (token_id, action,
+actor, occurred_at). Write on mint/revoke. Surface in admin
+diagnostics endpoint.
+
+**Effort:** ~100 LOC + lightweight read API.
+
+## 9. CLI for local development (P3)
+
+**Problem:** Developers running canvas locally can't easily mint
+and use org tokens against their dev tenant because the UI
+requires a WorkOS session.
+
+**Proposal:** `molecli org-token create --name <label>` uses
+`ADMIN_TOKEN` from env + `MOLECULE_ORG_URL` to mint. Same API,
+scripts-friendly.
+
+**Effort:** ~80 LOC in molecli + a line in the docs guide.
+
+## 10. Migrate ADMIN_TOKEN to org_api_tokens table (P4 — long-term)
+
+**Problem:** `ADMIN_TOKEN` as an env var is a special case that
+every auth tier has to handle. Once org tokens are feature-
+complete (roles, expiry, binding), the env-var token is redundant
+and complicates the auth code.
+
+**Proposal:** Bootstrap the tenant by inserting a row labeled
+`bootstrap` into `org_api_tokens` at provision time with the
+current ADMIN_TOKEN value's hash. Remove the env-var check entirely
+from AdminAuth. `ADMIN_TOKEN` becomes just "the initial token that
+happens to be stored as a normal row".
+
+Requires: roles + expiry shipped first (bootstrap token needs to
+be demarcated as revocable-but-permanent-by-default).
+
+**Effort:** ~150 LOC once prerequisites land.
+
+---
+
+## Tracked issues to file
+
+Each of the above should become a GitHub issue when we're ready to
+work it. One-liner label for the batch: `area:org-api-keys`.
+
+## Non-goals
+
+Explicit list of things we do NOT want to add:
+
+- JWT / signed tokens. Opaque bearers + DB lookup is simpler and
+  matches every other token type in the system.
+- OAuth scopes. We're not a third-party OAuth provider; this is
+  for internal integrations only.
+- IP allow-lists per token. Captured nominally by the usage log
+  (#4) for detection, but enforcement adds operational friction
+  (customer VPN changes → all tokens break).
--- a/docs/architecture/org-api-keys.md
+++ b/docs/architecture/org-api-keys.md
@ -0,0 +1,167 @@
+# Organization API Keys
+
+> **Status:** Shipped (beta), 2026-04-20. See `docs/guides/org-api-keys.md` for user-facing usage.
+
+Full-admin bearer tokens scoped to a single tenant org. User-visible
+replacement for the single `ADMIN_TOKEN` env var — named, revocable,
+audited, mintable from the canvas UI without ops intervention.
+
+## Why this exists
+
+Before these, admin access on a tenant required the bootstrap
+`ADMIN_TOKEN` from AWS Secrets Manager. That token:
+
+- Is a single shared value with no name or audit trail
+- Can't be rotated without redeploying the tenant
+- Is inaccessible to users (stored in ops-only SM)
+- Can't be revoked individually — rotating it kills every integration
+
+For the beta growth phase we want users to hand an AI agent an API
+key and not worry about ops. Org API keys solve that: mint, use,
+revoke, all from the canvas UI.
+
+## Data model
+
+```sql
+CREATE TABLE org_api_tokens (
+    id            UUID PRIMARY KEY DEFAULT gen_random_uuid(),
+    token_hash    BYTEA NOT NULL,     -- sha256(plaintext)
+    prefix        TEXT  NOT NULL,     -- first 8 plaintext chars for UI
+    name          TEXT,               -- user label ("zapier", "ci-bot")
+    created_by    TEXT,               -- provenance: "session"/"org-token:xxxxxxxx"/"admin-token"
+    created_at    TIMESTAMPTZ NOT NULL DEFAULT now(),
+    last_used_at  TIMESTAMPTZ,
+    revoked_at    TIMESTAMPTZ,
+    UNIQUE (token_hash)
+);
+
+CREATE INDEX org_api_tokens_live_idx
+    ON org_api_tokens (token_hash)
+    WHERE revoked_at IS NULL;
+```
+
+Plaintext is NEVER stored. Only sha256 hash. Recovery is impossible
+— lost tokens must be revoked and replaced.
+
+The partial index keeps the hot-path `SELECT id WHERE token_hash=$1
+AND revoked_at IS NULL` O(log live-tokens) regardless of how many
+tokens have been minted + revoked over the tenant's lifetime.
+
+## Request flow
+
+```
+Browser / CLI / Agent
+   │  Authorization: Bearer <plaintext>
+   ▼
+Cloudflare edge
+   │
+   ▼  tunnel (path-matched)
+Tenant platform :8080
+   │
+   ▼  TenantGuard (allowed; same-origin or header)
+   ▼  AdminAuth middleware
+       ├ Tier 0: fail-open (only if no ADMIN_TOKEN and no live tokens)
+       ├ Tier 1: CP session cookie → /cp/auth/tenant-member
+       ├ Tier 2a: sha256(bearer) IN org_api_tokens WHERE revoked_at IS NULL   ← THIS
+       ├ Tier 2b: bearer == ADMIN_TOKEN (bootstrap / break-glass)
+       └ Tier 3: any live workspace token (deprecated, only if no ADMIN_TOKEN)
+```
+
+Cost per request on the hot path: ONE indexed SELECT + one async
+last_used_at UPDATE. Both hit the partial index; negligible vs
+everything else the request does.
+
+## Authorization scope
+
+Every live org API token grants the SAME access as `ADMIN_TOKEN`:
+
+- All `/workspaces/*` CRUD (create, delete, list, any workspace's sub-routes)
+- All `/approvals/pending`, `/bundles/import`, `/org/import`, `/org/templates`
+- All `/admin/*` routes
+- All `/settings/secrets`, `/channels/discover`, `/events/*`
+- Mint + revoke other org API tokens (self-sustaining after bootstrap)
+
+It does NOT grant:
+
+- Access to the control plane (`/cp/*`) directly — those are proxied
+  by the tenant and the CP has its own auth (WorkOS session). An
+  org token alone can't hit `/cp/admin/orgs` or `/cp/billing/*`.
+- Cross-tenant access — each tenant's `org_api_tokens` table is
+  isolated in its own Postgres.
+
+## Bootstrap + self-sustenance
+
+The FIRST org token on a fresh tenant is minted via either:
+
+1. **Canvas UI**: a user with a WorkOS session cookie (verified via
+   `/cp/auth/tenant-member`) opens Settings → Org API Keys → New.
+2. **ADMIN_TOKEN CLI**: `curl -XPOST /org/tokens -H "Authorization:
+   Bearer $ADMIN_TOKEN"`. Useful in provisioning scripts or when
+   the canvas is down.
+
+After that, any existing org token can mint more. Revocation
+leaves ADMIN_TOKEN as the break-glass credential — operators can
+still recover admin access even if every user-minted token is
+revoked.
+
+## Security properties
+
+- **Plaintext never persisted**: only sha256 hash. A DB leak gives
+  the attacker prefixes + hashes — neither lets them forge a token.
+- **Timing-safe lookup**: single hash-indexed SELECT. No
+  path-dependent branches that could leak hash-prefix info.
+- **Immediate revocation**: `UPDATE revoked_at = now()` takes
+  microseconds; the next request returns 401. Partial index means
+  no lag from rebuilding full indexes.
+- **Idempotent revoke**: revoking twice returns 404 the second
+  time, not a conflict. Simplifies revoke tooling that might
+  double-deliver.
+- **Collapsed failure responses**: `Validate()` returns
+  `ErrInvalidToken` for any failure (bad bytes, revoked, deleted,
+  never-existed). Response shape cannot distinguish, so enumeration
+  is blind.
+- **Audit trail via `created_by`**: every token row records its
+  provenance ("session", "org-token:<prefix>", "admin-token") so
+  post-incident review can follow a chain of mints.
+
+## Threat model
+
+| Threat | Mitigation |
+|---|---|
+| Attacker exfiltrates a token via leaked logs | Tokens NEVER logged at INFO — only prefixes. `created_by` audit shows who minted what. |
+| Attacker cracks a stored hash | sha256 of 256 bits of uniform-random input — not crackable in our lifetime. Rainbow tables would need 2^256 entries. |
+| Attacker brute-forces the bearer | 256 bits of entropy, base64url-encoded 43-char string. At 1e9 guesses/sec it would take >1e60 years. Rate limiting is not the primary defense here; entropy is. |
+| Admin's session cookie is stolen | Cookie mints org tokens. Revoke the fresh tokens, rotate ADMIN_TOKEN, force WorkOS re-auth via logout. Mitigations: WorkOS session expiry + `created_by: session` audit trail makes post-hoc detection possible. |
+| Token leaks to an AI that misbehaves | Full-org access — damage confined to the tenant but large within it. Beta trade-off accepted. **Future work:** scoped roles. |
+| Tenant Postgres is compromised | Attacker can't forge tokens (only hashes stored). They CAN read workspace secrets — that's the separate secrets-encryption story (`SECRETS_ENCRYPTION_KEY`). |
+
+## HTTP surface
+
+```
+GET    /org/tokens              list live tokens (prefix + metadata only)
+POST   /org/tokens               mint; plaintext returned once
+       body: {"name": "optional label"}
+DELETE /org/tokens/:id           revoke; idempotent (404 on already-revoked)
+```
+
+All three behind `AdminAuth`. See `internal/handlers/org_tokens.go`.
+
+## Follow-up roadmap
+
+See `docs/architecture/org-api-keys-followups.md` for the full
+list; headline items:
+
+1. **Role scoping**: split into ADMIN / EDITOR / READER tiers. Then
+   WORKSPACE-SPECIFIC tokens ("this key can only touch workspace
+   X"). Aligns with the AWS IAM-style direction the product wants.
+2. **Expiry**: optional `expires_at`, enforced in the hot-path
+   query. Lets users mint short-lived tokens for specific jobs.
+3. **Usage metrics**: counter + last-request metadata
+   (path/ip/user-agent) for the UI so users can see what a token
+   is actually doing.
+4. **Rotation hooks**: webhook-on-revoke so integrations know to
+   re-mint.
+5. **Capture WorkOS user_id in `created_by`** when minted via session
+   (currently just records "session"). Requires propagating session
+   identity from the CP's tenant-member check through
+   `session_auth.go`.
--- a/docs/guides/org-api-keys.md
+++ b/docs/guides/org-api-keys.md
@ -0,0 +1,140 @@
+# Organization API Keys — User Guide
+
+> Full-admin API keys for your Molecule AI organization. Use these to
+> let AI agents, scripts, or integrations manage your org without a
+> browser session.
+
+## TL;DR
+
+1. Open your org's canvas UI (`https://<your-slug>.moleculesai.app`)
+2. Settings (⌘,) → **Org API Keys** tab
+3. Click **New Key**, give it a label (e.g. "zapier", "my-claude-agent")
+4. **Copy the token immediately** — it will never be shown again
+5. Hand it to whatever needs org-admin access:
+   ```
+   Authorization: Bearer <your-token>
+   ```
+
+Revoke from the same UI the moment anything looks wrong.
+
+## What these keys can do
+
+**Full organization admin.** A valid org API key is equivalent to
+being logged in as an admin user. With it, a script or AI can:
+
+- Create, delete, list workspaces
+- Import a complete org definition (can wipe + recreate everything)
+- Manage per-workspace secrets (your OpenAI/Anthropic/etc. keys)
+- Register + install templates, bundles, plugins
+- Approve or reject pending workspace approvals
+- Configure channels (Slack, Discord, etc.)
+- Mint more org API keys
+- Revoke any org API key (including itself)
+
+**What they cannot do:**
+
+- Reach the control plane's admin API (`/cp/admin/*`) — CP admin
+  lives on a separate allowlist.
+- Touch other organizations — each org's keys work only on its own
+  tenant.
+- Edit the tenant's environment variables or restart the underlying
+  EC2 instance — those are ops-only operations.
+
+## Treat keys like passwords
+
+- **Don't** commit keys to git. If you must have one in source,
+  reference an env var and keep the var in your secret manager.
+- **Don't** paste keys into Slack or email. Share via a password
+  manager when you can.
+- **Do** give each integration its own key with a descriptive name.
+  If Zapier gets compromised, you revoke `zapier` and leave
+  `github-action-deploy` untouched.
+- **Do** revoke any key you stop using.
+
+If you leak one, revoke it and mint a new one. Revocation is
+immediate — the next request with the old key gets 401.
+
+## Using a key
+
+### curl
+
+```bash
+curl -H "Authorization: Bearer $MOLECULE_ORG_TOKEN" \
+  https://acme.moleculesai.app/workspaces
+```
+
+### Python
+
+```python
+import os, requests
+
+resp = requests.get(
+    "https://acme.moleculesai.app/workspaces",
+    headers={"Authorization": f"Bearer {os.environ['MOLECULE_ORG_TOKEN']}"},
+)
+resp.raise_for_status()
+print(resp.json())
+```
+
+### TypeScript / Node
+
+```ts
+const resp = await fetch("https://acme.moleculesai.app/workspaces", {
+  headers: { Authorization: `Bearer ${process.env.MOLECULE_ORG_TOKEN}` },
+});
+if (!resp.ok) throw new Error(`${resp.status}: ${await resp.text()}`);
+console.log(await resp.json());
+```
+
+### Hand it to an AI agent
+
+Add the key to the agent's environment or config, with clear
+instructions about what routes it should touch. Claude Code, for
+example, can use it to inspect the tenant's state programmatically:
+
+```bash
+export MOLECULE_ORG_TOKEN=...   # the key you just minted
+```
+
+Then tell the agent: "Using MOLECULE_ORG_TOKEN, list my workspaces
+and tell me which ones are idle."
+
+## Endpoints you'll hit most often
+
+| Method | Path | What it does |
+|---|---|---|
+| GET | `/workspaces` | list all workspaces |
+| POST | `/workspaces` | create a workspace |
+| DELETE | `/workspaces/:id` | delete a workspace |
+| GET | `/org/templates` | list registered templates |
+| POST | `/org/import` | import a full org YAML |
+| POST | `/bundles/import` | install a bundle |
+| GET | `/approvals/pending` | list pending approvals |
+
+Each workspace you create gets its own workspace-scoped token
+returned in the create response. Use that token (not the org key)
+for agent-to-platform calls inside that specific workspace — it
+has a narrower blast radius if leaked.
+
+Full API reference: `docs/api-reference.md`.
+
+## Keys vs session cookies
+
+| | Org API Key | WorkOS session cookie |
+|---|---|---|
+| Who holds it | Integrations, AI, CLI | Your browser |
+| Where you see it | `/org/tokens` UI | Browser cookies |
+| Revocation | One-click in UI | Log out / session expiry |
+| Use from code | Yes | No (HttpOnly) |
+| Blast radius | Full org admin | Full org admin |
+
+Both unlock the same surface; the key is just the non-browser
+equivalent.
+
+## What's coming
+
+Scoped roles (READ / WORKSPACE-WRITE / ORG-ADMIN), expiry timers,
+per-workspace bindings, and usage metrics are on the roadmap. See
+`docs/architecture/org-api-keys-followups.md`. For now every key
+is full-admin by design — trading scope granularity for beta
+shipping speed.
--- a/workspace-server/internal/middleware/wsauth_middleware.go
+++ b/workspace-server/internal/middleware/wsauth_middleware.go
@ -54,7 +54,22 @@ func WorkspaceAuth(database *sql.DB) gin.HandlerFunc {
 				c.Next()
 				return
 			}
-			// Per-workspace token
+			// Org-scoped API token — user-minted from canvas UI. Grants
+			// access to EVERY workspace in the org (that's the explicit
+			// product spec: one org key can touch each workspace). Same
+			// power surface as ADMIN_TOKEN but named, revocable, audited.
+			// Check before per-workspace token so an org-key presenter
+			// doesn't hit the narrower ValidateToken failure path.
+			if id, err := orgtoken.Validate(ctx, database, tok); err == nil {
+				c.Set("org_token_id", id)
+				c.Next()
+				return
+			} else if !errors.Is(err, orgtoken.ErrInvalidToken) {
+				log.Printf("wsauth: WorkspaceAuth: orgtoken.Validate: %v", err)
+				c.AbortWithStatusJSON(http.StatusInternalServerError, gin.H{"error": "auth check failed"})
+				return
+			}
+			// Per-workspace token — narrowest scope, bound to this :id.
 			if err := wsauth.ValidateToken(ctx, database, workspaceID, tok); err != nil {
 				c.AbortWithStatusJSON(http.StatusUnauthorized, gin.H{"error": "invalid workspace auth token"})
 				return