feat(auth): org tokens reach /workspaces/:id/* subroutes + docs

Extends WorkspaceAuth to accept org API tokens as a valid
credential for any workspace sub-route in the org. Previously a
user minting an org token could hit admin-surface endpoints
(/workspaces, /org/import, etc.) but couldn't reach per-workspace
routes like /workspaces/:id/channels — those were gated by
WorkspaceAuth which only knew about workspace-scoped tokens.

Scope matches the explicit product spec: one org API key can
manipulate every workspace in the org. AI agents given a key can
read/write channels, tokens, schedules, secrets, tasks across all
workspaces.

## WorkspaceAuth tier order

  1. ADMIN_TOKEN exact match (break-glass / bootstrap)
  2. Org API token (Validate against org_api_tokens)           NEW
  3. Workspace-scoped token (ValidateToken with :id binding)
  4. Same-origin canvas referer

Org token tier sits above the per-workspace check so a presenter
of an org key doesn't hit the narrower ValidateToken failure path
first. Checked with isSameOriginCanvas path unchanged.

## End-to-end verified

Minted test token via ADMIN_TOKEN, then with that org token:
  - GET /workspaces             → 200 (list all)
  - GET /workspaces/<id>        → 200 (detail, admin-only route)
  - GET /workspaces/<id>/channels → 200 (workspace sub-route)
  - GET /workspaces/<id>/tokens   → 200 (workspace tokens list)
  - GET /workspaces/<bad-uuid>    → 404 workspace not found
                                    (routing still scoped correctly)

## Documentation

  - docs/architecture/org-api-keys.md — design, data model, threat
    model, security properties
  - docs/architecture/org-api-keys-followups.md — 10 tracked
    follow-ups prioritized (role scoping P1, per-workspace binding
    P1, expiry P2, usage metrics P2, WorkOS user_id capture P2,
    rotation webhooks P3, mint-rate limit P3, audit log P2, CLI
    P3, migrate ADMIN_TOKEN to the same table P4)
  - docs/guides/org-api-keys.md — end-user guide (mint via UI,
    use in curl/Python/TS/AI agents, session-vs-key comparison)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Hongming Wang 2026-04-20 14:11:45 -07:00
parent c6bb4ae5c4
commit 3d7244ab94
4 changed files with 536 additions and 1 deletions

View File

@ -0,0 +1,213 @@
# Organization API Keys — Follow-up Work
> Tracked improvements to the beta `org_api_tokens` system. Each item
> has a rationale + sketch implementation + rough effort estimate.
> Ordered by priority.
## 1. Role scoping (P1 — next after beta signal)
**Problem:** Today every token is full-admin. A token given to a
simple read-only monitoring script is as dangerous as one given to
a deploy bot. No way to hand an AI agent a token that lets it read
workspace state but not nuke the org.
**Proposal:** Add a `role` column to `org_api_tokens`:
```sql
ALTER TABLE org_api_tokens
ADD COLUMN role TEXT NOT NULL DEFAULT 'admin'
CHECK (role IN ('admin', 'editor', 'reader'));
```
- `admin` — current behavior (all AdminAuth routes)
- `editor` — workspace CRUD + secrets + approvals, but NOT mint/
revoke org tokens (closes the self-escalation loop)
- `reader` — GETs only, no mutations
New middleware wrapper `RequireRole(role)` checks token's row
against the route's required minimum. Extend AdminAuth to stash
the resolved role on `c.Set("org_token_role", r)`.
**Effort:** ~200 LOC + migration + UI role-picker in
`OrgTokensTab.tsx`. Breaking change for existing tokens (default
to `admin` preserves behavior).
## 2. Per-workspace binding (P1)
**Problem:** An org-admin token that only needs to touch one
workspace is overkill. AWS IAM equivalent: "this key can only read
bucket foo".
**Proposal:** Optional `workspace_id` FK on the token. When set,
AdminAuth + WorkspaceAuth both accept the token ONLY for routes
scoped to that workspace (`/workspaces/<id>/*`). Tokens with
`workspace_id = NULL` behave as today (full-org).
```sql
ALTER TABLE org_api_tokens
ADD COLUMN workspace_id UUID REFERENCES workspaces(id) ON DELETE CASCADE;
```
Cascade delete means revoking a workspace revokes its scoped
tokens automatically. UI adds a workspace dropdown at mint time.
**Effort:** ~250 LOC. Pairs naturally with role scoping.
## 3. Expiry (P2)
**Problem:** Long-lived tokens are a liability. "Mint this key for
this one deploy and die after 1 hour" is a common ask.
**Proposal:** Optional `expires_at` on the row, enforced in the
hot-path query:
```sql
WHERE token_hash = $1 AND revoked_at IS NULL
AND (expires_at IS NULL OR expires_at > now())
```
UI: mint form has "Expires in: [Never / 1h / 1d / 30d]" picker.
Show time-left on the list view; flag soon-to-expire in amber.
**Effort:** ~80 LOC. Additive; existing tokens have NULL = never.
## 4. Usage metrics (P2)
**Problem:** `last_used_at` is the only observation we have. Users
want to see what a token is doing — which paths, from which IPs,
how often — so they can detect anomalies.
**Proposal:** Async counter writes on every successful Validate.
New table:
```sql
CREATE TABLE org_api_token_usage (
token_id UUID REFERENCES org_api_tokens(id) ON DELETE CASCADE,
hour TIMESTAMPTZ NOT NULL, -- truncated to hour
request_count BIGINT NOT NULL DEFAULT 0,
last_path TEXT,
last_ip INET,
last_user_agent TEXT,
PRIMARY KEY (token_id, hour)
);
```
`ON CONFLICT DO UPDATE SET request_count = request_count + 1`
atomic counter upserts, one row per token-hour. UI graphs last 30
days per token.
**Effort:** ~150 LOC + background sweep to prune >90-day rows.
## 5. Rotation webhooks (P3)
**Problem:** When a user revokes a token, integrations using it
get 401 with no warning. Big ones want "you're about to lose
access, here's 60s to rotate" signals.
**Proposal:** Soft-revoke tier. Revoke now accepts
`?drain_seconds=60`. Token enters a `draining` state (still valid
but a warning header `X-Molecule-Token-Draining: true` is added to
every response). After drain window, fully revoked.
Alternative / complement: webhook URL on the token. POST to it
when revoked. Safer because no drain period.
**Effort:** ~200 LOC. Webhook variant requires retry logic +
delivery audit.
## 6. Capture WorkOS user_id in created_by (P2, quick win)
**Problem:** Today, tokens minted via the canvas UI log
`created_by: "session"` — we know it was a session but not whose.
Post-incident review can't link a token back to a user.
**Proposal:** Thread the WorkOS user_id from the session-auth
verification through to the handler. The CP's
`/cp/auth/tenant-member` already returns `user_id`; stash it on
the gin context in `session_auth.go`; handler reads it for
`created_by`.
```go
// session_auth.go after successful verify
c.Set("session_user_id", body.UserID)
// handler
if v, ok := c.Get("session_user_id"); ok {
createdBy = "session:" + v.(string)
}
```
**Effort:** ~20 LOC. Unblocks Important follow-up #6 from today's
code review.
## 7. Mint-rate limit (P3)
**Problem:** A compromised session or admin token could mint
thousands of org tokens quickly, making forensic cleanup painful.
**Proposal:** Rate limit mint calls per-org: max N tokens per 5 min.
Existing `middleware/ratelimit` package does exactly this — bind
the limiter to the mint route with a low ceiling.
**Effort:** ~30 LOC. Do this before #5 — revoke-storms could hit
the same pattern.
## 8. Audit log (P2)
**Problem:** Token revocation is logged to stdout. That's fine for
Railway's retention window but ops want a queryable audit log.
**Proposal:** New table `org_token_audit` with (token_id, action,
actor, occurred_at). Write on mint/revoke. Surface in admin
diagnostics endpoint.
**Effort:** ~100 LOC + lightweight read API.
## 9. CLI for local development (P3)
**Problem:** Developers running canvas locally can't easily mint
and use org tokens against their dev tenant because the UI
requires a WorkOS session.
**Proposal:** `molecli org-token create --name <label>` uses
`ADMIN_TOKEN` from env + `MOLECULE_ORG_URL` to mint. Same API,
scripts-friendly.
**Effort:** ~80 LOC in molecli + a line in the docs guide.
## 10. Migrate ADMIN_TOKEN to org_api_tokens table (P4 — long-term)
**Problem:** `ADMIN_TOKEN` as an env var is a special case that
every auth tier has to handle. Once org tokens are feature-
complete (roles, expiry, binding), the env-var token is redundant
and complicates the auth code.
**Proposal:** Bootstrap the tenant by inserting a row labeled
`bootstrap` into `org_api_tokens` at provision time with the
current ADMIN_TOKEN value's hash. Remove the env-var check entirely
from AdminAuth. `ADMIN_TOKEN` becomes just "the initial token that
happens to be stored as a normal row".
Requires: roles + expiry shipped first (bootstrap token needs to
be demarcated as revocable-but-permanent-by-default).
**Effort:** ~150 LOC once prerequisites land.
---
## Tracked issues to file
Each of the above should become a GitHub issue when we're ready to
work it. One-liner label for the batch: `area:org-api-keys`.
## Non-goals
Explicit list of things we do NOT want to add:
- JWT / signed tokens. Opaque bearers + DB lookup is simpler and
matches every other token type in the system.
- OAuth scopes. We're not a third-party OAuth provider; this is
for internal integrations only.
- IP allow-lists per token. Captured nominally by the usage log
(#4) for detection, but enforcement adds operational friction
(customer VPN changes → all tokens break).

View File

@ -0,0 +1,167 @@
# Organization API Keys
> **Status:** Shipped (beta), 2026-04-20. See `docs/guides/org-api-keys.md` for user-facing usage.
Full-admin bearer tokens scoped to a single tenant org. User-visible
replacement for the single `ADMIN_TOKEN` env var — named, revocable,
audited, mintable from the canvas UI without ops intervention.
## Why this exists
Before these, admin access on a tenant required the bootstrap
`ADMIN_TOKEN` from AWS Secrets Manager. That token:
- Is a single shared value with no name or audit trail
- Can't be rotated without redeploying the tenant
- Is inaccessible to users (stored in ops-only SM)
- Can't be revoked individually — rotating it kills every integration
For the beta growth phase we want users to hand an AI agent an API
key and not worry about ops. Org API keys solve that: mint, use,
revoke, all from the canvas UI.
## Data model
```sql
CREATE TABLE org_api_tokens (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
token_hash BYTEA NOT NULL, -- sha256(plaintext)
prefix TEXT NOT NULL, -- first 8 plaintext chars for UI
name TEXT, -- user label ("zapier", "ci-bot")
created_by TEXT, -- provenance: "session"/"org-token:xxxxxxxx"/"admin-token"
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
last_used_at TIMESTAMPTZ,
revoked_at TIMESTAMPTZ,
UNIQUE (token_hash)
);
CREATE INDEX org_api_tokens_live_idx
ON org_api_tokens (token_hash)
WHERE revoked_at IS NULL;
```
Plaintext is NEVER stored. Only sha256 hash. Recovery is impossible
— lost tokens must be revoked and replaced.
The partial index keeps the hot-path `SELECT id WHERE token_hash=$1
AND revoked_at IS NULL` O(log live-tokens) regardless of how many
tokens have been minted + revoked over the tenant's lifetime.
## Request flow
```
Browser / CLI / Agent
│ Authorization: Bearer <plaintext>
Cloudflare edge
▼ tunnel (path-matched)
Tenant platform :8080
▼ TenantGuard (allowed; same-origin or header)
▼ AdminAuth middleware
├ Tier 0: fail-open (only if no ADMIN_TOKEN and no live tokens)
├ Tier 1: CP session cookie → /cp/auth/tenant-member
├ Tier 2a: sha256(bearer) IN org_api_tokens WHERE revoked_at IS NULL ← THIS
├ Tier 2b: bearer == ADMIN_TOKEN (bootstrap / break-glass)
└ Tier 3: any live workspace token (deprecated, only if no ADMIN_TOKEN)
```
Cost per request on the hot path: ONE indexed SELECT + one async
last_used_at UPDATE. Both hit the partial index; negligible vs
everything else the request does.
## Authorization scope
Every live org API token grants the SAME access as `ADMIN_TOKEN`:
- All `/workspaces/*` CRUD (create, delete, list, any workspace's sub-routes)
- All `/approvals/pending`, `/bundles/import`, `/org/import`, `/org/templates`
- All `/admin/*` routes
- All `/settings/secrets`, `/channels/discover`, `/events/*`
- Mint + revoke other org API tokens (self-sustaining after bootstrap)
It does NOT grant:
- Access to the control plane (`/cp/*`) directly — those are proxied
by the tenant and the CP has its own auth (WorkOS session). An
org token alone can't hit `/cp/admin/orgs` or `/cp/billing/*`.
- Cross-tenant access — each tenant's `org_api_tokens` table is
isolated in its own Postgres.
## Bootstrap + self-sustenance
The FIRST org token on a fresh tenant is minted via either:
1. **Canvas UI**: a user with a WorkOS session cookie (verified via
`/cp/auth/tenant-member`) opens Settings → Org API Keys → New.
2. **ADMIN_TOKEN CLI**: `curl -XPOST /org/tokens -H "Authorization:
Bearer $ADMIN_TOKEN"`. Useful in provisioning scripts or when
the canvas is down.
After that, any existing org token can mint more. Revocation
leaves ADMIN_TOKEN as the break-glass credential — operators can
still recover admin access even if every user-minted token is
revoked.
## Security properties
- **Plaintext never persisted**: only sha256 hash. A DB leak gives
the attacker prefixes + hashes — neither lets them forge a token.
- **Timing-safe lookup**: single hash-indexed SELECT. No
path-dependent branches that could leak hash-prefix info.
- **Immediate revocation**: `UPDATE revoked_at = now()` takes
microseconds; the next request returns 401. Partial index means
no lag from rebuilding full indexes.
- **Idempotent revoke**: revoking twice returns 404 the second
time, not a conflict. Simplifies revoke tooling that might
double-deliver.
- **Collapsed failure responses**: `Validate()` returns
`ErrInvalidToken` for any failure (bad bytes, revoked, deleted,
never-existed). Response shape cannot distinguish, so enumeration
is blind.
- **Audit trail via `created_by`**: every token row records its
provenance ("session", "org-token:<prefix>", "admin-token") so
post-incident review can follow a chain of mints.
## Threat model
| Threat | Mitigation |
|---|---|
| Attacker exfiltrates a token via leaked logs | Tokens NEVER logged at INFO — only prefixes. `created_by` audit shows who minted what. |
| Attacker cracks a stored hash | sha256 of 256 bits of uniform-random input — not crackable in our lifetime. Rainbow tables would need 2^256 entries. |
| Attacker brute-forces the bearer | 256 bits of entropy, base64url-encoded 43-char string. At 1e9 guesses/sec it would take >1e60 years. Rate limiting is not the primary defense here; entropy is. |
| Admin's session cookie is stolen | Cookie mints org tokens. Revoke the fresh tokens, rotate ADMIN_TOKEN, force WorkOS re-auth via logout. Mitigations: WorkOS session expiry + `created_by: session` audit trail makes post-hoc detection possible. |
| Token leaks to an AI that misbehaves | Full-org access — damage confined to the tenant but large within it. Beta trade-off accepted. **Future work:** scoped roles. |
| Tenant Postgres is compromised | Attacker can't forge tokens (only hashes stored). They CAN read workspace secrets — that's the separate secrets-encryption story (`SECRETS_ENCRYPTION_KEY`). |
## HTTP surface
```
GET /org/tokens list live tokens (prefix + metadata only)
POST /org/tokens mint; plaintext returned once
body: {"name": "optional label"}
DELETE /org/tokens/:id revoke; idempotent (404 on already-revoked)
```
All three behind `AdminAuth`. See `internal/handlers/org_tokens.go`.
## Follow-up roadmap
See `docs/architecture/org-api-keys-followups.md` for the full
list; headline items:
1. **Role scoping**: split into ADMIN / EDITOR / READER tiers. Then
WORKSPACE-SPECIFIC tokens ("this key can only touch workspace
X"). Aligns with the AWS IAM-style direction the product wants.
2. **Expiry**: optional `expires_at`, enforced in the hot-path
query. Lets users mint short-lived tokens for specific jobs.
3. **Usage metrics**: counter + last-request metadata
(path/ip/user-agent) for the UI so users can see what a token
is actually doing.
4. **Rotation hooks**: webhook-on-revoke so integrations know to
re-mint.
5. **Capture WorkOS user_id in `created_by`** when minted via session
(currently just records "session"). Requires propagating session
identity from the CP's tenant-member check through
`session_auth.go`.

140
docs/guides/org-api-keys.md Normal file
View File

@ -0,0 +1,140 @@
# Organization API Keys — User Guide
> Full-admin API keys for your Molecule AI organization. Use these to
> let AI agents, scripts, or integrations manage your org without a
> browser session.
## TL;DR
1. Open your org's canvas UI (`https://<your-slug>.moleculesai.app`)
2. Settings (⌘,) → **Org API Keys** tab
3. Click **New Key**, give it a label (e.g. "zapier", "my-claude-agent")
4. **Copy the token immediately** — it will never be shown again
5. Hand it to whatever needs org-admin access:
```
Authorization: Bearer <your-token>
```
Revoke from the same UI the moment anything looks wrong.
## What these keys can do
**Full organization admin.** A valid org API key is equivalent to
being logged in as an admin user. With it, a script or AI can:
- Create, delete, list workspaces
- Import a complete org definition (can wipe + recreate everything)
- Manage per-workspace secrets (your OpenAI/Anthropic/etc. keys)
- Register + install templates, bundles, plugins
- Approve or reject pending workspace approvals
- Configure channels (Slack, Discord, etc.)
- Mint more org API keys
- Revoke any org API key (including itself)
**What they cannot do:**
- Reach the control plane's admin API (`/cp/admin/*`) — CP admin
lives on a separate allowlist.
- Touch other organizations — each org's keys work only on its own
tenant.
- Edit the tenant's environment variables or restart the underlying
EC2 instance — those are ops-only operations.
## Treat keys like passwords
- **Don't** commit keys to git. If you must have one in source,
reference an env var and keep the var in your secret manager.
- **Don't** paste keys into Slack or email. Share via a password
manager when you can.
- **Do** give each integration its own key with a descriptive name.
If Zapier gets compromised, you revoke `zapier` and leave
`github-action-deploy` untouched.
- **Do** revoke any key you stop using.
If you leak one, revoke it and mint a new one. Revocation is
immediate — the next request with the old key gets 401.
## Using a key
### curl
```bash
curl -H "Authorization: Bearer $MOLECULE_ORG_TOKEN" \
https://acme.moleculesai.app/workspaces
```
### Python
```python
import os, requests
resp = requests.get(
"https://acme.moleculesai.app/workspaces",
headers={"Authorization": f"Bearer {os.environ['MOLECULE_ORG_TOKEN']}"},
)
resp.raise_for_status()
print(resp.json())
```
### TypeScript / Node
```ts
const resp = await fetch("https://acme.moleculesai.app/workspaces", {
headers: { Authorization: `Bearer ${process.env.MOLECULE_ORG_TOKEN}` },
});
if (!resp.ok) throw new Error(`${resp.status}: ${await resp.text()}`);
console.log(await resp.json());
```
### Hand it to an AI agent
Add the key to the agent's environment or config, with clear
instructions about what routes it should touch. Claude Code, for
example, can use it to inspect the tenant's state programmatically:
```bash
export MOLECULE_ORG_TOKEN=... # the key you just minted
```
Then tell the agent: "Using MOLECULE_ORG_TOKEN, list my workspaces
and tell me which ones are idle."
## Endpoints you'll hit most often
| Method | Path | What it does |
|---|---|---|
| GET | `/workspaces` | list all workspaces |
| POST | `/workspaces` | create a workspace |
| DELETE | `/workspaces/:id` | delete a workspace |
| GET | `/org/templates` | list registered templates |
| POST | `/org/import` | import a full org YAML |
| POST | `/bundles/import` | install a bundle |
| GET | `/approvals/pending` | list pending approvals |
Each workspace you create gets its own workspace-scoped token
returned in the create response. Use that token (not the org key)
for agent-to-platform calls inside that specific workspace — it
has a narrower blast radius if leaked.
Full API reference: `docs/api-reference.md`.
## Keys vs session cookies
| | Org API Key | WorkOS session cookie |
|---|---|---|
| Who holds it | Integrations, AI, CLI | Your browser |
| Where you see it | `/org/tokens` UI | Browser cookies |
| Revocation | One-click in UI | Log out / session expiry |
| Use from code | Yes | No (HttpOnly) |
| Blast radius | Full org admin | Full org admin |
Both unlock the same surface; the key is just the non-browser
equivalent.
## What's coming
Scoped roles (READ / WORKSPACE-WRITE / ORG-ADMIN), expiry timers,
per-workspace bindings, and usage metrics are on the roadmap. See
`docs/architecture/org-api-keys-followups.md`. For now every key
is full-admin by design — trading scope granularity for beta
shipping speed.

View File

@ -54,7 +54,22 @@ func WorkspaceAuth(database *sql.DB) gin.HandlerFunc {
c.Next()
return
}
// Per-workspace token
// Org-scoped API token — user-minted from canvas UI. Grants
// access to EVERY workspace in the org (that's the explicit
// product spec: one org key can touch each workspace). Same
// power surface as ADMIN_TOKEN but named, revocable, audited.
// Check before per-workspace token so an org-key presenter
// doesn't hit the narrower ValidateToken failure path.
if id, err := orgtoken.Validate(ctx, database, tok); err == nil {
c.Set("org_token_id", id)
c.Next()
return
} else if !errors.Is(err, orgtoken.ErrInvalidToken) {
log.Printf("wsauth: WorkspaceAuth: orgtoken.Validate: %v", err)
c.AbortWithStatusJSON(http.StatusInternalServerError, gin.H{"error": "auth check failed"})
return
}
// Per-workspace token — narrowest scope, bound to this :id.
if err := wsauth.ValidateToken(ctx, database, workspaceID, tok); err != nil {
c.AbortWithStatusJSON(http.StatusUnauthorized, gin.H{"error": "invalid workspace auth token"})
return