URLs returned from DB and Redis cache (db.GetCachedURL, workspaces.url column)
are now validated via validateAgentURL() before any HTTP request is made:
- mcpResolveURL (mcp.go): added validateAgentURL() calls on all three return
paths (internal cache, Redis cache, DB fallback).
- resolveAgentURL (a2a_proxy.go): added validateAgentURL() call before
returning agentURL to the A2A dispatcher.
validateAgentURL() was extended (registry.go) to resolve DNS hostnames and
check each returned IP against the blocklist (private ranges, loopback,
cloud-metadata 169.254.0.0/16). "localhost" is allowed by name for local dev.
GET /admin/memories/export now applies redactSecrets() to each content field
before including it in the JSON response. Pre-SAFE-T1201 memories (stored
before redactSecrets was mandatory on writes) no longer leak credentials.
POST /admin/memories/import now calls redactSecrets() on content before both
the deduplication check and the INSERT. Imported memories with embedded
credentials cannot bypass SAFE-T1201 (#838).
- admin_memories.go: GET /admin/memories/export + POST /admin/memories/import
handler (from PR #1051, with security fixes applied).
- admin_memories_test.go: 6 tests covering redactSecrets parity on both endpoints.
- registry_test.go: added DNS-lookup test cases for validateAgentURL (F1083).
"localhost" allowed by name (preserves existing test); nxdomain blocked.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Workspaces stuck in provisioning used to sit in "starting" for 10min
until the sweeper flipped them. The real signal — a runtime crash at
EC2 boot — lands on the serial console within seconds but nothing
listened. These endpoints close the loop.
1. POST /admin/workspaces/:id/bootstrap-failed
The control plane's bootstrap watcher posts here when it spots
"RUNTIME CRASHED" in ec2:GetConsoleOutput. Handler:
- UPDATEs workspaces SET status='failed' only when status was
'provisioning' (idempotent — a raced online/failed stays put)
- Stores the error + log_tail in last_sample_error so the canvas
can render the real stack trace, not a generic "timeout" string
- Broadcasts WORKSPACE_PROVISION_FAILED with source='bootstrap_watcher'
2. GET /workspaces/:id/console
Proxies to CP's new /cp/admin/workspaces/:id/console endpoint so
the tenant platform can surface EC2 serial console output without
holding AWS credentials. CPProvisioner.GetConsoleOutput is the
client; returns 501 in non-CP deployments (docker-compose dev).
Both gated by AdminAuth — CP holds the tenant ADMIN_TOKEN that the
middleware accepts on its tier 2b branch.
Tests cover: happy-path fail, already-transitioned no-op, empty id,
log_tail truncation, and the 501 fallback when no CP is wired.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses the Critical + Important findings from today's code
review of the org API keys feature (PRs #1105-1108).
## Critical-1: rate-limit mint endpoint
Previously POST /org/tokens had no mint-rate limit. A compromised
WorkOS session or leaked bearer could mint thousands of tokens in
seconds, forcing a painful manual cleanup of each one.
Fix: dedicated per-IP token bucket, 10 mints/hour/IP. Legitimate
bursts fit under the ceiling; abuse bounces. List + Delete stay
on the global limiter — they can't be used to generate new
secret material.
## Important-1: HTTP handler integration tests
internal/orgtoken had 9 unit tests; the HTTP layer (org_tokens.go)
had none. Adds org_tokens_test.go covering:
- List happy path + DB error → 500
- Create actor="admin-token" (bootstrap), actor="org-token:<prefix>"
(chained mint), actor="session" (canvas browser path)
- Create name>100 chars → 400
- Create with empty body mints with no name
- Revoke happy path 200, missing id 404, empty id 400
- Plaintext returned in response body and prefix matches first 8 chars
- Warning text present
A regression that breaks the tier-ordering, drops the createdBy
field, or accepts oversized names now fails at CI not prod.
## Important-2: bound List output
List() had no LIMIT — a mint-storm bug or abuse could make the
admin UI slow to render and allocate proportionally. Adds
LIMIT 500 at the SQL layer. 10x realistic ceiling, guardrail
against pathological cases.
## Important-3: audit provenance uses plaintext prefix, not UUID
orgTokenActor() was logging "org-token:<first-8-of-uuid>" which
couldn't be cross-referenced with the UI (which shows first-8
of the plaintext). Users could not correlate "who minted this"
audit entries with the revoke button they're looking at.
Fix: Validate() now returns (id, prefix, error). Middleware
stashes both on the gin context. Handler reads prefix for the
actor string. Audit rows now match UI prefixes exactly.
## Nit: named constants for audit labels
actorOrgTokenPrefix / actorSession / actorAdminToken replace
the hardcoded strings scattered across the handler. Greppable
across log pipelines + audit queries; one place to change if
the format evolves.
## Tests
- internal/orgtoken: 9 existing + 0 new, all still green (updated
signatures for Validate returning prefix).
- internal/handlers/org_tokens_test.go: new — 9 HTTP-layer tests
above. Full gin.Context + sqlmock harness.
- Full `go test ./...` green except one pre-existing
TestGitHubToken_NoTokenProvider flake unrelated to this change
(expects 404, gets 500 — tracked separately).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds user-facing API keys with full-org admin scope. Replaces the
single ADMIN_TOKEN env var with named, revocable, audited tokens
that users can mint/rotate from the canvas UI without ops
intervention.
Designed for the beta growth phase — one token tier (full admin).
Future work will split into scoped roles (admin / workspace-write
/ read-only) and per-workspace bindings. See docs/architecture/
org-api-keys.md for the design + follow-up roadmap.
## Surface
POST /org/tokens mint (plaintext returned once)
GET /org/tokens list live keys (prefix-only)
DELETE /org/tokens/:id revoke (idempotent)
All AdminAuth-gated. Bootstrap path: mint the first token via
ADMIN_TOKEN or canvas session; tokens can mint more tokens after.
## Validation as a new AdminAuth tier (2a)
AdminAuth evaluation order:
Tier 0 lazy-bootstrap fail-open (only when no live tokens AND
no ADMIN_TOKEN env)
Tier 1 verified WorkOS session via /cp/auth/tenant-member
Tier 2a org_api_tokens SELECT — NEW
Tier 2b ADMIN_TOKEN env (bootstrap / CLI break-glass)
Tier 3 any live workspace token (deprecated, only when ADMIN_TOKEN
unset)
Tier 2a runs ONE indexed lookup (partial index on
token_hash WHERE revoked_at IS NULL) + an async last_used_at
bump. No measurable latency cost on the hot path.
## UI
New "Org API Keys" tab in the settings panel. Label field for
human-readable naming. Plaintext shown once + clipboard copy.
Revoke with confirm dialog. Mirrors the existing workspace-
TokensTab flow so users who've used one get the other for free.
## Security properties
- Plaintext never stored. sha256 hash + 8-char display prefix.
- Revocation is immediate: partial index on revoked_at IS NULL
means the next request validates or fails in microseconds.
- created_by audit field captures provenance: "org-token:<short>"
when a token mints another, "session" for browser-UI mints,
"admin-token" for the ADMIN_TOKEN bootstrap path.
- Validate() collapses all failure shapes into ErrInvalidToken
so response-shape can't distinguish "never existed" from
"revoked".
## Tests
- internal/orgtoken: 9 unit tests (hash storage, empty field
null-ing, validation happy path, empty plaintext, unknown hash,
revoked filtering, list ordering, revoke idempotency, has-any-
live short-circuit).
- AdminAuth tier-2a integration covered by existing middleware
tests unchanged (fail-open + bearer paths).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses three Critical findings from today's code review of the
SaaS-canvas routing stack.
## Critical-1: session verification scoped to the current tenant
session_auth.go previously verified via GET /cp/auth/me, which
only answers "is someone logged in" — NOT "is this user in the
org they're targeting." Every WorkOS-authed user (including folks
who only signed up via app.moleculesai.app with no tenant
relationship) could call /workspaces, /approvals/pending,
/bundles/import, /org/import etc. on ANY tenant they could reach.
Cross-tenant read: user at acme.moleculesai.app could hit
bob.moleculesai.app/workspaces with their cookie and get Bob's
workspaces.
Fix:
- CP gains GET /cp/auth/tenant-member?slug=<slug> which joins
org_members × organizations and only returns member:true when
the authenticated user is actually in that org.
- Tenant sets MOLECULE_ORG_SLUG at boot via user-data.
- session_auth now calls tenant-member (not /me), passing its
own slug. Cache key includes slug so one tenant's cached
positive never satisfies another's check.
## Critical-2: cp_proxy path allowlist (lateral-movement fix)
cp_proxy.go forwarded any /cp/* path upstream with the cookie
and bearer attached. Since /cp/admin/* accepts sessions as one
of its auth tiers, a tenant-authed user could curl
/cp/admin/tenants/other-slug/diagnostics through their tenant
and the CP would honor it — turning any tenant into a lateral
hop into admin surface.
Fix: explicit allowlist of paths the canvas browser bundle
actually needs (/cp/auth, /cp/orgs, /cp/billing, /cp/templates,
/cp/legal). Everything else 404s at the tenant before cookies
leave. Fail-closed: future UI paths require explicit entries.
## Important-1,2: bounded session cache + split positive/negative TTL
Previous sync.Map cache grew unbounded (one entry per unique
Cookie header for process lifetime) and cached failures for 30s,
meaning a 3s CP blip locked users out for the full window.
Fix:
- Bounded map with batch random eviction at cap (10k entries ×
~100 bytes = 1 MB ceiling). Random eviction is O(1)
expected; we don't need precise LRU.
- Periodic sweeper goroutine (2 min) reclaims expired entries
even when they're not re-hit.
- Positive TTL 30s, negative TTL 5s — short negative so CP
flakes self-heal fast.
- Transport errors NOT cached (would otherwise trap every
user during a multi-second upstream outage).
- Cache key = sha256(slug + cookie) so raw session tokens
don't sit in process memory, and cross-tenant isolation is
structural not policy.
## Important-3: TenantGuard /cp/* bypass documented
Added a security note to the bypass explaining why it's safe
only under the current setup (cp_proxy allowlist + tunnel-only
ingress), and what would require revisiting (SG opens :8080
inbound to the VPC).
## Tests
- session_auth_test.go: 12 new tests — empty cookie, missing
slug, no CP, member:true happy path with cache hit, member:
false, 401 upstream, malformed JSON, transport error not
cached, cross-tenant isolation (same cookie different
tenants hit upstream separately), bounded eviction, expired
entries, cache key collision resistance.
- cp_proxy_test.go: new — isCPProxyAllowedPath covers 17
allow/block cases, forwarding preserves Cookie+Auth, Host
rewritten, blocked paths 404 without calling upstream.
All platform tests pass. CP provisioner tests pass after
threading cfg.OrgSlug into the container env.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Canvas's browser bundle issues fetches to both CP endpoints
(/cp/auth/me, /cp/orgs, ...) AND tenant-platform endpoints
(/canvas/viewport, /approvals/pending, /org/templates). They
share ONE build-time base URL. Baking api.moleculesai.app
broke tenant calls with 404; baking the tenant subdomain broke
auth. Tried both today and saw exactly one failure mode per
attempt.
Real fix: same-origin fetches + tenant-side split. Adds:
internal/router/cp_proxy.go # /cp/* → CP_UPSTREAM_URL
mounted before NoRoute(canvasProxy). Now a tenant serves:
/cp/* → reverse-proxy to api.moleculesai.app
/canvas/viewport,
/approvals/pending,
/workspaces/:id/*,
/ws, /registry, → tenant platform (existing handlers)
/metrics
everything else → canvas UI (existing reverse-proxy)
Canvas middleware reverts to `connect-src 'self' wss:` for the
same-origin path (keeping explicit PLATFORM_URL whitelist as a
self-hosted escape hatch when the build-arg is non-empty).
CI build-arg flips to NEXT_PUBLIC_PLATFORM_URL="" so the bundle
issues relative fetches.
Security of cp_proxy:
- Cookie + Authorization PRESERVED across the hop (opposite of
canvas proxy) — they carry the WorkOS session, which is the
whole point.
- Host rewritten to upstream so CORS + cookie-domain on the CP
side see their own hostname.
- Upstream URL validated at construction: must parse, must be
http(s), must have a host — misconfig fails closed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #729 tightened AdminAuth to require ADMIN_TOKEN, breaking the
workspace credential helper which called /admin/github-installation-token
with a workspace bearer token. Tokens expired after 60 min with no refresh.
Fix: Add /workspaces/:id/github-installation-token under WorkspaceAuth
so any authenticated workspace can refresh its GitHub token. Keep the
admin path as backward-compatible alias.
Update molecule-git-token-helper.sh to use the workspace-scoped path
when WORKSPACE_ID is set.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GET /admin/memories/export returns all agent memories with workspace
name mapping. POST /admin/memories/import accepts the same format,
resolves workspaces by name, and deduplicates on content+scope.
Both endpoints are AdminAuth-gated.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>