Addresses three Critical findings from today's code review of the
SaaS-canvas routing stack.
## Critical-1: session verification scoped to the current tenant
session_auth.go previously verified via GET /cp/auth/me, which
only answers "is someone logged in" — NOT "is this user in the
org they're targeting." Every WorkOS-authed user (including folks
who only signed up via app.moleculesai.app with no tenant
relationship) could call /workspaces, /approvals/pending,
/bundles/import, /org/import etc. on ANY tenant they could reach.
Cross-tenant read: user at acme.moleculesai.app could hit
bob.moleculesai.app/workspaces with their cookie and get Bob's
workspaces.
Fix:
- CP gains GET /cp/auth/tenant-member?slug=<slug> which joins
org_members × organizations and only returns member:true when
the authenticated user is actually in that org.
- Tenant sets MOLECULE_ORG_SLUG at boot via user-data.
- session_auth now calls tenant-member (not /me), passing its
own slug. Cache key includes slug so one tenant's cached
positive never satisfies another's check.
## Critical-2: cp_proxy path allowlist (lateral-movement fix)
cp_proxy.go forwarded any /cp/* path upstream with the cookie
and bearer attached. Since /cp/admin/* accepts sessions as one
of its auth tiers, a tenant-authed user could curl
/cp/admin/tenants/other-slug/diagnostics through their tenant
and the CP would honor it — turning any tenant into a lateral
hop into admin surface.
Fix: explicit allowlist of paths the canvas browser bundle
actually needs (/cp/auth, /cp/orgs, /cp/billing, /cp/templates,
/cp/legal). Everything else 404s at the tenant before cookies
leave. Fail-closed: future UI paths require explicit entries.
## Important-1,2: bounded session cache + split positive/negative TTL
Previous sync.Map cache grew unbounded (one entry per unique
Cookie header for process lifetime) and cached failures for 30s,
meaning a 3s CP blip locked users out for the full window.
Fix:
- Bounded map with batch random eviction at cap (10k entries ×
~100 bytes = 1 MB ceiling). Random eviction is O(1)
expected; we don't need precise LRU.
- Periodic sweeper goroutine (2 min) reclaims expired entries
even when they're not re-hit.
- Positive TTL 30s, negative TTL 5s — short negative so CP
flakes self-heal fast.
- Transport errors NOT cached (would otherwise trap every
user during a multi-second upstream outage).
- Cache key = sha256(slug + cookie) so raw session tokens
don't sit in process memory, and cross-tenant isolation is
structural not policy.
## Important-3: TenantGuard /cp/* bypass documented
Added a security note to the bypass explaining why it's safe
only under the current setup (cp_proxy allowlist + tunnel-only
ingress), and what would require revisiting (SG opens :8080
inbound to the VPC).
## Tests
- session_auth_test.go: 12 new tests — empty cookie, missing
slug, no CP, member:true happy path with cache hit, member:
false, 401 upstream, malformed JSON, transport error not
cached, cross-tenant isolation (same cookie different
tenants hit upstream separately), bounded eviction, expired
entries, cache key collision resistance.
- cp_proxy_test.go: new — isCPProxyAllowedPath covers 17
allow/block cases, forwarding preserves Cookie+Auth, Host
rewritten, blocked paths 404 without calling upstream.
All platform tests pass. CP provisioner tests pass after
threading cfg.OrgSlug into the container env.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
233 lines
6.9 KiB
Go
233 lines
6.9 KiB
Go
package middleware
|
||
|
||
import (
|
||
"crypto/sha256"
|
||
"encoding/hex"
|
||
"encoding/json"
|
||
"log"
|
||
"math/rand/v2"
|
||
"net/http"
|
||
"net/url"
|
||
"os"
|
||
"strings"
|
||
"sync"
|
||
"time"
|
||
)
|
||
|
||
// sessionCache holds short-lived verification results for upstream
|
||
// session-cookie checks. Entries are scoped BY TENANT SLUG so one
|
||
// tenant's cache can't satisfy another tenant's check even when the
|
||
// same cookie is presented.
|
||
//
|
||
// Keyed by a sha256 of (slug + cookie) rather than raw cookie bytes:
|
||
// - Avoids storing raw session tokens in memory for longer than
|
||
// needed to look them up.
|
||
// - Makes the cache lookup deterministic regardless of cookie
|
||
// ordering / whitespace that browsers sometimes introduce.
|
||
//
|
||
// Bounded: we evict random entries when size breaches sessionCacheMax.
|
||
// Periodic sweeper GCs expired entries even when they aren't re-hit.
|
||
var sessionCache = struct {
|
||
sync.Mutex
|
||
entries map[string]sessionCacheEntry
|
||
}{entries: make(map[string]sessionCacheEntry)}
|
||
|
||
const (
|
||
// Positive TTL: on the higher end because a valid session is
|
||
// stable until logout. 30s means logout or role change takes at
|
||
// most 30s to propagate.
|
||
sessionCacheTTLOK = 30 * time.Second
|
||
|
||
// Negative TTL: shorter, because a transient CP 502 (see
|
||
// controlplane issue #157 — terms-status flake) must heal
|
||
// quickly. 5s still absorbs a burst of retries from a single
|
||
// page render without fanning out to CP.
|
||
sessionCacheTTLFail = 5 * time.Second
|
||
|
||
// Cap on cached entries. 10k × ~100 bytes = ~1 MB — enough
|
||
// headroom for realistic tenant traffic without a slow leak.
|
||
sessionCacheMax = 10_000
|
||
|
||
// Sweeper runs opportunistically; cost is O(N) per sweep.
|
||
sessionCacheSweepEvery = 2 * time.Minute
|
||
)
|
||
|
||
type sessionCacheEntry struct {
|
||
expiresAt time.Time
|
||
ok bool
|
||
}
|
||
|
||
// cacheKey derives the lookup key. Using sha256 here isn't about
|
||
// cryptographic secrecy — it's about keying by (tenant, cookie) in a
|
||
// fixed-size string and not sprinkling raw tokens around the map.
|
||
func cacheKey(slug, cookie string) string {
|
||
h := sha256.New()
|
||
h.Write([]byte(slug))
|
||
h.Write([]byte{0}) // separator so ("a","bc") ≠ ("ab","c")
|
||
h.Write([]byte(cookie))
|
||
return hex.EncodeToString(h.Sum(nil))
|
||
}
|
||
|
||
// sessionCacheGet returns (ok, hit). hit=false means expired or absent.
|
||
func sessionCacheGet(key string) (ok bool, hit bool) {
|
||
sessionCache.Lock()
|
||
defer sessionCache.Unlock()
|
||
e, present := sessionCache.entries[key]
|
||
if !present {
|
||
return false, false
|
||
}
|
||
if time.Now().After(e.expiresAt) {
|
||
delete(sessionCache.entries, key)
|
||
return false, false
|
||
}
|
||
return e.ok, true
|
||
}
|
||
|
||
// sessionCachePut stores the result with the appropriate TTL. On
|
||
// overflow it evicts a pseudo-random entry so the cache stays
|
||
// bounded. This isn't LRU — we don't need precise recency, just
|
||
// ceiling behaviour. Random eviction is O(1) expected and avoids
|
||
// the bookkeeping of a doubly-linked list.
|
||
func sessionCachePut(key string, ok bool) {
|
||
ttl := sessionCacheTTLFail
|
||
if ok {
|
||
ttl = sessionCacheTTLOK
|
||
}
|
||
sessionCache.Lock()
|
||
defer sessionCache.Unlock()
|
||
if len(sessionCache.entries) >= sessionCacheMax {
|
||
// Evict N random entries to amortize the sweep cost. Pick
|
||
// the first N in map-iteration order (Go randomizes this).
|
||
const evictBatch = 128
|
||
i := 0
|
||
for k := range sessionCache.entries {
|
||
delete(sessionCache.entries, k)
|
||
i++
|
||
if i >= evictBatch {
|
||
break
|
||
}
|
||
}
|
||
}
|
||
sessionCache.entries[key] = sessionCacheEntry{
|
||
expiresAt: time.Now().Add(ttl),
|
||
ok: ok,
|
||
}
|
||
}
|
||
|
||
func init() {
|
||
go func() {
|
||
// Jitter startup so restarts don't align sweeps.
|
||
time.Sleep(time.Duration(rand.Int64N(int64(sessionCacheSweepEvery))))
|
||
t := time.NewTicker(sessionCacheSweepEvery)
|
||
defer t.Stop()
|
||
for range t.C {
|
||
sweepExpired()
|
||
}
|
||
}()
|
||
}
|
||
|
||
// sweepExpired removes expired entries so a low-hit-rate cache still
|
||
// releases memory. Cheap — we hold the lock briefly per entry.
|
||
func sweepExpired() {
|
||
now := time.Now()
|
||
sessionCache.Lock()
|
||
defer sessionCache.Unlock()
|
||
for k, e := range sessionCache.entries {
|
||
if now.After(e.expiresAt) {
|
||
delete(sessionCache.entries, k)
|
||
}
|
||
}
|
||
}
|
||
|
||
// cpSessionVerifyURL builds the upstream /cp/auth/tenant-member URL
|
||
// with the tenant slug attached. Returns "" when the tenant isn't
|
||
// configured for CP verification (CP_UPSTREAM_URL unset).
|
||
func cpSessionVerifyURL(slug string) string {
|
||
base := strings.TrimRight(os.Getenv("CP_UPSTREAM_URL"), "/")
|
||
if base == "" {
|
||
return ""
|
||
}
|
||
return base + "/cp/auth/tenant-member?slug=" + url.QueryEscape(slug)
|
||
}
|
||
|
||
// tenantSlug returns the slug this platform represents. Pulled from
|
||
// the MOLECULE_ORG_SLUG env at provision time; falls back to empty
|
||
// when unset (self-hosted / dev).
|
||
func tenantSlug() string {
|
||
return strings.TrimSpace(os.Getenv("MOLECULE_ORG_SLUG"))
|
||
}
|
||
|
||
// verifiedCPSession returns true when the request carries a cookie
|
||
// that the CP confirms belongs to a MEMBER of THIS tenant's org (not
|
||
// just "someone is logged in"). The difference is the authz boundary:
|
||
// any WorkOS-authed user could hit /cp/auth/me successfully; only
|
||
// actual org members pass /cp/auth/tenant-member?slug=<us>.
|
||
//
|
||
// Returns (false, false) when no cookie at all, so callers can
|
||
// distinguish "no credential presented" (fall through to bearer)
|
||
// from "credential presented but invalid" (abort with 401).
|
||
//
|
||
// Also returns (false, false) when MOLECULE_ORG_SLUG isn't configured
|
||
// — fail-safe: better to refuse session auth than to accept it
|
||
// without knowing which tenant we ARE. Deployments that want session
|
||
// auth MUST set both CP_UPSTREAM_URL and MOLECULE_ORG_SLUG.
|
||
func verifiedCPSession(cookieHeader string) (valid, presented bool) {
|
||
if cookieHeader == "" {
|
||
return false, false
|
||
}
|
||
slug := tenantSlug()
|
||
if slug == "" {
|
||
return false, false
|
||
}
|
||
verifyURL := cpSessionVerifyURL(slug)
|
||
if verifyURL == "" {
|
||
return false, true
|
||
}
|
||
|
||
key := cacheKey(slug, cookieHeader)
|
||
if ok, hit := sessionCacheGet(key); hit {
|
||
return ok, true
|
||
}
|
||
|
||
// Short timeout — a slow CP mustn't gate every canvas render.
|
||
client := &http.Client{Timeout: 3 * time.Second}
|
||
req, err := http.NewRequest("GET", verifyURL, nil)
|
||
if err != nil {
|
||
log.Printf("verifiedCPSession: build req: %v", err)
|
||
return false, true
|
||
}
|
||
req.Header.Set("Cookie", cookieHeader)
|
||
req.Header.Set("User-Agent", "molecule-tenant-platform/session-verifier")
|
||
|
||
resp, err := client.Do(req)
|
||
if err != nil {
|
||
log.Printf("verifiedCPSession: upstream: %v", err)
|
||
// NOTE: we deliberately do NOT cache transport failures.
|
||
// Caching them would mean a 3s CP blip locks out all users
|
||
// for the negative-TTL window. Next request retries.
|
||
return false, true
|
||
}
|
||
defer resp.Body.Close()
|
||
|
||
if resp.StatusCode != http.StatusOK {
|
||
sessionCachePut(key, false)
|
||
return false, true
|
||
}
|
||
|
||
var body struct {
|
||
Member bool `json:"member"`
|
||
UserID string `json:"user_id"`
|
||
}
|
||
if err := json.NewDecoder(resp.Body).Decode(&body); err != nil {
|
||
sessionCachePut(key, false)
|
||
return false, true
|
||
}
|
||
if !body.Member || body.UserID == "" {
|
||
sessionCachePut(key, false)
|
||
return false, true
|
||
}
|
||
|
||
sessionCachePut(key, true)
|
||
return true, true
|
||
}
|