c1a94deabc
5 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
8516a8f9c6 |
fix(tenant-guard): allowlist /buildinfo so redeploy verifier can reach it
The /buildinfo route added in #2398 to verify each tenant runs the published SHA was 404'd by TenantGuard on every production tenant — the allowlist had /health, /metrics, /registry/register, /registry/heartbeat, but not /buildinfo. The redeploy workflows curl /buildinfo from a CI runner with no X-Molecule-Org-Id header, TenantGuard 404'd them, gin's NoRoute proxied to canvas, canvas returned its HTML 404 page, jq read empty git_sha, and the verifier silently soft-warned every tenant as "unreachable" — which the workflow doesn't fail on. Confirmed externally: curl https://hongmingwang.moleculesai.app/buildinfo → HTTP 404 + Content-Type: text/html (Next.js "404: This page could not be found.") even though /health on the same host returns {"status":"ok"} from gin. The buildinfo package's own doc already declares /buildinfo public by design ("Public is intentional: it's a build identifier, not operational state. The same string is already published as org.opencontainers.image.revision on the container image, so no new info is exposed.") — the allowlist just missed it. Pin the alignment in tenant_guard_test.go: TestTenantGuard_AllowlistBypassesCheck now asserts /buildinfo returns 200 without an org header alongside /health and /metrics, so a future allowlist edit can't silently regress the verifier again. Closes the silent-success failure mode: stale tenants will now show up as STALE (hard-fail) rather than UNREACHABLE (soft-warn). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
8059fee128 |
fix(tenant-guard): allowlist /registry/register + /registry/heartbeat (#1236)
* fix(security): call redactSecrets before seeding workspace memories (F1085) seedInitialMemories() in workspace_provision.go was inserting template/config memories directly into agent_memories without scrubbing credential patterns. A workspace provisioned from a template containing API keys, tokens, or other secrets would store them in plain text — the same class of issue as #838. Fix: call redactSecrets(workspaceID, content) on the truncated memory content before the INSERT. The truncation (maxMemoryContentLength = 100 KiB, CWE-400) is preserved — redaction runs after truncation so the size limit still applies. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * test(workspace_provision): add seedInitialMemories coverage for #1208 Cover the truncate-at-100k boundary (PR #1167, CWE-400) and the redactSecrets call (F1085 / #1132), both identified as untested in #1208. - TestSeedInitialMemories_TruncatesOversizedContent: boundary at exactly 100k, 1 byte over, far over, and well under. Verifies INSERT receives exactly maxMemoryContentLength bytes. - TestSeedInitialMemories_RedactsSecrets: verifies redactSecrets runs before INSERT, regression test for F1085. - TestSeedInitialMemories_InvalidScopeSkipped: invalid scope is silently skipped, no INSERT called. - TestSeedInitialMemories_EmptyMemoriesNil: nil slice is handled without DB calls. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(marketing): Discord adapter launch visual assets (#1209) Squash-merge: Discord adapter launch visual assets (3 PNGs) + social copy. Acceptance: assets on staging. * fix(ci): golangci-lint errcheck failures on staging Suppress errcheck warnings for calls where the return value is safely ignored: - resp.Body.Close() (artifacts/client.go): deferred cleanup — failure to close a response body is non-critical; the defer itself is what matters for connection reuse. - rows.Close() (bundle/exporter.go): deferred cleanup in a loop where rows.Err() already handles query errors. - filepath.Walk (bundle/exporter.go): top-level walk call; errors in sub-directory traversal are handled by the inner callback (which returns nil for err != nil). - broadcaster.RecordAndBroadcast (bundle/importer.go): fire-and-forget event broadcast; errors are logged internally by the broadcaster. - db.DB.ExecContext (bundle/importer.go): best-effort runtime column update; non-critical auxiliary data that the provisioner re-extracts if needed. Fixes: #1143 * test(artifacts): suppress w.Write return values to satisfy errcheck All httptest.ResponseWriter.Write calls in client_test.go now discard the byte count and error return with _, _ = prefix. The Write method is safe to discard in test handlers — httptest.ResponseWriter.Write never returns an error for in-memory buffers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(CI): move changes job off self-hosted runner + add workflow concurrency Cherry-pick from staging PR #1194 for main. Two changes to relieve macOS arm64 runner saturation: 1. `changes` job: runs on ubuntu-latest instead of [self-hosted, macos, arm64]. This job does a plain `git diff` with zero macOS dependencies — moving it off the runner frees a slot immediately on every workflow trigger. 2. Add workflow-level concurrency: concurrency: group: ci-${{ github.ref }}; cancel-in-progress: true Prevents multiple stale in-flight CI runs from queuing on the same ref when new commits arrive. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(security): call redactSecrets before seeding workspace memories (F1085) (#1203) seedInitialMemories() in workspace_provision.go was inserting template/config memories directly into agent_memories without scrubbing credential patterns. A workspace provisioned from a template containing API keys, tokens, or other secrets would store them in plain text — the same class of issue as #838. Fix: call redactSecrets(workspaceID, content) on the truncated memory content before the INSERT. The truncation (maxMemoryContentLength = 100 KiB, CWE-400) is preserved — redaction runs after truncation so the size limit still applies. Co-authored-by: Molecule AI Core-BE <core-be@agents.moleculesai.app> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * tick: 2026-04-21 ~03:40Z — CI stalled 59+ min, GH_TOKEN 4th rotation, PR reviews done * fix(tenant-guard): allowlist /registry/register + /registry/heartbeat Final layer of today's stuck-provisioning saga. With the private-IP platform_url fix and the intra-VPC :8080 SG rule in place, workspace EC2s finally reached the tenant on the right port — only to have every POST bounced with a synthetic 404 by TenantGuard. TenantGuard is the SaaS hook that rejects cross-tenant routing. It demands X-Molecule-Org-Id on every request, but CP's workspace user- data doesn't export MOLECULE_ORG_ID (only WORKSPACE_ID, PLATFORM_URL, RUNTIME, PORT), so the runtime can't attach the header. Net effect: every workspace's first heartbeat to /registry/heartbeat was a silent 404, and the workspace sat in 'provisioning' until the platform sweeper timed it out. Allowlist the two workspace-boot paths: - /registry/register — one-shot at runtime startup - /registry/heartbeat — every 30s Both are still gated by wsauth.HasAnyLiveToken (workspaces with a token on file must present it; legacy tokenless workspaces are grandfathered). And the tenant SG already scopes :8080 to the VPC CIDR, so only intra-VPC callers can reach these paths in the first place. The allowlist bypasses cross-org routing, not auth. Follow-up: passing MOLECULE_ORG_ID into the workspace env would let the runtime attach the header and drop this allowlist entry. Tracked separately; not urgent since the multi-layer auth above is already adequate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Molecule AI Core-BE <core-be@agents.moleculesai.app> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Molecule AI Infra-SRE <infra-sre@agents.moleculesai.app> Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com> Co-authored-by: Molecule AI Core-DevOps <core-devops@agents.moleculesai.app> Co-authored-by: Molecule AI Core-UIUX <core-uiux@agents.moleculesai.app> Co-authored-by: Hongming Wang <hongmingwang.rabbit@users.noreply.github.com> |
||
|
|
d03f2d47e0 |
fix: close cross-tenant authz + cp_proxy admin-traversal gaps
Addresses three Critical findings from today's code review of the
SaaS-canvas routing stack.
## Critical-1: session verification scoped to the current tenant
session_auth.go previously verified via GET /cp/auth/me, which
only answers "is someone logged in" — NOT "is this user in the
org they're targeting." Every WorkOS-authed user (including folks
who only signed up via app.moleculesai.app with no tenant
relationship) could call /workspaces, /approvals/pending,
/bundles/import, /org/import etc. on ANY tenant they could reach.
Cross-tenant read: user at acme.moleculesai.app could hit
bob.moleculesai.app/workspaces with their cookie and get Bob's
workspaces.
Fix:
- CP gains GET /cp/auth/tenant-member?slug=<slug> which joins
org_members × organizations and only returns member:true when
the authenticated user is actually in that org.
- Tenant sets MOLECULE_ORG_SLUG at boot via user-data.
- session_auth now calls tenant-member (not /me), passing its
own slug. Cache key includes slug so one tenant's cached
positive never satisfies another's check.
## Critical-2: cp_proxy path allowlist (lateral-movement fix)
cp_proxy.go forwarded any /cp/* path upstream with the cookie
and bearer attached. Since /cp/admin/* accepts sessions as one
of its auth tiers, a tenant-authed user could curl
/cp/admin/tenants/other-slug/diagnostics through their tenant
and the CP would honor it — turning any tenant into a lateral
hop into admin surface.
Fix: explicit allowlist of paths the canvas browser bundle
actually needs (/cp/auth, /cp/orgs, /cp/billing, /cp/templates,
/cp/legal). Everything else 404s at the tenant before cookies
leave. Fail-closed: future UI paths require explicit entries.
## Important-1,2: bounded session cache + split positive/negative TTL
Previous sync.Map cache grew unbounded (one entry per unique
Cookie header for process lifetime) and cached failures for 30s,
meaning a 3s CP blip locked users out for the full window.
Fix:
- Bounded map with batch random eviction at cap (10k entries ×
~100 bytes = 1 MB ceiling). Random eviction is O(1)
expected; we don't need precise LRU.
- Periodic sweeper goroutine (2 min) reclaims expired entries
even when they're not re-hit.
- Positive TTL 30s, negative TTL 5s — short negative so CP
flakes self-heal fast.
- Transport errors NOT cached (would otherwise trap every
user during a multi-second upstream outage).
- Cache key = sha256(slug + cookie) so raw session tokens
don't sit in process memory, and cross-tenant isolation is
structural not policy.
## Important-3: TenantGuard /cp/* bypass documented
Added a security note to the bypass explaining why it's safe
only under the current setup (cp_proxy allowlist + tunnel-only
ingress), and what would require revisiting (SG opens :8080
inbound to the VPC).
## Tests
- session_auth_test.go: 12 new tests — empty cookie, missing
slug, no CP, member:true happy path with cache hit, member:
false, 401 upstream, malformed JSON, transport error not
cached, cross-tenant isolation (same cookie different
tenants hit upstream separately), bounded eviction, expired
entries, cache key collision resistance.
- cp_proxy_test.go: new — isCPProxyAllowedPath covers 17
allow/block cases, forwarding preserves Cookie+Auth, Host
rewritten, blocked paths 404 without calling upstream.
All platform tests pass. CP provisioner tests pass after
threading cfg.OrgSlug into the container env.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
0b8f3239f6 |
fix(middleware): TenantGuard passes through /cp/* to CP proxy
Today's rollout of cp_proxy (PR #1095/1096) mounted /cp/* as a reverse-proxy to the control plane, but the TenantGuard middleware runs first in the global chain and 404s anything that isn't in its exact-path allowlist (/health + /metrics). Every /cp/auth/me fetch from canvas landed on a 40µs 404 before ever reaching the proxy. /cp/* is handled upstream (WorkOS session + admin bearer), so the tenant doesn't need to attach org identity for those paths. Passing them through is correct — matches the design where the tenant platform is a pure transit layer for /cp/*. Verified: /cp/auth/me via tunnel now returns 401 (correct unauth from CP) instead of 404 from TenantGuard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
d8026347e5 |
chore: open-source restructure — rename dirs, remove internal files, scrub secrets
Renames: - platform/ → workspace-server/ (Go module path stays as "platform" for external dep compat — will update after plugin module republish) - workspace-template/ → workspace/ Removed (moved to separate repos or deleted): - PLAN.md — internal roadmap (move to private project board) - HANDOFF.md, AGENTS.md — one-time internal session docs - .claude/ — gitignored entirely (local agent config) - infra/cloudflare-worker/ → Molecule-AI/molecule-tenant-proxy - org-templates/molecule-dev/ → standalone template repo - .mcp-eval/ → molecule-mcp-server repo - test-results/ — ephemeral, gitignored Security scrubbing: - Cloudflare account/zone/KV IDs → placeholders - Real EC2 IPs → <EC2_IP> in all docs - CF token prefix, Neon project ID, Fly app names → redacted - Langfuse dev credentials → parameterized - Personal runner username/machine name → generic Community files: - CONTRIBUTING.md — build, test, branch conventions - CODE_OF_CONDUCT.md — Contributor Covenant 2.1 All Dockerfiles, CI workflows, docker-compose, railway.toml, render.yaml, README, CLAUDE.md updated for new directory names. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |