Commit Graph

5 Commits

Author SHA1 Message Date
Hongming Wang
8516a8f9c6 fix(tenant-guard): allowlist /buildinfo so redeploy verifier can reach it
The /buildinfo route added in #2398 to verify each tenant runs the
published SHA was 404'd by TenantGuard on every production tenant —
the allowlist had /health, /metrics, /registry/register,
/registry/heartbeat, but not /buildinfo. The redeploy workflows
curl /buildinfo from a CI runner with no X-Molecule-Org-Id header,
TenantGuard 404'd them, gin's NoRoute proxied to canvas, canvas
returned its HTML 404 page, jq read empty git_sha, and the verifier
silently soft-warned every tenant as "unreachable" — which the
workflow doesn't fail on.

Confirmed externally:
  curl https://hongmingwang.moleculesai.app/buildinfo
  → HTTP 404 + Content-Type: text/html (Next.js "404: This page
    could not be found.") even though /health on the same host
    returns {"status":"ok"} from gin.

The buildinfo package's own doc already declares /buildinfo public
by design ("Public is intentional: it's a build identifier, not
operational state. The same string is already published as
org.opencontainers.image.revision on the container image, so no new
info is exposed.") — the allowlist just missed it.

Pin the alignment in tenant_guard_test.go:
TestTenantGuard_AllowlistBypassesCheck now asserts /buildinfo
returns 200 without an org header alongside /health and /metrics,
so a future allowlist edit can't silently regress the verifier
again.

Closes the silent-success failure mode: stale tenants will now
show up as STALE (hard-fail) rather than UNREACHABLE (soft-warn).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 12:54:51 -07:00
Hongming Wang
8059fee128 fix(tenant-guard): allowlist /registry/register + /registry/heartbeat (#1236)
* fix(security): call redactSecrets before seeding workspace memories (F1085)

seedInitialMemories() in workspace_provision.go was inserting template/config
memories directly into agent_memories without scrubbing credential patterns.
A workspace provisioned from a template containing API keys, tokens, or other
secrets would store them in plain text — the same class of issue as #838.

Fix: call redactSecrets(workspaceID, content) on the truncated memory content
before the INSERT. The truncation (maxMemoryContentLength = 100 KiB, CWE-400)
is preserved — redaction runs after truncation so the size limit still applies.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(workspace_provision): add seedInitialMemories coverage for #1208

Cover the truncate-at-100k boundary (PR #1167, CWE-400) and the
redactSecrets call (F1085 / #1132), both identified as untested in #1208.

- TestSeedInitialMemories_TruncatesOversizedContent: boundary at exactly
  100k, 1 byte over, far over, and well under. Verifies INSERT receives
  exactly maxMemoryContentLength bytes.
- TestSeedInitialMemories_RedactsSecrets: verifies redactSecrets runs
  before INSERT, regression test for F1085.
- TestSeedInitialMemories_InvalidScopeSkipped: invalid scope is silently
  skipped, no INSERT called.
- TestSeedInitialMemories_EmptyMemoriesNil: nil slice is handled without
  DB calls.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs(marketing): Discord adapter launch visual assets (#1209)

Squash-merge: Discord adapter launch visual assets (3 PNGs) + social copy. Acceptance: assets on staging.

* fix(ci): golangci-lint errcheck failures on staging

Suppress errcheck warnings for calls where the return value is safely
ignored:
  - resp.Body.Close() (artifacts/client.go): deferred cleanup — failure
    to close a response body is non-critical; the defer itself is what
    matters for connection reuse.
  - rows.Close() (bundle/exporter.go): deferred cleanup in a loop where
    rows.Err() already handles query errors.
  - filepath.Walk (bundle/exporter.go): top-level walk call; errors in
    sub-directory traversal are handled by the inner callback (which
    returns nil for err != nil).
  - broadcaster.RecordAndBroadcast (bundle/importer.go): fire-and-forget
    event broadcast; errors are logged internally by the broadcaster.
  - db.DB.ExecContext (bundle/importer.go): best-effort runtime column
    update; non-critical auxiliary data that the provisioner re-extracts
    if needed.

Fixes: #1143

* test(artifacts): suppress w.Write return values to satisfy errcheck

All httptest.ResponseWriter.Write calls in client_test.go now discard
the byte count and error return with _, _ = prefix. The Write method
is safe to discard in test handlers — httptest.ResponseWriter.Write
never returns an error for in-memory buffers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(CI): move changes job off self-hosted runner + add workflow concurrency

Cherry-pick from staging PR #1194 for main. Two changes to relieve
macOS arm64 runner saturation:

1. `changes` job: runs on ubuntu-latest instead of
   [self-hosted, macos, arm64]. This job does a plain `git diff`
   with zero macOS dependencies — moving it off the runner frees
   a slot immediately on every workflow trigger.

2. Add workflow-level concurrency:
   concurrency: group: ci-${{ github.ref }}; cancel-in-progress: true

   Prevents multiple stale in-flight CI runs from queuing on the
   same ref when new commits arrive.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(security): call redactSecrets before seeding workspace memories (F1085) (#1203)

seedInitialMemories() in workspace_provision.go was inserting template/config
memories directly into agent_memories without scrubbing credential patterns.
A workspace provisioned from a template containing API keys, tokens, or other
secrets would store them in plain text — the same class of issue as #838.

Fix: call redactSecrets(workspaceID, content) on the truncated memory content
before the INSERT. The truncation (maxMemoryContentLength = 100 KiB, CWE-400)
is preserved — redaction runs after truncation so the size limit still applies.

Co-authored-by: Molecule AI Core-BE <core-be@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* tick: 2026-04-21 ~03:40Z — CI stalled 59+ min, GH_TOKEN 4th rotation, PR reviews done

* fix(tenant-guard): allowlist /registry/register + /registry/heartbeat

Final layer of today's stuck-provisioning saga. With the private-IP
platform_url fix and the intra-VPC :8080 SG rule in place, workspace
EC2s finally reached the tenant on the right port — only to have every
POST bounced with a synthetic 404 by TenantGuard.

TenantGuard is the SaaS hook that rejects cross-tenant routing. It
demands X-Molecule-Org-Id on every request, but CP's workspace user-
data doesn't export MOLECULE_ORG_ID (only WORKSPACE_ID, PLATFORM_URL,
RUNTIME, PORT), so the runtime can't attach the header. Net effect:
every workspace's first heartbeat to /registry/heartbeat was a silent
404, and the workspace sat in 'provisioning' until the platform
sweeper timed it out.

Allowlist the two workspace-boot paths:
  - /registry/register  — one-shot at runtime startup
  - /registry/heartbeat — every 30s

Both are still gated by wsauth.HasAnyLiveToken (workspaces with a
token on file must present it; legacy tokenless workspaces are
grandfathered). And the tenant SG already scopes :8080 to the VPC
CIDR, so only intra-VPC callers can reach these paths in the first
place. The allowlist bypasses cross-org routing, not auth.

Follow-up: passing MOLECULE_ORG_ID into the workspace env would let
the runtime attach the header and drop this allowlist entry. Tracked
separately; not urgent since the multi-layer auth above is already
adequate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Molecule AI Core-BE <core-be@agents.moleculesai.app>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Molecule AI Infra-SRE <infra-sre@agents.moleculesai.app>
Co-authored-by: molecule-ai[bot] <276602405+molecule-ai[bot]@users.noreply.github.com>
Co-authored-by: Molecule AI Core-DevOps <core-devops@agents.moleculesai.app>
Co-authored-by: Molecule AI Core-UIUX <core-uiux@agents.moleculesai.app>
Co-authored-by: Hongming Wang <hongmingwang.rabbit@users.noreply.github.com>
2026-04-21 02:47:27 +00:00
Hongming Wang
d03f2d47e0 fix: close cross-tenant authz + cp_proxy admin-traversal gaps
Addresses three Critical findings from today's code review of the
SaaS-canvas routing stack.

## Critical-1: session verification scoped to the current tenant

session_auth.go previously verified via GET /cp/auth/me, which
only answers "is someone logged in" — NOT "is this user in the
org they're targeting." Every WorkOS-authed user (including folks
who only signed up via app.moleculesai.app with no tenant
relationship) could call /workspaces, /approvals/pending,
/bundles/import, /org/import etc. on ANY tenant they could reach.
Cross-tenant read: user at acme.moleculesai.app could hit
bob.moleculesai.app/workspaces with their cookie and get Bob's
workspaces.

Fix:
  - CP gains GET /cp/auth/tenant-member?slug=<slug> which joins
    org_members × organizations and only returns member:true when
    the authenticated user is actually in that org.
  - Tenant sets MOLECULE_ORG_SLUG at boot via user-data.
  - session_auth now calls tenant-member (not /me), passing its
    own slug. Cache key includes slug so one tenant's cached
    positive never satisfies another's check.

## Critical-2: cp_proxy path allowlist (lateral-movement fix)

cp_proxy.go forwarded any /cp/* path upstream with the cookie
and bearer attached. Since /cp/admin/* accepts sessions as one
of its auth tiers, a tenant-authed user could curl
/cp/admin/tenants/other-slug/diagnostics through their tenant
and the CP would honor it — turning any tenant into a lateral
hop into admin surface.

Fix: explicit allowlist of paths the canvas browser bundle
actually needs (/cp/auth, /cp/orgs, /cp/billing, /cp/templates,
/cp/legal). Everything else 404s at the tenant before cookies
leave. Fail-closed: future UI paths require explicit entries.

## Important-1,2: bounded session cache + split positive/negative TTL

Previous sync.Map cache grew unbounded (one entry per unique
Cookie header for process lifetime) and cached failures for 30s,
meaning a 3s CP blip locked users out for the full window.

Fix:
  - Bounded map with batch random eviction at cap (10k entries ×
    ~100 bytes = 1 MB ceiling). Random eviction is O(1)
    expected; we don't need precise LRU.
  - Periodic sweeper goroutine (2 min) reclaims expired entries
    even when they're not re-hit.
  - Positive TTL 30s, negative TTL 5s — short negative so CP
    flakes self-heal fast.
  - Transport errors NOT cached (would otherwise trap every
    user during a multi-second upstream outage).
  - Cache key = sha256(slug + cookie) so raw session tokens
    don't sit in process memory, and cross-tenant isolation is
    structural not policy.

## Important-3: TenantGuard /cp/* bypass documented

Added a security note to the bypass explaining why it's safe
only under the current setup (cp_proxy allowlist + tunnel-only
ingress), and what would require revisiting (SG opens :8080
inbound to the VPC).

## Tests

  - session_auth_test.go: 12 new tests — empty cookie, missing
    slug, no CP, member:true happy path with cache hit, member:
    false, 401 upstream, malformed JSON, transport error not
    cached, cross-tenant isolation (same cookie different
    tenants hit upstream separately), bounded eviction, expired
    entries, cache key collision resistance.
  - cp_proxy_test.go: new — isCPProxyAllowedPath covers 17
    allow/block cases, forwarding preserves Cookie+Auth, Host
    rewritten, blocked paths 404 without calling upstream.

All platform tests pass. CP provisioner tests pass after
threading cfg.OrgSlug into the container env.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:45:57 -07:00
Hongming Wang
0b8f3239f6 fix(middleware): TenantGuard passes through /cp/* to CP proxy
Today's rollout of cp_proxy (PR #1095/1096) mounted /cp/* as a
reverse-proxy to the control plane, but the TenantGuard middleware
runs first in the global chain and 404s anything that isn't in its
exact-path allowlist (/health + /metrics). Every /cp/auth/me fetch
from canvas landed on a 40µs 404 before ever reaching the proxy.

/cp/* is handled upstream (WorkOS session + admin bearer), so the
tenant doesn't need to attach org identity for those paths. Passing
them through is correct — matches the design where the tenant
platform is a pure transit layer for /cp/*.

Verified: /cp/auth/me via tunnel now returns 401 (correct unauth
from CP) instead of 404 from TenantGuard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:14:56 -07:00
Hongming Wang
d8026347e5 chore: open-source restructure — rename dirs, remove internal files, scrub secrets
Renames:
- platform/ → workspace-server/ (Go module path stays as "platform" for
  external dep compat — will update after plugin module republish)
- workspace-template/ → workspace/

Removed (moved to separate repos or deleted):
- PLAN.md — internal roadmap (move to private project board)
- HANDOFF.md, AGENTS.md — one-time internal session docs
- .claude/ — gitignored entirely (local agent config)
- infra/cloudflare-worker/ → Molecule-AI/molecule-tenant-proxy
- org-templates/molecule-dev/ → standalone template repo
- .mcp-eval/ → molecule-mcp-server repo
- test-results/ — ephemeral, gitignored

Security scrubbing:
- Cloudflare account/zone/KV IDs → placeholders
- Real EC2 IPs → <EC2_IP> in all docs
- CF token prefix, Neon project ID, Fly app names → redacted
- Langfuse dev credentials → parameterized
- Personal runner username/machine name → generic

Community files:
- CONTRIBUTING.md — build, test, branch conventions
- CODE_OF_CONDUCT.md — Contributor Covenant 2.1

All Dockerfiles, CI workflows, docker-compose, railway.toml, render.yaml,
README, CLAUDE.md updated for new directory names.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 00:24:44 -07:00