molecule-ai/molecule-core

Fork 2

Files

T

hongming 6964b26474

ci-arm64-advisory / fast-checks (push) Waiting to run

Details

Block internal-flavored paths / Block forbidden paths (push) Successful in 9s

Details

CI / Python Lint & Test (push) Successful in 8s

Details

CI / Detect changes (push) Successful in 12s

Details

Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (push) Successful in 10s

Details

Lint no tenant GITEA or GITHUB token write / Scan for repo-host token write into tenant workspace surface (push) Successful in 10s

Details

Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (push) Successful in 11s

Details

Handlers Postgres Integration / detect-changes (push) Successful in 12s

Details

E2E API Smoke Test / detect-changes (push) Successful in 25s

Details

CI / all-required (push) Successful in 27s

Details

E2E Chat / detect-changes (push) Successful in 25s

Details

E2E Staging Canvas (Playwright) / detect-changes (push) Successful in 23s

Details

Secret scan / Scan diff for credential-shaped strings (push) Successful in 8s

Details

CI / Platform (Go) (push) Successful in 4s

Details

CI / Canvas (Next.js) (push) Successful in 4s

Details

CI / Shellcheck (E2E scripts) (push) Successful in 2s

Details

E2E API Smoke Test / E2E API Smoke Test (push) Successful in 5s

Details

E2E Chat / E2E Chat (push) Successful in 8s

Details

Handlers Postgres Integration / Handlers Postgres Integration (push) Successful in 1m58s

Details

CI / Canvas Deploy Reminder (push) Successful in 6s

Details

E2E Staging Canvas (Playwright) / Canvas tabs E2E (push) Successful in 6s

Details

publish-workspace-server-image / build-and-push (push) Successful in 3m59s

Details

publish-workspace-server-image / Production auto-deploy (push) Has been cancelled

Details

docs(arch): #1793 workspace-placement RFC — formalize org-per-EC2 architecture (#1819 )

CTO-bypass merge 2026-05-24: #1793 workspace-placement RFC

2026-05-25 00:09:38 +00:00

13 KiB

Raw Permalink Blame History

Workspace placement — org-per-EC2 architecture

Status: Accepted (implicit since 2026-05; formalized 2026-05-24) Owners: hongming (CTO), cui (CEO) Tracking: #1793

This RFC formalizes the architecture decision that has been implicit in the system since the post-suspension rebuild: each Molecule AI org is one isolated tenant on its own EC2 instance, with every functional surface (workspace-server, memory plugin, Postgres, Redis, canvas) co-located on that instance. The platform's role is provisioning, billing, and the cross-tenant control plane — never the data path.

The implementation already follows this pattern in every direction we look (provisioner, memory v2 cutover, tenant entrypoint, controlplane user-data, even the OSS deploy story). Writing it down so it stays that way.

TL;DR

                ┌──────────────────────────────────┐
                │  Platform (controlplane)         │
                │  Railway-hosted                  │
                │  api.moleculesai.app             │
                │                                  │
                │  - org provisioning              │
                │  - billing + Stripe integration  │
                │  - DNS + tunnel orchestration    │
                │  - auth / org-token issuance     │
                │  - fleet redeploy orchestration  │
                │                                  │
                │  NEVER holds tenant data         │
                └──────────────────────────────────┘
                         │                      │
              provision  │                      │  provision
              + billing  │                      │  + billing
                         ▼                      ▼
       ┌─────────────────────────┐    ┌─────────────────────────┐
       │  Tenant: agents-team    │    │  Tenant: <other-org>    │
       │  Own EC2 (us-east-2)    │    │  Own EC2 (us-east-2)    │
       │  agents-team.molecule.. │    │  <slug>.moleculesai.app │
       │                         │    │                         │
       │  ┌───────────────────┐  │    │  ┌───────────────────┐  │
       │  │ molecule-tenant   │  │    │  │ molecule-tenant   │  │
       │  │ (workspace-server │  │    │  │ (workspace-server │  │
       │  │  + canvas + go)   │  │    │  │  + canvas + go)   │  │
       │  └───────────────────┘  │    │  └───────────────────┘  │
       │  ┌───────────────────┐  │    │  ┌───────────────────┐  │
       │  │ memory-plugin     │  │    │  │ memory-plugin     │  │
       │  │ (loopback :9100)  │  │    │  │ (loopback :9100)  │  │
       │  └───────────────────┘  │    │  └───────────────────┘  │
       │  ┌───────────────────┐  │    │  ┌───────────────────┐  │
       │  │ postgres pgvector │  │    │  │ postgres pgvector │  │
       │  │ (172.17.0.1:5432) │  │    │  │ (172.17.0.1:5432) │  │
       │  └───────────────────┘  │    │  └───────────────────┘  │
       │  ┌───────────────────┐  │    │  ┌───────────────────┐  │
       │  │ redis             │  │    │  │ redis             │  │
       │  └───────────────────┘  │    │  └───────────────────┘  │
       │  ┌───────────────────┐  │    │  ┌───────────────────┐  │
       │  │ workspace runtime │  │    │  │ workspace runtime │  │
       │  │ containers (ws-*) │  │    │  │ containers (ws-*) │  │
       │  └───────────────────┘  │    │  └───────────────────┘  │
       └─────────────────────────┘    └─────────────────────────┘

Every tenant is a self-contained molecule-core instance. The platform is a thin coordinator above them.

What crosses the platform/tenant boundary

What the platform sends down to the tenant:

Initial EC2 provisioning (user-data script via SSM) — see molecule-controlplane/internal/provisioner/ec2.go
Per-tenant secrets (DB password, SECRETS_ENCRYPTION_KEY, MOLECULE_CP_SHARED_SECRET) injected as env at boot
Image redeploys via POST /cp/admin/tenants/:slug/redeploy → SSM → docker pull && docker stop && docker run
DNS records (Cloudflare) and tunnel registration (cloudflared)
Billing-state changes (subscription status, plan upgrades)

What the tenant sends up to the platform:

Boot-stage telemetry (report_stage calls during EC2 user-data execution)
LLM usage events (for billing aggregation; documented in controlplane/migrations/037_llm_usage_billing.up.sql)
Workspace lifecycle events for cross-tenant analytics — read-only, no remote control implied

What does NOT cross the boundary:

Memory contents (HMA scopes, agent_memories before A3, memory_plugin records after)
Workspace state, files, canvas layouts
Workspace runtime container state
Per-org user authentication state (tenant issues its own session tokens via wsauth)

If a feature design wants to put any of those on the platform side, that's a violation of this RFC and needs explicit justification.

SSOT rationale

The single-source-of-truth boundary is the tenant EC2.

This decision was the implicit basis for the memory v1→v2 migration that ran 2026-05-24 (issues #1747 → #1791 → #1792). The v2 memory plugin runs as a sidecar on each tenant EC2, sharing the tenant's Postgres under a dedicated memory_plugin schema. There is no platform-side memory aggregation, no central index, no cross-tenant memory federation. Memory writes are loopback-only (workspace-server → memory-plugin on 127.0.0.1:9100).

Why this is correct:

Organizational isolation is the product. A tenant's memory, workspaces, secrets, and conversation history must not be readable by another org, ever. The simplest enforcement is physical: different EC2, different DB, different network. Application-level multi-tenancy adds a class of cross-tenant data leak bugs that can't happen here.
The platform must remain horizontally scalable independent of tenant data volume. If memory aggregation lived on the platform, billing/provisioning/auth would scale with the volume of memory across all tenants. With per-tenant storage, the platform's scaling envelope depends only on the number of orgs.
OSS-deployability requires it. molecule-core is open-source; anyone can deploy it. If functional state lived on a centralized platform, OSS deployers would either have to run their own platform (high barrier) or call ours (privacy concern + scale concern). Per-tenant SSOT means the OSS molecule-core instance is functionally complete — it just talks to a platform for billing.

OSS-deployment shape

A workspace inside any tenant reaches its parent tenant by injecting two env vars at container start:

MOLECULE_ORG_ID — the UUID of the org this workspace belongs to
MOLECULE_PLATFORM_URL — the tenant's HTTPS URL (e.g. https://agents-team.moleculesai.app)

These are baked into the workspace runtime's docker run by the workspace-server when it provisions a workspace. The workspace's agent runtime uses them to:

Register itself in the tenant's workspaces table
Send heartbeats (Redis TTL key on the tenant)
Subscribe to A2A messages via the tenant's WebSocket hub
Commit memories via the tenant's MCP bridge or HTTP /memories endpoints

An OSS deployer running their own molecule-core instance gets the same shape: their workspaces inject the deployer's tenant URL and org ID. The agent runtime is agnostic to whether it's talking to our hosted platform or a self-hosted one.

The only thing tying a tenant to our platform is the billing/auth path:

MOLECULE_CP_URL env on the tenant container points at api.moleculesai.app
MOLECULE_CP_SHARED_SECRET env authenticates the tenant→platform direction
LLM usage events POST to cp_url/cp/llm-usage-events for billing aggregation

An OSS deployer can leave MOLECULE_CP_URL unset (or point at their own platform). The workspace-server's wiring.go and cp_provisioner.go already handle the absent-CP case gracefully — the tenant is fully functional without it.

Scaling envelope

Per-tenant resource shape (current):

Layer	Sizing
EC2	t3.medium (2 vCPU, 4 GiB) for default-tier orgs
Postgres	Single container, pgvector pre-installed, ~1-10 GiB per org expected
Memory plugin	Loopback only, ~50 MB resident, scales with memory record count
Workspace runtime containers (ws-*)	One per workspace; sized by template tier

The platform's scaling envelope:

Layer	Sizing
controlplane	Single Railway service, scales horizontally
Postgres	One Railway-hosted Postgres for billing + org registry + auth tokens
DNS	Cloudflare zone with one CNAME per tenant
Tunnels	One Cloudflare tunnel per tenant

Order-of-magnitude:

100 orgs: trivial (100 EC2s, controlplane unchanged)
10K orgs: needs an EC2 placement strategy (region pinning, dedicated-tier hosts), but the platform is still a single service
1M orgs: this design starts to strain — Cloudflare tunnel-per-tenant becomes expensive, EC2-per-tenant becomes resource-wasteful, and we'd want a denser tenant-on-shared-infra mode

The current architecture is sized for the 100–10K range. The 1M-org variant is explicitly out of scope for this RFC.

Decision points for new feature design

When proposing a new feature, the design must answer "where does the data live?" Pick one:

On the tenant. Default choice for anything functional. Tenant DB, tenant memory plugin, tenant filesystem. The feature ships in molecule-core and is deployed via the tenant image.
On the platform. ONLY for billing, cross-org analytics (anonymized), org registry, auth tokens, DNS/tunnel state. The feature ships in molecule-controlplane.
Both, with one as SSOT. Rare. The tenant is the SSOT; the platform may cache for cross-tenant queries but must be willing to re-read from the tenant on miss. Document the cache invalidation contract.

When in doubt, default to #1. If you find yourself wanting to put HMA memory, workspace state, or session history on the platform, stop — you're re-introducing the SSOT violation the v1→v2 memory migration was designed to remove.

Migration path for non-conforming code

The implementation already conforms. There is no migration backlog as of 2026-05-24:

Memory: v1→v2 migration complete (#1747 → #1791 → #1792). v2 plugin per-tenant is SSOT.
Workspace state: always per-tenant (the workspaces table lives in the tenant Postgres).
Activity logs: per-tenant activity_logs table.
Files: per-tenant (Docker volumes attached to ws-* containers).
Secrets: per-tenant (workspace_secrets + global_secrets tables in tenant DB).
LLM usage events: tenant emits, platform aggregates for billing — correct shape.

If a future PR proposes platform-side aggregation of something functional, link this RFC in the review.

What this RFC does NOT cover

Out of scope for this document; tracked separately if needed:

Multi-region tenant placement — current design is single-region (us-east-2). Multi-region needs its own RFC because it changes the EC2 placement contract.
BYO-compute / customer-managed VPC — adjacent design; the org-per-EC2 boundary holds but the EC2 ownership shifts to the customer.
Workspace runtime selection — separately documented in docs/architecture/workspace-tiers.md.
Tenant image upgrade strategy — separately documented in docs/architecture/tenant-image-upgrades.md.
OSS billing alternatives — how OSS deployers handle billing without our controlplane is a separate go-to-market decision.

References

docs/architecture/memory.md — HMA scopes + v2 plugin
docs/architecture/saas-prod-migration-2026-04-19.md — provisioning pipeline reference
docs/architecture/molecule-technical-doc.md §3 (System Architecture) — top-level picture
molecule-controlplane/internal/provisioner/ec2.go — the canonical user-data + docker run for tenants
workspace-server/entrypoint-tenant.sh — the canonical tenant boot script
Memory system migration: #1747 (kill v1 fallback), #1791 (Phase A2 backfill), #1792 (Phase A3 drop table)

13 KiB Raw Permalink Blame History Unescape Escape

Workspace placement — org-per-EC2 architecture

TL;DR

What crosses the platform/tenant boundary

SSOT rationale

OSS-deployment shape

Scaling envelope

Decision points for new feature design

Migration path for non-conforming code

What this RFC does NOT cover

References

13 KiB

Raw Permalink Blame History