From b583347d1e4d4f32f48cda2849cb2c80e2fa31a0 Mon Sep 17 00:00:00 2001 From: hongming Date: Sun, 24 May 2026 17:04:08 -0700 Subject: [PATCH] =?UTF-8?q?docs(arch):=20#1793=20workspace-placement=20RFC?= =?UTF-8?q?=20=E2=80=94=20formalize=20org-per-EC2=20architecture?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Writes down the architecture decision that has been implicit since the post-2026-05-06 GitHub-suspension rebuild: every Molecule org runs as a fully isolated tenant on its own EC2, with workspace-server + memory plugin + Postgres + Redis + canvas co-located on that instance. The platform (controlplane on Railway) handles provisioning, billing, and DNS — it never holds tenant data. The RFC was the implicit basis for every design decision in the 2026-05-24 memory v1→v2 migration (#1747 → #1791 → #1792). Formalizing it now so future architectural choices stay aligned. Closes #1793. ## Changes 1. **docs/architecture/workspace-placement.md** (new) — the RFC itself. Covers the platform/tenant boundary, what crosses it, SSOT rationale, OSS-deployment shape, scaling envelope, decision rules for new features, migration path (currently no backlog), and explicit out-of-scope items (multi-region, BYO-compute, OSS billing). 2. **docs/architecture/molecule-technical-doc.md** — added a one-line note at the top of §3 (System Architecture) linking to the RFC. This is the highest-traffic architecture doc; readers landing there should see the contract immediately. 3. **docs/architecture/memory.md** — added a paragraph in the HMA intro making the tenant-isolation enforcement explicit. HMA's organizational-boundary principle is enforced PHYSICALLY (per-EC2 memory plugin), not at the application layer. Links back to the RFC. ## What the RFC does - Diagrams the boundary (controlplane → tenants, never the data path) - Lists what crosses (provisioning, billing, telemetry) vs what doesn't (memory contents, workspace state, files, sessions) - Documents SSOT rationale: org isolation, platform scaling independent of tenant data volume, OSS-deployability requirements - Defines OSS deploy shape — workspaces inject MOLECULE_ORG_ID + MOLECULE_PLATFORM_URL; runtime is agnostic to whether it's our hosted platform or self-hosted - Sizing envelope (100 → 10K orgs supported by current architecture; 1M-org variant explicitly out of scope) - Decision rules for new feature design (default: tenant; platform only for billing/anonymized analytics; both rare and tenant-as-SSOT) ## What this RFC does NOT cover (separately tracked) - Multi-region tenant placement (needs its own RFC) - BYO-compute / customer-managed VPC - Workspace runtime selection (docs/architecture/workspace-tiers.md) - Tenant image upgrades (docs/architecture/tenant-image-upgrades.md) - OSS billing alternatives ## SOP Checklist (RFC #351) ### 1. Comprehensive testing performed N/A — pure architecture documentation. Diagram + prose verified for internal consistency. ### 2. Local-postgres E2E run N/A. ### 3. Staging-smoke verified or pending N/A. ### 4. Root-cause not symptom The root cause of architectural drift is undocumented decisions becoming implicit. This document removes that surface — anyone proposing a platform-side aggregation of functional state can be pointed at this RFC during review. ### 5. Five-Axis review walked Walked solo. The product framing (especially OSS-deploy shape + 1M-org out-of-scope) merits Cui's sign-off post-merge — added 'Reviewed by Cui' as an acceptance criterion in #1793's body. PR is mergeable on technical grounds; product review can happen async. ### 6. No backwards-compat shim / dead code added Pure addition. +218 LOC across 1 new doc + 2 small cross-link edits in existing docs. ### 7. Memory/saved-feedback consulted - 'reference_post_suspension_pipeline' — context for why per-tenant SSOT became the architecture post-suspension - 'feedback_no_single_source_of_truth' — this RFC encodes the rule at the architecture level - Saved new memory pointer 'reference_workspace_placement_rfc' linking to this doc so future Claude sessions reach it before design work Closes #1793. --- docs/architecture/memory.md | 2 + docs/architecture/molecule-technical-doc.md | 2 + docs/architecture/workspace-placement.md | 194 ++++++++++++++++++++ 3 files changed, 198 insertions(+) create mode 100644 docs/architecture/workspace-placement.md diff --git a/docs/architecture/memory.md b/docs/architecture/memory.md index 2af02c70e..81fd98472 100644 --- a/docs/architecture/memory.md +++ b/docs/architecture/memory.md @@ -6,6 +6,8 @@ Molecule AI's memory model is built around one principle: That is the purpose of **HMA: Hierarchical Memory Architecture**. +The organizational boundary is enforced **physically**, not at the application layer: each org runs as its own tenant on its own EC2, with its own memory plugin sidecar and its own Postgres. Memory writes are loopback-only — never cross-tenant. See [`workspace-placement.md`](workspace-placement.md) for the architecture contract that makes HMA tenant-isolated by construction. + ## The Three Scopes | Scope | Meaning | Intended use | diff --git a/docs/architecture/molecule-technical-doc.md b/docs/architecture/molecule-technical-doc.md index cb4410db6..5120c2e00 100644 --- a/docs/architecture/molecule-technical-doc.md +++ b/docs/architecture/molecule-technical-doc.md @@ -84,6 +84,8 @@ Six runtime adapters ship production-ready on `main`: LangGraph, DeepAgents, Cla ## 3. System Architecture +> **Workspace placement contract:** every Molecule org runs as a fully isolated tenant on its own EC2, with workspace-server, memory plugin, Postgres, and Redis all co-located. The platform (controlplane on Railway) handles provisioning, billing, and DNS only — it never holds tenant data. See [`workspace-placement.md`](workspace-placement.md) for the formal RFC. + ### System Boundary Diagram ``` diff --git a/docs/architecture/workspace-placement.md b/docs/architecture/workspace-placement.md new file mode 100644 index 000000000..1c2fb17a3 --- /dev/null +++ b/docs/architecture/workspace-placement.md @@ -0,0 +1,194 @@ +# Workspace placement — org-per-EC2 architecture + +Status: Accepted (implicit since 2026-05; formalized 2026-05-24) +Owners: hongming (CTO), cui (CEO) +Tracking: #1793 + +This RFC formalizes the architecture decision that has been implicit in the system since the post-suspension rebuild: **each Molecule AI org is one isolated tenant on its own EC2 instance**, with every functional surface (workspace-server, memory plugin, Postgres, Redis, canvas) co-located on that instance. The platform's role is provisioning, billing, and the cross-tenant control plane — never the data path. + +The implementation already follows this pattern in every direction we look (provisioner, memory v2 cutover, tenant entrypoint, controlplane user-data, even the OSS deploy story). Writing it down so it stays that way. + +## TL;DR + +``` + ┌──────────────────────────────────┐ + │ Platform (controlplane) │ + │ Railway-hosted │ + │ api.moleculesai.app │ + │ │ + │ - org provisioning │ + │ - billing + Stripe integration │ + │ - DNS + tunnel orchestration │ + │ - auth / org-token issuance │ + │ - fleet redeploy orchestration │ + │ │ + │ NEVER holds tenant data │ + └──────────────────────────────────┘ + │ │ + provision │ │ provision + + billing │ │ + billing + ▼ ▼ + ┌─────────────────────────┐ ┌─────────────────────────┐ + │ Tenant: agents-team │ │ Tenant: │ + │ Own EC2 (us-east-2) │ │ Own EC2 (us-east-2) │ + │ agents-team.molecule.. │ │ .moleculesai.app │ + │ │ │ │ + │ ┌───────────────────┐ │ │ ┌───────────────────┐ │ + │ │ molecule-tenant │ │ │ │ molecule-tenant │ │ + │ │ (workspace-server │ │ │ │ (workspace-server │ │ + │ │ + canvas + go) │ │ │ │ + canvas + go) │ │ + │ └───────────────────┘ │ │ └───────────────────┘ │ + │ ┌───────────────────┐ │ │ ┌───────────────────┐ │ + │ │ memory-plugin │ │ │ │ memory-plugin │ │ + │ │ (loopback :9100) │ │ │ │ (loopback :9100) │ │ + │ └───────────────────┘ │ │ └───────────────────┘ │ + │ ┌───────────────────┐ │ │ ┌───────────────────┐ │ + │ │ postgres pgvector │ │ │ │ postgres pgvector │ │ + │ │ (172.17.0.1:5432) │ │ │ │ (172.17.0.1:5432) │ │ + │ └───────────────────┘ │ │ └───────────────────┘ │ + │ ┌───────────────────┐ │ │ ┌───────────────────┐ │ + │ │ redis │ │ │ │ redis │ │ + │ └───────────────────┘ │ │ └───────────────────┘ │ + │ ┌───────────────────┐ │ │ ┌───────────────────┐ │ + │ │ workspace runtime │ │ │ │ workspace runtime │ │ + │ │ containers (ws-*) │ │ │ │ containers (ws-*) │ │ + │ └───────────────────┘ │ │ └───────────────────┘ │ + └─────────────────────────┘ └─────────────────────────┘ +``` + +Every tenant is a self-contained molecule-core instance. The platform is a thin coordinator above them. + +## What crosses the platform/tenant boundary + +What the platform sends down to the tenant: + +- Initial EC2 provisioning (user-data script via SSM) — see `molecule-controlplane/internal/provisioner/ec2.go` +- Per-tenant secrets (DB password, `SECRETS_ENCRYPTION_KEY`, `MOLECULE_CP_SHARED_SECRET`) injected as env at boot +- Image redeploys via `POST /cp/admin/tenants/:slug/redeploy` → SSM → `docker pull && docker stop && docker run` +- DNS records (Cloudflare) and tunnel registration (cloudflared) +- Billing-state changes (subscription status, plan upgrades) + +What the tenant sends up to the platform: + +- Boot-stage telemetry (`report_stage` calls during EC2 user-data execution) +- LLM usage events (for billing aggregation; documented in `controlplane/migrations/037_llm_usage_billing.up.sql`) +- Workspace lifecycle events for cross-tenant analytics — read-only, no remote control implied + +What does NOT cross the boundary: + +- Memory contents (HMA scopes, agent_memories before A3, memory_plugin records after) +- Workspace state, files, canvas layouts +- Workspace runtime container state +- Per-org user authentication state (tenant issues its own session tokens via `wsauth`) + +If a feature design wants to put any of those on the platform side, that's a violation of this RFC and needs explicit justification. + +## SSOT rationale + +The single-source-of-truth boundary is **the tenant EC2**. + +This decision was the implicit basis for the memory v1→v2 migration that ran 2026-05-24 (issues #1747 → #1791 → #1792). The v2 memory plugin runs as a sidecar on each tenant EC2, sharing the tenant's Postgres under a dedicated `memory_plugin` schema. There is no platform-side memory aggregation, no central index, no cross-tenant memory federation. Memory writes are loopback-only (workspace-server → memory-plugin on `127.0.0.1:9100`). + +Why this is correct: + +1. **Organizational isolation is the product.** A tenant's memory, workspaces, secrets, and conversation history must not be readable by another org, ever. The simplest enforcement is physical: different EC2, different DB, different network. Application-level multi-tenancy adds a class of cross-tenant data leak bugs that can't happen here. + +2. **The platform must remain horizontally scalable independent of tenant data volume.** If memory aggregation lived on the platform, billing/provisioning/auth would scale with the volume of memory across all tenants. With per-tenant storage, the platform's scaling envelope depends only on the number of orgs. + +3. **OSS-deployability requires it.** molecule-core is open-source; anyone can deploy it. If functional state lived on a centralized platform, OSS deployers would either have to run their own platform (high barrier) or call ours (privacy concern + scale concern). Per-tenant SSOT means the OSS molecule-core instance is functionally complete — it just talks to a platform for billing. + +## OSS-deployment shape + +A workspace inside any tenant reaches its parent tenant by injecting two env vars at container start: + +- `MOLECULE_ORG_ID` — the UUID of the org this workspace belongs to +- `MOLECULE_PLATFORM_URL` — the tenant's HTTPS URL (e.g. `https://agents-team.moleculesai.app`) + +These are baked into the workspace runtime's docker run by the workspace-server when it provisions a workspace. The workspace's agent runtime uses them to: + +- Register itself in the tenant's `workspaces` table +- Send heartbeats (Redis TTL key on the tenant) +- Subscribe to A2A messages via the tenant's WebSocket hub +- Commit memories via the tenant's MCP bridge or HTTP `/memories` endpoints + +An OSS deployer running their own molecule-core instance gets the same shape: their workspaces inject the deployer's tenant URL and org ID. The agent runtime is **agnostic** to whether it's talking to our hosted platform or a self-hosted one. + +The only thing tying a tenant to **our** platform is the billing/auth path: + +- `MOLECULE_CP_URL` env on the tenant container points at `api.moleculesai.app` +- `MOLECULE_CP_SHARED_SECRET` env authenticates the tenant→platform direction +- LLM usage events POST to `cp_url/cp/llm-usage-events` for billing aggregation + +An OSS deployer can leave `MOLECULE_CP_URL` unset (or point at their own platform). The workspace-server's `wiring.go` and `cp_provisioner.go` already handle the absent-CP case gracefully — the tenant is fully functional without it. + +## Scaling envelope + +Per-tenant resource shape (current): + +| Layer | Sizing | +|---|---| +| EC2 | t3.medium (2 vCPU, 4 GiB) for default-tier orgs | +| Postgres | Single container, pgvector pre-installed, ~1-10 GiB per org expected | +| Memory plugin | Loopback only, ~50 MB resident, scales with memory record count | +| Workspace runtime containers (ws-\*) | One per workspace; sized by template tier | + +The platform's scaling envelope: + +| Layer | Sizing | +|---|---| +| controlplane | Single Railway service, scales horizontally | +| Postgres | One Railway-hosted Postgres for billing + org registry + auth tokens | +| DNS | Cloudflare zone with one CNAME per tenant | +| Tunnels | One Cloudflare tunnel per tenant | + +Order-of-magnitude: + +- 100 orgs: trivial (100 EC2s, controlplane unchanged) +- 10K orgs: needs an EC2 placement strategy (region pinning, dedicated-tier hosts), but the platform is still a single service +- 1M orgs: this design starts to strain — Cloudflare tunnel-per-tenant becomes expensive, EC2-per-tenant becomes resource-wasteful, and we'd want a denser tenant-on-shared-infra mode + +The current architecture is sized for the 100–10K range. The 1M-org variant is explicitly out of scope for this RFC. + +## Decision points for new feature design + +When proposing a new feature, the design must answer "where does the data live?" Pick one: + +1. **On the tenant.** Default choice for anything functional. Tenant DB, tenant memory plugin, tenant filesystem. The feature ships in `molecule-core` and is deployed via the tenant image. + +2. **On the platform.** ONLY for billing, cross-org analytics (anonymized), org registry, auth tokens, DNS/tunnel state. The feature ships in `molecule-controlplane`. + +3. **Both, with one as SSOT.** Rare. The tenant is the SSOT; the platform may cache for cross-tenant queries but must be willing to re-read from the tenant on miss. Document the cache invalidation contract. + +When in doubt, default to #1. If you find yourself wanting to put HMA memory, workspace state, or session history on the platform, stop — you're re-introducing the SSOT violation the v1→v2 memory migration was designed to remove. + +## Migration path for non-conforming code + +The implementation already conforms. There is no migration backlog as of 2026-05-24: + +- Memory: v1→v2 migration complete (#1747 → #1791 → #1792). v2 plugin per-tenant is SSOT. +- Workspace state: always per-tenant (the `workspaces` table lives in the tenant Postgres). +- Activity logs: per-tenant `activity_logs` table. +- Files: per-tenant (Docker volumes attached to ws-\* containers). +- Secrets: per-tenant (`workspace_secrets` + `global_secrets` tables in tenant DB). +- LLM usage events: tenant emits, platform aggregates for billing — correct shape. + +If a future PR proposes platform-side aggregation of something functional, link this RFC in the review. + +## What this RFC does NOT cover + +Out of scope for this document; tracked separately if needed: + +- **Multi-region tenant placement** — current design is single-region (us-east-2). Multi-region needs its own RFC because it changes the EC2 placement contract. +- **BYO-compute / customer-managed VPC** — adjacent design; the org-per-EC2 boundary holds but the EC2 ownership shifts to the customer. +- **Workspace runtime selection** — separately documented in `docs/architecture/workspace-tiers.md`. +- **Tenant image upgrade strategy** — separately documented in `docs/architecture/tenant-image-upgrades.md`. +- **OSS billing alternatives** — how OSS deployers handle billing without our controlplane is a separate go-to-market decision. + +## References + +- `docs/architecture/memory.md` — HMA scopes + v2 plugin +- `docs/architecture/saas-prod-migration-2026-04-19.md` — provisioning pipeline reference +- `docs/architecture/molecule-technical-doc.md` §3 (System Architecture) — top-level picture +- `molecule-controlplane/internal/provisioner/ec2.go` — the canonical user-data + docker run for tenants +- `workspace-server/entrypoint-tenant.sh` — the canonical tenant boot script +- Memory system migration: #1747 (kill v1 fallback), #1791 (Phase A2 backfill), #1792 (Phase A3 drop table) -- 2.52.0