molecule-core/docs/retrospectives/2026-04-17-saas-buildout.md
Hongming Wang 39074cc4ae chore: final open-source cleanup — binary, stale paths, private refs
- Remove compiled workspace-server/server binary from git
- Fix .gitignore, .gitattributes, .githooks/pre-commit for renamed dirs
- Fix CI workflow path filters (workspace-template → workspace)
- Replace real EC2 IP and personal slug in test_saas_tenant.sh
- Scrub molecule-controlplane references in docs
- Fix stale workspace-template/ paths in provisioner, handlers, tests
- Clean tracked Python cache files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 00:38:55 -07:00

11 KiB

Session Retrospective: 2026-04-16/17 SaaS Buildout

Duration: ~24 hours (overnight autonomous + daytime interactive) Scope: Full SaaS infrastructure migration + E2E workspace provisioning Status: Platform API 17/17 pass, workspace A2A confirmed working, multiple issues remain for production readiness


What was done

Infrastructure migration (Fly.io → Railway + EC2)

Change Repo Status
Railway deployment for control plane the private control-plane repo Deployed, auto-deploy on push
EC2 provisioner for tenants (Postgres + Redis + Platform in Docker) the private control-plane repo Deployed
EC2 provisioner for workspaces (pip install runtime at boot) the private control-plane repo Deployed, 9 min cold start
Cloudflare Worker for wildcard subdomain routing molecule-tenant-proxy (new repo) Deployed
Wildcard DNS *.moleculesai.app → Worker Cloudflare dashboard Done
Per-tenant ADMIN_TOKEN for Worker auth injection the private control-plane repo Deployed
Auto-updater cron on tenant EC2s (Option B) the private control-plane repo Deployed
Phase 33.2: stop creating per-tenant DNS records the private control-plane repo Deployed
Provisioning status page (progress bar + ETA) molecule-app Deployed to Vercel
Delete org button with type-to-confirm molecule-app Deployed to Vercel
Remove admin section from SaaS app molecule-app Deployed to Vercel

Monorepo PRs merged (by me)

PR Title
#584 TenantGuard same-origin bypass for EC2 tenant Canvas
#585 Remove Fly registry from publish pipeline
#586 Remove brand-monitor from monorepo
#587 5 Canvas UX fixes (error handling, a11y, loading state)
#588 Hermes + gemini-cli deploy preflight required keys
#589 Ecosystem-watch MAF v1.0 update
#646 Migration TEXT→UUID FK type mismatch (critical E2E unblock)
#751 A2A topology overlay
#771 mcp-eval quality gate
#843 pgvector migration DO block guard (critical E2E unblock)

Monorepo PRs merged (by other agents, reviewed by me)

#601, #602, #606, #610, #611, #612, #627, #629, #630, #639, #640, #641, #644, #645, #650, #655, #656, #659, #669, #764, #784, #785, #791, #793, #794, #796, #797, #798, #803, #808 — 30+ PRs total.

Issues filed

Issue Title
#590 AG-UI compatible SSE endpoint (implemented in #601)
#591 Per-org tool governance registry
#592 Per-workspace cost transparency
#850 Canvas :3000 not running on tenant EC2 (fixed)
#863 Workspace boot script missing config.yaml (fixed)

Docs created

Doc Purpose
docs/architecture/wildcard-dns-proxy.md Phase 33 Cloudflare Worker architecture
docs/architecture/tenant-image-upgrades.md Options A/B/C for tenant auto-upgrade
docs/architecture/partner-api-keys.md Phase 34 partner/programmatic API access
tests/e2e/test_saas_tenant.sh Reusable SaaS tenant smoke test

Standalone repos created

Repo Purpose
Molecule-AI/molecule-tenant-proxy Cloudflare Worker for subdomain routing

What should NOT have been changed (but was)

1. Wildcard DNS record changed 4 times in one session

The wildcard A record for *.moleculesai.app was pointed at:

  1. <EC2_IP> (real EC2 IP) — initial
  2. 198.51.100.1 (RFC 5737 TEST-NET) — Cloudflare blocked it (1003)
  3. <EC2_IP> (terminated EC2) — caused 1003 for all subdomains
  4. <EC2_IP> (another terminated EC2) — same issue
  5. <EC2_IP> (final live EC2) — current

Impact: Every subdomain queried during configs 2-4 got permanently cached as 1003 at Cloudflare's edge. Cache purge didn't help (different cache layer). These subdomains are stuck until Cloudflare's DNS routing cache expires (~24h).

Lesson: The wildcard should have pointed to a stable, always-live IP from the start. In production, this should be a dedicated proxy/load balancer IP that never changes, not an individual EC2 instance.

Follow-up: Consider using a Cloudflare Tunnel instead of a proxied A record — tunnels don't have the origin-IP-must-be-reachable requirement.

2. AdminAuth Origin bypass attempted then reverted

Attempted to add canvasOriginAllowed() to AdminAuth middleware to let the Canvas through without a bearer token. A test (#623) correctly blocked this — Origin is forgeable, and AdminAuth protects sensitive routes (secrets, events, bundles).

What should have been done from the start: Per-tenant ADMIN_TOKEN (which we eventually implemented). The Origin bypass was a security shortcut that the existing test suite caught.

Current state: Reverted. ADMIN_TOKEN is the correct approach.

3. Debug code left in CP provisioner

The workspace boot script still has:

  • python3 -m http.server 9999 debug server exposing /var/log/
  • Crash detection echo "RUNTIME CRASHED" with log dump
  • set -ex showing all commands in cloud-init console

Follow-up: Remove debug instrumentation before production. The debug server on :9999 exposes boot logs to anyone who can reach the EC2 IP.

4. GHCR auth removed then re-added

Removed docker login from tenant boot script (assuming public GHCR), then had to re-add it when the package couldn't be made public (linked to private repo). Wasted one provisioning cycle.

5. DB rows deleted manually via psql

Multiple times during testing, org/instance rows were deleted directly via psql instead of going through the proper DELETE /cp/orgs/:slug cascade. This left orphaned EC2 instances running (costing money) and skipped the GDPR purge audit trail.

Lesson: Always use the API for deletions. The cascade handles EC2 termination + DNS cleanup + audit logging.


Security concerns to address

CRITICAL

  1. #756 — X-Workspace-ID header forge bypasses CanCommunicate Any workspace can reach any other workspace by setting X-Workspace-ID: system:anything. Complete access control bypass. Fix options proposed, awaiting CEO design decision.

  2. #757 — GLOBAL memory poisoning Root workspaces can inject persistent prompt injection into all agents via GLOBAL memory scope. Mitigations proposed, awaiting CEO decision.

HIGH

  1. ADMIN_TOKEN in plaintext in org_instances table The per-tenant ADMIN_TOKEN is stored unencrypted in the CP database. Should be encrypted with the envelope key like other secrets.

  2. ADMIN_TOKEN exposed via /cp/orgs/:slug/instance public endpoint The Worker's routing endpoint returns the admin_token in plaintext. This endpoint is public (no auth). Anyone who knows the slug can get the admin token and access all AdminAuth-protected routes. Fix: Remove admin_token from the public response. Store it in Worker KV at provision time instead.

  3. Debug HTTP server on workspace EC2 port 9999 Exposes boot logs (may contain secrets in env exports) to anyone who can reach the EC2 IP. Must be removed before production.

  4. set -ex in boot scripts Shows all commands including secret values in cloud-init console output. EC2 console output is accessible via AWS API.

MEDIUM

  1. Workspace EC2 security group allows all inbound Should restrict to: Cloudflare IPs (for Worker proxying), tenant EC2 IP (for direct platform communication), SSH from admin IP only.

  2. No HTTPS between Worker and EC2 Worker connects to EC2 on http://IP:8080 (plain HTTP). Traffic crosses the public internet unencrypted. Should use a tunnel or at minimum restrict to VPC.


What needs proper workflow

1. Workspace registration not working

Workspace EC2s boot, start the A2A server on :8000, but never register with the tenant platform (POST /registry/register). The workspace stays at "provisioning" status forever on the Canvas.

Root cause: The boot script starts molecule-runtime which handles registration, but the runtime may not have the workspace auth token needed for registration. The token is issued by the tenant platform after the CP provision call, but it's not passed to the workspace EC2.

Fix needed: Pass the workspace auth token in the boot script env, or have the runtime request a token at startup.

2. Workspace boot time (9 min cold start)

The workspace EC2 boot sequence:

  • apt-get update + install (~2 min)
  • python3 -m venv + pip install molecule-ai-workspace-runtime (~2 min)
  • git clone adapter repo + pip install adapter deps (~2 min)
  • Runtime initialization (~2-3 min)

Fix: Pre-baked AMIs per runtime (tracked in project_ami_pipeline.md). Each AMI has all deps pre-installed. Boot reduces to ~30s.

3. CI blocked by go.mod replace directive

PR #900 fixes replace github.com/...plugin... => /plugin which breaks native Go builds. The replace is needed only in Docker builds where the plugin is COPYed to /plugin. Fix: add replace at Docker build time via RUN echo 'replace ...' >> go.mod.

4. Cloudflare edge cache poisoning

Changing the wildcard A record origin IP causes all previously-queried subdomains to cache the 1003 error for hours. HTTP cache purge doesn't clear DNS routing cache.

Fix for production: Use a stable origin IP (dedicated proxy) or Cloudflare Tunnel. Never change the wildcard origin IP in production.


Tests needed

Automated (add to CI)

  • Workspace EC2 boot script integration test (mock EC2, verify user-data contains config.yaml, adapter clone, env vars)
  • CP workspace provision handler test (verify env map passthrough)
  • Worker routing test (mock CP lookup, verify correct backend proxy)
  • Tenant ADMIN_TOKEN validation test (verify AdminAuth accepts it)
  • Provisioning status endpoint test (verify direct-IP health check)

Manual (before GA)

  • Full org lifecycle: create → provision → deploy workspace → send message → get AI response → delete workspace → delete org
  • Multi-org isolation: create 2 orgs, verify workspace A cannot reach workspace B
  • Workspace auto-update: push new image, verify tenant picks it up within 5 min
  • Org deletion cascade: verify EC2 terminated, DNS cleaned, DB purged, audit trail written
  • Browser E2E: Canvas loads, onboarding wizard works, deploy template prompts for API key, workspace comes online, chat works

Security (before GA)

  • Fix #756 (X-Workspace-ID forge) — complete access control bypass
  • Fix #757 (GLOBAL memory poisoning)
  • Remove ADMIN_TOKEN from public /instance endpoint
  • Encrypt ADMIN_TOKEN in DB
  • Remove debug server (:9999) from workspace boot script
  • Remove set -ex from boot scripts (leaks secrets to console)
  • Restrict workspace EC2 security group
  • Add HTTPS between Worker and EC2 (or use tunnel)