- Remove compiled workspace-server/server binary from git - Fix .gitignore, .gitattributes, .githooks/pre-commit for renamed dirs - Fix CI workflow path filters (workspace-template → workspace) - Replace real EC2 IP and personal slug in test_saas_tenant.sh - Scrub molecule-controlplane references in docs - Fix stale workspace-template/ paths in provisioner, handlers, tests - Clean tracked Python cache files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
11 KiB
Session Retrospective: 2026-04-16/17 SaaS Buildout
Duration: ~24 hours (overnight autonomous + daytime interactive) Scope: Full SaaS infrastructure migration + E2E workspace provisioning Status: Platform API 17/17 pass, workspace A2A confirmed working, multiple issues remain for production readiness
What was done
Infrastructure migration (Fly.io → Railway + EC2)
| Change | Repo | Status |
|---|---|---|
| Railway deployment for control plane | the private control-plane repo | Deployed, auto-deploy on push |
| EC2 provisioner for tenants (Postgres + Redis + Platform in Docker) | the private control-plane repo | Deployed |
| EC2 provisioner for workspaces (pip install runtime at boot) | the private control-plane repo | Deployed, 9 min cold start |
| Cloudflare Worker for wildcard subdomain routing | molecule-tenant-proxy (new repo) | Deployed |
Wildcard DNS *.moleculesai.app → Worker |
Cloudflare dashboard | Done |
| Per-tenant ADMIN_TOKEN for Worker auth injection | the private control-plane repo | Deployed |
| Auto-updater cron on tenant EC2s (Option B) | the private control-plane repo | Deployed |
| Phase 33.2: stop creating per-tenant DNS records | the private control-plane repo | Deployed |
| Provisioning status page (progress bar + ETA) | molecule-app | Deployed to Vercel |
| Delete org button with type-to-confirm | molecule-app | Deployed to Vercel |
| Remove admin section from SaaS app | molecule-app | Deployed to Vercel |
Monorepo PRs merged (by me)
| PR | Title |
|---|---|
| #584 | TenantGuard same-origin bypass for EC2 tenant Canvas |
| #585 | Remove Fly registry from publish pipeline |
| #586 | Remove brand-monitor from monorepo |
| #587 | 5 Canvas UX fixes (error handling, a11y, loading state) |
| #588 | Hermes + gemini-cli deploy preflight required keys |
| #589 | Ecosystem-watch MAF v1.0 update |
| #646 | Migration TEXT→UUID FK type mismatch (critical E2E unblock) |
| #751 | A2A topology overlay |
| #771 | mcp-eval quality gate |
| #843 | pgvector migration DO block guard (critical E2E unblock) |
Monorepo PRs merged (by other agents, reviewed by me)
#601, #602, #606, #610, #611, #612, #627, #629, #630, #639, #640, #641, #644, #645, #650, #655, #656, #659, #669, #764, #784, #785, #791, #793, #794, #796, #797, #798, #803, #808 — 30+ PRs total.
Issues filed
| Issue | Title |
|---|---|
| #590 | AG-UI compatible SSE endpoint (implemented in #601) |
| #591 | Per-org tool governance registry |
| #592 | Per-workspace cost transparency |
| #850 | Canvas :3000 not running on tenant EC2 (fixed) |
| #863 | Workspace boot script missing config.yaml (fixed) |
Docs created
| Doc | Purpose |
|---|---|
docs/architecture/wildcard-dns-proxy.md |
Phase 33 Cloudflare Worker architecture |
docs/architecture/tenant-image-upgrades.md |
Options A/B/C for tenant auto-upgrade |
docs/architecture/partner-api-keys.md |
Phase 34 partner/programmatic API access |
tests/e2e/test_saas_tenant.sh |
Reusable SaaS tenant smoke test |
Standalone repos created
| Repo | Purpose |
|---|---|
Molecule-AI/molecule-tenant-proxy |
Cloudflare Worker for subdomain routing |
What should NOT have been changed (but was)
1. Wildcard DNS record changed 4 times in one session
The wildcard A record for *.moleculesai.app was pointed at:
<EC2_IP>(real EC2 IP) — initial198.51.100.1(RFC 5737 TEST-NET) — Cloudflare blocked it (1003)<EC2_IP>(terminated EC2) — caused 1003 for all subdomains<EC2_IP>(another terminated EC2) — same issue<EC2_IP>(final live EC2) — current
Impact: Every subdomain queried during configs 2-4 got permanently cached as 1003 at Cloudflare's edge. Cache purge didn't help (different cache layer). These subdomains are stuck until Cloudflare's DNS routing cache expires (~24h).
Lesson: The wildcard should have pointed to a stable, always-live IP from the start. In production, this should be a dedicated proxy/load balancer IP that never changes, not an individual EC2 instance.
Follow-up: Consider using a Cloudflare Tunnel instead of a proxied A record — tunnels don't have the origin-IP-must-be-reachable requirement.
2. AdminAuth Origin bypass attempted then reverted
Attempted to add canvasOriginAllowed() to AdminAuth middleware to let
the Canvas through without a bearer token. A test (#623) correctly blocked
this — Origin is forgeable, and AdminAuth protects sensitive routes
(secrets, events, bundles).
What should have been done from the start: Per-tenant ADMIN_TOKEN (which we eventually implemented). The Origin bypass was a security shortcut that the existing test suite caught.
Current state: Reverted. ADMIN_TOKEN is the correct approach.
3. Debug code left in CP provisioner
The workspace boot script still has:
python3 -m http.server 9999debug server exposing/var/log/- Crash detection
echo "RUNTIME CRASHED"with log dump set -exshowing all commands in cloud-init console
Follow-up: Remove debug instrumentation before production. The debug server on :9999 exposes boot logs to anyone who can reach the EC2 IP.
4. GHCR auth removed then re-added
Removed docker login from tenant boot script (assuming public GHCR),
then had to re-add it when the package couldn't be made public (linked
to private repo). Wasted one provisioning cycle.
5. DB rows deleted manually via psql
Multiple times during testing, org/instance rows were deleted directly
via psql instead of going through the proper DELETE /cp/orgs/:slug
cascade. This left orphaned EC2 instances running (costing money) and
skipped the GDPR purge audit trail.
Lesson: Always use the API for deletions. The cascade handles EC2 termination + DNS cleanup + audit logging.
Security concerns to address
CRITICAL
-
#756 — X-Workspace-ID header forge bypasses CanCommunicate Any workspace can reach any other workspace by setting
X-Workspace-ID: system:anything. Complete access control bypass. Fix options proposed, awaiting CEO design decision. -
#757 — GLOBAL memory poisoning Root workspaces can inject persistent prompt injection into all agents via GLOBAL memory scope. Mitigations proposed, awaiting CEO decision.
HIGH
-
ADMIN_TOKEN in plaintext in org_instances table The per-tenant ADMIN_TOKEN is stored unencrypted in the CP database. Should be encrypted with the envelope key like other secrets.
-
ADMIN_TOKEN exposed via
/cp/orgs/:slug/instancepublic endpoint The Worker's routing endpoint returns the admin_token in plaintext. This endpoint is public (no auth). Anyone who knows the slug can get the admin token and access all AdminAuth-protected routes. Fix: Remove admin_token from the public response. Store it in Worker KV at provision time instead. -
Debug HTTP server on workspace EC2 port 9999 Exposes boot logs (may contain secrets in env exports) to anyone who can reach the EC2 IP. Must be removed before production.
-
set -exin boot scripts Shows all commands including secret values in cloud-init console output. EC2 console output is accessible via AWS API.
MEDIUM
-
Workspace EC2 security group allows all inbound Should restrict to: Cloudflare IPs (for Worker proxying), tenant EC2 IP (for direct platform communication), SSH from admin IP only.
-
No HTTPS between Worker and EC2 Worker connects to EC2 on
http://IP:8080(plain HTTP). Traffic crosses the public internet unencrypted. Should use a tunnel or at minimum restrict to VPC.
What needs proper workflow
1. Workspace registration not working
Workspace EC2s boot, start the A2A server on :8000, but never register
with the tenant platform (POST /registry/register). The workspace stays
at "provisioning" status forever on the Canvas.
Root cause: The boot script starts molecule-runtime which handles
registration, but the runtime may not have the workspace auth token
needed for registration. The token is issued by the tenant platform
after the CP provision call, but it's not passed to the workspace EC2.
Fix needed: Pass the workspace auth token in the boot script env, or have the runtime request a token at startup.
2. Workspace boot time (9 min cold start)
The workspace EC2 boot sequence:
apt-get update + install(~2 min)python3 -m venv + pip install molecule-ai-workspace-runtime(~2 min)git clone adapter repo + pip install adapter deps(~2 min)- Runtime initialization (~2-3 min)
Fix: Pre-baked AMIs per runtime (tracked in project_ami_pipeline.md).
Each AMI has all deps pre-installed. Boot reduces to ~30s.
3. CI blocked by go.mod replace directive
PR #900 fixes replace github.com/...plugin... => /plugin which breaks
native Go builds. The replace is needed only in Docker builds where the
plugin is COPYed to /plugin. Fix: add replace at Docker build time via
RUN echo 'replace ...' >> go.mod.
4. Cloudflare edge cache poisoning
Changing the wildcard A record origin IP causes all previously-queried subdomains to cache the 1003 error for hours. HTTP cache purge doesn't clear DNS routing cache.
Fix for production: Use a stable origin IP (dedicated proxy) or Cloudflare Tunnel. Never change the wildcard origin IP in production.
Tests needed
Automated (add to CI)
- Workspace EC2 boot script integration test (mock EC2, verify user-data contains config.yaml, adapter clone, env vars)
- CP workspace provision handler test (verify env map passthrough)
- Worker routing test (mock CP lookup, verify correct backend proxy)
- Tenant ADMIN_TOKEN validation test (verify AdminAuth accepts it)
- Provisioning status endpoint test (verify direct-IP health check)
Manual (before GA)
- Full org lifecycle: create → provision → deploy workspace → send message → get AI response → delete workspace → delete org
- Multi-org isolation: create 2 orgs, verify workspace A cannot reach workspace B
- Workspace auto-update: push new image, verify tenant picks it up within 5 min
- Org deletion cascade: verify EC2 terminated, DNS cleaned, DB purged, audit trail written
- Browser E2E: Canvas loads, onboarding wizard works, deploy template prompts for API key, workspace comes online, chat works
Security (before GA)
- Fix #756 (X-Workspace-ID forge) — complete access control bypass
- Fix #757 (GLOBAL memory poisoning)
- Remove ADMIN_TOKEN from public
/instanceendpoint - Encrypt ADMIN_TOKEN in DB
- Remove debug server (:9999) from workspace boot script
- Remove
set -exfrom boot scripts (leaks secrets to console) - Restrict workspace EC2 security group
- Add HTTPS between Worker and EC2 (or use tunnel)