docs: session retrospective + Phase 35 hardening plan

Full retrospective of the 2026-04-16/17 SaaS buildout session: - What was done (infra migration, 40+ PRs, 5 issues, 4 docs, 1 new repo) - What should NOT have been changed (wildcard DNS churn, AdminAuth shortcut) - Security concerns (8 items, 2 CRITICAL) - Workflow gaps (registration, boot time, CI) - Tests needed (automated + manual + security) Phase 35 in PLAN.md covers production hardening follow-ups. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 20:08:39 -07:00 · 2026-04-17 20:08:39 -07:00 · da0be04a19
commit da0be04a19
parent eabca3679e
2 changed files with 315 additions and 0 deletions
--- a/PLAN.md
+++ b/PLAN.md
@ -622,6 +622,56 @@ self-hosted per-customer). Ordered by dependency + ROI.

 ---

+## Phase 35: SaaS Production Hardening (post-2026-04-17 retrospective)
+
+> **Goal:** Address security gaps, remove debug code, fix workspace
+> registration, and reduce boot time identified during the SaaS buildout
+> session. See `docs/retrospectives/2026-04-17-saas-buildout.md` for full
+> context.
+
+### Phase 35.1 — Security (CRITICAL, before any public launch)
+
+- [ ] Fix #756 — X-Workspace-ID header forge bypasses CanCommunicate
+  (derive callerID from authenticated token, not raw header)
+- [ ] Fix #757 — GLOBAL memory poisoning mitigations (content delimiters
+  + audit log at minimum)
+- [ ] Remove ADMIN_TOKEN from public `/cp/orgs/:slug/instance` endpoint —
+  store in Worker KV at provision time instead
+- [ ] Encrypt ADMIN_TOKEN in `org_instances` table (use envelope key)
+- [ ] Remove debug HTTP server (:9999) from workspace boot script
+- [ ] Remove `set -ex` from boot scripts (leaks env vars to EC2 console)
+- [ ] Restrict workspace EC2 security group (Cloudflare IPs + tenant IP only)
+- [ ] Add HTTPS between Worker and EC2 (or Cloudflare Tunnel)
+
+### Phase 35.2 — Workspace registration fix
+
+- [ ] Pass workspace auth token in EC2 boot script env so runtime can
+  register with `POST /registry/register`
+- [ ] Or: have runtime request a token at startup via
+  `GET /admin/workspaces/:id/test-token`
+- [ ] Verify workspace status flips to "online" on Canvas after boot
+- [ ] Test full Canvas flow: deploy → STARTING → online → chat works
+
+### Phase 35.3 — Boot time optimization
+
+- [ ] Pre-baked AMI per runtime (Packer or EC2 Image Builder):
+  - `ami-hermes`: Python + openai + anthropic + molecule-runtime + hermes adapter
+  - `ami-claude-code`: Node + claude-code SDK + molecule-runtime
+  - `ami-langgraph`: Python + langchain + langgraph + molecule-runtime
+- [ ] Runtime switch = launch from different AMI. Boot ~30s vs current ~9 min
+- [ ] Remove apt-get + pip install from boot script (only config + secrets + start)
+
+### Phase 35.4 — Stability + CI
+
+- [ ] Fix go.mod replace directive (PR #900) — unblocks all CI
+- [ ] Use stable origin IP for wildcard DNS (dedicated proxy or Tunnel)
+- [ ] Add workspace boot integration test to CI
+- [ ] Add SaaS tenant smoke test (`tests/e2e/test_saas_tenant.sh`) to CI
+- [ ] Clean up Cloudflare edge cache poisoning from session
+  (or wait ~24h for natural expiry)
+
+---
+
 ## Infra footnote — Temporal

 `docker-compose.infra.yml` now includes Temporal (`:7233` gRPC, `:8233` Web
--- a/docs/retrospectives/2026-04-17-saas-buildout.md
+++ b/docs/retrospectives/2026-04-17-saas-buildout.md
@ -0,0 +1,265 @@
+# Session Retrospective: 2026-04-16/17 SaaS Buildout
+
+> **Duration:** ~24 hours (overnight autonomous + daytime interactive)
+> **Scope:** Full SaaS infrastructure migration + E2E workspace provisioning
+> **Status:** Platform API 17/17 pass, workspace A2A confirmed working,
+> multiple issues remain for production readiness
+
+---
+
+## What was done
+
+### Infrastructure migration (Fly.io → Railway + EC2)
+
+| Change | Repo | Status |
+|--------|------|--------|
+| Railway deployment for control plane | molecule-controlplane | Deployed, auto-deploy on push |
+| EC2 provisioner for tenants (Postgres + Redis + Platform in Docker) | molecule-controlplane | Deployed |
+| EC2 provisioner for workspaces (pip install runtime at boot) | molecule-controlplane | Deployed, 9 min cold start |
+| Cloudflare Worker for wildcard subdomain routing | molecule-tenant-proxy (new repo) | Deployed |
+| Wildcard DNS `*.moleculesai.app` → Worker | Cloudflare dashboard | Done |
+| Per-tenant ADMIN_TOKEN for Worker auth injection | molecule-controlplane | Deployed |
+| Auto-updater cron on tenant EC2s (Option B) | molecule-controlplane | Deployed |
+| Phase 33.2: stop creating per-tenant DNS records | molecule-controlplane | Deployed |
+| Provisioning status page (progress bar + ETA) | molecule-app | Deployed to Vercel |
+| Delete org button with type-to-confirm | molecule-app | Deployed to Vercel |
+| Remove admin section from SaaS app | molecule-app | Deployed to Vercel |
+
+### Monorepo PRs merged (by me)
+
+| PR | Title |
+|----|-------|
+| #584 | TenantGuard same-origin bypass for EC2 tenant Canvas |
+| #585 | Remove Fly registry from publish pipeline |
+| #586 | Remove brand-monitor from monorepo |
+| #587 | 5 Canvas UX fixes (error handling, a11y, loading state) |
+| #588 | Hermes + gemini-cli deploy preflight required keys |
+| #589 | Ecosystem-watch MAF v1.0 update |
+| #646 | Migration TEXT→UUID FK type mismatch (critical E2E unblock) |
+| #751 | A2A topology overlay |
+| #771 | mcp-eval quality gate |
+| #843 | pgvector migration DO block guard (critical E2E unblock) |
+
+### Monorepo PRs merged (by other agents, reviewed by me)
+
+#601, #602, #606, #610, #611, #612, #627, #629, #630, #639, #640, #641,
+#644, #645, #650, #655, #656, #659, #669, #764, #784, #785, #791, #793,
+#794, #796, #797, #798, #803, #808 — 30+ PRs total.
+
+### Issues filed
+
+| Issue | Title |
+|-------|-------|
+| #590 | AG-UI compatible SSE endpoint (implemented in #601) |
+| #591 | Per-org tool governance registry |
+| #592 | Per-workspace cost transparency |
+| #850 | Canvas :3000 not running on tenant EC2 (fixed) |
+| #863 | Workspace boot script missing config.yaml (fixed) |
+
+### Docs created
+
+| Doc | Purpose |
+|-----|---------|
+| `docs/architecture/wildcard-dns-proxy.md` | Phase 33 Cloudflare Worker architecture |
+| `docs/architecture/tenant-image-upgrades.md` | Options A/B/C for tenant auto-upgrade |
+| `docs/architecture/partner-api-keys.md` | Phase 34 partner/programmatic API access |
+| `tests/e2e/test_saas_tenant.sh` | Reusable SaaS tenant smoke test |
+
+### Standalone repos created
+
+| Repo | Purpose |
+|------|---------|
+| `Molecule-AI/molecule-tenant-proxy` | Cloudflare Worker for subdomain routing |
+
+---
+
+## What should NOT have been changed (but was)
+
+### 1. Wildcard DNS record changed 4 times in one session
+
+The wildcard A record for `*.moleculesai.app` was pointed at:
+1. `18.220.182.88` (real EC2 IP) — initial
+2. `198.51.100.1` (RFC 5737 TEST-NET) — Cloudflare blocked it (1003)
+3. `3.16.109.132` (terminated EC2) — caused 1003 for all subdomains
+4. `3.143.250.95` (another terminated EC2) — same issue
+5. `3.131.96.216` (final live EC2) — current
+
+**Impact:** Every subdomain queried during configs 2-4 got permanently
+cached as 1003 at Cloudflare's edge. Cache purge didn't help (different
+cache layer). These subdomains are stuck until Cloudflare's DNS routing
+cache expires (~24h).
+
+**Lesson:** The wildcard should have pointed to a **stable, always-live IP**
+from the start. In production, this should be a dedicated proxy/load
+balancer IP that never changes, not an individual EC2 instance.
+
+**Follow-up:** Consider using a Cloudflare Tunnel instead of a proxied A
+record — tunnels don't have the origin-IP-must-be-reachable requirement.
+
+### 2. AdminAuth Origin bypass attempted then reverted
+
+Attempted to add `canvasOriginAllowed()` to `AdminAuth` middleware to let
+the Canvas through without a bearer token. A test (#623) correctly blocked
+this — Origin is forgeable, and AdminAuth protects sensitive routes
+(secrets, events, bundles).
+
+**What should have been done from the start:** Per-tenant ADMIN_TOKEN
+(which we eventually implemented). The Origin bypass was a security
+shortcut that the existing test suite caught.
+
+**Current state:** Reverted. ADMIN_TOKEN is the correct approach.
+
+### 3. Debug code left in CP provisioner
+
+The workspace boot script still has:
+- `python3 -m http.server 9999` debug server exposing `/var/log/`
+- Crash detection `echo "RUNTIME CRASHED"` with log dump
+- `set -ex` showing all commands in cloud-init console
+
+**Follow-up:** Remove debug instrumentation before production. The debug
+server on :9999 exposes boot logs to anyone who can reach the EC2 IP.
+
+### 4. GHCR auth removed then re-added
+
+Removed `docker login` from tenant boot script (assuming public GHCR),
+then had to re-add it when the package couldn't be made public (linked
+to private repo). Wasted one provisioning cycle.
+
+### 5. DB rows deleted manually via psql
+
+Multiple times during testing, org/instance rows were deleted directly
+via psql instead of going through the proper `DELETE /cp/orgs/:slug`
+cascade. This left orphaned EC2 instances running (costing money) and
+skipped the GDPR purge audit trail.
+
+**Lesson:** Always use the API for deletions. The cascade handles EC2
+termination + DNS cleanup + audit logging.
+
+---
+
+## Security concerns to address
+
+### CRITICAL
+
+1. **#756 — X-Workspace-ID header forge bypasses CanCommunicate**
+   Any workspace can reach any other workspace by setting
+   `X-Workspace-ID: system:anything`. Complete access control bypass.
+   Fix options proposed, awaiting CEO design decision.
+
+2. **#757 — GLOBAL memory poisoning**
+   Root workspaces can inject persistent prompt injection into all agents
+   via GLOBAL memory scope. Mitigations proposed, awaiting CEO decision.
+
+### HIGH
+
+3. **ADMIN_TOKEN in plaintext in org_instances table**
+   The per-tenant ADMIN_TOKEN is stored unencrypted in the CP database.
+   Should be encrypted with the envelope key like other secrets.
+
+4. **ADMIN_TOKEN exposed via `/cp/orgs/:slug/instance` public endpoint**
+   The Worker's routing endpoint returns the admin_token in plaintext.
+   This endpoint is public (no auth). Anyone who knows the slug can get
+   the admin token and access all AdminAuth-protected routes.
+   **Fix:** Remove admin_token from the public response. Store it in
+   Worker KV at provision time instead.
+
+5. **Debug HTTP server on workspace EC2 port 9999**
+   Exposes boot logs (may contain secrets in env exports) to anyone
+   who can reach the EC2 IP. Must be removed before production.
+
+6. **`set -ex` in boot scripts**
+   Shows all commands including secret values in cloud-init console
+   output. EC2 console output is accessible via AWS API.
+
+### MEDIUM
+
+7. **Workspace EC2 security group allows all inbound**
+   Should restrict to: Cloudflare IPs (for Worker proxying), tenant
+   EC2 IP (for direct platform communication), SSH from admin IP only.
+
+8. **No HTTPS between Worker and EC2**
+   Worker connects to EC2 on `http://IP:8080` (plain HTTP). Traffic
+   crosses the public internet unencrypted. Should use a tunnel or
+   at minimum restrict to VPC.
+
+---
+
+## What needs proper workflow
+
+### 1. Workspace registration not working
+
+Workspace EC2s boot, start the A2A server on :8000, but never register
+with the tenant platform (`POST /registry/register`). The workspace stays
+at "provisioning" status forever on the Canvas.
+
+**Root cause:** The boot script starts `molecule-runtime` which handles
+registration, but the runtime may not have the workspace auth token
+needed for registration. The token is issued by the tenant platform
+after the CP provision call, but it's not passed to the workspace EC2.
+
+**Fix needed:** Pass the workspace auth token in the boot script env,
+or have the runtime request a token at startup.
+
+### 2. Workspace boot time (9 min cold start)
+
+The workspace EC2 boot sequence:
+- `apt-get update + install` (~2 min)
+- `python3 -m venv + pip install molecule-ai-workspace-runtime` (~2 min)
+- `git clone adapter repo + pip install adapter deps` (~2 min)
+- Runtime initialization (~2-3 min)
+
+**Fix:** Pre-baked AMIs per runtime (tracked in `project_ami_pipeline.md`).
+Each AMI has all deps pre-installed. Boot reduces to ~30s.
+
+### 3. CI blocked by go.mod replace directive
+
+PR #900 fixes `replace github.com/...plugin... => /plugin` which breaks
+native Go builds. The replace is needed only in Docker builds where the
+plugin is COPYed to `/plugin`. Fix: add replace at Docker build time via
+`RUN echo 'replace ...' >> go.mod`.
+
+### 4. Cloudflare edge cache poisoning
+
+Changing the wildcard A record origin IP causes all previously-queried
+subdomains to cache the 1003 error for hours. HTTP cache purge doesn't
+clear DNS routing cache.
+
+**Fix for production:** Use a stable origin IP (dedicated proxy) or
+Cloudflare Tunnel. Never change the wildcard origin IP in production.
+
+---
+
+## Tests needed
+
+### Automated (add to CI)
+
+- [ ] Workspace EC2 boot script integration test (mock EC2, verify
+  user-data contains config.yaml, adapter clone, env vars)
+- [ ] CP workspace provision handler test (verify env map passthrough)
+- [ ] Worker routing test (mock CP lookup, verify correct backend proxy)
+- [ ] Tenant ADMIN_TOKEN validation test (verify AdminAuth accepts it)
+- [ ] Provisioning status endpoint test (verify direct-IP health check)
+
+### Manual (before GA)
+
+- [ ] Full org lifecycle: create → provision → deploy workspace →
+  send message → get AI response → delete workspace → delete org
+- [ ] Multi-org isolation: create 2 orgs, verify workspace A cannot
+  reach workspace B
+- [ ] Workspace auto-update: push new image, verify tenant picks it up
+  within 5 min
+- [ ] Org deletion cascade: verify EC2 terminated, DNS cleaned, DB
+  purged, audit trail written
+- [ ] Browser E2E: Canvas loads, onboarding wizard works, deploy
+  template prompts for API key, workspace comes online, chat works
+
+### Security (before GA)
+
+- [ ] Fix #756 (X-Workspace-ID forge) — complete access control bypass
+- [ ] Fix #757 (GLOBAL memory poisoning)
+- [ ] Remove ADMIN_TOKEN from public `/instance` endpoint
+- [ ] Encrypt ADMIN_TOKEN in DB
+- [ ] Remove debug server (:9999) from workspace boot script
+- [ ] Remove `set -ex` from boot scripts (leaks secrets to console)
+- [ ] Restrict workspace EC2 security group
+- [ ] Add HTTPS between Worker and EC2 (or use tunnel)