diff --git a/PLAN.md b/PLAN.md index 10c37359..d3217943 100644 --- a/PLAN.md +++ b/PLAN.md @@ -622,6 +622,56 @@ self-hosted per-customer). Ordered by dependency + ROI. --- +## Phase 35: SaaS Production Hardening (post-2026-04-17 retrospective) + +> **Goal:** Address security gaps, remove debug code, fix workspace +> registration, and reduce boot time identified during the SaaS buildout +> session. See `docs/retrospectives/2026-04-17-saas-buildout.md` for full +> context. + +### Phase 35.1 — Security (CRITICAL, before any public launch) + +- [ ] Fix #756 — X-Workspace-ID header forge bypasses CanCommunicate + (derive callerID from authenticated token, not raw header) +- [ ] Fix #757 — GLOBAL memory poisoning mitigations (content delimiters + + audit log at minimum) +- [ ] Remove ADMIN_TOKEN from public `/cp/orgs/:slug/instance` endpoint — + store in Worker KV at provision time instead +- [ ] Encrypt ADMIN_TOKEN in `org_instances` table (use envelope key) +- [ ] Remove debug HTTP server (:9999) from workspace boot script +- [ ] Remove `set -ex` from boot scripts (leaks env vars to EC2 console) +- [ ] Restrict workspace EC2 security group (Cloudflare IPs + tenant IP only) +- [ ] Add HTTPS between Worker and EC2 (or Cloudflare Tunnel) + +### Phase 35.2 — Workspace registration fix + +- [ ] Pass workspace auth token in EC2 boot script env so runtime can + register with `POST /registry/register` +- [ ] Or: have runtime request a token at startup via + `GET /admin/workspaces/:id/test-token` +- [ ] Verify workspace status flips to "online" on Canvas after boot +- [ ] Test full Canvas flow: deploy → STARTING → online → chat works + +### Phase 35.3 — Boot time optimization + +- [ ] Pre-baked AMI per runtime (Packer or EC2 Image Builder): + - `ami-hermes`: Python + openai + anthropic + molecule-runtime + hermes adapter + - `ami-claude-code`: Node + claude-code SDK + molecule-runtime + - `ami-langgraph`: Python + langchain + langgraph + molecule-runtime +- [ ] Runtime switch = launch from different AMI. Boot ~30s vs current ~9 min +- [ ] Remove apt-get + pip install from boot script (only config + secrets + start) + +### Phase 35.4 — Stability + CI + +- [ ] Fix go.mod replace directive (PR #900) — unblocks all CI +- [ ] Use stable origin IP for wildcard DNS (dedicated proxy or Tunnel) +- [ ] Add workspace boot integration test to CI +- [ ] Add SaaS tenant smoke test (`tests/e2e/test_saas_tenant.sh`) to CI +- [ ] Clean up Cloudflare edge cache poisoning from session + (or wait ~24h for natural expiry) + +--- + ## Infra footnote — Temporal `docker-compose.infra.yml` now includes Temporal (`:7233` gRPC, `:8233` Web diff --git a/docs/retrospectives/2026-04-17-saas-buildout.md b/docs/retrospectives/2026-04-17-saas-buildout.md new file mode 100644 index 00000000..c851367a --- /dev/null +++ b/docs/retrospectives/2026-04-17-saas-buildout.md @@ -0,0 +1,265 @@ +# Session Retrospective: 2026-04-16/17 SaaS Buildout + +> **Duration:** ~24 hours (overnight autonomous + daytime interactive) +> **Scope:** Full SaaS infrastructure migration + E2E workspace provisioning +> **Status:** Platform API 17/17 pass, workspace A2A confirmed working, +> multiple issues remain for production readiness + +--- + +## What was done + +### Infrastructure migration (Fly.io → Railway + EC2) + +| Change | Repo | Status | +|--------|------|--------| +| Railway deployment for control plane | molecule-controlplane | Deployed, auto-deploy on push | +| EC2 provisioner for tenants (Postgres + Redis + Platform in Docker) | molecule-controlplane | Deployed | +| EC2 provisioner for workspaces (pip install runtime at boot) | molecule-controlplane | Deployed, 9 min cold start | +| Cloudflare Worker for wildcard subdomain routing | molecule-tenant-proxy (new repo) | Deployed | +| Wildcard DNS `*.moleculesai.app` → Worker | Cloudflare dashboard | Done | +| Per-tenant ADMIN_TOKEN for Worker auth injection | molecule-controlplane | Deployed | +| Auto-updater cron on tenant EC2s (Option B) | molecule-controlplane | Deployed | +| Phase 33.2: stop creating per-tenant DNS records | molecule-controlplane | Deployed | +| Provisioning status page (progress bar + ETA) | molecule-app | Deployed to Vercel | +| Delete org button with type-to-confirm | molecule-app | Deployed to Vercel | +| Remove admin section from SaaS app | molecule-app | Deployed to Vercel | + +### Monorepo PRs merged (by me) + +| PR | Title | +|----|-------| +| #584 | TenantGuard same-origin bypass for EC2 tenant Canvas | +| #585 | Remove Fly registry from publish pipeline | +| #586 | Remove brand-monitor from monorepo | +| #587 | 5 Canvas UX fixes (error handling, a11y, loading state) | +| #588 | Hermes + gemini-cli deploy preflight required keys | +| #589 | Ecosystem-watch MAF v1.0 update | +| #646 | Migration TEXT→UUID FK type mismatch (critical E2E unblock) | +| #751 | A2A topology overlay | +| #771 | mcp-eval quality gate | +| #843 | pgvector migration DO block guard (critical E2E unblock) | + +### Monorepo PRs merged (by other agents, reviewed by me) + +#601, #602, #606, #610, #611, #612, #627, #629, #630, #639, #640, #641, +#644, #645, #650, #655, #656, #659, #669, #764, #784, #785, #791, #793, +#794, #796, #797, #798, #803, #808 — 30+ PRs total. + +### Issues filed + +| Issue | Title | +|-------|-------| +| #590 | AG-UI compatible SSE endpoint (implemented in #601) | +| #591 | Per-org tool governance registry | +| #592 | Per-workspace cost transparency | +| #850 | Canvas :3000 not running on tenant EC2 (fixed) | +| #863 | Workspace boot script missing config.yaml (fixed) | + +### Docs created + +| Doc | Purpose | +|-----|---------| +| `docs/architecture/wildcard-dns-proxy.md` | Phase 33 Cloudflare Worker architecture | +| `docs/architecture/tenant-image-upgrades.md` | Options A/B/C for tenant auto-upgrade | +| `docs/architecture/partner-api-keys.md` | Phase 34 partner/programmatic API access | +| `tests/e2e/test_saas_tenant.sh` | Reusable SaaS tenant smoke test | + +### Standalone repos created + +| Repo | Purpose | +|------|---------| +| `Molecule-AI/molecule-tenant-proxy` | Cloudflare Worker for subdomain routing | + +--- + +## What should NOT have been changed (but was) + +### 1. Wildcard DNS record changed 4 times in one session + +The wildcard A record for `*.moleculesai.app` was pointed at: +1. `18.220.182.88` (real EC2 IP) — initial +2. `198.51.100.1` (RFC 5737 TEST-NET) — Cloudflare blocked it (1003) +3. `3.16.109.132` (terminated EC2) — caused 1003 for all subdomains +4. `3.143.250.95` (another terminated EC2) — same issue +5. `3.131.96.216` (final live EC2) — current + +**Impact:** Every subdomain queried during configs 2-4 got permanently +cached as 1003 at Cloudflare's edge. Cache purge didn't help (different +cache layer). These subdomains are stuck until Cloudflare's DNS routing +cache expires (~24h). + +**Lesson:** The wildcard should have pointed to a **stable, always-live IP** +from the start. In production, this should be a dedicated proxy/load +balancer IP that never changes, not an individual EC2 instance. + +**Follow-up:** Consider using a Cloudflare Tunnel instead of a proxied A +record — tunnels don't have the origin-IP-must-be-reachable requirement. + +### 2. AdminAuth Origin bypass attempted then reverted + +Attempted to add `canvasOriginAllowed()` to `AdminAuth` middleware to let +the Canvas through without a bearer token. A test (#623) correctly blocked +this — Origin is forgeable, and AdminAuth protects sensitive routes +(secrets, events, bundles). + +**What should have been done from the start:** Per-tenant ADMIN_TOKEN +(which we eventually implemented). The Origin bypass was a security +shortcut that the existing test suite caught. + +**Current state:** Reverted. ADMIN_TOKEN is the correct approach. + +### 3. Debug code left in CP provisioner + +The workspace boot script still has: +- `python3 -m http.server 9999` debug server exposing `/var/log/` +- Crash detection `echo "RUNTIME CRASHED"` with log dump +- `set -ex` showing all commands in cloud-init console + +**Follow-up:** Remove debug instrumentation before production. The debug +server on :9999 exposes boot logs to anyone who can reach the EC2 IP. + +### 4. GHCR auth removed then re-added + +Removed `docker login` from tenant boot script (assuming public GHCR), +then had to re-add it when the package couldn't be made public (linked +to private repo). Wasted one provisioning cycle. + +### 5. DB rows deleted manually via psql + +Multiple times during testing, org/instance rows were deleted directly +via psql instead of going through the proper `DELETE /cp/orgs/:slug` +cascade. This left orphaned EC2 instances running (costing money) and +skipped the GDPR purge audit trail. + +**Lesson:** Always use the API for deletions. The cascade handles EC2 +termination + DNS cleanup + audit logging. + +--- + +## Security concerns to address + +### CRITICAL + +1. **#756 — X-Workspace-ID header forge bypasses CanCommunicate** + Any workspace can reach any other workspace by setting + `X-Workspace-ID: system:anything`. Complete access control bypass. + Fix options proposed, awaiting CEO design decision. + +2. **#757 — GLOBAL memory poisoning** + Root workspaces can inject persistent prompt injection into all agents + via GLOBAL memory scope. Mitigations proposed, awaiting CEO decision. + +### HIGH + +3. **ADMIN_TOKEN in plaintext in org_instances table** + The per-tenant ADMIN_TOKEN is stored unencrypted in the CP database. + Should be encrypted with the envelope key like other secrets. + +4. **ADMIN_TOKEN exposed via `/cp/orgs/:slug/instance` public endpoint** + The Worker's routing endpoint returns the admin_token in plaintext. + This endpoint is public (no auth). Anyone who knows the slug can get + the admin token and access all AdminAuth-protected routes. + **Fix:** Remove admin_token from the public response. Store it in + Worker KV at provision time instead. + +5. **Debug HTTP server on workspace EC2 port 9999** + Exposes boot logs (may contain secrets in env exports) to anyone + who can reach the EC2 IP. Must be removed before production. + +6. **`set -ex` in boot scripts** + Shows all commands including secret values in cloud-init console + output. EC2 console output is accessible via AWS API. + +### MEDIUM + +7. **Workspace EC2 security group allows all inbound** + Should restrict to: Cloudflare IPs (for Worker proxying), tenant + EC2 IP (for direct platform communication), SSH from admin IP only. + +8. **No HTTPS between Worker and EC2** + Worker connects to EC2 on `http://IP:8080` (plain HTTP). Traffic + crosses the public internet unencrypted. Should use a tunnel or + at minimum restrict to VPC. + +--- + +## What needs proper workflow + +### 1. Workspace registration not working + +Workspace EC2s boot, start the A2A server on :8000, but never register +with the tenant platform (`POST /registry/register`). The workspace stays +at "provisioning" status forever on the Canvas. + +**Root cause:** The boot script starts `molecule-runtime` which handles +registration, but the runtime may not have the workspace auth token +needed for registration. The token is issued by the tenant platform +after the CP provision call, but it's not passed to the workspace EC2. + +**Fix needed:** Pass the workspace auth token in the boot script env, +or have the runtime request a token at startup. + +### 2. Workspace boot time (9 min cold start) + +The workspace EC2 boot sequence: +- `apt-get update + install` (~2 min) +- `python3 -m venv + pip install molecule-ai-workspace-runtime` (~2 min) +- `git clone adapter repo + pip install adapter deps` (~2 min) +- Runtime initialization (~2-3 min) + +**Fix:** Pre-baked AMIs per runtime (tracked in `project_ami_pipeline.md`). +Each AMI has all deps pre-installed. Boot reduces to ~30s. + +### 3. CI blocked by go.mod replace directive + +PR #900 fixes `replace github.com/...plugin... => /plugin` which breaks +native Go builds. The replace is needed only in Docker builds where the +plugin is COPYed to `/plugin`. Fix: add replace at Docker build time via +`RUN echo 'replace ...' >> go.mod`. + +### 4. Cloudflare edge cache poisoning + +Changing the wildcard A record origin IP causes all previously-queried +subdomains to cache the 1003 error for hours. HTTP cache purge doesn't +clear DNS routing cache. + +**Fix for production:** Use a stable origin IP (dedicated proxy) or +Cloudflare Tunnel. Never change the wildcard origin IP in production. + +--- + +## Tests needed + +### Automated (add to CI) + +- [ ] Workspace EC2 boot script integration test (mock EC2, verify + user-data contains config.yaml, adapter clone, env vars) +- [ ] CP workspace provision handler test (verify env map passthrough) +- [ ] Worker routing test (mock CP lookup, verify correct backend proxy) +- [ ] Tenant ADMIN_TOKEN validation test (verify AdminAuth accepts it) +- [ ] Provisioning status endpoint test (verify direct-IP health check) + +### Manual (before GA) + +- [ ] Full org lifecycle: create → provision → deploy workspace → + send message → get AI response → delete workspace → delete org +- [ ] Multi-org isolation: create 2 orgs, verify workspace A cannot + reach workspace B +- [ ] Workspace auto-update: push new image, verify tenant picks it up + within 5 min +- [ ] Org deletion cascade: verify EC2 terminated, DNS cleaned, DB + purged, audit trail written +- [ ] Browser E2E: Canvas loads, onboarding wizard works, deploy + template prompts for API key, workspace comes online, chat works + +### Security (before GA) + +- [ ] Fix #756 (X-Workspace-ID forge) — complete access control bypass +- [ ] Fix #757 (GLOBAL memory poisoning) +- [ ] Remove ADMIN_TOKEN from public `/instance` endpoint +- [ ] Encrypt ADMIN_TOKEN in DB +- [ ] Remove debug server (:9999) from workspace boot script +- [ ] Remove `set -ex` from boot scripts (leaks secrets to console) +- [ ] Restrict workspace EC2 security group +- [ ] Add HTTPS between Worker and EC2 (or use tunnel)