docs: session retrospective + Phase 35 hardening plan

Full retrospective of the 2026-04-16/17 SaaS buildout session:
- What was done (infra migration, 40+ PRs, 5 issues, 4 docs, 1 new repo)
- What should NOT have been changed (wildcard DNS churn, AdminAuth shortcut)
- Security concerns (8 items, 2 CRITICAL)
- Workflow gaps (registration, boot time, CI)
- Tests needed (automated + manual + security)

Phase 35 in PLAN.md covers production hardening follow-ups.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Hongming Wang 2026-04-17 20:08:39 -07:00
parent eabca3679e
commit da0be04a19
2 changed files with 315 additions and 0 deletions

50
PLAN.md
View File

@ -622,6 +622,56 @@ self-hosted per-customer). Ordered by dependency + ROI.
---
## Phase 35: SaaS Production Hardening (post-2026-04-17 retrospective)
> **Goal:** Address security gaps, remove debug code, fix workspace
> registration, and reduce boot time identified during the SaaS buildout
> session. See `docs/retrospectives/2026-04-17-saas-buildout.md` for full
> context.
### Phase 35.1 — Security (CRITICAL, before any public launch)
- [ ] Fix #756 — X-Workspace-ID header forge bypasses CanCommunicate
(derive callerID from authenticated token, not raw header)
- [ ] Fix #757 — GLOBAL memory poisoning mitigations (content delimiters
+ audit log at minimum)
- [ ] Remove ADMIN_TOKEN from public `/cp/orgs/:slug/instance` endpoint —
store in Worker KV at provision time instead
- [ ] Encrypt ADMIN_TOKEN in `org_instances` table (use envelope key)
- [ ] Remove debug HTTP server (:9999) from workspace boot script
- [ ] Remove `set -ex` from boot scripts (leaks env vars to EC2 console)
- [ ] Restrict workspace EC2 security group (Cloudflare IPs + tenant IP only)
- [ ] Add HTTPS between Worker and EC2 (or Cloudflare Tunnel)
### Phase 35.2 — Workspace registration fix
- [ ] Pass workspace auth token in EC2 boot script env so runtime can
register with `POST /registry/register`
- [ ] Or: have runtime request a token at startup via
`GET /admin/workspaces/:id/test-token`
- [ ] Verify workspace status flips to "online" on Canvas after boot
- [ ] Test full Canvas flow: deploy → STARTING → online → chat works
### Phase 35.3 — Boot time optimization
- [ ] Pre-baked AMI per runtime (Packer or EC2 Image Builder):
- `ami-hermes`: Python + openai + anthropic + molecule-runtime + hermes adapter
- `ami-claude-code`: Node + claude-code SDK + molecule-runtime
- `ami-langgraph`: Python + langchain + langgraph + molecule-runtime
- [ ] Runtime switch = launch from different AMI. Boot ~30s vs current ~9 min
- [ ] Remove apt-get + pip install from boot script (only config + secrets + start)
### Phase 35.4 — Stability + CI
- [ ] Fix go.mod replace directive (PR #900) — unblocks all CI
- [ ] Use stable origin IP for wildcard DNS (dedicated proxy or Tunnel)
- [ ] Add workspace boot integration test to CI
- [ ] Add SaaS tenant smoke test (`tests/e2e/test_saas_tenant.sh`) to CI
- [ ] Clean up Cloudflare edge cache poisoning from session
(or wait ~24h for natural expiry)
---
## Infra footnote — Temporal
`docker-compose.infra.yml` now includes Temporal (`:7233` gRPC, `:8233` Web

View File

@ -0,0 +1,265 @@
# Session Retrospective: 2026-04-16/17 SaaS Buildout
> **Duration:** ~24 hours (overnight autonomous + daytime interactive)
> **Scope:** Full SaaS infrastructure migration + E2E workspace provisioning
> **Status:** Platform API 17/17 pass, workspace A2A confirmed working,
> multiple issues remain for production readiness
---
## What was done
### Infrastructure migration (Fly.io → Railway + EC2)
| Change | Repo | Status |
|--------|------|--------|
| Railway deployment for control plane | molecule-controlplane | Deployed, auto-deploy on push |
| EC2 provisioner for tenants (Postgres + Redis + Platform in Docker) | molecule-controlplane | Deployed |
| EC2 provisioner for workspaces (pip install runtime at boot) | molecule-controlplane | Deployed, 9 min cold start |
| Cloudflare Worker for wildcard subdomain routing | molecule-tenant-proxy (new repo) | Deployed |
| Wildcard DNS `*.moleculesai.app` → Worker | Cloudflare dashboard | Done |
| Per-tenant ADMIN_TOKEN for Worker auth injection | molecule-controlplane | Deployed |
| Auto-updater cron on tenant EC2s (Option B) | molecule-controlplane | Deployed |
| Phase 33.2: stop creating per-tenant DNS records | molecule-controlplane | Deployed |
| Provisioning status page (progress bar + ETA) | molecule-app | Deployed to Vercel |
| Delete org button with type-to-confirm | molecule-app | Deployed to Vercel |
| Remove admin section from SaaS app | molecule-app | Deployed to Vercel |
### Monorepo PRs merged (by me)
| PR | Title |
|----|-------|
| #584 | TenantGuard same-origin bypass for EC2 tenant Canvas |
| #585 | Remove Fly registry from publish pipeline |
| #586 | Remove brand-monitor from monorepo |
| #587 | 5 Canvas UX fixes (error handling, a11y, loading state) |
| #588 | Hermes + gemini-cli deploy preflight required keys |
| #589 | Ecosystem-watch MAF v1.0 update |
| #646 | Migration TEXT→UUID FK type mismatch (critical E2E unblock) |
| #751 | A2A topology overlay |
| #771 | mcp-eval quality gate |
| #843 | pgvector migration DO block guard (critical E2E unblock) |
### Monorepo PRs merged (by other agents, reviewed by me)
#601, #602, #606, #610, #611, #612, #627, #629, #630, #639, #640, #641,
#644, #645, #650, #655, #656, #659, #669, #764, #784, #785, #791, #793,
#794, #796, #797, #798, #803, #808 — 30+ PRs total.
### Issues filed
| Issue | Title |
|-------|-------|
| #590 | AG-UI compatible SSE endpoint (implemented in #601) |
| #591 | Per-org tool governance registry |
| #592 | Per-workspace cost transparency |
| #850 | Canvas :3000 not running on tenant EC2 (fixed) |
| #863 | Workspace boot script missing config.yaml (fixed) |
### Docs created
| Doc | Purpose |
|-----|---------|
| `docs/architecture/wildcard-dns-proxy.md` | Phase 33 Cloudflare Worker architecture |
| `docs/architecture/tenant-image-upgrades.md` | Options A/B/C for tenant auto-upgrade |
| `docs/architecture/partner-api-keys.md` | Phase 34 partner/programmatic API access |
| `tests/e2e/test_saas_tenant.sh` | Reusable SaaS tenant smoke test |
### Standalone repos created
| Repo | Purpose |
|------|---------|
| `Molecule-AI/molecule-tenant-proxy` | Cloudflare Worker for subdomain routing |
---
## What should NOT have been changed (but was)
### 1. Wildcard DNS record changed 4 times in one session
The wildcard A record for `*.moleculesai.app` was pointed at:
1. `18.220.182.88` (real EC2 IP) — initial
2. `198.51.100.1` (RFC 5737 TEST-NET) — Cloudflare blocked it (1003)
3. `3.16.109.132` (terminated EC2) — caused 1003 for all subdomains
4. `3.143.250.95` (another terminated EC2) — same issue
5. `3.131.96.216` (final live EC2) — current
**Impact:** Every subdomain queried during configs 2-4 got permanently
cached as 1003 at Cloudflare's edge. Cache purge didn't help (different
cache layer). These subdomains are stuck until Cloudflare's DNS routing
cache expires (~24h).
**Lesson:** The wildcard should have pointed to a **stable, always-live IP**
from the start. In production, this should be a dedicated proxy/load
balancer IP that never changes, not an individual EC2 instance.
**Follow-up:** Consider using a Cloudflare Tunnel instead of a proxied A
record — tunnels don't have the origin-IP-must-be-reachable requirement.
### 2. AdminAuth Origin bypass attempted then reverted
Attempted to add `canvasOriginAllowed()` to `AdminAuth` middleware to let
the Canvas through without a bearer token. A test (#623) correctly blocked
this — Origin is forgeable, and AdminAuth protects sensitive routes
(secrets, events, bundles).
**What should have been done from the start:** Per-tenant ADMIN_TOKEN
(which we eventually implemented). The Origin bypass was a security
shortcut that the existing test suite caught.
**Current state:** Reverted. ADMIN_TOKEN is the correct approach.
### 3. Debug code left in CP provisioner
The workspace boot script still has:
- `python3 -m http.server 9999` debug server exposing `/var/log/`
- Crash detection `echo "RUNTIME CRASHED"` with log dump
- `set -ex` showing all commands in cloud-init console
**Follow-up:** Remove debug instrumentation before production. The debug
server on :9999 exposes boot logs to anyone who can reach the EC2 IP.
### 4. GHCR auth removed then re-added
Removed `docker login` from tenant boot script (assuming public GHCR),
then had to re-add it when the package couldn't be made public (linked
to private repo). Wasted one provisioning cycle.
### 5. DB rows deleted manually via psql
Multiple times during testing, org/instance rows were deleted directly
via psql instead of going through the proper `DELETE /cp/orgs/:slug`
cascade. This left orphaned EC2 instances running (costing money) and
skipped the GDPR purge audit trail.
**Lesson:** Always use the API for deletions. The cascade handles EC2
termination + DNS cleanup + audit logging.
---
## Security concerns to address
### CRITICAL
1. **#756 — X-Workspace-ID header forge bypasses CanCommunicate**
Any workspace can reach any other workspace by setting
`X-Workspace-ID: system:anything`. Complete access control bypass.
Fix options proposed, awaiting CEO design decision.
2. **#757 — GLOBAL memory poisoning**
Root workspaces can inject persistent prompt injection into all agents
via GLOBAL memory scope. Mitigations proposed, awaiting CEO decision.
### HIGH
3. **ADMIN_TOKEN in plaintext in org_instances table**
The per-tenant ADMIN_TOKEN is stored unencrypted in the CP database.
Should be encrypted with the envelope key like other secrets.
4. **ADMIN_TOKEN exposed via `/cp/orgs/:slug/instance` public endpoint**
The Worker's routing endpoint returns the admin_token in plaintext.
This endpoint is public (no auth). Anyone who knows the slug can get
the admin token and access all AdminAuth-protected routes.
**Fix:** Remove admin_token from the public response. Store it in
Worker KV at provision time instead.
5. **Debug HTTP server on workspace EC2 port 9999**
Exposes boot logs (may contain secrets in env exports) to anyone
who can reach the EC2 IP. Must be removed before production.
6. **`set -ex` in boot scripts**
Shows all commands including secret values in cloud-init console
output. EC2 console output is accessible via AWS API.
### MEDIUM
7. **Workspace EC2 security group allows all inbound**
Should restrict to: Cloudflare IPs (for Worker proxying), tenant
EC2 IP (for direct platform communication), SSH from admin IP only.
8. **No HTTPS between Worker and EC2**
Worker connects to EC2 on `http://IP:8080` (plain HTTP). Traffic
crosses the public internet unencrypted. Should use a tunnel or
at minimum restrict to VPC.
---
## What needs proper workflow
### 1. Workspace registration not working
Workspace EC2s boot, start the A2A server on :8000, but never register
with the tenant platform (`POST /registry/register`). The workspace stays
at "provisioning" status forever on the Canvas.
**Root cause:** The boot script starts `molecule-runtime` which handles
registration, but the runtime may not have the workspace auth token
needed for registration. The token is issued by the tenant platform
after the CP provision call, but it's not passed to the workspace EC2.
**Fix needed:** Pass the workspace auth token in the boot script env,
or have the runtime request a token at startup.
### 2. Workspace boot time (9 min cold start)
The workspace EC2 boot sequence:
- `apt-get update + install` (~2 min)
- `python3 -m venv + pip install molecule-ai-workspace-runtime` (~2 min)
- `git clone adapter repo + pip install adapter deps` (~2 min)
- Runtime initialization (~2-3 min)
**Fix:** Pre-baked AMIs per runtime (tracked in `project_ami_pipeline.md`).
Each AMI has all deps pre-installed. Boot reduces to ~30s.
### 3. CI blocked by go.mod replace directive
PR #900 fixes `replace github.com/...plugin... => /plugin` which breaks
native Go builds. The replace is needed only in Docker builds where the
plugin is COPYed to `/plugin`. Fix: add replace at Docker build time via
`RUN echo 'replace ...' >> go.mod`.
### 4. Cloudflare edge cache poisoning
Changing the wildcard A record origin IP causes all previously-queried
subdomains to cache the 1003 error for hours. HTTP cache purge doesn't
clear DNS routing cache.
**Fix for production:** Use a stable origin IP (dedicated proxy) or
Cloudflare Tunnel. Never change the wildcard origin IP in production.
---
## Tests needed
### Automated (add to CI)
- [ ] Workspace EC2 boot script integration test (mock EC2, verify
user-data contains config.yaml, adapter clone, env vars)
- [ ] CP workspace provision handler test (verify env map passthrough)
- [ ] Worker routing test (mock CP lookup, verify correct backend proxy)
- [ ] Tenant ADMIN_TOKEN validation test (verify AdminAuth accepts it)
- [ ] Provisioning status endpoint test (verify direct-IP health check)
### Manual (before GA)
- [ ] Full org lifecycle: create → provision → deploy workspace →
send message → get AI response → delete workspace → delete org
- [ ] Multi-org isolation: create 2 orgs, verify workspace A cannot
reach workspace B
- [ ] Workspace auto-update: push new image, verify tenant picks it up
within 5 min
- [ ] Org deletion cascade: verify EC2 terminated, DNS cleaned, DB
purged, audit trail written
- [ ] Browser E2E: Canvas loads, onboarding wizard works, deploy
template prompts for API key, workspace comes online, chat works
### Security (before GA)
- [ ] Fix #756 (X-Workspace-ID forge) — complete access control bypass
- [ ] Fix #757 (GLOBAL memory poisoning)
- [ ] Remove ADMIN_TOKEN from public `/instance` endpoint
- [ ] Encrypt ADMIN_TOKEN in DB
- [ ] Remove debug server (:9999) from workspace boot script
- [ ] Remove `set -ex` from boot scripts (leaks secrets to console)
- [ ] Restrict workspace EC2 security group
- [ ] Add HTTPS between Worker and EC2 (or use tunnel)