forked from molecule-ai/molecule-core
Merge pull request #924 from Molecule-AI/docs/session-retrospective-2026-04-17
docs: SaaS buildout retrospective + Phase 35 hardening plan
This commit is contained in:
commit
7e8eff2fe5
50
PLAN.md
50
PLAN.md
@ -622,6 +622,56 @@ self-hosted per-customer). Ordered by dependency + ROI.
|
||||
|
||||
---
|
||||
|
||||
## Phase 35: SaaS Production Hardening (post-2026-04-17 retrospective)
|
||||
|
||||
> **Goal:** Address security gaps, remove debug code, fix workspace
|
||||
> registration, and reduce boot time identified during the SaaS buildout
|
||||
> session. See `docs/retrospectives/2026-04-17-saas-buildout.md` for full
|
||||
> context.
|
||||
|
||||
### Phase 35.1 — Security (CRITICAL, before any public launch)
|
||||
|
||||
- [ ] Fix #756 — X-Workspace-ID header forge bypasses CanCommunicate
|
||||
(derive callerID from authenticated token, not raw header)
|
||||
- [ ] Fix #757 — GLOBAL memory poisoning mitigations (content delimiters
|
||||
+ audit log at minimum)
|
||||
- [ ] Remove ADMIN_TOKEN from public `/cp/orgs/:slug/instance` endpoint —
|
||||
store in Worker KV at provision time instead
|
||||
- [ ] Encrypt ADMIN_TOKEN in `org_instances` table (use envelope key)
|
||||
- [ ] Remove debug HTTP server (:9999) from workspace boot script
|
||||
- [ ] Remove `set -ex` from boot scripts (leaks env vars to EC2 console)
|
||||
- [ ] Restrict workspace EC2 security group (Cloudflare IPs + tenant IP only)
|
||||
- [ ] Add HTTPS between Worker and EC2 (or Cloudflare Tunnel)
|
||||
|
||||
### Phase 35.2 — Workspace registration fix
|
||||
|
||||
- [ ] Pass workspace auth token in EC2 boot script env so runtime can
|
||||
register with `POST /registry/register`
|
||||
- [ ] Or: have runtime request a token at startup via
|
||||
`GET /admin/workspaces/:id/test-token`
|
||||
- [ ] Verify workspace status flips to "online" on Canvas after boot
|
||||
- [ ] Test full Canvas flow: deploy → STARTING → online → chat works
|
||||
|
||||
### Phase 35.3 — Boot time optimization
|
||||
|
||||
- [ ] Pre-baked AMI per runtime (Packer or EC2 Image Builder):
|
||||
- `ami-hermes`: Python + openai + anthropic + molecule-runtime + hermes adapter
|
||||
- `ami-claude-code`: Node + claude-code SDK + molecule-runtime
|
||||
- `ami-langgraph`: Python + langchain + langgraph + molecule-runtime
|
||||
- [ ] Runtime switch = launch from different AMI. Boot ~30s vs current ~9 min
|
||||
- [ ] Remove apt-get + pip install from boot script (only config + secrets + start)
|
||||
|
||||
### Phase 35.4 — Stability + CI
|
||||
|
||||
- [ ] Fix go.mod replace directive (PR #900) — unblocks all CI
|
||||
- [ ] Use stable origin IP for wildcard DNS (dedicated proxy or Tunnel)
|
||||
- [ ] Add workspace boot integration test to CI
|
||||
- [ ] Add SaaS tenant smoke test (`tests/e2e/test_saas_tenant.sh`) to CI
|
||||
- [ ] Clean up Cloudflare edge cache poisoning from session
|
||||
(or wait ~24h for natural expiry)
|
||||
|
||||
---
|
||||
|
||||
## Infra footnote — Temporal
|
||||
|
||||
`docker-compose.infra.yml` now includes Temporal (`:7233` gRPC, `:8233` Web
|
||||
|
||||
265
docs/retrospectives/2026-04-17-saas-buildout.md
Normal file
265
docs/retrospectives/2026-04-17-saas-buildout.md
Normal file
@ -0,0 +1,265 @@
|
||||
# Session Retrospective: 2026-04-16/17 SaaS Buildout
|
||||
|
||||
> **Duration:** ~24 hours (overnight autonomous + daytime interactive)
|
||||
> **Scope:** Full SaaS infrastructure migration + E2E workspace provisioning
|
||||
> **Status:** Platform API 17/17 pass, workspace A2A confirmed working,
|
||||
> multiple issues remain for production readiness
|
||||
|
||||
---
|
||||
|
||||
## What was done
|
||||
|
||||
### Infrastructure migration (Fly.io → Railway + EC2)
|
||||
|
||||
| Change | Repo | Status |
|
||||
|--------|------|--------|
|
||||
| Railway deployment for control plane | molecule-controlplane | Deployed, auto-deploy on push |
|
||||
| EC2 provisioner for tenants (Postgres + Redis + Platform in Docker) | molecule-controlplane | Deployed |
|
||||
| EC2 provisioner for workspaces (pip install runtime at boot) | molecule-controlplane | Deployed, 9 min cold start |
|
||||
| Cloudflare Worker for wildcard subdomain routing | molecule-tenant-proxy (new repo) | Deployed |
|
||||
| Wildcard DNS `*.moleculesai.app` → Worker | Cloudflare dashboard | Done |
|
||||
| Per-tenant ADMIN_TOKEN for Worker auth injection | molecule-controlplane | Deployed |
|
||||
| Auto-updater cron on tenant EC2s (Option B) | molecule-controlplane | Deployed |
|
||||
| Phase 33.2: stop creating per-tenant DNS records | molecule-controlplane | Deployed |
|
||||
| Provisioning status page (progress bar + ETA) | molecule-app | Deployed to Vercel |
|
||||
| Delete org button with type-to-confirm | molecule-app | Deployed to Vercel |
|
||||
| Remove admin section from SaaS app | molecule-app | Deployed to Vercel |
|
||||
|
||||
### Monorepo PRs merged (by me)
|
||||
|
||||
| PR | Title |
|
||||
|----|-------|
|
||||
| #584 | TenantGuard same-origin bypass for EC2 tenant Canvas |
|
||||
| #585 | Remove Fly registry from publish pipeline |
|
||||
| #586 | Remove brand-monitor from monorepo |
|
||||
| #587 | 5 Canvas UX fixes (error handling, a11y, loading state) |
|
||||
| #588 | Hermes + gemini-cli deploy preflight required keys |
|
||||
| #589 | Ecosystem-watch MAF v1.0 update |
|
||||
| #646 | Migration TEXT→UUID FK type mismatch (critical E2E unblock) |
|
||||
| #751 | A2A topology overlay |
|
||||
| #771 | mcp-eval quality gate |
|
||||
| #843 | pgvector migration DO block guard (critical E2E unblock) |
|
||||
|
||||
### Monorepo PRs merged (by other agents, reviewed by me)
|
||||
|
||||
#601, #602, #606, #610, #611, #612, #627, #629, #630, #639, #640, #641,
|
||||
#644, #645, #650, #655, #656, #659, #669, #764, #784, #785, #791, #793,
|
||||
#794, #796, #797, #798, #803, #808 — 30+ PRs total.
|
||||
|
||||
### Issues filed
|
||||
|
||||
| Issue | Title |
|
||||
|-------|-------|
|
||||
| #590 | AG-UI compatible SSE endpoint (implemented in #601) |
|
||||
| #591 | Per-org tool governance registry |
|
||||
| #592 | Per-workspace cost transparency |
|
||||
| #850 | Canvas :3000 not running on tenant EC2 (fixed) |
|
||||
| #863 | Workspace boot script missing config.yaml (fixed) |
|
||||
|
||||
### Docs created
|
||||
|
||||
| Doc | Purpose |
|
||||
|-----|---------|
|
||||
| `docs/architecture/wildcard-dns-proxy.md` | Phase 33 Cloudflare Worker architecture |
|
||||
| `docs/architecture/tenant-image-upgrades.md` | Options A/B/C for tenant auto-upgrade |
|
||||
| `docs/architecture/partner-api-keys.md` | Phase 34 partner/programmatic API access |
|
||||
| `tests/e2e/test_saas_tenant.sh` | Reusable SaaS tenant smoke test |
|
||||
|
||||
### Standalone repos created
|
||||
|
||||
| Repo | Purpose |
|
||||
|------|---------|
|
||||
| `Molecule-AI/molecule-tenant-proxy` | Cloudflare Worker for subdomain routing |
|
||||
|
||||
---
|
||||
|
||||
## What should NOT have been changed (but was)
|
||||
|
||||
### 1. Wildcard DNS record changed 4 times in one session
|
||||
|
||||
The wildcard A record for `*.moleculesai.app` was pointed at:
|
||||
1. `18.220.182.88` (real EC2 IP) — initial
|
||||
2. `198.51.100.1` (RFC 5737 TEST-NET) — Cloudflare blocked it (1003)
|
||||
3. `3.16.109.132` (terminated EC2) — caused 1003 for all subdomains
|
||||
4. `3.143.250.95` (another terminated EC2) — same issue
|
||||
5. `3.131.96.216` (final live EC2) — current
|
||||
|
||||
**Impact:** Every subdomain queried during configs 2-4 got permanently
|
||||
cached as 1003 at Cloudflare's edge. Cache purge didn't help (different
|
||||
cache layer). These subdomains are stuck until Cloudflare's DNS routing
|
||||
cache expires (~24h).
|
||||
|
||||
**Lesson:** The wildcard should have pointed to a **stable, always-live IP**
|
||||
from the start. In production, this should be a dedicated proxy/load
|
||||
balancer IP that never changes, not an individual EC2 instance.
|
||||
|
||||
**Follow-up:** Consider using a Cloudflare Tunnel instead of a proxied A
|
||||
record — tunnels don't have the origin-IP-must-be-reachable requirement.
|
||||
|
||||
### 2. AdminAuth Origin bypass attempted then reverted
|
||||
|
||||
Attempted to add `canvasOriginAllowed()` to `AdminAuth` middleware to let
|
||||
the Canvas through without a bearer token. A test (#623) correctly blocked
|
||||
this — Origin is forgeable, and AdminAuth protects sensitive routes
|
||||
(secrets, events, bundles).
|
||||
|
||||
**What should have been done from the start:** Per-tenant ADMIN_TOKEN
|
||||
(which we eventually implemented). The Origin bypass was a security
|
||||
shortcut that the existing test suite caught.
|
||||
|
||||
**Current state:** Reverted. ADMIN_TOKEN is the correct approach.
|
||||
|
||||
### 3. Debug code left in CP provisioner
|
||||
|
||||
The workspace boot script still has:
|
||||
- `python3 -m http.server 9999` debug server exposing `/var/log/`
|
||||
- Crash detection `echo "RUNTIME CRASHED"` with log dump
|
||||
- `set -ex` showing all commands in cloud-init console
|
||||
|
||||
**Follow-up:** Remove debug instrumentation before production. The debug
|
||||
server on :9999 exposes boot logs to anyone who can reach the EC2 IP.
|
||||
|
||||
### 4. GHCR auth removed then re-added
|
||||
|
||||
Removed `docker login` from tenant boot script (assuming public GHCR),
|
||||
then had to re-add it when the package couldn't be made public (linked
|
||||
to private repo). Wasted one provisioning cycle.
|
||||
|
||||
### 5. DB rows deleted manually via psql
|
||||
|
||||
Multiple times during testing, org/instance rows were deleted directly
|
||||
via psql instead of going through the proper `DELETE /cp/orgs/:slug`
|
||||
cascade. This left orphaned EC2 instances running (costing money) and
|
||||
skipped the GDPR purge audit trail.
|
||||
|
||||
**Lesson:** Always use the API for deletions. The cascade handles EC2
|
||||
termination + DNS cleanup + audit logging.
|
||||
|
||||
---
|
||||
|
||||
## Security concerns to address
|
||||
|
||||
### CRITICAL
|
||||
|
||||
1. **#756 — X-Workspace-ID header forge bypasses CanCommunicate**
|
||||
Any workspace can reach any other workspace by setting
|
||||
`X-Workspace-ID: system:anything`. Complete access control bypass.
|
||||
Fix options proposed, awaiting CEO design decision.
|
||||
|
||||
2. **#757 — GLOBAL memory poisoning**
|
||||
Root workspaces can inject persistent prompt injection into all agents
|
||||
via GLOBAL memory scope. Mitigations proposed, awaiting CEO decision.
|
||||
|
||||
### HIGH
|
||||
|
||||
3. **ADMIN_TOKEN in plaintext in org_instances table**
|
||||
The per-tenant ADMIN_TOKEN is stored unencrypted in the CP database.
|
||||
Should be encrypted with the envelope key like other secrets.
|
||||
|
||||
4. **ADMIN_TOKEN exposed via `/cp/orgs/:slug/instance` public endpoint**
|
||||
The Worker's routing endpoint returns the admin_token in plaintext.
|
||||
This endpoint is public (no auth). Anyone who knows the slug can get
|
||||
the admin token and access all AdminAuth-protected routes.
|
||||
**Fix:** Remove admin_token from the public response. Store it in
|
||||
Worker KV at provision time instead.
|
||||
|
||||
5. **Debug HTTP server on workspace EC2 port 9999**
|
||||
Exposes boot logs (may contain secrets in env exports) to anyone
|
||||
who can reach the EC2 IP. Must be removed before production.
|
||||
|
||||
6. **`set -ex` in boot scripts**
|
||||
Shows all commands including secret values in cloud-init console
|
||||
output. EC2 console output is accessible via AWS API.
|
||||
|
||||
### MEDIUM
|
||||
|
||||
7. **Workspace EC2 security group allows all inbound**
|
||||
Should restrict to: Cloudflare IPs (for Worker proxying), tenant
|
||||
EC2 IP (for direct platform communication), SSH from admin IP only.
|
||||
|
||||
8. **No HTTPS between Worker and EC2**
|
||||
Worker connects to EC2 on `http://IP:8080` (plain HTTP). Traffic
|
||||
crosses the public internet unencrypted. Should use a tunnel or
|
||||
at minimum restrict to VPC.
|
||||
|
||||
---
|
||||
|
||||
## What needs proper workflow
|
||||
|
||||
### 1. Workspace registration not working
|
||||
|
||||
Workspace EC2s boot, start the A2A server on :8000, but never register
|
||||
with the tenant platform (`POST /registry/register`). The workspace stays
|
||||
at "provisioning" status forever on the Canvas.
|
||||
|
||||
**Root cause:** The boot script starts `molecule-runtime` which handles
|
||||
registration, but the runtime may not have the workspace auth token
|
||||
needed for registration. The token is issued by the tenant platform
|
||||
after the CP provision call, but it's not passed to the workspace EC2.
|
||||
|
||||
**Fix needed:** Pass the workspace auth token in the boot script env,
|
||||
or have the runtime request a token at startup.
|
||||
|
||||
### 2. Workspace boot time (9 min cold start)
|
||||
|
||||
The workspace EC2 boot sequence:
|
||||
- `apt-get update + install` (~2 min)
|
||||
- `python3 -m venv + pip install molecule-ai-workspace-runtime` (~2 min)
|
||||
- `git clone adapter repo + pip install adapter deps` (~2 min)
|
||||
- Runtime initialization (~2-3 min)
|
||||
|
||||
**Fix:** Pre-baked AMIs per runtime (tracked in `project_ami_pipeline.md`).
|
||||
Each AMI has all deps pre-installed. Boot reduces to ~30s.
|
||||
|
||||
### 3. CI blocked by go.mod replace directive
|
||||
|
||||
PR #900 fixes `replace github.com/...plugin... => /plugin` which breaks
|
||||
native Go builds. The replace is needed only in Docker builds where the
|
||||
plugin is COPYed to `/plugin`. Fix: add replace at Docker build time via
|
||||
`RUN echo 'replace ...' >> go.mod`.
|
||||
|
||||
### 4. Cloudflare edge cache poisoning
|
||||
|
||||
Changing the wildcard A record origin IP causes all previously-queried
|
||||
subdomains to cache the 1003 error for hours. HTTP cache purge doesn't
|
||||
clear DNS routing cache.
|
||||
|
||||
**Fix for production:** Use a stable origin IP (dedicated proxy) or
|
||||
Cloudflare Tunnel. Never change the wildcard origin IP in production.
|
||||
|
||||
---
|
||||
|
||||
## Tests needed
|
||||
|
||||
### Automated (add to CI)
|
||||
|
||||
- [ ] Workspace EC2 boot script integration test (mock EC2, verify
|
||||
user-data contains config.yaml, adapter clone, env vars)
|
||||
- [ ] CP workspace provision handler test (verify env map passthrough)
|
||||
- [ ] Worker routing test (mock CP lookup, verify correct backend proxy)
|
||||
- [ ] Tenant ADMIN_TOKEN validation test (verify AdminAuth accepts it)
|
||||
- [ ] Provisioning status endpoint test (verify direct-IP health check)
|
||||
|
||||
### Manual (before GA)
|
||||
|
||||
- [ ] Full org lifecycle: create → provision → deploy workspace →
|
||||
send message → get AI response → delete workspace → delete org
|
||||
- [ ] Multi-org isolation: create 2 orgs, verify workspace A cannot
|
||||
reach workspace B
|
||||
- [ ] Workspace auto-update: push new image, verify tenant picks it up
|
||||
within 5 min
|
||||
- [ ] Org deletion cascade: verify EC2 terminated, DNS cleaned, DB
|
||||
purged, audit trail written
|
||||
- [ ] Browser E2E: Canvas loads, onboarding wizard works, deploy
|
||||
template prompts for API key, workspace comes online, chat works
|
||||
|
||||
### Security (before GA)
|
||||
|
||||
- [ ] Fix #756 (X-Workspace-ID forge) — complete access control bypass
|
||||
- [ ] Fix #757 (GLOBAL memory poisoning)
|
||||
- [ ] Remove ADMIN_TOKEN from public `/instance` endpoint
|
||||
- [ ] Encrypt ADMIN_TOKEN in DB
|
||||
- [ ] Remove debug server (:9999) from workspace boot script
|
||||
- [ ] Remove `set -ex` from boot scripts (leaks secrets to console)
|
||||
- [ ] Restrict workspace EC2 security group
|
||||
- [ ] Add HTTPS between Worker and EC2 (or use tunnel)
|
||||
Loading…
Reference in New Issue
Block a user