Security: - Replace hardcoded Cloudflare account/zone/KV IDs in wrangler.toml with placeholders; add wrangler.toml to .gitignore, ship .example - Replace real EC2 IPs in docs with <EC2_IP> placeholders - Redact partial CF API token prefix in retrospective - Parameterize Langfuse dev credentials in docker-compose.infra.yml - Replace Neon project ID in runbook with <neon-project-id> Community: - Add CONTRIBUTING.md (build, test, branch conventions, CI info) - Add CODE_OF_CONDUCT.md (Contributor Covenant 2.1) Cleanup: - Replace personal runner username/machine name in CI + PLAN.md - Replace personal tenant URL in MCP setup guide - Replace personal author field in bundle-system doc - Replace personal login in webhook test fixture - Rewrite cryptominer incident reference as generic security remediation - Remove private repo commit hashes from PLAN.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
266 lines
11 KiB
Markdown
266 lines
11 KiB
Markdown
# Session Retrospective: 2026-04-16/17 SaaS Buildout
|
|
|
|
> **Duration:** ~24 hours (overnight autonomous + daytime interactive)
|
|
> **Scope:** Full SaaS infrastructure migration + E2E workspace provisioning
|
|
> **Status:** Platform API 17/17 pass, workspace A2A confirmed working,
|
|
> multiple issues remain for production readiness
|
|
|
|
---
|
|
|
|
## What was done
|
|
|
|
### Infrastructure migration (Fly.io → Railway + EC2)
|
|
|
|
| Change | Repo | Status |
|
|
|--------|------|--------|
|
|
| Railway deployment for control plane | molecule-controlplane | Deployed, auto-deploy on push |
|
|
| EC2 provisioner for tenants (Postgres + Redis + Platform in Docker) | molecule-controlplane | Deployed |
|
|
| EC2 provisioner for workspaces (pip install runtime at boot) | molecule-controlplane | Deployed, 9 min cold start |
|
|
| Cloudflare Worker for wildcard subdomain routing | molecule-tenant-proxy (new repo) | Deployed |
|
|
| Wildcard DNS `*.moleculesai.app` → Worker | Cloudflare dashboard | Done |
|
|
| Per-tenant ADMIN_TOKEN for Worker auth injection | molecule-controlplane | Deployed |
|
|
| Auto-updater cron on tenant EC2s (Option B) | molecule-controlplane | Deployed |
|
|
| Phase 33.2: stop creating per-tenant DNS records | molecule-controlplane | Deployed |
|
|
| Provisioning status page (progress bar + ETA) | molecule-app | Deployed to Vercel |
|
|
| Delete org button with type-to-confirm | molecule-app | Deployed to Vercel |
|
|
| Remove admin section from SaaS app | molecule-app | Deployed to Vercel |
|
|
|
|
### Monorepo PRs merged (by me)
|
|
|
|
| PR | Title |
|
|
|----|-------|
|
|
| #584 | TenantGuard same-origin bypass for EC2 tenant Canvas |
|
|
| #585 | Remove Fly registry from publish pipeline |
|
|
| #586 | Remove brand-monitor from monorepo |
|
|
| #587 | 5 Canvas UX fixes (error handling, a11y, loading state) |
|
|
| #588 | Hermes + gemini-cli deploy preflight required keys |
|
|
| #589 | Ecosystem-watch MAF v1.0 update |
|
|
| #646 | Migration TEXT→UUID FK type mismatch (critical E2E unblock) |
|
|
| #751 | A2A topology overlay |
|
|
| #771 | mcp-eval quality gate |
|
|
| #843 | pgvector migration DO block guard (critical E2E unblock) |
|
|
|
|
### Monorepo PRs merged (by other agents, reviewed by me)
|
|
|
|
#601, #602, #606, #610, #611, #612, #627, #629, #630, #639, #640, #641,
|
|
#644, #645, #650, #655, #656, #659, #669, #764, #784, #785, #791, #793,
|
|
#794, #796, #797, #798, #803, #808 — 30+ PRs total.
|
|
|
|
### Issues filed
|
|
|
|
| Issue | Title |
|
|
|-------|-------|
|
|
| #590 | AG-UI compatible SSE endpoint (implemented in #601) |
|
|
| #591 | Per-org tool governance registry |
|
|
| #592 | Per-workspace cost transparency |
|
|
| #850 | Canvas :3000 not running on tenant EC2 (fixed) |
|
|
| #863 | Workspace boot script missing config.yaml (fixed) |
|
|
|
|
### Docs created
|
|
|
|
| Doc | Purpose |
|
|
|-----|---------|
|
|
| `docs/architecture/wildcard-dns-proxy.md` | Phase 33 Cloudflare Worker architecture |
|
|
| `docs/architecture/tenant-image-upgrades.md` | Options A/B/C for tenant auto-upgrade |
|
|
| `docs/architecture/partner-api-keys.md` | Phase 34 partner/programmatic API access |
|
|
| `tests/e2e/test_saas_tenant.sh` | Reusable SaaS tenant smoke test |
|
|
|
|
### Standalone repos created
|
|
|
|
| Repo | Purpose |
|
|
|------|---------|
|
|
| `Molecule-AI/molecule-tenant-proxy` | Cloudflare Worker for subdomain routing |
|
|
|
|
---
|
|
|
|
## What should NOT have been changed (but was)
|
|
|
|
### 1. Wildcard DNS record changed 4 times in one session
|
|
|
|
The wildcard A record for `*.moleculesai.app` was pointed at:
|
|
1. `<EC2_IP>` (real EC2 IP) — initial
|
|
2. `198.51.100.1` (RFC 5737 TEST-NET) — Cloudflare blocked it (1003)
|
|
3. `<EC2_IP>` (terminated EC2) — caused 1003 for all subdomains
|
|
4. `<EC2_IP>` (another terminated EC2) — same issue
|
|
5. `<EC2_IP>` (final live EC2) — current
|
|
|
|
**Impact:** Every subdomain queried during configs 2-4 got permanently
|
|
cached as 1003 at Cloudflare's edge. Cache purge didn't help (different
|
|
cache layer). These subdomains are stuck until Cloudflare's DNS routing
|
|
cache expires (~24h).
|
|
|
|
**Lesson:** The wildcard should have pointed to a **stable, always-live IP**
|
|
from the start. In production, this should be a dedicated proxy/load
|
|
balancer IP that never changes, not an individual EC2 instance.
|
|
|
|
**Follow-up:** Consider using a Cloudflare Tunnel instead of a proxied A
|
|
record — tunnels don't have the origin-IP-must-be-reachable requirement.
|
|
|
|
### 2. AdminAuth Origin bypass attempted then reverted
|
|
|
|
Attempted to add `canvasOriginAllowed()` to `AdminAuth` middleware to let
|
|
the Canvas through without a bearer token. A test (#623) correctly blocked
|
|
this — Origin is forgeable, and AdminAuth protects sensitive routes
|
|
(secrets, events, bundles).
|
|
|
|
**What should have been done from the start:** Per-tenant ADMIN_TOKEN
|
|
(which we eventually implemented). The Origin bypass was a security
|
|
shortcut that the existing test suite caught.
|
|
|
|
**Current state:** Reverted. ADMIN_TOKEN is the correct approach.
|
|
|
|
### 3. Debug code left in CP provisioner
|
|
|
|
The workspace boot script still has:
|
|
- `python3 -m http.server 9999` debug server exposing `/var/log/`
|
|
- Crash detection `echo "RUNTIME CRASHED"` with log dump
|
|
- `set -ex` showing all commands in cloud-init console
|
|
|
|
**Follow-up:** Remove debug instrumentation before production. The debug
|
|
server on :9999 exposes boot logs to anyone who can reach the EC2 IP.
|
|
|
|
### 4. GHCR auth removed then re-added
|
|
|
|
Removed `docker login` from tenant boot script (assuming public GHCR),
|
|
then had to re-add it when the package couldn't be made public (linked
|
|
to private repo). Wasted one provisioning cycle.
|
|
|
|
### 5. DB rows deleted manually via psql
|
|
|
|
Multiple times during testing, org/instance rows were deleted directly
|
|
via psql instead of going through the proper `DELETE /cp/orgs/:slug`
|
|
cascade. This left orphaned EC2 instances running (costing money) and
|
|
skipped the GDPR purge audit trail.
|
|
|
|
**Lesson:** Always use the API for deletions. The cascade handles EC2
|
|
termination + DNS cleanup + audit logging.
|
|
|
|
---
|
|
|
|
## Security concerns to address
|
|
|
|
### CRITICAL
|
|
|
|
1. **#756 — X-Workspace-ID header forge bypasses CanCommunicate**
|
|
Any workspace can reach any other workspace by setting
|
|
`X-Workspace-ID: system:anything`. Complete access control bypass.
|
|
Fix options proposed, awaiting CEO design decision.
|
|
|
|
2. **#757 — GLOBAL memory poisoning**
|
|
Root workspaces can inject persistent prompt injection into all agents
|
|
via GLOBAL memory scope. Mitigations proposed, awaiting CEO decision.
|
|
|
|
### HIGH
|
|
|
|
3. **ADMIN_TOKEN in plaintext in org_instances table**
|
|
The per-tenant ADMIN_TOKEN is stored unencrypted in the CP database.
|
|
Should be encrypted with the envelope key like other secrets.
|
|
|
|
4. **ADMIN_TOKEN exposed via `/cp/orgs/:slug/instance` public endpoint**
|
|
The Worker's routing endpoint returns the admin_token in plaintext.
|
|
This endpoint is public (no auth). Anyone who knows the slug can get
|
|
the admin token and access all AdminAuth-protected routes.
|
|
**Fix:** Remove admin_token from the public response. Store it in
|
|
Worker KV at provision time instead.
|
|
|
|
5. **Debug HTTP server on workspace EC2 port 9999**
|
|
Exposes boot logs (may contain secrets in env exports) to anyone
|
|
who can reach the EC2 IP. Must be removed before production.
|
|
|
|
6. **`set -ex` in boot scripts**
|
|
Shows all commands including secret values in cloud-init console
|
|
output. EC2 console output is accessible via AWS API.
|
|
|
|
### MEDIUM
|
|
|
|
7. **Workspace EC2 security group allows all inbound**
|
|
Should restrict to: Cloudflare IPs (for Worker proxying), tenant
|
|
EC2 IP (for direct platform communication), SSH from admin IP only.
|
|
|
|
8. **No HTTPS between Worker and EC2**
|
|
Worker connects to EC2 on `http://IP:8080` (plain HTTP). Traffic
|
|
crosses the public internet unencrypted. Should use a tunnel or
|
|
at minimum restrict to VPC.
|
|
|
|
---
|
|
|
|
## What needs proper workflow
|
|
|
|
### 1. Workspace registration not working
|
|
|
|
Workspace EC2s boot, start the A2A server on :8000, but never register
|
|
with the tenant platform (`POST /registry/register`). The workspace stays
|
|
at "provisioning" status forever on the Canvas.
|
|
|
|
**Root cause:** The boot script starts `molecule-runtime` which handles
|
|
registration, but the runtime may not have the workspace auth token
|
|
needed for registration. The token is issued by the tenant platform
|
|
after the CP provision call, but it's not passed to the workspace EC2.
|
|
|
|
**Fix needed:** Pass the workspace auth token in the boot script env,
|
|
or have the runtime request a token at startup.
|
|
|
|
### 2. Workspace boot time (9 min cold start)
|
|
|
|
The workspace EC2 boot sequence:
|
|
- `apt-get update + install` (~2 min)
|
|
- `python3 -m venv + pip install molecule-ai-workspace-runtime` (~2 min)
|
|
- `git clone adapter repo + pip install adapter deps` (~2 min)
|
|
- Runtime initialization (~2-3 min)
|
|
|
|
**Fix:** Pre-baked AMIs per runtime (tracked in `project_ami_pipeline.md`).
|
|
Each AMI has all deps pre-installed. Boot reduces to ~30s.
|
|
|
|
### 3. CI blocked by go.mod replace directive
|
|
|
|
PR #900 fixes `replace github.com/...plugin... => /plugin` which breaks
|
|
native Go builds. The replace is needed only in Docker builds where the
|
|
plugin is COPYed to `/plugin`. Fix: add replace at Docker build time via
|
|
`RUN echo 'replace ...' >> go.mod`.
|
|
|
|
### 4. Cloudflare edge cache poisoning
|
|
|
|
Changing the wildcard A record origin IP causes all previously-queried
|
|
subdomains to cache the 1003 error for hours. HTTP cache purge doesn't
|
|
clear DNS routing cache.
|
|
|
|
**Fix for production:** Use a stable origin IP (dedicated proxy) or
|
|
Cloudflare Tunnel. Never change the wildcard origin IP in production.
|
|
|
|
---
|
|
|
|
## Tests needed
|
|
|
|
### Automated (add to CI)
|
|
|
|
- [ ] Workspace EC2 boot script integration test (mock EC2, verify
|
|
user-data contains config.yaml, adapter clone, env vars)
|
|
- [ ] CP workspace provision handler test (verify env map passthrough)
|
|
- [ ] Worker routing test (mock CP lookup, verify correct backend proxy)
|
|
- [ ] Tenant ADMIN_TOKEN validation test (verify AdminAuth accepts it)
|
|
- [ ] Provisioning status endpoint test (verify direct-IP health check)
|
|
|
|
### Manual (before GA)
|
|
|
|
- [ ] Full org lifecycle: create → provision → deploy workspace →
|
|
send message → get AI response → delete workspace → delete org
|
|
- [ ] Multi-org isolation: create 2 orgs, verify workspace A cannot
|
|
reach workspace B
|
|
- [ ] Workspace auto-update: push new image, verify tenant picks it up
|
|
within 5 min
|
|
- [ ] Org deletion cascade: verify EC2 terminated, DNS cleaned, DB
|
|
purged, audit trail written
|
|
- [ ] Browser E2E: Canvas loads, onboarding wizard works, deploy
|
|
template prompts for API key, workspace comes online, chat works
|
|
|
|
### Security (before GA)
|
|
|
|
- [ ] Fix #756 (X-Workspace-ID forge) — complete access control bypass
|
|
- [ ] Fix #757 (GLOBAL memory poisoning)
|
|
- [ ] Remove ADMIN_TOKEN from public `/instance` endpoint
|
|
- [ ] Encrypt ADMIN_TOKEN in DB
|
|
- [ ] Remove debug server (:9999) from workspace boot script
|
|
- [ ] Remove `set -ex` from boot scripts (leaks secrets to console)
|
|
- [ ] Restrict workspace EC2 security group
|
|
- [ ] Add HTTPS between Worker and EC2 (or use tunnel)
|