Security: - Replace hardcoded Cloudflare account/zone/KV IDs in wrangler.toml with placeholders; add wrangler.toml to .gitignore, ship .example - Replace real EC2 IPs in docs with <EC2_IP> placeholders - Redact partial CF API token prefix in retrospective - Parameterize Langfuse dev credentials in docker-compose.infra.yml - Replace Neon project ID in runbook with <neon-project-id> Community: - Add CONTRIBUTING.md (build, test, branch conventions, CI info) - Add CODE_OF_CONDUCT.md (Contributor Covenant 2.1) Cleanup: - Replace personal runner username/machine name in CI + PLAN.md - Replace personal tenant URL in MCP setup guide - Replace personal author field in bundle-system doc - Replace personal login in webhook test fixture - Rewrite cryptominer incident reference as generic security remediation - Remove private repo commit hashes from PLAN.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
775 lines
47 KiB
Markdown
775 lines
47 KiB
Markdown
# PLAN.md — Molecule AI Build Plan
|
||
|
||
> Completed phases (1–11, 13–14) are documented in `/docs` and removed from here.
|
||
> This file tracks only **in-progress and upcoming work**.
|
||
|
||
---
|
||
|
||
## Completed Phases (see /docs for details)
|
||
|
||
| Phase | Name | Docs |
|
||
|-------|------|------|
|
||
| 1 | Core Loop | `docs/architecture/architecture.md`, `CLAUDE.md` |
|
||
| 2 | E2E Validation | `CLAUDE.md` (build/test commands) |
|
||
| 3 | Hierarchy & Communication | `docs/api-protocol/communication-rules.md` |
|
||
| 4 | Provisioner | `docs/architecture/provisioner.md` |
|
||
| 5 | Agent Management | `CLAUDE.md` (API routes) |
|
||
| 6 | Bundle Export/Import | `docs/agent-runtime/bundle-system.md` |
|
||
| 7 | Team Expansion | `docs/agent-runtime/team-expansion.md` |
|
||
| 8 | Human-in-the-Loop Approvals | `docs/agent-runtime/system-prompt-structure.md` |
|
||
| 9 | Hierarchical Memory | `docs/architecture/memory.md` |
|
||
| 10 | Observability (Langfuse) | `docs/development/observability.md` |
|
||
| 11 | Canvas Polish & UX | `docs/frontend/canvas.md` |
|
||
| 13 | Runtime Enhancements | `docs/agent-runtime/workspace-runtime.md` |
|
||
| 14 | Production Hardening | `docs/architecture/provisioner.md`, `CLAUDE.md` |
|
||
| 15 | Per-Workspace Dir | PR #38 — `workspace_dir` per workspace |
|
||
| 16 | Plugin System | PR #39 — per-workspace plugins with registry |
|
||
| 17 | Agent GitHub Access | PR #40 — git/gh in images, GITHUB_TOKEN env |
|
||
| 18 | File Browser Lazy Loading | PR #37 — depth=1, path traversal protection |
|
||
| 19 | MCP Full Coverage | PR #40 — 52→54 tools (plugins, global secrets, pause/resume, org, delegation) |
|
||
| 20 | Canvas UX Sprint | PRs #4, #21, #39 — Settings Panel, Onboarding, Plugins UI, Pause/Resume |
|
||
| 21 | Claude Agent SDK Migration | PR #48 — `ClaudeSDKExecutor` replaces CLI subprocess |
|
||
| 22 | Cron Scheduling | PR #49 — recurring tasks via cron expressions, Canvas Schedule tab |
|
||
| 23 | Code Quality & Multi-Provider | PR #50 — model fallback, DeepAgents full SDK, 7 LLM providers, 100% test coverage |
|
||
| 24 | Async Delegation | PR #41 — non-blocking delegation with status polling, `check_delegation_status` tool |
|
||
| 25 | Social Channels | PR #54 — adapter-based Telegram integration, Canvas Channels tab, 7 MCP tools, hot reload, multi-chat IDs, auto-detect, /start auto-reply, full Telegram Bot API audit fixes |
|
||
| 26 | Auth Env Vars | PR #55 — `required_env` config replaces `.auth-token` files, env-var only path; reno-stars 15-agent org template |
|
||
| 27 | Channel Polish & Org Auto-link | PR #56 — poller lifetime fix (bgCtx), Restart Pending button (only when needed), org template `channels:` field auto-links Telegram on import |
|
||
|
||
---
|
||
|
||
## Phase 12: Code Sandbox — DONE
|
||
|
||
> Three-backend sandbox for the `run_code` tool, selectable per-workspace
|
||
> via `SANDBOX_BACKEND` env (set from `config.yaml → sandbox.backend`).
|
||
|
||
- [x] `run_code` tool — `workspace-template/builtin_tools/sandbox.py`
|
||
- [x] `subprocess` backend (default) — asyncio subprocess with hard timeout
|
||
- [x] `docker` backend — throwaway container with resource limits (MVP)
|
||
- [x] `e2b` backend (cloud) — E2B microVMs via `e2b-code-interpreter`, reads `E2B_API_KEY`
|
||
- [x] Sandbox config — `SandboxConfig` dataclass in `workspace-template/config.py`
|
||
|
||
Firecracker-as-a-backend is intentionally skipped: each tenant platform now
|
||
runs on a Fly Machine (which IS a Firecracker microVM — see Phase 32
|
||
Phase B), so the entire workspace process is already Firecracker-isolated
|
||
from other tenants. Running Firecracker inside Firecracker would double-
|
||
nest for no additional security. For stronger per-call isolation within
|
||
one tenant, use the `e2b` backend.
|
||
|
||
---
|
||
|
||
## Phase 20: Canvas UX Sprint — MOSTLY COMPLETE
|
||
|
||
> UX specs created by UIUX Designer agent. See `docs/ux-specs/` for full specs.
|
||
|
||
### 20.1 Settings Panel (Global Secrets UI) — DONE
|
||
**Spec**: `docs/ux-specs/ux-spec-settings-panel.md`
|
||
|
||
- [x] Gear icon in canvas top bar (Cmd+, shortcut)
|
||
- [x] Slide-over drawer (480px, right-anchored)
|
||
- [x] Service groups (GitHub, Anthropic, OpenRouter, Custom)
|
||
- [x] CRUD: add, view (masked), edit, delete secrets
|
||
- [x] Empty state with guided setup
|
||
- [x] Unsaved changes guard on close
|
||
|
||
### 20.2 Onboarding / Deploy Interception — DONE
|
||
**Spec**: `docs/ux-specs/ux-spec-onboarding-interception.md`
|
||
|
||
- [x] Pre-deploy secret check — detect missing API keys per runtime
|
||
- [x] Missing Keys Modal — inline form, only asks for what's needed
|
||
- [x] Provisioning timeout → named error state with recovery actions
|
||
- [x] No dead ends — every error has a fix action
|
||
|
||
### 20.3 Canvas UI Improvements — PARTIAL
|
||
**Spec**: `docs/ux-specs/ux-spec-canvas-improvements.md`
|
||
|
||
- [x] Plugins install/uninstall in Skills tab (PR #39)
|
||
- [x] Pause/resume from context menu
|
||
- [x] Org template import from canvas (PR — `OrgTemplatesSection` in TemplatePalette)
|
||
- [ ] Workspace search (Cmd+K)
|
||
- [ ] Batch operations
|
||
|
||
---
|
||
|
||
## Phase 30: SaaS — Remote Workspaces & Cross-Network Federation — IN PROGRESS
|
||
|
||
**Goal:** let a Python agent running on a laptop in another city boot,
|
||
register, authenticate, accept A2A from its parent PM on the platform,
|
||
and appear on the canvas as a first-class workspace.
|
||
|
||
**Why now:** the self-hostable single-box model has landed; the next
|
||
meaningful expansion is letting orgs span machines and networks. This
|
||
is the step that turns Molecule AI from "Docker-compose on one box" into
|
||
a multi-tenant SaaS-shaped product.
|
||
|
||
**Design thesis:** ride the existing `runtime='external'` escape hatch.
|
||
Every Docker-touching handler already short-circuits when a workspace
|
||
is external. We don't need a parallel subsystem — we need to close
|
||
four small gaps and add per-workspace auth. See
|
||
[`docs/remote-workspaces-readiness.md`](docs/remote-workspaces-readiness.md)
|
||
for the full code audit.
|
||
|
||
### Shipping order (eight bounded steps, ~2 weeks to GA)
|
||
|
||
- [x] **30.1 Workspace auth tokens** — foundation; prevents spoofing.
|
||
New `workspace_auth_tokens` table; `POST /registry/register` issues
|
||
a token; middleware validates `Authorization: Bearer <token>` on
|
||
`/registry/heartbeat`, `/registry/update-card`. Lazy bootstrap so
|
||
in-flight workspaces upgrade gracefully. Transparent to local
|
||
containers — provisioner carries the token through the existing env-var
|
||
pattern. No feature flag.
|
||
|
||
- [x] **30.2 Secrets pull endpoint** — `GET /workspaces/:id/secrets/values`
|
||
returns decrypted secrets JSON, gated by the 30.1 token. Local agents
|
||
can use it too (removes env-at-create coupling for rotating secrets).
|
||
|
||
- [x] **30.3 Plugin tarball download** — `GET /plugins/:name/download`
|
||
returns a tarball; agent unpacks locally. Replaces Docker-exec plugin
|
||
install for remote agents. Behind `REMOTE_PLUGIN_DOWNLOAD_ENABLED`.
|
||
|
||
- [x] **30.4 Workspace state polling** — `GET /workspaces/:id/state`
|
||
returns `{status, paused, deleted_at, pending_events[]}` as a drop-in
|
||
for the WebSocket feed remote agents can't reach. Behind
|
||
`REMOTE_STATE_POLLING_ENABLED`.
|
||
|
||
- [x] **30.5 A2A proxy token validation** — the proxy enforces the caller's
|
||
auth token on `POST /workspaces/:id/a2a`. Mutual auth between agents.
|
||
|
||
- [x] **30.6 Direct sibling discovery + URL caching** — agents call
|
||
`GET /registry/{parent_id}/peers` once, cache sibling URLs, call them
|
||
directly for A2A. Resilient to brief platform outages.
|
||
|
||
- [x] **30.7 Poll-liveness for external runtime** — `LivenessChecker`
|
||
interface in `registry/`; `PollLiveness` marks offline if no heartbeat
|
||
in 90s. Docker checker becomes one implementation, poll-liveness
|
||
another. Health sweep routes by runtime. Behind
|
||
`REMOTE_LIVENESS_POLLING_ENABLED`.
|
||
|
||
- [x] **30.8 Remote-agent SDK + docs** — `sdk/python/molecule_agent/`
|
||
thin client: register → pull secrets → run A2A loop → poll state →
|
||
heartbeat. Working `sdk/python/examples/remote-agent/` a new user can run on a
|
||
laptop. Remove the three feature flags. Remote workspaces become GA.
|
||
|
||
### Out of scope for Phase 30
|
||
|
||
- Mutual TLS / platform-identity verification from the agent side.
|
||
Agent trusts any platform URL in its env. Defer until real multi-
|
||
tenant deployment forces the question.
|
||
- Agent-to-agent mesh across NATs. Direct sibling calls only work when
|
||
siblings are reachable from each other. Behind-NAT ↔ behind-NAT needs
|
||
a relay — defer to Phase 31.
|
||
- Platform-managed persistent state for remote agents. Remote agents
|
||
own their filesystem; platform never mounts.
|
||
|
||
### Success criteria
|
||
|
||
- `sdk/python/examples/remote-agent/` boots on a laptop disconnected from the
|
||
platform's LAN, registers, receives a task from parent PM via A2A,
|
||
returns a result, appears on the canvas.
|
||
- `tests/e2e/test_federation.sh` spawns a second platform instance +
|
||
remote agent pointing at the first; both platforms see the agent as
|
||
a workspace in the right state.
|
||
- Spoofing test: attempt to impersonate a workspace with a guessed ID
|
||
but no token → 401.
|
||
|
||
---
|
||
|
||
## Phase 31 — Quality + Infra Pass (Q2 2026) — SHIPPED 2026-04-13
|
||
|
||
Completed in PRs #1–#8 and documented in `docs/edit-history/2026-04-13.md`:
|
||
|
||
- [x] **Brand migration cleanup** — LICENSE "Agent Molecule" → "Molecule AI"; new icon assets (PR #1).
|
||
- [x] **Repo structural cleanup** — moved `examples/remote-agent/` → `sdk/python/examples/`, `docs/superpowers/plans/` → `plugins/superpowers/plans/`; deleted empty `platform/plugins/`; gitignored `.agents/`, `platform/workspace-configs-templates/`, `backups/`, `logs/`, `test-results/`; added READMEs under `tests/` and `docs/` (PR #3).
|
||
- [x] **MCP per-domain split** — `mcp-server/src/index.ts` 1697 → 89 lines; 12 per-domain modules in `src/tools/`; shared `src/api.ts`; startup log now reports 87 tools (PRs #2, #4, #7).
|
||
- [x] **Canvas dialog unification** — native `confirm()`/`alert()` replaced with `ConfirmDialog` in 7 sites; new `singleButton` prop + 5 tests (vitest 352 → 357).
|
||
- [x] **Platform handler decomposition** — 4 oversize functions (`proxyA2ARequest`, `Delegate`, `Discover`, `SessionSearch`) split into testable helpers; +47 Go tests; `handlers` coverage 56.1% → 57.6%.
|
||
- [x] **Env-var documentation** — `.env.example` gained 11 previously-undocumented vars; all 21 distinct `os.Getenv`/`envx.*` keys now documented.
|
||
- [x] **E2E hardening + CI** — Phase 30.1 bearer auth + Phase 30.6 `X-Workspace-ID` requirements baked into `test_api.sh` (62/62) and `test_comprehensive_e2e.sh` (67/67); shared `_lib.sh` + `_extract_token.py`; new CI jobs `e2e-api` and `shellcheck`; `setup-go` gains module cache (PRs #5, #7, #8).
|
||
|
||
---
|
||
|
||
## PR Workflow Rules
|
||
|
||
All PRs must follow this checklist:
|
||
|
||
1. **Branch**: Never push to main. Always create a feature/fix branch.
|
||
2. **Code Review**: Run `/code-review` skill and fix all issues before requesting merge.
|
||
3. **Tests**: All existing tests must pass. New features require new tests.
|
||
4. **Documentation**: Run `/update-docs` skill. Every PR must update:
|
||
- `docs/edit-history/` session log
|
||
- Relevant docs in `docs/` (API, architecture, frontend, etc.)
|
||
- `CLAUDE.md` if routes, env vars, or commands changed
|
||
- `PLAN.md` if the work completes a phase or adds new items
|
||
5. **E2E Test**: Rebuild, restart service, and manually verify before reporting done.
|
||
6. **QA Review**: QA Engineer reviews for edge cases, plan compliance, and documentation completeness before CEO merge approval.
|
||
7. **CEO Approval**: Only the CEO approves merges. Never merge without explicit approval.
|
||
|
||
---
|
||
|
||
## Ecosystem Awareness
|
||
|
||
Adjacent projects worth tracking (Holaboss, Hermes, gstack, …) are catalogued
|
||
in **[`docs/ecosystem-watch.md`](docs/ecosystem-watch.md)**. Skim quarterly,
|
||
add entries liberally, and when one of those projects ships something we
|
||
should react to, file a "Signals to react to" line in that doc and create a
|
||
Backlog entry below pointing at it. Agents doing research or strategy work
|
||
should read `docs/ecosystem-watch.md` first — it's the canonical starting
|
||
point for "what else is out there."
|
||
|
||
---
|
||
|
||
## Backlog (prioritized)
|
||
|
||
1. **Canvas: Org template import** — Phase 20.3 (deploy org from canvas UI)
|
||
2. **Canvas: Workspace search (Cmd+K)** — Phase 20.3 (quick find)
|
||
3. **Canvas: Batch operations** — Phase 20.3 (multi-select delete/restart)
|
||
4. **Sandbox: Firecracker/E2B backends** — Phase 12 (production isolation)
|
||
5. **NemoClaw adapter** — stub exists at `adapters/nemoclaw/`, no implementation yet
|
||
6. **Remote plugin registry** — install plugins from npm/git (currently local only)
|
||
7. **Agent git worktrees** — per-agent branches without full clone
|
||
8. **SDK follow-ups** — live tool-call visibility, cost telemetry, cancel UX, governance hooks
|
||
9. **Real webhook mode for channels** — Phase 27 candidate. Currently polling-only; webhook needs:
|
||
- `mode: "webhook"|"polling"` config field
|
||
- `PUBLIC_URL` env var
|
||
- Platform calls `setWebhook` on channel create (with random `webhook_secret`), `deleteWebhook` on delete
|
||
- Canvas toggle to enable webhook mode (only when PUBLIC_URL is set)
|
||
- Polling works fine for ≤hundreds of bots; webhook needed at thousands+ scale or for serverless
|
||
10. **More channel adapters** — Slack (OAuth + Events API), Discord (Bot + Gateway), WhatsApp (Cloud API)
|
||
11. **Delegations list endpoint mismatch** — `GET /workspaces/:id/delegations` returns `[]` while the agent's internal `check_delegation_status` shows active/completed delegations. One source of truth.
|
||
12. **YAML-configurable per-agent repo access** — new `workspace_access: none|read_only|read_write` field in `org.yaml` + `:ro` bind-mount for research agents; eliminates the "PM couriers documents to reports" workaround.
|
||
13. **SDK executor swallows subprocess stderr** — `workspace-template/claude_sdk_executor.py` surfaces only "Command failed with exit code 1 / Check stderr output for details" when the `claude` CLI crashes, making every failure opaque. Capture stderr, log at ERROR, include first ~1 KB in the A2A error response. **High priority** — blocked real debugging during PLAN.md coordination on 2026-04-12.
|
||
14. **Agent MCP client defaults to `localhost:8080`** — inside a workspace container, `localhost` is the container itself, not the platform — so `mcp__molecule__*` tools fail with "platform unreachable." Inject `MOLECULE_URL=${PLATFORM_URL}` into every container at provision time and change the MCP client default to `http://host.docker.internal:8080`. **High priority** — blocks agents from calling platform tools (e.g. PM couldn't restart its own reports).
|
||
|
||
> Note: items 11–14 previously carried sequential refs `#64`–`#67`. Those refs were placeholder enumeration, not GitHub issues. They now collide with actual merged PRs and issues with different scopes, so the refs were removed in 2026-04-14 tick-5. If/when these items get prioritized, file real GitHub issues for them.
|
||
15. **Workspace `restart_prompt` — user-defined restart context (#19 Layer 2)** — GitHub issue **#66** (new 2026-04-14 tick-4 follow-up to PR #65 which shipped Layer 1). Let `config.yaml` / `org.yaml` declare a user-authored `restart_prompt` that is delivered alongside the platform-generated restart-context system message — e.g. "re-read your CLAUDE.md, re-hydrate TODOs from memory, resume the active delegation." Layer 1 (platform state snapshot) already ships; Layer 2 adds the user-defined side.
|
||
|
||
### Recently launched (2026-04-14 tick-4)
|
||
- **GitHub issue #15** — Provisioner: auto-refresh `CLAUDE_CODE_OAUTH_TOKEN` from `global_secrets` on workspace restart → **DONE** via PR #64 (`SetGlobal` / `DeleteGlobal` now fan out `RestartByID` to every affected workspace).
|
||
- **GitHub issue #19 Layer 1** — Platform-generated restart context → **DONE** via PR #65 (synthetic A2A `message/send` with `metadata.kind=restart_context`, `system:restart-context` caller prefix, 30s re-register wait). Layer 2 deferred to issue #66 (see Backlog item 15 above).
|
||
|
||
### Recently launched (2026-04-15 overnight sweep — ticks 17–30+, ~27 PRs)
|
||
|
||
**Security hardening cluster.** Roughly half the sweep was closing auth gaps surfaced by the Security Auditor's hourly audit cron:
|
||
- `#94` RFC-1918 + link-local in registry URL validator
|
||
- `#99` AdminAuth gate on `GET /workspaces` (topology leak / #104)
|
||
- `#106` path-sanitize + admin-gate `POST /org/import` (#103 HIGH)
|
||
- `#110` revoke `workspace_auth_tokens` on workspace delete
|
||
- `#119` IPv6 SSRF blocklist (fe80::/10, ::1/128, fc00::/7) + scheduler unit tests
|
||
- `#162` field-level authz on `PATCH /workspaces/:id` (#138 — cosmetic vs sensitive split)
|
||
- `#155` wire existing `SecurityHeaders` middleware into router
|
||
- `#167` gate 6 previously-unauth routes behind `AdminAuth` (#164 CRITICAL anon bundles/import; #165 HIGH events+bundles/export topology leak; #166 MED viewport+liveness)
|
||
- `#185` `AdminAuth` on `GET /approvals/pending` (#180)
|
||
- `#200` `AdminAuth` on `POST /templates/import` (#190 HIGH)
|
||
- `#203` `CanvasOrBearer` middleware — route-split for #168 canvas regression, only `PUT /canvas/viewport`; rejected PR #194's broader Origin-fallback approach because it would have re-opened #164
|
||
- `#209` source_id spoof defense in `activity.Report` (cherry-picked from the rejected #169 batch)
|
||
- `#233` `resolveInsideRoot` on `POST /workspaces template/runtime` (#226 MED)
|
||
|
||
**Data integrity.** Three bugs that would have silently corrupted state:
|
||
- `#212` **CRITICAL** migration-runner bug — `RunMigrations` globbed `*.sql` and alphabetically ran `.down.sql` BEFORE `.up.sql` on every boot, wiping `workspace_auth_tokens` (and 018/019 pairs). Filter fix + unit test in `postgres_migrate_test.go`.
|
||
- `#224` YAML injection in `generateDefaultConfig` — body.Name now emitted as a double-quoted YAML scalar with all control chars escaped. Structural test (parse + verify key count).
|
||
- `#236` log-injection in the #209 security-event log line — attacker-controlled source_id echoed via `%s` allowed fake log entries; switched to `%q`.
|
||
|
||
**CI / infra.**
|
||
- `#186` + controlplane `#28` — every CI job migrated from `ubuntu-latest` to `[self-hosted, macos, arm64]` (Mac mini `self-hosted-runner`). Non-trivial: `services:` replaced with inline `docker run` containers (ports 15432/16379), `actions/setup-python` bypassed via Homebrew python3.11 on `$GITHUB_PATH`, `docker/setup-qemu-action` added for cross-arch builds. Workaround for GH Actions billing cap on private repos.
|
||
- `#149` independent heartbeat pulse goroutine so long cron fires don't look stale on `/admin/liveness` (#140)
|
||
- `#211` migration runner regression (see #212 above — PR #212 is the fix)
|
||
- **Fly registry `FLY_API_TOKEN`** rotated to a deploy token scoped to `molecule-tenant` (previously personal token, was rotated during the security incident remediation)
|
||
|
||
**Platform / Scheduler reliability.**
|
||
- `#95` panic-recover in scheduler `tick()` + per-fire goroutines (closes #85)
|
||
- `#207` concurrency-aware skip — `scheduler.fireSchedule` reads `workspaces.active_tasks` and advances `next_run_at` + records a `cron_run` row with `status='skipped'` instead of colliding with a busy agent (#115)
|
||
- `#206` surface `error_detail` in schedule history API (#152 problem B)
|
||
|
||
**Workspace runtime features.**
|
||
- `#205` idle-loop reflection pattern — opt-in `idle_prompt` + `idle_interval_seconds` in `config.yaml`; self-sends when `heartbeat.active_tasks == 0`. Hermes/Letta shape.
|
||
- `#208` Hermes Phase 1 multi-provider registry — 15 providers via `adapters/hermes/providers.py` (Nous, OpenRouter, OpenAI, Anthropic, xAI, Gemini, Qwen, GLM, Kimi, MiniMax, DeepSeek, Groq, Together, Fireworks, Mistral). 26 tests.
|
||
- `#198` A2A protocol compliance batch (#173/#174/#175): `cancel()` emits `TaskStatusUpdateEvent(canceled, final=True)`, `stateTransitionHistory=True` in AgentCapabilities. **Regression:** `push_sender=PushNotificationSender()` crashed on startup because PushNotificationSender is abstract — reverted in #210.
|
||
- `#216` idle-loop pilot enabled on Technical Researcher workspace.
|
||
- `#225` + `#235` `auth_headers()` on `/registry/register` + initial_prompt + idle loop self-posts (#215/#220)
|
||
- `#231` Claude SDK stderr probe for proper rate-limit error attribution (#160 diagnostics)
|
||
|
||
**Controlplane (molecule-controlplane).**
|
||
- `#19`+`#20` Grafana Cloud remote-write counter registry (`cp_requests_total`), push loop to `prometheus-prod-32-prod-ca-east-0.grafana.net`, Basic auth with user 3116422
|
||
- `#21` AWS KMS envelope encryption — per-secret DEK via `GenerateDataKey`, dual-mode (v2 blobs via KMS, legacy via static key, auto-routes by leading byte)
|
||
- `#24` `/cp/status` deep probe for Betterstack
|
||
- `#26`+`#27` public `/legal/{terms,privacy,dpa,acceptable}` pages from embedded markdown + smoke coverage
|
||
- Isolation red-team test suite + observability runbooks (Grafana dashboard, Betterstack, Stripe Atlas)
|
||
|
||
**Self code-review follow-ups (`#228` + `#232`).** Ran `/code-review` on the batch merges, surfaced 8 🟡 issues, split into Go (#228) and Python/docs (#232):
|
||
- `CanvasOrBearer` invalid-bearer fall-through fix
|
||
- `short()` helper replacing unsafe `[:N]` slices in `scheduler.go`
|
||
- 6 new tests (`TestShort_helper`, `TestRecordSkipped_*`, `TestActivityHandler_Report_*`, `TestHistory_IncludesErrorDetail`)
|
||
- idle-loop hardening (`asyncio.get_running_loop()`, `IDLE_FIRE_TIMEOUT_SECONDS` clamp, typed exception handling, `add_done_callback` for fire-and-forget error logging)
|
||
- `idle_prompt` / `idle_interval_seconds` documented in `org.yaml` defaults
|
||
- New `docs/runbooks/admin-auth.md` — the three middleware variants + three-question test for adding to `CanvasOrBearer`
|
||
|
||
**Test counts post-sweep:** +70 Go (816 total), +40 Python (1180 total), +0 Canvas vitest (453 unchanged — UI/a11y patches only).
|
||
|
||
**Outstanding (user action):** `#126` Slack adapter (Phase-H product decision), `#160` Claude Max OAuth quota (wait for 2026-04-17 23:00Z reset OR upgrade OR switch to ANTHROPIC_API_KEY), `#191` runner persistent-state docs (P3), `#199` Fly registry token (**resolved** this session but publish-platform-image re-run pending runner), Stripe Atlas application (launch blocker, 2-week lead).
|
||
|
||
### Recently launched (2026-04-15 tick-9)
|
||
- **Phase 32 Phase B.2 (image pipeline)** — PR #80 adds `.github/workflows/publish-platform-image.yml`: on every main-merge touching `platform/**`, builds `platform/Dockerfile` and pushes `ghcr.io/molecule-ai/platform:latest` + `:sha-<commit>` to GHCR. Paired with the private `molecule-controlplane` Fly + Neon provisioner (PR #3 there) that reads `TENANT_IMAGE` env and boots tenant Fly Machines from this image. Tick-8 docs-sync PR #79 also landed.
|
||
|
||
### Recently launched (2026-04-14 tick-8)
|
||
- **Phase 32 PR #1** — `TenantGuard` middleware (PR #78). Public repo's only SaaS hook: when `MOLECULE_ORG_ID` env is set, non-allowlisted requests require matching `X-Molecule-Org-Id` header or 404. Unset → passthrough (self-hosted unchanged). Allowlist is exact-match: `/health` + `/metrics`. Paired with the private `Molecule-AI/molecule-controlplane` repo scaffolded this tick (Fly Machines provisioner stub, `/cp/orgs` CRUD, subdomain→fly-replay router, migrations 001-003 for `organizations`/`org_instances`/`org_members`). +6 `TestTenantGuard_*` tests. Phase 32 plan: follow-up PRs wire real Fly provisioner, WorkOS AuthKit, Stripe, Cloudflare, signup UX — all in the private repo except the single public middleware.
|
||
|
||
### Recently launched (2026-04-14 tick-7)
|
||
- **GitHub issue #24** — Runtime-added workspace_schedules drift on org re-import → **DONE** via PR #76 (new `source` column on `workspace_schedules` via migration `022`; org/import now upserts with `ON CONFLICT (workspace_id, name) DO UPDATE ... WHERE source='template'`, so runtime-added rows survive re-imports; legacy rows backfilled to `'template'`; +3 tests).
|
||
- **GitHub issue #51** — PM hardcoded audit-category routing → **DONE** via PR #75 (generic `category_routing:` block in `org-templates/<name>/org.yaml` `defaults` + per-workspace override; rendered into each workspace's `config.yaml` via `renderCategoryRoutingYAML` using `yaml.Node` + `yaml.Marshal` for safe escaping; PM prompt replaced with generic config-lookup; +6 tests).
|
||
- **PR #74** — `org-templates/molecule-dev/org.yaml` role overrides shrunk to just the deltas now that UNION semantics (PR #71) are in effect — removes verbose re-listing of defaults across PM, Research Lead, Research sub-roles, Security Auditor, UIUX Designer.
|
||
|
||
### Recently launched (2026-04-14 tick-6)
|
||
- **GitHub issue #68** — Per-workspace `plugins:` REPLACE semantics caveat → **DONE** via PR #71 (`mergePlugins` helper in `platform/internal/handlers/org.go` now UNIONs per-workspace with `defaults.plugins`; `!plugin` or `-plugin` prefix on a per-workspace entry opts a default out; +5 `TestPlugins_*` tests). Role overrides in `org-templates/*/org.yaml` can now declare just the delta instead of restating every default.
|
||
|
||
### Recently launched (2026-04-14 tick-5)
|
||
- **PR #70** — Wired the 12 modular plugins from PR #63 (tick-4) into the default `molecule-dev` org template. `defaults.plugins` expands from 3 → 9 (safety hooks + operational-memory skills become universal); PM role gains `molecule-workflow-triage` + `molecule-workflow-retro`, Security Auditor gains `molecule-skill-code-review` + `molecule-skill-cross-vendor-review` + `molecule-skill-llm-judge`. Verbose per-role re-listing is a consequence of REPLACE (not UNION) semantics in `platform/internal/handlers/org.go`; union-semantics proposal tracked as issue **#68**.
|
||
- **PR #69** — Backlog items 11–14 stripped of stale sequential refs `#64`–`#67` (see footnote near item 15 above).
|
||
|
||
---
|
||
|
||
## Test Coverage
|
||
|
||
| Stack | Tests | Framework |
|
||
|-------|-------|-----------|
|
||
| Go (platform) | 726 | `go test -race` (raw PASS lines incl. subtests; +6 top-level `Test*` this tick: #64 secrets auto-restart x2, #65 restart-context x4) |
|
||
| Python (workspace) | 1,140 | pytest |
|
||
| Canvas (frontend) | 357 | Vitest |
|
||
| SDK (python) | 132 | pytest |
|
||
| MCP server | 97 | Jest |
|
||
| **Total** | **2,452** | |
|
||
|
||
E2E: 67/67 comprehensive checks passing, 62/62 API tests (also gated in CI `e2e-api` job), shellcheck-clean across all 5 E2E scripts.
|
||
|
||
---
|
||
|
||
## Team Assignments
|
||
|
||
| Agent | Current Focus |
|
||
|-------|--------------|
|
||
| PM | Sprint coordination, backlog prioritization |
|
||
| Dev Lead | Engineering planning, PR review |
|
||
| UIUX Designer | UX specs for Phase 20 (DONE — 5 specs delivered) |
|
||
| Frontend Engineer | Phase 20.3 remaining items (org import, search, batch) |
|
||
| Backend Engineer | Sandbox production backends, API completeness |
|
||
| QA Engineer | **Review every PR for docs + plan compliance** |
|
||
| DevOps Engineer | CI/CD, Docker image optimization |
|
||
| Security Auditor | API key handling, path traversal, auth review |
|
||
|
||
---
|
||
|
||
## Next Steps
|
||
|
||
1. Frontend Engineer implements remaining Phase 20.3 items (org import from canvas, Cmd+K search)
|
||
2. Backend Engineer scopes Firecracker/E2B sandbox backends (Phase 12)
|
||
3. QA Engineer reviews PR #52 for docs compliance before merge
|
||
4. All agents use `GITHUB_TOKEN` env var to clone repo, branch, and create PRs
|
||
|
||
---
|
||
|
||
## Plugin Adaptor System — shipped; deferred follow-ups only
|
||
|
||
**The system is done.** Landed (see `feat/plugin-adaptor-registry` and `feat/agentskills-compliance`):
|
||
per-runtime plugin adaptors, hybrid resolver (registry > plugin-shipped >
|
||
raw-drop), `AgentskillsAdaptor` covering rule+skill plugins for all
|
||
runtimes, `/plugins?runtime=` filter, `/workspaces/:id/plugins/available`
|
||
endpoint, `molecule-plugin` SDK, gemini org parity with molecule-dev,
|
||
and **full agentskills.io spec compliance** for all first-party skills
|
||
(installable in Claude Code, Cursor, Codex, and ~35 other skill-compatible
|
||
tools — see `docs/plugins/agentskills-compat.md`).
|
||
|
||
Deferred, not blocking:
|
||
|
||
- **Upstream `runtime-adapters/` extension to agentskills.io spec** —
|
||
once we've lived with our own per-runtime adapter model for ~month,
|
||
propose it as a spec extension to `agentskills/agentskills` so other
|
||
tools can share Molecule AI-authored adaptors.
|
||
- **Install-from-GitHub-URL flow** — `POST /plugins/install {git_url}` that
|
||
clones a repo into the registry, validates the manifest, and runs the
|
||
adaptor through a sandbox. Needs signature/version pinning and a review
|
||
of the adaptor-execution threat model before shipping.
|
||
- **Promote-to-default UI** — today, promoting a community plugin to
|
||
"curated" means manually copying its `adapters/<runtime>.py` into
|
||
`workspace-template/plugins_registry/<plugin>/`. Later add a canvas
|
||
button + PR template that opens an upstream PR automatically.
|
||
- **Plugin packs** — manifest that lists other plugins to bundle
|
||
(`superpowers-pack` → install `superpowers-tdd` + `superpowers-debug` + …).
|
||
Skip until a real user asks; first-party plugins are small enough to
|
||
install individually today.
|
||
- **Hot-reload on DeepAgents** — upstream docs say skills/sub-agents are
|
||
startup-only; would need platform-level container restart on plugin
|
||
file change. Defer until users complain.
|
||
- **Atomic split of first-party plugins** — `superpowers` and `ecc` still
|
||
ship as multi-skill bundles. Pipeline already supports splitting but
|
||
non-urgent.
|
||
- **Sub-agent plugins for non-DeepAgents runtimes** — Claude Code /
|
||
LangGraph don't have a native sub-agent feature; emulating via
|
||
tool-routing is possible but invasive. Defer.
|
||
- **Workspace install tracking table** — a `workspace_plugin_installs`
|
||
table would let uninstall call the adaptor's `uninstall()` path
|
||
reliably. Today uninstall is a `rm -rf /configs/plugins/<name>` which
|
||
leaves copied skill dirs behind. Low user impact.
|
||
- **Shared org-template `system-prompt.md` via `_shared/`** — DRY molecule-dev
|
||
and molecule-worker-gemini. Drift risk; revisit at 3+ orgs.
|
||
|
||
## Phase 32 — Cloud SaaS launch (2026-Q2/Q3)
|
||
|
||
Goal: ship Molecule AI as a multi-tenant cloud SaaS (not just
|
||
self-hosted per-customer). Ordered by dependency + ROI.
|
||
|
||
### Current state (2026-04-15)
|
||
|
||
**Live infrastructure:**
|
||
- Control plane deployed: https://molecule-cp.fly.dev (Fly app `molecule-cp`, 2 machines, Neon project `molecule-cp` / `cool-sea-89357706`)
|
||
- Tenant app: Fly app `molecule-tenant` (Neon parent project `molecule-tenants` / `dawn-bar-08311714`, tenants get a branch per org)
|
||
- Shared Redis: Upstash `grateful-prawn-89393.upstash.io` (key-prefix isolation, Phase H moves to per-tenant)
|
||
- Container registry: `registry.fly.io/molecule-tenant:latest` (mirrored from `ghcr.io/molecule-ai/platform:latest` via GH Actions on every main push)
|
||
- First real tenant provisioned: org `acme` → Fly machine + Neon branch + encrypted URLs in `org_instances`
|
||
- WorkOS AuthKit live at `/cp/auth/{signup,login,callback,signout,me}` — hosted signup redirects correctly; see https://molecule-cp.fly.dev/cp/auth/signup
|
||
- Stripe billing scaffold deployed in orgs-only mode (no Stripe creds configured yet; webhook handler + signature verification code ready)
|
||
- Domain: `moleculesai.app` (DNS not yet wired — subdomain routing works via `X-Molecule-Org-Slug` header pending Cloudflare)
|
||
|
||
**Phase status (post 2026-04-15 overnight sweep):**
|
||
- **A — Foundation** (accounts, tokens, domain): ✅ done
|
||
- **B — Fly provisioner + Neon branching**: ✅ done
|
||
- **C — WorkOS AuthKit scaffold + RequireSession + org-ownership check**: ✅ done
|
||
- **D — Stripe billing scaffold + auth-scoped checkout + plan quotas**: ✅ code done; live keys pending Stripe Atlas
|
||
- **E — Cloudflare + DNS `*.moleculesai.app` + per-tenant Vercel canvas**: ✅ done
|
||
- **F — Sign-up UX + onboarding**: ✅ basic flow done (signup / org create / canvas redirect); polish + email pending
|
||
- **G — Observability + quotas + admin**: ✅ Sentry + Grafana remote-write + `/cp/status` Betterstack probe + per-org rate limiter; admin panel `/cp/admin/*` pending
|
||
- **H — Hardening**: ⏳ partial — AWS KMS envelope encryption ✅ (controlplane PR #21), tenant-isolation red-team CI gate ✅ (`isolation_test.go`), legal pages ✅ (`/legal/*` from controlplane PR #26); load test + Stripe Atlas application + status page custom domain pending
|
||
- **I — Launch**: pending Stripe Atlas (~2 week lead)
|
||
|
||
**Live infrastructure deltas (post-sweep):**
|
||
- Migration runner safety fix landed (#212) — `*.down.sql` filter; was wiping `workspace_auth_tokens` on every restart
|
||
- Workspace auth tokens now revoked on workspace delete (#110)
|
||
- All known unauth admin routes gated; #138 canvas regression resolved via field-level authz + `CanvasOrBearer` middleware
|
||
- Self-hosted Mac mini CI runner replaced GH-hosted Linux to bypass private-repo Actions billing cap; `FLY_API_TOKEN` rotated to a deploy token scoped to `molecule-tenant` after the token was rotated during the security incident remediation
|
||
- `/legal/{terms,privacy,dpa,acceptable}` live at `https://app.moleculesai.app/legal/*`
|
||
|
||
**Known open issues on the live system:**
|
||
- Tenant `/workspaces` returns Neon pooler warnings (`unnamed prepared statement does not exist`) — lib/pq + Neon pooler incompatibility, tracked for lib/pq → pgx migration in a later phase
|
||
- `#160` Claude Max OAuth quota exhausted on the agent-fleet token until 2026-04-17 23:00 UTC; mitigations: wait, upgrade plan, OR switch workspace containers to `ANTHROPIC_API_KEY` env var
|
||
- `#191` self-hosted runner persistent-state docs (P3, low urgency)
|
||
- `#199` Fly registry token — **resolved** in the 2026-04-15 sweep but `publish-platform-image` re-run pending runner availability
|
||
|
||
**Companion repo:** `Molecule-AI/molecule-controlplane` (private). n8n-style open-core split: this public repo stays OSS (tenant binary + plugins + channels, contributable surface); control plane (orgs / signup / billing / provisioner / routing) is private. See `molecule-controlplane/PLAN.md` for its roadmap.
|
||
|
||
|
||
### Tier 1 — blocks multi-tenant launch
|
||
|
||
- [ ] **Multi-tenancy**: `organizations` table, `org_id` FK +
|
||
`WHERE org_id = $caller_org` filter on every row-returning
|
||
handler (`workspaces`, `workspace_secrets`, `global_secrets`,
|
||
`activity_logs`, `structure_events`, `agent_memories`,
|
||
`workspace_schedules`, `workspace_channels`). Middleware resolves
|
||
caller's org from session token → ctx. Full security audit of
|
||
tenant isolation before first external user.
|
||
- [ ] **Human auth + orgs**: **WorkOS AuthKit** (NOT build-yourself,
|
||
NOT Clerk — WorkOS treats per-org SSO as first-class; Clerk
|
||
treats it as an upsell). Keep Phase 30.1 bearer tokens for
|
||
machine-to-machine (agents). Stripe integration via WorkOS hooks.
|
||
- [ ] **Container isolation**: replace raw-Docker-socket provisioner
|
||
with **Fly Machines API** (Firecracker microVMs, per-workspace
|
||
isolation, sub-second boot, pay-per-second). Today's shared
|
||
`/var/run/docker.sock` is an RCE-to-host footgun that cannot ship
|
||
multi-tenant. `provisioner` interface stays — only backend swaps.
|
||
Docker path remains for local dev.
|
||
- [ ] **Stripe billing**: subscriptions + usage metering
|
||
(workspace-hours, LLM-token pass-through, storage), trial flow,
|
||
dunning, invoices.
|
||
- [ ] **Per-org resource quotas**: tier memory/CPU is configurable
|
||
(PR #58) but unenforced at provision time. Add per-org ceilings:
|
||
max workspaces, max concurrent-running, max total memory.
|
||
- [ ] **Managed Postgres + Redis**: move off `docker-compose` for
|
||
prod. **Neon** (serverless, branch-per-PR) for Postgres; **Upstash**
|
||
for Redis. Alternative: drop Redis entirely — `LISTEN/NOTIFY`
|
||
+ advisory locks cover heartbeat TTL + URL cache.
|
||
- [ ] **Secrets at rest via KMS**: current `SECRETS_ENCRYPTION_KEY`
|
||
is a single static AES-256 key. Move to **AWS/GCP KMS**-backed
|
||
envelope encryption; the `secrets_encryption_version` table slot
|
||
is already reserved for rotation.
|
||
- [ ] **Migration runner out of app boot**: a bad migration
|
||
currently crashes platform boot with no rollback. Extract to
|
||
**goose** as a release step / init container. Auto-discovery
|
||
runner stays for dev mode only.
|
||
|
||
### Tier 1 follow-ups (before customer #1)
|
||
|
||
- [ ] **Observability**: wire `/metrics` to a scraper (Grafana
|
||
Cloud or self-hosted). Add **Sentry** for Go + Next.js error
|
||
tracking. Langfuse stays for LLM traces.
|
||
- [ ] **Rate limiting per-org**: global `RATE_LIMIT=600/min` is a
|
||
shared bucket today. Needs per-org + per-endpoint buckets.
|
||
- [ ] **Cloudflare in front**: WAF + CDN + DDoS. Free tier covers
|
||
pre-revenue.
|
||
- [ ] **Sign-up / onboarding flow**: landing → signup → first
|
||
workspace in 60 seconds. No such flow today.
|
||
- [ ] **Transactional email**: Resend or Postmark.
|
||
- [ ] **Admin panel**: view orgs, suspend accounts, see usage,
|
||
issue refunds. SQL-only at first; UI by ~50 orgs.
|
||
- [ ] **Privacy policy + ToS + DPA**: real ones, vetted. GDPR /
|
||
CCPA data-export + deletion endpoints (workspace-export already
|
||
exists; need org-level).
|
||
|
||
### Tier 2 — tech-stack upgrades (high ROI, non-blocking)
|
||
|
||
- [ ] **Go platform**: migrate `lib/pq` → **pgx/v5** (1–2 days;
|
||
`lib/pq` in maintenance since ~2021). Then **sqlc** incrementally
|
||
for new queries — keeps the no-ORM philosophy + typed Go.
|
||
- [ ] **Platform async: River** (Postgres-backed, Go-native job
|
||
queue). Delegation dispatch, `workspace_schedules` cron, future
|
||
billing events + webhook fan-out all migrate cleanly. **NOT**
|
||
Temporal — Temporal already ships in workspace-template as an
|
||
agent tool; keep the separation.
|
||
- [ ] **Frontend: TanStack Query** for server state. Zustand keeps
|
||
pure UI state. Stops reimplementing cache / refetch / dedup. WS
|
||
updates flow via `qc.setQueryData`. Single highest-ROI frontend
|
||
refactor.
|
||
- [ ] **Turbopack for `next build`**: one flag, 2–5× cold-build
|
||
speedup.
|
||
- [ ] **Python workspace runtime → uv**: `uv pip install` in
|
||
`entrypoint.sh` cuts workspace cold-start 10–100×. User-visible
|
||
latency win.
|
||
- [ ] **Python MCP client inside runtime**: today `mcp-server/`
|
||
exposes the platform as an MCP server; agents inside workspaces
|
||
can't yet consume external MCP servers. Closing the gap joins
|
||
the winning 2026 ecosystem.
|
||
- [ ] **shadcn/ui CLI convention**: already Radix + Tailwind;
|
||
adopt `npx shadcn add …` passively for new components.
|
||
No rewrite.
|
||
|
||
### Tier 3 — explicitly NOT doing
|
||
|
||
- **Kubernetes**: company-of-one cannot run K8s. Fly Machines
|
||
covers isolation without the ops tax.
|
||
- **ORM** (GORM / ent / bun): raw-SQL + sqlc covers every case.
|
||
- **Framework swap** (Next → Vite / TanStack Start): 2-week
|
||
rewrite buys nothing users see.
|
||
- **Auth-from-scratch**: every hour on auth is an hour not on
|
||
product.
|
||
- **Canvas library swap** (xyflow → tldraw): xyflow is still the
|
||
correct tool for typed node graphs.
|
||
|
||
### Tier 4 — compliance / enterprise (when revenue lands)
|
||
|
||
- [ ] SOC 2 via Drata / Vanta
|
||
- [ ] Status page (Betterstack or Instatus)
|
||
- [ ] Staging environment that mirrors prod
|
||
- [ ] Blue-green / canary deploy pipeline
|
||
- [ ] Per-org backup + point-in-time restore
|
||
- [ ] Load testing (`hey` / `vegeta`) — current per-node ceiling
|
||
unknown
|
||
|
||
### Success criteria for Phase 32
|
||
|
||
- Customer can sign up at moleculesai.app, create an org, deploy their
|
||
first workspace, send their first message in < 5 minutes.
|
||
- Two orgs on the same cluster cannot observe each other's
|
||
workspaces, secrets, memory, or activity — verified by automated
|
||
tenant-isolation test + manual red-team.
|
||
- Fly Machines cost per active workspace-hour documented and
|
||
reproducible.
|
||
- Stripe-backed subscription + usage-based add-ons working end-to-
|
||
end in sandbox.
|
||
- One paying design partner on the cluster, paying a real invoice.
|
||
|
||
---
|
||
|
||
## Phase 34: Partner API Keys — Programmatic Org Management
|
||
|
||
> **Goal:** Enable partner platforms, CI/CD pipelines, and automation tools to
|
||
> create and manage orgs via API without a browser session. Critical for
|
||
> partner integrations, marketplace resellers, and internal testing.
|
||
>
|
||
> **Docs:** `docs/architecture/partner-api-keys.md`
|
||
|
||
### Phase 34.1 — Core infrastructure
|
||
|
||
- [ ] Migration: `partner_api_keys` table (key_hash, scopes, org_id, rate_limit)
|
||
- [ ] `internal/auth/partner_keys.go` — key validation, SHA-256 hashing, scope check
|
||
- [ ] Update `auth.Middleware` — check `Bearer mol_pk_*` before WorkOS session
|
||
- [ ] Scope enforcement helpers — `RequireScope("orgs:create")` per handler
|
||
|
||
### Phase 34.2 — Admin endpoints
|
||
|
||
- [ ] `POST /cp/admin/partner-keys` — create key (returns plaintext once)
|
||
- [ ] `GET /cp/admin/partner-keys` — list keys (prefix + metadata only)
|
||
- [ ] `DELETE /cp/admin/partner-keys/:id` — revoke key
|
||
|
||
### Phase 34.3 — Rate limiting + audit
|
||
|
||
- [ ] Per-key rate limiter (separate from session rate limit)
|
||
- [ ] `last_used_at` tracking on each request
|
||
- [ ] Add `mol_pk_` to pre-commit secret scanner
|
||
|
||
### Phase 34.4 — Partner onboarding
|
||
|
||
- [ ] Partner onboarding guide (docs)
|
||
- [ ] Example: create org → poll status → redirect user to tenant
|
||
- [ ] Example: CI/CD test org lifecycle (create → test → delete)
|
||
|
||
### Success criteria for Phase 34
|
||
|
||
- Partner can `POST /cp/orgs` with an API key and get a provisioned org
|
||
- Org-scoped keys cannot access other orgs
|
||
- Revoked keys immediately return 401
|
||
- Rate limiting prevents abuse
|
||
- Full audit trail: who created which key, when last used
|
||
|
||
---
|
||
|
||
## Phase 36: Full Staging Environment — GATES ALL INFRA CHANGES
|
||
|
||
> **Goal:** Stop merging untested infra changes to production. Every change
|
||
> ships to staging first, gets verified, then promotes to production.
|
||
>
|
||
> **Why now:** The 2026-04-17 session broke CI twice and caused hours of
|
||
> edge cache issues because there was no staging to catch regressions.
|
||
> This gates Phase 33 (Tunnel migration) and Phase 35 (security hardening).
|
||
>
|
||
> **Docs:** `docs/architecture/staging-environment.md`
|
||
|
||
### Phase 36.1 — Railway + Neon staging
|
||
|
||
- [ ] Create Railway `staging` environment with staging-specific vars
|
||
- [ ] Create Neon staging branch from main
|
||
- [ ] Add `staging.api.moleculesai.app` CNAME to Railway staging
|
||
- [ ] Verify CP deploys and boots on staging
|
||
|
||
### Phase 36.2 — Image + deploy pipeline
|
||
|
||
- [ ] Publish workflow pushes `:staging` tag (not `:latest`) on main merge
|
||
- [ ] Add `promote-to-production.yml` workflow (manual trigger)
|
||
- [ ] Promotion: retag `:staging` → `:latest`, deploy CP to production
|
||
- [ ] Production tenants auto-update via Option B cron
|
||
|
||
### Phase 36.3 — Staging DNS + Vercel
|
||
|
||
- [ ] `*.staging.moleculesai.app` for staging tenant subdomains
|
||
- [ ] `staging.app.moleculesai.app` for Vercel staging preview
|
||
- [ ] Staging Cloudflare Tunnel (or Worker) for tenant routing
|
||
|
||
### Phase 36.4 — Automated verification
|
||
|
||
- [ ] Post-deploy staging smoke test (run `test_saas_tenant.sh`)
|
||
- [ ] Block promotion if smoke test fails
|
||
- [ ] Slack/GitHub notification on staging deploy + promotion
|
||
|
||
### Success criteria for Phase 36
|
||
|
||
- No infra change reaches production without passing staging first
|
||
- Staging mirrors production (same services, same auth, separate data)
|
||
- Promotion is a single manual action (button click or CLI command)
|
||
- Staging cleanup is automated (terminate test EC2s after verification)
|
||
|
||
---
|
||
|
||
## Phase 33: Tenant Subdomain Routing — MIGRATING TO CLOUDFLARE TUNNEL
|
||
|
||
> **Original:** Wildcard DNS + Cloudflare Worker (implemented 2026-04-17).
|
||
> **Replacing with:** Cloudflare Tunnel per tenant (issue #933).
|
||
> Worker approach caused edge cache poisoning + security gaps (ADMIN_TOKEN
|
||
> in plaintext, unencrypted HTTP). Tunnel eliminates all of these.
|
||
> **Docs:** `docs/architecture/wildcard-dns-proxy.md` (original),
|
||
> issue #933 (tunnel migration plan).
|
||
> **Prerequisite:** Phase 36 (staging) — test tunnel on staging first.
|
||
|
||
### Phase 33.1 — Worker + wildcard DNS (no tenant changes)
|
||
|
||
- [ ] Create Cloudflare Worker that extracts slug from hostname, looks up
|
||
backend IP from CP API, proxies request to EC2
|
||
- [ ] Add `GET /cp/orgs/:slug/instance` endpoint to CP (public, rate-limited)
|
||
- [ ] Add `*.moleculesai.app` wildcard DNS record (proxied, orange cloud)
|
||
- [ ] Worker serves static "provisioning" splash page when tenant not ready
|
||
- [ ] Deploy Worker via `wrangler deploy` + GitHub Actions
|
||
- [ ] Verify Worker routing works for existing tenants alongside old A records
|
||
|
||
### Phase 33.2 — Stop per-tenant DNS records
|
||
|
||
- [ ] Remove Cloudflare A record creation from `ec2.go` provisioner
|
||
- [ ] Remove Cloudflare DNS cleanup from deprovision/purge cascade
|
||
- [ ] Existing A records coexist harmlessly (explicit wins over wildcard)
|
||
|
||
### Phase 33.3 — Remove Caddy from EC2
|
||
|
||
- [ ] Worker handles TLS termination — EC2 runs plain HTTP only
|
||
- [ ] Remove Caddy install + Caddyfile from EC2 user-data script
|
||
- [ ] EC2 security group: allow inbound HTTP from Cloudflare IPs only
|
||
- [ ] ~30s faster cold start (no apt-get caddy, no Let's Encrypt)
|
||
|
||
### Phase 33.4 — Cleanup
|
||
|
||
- [ ] Delete old per-tenant A records from Cloudflare
|
||
- [ ] Remove `cloudflareapi/` package from CP (Worker replaces it)
|
||
- [ ] Update `docs/runbooks/saas-secrets.md` with Worker secrets
|
||
|
||
### Success criteria for Phase 33
|
||
|
||
- New org subdomain resolves instantly (zero DNS wait)
|
||
- No NXDOMAIN caching — user never sees "site can't be reached"
|
||
- Provisioning splash page shown while EC2 boots (auto-refreshes)
|
||
- Cold start ~30s faster (no Caddy/Let's Encrypt)
|
||
- Cost: Cloudflare Worker free tier or $5/mo
|
||
|
||
---
|
||
|
||
## Phase 35: SaaS Production Hardening (post-2026-04-17 retrospective)
|
||
|
||
> **Goal:** Address security gaps, remove debug code, fix workspace
|
||
> registration, and reduce boot time identified during the SaaS buildout
|
||
> session. See `docs/retrospectives/2026-04-17-saas-buildout.md` for full
|
||
> context.
|
||
|
||
### Phase 35.1 — Security (CRITICAL, before any public launch)
|
||
|
||
- [ ] Fix #756 — X-Workspace-ID header forge bypasses CanCommunicate
|
||
(derive callerID from authenticated token, not raw header)
|
||
- [ ] Fix #757 — GLOBAL memory poisoning mitigations (content delimiters
|
||
+ audit log at minimum)
|
||
- [ ] Remove ADMIN_TOKEN from public `/cp/orgs/:slug/instance` endpoint —
|
||
store in Worker KV at provision time instead
|
||
- [ ] Encrypt ADMIN_TOKEN in `org_instances` table (use envelope key)
|
||
- [ ] Remove debug HTTP server (:9999) from workspace boot script
|
||
- [ ] Remove `set -ex` from boot scripts (leaks env vars to EC2 console)
|
||
- [ ] Restrict workspace EC2 security group (Cloudflare IPs + tenant IP only)
|
||
- [ ] Add HTTPS between Worker and EC2 (or Cloudflare Tunnel)
|
||
|
||
### Phase 35.2 — Workspace registration fix
|
||
|
||
- [ ] Pass workspace auth token in EC2 boot script env so runtime can
|
||
register with `POST /registry/register`
|
||
- [ ] Or: have runtime request a token at startup via
|
||
`GET /admin/workspaces/:id/test-token`
|
||
- [ ] Verify workspace status flips to "online" on Canvas after boot
|
||
- [ ] Test full Canvas flow: deploy → STARTING → online → chat works
|
||
|
||
### Phase 35.3 — Boot time optimization
|
||
|
||
- [ ] Pre-baked AMI per runtime (Packer or EC2 Image Builder):
|
||
- `ami-hermes`: Python + openai + anthropic + molecule-runtime + hermes adapter
|
||
- `ami-claude-code`: Node + claude-code SDK + molecule-runtime
|
||
- `ami-langgraph`: Python + langchain + langgraph + molecule-runtime
|
||
- [ ] Runtime switch = launch from different AMI. Boot ~30s vs current ~9 min
|
||
- [ ] Remove apt-get + pip install from boot script (only config + secrets + start)
|
||
|
||
### Phase 35.4 — Stability + CI
|
||
|
||
- [ ] Fix go.mod replace directive (PR #900) — unblocks all CI
|
||
- [ ] Use stable origin IP for wildcard DNS (dedicated proxy or Tunnel)
|
||
- [ ] Add workspace boot integration test to CI
|
||
- [ ] Add SaaS tenant smoke test (`tests/e2e/test_saas_tenant.sh`) to CI
|
||
- [ ] Clean up Cloudflare edge cache poisoning from session
|
||
(or wait ~24h for natural expiry)
|
||
|
||
---
|
||
|
||
## Infra footnote — Temporal
|
||
|
||
`docker-compose.infra.yml` now includes Temporal (`:7233` gRPC, `:8233` Web
|
||
UI) backing `workspace-template/builtin_tools/temporal_workflow.py` for
|
||
durable long-running agent workflows. All infra services share the
|
||
`molecule-monorepo-net` Docker network, which `infra/scripts/setup.sh`
|
||
creates idempotently. Temporal currently runs with **no auth** on
|
||
`0.0.0.0:7233` — dev-only; any production deployment must front it with
|
||
mTLS, API keys, or a reverse proxy before exposing the cluster.
|