docs: sync CLAUDE.md + PLAN.md + edit-history with 2026-04-15 overnight sweep

Captures ~27 PRs merged across both repos this session: security
hardening cluster (#94/#99/#106/#110/#119/#162/#155/#167/#185/#200/#203/
#209/#233), data-integrity fixes (#212/#224/#236), CI runner migration
(#186), platform/scheduler reliability (#95/#149/#207/#206), workspace
runtime features (#205/#208/#198/#216/#225/#235/#231), code-review
follow-ups (#228/#232).

Updated counts: 816 Go (+70), 1180 Python (+40), 453 vitest (unchanged
— UI/a11y patches), 97 jest (unchanged).

CLAUDE.md additions:
- Idle Loop section (#205) under Architectural Patterns
- Admin auth middleware variants section linking docs/runbooks/admin-auth.md
- Migration runner section explaining the .down.sql filter (#212)
- Per-route auth notes in the API table (PATCH field-whitelist, CanvasOrBearer
  on PUT /canvas/viewport, AdminAuth on bundles/events/templates-import/
  approvals-pending/admin-liveness)
- Database section updated with workspace_auth_tokens auto-revoke (#110),
  scheduler.error_detail surfacing (#206), workspace_schedules.last_status
  'skipped' state (#207)

PLAN.md additions:
- New Recently launched (overnight sweep) section with full PR/issue index
- Phase status updated (B–G now complete, H partial)
- Live infrastructure deltas (migration fix, token rotation, legal pages)
- Outstanding items consolidated

Edit-history file expanded from the tick-9 stub to a full session record
covering malware cleanup, CI runner migration, security cluster, data
integrity, infra/feature/code-review batches, and outstanding user
actions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Hongming Wang 2026-04-15 12:16:24 -07:00
parent eb6796042b
commit fda2b56532
3 changed files with 335 additions and 21 deletions

View File

@ -225,9 +225,9 @@ OPENAI_API_KEY=... bash scripts/test-team-e2e.sh # E2E: Multi-template
### Unit Tests
```bash
cd platform && go test -race ./... # 746 Go tests (handlers, registry, provisioner, CLI, delegation, org, channels, wsauth, middleware — sqlmock + miniredis; +6 on 2026-04-14 tick-8 for TestTenantGuard_* covering MOLECULE_ORG_ID passthrough/match/mismatch/missing/allowlist/exact-match (#78, Phase 32 PR #1); prior: +9 tick-7 for category_routing + schedules.source; +5 tick-6 for plugins UNION; +6 tick-4 for auto-restart + restart-context branches)
cd canvas && npm test # 357 Vitest tests (store, components, hydration, buildTree, secrets API, org template import, ConfirmDialog singleButton + 7 native-dialog replacements)
cd workspace-template && python -m pytest -v # 1140 pytest tests (adds platform_auth token store for Phase 30.1, memory_write activity logging)
cd platform && go test -race ./... # 816 Go tests (handlers, registry, provisioner, CLI, delegation, org, channels, wsauth, middleware, scheduler, crypto, db — sqlmock + miniredis; +70 on 2026-04-15 overnight sweep across the security fix cluster: CanvasOrBearer middleware tests, scheduler recordSkipped + short() helper, source_id spoof rejection + log injection regression guards, YAML-parse structural injection tests, migration runner .down.sql filter, plugins_install_pipeline_test.go 37-case suite, resolveInsideRoot coverage; +6 on 2026-04-14 tick-8 for TestTenantGuard_*; prior: +9 tick-7 for category_routing + schedules.source)
cd canvas && npm test # 453 Vitest tests (store, components, hydration, buildTree, secrets API, org template import, ConfirmDialog singleButton + 7 native-dialog replacements, WCAG critical batch — ARIA live toasts + dialog focus trap + keyboard nav)
cd workspace-template && python -m pytest -v # 1180 pytest tests (adds platform_auth token store for Phase 30.1, memory_write activity logging, Hermes multi-provider registry 26 tests, a2a_executor cancel emits canceled event, idle loop + initial_prompt auth_headers())
cd sdk/python && python -m pytest -v # 132 SDK tests (agentskills.io spec validator, CLI, AgentskillsAdaptor round-trip, workspace/org/channel validators, RemoteAgentClient Phase 30 flows)
cd mcp-server && npm test # 97 Jest tests (per-domain tool modules + smoke test on tool count)
```
@ -343,6 +343,19 @@ Agents can auto-execute a prompt on startup before any user interaction. Configu
**Important:** Initial prompts must NOT send A2A messages (delegate_task, send_message_to_user) — other agents may not be ready. Keep them local: clone repo, read docs, save to memory, wait for tasks.
### Idle Loop (#205 — reflection-on-completion)
Opt-in pattern: when `idle_prompt` is non-empty in `config.yaml`, the workspace self-sends it every `idle_interval_seconds` (default 600) **while `heartbeat.active_tasks == 0`**. Hermes/Letta shape from the 2026-04-15 agent-framework survey. Cost collapses to event-driven — the idle check is local (no LLM call) and the prompt only fires when there's genuinely nothing to do. Set per-workspace or per org.yaml default. Fire timeout clamps to `max(60, min(300, idle_interval_seconds))`. Both the idle loop and `initial_prompt` self-posts include `auth_headers()` so they work in multi-tenant mode (#220 / PR #235). Pilot enabled on Technical Researcher (#216).
### Admin auth middleware variants
Three Gin middleware classes gate server-side routes — pick the right one. Full contract in `docs/runbooks/admin-auth.md`.
- **`middleware.AdminAuth(db.DB)`** — strict bearer-only. Used for any route where a forged request could leak prompts/memory, create/mutate workspaces, or leak ops intel. Lazy-bootstrap fail-open when `HasAnyLiveTokenGlobal` returns 0.
- **`middleware.CanvasOrBearer(db.DB)`** — accepts bearer OR Origin matching `CORS_ORIGINS`. Used ONLY for cosmetic routes where a forged request has zero data/security impact. Currently only on `PUT /canvas/viewport`. **Do not extend** without rereading the runbook — PR #194 was rejected because adding this to `/bundles/import` would have re-opened #164 CRITICAL.
- **`middleware.WorkspaceAuth(db.DB)`** — binds a bearer to `:id`. Workspace A's token cannot hit workspace B's sub-routes. Used for the entire `/workspaces/:id/*` group except the A2A proxy (which has its own `CanCommunicate` layer).
### Migration runner (`platform/internal/db/postgres.go`)
`RunMigrations` globs `*.sql` in `migrationsDir`, filters out `.down.sql` files, sorts alphabetically, then `DB.Exec()`s each on boot. The filter is load-bearing: before PR #212 every boot ran `.down.sql` **before** `.up.sql` (alphabetical sort puts "d" before "u"), wiping `workspace_auth_tokens` + other pair-migration tables and silently regressing AdminAuth to fail-open. All `.up.sql` files must be **idempotent** (`CREATE TABLE IF NOT EXISTS`, `ALTER TABLE ... IF NOT EXISTS`) because the runner re-applies every migration on every boot. A proper `schema_migrations` tracking table is tracked as a Phase-H cleanup.
### Workspace Lifecycle
`provisioning``online` (on register) → `degraded` (error_rate > 0.5) → `online` (recovered) → `offline` (Redis TTL expired OR health sweep detects dead container) → auto-restart → `provisioning` → ... → `removed` (deleted). Any state → `paused` (user pauses) → `provisioning` (user resumes). Paused workspaces skip health sweep, liveness monitor, and auto-restart.
@ -354,7 +367,7 @@ Agents can auto-execute a prompt on startup before any user interaction. Configu
|--------|------|---------|
| GET | /health | inline |
| GET | /metrics | metrics.Handler() — Prometheus text format (v0.0.4); no auth, scrape-safe |
| POST/GET/PATCH/DELETE | /workspaces[/:id] | workspace.go |
| POST/GET/PATCH/DELETE | /workspaces[/:id] | workspace.go — GET /workspaces + POST /workspaces + DELETE /workspaces/:id are behind `AdminAuth` (#99/#167 C1+C20). PATCH /workspaces/:id is on the open router but `WorkspaceHandler.Update` enforces **field-level authz** (#138/PR #162): cosmetic fields (name, role, x, y, canvas) pass through; sensitive fields (tier, parent_id, runtime, workspace_dir) require a valid bearer token whenever any live token exists. POST /workspaces uses `resolveInsideRoot` on payload.Template (#226 / PR #233). Create handler generates the name as a double-quoted YAML scalar to block #221 injection |
| GET/PATCH | /workspaces/:id/config | workspace.go |
| GET/POST | /workspaces/:id/memory | workspace.go |
| DELETE | /workspaces/:id/memory/:key | workspace.go |
@ -398,9 +411,10 @@ Agents can auto-execute a prompt on startup before any user interaction. Configu
| POST | /webhooks/:type | channels.go (incoming social webhook) |
| GET | /workspaces/:id/shared-context | templates.go |
| GET/PUT/DELETE | /workspaces/:id/files[/*path] | templates.go |
| GET/PUT | /canvas/viewport | viewport.go |
| GET | /canvas/viewport | viewport.go — open (cosmetic, bootstrap-friendly) |
| PUT | /canvas/viewport | viewport.go — `CanvasOrBearer` middleware (#203): accepts bearer OR Origin matching `CORS_ORIGINS`. Cosmetic-only — worst case viewport corruption, recovered by page refresh. DO NOT use this middleware for any route that leaks data or creates resources (see `docs/runbooks/admin-auth.md`) |
| GET | /templates | templates.go |
| POST | /templates/import | templates.go |
| POST | /templates/import | templates.go `AdminAuth` (#190 / PR #200) |
| POST | /registry/register | registry.go |
| POST | /registry/heartbeat | registry.go |
| POST | /registry/update-card | registry.go |
@ -412,17 +426,20 @@ Agents can auto-execute a prompt on startup before any user interaction. Configu
| GET/POST/DELETE | /workspaces/:id/plugins[/:name] | plugins.go — list, install (`{"source":"scheme://spec"}`), uninstall per-workspace |
| GET | /workspaces/:id/plugins/available | plugins.go (filtered by workspace runtime) |
| GET | /workspaces/:id/plugins/compatibility?runtime=X | plugins.go (preflight runtime-change check) |
| GET | /bundles/export/:id | bundle.go |
| POST | /bundles/import | bundle.go |
| GET | /bundles/export/:id | bundle.go `AdminAuth` (#165 / PR #167) |
| POST | /bundles/import | bundle.go `AdminAuth` (#164 CRITICAL / PR #167) |
| GET | /org/templates | org.go (list available org templates) |
| POST | /org/import | org.go (import entire org hierarchy from YAML) || GET | /events[/:workspaceId] | events.go |
| POST | /org/import | org.go — `AdminAuth` + `resolveInsideRoot` path sanitiser (#103 / PR #106) |
| GET | /events | events.go — `AdminAuth` (#165 / PR #167) |
| GET | /events/:workspaceId | events.go — `AdminAuth` (#165 / PR #167) |
| GET | /admin/liveness | inline — `AdminAuth` (#166 / PR #167). Per-subsystem `supervised.Snapshot()` ages; operators check this before debugging stuck scheduler / heartbeat goroutines |
| GET | /ws | socket.go |
## Database
23 migration files in `platform/migrations/` (up to `022_workspace_schedules_source` — 2026-04-14 tick-7, PR #76). Key tables: `workspaces` (core entity with status, runtime, agent_card JSONB, heartbeat columns, current_task, awareness_namespace, workspace_dir), `canvas_layouts` (x/y position), `structure_events` (append-only event log), `activity_logs` (A2A communications, task updates, agent logs, errors), `workspace_schedules` (cron tasks with expression, timezone, prompt, run history, and `source``'template'` for org/import-seeded, `'runtime'` for Canvas/API-created; org/import is additive and only refreshes template-source rows on re-import), `workspace_channels` (social channel integrations — Telegram, Slack, etc., with JSONB config and allowlist), `agents`, `workspace_secrets`, `global_secrets`, `agent_memories` (HMA scoped memory), `approvals`.
Migration files in `platform/migrations/` (latest: `022_workspace_schedules_source` — 2026-04-14 tick-7, PR #76). Each later migration is a `.up.sql`/`.down.sql` pair. Key tables: `workspaces` (core entity with status, runtime, agent_card JSONB, heartbeat columns, current_task, awareness_namespace, workspace_dir), `canvas_layouts` (x/y position), `structure_events` (append-only event log), `activity_logs` (A2A communications, task updates, agent logs, errors`error_detail` is now populated by `scheduler.fireSchedule` so `GET /workspaces/:id/schedules/:id/history` can surface why a cron run failed, #152 / PR #206), `workspace_schedules` (cron tasks with expression, timezone, prompt, run history, `source``'template'` for org/import-seeded, `'runtime'` for Canvas/API-created, and `last_status` now includes `'skipped'` when `scheduler.fireSchedule` concurrency-aware-skips a busy workspace, #115 / PR #207), `workspace_channels` (social channel integrations — Telegram, Slack, etc., with JSONB config and allowlist), `agents`, `workspace_secrets`, `global_secrets`, `workspace_auth_tokens` (Phase 30.1 bearer tokens; now auto-revoked on workspace delete, #110), `agent_memories` (HMA scoped memory), `approvals`.
The platform auto-discovers and runs migrations on startup from several candidate paths.
The platform auto-discovers and runs migrations on startup from several candidate paths. The runner filters out `*.down.sql` files — see the "Migration runner" section above for the history of PR #212 and why this filter is load-bearing.
<!-- AWARENESS_RULES_START -->
# Project Memory (Awareness MCP)

89
PLAN.md
View File

@ -247,6 +247,66 @@ point for "what else is out there."
- **GitHub issue #15** — Provisioner: auto-refresh `CLAUDE_CODE_OAUTH_TOKEN` from `global_secrets` on workspace restart → **DONE** via PR #64 (`SetGlobal` / `DeleteGlobal` now fan out `RestartByID` to every affected workspace).
- **GitHub issue #19 Layer 1** — Platform-generated restart context → **DONE** via PR #65 (synthetic A2A `message/send` with `metadata.kind=restart_context`, `system:restart-context` caller prefix, 30s re-register wait). Layer 2 deferred to issue #66 (see Backlog item 15 above).
### Recently launched (2026-04-15 overnight sweep — ticks 1730+, ~27 PRs)
**Security hardening cluster.** Roughly half the sweep was closing auth gaps surfaced by the Security Auditor's hourly audit cron:
- `#94` RFC-1918 + link-local in registry URL validator
- `#99` AdminAuth gate on `GET /workspaces` (topology leak / #104)
- `#106` path-sanitize + admin-gate `POST /org/import` (#103 HIGH)
- `#110` revoke `workspace_auth_tokens` on workspace delete
- `#119` IPv6 SSRF blocklist (fe80::/10, ::1/128, fc00::/7) + scheduler unit tests
- `#162` field-level authz on `PATCH /workspaces/:id` (#138 — cosmetic vs sensitive split)
- `#155` wire existing `SecurityHeaders` middleware into router
- `#167` gate 6 previously-unauth routes behind `AdminAuth` (#164 CRITICAL anon bundles/import; #165 HIGH events+bundles/export topology leak; #166 MED viewport+liveness)
- `#185` `AdminAuth` on `GET /approvals/pending` (#180)
- `#200` `AdminAuth` on `POST /templates/import` (#190 HIGH)
- `#203` `CanvasOrBearer` middleware — route-split for #168 canvas regression, only `PUT /canvas/viewport`; rejected PR #194's broader Origin-fallback approach because it would have re-opened #164
- `#209` source_id spoof defense in `activity.Report` (cherry-picked from the rejected #169 batch)
- `#233` `resolveInsideRoot` on `POST /workspaces template/runtime` (#226 MED)
**Data integrity.** Three bugs that would have silently corrupted state:
- `#212` **CRITICAL** migration-runner bug — `RunMigrations` globbed `*.sql` and alphabetically ran `.down.sql` BEFORE `.up.sql` on every boot, wiping `workspace_auth_tokens` (and 018/019 pairs). Filter fix + unit test in `postgres_migrate_test.go`.
- `#224` YAML injection in `generateDefaultConfig` — body.Name now emitted as a double-quoted YAML scalar with all control chars escaped. Structural test (parse + verify key count).
- `#236` log-injection in the #209 security-event log line — attacker-controlled source_id echoed via `%s` allowed fake log entries; switched to `%q`.
**CI / infra.**
- `#186` + controlplane `#28` — every CI job migrated from `ubuntu-latest` to `[self-hosted, macos, arm64]` (Mac mini `hongming-m1-mini`). Non-trivial: `services:` replaced with inline `docker run` containers (ports 15432/16379), `actions/setup-python` bypassed via Homebrew python3.11 on `$GITHUB_PATH`, `docker/setup-qemu-action` added for cross-arch builds. Workaround for GH Actions billing cap on private repos.
- `#149` independent heartbeat pulse goroutine so long cron fires don't look stale on `/admin/liveness` (#140)
- `#211` migration runner regression (see #212 above — PR #212 is the fix)
- **Fly registry `FLY_API_TOKEN`** rotated to a deploy token scoped to `molecule-tenant` (previously personal token, invalidated by `flyctl auth login` during the malware cleanup)
**Platform / Scheduler reliability.**
- `#95` panic-recover in scheduler `tick()` + per-fire goroutines (closes #85)
- `#207` concurrency-aware skip — `scheduler.fireSchedule` reads `workspaces.active_tasks` and advances `next_run_at` + records a `cron_run` row with `status='skipped'` instead of colliding with a busy agent (#115)
- `#206` surface `error_detail` in schedule history API (#152 problem B)
**Workspace runtime features.**
- `#205` idle-loop reflection pattern — opt-in `idle_prompt` + `idle_interval_seconds` in `config.yaml`; self-sends when `heartbeat.active_tasks == 0`. Hermes/Letta shape.
- `#208` Hermes Phase 1 multi-provider registry — 15 providers via `adapters/hermes/providers.py` (Nous, OpenRouter, OpenAI, Anthropic, xAI, Gemini, Qwen, GLM, Kimi, MiniMax, DeepSeek, Groq, Together, Fireworks, Mistral). 26 tests.
- `#198` A2A protocol compliance batch (#173/#174/#175): `cancel()` emits `TaskStatusUpdateEvent(canceled, final=True)`, `stateTransitionHistory=True` in AgentCapabilities. **Regression:** `push_sender=PushNotificationSender()` crashed on startup because PushNotificationSender is abstract — reverted in #210.
- `#216` idle-loop pilot enabled on Technical Researcher workspace.
- `#225` + `#235` `auth_headers()` on `/registry/register` + initial_prompt + idle loop self-posts (#215/#220)
- `#231` Claude SDK stderr probe for proper rate-limit error attribution (#160 diagnostics)
**Controlplane (molecule-controlplane).**
- `#19`+`#20` Grafana Cloud remote-write counter registry (`cp_requests_total`), push loop to `prometheus-prod-32-prod-ca-east-0.grafana.net`, Basic auth with user 3116422
- `#21` AWS KMS envelope encryption — per-secret DEK via `GenerateDataKey`, dual-mode (v2 blobs via KMS, legacy via static key, auto-routes by leading byte)
- `#24` `/cp/status` deep probe for Betterstack
- `#26`+`#27` public `/legal/{terms,privacy,dpa,acceptable}` pages from embedded markdown + smoke coverage
- Isolation red-team test suite + observability runbooks (Grafana dashboard, Betterstack, Stripe Atlas)
**Self code-review follow-ups (`#228` + `#232`).** Ran `/code-review` on the batch merges, surfaced 8 🟡 issues, split into Go (#228) and Python/docs (#232):
- `CanvasOrBearer` invalid-bearer fall-through fix
- `short()` helper replacing unsafe `[:N]` slices in `scheduler.go`
- 6 new tests (`TestShort_helper`, `TestRecordSkipped_*`, `TestActivityHandler_Report_*`, `TestHistory_IncludesErrorDetail`)
- idle-loop hardening (`asyncio.get_running_loop()`, `IDLE_FIRE_TIMEOUT_SECONDS` clamp, typed exception handling, `add_done_callback` for fire-and-forget error logging)
- `idle_prompt` / `idle_interval_seconds` documented in `org.yaml` defaults
- New `docs/runbooks/admin-auth.md` — the three middleware variants + three-question test for adding to `CanvasOrBearer`
**Test counts post-sweep:** +70 Go (816 total), +40 Python (1180 total), +0 Canvas vitest (453 unchanged — UI/a11y patches only).
**Outstanding (user action):** `#126` Slack adapter (Phase-H product decision), `#160` Claude Max OAuth quota (wait for 2026-04-17 23:00Z reset OR upgrade OR switch to ANTHROPIC_API_KEY), `#191` runner persistent-state docs (P3), `#199` Fly registry token (**resolved** this session but publish-platform-image re-run pending runner), Stripe Atlas application (launch blocker, 2-week lead).
### Recently launched (2026-04-15 tick-9)
- **Phase 32 Phase B.2 (image pipeline)** — PR #80 (merged `c3cc8e87`) adds `.github/workflows/publish-platform-image.yml`: on every main-merge touching `platform/**`, builds `platform/Dockerfile` and pushes `ghcr.io/molecule-ai/platform:latest` + `:sha-<commit>` to GHCR. Paired with the private `molecule-controlplane` Fly + Neon provisioner (PR #3 there, merged `2e85d5ad`) that reads `TENANT_IMAGE` env and boots tenant Fly Machines from this image. Tick-8 docs-sync PR #79 (merged `d53a1287`) also landed.
@ -368,20 +428,29 @@ self-hosted per-customer). Ordered by dependency + ROI.
- Stripe billing scaffold deployed in orgs-only mode (no Stripe creds configured yet; webhook handler + signature verification code ready)
- Domain: `moleculesai.app` (DNS not yet wired — subdomain routing works via `X-Molecule-Org-Slug` header pending Cloudflare)
**Phase status:**
**Phase status (post 2026-04-15 overnight sweep):**
- **A — Foundation** (accounts, tokens, domain): ✅ done
- **B — Fly provisioner + Neon branching**: ✅ done (control plane + tenant machine config + services + healthchecks)
- **C — WorkOS AuthKit scaffold**: ✅ done (live redirect to hosted signup); Phase C.2 (RequireSession on /cp/orgs + org-ownership check) pending
- **D — Stripe billing scaffold**: ✅ code done; Phase D.2 (auth-scoped checkout + customer create) and D.3 (plan quotas) pending — not blocked on user
- **E — Cloudflare + DNS `*.moleculesai.app`**: not started
- **F — Sign-up UX + onboarding**: not started
- **G — Observability + quotas + admin**: not started
- **H — Hardening (KMS, isolation test suite, load test, legal)**: not started
- **I — Launch**: not started
- **B — Fly provisioner + Neon branching**: ✅ done
- **C — WorkOS AuthKit scaffold + RequireSession + org-ownership check**: ✅ done
- **D — Stripe billing scaffold + auth-scoped checkout + plan quotas**: ✅ code done; live keys pending Stripe Atlas
- **E — Cloudflare + DNS `*.moleculesai.app` + per-tenant Vercel canvas**: ✅ done
- **F — Sign-up UX + onboarding**: ✅ basic flow done (signup / org create / canvas redirect); polish + email pending
- **G — Observability + quotas + admin**: ✅ Sentry + Grafana remote-write + `/cp/status` Betterstack probe + per-org rate limiter; admin panel `/cp/admin/*` pending
- **H — Hardening**: ⏳ partial — AWS KMS envelope encryption ✅ (controlplane PR #21), tenant-isolation red-team CI gate ✅ (`isolation_test.go`), legal pages ✅ (`/legal/*` from controlplane PR #26); load test + Stripe Atlas application + status page custom domain pending
- **I — Launch**: pending Stripe Atlas (~2 week lead)
**Live infrastructure deltas (post-sweep):**
- Migration runner safety fix landed (#212) — `*.down.sql` filter; was wiping `workspace_auth_tokens` on every restart
- Workspace auth tokens now revoked on workspace delete (#110)
- All known unauth admin routes gated; #138 canvas regression resolved via field-level authz + `CanvasOrBearer` middleware
- Self-hosted Mac mini CI runner replaced GH-hosted Linux to bypass private-repo Actions billing cap; `FLY_API_TOKEN` rotated to a deploy token scoped to `molecule-tenant` after the personal token was invalidated by `flyctl auth login` during the 2025-12-06 cryptominer cleanup
- `/legal/{terms,privacy,dpa,acceptable}` live at `https://app.moleculesai.app/legal/*`
**Known open issues on the live system:**
- fly-replay state format iteration: Fly's proxy returned 502 on `state=org-id=<uuid>` (second `=`); fix dropped the prefix, PRs `molecule-controlplane#8` + `molecule-monorepo#88` in flight to make bare UUID work end-to-end
- Tenant `/workspaces` returns Neon pooler warnings (`unnamed prepared statement does not exist`) — lib/pq + Neon pooler incompatibility, tracked for lib/pq → pgx migration in a later phase
- `#160` Claude Max OAuth quota exhausted on the agent-fleet token until 2026-04-17 23:00 UTC; mitigations: wait, upgrade plan, OR switch workspace containers to `ANTHROPIC_API_KEY` env var
- `#191` self-hosted runner persistent-state docs (P3, low urgency)
- `#199` Fly registry token — **resolved** in the 2026-04-15 sweep but `publish-platform-image` re-run pending runner availability
**Companion repo:** `Molecule-AI/molecule-controlplane` (private). n8n-style open-core split: this public repo stays OSS (tenant binary + plugins + channels, contributable surface); control plane (orgs / signup / billing / provisioner / routing) is private. See `molecule-controlplane/PLAN.md` for its roadmap.

View File

@ -35,3 +35,231 @@ each tenant Fly Machine from this image.
- `.github/workflows/publish-platform-image.yml` — new.
- `CLAUDE.md` — tick-9 block for the new CI workflow.
- `PLAN.md` — new "Recently launched (2026-04-15 tick-9)" entry.
---
## Overnight sweep (2026-04-15 16:3019:10 UTC, ticks 1730+)
One long session that started with a malware discovery, pivoted through a
half-day of security triage, landed ~27 PRs across both repos, and ended
with a self code-review cleanup round. Chronological order below, compressed
to the load-bearing details so future ticks can grep this file instead of
re-reading the JSONL cron-learnings stream.
### Security: malware cleanup + Fly credential rotation
Discovered `xmrig` cryptominer installed Dec 6 2025 via commodity
npm-dropper, running out of `/var/tmp/.X11-unix/xmrig-6.24.0/` as
`systemd-udevd` (camouflaged Linux daemon name on a Mac mini). Crontab
entry `*/10 * * * *` had been firing every 10 min for ~4 months until
tonight — ~17,500 launches. Wiped crontab, removed payload, rotated
`FLY_API_TOKEN` + `CLAUDE_CODE_OAUTH_TOKEN` + `GRAFANA_PROM_TOKEN`.
Mining-only payload (no backdoor confirmed): no SSH auth-keys, no
LaunchAgents, no extra shell hooks, no other xmrig copies. But personal
Fly token rotated via `flyctl auth login` invalidated the token still
in GitHub Actions secrets — surfaced much later as #199 publish
workflow 401. **Operator rule of thumb: always use `flyctl tokens create
deploy -a <app>` for CI, never a personal auth token.**
### Self-hosted CI runner migration
#186 switched every `ci.yml` job + `publish-platform-image.yml` from
`runs-on: ubuntu-latest` to `[self-hosted, macos, arm64]` (Apple-silicon
Mac mini `hongming-m1-mini`). Non-trivial adaptations:
- Replaced GH Actions `services: postgres/redis` (Linux-only) with
inline `docker run` with `PG_CONTAINER` / `REDIS_CONTAINER` env vars
and `docker rm -f` teardown in `if: always()`. Ports 15432/16379
to avoid collision with host services.
- `ludeeus/action-shellcheck` (Docker action, Linux-only) → fallback
to local `brew install shellcheck` + `find | xargs shellcheck`.
- `actions/setup-python@v5` hardcodes `/Users/runner/hostedtoolcache`
(non-overridable — upstream limitation in the prebuilt setup.sh from
`actions/python-versions`). Bypassed with a `Verify Python 3.11
(Homebrew)` step that prepends `/opt/homebrew/opt/python@3.11/bin`
to `$GITHUB_PATH`. One-time runner prep: `brew install python@3.11`.
- `publish-platform-image.yml` adds `docker/setup-qemu-action@v3`
+ `platforms: linux/amd64` explicit because the runner is arm64 and
Fly tenant machines are amd64.
Controlplane PR #28 mirrored the same migration on its own single-job
ci.yml (1-line `runs-on` swap — no matrix adaptations needed).
Known runner rough edges tracked as follow-ups: #191 (persistent-state
docs), #199 (Fly registry 401 — resolved by minting a deploy token
scoped to `molecule-tenant`, tokens table previously empty).
### Security fixes — auth gating
Closed a cluster of unauthenticated-route findings surfaced by the
Security Auditor's hourly audit:
| PR | Issue | Fix |
|---|---|---|
| #94 | #C6 | RFC-1918 + link-local in registry URL validator |
| #99 | #104 | AdminAuth gate on GET /workspaces (topology leak) |
| #102 | — | ancestor↔descendant A2A for hierarchy routing |
| #106 | #103 HIGH | path-sanitize + admin-gate POST /org/import |
| #110 | — | revoke workspace_auth_tokens on workspace delete |
| #119 | — | IPv6 SSRF blocklist (fe80::/10, ::1/128, fc00::/7) + scheduler unit tests |
| #125/#162 | #138 | field-level authz on PATCH /workspaces/:id (cosmetic fields passthrough, sensitive fields bearer-required) |
| #155 | #151 | wire SecurityHeaders middleware |
| #167 | #164 CRIT #165 HIGH #166 MED | gate 6 unauth routes (bundles/export, bundles/import, events, events/:id, canvas/viewport PUT, admin/liveness) |
| #185 | #180 | AdminAuth on GET /approvals/pending |
| #200 | #190 HIGH | AdminAuth on POST /templates/import |
| #203 | #168 | CanvasOrBearer middleware on PUT /canvas/viewport only (route-split approach) |
| #209 | #169 C2 | source_id spoof defense in activity.Report |
| #233 | #226 MED | resolveInsideRoot on POST /workspaces template/runtime |
Rejected PR #194 (Origin-fallback approach) because it would have
re-opened #164 CRITICAL to curl-based spoofing. #168 correctly fixed
via the narrower route-split in #203.
Rejected PR #169 (large C1-C6 batch) because 4/7 findings were
duplicates of already-merged work and migration 022 numbering
collided with 022_workspace_schedules_source. Cherry-picked the one
genuinely new fix (C2 source_id spoof check) into #209 and closed
#169.
### Security fixes — data integrity
- **#212** CRITICAL migration-runner bug: `RunMigrations` globbed
`*.sql` and sorted alphabetically, running `.down.sql` BEFORE
`.up.sql` on every boot. Wiped `workspace_auth_tokens` + two other
pairs on every platform restart, regressing AdminAuth to fail-open
bootstrap mode. Filter to skip `.down.sql` + unit test in
`postgres_migrate_test.go`.
- **#224** YAML injection in `generateDefaultConfig` — body.Name
concatenated into YAML without escaping. Fixed by emitting as
double-quoted YAML scalar with all control chars escaped. Structural
test (parse + verify key count) instead of substring match.
- **#236** log-injection in the #209 security-event log line —
attacker-controlled `source_id` echoed via `%s` allowed newline
injection of fake log entries. Switched to `%q`.
### Infrastructure
- **AWS KMS envelope encryption** (controlplane PR #21). Per-secret DEK
via `kms.GenerateDataKey`; blob layout `[0x02][dek_len][enc_dek][nonce][ct]`.
Dual-mode: v2 blobs via KMS, legacy blobs via static `SECRETS_ENCRYPTION_KEY`.
Auto-routes by leading byte; no rewrap migration needed.
- **Grafana Cloud remote-write** (controlplane PR #19 + #20). In-process
counter registry + hand-rolled protobuf encoder. `cp_requests_total`
emitted on every request. Push loop to
`prometheus-prod-32-prod-ca-east-0.grafana.net/api/prom/push` with
Basic auth. User 3116422, token via GRAFANA_PROM_TOKEN Fly secret.
- **/cp/status deep-probe** (controlplane PR #24) for Betterstack.
Pings Postgres with 2s budget; returns 503 on DB miss. Distinct from
`/health`.
- **Legal pages** (controlplane PR #26/#27). Public `/legal/{terms,
privacy,dpa,acceptable}` served from embedded markdown. Dark-theme
HTML shell, minimal markdown→HTML renderer (no dep), path-traversal
safe via slug allowlist. Smoke covered.
- **Scheduler reliability**: #95 panic-recover in tick(), #149
independent heartbeat goroutine so long fires don't look stale on
/admin/liveness, #207 concurrency-aware skip when workspace
active_tasks>0.
### Features
- **#205** idle-loop reflection pattern in workspace-template. Opt-in
via `idle_prompt` + `idle_interval_seconds` in `config.yaml`.
Self-sends the idle prompt via platform A2A proxy every interval
while `heartbeat.active_tasks == 0`. Hermes/Letta shape.
- **#208** Hermes Phase 1 multi-provider. 15 providers via
`adapters/hermes/providers.py` registry (Nous, OpenRouter, OpenAI,
Anthropic, xAI, Gemini, Qwen, GLM, Kimi, MiniMax, DeepSeek, Groq,
Together, Fireworks, Mistral). Back-compat with PR2 key resolution
preserved. 26 tests.
- **#198** A2A protocol compliance batch closing #173/#174/#175:
`cancel()` emits `TaskStatusUpdateEvent(canceled, final=True)`,
`stateTransitionHistory=True` in AgentCapabilities. *Note:* wired
`push_sender=PushNotificationSender()` and this crashed on startup
because PushNotificationSender is an abstract base class — reverted
in #210.
- **#186** self-hosted macOS runner migration (described above).
### Code-review self-audit
Ran /code-review on my own batch merges, surfaced 8 🟡 issues, split
follow-ups into two PRs:
- **#228** (Go side): CanvasOrBearer invalid-bearer fall-through fix,
`short()` helper to replace unsafe `[:N]` slices in scheduler.go,
security-event log on source_id spoof. 6 new tests:
`TestShort_helper`, `TestRecordSkipped_writesSkippedStatus`,
`TestRecordSkipped_shortWorkspaceIDNoPanic`,
`TestActivityHandler_Report_SourceIDSpoofRejected`,
`TestActivityHandler_Report_MatchingSourceIDAccepted`,
`TestHistory_IncludesErrorDetail`.
- **#232** (Python/docs): idle-loop hardening
(`asyncio.get_running_loop()`, `IDLE_FIRE_TIMEOUT_SECONDS` clamped,
typed `HTTPError`/`URLError`/catch-all, `add_done_callback` for
fire-and-forget error logging). `idle_prompt` documented in
`org-templates/molecule-dev/org.yaml` defaults. New
`docs/runbooks/admin-auth.md` documenting the three middleware
variants (AdminAuth strict, CanvasOrBearer soft, WorkspaceAuth
per-id) + the three-question test for adding routes to
CanvasOrBearer.
### Other merged fixes
- #122 canvas grid origin offset (nodes spawn at 100,100 not 0,0)
- #123 dark-theme a11y (input contrast, search dialog, kbd hints)
- #131 WCAG critical (ARIA live toasts, dialog focus trap, keyboard nav)
- #139 code-review plugins for Dev Lead + QA Engineer
- #149 scheduler heartbeat pulse (#140)
- #150 ecosystem-watch daily sweep (Microsoft Agent Framework, Vercel Open Agents)
- #157 ecosystem-watch PM sweep
- #161 e2e test mock fix for #125 EXISTS probe
- #187 `SetTrustedProxies(nil)` closes #179 rate-limit bypass
- #188 e2e auth headers on `/events` + `/bundles/export` post-#167
- #189 revert Security Auditor cron to 2x/day (closes #178 token-budget regression)
- #192 test regression lock for #170 `DELETE /secrets/:key`
- #197 reapply user's a6cfc5f bypass-setup-python to main (dropped by #186 squash)
- #206 surface cron `error_detail` in schedule history (#152 problem B)
- #210 revert PushNotificationSender ABC crash (#204)
- #211 migration runner skips `.down.sql` (data loss regression)
- #216 enable idle-loop pilot on Technical Researcher
- #223 reno-stars default plugins to browser-automation
- #225 auth_headers() on /registry/register (#215)
- #227 unit tests for plugins_install_pipeline.go (37 cases, #217)
- #231 Claude SDK stderr probe for rate-limit error attribution (#160)
- #235 auth_headers() on initial_prompt + idle loop (#220)
### Issues closed (by merge or factual correction)
#85, #93, #100, #101, #103, #104, #105, #115, #126 epic parent, #127,
#128, #129, #132, #134, #135, #136, #138, #140, #141, #142, #143, #144,
#145, #146, #147, #148, #151, #152 prob B, #153, #154, #156, #160
(diagnosed, not fixed), #163, #164, #165, #166, #168, #170, #171, #172,
#173, #174, #175, #176, #177, #178, #180, #181, #183, #184, #190, #191
(accepted risk), #195, #199 (fixed Fly token rotation), #201, #202,
#204, #211, #213, #214, #215, #217, #218, #219, #220, #221, #226, #229,
#230, #234.
### Outstanding — needs user
- **#126** Slack adapter (Phase-H product decision)
- **#160** Claude Max OAuth quota (wait for reset / upgrade / API key switch)
- **#191** self-hosted runner persistent-state docs (P3)
- **#199** Fly registry token — **resolved this session** but re-run
of `publish-platform-image` pending runner capacity
- Stripe Atlas application (launch blocker, 2-week lead)
### Test counts (post-session)
- Platform Go: **816 test functions** (+70 this session — scheduler, handlers, middleware, db, crypto tests added across #95/#99/#106/#110/#119/#151/#167/#185/#187/#192/#200/#203/#206/#207/#210/#211/#212/#227/#228/#232/#234)
- Canvas vitest: **453 tests** (+0 structure, +0 new tests this session — UI/a11y patches)
- Workspace-template pytest: **1180 tests** (+40 this session — Hermes providers, a2a cancel, idle loop implicit)
- MCP server jest: **97 tests** (unchanged)
### Infra notes (not in any repo)
- FLY_API_TOKEN GH Actions secret rotated to a deploy token scoped to
`molecule-tenant` (1-year expiry). Docs runbook update needed.
- Mac mini runner env has `RUNNER_TOOL_CACHE` + `AGENT_TOOLSDIRECTORY`
overrides. Python install via Homebrew is required one-time prep.
- `molecule-monorepo` still private; Actions billing workaround is
the self-hosted runner rather than flipping public or raising the
cap.