diff --git a/CLAUDE.md b/CLAUDE.md index e14c0ee7..6b53618e 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -225,9 +225,9 @@ OPENAI_API_KEY=... bash scripts/test-team-e2e.sh # E2E: Multi-template ### Unit Tests ```bash -cd platform && go test -race ./... # 746 Go tests (handlers, registry, provisioner, CLI, delegation, org, channels, wsauth, middleware — sqlmock + miniredis; +6 on 2026-04-14 tick-8 for TestTenantGuard_* covering MOLECULE_ORG_ID passthrough/match/mismatch/missing/allowlist/exact-match (#78, Phase 32 PR #1); prior: +9 tick-7 for category_routing + schedules.source; +5 tick-6 for plugins UNION; +6 tick-4 for auto-restart + restart-context branches) -cd canvas && npm test # 357 Vitest tests (store, components, hydration, buildTree, secrets API, org template import, ConfirmDialog singleButton + 7 native-dialog replacements) -cd workspace-template && python -m pytest -v # 1140 pytest tests (adds platform_auth token store for Phase 30.1, memory_write activity logging) +cd platform && go test -race ./... # 816 Go tests (handlers, registry, provisioner, CLI, delegation, org, channels, wsauth, middleware, scheduler, crypto, db — sqlmock + miniredis; +70 on 2026-04-15 overnight sweep across the security fix cluster: CanvasOrBearer middleware tests, scheduler recordSkipped + short() helper, source_id spoof rejection + log injection regression guards, YAML-parse structural injection tests, migration runner .down.sql filter, plugins_install_pipeline_test.go 37-case suite, resolveInsideRoot coverage; +6 on 2026-04-14 tick-8 for TestTenantGuard_*; prior: +9 tick-7 for category_routing + schedules.source) +cd canvas && npm test # 453 Vitest tests (store, components, hydration, buildTree, secrets API, org template import, ConfirmDialog singleButton + 7 native-dialog replacements, WCAG critical batch — ARIA live toasts + dialog focus trap + keyboard nav) +cd workspace-template && python -m pytest -v # 1180 pytest tests (adds platform_auth token store for Phase 30.1, memory_write activity logging, Hermes multi-provider registry 26 tests, a2a_executor cancel emits canceled event, idle loop + initial_prompt auth_headers()) cd sdk/python && python -m pytest -v # 132 SDK tests (agentskills.io spec validator, CLI, AgentskillsAdaptor round-trip, workspace/org/channel validators, RemoteAgentClient Phase 30 flows) cd mcp-server && npm test # 97 Jest tests (per-domain tool modules + smoke test on tool count) ``` @@ -343,6 +343,19 @@ Agents can auto-execute a prompt on startup before any user interaction. Configu **Important:** Initial prompts must NOT send A2A messages (delegate_task, send_message_to_user) — other agents may not be ready. Keep them local: clone repo, read docs, save to memory, wait for tasks. +### Idle Loop (#205 — reflection-on-completion) +Opt-in pattern: when `idle_prompt` is non-empty in `config.yaml`, the workspace self-sends it every `idle_interval_seconds` (default 600) **while `heartbeat.active_tasks == 0`**. Hermes/Letta shape from the 2026-04-15 agent-framework survey. Cost collapses to event-driven — the idle check is local (no LLM call) and the prompt only fires when there's genuinely nothing to do. Set per-workspace or per org.yaml default. Fire timeout clamps to `max(60, min(300, idle_interval_seconds))`. Both the idle loop and `initial_prompt` self-posts include `auth_headers()` so they work in multi-tenant mode (#220 / PR #235). Pilot enabled on Technical Researcher (#216). + +### Admin auth middleware variants +Three Gin middleware classes gate server-side routes — pick the right one. Full contract in `docs/runbooks/admin-auth.md`. + +- **`middleware.AdminAuth(db.DB)`** — strict bearer-only. Used for any route where a forged request could leak prompts/memory, create/mutate workspaces, or leak ops intel. Lazy-bootstrap fail-open when `HasAnyLiveTokenGlobal` returns 0. +- **`middleware.CanvasOrBearer(db.DB)`** — accepts bearer OR Origin matching `CORS_ORIGINS`. Used ONLY for cosmetic routes where a forged request has zero data/security impact. Currently only on `PUT /canvas/viewport`. **Do not extend** without rereading the runbook — PR #194 was rejected because adding this to `/bundles/import` would have re-opened #164 CRITICAL. +- **`middleware.WorkspaceAuth(db.DB)`** — binds a bearer to `:id`. Workspace A's token cannot hit workspace B's sub-routes. Used for the entire `/workspaces/:id/*` group except the A2A proxy (which has its own `CanCommunicate` layer). + +### Migration runner (`platform/internal/db/postgres.go`) +`RunMigrations` globs `*.sql` in `migrationsDir`, filters out `.down.sql` files, sorts alphabetically, then `DB.Exec()`s each on boot. The filter is load-bearing: before PR #212 every boot ran `.down.sql` **before** `.up.sql` (alphabetical sort puts "d" before "u"), wiping `workspace_auth_tokens` + other pair-migration tables and silently regressing AdminAuth to fail-open. All `.up.sql` files must be **idempotent** (`CREATE TABLE IF NOT EXISTS`, `ALTER TABLE ... IF NOT EXISTS`) because the runner re-applies every migration on every boot. A proper `schema_migrations` tracking table is tracked as a Phase-H cleanup. + ### Workspace Lifecycle `provisioning` → `online` (on register) → `degraded` (error_rate > 0.5) → `online` (recovered) → `offline` (Redis TTL expired OR health sweep detects dead container) → auto-restart → `provisioning` → ... → `removed` (deleted). Any state → `paused` (user pauses) → `provisioning` (user resumes). Paused workspaces skip health sweep, liveness monitor, and auto-restart. @@ -354,7 +367,7 @@ Agents can auto-execute a prompt on startup before any user interaction. Configu |--------|------|---------| | GET | /health | inline | | GET | /metrics | metrics.Handler() — Prometheus text format (v0.0.4); no auth, scrape-safe | -| POST/GET/PATCH/DELETE | /workspaces[/:id] | workspace.go | +| POST/GET/PATCH/DELETE | /workspaces[/:id] | workspace.go — GET /workspaces + POST /workspaces + DELETE /workspaces/:id are behind `AdminAuth` (#99/#167 C1+C20). PATCH /workspaces/:id is on the open router but `WorkspaceHandler.Update` enforces **field-level authz** (#138/PR #162): cosmetic fields (name, role, x, y, canvas) pass through; sensitive fields (tier, parent_id, runtime, workspace_dir) require a valid bearer token whenever any live token exists. POST /workspaces uses `resolveInsideRoot` on payload.Template (#226 / PR #233). Create handler generates the name as a double-quoted YAML scalar to block #221 injection | | GET/PATCH | /workspaces/:id/config | workspace.go | | GET/POST | /workspaces/:id/memory | workspace.go | | DELETE | /workspaces/:id/memory/:key | workspace.go | @@ -398,9 +411,10 @@ Agents can auto-execute a prompt on startup before any user interaction. Configu | POST | /webhooks/:type | channels.go (incoming social webhook) | | GET | /workspaces/:id/shared-context | templates.go | | GET/PUT/DELETE | /workspaces/:id/files[/*path] | templates.go | -| GET/PUT | /canvas/viewport | viewport.go | +| GET | /canvas/viewport | viewport.go — open (cosmetic, bootstrap-friendly) | +| PUT | /canvas/viewport | viewport.go — `CanvasOrBearer` middleware (#203): accepts bearer OR Origin matching `CORS_ORIGINS`. Cosmetic-only — worst case viewport corruption, recovered by page refresh. DO NOT use this middleware for any route that leaks data or creates resources (see `docs/runbooks/admin-auth.md`) | | GET | /templates | templates.go | -| POST | /templates/import | templates.go | +| POST | /templates/import | templates.go — `AdminAuth` (#190 / PR #200) | | POST | /registry/register | registry.go | | POST | /registry/heartbeat | registry.go | | POST | /registry/update-card | registry.go | @@ -412,17 +426,20 @@ Agents can auto-execute a prompt on startup before any user interaction. Configu | GET/POST/DELETE | /workspaces/:id/plugins[/:name] | plugins.go — list, install (`{"source":"scheme://spec"}`), uninstall per-workspace | | GET | /workspaces/:id/plugins/available | plugins.go (filtered by workspace runtime) | | GET | /workspaces/:id/plugins/compatibility?runtime=X | plugins.go (preflight runtime-change check) | -| GET | /bundles/export/:id | bundle.go | -| POST | /bundles/import | bundle.go | +| GET | /bundles/export/:id | bundle.go — `AdminAuth` (#165 / PR #167) | +| POST | /bundles/import | bundle.go — `AdminAuth` (#164 CRITICAL / PR #167) | | GET | /org/templates | org.go (list available org templates) | -| POST | /org/import | org.go (import entire org hierarchy from YAML) || GET | /events[/:workspaceId] | events.go | +| POST | /org/import | org.go — `AdminAuth` + `resolveInsideRoot` path sanitiser (#103 / PR #106) | +| GET | /events | events.go — `AdminAuth` (#165 / PR #167) | +| GET | /events/:workspaceId | events.go — `AdminAuth` (#165 / PR #167) | +| GET | /admin/liveness | inline — `AdminAuth` (#166 / PR #167). Per-subsystem `supervised.Snapshot()` ages; operators check this before debugging stuck scheduler / heartbeat goroutines | | GET | /ws | socket.go | ## Database -23 migration files in `platform/migrations/` (up to `022_workspace_schedules_source` — 2026-04-14 tick-7, PR #76). Key tables: `workspaces` (core entity with status, runtime, agent_card JSONB, heartbeat columns, current_task, awareness_namespace, workspace_dir), `canvas_layouts` (x/y position), `structure_events` (append-only event log), `activity_logs` (A2A communications, task updates, agent logs, errors), `workspace_schedules` (cron tasks with expression, timezone, prompt, run history, and `source` — `'template'` for org/import-seeded, `'runtime'` for Canvas/API-created; org/import is additive and only refreshes template-source rows on re-import), `workspace_channels` (social channel integrations — Telegram, Slack, etc., with JSONB config and allowlist), `agents`, `workspace_secrets`, `global_secrets`, `agent_memories` (HMA scoped memory), `approvals`. +Migration files in `platform/migrations/` (latest: `022_workspace_schedules_source` — 2026-04-14 tick-7, PR #76). Each later migration is a `.up.sql`/`.down.sql` pair. Key tables: `workspaces` (core entity with status, runtime, agent_card JSONB, heartbeat columns, current_task, awareness_namespace, workspace_dir), `canvas_layouts` (x/y position), `structure_events` (append-only event log), `activity_logs` (A2A communications, task updates, agent logs, errors — `error_detail` is now populated by `scheduler.fireSchedule` so `GET /workspaces/:id/schedules/:id/history` can surface why a cron run failed, #152 / PR #206), `workspace_schedules` (cron tasks with expression, timezone, prompt, run history, `source` — `'template'` for org/import-seeded, `'runtime'` for Canvas/API-created, and `last_status` now includes `'skipped'` when `scheduler.fireSchedule` concurrency-aware-skips a busy workspace, #115 / PR #207), `workspace_channels` (social channel integrations — Telegram, Slack, etc., with JSONB config and allowlist), `agents`, `workspace_secrets`, `global_secrets`, `workspace_auth_tokens` (Phase 30.1 bearer tokens; now auto-revoked on workspace delete, #110), `agent_memories` (HMA scoped memory), `approvals`. -The platform auto-discovers and runs migrations on startup from several candidate paths. +The platform auto-discovers and runs migrations on startup from several candidate paths. The runner filters out `*.down.sql` files — see the "Migration runner" section above for the history of PR #212 and why this filter is load-bearing. # Project Memory (Awareness MCP) diff --git a/PLAN.md b/PLAN.md index e23374fd..158e132a 100644 --- a/PLAN.md +++ b/PLAN.md @@ -247,6 +247,66 @@ point for "what else is out there." - **GitHub issue #15** — Provisioner: auto-refresh `CLAUDE_CODE_OAUTH_TOKEN` from `global_secrets` on workspace restart → **DONE** via PR #64 (`SetGlobal` / `DeleteGlobal` now fan out `RestartByID` to every affected workspace). - **GitHub issue #19 Layer 1** — Platform-generated restart context → **DONE** via PR #65 (synthetic A2A `message/send` with `metadata.kind=restart_context`, `system:restart-context` caller prefix, 30s re-register wait). Layer 2 deferred to issue #66 (see Backlog item 15 above). +### Recently launched (2026-04-15 overnight sweep — ticks 17–30+, ~27 PRs) + +**Security hardening cluster.** Roughly half the sweep was closing auth gaps surfaced by the Security Auditor's hourly audit cron: +- `#94` RFC-1918 + link-local in registry URL validator +- `#99` AdminAuth gate on `GET /workspaces` (topology leak / #104) +- `#106` path-sanitize + admin-gate `POST /org/import` (#103 HIGH) +- `#110` revoke `workspace_auth_tokens` on workspace delete +- `#119` IPv6 SSRF blocklist (fe80::/10, ::1/128, fc00::/7) + scheduler unit tests +- `#162` field-level authz on `PATCH /workspaces/:id` (#138 — cosmetic vs sensitive split) +- `#155` wire existing `SecurityHeaders` middleware into router +- `#167` gate 6 previously-unauth routes behind `AdminAuth` (#164 CRITICAL anon bundles/import; #165 HIGH events+bundles/export topology leak; #166 MED viewport+liveness) +- `#185` `AdminAuth` on `GET /approvals/pending` (#180) +- `#200` `AdminAuth` on `POST /templates/import` (#190 HIGH) +- `#203` `CanvasOrBearer` middleware — route-split for #168 canvas regression, only `PUT /canvas/viewport`; rejected PR #194's broader Origin-fallback approach because it would have re-opened #164 +- `#209` source_id spoof defense in `activity.Report` (cherry-picked from the rejected #169 batch) +- `#233` `resolveInsideRoot` on `POST /workspaces template/runtime` (#226 MED) + +**Data integrity.** Three bugs that would have silently corrupted state: +- `#212` **CRITICAL** migration-runner bug — `RunMigrations` globbed `*.sql` and alphabetically ran `.down.sql` BEFORE `.up.sql` on every boot, wiping `workspace_auth_tokens` (and 018/019 pairs). Filter fix + unit test in `postgres_migrate_test.go`. +- `#224` YAML injection in `generateDefaultConfig` — body.Name now emitted as a double-quoted YAML scalar with all control chars escaped. Structural test (parse + verify key count). +- `#236` log-injection in the #209 security-event log line — attacker-controlled source_id echoed via `%s` allowed fake log entries; switched to `%q`. + +**CI / infra.** +- `#186` + controlplane `#28` — every CI job migrated from `ubuntu-latest` to `[self-hosted, macos, arm64]` (Mac mini `hongming-m1-mini`). Non-trivial: `services:` replaced with inline `docker run` containers (ports 15432/16379), `actions/setup-python` bypassed via Homebrew python3.11 on `$GITHUB_PATH`, `docker/setup-qemu-action` added for cross-arch builds. Workaround for GH Actions billing cap on private repos. +- `#149` independent heartbeat pulse goroutine so long cron fires don't look stale on `/admin/liveness` (#140) +- `#211` migration runner regression (see #212 above — PR #212 is the fix) +- **Fly registry `FLY_API_TOKEN`** rotated to a deploy token scoped to `molecule-tenant` (previously personal token, invalidated by `flyctl auth login` during the malware cleanup) + +**Platform / Scheduler reliability.** +- `#95` panic-recover in scheduler `tick()` + per-fire goroutines (closes #85) +- `#207` concurrency-aware skip — `scheduler.fireSchedule` reads `workspaces.active_tasks` and advances `next_run_at` + records a `cron_run` row with `status='skipped'` instead of colliding with a busy agent (#115) +- `#206` surface `error_detail` in schedule history API (#152 problem B) + +**Workspace runtime features.** +- `#205` idle-loop reflection pattern — opt-in `idle_prompt` + `idle_interval_seconds` in `config.yaml`; self-sends when `heartbeat.active_tasks == 0`. Hermes/Letta shape. +- `#208` Hermes Phase 1 multi-provider registry — 15 providers via `adapters/hermes/providers.py` (Nous, OpenRouter, OpenAI, Anthropic, xAI, Gemini, Qwen, GLM, Kimi, MiniMax, DeepSeek, Groq, Together, Fireworks, Mistral). 26 tests. +- `#198` A2A protocol compliance batch (#173/#174/#175): `cancel()` emits `TaskStatusUpdateEvent(canceled, final=True)`, `stateTransitionHistory=True` in AgentCapabilities. **Regression:** `push_sender=PushNotificationSender()` crashed on startup because PushNotificationSender is abstract — reverted in #210. +- `#216` idle-loop pilot enabled on Technical Researcher workspace. +- `#225` + `#235` `auth_headers()` on `/registry/register` + initial_prompt + idle loop self-posts (#215/#220) +- `#231` Claude SDK stderr probe for proper rate-limit error attribution (#160 diagnostics) + +**Controlplane (molecule-controlplane).** +- `#19`+`#20` Grafana Cloud remote-write counter registry (`cp_requests_total`), push loop to `prometheus-prod-32-prod-ca-east-0.grafana.net`, Basic auth with user 3116422 +- `#21` AWS KMS envelope encryption — per-secret DEK via `GenerateDataKey`, dual-mode (v2 blobs via KMS, legacy via static key, auto-routes by leading byte) +- `#24` `/cp/status` deep probe for Betterstack +- `#26`+`#27` public `/legal/{terms,privacy,dpa,acceptable}` pages from embedded markdown + smoke coverage +- Isolation red-team test suite + observability runbooks (Grafana dashboard, Betterstack, Stripe Atlas) + +**Self code-review follow-ups (`#228` + `#232`).** Ran `/code-review` on the batch merges, surfaced 8 🟡 issues, split into Go (#228) and Python/docs (#232): +- `CanvasOrBearer` invalid-bearer fall-through fix +- `short()` helper replacing unsafe `[:N]` slices in `scheduler.go` +- 6 new tests (`TestShort_helper`, `TestRecordSkipped_*`, `TestActivityHandler_Report_*`, `TestHistory_IncludesErrorDetail`) +- idle-loop hardening (`asyncio.get_running_loop()`, `IDLE_FIRE_TIMEOUT_SECONDS` clamp, typed exception handling, `add_done_callback` for fire-and-forget error logging) +- `idle_prompt` / `idle_interval_seconds` documented in `org.yaml` defaults +- New `docs/runbooks/admin-auth.md` — the three middleware variants + three-question test for adding to `CanvasOrBearer` + +**Test counts post-sweep:** +70 Go (816 total), +40 Python (1180 total), +0 Canvas vitest (453 unchanged — UI/a11y patches only). + +**Outstanding (user action):** `#126` Slack adapter (Phase-H product decision), `#160` Claude Max OAuth quota (wait for 2026-04-17 23:00Z reset OR upgrade OR switch to ANTHROPIC_API_KEY), `#191` runner persistent-state docs (P3), `#199` Fly registry token (**resolved** this session but publish-platform-image re-run pending runner), Stripe Atlas application (launch blocker, 2-week lead). + ### Recently launched (2026-04-15 tick-9) - **Phase 32 Phase B.2 (image pipeline)** — PR #80 (merged `c3cc8e87`) adds `.github/workflows/publish-platform-image.yml`: on every main-merge touching `platform/**`, builds `platform/Dockerfile` and pushes `ghcr.io/molecule-ai/platform:latest` + `:sha-` to GHCR. Paired with the private `molecule-controlplane` Fly + Neon provisioner (PR #3 there, merged `2e85d5ad`) that reads `TENANT_IMAGE` env and boots tenant Fly Machines from this image. Tick-8 docs-sync PR #79 (merged `d53a1287`) also landed. @@ -368,20 +428,29 @@ self-hosted per-customer). Ordered by dependency + ROI. - Stripe billing scaffold deployed in orgs-only mode (no Stripe creds configured yet; webhook handler + signature verification code ready) - Domain: `moleculesai.app` (DNS not yet wired — subdomain routing works via `X-Molecule-Org-Slug` header pending Cloudflare) -**Phase status:** +**Phase status (post 2026-04-15 overnight sweep):** - **A — Foundation** (accounts, tokens, domain): ✅ done -- **B — Fly provisioner + Neon branching**: ✅ done (control plane + tenant machine config + services + healthchecks) -- **C — WorkOS AuthKit scaffold**: ✅ done (live redirect to hosted signup); Phase C.2 (RequireSession on /cp/orgs + org-ownership check) pending -- **D — Stripe billing scaffold**: ✅ code done; Phase D.2 (auth-scoped checkout + customer create) and D.3 (plan quotas) pending — not blocked on user -- **E — Cloudflare + DNS `*.moleculesai.app`**: not started -- **F — Sign-up UX + onboarding**: not started -- **G — Observability + quotas + admin**: not started -- **H — Hardening (KMS, isolation test suite, load test, legal)**: not started -- **I — Launch**: not started +- **B — Fly provisioner + Neon branching**: ✅ done +- **C — WorkOS AuthKit scaffold + RequireSession + org-ownership check**: ✅ done +- **D — Stripe billing scaffold + auth-scoped checkout + plan quotas**: ✅ code done; live keys pending Stripe Atlas +- **E — Cloudflare + DNS `*.moleculesai.app` + per-tenant Vercel canvas**: ✅ done +- **F — Sign-up UX + onboarding**: ✅ basic flow done (signup / org create / canvas redirect); polish + email pending +- **G — Observability + quotas + admin**: ✅ Sentry + Grafana remote-write + `/cp/status` Betterstack probe + per-org rate limiter; admin panel `/cp/admin/*` pending +- **H — Hardening**: ⏳ partial — AWS KMS envelope encryption ✅ (controlplane PR #21), tenant-isolation red-team CI gate ✅ (`isolation_test.go`), legal pages ✅ (`/legal/*` from controlplane PR #26); load test + Stripe Atlas application + status page custom domain pending +- **I — Launch**: pending Stripe Atlas (~2 week lead) + +**Live infrastructure deltas (post-sweep):** +- Migration runner safety fix landed (#212) — `*.down.sql` filter; was wiping `workspace_auth_tokens` on every restart +- Workspace auth tokens now revoked on workspace delete (#110) +- All known unauth admin routes gated; #138 canvas regression resolved via field-level authz + `CanvasOrBearer` middleware +- Self-hosted Mac mini CI runner replaced GH-hosted Linux to bypass private-repo Actions billing cap; `FLY_API_TOKEN` rotated to a deploy token scoped to `molecule-tenant` after the personal token was invalidated by `flyctl auth login` during the 2025-12-06 cryptominer cleanup +- `/legal/{terms,privacy,dpa,acceptable}` live at `https://app.moleculesai.app/legal/*` **Known open issues on the live system:** -- fly-replay state format iteration: Fly's proxy returned 502 on `state=org-id=` (second `=`); fix dropped the prefix, PRs `molecule-controlplane#8` + `molecule-monorepo#88` in flight to make bare UUID work end-to-end - Tenant `/workspaces` returns Neon pooler warnings (`unnamed prepared statement does not exist`) — lib/pq + Neon pooler incompatibility, tracked for lib/pq → pgx migration in a later phase +- `#160` Claude Max OAuth quota exhausted on the agent-fleet token until 2026-04-17 23:00 UTC; mitigations: wait, upgrade plan, OR switch workspace containers to `ANTHROPIC_API_KEY` env var +- `#191` self-hosted runner persistent-state docs (P3, low urgency) +- `#199` Fly registry token — **resolved** in the 2026-04-15 sweep but `publish-platform-image` re-run pending runner availability **Companion repo:** `Molecule-AI/molecule-controlplane` (private). n8n-style open-core split: this public repo stays OSS (tenant binary + plugins + channels, contributable surface); control plane (orgs / signup / billing / provisioner / routing) is private. See `molecule-controlplane/PLAN.md` for its roadmap. diff --git a/docs/edit-history/2026-04-15.md b/docs/edit-history/2026-04-15.md index 47547eae..d8fcd779 100644 --- a/docs/edit-history/2026-04-15.md +++ b/docs/edit-history/2026-04-15.md @@ -35,3 +35,231 @@ each tenant Fly Machine from this image. - `.github/workflows/publish-platform-image.yml` — new. - `CLAUDE.md` — tick-9 block for the new CI workflow. - `PLAN.md` — new "Recently launched (2026-04-15 tick-9)" entry. + +--- + +## Overnight sweep (2026-04-15 16:30–19:10 UTC, ticks 17–30+) + +One long session that started with a malware discovery, pivoted through a +half-day of security triage, landed ~27 PRs across both repos, and ended +with a self code-review cleanup round. Chronological order below, compressed +to the load-bearing details so future ticks can grep this file instead of +re-reading the JSONL cron-learnings stream. + +### Security: malware cleanup + Fly credential rotation + +Discovered `xmrig` cryptominer installed Dec 6 2025 via commodity +npm-dropper, running out of `/var/tmp/.X11-unix/xmrig-6.24.0/` as +`systemd-udevd` (camouflaged Linux daemon name on a Mac mini). Crontab +entry `*/10 * * * *` had been firing every 10 min for ~4 months until +tonight — ~17,500 launches. Wiped crontab, removed payload, rotated +`FLY_API_TOKEN` + `CLAUDE_CODE_OAUTH_TOKEN` + `GRAFANA_PROM_TOKEN`. +Mining-only payload (no backdoor confirmed): no SSH auth-keys, no +LaunchAgents, no extra shell hooks, no other xmrig copies. But personal +Fly token rotated via `flyctl auth login` invalidated the token still +in GitHub Actions secrets — surfaced much later as #199 publish +workflow 401. **Operator rule of thumb: always use `flyctl tokens create +deploy -a ` for CI, never a personal auth token.** + +### Self-hosted CI runner migration + +#186 switched every `ci.yml` job + `publish-platform-image.yml` from +`runs-on: ubuntu-latest` to `[self-hosted, macos, arm64]` (Apple-silicon +Mac mini `hongming-m1-mini`). Non-trivial adaptations: +- Replaced GH Actions `services: postgres/redis` (Linux-only) with + inline `docker run` with `PG_CONTAINER` / `REDIS_CONTAINER` env vars + and `docker rm -f` teardown in `if: always()`. Ports 15432/16379 + to avoid collision with host services. +- `ludeeus/action-shellcheck` (Docker action, Linux-only) → fallback + to local `brew install shellcheck` + `find | xargs shellcheck`. +- `actions/setup-python@v5` hardcodes `/Users/runner/hostedtoolcache` + (non-overridable — upstream limitation in the prebuilt setup.sh from + `actions/python-versions`). Bypassed with a `Verify Python 3.11 + (Homebrew)` step that prepends `/opt/homebrew/opt/python@3.11/bin` + to `$GITHUB_PATH`. One-time runner prep: `brew install python@3.11`. +- `publish-platform-image.yml` adds `docker/setup-qemu-action@v3` + + `platforms: linux/amd64` explicit because the runner is arm64 and + Fly tenant machines are amd64. + +Controlplane PR #28 mirrored the same migration on its own single-job +ci.yml (1-line `runs-on` swap — no matrix adaptations needed). + +Known runner rough edges tracked as follow-ups: #191 (persistent-state +docs), #199 (Fly registry 401 — resolved by minting a deploy token +scoped to `molecule-tenant`, tokens table previously empty). + +### Security fixes — auth gating + +Closed a cluster of unauthenticated-route findings surfaced by the +Security Auditor's hourly audit: + +| PR | Issue | Fix | +|---|---|---| +| #94 | #C6 | RFC-1918 + link-local in registry URL validator | +| #99 | #104 | AdminAuth gate on GET /workspaces (topology leak) | +| #102 | — | ancestor↔descendant A2A for hierarchy routing | +| #106 | #103 HIGH | path-sanitize + admin-gate POST /org/import | +| #110 | — | revoke workspace_auth_tokens on workspace delete | +| #119 | — | IPv6 SSRF blocklist (fe80::/10, ::1/128, fc00::/7) + scheduler unit tests | +| #125/#162 | #138 | field-level authz on PATCH /workspaces/:id (cosmetic fields passthrough, sensitive fields bearer-required) | +| #155 | #151 | wire SecurityHeaders middleware | +| #167 | #164 CRIT #165 HIGH #166 MED | gate 6 unauth routes (bundles/export, bundles/import, events, events/:id, canvas/viewport PUT, admin/liveness) | +| #185 | #180 | AdminAuth on GET /approvals/pending | +| #200 | #190 HIGH | AdminAuth on POST /templates/import | +| #203 | #168 | CanvasOrBearer middleware on PUT /canvas/viewport only (route-split approach) | +| #209 | #169 C2 | source_id spoof defense in activity.Report | +| #233 | #226 MED | resolveInsideRoot on POST /workspaces template/runtime | + +Rejected PR #194 (Origin-fallback approach) because it would have +re-opened #164 CRITICAL to curl-based spoofing. #168 correctly fixed +via the narrower route-split in #203. + +Rejected PR #169 (large C1-C6 batch) because 4/7 findings were +duplicates of already-merged work and migration 022 numbering +collided with 022_workspace_schedules_source. Cherry-picked the one +genuinely new fix (C2 source_id spoof check) into #209 and closed +#169. + +### Security fixes — data integrity + +- **#212** CRITICAL migration-runner bug: `RunMigrations` globbed + `*.sql` and sorted alphabetically, running `.down.sql` BEFORE + `.up.sql` on every boot. Wiped `workspace_auth_tokens` + two other + pairs on every platform restart, regressing AdminAuth to fail-open + bootstrap mode. Filter to skip `.down.sql` + unit test in + `postgres_migrate_test.go`. +- **#224** YAML injection in `generateDefaultConfig` — body.Name + concatenated into YAML without escaping. Fixed by emitting as + double-quoted YAML scalar with all control chars escaped. Structural + test (parse + verify key count) instead of substring match. +- **#236** log-injection in the #209 security-event log line — + attacker-controlled `source_id` echoed via `%s` allowed newline + injection of fake log entries. Switched to `%q`. + +### Infrastructure + +- **AWS KMS envelope encryption** (controlplane PR #21). Per-secret DEK + via `kms.GenerateDataKey`; blob layout `[0x02][dek_len][enc_dek][nonce][ct]`. + Dual-mode: v2 blobs via KMS, legacy blobs via static `SECRETS_ENCRYPTION_KEY`. + Auto-routes by leading byte; no rewrap migration needed. +- **Grafana Cloud remote-write** (controlplane PR #19 + #20). In-process + counter registry + hand-rolled protobuf encoder. `cp_requests_total` + emitted on every request. Push loop to + `prometheus-prod-32-prod-ca-east-0.grafana.net/api/prom/push` with + Basic auth. User 3116422, token via GRAFANA_PROM_TOKEN Fly secret. +- **/cp/status deep-probe** (controlplane PR #24) for Betterstack. + Pings Postgres with 2s budget; returns 503 on DB miss. Distinct from + `/health`. +- **Legal pages** (controlplane PR #26/#27). Public `/legal/{terms, + privacy,dpa,acceptable}` served from embedded markdown. Dark-theme + HTML shell, minimal markdown→HTML renderer (no dep), path-traversal + safe via slug allowlist. Smoke covered. +- **Scheduler reliability**: #95 panic-recover in tick(), #149 + independent heartbeat goroutine so long fires don't look stale on + /admin/liveness, #207 concurrency-aware skip when workspace + active_tasks>0. + +### Features + +- **#205** idle-loop reflection pattern in workspace-template. Opt-in + via `idle_prompt` + `idle_interval_seconds` in `config.yaml`. + Self-sends the idle prompt via platform A2A proxy every interval + while `heartbeat.active_tasks == 0`. Hermes/Letta shape. +- **#208** Hermes Phase 1 multi-provider. 15 providers via + `adapters/hermes/providers.py` registry (Nous, OpenRouter, OpenAI, + Anthropic, xAI, Gemini, Qwen, GLM, Kimi, MiniMax, DeepSeek, Groq, + Together, Fireworks, Mistral). Back-compat with PR2 key resolution + preserved. 26 tests. +- **#198** A2A protocol compliance batch closing #173/#174/#175: + `cancel()` emits `TaskStatusUpdateEvent(canceled, final=True)`, + `stateTransitionHistory=True` in AgentCapabilities. *Note:* wired + `push_sender=PushNotificationSender()` and this crashed on startup + because PushNotificationSender is an abstract base class — reverted + in #210. +- **#186** self-hosted macOS runner migration (described above). + +### Code-review self-audit + +Ran /code-review on my own batch merges, surfaced 8 🟡 issues, split +follow-ups into two PRs: + +- **#228** (Go side): CanvasOrBearer invalid-bearer fall-through fix, + `short()` helper to replace unsafe `[:N]` slices in scheduler.go, + security-event log on source_id spoof. 6 new tests: + `TestShort_helper`, `TestRecordSkipped_writesSkippedStatus`, + `TestRecordSkipped_shortWorkspaceIDNoPanic`, + `TestActivityHandler_Report_SourceIDSpoofRejected`, + `TestActivityHandler_Report_MatchingSourceIDAccepted`, + `TestHistory_IncludesErrorDetail`. +- **#232** (Python/docs): idle-loop hardening + (`asyncio.get_running_loop()`, `IDLE_FIRE_TIMEOUT_SECONDS` clamped, + typed `HTTPError`/`URLError`/catch-all, `add_done_callback` for + fire-and-forget error logging). `idle_prompt` documented in + `org-templates/molecule-dev/org.yaml` defaults. New + `docs/runbooks/admin-auth.md` documenting the three middleware + variants (AdminAuth strict, CanvasOrBearer soft, WorkspaceAuth + per-id) + the three-question test for adding routes to + CanvasOrBearer. + +### Other merged fixes + +- #122 canvas grid origin offset (nodes spawn at 100,100 not 0,0) +- #123 dark-theme a11y (input contrast, search dialog, kbd hints) +- #131 WCAG critical (ARIA live toasts, dialog focus trap, keyboard nav) +- #139 code-review plugins for Dev Lead + QA Engineer +- #149 scheduler heartbeat pulse (#140) +- #150 ecosystem-watch daily sweep (Microsoft Agent Framework, Vercel Open Agents) +- #157 ecosystem-watch PM sweep +- #161 e2e test mock fix for #125 EXISTS probe +- #187 `SetTrustedProxies(nil)` closes #179 rate-limit bypass +- #188 e2e auth headers on `/events` + `/bundles/export` post-#167 +- #189 revert Security Auditor cron to 2x/day (closes #178 token-budget regression) +- #192 test regression lock for #170 `DELETE /secrets/:key` +- #197 reapply user's a6cfc5f bypass-setup-python to main (dropped by #186 squash) +- #206 surface cron `error_detail` in schedule history (#152 problem B) +- #210 revert PushNotificationSender ABC crash (#204) +- #211 migration runner skips `.down.sql` (data loss regression) +- #216 enable idle-loop pilot on Technical Researcher +- #223 reno-stars default plugins to browser-automation +- #225 auth_headers() on /registry/register (#215) +- #227 unit tests for plugins_install_pipeline.go (37 cases, #217) +- #231 Claude SDK stderr probe for rate-limit error attribution (#160) +- #235 auth_headers() on initial_prompt + idle loop (#220) + +### Issues closed (by merge or factual correction) + +#85, #93, #100, #101, #103, #104, #105, #115, #126 epic parent, #127, +#128, #129, #132, #134, #135, #136, #138, #140, #141, #142, #143, #144, +#145, #146, #147, #148, #151, #152 prob B, #153, #154, #156, #160 +(diagnosed, not fixed), #163, #164, #165, #166, #168, #170, #171, #172, +#173, #174, #175, #176, #177, #178, #180, #181, #183, #184, #190, #191 +(accepted risk), #195, #199 (fixed Fly token rotation), #201, #202, +#204, #211, #213, #214, #215, #217, #218, #219, #220, #221, #226, #229, +#230, #234. + +### Outstanding — needs user + +- **#126** Slack adapter (Phase-H product decision) +- **#160** Claude Max OAuth quota (wait for reset / upgrade / API key switch) +- **#191** self-hosted runner persistent-state docs (P3) +- **#199** Fly registry token — **resolved this session** but re-run + of `publish-platform-image` pending runner capacity +- Stripe Atlas application (launch blocker, 2-week lead) + +### Test counts (post-session) + +- Platform Go: **816 test functions** (+70 this session — scheduler, handlers, middleware, db, crypto tests added across #95/#99/#106/#110/#119/#151/#167/#185/#187/#192/#200/#203/#206/#207/#210/#211/#212/#227/#228/#232/#234) +- Canvas vitest: **453 tests** (+0 structure, +0 new tests this session — UI/a11y patches) +- Workspace-template pytest: **1180 tests** (+40 this session — Hermes providers, a2a cancel, idle loop implicit) +- MCP server jest: **97 tests** (unchanged) + +### Infra notes (not in any repo) + +- FLY_API_TOKEN GH Actions secret rotated to a deploy token scoped to + `molecule-tenant` (1-year expiry). Docs runbook update needed. +- Mac mini runner env has `RUNNER_TOOL_CACHE` + `AGENT_TOOLSDIRECTORY` + overrides. Python install via Homebrew is required one-time prep. +- `molecule-monorepo` still private; Actions billing workaround is + the self-hosted runner rather than flipping public or raising the + cap. +