Merge pull request #238 from Molecule-AI/docs/sync-2026-04-15-overnight-sweep
docs: sync 2026-04-15 overnight sweep — CLAUDE.md + PLAN.md + edit-history
This commit is contained in:
commit
51e3556efe
39
CLAUDE.md
39
CLAUDE.md
@ -232,9 +232,9 @@ OPENAI_API_KEY=... bash scripts/test-team-e2e.sh # E2E: Multi-template
|
||||
|
||||
### Unit Tests
|
||||
```bash
|
||||
cd platform && go test -race ./... # 746 Go tests (handlers, registry, provisioner, CLI, delegation, org, channels, wsauth, middleware — sqlmock + miniredis; +6 on 2026-04-14 tick-8 for TestTenantGuard_* covering MOLECULE_ORG_ID passthrough/match/mismatch/missing/allowlist/exact-match (#78, Phase 32 PR #1); prior: +9 tick-7 for category_routing + schedules.source; +5 tick-6 for plugins UNION; +6 tick-4 for auto-restart + restart-context branches)
|
||||
cd canvas && npm test # 357 Vitest tests (store, components, hydration, buildTree, secrets API, org template import, ConfirmDialog singleButton + 7 native-dialog replacements)
|
||||
cd workspace-template && python -m pytest -v # 1140 pytest tests (adds platform_auth token store for Phase 30.1, memory_write activity logging)
|
||||
cd platform && go test -race ./... # 816 Go tests (handlers, registry, provisioner, CLI, delegation, org, channels, wsauth, middleware, scheduler, crypto, db — sqlmock + miniredis; +70 on 2026-04-15 overnight sweep across the security fix cluster: CanvasOrBearer middleware tests, scheduler recordSkipped + short() helper, source_id spoof rejection + log injection regression guards, YAML-parse structural injection tests, migration runner .down.sql filter, plugins_install_pipeline_test.go 37-case suite, resolveInsideRoot coverage; +6 on 2026-04-14 tick-8 for TestTenantGuard_*; prior: +9 tick-7 for category_routing + schedules.source)
|
||||
cd canvas && npm test # 453 Vitest tests (store, components, hydration, buildTree, secrets API, org template import, ConfirmDialog singleButton + 7 native-dialog replacements, WCAG critical batch — ARIA live toasts + dialog focus trap + keyboard nav)
|
||||
cd workspace-template && python -m pytest -v # 1180 pytest tests (adds platform_auth token store for Phase 30.1, memory_write activity logging, Hermes multi-provider registry 26 tests, a2a_executor cancel emits canceled event, idle loop + initial_prompt auth_headers())
|
||||
cd sdk/python && python -m pytest -v # 132 SDK tests (agentskills.io spec validator, CLI, AgentskillsAdaptor round-trip, workspace/org/channel validators, RemoteAgentClient Phase 30 flows)
|
||||
cd mcp-server && npm test # 97 Jest tests (per-domain tool modules + smoke test on tool count)
|
||||
```
|
||||
@ -350,6 +350,19 @@ Agents can auto-execute a prompt on startup before any user interaction. Configu
|
||||
|
||||
**Important:** Initial prompts must NOT send A2A messages (delegate_task, send_message_to_user) — other agents may not be ready. Keep them local: clone repo, read docs, save to memory, wait for tasks.
|
||||
|
||||
### Idle Loop (#205 — reflection-on-completion)
|
||||
Opt-in pattern: when `idle_prompt` is non-empty in `config.yaml`, the workspace self-sends it every `idle_interval_seconds` (default 600) **while `heartbeat.active_tasks == 0`**. Hermes/Letta shape from the 2026-04-15 agent-framework survey. Cost collapses to event-driven — the idle check is local (no LLM call) and the prompt only fires when there's genuinely nothing to do. Set per-workspace or per org.yaml default. Fire timeout clamps to `max(60, min(300, idle_interval_seconds))`. Both the idle loop and `initial_prompt` self-posts include `auth_headers()` so they work in multi-tenant mode (#220 / PR #235). Pilot enabled on Technical Researcher (#216).
|
||||
|
||||
### Admin auth middleware variants
|
||||
Three Gin middleware classes gate server-side routes — pick the right one. Full contract in `docs/runbooks/admin-auth.md`.
|
||||
|
||||
- **`middleware.AdminAuth(db.DB)`** — strict bearer-only. Used for any route where a forged request could leak prompts/memory, create/mutate workspaces, or leak ops intel. Lazy-bootstrap fail-open when `HasAnyLiveTokenGlobal` returns 0.
|
||||
- **`middleware.CanvasOrBearer(db.DB)`** — accepts bearer OR Origin matching `CORS_ORIGINS`. Used ONLY for cosmetic routes where a forged request has zero data/security impact. Currently only on `PUT /canvas/viewport`. **Do not extend** without rereading the runbook — PR #194 was rejected because adding this to `/bundles/import` would have re-opened #164 CRITICAL.
|
||||
- **`middleware.WorkspaceAuth(db.DB)`** — binds a bearer to `:id`. Workspace A's token cannot hit workspace B's sub-routes. Used for the entire `/workspaces/:id/*` group except the A2A proxy (which has its own `CanCommunicate` layer).
|
||||
|
||||
### Migration runner (`platform/internal/db/postgres.go`)
|
||||
`RunMigrations` globs `*.sql` in `migrationsDir`, filters out `.down.sql` files, sorts alphabetically, then `DB.Exec()`s each on boot. The filter is load-bearing: before PR #212 every boot ran `.down.sql` **before** `.up.sql` (alphabetical sort puts "d" before "u"), wiping `workspace_auth_tokens` + other pair-migration tables and silently regressing AdminAuth to fail-open. All `.up.sql` files must be **idempotent** (`CREATE TABLE IF NOT EXISTS`, `ALTER TABLE ... IF NOT EXISTS`) because the runner re-applies every migration on every boot. A proper `schema_migrations` tracking table is tracked as a Phase-H cleanup.
|
||||
|
||||
### Workspace Lifecycle
|
||||
`provisioning` → `online` (on register) → `degraded` (error_rate > 0.5) → `online` (recovered) → `offline` (Redis TTL expired OR health sweep detects dead container) → auto-restart → `provisioning` → ... → `removed` (deleted). Any state → `paused` (user pauses) → `provisioning` (user resumes). Paused workspaces skip health sweep, liveness monitor, and auto-restart.
|
||||
|
||||
@ -361,7 +374,7 @@ Agents can auto-execute a prompt on startup before any user interaction. Configu
|
||||
|--------|------|---------|
|
||||
| GET | /health | inline |
|
||||
| GET | /metrics | metrics.Handler() — Prometheus text format (v0.0.4); no auth, scrape-safe |
|
||||
| POST/GET/PATCH/DELETE | /workspaces[/:id] | workspace.go |
|
||||
| POST/GET/PATCH/DELETE | /workspaces[/:id] | workspace.go — GET /workspaces + POST /workspaces + DELETE /workspaces/:id are behind `AdminAuth` (#99/#167 C1+C20). PATCH /workspaces/:id is on the open router but `WorkspaceHandler.Update` enforces **field-level authz** (#138/PR #162): cosmetic fields (name, role, x, y, canvas) pass through; sensitive fields (tier, parent_id, runtime, workspace_dir) require a valid bearer token whenever any live token exists. POST /workspaces uses `resolveInsideRoot` on payload.Template (#226 / PR #233). Create handler generates the name as a double-quoted YAML scalar to block #221 injection |
|
||||
| GET/PATCH | /workspaces/:id/config | workspace.go |
|
||||
| GET/POST | /workspaces/:id/memory | workspace.go |
|
||||
| DELETE | /workspaces/:id/memory/:key | workspace.go |
|
||||
@ -405,9 +418,10 @@ Agents can auto-execute a prompt on startup before any user interaction. Configu
|
||||
| POST | /webhooks/:type | channels.go (incoming social webhook) |
|
||||
| GET | /workspaces/:id/shared-context | templates.go |
|
||||
| GET/PUT/DELETE | /workspaces/:id/files[/*path] | templates.go |
|
||||
| GET/PUT | /canvas/viewport | viewport.go |
|
||||
| GET | /canvas/viewport | viewport.go — open (cosmetic, bootstrap-friendly) |
|
||||
| PUT | /canvas/viewport | viewport.go — `CanvasOrBearer` middleware (#203): accepts bearer OR Origin matching `CORS_ORIGINS`. Cosmetic-only — worst case viewport corruption, recovered by page refresh. DO NOT use this middleware for any route that leaks data or creates resources (see `docs/runbooks/admin-auth.md`) |
|
||||
| GET | /templates | templates.go |
|
||||
| POST | /templates/import | templates.go |
|
||||
| POST | /templates/import | templates.go — `AdminAuth` (#190 / PR #200) |
|
||||
| POST | /registry/register | registry.go |
|
||||
| POST | /registry/heartbeat | registry.go |
|
||||
| POST | /registry/update-card | registry.go |
|
||||
@ -419,17 +433,20 @@ Agents can auto-execute a prompt on startup before any user interaction. Configu
|
||||
| GET/POST/DELETE | /workspaces/:id/plugins[/:name] | plugins.go — list, install (`{"source":"scheme://spec"}`), uninstall per-workspace |
|
||||
| GET | /workspaces/:id/plugins/available | plugins.go (filtered by workspace runtime) |
|
||||
| GET | /workspaces/:id/plugins/compatibility?runtime=X | plugins.go (preflight runtime-change check) |
|
||||
| GET | /bundles/export/:id | bundle.go |
|
||||
| POST | /bundles/import | bundle.go |
|
||||
| GET | /bundles/export/:id | bundle.go — `AdminAuth` (#165 / PR #167) |
|
||||
| POST | /bundles/import | bundle.go — `AdminAuth` (#164 CRITICAL / PR #167) |
|
||||
| GET | /org/templates | org.go (list available org templates) |
|
||||
| POST | /org/import | org.go (import entire org hierarchy from YAML) || GET | /events[/:workspaceId] | events.go |
|
||||
| POST | /org/import | org.go — `AdminAuth` + `resolveInsideRoot` path sanitiser (#103 / PR #106) |
|
||||
| GET | /events | events.go — `AdminAuth` (#165 / PR #167) |
|
||||
| GET | /events/:workspaceId | events.go — `AdminAuth` (#165 / PR #167) |
|
||||
| GET | /admin/liveness | inline — `AdminAuth` (#166 / PR #167). Per-subsystem `supervised.Snapshot()` ages; operators check this before debugging stuck scheduler / heartbeat goroutines |
|
||||
| GET | /ws | socket.go |
|
||||
|
||||
## Database
|
||||
|
||||
23 migration files in `platform/migrations/` (up to `022_workspace_schedules_source` — 2026-04-14 tick-7, PR #76). Key tables: `workspaces` (core entity with status, runtime, agent_card JSONB, heartbeat columns, current_task, awareness_namespace, workspace_dir), `canvas_layouts` (x/y position), `structure_events` (append-only event log), `activity_logs` (A2A communications, task updates, agent logs, errors), `workspace_schedules` (cron tasks with expression, timezone, prompt, run history, and `source` — `'template'` for org/import-seeded, `'runtime'` for Canvas/API-created; org/import is additive and only refreshes template-source rows on re-import), `workspace_channels` (social channel integrations — Telegram, Slack, etc., with JSONB config and allowlist), `agents`, `workspace_secrets`, `global_secrets`, `agent_memories` (HMA scoped memory), `approvals`.
|
||||
Migration files in `platform/migrations/` (latest: `022_workspace_schedules_source` — 2026-04-14 tick-7, PR #76). Each later migration is a `.up.sql`/`.down.sql` pair. Key tables: `workspaces` (core entity with status, runtime, agent_card JSONB, heartbeat columns, current_task, awareness_namespace, workspace_dir), `canvas_layouts` (x/y position), `structure_events` (append-only event log), `activity_logs` (A2A communications, task updates, agent logs, errors — `error_detail` is now populated by `scheduler.fireSchedule` so `GET /workspaces/:id/schedules/:id/history` can surface why a cron run failed, #152 / PR #206), `workspace_schedules` (cron tasks with expression, timezone, prompt, run history, `source` — `'template'` for org/import-seeded, `'runtime'` for Canvas/API-created, and `last_status` now includes `'skipped'` when `scheduler.fireSchedule` concurrency-aware-skips a busy workspace, #115 / PR #207), `workspace_channels` (social channel integrations — Telegram, Slack, etc., with JSONB config and allowlist), `agents`, `workspace_secrets`, `global_secrets`, `workspace_auth_tokens` (Phase 30.1 bearer tokens; now auto-revoked on workspace delete, #110), `agent_memories` (HMA scoped memory), `approvals`.
|
||||
|
||||
The platform auto-discovers and runs migrations on startup from several candidate paths.
|
||||
The platform auto-discovers and runs migrations on startup from several candidate paths. The runner filters out `*.down.sql` files — see the "Migration runner" section above for the history of PR #212 and why this filter is load-bearing.
|
||||
|
||||
<!-- AWARENESS_RULES_START -->
|
||||
# Project Memory (Awareness MCP)
|
||||
|
||||
89
PLAN.md
89
PLAN.md
@ -247,6 +247,66 @@ point for "what else is out there."
|
||||
- **GitHub issue #15** — Provisioner: auto-refresh `CLAUDE_CODE_OAUTH_TOKEN` from `global_secrets` on workspace restart → **DONE** via PR #64 (`SetGlobal` / `DeleteGlobal` now fan out `RestartByID` to every affected workspace).
|
||||
- **GitHub issue #19 Layer 1** — Platform-generated restart context → **DONE** via PR #65 (synthetic A2A `message/send` with `metadata.kind=restart_context`, `system:restart-context` caller prefix, 30s re-register wait). Layer 2 deferred to issue #66 (see Backlog item 15 above).
|
||||
|
||||
### Recently launched (2026-04-15 overnight sweep — ticks 17–30+, ~27 PRs)
|
||||
|
||||
**Security hardening cluster.** Roughly half the sweep was closing auth gaps surfaced by the Security Auditor's hourly audit cron:
|
||||
- `#94` RFC-1918 + link-local in registry URL validator
|
||||
- `#99` AdminAuth gate on `GET /workspaces` (topology leak / #104)
|
||||
- `#106` path-sanitize + admin-gate `POST /org/import` (#103 HIGH)
|
||||
- `#110` revoke `workspace_auth_tokens` on workspace delete
|
||||
- `#119` IPv6 SSRF blocklist (fe80::/10, ::1/128, fc00::/7) + scheduler unit tests
|
||||
- `#162` field-level authz on `PATCH /workspaces/:id` (#138 — cosmetic vs sensitive split)
|
||||
- `#155` wire existing `SecurityHeaders` middleware into router
|
||||
- `#167` gate 6 previously-unauth routes behind `AdminAuth` (#164 CRITICAL anon bundles/import; #165 HIGH events+bundles/export topology leak; #166 MED viewport+liveness)
|
||||
- `#185` `AdminAuth` on `GET /approvals/pending` (#180)
|
||||
- `#200` `AdminAuth` on `POST /templates/import` (#190 HIGH)
|
||||
- `#203` `CanvasOrBearer` middleware — route-split for #168 canvas regression, only `PUT /canvas/viewport`; rejected PR #194's broader Origin-fallback approach because it would have re-opened #164
|
||||
- `#209` source_id spoof defense in `activity.Report` (cherry-picked from the rejected #169 batch)
|
||||
- `#233` `resolveInsideRoot` on `POST /workspaces template/runtime` (#226 MED)
|
||||
|
||||
**Data integrity.** Three bugs that would have silently corrupted state:
|
||||
- `#212` **CRITICAL** migration-runner bug — `RunMigrations` globbed `*.sql` and alphabetically ran `.down.sql` BEFORE `.up.sql` on every boot, wiping `workspace_auth_tokens` (and 018/019 pairs). Filter fix + unit test in `postgres_migrate_test.go`.
|
||||
- `#224` YAML injection in `generateDefaultConfig` — body.Name now emitted as a double-quoted YAML scalar with all control chars escaped. Structural test (parse + verify key count).
|
||||
- `#236` log-injection in the #209 security-event log line — attacker-controlled source_id echoed via `%s` allowed fake log entries; switched to `%q`.
|
||||
|
||||
**CI / infra.**
|
||||
- `#186` + controlplane `#28` — every CI job migrated from `ubuntu-latest` to `[self-hosted, macos, arm64]` (Mac mini `hongming-m1-mini`). Non-trivial: `services:` replaced with inline `docker run` containers (ports 15432/16379), `actions/setup-python` bypassed via Homebrew python3.11 on `$GITHUB_PATH`, `docker/setup-qemu-action` added for cross-arch builds. Workaround for GH Actions billing cap on private repos.
|
||||
- `#149` independent heartbeat pulse goroutine so long cron fires don't look stale on `/admin/liveness` (#140)
|
||||
- `#211` migration runner regression (see #212 above — PR #212 is the fix)
|
||||
- **Fly registry `FLY_API_TOKEN`** rotated to a deploy token scoped to `molecule-tenant` (previously personal token, invalidated by `flyctl auth login` during the malware cleanup)
|
||||
|
||||
**Platform / Scheduler reliability.**
|
||||
- `#95` panic-recover in scheduler `tick()` + per-fire goroutines (closes #85)
|
||||
- `#207` concurrency-aware skip — `scheduler.fireSchedule` reads `workspaces.active_tasks` and advances `next_run_at` + records a `cron_run` row with `status='skipped'` instead of colliding with a busy agent (#115)
|
||||
- `#206` surface `error_detail` in schedule history API (#152 problem B)
|
||||
|
||||
**Workspace runtime features.**
|
||||
- `#205` idle-loop reflection pattern — opt-in `idle_prompt` + `idle_interval_seconds` in `config.yaml`; self-sends when `heartbeat.active_tasks == 0`. Hermes/Letta shape.
|
||||
- `#208` Hermes Phase 1 multi-provider registry — 15 providers via `adapters/hermes/providers.py` (Nous, OpenRouter, OpenAI, Anthropic, xAI, Gemini, Qwen, GLM, Kimi, MiniMax, DeepSeek, Groq, Together, Fireworks, Mistral). 26 tests.
|
||||
- `#198` A2A protocol compliance batch (#173/#174/#175): `cancel()` emits `TaskStatusUpdateEvent(canceled, final=True)`, `stateTransitionHistory=True` in AgentCapabilities. **Regression:** `push_sender=PushNotificationSender()` crashed on startup because PushNotificationSender is abstract — reverted in #210.
|
||||
- `#216` idle-loop pilot enabled on Technical Researcher workspace.
|
||||
- `#225` + `#235` `auth_headers()` on `/registry/register` + initial_prompt + idle loop self-posts (#215/#220)
|
||||
- `#231` Claude SDK stderr probe for proper rate-limit error attribution (#160 diagnostics)
|
||||
|
||||
**Controlplane (molecule-controlplane).**
|
||||
- `#19`+`#20` Grafana Cloud remote-write counter registry (`cp_requests_total`), push loop to `prometheus-prod-32-prod-ca-east-0.grafana.net`, Basic auth with user 3116422
|
||||
- `#21` AWS KMS envelope encryption — per-secret DEK via `GenerateDataKey`, dual-mode (v2 blobs via KMS, legacy via static key, auto-routes by leading byte)
|
||||
- `#24` `/cp/status` deep probe for Betterstack
|
||||
- `#26`+`#27` public `/legal/{terms,privacy,dpa,acceptable}` pages from embedded markdown + smoke coverage
|
||||
- Isolation red-team test suite + observability runbooks (Grafana dashboard, Betterstack, Stripe Atlas)
|
||||
|
||||
**Self code-review follow-ups (`#228` + `#232`).** Ran `/code-review` on the batch merges, surfaced 8 🟡 issues, split into Go (#228) and Python/docs (#232):
|
||||
- `CanvasOrBearer` invalid-bearer fall-through fix
|
||||
- `short()` helper replacing unsafe `[:N]` slices in `scheduler.go`
|
||||
- 6 new tests (`TestShort_helper`, `TestRecordSkipped_*`, `TestActivityHandler_Report_*`, `TestHistory_IncludesErrorDetail`)
|
||||
- idle-loop hardening (`asyncio.get_running_loop()`, `IDLE_FIRE_TIMEOUT_SECONDS` clamp, typed exception handling, `add_done_callback` for fire-and-forget error logging)
|
||||
- `idle_prompt` / `idle_interval_seconds` documented in `org.yaml` defaults
|
||||
- New `docs/runbooks/admin-auth.md` — the three middleware variants + three-question test for adding to `CanvasOrBearer`
|
||||
|
||||
**Test counts post-sweep:** +70 Go (816 total), +40 Python (1180 total), +0 Canvas vitest (453 unchanged — UI/a11y patches only).
|
||||
|
||||
**Outstanding (user action):** `#126` Slack adapter (Phase-H product decision), `#160` Claude Max OAuth quota (wait for 2026-04-17 23:00Z reset OR upgrade OR switch to ANTHROPIC_API_KEY), `#191` runner persistent-state docs (P3), `#199` Fly registry token (**resolved** this session but publish-platform-image re-run pending runner), Stripe Atlas application (launch blocker, 2-week lead).
|
||||
|
||||
### Recently launched (2026-04-15 tick-9)
|
||||
- **Phase 32 Phase B.2 (image pipeline)** — PR #80 (merged `c3cc8e87`) adds `.github/workflows/publish-platform-image.yml`: on every main-merge touching `platform/**`, builds `platform/Dockerfile` and pushes `ghcr.io/molecule-ai/platform:latest` + `:sha-<commit>` to GHCR. Paired with the private `molecule-controlplane` Fly + Neon provisioner (PR #3 there, merged `2e85d5ad`) that reads `TENANT_IMAGE` env and boots tenant Fly Machines from this image. Tick-8 docs-sync PR #79 (merged `d53a1287`) also landed.
|
||||
|
||||
@ -368,20 +428,29 @@ self-hosted per-customer). Ordered by dependency + ROI.
|
||||
- Stripe billing scaffold deployed in orgs-only mode (no Stripe creds configured yet; webhook handler + signature verification code ready)
|
||||
- Domain: `moleculesai.app` (DNS not yet wired — subdomain routing works via `X-Molecule-Org-Slug` header pending Cloudflare)
|
||||
|
||||
**Phase status:**
|
||||
**Phase status (post 2026-04-15 overnight sweep):**
|
||||
- **A — Foundation** (accounts, tokens, domain): ✅ done
|
||||
- **B — Fly provisioner + Neon branching**: ✅ done (control plane + tenant machine config + services + healthchecks)
|
||||
- **C — WorkOS AuthKit scaffold**: ✅ done (live redirect to hosted signup); Phase C.2 (RequireSession on /cp/orgs + org-ownership check) pending
|
||||
- **D — Stripe billing scaffold**: ✅ code done; Phase D.2 (auth-scoped checkout + customer create) and D.3 (plan quotas) pending — not blocked on user
|
||||
- **E — Cloudflare + DNS `*.moleculesai.app`**: not started
|
||||
- **F — Sign-up UX + onboarding**: not started
|
||||
- **G — Observability + quotas + admin**: not started
|
||||
- **H — Hardening (KMS, isolation test suite, load test, legal)**: not started
|
||||
- **I — Launch**: not started
|
||||
- **B — Fly provisioner + Neon branching**: ✅ done
|
||||
- **C — WorkOS AuthKit scaffold + RequireSession + org-ownership check**: ✅ done
|
||||
- **D — Stripe billing scaffold + auth-scoped checkout + plan quotas**: ✅ code done; live keys pending Stripe Atlas
|
||||
- **E — Cloudflare + DNS `*.moleculesai.app` + per-tenant Vercel canvas**: ✅ done
|
||||
- **F — Sign-up UX + onboarding**: ✅ basic flow done (signup / org create / canvas redirect); polish + email pending
|
||||
- **G — Observability + quotas + admin**: ✅ Sentry + Grafana remote-write + `/cp/status` Betterstack probe + per-org rate limiter; admin panel `/cp/admin/*` pending
|
||||
- **H — Hardening**: ⏳ partial — AWS KMS envelope encryption ✅ (controlplane PR #21), tenant-isolation red-team CI gate ✅ (`isolation_test.go`), legal pages ✅ (`/legal/*` from controlplane PR #26); load test + Stripe Atlas application + status page custom domain pending
|
||||
- **I — Launch**: pending Stripe Atlas (~2 week lead)
|
||||
|
||||
**Live infrastructure deltas (post-sweep):**
|
||||
- Migration runner safety fix landed (#212) — `*.down.sql` filter; was wiping `workspace_auth_tokens` on every restart
|
||||
- Workspace auth tokens now revoked on workspace delete (#110)
|
||||
- All known unauth admin routes gated; #138 canvas regression resolved via field-level authz + `CanvasOrBearer` middleware
|
||||
- Self-hosted Mac mini CI runner replaced GH-hosted Linux to bypass private-repo Actions billing cap; `FLY_API_TOKEN` rotated to a deploy token scoped to `molecule-tenant` after the personal token was invalidated by `flyctl auth login` during the 2025-12-06 cryptominer cleanup
|
||||
- `/legal/{terms,privacy,dpa,acceptable}` live at `https://app.moleculesai.app/legal/*`
|
||||
|
||||
**Known open issues on the live system:**
|
||||
- fly-replay state format iteration: Fly's proxy returned 502 on `state=org-id=<uuid>` (second `=`); fix dropped the prefix, PRs `molecule-controlplane#8` + `molecule-monorepo#88` in flight to make bare UUID work end-to-end
|
||||
- Tenant `/workspaces` returns Neon pooler warnings (`unnamed prepared statement does not exist`) — lib/pq + Neon pooler incompatibility, tracked for lib/pq → pgx migration in a later phase
|
||||
- `#160` Claude Max OAuth quota exhausted on the agent-fleet token until 2026-04-17 23:00 UTC; mitigations: wait, upgrade plan, OR switch workspace containers to `ANTHROPIC_API_KEY` env var
|
||||
- `#191` self-hosted runner persistent-state docs (P3, low urgency)
|
||||
- `#199` Fly registry token — **resolved** in the 2026-04-15 sweep but `publish-platform-image` re-run pending runner availability
|
||||
|
||||
**Companion repo:** `Molecule-AI/molecule-controlplane` (private). n8n-style open-core split: this public repo stays OSS (tenant binary + plugins + channels, contributable surface); control plane (orgs / signup / billing / provisioner / routing) is private. See `molecule-controlplane/PLAN.md` for its roadmap.
|
||||
|
||||
|
||||
@ -35,3 +35,231 @@ each tenant Fly Machine from this image.
|
||||
- `.github/workflows/publish-platform-image.yml` — new.
|
||||
- `CLAUDE.md` — tick-9 block for the new CI workflow.
|
||||
- `PLAN.md` — new "Recently launched (2026-04-15 tick-9)" entry.
|
||||
|
||||
---
|
||||
|
||||
## Overnight sweep (2026-04-15 16:30–19:10 UTC, ticks 17–30+)
|
||||
|
||||
One long session that started with a malware discovery, pivoted through a
|
||||
half-day of security triage, landed ~27 PRs across both repos, and ended
|
||||
with a self code-review cleanup round. Chronological order below, compressed
|
||||
to the load-bearing details so future ticks can grep this file instead of
|
||||
re-reading the JSONL cron-learnings stream.
|
||||
|
||||
### Security: malware cleanup + Fly credential rotation
|
||||
|
||||
Discovered `xmrig` cryptominer installed Dec 6 2025 via commodity
|
||||
npm-dropper, running out of `/var/tmp/.X11-unix/xmrig-6.24.0/` as
|
||||
`systemd-udevd` (camouflaged Linux daemon name on a Mac mini). Crontab
|
||||
entry `*/10 * * * *` had been firing every 10 min for ~4 months until
|
||||
tonight — ~17,500 launches. Wiped crontab, removed payload, rotated
|
||||
`FLY_API_TOKEN` + `CLAUDE_CODE_OAUTH_TOKEN` + `GRAFANA_PROM_TOKEN`.
|
||||
Mining-only payload (no backdoor confirmed): no SSH auth-keys, no
|
||||
LaunchAgents, no extra shell hooks, no other xmrig copies. But personal
|
||||
Fly token rotated via `flyctl auth login` invalidated the token still
|
||||
in GitHub Actions secrets — surfaced much later as #199 publish
|
||||
workflow 401. **Operator rule of thumb: always use `flyctl tokens create
|
||||
deploy -a <app>` for CI, never a personal auth token.**
|
||||
|
||||
### Self-hosted CI runner migration
|
||||
|
||||
#186 switched every `ci.yml` job + `publish-platform-image.yml` from
|
||||
`runs-on: ubuntu-latest` to `[self-hosted, macos, arm64]` (Apple-silicon
|
||||
Mac mini `hongming-m1-mini`). Non-trivial adaptations:
|
||||
- Replaced GH Actions `services: postgres/redis` (Linux-only) with
|
||||
inline `docker run` with `PG_CONTAINER` / `REDIS_CONTAINER` env vars
|
||||
and `docker rm -f` teardown in `if: always()`. Ports 15432/16379
|
||||
to avoid collision with host services.
|
||||
- `ludeeus/action-shellcheck` (Docker action, Linux-only) → fallback
|
||||
to local `brew install shellcheck` + `find | xargs shellcheck`.
|
||||
- `actions/setup-python@v5` hardcodes `/Users/runner/hostedtoolcache`
|
||||
(non-overridable — upstream limitation in the prebuilt setup.sh from
|
||||
`actions/python-versions`). Bypassed with a `Verify Python 3.11
|
||||
(Homebrew)` step that prepends `/opt/homebrew/opt/python@3.11/bin`
|
||||
to `$GITHUB_PATH`. One-time runner prep: `brew install python@3.11`.
|
||||
- `publish-platform-image.yml` adds `docker/setup-qemu-action@v3`
|
||||
+ `platforms: linux/amd64` explicit because the runner is arm64 and
|
||||
Fly tenant machines are amd64.
|
||||
|
||||
Controlplane PR #28 mirrored the same migration on its own single-job
|
||||
ci.yml (1-line `runs-on` swap — no matrix adaptations needed).
|
||||
|
||||
Known runner rough edges tracked as follow-ups: #191 (persistent-state
|
||||
docs), #199 (Fly registry 401 — resolved by minting a deploy token
|
||||
scoped to `molecule-tenant`, tokens table previously empty).
|
||||
|
||||
### Security fixes — auth gating
|
||||
|
||||
Closed a cluster of unauthenticated-route findings surfaced by the
|
||||
Security Auditor's hourly audit:
|
||||
|
||||
| PR | Issue | Fix |
|
||||
|---|---|---|
|
||||
| #94 | #C6 | RFC-1918 + link-local in registry URL validator |
|
||||
| #99 | #104 | AdminAuth gate on GET /workspaces (topology leak) |
|
||||
| #102 | — | ancestor↔descendant A2A for hierarchy routing |
|
||||
| #106 | #103 HIGH | path-sanitize + admin-gate POST /org/import |
|
||||
| #110 | — | revoke workspace_auth_tokens on workspace delete |
|
||||
| #119 | — | IPv6 SSRF blocklist (fe80::/10, ::1/128, fc00::/7) + scheduler unit tests |
|
||||
| #125/#162 | #138 | field-level authz on PATCH /workspaces/:id (cosmetic fields passthrough, sensitive fields bearer-required) |
|
||||
| #155 | #151 | wire SecurityHeaders middleware |
|
||||
| #167 | #164 CRIT #165 HIGH #166 MED | gate 6 unauth routes (bundles/export, bundles/import, events, events/:id, canvas/viewport PUT, admin/liveness) |
|
||||
| #185 | #180 | AdminAuth on GET /approvals/pending |
|
||||
| #200 | #190 HIGH | AdminAuth on POST /templates/import |
|
||||
| #203 | #168 | CanvasOrBearer middleware on PUT /canvas/viewport only (route-split approach) |
|
||||
| #209 | #169 C2 | source_id spoof defense in activity.Report |
|
||||
| #233 | #226 MED | resolveInsideRoot on POST /workspaces template/runtime |
|
||||
|
||||
Rejected PR #194 (Origin-fallback approach) because it would have
|
||||
re-opened #164 CRITICAL to curl-based spoofing. #168 correctly fixed
|
||||
via the narrower route-split in #203.
|
||||
|
||||
Rejected PR #169 (large C1-C6 batch) because 4/7 findings were
|
||||
duplicates of already-merged work and migration 022 numbering
|
||||
collided with 022_workspace_schedules_source. Cherry-picked the one
|
||||
genuinely new fix (C2 source_id spoof check) into #209 and closed
|
||||
#169.
|
||||
|
||||
### Security fixes — data integrity
|
||||
|
||||
- **#212** CRITICAL migration-runner bug: `RunMigrations` globbed
|
||||
`*.sql` and sorted alphabetically, running `.down.sql` BEFORE
|
||||
`.up.sql` on every boot. Wiped `workspace_auth_tokens` + two other
|
||||
pairs on every platform restart, regressing AdminAuth to fail-open
|
||||
bootstrap mode. Filter to skip `.down.sql` + unit test in
|
||||
`postgres_migrate_test.go`.
|
||||
- **#224** YAML injection in `generateDefaultConfig` — body.Name
|
||||
concatenated into YAML without escaping. Fixed by emitting as
|
||||
double-quoted YAML scalar with all control chars escaped. Structural
|
||||
test (parse + verify key count) instead of substring match.
|
||||
- **#236** log-injection in the #209 security-event log line —
|
||||
attacker-controlled `source_id` echoed via `%s` allowed newline
|
||||
injection of fake log entries. Switched to `%q`.
|
||||
|
||||
### Infrastructure
|
||||
|
||||
- **AWS KMS envelope encryption** (controlplane PR #21). Per-secret DEK
|
||||
via `kms.GenerateDataKey`; blob layout `[0x02][dek_len][enc_dek][nonce][ct]`.
|
||||
Dual-mode: v2 blobs via KMS, legacy blobs via static `SECRETS_ENCRYPTION_KEY`.
|
||||
Auto-routes by leading byte; no rewrap migration needed.
|
||||
- **Grafana Cloud remote-write** (controlplane PR #19 + #20). In-process
|
||||
counter registry + hand-rolled protobuf encoder. `cp_requests_total`
|
||||
emitted on every request. Push loop to
|
||||
`prometheus-prod-32-prod-ca-east-0.grafana.net/api/prom/push` with
|
||||
Basic auth. User 3116422, token via GRAFANA_PROM_TOKEN Fly secret.
|
||||
- **/cp/status deep-probe** (controlplane PR #24) for Betterstack.
|
||||
Pings Postgres with 2s budget; returns 503 on DB miss. Distinct from
|
||||
`/health`.
|
||||
- **Legal pages** (controlplane PR #26/#27). Public `/legal/{terms,
|
||||
privacy,dpa,acceptable}` served from embedded markdown. Dark-theme
|
||||
HTML shell, minimal markdown→HTML renderer (no dep), path-traversal
|
||||
safe via slug allowlist. Smoke covered.
|
||||
- **Scheduler reliability**: #95 panic-recover in tick(), #149
|
||||
independent heartbeat goroutine so long fires don't look stale on
|
||||
/admin/liveness, #207 concurrency-aware skip when workspace
|
||||
active_tasks>0.
|
||||
|
||||
### Features
|
||||
|
||||
- **#205** idle-loop reflection pattern in workspace-template. Opt-in
|
||||
via `idle_prompt` + `idle_interval_seconds` in `config.yaml`.
|
||||
Self-sends the idle prompt via platform A2A proxy every interval
|
||||
while `heartbeat.active_tasks == 0`. Hermes/Letta shape.
|
||||
- **#208** Hermes Phase 1 multi-provider. 15 providers via
|
||||
`adapters/hermes/providers.py` registry (Nous, OpenRouter, OpenAI,
|
||||
Anthropic, xAI, Gemini, Qwen, GLM, Kimi, MiniMax, DeepSeek, Groq,
|
||||
Together, Fireworks, Mistral). Back-compat with PR2 key resolution
|
||||
preserved. 26 tests.
|
||||
- **#198** A2A protocol compliance batch closing #173/#174/#175:
|
||||
`cancel()` emits `TaskStatusUpdateEvent(canceled, final=True)`,
|
||||
`stateTransitionHistory=True` in AgentCapabilities. *Note:* wired
|
||||
`push_sender=PushNotificationSender()` and this crashed on startup
|
||||
because PushNotificationSender is an abstract base class — reverted
|
||||
in #210.
|
||||
- **#186** self-hosted macOS runner migration (described above).
|
||||
|
||||
### Code-review self-audit
|
||||
|
||||
Ran /code-review on my own batch merges, surfaced 8 🟡 issues, split
|
||||
follow-ups into two PRs:
|
||||
|
||||
- **#228** (Go side): CanvasOrBearer invalid-bearer fall-through fix,
|
||||
`short()` helper to replace unsafe `[:N]` slices in scheduler.go,
|
||||
security-event log on source_id spoof. 6 new tests:
|
||||
`TestShort_helper`, `TestRecordSkipped_writesSkippedStatus`,
|
||||
`TestRecordSkipped_shortWorkspaceIDNoPanic`,
|
||||
`TestActivityHandler_Report_SourceIDSpoofRejected`,
|
||||
`TestActivityHandler_Report_MatchingSourceIDAccepted`,
|
||||
`TestHistory_IncludesErrorDetail`.
|
||||
- **#232** (Python/docs): idle-loop hardening
|
||||
(`asyncio.get_running_loop()`, `IDLE_FIRE_TIMEOUT_SECONDS` clamped,
|
||||
typed `HTTPError`/`URLError`/catch-all, `add_done_callback` for
|
||||
fire-and-forget error logging). `idle_prompt` documented in
|
||||
`org-templates/molecule-dev/org.yaml` defaults. New
|
||||
`docs/runbooks/admin-auth.md` documenting the three middleware
|
||||
variants (AdminAuth strict, CanvasOrBearer soft, WorkspaceAuth
|
||||
per-id) + the three-question test for adding routes to
|
||||
CanvasOrBearer.
|
||||
|
||||
### Other merged fixes
|
||||
|
||||
- #122 canvas grid origin offset (nodes spawn at 100,100 not 0,0)
|
||||
- #123 dark-theme a11y (input contrast, search dialog, kbd hints)
|
||||
- #131 WCAG critical (ARIA live toasts, dialog focus trap, keyboard nav)
|
||||
- #139 code-review plugins for Dev Lead + QA Engineer
|
||||
- #149 scheduler heartbeat pulse (#140)
|
||||
- #150 ecosystem-watch daily sweep (Microsoft Agent Framework, Vercel Open Agents)
|
||||
- #157 ecosystem-watch PM sweep
|
||||
- #161 e2e test mock fix for #125 EXISTS probe
|
||||
- #187 `SetTrustedProxies(nil)` closes #179 rate-limit bypass
|
||||
- #188 e2e auth headers on `/events` + `/bundles/export` post-#167
|
||||
- #189 revert Security Auditor cron to 2x/day (closes #178 token-budget regression)
|
||||
- #192 test regression lock for #170 `DELETE /secrets/:key`
|
||||
- #197 reapply user's a6cfc5f bypass-setup-python to main (dropped by #186 squash)
|
||||
- #206 surface cron `error_detail` in schedule history (#152 problem B)
|
||||
- #210 revert PushNotificationSender ABC crash (#204)
|
||||
- #211 migration runner skips `.down.sql` (data loss regression)
|
||||
- #216 enable idle-loop pilot on Technical Researcher
|
||||
- #223 reno-stars default plugins to browser-automation
|
||||
- #225 auth_headers() on /registry/register (#215)
|
||||
- #227 unit tests for plugins_install_pipeline.go (37 cases, #217)
|
||||
- #231 Claude SDK stderr probe for rate-limit error attribution (#160)
|
||||
- #235 auth_headers() on initial_prompt + idle loop (#220)
|
||||
|
||||
### Issues closed (by merge or factual correction)
|
||||
|
||||
#85, #93, #100, #101, #103, #104, #105, #115, #126 epic parent, #127,
|
||||
#128, #129, #132, #134, #135, #136, #138, #140, #141, #142, #143, #144,
|
||||
#145, #146, #147, #148, #151, #152 prob B, #153, #154, #156, #160
|
||||
(diagnosed, not fixed), #163, #164, #165, #166, #168, #170, #171, #172,
|
||||
#173, #174, #175, #176, #177, #178, #180, #181, #183, #184, #190, #191
|
||||
(accepted risk), #195, #199 (fixed Fly token rotation), #201, #202,
|
||||
#204, #211, #213, #214, #215, #217, #218, #219, #220, #221, #226, #229,
|
||||
#230, #234.
|
||||
|
||||
### Outstanding — needs user
|
||||
|
||||
- **#126** Slack adapter (Phase-H product decision)
|
||||
- **#160** Claude Max OAuth quota (wait for reset / upgrade / API key switch)
|
||||
- **#191** self-hosted runner persistent-state docs (P3)
|
||||
- **#199** Fly registry token — **resolved this session** but re-run
|
||||
of `publish-platform-image` pending runner capacity
|
||||
- Stripe Atlas application (launch blocker, 2-week lead)
|
||||
|
||||
### Test counts (post-session)
|
||||
|
||||
- Platform Go: **816 test functions** (+70 this session — scheduler, handlers, middleware, db, crypto tests added across #95/#99/#106/#110/#119/#151/#167/#185/#187/#192/#200/#203/#206/#207/#210/#211/#212/#227/#228/#232/#234)
|
||||
- Canvas vitest: **453 tests** (+0 structure, +0 new tests this session — UI/a11y patches)
|
||||
- Workspace-template pytest: **1180 tests** (+40 this session — Hermes providers, a2a cancel, idle loop implicit)
|
||||
- MCP server jest: **97 tests** (unchanged)
|
||||
|
||||
### Infra notes (not in any repo)
|
||||
|
||||
- FLY_API_TOKEN GH Actions secret rotated to a deploy token scoped to
|
||||
`molecule-tenant` (1-year expiry). Docs runbook update needed.
|
||||
- Mac mini runner env has `RUNNER_TOOL_CACHE` + `AGENT_TOOLSDIRECTORY`
|
||||
overrides. Python install via Homebrew is required one-time prep.
|
||||
- `molecule-monorepo` still private; Actions billing workaround is
|
||||
the self-hosted runner rather than flipping public or raising the
|
||||
cap.
|
||||
|
||||
|
||||
Loading…
Reference in New Issue
Block a user