# Edit history — 2026-04-15 ## tick-9: Phase 32 Phase B.2 image pipeline (PR #80) + tick-8 docs sync (PR #79) Two merges: ### PR #79 — `docs: sync documentation with 2026-04-14 tick-8 merge (#78)` Merge commit `d53a1287`. Tick-8 docs sync for the TenantGuard middleware. Pure docs; CLAUDE.md test count + PLAN.md tick-8 block + edit-history entry. ### PR #80 — `feat(ci): publish-platform-image → ghcr.io/molecule-ai/platform (Phase B.2)` Merge commit `c3cc8e87`. Noteworthy: ci-infra. Adds `.github/workflows/publish-platform-image.yml`: - Trigger: push to main touching `workspace-server/**`; also `workflow_dispatch`. - Builds `workspace-server/Dockerfile` via `docker/build-push-action@v5`. - Pushes two tags per run: `ghcr.io/molecule-ai/platform:latest` (floating) and `:sha-` (immutable, pin-friendly). - GHA cache via `cache-from/cache-to: type=gha` for warm rebuilds. - Permissions: `contents:read` + `packages:write`; authenticates to GHCR using the built-in `GITHUB_TOKEN`, no extra secrets. - OCI labels propagate source URL + commit SHA for provenance. Purpose: pairs with the private `molecule-controlplane` Fly + Neon provisioner (PR #3 there, merged `2e85d5ad`) which reads `TENANT_IMAGE=ghcr.io/molecule-ai/platform:` from env and spawns each tenant Fly Machine from this image. ### Deployment state (informational — not in any repo) - Fly apps (`molecule-cp`, `molecule-tenant`): **pending CEO** (`flyctl apps create`). - Fly billing card: **pending CEO**. - First real tenant provision: **blocked** on the two above. ### File deltas (public repo) - `.github/workflows/publish-platform-image.yml` — new. - `CLAUDE.md` — tick-9 block for the new CI workflow. - `PLAN.md` — new "Recently launched (2026-04-15 tick-9)" entry. --- ## Overnight sweep (2026-04-15 16:30–19:10 UTC, ticks 17–30+) One long session that started with a malware discovery, pivoted through a half-day of security triage, landed ~27 PRs across both repos, and ended with a self code-review cleanup round. Chronological order below, compressed to the load-bearing details so future ticks can grep this file instead of re-reading the JSONL cron-learnings stream. ### Security: malware cleanup + Fly credential rotation Discovered `xmrig` cryptominer installed Dec 6 2025 via commodity npm-dropper, running out of `/var/tmp/.X11-unix/xmrig-6.24.0/` as `systemd-udevd` (camouflaged Linux daemon name on a Mac mini). Crontab entry `*/10 * * * *` had been firing every 10 min for ~4 months until tonight — ~17,500 launches. Wiped crontab, removed payload, rotated `FLY_API_TOKEN` + `CLAUDE_CODE_OAUTH_TOKEN` + `GRAFANA_PROM_TOKEN`. Mining-only payload (no backdoor confirmed): no SSH auth-keys, no LaunchAgents, no extra shell hooks, no other xmrig copies. But personal Fly token rotated via `flyctl auth login` invalidated the token still in GitHub Actions secrets — surfaced much later as #199 publish workflow 401. **Operator rule of thumb: always use `flyctl tokens create deploy -a ` for CI, never a personal auth token.** ### Self-hosted CI runner migration #186 switched every `ci.yml` job + `publish-platform-image.yml` from `runs-on: ubuntu-latest` to `[self-hosted, macos, arm64]` (Apple-silicon Mac mini `hongming-m1-mini`). Non-trivial adaptations: - Replaced GH Actions `services: postgres/redis` (Linux-only) with inline `docker run` with `PG_CONTAINER` / `REDIS_CONTAINER` env vars and `docker rm -f` teardown in `if: always()`. Ports 15432/16379 to avoid collision with host services. - `ludeeus/action-shellcheck` (Docker action, Linux-only) → fallback to local `brew install shellcheck` + `find | xargs shellcheck`. - `actions/setup-python@v5` hardcodes `/Users/runner/hostedtoolcache` (non-overridable — upstream limitation in the prebuilt setup.sh from `actions/python-versions`). Bypassed with a `Verify Python 3.11 (Homebrew)` step that prepends `/opt/homebrew/opt/python@3.11/bin` to `$GITHUB_PATH`. One-time runner prep: `brew install python@3.11`. - `publish-platform-image.yml` adds `docker/setup-qemu-action@v3` + `platforms: linux/amd64` explicit because the runner is arm64 and Fly tenant machines are amd64. Controlplane PR #28 mirrored the same migration on its own single-job ci.yml (1-line `runs-on` swap — no matrix adaptations needed). Known runner rough edges tracked as follow-ups: #191 (persistent-state docs), #199 (Fly registry 401 — resolved by minting a deploy token scoped to `molecule-tenant`, tokens table previously empty). ### Security fixes — auth gating Closed a cluster of unauthenticated-route findings surfaced by the Security Auditor's hourly audit: | PR | Issue | Fix | |---|---|---| | #94 | #C6 | RFC-1918 + link-local in registry URL validator | | #99 | #104 | AdminAuth gate on GET /workspaces (topology leak) | | #102 | — | ancestor↔descendant A2A for hierarchy routing | | #106 | #103 HIGH | path-sanitize + admin-gate POST /org/import | | #110 | — | revoke workspace_auth_tokens on workspace delete | | #119 | — | IPv6 SSRF blocklist (fe80::/10, ::1/128, fc00::/7) + scheduler unit tests | | #125/#162 | #138 | field-level authz on PATCH /workspaces/:id (cosmetic fields passthrough, sensitive fields bearer-required) | | #155 | #151 | wire SecurityHeaders middleware | | #167 | #164 CRIT #165 HIGH #166 MED | gate 6 unauth routes (bundles/export, bundles/import, events, events/:id, canvas/viewport PUT, admin/liveness) | | #185 | #180 | AdminAuth on GET /approvals/pending | | #200 | #190 HIGH | AdminAuth on POST /templates/import | | #203 | #168 | CanvasOrBearer middleware on PUT /canvas/viewport only (route-split approach) | | #209 | #169 C2 | source_id spoof defense in activity.Report | | #233 | #226 MED | resolveInsideRoot on POST /workspaces template/runtime | Rejected PR #194 (Origin-fallback approach) because it would have re-opened #164 CRITICAL to curl-based spoofing. #168 correctly fixed via the narrower route-split in #203. Rejected PR #169 (large C1-C6 batch) because 4/7 findings were duplicates of already-merged work and migration 022 numbering collided with 022_workspace_schedules_source. Cherry-picked the one genuinely new fix (C2 source_id spoof check) into #209 and closed #169. ### Security fixes — data integrity - **#212** CRITICAL migration-runner bug: `RunMigrations` globbed `*.sql` and sorted alphabetically, running `.down.sql` BEFORE `.up.sql` on every boot. Wiped `workspace_auth_tokens` + two other pairs on every platform restart, regressing AdminAuth to fail-open bootstrap mode. Filter to skip `.down.sql` + unit test in `postgres_migrate_test.go`. - **#224** YAML injection in `generateDefaultConfig` — body.Name concatenated into YAML without escaping. Fixed by emitting as double-quoted YAML scalar with all control chars escaped. Structural test (parse + verify key count) instead of substring match. - **#236** log-injection in the #209 security-event log line — attacker-controlled `source_id` echoed via `%s` allowed newline injection of fake log entries. Switched to `%q`. ### Infrastructure - **AWS KMS envelope encryption** (controlplane PR #21). Per-secret DEK via `kms.GenerateDataKey`; blob layout `[0x02][dek_len][enc_dek][nonce][ct]`. Dual-mode: v2 blobs via KMS, legacy blobs via static `SECRETS_ENCRYPTION_KEY`. Auto-routes by leading byte; no rewrap migration needed. - **Grafana Cloud remote-write** (controlplane PR #19 + #20). In-process counter registry + hand-rolled protobuf encoder. `cp_requests_total` emitted on every request. Push loop to `prometheus-prod-32-prod-ca-east-0.grafana.net/api/prom/push` with Basic auth. User 3116422, token via GRAFANA_PROM_TOKEN Fly secret. - **/cp/status deep-probe** (controlplane PR #24) for Betterstack. Pings Postgres with 2s budget; returns 503 on DB miss. Distinct from `/health`. - **Legal pages** (controlplane PR #26/#27). Public `/legal/{terms, privacy,dpa,acceptable}` served from embedded markdown. Dark-theme HTML shell, minimal markdown→HTML renderer (no dep), path-traversal safe via slug allowlist. Smoke covered. - **Scheduler reliability**: #95 panic-recover in tick(), #149 independent heartbeat goroutine so long fires don't look stale on /admin/liveness, #207 concurrency-aware skip when workspace active_tasks>0. ### Features - **#205** idle-loop reflection pattern in workspace-template. Opt-in via `idle_prompt` + `idle_interval_seconds` in `config.yaml`. Self-sends the idle prompt via platform A2A proxy every interval while `heartbeat.active_tasks == 0`. Hermes/Letta shape. - **#208** Hermes Phase 1 multi-provider. 15 providers via `adapters/hermes/providers.py` registry (Nous, OpenRouter, OpenAI, Anthropic, xAI, Gemini, Qwen, GLM, Kimi, MiniMax, DeepSeek, Groq, Together, Fireworks, Mistral). Back-compat with PR2 key resolution preserved. 26 tests. - **#198** A2A protocol compliance batch closing #173/#174/#175: `cancel()` emits `TaskStatusUpdateEvent(canceled, final=True)`, `stateTransitionHistory=True` in AgentCapabilities. *Note:* wired `push_sender=PushNotificationSender()` and this crashed on startup because PushNotificationSender is an abstract base class — reverted in #210. - **#186** self-hosted macOS runner migration (described above). ### Code-review self-audit Ran /code-review on my own batch merges, surfaced 8 🟡 issues, split follow-ups into two PRs: - **#228** (Go side): CanvasOrBearer invalid-bearer fall-through fix, `short()` helper to replace unsafe `[:N]` slices in scheduler.go, security-event log on source_id spoof. 6 new tests: `TestShort_helper`, `TestRecordSkipped_writesSkippedStatus`, `TestRecordSkipped_shortWorkspaceIDNoPanic`, `TestActivityHandler_Report_SourceIDSpoofRejected`, `TestActivityHandler_Report_MatchingSourceIDAccepted`, `TestHistory_IncludesErrorDetail`. - **#232** (Python/docs): idle-loop hardening (`asyncio.get_running_loop()`, `IDLE_FIRE_TIMEOUT_SECONDS` clamped, typed `HTTPError`/`URLError`/catch-all, `add_done_callback` for fire-and-forget error logging). `idle_prompt` documented in `org-templates/molecule-dev/org.yaml` defaults. New `docs/runbooks/admin-auth.md` documenting the three middleware variants (AdminAuth strict, CanvasOrBearer soft, WorkspaceAuth per-id) + the three-question test for adding routes to CanvasOrBearer. ### Other merged fixes - #122 canvas grid origin offset (nodes spawn at 100,100 not 0,0) - #123 dark-theme a11y (input contrast, search dialog, kbd hints) - #131 WCAG critical (ARIA live toasts, dialog focus trap, keyboard nav) - #139 code-review plugins for Dev Lead + QA Engineer - #149 scheduler heartbeat pulse (#140) - #150 ecosystem-watch daily sweep (Microsoft Agent Framework, Vercel Open Agents) - #157 ecosystem-watch PM sweep - #161 e2e test mock fix for #125 EXISTS probe - #187 `SetTrustedProxies(nil)` closes #179 rate-limit bypass - #188 e2e auth headers on `/events` + `/bundles/export` post-#167 - #189 revert Security Auditor cron to 2x/day (closes #178 token-budget regression) - #192 test regression lock for #170 `DELETE /secrets/:key` - #197 reapply user's a6cfc5f bypass-setup-python to main (dropped by #186 squash) - #206 surface cron `error_detail` in schedule history (#152 problem B) - #210 revert PushNotificationSender ABC crash (#204) - #211 migration runner skips `.down.sql` (data loss regression) - #216 enable idle-loop pilot on Technical Researcher - #223 reno-stars default plugins to browser-automation - #225 auth_headers() on /registry/register (#215) - #227 unit tests for plugins_install_pipeline.go (37 cases, #217) - #231 Claude SDK stderr probe for rate-limit error attribution (#160) - #235 auth_headers() on initial_prompt + idle loop (#220) ### Issues closed (by merge or factual correction) #85, #93, #100, #101, #103, #104, #105, #115, #126 epic parent, #127, #128, #129, #132, #134, #135, #136, #138, #140, #141, #142, #143, #144, #145, #146, #147, #148, #151, #152 prob B, #153, #154, #156, #160 (diagnosed, not fixed), #163, #164, #165, #166, #168, #170, #171, #172, #173, #174, #175, #176, #177, #178, #180, #181, #183, #184, #190, #191 (accepted risk), #195, #199 (fixed Fly token rotation), #201, #202, #204, #211, #213, #214, #215, #217, #218, #219, #220, #221, #226, #229, #230, #234. ### Outstanding — needs user - **#126** Slack adapter (Phase-H product decision) - **#160** Claude Max OAuth quota (wait for reset / upgrade / API key switch) - **#191** self-hosted runner persistent-state docs (P3) - **#199** Fly registry token — **resolved this session** but re-run of `publish-platform-image` pending runner capacity - Stripe Atlas application (launch blocker, 2-week lead) ### Test counts (post-session) - Platform Go: **816 test functions** (+70 this session — scheduler, handlers, middleware, db, crypto tests added across #95/#99/#106/#110/#119/#151/#167/#185/#187/#192/#200/#203/#206/#207/#210/#211/#212/#227/#228/#232/#234) - Canvas vitest: **453 tests** (+0 structure, +0 new tests this session — UI/a11y patches) - Workspace-template pytest: **1180 tests** (+40 this session — Hermes providers, a2a cancel, idle loop implicit) - MCP server jest: **97 tests** (unchanged) ### Infra notes (not in any repo) - FLY_API_TOKEN GH Actions secret rotated to a deploy token scoped to `molecule-tenant` (1-year expiry). Docs runbook update needed. - Mac mini runner env has `RUNNER_TOOL_CACHE` + `AGENT_TOOLSDIRECTORY` overrides. Python install via Homebrew is required one-time prep. - `molecule-monorepo` still private; Actions billing workaround is the self-hosted runner rather than flipping public or raising the cap.