Renames: - platform/ → workspace-server/ (Go module path stays as "platform" for external dep compat — will update after plugin module republish) - workspace-template/ → workspace/ Removed (moved to separate repos or deleted): - PLAN.md — internal roadmap (move to private project board) - HANDOFF.md, AGENTS.md — one-time internal session docs - .claude/ — gitignored entirely (local agent config) - infra/cloudflare-worker/ → Molecule-AI/molecule-tenant-proxy - org-templates/molecule-dev/ → standalone template repo - .mcp-eval/ → molecule-mcp-server repo - test-results/ — ephemeral, gitignored Security scrubbing: - Cloudflare account/zone/KV IDs → placeholders - Real EC2 IPs → <EC2_IP> in all docs - CF token prefix, Neon project ID, Fly app names → redacted - Langfuse dev credentials → parameterized - Personal runner username/machine name → generic Community files: - CONTRIBUTING.md — build, test, branch conventions - CODE_OF_CONDUCT.md — Contributor Covenant 2.1 All Dockerfiles, CI workflows, docker-compose, railway.toml, render.yaml, README, CLAUDE.md updated for new directory names. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
266 lines
13 KiB
Markdown
266 lines
13 KiB
Markdown
# Edit history — 2026-04-15
|
||
|
||
## tick-9: Phase 32 Phase B.2 image pipeline (PR #80) + tick-8 docs sync (PR #79)
|
||
|
||
Two merges:
|
||
|
||
### PR #79 — `docs: sync documentation with 2026-04-14 tick-8 merge (#78)`
|
||
Merge commit `d53a1287`. Tick-8 docs sync for the TenantGuard middleware.
|
||
Pure docs; CLAUDE.md test count + PLAN.md tick-8 block + edit-history entry.
|
||
|
||
### PR #80 — `feat(ci): publish-platform-image → ghcr.io/molecule-ai/platform (Phase B.2)`
|
||
Merge commit `c3cc8e87`. Noteworthy: ci-infra.
|
||
|
||
Adds `.github/workflows/publish-platform-image.yml`:
|
||
- Trigger: push to main touching `workspace-server/**`; also `workflow_dispatch`.
|
||
- Builds `workspace-server/Dockerfile` via `docker/build-push-action@v5`.
|
||
- Pushes two tags per run: `ghcr.io/molecule-ai/platform:latest` (floating)
|
||
and `:sha-<short-commit>` (immutable, pin-friendly).
|
||
- GHA cache via `cache-from/cache-to: type=gha` for warm rebuilds.
|
||
- Permissions: `contents:read` + `packages:write`; authenticates to GHCR
|
||
using the built-in `GITHUB_TOKEN`, no extra secrets.
|
||
- OCI labels propagate source URL + commit SHA for provenance.
|
||
|
||
Purpose: pairs with the private `molecule-controlplane` Fly + Neon
|
||
provisioner (PR #3 there, merged `2e85d5ad`) which reads
|
||
`TENANT_IMAGE=ghcr.io/molecule-ai/platform:<tag>` from env and spawns
|
||
each tenant Fly Machine from this image.
|
||
|
||
### Deployment state (informational — not in any repo)
|
||
- Fly apps (`molecule-cp`, `molecule-tenant`): **pending CEO** (`flyctl apps create`).
|
||
- Fly billing card: **pending CEO**.
|
||
- First real tenant provision: **blocked** on the two above.
|
||
|
||
### File deltas (public repo)
|
||
- `.github/workflows/publish-platform-image.yml` — new.
|
||
- `CLAUDE.md` — tick-9 block for the new CI workflow.
|
||
- `PLAN.md` — new "Recently launched (2026-04-15 tick-9)" entry.
|
||
|
||
---
|
||
|
||
## Overnight sweep (2026-04-15 16:30–19:10 UTC, ticks 17–30+)
|
||
|
||
One long session that started with a malware discovery, pivoted through a
|
||
half-day of security triage, landed ~27 PRs across both repos, and ended
|
||
with a self code-review cleanup round. Chronological order below, compressed
|
||
to the load-bearing details so future ticks can grep this file instead of
|
||
re-reading the JSONL cron-learnings stream.
|
||
|
||
### Security: malware cleanup + Fly credential rotation
|
||
|
||
Discovered `xmrig` cryptominer installed Dec 6 2025 via commodity
|
||
npm-dropper, running out of `/var/tmp/.X11-unix/xmrig-6.24.0/` as
|
||
`systemd-udevd` (camouflaged Linux daemon name on a Mac mini). Crontab
|
||
entry `*/10 * * * *` had been firing every 10 min for ~4 months until
|
||
tonight — ~17,500 launches. Wiped crontab, removed payload, rotated
|
||
`FLY_API_TOKEN` + `CLAUDE_CODE_OAUTH_TOKEN` + `GRAFANA_PROM_TOKEN`.
|
||
Mining-only payload (no backdoor confirmed): no SSH auth-keys, no
|
||
LaunchAgents, no extra shell hooks, no other xmrig copies. But personal
|
||
Fly token rotated via `flyctl auth login` invalidated the token still
|
||
in GitHub Actions secrets — surfaced much later as #199 publish
|
||
workflow 401. **Operator rule of thumb: always use `flyctl tokens create
|
||
deploy -a <app>` for CI, never a personal auth token.**
|
||
|
||
### Self-hosted CI runner migration
|
||
|
||
#186 switched every `ci.yml` job + `publish-platform-image.yml` from
|
||
`runs-on: ubuntu-latest` to `[self-hosted, macos, arm64]` (Apple-silicon
|
||
Mac mini `hongming-m1-mini`). Non-trivial adaptations:
|
||
- Replaced GH Actions `services: postgres/redis` (Linux-only) with
|
||
inline `docker run` with `PG_CONTAINER` / `REDIS_CONTAINER` env vars
|
||
and `docker rm -f` teardown in `if: always()`. Ports 15432/16379
|
||
to avoid collision with host services.
|
||
- `ludeeus/action-shellcheck` (Docker action, Linux-only) → fallback
|
||
to local `brew install shellcheck` + `find | xargs shellcheck`.
|
||
- `actions/setup-python@v5` hardcodes `/Users/runner/hostedtoolcache`
|
||
(non-overridable — upstream limitation in the prebuilt setup.sh from
|
||
`actions/python-versions`). Bypassed with a `Verify Python 3.11
|
||
(Homebrew)` step that prepends `/opt/homebrew/opt/python@3.11/bin`
|
||
to `$GITHUB_PATH`. One-time runner prep: `brew install python@3.11`.
|
||
- `publish-platform-image.yml` adds `docker/setup-qemu-action@v3`
|
||
+ `platforms: linux/amd64` explicit because the runner is arm64 and
|
||
Fly tenant machines are amd64.
|
||
|
||
Controlplane PR #28 mirrored the same migration on its own single-job
|
||
ci.yml (1-line `runs-on` swap — no matrix adaptations needed).
|
||
|
||
Known runner rough edges tracked as follow-ups: #191 (persistent-state
|
||
docs), #199 (Fly registry 401 — resolved by minting a deploy token
|
||
scoped to `molecule-tenant`, tokens table previously empty).
|
||
|
||
### Security fixes — auth gating
|
||
|
||
Closed a cluster of unauthenticated-route findings surfaced by the
|
||
Security Auditor's hourly audit:
|
||
|
||
| PR | Issue | Fix |
|
||
|---|---|---|
|
||
| #94 | #C6 | RFC-1918 + link-local in registry URL validator |
|
||
| #99 | #104 | AdminAuth gate on GET /workspaces (topology leak) |
|
||
| #102 | — | ancestor↔descendant A2A for hierarchy routing |
|
||
| #106 | #103 HIGH | path-sanitize + admin-gate POST /org/import |
|
||
| #110 | — | revoke workspace_auth_tokens on workspace delete |
|
||
| #119 | — | IPv6 SSRF blocklist (fe80::/10, ::1/128, fc00::/7) + scheduler unit tests |
|
||
| #125/#162 | #138 | field-level authz on PATCH /workspaces/:id (cosmetic fields passthrough, sensitive fields bearer-required) |
|
||
| #155 | #151 | wire SecurityHeaders middleware |
|
||
| #167 | #164 CRIT #165 HIGH #166 MED | gate 6 unauth routes (bundles/export, bundles/import, events, events/:id, canvas/viewport PUT, admin/liveness) |
|
||
| #185 | #180 | AdminAuth on GET /approvals/pending |
|
||
| #200 | #190 HIGH | AdminAuth on POST /templates/import |
|
||
| #203 | #168 | CanvasOrBearer middleware on PUT /canvas/viewport only (route-split approach) |
|
||
| #209 | #169 C2 | source_id spoof defense in activity.Report |
|
||
| #233 | #226 MED | resolveInsideRoot on POST /workspaces template/runtime |
|
||
|
||
Rejected PR #194 (Origin-fallback approach) because it would have
|
||
re-opened #164 CRITICAL to curl-based spoofing. #168 correctly fixed
|
||
via the narrower route-split in #203.
|
||
|
||
Rejected PR #169 (large C1-C6 batch) because 4/7 findings were
|
||
duplicates of already-merged work and migration 022 numbering
|
||
collided with 022_workspace_schedules_source. Cherry-picked the one
|
||
genuinely new fix (C2 source_id spoof check) into #209 and closed
|
||
#169.
|
||
|
||
### Security fixes — data integrity
|
||
|
||
- **#212** CRITICAL migration-runner bug: `RunMigrations` globbed
|
||
`*.sql` and sorted alphabetically, running `.down.sql` BEFORE
|
||
`.up.sql` on every boot. Wiped `workspace_auth_tokens` + two other
|
||
pairs on every platform restart, regressing AdminAuth to fail-open
|
||
bootstrap mode. Filter to skip `.down.sql` + unit test in
|
||
`postgres_migrate_test.go`.
|
||
- **#224** YAML injection in `generateDefaultConfig` — body.Name
|
||
concatenated into YAML without escaping. Fixed by emitting as
|
||
double-quoted YAML scalar with all control chars escaped. Structural
|
||
test (parse + verify key count) instead of substring match.
|
||
- **#236** log-injection in the #209 security-event log line —
|
||
attacker-controlled `source_id` echoed via `%s` allowed newline
|
||
injection of fake log entries. Switched to `%q`.
|
||
|
||
### Infrastructure
|
||
|
||
- **AWS KMS envelope encryption** (controlplane PR #21). Per-secret DEK
|
||
via `kms.GenerateDataKey`; blob layout `[0x02][dek_len][enc_dek][nonce][ct]`.
|
||
Dual-mode: v2 blobs via KMS, legacy blobs via static `SECRETS_ENCRYPTION_KEY`.
|
||
Auto-routes by leading byte; no rewrap migration needed.
|
||
- **Grafana Cloud remote-write** (controlplane PR #19 + #20). In-process
|
||
counter registry + hand-rolled protobuf encoder. `cp_requests_total`
|
||
emitted on every request. Push loop to
|
||
`prometheus-prod-32-prod-ca-east-0.grafana.net/api/prom/push` with
|
||
Basic auth. User 3116422, token via GRAFANA_PROM_TOKEN Fly secret.
|
||
- **/cp/status deep-probe** (controlplane PR #24) for Betterstack.
|
||
Pings Postgres with 2s budget; returns 503 on DB miss. Distinct from
|
||
`/health`.
|
||
- **Legal pages** (controlplane PR #26/#27). Public `/legal/{terms,
|
||
privacy,dpa,acceptable}` served from embedded markdown. Dark-theme
|
||
HTML shell, minimal markdown→HTML renderer (no dep), path-traversal
|
||
safe via slug allowlist. Smoke covered.
|
||
- **Scheduler reliability**: #95 panic-recover in tick(), #149
|
||
independent heartbeat goroutine so long fires don't look stale on
|
||
/admin/liveness, #207 concurrency-aware skip when workspace
|
||
active_tasks>0.
|
||
|
||
### Features
|
||
|
||
- **#205** idle-loop reflection pattern in workspace-template. Opt-in
|
||
via `idle_prompt` + `idle_interval_seconds` in `config.yaml`.
|
||
Self-sends the idle prompt via platform A2A proxy every interval
|
||
while `heartbeat.active_tasks == 0`. Hermes/Letta shape.
|
||
- **#208** Hermes Phase 1 multi-provider. 15 providers via
|
||
`adapters/hermes/providers.py` registry (Nous, OpenRouter, OpenAI,
|
||
Anthropic, xAI, Gemini, Qwen, GLM, Kimi, MiniMax, DeepSeek, Groq,
|
||
Together, Fireworks, Mistral). Back-compat with PR2 key resolution
|
||
preserved. 26 tests.
|
||
- **#198** A2A protocol compliance batch closing #173/#174/#175:
|
||
`cancel()` emits `TaskStatusUpdateEvent(canceled, final=True)`,
|
||
`stateTransitionHistory=True` in AgentCapabilities. *Note:* wired
|
||
`push_sender=PushNotificationSender()` and this crashed on startup
|
||
because PushNotificationSender is an abstract base class — reverted
|
||
in #210.
|
||
- **#186** self-hosted macOS runner migration (described above).
|
||
|
||
### Code-review self-audit
|
||
|
||
Ran /code-review on my own batch merges, surfaced 8 🟡 issues, split
|
||
follow-ups into two PRs:
|
||
|
||
- **#228** (Go side): CanvasOrBearer invalid-bearer fall-through fix,
|
||
`short()` helper to replace unsafe `[:N]` slices in scheduler.go,
|
||
security-event log on source_id spoof. 6 new tests:
|
||
`TestShort_helper`, `TestRecordSkipped_writesSkippedStatus`,
|
||
`TestRecordSkipped_shortWorkspaceIDNoPanic`,
|
||
`TestActivityHandler_Report_SourceIDSpoofRejected`,
|
||
`TestActivityHandler_Report_MatchingSourceIDAccepted`,
|
||
`TestHistory_IncludesErrorDetail`.
|
||
- **#232** (Python/docs): idle-loop hardening
|
||
(`asyncio.get_running_loop()`, `IDLE_FIRE_TIMEOUT_SECONDS` clamped,
|
||
typed `HTTPError`/`URLError`/catch-all, `add_done_callback` for
|
||
fire-and-forget error logging). `idle_prompt` documented in
|
||
`org-templates/molecule-dev/org.yaml` defaults. New
|
||
`docs/runbooks/admin-auth.md` documenting the three middleware
|
||
variants (AdminAuth strict, CanvasOrBearer soft, WorkspaceAuth
|
||
per-id) + the three-question test for adding routes to
|
||
CanvasOrBearer.
|
||
|
||
### Other merged fixes
|
||
|
||
- #122 canvas grid origin offset (nodes spawn at 100,100 not 0,0)
|
||
- #123 dark-theme a11y (input contrast, search dialog, kbd hints)
|
||
- #131 WCAG critical (ARIA live toasts, dialog focus trap, keyboard nav)
|
||
- #139 code-review plugins for Dev Lead + QA Engineer
|
||
- #149 scheduler heartbeat pulse (#140)
|
||
- #150 ecosystem-watch daily sweep (Microsoft Agent Framework, Vercel Open Agents)
|
||
- #157 ecosystem-watch PM sweep
|
||
- #161 e2e test mock fix for #125 EXISTS probe
|
||
- #187 `SetTrustedProxies(nil)` closes #179 rate-limit bypass
|
||
- #188 e2e auth headers on `/events` + `/bundles/export` post-#167
|
||
- #189 revert Security Auditor cron to 2x/day (closes #178 token-budget regression)
|
||
- #192 test regression lock for #170 `DELETE /secrets/:key`
|
||
- #197 reapply user's a6cfc5f bypass-setup-python to main (dropped by #186 squash)
|
||
- #206 surface cron `error_detail` in schedule history (#152 problem B)
|
||
- #210 revert PushNotificationSender ABC crash (#204)
|
||
- #211 migration runner skips `.down.sql` (data loss regression)
|
||
- #216 enable idle-loop pilot on Technical Researcher
|
||
- #223 reno-stars default plugins to browser-automation
|
||
- #225 auth_headers() on /registry/register (#215)
|
||
- #227 unit tests for plugins_install_pipeline.go (37 cases, #217)
|
||
- #231 Claude SDK stderr probe for rate-limit error attribution (#160)
|
||
- #235 auth_headers() on initial_prompt + idle loop (#220)
|
||
|
||
### Issues closed (by merge or factual correction)
|
||
|
||
#85, #93, #100, #101, #103, #104, #105, #115, #126 epic parent, #127,
|
||
#128, #129, #132, #134, #135, #136, #138, #140, #141, #142, #143, #144,
|
||
#145, #146, #147, #148, #151, #152 prob B, #153, #154, #156, #160
|
||
(diagnosed, not fixed), #163, #164, #165, #166, #168, #170, #171, #172,
|
||
#173, #174, #175, #176, #177, #178, #180, #181, #183, #184, #190, #191
|
||
(accepted risk), #195, #199 (fixed Fly token rotation), #201, #202,
|
||
#204, #211, #213, #214, #215, #217, #218, #219, #220, #221, #226, #229,
|
||
#230, #234.
|
||
|
||
### Outstanding — needs user
|
||
|
||
- **#126** Slack adapter (Phase-H product decision)
|
||
- **#160** Claude Max OAuth quota (wait for reset / upgrade / API key switch)
|
||
- **#191** self-hosted runner persistent-state docs (P3)
|
||
- **#199** Fly registry token — **resolved this session** but re-run
|
||
of `publish-platform-image` pending runner capacity
|
||
- Stripe Atlas application (launch blocker, 2-week lead)
|
||
|
||
### Test counts (post-session)
|
||
|
||
- Platform Go: **816 test functions** (+70 this session — scheduler, handlers, middleware, db, crypto tests added across #95/#99/#106/#110/#119/#151/#167/#185/#187/#192/#200/#203/#206/#207/#210/#211/#212/#227/#228/#232/#234)
|
||
- Canvas vitest: **453 tests** (+0 structure, +0 new tests this session — UI/a11y patches)
|
||
- Workspace-template pytest: **1180 tests** (+40 this session — Hermes providers, a2a cancel, idle loop implicit)
|
||
- MCP server jest: **97 tests** (unchanged)
|
||
|
||
### Infra notes (not in any repo)
|
||
|
||
- FLY_API_TOKEN GH Actions secret rotated to a deploy token scoped to
|
||
`molecule-tenant` (1-year expiry). Docs runbook update needed.
|
||
- Mac mini runner env has `RUNNER_TOOL_CACHE` + `AGENT_TOOLSDIRECTORY`
|
||
overrides. Python install via Homebrew is required one-time prep.
|
||
- `molecule-monorepo` still private; Actions billing workaround is
|
||
the self-hosted runner rather than flipping public or raising the
|
||
cap.
|
||
|