molecule-core/docs/edit-history/2026-04-15.md
Hongming Wang d8026347e5 chore: open-source restructure — rename dirs, remove internal files, scrub secrets
Renames:
- platform/ → workspace-server/ (Go module path stays as "platform" for
  external dep compat — will update after plugin module republish)
- workspace-template/ → workspace/

Removed (moved to separate repos or deleted):
- PLAN.md — internal roadmap (move to private project board)
- HANDOFF.md, AGENTS.md — one-time internal session docs
- .claude/ — gitignored entirely (local agent config)
- infra/cloudflare-worker/ → Molecule-AI/molecule-tenant-proxy
- org-templates/molecule-dev/ → standalone template repo
- .mcp-eval/ → molecule-mcp-server repo
- test-results/ — ephemeral, gitignored

Security scrubbing:
- Cloudflare account/zone/KV IDs → placeholders
- Real EC2 IPs → <EC2_IP> in all docs
- CF token prefix, Neon project ID, Fly app names → redacted
- Langfuse dev credentials → parameterized
- Personal runner username/machine name → generic

Community files:
- CONTRIBUTING.md — build, test, branch conventions
- CODE_OF_CONDUCT.md — Contributor Covenant 2.1

All Dockerfiles, CI workflows, docker-compose, railway.toml, render.yaml,
README, CLAUDE.md updated for new directory names.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 00:24:44 -07:00

266 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Edit history — 2026-04-15
## tick-9: Phase 32 Phase B.2 image pipeline (PR #80) + tick-8 docs sync (PR #79)
Two merges:
### PR #79 — `docs: sync documentation with 2026-04-14 tick-8 merge (#78)`
Merge commit `d53a1287`. Tick-8 docs sync for the TenantGuard middleware.
Pure docs; CLAUDE.md test count + PLAN.md tick-8 block + edit-history entry.
### PR #80 — `feat(ci): publish-platform-image → ghcr.io/molecule-ai/platform (Phase B.2)`
Merge commit `c3cc8e87`. Noteworthy: ci-infra.
Adds `.github/workflows/publish-platform-image.yml`:
- Trigger: push to main touching `workspace-server/**`; also `workflow_dispatch`.
- Builds `workspace-server/Dockerfile` via `docker/build-push-action@v5`.
- Pushes two tags per run: `ghcr.io/molecule-ai/platform:latest` (floating)
and `:sha-<short-commit>` (immutable, pin-friendly).
- GHA cache via `cache-from/cache-to: type=gha` for warm rebuilds.
- Permissions: `contents:read` + `packages:write`; authenticates to GHCR
using the built-in `GITHUB_TOKEN`, no extra secrets.
- OCI labels propagate source URL + commit SHA for provenance.
Purpose: pairs with the private `molecule-controlplane` Fly + Neon
provisioner (PR #3 there, merged `2e85d5ad`) which reads
`TENANT_IMAGE=ghcr.io/molecule-ai/platform:<tag>` from env and spawns
each tenant Fly Machine from this image.
### Deployment state (informational — not in any repo)
- Fly apps (`molecule-cp`, `molecule-tenant`): **pending CEO** (`flyctl apps create`).
- Fly billing card: **pending CEO**.
- First real tenant provision: **blocked** on the two above.
### File deltas (public repo)
- `.github/workflows/publish-platform-image.yml` — new.
- `CLAUDE.md` — tick-9 block for the new CI workflow.
- `PLAN.md` — new "Recently launched (2026-04-15 tick-9)" entry.
---
## Overnight sweep (2026-04-15 16:3019:10 UTC, ticks 1730+)
One long session that started with a malware discovery, pivoted through a
half-day of security triage, landed ~27 PRs across both repos, and ended
with a self code-review cleanup round. Chronological order below, compressed
to the load-bearing details so future ticks can grep this file instead of
re-reading the JSONL cron-learnings stream.
### Security: malware cleanup + Fly credential rotation
Discovered `xmrig` cryptominer installed Dec 6 2025 via commodity
npm-dropper, running out of `/var/tmp/.X11-unix/xmrig-6.24.0/` as
`systemd-udevd` (camouflaged Linux daemon name on a Mac mini). Crontab
entry `*/10 * * * *` had been firing every 10 min for ~4 months until
tonight — ~17,500 launches. Wiped crontab, removed payload, rotated
`FLY_API_TOKEN` + `CLAUDE_CODE_OAUTH_TOKEN` + `GRAFANA_PROM_TOKEN`.
Mining-only payload (no backdoor confirmed): no SSH auth-keys, no
LaunchAgents, no extra shell hooks, no other xmrig copies. But personal
Fly token rotated via `flyctl auth login` invalidated the token still
in GitHub Actions secrets — surfaced much later as #199 publish
workflow 401. **Operator rule of thumb: always use `flyctl tokens create
deploy -a <app>` for CI, never a personal auth token.**
### Self-hosted CI runner migration
#186 switched every `ci.yml` job + `publish-platform-image.yml` from
`runs-on: ubuntu-latest` to `[self-hosted, macos, arm64]` (Apple-silicon
Mac mini `hongming-m1-mini`). Non-trivial adaptations:
- Replaced GH Actions `services: postgres/redis` (Linux-only) with
inline `docker run` with `PG_CONTAINER` / `REDIS_CONTAINER` env vars
and `docker rm -f` teardown in `if: always()`. Ports 15432/16379
to avoid collision with host services.
- `ludeeus/action-shellcheck` (Docker action, Linux-only) → fallback
to local `brew install shellcheck` + `find | xargs shellcheck`.
- `actions/setup-python@v5` hardcodes `/Users/runner/hostedtoolcache`
(non-overridable — upstream limitation in the prebuilt setup.sh from
`actions/python-versions`). Bypassed with a `Verify Python 3.11
(Homebrew)` step that prepends `/opt/homebrew/opt/python@3.11/bin`
to `$GITHUB_PATH`. One-time runner prep: `brew install python@3.11`.
- `publish-platform-image.yml` adds `docker/setup-qemu-action@v3`
+ `platforms: linux/amd64` explicit because the runner is arm64 and
Fly tenant machines are amd64.
Controlplane PR #28 mirrored the same migration on its own single-job
ci.yml (1-line `runs-on` swap — no matrix adaptations needed).
Known runner rough edges tracked as follow-ups: #191 (persistent-state
docs), #199 (Fly registry 401 — resolved by minting a deploy token
scoped to `molecule-tenant`, tokens table previously empty).
### Security fixes — auth gating
Closed a cluster of unauthenticated-route findings surfaced by the
Security Auditor's hourly audit:
| PR | Issue | Fix |
|---|---|---|
| #94 | #C6 | RFC-1918 + link-local in registry URL validator |
| #99 | #104 | AdminAuth gate on GET /workspaces (topology leak) |
| #102 | — | ancestor↔descendant A2A for hierarchy routing |
| #106 | #103 HIGH | path-sanitize + admin-gate POST /org/import |
| #110 | — | revoke workspace_auth_tokens on workspace delete |
| #119 | — | IPv6 SSRF blocklist (fe80::/10, ::1/128, fc00::/7) + scheduler unit tests |
| #125/#162 | #138 | field-level authz on PATCH /workspaces/:id (cosmetic fields passthrough, sensitive fields bearer-required) |
| #155 | #151 | wire SecurityHeaders middleware |
| #167 | #164 CRIT #165 HIGH #166 MED | gate 6 unauth routes (bundles/export, bundles/import, events, events/:id, canvas/viewport PUT, admin/liveness) |
| #185 | #180 | AdminAuth on GET /approvals/pending |
| #200 | #190 HIGH | AdminAuth on POST /templates/import |
| #203 | #168 | CanvasOrBearer middleware on PUT /canvas/viewport only (route-split approach) |
| #209 | #169 C2 | source_id spoof defense in activity.Report |
| #233 | #226 MED | resolveInsideRoot on POST /workspaces template/runtime |
Rejected PR #194 (Origin-fallback approach) because it would have
re-opened #164 CRITICAL to curl-based spoofing. #168 correctly fixed
via the narrower route-split in #203.
Rejected PR #169 (large C1-C6 batch) because 4/7 findings were
duplicates of already-merged work and migration 022 numbering
collided with 022_workspace_schedules_source. Cherry-picked the one
genuinely new fix (C2 source_id spoof check) into #209 and closed
#169.
### Security fixes — data integrity
- **#212** CRITICAL migration-runner bug: `RunMigrations` globbed
`*.sql` and sorted alphabetically, running `.down.sql` BEFORE
`.up.sql` on every boot. Wiped `workspace_auth_tokens` + two other
pairs on every platform restart, regressing AdminAuth to fail-open
bootstrap mode. Filter to skip `.down.sql` + unit test in
`postgres_migrate_test.go`.
- **#224** YAML injection in `generateDefaultConfig` — body.Name
concatenated into YAML without escaping. Fixed by emitting as
double-quoted YAML scalar with all control chars escaped. Structural
test (parse + verify key count) instead of substring match.
- **#236** log-injection in the #209 security-event log line —
attacker-controlled `source_id` echoed via `%s` allowed newline
injection of fake log entries. Switched to `%q`.
### Infrastructure
- **AWS KMS envelope encryption** (controlplane PR #21). Per-secret DEK
via `kms.GenerateDataKey`; blob layout `[0x02][dek_len][enc_dek][nonce][ct]`.
Dual-mode: v2 blobs via KMS, legacy blobs via static `SECRETS_ENCRYPTION_KEY`.
Auto-routes by leading byte; no rewrap migration needed.
- **Grafana Cloud remote-write** (controlplane PR #19 + #20). In-process
counter registry + hand-rolled protobuf encoder. `cp_requests_total`
emitted on every request. Push loop to
`prometheus-prod-32-prod-ca-east-0.grafana.net/api/prom/push` with
Basic auth. User 3116422, token via GRAFANA_PROM_TOKEN Fly secret.
- **/cp/status deep-probe** (controlplane PR #24) for Betterstack.
Pings Postgres with 2s budget; returns 503 on DB miss. Distinct from
`/health`.
- **Legal pages** (controlplane PR #26/#27). Public `/legal/{terms,
privacy,dpa,acceptable}` served from embedded markdown. Dark-theme
HTML shell, minimal markdown→HTML renderer (no dep), path-traversal
safe via slug allowlist. Smoke covered.
- **Scheduler reliability**: #95 panic-recover in tick(), #149
independent heartbeat goroutine so long fires don't look stale on
/admin/liveness, #207 concurrency-aware skip when workspace
active_tasks>0.
### Features
- **#205** idle-loop reflection pattern in workspace-template. Opt-in
via `idle_prompt` + `idle_interval_seconds` in `config.yaml`.
Self-sends the idle prompt via platform A2A proxy every interval
while `heartbeat.active_tasks == 0`. Hermes/Letta shape.
- **#208** Hermes Phase 1 multi-provider. 15 providers via
`adapters/hermes/providers.py` registry (Nous, OpenRouter, OpenAI,
Anthropic, xAI, Gemini, Qwen, GLM, Kimi, MiniMax, DeepSeek, Groq,
Together, Fireworks, Mistral). Back-compat with PR2 key resolution
preserved. 26 tests.
- **#198** A2A protocol compliance batch closing #173/#174/#175:
`cancel()` emits `TaskStatusUpdateEvent(canceled, final=True)`,
`stateTransitionHistory=True` in AgentCapabilities. *Note:* wired
`push_sender=PushNotificationSender()` and this crashed on startup
because PushNotificationSender is an abstract base class — reverted
in #210.
- **#186** self-hosted macOS runner migration (described above).
### Code-review self-audit
Ran /code-review on my own batch merges, surfaced 8 🟡 issues, split
follow-ups into two PRs:
- **#228** (Go side): CanvasOrBearer invalid-bearer fall-through fix,
`short()` helper to replace unsafe `[:N]` slices in scheduler.go,
security-event log on source_id spoof. 6 new tests:
`TestShort_helper`, `TestRecordSkipped_writesSkippedStatus`,
`TestRecordSkipped_shortWorkspaceIDNoPanic`,
`TestActivityHandler_Report_SourceIDSpoofRejected`,
`TestActivityHandler_Report_MatchingSourceIDAccepted`,
`TestHistory_IncludesErrorDetail`.
- **#232** (Python/docs): idle-loop hardening
(`asyncio.get_running_loop()`, `IDLE_FIRE_TIMEOUT_SECONDS` clamped,
typed `HTTPError`/`URLError`/catch-all, `add_done_callback` for
fire-and-forget error logging). `idle_prompt` documented in
`org-templates/molecule-dev/org.yaml` defaults. New
`docs/runbooks/admin-auth.md` documenting the three middleware
variants (AdminAuth strict, CanvasOrBearer soft, WorkspaceAuth
per-id) + the three-question test for adding routes to
CanvasOrBearer.
### Other merged fixes
- #122 canvas grid origin offset (nodes spawn at 100,100 not 0,0)
- #123 dark-theme a11y (input contrast, search dialog, kbd hints)
- #131 WCAG critical (ARIA live toasts, dialog focus trap, keyboard nav)
- #139 code-review plugins for Dev Lead + QA Engineer
- #149 scheduler heartbeat pulse (#140)
- #150 ecosystem-watch daily sweep (Microsoft Agent Framework, Vercel Open Agents)
- #157 ecosystem-watch PM sweep
- #161 e2e test mock fix for #125 EXISTS probe
- #187 `SetTrustedProxies(nil)` closes #179 rate-limit bypass
- #188 e2e auth headers on `/events` + `/bundles/export` post-#167
- #189 revert Security Auditor cron to 2x/day (closes #178 token-budget regression)
- #192 test regression lock for #170 `DELETE /secrets/:key`
- #197 reapply user's a6cfc5f bypass-setup-python to main (dropped by #186 squash)
- #206 surface cron `error_detail` in schedule history (#152 problem B)
- #210 revert PushNotificationSender ABC crash (#204)
- #211 migration runner skips `.down.sql` (data loss regression)
- #216 enable idle-loop pilot on Technical Researcher
- #223 reno-stars default plugins to browser-automation
- #225 auth_headers() on /registry/register (#215)
- #227 unit tests for plugins_install_pipeline.go (37 cases, #217)
- #231 Claude SDK stderr probe for rate-limit error attribution (#160)
- #235 auth_headers() on initial_prompt + idle loop (#220)
### Issues closed (by merge or factual correction)
#85, #93, #100, #101, #103, #104, #105, #115, #126 epic parent, #127,
#128, #129, #132, #134, #135, #136, #138, #140, #141, #142, #143, #144,
#145, #146, #147, #148, #151, #152 prob B, #153, #154, #156, #160
(diagnosed, not fixed), #163, #164, #165, #166, #168, #170, #171, #172,
#173, #174, #175, #176, #177, #178, #180, #181, #183, #184, #190, #191
(accepted risk), #195, #199 (fixed Fly token rotation), #201, #202,
#204, #211, #213, #214, #215, #217, #218, #219, #220, #221, #226, #229,
#230, #234.
### Outstanding — needs user
- **#126** Slack adapter (Phase-H product decision)
- **#160** Claude Max OAuth quota (wait for reset / upgrade / API key switch)
- **#191** self-hosted runner persistent-state docs (P3)
- **#199** Fly registry token — **resolved this session** but re-run
of `publish-platform-image` pending runner capacity
- Stripe Atlas application (launch blocker, 2-week lead)
### Test counts (post-session)
- Platform Go: **816 test functions** (+70 this session — scheduler, handlers, middleware, db, crypto tests added across #95/#99/#106/#110/#119/#151/#167/#185/#187/#192/#200/#203/#206/#207/#210/#211/#212/#227/#228/#232/#234)
- Canvas vitest: **453 tests** (+0 structure, +0 new tests this session — UI/a11y patches)
- Workspace-template pytest: **1180 tests** (+40 this session — Hermes providers, a2a cancel, idle loop implicit)
- MCP server jest: **97 tests** (unchanged)
### Infra notes (not in any repo)
- FLY_API_TOKEN GH Actions secret rotated to a deploy token scoped to
`molecule-tenant` (1-year expiry). Docs runbook update needed.
- Mac mini runner env has `RUNNER_TOOL_CACHE` + `AGENT_TOOLSDIRECTORY`
overrides. Python install via Homebrew is required one-time prep.
- `molecule-monorepo` still private; Actions billing workaround is
the self-hosted runner rather than flipping public or raising the
cap.