molecule-core/docs/edit-history/2026-04-15.md
Hongming Wang d8026347e5 chore: open-source restructure — rename dirs, remove internal files, scrub secrets
Renames:
- platform/ → workspace-server/ (Go module path stays as "platform" for
  external dep compat — will update after plugin module republish)
- workspace-template/ → workspace/

Removed (moved to separate repos or deleted):
- PLAN.md — internal roadmap (move to private project board)
- HANDOFF.md, AGENTS.md — one-time internal session docs
- .claude/ — gitignored entirely (local agent config)
- infra/cloudflare-worker/ → Molecule-AI/molecule-tenant-proxy
- org-templates/molecule-dev/ → standalone template repo
- .mcp-eval/ → molecule-mcp-server repo
- test-results/ — ephemeral, gitignored

Security scrubbing:
- Cloudflare account/zone/KV IDs → placeholders
- Real EC2 IPs → <EC2_IP> in all docs
- CF token prefix, Neon project ID, Fly app names → redacted
- Langfuse dev credentials → parameterized
- Personal runner username/machine name → generic

Community files:
- CONTRIBUTING.md — build, test, branch conventions
- CODE_OF_CONDUCT.md — Contributor Covenant 2.1

All Dockerfiles, CI workflows, docker-compose, railway.toml, render.yaml,
README, CLAUDE.md updated for new directory names.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 00:24:44 -07:00

13 KiB
Raw Blame History

Edit history — 2026-04-15

tick-9: Phase 32 Phase B.2 image pipeline (PR #80) + tick-8 docs sync (PR #79)

Two merges:

PR #79 — docs: sync documentation with 2026-04-14 tick-8 merge (#78)

Merge commit d53a1287. Tick-8 docs sync for the TenantGuard middleware. Pure docs; CLAUDE.md test count + PLAN.md tick-8 block + edit-history entry.

PR #80 — feat(ci): publish-platform-image → ghcr.io/molecule-ai/platform (Phase B.2)

Merge commit c3cc8e87. Noteworthy: ci-infra.

Adds .github/workflows/publish-platform-image.yml:

  • Trigger: push to main touching workspace-server/**; also workflow_dispatch.
  • Builds workspace-server/Dockerfile via docker/build-push-action@v5.
  • Pushes two tags per run: ghcr.io/molecule-ai/platform:latest (floating) and :sha-<short-commit> (immutable, pin-friendly).
  • GHA cache via cache-from/cache-to: type=gha for warm rebuilds.
  • Permissions: contents:read + packages:write; authenticates to GHCR using the built-in GITHUB_TOKEN, no extra secrets.
  • OCI labels propagate source URL + commit SHA for provenance.

Purpose: pairs with the private molecule-controlplane Fly + Neon provisioner (PR #3 there, merged 2e85d5ad) which reads TENANT_IMAGE=ghcr.io/molecule-ai/platform:<tag> from env and spawns each tenant Fly Machine from this image.

Deployment state (informational — not in any repo)

  • Fly apps (molecule-cp, molecule-tenant): pending CEO (flyctl apps create).
  • Fly billing card: pending CEO.
  • First real tenant provision: blocked on the two above.

File deltas (public repo)

  • .github/workflows/publish-platform-image.yml — new.
  • CLAUDE.md — tick-9 block for the new CI workflow.
  • PLAN.md — new "Recently launched (2026-04-15 tick-9)" entry.

Overnight sweep (2026-04-15 16:3019:10 UTC, ticks 1730+)

One long session that started with a malware discovery, pivoted through a half-day of security triage, landed ~27 PRs across both repos, and ended with a self code-review cleanup round. Chronological order below, compressed to the load-bearing details so future ticks can grep this file instead of re-reading the JSONL cron-learnings stream.

Security: malware cleanup + Fly credential rotation

Discovered xmrig cryptominer installed Dec 6 2025 via commodity npm-dropper, running out of /var/tmp/.X11-unix/xmrig-6.24.0/ as systemd-udevd (camouflaged Linux daemon name on a Mac mini). Crontab entry */10 * * * * had been firing every 10 min for ~4 months until tonight — ~17,500 launches. Wiped crontab, removed payload, rotated FLY_API_TOKEN + CLAUDE_CODE_OAUTH_TOKEN + GRAFANA_PROM_TOKEN. Mining-only payload (no backdoor confirmed): no SSH auth-keys, no LaunchAgents, no extra shell hooks, no other xmrig copies. But personal Fly token rotated via flyctl auth login invalidated the token still in GitHub Actions secrets — surfaced much later as #199 publish workflow 401. Operator rule of thumb: always use flyctl tokens create deploy -a <app> for CI, never a personal auth token.

Self-hosted CI runner migration

#186 switched every ci.yml job + publish-platform-image.yml from runs-on: ubuntu-latest to [self-hosted, macos, arm64] (Apple-silicon Mac mini hongming-m1-mini). Non-trivial adaptations:

  • Replaced GH Actions services: postgres/redis (Linux-only) with inline docker run with PG_CONTAINER / REDIS_CONTAINER env vars and docker rm -f teardown in if: always(). Ports 15432/16379 to avoid collision with host services.
  • ludeeus/action-shellcheck (Docker action, Linux-only) → fallback to local brew install shellcheck + find | xargs shellcheck.
  • actions/setup-python@v5 hardcodes /Users/runner/hostedtoolcache (non-overridable — upstream limitation in the prebuilt setup.sh from actions/python-versions). Bypassed with a Verify Python 3.11 (Homebrew) step that prepends /opt/homebrew/opt/python@3.11/bin to $GITHUB_PATH. One-time runner prep: brew install python@3.11.
  • publish-platform-image.yml adds docker/setup-qemu-action@v3
    • platforms: linux/amd64 explicit because the runner is arm64 and Fly tenant machines are amd64.

Controlplane PR #28 mirrored the same migration on its own single-job ci.yml (1-line runs-on swap — no matrix adaptations needed).

Known runner rough edges tracked as follow-ups: #191 (persistent-state docs), #199 (Fly registry 401 — resolved by minting a deploy token scoped to molecule-tenant, tokens table previously empty).

Security fixes — auth gating

Closed a cluster of unauthenticated-route findings surfaced by the Security Auditor's hourly audit:

PR Issue Fix
#94 #C6 RFC-1918 + link-local in registry URL validator
#99 #104 AdminAuth gate on GET /workspaces (topology leak)
#102 ancestor↔descendant A2A for hierarchy routing
#106 #103 HIGH path-sanitize + admin-gate POST /org/import
#110 revoke workspace_auth_tokens on workspace delete
#119 IPv6 SSRF blocklist (fe80::/10, ::1/128, fc00::/7) + scheduler unit tests
#125/#162 #138 field-level authz on PATCH /workspaces/:id (cosmetic fields passthrough, sensitive fields bearer-required)
#155 #151 wire SecurityHeaders middleware
#167 #164 CRIT #165 HIGH #166 MED gate 6 unauth routes (bundles/export, bundles/import, events, events/:id, canvas/viewport PUT, admin/liveness)
#185 #180 AdminAuth on GET /approvals/pending
#200 #190 HIGH AdminAuth on POST /templates/import
#203 #168 CanvasOrBearer middleware on PUT /canvas/viewport only (route-split approach)
#209 #169 C2 source_id spoof defense in activity.Report
#233 #226 MED resolveInsideRoot on POST /workspaces template/runtime

Rejected PR #194 (Origin-fallback approach) because it would have re-opened #164 CRITICAL to curl-based spoofing. #168 correctly fixed via the narrower route-split in #203.

Rejected PR #169 (large C1-C6 batch) because 4/7 findings were duplicates of already-merged work and migration 022 numbering collided with 022_workspace_schedules_source. Cherry-picked the one genuinely new fix (C2 source_id spoof check) into #209 and closed #169.

Security fixes — data integrity

  • #212 CRITICAL migration-runner bug: RunMigrations globbed *.sql and sorted alphabetically, running .down.sql BEFORE .up.sql on every boot. Wiped workspace_auth_tokens + two other pairs on every platform restart, regressing AdminAuth to fail-open bootstrap mode. Filter to skip .down.sql + unit test in postgres_migrate_test.go.
  • #224 YAML injection in generateDefaultConfig — body.Name concatenated into YAML without escaping. Fixed by emitting as double-quoted YAML scalar with all control chars escaped. Structural test (parse + verify key count) instead of substring match.
  • #236 log-injection in the #209 security-event log line — attacker-controlled source_id echoed via %s allowed newline injection of fake log entries. Switched to %q.

Infrastructure

  • AWS KMS envelope encryption (controlplane PR #21). Per-secret DEK via kms.GenerateDataKey; blob layout [0x02][dek_len][enc_dek][nonce][ct]. Dual-mode: v2 blobs via KMS, legacy blobs via static SECRETS_ENCRYPTION_KEY. Auto-routes by leading byte; no rewrap migration needed.
  • Grafana Cloud remote-write (controlplane PR #19 + #20). In-process counter registry + hand-rolled protobuf encoder. cp_requests_total emitted on every request. Push loop to prometheus-prod-32-prod-ca-east-0.grafana.net/api/prom/push with Basic auth. User 3116422, token via GRAFANA_PROM_TOKEN Fly secret.
  • /cp/status deep-probe (controlplane PR #24) for Betterstack. Pings Postgres with 2s budget; returns 503 on DB miss. Distinct from /health.
  • Legal pages (controlplane PR #26/#27). Public /legal/{terms, privacy,dpa,acceptable} served from embedded markdown. Dark-theme HTML shell, minimal markdown→HTML renderer (no dep), path-traversal safe via slug allowlist. Smoke covered.
  • Scheduler reliability: #95 panic-recover in tick(), #149 independent heartbeat goroutine so long fires don't look stale on /admin/liveness, #207 concurrency-aware skip when workspace active_tasks>0.

Features

  • #205 idle-loop reflection pattern in workspace-template. Opt-in via idle_prompt + idle_interval_seconds in config.yaml. Self-sends the idle prompt via platform A2A proxy every interval while heartbeat.active_tasks == 0. Hermes/Letta shape.
  • #208 Hermes Phase 1 multi-provider. 15 providers via adapters/hermes/providers.py registry (Nous, OpenRouter, OpenAI, Anthropic, xAI, Gemini, Qwen, GLM, Kimi, MiniMax, DeepSeek, Groq, Together, Fireworks, Mistral). Back-compat with PR2 key resolution preserved. 26 tests.
  • #198 A2A protocol compliance batch closing #173/#174/#175: cancel() emits TaskStatusUpdateEvent(canceled, final=True), stateTransitionHistory=True in AgentCapabilities. Note: wired push_sender=PushNotificationSender() and this crashed on startup because PushNotificationSender is an abstract base class — reverted in #210.
  • #186 self-hosted macOS runner migration (described above).

Code-review self-audit

Ran /code-review on my own batch merges, surfaced 8 🟡 issues, split follow-ups into two PRs:

  • #228 (Go side): CanvasOrBearer invalid-bearer fall-through fix, short() helper to replace unsafe [:N] slices in scheduler.go, security-event log on source_id spoof. 6 new tests: TestShort_helper, TestRecordSkipped_writesSkippedStatus, TestRecordSkipped_shortWorkspaceIDNoPanic, TestActivityHandler_Report_SourceIDSpoofRejected, TestActivityHandler_Report_MatchingSourceIDAccepted, TestHistory_IncludesErrorDetail.
  • #232 (Python/docs): idle-loop hardening (asyncio.get_running_loop(), IDLE_FIRE_TIMEOUT_SECONDS clamped, typed HTTPError/URLError/catch-all, add_done_callback for fire-and-forget error logging). idle_prompt documented in org-templates/molecule-dev/org.yaml defaults. New docs/runbooks/admin-auth.md documenting the three middleware variants (AdminAuth strict, CanvasOrBearer soft, WorkspaceAuth per-id) + the three-question test for adding routes to CanvasOrBearer.

Other merged fixes

  • #122 canvas grid origin offset (nodes spawn at 100,100 not 0,0)
  • #123 dark-theme a11y (input contrast, search dialog, kbd hints)
  • #131 WCAG critical (ARIA live toasts, dialog focus trap, keyboard nav)
  • #139 code-review plugins for Dev Lead + QA Engineer
  • #149 scheduler heartbeat pulse (#140)
  • #150 ecosystem-watch daily sweep (Microsoft Agent Framework, Vercel Open Agents)
  • #157 ecosystem-watch PM sweep
  • #161 e2e test mock fix for #125 EXISTS probe
  • #187 SetTrustedProxies(nil) closes #179 rate-limit bypass
  • #188 e2e auth headers on /events + /bundles/export post-#167
  • #189 revert Security Auditor cron to 2x/day (closes #178 token-budget regression)
  • #192 test regression lock for #170 DELETE /secrets/:key
  • #197 reapply user's a6cfc5f bypass-setup-python to main (dropped by #186 squash)
  • #206 surface cron error_detail in schedule history (#152 problem B)
  • #210 revert PushNotificationSender ABC crash (#204)
  • #211 migration runner skips .down.sql (data loss regression)
  • #216 enable idle-loop pilot on Technical Researcher
  • #223 reno-stars default plugins to browser-automation
  • #225 auth_headers() on /registry/register (#215)
  • #227 unit tests for plugins_install_pipeline.go (37 cases, #217)
  • #231 Claude SDK stderr probe for rate-limit error attribution (#160)
  • #235 auth_headers() on initial_prompt + idle loop (#220)

Issues closed (by merge or factual correction)

#85, #93, #100, #101, #103, #104, #105, #115, #126 epic parent, #127, #128, #129, #132, #134, #135, #136, #138, #140, #141, #142, #143, #144, #145, #146, #147, #148, #151, #152 prob B, #153, #154, #156, #160 (diagnosed, not fixed), #163, #164, #165, #166, #168, #170, #171, #172, #173, #174, #175, #176, #177, #178, #180, #181, #183, #184, #190, #191 (accepted risk), #195, #199 (fixed Fly token rotation), #201, #202, #204, #211, #213, #214, #215, #217, #218, #219, #220, #221, #226, #229, #230, #234.

Outstanding — needs user

  • #126 Slack adapter (Phase-H product decision)
  • #160 Claude Max OAuth quota (wait for reset / upgrade / API key switch)
  • #191 self-hosted runner persistent-state docs (P3)
  • #199 Fly registry token — resolved this session but re-run of publish-platform-image pending runner capacity
  • Stripe Atlas application (launch blocker, 2-week lead)

Test counts (post-session)

  • Platform Go: 816 test functions (+70 this session — scheduler, handlers, middleware, db, crypto tests added across #95/#99/#106/#110/#119/#151/#167/#185/#187/#192/#200/#203/#206/#207/#210/#211/#212/#227/#228/#232/#234)
  • Canvas vitest: 453 tests (+0 structure, +0 new tests this session — UI/a11y patches)
  • Workspace-template pytest: 1180 tests (+40 this session — Hermes providers, a2a cancel, idle loop implicit)
  • MCP server jest: 97 tests (unchanged)

Infra notes (not in any repo)

  • FLY_API_TOKEN GH Actions secret rotated to a deploy token scoped to molecule-tenant (1-year expiry). Docs runbook update needed.
  • Mac mini runner env has RUNNER_TOOL_CACHE + AGENT_TOOLSDIRECTORY overrides. Python install via Homebrew is required one-time prep.
  • molecule-monorepo still private; Actions billing workaround is the self-hosted runner rather than flipping public or raising the cap.