- Remove compiled workspace-server/server binary from git - Fix .gitignore, .gitattributes, .githooks/pre-commit for renamed dirs - Fix CI workflow path filters (workspace-template → workspace) - Replace real EC2 IP and personal slug in test_saas_tenant.sh - Scrub molecule-controlplane references in docs - Fix stale workspace-template/ paths in provisioner, handlers, tests - Clean tracked Python cache files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
13 KiB
Edit history — 2026-04-15
tick-9: Phase 32 Phase B.2 image pipeline (PR #80) + tick-8 docs sync (PR #79)
Two merges:
PR #79 — docs: sync documentation with 2026-04-14 tick-8 merge (#78)
Merge commit d53a1287. Tick-8 docs sync for the TenantGuard middleware.
Pure docs; CLAUDE.md test count + PLAN.md tick-8 block + edit-history entry.
PR #80 — feat(ci): publish-platform-image → ghcr.io/molecule-ai/platform (Phase B.2)
Merge commit c3cc8e87. Noteworthy: ci-infra.
Adds .github/workflows/publish-platform-image.yml:
- Trigger: push to main touching
workspace-server/**; alsoworkflow_dispatch. - Builds
workspace-server/Dockerfileviadocker/build-push-action@v5. - Pushes two tags per run:
ghcr.io/molecule-ai/platform:latest(floating) and:sha-<short-commit>(immutable, pin-friendly). - GHA cache via
cache-from/cache-to: type=ghafor warm rebuilds. - Permissions:
contents:read+packages:write; authenticates to GHCR using the built-inGITHUB_TOKEN, no extra secrets. - OCI labels propagate source URL + commit SHA for provenance.
Purpose: pairs with the private the private control-plane repo Fly + Neon
provisioner (PR #3 there, merged 2e85d5ad) which reads
TENANT_IMAGE=ghcr.io/molecule-ai/platform:<tag> from env and spawns
each tenant Fly Machine from this image.
Deployment state (informational — not in any repo)
- Fly apps (
molecule-cp,molecule-tenant): pending CEO (flyctl apps create). - Fly billing card: pending CEO.
- First real tenant provision: blocked on the two above.
File deltas (public repo)
.github/workflows/publish-platform-image.yml— new.CLAUDE.md— tick-9 block for the new CI workflow.PLAN.md— new "Recently launched (2026-04-15 tick-9)" entry.
Overnight sweep (2026-04-15 16:30–19:10 UTC, ticks 17–30+)
One long session that started with a malware discovery, pivoted through a half-day of security triage, landed ~27 PRs across both repos, and ended with a self code-review cleanup round. Chronological order below, compressed to the load-bearing details so future ticks can grep this file instead of re-reading the JSONL cron-learnings stream.
Security: malware cleanup + Fly credential rotation
Discovered xmrig cryptominer installed Dec 6 2025 via commodity
npm-dropper, running out of /var/tmp/.X11-unix/xmrig-6.24.0/ as
systemd-udevd (camouflaged Linux daemon name on a Mac mini). Crontab
entry */10 * * * * had been firing every 10 min for ~4 months until
tonight — ~17,500 launches. Wiped crontab, removed payload, rotated
FLY_API_TOKEN + CLAUDE_CODE_OAUTH_TOKEN + GRAFANA_PROM_TOKEN.
Mining-only payload (no backdoor confirmed): no SSH auth-keys, no
LaunchAgents, no extra shell hooks, no other xmrig copies. But personal
Fly token rotated via flyctl auth login invalidated the token still
in GitHub Actions secrets — surfaced much later as #199 publish
workflow 401. Operator rule of thumb: always use flyctl tokens create deploy -a <app> for CI, never a personal auth token.
Self-hosted CI runner migration
#186 switched every ci.yml job + publish-platform-image.yml from
runs-on: ubuntu-latest to [self-hosted, macos, arm64] (Apple-silicon
Mac mini hongming-m1-mini). Non-trivial adaptations:
- Replaced GH Actions
services: postgres/redis(Linux-only) with inlinedocker runwithPG_CONTAINER/REDIS_CONTAINERenv vars anddocker rm -fteardown inif: always(). Ports 15432/16379 to avoid collision with host services. ludeeus/action-shellcheck(Docker action, Linux-only) → fallback to localbrew install shellcheck+find | xargs shellcheck.actions/setup-python@v5hardcodes/Users/runner/hostedtoolcache(non-overridable — upstream limitation in the prebuilt setup.sh fromactions/python-versions). Bypassed with aVerify Python 3.11 (Homebrew)step that prepends/opt/homebrew/opt/python@3.11/binto$GITHUB_PATH. One-time runner prep:brew install python@3.11.publish-platform-image.ymladdsdocker/setup-qemu-action@v3platforms: linux/amd64explicit because the runner is arm64 and Fly tenant machines are amd64.
Controlplane PR #28 mirrored the same migration on its own single-job
ci.yml (1-line runs-on swap — no matrix adaptations needed).
Known runner rough edges tracked as follow-ups: #191 (persistent-state
docs), #199 (Fly registry 401 — resolved by minting a deploy token
scoped to molecule-tenant, tokens table previously empty).
Security fixes — auth gating
Closed a cluster of unauthenticated-route findings surfaced by the Security Auditor's hourly audit:
| PR | Issue | Fix |
|---|---|---|
| #94 | #C6 | RFC-1918 + link-local in registry URL validator |
| #99 | #104 | AdminAuth gate on GET /workspaces (topology leak) |
| #102 | — | ancestor↔descendant A2A for hierarchy routing |
| #106 | #103 HIGH | path-sanitize + admin-gate POST /org/import |
| #110 | — | revoke workspace_auth_tokens on workspace delete |
| #119 | — | IPv6 SSRF blocklist (fe80::/10, ::1/128, fc00::/7) + scheduler unit tests |
| #125/#162 | #138 | field-level authz on PATCH /workspaces/:id (cosmetic fields passthrough, sensitive fields bearer-required) |
| #155 | #151 | wire SecurityHeaders middleware |
| #167 | #164 CRIT #165 HIGH #166 MED | gate 6 unauth routes (bundles/export, bundles/import, events, events/:id, canvas/viewport PUT, admin/liveness) |
| #185 | #180 | AdminAuth on GET /approvals/pending |
| #200 | #190 HIGH | AdminAuth on POST /templates/import |
| #203 | #168 | CanvasOrBearer middleware on PUT /canvas/viewport only (route-split approach) |
| #209 | #169 C2 | source_id spoof defense in activity.Report |
| #233 | #226 MED | resolveInsideRoot on POST /workspaces template/runtime |
Rejected PR #194 (Origin-fallback approach) because it would have re-opened #164 CRITICAL to curl-based spoofing. #168 correctly fixed via the narrower route-split in #203.
Rejected PR #169 (large C1-C6 batch) because 4/7 findings were duplicates of already-merged work and migration 022 numbering collided with 022_workspace_schedules_source. Cherry-picked the one genuinely new fix (C2 source_id spoof check) into #209 and closed #169.
Security fixes — data integrity
- #212 CRITICAL migration-runner bug:
RunMigrationsglobbed*.sqland sorted alphabetically, running.down.sqlBEFORE.up.sqlon every boot. Wipedworkspace_auth_tokens+ two other pairs on every platform restart, regressing AdminAuth to fail-open bootstrap mode. Filter to skip.down.sql+ unit test inpostgres_migrate_test.go. - #224 YAML injection in
generateDefaultConfig— body.Name concatenated into YAML without escaping. Fixed by emitting as double-quoted YAML scalar with all control chars escaped. Structural test (parse + verify key count) instead of substring match. - #236 log-injection in the #209 security-event log line —
attacker-controlled
source_idechoed via%sallowed newline injection of fake log entries. Switched to%q.
Infrastructure
- AWS KMS envelope encryption (controlplane PR #21). Per-secret DEK
via
kms.GenerateDataKey; blob layout[0x02][dek_len][enc_dek][nonce][ct]. Dual-mode: v2 blobs via KMS, legacy blobs via staticSECRETS_ENCRYPTION_KEY. Auto-routes by leading byte; no rewrap migration needed. - Grafana Cloud remote-write (controlplane PR #19 + #20). In-process
counter registry + hand-rolled protobuf encoder.
cp_requests_totalemitted on every request. Push loop toprometheus-prod-32-prod-ca-east-0.grafana.net/api/prom/pushwith Basic auth. User 3116422, token via GRAFANA_PROM_TOKEN Fly secret. - /cp/status deep-probe (controlplane PR #24) for Betterstack.
Pings Postgres with 2s budget; returns 503 on DB miss. Distinct from
/health. - Legal pages (controlplane PR #26/#27). Public
/legal/{terms, privacy,dpa,acceptable}served from embedded markdown. Dark-theme HTML shell, minimal markdown→HTML renderer (no dep), path-traversal safe via slug allowlist. Smoke covered. - Scheduler reliability: #95 panic-recover in tick(), #149 independent heartbeat goroutine so long fires don't look stale on /admin/liveness, #207 concurrency-aware skip when workspace active_tasks>0.
Features
- #205 idle-loop reflection pattern in workspace-template. Opt-in
via
idle_prompt+idle_interval_secondsinconfig.yaml. Self-sends the idle prompt via platform A2A proxy every interval whileheartbeat.active_tasks == 0. Hermes/Letta shape. - #208 Hermes Phase 1 multi-provider. 15 providers via
adapters/hermes/providers.pyregistry (Nous, OpenRouter, OpenAI, Anthropic, xAI, Gemini, Qwen, GLM, Kimi, MiniMax, DeepSeek, Groq, Together, Fireworks, Mistral). Back-compat with PR2 key resolution preserved. 26 tests. - #198 A2A protocol compliance batch closing #173/#174/#175:
cancel()emitsTaskStatusUpdateEvent(canceled, final=True),stateTransitionHistory=Truein AgentCapabilities. Note: wiredpush_sender=PushNotificationSender()and this crashed on startup because PushNotificationSender is an abstract base class — reverted in #210. - #186 self-hosted macOS runner migration (described above).
Code-review self-audit
Ran /code-review on my own batch merges, surfaced 8 🟡 issues, split follow-ups into two PRs:
- #228 (Go side): CanvasOrBearer invalid-bearer fall-through fix,
short()helper to replace unsafe[:N]slices in scheduler.go, security-event log on source_id spoof. 6 new tests:TestShort_helper,TestRecordSkipped_writesSkippedStatus,TestRecordSkipped_shortWorkspaceIDNoPanic,TestActivityHandler_Report_SourceIDSpoofRejected,TestActivityHandler_Report_MatchingSourceIDAccepted,TestHistory_IncludesErrorDetail. - #232 (Python/docs): idle-loop hardening
(
asyncio.get_running_loop(),IDLE_FIRE_TIMEOUT_SECONDSclamped, typedHTTPError/URLError/catch-all,add_done_callbackfor fire-and-forget error logging).idle_promptdocumented inorg-templates/molecule-dev/org.yamldefaults. Newdocs/runbooks/admin-auth.mddocumenting the three middleware variants (AdminAuth strict, CanvasOrBearer soft, WorkspaceAuth per-id) + the three-question test for adding routes to CanvasOrBearer.
Other merged fixes
- #122 canvas grid origin offset (nodes spawn at 100,100 not 0,0)
- #123 dark-theme a11y (input contrast, search dialog, kbd hints)
- #131 WCAG critical (ARIA live toasts, dialog focus trap, keyboard nav)
- #139 code-review plugins for Dev Lead + QA Engineer
- #149 scheduler heartbeat pulse (#140)
- #150 ecosystem-watch daily sweep (Microsoft Agent Framework, Vercel Open Agents)
- #157 ecosystem-watch PM sweep
- #161 e2e test mock fix for #125 EXISTS probe
- #187
SetTrustedProxies(nil)closes #179 rate-limit bypass - #188 e2e auth headers on
/events+/bundles/exportpost-#167 - #189 revert Security Auditor cron to 2x/day (closes #178 token-budget regression)
- #192 test regression lock for #170
DELETE /secrets/:key - #197 reapply user's a6cfc5f bypass-setup-python to main (dropped by #186 squash)
- #206 surface cron
error_detailin schedule history (#152 problem B) - #210 revert PushNotificationSender ABC crash (#204)
- #211 migration runner skips
.down.sql(data loss regression) - #216 enable idle-loop pilot on Technical Researcher
- #223 reno-stars default plugins to browser-automation
- #225 auth_headers() on /registry/register (#215)
- #227 unit tests for plugins_install_pipeline.go (37 cases, #217)
- #231 Claude SDK stderr probe for rate-limit error attribution (#160)
- #235 auth_headers() on initial_prompt + idle loop (#220)
Issues closed (by merge or factual correction)
#85, #93, #100, #101, #103, #104, #105, #115, #126 epic parent, #127, #128, #129, #132, #134, #135, #136, #138, #140, #141, #142, #143, #144, #145, #146, #147, #148, #151, #152 prob B, #153, #154, #156, #160 (diagnosed, not fixed), #163, #164, #165, #166, #168, #170, #171, #172, #173, #174, #175, #176, #177, #178, #180, #181, #183, #184, #190, #191 (accepted risk), #195, #199 (fixed Fly token rotation), #201, #202, #204, #211, #213, #214, #215, #217, #218, #219, #220, #221, #226, #229, #230, #234.
Outstanding — needs user
- #126 Slack adapter (Phase-H product decision)
- #160 Claude Max OAuth quota (wait for reset / upgrade / API key switch)
- #191 self-hosted runner persistent-state docs (P3)
- #199 Fly registry token — resolved this session but re-run
of
publish-platform-imagepending runner capacity - Stripe Atlas application (launch blocker, 2-week lead)
Test counts (post-session)
- Platform Go: 816 test functions (+70 this session — scheduler, handlers, middleware, db, crypto tests added across #95/#99/#106/#110/#119/#151/#167/#185/#187/#192/#200/#203/#206/#207/#210/#211/#212/#227/#228/#232/#234)
- Canvas vitest: 453 tests (+0 structure, +0 new tests this session — UI/a11y patches)
- Workspace-template pytest: 1180 tests (+40 this session — Hermes providers, a2a cancel, idle loop implicit)
- MCP server jest: 97 tests (unchanged)
Infra notes (not in any repo)
- FLY_API_TOKEN GH Actions secret rotated to a deploy token scoped to
molecule-tenant(1-year expiry). Docs runbook update needed. - Mac mini runner env has
RUNNER_TOOL_CACHE+AGENT_TOOLSDIRECTORYoverrides. Python install via Homebrew is required one-time prep. molecule-monorepostill private; Actions billing workaround is the self-hosted runner rather than flipping public or raising the cap.