molecule-core

Author	SHA1	Message	Date
claude-ceo-assistant	b3041c13d3	fix(org-import): emit started event after YAML parse so name is populated The org.import.started event was firing immediately after request body bind, before the YAML at body.Dir was loaded. Result: payload.name was "" whenever the caller passed `dir` (the common path — the canvas and all live imports use dir, not inline template). Three started rows already in the local platform's structure_events have empty name. Fix: move the started emit (and importStart timestamp) to after the YAML unmarshal / inline-template fallthrough, where tmpl.Name is guaranteed populated. Bonus: pre-parse error returns (invalid body, traversal-rejected dir, file-not-found, YAML expansion fail, YAML unmarshal fail, neither dir nor template provided) no longer emit an orphan started row — every started is now guaranteed a paired completed/failed. Verified live against running platform: re-imported molecule-dev-only, new started row in structure_events carries "Molecule AI Dev Team (dev-only)" instead of "". Tests: full handler suite green (`go test ./internal/handlers/`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:25:24 -07:00
claude-ceo-assistant	bfefcb315b	refactor(handlers): Delete() delegates to CascadeDelete helper Drops ~150 lines of duplicated cascade logic from the Delete HTTP handler — workspace_crud.go's CascadeDelete (added in PR #137) and Delete() were running the same #73 race-guard sequence (status update → canvas_layouts → tokens → schedules → container stop → broadcast), just with Delete() inlined and CascadeDelete owning the OrgImport reconcile path. CascadeDelete now returns the descendant id list (was: count) so Delete() can drive the optional ?purge=true hard-delete against the same set the cascade just touched. Net diff: workspace_crud.go shrinks from ~270 lines in Delete() to ~75 lines (parse + 409 confirm gate + CascadeDelete call + stop-error 500 + purge block + 200 response). Behavior identical — same SQL ordering, same #73 race guard, same response shapes. Three sqlmock tests for the 0-children case gained one extra ExpectQuery for the recursive-CTE descendants scan (the old inline code skipped that query when len(children)==0; CascadeDelete walks unconditionally — returns 0 rows, same end state, one extra cheap query). Tests: full handler suite green (`go test ./internal/handlers/`). Live-tested against the running local platform: DELETE on a fake workspace returns `{"cascade_deleted":0,"status":"removed"}`, fleet of 9 workspaces preserved, refactored handler matches the prior wire-shape exactly. Tracked as the PR #137 follow-up tech-debt item. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 15:47:51 -07:00
claude-ceo-assistant	3de51faa19	fix(org-import): reconcile mode + audit-event emission Closes the additive-import zombie bug — re-running /org/import with a tree shape that reparents same-named roles left the prior workspace online because lookupExistingChild's dedupe is parent-scoped (different parent_id → "different" workspace). Caught 2026-05-08 after a dev-tree re-import left 8 orphans co-existing with the new tree on canvas until manual cascade-delete. Three layers in this PR: - mode="reconcile" on /org/import — after the import loop, online workspaces whose name matches an imported name but whose id isn't in the result set are cascade-deleted. Default mode "" / "merge" preserves existing additive behavior. Empty-set guards prevent accidental "delete everything" if either array comes up empty. - WorkspaceHandler.CascadeDelete extracted as a callable helper from the existing Delete HTTP handler so OrgImport's reconcile path shares the same teardown sequence (#73 race guard, container stop, volume removal, token revocation, schedule disable, event broadcast). The HTTP Delete handler still inlines the same logic; deduplication tracked as tech-debt follow-up. - emitOrgEvent(structure_events) records org.import.started + org.import.completed with mode, created/skipped/reconcile_removed counts, duration_ms, error. Replaces the lost-on-restart stdout-only log shape for an audit-trail surface that's queryable by SQL. Closes the "what happened at 20:13?" debugging gap that motivated this fix. Verified live against the local platform: cascade-delete on an old tree's removed root cleared 8 surviving orphans; mode="reconcile" with a freshly-INSERTed fake orphan removed exactly the fake; idempotent re-run of reconcile is a no-op (0 removed, no errors); structure_events captures every started+completed pair with full payload. 7 new unit tests (walkOrgWorkspaceNames flat/nested/spawning:false/ empty-name; emitOrgEvent success + DB-error-swallow; errString). Full handler suite green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 15:04:47 -07:00
claude-ceo-assistant	6f861926bd	Merge pull request 'fix(workspace_provision): preserve MODEL secret over MODEL_PROVIDER slug on restart' (#136 ) from fix/preserve-model-secret-on-restart into main	2026-05-08 21:31:50 +00:00
claude-ceo-assistant	15c5f32491	fix(workspace_provision): preserve MODEL secret over MODEL_PROVIDER slug on restart Phase 4 follow-up to template-claude-code PR #9 (2026-05-08 dev-tree wedge). Pre-fix: applyRuntimeModelEnv unconditionally overwrote envVars["MODEL"] with the MODEL_PROVIDER slug whenever payload.Model was empty (the restart path). This silently wiped the operator'\''s explicit per-persona MODEL secret on every restart. Symptom: dev-tree workspaces booted correctly on first /org/import (the envVars map was populated direct from the persona env file with both MODEL=MiniMax-M2.7-highspeed and MODEL_PROVIDER=minimax), then on the next Restart the MODEL secret got clobbered to literal "minimax" — a provider slug, not a valid model id — and the workspace template'\''s adapter failed to match any registry prefix, fell through to providers[0] (anthropic-oauth), and wedged at SDK initialize. Fix: resolution order in applyRuntimeModelEnv is now: 1. payload.Model (caller passed the canvas-picked model id verbatim) 2. envVars["MODEL"] (workspace_secret persisted from persona env) 3. envVars["MODEL_PROVIDER"] (legacy canvas Save+Restart shape) Tests ----- TestApplyRuntimeModelEnv_PersonaEnvMODELSecretPreserved — locks in the new resolution order with four cases: - MODEL secret wins over MODEL_PROVIDER slug (persona-env shape) - MODEL secret wins even when same as MODEL_PROVIDER - MODEL absent → fall back to MODEL_PROVIDER (legacy shape) - Both absent → no MODEL set (no-op) Existing TestApplyRuntimeModelEnv_SetsUniversalMODELForAllRuntimes continues to pass — fix is strictly additive on the precedence chain.	2026-05-08 14:31:14 -07:00
claude-ceo-assistant	9b5e89bb42	Merge pull request 'feat(org-import): add spawning:false field to skip workspace + descendants' (#135 ) from feat/org-import-spawning-false into main	2026-05-08 21:20:56 +00:00
claude-ceo-assistant	b91da1ab77	feat(org-import): add spawning:false field to skip workspace + descendants Lets a workspace declare it (and its entire subtree) should be skipped during /org/import. Pointer-typed `*bool` so we distinguish "explicitly false" from "unset" (default = spawn). ## Use case The dev-tree org template ships the full role taxonomy (Dev Lead with Core Platform / Controlplane / App & Docs / Infra / SDK Leads, each with their own engineering / QA / security / UI-UX children — 27 personas total in a single import). Some setups need a smaller set: - Local dev on a memory-constrained machine - Demo / smoke runs that don't need the full org breathing - Customer trials starting with leadership-only before fan-out Pre-fix the only options were: - Edit the canonical template (mutates shared state) - Author a parallel slimmer template (duplicates structure) - Manual workspace deprovision after full import (wasteful — already paid the docker pull / build cost) `spawning: false` is the per-workspace knob that solves this without touching the canonical template structure. ## Semantics - Unset: workspace spawns (current behaviour, no migration) - `spawning: true`: explicitly spawns (same as unset) - `spawning: false`: workspace is skipped AND every descendant is skipped. The guard sits BEFORE any side effect in createWorkspaceTree — no DB row, no docker provision, no children recursion. A false-spawning subtree is genuinely a no-op except for the log line. countWorkspaces still counts the subtree (so /org/templates numbers reflect the full structure). ## Stage A — verified Local dev-only template that wraps teams/dev.yaml (Dev Lead) with children:[] cleared on the 5 sub-team yaml files, plus 3 floater personas (Release Manager / Integration Tester / Fullstack Engineer). /org/import returned 9 workspaces. Drop-in: same result via `spawning: false` on each sub-tree root in the future. ## Stage B — N/A Pure additive feature on the org-template handler. No SaaS deploy chain implications. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 14:20:14 -07:00
claude-ceo-assistant	c3596d6271	fix(org-import): use ws.FilesDir as persona-dir lookup, add docker-cli-buildx to dev image ## org_import.go — persona env injection root-cause fix The Phase-3 fix from earlier today (`feedback/per-agent-gitea-identity-default`) introduced loadPersonaEnvFile to inject persona-specific creds into workspace_secrets on /org/import. It passed `ws.Role` as the persona-dir lookup key, but in our dev-tree org.yaml shape `role:` carries the multi-line descriptive text the agent reads from its prompt ("Engineering planning and team coordination — leads Core Platform, Controlplane, ..."), while `files_dir:` holds the short slug (`core-lead`, `dev-lead`, etc.) matching `~/.molecule-ai/personas/<files_dir>/env`. isSafeRoleName silently rejected the multi-word role text → no persona env loaded → every imported workspace booted with zero workspace_secrets rows → no ANTHROPIC / CLAUDE_CODE / MINIMAX auth in the container env → claude_agent_sdk wedged on `query.initialize()` with a 60s control-request timeout. After the fix, /org/import on the dev tree (27 personas) populates 8 workspace_secrets per workspace (Gitea identity + MODEL/MODEL_PROVIDER + provider-specific token), 5 of 6 leads boot online, and the remaining wedges trace to a separate runtime-template-repo bug (workspace-template-claude-code's claude_sdk_executor.py doesn't dispatch on MODEL_PROVIDER=minimax — filed separately). ## Dockerfile.dev — docker-cli + docker-cli-buildx Without these, every claude-code/tier-2 workspace POST fails-fast: - docker-cli alone produces `exec: "docker": executable file not found` - docker-cli alone (no buildx) fails on `docker build` with `ERROR: BuildKit is enabled but the buildx component is missing or broken` Both packages are now installed in the dev image; verified with `docker exec molecule-core-platform-1 docker buildx version`. ## Stage A verified Local /org/import dev-only path: 27 workspaces created, all 27 receive persona env injection (8 secrets each — Gitea identity + provider creds). Lead workspaces (claude-code-OAuth tier) boot online. ## Stage B — N/A Local-dev-only path (docker-compose.dev.yml + dev image). Tenant EC2 provisioning uses Dockerfile.tenant (untouched). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 13:50:46 -07:00
claude-ceo-assistant	7eda8f510f	feat(local-dev): containerize platform + canvas stack via docker-compose (closes #126 ) Replaces the legacy nohup `go run ./cmd/server` setup with a fully containerized local stack: postgres + redis + platform + canvas, all with `restart: unless-stopped` so they survive Mac sleep/wake and Docker Desktop daemon restarts. ## Changes - docker-compose.yml - `restart: unless-stopped` on platform/postgres/redis - `BIND_ADDR=0.0.0.0` for platform — the dev-mode-fail-open default of 127.0.0.1 (PR #7) made the host unable to reach the container even with port mapping. Container netns is already isolated, so binding all interfaces inside is safe. - Healthchecks switched from `wget --spider` (HEAD → 404 forever because /health is GET-only) to `wget -qO /dev/null` (GET). Same regression existed on canvas; fixed both. - workspace-server/Dockerfile.dev - `CGO_ENABLED=1` → `0` to match prod Dockerfile + Dockerfile.tenant. Without this, the alpine dev image fails with "gcc: not found" because workspace-server has no actual cgo deps but the env was forcing the cgo build path. Closes a divergence introduced in `9d50a6da` (today's air hot-reload PR). - canvas/Dockerfile - `npm install` → `npm ci --include=optional` for lockfile-exact installs that include platform-specific @tailwindcss/oxide native binaries. Without these, `next build` fails with "Cannot read properties of undefined (reading 'All')" on the `@import "tailwindcss"` directive. - canvas/.dockerignore (new) - Excludes `node_modules` and `.next` so the Dockerfile's `COPY . .` step doesn't clobber the freshly-installed container node_modules with the host's (potentially stale or wrong-arch) copy. This was the actual root cause of the canvas build break. - workspace-server/.gitignore - Adds `/tmp/` for air's live-reload build cache. ## Stage A verified ``` container status restart postgres-1 Up (healthy) unless-stopped redis-1 Up (healthy) unless-stopped platform-1 Up (healthy, air-mode) unless-stopped canvas-1 Up (healthy) unless-stopped GET :8080/health → 200 GET :3000/ → 200 DB preserved: 407 workspace rows + 5 named personas Persona mount: 28 dirs at /etc/molecule-bootstrap/personas ``` ## Stage B — N/A This is local-dev infrastructure only. None of these files ship to SaaS tenants — production EC2s use `Dockerfile.tenant` + `ec2.go` user-data, not docker-compose. ## Out of scope - The decorative-but-broken `wget --spider` healthcheck has presumably also been silently 404'ing on prod tenants. Ship a follow-up to audit + fix the prod path; not done here to keep the PR scoped. - Docker Desktop "Start at login" is a per-machine GUI setting that must be toggled manually (Settings → General). - The legacy heartbeat-all.sh that pinged 5 persona workspaces from the host has been deleted (~/.molecule-ai/heartbeat-all.sh). Per Hongming: each workspace is responsible for its own heartbeat. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 10:53:39 -07:00
claude-ceo-assistant	120b3a25aa	feat(workspaces): update_tier column for canary vs production fan-out Closes core#115 partial. Schema-only change; the apply-endpoint filter logic that reads this column lands with core#123 (drift detector + queue + apply endpoint, the deferred follow-up of core#113). Default 'production' so existing customers (Reno-Stars + any future tenant) are default-safe. Synthetic dogfooding workspaces opt INTO 'canary' explicitly. CHECK constraint pins the closed value set ('canary' \| 'production') — the apply endpoint's filter relies on the database to reject anything else, so a future operator typo in PATCH /workspaces/:id ({update_tier: 'canery'}) returns a constraint violation, not silent fan-out to nobody. Partial index on canary rows since the apply-endpoint query path ('apply this update only to canary tier first') hits canary much more often than production, and the production set is the much larger default. WHAT THIS DOES NOT DO (lands with core#123) - PATCH endpoint to flip a workspace to canary - The apply endpoint that consults the column - Tests that exercise canary-vs-production fan-out Schema-only foundation; same pattern as core#113 (workspace_plugins). PHASE 4 SELF-REVIEW Correctness: No finding — IF NOT EXISTS guards, DEFAULT clause means existing rows get 'production' on migration apply. Readability: No finding — comment block documents the tier semantics + the deferral to core#123. Architecture: No finding — additive ALTER, partial index for the expected access pattern. Security: No finding — no code path; column constraint reduces blast radius of bad PATCH input. Performance: No finding — partial index minimizes write amplification on the production-default rows. REFS core#115 — this issue core#123 — apply endpoint follow-up (will exercise this column) core#113 — version subscription DB foundation (sibling pattern) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 08:55:19 -07:00
claude-ceo-assistant	72b0d4b1ab	feat(plugins): workspace_plugins tracking table — version-subscription foundation Closes core#113 partial. Adds the DB foundation for the version-subscription model. Drift detection + queue + admin apply endpoint are follow-up scope (separate PR; filed as a new issue). WHY THIS PR ONLY GETS US PART-WAY Plugin install state today is filesystem-only — '/configs/plugins/<name>/' inside the container. There's no DB record of 'plugin X installed at workspace W from source S, tracking ref T'. That makes drift detection impossible: nothing to compare upstream tags against. This PR adds the table + the install-endpoint hook that writes to it. With baseline tags now on every plugin (post internal#92), the table starts collecting tracked-ref values immediately on the next install. The actual drift-check job + queue + apply endpoint layer on top. WHAT THIS ADDS workspace_plugins table: workspace_id FK → workspaces(id) ON DELETE CASCADE plugin_name canonical name from plugin.yaml source_raw full source URL the install used tracked_ref 'none' \| 'tag:vX.Y.Z' \| 'tag:latest' \| 'sha:<full>' installed_at, updated_at installRequest gains optional 'track' field (defaults to 'none'). Install handler upserts the workspace_plugins row after delivery succeeds. DB write failure is logged but doesn't fail the install (the plugin IS in the container; surfacing 500 misleads the caller). validateTrackedRef enforces the closed set of accepted shapes: 'none' \| 'tag:<non-empty>' \| 'sha:<non-empty>' Bare values like 'latest' / 'main' / version-strings without prefix are rejected — the drift detector keys on prefix to know what kind of resolution to do. WHAT THIS DOES NOT ADD (filed separately) - Drift detector job (cron / on-demand) that scans 'WHERE tracked_ref != none' rows and queues updates on upstream drift - plugin_update_queue table (separate migration once detector lands) - GET /admin/plugin-updates-pending and POST .../apply endpoints - Tier-aware apply (core#115 — composes here) PHASE 4 SELF-REVIEW (FIVE-AXIS) Correctness: No finding — install endpoint behavior unchanged for callers that don't pass 'track'. DB write is best-effort + logged on failure. validateTrackedRef rejects ambiguous bare strings. Readability: No finding — separate file plugins_tracking.go isolates the new concern; install handler delta is a single 4-line block. Architecture: No finding — additive table; existing schema untouched. Migration 20260508160000_* uses the timestamp-prefixed convention. Security: No finding — INSERT params via placeholders (no string interpolation). validateTrackedRef rejects unexpected shapes before the column constraint would. Performance: No finding — one extra ExecContext per install. Install is already seconds-scale (network fetch + tar + docker exec); rounds to noise. TESTS (1 new, all green) TestValidateTrackedRef — pin closed set + structural validators REFS core#113 — this issue (foundation only; drift+queue+apply = follow-up) internal#92, internal#93 — plugin/template baseline tags (now exists for tracking) core#114 — atomic install (this PR composes — no atomicity regression) core#115 — canary tier filter (will key off the same DB foundation) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 08:52:35 -07:00
claude-ceo-assistant	249e760fbd	feat(plugins): hot-reload classifier — skip restart on SKILL-content-only updates Closes molecule-core#112. Composes with #114 (atomic install). Before issuing restartFunc, classify the diff between staged and live: - skill-content-only: only **/SKILL.md content changed → skip restart (Claude Code re-reads SKILL.md on each Skill invocation; no in-memory cache) - cold: anything else → restartFunc as before (hooks/settings load at session start; plugin.yaml is structural; added/removed files require a fresh load) DETECTION - Hash every regular file in staged tree (host filesystem, sha256) - Hash every regular file in live tree (in-container via docker exec sh -c 'cd <livePath> && find . -type f -print0 \| xargs -0 sha256sum') - .complete marker dropped from comparison (mtime varies install-to- install; including it would force-cold every reinstall) - File added/removed → cold - File content differs but isn't SKILL.md → cold - All differences are SKILL.md basenames → skill-content-only DEFAULTS COLD - First install (no live tree) → cold - Live tree read failure → cold (conservative; never hot-reload speculatively) - Symlinks skipped during hash (same posture as tar walker) PHASE 4 SELF-REVIEW Correctness: No finding — all error paths default to cold; never falsely classify as skill-content-only. The .complete drop is a deliberate exception (the marker is bookkeeping, not content). Readability: No finding — single-purpose helpers (hashLocalTree, hashContainerTree, isSkillMarkdown, shQuote) each do one thing. The classifier itself reads as 'compare set, then walk diff with isSkillMarkdown gate.' Architecture: No finding — composes existing execAsRoot primitive; new helpers in plugins_classifier.go don't touch any other handler. Old behavior unchanged when live read fails. Security: No finding — shQuote single-quotes any non-trivial path, pluginName comes from validatePluginName-validated source, and the docker exec command takes the path as a single arg (xargs -0 handles binary-safe path delimiting). Symlinks skipped. Performance: No finding — adds two tree walks (host + container) per install. Container walk is one docker exec call returning sha256 lines; for typical plugins (~10-50 files) round-trip is ~100ms. Versus the saved ~5-10s of restart on a hot-reloadable update, this is a clear win. TESTS (4 new, all green; full handler suite green) TestIsSkillMarkdown — basename match, case-sensitive TestHashLocalTree_StableHash — re-hash same dir = same map TestHashLocalTree_SymlinkSkipped — hostile link doesn't poison classifier TestShQuote — quoting boundary for shell injection safety REFS molecule-core#112 — this issue molecule-core#114 — atomic install (.complete marker added there) Reno-Stars iteration safety (Hongming 2026-05-08) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 08:26:05 -07:00
claude-ceo-assistant	3e96184d6f	Merge pull request 'feat(plugins): atomic install — stage→snapshot→swap→marker (docker path)' (#120 ) from feat/plugin-atomic-install into main	2026-05-08 15:23:31 +00:00
claude-ceo-assistant	7fbb8cb6e9	feat(plugins): atomic install — stage→snapshot→swap→marker (docker path) Closes molecule-core#114 for the docker (local-OSS) path. EIC (SaaS) path tracked as a follow-up — same shape, different exec primitives (ssh vs docker exec); shipping both in one PR doubles the test surface. THE FOUR-STEP DANCE 1. STAGE — docker.CopyToContainer extracts tar into /configs/plugins/.staging/<name>.<ts>/ 2. SNAPSHOT — if /configs/plugins/<name>/ exists, mv to /configs/plugins/.previous/<name>.<ts>/ 3. SWAP — atomic mv staging → live (single rename(2)) 4. MARKER — touch /configs/plugins/<name>/.complete Workspace-side plugin loaders should refuse to load any plugin dir without .complete (separate small change, not in this PR — the marker write is the necessary precursor; consumer side is a follow-up so existing-content plugins don't break before they're re-installed). ROLLBACK - Stage failure: rm -rf staging dir; live untouched - Snapshot failure: rm -rf staging dir; live untouched (no rename happened) - Swap failure with snapshot present: mv previous back to live - Swap failure (no snapshot): rm -rf staging; live (which never existed) stays absent - Marker failure: content already in place, log loudly with manual recovery hint (touch <plugin>/.complete) — don't roll back since the new content is what we wanted, just unmarked GC Best-effort delete of previous-version snapshot after successful marker write. Failures non-fatal — next install or a separate sweeper reclaims. Sweeper for stale .previous/* across reboots is follow-up scope. CONCURRENCY Each install gets a unique stamp (UTC second precision), so two concurrent reinstalls land in distinct staging dirs and the second swap simply overwrites the first's live result. The atomicity is per-install, not cross-install — by design (the platform serializes POST /workspaces/:id/plugins via Go-side semaphore upstream of this code, so cross-install collisions don't reach here). CHANGES + plugins_atomic.go — installVersion + atomicCopyToContainer + plugins_atomic_tar.go — tarWalk/tarHostDirWithPrefix helpers + plugins_atomic_test.go — 5 unit tests (paths, stamp shape, tar happy path, symlink-skip, prefix normalization). All green. ~ plugins_install_pipeline.go::deliverToContainer — swap copyPluginToContainer call to atomicCopyToContainer Old copyPluginToContainer is retained (still called by Download()) so this PR is purely additive on the install path; no public API change. PHASE 4 SELF-REVIEW (FIVE-AXIS) Correctness: Required (addressed) — swap-failure rollback writes mv of previous back to live before returning the error; if rollback itself fails, we wrap both errors and surface the combined fault. Marker-write failure is treated as content-landed-but-unmarked (LOG, don't roll back the new content). Readability: No finding — installVersion path methods make the /staging/.previous/live/marker layout obvious from one struct. tarWalk extracted from the inline filepath.Walk in plugins_install_pipeline.go for testability. Architecture: No finding — atomicCopyToContainer composes existing execAsRoot / docker.CopyToContainer primitives; no new dependencies. Old copyPluginToContainer kept for Download() — single responsibility per function. Security: No finding — symlinks still skipped during tar walk (defense vs hostile plugin escaping its own dir). Marker writes use composeable path.Join, no user input touches the path. Performance: No finding — adds ~3 docker exec calls per install (mkdir, mv-snapshot, mv-swap, touch — actually 4) on top of the one CopyToContainer. Each exec ~50-100ms in practice; install end-to-end was already seconds-scale, this rounds to noise. REFS molecule-core#114 — this issue Companion: molecule-core#112 (hot-reload classifier — depends on .complete marker) Companion: molecule-core#113 (version subscription — uses install machinery) EIC follow-up: separate issue to be filed for SaaS path parity Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 08:22:52 -07:00
claude-ceo-assistant	c3686a4bb3	Merge branch 'main' into fix/pendinguploads-test-isolation	2026-05-08 15:20:36 +00:00
claude-ceo-assistant	e37a289eb6	Merge pull request 'feat(org-import): inject per-role persona env from operator-host bootstrap dir' (#110 ) from feat/persona-env-injection into main	2026-05-08 15:17:17 +00:00
claude-ceo-assistant	9d50a6dae4	feat(local-dev): air-based hot-reload for workspace-server Closes core#116. Brings local-dev iteration parity with the canvas's Turbopack HMR — edit a Go file, see the platform restart in <5s instead of running 'docker compose up --build' (~30s) per change. USAGE make dev # docker compose with air-driven live reload make up # production-shape stack (no air, normal Dockerfile) WHAT THIS ADDS workspace-server/.air.toml — air watch config workspace-server/Dockerfile.dev — air-on-golang:1.25-alpine, dev-only docker-compose.dev.yml — overlay swapping platform service to Dockerfile.dev + bind-mounting workspace-server/ source Makefile — make {dev,up,down,logs,build,test} WHAT THIS DOES NOT TOUCH workspace-server/Dockerfile (production multi-stage build) docker-compose.yml (prod-shape stack) CI workflows (build prod image directly) Tenant deployment / SaaS (image swap stays the model) Pure additive. Existing 'docker compose up' path unchanged; production stays on the static binary. Air install pinned via go install at image build time so the dev image is reproducible-enough for local use (we don't pin air to a SHA — the dev image is rebuilt locally and updates opportunistically). PHASE 4 SELF-REVIEW (FIVE-AXIS) Correctness: No finding — additive change, no existing path modified. .air.toml watches .go + .yaml under workspace-server/, excludes _test.go and tests dir so test edits don't trigger rebuild. Dockerfile.dev mirrors prod's 'go mod download' so first rebuild is fast. Readability: No finding — three small files plus a Makefile, each with header comments explaining the WHY, not just the WHAT. The Makefile uses the standard ## help-target pattern. Architecture: No finding — overlay pattern (docker-compose.dev.yml on top of docker-compose.yml) is the standard compose convention for env-specific overrides. Doesn't fork the prod path. Security: No finding because no production code path; dev-only image isn't built in CI and isn't published to ECR. Performance: No finding — air debounce=500ms, exclude_unchanged=true so a save that doesn't change content is a no-op rebuild. REFS core#116 — this issue Companion: core#117 (workspace-side config-watcher for hot-reload of config.yaml) — different scope; this issue is platform-only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 08:10:50 -07:00
dev-lead	9e18ab4620	fix(pendinguploads): wait for error metric before test exit TestStartSweeper_TransientErrorDoesNotCrashLoop leaks an in-flight metric write across the test boundary: cycleDone fires inside the fake's Sweep defer (before Sweep returns), waitForCycle returns immediately after, cancel() lands, but the goroutine still has metrics.PendingUploadsSweepError() to execute. Whether that write happens before or after the next test's metricDelta() baseline read is a coin-flip on slow CI hosts. Outcome: TestStartSweeper_RecordsMetricsOnSuccess fails with "error counter delta = 1, want 0" — looks like a real bug, isn't. Instrumented analysis (per the file's existing waitForMetricDelta docstring covering the same shape) confirms the metric IS getting recorded, just AFTER the next test reads its baseline. The Records* tests already use waitForMetricDelta to close this race on their own assertions. This change extends the same shape to TransientErrorDoesNotCrashLoop so it doesn't poison subsequent tests' baselines. Verified by running `go test -race -count=20 ./internal/pendinguploads/...` locally — passes deterministically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 07:37:45 -07:00
claude-ceo-assistant	43b33bcaa5	feat(org-import): inject per-role persona env from operator-host bootstrap dir Wires the 28 dev-tree persona credentials minted 2026-05-08 into the workspace-secrets path used by org_import. When a workspace.yaml carries `role: <name>`, the importer now reads $MOLECULE_PERSONA_ROOT/<role>/env (default /etc/molecule-bootstrap/personas/<role>/env, populated by the bootstrap kit on the tenant host) and merges the role's GITEA_USER / GITEA_TOKEN / GITEA_TOKEN_SCOPES / GITEA_USER_EMAIL / GITEA_SSH_KEY_PATH into the same envVars map that already feeds workspace_secrets via parseEnvFile + crypto.Encrypt + INSERT. PRECEDENCE Persona env is the LOWEST layer: 0. Persona env (per-role) 1. Org root .env (shared) 2. Workspace .env (per-workspace) Each later layer overrides the previous, so a workspace .env can pin a different GITEA_TOKEN if it ever needs to (testing, override). WHY THIS LAYERING Workspaces should boot with the role's identity by default. .env files stay the explicit-override mechanism for the (rare) case where a workspace needs to deviate. No new behavior for workspaces with no role: persona load is silent no-op when ws.Role is empty or unsafe. SECURITY isSafeRoleName accepts only [A-Za-z0-9_-]+ (no '..', '/', or separators) — admin-only construct, but defense-in-depth keeps the persona dir shape invariant. Test TestLoadPersonaEnvFile_RejectsTraversal pins the rejection set against a planted target file. OPERATOR-HOST CONTRACT The 28 persona env files live at /etc/molecule-bootstrap/personas/<role>/env (mode 600, owner root:root) with the per-role token-scope tailoring Hongming approved 2026-05-08 (D5). Synced via task #241. Override via MOLECULE_PERSONA_ROOT for tests + non-prod hosts. TESTS (7 new, all green) TestLoadPersonaEnvFile_HappyPath — typical persona-env shape TestLoadPersonaEnvFile_MissingDir — silent no-op when file absent TestLoadPersonaEnvFile_EmptyRole — silent no-op when role empty TestLoadPersonaEnvFile_RejectsTraversal — planted file unreachable via '../../etc/passwd' etc. TestLoadPersonaEnvFile_DefaultRoot — falls back to /etc/... TestLoadPersonaEnvFile_OverwritesEmptyMap TestIsSafeRoleName_Acceptance — positive + negative role names PHASE 4 SELF-REVIEW (FIVE-AXIS) Correctness: No finding — additive change, silent no-op on the ws.Role=='' path covers every existing workspace; tests cover happy path + each rejection mode + missing-dir. Readability: No finding — helper sits next to parseEnvFile in org_helpers.go with a comment block explaining WHY persona is lowest precedence. Architecture: No finding — fits the existing 'merge .env into envVars then INSERT INTO workspace_secrets' pattern that's been in place since the .env-driven workspace secrets feature; no new dependencies, no new tables. Security: Required (addressed) — path traversal blocked by isSafeRoleName. No finding beyond that since persona files are admin-managed and the helper does not log token values. Performance: No finding — one extra os.ReadFile per workspace at import time; amortized over workspace lifetime, cost is negligible. REFS internal#85 — RFC for SOP Phase 4 + structured Five-Axis (parent context) Saved memories: feedback_per_agent_gitea_identity_default, feedback_unified_credentials_file Task #241 — operator-host sync (already DONE; populated 28 dirs) Task #242 — this PR Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 07:09:40 -07:00
claude-ceo-assistant	c72d0a5383	harden(org-external): token via http.extraHeader, .complete cache marker, ref '..' deny, naming cleanup Self-review of molecule-core PR #105 + #106 (the !external resolver chain) surfaced 3 real correctness/security gaps and 2 readability nits. Fixes all four in one PR since they're the same file's hardening. (1) TOKEN LEAKAGE — fixed Before: gitFetcher built clone URLs with auth in userinfo (https://oauth2:TOKEN@host/repo.git). Two leak paths: a. Token persisted in cloned repo's .git/config b. Token could appear in clone error output captured via cmd.CombinedOutput() After: clone URL has no userinfo (https://host/repo.git). Auth is layered on via -c http.extraHeader=Authorization: token ... which sends the header per-request without persisting. Plus a redactToken() pass over any error string before it surfaces in fmt.Errorf, as belt-and-braces. Tradeoff: token now visible in 'ps aux' for the duration of the git child process (same as before via env var), but no longer in any persistent state. (2) CACHE-VALIDITY FOOTGUN — fixed Before: cache-hit was 'cacheDir/.git exists'. A clone interrupted after .git was created but before content finished writing would leave a partially-written cache that subsequent imports treated as hit, returning stale/incomplete content forever (no self-heal). After: cache-hit also requires a .complete marker file written only AFTER successful clone+rename. Partially-written cache is treated as cache-miss and re-fetched cleanly (after RemoveAll on the partial dir to avoid blocking the new clone's mkdir). (3) REF '..' DENY — fixed Before: safeRefPattern '^[a-zA-Z0-9_./-]+$' allowed '..' as a substring. Git itself rejects most refs containing '..', but defense-in-depth says don't depend on the downstream tool's validation when sanitizing input at the boundary. After: explicit strings.Contains(ref.Ref, '..') check. (4) NAMING CLEANUP — fixed Before: rewriteFilesDirAndIncludes() — name claims to rewrite !include scalars but doesn't (we removed that during PR-A development; double-prefix bug). Misleading for readers. After: rewriteFilesDir(). Docstring updated to explicitly explain why !include paths are NOT rewritten (relative to subDir, naturally inside cache). Also: removed unused buildAuthedURL() (replaced by buildExternalCloneURL + authConfigArgs split), removed unused shortHash() helper (replaced by os.MkdirTemp), removed unused crypto/sha1 + encoding/hex + fmt imports, removed stray '_ = fmt.Sprint' line in integration test. NEW TESTS - TestGitFetcher_RejectsRefWithDoubleDot (defense-in-depth on ref input) - TestGitFetcher_CacheValidatedByCompleteMarker (partial cache → re-fetch) VERIFIED LOCALLY 2026-05-08 Full ./internal/handlers/ suite: ok (7.8s, 14 external-resolver tests + all existing tests). Two new tests cover the two new behaviors. Refs: internal#77 — extraction RFC molecule-core#105 (resolver), #106 (tests) — original implementation Hongming code-review-and-quality skill invocation 2026-05-08 + 'fix all'	2026-05-08 05:54:54 -07:00
claude-ceo-assistant	89c5567d79	test(org-external): integration test against local bare-git + e2e against live Gitea (PR-B + PR-C) PR-B (local bare-git integration, task #233): workspace-server/internal/handlers/org_external_integration_test.go Three tests using git's GIT_CONFIG_COUNT/KEY/VALUE env-var-injected insteadOf URL rewrite — process-scoped, no ~/.gitconfig pollution: - TestGitFetcher_RealClone_LocalRedirect: full resolver chain end-to- end with REAL git clone against a local bare-repo, asserts cache population + content materialization + path rewrite + cache-hit on second invocation. - TestGitFetcher_RealClone_BadRefFails: nonexistent ref surfaces git's error cleanly through the ls-remote step. - TestGitFetcher_DirectFetch_CacheHit: gitFetcher.Fetch direct invocation (no resolver wrapping); verifies cache-hit returns same dir + same SHA, no clobber. Production code untouched — insteadOf rewrite makes the production gitFetcher think it's cloning from Gitea, but git rewrites at clone time to file://<barePath>. Tests the real shell-out + parsing. PR-C (live Gitea e2e, task #234): workspace-server/internal/handlers/local_e2e_dev_dept_test.go TestLocalE2E_ExternalDevDepartment — minimal parent template that uses !external against the LIVE molecule-ai/molecule-dev-department repo. No symlink, no /tmp/local-e2e-deploy fixture. Composition resolves over network at import time. Asserts: - 28+ dev-tree workspaces resolve through the fetched cache (matches the count from TestLocalE2E_DevDepartmentExtraction) - Q1 placement: 'Documentation Specialist' present (under app-lead) - Q2 placement: 'Triage Operator' present (under dev-lead) - Every workspace's files_dir is cache-prefixed (proves rewrite ran) - Every workspace's resolveInsideRoot+Stat succeeds (would fail provisioning if not) Skipped if Gitea unreachable (TCP probe to git.moleculesai.app:443) or git binary absent — won't false-fail offline runners. VERIFIED LOCALLY 2026-05-08: --- PASS: TestGitFetcher_RealClone_LocalRedirect (0.26s) --- PASS: TestGitFetcher_RealClone_BadRefFails (0.15s) --- PASS: TestGitFetcher_DirectFetch_CacheHit (0.23s) --- PASS: TestLocalE2E_ExternalDevDepartment (0.55s) workspaces resolved through !external: 28 Full ./internal/handlers/ test suite: ok (no regressions) Together with PR-A's unit tests (#105), the !external resolver is now covered at three layers: - unit (fakeFetcher injection): allowlist, validation, path rewrite - integration (real git, local bare-repo): clone, cache, ls-remote - e2e (real git, live Gitea, live dev-department): full chain Refs: internal#77 — extraction RFC (Phase 3a phasing in comment 1995) task #233 (PR-B), task #234 (PR-C) Hongming GO 2026-05-08 ('do PR-B/C/D')	2026-05-08 05:30:04 -07:00
claude-ceo-assistant	257d6c1b5a	feat(org-import): `!external` cross-repo subtree resolver (Phase 3a, internal#77 / task #222 ) Adds gitops-style cross-repo subtree composition to the platform's org-template importer. Replaces (eventually) the operator-side filesystem symlink approach shipped in PR #5. DESIGN See internal#77 comment 1995 for the full design doc + decision points agreed with Hongming 2026-05-08. Schema: a `!external`-tagged mapping anywhere a workspace entry is allowed (workspaces:, roots:, children:): - !external repo: molecule-ai/molecule-dev-department ref: main path: dev-lead/workspace.yaml url: git.moleculesai.app # optional; default = MOLECULE_EXTERNAL_GITEA_URL or git.moleculesai.app At resolve time the platform fetches the repo at ref into a content- addressable cache under <orgBaseDir>/.external-cache/<repo>/<sha>/, loads <cacheDir>/<path>, recursively resolves nested !include / !external in the loaded subtree, then rewrites every files_dir scalar in the fully-resolved subtree to be cache-prefixed. Downstream pipeline (resolveInsideRoot, plugin merge, CopyTemplateToContainer) sees ordinary in-tree paths. IMPLEMENTATION - org_external.go: ExternalRef type, fetcher interface (gitFetcher production + injectable for tests), resolveExternalMapping resolver, rewriteFilesDirAndIncludes path-rewrite walker, allowlistedHostPath + safeRefPattern + safeRepoCacheDir validation helpers. - org_include.go: 4-line hook in expandNode dispatching MappingNode with Tag=="!external" to resolveExternalMapping. - org_external_test.go: 8 unit tests with fakeFetcher injection (no network): * happy path (top + nested workspace files_dir cache-prefixed) * allowlist rejection (github.com/foo/bar) * path-traversal rejection (../../etc/passwd) * malformed ref rejection ("main; rm -rf /") * missing required fields (repo / ref / path) * rewriteFilesDirAndIncludes basic + idempotent * allowlistedHostPath env-override + glob Path rewrite ONLY rewrites files_dir scalars. !include scalars are NOT rewritten — they resolve relative to their containing file's directory, which post-fetch is naturally inside the cache, so relative !includes Just Work without modification. ALLOWLIST + AUTH - Default allowlist: git.moleculesai.app/molecule-ai/. - Override: MOLECULE_EXTERNAL_REPO_ALLOWLIST (comma-separated prefixes; trailing /* or / supported). - Auth: MOLECULE_GITEA_TOKEN env var injected into clone URL. Optional — falls back to unauthenticated for public repos. - Reject: malformed refs, path-traversal, non-allowlisted hosts. CACHE - Location: <orgBaseDir>/.external-cache/<safe-repo>/<sha>/. Operators add to .gitignore. - Content-addressable: same (repo, sha) reuses cache, no overwrite. - Atomic clone via tmp-then-rename. - Concurrency: race-tolerant — last-writer-wins on same SHA. GC out of scope for v1 (filed as parked follow-up). SECURITY (per SOP Phase 2) Untrusted yaml input — all validated: repo: allowlist (default molecule-ai/* on Gitea host) ref: ^[a-zA-Z0-9_./-]+$ regex (rejects shell injection) path: relative-and-down-only (rejects ../escape) Auth: read-only token scoped to allowed orgs. Recursion: maxExternalDepth=4 (vs maxIncludeDepth=16) to limit network fan-out cost. Cache poisoning: per-(repo, sha) content-addressable; can't poison across SHAs. Trust boundary: cloned content treated identically to a sibling- cloned subtree (same model as current symlink approach). VERSIONING / BACKWARDS COMPAT Pure additive. Existing !include and inline workspaces unchanged. Existing dev-lead symlink (parent template PR #5) keeps working. Migration of parent template to !external is a separate PR-D. No DB schema change. No public API change. VERIFIED LOCALLY go test ./internal/handlers/ → ok (5.2s, all 8 new tests + existing) Stub fetcher injection lets unit tests cover the resolver + path-rewrite logic without network. PR-B (follow-up) adds an integration test against a local bare-git repo. PR-C adds the real-Gitea e2e test against the live dev-department repo. Refs: internal#77 — extraction RFC (comment 1995 = Phase 1+2 design) task #222 — this PR is Phase 3a (PR-A in the design's phasing) Hongming GO 2026-05-08 ('go' on 4 decision points + design)	2026-05-08 05:17:55 -07:00
claude-ceo-assistant	3dcc7230f9	fix(provisioner)+test: EvalSymlinks templatePath; stage-2 e2e for files_dir consumption Two changes that fall out of one root cause discovered while preparing the local platform spin-up for the dev-department extraction (internal#77): PROBLEM CopyTemplateToContainer's filepath.Walk is called with templatePath set to the workspace's resolved files_dir. With the cross-repo symlink composition shipped in PR #5 (parent template's dev-lead → ../molecule-dev-department/dev-lead/), the Dev Lead workspace's files_dir is literally 'dev-lead' — i.e. the symlink itself, not a path THROUGH the symlink. filepath.Walk does not descend into a symlink leaf — it Lstats the root, sees a symlink (mode bit set, not a directory), emits exactly one entry, and returns. Result: the workspace's /configs/ tar would ship empty. Other 38 workspaces are fine because their files_dir paths just TRAVERSE the symlink (path resolution handles intermediate symlinks via Lstat traversal); only the leaf-is-symlink case breaks. FIX workspace-server/internal/provisioner/provisioner.go: Call filepath.EvalSymlinks on templatePath before filepath.Walk. Resolves the leaf-symlink case for ALL templates, not just dev-dept. Security: templatePath has already passed resolveInsideRoot's path-string check at the call site; the trust boundary is the operator-side /org-templates/ filesystem layout, not this resolution step. TEST workspace-server/internal/handlers/local_e2e_dev_dept_test.go: New TestLocalE2E_FilesDirConsumption — stage-2 of the local e2e. For every workspace in the resolved OrgTemplate, asserts: 1. resolveInsideRoot(orgBaseDir, ws.FilesDir) succeeds. 2. os.Stat on the result returns a directory. 3. filepath.Walk after EvalSymlinks (mirroring the platform fix) emits at least one file. 4. At least one workspace marker exists (workspace.yaml, system-prompt.md, or initial-prompt.md). Exercises the SECOND half of POST /org/import that TestLocalE2E_DevDepartmentExtraction (PR #103) didn't cover. VERIFIED LOCALLY (2026-05-08, against post-extraction Gitea state): --- PASS: TestLocalE2E_FilesDirConsumption (0.05s) checked 39 workspaces with files_dir All 39 walk paths emit non-empty file sets with valid workspace markers. REGRESSION GUARD Without the EvalSymlinks fix, this test fails on Dev Lead with: files_dir 'dev-lead' at '/.../molecule-dev/dev-lead' is empty — CopyTemplateToContainer would produce empty /configs/ Refs: internal#77 — extraction RFC molecule-core#102 (resolver symlink contract test) molecule-core#103 (stage-1 e2e: include resolution) Hongming GO 2026-05-08 ('go' on the 3 pre-spin-up optimizations)	2026-05-08 04:46:33 -07:00
claude-ceo-assistant	3adbbacf2e	test(local-e2e): verify dev-department extraction end-to-end via real resolveYAMLIncludes Phase 4 (local-only) of internal#77 (dev-department extraction). Adds TestLocalE2E_DevDepartmentExtraction that exercises the FULL platform import path against the real molecule-ai-org-template-molecule-dev (post-slim) and molecule-ai/molecule-dev-department (post-atomize) repos cloned as siblings under /tmp/local-e2e-deploy/. What it proves end-to-end: - The dev-lead symlink at parent's template root is followed by resolveYAMLIncludes (filepath.Abs/Rel-style security check passes, os.ReadFile follows the link). - Recursive !include chain through the symlinked subtree resolves: parent's org.yaml → !include dev-lead/workspace.yaml (symlinked) → !include ./core-lead/workspace.yaml → !include ./core-be/workspace.yaml (atomized children: paths, no '..'). - 39 workspaces enumerate after resolution: 5 PM-tree + 6 Marketing-tree + 28 dev-tree (Dev Lead + 5 sub-team leads + 18 leaf workspaces + 3 floaters + 1 triage-operator). - Q1+Q2 placements verified by sentinel name check: 'Documentation Specialist' is reachable (under app-lead via app-docs sub-team), 'Triage Operator' is reachable (direct child of Dev Lead). Test skips with t.Skipf if the local-e2e fixture isn't present on the host — won't block CI on hosts that haven't set it up. To set up locally: TESTROOT=/tmp/local-e2e-deploy mkdir -p $TESTROOT && cd $TESTROOT git clone https://git.moleculesai.app/molecule-ai/molecule-ai-org-template-molecule-dev.git molecule-dev git clone https://git.moleculesai.app/molecule-ai/molecule-dev-department.git cd /Users/<you>/molecule-core/workspace-server go test -v -run TestLocalE2E_DevDepartmentExtraction ./internal/handlers/ Verified locally 2026-05-08: --- PASS: TestLocalE2E_DevDepartmentExtraction (0.01s) total workspaces (recursive): 39 Refs: internal#77 — extraction RFC molecule-core PR #102 — symlink-resolution contract test molecule-ai/molecule-dev-department PRs #1, #2, #3 (scaffold + extract + atomize) molecule-ai/molecule-ai-org-template-molecule-dev PR #5 (parent slim + symlink wire) Hongming GO 2026-05-08 ('lets not go for staging right now, we do local test first') SOP Phase 4 (local) — task #226	2026-05-08 04:24:47 -07:00
claude-ceo-assistant	78c4b9b74f	test(org-include): pin symlink-based subtree composition contract Two new tests in workspace-server/internal/handlers/org_include_test.go: - TestResolveYAMLIncludes_FollowsDirectorySymlink: parent template's org.yaml `!include`s into a sibling-repo subtree via a relative directory symlink. The resolver's filepath.Abs/Rel security check operates on path strings (passes), and os.ReadFile follows the symlink at OS layer (file content delivered). Recursive nested `!include`s within the symlinked subtree resolve correctly because filepath.Dir(absTarget) keeps the literal symlink path as currentDir. - TestResolveYAMLIncludes_RejectsSymlinkEscapingRoot: companion test pinning current behavior where a symlink target outside the parent root is followed (resolveInsideRoot doesn't EvalSymlinks). Asserted as 'should resolve' so future hardening (if filepath.EvalSymlinks is added) flips the test red and forces a coordinated update to the dev-department subtree-composition pattern. Why now: internal#77 RFC (dev-department extraction) selects symlink- based composition over a future platform-level external: ref. These tests pin the contract before the operator-side symlink convention gets shipped, so a refactor or hardening of the resolver can't silently break the production org-import path. No production code changes. Pure additive test coverage. Refs: internal#77 (Phase 3b verification — task #223)	2026-05-07 20:42:38 -07:00
claude-ceo-assistant	7f61206a18	Merge branch 'staging' into fix/saas-plugin-install-eic	2026-05-08 00:21:10 +00:00
devops-engineer	6d7554d282	chore: sync main → staging (auto, `d84d88ad`)	2026-05-07 23:38:08 +00:00
claude-ceo-assistant	6bb272360d	Merge branch 'main' into feat/issue-63-local-build-from-gitea-v2	2026-05-07 23:33:03 +00:00
claude-ceo-assistant (Claude Opus 4.7 on Hongming's MacBook)	b664691051	fix(eic-tunnel-pool): capture poolJanitorInterval at pool construction Closes the chronic -race flake on TestPooledWithEICTunnel_PanicPoisonsEntry and the handlers package as a whole (CI / Platform (Go) was intermittent on staging, ~50% red on workspace-server-touching commits since 2026-04). The race: tests swap the package-level poolJanitorInterval via t.Cleanup (eic_tunnel_pool_test.go:61) AFTER an earlier test caused the global pool's janitor goroutine to start. The janitor loops on time.NewTicker(poolJanitorInterval) on every tick — so the cleanup write races the goroutine read for the rest of the process. Caught locally + on PR #84's CI run on Gitea. Fix: capture the interval as a field on eicTunnelPool at newEICTunnelPool(). The janitor now reads p.janitorInterval, which never changes after construction. Tests that override poolJanitorInterval before freshPool() still get the new value (they set the package var before construction). The global pool's janitor — created lazily once via sync.Once on first getEICTunnelPool() — is now immune to t.Cleanup-driven swaps from later tests. Surfaced while verifying #84 (SaaS plugin install via EIC SSH); folded into this PR per the "fix root not symptom" rule rather than merging around a chronic-red CI signal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 16:01:11 -07:00
devops-engineer	7e2cca7fad	chore: sync main → staging (auto, `e7660618`)	2026-05-07 23:00:21 +00:00
claude-ceo-assistant (Claude Opus 4.7 on Hongming's MacBook)	16868c4ec1	fix(plugins): SaaS (EC2-per-workspace) install/uninstall via EIC SSH Closes the 🔴 docker-only row in docs/architecture/backends.md. Plugin install on every SaaS tenant currently 503s with "workspace container not running" because the handler is hardcoded to Docker exec but SaaS workspaces live on per-workspace EC2s. Caught on hongming.moleculesai.app when canvas POST /workspaces/<id>/plugins surfaced the error. Mirrors the Files API PR #1702 pattern: dispatch on workspaces.instance_id in deliverToContainer (and Uninstall). When set, push the staged plugin tarball to the EC2 over the existing withEICTunnel primitive (template_files_eic.go) and unpack into the runtime's bind-mounted config dir (/configs for claude-code, /home/ubuntu/.hermes for hermes — see workspaceFilePathPrefix). chown 1000:1000 to match the docker path's agent-uid contract; restart via the existing dispatcher. Direct host write rather than docker-cp via SSH because the runtime's config dir is already bind-mounted into the workspace container — the runtime sees the files on next start with no additional plumbing. Adds InstanceIDLookup (parallel to RuntimeLookup) so unit tests don't need a DB; production wires it in router.go like templates.go does. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 15:42:51 -07:00
claude-ceo-assistant	d9e380c5bc	feat(workspace-server): local-dev provisioner builds from Gitea source when MOLECULE_IMAGE_REGISTRY is unset (#63 , Task #194 ) OSS contributors who clone molecule-core and `go run ./workspace-server/cmd/server` now get a working end-to-end provision without authenticating to GHCR or AWS ECR. Pre-fix: with MOLECULE_IMAGE_REGISTRY unset, the provisioner attempted to pull ghcr.io/molecule-ai/workspace-template-<runtime>:latest, which has been returning 403 since the 2026-05-06 GitHub-org suspension. Post-fix: when MOLECULE_IMAGE_REGISTRY is unset, the provisioner switches to local-build mode — looks up the workspace-template-<runtime> repo's HEAD sha on Gitea via a single API call, shallow-clones into ~/.cache/molecule/, and runs `docker build --platform=linux/amd64`. SHA-pinned cache key skips the clone+build entirely on subsequent provisions. Production tenants are unaffected: every prod tenant sets the var to its private ECR mirror, so the SaaS pull path is byte-for-byte identical. SSOT for mode detection lives in Resolve() (registry_mode.go) returning a discriminated RegistrySource{Mode, Prefix} so call sites that branch on mode get a compile-time push instead of a string-equality footgun. Coverage: * registry_mode.go — new SSOT (Resolve, RegistryMode, IsKnownRuntime) * registry_mode_test.go — 8 tests pinning mode-decision contract * localbuild.go — clone+build pipeline (570 LOC, fully unit-tested) * localbuild_test.go — 22 tests covering happy/sad paths, fail-closed * provisioner.go — Start() inserts ensureLocalImageHook in local mode * docs/adr/ADR-002 — design rationale + alternatives + security review * docs/development/local-development.md — local-build flow + env overrides Security: * Allowlist-only runtime names (knownRuntimes) gate the clone path. * Repo prefix hardcoded to git.moleculesai.app/molecule-ai/molecule-ai-workspace-template-; forks via opt-in MOLECULE_LOCAL_TEMPLATE_REPO_PREFIX. * MOLECULE_GITEA_TOKEN masked in every log line via maskTokenInURL/maskTokenInString. * Fail-closed: Gitea unreachable / runtime not mirrored → clear error, never silently fall back to GHCR/ECR. * docker build invocation passes no --build-arg from external input. * HTTP body cap 64KB on Gitea API responses (defence vs malicious upstream). Closes #63 / Task #194. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 15:16:51 -07:00
security-auditor	5b7b669b4c	docs(ratelimit): tighten dev-mode comment after keyFor refactor The previous comment said "all share one IP bucket" — accurate before the keyFor refactor, slightly stale after it. The dev-mode rationale (bucket fills fast, blanks the page on a single-user dev box) is unchanged; only the bucket-key flavour text needed updating. Doc-only follow-up from #60's hostile self-review #3. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 14:57:21 -07:00
security-auditor	9dda84d671	fix(ratelimit): tenant-aware bucket keying — close canvas 429 storm Closes #59. Symptom: /workspaces/:id/activity returns 429 with rate-limit-exceeded on hongming.moleculesai.app whenever multiple workspaces are visible in the canvas. Single-tab, single-user, well within the documented 600 req/min budget — but every request collapsed into one bucket. Root cause: workspace-server's RateLimiter keyed buckets on c.ClientIP(). After issue #179 turned off proxy-header trust (SetTrustedProxies(nil), correctly closing the XFF spoofing hole), c.ClientIP() returns the TCP RemoteAddr — which in production is the upstream proxy (Caddy on per-tenant EC2; CP/Vercel on the SaaS plane). Every browser tab + every canvas consumer + every poll loop for every tenant collapsed into one bucket. Fix: bucket key derivation moves into a single keyFor helper that mirrors the SSOT pattern of: - molecule-controlplane/internal/middleware/ratelimit.go (org > user > IP) - this package's own MCPRateLimiter (token-hash via tokenKey) Priority: X-Molecule-Org-Id header → SHA-256(Authorization Bearer) → ClientIP. Token values are kept hashed in the bucket map so the in-memory state can't become a token dump. Tests: - TestKeyFor_OrgIdHeaderTrumpsBearerAndIP — priority order - TestKeyFor_BearerTokenWhenNoOrgId — middle tier + raw-token leak pin - TestKeyFor_IPFallbackWhenNoOrgIdNoBearer — anon probe path - TestRateLimit_TwoOrgsSameIP_IndependentBuckets — load-bearing regression (issue #59) — two tenants behind same upstream proxy must not share a bucket - TestRateLimit_TwoTokensSameIP_IndependentBuckets — same shape for the per-tenant Caddy box - TestRateLimit_SameOrgDifferentTokens_SharedBucket — counter-pin: rotating tokens within one org must NOT bypass the org's quota - TestRateLimit_Middleware_RoutesThroughKeyFor — AST gate, mirrors the SSOT gates established in #36/#10/#12 Mutation-tested: - strip org-id branch in keyFor → 3 tests fail - strip bearer-token branch → 2 tests fail - reintroduce direct c.ClientIP() in Middleware → 3 tests fail (including the AST gate) Existing tests pass unchanged: dev-mode fail-open, X-RateLimit-* headers (#105), Retry-After on 429 (#105), XFF anti-spoofing (#179). No schema/API change. 429 response body and X-RateLimit-* headers unchanged. RATE_LIMIT env var semantics unchanged. Hostile self-review (three weakest spots) is in the issue body: 1. one-shot Docker-inspect cost is now bucket-key derivation cost (string compare + SHA-256 of bearer); single-digit microseconds. 2. X-Molecule-Org-Id is unvalidated at the rate-limiter layer — spoofing is closed by tenant SG + CP front; documented in keyFor's docstring with the conditions under which to revisit. 3. cpProv-style SaaS surface is out of scope; CP's own limiter handles that hop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 14:51:08 -07:00
claude-ceo-assistant (Claude Opus 4.7 on Hongming's MacBook)	25fb696965	chore: reconcile main → staging post-suspension divergence Refs Task #165 (Class D AUTO_SYNC_TOKEN plumbing). main and staging diverged after the 2026-05-06 GitHub-org suspension because Class D / Class G / feature work landed on staging while unrelated CI fixes (#34-47, ECR auth-inline, buildx→docker, pre-clone manifest deps) landed straight on main. Both branches edited the same workflow files, so every push to main triggered an Auto-sync run that aborted at `git merge --no-ff origin/main` with 7 content conflicts: - .github/workflows/canary-verify.yml (URL: github.com → Gitea) - .github/workflows/ci.yml (3 URL refs) - .github/workflows/publish-runtime.yml (cascade: HTTP repo-dispatch → Gitea push) - .github/workflows/publish-workspace-server-image.yml (drop AWS-action steps; ECR auth is inline) - .github/workflows/retarget-main-to-staging.yml (URL) - manifest.json (lowercase org slug + add mock-bigorg from main) - scripts/clone-manifest.sh (keep main's MOLECULE_GITEA_TOKEN auth path + drop awk-tolower since manifest is now lowercase) Resolution: union — staging's post-suspension Gitea/ECR migrations win on URL/policy edits; main's additive work (mock-bigorg manifest entry, inline ECR auth, MOLECULE_GITEA_TOKEN basic-auth) is preserved on top. After this lands, staging is a strict superset of main, so the next auto-sync run on a push to main will be a clean fast-forward / no-op. The auto-sync workflow on main also picks up staging's AUTO_SYNC_TOKEN swap (Class D #26) for free, fixing the latent layer-2 push-auth issue. Verified locally: - bash -n scripts/clone-manifest.sh - python -c 'yaml.safe_load(...)' on each touched workflow - python -c 'json.load(open(manifest.json))' (21 plugins, 9 templates, 7 org_templates) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 14:24:37 -07:00
devops-engineer	e16d7eaa08	fix(ci): apply pre-clone fix to platform Dockerfile too (followup #173 ) The first PR (#38) only patched Dockerfile.tenant — but the workflow also builds the platform image from workspace-server/Dockerfile, which had the SAME in-image `git clone` stage. Build run #794 caught this: "process clone-manifest.sh ... exit code 128" on the platform image. Apply the same pre-clone shape to the platform Dockerfile: drop the `templates` stage, COPY from .tenant-bundle-deps/ instead. The workflow's existing "Pre-clone manifest deps" step (added in #38) already populates .tenant-bundle-deps/ before either build runs, so no workflow change needed. Self-review note: the missed-platform-Dockerfile is a Phase 1 quality miss — I read both files but only registered the tenant one as in-scope. Saved memory `feedback_orchestrator_must_verify_before_declaring_fixed` applies: should have grepped the whole workspace-server/ for "templates" stages before claiming Task #173 done. CI run #794 caught it within ~6 minutes; net cost: one followup commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 13:13:13 -07:00
Hongming Wang	17f1f30b3f	fix(test): drain coalesceRestart goroutines before t.Cleanup (Class H, #170 ) TestPooledWithEICTunnel_PreservesFnErr (and any sqlmock-using neighbour test) was at risk of inheriting stale INSERT calls from a previous test's coalesceRestart goroutine that survived its t.Cleanup boundary. The production callsite shape is `go h.RestartByID(...)` from a2a_proxy.go, a2a_proxy_helpers.go and main.go. When that goroutine's runRestartCycle panics, coalesceRestart's deferred recover swallows it to keep the platform process alive — but in tests, nothing waits for the goroutine to fully exit. If it's still draining LogActivity-shaped work after the test returns, those INSERTs land in the next test's sqlmock connection as kind=DELEGATION_FAILED / kind=WORKSPACE_PROVISION_FAILED, surfacing as "INSERT-not-expected". Fix: introduce drainCoalesceGoroutine(t, wsID, cycle) test helper that spawns coalesceRestart on a goroutine (matching production) and registers a t.Cleanup with sync.WaitGroup.Wait so the test can't declare itself done while a goroutine is still alive. Convert TestCoalesceRestart_PanicInCycleClearsState to use the helper (previously it called coalesceRestart synchronously, which never exercised the production goroutine-survival contract). Add TestCoalesceRestart_DrainHelperWaitsForGoroutineExit as the regression guard: cycle blocks 150ms then panics; the test asserts t.Run elapsed >= 150ms (proving the Wait barrier engaged) AND the deferred close ran (proving the panic-recovery defer chain executed) AND state.running was cleared. Verified the assertion is real by mutation-testing: removing t.Cleanup(wg.Wait) makes this test FAIL deterministically with elapsed <300µs. Per saved memory feedback_assert_exact_not_substring: the regression test asserts an exact-shape contract (elapsed >= blockFor) rather than a substring-in-output, so it discriminates between "drain works" and "drain skipped". Per Phase 3: 10/10 race-detector runs pass for all TestCoalesceRestart_* tests. Full ./internal/handlers/... suite green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 13:13:13 -07:00
devops-engineer	55689e0b10	fix(post-suspension): migrate github.com/Molecule-AI refs to git.moleculesai.app (Class G #168 ) The GitHub org Molecule-AI was suspended on 2026-05-06; canonical SCM is now Gitea at https://git.moleculesai.app/molecule-ai/. Stale github.com/Molecule-AI/... URLs return 404 and break tooling that clones / pip-installs / curls them. This bundles all non-Go-module URL fixes for this repo into a single PR. Go module path references (in *.go, go.mod, go.sum) are out of scope here -- tracked separately under Task #140. Token-auth clone URLs also flip ${GITHUB_TOKEN} -> ${GITEA_TOKEN} since the GitHub token does not auth against Gitea. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 13:08:15 -07:00
devops-engineer	a6d67b4c68	fix(ci): pre-clone manifest deps in workflow, drop in-image clone (closes #173 ) publish-workspace-server-image.yml could not run on Gitea Actions because Dockerfile.tenant's stage 3 ran `git clone` against private Gitea repos from inside the Docker build context, where no auth path exists. Every workspace-server rebuild required a manual operator-host push. Move cloning to the trusted CI context (where AUTO_SYNC_TOKEN — the devops-engineer persona PAT — is naturally available). Dockerfile.tenant now COPYs from .tenant-bundle-deps/, populated by the workflow's new "Pre-clone manifest deps" step. The Gitea token never enters the image. - scripts/clone-manifest.sh: optional MOLECULE_GITEA_TOKEN env embeds basic-auth in the clone URL; redacted in log output. Anonymous fallback preserved for future public-repo path. - .github/workflows/publish-workspace-server-image.yml: new pre-clone step before docker build; injects AUTO_SYNC_TOKEN. Fail-fast if the secret is empty. - workspace-server/Dockerfile.tenant: drop stage 3 (templates), COPY from .tenant-bundle-deps/ instead. Header documents the prereq. - .gitignore: ignore /.tenant-bundle-deps/ so a local build can't accidentally commit cloned repos. Verified locally: clone-manifest.sh with the devops-engineer persona token cloned all 37 repos (9 ws + 7 org + 21 plugins, 4.9MB after .git strip). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 12:59:46 -07:00
claude-ceo-assistant (Claude Opus 4.7 on Hongming's MacBook)	be5fbb5ad3	fix(workspace-server): a2a-proxy preflight container check (closes #36 ) Same SSOT-divergence shape as #10 / fixed in #12, but on the a2a-proxy code path. The plugin handler was routed through `provisioner.RunningContainerName`; a2a-proxy was forwarding optimistically and only catching missing containers REACTIVELY via `maybeMarkContainerDead` after the network call timed out. Result on tenants whose agent containers had been recycled (e.g. post-EC2 replace from molecule-controlplane#20): canvas waits 2-30s for the network forward to fail before getting a 503, and the workspace-server logs only "ProxyA2A forward error" without the "container is dead" signal. This PR adds a proactive `Provisioner.IsRunning` check in `proxyA2ARequest` between `resolveAgentURL` and `dispatchA2A`, gated on the conditions where we know we're talking to a sibling Docker container we own (`h.provisioner != nil` AND `platformInDocker` AND the URL was rewritten to Docker-DNS form). Three outcomes via the SSOT helper: (true, nil) → forward as today (false, nil) → fast-503 with `error="workspace container not running — restart triggered"`, `restarting=true`, `preflight=true`, plus the same offline-flip + WORKSPACE_OFFLINE broadcast + async restart that `maybeMarkContainerDead` produces (true, err) → fall through to optimistic forward (matches IsRunning's "fail-soft as alive" contract — flaky daemon must not trigger a restart cascade) The `preflight=true` flag in the response distinguishes the proactive short-circuit from the reactive `maybeMarkContainerDead` path so canvas or downstream callers can render distinct messages later. * `internal/handlers/a2a_proxy.go` — preflight call site between resolveAgentURL and dispatchA2A; gated on `h.provisioner != nil && platformInDocker && url == http://<ContainerName(id)>:port`. * `internal/handlers/a2a_proxy_helpers.go` — `preflightContainerHealth` helper. Routes through `h.provisioner.IsRunning` (which itself wraps `RunningContainerName`). Identical offline-flip side-effects as `maybeMarkContainerDead` for the dead-container case. * `internal/handlers/a2a_proxy_preflight_test.go` — 4 tests: running → nil; not-running → structured 503 + sqlmock expectations on the offline-flip + structure_events insert; transient error → nil (fail-soft); AST gate pinning the SSOT routing (mirror of #12's gate). Mutation-tested: removing the `if running { return nil }` guard makes the production code fail to compile (unused var). A subtler mutation (replacing the !running branch with `return nil`) would make TestPreflight_ContainerNotRunning_StructuredFastFail fail at runtime with sqlmock's "expected DB call did not occur." Refs: molecule-core#36. Companion to #12 (issue #10). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 11:15:08 -07:00
Hongming Wang	d64641904f	feat(workspace-server): mock runtime + mock-bigorg org template Adds a 'mock' runtime: virtual workspaces with no container, no EC2, no LLM. Every A2A reply is synthesised from a small canned-variant pool ('On it!', 'Got it, on it now.', etc.) deterministically seeded by (workspace_id, request_id). Built for funding-demo "200-workspace mock org" — renders an enterprise-scale org chart on the canvas (CEO/VPs/Managers/ICs) without burning real LLM credits or provisioning 200 EC2 instances. Surfaces: - workspace-server/internal/handlers/mock_runtime.go: A2A proxy short-circuit, canned-reply pool, deterministic variant pick. - workspace-server/internal/handlers/a2a_proxy.go: gate the short-circuit before resolveAgentURL (mock has no URL). - workspace-server/internal/handlers/org_import.go: skip Docker provisioning for mock workspaces, set status='online' directly, drop the per-sibling 2s pacing for mock children (collapses a 200-workspace import from ~7min → ~1s). - workspace-server/internal/handlers/runtime_registry.go: register 'mock' in the runtime allowlist (manifest + fallback set). - workspace-server/internal/registry/healthsweep.go + orphan_sweeper.go: skip mock workspaces in container-health and stale-token sweeps (no container by design). - workspace-server/internal/handlers/workspace_restart.go: mirror the 'external' Restart no-op for mock. - manifest.json: register the new Molecule-AI/molecule-ai-org-template-mock-bigorg repo. Tests: 5 new in mock_runtime_test.go covering happy-path, non-mock regression guard, determinism, IsMockRuntime trim/case, JSON-RPC id echo. All existing handler + registry tests still pass. Local-verified: imported the 200-workspace template against a fresh postgres+redis, confirmed all 200 land in 'online' and stay there through the 30s health-sweep window, exercised A2A on CEO + VPs + Managers + ICs and saw the variant pool rotate. Org template lives at Molecule-AI/molecule-ai-org-template-mock-bigorg (created today) and is imported via the existing /org/import flow on the canvas Template Palette. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 08:40:37 -07:00
devops-engineer	10e510f50c	chore: drop github-app-auth + swap GHCR→ECR (closes #157 , #161 ) Two coupled cleanups for the post-2026-05-06 stack: ============================================ The plugin injected GITHUB_TOKEN/GH_TOKEN via the App's installation-access flow (~hourly rotation). Per-agent Gitea identities replaced this approach after the 2026-05-06 suspension — workspaces now provision with a per-persona Gitea PAT from .env instead of an App-rotated token. The plugin code itself lived on github.com/Molecule-AI/molecule-ai-plugin-github-app-auth which is also unreachable post-suspension; checking it out at CI build time was already failing. Removed: - workspace-server/cmd/server/main.go: githubappauth import + the `if os.Getenv("GITHUB_APP_ID") != ""` block that called BuildRegistry. gh-identity remains as the active mutator. - workspace-server/Dockerfile + Dockerfile.tenant: COPY of the sibling repo + the `replace github.com/Molecule-AI/molecule-ai- plugin-github-app-auth => /plugin` directive injection. - workspace-server/go.mod + go.sum: github-app-auth dep entry (cleaned up by `go mod tidy`). - 3 workflows: actions/checkout steps for the sibling plugin repo: - .github/workflows/codeql.yml (Go matrix path) - .github/workflows/harness-replays.yml - .github/workflows/publish-workspace-server-image.yml Verified `go build ./cmd/server` + `go vet ./...` pass post-removal. ======================================================= Same workflow used to push to ghcr.io/molecule-ai/platform + platform-tenant. ghcr.io/molecule-ai is gone post-suspension. The operator's ECR org (153263036946.dkr.ecr.us-east-2.amazonaws.com/ molecule-ai/) already hosts platform-tenant + workspace-template-* + runner-base images and is the post-suspension SSOT for container images. This PR aligns publish-workspace-server-image with that stack. - env.IMAGE_NAME + env.TENANT_IMAGE_NAME repointed to ECR URL. - docker/login-action swapped for aws-actions/configure-aws- credentials@v4 + aws-actions/amazon-ecr-login@v2 chain (the standard ECR auth pattern; uses AWS_ACCESS_KEY_ID/SECRET secrets bound to the molecule-cp IAM user). The :staging-<sha> + :staging-latest tag policy is unchanged — staging-CP's TENANT_IMAGE pin still points at :staging-latest, just with the new registry prefix. Refs molecule-core#157, #161; parallel to org-wide CI-green sweep.	2026-05-07 07:48:51 -07:00
devops-engineer	1d8c101c94	chore: drop github-app-auth + swap GHCR→ECR (closes #157 , #161 ) Two coupled cleanups for the post-2026-05-06 stack: #157 — drop molecule-ai-plugin-github-app-auth ============================================ The plugin injected GITHUB_TOKEN/GH_TOKEN via the App's installation-access flow (~hourly rotation). Per-agent Gitea identities replaced this approach after the 2026-05-06 suspension — workspaces now provision with a per-persona Gitea PAT from .env instead of an App-rotated token. The plugin code itself lived on github.com/Molecule-AI/molecule-ai-plugin-github-app-auth which is also unreachable post-suspension; checking it out at CI build time was already failing. Removed: - workspace-server/cmd/server/main.go: githubappauth import + the `if os.Getenv("GITHUB_APP_ID") != ""` block that called BuildRegistry. gh-identity remains as the active mutator. - workspace-server/Dockerfile + Dockerfile.tenant: COPY of the sibling repo + the `replace github.com/Molecule-AI/molecule-ai- plugin-github-app-auth => /plugin` directive injection. - workspace-server/go.mod + go.sum: github-app-auth dep entry (cleaned up by `go mod tidy`). - 3 workflows: actions/checkout steps for the sibling plugin repo: - .github/workflows/codeql.yml (Go matrix path) - .github/workflows/harness-replays.yml - .github/workflows/publish-workspace-server-image.yml Verified `go build ./cmd/server` + `go vet ./...` pass post-removal. #161 — swap GHCR→ECR for publish-workspace-server-image ======================================================= Same workflow used to push to ghcr.io/molecule-ai/platform + platform-tenant. ghcr.io/molecule-ai is gone post-suspension. The operator's ECR org (153263036946.dkr.ecr.us-east-2.amazonaws.com/ molecule-ai/) already hosts platform-tenant + workspace-template-* + runner-base images and is the post-suspension SSOT for container images. This PR aligns publish-workspace-server-image with that stack. - env.IMAGE_NAME + env.TENANT_IMAGE_NAME repointed to ECR URL. - docker/login-action swapped for aws-actions/configure-aws- credentials@v4 + aws-actions/amazon-ecr-login@v2 chain (the standard ECR auth pattern; uses AWS_ACCESS_KEY_ID/SECRET secrets bound to the molecule-cp IAM user). The :staging-<sha> + :staging-latest tag policy is unchanged — staging-CP's TENANT_IMAGE pin still points at :staging-latest, just with the new registry prefix. Refs molecule-core#157, #161; parallel to org-wide CI-green sweep.	2026-05-07 05:12:06 -07:00
claude-ceo-assistant	6a7dcd287c	Merge pull request 'feat(canvas/chat-server): canvas consumes /chat-history + server-side row-aware reverse (RFC #2945 PR-C-2)' (#4 ) from feat/rfc-2945-pr-c-2-canvas-chat-history into staging	2026-05-07 11:38:54 +00:00
claude-ceo-assistant	b49bdde997	Merge pull request 'fix(workspace-server): CP orphan sweeper closes deprovision split-write race (#2989 )' (#2 ) from fix/cp-orphan-sweeper-2989 into staging	2026-05-07 11:38:48 +00:00
claude-ceo-assistant	f51722411b	Merge branch 'main' into fix/issue10-runtime-aware-plugin-install	2026-05-07 11:26:14 +00:00
claude-ceo-assistant	624ef4d06d	perf(workspace-server,canvas): EIC tunnel pool + canvas Promise.all (closes core#11) ## Symptom Canvas detail-panel "config + filesystem load" took ~20s. Reported on production hongming tenant, workspace c7c28c0b-... (Claude Code Agent T2). ## Two stacked latency sources ### 1. Server-side: per-call EIC tunnel setup (~80% of the win) `workspace-server/internal/handlers/template_files_eic.go::realWithEICTunnel` performed ssh-keygen + SendSSHPublicKey + open-tunnel + waitForPort PER call. 4 callers (read/write/list/delete) each paid the full ~3-5s setup cost even when fired back-to-back on the same workspace EC2. Fix: refcounted pool keyed on instanceID with TTL ≤ 50s (under the 60s SendSSHPublicKey grant). One tunnel serves N file ops; concurrent acquires for the same instance share the slot via a pendingSetups gate; LRU eviction caps simultaneous tracked instances at 32. Poisons entries on tunnel-fatal errors (connection refused, broken pipe, auth failed) so the next acquire builds fresh. Cleanup on panic via defer-release pattern (added after self-review caught a refcount-leak hazard). Public API unchanged — `var withEICTunnel` rebinds to `pooledWithEICTunnel` at package init, so all 4 callers inherit pooling for free. 10 unit tests pin: 4-ops-amortise (1 setup), different-instances-do-not-share, TTL eviction, poison invalidates, concurrent-acquire-single-setup, TTL=0 escape hatch, LRU eviction at cap, error classification heuristic, refcount blocks expired eviction, panic poisons entry. All green. ### 2. Canvas-side: serial fan-out + duplicate fetch (~20% of the win) `canvas/src/components/tabs/ConfigTab.tsx::loadConfig` awaited 3 independent metadata GETs (`/workspaces/{id}`, `/model`, `/provider`) serially. `AgentCardSection` fired a SECOND `/workspaces/{id}` from its own useEffect. Fix: Promise.all over the 3 metadata GETs (each leg keeps its existing .catch fallback semantics). AgentCardSection now reads `agentCard` from the canvas store (`useCanvasStore`) instead of refetching — the canvas already hydrates `node.data.agentCard` from the platform event stream. Defensive selector handles test mocks without a `nodes` array. ## Verification - `go test ./internal/handlers/` 5.07s green (full handlers package, including 10 new pool tests) - `go vet ./internal/handlers/` clean - `npx vitest run` — 1380/1380 canvas unit tests pass (2 test FILES fail on a pre-existing xyflow CSS-load issue in vitest config, unrelated to this change) - `npx tsc --noEmit` clean Live wall-time verification deferred to Phase 4 / E2E (canvas browser session required; external probe blocked by 403 since the canvas auth chain is session-cookie + Origin header, not a bearer token I can fabricate). ## Backwards compatibility API surface unchanged. All 4 EIC handler callers use the rebound var; no caller migration. Pool defaults to enabled (TTL=50s); tests can disable by setting poolTTL=0 or by overwriting withEICTunnel directly (existing stub pattern in template_files_eic_dispatch_test.go preserved). ## Hostile self-review (3 weakest spots) 1. `fnErrIndicatesTunnelFault` is a substring grep on err.Error() — the marker list is hand-curated and ssh client error formats vary across OpenSSH versions. A future ssh that reports a tunnel failure via a phrasing not in the list would NOT poison the entry → next callers reuse a dead tunnel until TTL evicts. Acceptable: TTL bounds the impact (≤50s of bad reuse), and the heuristic covers every tunnel-error shape that appears in the existing test fixtures and known incidents. 2. `acquire`'s for-loop has unbounded retry potential under pathological churn (signal closed → new acquirer → setup fails → repeat). No bounded retry counter. Today there is no test exercise for "flaky setup that succeeds-then-fails-then-succeeds"; if observability ever shows this shape, add a max-retry guard. Filed as a known limitation, not blocking. 3. The substring assertion `strings.Contains` style I used for tunnel-fault classification could false-positive on app-level error messages that happen to contain "permission denied" or "broken pipe" verbatim. The classification test covers the discriminator but only against the error shapes we know today. Acceptable: poisoning errs on the side of building fresh, which is correct-but-slightly-slow rather than incorrect. ## Phase 4 / E2E plan - Live timing of the canvas detail-panel open against a real workspace (browser session, not external probe). - Target: perceived latency under 2s on warm pool. Cold open still pays one tunnel setup (~3-5s) — the pool buys you the SECOND through Nth panel-open within the TTL window. - Memory `feedback_chase_verification_to_staging` applies — will not declare done at PR-merge; will follow through to user-visible behavior on staging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 23:17:58 -07:00
security-auditor	c1de2287fd	fix(workspace-server): SSOT-route container check + 422 on external runtimes Two coupled fixes for molecule-core#10 (plugin install 503 vs status=online split-state): 1. SSOT for "is this workspace's container running" — `findRunningContainer` in plugins.go used to carry its own copy of `cli.ContainerInspect`, which collapsed transient daemon errors into the same `""` return as a genuinely-stopped container. Healthsweep's `Provisioner.IsRunning` handled the same input correctly (defensive). Promote the inspect logic to `provisioner.RunningContainerName`, route both consumers through it. Transient errors get a distinct log line on the plugins side so triage doesn't confuse a flaky daemon with a stopped container. 2. Runtime-aware Install/Uninstall — `runtime='external'` workspaces have no local container; push-install via docker exec is meaningless. They pull plugins via the download endpoint instead (Phase 30.3). Without a guard they fell through to `findRunningContainer` and 503'd with a misleading "container not running." Add an early 422 with a hint pointing at the download endpoint. The two fixes are independent: (1) preserves correctness when the SSOT helper is later modified; (2) eliminates the persistent split-state on the 5 external persona-agent workspaces in this DB (and on tenant deployments hitting the same shape). * `internal/provisioner/provisioner.go` — new `RunningContainerName(ctx, cli, id) (string, error)` with three documented outcomes (running / stopped / transient). `Provisioner.IsRunning` now wraps it; behavior preserved. * `internal/handlers/plugins.go` — `findRunningContainer` shimmed onto `RunningContainerName`; new `isExternalRuntime(id)` predicate. * `internal/handlers/plugins_install.go` — Install + Uninstall reject external runtimes with 422 + hint, before the source-fetch step. * `internal/handlers/plugins_install_external_test.go` — 5 cases: external→422, uninstall-external→422, container-backed-falls-through, no-runtime-lookup-fails-open, lookup-error-fails-open. * `internal/handlers/plugins_findrunning_ssot_test.go` — two AST gates pin the SSOT routing so future PRs can't silently re-introduce the parallel impl. Mutation-tested: reverting either consumer to a direct `ContainerInspect` makes the gate fail. Refs: molecule-core#10 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 22:58:20 -07:00
security-auditor	f3187ea0c1	fix(workspace-server): default-bind to 127.0.0.1 in dev-mode fail-open In dev mode (`MOLECULE_ENV=dev\|development`, `ADMIN_TOKEN` unset) the AdminAuth chain fails open by design so canvas at :3000 can call workspace-server at :8080 without a bearer token. Combined with the existing wildcard bind on `:8080`, that exposed unauthenticated `POST /workspaces` to any same-LAN peer (S-8 in the audit RFC v1). Couple the bind narrowness to the same signal that drives the auth fail-open: when `middleware.IsDevModeFailOpen()` returns true, default the listener to `127.0.0.1`. Production (`ADMIN_TOKEN` set) keeps binding to all interfaces — its auth chain is doing the work. Operators who need LAN exposure set `BIND_ADDR=<host>` explicitly. * `cmd/server/main.go` — `resolveBindHost()` precedence: BIND_ADDR explicit > IsDevModeFailOpen() loopback > "" (all interfaces). Startup log line now includes the resolved bind + dev-mode-fail-open state for post-deploy auditing. * `cmd/server/bind_test.go` — 8 t.Setenv table cases covering precedence, explicit overrides, dev/prod env words. Mutation-tested: removing the `IsDevModeFailOpen()` branch makes the dev-mode cases fail with "" vs "127.0.0.1". Refs: molecule-core#7 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 22:29:24 -07:00
claude-ceo-assistant (Claude Opus 4.7 on Hongming's MacBook)	75a72bf5a2	feat(canvas/chat-server): canvas consumes /chat-history + server-side row-aware reverse (RFC #2945 PR-C-2) Closes the SSOT story shipped in PR-C/D: canvas now consumes the typed /chat-history endpoint instead of /activity?type=a2a_receive, and the server emits messages in display-ready chronological order so the client doesn't have to re-order them. ## Canvas (consumer migration) - loadMessagesFromDB swaps from /activity to /chat-history. - Drops type=a2a_receive + source=canvas params (server applies the filter centrally now). - Drops [...activities].reverse() — wire is already display-ready. - Drops the local INTERNAL_SELF_MESSAGE_PREFIXES constant + isInternalSelfMessage helper. Server-side IsInternalSelfMessage applies the same predicate before emitting rows. - Drops the activityRowToMessages + ActivityRowForHydration imports from historyHydration.ts. The TS parser stays in tree because message-parser.ts is still load-bearing for live A2A WebSocket messages (ChatTab.tsx:805, AgentCommsPanel.tsx, canvas-events.ts). ## Server (row-aware wire-order fix) The pre-PR-C-2 client did `[...activities].reverse()` over ROWS, then flattened each row into [user, agent] messages. The reversal was ROW-aware. After PR-C/D, the server returned a flat ChatMessage slice in `ORDER BY created_at DESC` order, with [user, agent] within each row. A naive client-side flat reverse would FLIP each pair (agent before user at same timestamp). Two ways to fix it: A) Server emits oldest-first within page; canvas does NOT reverse. B) Canvas does row-aware reversal (group by timestamp, reverse). Option A is cleaner — server owns the wire-order responsibility, every client trusts `for m of messages` to render chronologically. Server adds reverseRowChunks() that: 1. Groups consecutive same-Timestamp messages into row chunks (1-2 messages per row). 2. Reverses the chunk order (newest-row-first → oldest-row-first). 3. Flattens. Within-chunk [user, agent] order is preserved. Single-message rows (agent reply not yet recorded, attachments-only user upload) collapse to 1-element chunks and reverse correctly too. ## Tests Server: 3 new unit tests on reverseRowChunks (paired across rows, single-message rows, empty input) + 1 sqlmock integration test on List() that drives the full SQL → reverse → wire path. Mutation-tested: removed `messages = reverseRowChunks(messages)` from List(), confirmed the integration test fires red with all 4 misordered indices flagged. Restored, all 25 messagestore tests + 9 chat-history handler tests green. Canvas: 8 lazyHistory pagination tests refactored to mock /chat-history (not /activity) and assert against the new wire shape ({messages, reached_end} not raw activity rows). All 1389/1389 vitest tests green; tsc --noEmit clean. ## Three weakest spots (hostile-reviewer self-pass) 1. reverseRowChunks groups by Timestamp string equality. If two distinct rows had the SAME timestamp (legitimately possible at sub- millisecond granularity), the algorithm would treat them as one chunk and not reverse them relative to each other. Mitigated: activity_logs.created_at uses microsecond resolution; concurrent inserts at exact-same microsecond are vanishingly rare. If a collision happens, the within-chunk order is whatever the SQL returned — both rows render at the same timestamp, no user-visible misordering. 2. The pre-existing TS parser files (historyHydration.ts + message-parser.ts) stay in tree. historyHydration.ts is now dead code (no consumers post-migration); deletion is parked as a follow- up after a one-week observation window confirms no live-message consumer reaches it. 3. canvas's loadMessagesFromDB returns `resp.messages ?? []`. If the server were ever to return `null` instead of `[]` (it currently doesn't — handler defensively coerces nil to []), the nullish coalesce keeps the canvas from crashing. A stricter wire schema would assert the never-null invariant; for today's pragmatic safety, the ?? is enough. ## Security review - Untrusted input? Same as PR-C — agent JSON parsed defensively in the messagestore parser. No new exposure. - Trust boundary? Same. Canvas → /chat-history → wsAuth → messagestore. - Output sanitization? Plain text + opaque attachment URIs as before. No security-relevant changes beyond what /chat-history already exposes via PR-C. Considered, not skipped. ## Versioning / backwards compat - /activity endpoint unchanged. - /chat-history endpoint shape unchanged (still {messages, reached_end}); only the wire ORDER within a page changed (newest-first row → oldest- first row). Canvas is the only consumer in tree; no API consumers depend on the previous order. - canvas's loadMessagesFromDB call signature unchanged — internal refactor. 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-05-06 16:55:00 -07:00

1 2 3 4 5 ...

712 Commits