molecule-core

Author	SHA1	Message	Date
Molecule AI Core-DevOps	252f8d0c47	tech-debt: rename molecule-monorepo-net -> molecule-core-net Some checks failed sop-tier-check / tier-check (pull_request) Failing after 4s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s Details Renames Docker network across all code, configs, scripts, and docs. Per issue #93: the network was named molecule-monorepo-net as a holdover from when the repo was called molecule-monorepo. The canonical repo name is now molecule-core, so the network should be molecule-core-net. Files changed: - docker-compose.yml, docker-compose.infra.yml: network definition - infra/scripts/setup.sh: docker network create - scripts/nuke-and-rebuild.sh: docker network rm - workspace-server/internal/provisioner/provisioner.go: DefaultNetwork - All comments/docs: updated wording Acceptance: grep -rn 'molecule-monorepo-net' returns zero matches. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-09 20:51:48 +00:00
claude-ceo-assistant	3dcc7230f9	fix(provisioner)+test: EvalSymlinks templatePath; stage-2 e2e for files_dir consumption Some checks failed CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 1s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 2s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 1s Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s Details CI / Detect changes (pull_request) Successful in 8s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 7s Details Harness Replays / detect-changes (pull_request) Successful in 7s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 7s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 7s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 7s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 8s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s Details CI / Canvas (Next.js) (pull_request) Successful in 5s Details CI / Python Lint & Test (pull_request) Successful in 5s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 4s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 6s Details Harness Replays / Harness Replays (pull_request) Failing after 46s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 54s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 1m20s Details CI / Platform (Go) (pull_request) Successful in 2m48s Details Two changes that fall out of one root cause discovered while preparing the local platform spin-up for the dev-department extraction (internal#77): PROBLEM CopyTemplateToContainer's filepath.Walk is called with templatePath set to the workspace's resolved files_dir. With the cross-repo symlink composition shipped in PR #5 (parent template's dev-lead → ../molecule-dev-department/dev-lead/), the Dev Lead workspace's files_dir is literally 'dev-lead' — i.e. the symlink itself, not a path THROUGH the symlink. filepath.Walk does not descend into a symlink leaf — it Lstats the root, sees a symlink (mode bit set, not a directory), emits exactly one entry, and returns. Result: the workspace's /configs/ tar would ship empty. Other 38 workspaces are fine because their files_dir paths just TRAVERSE the symlink (path resolution handles intermediate symlinks via Lstat traversal); only the leaf-is-symlink case breaks. FIX workspace-server/internal/provisioner/provisioner.go: Call filepath.EvalSymlinks on templatePath before filepath.Walk. Resolves the leaf-symlink case for ALL templates, not just dev-dept. Security: templatePath has already passed resolveInsideRoot's path-string check at the call site; the trust boundary is the operator-side /org-templates/ filesystem layout, not this resolution step. TEST workspace-server/internal/handlers/local_e2e_dev_dept_test.go: New TestLocalE2E_FilesDirConsumption — stage-2 of the local e2e. For every workspace in the resolved OrgTemplate, asserts: 1. resolveInsideRoot(orgBaseDir, ws.FilesDir) succeeds. 2. os.Stat on the result returns a directory. 3. filepath.Walk after EvalSymlinks (mirroring the platform fix) emits at least one file. 4. At least one workspace marker exists (workspace.yaml, system-prompt.md, or initial-prompt.md). Exercises the SECOND half of POST /org/import that TestLocalE2E_DevDepartmentExtraction (PR #103) didn't cover. VERIFIED LOCALLY (2026-05-08, against post-extraction Gitea state): --- PASS: TestLocalE2E_FilesDirConsumption (0.05s) checked 39 workspaces with files_dir All 39 walk paths emit non-empty file sets with valid workspace markers. REGRESSION GUARD Without the EvalSymlinks fix, this test fails on Dev Lead with: files_dir 'dev-lead' at '/.../molecule-dev/dev-lead' is empty — CopyTemplateToContainer would produce empty /configs/ Refs: internal#77 — extraction RFC molecule-core#102 (resolver symlink contract test) molecule-core#103 (stage-1 e2e: include resolution) Hongming GO 2026-05-08 ('go' on the 3 pre-spin-up optimizations)	2026-05-08 04:46:33 -07:00
claude-ceo-assistant	d9e380c5bc	feat(workspace-server): local-dev provisioner builds from Gitea source when MOLECULE_IMAGE_REGISTRY is unset (#63 , Task #194 ) Some checks failed E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 7s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 7s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 5s Details CI / Detect changes (pull_request) Successful in 7s Details CI / Canvas (Next.js) (pull_request) Successful in 7s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1m38s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 7s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 7s Details Harness Replays / detect-changes (pull_request) Successful in 7s Details Harness Replays / Harness Replays (pull_request) Failing after 42s Details CI / Platform (Go) (pull_request) Successful in 3m32s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Successful in 1s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Successful in 1s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 7s Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 7s Details CI / Python Lint & Test (pull_request) Successful in 6s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Successful in 1s Details Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 4s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 5s Details OSS contributors who clone molecule-core and `go run ./workspace-server/cmd/server` now get a working end-to-end provision without authenticating to GHCR or AWS ECR. Pre-fix: with MOLECULE_IMAGE_REGISTRY unset, the provisioner attempted to pull ghcr.io/molecule-ai/workspace-template-<runtime>:latest, which has been returning 403 since the 2026-05-06 GitHub-org suspension. Post-fix: when MOLECULE_IMAGE_REGISTRY is unset, the provisioner switches to local-build mode — looks up the workspace-template-<runtime> repo's HEAD sha on Gitea via a single API call, shallow-clones into ~/.cache/molecule/, and runs `docker build --platform=linux/amd64`. SHA-pinned cache key skips the clone+build entirely on subsequent provisions. Production tenants are unaffected: every prod tenant sets the var to its private ECR mirror, so the SaaS pull path is byte-for-byte identical. SSOT for mode detection lives in Resolve() (registry_mode.go) returning a discriminated RegistrySource{Mode, Prefix} so call sites that branch on mode get a compile-time push instead of a string-equality footgun. Coverage: * registry_mode.go — new SSOT (Resolve, RegistryMode, IsKnownRuntime) * registry_mode_test.go — 8 tests pinning mode-decision contract * localbuild.go — clone+build pipeline (570 LOC, fully unit-tested) * localbuild_test.go — 22 tests covering happy/sad paths, fail-closed * provisioner.go — Start() inserts ensureLocalImageHook in local mode * docs/adr/ADR-002 — design rationale + alternatives + security review * docs/development/local-development.md — local-build flow + env overrides Security: * Allowlist-only runtime names (knownRuntimes) gate the clone path. * Repo prefix hardcoded to git.moleculesai.app/molecule-ai/molecule-ai-workspace-template-; forks via opt-in MOLECULE_LOCAL_TEMPLATE_REPO_PREFIX. * MOLECULE_GITEA_TOKEN masked in every log line via maskTokenInURL/maskTokenInString. * Fail-closed: Gitea unreachable / runtime not mirrored → clear error, never silently fall back to GHCR/ECR. * docker build invocation passes no --build-arg from external input. * HTTP body cap 64KB on Gitea API responses (defence vs malicious upstream). Closes #63 / Task #194. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 15:16:51 -07:00
security-auditor	c1de2287fd	fix(workspace-server): SSOT-route container check + 422 on external runtimes Some checks failed E2E API Smoke Test / E2E API Smoke Test (pull_request) Failing after 4m46s Details CI / Detect changes (pull_request) Successful in 5s Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 6s Details Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s Details Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 5s Details CI / Canvas (Next.js) (pull_request) Successful in 5s Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 7s Details Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 4s Details E2E API Smoke Test / detect-changes (pull_request) Successful in 4s Details Harness Replays / detect-changes (pull_request) Successful in 5s Details Retarget main PRs to staging / Retarget to staging (pull_request) Has been skipped Details Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 6s Details CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s Details CI / Python Lint & Test (pull_request) Successful in 4s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 53s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Failing after 44s Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 4s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m21s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m28s Details Harness Replays / Harness Replays (pull_request) Failing after 43s Details CI / Platform (Go) (pull_request) Successful in 3m19s Details Two coupled fixes for molecule-core#10 (plugin install 503 vs status=online split-state): 1. SSOT for "is this workspace's container running" — `findRunningContainer` in plugins.go used to carry its own copy of `cli.ContainerInspect`, which collapsed transient daemon errors into the same `""` return as a genuinely-stopped container. Healthsweep's `Provisioner.IsRunning` handled the same input correctly (defensive). Promote the inspect logic to `provisioner.RunningContainerName`, route both consumers through it. Transient errors get a distinct log line on the plugins side so triage doesn't confuse a flaky daemon with a stopped container. 2. Runtime-aware Install/Uninstall — `runtime='external'` workspaces have no local container; push-install via docker exec is meaningless. They pull plugins via the download endpoint instead (Phase 30.3). Without a guard they fell through to `findRunningContainer` and 503'd with a misleading "container not running." Add an early 422 with a hint pointing at the download endpoint. The two fixes are independent: (1) preserves correctness when the SSOT helper is later modified; (2) eliminates the persistent split-state on the 5 external persona-agent workspaces in this DB (and on tenant deployments hitting the same shape). * `internal/provisioner/provisioner.go` — new `RunningContainerName(ctx, cli, id) (string, error)` with three documented outcomes (running / stopped / transient). `Provisioner.IsRunning` now wraps it; behavior preserved. * `internal/handlers/plugins.go` — `findRunningContainer` shimmed onto `RunningContainerName`; new `isExternalRuntime(id)` predicate. * `internal/handlers/plugins_install.go` — Install + Uninstall reject external runtimes with 422 + hint, before the source-fetch step. * `internal/handlers/plugins_install_external_test.go` — 5 cases: external→422, uninstall-external→422, container-backed-falls-through, no-runtime-lookup-fails-open, lookup-error-fails-open. * `internal/handlers/plugins_findrunning_ssot_test.go` — two AST gates pin the SSOT routing so future PRs can't silently re-introduce the parallel impl. Mutation-tested: reverting either consumer to a direct `ContainerInspect` makes the gate fail. Refs: molecule-core#10 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 22:58:20 -07:00
claude-ceo-assistant (Claude Opus 4.7 on Hongming's MacBook)	4b074f631b	feat(provisioner): env-driven RegistryPrefix() for workspace template images (#6 ) Some checks failed pr-guards / disable-auto-merge-on-push (pull_request) Failing after 0s Details Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Failing after 41s Details Harness Replays / Harness Replays (pull_request) Failing after 30s Details CI / Canvas Deploy Reminder (pull_request) Has been skipped Details CI / Platform (Go) (pull_request) Failing after 3m8s Details E2E API Smoke Test / E2E API Smoke Test (pull_request) Failing after 5m7s Details CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 14m4s Details CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 14m36s Details CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 14m30s Details Block internal-flavored paths / Block forbidden paths (pull_request) Has been cancelled Details CI / Python Lint & Test (pull_request) Has been cancelled Details E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Has been cancelled Details CI / Canvas (Next.js) (pull_request) Has been cancelled Details Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Has been cancelled Details CI / Detect changes (pull_request) Has been cancelled Details Secret scan / Scan diff for credential-shaped strings (pull_request) Has been cancelled Details E2E API Smoke Test / detect-changes (pull_request) Has been cancelled Details Runtime PR-Built Compatibility / detect-changes (pull_request) Has been cancelled Details Harness Replays / detect-changes (pull_request) Has been cancelled Details Handlers Postgres Integration / detect-changes (pull_request) Has been cancelled Details E2E Staging Canvas (Playwright) / detect-changes (pull_request) Has been cancelled Details CI / Shellcheck (E2E scripts) (pull_request) Has been cancelled Details Add MOLECULE_IMAGE_REGISTRY env var to override the registry prefix used by all workspace-template image references. Defaults to ghcr.io/molecule-ai (unchanged for OSS users); set to an ECR URI in production tenants when mirroring to AWS. Why this matters: GitHub suspended the Molecule-AI org on 2026-05-06 with no warning. Production tenants kept running because they had images cached locally, but any tenant restart (AWS health event, redeploy, OS reboot) would have failed at `docker pull ghcr.io/molecule-ai/...` because GHCR returned 401. This change introduces the seam needed to point new pulls at a registry we control (AWS ECR) by flipping a single env var on Railway. Design (RFC: molecule-ai/internal#6): - New `RegistryPrefix()` function in `provisioner/registry.go` reads MOLECULE_IMAGE_REGISTRY, falls back to "ghcr.io/molecule-ai". - New `RuntimeImage(runtime)` returns the canonical ref using the prefix. - `RuntimeImages` map computed at init via `computeRuntimeImages()` so existing callers that range over it still work. - `DefaultImage` likewise computed via `RuntimeImage(defaultRuntime)`. - `handlers.TemplateImageRef()` switched from hardcoded format string to `provisioner.RegistryPrefix()`. - `runtime_image_pin.go::resolveRuntimeImage()` automatically inherits the prefix change because it reads from `provisioner.RuntimeImages[]` and only re-formats the tag suffix to a digest pin. Alternatives rejected (see RFC): - Multi-registry fallback chain (try ECR, fall back to GHCR): GHCR is locked from outbound for our org, so the fallback never works for us. Adds code complexity for no benefit. - Hardcoded ECR-only switch: couples production code to a specific deployment environment. OSS users self-hosting Molecule would need the upstream GHCR. - Self-hosted Harbor / registry-on-Hetzner: adds a component to operate. Not justified at 3-tenant scale; AWS ECR is mature and IAM-integrated. Auth — deliberately NOT changed in this commit: - For GHCR, the existing `ghcrAuthHeader()` reads GHCR_USER/GHCR_TOKEN. - For ECR, EC2 user-data installs `amazon-ecr-credential-helper` and adds a `credHelpers` entry in `~/.docker/config.json` so the daemon resolves ECR credentials via the EC2 instance role on every pull. The Go code needs no auth change. This keeps the diff minimal. Backwards compatibility: - Additive: env unset → identical behavior to today (GHCR). - Existing tests reference literal `ghcr.io/molecule-ai/...` strings; they continue to pass under the default prefix. - `RuntimeImages` map preserved for callers that iterate it. - No interface, schema, API, or migration version bump needed. Security review: - No untrusted input: MOLECULE_IMAGE_REGISTRY is set at deploy time (Railway env, EC2 user-data), not by users. - No expanded data collection or logging changes. - No new permissions: ECR pull permission is a future user-data + IAM role change, separate from this code change. - Worst-case: an attacker who already compromises Railway can swap the registry prefix to a malicious URI — same blast radius as compromising Railway today, no expansion. Tests: - 9 new unit tests in `registry_test.go` covering: default fallback, env override, empty env, all 9 known runtimes, unknown runtime, override-applies-to-all, computeRuntimeImages map population, env reflection, alphabetical ordering pin. - All existing provisioner + handlers tests continue to pass. - Mutation-tested mentally: deleting `if v := os.Getenv(...)` makes TestRegistryPrefix_RespectsEnv fail. Deleting `for _, r := range knownRuntimes` makes TestRuntimeImage_AllKnownRuntimes fail. The test suite would catch a regression of the original failure mode. Rollout plan: this PR is safe to merge with no env change. Production cutover happens by setting MOLECULE_IMAGE_REGISTRY on Railway after the AWS ECR mirror is populated (separate ops change, tracked in issue #6 phases 3b–3f). Tracking: - RFC: molecule-ai/internal#6 - Tasks: #97 (ECR setup), #98 (CP fallback) - Tech debt: runbooks/hetzner-rollout-tech-debt-2026-05-06.md item 7 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 14:23:01 -07:00
Hongming Wang	1bff419833	feat(provisioner): digest-pin workspace images via runtime_image_pins (#2272 layer 1) Layer 1 of the runtime-rollout plan. Decouples publish from promotion by giving operators a `runtime_image_pins` table the provisioner consults at container-create time. No row = legacy `:latest` behavior; row present = provisioner pulls `<base>@sha256:<digest>`. One bad publish no longer breaks every workspace simultaneously. Mechanics: - Migration 047: `runtime_image_pins` (template_name PK + sha256 digest + audit columns) and `workspaces.runtime_image_digest` (nullable, with partial index) for "show me workspaces still on the old digest" queries. - `resolveRuntimeImage` (handlers/runtime_image_pin.go): looks up the pin, returns `<base>@sha256:<digest>` on hit, "" on miss/error so the provisioner falls through to the legacy tag map. Availability over pinning — any DB error logs and returns "" rather than blocking the provision. `WORKSPACE_IMAGE_LOCAL_OVERRIDE=1` short-circuits the lookup so devs rebuilding template images locally see their fresh build. - `WorkspaceConfig.Image` carries the resolved value into the provisioner. `selectImage` honors it ahead of the runtime→tag map and falls back to DefaultImage on unknown runtime. - The existing `imageTagIsMoving` predicate (#215) already returns false on `@sha256:` form, so digest pins skip the force-pull path naturally. Tests: - Handler-side (sqlmock): no-pin/db-error/with-pin/empty/unknown/local- override paths cover every branch of `resolveRuntimeImage`. - Provisioner-side: `selectImage` table covers explicit-image preference, runtime-map fallback, unknown-runtime → default, empty-config → default. Plus a struct-literal compile-time pin on `Image` so a future refactor can't silently drop the field. Layer 2 (per-ring routing via `workspaces.runtime_image_digest`) and the admin promote/rollback endpoint ride on top of this and ship separately.	2026-05-03 02:30:00 -07:00
Hongming Wang	552602e462	fix(provisioner): force re-pull of moving image tags on workspace start Previously Start() only pulled when the image was missing locally (imgErr != nil). Once a tenant's Docker daemon had `:latest` cached, it stuck on that snapshot forever even after publish-runtime pushed a newer image with the same tag — the same image-cache class that sibling task #232 closed on the controlplane redeploy path. Now Start() additionally re-pulls when the tag is "moving" (`:latest`, no tag, `:staging`, `:main`, `:dev`, `:edge`, `:nightly`, `:rolling`). Pinned tags (semver, sha-prefixed, date-stamped, build-id) and digest-pinned references (`@sha256:...`) skip the pull because their contents are by definition immutable. The classifier (imageTagIsMoving) is deliberately conservative on the "moving" side — only the well-known moving tags trip it. Misclassifying a pinned tag as moving wastes bandwidth on every provision; misclassifying moving as pinned silently bricks the fleet on stale snapshots, which is exactly the bug class this fix closes. Edge cases handled: - Registry hostname with port (`localhost:5000/foo`) — the `:5000` is not mistaken for a tag. - Digest pinning (`image@sha256:...`) — never re-pulled even if a moving-looking tag is also present. - Legacy local-build tags (`workspace-template:hermes`) — treated as pinned (no registry to move from). Test coverage: 22 cases across all classifier shapes. No changes to the pull-failure path (still best-effort, ContainerCreate still surfaces the actionable "image not found" error if the pull failed and the cache is also empty). Task: #215. Companion to #232.	2026-05-02 23:56:32 -07:00
Hongming Wang	e081c8335f	refactor(handlers): widen WorkspaceHandler.provisioner to LocalProvisionerAPI interface (#2369 ) Symmetric with the existing CPProvisionerAPI interface. Closes the asymmetry where the SaaS provisioner field was an interface (mockable in tests) but the Docker provisioner field was a concrete pointer (not). ## Changes - New ``provisioner.LocalProvisionerAPI`` interface — the 7 methods WorkspaceHandler / TeamHandler call on h.provisioner today: Start, Stop, IsRunning, ExecRead, RemoveVolume, VolumeHasFile, WriteAuthTokenToVolume. Compile-time assertion confirms Provisioner satisfies it. Mirror of cp_provisioner.go's CPProvisionerAPI block. - ``WorkspaceHandler.provisioner`` and ``TeamHandler.provisioner`` re-typed from ``provisioner.Provisioner`` to ``provisioner.LocalProvisionerAPI``. Constructor parameter type is unchanged — the assignment widens to the interface, so the 200+ callers of ``NewWorkspaceHandler`` / ``NewTeamHandler`` are unaffected. - Constructors gain a ``if p != nil`` guard before assigning to the interface field. Without this, ``NewWorkspaceHandler(..., nil, ...)`` (the test fixture pattern across 200+ tests) yields a typed-nil interface value where ``h.provisioner != nil`` evaluates true, and the SaaS-vs-Docker fork incorrectly routes nil-fixture tests into the Docker code path. Documented inline with reference to the Go FAQ. - Hardened the 5 Provisioner methods that lacked nil-receiver guards (Start, ExecRead, WriteAuthTokenToVolume, RemoveVolume, VolumeHasFile) — return ErrNoBackend on nil receiver instead of panicking on p.cli dereference. Symmetric with Stop/IsRunning (already hardened in #1813). Defensive cleanup so a future caller that bypasses the constructor's nil-elision still degrades cleanly. - Extended TestZeroValuedBackends_NoPanic with 5 new sub-tests covering the newly-hardened nil-receiver paths. Defense-in-depth: a future refactor that drops one of the nil-checks fails red here before reaching production. ## Why now - Provisioner orchestration has been touched in #2366 / #2368 — the interface symmetry is the natural follow-up captured in #2369. - Future work (CP fleet redeploy endpoint, multi-backend provisioners) wants this in place. Memory note ``project_provisioner_abstraction.md`` calls out pluggable backends as a north-star. - Memory note ``feedback_long_term_robust_automated.md`` — compile-time gates + ErrNoBackend symmetry > runtime panics. ## Verification - ``go build ./...`` clean. - ``go test ./...`` clean — 1300+ tests pass, including the previously-flaky Create-with-nil-provisioner paths that now exercise the constructor's nil-elision correctly. - ``go test ./internal/provisioner/ -run TestZeroValuedBackends_NoPanic -v`` — all 11 nil-receiver subtests green (was 6, +5 for the newly-hardened methods). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 09:18:16 -07:00
Hongming Wang	92d99d96fe	fix(provisioner): treat "removal already in progress" as no-op success Cascade-deleting a 7-workspace org returned 500 with "workspace marked removed, but 2 stop call(s) failed — please retry: stop eeb99b5d-...: force-remove ws-eeb99b5d-607: Error response from daemon: removal of container ws-eeb99b5d-607 is already in progress" even though the DB-side post-condition succeeded (removed_count=7) and the containers WERE removed shortly after. The fanout fired Stop() on every workspace concurrently and the orphan sweeper happened to reap two of them at the same instant, so Docker rejected the second ContainerRemove with "removal already in progress" — a race-condition ack, not a real failure. Retrying just races the same in-flight removal. The post-condition we care about (the container WILL be gone) is identical to a successful removal, so Stop() should treat it the same way it already treats "No such container" — a no-op return nil that lets the caller proceed with volume cleanup. Real daemon failures (timeout, EOF, ctx cancel) still surface as errors. Two pieces: - New isRemovalInProgress() predicate using the same string-match approach as isContainerNotFound (docker/docker has no typed errdef for this; the CLI itself relies on the message). - Stop() now treats the predicate as success, with a log line distinct from the not-found path so debugging can tell which race fired. Both substrings ("removal of container" + "already in progress") must match — "already in progress" alone would false-positive on unrelated operations like image pulls. Truth table pinned in 7 new test cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 13:25:32 -07:00
Hongming Wang	4915d1d59e	fix(orphan-sweeper): reap labeled containers with no DB row (wiped-DB) The existing sweeper only reaps ws-* containers whose workspace row has status='removed'. That misses the entire wiped-DB case: an operator does `docker compose down -v` (kills the postgres volume), the previous platform's ws-* containers keep running, the new platform boots into an empty workspaces table — first pass finds zero candidates and those containers leak forever. Symptom users hit today: 7 ws-* containers from 11h ago, no rows in DB, no visibility in Canvas, eating CPU + memory. Fix shape: 1. Provisioner stamps every ws-* container + volume with `molecule.platform.managed=true`. Without a label, the sweeper would have to assume any unlabeled ws-* container might belong to a sibling platform stack on a shared Docker daemon. 2. Provisioner exposes ListManagedContainerIDPrefixes — a label-filter counterpart to the existing name-filter. 3. Sweeper splits sweepOnce into two independent passes: - sweepRemovedRows (unchanged behavior; status='removed' only) - sweepLabeledOrphansWithoutRows (new; labeled containers whose workspace_id has no row in the table at all) Each pass has its own short-circuit so an empty result or transient error in one doesn't block the other — load-bearing because the wiped-DB pass exists precisely for cases where the removed-row pass finds nothing. Safe under multi-platform-on-shared-daemon: only containers carrying our label get reaped, sibling stacks' containers are invisible to this pass. (For now the label is a constant string; a future per-instance UUID layer can refine "ours" further if a real shared-daemon scenario emerges.) Migration: existing platforms running pre-PR builds have UNLABELED ws-* containers. After this lands they continue to NOT be reaped by the new path (no label = invisible). They'll only be cleaned via manual intervention or once the operator recreates them — same as today. No regression. Tests cover all five branches of the new pass: happy-path reap, no-reap when row exists, mixed reap-some-keep-some, Docker error short-circuits cleanly, non-UUID prefixes get filtered before the SQL query. Pairs with PR #2122 (script-level fix). Together they close the orphan-leak path for both `bash scripts/nuke-and-rebuild.sh` users (handled by the script) AND `docker compose down -v` users (handled by the runtime). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 14:33:41 -07:00
Hongming Wang	d0f198b24f	merge: resolve staging conflicts (a2a_proxy + workspace_crud) Three files conflicted with staging changes that landed while this PR sat open. Resolved each by combining both intents (not picking one side): - a2a_proxy.go: keep the branch's idle-timeout signature (workspaceID parameter + comment) AND apply staging's #1483 SSRF defense-in-depth check at the top of dispatchA2A. Type-assert h.broadcaster (now an EventEmitter interface per staging) back to Broadcaster for applyIdleTimeout's SubscribeSSE call; falls through to no-op when the assertion fails (test-mock case). - a2a_proxy_test.go: keep both new test suites — branch's TestApplyIdleTimeout_ (3 cases for the idle-timeout helper) AND staging's TestDispatchA2A_RejectsUnsafeURL (#1483 regression). Updated the staging test's dispatchA2A call to pass the workspaceID arg introduced by the branch's signature change. - workspace_crud.go: combine both Delete-cleanup intents: * Branch's cleanupCtx detachment (WithoutCancel + 30s) so canvas hang-up doesn't cancel mid-Docker-call (the container-leak fix) * Branch's stopAndRemove helper that skips RemoveVolume when Stop fails (orphan sweeper handles) * Staging's #1843 stopErrs aggregation so Stop failures bubble up as 500 to the client (the EC2 orphan-instance prevention) Both concerns satisfied: cleanup runs to completion past canvas hangup AND failed Stop calls surface to caller. Build clean, all platform tests pass. 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-04-26 10:43:22 -07:00
Hongming Wang	0de67cd379	feat(platform/admin): /admin/workspace-images/refresh + Docker SDK + GHCR auth The production-side end of the runtime CD chain. Operators (or the post- publish CI workflow) hit this after a runtime release to pull the latest workspace-template-* images from GHCR and recreate any running ws-* containers so they adopt the new image. Without this, freshly-published runtime sat in the registry but containers kept the old image until naturally cycled. Implementation notes: - Uses Docker SDK ImagePull rather than shelling out to docker CLI — the alpine platform container has no docker CLI installed. - ghcrAuthHeader() reads GHCR_USER + GHCR_TOKEN env, builds the base64- encoded JSON payload Docker engine expects in PullOptions.RegistryAuth. Both empty → public/cached images only; both set → private GHCR pulls. - Container matching uses ContainerInspect (NOT ContainerList) because ContainerList returns the resolved digest in .Image, not the human tag. Inspect surfaces .Config.Image which is what we need. - Provisioner.DefaultImagePlatform() exported so admin handler picks the same Apple-Silicon-needs-amd64 platform as the provisioner — single source of truth for the multi-arch override. Local-dev companion: scripts/refresh-workspace-images.sh runs on the host and inherits the host's docker keychain auth — alternate path for when GHCR_USER/TOKEN aren't set in the platform env. 🤖 Generated with [Claude Code](https://claude.com/claude-code)	2026-04-26 10:17:21 -07:00
Hongming Wang	48b494def3	fix(provisioner): nil guards on Stop/IsRunning, unblock contract tests (closes #1813 ) Both backends panicked when called on a zero-valued or nil receiver: Provisioner.{Stop,IsRunning} dereferenced p.cli; CPProvisioner.{Stop, IsRunning} dereferenced p.httpClient. The orphan sweeper and shutdown paths can call these speculatively where the receiver isn't fully wired — the panic crashed the goroutine instead of the caller seeing a clean error. Three changes: 1. Add ErrNoBackend (typed sentinel) and nil-guard the four methods. - Provisioner.{Stop,IsRunning}: guard p == nil \|\| p.cli == nil at the top. - CPProvisioner.Stop: guard p == nil up top, then httpClient nil AFTER resolveInstanceID + empty-instance check (the empty instance_id path doesn't need HTTP and stays a no-op success even on zero-valued receivers — preserved historical contract from TestIsRunning_EmptyInstanceIDReturnsFalse). - CPProvisioner.IsRunning: same shape — empty instance_id stays (false, nil); httpClient-nil with non-empty instance_id returns ErrNoBackend. 2. Flip the t.Skip on TestDockerBackend_Contract + TestCPProvisionerBackend_Contract — both contract tests run now that the panics are gone. Skipped scenarios were the regression guard for this fix. 3. Add TestZeroValuedBackends_NoPanic — explicit assertion that zero-valued and nil receivers return cleanly (no panic). Docker backend always returns ErrNoBackend on zero-valued; CPProvisioner may return (false, nil) when the DB-lookup layer absorbs the case (no instance to query → no HTTP needed). Both are acceptable per the issue's contract — the gate is no-panic. Tests: - 6 sub-cases across the new TestZeroValuedBackends_NoPanic - TestDockerBackend_Contract + TestCPProvisionerBackend_Contract now run their 2 scenarios (4 sub-cases each) - All existing provisioner tests still green - go build ./... + go vet ./... + go test ./... clean Closes drift-risk #6 in docs/architecture/backends.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 02:17:51 -07:00
Hongming Wang	cb12601414	fix(platform): make Provisioner.Stop return real errors so cleanup gates fire Review caught a critical issue with `12c49183`: the headline "skip RemoveVolume when Stop fails" guarantee was dead code. `Provisioner.Stop` unconditionally `return nil`'d after logging the underlying ContainerRemove error, so the new `if err := h.provisioner.Stop(...); err != nil { skip volume }` guard in workspace_crud.go AND the same guard in the orphan sweeper could never fire. RemoveVolume always ran, predictably failing with "volume in use" when Stop hadn't actually killed the container — which is the exact production bug the commit claimed to fix. Now Stop: - returns nil on successful remove (no change) - returns nil when the container is already gone (uses the existing isContainerNotFound helper — that's the cleanup post-condition, not a failure) - returns the wrapped Docker error otherwise (daemon timeout, ctx cancellation, socket EOF — anything that means the container might still be alive) Audited every Provisioner.Stop caller in the tree (team.go, workspace_restart.go ×4, workspace.go) — all of them already discard the return value, so the widened error surface is purely opt-in for the new cleanup paths and breaks no existing behaviour. Other review-driven fixes in this commit: - workspace_crud.go: detached `broadcaster.RecordAndBroadcast` from the request ctx too. RecordAndBroadcast does INSERT INTO structure_events + Redis Publish; if the canvas hangs up, a request-ctx-bound INSERT can be cancelled mid-write and the WORKSPACE_REMOVED event never lands, leaving other WS clients ignorant of the cascade. - orphan_sweeper.go: added isLikelyWorkspaceID guard before turning Docker container prefixes into SQL LIKE patterns. The Docker name filter is a SUBSTRING match (not prefix), so non-workspace containers like `my-ws-tool` slip through; the in-loop HasPrefix in provisioner trims most, but the in-sweeper alphabet check (hex + dashes only) is the second line of defence and also blocks SQL LIKE wildcards (`_`, `%`) from reaching the query. Two new tests pin this — TestSweepOnce_FiltersNonWorkspacePrefixes and TestIsLikelyWorkspaceID with 10 alphabet cases. - provisioner.go: comment added to ListWorkspaceContainerIDPrefixes flagging the substring/HasPrefix relationship as load-bearing. Verified: full Go test suite passes; all 8 sweeper tests pass (2 new for the LIKE-pattern guard); existing dispatch / delete / provisioner tests unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 23:32:48 -07:00
Hongming Wang	12c4918318	fix(platform): stop leaking workspace containers on delete Symptom: deleting workspaces from the canvas marked DB rows status='removed' but left Docker containers running indefinitely. After a session of org imports + cancellations, we counted 10 running ws-* containers all backed by 'removed' DB rows, eating ~1100% CPU on the Docker VM. Two compounding bugs in handlers/workspace_crud.go's delete cascade: 1. The cleanup loop used `c.Request.Context()` for the Docker stop/remove calls. When the canvas's `api.del` resolved on the platform's 200, gin cancelled the request ctx — and any in-flight Docker call cancelled with `context canceled`, leaving the container alive. Old logs: "Delete descendant <id> volume removal warning: ... context canceled" 2. `provisioner.Stop`'s error return was discarded and `RemoveVolume` ran unconditionally afterward. When Stop didn't actually kill the container (transient daemon error, ctx cancellation as in #1), the volume removal would predictably fail with "volume in use" and the container kept running with the volume mounted. Old logs: "Delete descendant <id> volume removal warning: Error response from daemon: remove ... volume is in use" Fix layered in two parts: - workspace_crud.go: detach cleanup with `context.WithoutCancel(ctx)` + a 30s bounded timeout. Stop's error is now checked and on failure we skip RemoveVolume entirely (the orphan sweeper below catches what we deferred). - New registry/orphan_sweeper.go: periodic reconcile pass (every 60s, initial run on boot). Lists running ws-* containers via Docker name filter, intersects with DB rows where status='removed', stops + removes volumes for the leaks. Defence in depth — even a brand-new Stop failure mode heals on the next sweep instead of leaking forever. Provisioner gains a tiny ListWorkspaceContainerIDPrefixes helper that wraps ContainerList with the `name=ws-` filter; the sweeper takes an OrphanReaper interface (matches the ContainerChecker pattern in healthsweep.go) so unit tests don't need a real Docker daemon. main.go wires the sweeper alongside the existing liveness + health-sweep + provisioning-timeout monitors, all under supervised.RunWithRecover so a panic restarts the goroutine. 6 new sweeper tests cover the reconcile path, the no-running-containers short-circuit, the daemon-error skip, the Stop-failure-leaves-volume invariant (the same trap that motivated this fix), the volume-remove-error-is-non-fatal continuation, and the nil-reaper no-op. Verified: full Go test suite passes; manually purged the 10 leaked containers + their orphan volumes from the dev host with `docker rm -f` + `docker volume rm` (one-off cleanup; the sweeper would have caught them on the next cycle once deployed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-25 12:36:22 -07:00
Molecule AI Core-BE	b5e2142c46	fix(#1877 ): close token-rotation race on restart — Option A+Option B combined Platform side (Option B): - provisioner.go: add WriteAuthTokenToVolume() — writes .auth_token to the Docker named volume BEFORE ContainerStart using a throwaway alpine container, eliminating the race window where a restarted container could read a stale token before WriteFilesToContainer writes the new one. - workspace_provision.go: call WriteAuthTokenToVolume() in issueAndInjectToken as a best-effort pre-write before the container starts. Runtime side (Option A): - heartbeat.py: on HTTPStatusError 401 from /registry/heartbeat, call refresh_cache() to force re-read of /configs/.auth_token from disk, then retry the heartbeat once. Fall through to normal failure tracking if the retry also fails. - platform_auth.py: add refresh_cache() which discards the in-process _cached_token and calls get_token() to re-read from disk. Together these eliminate the >1 consecutive 401 window described in issue #1877. Pre-write (B) is the primary fix; runtime retry (A) is the self-healing fallback for any residual race. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-23 17:47:18 -07:00
Hongming Wang	539e3483e4	fix(provisioner): force linux/amd64 pull + create on Apple Silicon hosts (#1875 ) On an Apple Silicon dev box, every `POST /workspaces` failed immediately with: no matching manifest for linux/arm64/v8 in the manifest list entries: no match for platform in manifest: not found because the GHCR workspace-template-* images ship only a linux/amd64 manifest today. `ImagePull` and `ContainerCreate` asked for the daemon's native arch and missed. The Canvas surfaced this as docker image "ghcr.io/molecule-ai/workspace-template-autogen:latest" not found after pull attempt — verify GHCR visibility for autogen — confusing because the image IS visible, just not for linux/arm64. ### Fix Add an auto-detect helper `defaultImagePlatform()` in `internal/provisioner/provisioner.go` that returns `"linux/amd64"` on Apple Silicon hosts and `""` (no preference) everywhere else, with an env override `MOLECULE_IMAGE_PLATFORM` for operators who want to pin or disable explicitly. The result is passed to both `ImagePull` (`PullOptions.Platform`) and `ContainerCreate` (4th arg `*ocispec.Platform`) so the pulled amd64 manifest matches the create-time platform spec. Docker Desktop transparently runs it under QEMU emulation on M-series Macs — slow (2–5× native) but functional. SaaS production (linux/amd64 EC2, `MOLECULE_ENV=production`) never hits the `runtime.GOARCH == "arm64"` branch, so the current behaviour on real tenants is byte-for-byte unchanged. Opt-in escape hatch for operators who want it off: export MOLECULE_IMAGE_PLATFORM="" # disable auto-force export MOLECULE_IMAGE_PLATFORM=linux/arm64 # pin alternate `ocispec` is `github.com/opencontainers/image-spec/specs-go/v1` — already in go.sum v1.1.1 as a transitive dependency of `github.com/docker/docker`, not a new import. ### Tests `internal/provisioner/platform_test.go` exercises every branch: - `TestDefaultImagePlatform_EnvOverride_ExplicitValue` — env wins - `TestDefaultImagePlatform_EnvOverride_EmptyValue` — empty string disables the auto-force (operator escape hatch) - `TestDefaultImagePlatform_AutoDetect` — linux/amd64 on arm64 Mac, "" on every other host - `TestParseOCIPlatform` — 7 table-driven cases covering well-formed platforms, malformed inputs, and nil handling ### End-to-end verification Before this commit, `POST /workspaces` on my Apple Silicon box: workspace status transitioned: provisioning → failed (~1s) log: image pull for ... failed: no matching manifest for linux/arm64/v8 After this commit, fresh DB + fresh platform: workspace status transitioned: provisioning → online (~25s) log: attempting pull (platform=linux/amd64) pulled ghcr.io/molecule-ai/workspace-template-langgraph:latest docker ps: ws-7aa08951-00d Up 27 seconds The existing provisioner race-tested test suite (`go test -race ./internal/provisioner/`) still passes — the platform pointer defaults to nil on linux/amd64 hosts, so the CI-resolved test expectations don't change. Closes #1875 (arm64 image blocker). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 14:55:34 -07:00
Hongming Wang	9df3159c59	feat(provisioner): pull workspace-template images from GHCR Every standalone workspace-template repo now publishes to ghcr.io/molecule-ai/workspace-template-<runtime>:latest via the reusable publish-template-image workflow in molecule-ci (landed today — one caller per template repo). This PR makes the provisioner actually use those images: - RuntimeImages map + DefaultImage switched from bare local tags (workspace-template:<runtime>) to their GHCR equivalents. - New ensureImageLocal step before ContainerCreate: if the image isn't present locally, attempt `docker pull` and drain the progress stream to completion. Best-effort — if the pull fails (network, auth, rate limit) the subsequent ContainerCreate still surfaces the actionable "No such image" error, now with a GHCR-appropriate hint instead of the defunct `bash workspace/build-all.sh <runtime>` advice. - runtimeTagFromImage now handles both forms: legacy `workspace-template:<runtime>` (local dev via build-all.sh / rebuild-runtime-images.sh) and the current GHCR shape. Keeps error hints sensible in both worlds. - Tests cover the GHCR path for tag extraction and the new error message shape. Legacy local tags still recognised. Local dev path unchanged — scripts/build-images.sh and workspace/rebuild-runtime-images.sh still produce locally-tagged `workspace-template:<runtime>` images, and Docker's image resolver matches them before any pull is attempted. So contributors can keep iterating on a template repo without round-tripping through GHCR. Follow-on impact: - hongmingwang.moleculesai.app (and any other tenant EC2) will auto-pull `ghcr.io/molecule-ai/workspace-template-hermes:latest` on the next hermes workspace provision — picking up the real Nous hermes-agent behind the A2A bridge (template-hermes v2.1.0) without any tenant-side rebuild step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 12:39:56 -07:00
Hongming Wang	39074cc4ae	chore: final open-source cleanup — binary, stale paths, private refs - Remove compiled workspace-server/server binary from git - Fix .gitignore, .gitattributes, .githooks/pre-commit for renamed dirs - Fix CI workflow path filters (workspace-template → workspace) - Replace real EC2 IP and personal slug in test_saas_tenant.sh - Scrub molecule-controlplane references in docs - Fix stale workspace-template/ paths in provisioner, handlers, tests - Clean tracked Python cache files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-18 00:38:55 -07:00
Hongming Wang	d8026347e5	chore: open-source restructure — rename dirs, remove internal files, scrub secrets Renames: - platform/ → workspace-server/ (Go module path stays as "platform" for external dep compat — will update after plugin module republish) - workspace-template/ → workspace/ Removed (moved to separate repos or deleted): - PLAN.md — internal roadmap (move to private project board) - HANDOFF.md, AGENTS.md — one-time internal session docs - .claude/ — gitignored entirely (local agent config) - infra/cloudflare-worker/ → Molecule-AI/molecule-tenant-proxy - org-templates/molecule-dev/ → standalone template repo - .mcp-eval/ → molecule-mcp-server repo - test-results/ — ephemeral, gitignored Security scrubbing: - Cloudflare account/zone/KV IDs → placeholders - Real EC2 IPs → <EC2_IP> in all docs - CF token prefix, Neon project ID, Fly app names → redacted - Langfuse dev credentials → parameterized - Personal runner username/machine name → generic Community files: - CONTRIBUTING.md — build, test, branch conventions - CODE_OF_CONDUCT.md — Contributor Covenant 2.1 All Dockerfiles, CI workflows, docker-compose, railway.toml, render.yaml, README, CLAUDE.md updated for new directory names. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-18 00:24:44 -07:00

20 Commits