feat(workspace-server): local-dev provisioner builds from Gitea source when MOLECULE_IMAGE_REGISTRY is unset (Task #194) #63
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Phase 1 — Investigation
Root cause
Provisioner image resolution treats GHCR as the OSS default (
ghcr.io/molecule-ai/workspace-template-<runtime>:latestviaRegistryPrefix()inworkspace-server/internal/provisioner/registry.go). Post-2026-05-06 theMolecule-AIGitHub org was suspended; GHCR now returns 403 for every workspace-template-* manifest. OSS contributors who clonemolecule-coreandgo run ./workspace-server/cmd/servercannot provision a workspace — first provision fails with:Prod tenants are unaffected because every prod tenant sets
MOLECULE_IMAGE_REGISTRYto the AWS ECR mirror via Railway env + EC2 user-data.Reproduction (verified 2026-05-07):
Affected surfaces (all in
workspace-server)internal/provisioner/registry.go—RegistryPrefix()is the SSOT for mode;RuntimeImage()andcomputeRuntimeImages()produce image refs.internal/provisioner/provisioner.go—Start()callsselectImage(cfg)→RuntimeImages[runtime]; pulls viapullImageAndDrain. Hardcodeslinux/amd64platform on Apple Silicon (existing emulation behavior, unchanged here).internal/provisioner/cp_provisioner.go— SaaS path; calls control plane HTTP API. Does NOT consultRuntimeImagesdirectly. Untouched by this change.internal/handlers/admin_workspace_images.go—TemplateImageRef()mirrors the registry decision for the manual/admin/workspace-images/refreshroute. Must stay aligned.internal/imagewatch/watch.go— auto-refresh pollshttps://ghcr.io/v2/molecule-ai/workspace-template-<rt>/manifests/latest. Hardcoded GHCR. Gated behindIMAGE_AUTO_REFRESH=true(off by default in local dev). Out of scope: the watcher should NOT run in local-build mode; the gate already covers that since OSS contributors don't set the env.internal/provisioner/registry_test.go— pins existing behavior; needs extension for local-build mode.docs/development/local-development.md— current doc saysdocker compose upboots everything. After this change, first-provision will trigger a clone+build that takes 5–10 min on Apple Silicon. Must call out.~/Documents/GitHub/molecule-ai-workspace-template-claude-code/runbooks/local-dev-setup.mdandknown-issues.md§5 — currently document a wrong-shaped retag workaround. Must be replaced (separate PR in template repo).Other registry-deciding code paths searched
grep -rn "MOLECULE_IMAGE_REGISTRY\|RegistryPrefix\|workspace-template-" workspace-serverconfirms the env var is consulted in exactly one place (registry.go); every other call site reads viaRegistryPrefix()/RuntimeImage(). Q2 reading holds: extendingRegistryPrefixsemantics propagates everywhere needed.Gitea template-repo coverage
Verified via the Gitea API that 4 of 9 runtimes have their template repos mirrored to Gitea today:
Local-build mode succeeds for the 4 mirrored runtimes today; for the 5 unmirrored ones, fail-loud with an actionable error message naming the missing repo. Mirroring those repos is out of scope (separate task).
Architecture mismatch
Provisioner hardcodes
linux/amd64onContainerCreate(with QEMU emulation on Apple Silicon, seedefaultImagePlatform()). Two design choices for local-build:docker buildx build --platform=linux/amd64— mimics prod, slow (10–25 min cold on Apple Silicon).Decision in the design section below.
Prior art surveyed
docker_buildskaffold devwatches sources, rebuilds on change, optional--cache-artifactskind load docker-imagek3d image importimage-buildrunresolves + caches deps; subsequent runs hit cachecargo buildresolves crates: transparent, cached, only re-fetches on changeBest fit: a hybrid of devcontainer.json's transparent build-from-source + Cargo's content-hash cache. We do not need a watch loop or DSL.
Phase 2 — Design
Mode detection (SSOT)
RegistryPrefix()returns a discriminated value, not a bare string:Reason: a discriminated value forces every call site to acknowledge the two modes. A bare string return type would let a future caller silently treat
""as a registry prefix (the exact bug class that originally landed in the OSS-default-vs-ECR-mirror flap).Back-compat: keep
RegistryPrefix() stringas a thin shim that returnsRegistrywhen SaaS, or panics on local-mode (callers that ignore the mode must opt into the explicit migration). Easier: one big rename, every call site updated in same PR.Local-mode codepath
${HOME}/.cache/molecule/workspace-template-build/<runtime>/<head-sha>/(XDG-compliant;MOLECULE_LOCAL_BUILD_CACHEenv override).(template-repo HEAD sha, Dockerfile content hash). HEAD comes from a shallowgit ls-remote https://git.moleculesai.app/molecule-ai/molecule-ai-workspace-template-<runtime>(single HTTP call, no clone). When key matches, we skip clone + build entirely.docker build --platform=linux/amd64 -t molecule-local/workspace-template-<runtime>:<sha> -f Dockerfile .from the cloned dir. Choose direction (1) — amd64 emulation to honorfeedback_local_must_mimic_production. Tradeoff is build time, accepted; we mitigate via SHA-cache (subsequent runs are <1s lookup + 0s build).molecule-local/workspace-template-<runtime>:<head-sha-12>plus a:latestfloating tag for human inspection. Provisioner consumes the SHA-pinned tag (immutable)."local-build mode: Gitea unreachable at https://git.moleculesai.app — verify network or set MOLECULE_IMAGE_REGISTRY to a reachable registry". NEVER fall back to GHCR/ECR (would be a silent prod-cred-leak hazard if an OSS user happened to have ECR creds in their docker config).Architecture direction: amd64-emulated
Chosen to honor
feedback_local_must_mimic_production. Tradeoff: 5–10 min first-provision on Apple Silicon. Mitigated by:docker buildx buildso layer cache works for incremental changes.Alternatives rejected:
linux/arm64images that the provisioner explicitly rejects (defaultImagePlatform()forces amd64). Forking the platform decision in local-mode would diverge debug behavior from prod, violatingfeedback_local_must_mimic_production.Progress UX
Workspace stays in
provisioningfor the duration. We emit structured log lines at every step (local-build: cloning <url>,local-build: clone complete (<sha>),local-build: docker build start,local-build: docker build done (<duration>)). The platform server's existing log surface is sufficient for OSS contributor UX; no new HTTP/WebSocket events.Default for OSS contributor
git clone https://git.moleculesai.app/molecule-ai/molecule-core && go run ./workspace-server/cmd/serverboots end-to-end. First workspace-create takes 5–10 min for the first-runtime build. Subsequent provisions reuse cached image. Zero env vars required.Alternatives rejected
MOLECULE_LOCAL_BUILD=1— requires OSS contributors to know it exists. Violates zero-config requirement.Security review
https://git.moleculesai.app/molecule-ai/molecule-ai-workspace-template-and ONLY accept<runtime>from the known-runtimes list (allowlist). Forks via env overrideMOLECULE_LOCAL_TEMPLATE_REPO_PREFIX(default off; opt-in for forks).MOLECULE_GITEA_TOKENis set, pass viahttps://oauth2:<token>@…. Token NEVER appears in log lines (mask viaredactURLhelper).go runalready trustsmolecule-ai/molecule-ai-workspace-template-*repos (same trust that would apply to the published GHCR images).docker buildinvocation passes NO--build-argfrom external input. Dockerfile is consumed as-is.$HOME/.cache) so cross-user attacks aren't relevant in single-user dev mode.Versioning + back-compat
MOLECULE_IMAGE_REGISTRY=<ECR url>→ unchanged behavior.Phase 3+4 — Implementation + Verification
Follows in PR. Tests cover: mode detection (registry set/unset/empty/garbage), local-mode clone success, local-mode clone failure (network/auth/missing-repo/missing-ref), local-mode build success/failure, SaaS-mode untouched.
Ref: Task #194. Closes the OSS contributor onboarding gap.
Filed per Phase 1→4 SOP — investigation + design locked before any code change. See PR for implementation.