feat(workspace-server): local-dev provisioner builds from Gitea source when MOLECULE_IMAGE_REGISTRY is unset (Task #194) #63

Closed
opened 2026-05-07 22:02:49 +00:00 by claude-ceo-assistant · 0 comments

Phase 1 — Investigation

Root cause

Provisioner image resolution treats GHCR as the OSS default (ghcr.io/molecule-ai/workspace-template-<runtime>:latest via RegistryPrefix() in workspace-server/internal/provisioner/registry.go). Post-2026-05-06 the Molecule-AI GitHub org was suspended; GHCR now returns 403 for every workspace-template-* manifest. OSS contributors who clone molecule-core and go run ./workspace-server/cmd/server cannot provision a workspace — first provision fails with:

docker image "ghcr.io/molecule-ai/workspace-template-claude-code:latest" not found after pull attempt — verify GHCR visibility for claude-code and that the tenant has internet access

Prod tenants are unaffected because every prod tenant sets MOLECULE_IMAGE_REGISTRY to the AWS ECR mirror via Railway env + EC2 user-data.

Reproduction (verified 2026-05-07):

$ curl -H "Authorization: Bearer <ghcr-pull-token>" -I https://ghcr.io/v2/molecule-ai/workspace-template-claude-code/manifests/latest
HTTP/2 403

Affected surfaces (all in workspace-server)

  1. internal/provisioner/registry.goRegistryPrefix() is the SSOT for mode; RuntimeImage() and computeRuntimeImages() produce image refs.
  2. internal/provisioner/provisioner.goStart() calls selectImage(cfg)RuntimeImages[runtime]; pulls via pullImageAndDrain. Hardcodes linux/amd64 platform on Apple Silicon (existing emulation behavior, unchanged here).
  3. internal/provisioner/cp_provisioner.go — SaaS path; calls control plane HTTP API. Does NOT consult RuntimeImages directly. Untouched by this change.
  4. internal/handlers/admin_workspace_images.goTemplateImageRef() mirrors the registry decision for the manual /admin/workspace-images/refresh route. Must stay aligned.
  5. internal/imagewatch/watch.go — auto-refresh polls https://ghcr.io/v2/molecule-ai/workspace-template-<rt>/manifests/latest. Hardcoded GHCR. Gated behind IMAGE_AUTO_REFRESH=true (off by default in local dev). Out of scope: the watcher should NOT run in local-build mode; the gate already covers that since OSS contributors don't set the env.
  6. internal/provisioner/registry_test.go — pins existing behavior; needs extension for local-build mode.
  7. docs/development/local-development.md — current doc says docker compose up boots everything. After this change, first-provision will trigger a clone+build that takes 5–10 min on Apple Silicon. Must call out.
  8. OSS-template-side runbook + known-issues at ~/Documents/GitHub/molecule-ai-workspace-template-claude-code/runbooks/local-dev-setup.md and known-issues.md §5 — currently document a wrong-shaped retag workaround. Must be replaced (separate PR in template repo).

Other registry-deciding code paths searched

grep -rn "MOLECULE_IMAGE_REGISTRY\|RegistryPrefix\|workspace-template-" workspace-server confirms the env var is consulted in exactly one place (registry.go); every other call site reads via RegistryPrefix() / RuntimeImage(). Q2 reading holds: extending RegistryPrefix semantics propagates everywhere needed.

Gitea template-repo coverage

Verified via the Gitea API that 4 of 9 runtimes have their template repos mirrored to Gitea today:

runtime on Gitea? Dockerfile?
claude-code YES yes
hermes YES yes
langgraph (default) YES yes
autogen YES yes
crewai NO
deepagents NO
codex NO
gemini-cli NO
openclaw NO

Local-build mode succeeds for the 4 mirrored runtimes today; for the 5 unmirrored ones, fail-loud with an actionable error message naming the missing repo. Mirroring those repos is out of scope (separate task).

Architecture mismatch

Provisioner hardcodes linux/amd64 on ContainerCreate (with QEMU emulation on Apple Silicon, see defaultImagePlatform()). Two design choices for local-build:

  1. docker buildx build --platform=linux/amd64 — mimics prod, slow (10–25 min cold on Apple Silicon).
  2. Build native + drop platform pin in local-mode — fast, but diverges from prod runtime behavior.

Decision in the design section below.

Prior art surveyed

Project Pattern Adopt?
Tilt docker_build Tiltfile DSL declares build context + image; daemon caches by content hash; opt-in per developer Partial — adopt content-hash invalidation idea
Skaffold local builder skaffold dev watches sources, rebuilds on change, optional --cache-artifacts Reject — too coupled to Skaffold tooling chain; we just need one-shot first-build
kind kind load docker-image Manual import after manual build Reject — explicit user step, our goal is zero-config
k3d k3d image import Same shape as kind Reject — same
devcontainer.json image-build Spec field; VS Code reads the local Dockerfile and builds Adopt — same UX shape: the tool transparently builds from a known source location
nix-shell + dockerTools Reproducible OCI images from Nix expressions Reject — adds Nix dependency; out of bounds for OSS contributor onboarding
Buildah/podman build Daemonless; build via Containerfile Reject — molecule-core already requires Docker daemon for runtime; second tool is friction
Cargo / npm install workflow First run resolves + caches deps; subsequent runs hit cache Adopt — clone+build the workspace template the same way cargo build resolves crates: transparent, cached, only re-fetches on change

Best fit: a hybrid of devcontainer.json's transparent build-from-source + Cargo's content-hash cache. We do not need a watch loop or DSL.

Phase 2 — Design

Mode detection (SSOT)

RegistryPrefix() returns a discriminated value, not a bare string:

type RegistrySource struct {
    Mode     RegistryMode // "saas" or "local"
    Registry string       // populated when Mode == saas
}

func Resolve() RegistrySource

Reason: a discriminated value forces every call site to acknowledge the two modes. A bare string return type would let a future caller silently treat "" as a registry prefix (the exact bug class that originally landed in the OSS-default-vs-ECR-mirror flap).

Back-compat: keep RegistryPrefix() string as a thin shim that returns Registry when SaaS, or panics on local-mode (callers that ignore the mode must opt into the explicit migration). Easier: one big rename, every call site updated in same PR.

Local-mode codepath

  • Cache dir: ${HOME}/.cache/molecule/workspace-template-build/<runtime>/<head-sha>/ (XDG-compliant; MOLECULE_LOCAL_BUILD_CACHE env override).
  • Cache key: tuple of (template-repo HEAD sha, Dockerfile content hash). HEAD comes from a shallow git ls-remote https://git.moleculesai.app/molecule-ai/molecule-ai-workspace-template-<runtime> (single HTTP call, no clone). When key matches, we skip clone + build entirely.
  • Build invocation: docker build --platform=linux/amd64 -t molecule-local/workspace-template-<runtime>:<sha> -f Dockerfile . from the cloned dir. Choose direction (1) — amd64 emulation to honor feedback_local_must_mimic_production. Tradeoff is build time, accepted; we mitigate via SHA-cache (subsequent runs are <1s lookup + 0s build).
  • Tag scheme: molecule-local/workspace-template-<runtime>:<head-sha-12> plus a :latest floating tag for human inspection. Provisioner consumes the SHA-pinned tag (immutable).
  • Fallback if Gitea unreachable: fail-closed with message "local-build mode: Gitea unreachable at https://git.moleculesai.app — verify network or set MOLECULE_IMAGE_REGISTRY to a reachable registry". NEVER fall back to GHCR/ECR (would be a silent prod-cred-leak hazard if an OSS user happened to have ECR creds in their docker config).
  • Fallback if runtime not mirrored on Gitea: actionable error naming the missing repo URL.

Architecture direction: amd64-emulated

Chosen to honor feedback_local_must_mimic_production. Tradeoff: 5–10 min first-provision on Apple Silicon. Mitigated by:

  • Cache hits short-circuit clone + build entirely on subsequent runs.
  • The build invocation uses docker buildx build so layer cache works for incremental changes.
  • Documented in runbook so OSS contributor knows what to expect.

Alternatives rejected:

  • Native arch — faster but creates linux/arm64 images that the provisioner explicitly rejects (defaultImagePlatform() forces amd64). Forking the platform decision in local-mode would diverge debug behavior from prod, violating feedback_local_must_mimic_production.
  • Multi-arch buildx with manifest list — even slower, no benefit since the provisioner always wants amd64.

Progress UX

Workspace stays in provisioning for the duration. We emit structured log lines at every step (local-build: cloning <url>, local-build: clone complete (<sha>), local-build: docker build start, local-build: docker build done (<duration>)). The platform server's existing log surface is sufficient for OSS contributor UX; no new HTTP/WebSocket events.

Default for OSS contributor

git clone https://git.moleculesai.app/molecule-ai/molecule-core && go run ./workspace-server/cmd/server boots end-to-end. First workspace-create takes 5–10 min for the first-runtime build. Subsequent provisions reuse cached image. Zero env vars required.

Alternatives rejected

  1. New env var MOLECULE_LOCAL_BUILD=1 — requires OSS contributors to know it exists. Violates zero-config requirement.
  2. Push pre-built images to a public Gitea container registry, mirror tag from upstream — operationally cleaner BUT: (a) Gitea's container registry add-on isn't deployed yet, (b) defeats the OSS-contributor goal of "hack on the source, see your changes," since they'd still pull a stale image. Also doesn't solve the bigger 'can't pull from GHCR' problem if the Gitea registry ever has an outage.
  3. Embed Dockerfiles in molecule-core itself, drop the standalone template repos — would work but breaks the OSS-shape principle; templates are intentionally separable, anyone-can-fork artifacts. Out of bounds for this fix.

Security review

  • Gitea repo URL validation: hardcode the org prefix https://git.moleculesai.app/molecule-ai/molecule-ai-workspace-template- and ONLY accept <runtime> from the known-runtimes list (allowlist). Forks via env override MOLECULE_LOCAL_TEMPLATE_REPO_PREFIX (default off; opt-in for forks).
  • Token handling: clone uses anonymous HTTPS (templates are public). If MOLECULE_GITEA_TOKEN is set, pass via https://oauth2:<token>@…. Token NEVER appears in log lines (mask via redactURL helper).
  • Untrusted Dockerfile: trust boundary unchanged from today — operator running go run already trusts molecule-ai/molecule-ai-workspace-template-* repos (same trust that would apply to the published GHCR images).
  • Build-arg injection: docker build invocation passes NO --build-arg from external input. Dockerfile is consumed as-is.
  • Cache poisoning: cache key includes the Gitea HEAD sha, so a re-tag attack on the template repo (force-push to main) regenerates the cache key on next pull and rebuilds. Cache dir is per-user ($HOME/.cache) so cross-user attacks aren't relevant in single-user dev mode.

Versioning + back-compat

  • Existing prod tenants set MOLECULE_IMAGE_REGISTRY=<ECR url> → unchanged behavior.
  • Existing local installs that set the var → unchanged behavior.
  • Existing local installs that don't set it → switch to local-build path. Migration: none (additive); first provision will take 5–10 min instead of failing.
  • No deprecations. Documented in runbook.

Phase 3+4 — Implementation + Verification

Follows in PR. Tests cover: mode detection (registry set/unset/empty/garbage), local-mode clone success, local-mode clone failure (network/auth/missing-repo/missing-ref), local-mode build success/failure, SaaS-mode untouched.

Ref: Task #194. Closes the OSS contributor onboarding gap.


Filed per Phase 1→4 SOP — investigation + design locked before any code change. See PR for implementation.

## Phase 1 — Investigation ### Root cause Provisioner image resolution treats GHCR as the OSS default (`ghcr.io/molecule-ai/workspace-template-<runtime>:latest` via `RegistryPrefix()` in `workspace-server/internal/provisioner/registry.go`). Post-2026-05-06 the `Molecule-AI` GitHub org was suspended; GHCR now returns **403** for every workspace-template-* manifest. OSS contributors who clone `molecule-core` and `go run ./workspace-server/cmd/server` cannot provision a workspace — first provision fails with: ``` docker image "ghcr.io/molecule-ai/workspace-template-claude-code:latest" not found after pull attempt — verify GHCR visibility for claude-code and that the tenant has internet access ``` Prod tenants are unaffected because every prod tenant sets `MOLECULE_IMAGE_REGISTRY` to the AWS ECR mirror via Railway env + EC2 user-data. Reproduction (verified 2026-05-07): ``` $ curl -H "Authorization: Bearer <ghcr-pull-token>" -I https://ghcr.io/v2/molecule-ai/workspace-template-claude-code/manifests/latest HTTP/2 403 ``` ### Affected surfaces (all in `workspace-server`) 1. **`internal/provisioner/registry.go`** — `RegistryPrefix()` is the SSOT for mode; `RuntimeImage()` and `computeRuntimeImages()` produce image refs. 2. **`internal/provisioner/provisioner.go`** — `Start()` calls `selectImage(cfg)` → `RuntimeImages[runtime]`; pulls via `pullImageAndDrain`. Hardcodes `linux/amd64` platform on Apple Silicon (existing emulation behavior, unchanged here). 3. **`internal/provisioner/cp_provisioner.go`** — SaaS path; calls control plane HTTP API. Does NOT consult `RuntimeImages` directly. Untouched by this change. 4. **`internal/handlers/admin_workspace_images.go`** — `TemplateImageRef()` mirrors the registry decision for the manual `/admin/workspace-images/refresh` route. Must stay aligned. 5. **`internal/imagewatch/watch.go`** — auto-refresh polls `https://ghcr.io/v2/molecule-ai/workspace-template-<rt>/manifests/latest`. Hardcoded GHCR. Gated behind `IMAGE_AUTO_REFRESH=true` (off by default in local dev). Out of scope: the watcher should NOT run in local-build mode; the gate already covers that since OSS contributors don't set the env. 6. **`internal/provisioner/registry_test.go`** — pins existing behavior; needs extension for local-build mode. 7. **`docs/development/local-development.md`** — current doc says `docker compose up` boots everything. After this change, first-provision will trigger a clone+build that takes 5–10 min on Apple Silicon. Must call out. 8. **OSS-template-side runbook + known-issues** at `~/Documents/GitHub/molecule-ai-workspace-template-claude-code/runbooks/local-dev-setup.md` and `known-issues.md` §5 — currently document a wrong-shaped retag workaround. Must be replaced (separate PR in template repo). ### Other registry-deciding code paths searched `grep -rn "MOLECULE_IMAGE_REGISTRY\|RegistryPrefix\|workspace-template-" workspace-server` confirms the env var is consulted in exactly **one** place (`registry.go`); every other call site reads via `RegistryPrefix()` / `RuntimeImage()`. Q2 reading holds: extending `RegistryPrefix` semantics propagates everywhere needed. ### Gitea template-repo coverage Verified via the Gitea API that 4 of 9 runtimes have their template repos mirrored to Gitea today: | runtime | on Gitea? | Dockerfile? | |---------|-----------|-------------| | claude-code | YES | yes | | hermes | YES | yes | | langgraph (default) | YES | yes | | autogen | YES | yes | | crewai | NO | — | | deepagents | NO | — | | codex | NO | — | | gemini-cli | NO | — | | openclaw | NO | — | Local-build mode succeeds for the 4 mirrored runtimes today; for the 5 unmirrored ones, fail-loud with an actionable error message naming the missing repo. Mirroring those repos is out of scope (separate task). ### Architecture mismatch Provisioner hardcodes `linux/amd64` on `ContainerCreate` (with QEMU emulation on Apple Silicon, see `defaultImagePlatform()`). Two design choices for local-build: 1. **`docker buildx build --platform=linux/amd64`** — mimics prod, slow (10–25 min cold on Apple Silicon). 2. **Build native + drop platform pin in local-mode** — fast, but diverges from prod runtime behavior. Decision in the design section below. ### Prior art surveyed | Project | Pattern | Adopt? | |---------|---------|--------| | Tilt `docker_build` | Tiltfile DSL declares build context + image; daemon caches by content hash; opt-in per developer | **Partial** — adopt content-hash invalidation idea | | Skaffold local builder | `skaffold dev` watches sources, rebuilds on change, optional `--cache-artifacts` | **Reject** — too coupled to Skaffold tooling chain; we just need one-shot first-build | | kind `kind load docker-image` | Manual import after manual build | **Reject** — explicit user step, our goal is zero-config | | k3d `k3d image import` | Same shape as kind | **Reject** — same | | devcontainer.json `image-build` | Spec field; VS Code reads the local Dockerfile and builds | **Adopt** — same UX shape: the tool transparently builds from a known source location | | nix-shell + dockerTools | Reproducible OCI images from Nix expressions | **Reject** — adds Nix dependency; out of bounds for OSS contributor onboarding | | Buildah/podman build | Daemonless; build via Containerfile | **Reject** — molecule-core already requires Docker daemon for runtime; second tool is friction | | Cargo / npm install workflow | First `run` resolves + caches deps; subsequent runs hit cache | **Adopt** — clone+build the workspace template the same way `cargo build` resolves crates: transparent, cached, only re-fetches on change | Best fit: a hybrid of devcontainer.json's transparent build-from-source + Cargo's content-hash cache. We do **not** need a watch loop or DSL. ## Phase 2 — Design ### Mode detection (SSOT) `RegistryPrefix()` returns a discriminated value, not a bare string: ```go type RegistrySource struct { Mode RegistryMode // "saas" or "local" Registry string // populated when Mode == saas } func Resolve() RegistrySource ``` Reason: a discriminated value forces every call site to acknowledge the two modes. A bare string return type would let a future caller silently treat `""` as a registry prefix (the exact bug class that originally landed in the OSS-default-vs-ECR-mirror flap). Back-compat: keep `RegistryPrefix() string` as a thin shim that returns `Registry` when SaaS, or panics on local-mode (callers that ignore the mode must opt into the explicit migration). Easier: one big rename, every call site updated in same PR. ### Local-mode codepath * Cache dir: `${HOME}/.cache/molecule/workspace-template-build/<runtime>/<head-sha>/` (XDG-compliant; `MOLECULE_LOCAL_BUILD_CACHE` env override). * Cache key: tuple of `(template-repo HEAD sha, Dockerfile content hash)`. HEAD comes from a shallow `git ls-remote https://git.moleculesai.app/molecule-ai/molecule-ai-workspace-template-<runtime>` (single HTTP call, no clone). When key matches, we skip clone + build entirely. * Build invocation: `docker build --platform=linux/amd64 -t molecule-local/workspace-template-<runtime>:<sha> -f Dockerfile .` from the cloned dir. Choose **direction (1) — amd64 emulation** to honor `feedback_local_must_mimic_production`. Tradeoff is build time, accepted; we mitigate via SHA-cache (subsequent runs are <1s lookup + 0s build). * Tag scheme: `molecule-local/workspace-template-<runtime>:<head-sha-12>` plus a `:latest` floating tag for human inspection. Provisioner consumes the SHA-pinned tag (immutable). * Fallback if Gitea unreachable: **fail-closed** with message `"local-build mode: Gitea unreachable at https://git.moleculesai.app — verify network or set MOLECULE_IMAGE_REGISTRY to a reachable registry"`. NEVER fall back to GHCR/ECR (would be a silent prod-cred-leak hazard if an OSS user happened to have ECR creds in their docker config). * Fallback if runtime not mirrored on Gitea: actionable error naming the missing repo URL. ### Architecture direction: amd64-emulated Chosen to honor `feedback_local_must_mimic_production`. Tradeoff: 5–10 min first-provision on Apple Silicon. Mitigated by: * Cache hits short-circuit clone + build entirely on subsequent runs. * The build invocation uses `docker buildx build` so layer cache works for incremental changes. * Documented in runbook so OSS contributor knows what to expect. Alternatives rejected: * **Native arch** — faster but creates `linux/arm64` images that the provisioner explicitly rejects (`defaultImagePlatform()` forces amd64). Forking the platform decision in local-mode would diverge debug behavior from prod, violating `feedback_local_must_mimic_production`. * **Multi-arch buildx with manifest list** — even slower, no benefit since the provisioner always wants amd64. ### Progress UX Workspace stays in `provisioning` for the duration. We emit structured log lines at every step (`local-build: cloning <url>`, `local-build: clone complete (<sha>)`, `local-build: docker build start`, `local-build: docker build done (<duration>)`). The platform server's existing log surface is sufficient for OSS contributor UX; no new HTTP/WebSocket events. ### Default for OSS contributor `git clone https://git.moleculesai.app/molecule-ai/molecule-core && go run ./workspace-server/cmd/server` boots end-to-end. First workspace-create takes 5–10 min for the first-runtime build. Subsequent provisions reuse cached image. **Zero env vars required.** ### Alternatives rejected 1. **New env var `MOLECULE_LOCAL_BUILD=1`** — requires OSS contributors to know it exists. Violates zero-config requirement. 2. **Push pre-built images to a public Gitea container registry, mirror tag from upstream** — operationally cleaner BUT: (a) Gitea's container registry add-on isn't deployed yet, (b) defeats the OSS-contributor goal of "hack on the source, see your changes," since they'd still pull a stale image. Also doesn't solve the bigger 'can't pull from GHCR' problem if the Gitea registry ever has an outage. 3. **Embed Dockerfiles in molecule-core itself, drop the standalone template repos** — would work but breaks the OSS-shape principle; templates are intentionally separable, anyone-can-fork artifacts. Out of bounds for this fix. ### Security review * **Gitea repo URL validation**: hardcode the org prefix `https://git.moleculesai.app/molecule-ai/molecule-ai-workspace-template-` and ONLY accept `<runtime>` from the known-runtimes list (allowlist). Forks via env override `MOLECULE_LOCAL_TEMPLATE_REPO_PREFIX` (default off; opt-in for forks). * **Token handling**: clone uses anonymous HTTPS (templates are public). If `MOLECULE_GITEA_TOKEN` is set, pass via `https://oauth2:<token>@…`. Token NEVER appears in log lines (mask via `redactURL` helper). * **Untrusted Dockerfile**: trust boundary unchanged from today — operator running `go run` already trusts `molecule-ai/molecule-ai-workspace-template-*` repos (same trust that would apply to the published GHCR images). * **Build-arg injection**: `docker build` invocation passes NO `--build-arg` from external input. Dockerfile is consumed as-is. * **Cache poisoning**: cache key includes the Gitea HEAD sha, so a re-tag attack on the template repo (force-push to main) regenerates the cache key on next pull and rebuilds. Cache dir is per-user (`$HOME/.cache`) so cross-user attacks aren't relevant in single-user dev mode. ### Versioning + back-compat * Existing prod tenants set `MOLECULE_IMAGE_REGISTRY=<ECR url>` → unchanged behavior. * Existing local installs that set the var → unchanged behavior. * Existing local installs that don't set it → switch to local-build path. Migration: none (additive); first provision will take 5–10 min instead of failing. * No deprecations. Documented in runbook. ## Phase 3+4 — Implementation + Verification Follows in PR. Tests cover: mode detection (registry set/unset/empty/garbage), local-mode clone success, local-mode clone failure (network/auth/missing-repo/missing-ref), local-mode build success/failure, SaaS-mode untouched. Ref: Task #194. Closes the OSS contributor onboarding gap. --- Filed per Phase 1→4 SOP — investigation + design locked before any code change. See PR for implementation.
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#63
No description provided.