molecule-core/workspace-server/internal/provisioner
claude-ceo-assistant (Claude Opus 4.7 on Hongming's MacBook) 4b074f631b feat(provisioner): env-driven RegistryPrefix() for workspace template images (#6)
Add MOLECULE_IMAGE_REGISTRY env var to override the registry prefix used
by all workspace-template image references. Defaults to ghcr.io/molecule-ai
(unchanged for OSS users); set to an ECR URI in production tenants when
mirroring to AWS.

Why this matters: GitHub suspended the Molecule-AI org on 2026-05-06 with
no warning. Production tenants kept running because they had images cached
locally, but any tenant restart (AWS health event, redeploy, OS reboot)
would have failed at `docker pull ghcr.io/molecule-ai/...` because GHCR
returned 401. This change introduces the seam needed to point new pulls at
a registry we control (AWS ECR) by flipping a single env var on Railway.

Design (RFC: molecule-ai/internal#6):

- New `RegistryPrefix()` function in `provisioner/registry.go` reads
  MOLECULE_IMAGE_REGISTRY, falls back to "ghcr.io/molecule-ai".
- New `RuntimeImage(runtime)` returns the canonical ref using the prefix.
- `RuntimeImages` map computed at init via `computeRuntimeImages()` so
  existing callers that range over it still work.
- `DefaultImage` likewise computed via `RuntimeImage(defaultRuntime)`.
- `handlers.TemplateImageRef()` switched from hardcoded format string to
  `provisioner.RegistryPrefix()`.
- `runtime_image_pin.go::resolveRuntimeImage()` automatically inherits
  the prefix change because it reads from `provisioner.RuntimeImages[]`
  and only re-formats the tag suffix to a digest pin.

Alternatives rejected (see RFC):

- Multi-registry fallback chain (try ECR, fall back to GHCR): GHCR is
  locked from outbound for our org, so the fallback never works for us.
  Adds code complexity for no benefit.
- Hardcoded ECR-only switch: couples production code to a specific
  deployment environment. OSS users self-hosting Molecule would need
  the upstream GHCR.
- Self-hosted Harbor / registry-on-Hetzner: adds a component to operate.
  Not justified at 3-tenant scale; AWS ECR is mature and IAM-integrated.

Auth — deliberately NOT changed in this commit:

- For GHCR, the existing `ghcrAuthHeader()` reads GHCR_USER/GHCR_TOKEN.
- For ECR, EC2 user-data installs `amazon-ecr-credential-helper` and adds
  a `credHelpers` entry in `~/.docker/config.json` so the daemon resolves
  ECR credentials via the EC2 instance role on every pull. The Go code
  needs no auth change. This keeps the diff minimal.

Backwards compatibility:

- Additive: env unset → identical behavior to today (GHCR).
- Existing tests reference literal `ghcr.io/molecule-ai/...` strings;
  they continue to pass under the default prefix.
- `RuntimeImages` map preserved for callers that iterate it.
- No interface, schema, API, or migration version bump needed.

Security review:

- No untrusted input: MOLECULE_IMAGE_REGISTRY is set at deploy time
  (Railway env, EC2 user-data), not by users.
- No expanded data collection or logging changes.
- No new permissions: ECR pull permission is a future user-data + IAM
  role change, separate from this code change.
- Worst-case: an attacker who already compromises Railway can swap the
  registry prefix to a malicious URI — same blast radius as compromising
  Railway today, no expansion.

Tests:

- 9 new unit tests in `registry_test.go` covering: default fallback,
  env override, empty env, all 9 known runtimes, unknown runtime,
  override-applies-to-all, computeRuntimeImages map population, env
  reflection, alphabetical ordering pin.
- All existing provisioner + handlers tests continue to pass.
- Mutation-tested mentally: deleting `if v := os.Getenv(...)` makes
  TestRegistryPrefix_RespectsEnv fail. Deleting `for _, r := range
  knownRuntimes` makes TestRuntimeImage_AllKnownRuntimes fail. The test
  suite would catch a regression of the original failure mode.

Rollout plan: this PR is safe to merge with no env change. Production
cutover happens by setting MOLECULE_IMAGE_REGISTRY on Railway after
the AWS ECR mirror is populated (separate ops change, tracked in
issue #6 phases 3b–3f).

Tracking:
- RFC: molecule-ai/internal#6
- Tasks: #97 (ECR setup), #98 (CP fallback)
- Tech debt: runbooks/hetzner-rollout-tech-debt-2026-05-06.md item 7

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 14:23:01 -07:00
..
architecture_test.go test(arch): codify 4 module boundaries as architecture tests (#2344) 2026-04-29 22:12:58 -07:00
backend_contract_test.go refactor(handlers): widen WorkspaceHandler.provisioner to LocalProvisionerAPI interface (#2369) 2026-04-30 09:18:16 -07:00
cp_provisioner_instance_id_test.go test: regression guard for #1738 — cp-provisioner uses real instance_id 2026-04-23 17:45:13 -07:00
cp_provisioner_test.go fix(cp-provisioner): surface CP non-2xx on Stop to plug EC2 leak 2026-05-01 22:59:01 -07:00
cp_provisioner.go feat(workspace-server): structured logging at provisioning boundaries 2026-05-05 12:30:11 -07:00
isrunning_test.go fix(provisioner): treat "removal already in progress" as no-op success 2026-04-27 13:25:32 -07:00
local_provisioner_api.go refactor(handlers): widen WorkspaceHandler.provisioner to LocalProvisionerAPI interface (#2369) 2026-04-30 09:18:16 -07:00
platform_test.go fix(provisioner): force linux/amd64 pull + create on Apple Silicon hosts (#1875) 2026-04-23 14:55:34 -07:00
provisioner_test.go feat(provisioner): digest-pin workspace images via runtime_image_pins (#2272 layer 1) 2026-05-03 02:30:00 -07:00
provisioner.go feat(provisioner): env-driven RegistryPrefix() for workspace template images (#6) 2026-05-06 14:23:01 -07:00
registry_test.go feat(provisioner): env-driven RegistryPrefix() for workspace template images (#6) 2026-05-06 14:23:01 -07:00
registry.go feat(provisioner): env-driven RegistryPrefix() for workspace template images (#6) 2026-05-06 14:23:01 -07:00