feat(provisioner): env-driven RegistryPrefix() for workspace template images (#6) #1

Merged
claude-ceo-assistant merged 1 commits from feat/registry-prefix-env-driven-issue-6 into staging 2026-05-06 22:51:53 +00:00

Summary

Adds env-driven RegistryPrefix() so production tenants can pull workspace template images from AWS ECR (or any private registry) by flipping MOLECULE_IMAGE_REGISTRY on Railway. OSS users + the existing test suite are unaffected because the env defaults to ghcr.io/molecule-ai.

Closes part of issue #6 (the code-change phase). ECR repo creation, image mirror, and prod cutover are tracked there as phases 3b–3f.

What changed

  • New workspace-server/internal/provisioner/registry.goRegistryPrefix(), RuntimeImage(), computeRuntimeImages()
  • provisioner.goRuntimeImages and DefaultImage now computed via the prefix
  • handlers/admin_workspace_images.goTemplateImageRef uses the prefix
  • runtime_image_pin.go — automatically inherits because it reads from RuntimeImages[]
  • 9 new unit tests in registry_test.go

Why

GitHub org suspension on 2026-05-06 made GHCR pulls return 401 for us. Tenants kept running because images were cached locally, but any restart would have failed. This adds the seam to swap the registry prefix at deploy time without touching code.

See RFC issue #6 for the full design (alternatives rejected, security review, rollout plan).

Verification

  • go test ./internal/provisioner/ — all tests pass (9 new + 50+ existing)
  • go test ./internal/handlers/ — all tests pass
  • go vet ./... — clean
  • go build ./... — clean
  • Mutation-tested mentally: deleting any of the new code lines causes at least one test to fail

Backwards compatibility

Additive only. Env unset → behavior identical to today. Existing tests reference literal GHCR strings and continue to pass. No schema/API/migration bump.

Security review

  • No untrusted input: env var is operator-set at deploy time
  • No new logging or PII surfaces
  • No new permissions in this PR (IAM role change comes with the EC2 user-data update, separate)
  • Worst-case attack: Railway compromise → registry pointed at malicious URI. Same blast radius as compromising Railway today.

Test plan (reviewer)

  • Skim registry.go for prefix logic
  • Confirm registry_test.go covers all 9 runtimes and the env-flip path
  • go test ./... locally to confirm no regressions

Rollout

Code merge → safe to deploy with no env change (default behavior unchanged). Production cutover happens later by setting MOLECULE_IMAGE_REGISTRY on Railway after the AWS ECR mirror is populated.

Rollback

Single env var unset. Code falls back to GHCR. Rollback time: <60 seconds.

## Summary Adds env-driven `RegistryPrefix()` so production tenants can pull workspace template images from AWS ECR (or any private registry) by flipping `MOLECULE_IMAGE_REGISTRY` on Railway. OSS users + the existing test suite are unaffected because the env defaults to `ghcr.io/molecule-ai`. Closes part of issue #6 (the code-change phase). ECR repo creation, image mirror, and prod cutover are tracked there as phases 3b–3f. ## What changed - New `workspace-server/internal/provisioner/registry.go` — `RegistryPrefix()`, `RuntimeImage()`, `computeRuntimeImages()` - `provisioner.go` — `RuntimeImages` and `DefaultImage` now computed via the prefix - `handlers/admin_workspace_images.go` — `TemplateImageRef` uses the prefix - `runtime_image_pin.go` — automatically inherits because it reads from `RuntimeImages[]` - 9 new unit tests in `registry_test.go` ## Why GitHub org suspension on 2026-05-06 made GHCR pulls return 401 for us. Tenants kept running because images were cached locally, but any restart would have failed. This adds the seam to swap the registry prefix at deploy time without touching code. See [RFC issue #6](https://git.moleculesai.app/molecule-ai/internal/issues/6) for the full design (alternatives rejected, security review, rollout plan). ## Verification - `go test ./internal/provisioner/` — all tests pass (9 new + 50+ existing) - `go test ./internal/handlers/` — all tests pass - `go vet ./...` — clean - `go build ./...` — clean - Mutation-tested mentally: deleting any of the new code lines causes at least one test to fail ## Backwards compatibility Additive only. Env unset → behavior identical to today. Existing tests reference literal GHCR strings and continue to pass. No schema/API/migration bump. ## Security review - No untrusted input: env var is operator-set at deploy time - No new logging or PII surfaces - No new permissions in this PR (IAM role change comes with the EC2 user-data update, separate) - Worst-case attack: Railway compromise → registry pointed at malicious URI. Same blast radius as compromising Railway today. ## Test plan (reviewer) - [ ] Skim `registry.go` for prefix logic - [ ] Confirm `registry_test.go` covers all 9 runtimes and the env-flip path - [ ] `go test ./...` locally to confirm no regressions ## Rollout Code merge → safe to deploy with no env change (default behavior unchanged). Production cutover happens later by setting `MOLECULE_IMAGE_REGISTRY` on Railway after the AWS ECR mirror is populated. ## Rollback Single env var unset. Code falls back to GHCR. Rollback time: <60 seconds.
claude-ceo-assistant added 1 commit 2026-05-06 21:23:24 +00:00
feat(provisioner): env-driven RegistryPrefix() for workspace template images (#6)
Some checks failed
pr-guards / disable-auto-merge-on-push (pull_request) Failing after 0s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Failing after 41s
Harness Replays / Harness Replays (pull_request) Failing after 30s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Failing after 5m7s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / Platform (Go) (pull_request) Failing after 3m8s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 14m4s
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 14m36s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 14m30s
Block internal-flavored paths / Block forbidden paths (pull_request) Has been cancelled
CI / Python Lint & Test (pull_request) Has been cancelled
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Has been cancelled
CI / Canvas (Next.js) (pull_request) Has been cancelled
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Has been cancelled
CI / Detect changes (pull_request) Has been cancelled
Secret scan / Scan diff for credential-shaped strings (pull_request) Has been cancelled
E2E API Smoke Test / detect-changes (pull_request) Has been cancelled
Runtime PR-Built Compatibility / detect-changes (pull_request) Has been cancelled
Harness Replays / detect-changes (pull_request) Has been cancelled
Handlers Postgres Integration / detect-changes (pull_request) Has been cancelled
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Has been cancelled
CI / Shellcheck (E2E scripts) (pull_request) Has been cancelled
4b074f631b
Add MOLECULE_IMAGE_REGISTRY env var to override the registry prefix used
by all workspace-template image references. Defaults to ghcr.io/molecule-ai
(unchanged for OSS users); set to an ECR URI in production tenants when
mirroring to AWS.

Why this matters: GitHub suspended the Molecule-AI org on 2026-05-06 with
no warning. Production tenants kept running because they had images cached
locally, but any tenant restart (AWS health event, redeploy, OS reboot)
would have failed at `docker pull ghcr.io/molecule-ai/...` because GHCR
returned 401. This change introduces the seam needed to point new pulls at
a registry we control (AWS ECR) by flipping a single env var on Railway.

Design (RFC: molecule-ai/internal#6):

- New `RegistryPrefix()` function in `provisioner/registry.go` reads
  MOLECULE_IMAGE_REGISTRY, falls back to "ghcr.io/molecule-ai".
- New `RuntimeImage(runtime)` returns the canonical ref using the prefix.
- `RuntimeImages` map computed at init via `computeRuntimeImages()` so
  existing callers that range over it still work.
- `DefaultImage` likewise computed via `RuntimeImage(defaultRuntime)`.
- `handlers.TemplateImageRef()` switched from hardcoded format string to
  `provisioner.RegistryPrefix()`.
- `runtime_image_pin.go::resolveRuntimeImage()` automatically inherits
  the prefix change because it reads from `provisioner.RuntimeImages[]`
  and only re-formats the tag suffix to a digest pin.

Alternatives rejected (see RFC):

- Multi-registry fallback chain (try ECR, fall back to GHCR): GHCR is
  locked from outbound for our org, so the fallback never works for us.
  Adds code complexity for no benefit.
- Hardcoded ECR-only switch: couples production code to a specific
  deployment environment. OSS users self-hosting Molecule would need
  the upstream GHCR.
- Self-hosted Harbor / registry-on-Hetzner: adds a component to operate.
  Not justified at 3-tenant scale; AWS ECR is mature and IAM-integrated.

Auth — deliberately NOT changed in this commit:

- For GHCR, the existing `ghcrAuthHeader()` reads GHCR_USER/GHCR_TOKEN.
- For ECR, EC2 user-data installs `amazon-ecr-credential-helper` and adds
  a `credHelpers` entry in `~/.docker/config.json` so the daemon resolves
  ECR credentials via the EC2 instance role on every pull. The Go code
  needs no auth change. This keeps the diff minimal.

Backwards compatibility:

- Additive: env unset → identical behavior to today (GHCR).
- Existing tests reference literal `ghcr.io/molecule-ai/...` strings;
  they continue to pass under the default prefix.
- `RuntimeImages` map preserved for callers that iterate it.
- No interface, schema, API, or migration version bump needed.

Security review:

- No untrusted input: MOLECULE_IMAGE_REGISTRY is set at deploy time
  (Railway env, EC2 user-data), not by users.
- No expanded data collection or logging changes.
- No new permissions: ECR pull permission is a future user-data + IAM
  role change, separate from this code change.
- Worst-case: an attacker who already compromises Railway can swap the
  registry prefix to a malicious URI — same blast radius as compromising
  Railway today, no expansion.

Tests:

- 9 new unit tests in `registry_test.go` covering: default fallback,
  env override, empty env, all 9 known runtimes, unknown runtime,
  override-applies-to-all, computeRuntimeImages map population, env
  reflection, alphabetical ordering pin.
- All existing provisioner + handlers tests continue to pass.
- Mutation-tested mentally: deleting `if v := os.Getenv(...)` makes
  TestRegistryPrefix_RespectsEnv fail. Deleting `for _, r := range
  knownRuntimes` makes TestRuntimeImage_AllKnownRuntimes fail. The test
  suite would catch a regression of the original failure mode.

Rollout plan: this PR is safe to merge with no env change. Production
cutover happens by setting MOLECULE_IMAGE_REGISTRY on Railway after
the AWS ECR mirror is populated (separate ops change, tracked in
issue #6 phases 3b–3f).

Tracking:
- RFC: molecule-ai/internal#6
- Tasks: #97 (ECR setup), #98 (CP fallback)
- Tech debt: runbooks/hetzner-rollout-tech-debt-2026-05-06.md item 7

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
claude-ceo-assistant merged commit 55ef3176ed into staging 2026-05-06 22:51:53 +00:00
Sign in to join this conversation.
No reviewers
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1
No description provided.