molecule-core/workspace-server/internal/provisioner/registry.go
claude-ceo-assistant (Claude Opus 4.7 on Hongming's MacBook) 4b074f631b
Some checks failed
pr-guards / disable-auto-merge-on-push (pull_request) Failing after 0s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Failing after 41s
Harness Replays / Harness Replays (pull_request) Failing after 30s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Failing after 5m7s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / Platform (Go) (pull_request) Failing after 3m8s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 14m4s
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 14m36s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 14m30s
Block internal-flavored paths / Block forbidden paths (pull_request) Has been cancelled
CI / Python Lint & Test (pull_request) Has been cancelled
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Has been cancelled
CI / Canvas (Next.js) (pull_request) Has been cancelled
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Has been cancelled
CI / Detect changes (pull_request) Has been cancelled
Secret scan / Scan diff for credential-shaped strings (pull_request) Has been cancelled
E2E API Smoke Test / detect-changes (pull_request) Has been cancelled
Runtime PR-Built Compatibility / detect-changes (pull_request) Has been cancelled
Harness Replays / detect-changes (pull_request) Has been cancelled
Handlers Postgres Integration / detect-changes (pull_request) Has been cancelled
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Has been cancelled
CI / Shellcheck (E2E scripts) (pull_request) Has been cancelled
feat(provisioner): env-driven RegistryPrefix() for workspace template images (#6)
Add MOLECULE_IMAGE_REGISTRY env var to override the registry prefix used
by all workspace-template image references. Defaults to ghcr.io/molecule-ai
(unchanged for OSS users); set to an ECR URI in production tenants when
mirroring to AWS.

Why this matters: GitHub suspended the Molecule-AI org on 2026-05-06 with
no warning. Production tenants kept running because they had images cached
locally, but any tenant restart (AWS health event, redeploy, OS reboot)
would have failed at `docker pull ghcr.io/molecule-ai/...` because GHCR
returned 401. This change introduces the seam needed to point new pulls at
a registry we control (AWS ECR) by flipping a single env var on Railway.

Design (RFC: molecule-ai/internal#6):

- New `RegistryPrefix()` function in `provisioner/registry.go` reads
  MOLECULE_IMAGE_REGISTRY, falls back to "ghcr.io/molecule-ai".
- New `RuntimeImage(runtime)` returns the canonical ref using the prefix.
- `RuntimeImages` map computed at init via `computeRuntimeImages()` so
  existing callers that range over it still work.
- `DefaultImage` likewise computed via `RuntimeImage(defaultRuntime)`.
- `handlers.TemplateImageRef()` switched from hardcoded format string to
  `provisioner.RegistryPrefix()`.
- `runtime_image_pin.go::resolveRuntimeImage()` automatically inherits
  the prefix change because it reads from `provisioner.RuntimeImages[]`
  and only re-formats the tag suffix to a digest pin.

Alternatives rejected (see RFC):

- Multi-registry fallback chain (try ECR, fall back to GHCR): GHCR is
  locked from outbound for our org, so the fallback never works for us.
  Adds code complexity for no benefit.
- Hardcoded ECR-only switch: couples production code to a specific
  deployment environment. OSS users self-hosting Molecule would need
  the upstream GHCR.
- Self-hosted Harbor / registry-on-Hetzner: adds a component to operate.
  Not justified at 3-tenant scale; AWS ECR is mature and IAM-integrated.

Auth — deliberately NOT changed in this commit:

- For GHCR, the existing `ghcrAuthHeader()` reads GHCR_USER/GHCR_TOKEN.
- For ECR, EC2 user-data installs `amazon-ecr-credential-helper` and adds
  a `credHelpers` entry in `~/.docker/config.json` so the daemon resolves
  ECR credentials via the EC2 instance role on every pull. The Go code
  needs no auth change. This keeps the diff minimal.

Backwards compatibility:

- Additive: env unset → identical behavior to today (GHCR).
- Existing tests reference literal `ghcr.io/molecule-ai/...` strings;
  they continue to pass under the default prefix.
- `RuntimeImages` map preserved for callers that iterate it.
- No interface, schema, API, or migration version bump needed.

Security review:

- No untrusted input: MOLECULE_IMAGE_REGISTRY is set at deploy time
  (Railway env, EC2 user-data), not by users.
- No expanded data collection or logging changes.
- No new permissions: ECR pull permission is a future user-data + IAM
  role change, separate from this code change.
- Worst-case: an attacker who already compromises Railway can swap the
  registry prefix to a malicious URI — same blast radius as compromising
  Railway today, no expansion.

Tests:

- 9 new unit tests in `registry_test.go` covering: default fallback,
  env override, empty env, all 9 known runtimes, unknown runtime,
  override-applies-to-all, computeRuntimeImages map population, env
  reflection, alphabetical ordering pin.
- All existing provisioner + handlers tests continue to pass.
- Mutation-tested mentally: deleting `if v := os.Getenv(...)` makes
  TestRegistryPrefix_RespectsEnv fail. Deleting `for _, r := range
  knownRuntimes` makes TestRuntimeImage_AllKnownRuntimes fail. The test
  suite would catch a regression of the original failure mode.

Rollout plan: this PR is safe to merge with no env change. Production
cutover happens by setting MOLECULE_IMAGE_REGISTRY on Railway after
the AWS ECR mirror is populated (separate ops change, tracked in
issue #6 phases 3b–3f).

Tracking:
- RFC: molecule-ai/internal#6
- Tasks: #97 (ECR setup), #98 (CP fallback)
- Tech debt: runbooks/hetzner-rollout-tech-debt-2026-05-06.md item 7

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 14:23:01 -07:00

96 lines
3.7 KiB
Go

package provisioner
import (
"fmt"
"os"
)
// defaultRegistryPrefix is the upstream OSS face for all workspace template
// images. Self-hosted Molecule deployments without the MOLECULE_IMAGE_REGISTRY
// override pull from here.
const defaultRegistryPrefix = "ghcr.io/molecule-ai"
// knownRuntimes is the canonical list of workspace template runtimes shipped
// in main. Any runtime added here MUST also have a standalone template repo
// (Molecule-AI/molecule-ai-workspace-template-<name>) and an entry in the
// publish-template-image workflow that builds it.
//
// Order matters for deterministic test snapshots; keep alphabetical.
var knownRuntimes = []string{
"autogen",
"claude-code",
"codex",
"crewai",
"deepagents",
"gemini-cli",
"hermes",
"langgraph",
"openclaw",
}
// defaultRuntime is the fallback when a workspace's config doesn't specify a
// runtime. Picked because LangGraph is the most common in our org templates
// and has the smallest "first impression" cold-start surface.
const defaultRuntime = "langgraph"
// RegistryPrefix returns the registry prefix all workspace-template image
// references should use. Defaults to ghcr.io/molecule-ai (the upstream OSS
// face) and is overridden by the MOLECULE_IMAGE_REGISTRY env var in
// production tenants where we mirror images to a private registry.
//
// The override is set at deploy time (Railway env, EC2 user-data) — never
// from user-supplied input — so the value is trusted by the time it reaches
// this code. Validation is deliberately minimal: an operator-supplied
// prefix that points at a registry the EC2 can't authenticate to will fail
// loudly at docker-pull time, which is the right blast radius.
//
// Example values:
//
// (unset) → ghcr.io/molecule-ai (OSS default)
// "123456789012.dkr.ecr.us-east-2.amazonaws.com/molecule-ai" → AWS ECR mirror
// "git.moleculesai.app/molecule-ai" → self-hosted Gitea Container Registry (future)
//
// Auth is registry-specific and configured outside this function:
// - GHCR: GHCR_USER/GHCR_TOKEN env vars consumed by ghcrAuthHeader()
// - ECR: docker credential helper (amazon-ecr-credential-helper) configured
// in EC2 user-data; ~/.docker/config.json has credHelpers entry; the
// daemon resolves auth automatically on every pull.
func RegistryPrefix() string {
if v := os.Getenv("MOLECULE_IMAGE_REGISTRY"); v != "" {
return v
}
return defaultRegistryPrefix
}
// RuntimeImage returns the canonical image reference for the given runtime,
// using the current RegistryPrefix() and the moving `:latest` tag.
//
// For SHA-pinned references (production thin-AMI launches), the
// runtime_image_pins lookup in handlers/runtime_image_pin.go strips the
// `:latest` suffix and appends an immutable `@sha256:<digest>` from the DB.
// That code path naturally inherits any RegistryPrefix() change because it
// reads from RuntimeImages[runtime] and only re-formats the tag suffix.
//
// Returns the empty string for unknown runtimes; callers should fall through
// to DefaultImage in that case (matching legacy behavior).
func RuntimeImage(runtime string) string {
for _, r := range knownRuntimes {
if r == runtime {
return fmt.Sprintf("%s/workspace-template-%s:latest", RegistryPrefix(), runtime)
}
}
return ""
}
// computeRuntimeImages returns the {runtime: image-ref} map evaluated against
// the current RegistryPrefix(). Called at package init to populate the
// exported RuntimeImages var. Tests that flip MOLECULE_IMAGE_REGISTRY between
// expected values use this helper to rebuild the map mid-run.
func computeRuntimeImages() map[string]string {
out := make(map[string]string, len(knownRuntimes))
for _, r := range knownRuntimes {
out[r] = RuntimeImage(r)
}
return out
}