Some checks failed
pr-guards / disable-auto-merge-on-push (pull_request) Failing after 0s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Failing after 41s
Harness Replays / Harness Replays (pull_request) Failing after 30s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Failing after 5m7s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / Platform (Go) (pull_request) Failing after 3m8s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 14m4s
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 14m36s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 14m30s
Block internal-flavored paths / Block forbidden paths (pull_request) Has been cancelled
CI / Python Lint & Test (pull_request) Has been cancelled
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Has been cancelled
CI / Canvas (Next.js) (pull_request) Has been cancelled
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Has been cancelled
CI / Detect changes (pull_request) Has been cancelled
Secret scan / Scan diff for credential-shaped strings (pull_request) Has been cancelled
E2E API Smoke Test / detect-changes (pull_request) Has been cancelled
Runtime PR-Built Compatibility / detect-changes (pull_request) Has been cancelled
Harness Replays / detect-changes (pull_request) Has been cancelled
Handlers Postgres Integration / detect-changes (pull_request) Has been cancelled
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Has been cancelled
CI / Shellcheck (E2E scripts) (pull_request) Has been cancelled
Add MOLECULE_IMAGE_REGISTRY env var to override the registry prefix used by all workspace-template image references. Defaults to ghcr.io/molecule-ai (unchanged for OSS users); set to an ECR URI in production tenants when mirroring to AWS. Why this matters: GitHub suspended the Molecule-AI org on 2026-05-06 with no warning. Production tenants kept running because they had images cached locally, but any tenant restart (AWS health event, redeploy, OS reboot) would have failed at `docker pull ghcr.io/molecule-ai/...` because GHCR returned 401. This change introduces the seam needed to point new pulls at a registry we control (AWS ECR) by flipping a single env var on Railway. Design (RFC: molecule-ai/internal#6): - New `RegistryPrefix()` function in `provisioner/registry.go` reads MOLECULE_IMAGE_REGISTRY, falls back to "ghcr.io/molecule-ai". - New `RuntimeImage(runtime)` returns the canonical ref using the prefix. - `RuntimeImages` map computed at init via `computeRuntimeImages()` so existing callers that range over it still work. - `DefaultImage` likewise computed via `RuntimeImage(defaultRuntime)`. - `handlers.TemplateImageRef()` switched from hardcoded format string to `provisioner.RegistryPrefix()`. - `runtime_image_pin.go::resolveRuntimeImage()` automatically inherits the prefix change because it reads from `provisioner.RuntimeImages[]` and only re-formats the tag suffix to a digest pin. Alternatives rejected (see RFC): - Multi-registry fallback chain (try ECR, fall back to GHCR): GHCR is locked from outbound for our org, so the fallback never works for us. Adds code complexity for no benefit. - Hardcoded ECR-only switch: couples production code to a specific deployment environment. OSS users self-hosting Molecule would need the upstream GHCR. - Self-hosted Harbor / registry-on-Hetzner: adds a component to operate. Not justified at 3-tenant scale; AWS ECR is mature and IAM-integrated. Auth — deliberately NOT changed in this commit: - For GHCR, the existing `ghcrAuthHeader()` reads GHCR_USER/GHCR_TOKEN. - For ECR, EC2 user-data installs `amazon-ecr-credential-helper` and adds a `credHelpers` entry in `~/.docker/config.json` so the daemon resolves ECR credentials via the EC2 instance role on every pull. The Go code needs no auth change. This keeps the diff minimal. Backwards compatibility: - Additive: env unset → identical behavior to today (GHCR). - Existing tests reference literal `ghcr.io/molecule-ai/...` strings; they continue to pass under the default prefix. - `RuntimeImages` map preserved for callers that iterate it. - No interface, schema, API, or migration version bump needed. Security review: - No untrusted input: MOLECULE_IMAGE_REGISTRY is set at deploy time (Railway env, EC2 user-data), not by users. - No expanded data collection or logging changes. - No new permissions: ECR pull permission is a future user-data + IAM role change, separate from this code change. - Worst-case: an attacker who already compromises Railway can swap the registry prefix to a malicious URI — same blast radius as compromising Railway today, no expansion. Tests: - 9 new unit tests in `registry_test.go` covering: default fallback, env override, empty env, all 9 known runtimes, unknown runtime, override-applies-to-all, computeRuntimeImages map population, env reflection, alphabetical ordering pin. - All existing provisioner + handlers tests continue to pass. - Mutation-tested mentally: deleting `if v := os.Getenv(...)` makes TestRegistryPrefix_RespectsEnv fail. Deleting `for _, r := range knownRuntimes` makes TestRuntimeImage_AllKnownRuntimes fail. The test suite would catch a regression of the original failure mode. Rollout plan: this PR is safe to merge with no env change. Production cutover happens by setting MOLECULE_IMAGE_REGISTRY on Railway after the AWS ECR mirror is populated (separate ops change, tracked in issue #6 phases 3b–3f). Tracking: - RFC: molecule-ai/internal#6 - Tasks: #97 (ECR setup), #98 (CP fallback) - Tech debt: runbooks/hetzner-rollout-tech-debt-2026-05-06.md item 7 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
96 lines
3.7 KiB
Go
96 lines
3.7 KiB
Go
package provisioner
|
|
|
|
import (
|
|
"fmt"
|
|
"os"
|
|
)
|
|
|
|
// defaultRegistryPrefix is the upstream OSS face for all workspace template
|
|
// images. Self-hosted Molecule deployments without the MOLECULE_IMAGE_REGISTRY
|
|
// override pull from here.
|
|
const defaultRegistryPrefix = "ghcr.io/molecule-ai"
|
|
|
|
// knownRuntimes is the canonical list of workspace template runtimes shipped
|
|
// in main. Any runtime added here MUST also have a standalone template repo
|
|
// (Molecule-AI/molecule-ai-workspace-template-<name>) and an entry in the
|
|
// publish-template-image workflow that builds it.
|
|
//
|
|
// Order matters for deterministic test snapshots; keep alphabetical.
|
|
var knownRuntimes = []string{
|
|
"autogen",
|
|
"claude-code",
|
|
"codex",
|
|
"crewai",
|
|
"deepagents",
|
|
"gemini-cli",
|
|
"hermes",
|
|
"langgraph",
|
|
"openclaw",
|
|
}
|
|
|
|
// defaultRuntime is the fallback when a workspace's config doesn't specify a
|
|
// runtime. Picked because LangGraph is the most common in our org templates
|
|
// and has the smallest "first impression" cold-start surface.
|
|
const defaultRuntime = "langgraph"
|
|
|
|
// RegistryPrefix returns the registry prefix all workspace-template image
|
|
// references should use. Defaults to ghcr.io/molecule-ai (the upstream OSS
|
|
// face) and is overridden by the MOLECULE_IMAGE_REGISTRY env var in
|
|
// production tenants where we mirror images to a private registry.
|
|
//
|
|
// The override is set at deploy time (Railway env, EC2 user-data) — never
|
|
// from user-supplied input — so the value is trusted by the time it reaches
|
|
// this code. Validation is deliberately minimal: an operator-supplied
|
|
// prefix that points at a registry the EC2 can't authenticate to will fail
|
|
// loudly at docker-pull time, which is the right blast radius.
|
|
//
|
|
// Example values:
|
|
//
|
|
// (unset) → ghcr.io/molecule-ai (OSS default)
|
|
// "123456789012.dkr.ecr.us-east-2.amazonaws.com/molecule-ai" → AWS ECR mirror
|
|
// "git.moleculesai.app/molecule-ai" → self-hosted Gitea Container Registry (future)
|
|
//
|
|
// Auth is registry-specific and configured outside this function:
|
|
// - GHCR: GHCR_USER/GHCR_TOKEN env vars consumed by ghcrAuthHeader()
|
|
// - ECR: docker credential helper (amazon-ecr-credential-helper) configured
|
|
// in EC2 user-data; ~/.docker/config.json has credHelpers entry; the
|
|
// daemon resolves auth automatically on every pull.
|
|
func RegistryPrefix() string {
|
|
if v := os.Getenv("MOLECULE_IMAGE_REGISTRY"); v != "" {
|
|
return v
|
|
}
|
|
return defaultRegistryPrefix
|
|
}
|
|
|
|
// RuntimeImage returns the canonical image reference for the given runtime,
|
|
// using the current RegistryPrefix() and the moving `:latest` tag.
|
|
//
|
|
// For SHA-pinned references (production thin-AMI launches), the
|
|
// runtime_image_pins lookup in handlers/runtime_image_pin.go strips the
|
|
// `:latest` suffix and appends an immutable `@sha256:<digest>` from the DB.
|
|
// That code path naturally inherits any RegistryPrefix() change because it
|
|
// reads from RuntimeImages[runtime] and only re-formats the tag suffix.
|
|
//
|
|
// Returns the empty string for unknown runtimes; callers should fall through
|
|
// to DefaultImage in that case (matching legacy behavior).
|
|
func RuntimeImage(runtime string) string {
|
|
for _, r := range knownRuntimes {
|
|
if r == runtime {
|
|
return fmt.Sprintf("%s/workspace-template-%s:latest", RegistryPrefix(), runtime)
|
|
}
|
|
}
|
|
return ""
|
|
}
|
|
|
|
// computeRuntimeImages returns the {runtime: image-ref} map evaluated against
|
|
// the current RegistryPrefix(). Called at package init to populate the
|
|
// exported RuntimeImages var. Tests that flip MOLECULE_IMAGE_REGISTRY between
|
|
// expected values use this helper to rebuild the map mid-run.
|
|
func computeRuntimeImages() map[string]string {
|
|
out := make(map[string]string, len(knownRuntimes))
|
|
for _, r := range knownRuntimes {
|
|
out[r] = RuntimeImage(r)
|
|
}
|
|
return out
|
|
}
|