diff --git a/docs/design/rfc-2948-phase1-template-engine-decoupling.md b/docs/design/rfc-2948-phase1-template-engine-decoupling.md new file mode 100644 index 000000000..ff608c598 --- /dev/null +++ b/docs/design/rfc-2948-phase1-template-engine-decoupling.md @@ -0,0 +1,299 @@ +# RFC #2948 Phase 1 — Decouple workspace `template` from `runtime` + +**Status:** Design draft for driver + CTO sign-off +**Author:** Dev Engineer A (Kimi) +**Related:** RFC #2948 (marketplace foundation), RFC #2948 comment 103870 (Researcher pre-design risk surface), #2957/#2958 (PATCH-runtime rejection + seo-agent manifest mapping), #2959 (manifest pins must be merged commits), #2929 (restart settle-window), #2970 (platform-agent identity restoration), RCA #2831/#2948 (config/skill delivery), JRS `28f97a7f` + +## 1. Summary + +Today a workspace has one field — `runtime` — that is overloaded: it names both the **engine adapter** (`claude-code`, `codex`, `hermes`, …) and the **template** that supplies config/prompts/skills. The merged #2958 kept this overload by mapping the `seo-agent` *template* to the `claude-code` *runtime* in `manifest.json`. That unblocks the rejected `PATCH runtime=seo-agent` path, but it still leaves the platform unable to say "this workspace is running the claude-code engine **with** the seo-agent template". + +This design adds a distinct `template` column to the workspace model, changes the RFC #2843 template-asset resolver to a single chokepoint `resolveTemplateAssets(template, runtime, workspaceID)`, threads `template` through the control-plane provision request, and backfills existing SEO workspaces (JRS first) so the next reconcile fetches the full `seo-all` skill bundle instead of the 218-byte stub. + +Phase 1 is **platform-owned templates only** and intentionally reuses the existing #833 platform-token fetch path as a temporary backend. The architecture is shaped so that Phase 2 can drop a marketplace entitlement broker behind the same chokepoint without re-plumbing call sites. No per-seller encryption, no third-party seller credentials, and no standing god-credential inside the workspace are in scope here. + +## 2. Goals / non-goals + +**Goals** +- A workspace stores **one** installed template (`template`) separate from its engine (`runtime`). +- `template` is a **manifest-allowlist value**, validated at the write boundary (create/PATCH) and again at the fetch boundary (defense-in-depth). +- Template-asset resolution goes through a single server-side chokepoint that becomes the Phase 2 broker seam. +- Existing SEO workspaces get `template=seo-agent` via a safe, idempotent, auditable backfill. +- The change is deployable before the CP marketplace broker exists. +- JRS `28f97a7f` boots with `agent_card.skills > 0` after it is tagged `template=seo-agent` and reconciled. + +**Non-goals** +- Marketplace entitlement, per-seller encryption, third-party seller onboarding, or brokered signed-URL delivery (Phase 2). +- Changing `tenant//bootstrap` Secrets Manager contents. +- Removing the existing `TemplateAssetFetcher` channel — we repurpose it and centralize its caller. +- Allowing `PATCH /workspaces/:id` to set `runtime=seo-agent` (still rejected per #2958). +- Allowing arbitrary private repos or free-string template names in Phase 1. + +## 3. Background + +### 3.1 Current model +- `workspaces.runtime` is a `TEXT` column with default `claude-code` (migration `011_workspace_runtime.sql`). +- `models.CreateWorkspacePayload` already carries a `Template` field, used at create time to resolve `runtime` and `model` from `config.yaml` and to pick a local template path. +- The workspace row, however, does **not** persist `template`; the insert SQL only stores `runtime` (`workspace.go:683`). +- `runtime_registry.go` builds `templateRepoByName` from `manifest.json` `workspace_templates[]`, keyed by the template name with `-default` stripped. +- `templateIdentityForRuntime(runtime)` looks up `templateRepoByName[runtime]`. For `runtime=claude-code` this returns the `claude-code-default` repo; for `runtime=seo-agent` it would return the `seo-agent` repo, but `seo-agent` is not a valid runtime after #2958. + +### 3.2 Why the status quo breaks JRS +- JRS workspaces were created as `runtime=claude-code` with no persisted `template`. +- The RFC #2843 fetcher therefore resolves `templateIdentityForRuntime("claude-code")` → `claude-code-default`, never `seo-agent`. +- Even after #2958 added the `seo-agent` manifest entry, there is no workspace field that tells the fetcher to use it. + +## 4. Security model (Phase 1) + +This section folds in the Researcher pre-design risk surface from RFC #2948 comment 103870. + +- **`template` is an allowlist, never a free string.** It keys into the manifest registry (the same SSOT that #2959 pins to immutable, merged commits). A value not in the manifest is rejected at the WRITE boundary (create/PATCH) **and** at the fetch boundary. It never falls through to a constructed path or a repo name. +- **Platform-owned templates only.** The Phase 1 allowlist is the set of platform-owned manifest entries (open templates + our private templates like `seo-agent`). No third-party or arbitrary private repo may be named. +- **Single chokepoint: `resolveTemplateAssets(template, runtime, workspaceID)`.** All asset resolution for a `template` value goes through this function. In Phase 1 it returns the #833 platform-token fetch identity; in Phase 2 the same chokepoint swaps to brokered entitlement + signed URLs. No other call site holds the platform token or constructs a template fetch URL. +- **No standing god-credential in the workspace.** The #833 platform token is held server-side by the chokepoint/fetcher, scoped read-only to platform-owned template repos, and never exposed to the box. The workspace receives only the final assets (or, in Phase 2, a short-lived signed URL). +- **Tenant isolation.** The fetch uses only the template-scoped read-only token; it must never escalate to the requesting workspace's tenant secrets and must never let one tenant's `template` value read another tenant's data. +- **SSRF guard.** Because `template` influences the fetch URL only via the manifest registry, the HTTP path must apply the #2132 posture: dial-time IP guard, no redirects, explicit allowlist. The existing Gitea fetcher constructs URLs only from manifest `repo`/`ref` values, not from user input. + +## 5. Design + +### 5.1 Workspace model — add nullable `template` + +New migration (idempotent, additive): + +```sql +ALTER TABLE workspaces +ADD COLUMN IF NOT EXISTS template TEXT; +``` + +- `NULL` means "no installed template — use runtime fallback". This is the steady state for every existing workspace and for bare `{"name":...}` creates. +- `models.Workspace` gains `Template *string `json:"template,omitempty" db:"template"`` (or `sql.NullString`), with helpers `templateIsSet()` / `templateValue()`. +- `models.CreateWorkspacePayload.Template` already exists; we now persist it when non-empty. +- The create insert SQL becomes: + ```sql + INSERT INTO workspaces (id, name, role, tier, runtime, template, status, ...) + VALUES ($1, $2, $3, $4, $5, NULLIF($6, ''), 'provisioning', ...) + ``` +- The write boundary validates `template` against `templateRepoByName` before insert. Unknown values fail with a clear 400, matching the create-boundary fail-closed posture (#188). + +### 5.2 Single asset-resolution chokepoint + +Introduce a single resolver in `runtime_registry.go` (or a new `internal/handlers/template_assets.go` if the broker grows): + +```go +// TemplateAssetResolution is the only thing callers of the asset +// channel need. In Phase 1 it carries a Gitea identity; in Phase 2 +// it can carry a broker-signed URL or an entitlement-bound fetcher. +type TemplateAssetResolution struct { + Identity string // "/@" (Phase 1) or signed URL (Phase 2) +} + +// resolveTemplateAssets maps a workspace's template/runtime to the +// manifest-registered asset source. It is the ONLY place that: +// 1. looks up templateRepoByName, +// 2. validates the allowlist, +// 3. decides whether to use the #833 platform-token path (Phase 1) +// or a brokered entitlement (Phase 2). +// Callers outside this chokepoint must never construct a fetch URL +// or hold the platform token. +func resolveTemplateAssets(ctx context.Context, template, runtime, workspaceID string) (TemplateAssetResolution, error) { + if template != "" { + rr, ok := templateRepoByName[template] + if !ok { + return TemplateAssetResolution{}, fmt.Errorf("template %q is not in the manifest allowlist", template) + } + return TemplateAssetResolution{Identity: rr.Repo + "@" + rr.Ref}, nil + } + rr, ok := templateRepoByName[runtime] + if !ok { + // external / kimi / kimi-cli / mock: no template assets. + return TemplateAssetResolution{}, nil + } + return TemplateAssetResolution{Identity: rr.Repo + "@" + rr.Ref}, nil +} +``` + +Rules: +1. If `template` is non-empty and exists in `templateRepoByName`, use it. +2. If `template` is non-empty and **not** in the manifest, fail-closed with an error. +3. If `template` is empty/null, fall back to the current `runtime` lookup for backward compatibility. +4. The resolver is authoritative for assets; `runtime` is authoritative for the engine. Precedence is acyclic: `runtime` must never re-derive or override an explicitly-set `template`. + +Call sites to update: +- `workspace_provision.go` `buildProvisionerConfig`: replace `templateIdentityForRuntimeOrEmpty(payload.Runtime)` with `resolveTemplateAssets(ctx, payload.Template, payload.Runtime, workspaceID).Identity`. +- Any restart/reconcile path that rebuilds a `CreateWorkspacePayload` must populate `payload.Template` from the DB (see §5.4). + +### 5.3 Manifest-pin gate + +`templateRepoByName` inherits the #2959 ancestor-of-default-branch gate: an unmerged or orphaned SHA in `manifest.json` must be rejected at service boot. This prevents a template pointer from resolving to an untrusted commit, regardless of how the pointer is reached. + +### 5.4 Create path + +No API shape change. `workspace.go:Create` already accepts `template` in the payload. The only changes are: +- Validate `payload.Template` against `templateRepoByName` at the write boundary. +- Persist `payload.Template` into the new DB column (NULL when empty). + +The existing runtime/model resolution from `config.yaml` stays unchanged — a caller can still pass `template=seo-agent` and have `runtime` default to `claude-code` from the manifest. + +#### Forward constraint: `model` is NOT template-delivered at provision time + +The `MISSING_MODEL` gate runs at **provision time** and reads the workspace row's `model` column. At that point the template's `config.yaml` has **not** been fetched into the workspace container yet, so `model` cannot be derived from the template at provision time. Phase 1 therefore keeps the current create-time contract: + +- `model` is resolved **once**, at workspace **create** time, from the caller-supplied value or from the chosen template's `config.yaml` manifest/metadata (the same path that already exists today). +- The resolved `model` is persisted in the `workspaces` row and passed to the CP in `cpProvisionRequest`. +- The `MISSING_MODEL` provision-time gate reads this persisted value only; it never waits on a template fetch. +- Template-delivered assets in Phase 1 are limited to **prompts, MCP wiring, and identity files** (`config.yaml` content, `mcp_servers.yaml`, `prompts/concierge.md`, `identity-fallback.sh`). The model identifier is out of scope for template delivery. + +This preserves the #2966 fix and Researcher 103961 posture: the provision-time model gate remains fail-closed and independent of the template-asset fetch channel. + +### 5.5 Restart / reconcile path + +`workspace_restart.go` builds a `CreateWorkspacePayload` from the DB row (line ~444): + +```go +payload := withStoredCompute(ctx, id, models.CreateWorkspacePayload{ + Name: wsName, + Tier: tier, + Runtime: containerRuntime, +}) +``` + +It must also read `template` from `workspaces` and set `payload.Template`. The simplest change is a new helper `withStoredTemplate(ctx, id, payload)` or an inline query; either way, restart provisions use the persisted template. + +### 5.6 CP provision wire + +Add `Template` to the wire shapes so the CP can record it for its own reconciliation and future marketplace use: + +- `provisioner.WorkspaceConfig` gains `Template string` (or `*string`). +- `cp_provisioner.go` `cpProvisionRequest` gains `Template string `json:"template,omitempty"``. +- `molecule-controlplane` `wsProvisionRequest` gains `Template string `json:"template"`. +- The CP stores `template` in the workspace record / `tenant_resources.metadata` / a new lightweight table (CP-side decision). For Phase 1 the only hard requirement is that the CP does not drop the field and can echo it back in status/reconcile responses. + +The CP image-selection path does **not** change: `runtime` still drives `workspace-template-` / `runtime_image_pins.template_name`. The `seo-agent` template uses the `claude-code` image via the existing manifest `"runtime": "claude-code"` mapping. + +### 5.7 Setting `template` on an existing workspace + +The supported assignment path is a new **admin/CP-gated** endpoint: + +``` +PATCH /workspaces/:id/template +{ "template": "seo-agent" } +``` + +Behavior: +- Validates that `template` is a known manifest entry (fail-closed; same allowlist as create). +- Updates `workspaces.template`. +- Returns `{ "status": "updated", "needs_restart": true }`, exactly like `PATCH /workspaces/:id runtime`. +- Does **not** change `runtime`; the engine stays `claude-code`. +- Rejects a template whose manifest `runtime` differs from the workspace's current `runtime`. Phase 1 intentionally does not support cross-engine retargeting; same-engine template variants (e.g. different claude-code personas) are allowed. + +This endpoint is what the JRS support flow / CP automation calls to tag `template=seo-agent`. + +### 5.8 Backfill migration + +Two-part, fully idempotent backfill: + +1. **Explicit data-driven backfill** — workspaces that already have a template recorded in `workspace_config.data`: + ```sql + UPDATE workspaces w + SET template = NULLIF(TRIM(c.data->>'template'), '') + FROM workspace_config c + WHERE c.workspace_id = w.id + AND w.template IS NULL + AND NULLIF(TRIM(c.data->>'template'), '') IS NOT NULL + AND EXISTS ( + SELECT 1 FROM manifest_allowed_templates m + WHERE m.name = NULLIF(TRIM(c.data->>'template'), '') + ); + ``` + (In practice the allowlist check is the same Go helper used at the write boundary.) + +2. **SEO explicit-allowlist backfill** — run a one-off idempotent script for known SEO workspace IDs, starting with JRS `28f97a7f`. If additional SEO workspaces cannot be identified by `workspace_config.data`, use an **explicit workspace-ID allowlist** supplied by ops/PM, or the existing `seo-agent` manifest mapping. **Never** a loose string match on workspace name, env values, or role. + +Safety properties: +- **Idempotency:** gate on `WHERE template IS NULL` so re-runs are no-ops and manually-set templates are never clobbered. +- **Tight predicate:** exact workspace-ID allowlist or exact `workspace_config.data->>template` signal. +- **Canary first:** run JRS `28f97a7f` first, verify behavior, then fleet. +- **Reversible + resumable:** record the changed set; apply per-workspace transactionally; produce a coverage report (tagged/total). A mid-run failure leaves a clean mixed state that a re-run completes. +- **Rollback:** a companion script can reset `template = NULL` for the exact changed set if a catastrophic issue is found. + +### 5.9 Readiness gate and mid-flight changes + +- **Probe-verified readiness.** The workspace is not considered successfully provisioned until the template's assets are fetched and present at the probe path (`/configs/system-prompt.md` per the #2955 lesson, plus `config.yaml`). If assets are missing, fail-closed with a `MISSING_ASSETS` provision abort and retry on the next reconcile (same backstop pattern as `MISSING_MODEL`, core#2594). +- **Fill-absent-only.** Asset delivery never overwrites files that are already present in `/configs/*` (#141 / #833). +- **Template change mid-flight.** Changing `template` triggers a controlled re-fetch + restart using the existing #2929 `restartSettleWindow` pattern. The fetch is idempotent and keyed on the CURRENT record value, so concurrent changes converge to the last writer without half-applied states. + +### 5.10 JRS verification + +After the backfill + a restart/re-provision of JRS `28f97a7f`: +- `resolveTemplateAssets("seo-agent", "claude-code", "28f97a7f")` returns `molecule-ai/molecule-ai-workspace-template-seo-agent@`. +- `giteaTemplateAssetFetcher.Load` returns `agent-skills/seo-all/**`. +- The workspace boots with a non-stub `/configs/config.yaml` and `agent_card.skills` contains the SEO skill package. +- A simple smoke check: `agent_card.skills | length > 0` and the `/seo-*` slash commands are registered. + +## 6. Build split + +- **Dev Engineer A (Kimi) / molecule-core** + - DB migration adding nullable `workspaces.template`. + - `models.Workspace` + create insert SQL + NULL normalization. + - `resolveTemplateAssets` chokepoint + update fetcher call sites. + - Restart path reads `template` from DB. + - `PATCH /workspaces/:id/template` handler + allowlist validation + tests. + - Idempotent backfill migration / script + canary/rollback runbook. + - Core-side unit + integration tests (allowlist, chokepoint, backfill, readiness). + +- **Dev Engineer B (MiniMax) / molecule-controlplane** + - Accept `template` in `wsProvisionRequest`. + - Store `template` in CP workspace metadata. + - CP migration/backfill to mirror core's `template` values where CP keeps its own record. + - JRS tagging automation / endpoint if needed. + - CP-side tests. + +## 7. Test plan + +**Unit / integration (molecule-core)** +- `TestResolveTemplateAssets`: + - template set, known → returns repo@ref. + - template set, unknown → error (fail-closed). + - template empty, runtime known → returns runtime's default template. + - template empty, runtime external/kimi → empty (skip fetcher). +- `TestCreateWorkspace_PersistsTemplate`: create with `template=seo-agent` stores `template=seo-agent`, `runtime=claude-code`; unknown template rejected. +- `TestRestartWorkspace_UsesStoredTemplate`: restart reads `template` from DB and passes it to the resolver. +- `TestPatchTemplate`: admin endpoint rejects unknown template, updates known template, returns `needs_restart`, rejects cross-engine template. +- Migration test: existing workspace with `workspace_config.data->>template="seo-agent"` gets backfilled; manually-set templates are not clobbered. +- Readiness test: missing `/configs/system-prompt.md` after fetch aborts with `MISSING_ASSETS`. + +**E2E** +- Staging SEO workspace created with `template=seo-agent` boots with skills. +- JRS `28f97a7f` after tagging + restart: `agent_card.skills > 0`. +- Existing plain `claude-code` workspace without `template` continues to use `claude-code-default` after the change. + +## 8. Rollout + +1. Land molecule-core PR: model + migration + resolver + restart + `PATCH /template` + backfill + tests. +2. Land molecule-controlplane PR: accept/store `template`. +3. Run backfill in prod (canary JRS `28f97a7f` first); confirm row has `template=seo-agent`. +4. Trigger restart/re-provision for JRS; verify skills. +5. Tag remaining SEO workspaces from explicit allowlist and repeat verification. +6. Update RFC #2948 issue to mark Phase 1 complete and link Phase 2 design. + +## 9. Open questions + +1. **CP storage:** Should CP add a dedicated `workspace_templates` column/table or just mirror into `tenant_resources.metadata`? MiniMax should pick the cheapest option that supports Phase 2 marketplace queries. +2. **Same-engine variant policy:** Should `PATCH /template` allow a template whose manifest `runtime` matches the workspace runtime but whose `config.yaml` declares a different seeded runtime? The runtime-seed mismatch abort (#2027) will catch contradictions at provision time; the endpoint itself can stay runtime-agnostic. + +## 10. Decision log + +- **Single `template` field is enough for now** — CTO directive. No need for `template_version` or `template_source` until Phase 2. +- **Nullable `template` with `NULL` = unset** — enables the idempotent backfill predicate `WHERE template IS NULL` and makes unset behavior identical to today's runtime-fallback. +- **Template-first resolution, runtime fallback** — preserves all existing workspaces; no flag day. +- **New `PATCH /workspaces/:id/template`, not `PATCH runtime=seo-agent`** — closes the rejected overload path permanently. +- **`template` is manifest-allowlist only, validated at write and fetch** — closes the free-string / path-traversal / SSRF surface flagged by Researcher. +- **Single `resolveTemplateAssets` chokepoint** — the Phase 2 broker seam; the only place that knows about #833 token / fetch identity. +- **Tight SEO backfill predicate** — exact workspace-ID allowlist or `workspace_config.data->>template`, never a loose string match. +- **Probe-verified readiness + MISSING_ASSETS fail-closed** — extends the #2955 / core#2594 pattern to template assets. + +## 11. Top-3 decisions before coding + +1. **The broker chokepoint:** `resolveTemplateAssets(ctx, template, runtime, workspaceID) (TemplateAssetResolution, error)` lives in `runtime_registry.go` (or a new `internal/handlers/template_assets.go`). It is the sole caller of `templateRepoByName`, the sole place that knows about the #833 platform-token path, and the only seam the Phase 2 entitlement broker needs to wrap. +2. **The SEO backfill predicate:** idempotent `WHERE template IS NULL`, exact workspace-ID allowlist (JRS `28f97a7f` first) or exact `workspace_config.data->>template` signal, canary → fleet, resumable and reversible with a recorded changed-set and coverage report. +3. **The readiness gate:** assets are not considered delivered until the probe path (`/configs/system-prompt.md` / `config.yaml`) verifies them; missing assets abort with `MISSING_ASSETS` and retry on next reconcile. A mid-flight `template` change triggers a controlled re-fetch + restart inside the #2929 settle window.