docs(rfc#2948): Phase 1 design — decouple workspace template from runtime #2988

Closed
agent-dev-a wants to merge 3 commits from docs/rfc2948-phase1-template-engine-decoupling into main
@@ -0,0 +1,299 @@
# RFC #2948 Phase 1 — Decouple workspace `template` from `runtime`
**Status:** Design draft for driver + CTO sign-off
**Author:** Dev Engineer A (Kimi)
**Related:** RFC #2948 (marketplace foundation), RFC #2948 comment 103870 (Researcher pre-design risk surface), #2957/#2958 (PATCH-runtime rejection + seo-agent manifest mapping), #2959 (manifest pins must be merged commits), #2929 (restart settle-window), #2970 (platform-agent identity restoration), RCA #2831/#2948 (config/skill delivery), JRS `28f97a7f`
## 1. Summary
Today a workspace has one field — `runtime` — that is overloaded: it names both the **engine adapter** (`claude-code`, `codex`, `hermes`, …) and the **template** that supplies config/prompts/skills. The merged #2958 kept this overload by mapping the `seo-agent` *template* to the `claude-code` *runtime* in `manifest.json`. That unblocks the rejected `PATCH runtime=seo-agent` path, but it still leaves the platform unable to say "this workspace is running the claude-code engine **with** the seo-agent template".
This design adds a distinct `template` column to the workspace model, changes the RFC #2843 template-asset resolver to a single chokepoint `resolveTemplateAssets(template, runtime, workspaceID)`, threads `template` through the control-plane provision request, and backfills existing SEO workspaces (JRS first) so the next reconcile fetches the full `seo-all` skill bundle instead of the 218-byte stub.
Phase 1 is **platform-owned templates only** and intentionally reuses the existing #833 platform-token fetch path as a temporary backend. The architecture is shaped so that Phase 2 can drop a marketplace entitlement broker behind the same chokepoint without re-plumbing call sites. No per-seller encryption, no third-party seller credentials, and no standing god-credential inside the workspace are in scope here.
## 2. Goals / non-goals
**Goals**
- A workspace stores **one** installed template (`template`) separate from its engine (`runtime`).
- `template` is a **manifest-allowlist value**, validated at the write boundary (create/PATCH) and again at the fetch boundary (defense-in-depth).
- Template-asset resolution goes through a single server-side chokepoint that becomes the Phase 2 broker seam.
- Existing SEO workspaces get `template=seo-agent` via a safe, idempotent, auditable backfill.
- The change is deployable before the CP marketplace broker exists.
- JRS `28f97a7f` boots with `agent_card.skills > 0` after it is tagged `template=seo-agent` and reconciled.
**Non-goals**
- Marketplace entitlement, per-seller encryption, third-party seller onboarding, or brokered signed-URL delivery (Phase 2).
- Changing `tenant/<id>/bootstrap` Secrets Manager contents.
- Removing the existing `TemplateAssetFetcher` channel — we repurpose it and centralize its caller.
- Allowing `PATCH /workspaces/:id` to set `runtime=seo-agent` (still rejected per #2958).
- Allowing arbitrary private repos or free-string template names in Phase 1.
## 3. Background
### 3.1 Current model
- `workspaces.runtime` is a `TEXT` column with default `claude-code` (migration `011_workspace_runtime.sql`).
- `models.CreateWorkspacePayload` already carries a `Template` field, used at create time to resolve `runtime` and `model` from `config.yaml` and to pick a local template path.
- The workspace row, however, does **not** persist `template`; the insert SQL only stores `runtime` (`workspace.go:683`).
- `runtime_registry.go` builds `templateRepoByName` from `manifest.json` `workspace_templates[]`, keyed by the template name with `-default` stripped.
- `templateIdentityForRuntime(runtime)` looks up `templateRepoByName[runtime]`. For `runtime=claude-code` this returns the `claude-code-default` repo; for `runtime=seo-agent` it would return the `seo-agent` repo, but `seo-agent` is not a valid runtime after #2958.
### 3.2 Why the status quo breaks JRS
- JRS workspaces were created as `runtime=claude-code` with no persisted `template`.
- The RFC #2843 fetcher therefore resolves `templateIdentityForRuntime("claude-code")``claude-code-default`, never `seo-agent`.
- Even after #2958 added the `seo-agent` manifest entry, there is no workspace field that tells the fetcher to use it.
## 4. Security model (Phase 1)
This section folds in the Researcher pre-design risk surface from RFC #2948 comment 103870.
- **`template` is an allowlist, never a free string.** It keys into the manifest registry (the same SSOT that #2959 pins to immutable, merged commits). A value not in the manifest is rejected at the WRITE boundary (create/PATCH) **and** at the fetch boundary. It never falls through to a constructed path or a repo name.
- **Platform-owned templates only.** The Phase 1 allowlist is the set of platform-owned manifest entries (open templates + our private templates like `seo-agent`). No third-party or arbitrary private repo may be named.
- **Single chokepoint: `resolveTemplateAssets(template, runtime, workspaceID)`.** All asset resolution for a `template` value goes through this function. In Phase 1 it returns the #833 platform-token fetch identity; in Phase 2 the same chokepoint swaps to brokered entitlement + signed URLs. No other call site holds the platform token or constructs a template fetch URL.
- **No standing god-credential in the workspace.** The #833 platform token is held server-side by the chokepoint/fetcher, scoped read-only to platform-owned template repos, and never exposed to the box. The workspace receives only the final assets (or, in Phase 2, a short-lived signed URL).
- **Tenant isolation.** The fetch uses only the template-scoped read-only token; it must never escalate to the requesting workspace's tenant secrets and must never let one tenant's `template` value read another tenant's data.
- **SSRF guard.** Because `template` influences the fetch URL only via the manifest registry, the HTTP path must apply the #2132 posture: dial-time IP guard, no redirects, explicit allowlist. The existing Gitea fetcher constructs URLs only from manifest `repo`/`ref` values, not from user input.
## 5. Design
### 5.1 Workspace model — add nullable `template`
New migration (idempotent, additive):
```sql
ALTER TABLE workspaces
ADD COLUMN IF NOT EXISTS template TEXT;
```
- `NULL` means "no installed template — use runtime fallback". This is the steady state for every existing workspace and for bare `{"name":...}` creates.
- `models.Workspace` gains `Template *string `json:"template,omitempty" db:"template"`` (or `sql.NullString`), with helpers `templateIsSet()` / `templateValue()`.
- `models.CreateWorkspacePayload.Template` already exists; we now persist it when non-empty.
- The create insert SQL becomes:
```sql
INSERT INTO workspaces (id, name, role, tier, runtime, template, status, ...)
VALUES ($1, $2, $3, $4, $5, NULLIF($6, ''), 'provisioning', ...)
```
- The write boundary validates `template` against `templateRepoByName` before insert. Unknown values fail with a clear 400, matching the create-boundary fail-closed posture (#188).
### 5.2 Single asset-resolution chokepoint
Introduce a single resolver in `runtime_registry.go` (or a new `internal/handlers/template_assets.go` if the broker grows):
```go
// TemplateAssetResolution is the only thing callers of the asset
// channel need. In Phase 1 it carries a Gitea identity; in Phase 2
// it can carry a broker-signed URL or an entitlement-bound fetcher.
type TemplateAssetResolution struct {
Identity string // "<owner>/<repo>@<ref>" (Phase 1) or signed URL (Phase 2)
}
// resolveTemplateAssets maps a workspace's template/runtime to the
// manifest-registered asset source. It is the ONLY place that:
// 1. looks up templateRepoByName,
// 2. validates the allowlist,
// 3. decides whether to use the #833 platform-token path (Phase 1)
// or a brokered entitlement (Phase 2).
// Callers outside this chokepoint must never construct a fetch URL
// or hold the platform token.
func resolveTemplateAssets(ctx context.Context, template, runtime, workspaceID string) (TemplateAssetResolution, error) {
if template != "" {
rr, ok := templateRepoByName[template]
if !ok {
return TemplateAssetResolution{}, fmt.Errorf("template %q is not in the manifest allowlist", template)
}
return TemplateAssetResolution{Identity: rr.Repo + "@" + rr.Ref}, nil
}
rr, ok := templateRepoByName[runtime]
if !ok {
// external / kimi / kimi-cli / mock: no template assets.
return TemplateAssetResolution{}, nil
}
return TemplateAssetResolution{Identity: rr.Repo + "@" + rr.Ref}, nil
}
```
Rules:
1. If `template` is non-empty and exists in `templateRepoByName`, use it.
2. If `template` is non-empty and **not** in the manifest, fail-closed with an error.
3. If `template` is empty/null, fall back to the current `runtime` lookup for backward compatibility.
4. The resolver is authoritative for assets; `runtime` is authoritative for the engine. Precedence is acyclic: `runtime` must never re-derive or override an explicitly-set `template`.
Call sites to update:
- `workspace_provision.go` `buildProvisionerConfig`: replace `templateIdentityForRuntimeOrEmpty(payload.Runtime)` with `resolveTemplateAssets(ctx, payload.Template, payload.Runtime, workspaceID).Identity`.
- Any restart/reconcile path that rebuilds a `CreateWorkspacePayload` must populate `payload.Template` from the DB (see §5.4).
### 5.3 Manifest-pin gate
`templateRepoByName` inherits the #2959 ancestor-of-default-branch gate: an unmerged or orphaned SHA in `manifest.json` must be rejected at service boot. This prevents a template pointer from resolving to an untrusted commit, regardless of how the pointer is reached.
### 5.4 Create path
No API shape change. `workspace.go:Create` already accepts `template` in the payload. The only changes are:
- Validate `payload.Template` against `templateRepoByName` at the write boundary.
- Persist `payload.Template` into the new DB column (NULL when empty).
The existing runtime/model resolution from `config.yaml` stays unchanged — a caller can still pass `template=seo-agent` and have `runtime` default to `claude-code` from the manifest.
#### Forward constraint: `model` is NOT template-delivered at provision time
The `MISSING_MODEL` gate runs at **provision time** and reads the workspace row's `model` column. At that point the template's `config.yaml` has **not** been fetched into the workspace container yet, so `model` cannot be derived from the template at provision time. Phase 1 therefore keeps the current create-time contract:
- `model` is resolved **once**, at workspace **create** time, from the caller-supplied value or from the chosen template's `config.yaml` manifest/metadata (the same path that already exists today).
- The resolved `model` is persisted in the `workspaces` row and passed to the CP in `cpProvisionRequest`.
- The `MISSING_MODEL` provision-time gate reads this persisted value only; it never waits on a template fetch.
- Template-delivered assets in Phase 1 are limited to **prompts, MCP wiring, and identity files** (`config.yaml` content, `mcp_servers.yaml`, `prompts/concierge.md`, `identity-fallback.sh`). The model identifier is out of scope for template delivery.
This preserves the #2966 fix and Researcher 103961 posture: the provision-time model gate remains fail-closed and independent of the template-asset fetch channel.
### 5.5 Restart / reconcile path
`workspace_restart.go` builds a `CreateWorkspacePayload` from the DB row (line ~444):
```go
payload := withStoredCompute(ctx, id, models.CreateWorkspacePayload{
Name: wsName,
Tier: tier,
Runtime: containerRuntime,
})
```
It must also read `template` from `workspaces` and set `payload.Template`. The simplest change is a new helper `withStoredTemplate(ctx, id, payload)` or an inline query; either way, restart provisions use the persisted template.
### 5.6 CP provision wire
Add `Template` to the wire shapes so the CP can record it for its own reconciliation and future marketplace use:
- `provisioner.WorkspaceConfig` gains `Template string` (or `*string`).
- `cp_provisioner.go` `cpProvisionRequest` gains `Template string `json:"template,omitempty"``.
- `molecule-controlplane` `wsProvisionRequest` gains `Template string `json:"template"`.
- The CP stores `template` in the workspace record / `tenant_resources.metadata` / a new lightweight table (CP-side decision). For Phase 1 the only hard requirement is that the CP does not drop the field and can echo it back in status/reconcile responses.
The CP image-selection path does **not** change: `runtime` still drives `workspace-template-<runtime>` / `runtime_image_pins.template_name`. The `seo-agent` template uses the `claude-code` image via the existing manifest `"runtime": "claude-code"` mapping.
### 5.7 Setting `template` on an existing workspace
The supported assignment path is a new **admin/CP-gated** endpoint:
```
PATCH /workspaces/:id/template
{ "template": "seo-agent" }
```
Behavior:
- Validates that `template` is a known manifest entry (fail-closed; same allowlist as create).
- Updates `workspaces.template`.
- Returns `{ "status": "updated", "needs_restart": true }`, exactly like `PATCH /workspaces/:id runtime`.
- Does **not** change `runtime`; the engine stays `claude-code`.
- Rejects a template whose manifest `runtime` differs from the workspace's current `runtime`. Phase 1 intentionally does not support cross-engine retargeting; same-engine template variants (e.g. different claude-code personas) are allowed.
This endpoint is what the JRS support flow / CP automation calls to tag `template=seo-agent`.
### 5.8 Backfill migration
Two-part, fully idempotent backfill:
1. **Explicit data-driven backfill** — workspaces that already have a template recorded in `workspace_config.data`:
```sql
UPDATE workspaces w
SET template = NULLIF(TRIM(c.data->>'template'), '')
FROM workspace_config c
WHERE c.workspace_id = w.id
AND w.template IS NULL
AND NULLIF(TRIM(c.data->>'template'), '') IS NOT NULL
AND EXISTS (
SELECT 1 FROM manifest_allowed_templates m
WHERE m.name = NULLIF(TRIM(c.data->>'template'), '')
);
```
(In practice the allowlist check is the same Go helper used at the write boundary.)
2. **SEO explicit-allowlist backfill** — run a one-off idempotent script for known SEO workspace IDs, starting with JRS `28f97a7f`. If additional SEO workspaces cannot be identified by `workspace_config.data`, use an **explicit workspace-ID allowlist** supplied by ops/PM, or the existing `seo-agent` manifest mapping. **Never** a loose string match on workspace name, env values, or role.
Safety properties:
- **Idempotency:** gate on `WHERE template IS NULL` so re-runs are no-ops and manually-set templates are never clobbered.
- **Tight predicate:** exact workspace-ID allowlist or exact `workspace_config.data->>template` signal.
- **Canary first:** run JRS `28f97a7f` first, verify behavior, then fleet.
- **Reversible + resumable:** record the changed set; apply per-workspace transactionally; produce a coverage report (tagged/total). A mid-run failure leaves a clean mixed state that a re-run completes.
- **Rollback:** a companion script can reset `template = NULL` for the exact changed set if a catastrophic issue is found.
### 5.9 Readiness gate and mid-flight changes
- **Probe-verified readiness.** The workspace is not considered successfully provisioned until the template's assets are fetched and present at the probe path (`/configs/system-prompt.md` per the #2955 lesson, plus `config.yaml`). If assets are missing, fail-closed with a `MISSING_ASSETS` provision abort and retry on the next reconcile (same backstop pattern as `MISSING_MODEL`, core#2594).
- **Fill-absent-only.** Asset delivery never overwrites files that are already present in `/configs/*` (#141 / #833).
- **Template change mid-flight.** Changing `template` triggers a controlled re-fetch + restart using the existing #2929 `restartSettleWindow` pattern. The fetch is idempotent and keyed on the CURRENT record value, so concurrent changes converge to the last writer without half-applied states.
### 5.10 JRS verification
After the backfill + a restart/re-provision of JRS `28f97a7f`:
- `resolveTemplateAssets("seo-agent", "claude-code", "28f97a7f")` returns `molecule-ai/molecule-ai-workspace-template-seo-agent@<pin>`.
- `giteaTemplateAssetFetcher.Load` returns `agent-skills/seo-all/**`.
- The workspace boots with a non-stub `/configs/config.yaml` and `agent_card.skills` contains the SEO skill package.
- A simple smoke check: `agent_card.skills | length > 0` and the `/seo-*` slash commands are registered.
## 6. Build split
- **Dev Engineer A (Kimi) / molecule-core**
- DB migration adding nullable `workspaces.template`.
- `models.Workspace` + create insert SQL + NULL normalization.
- `resolveTemplateAssets` chokepoint + update fetcher call sites.
- Restart path reads `template` from DB.
- `PATCH /workspaces/:id/template` handler + allowlist validation + tests.
- Idempotent backfill migration / script + canary/rollback runbook.
- Core-side unit + integration tests (allowlist, chokepoint, backfill, readiness).
- **Dev Engineer B (MiniMax) / molecule-controlplane**
- Accept `template` in `wsProvisionRequest`.
- Store `template` in CP workspace metadata.
- CP migration/backfill to mirror core's `template` values where CP keeps its own record.
- JRS tagging automation / endpoint if needed.
- CP-side tests.
## 7. Test plan
**Unit / integration (molecule-core)**
- `TestResolveTemplateAssets`:
- template set, known → returns repo@ref.
- template set, unknown → error (fail-closed).
- template empty, runtime known → returns runtime's default template.
- template empty, runtime external/kimi → empty (skip fetcher).
- `TestCreateWorkspace_PersistsTemplate`: create with `template=seo-agent` stores `template=seo-agent`, `runtime=claude-code`; unknown template rejected.
- `TestRestartWorkspace_UsesStoredTemplate`: restart reads `template` from DB and passes it to the resolver.
- `TestPatchTemplate`: admin endpoint rejects unknown template, updates known template, returns `needs_restart`, rejects cross-engine template.
- Migration test: existing workspace with `workspace_config.data->>template="seo-agent"` gets backfilled; manually-set templates are not clobbered.
- Readiness test: missing `/configs/system-prompt.md` after fetch aborts with `MISSING_ASSETS`.
**E2E**
- Staging SEO workspace created with `template=seo-agent` boots with skills.
- JRS `28f97a7f` after tagging + restart: `agent_card.skills > 0`.
- Existing plain `claude-code` workspace without `template` continues to use `claude-code-default` after the change.
## 8. Rollout
1. Land molecule-core PR: model + migration + resolver + restart + `PATCH /template` + backfill + tests.
2. Land molecule-controlplane PR: accept/store `template`.
3. Run backfill in prod (canary JRS `28f97a7f` first); confirm row has `template=seo-agent`.
4. Trigger restart/re-provision for JRS; verify skills.
5. Tag remaining SEO workspaces from explicit allowlist and repeat verification.
6. Update RFC #2948 issue to mark Phase 1 complete and link Phase 2 design.
## 9. Open questions
1. **CP storage:** Should CP add a dedicated `workspace_templates` column/table or just mirror into `tenant_resources.metadata`? MiniMax should pick the cheapest option that supports Phase 2 marketplace queries.
2. **Same-engine variant policy:** Should `PATCH /template` allow a template whose manifest `runtime` matches the workspace runtime but whose `config.yaml` declares a different seeded runtime? The runtime-seed mismatch abort (#2027) will catch contradictions at provision time; the endpoint itself can stay runtime-agnostic.
## 10. Decision log
- **Single `template` field is enough for now** — CTO directive. No need for `template_version` or `template_source` until Phase 2.
- **Nullable `template` with `NULL` = unset** — enables the idempotent backfill predicate `WHERE template IS NULL` and makes unset behavior identical to today's runtime-fallback.
- **Template-first resolution, runtime fallback** — preserves all existing workspaces; no flag day.
- **New `PATCH /workspaces/:id/template`, not `PATCH runtime=seo-agent`** — closes the rejected overload path permanently.
- **`template` is manifest-allowlist only, validated at write and fetch** — closes the free-string / path-traversal / SSRF surface flagged by Researcher.
- **Single `resolveTemplateAssets` chokepoint** — the Phase 2 broker seam; the only place that knows about #833 token / fetch identity.
- **Tight SEO backfill predicate** — exact workspace-ID allowlist or `workspace_config.data->>template`, never a loose string match.
- **Probe-verified readiness + MISSING_ASSETS fail-closed** — extends the #2955 / core#2594 pattern to template assets.
## 11. Top-3 decisions before coding
1. **The broker chokepoint:** `resolveTemplateAssets(ctx, template, runtime, workspaceID) (TemplateAssetResolution, error)` lives in `runtime_registry.go` (or a new `internal/handlers/template_assets.go`). It is the sole caller of `templateRepoByName`, the sole place that knows about the #833 platform-token path, and the only seam the Phase 2 entitlement broker needs to wrap.
2. **The SEO backfill predicate:** idempotent `WHERE template IS NULL`, exact workspace-ID allowlist (JRS `28f97a7f` first) or exact `workspace_config.data->>template` signal, canary → fleet, resumable and reversible with a recorded changed-set and coverage report.
3. **The readiness gate:** assets are not considered delivered until the probe path (`/configs/system-prompt.md` / `config.yaml`) verifies them; missing assets abort with `MISSING_ASSETS` and retry on next reconcile. A mid-flight `template` change triggers a controlled re-fetch + restart inside the #2929 settle window.