From be72246a6eed323b2be131c2425d3a76f6bdb3e5 Mon Sep 17 00:00:00 2001 From: "Molecule AI Dev Engineer A (Kimi)" Date: Tue, 16 Jun 2026 00:35:55 +0000 Subject: [PATCH 1/3] =?UTF-8?q?docs(design):=20RFC=20#2948=20Phase=201=20?= =?UTF-8?q?=E2=80=94=20decouple=20workspace=20template=20from=20runtime?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds the design doc for driver+CTO sign-off before any build PR. Covers: workspace model template field, fetch-by-template with runtime fallback, CP provision threading, PATCH /workspaces/:id/template, backfill migration, and JRS verification plan. Co-Authored-By: Claude --- ...-2948-phase1-template-engine-decoupling.md | 220 ++++++++++++++++++ 1 file changed, 220 insertions(+) create mode 100644 docs/design/rfc-2948-phase1-template-engine-decoupling.md diff --git a/docs/design/rfc-2948-phase1-template-engine-decoupling.md b/docs/design/rfc-2948-phase1-template-engine-decoupling.md new file mode 100644 index 000000000..c7568f8fb --- /dev/null +++ b/docs/design/rfc-2948-phase1-template-engine-decoupling.md @@ -0,0 +1,220 @@ +# RFC #2948 Phase 1 — Decouple workspace `template` from `runtime` + +**Status:** Design draft for driver + CTO sign-off +**Author:** Dev Engineer A (Kimi) +**Related:** RFC #2948 (marketplace foundation), #2957/#2958 (PATCH-runtime rejection + seo-agent manifest mapping), #2970 (platform-agent identity restoration), RCA #2831/#2948 (config/skill delivery), JRS `28f97a7f` + +## 1. Summary + +Today a workspace has one field — `runtime` — that is overloaded: it names both the **engine adapter** (`claude-code`, `codex`, `hermes`, …) and the **template** that supplies config/prompts/skills. The merged #2958 kept this overload by mapping the `seo-agent` *template* to the `claude-code` *runtime* in `manifest.json`. That unblocks the rejected `PATCH runtime=seo-agent` path, but it still leaves the platform unable to say "this workspace is running the claude-code engine **with** the seo-agent template". + +This design adds a distinct `template` column to the workspace model, changes the RFC #2843 template-asset fetcher to resolve from `template` when set and fall back to `runtime` when unset, threads `template` through the control-plane provision request, and backfills existing SEO workspaces (JRS first) so the next reconcile fetches the full `seo-all` skill bundle instead of the 218-byte stub. + +No marketplace broker or per-seller encryption is in scope here — that is RFC #2948 Phase 2. Phase 1 is the CTO-decided JRS-unblock + the foundation that later phases build on. + +## 2. Goals / non-goals + +**Goals** +- A workspace stores **one** installed template (`template`) separate from its engine (`runtime`). +- Template-asset resolution is template-first with a runtime fallback, preserving every existing workspace. +- Existing SEO workspaces get `template=seo-agent` via a safe backfill + migration. +- The change is deployable before the CP marketplace broker exists. +- JRS `28f97a7f` boots with `agent_card.skills > 0` after it is tagged `template=seo-agent` and reconciled. + +**Non-goals** +- Marketplace entitlement, per-seller encryption, or brokered delivery (Phase 2). +- Changing `tenant//bootstrap` Secrets Manager contents. +- Removing the existing `TemplateIdentity` / `TemplateAssetFetcher` channel — we *repurpose* it. +- Allowing `PATCH /workspaces/:id` to set `runtime=seo-agent` (still rejected per #2958). + +## 3. Background + +### 3.1 Current model +- `workspaces.runtime` is a `TEXT` column with default `claude-code` (migration `011_workspace_runtime.sql`). +- `models.CreateWorkspacePayload` already carries a `Template` field, used at create time to resolve `runtime` and `model` from `config.yaml` and to pick a local template path. +- The workspace row, however, does **not** persist `template`; the insert SQL only stores `runtime` (`workspace.go:683`). +- `runtime_registry.go` builds `templateRepoByName` from `manifest.json` `workspace_templates[]`, keyed by the template name with `-default` stripped. For the `seo-agent` entry, the key is `seo-agent` and the manifest declares `"runtime": "claude-code"` for the runtime allowlist. +- `templateIdentityForRuntime(runtime)` looks up `templateRepoByName[runtime]`. For `runtime=claude-code` this returns the `claude-code-default` repo; for `runtime=seo-agent` it would return the `seo-agent` repo, but `seo-agent` is not a valid runtime after #2958. + +### 3.2 Why the status quo breaks JRS +- JRS workspaces were created as `runtime=claude-code` with no persisted `template`. +- The RFC #2843 fetcher therefore resolves `templateIdentityForRuntime("claude-code")` → `claude-code-default`, never `seo-agent`. +- Even after #2958 added the `seo-agent` manifest entry, there is no workspace field that tells the fetcher to use it. + +## 4. Design + +### 4.1 Workspace model — add `template` + +New migration (idempotent, additive): + +```sql +ALTER TABLE workspaces +ADD COLUMN IF NOT EXISTS template TEXT NOT NULL DEFAULT ''; +``` + +- `''` means "no installed template — use runtime fallback". This is the steady state for every existing workspace and for bare `{"name":...}` creates. +- `models.Workspace` gains `Template string `json:"template" db:"template"``. +- `models.CreateWorkspacePayload.Template` already exists; we now persist it. +- The create insert SQL becomes: + ```sql + INSERT INTO workspaces (id, name, role, tier, runtime, template, status, ...) + VALUES ($1, $2, $3, $4, $5, $6, 'provisioning', ...) + ``` + +### 4.2 Fetch-by-template with runtime fallback + +Introduce a single resolver in `runtime_registry.go`: + +```go +func resolveTemplateIdentity(template, runtime string) (string, bool) { + if template != "" { + if id, ok := templateRepoByName[template]; ok { + return id.Repo + "@" + id.Ref, true + } + return "", false + } + return templateIdentityForRuntime(runtime) +} +``` + +Rules: +1. If `template` is non-empty and exists in `templateRepoByName`, use it. +2. If `template` is non-empty and **not** in the manifest, fail-closed (`false`). This matches the create-boundary fail-closed posture (#188). +3. If `template` is empty, fall back to the current `runtime` lookup for backward compatibility. + +Call sites to update: +- `workspace_provision.go` `prepareProvisionContext` / `provisionWorkspaceCP`: replace `templateIdentityForRuntimeOrEmpty(payload.Runtime)` with `resolveTemplateIdentity(payload.Template, payload.Runtime)`. +- Any restart/reconcile path that rebuilds a `CreateWorkspacePayload` must populate `payload.Template` from the DB (see §4.4). + +### 4.3 Create path + +No API shape change. `workspace.go:Create` already accepts `template` in the payload. The only change is to persist `payload.Template` into the new DB column. + +The existing runtime/model resolution from `config.yaml` stays unchanged — a caller can still pass `template=seo-agent` and have `runtime` default to `claude-code` from the manifest. + +### 4.4 Restart / reconcile path + +`workspace_restart.go` builds a `CreateWorkspacePayload` from the DB row (line ~444): + +```go +payload := withStoredCompute(ctx, id, models.CreateWorkspacePayload{ + Name: wsName, + Tier: tier, + Runtime: containerRuntime, +}) +``` + +It must also read `template` from `workspaces` and set `payload.Template`. The simplest change is a new helper `withStoredTemplate(ctx, id, payload)` or an inline query; either way, restart provisions use the persisted template. + +### 4.5 CP provision wire + +Add `Template` to the wire shapes so the CP can record it for its own reconciliation and future marketplace use: + +- `provisioner.WorkspaceConfig` gains `Template string`. +- `cp_provisioner.go` `cpProvisionRequest` gains `Template string `json:"template,omitempty"``. +- `molecule-controlplane` `wsProvisionRequest` gains `Template string `json:"template"``. +- The CP stores `template` in the workspace record / `tenant_resources.metadata` / a new lightweight table (CP-side decision). For Phase 1 the only hard requirement is that the CP does not drop the field and can echo it back in status/reconcile responses. + +The CP image-selection path does **not** change: `runtime` still drives `workspace-template-` / `runtime_image_pins.template_name`. The `seo-agent` template uses the `claude-code` image via the existing manifest `"runtime": "claude-code"` mapping. + +### 4.6 Setting `template` on an existing workspace + +The supported assignment path is a new **admin/CP-gated** endpoint: + +``` +PATCH /workspaces/:id/template +{ "template": "seo-agent" } +``` + +Behavior: +- Validates that `template` is a known manifest entry. +- Updates `workspaces.template`. +- Returns `{ "status": "updated", "needs_restart": true }`, exactly like `PATCH /workspaces/:id runtime`. +- Does **not** change `runtime`; the engine stays `claude-code`. +- Rejects attempts to set a template whose base runtime differs from the workspace's current runtime? **Decision point:** either reject or warn. Recommendation: allow but warn, because a future migration may intentionally retarget a workspace to a different template variant of the same runtime. + +This endpoint is what the JRS support flow / CP automation calls to tag `template=seo-agent`. + +### 4.7 Backfill migration + +Two-part backfill: + +1. **Data-driven backfill** — workspaces that already have a template recorded in `workspace_config.data`: + ```sql + UPDATE workspaces w + SET template = COALESCE(NULLIF(c.data->>'template',''), w.template) + FROM workspace_config c + WHERE c.workspace_id = w.id + AND w.template = '' + AND c.data->>'template' IS NOT NULL; + ``` + +2. **Explicit SEO backfill** — run a one-off idempotent script for known SEO workspace IDs, starting with JRS `28f97a7f`. If additional SEO workspaces cannot be identified by `workspace_config.data`, use a conservative heuristic (runtime=`claude-code` + `workspace_config.data` contains SEO-specific env like `SERVICES_REPO_WEBSITE` or the role matches `/[Ss][Ee][Oo]/`) and log every row for human audit. + +The migration file runs in the normal migration runner and is safe to re-apply. + +### 4.8 JRS verification + +After the backfill + a restart/re-provision of JRS `28f97a7f`: +- `provisioner.TemplateAssetFetcher.Load("molecule-ai/molecule-ai-workspace-template-seo-agent@")` returns `agent-skills/seo-all/**`. +- The workspace boots with a non-stub `/configs/config.yaml` and `agent_card.skills` contains the SEO skill package. +- A simple smoke check: `agent_card.skills | length > 0` and the `/seo-*` slash commands are registered. + +## 5. Build split + +- **Dev Engineer A (Kimi) / molecule-core** + - DB migration adding `workspaces.template`. + - `models.Workspace` + create insert SQL. + - `resolveTemplateIdentity` + update fetcher call sites. + - Restart path reads `template` from DB. + - `PATCH /workspaces/:id/template` handler + tests. + - Backfill migration / script. + - Core-side unit + integration tests. + +- **Dev Engineer B (MiniMax) / molecule-controlplane** + - Accept `template` in `wsProvisionRequest`. + - Store `template` in CP workspace metadata. + - CP migration/backfill to mirror core's `template` values where CP keeps its own record. + - JRS tagging automation / endpoint if needed. + - CP-side tests. + +## 6. Test plan + +**Unit / integration (molecule-core)** +- `TestResolveTemplateIdentity`: + - template set, known → returns repo@ref. + - template set, unknown → false. + - template empty, runtime known → returns runtime's default template. + - template empty, runtime external/kimi → empty (skip fetcher). +- `TestCreateWorkspace_PersistsTemplate`: create with `template=seo-agent` stores `template=seo-agent`, `runtime=claude-code`. +- `TestRestartWorkspace_UsesStoredTemplate`: restart reads `template` from DB and passes it to the provisioner. +- `TestPatchTemplate`: admin endpoint updates template and returns `needs_restart`. +- Migration test: existing workspace with `workspace_config.data.template="seo-agent"` gets backfilled. + +**E2E** +- Staging SEO workspace created with `template=seo-agent` boots with skills. +- JRS `28f97a7f` after tagging + restart: `agent_card.skills > 0`. +- Existing plain `claude-code` workspace without `template` continues to use `claude-code-default` after the change. + +## 7. Rollout + +1. Land molecule-core PR: model + migration + resolver + restart + `PATCH /template` + backfill + tests. +2. Land molecule-controlplane PR: accept/store `template`. +3. Run backfill in prod; confirm JRS `28f97a7f` row has `template=seo-agent`. +4. Trigger restart/re-provision for JRS; verify skills. +5. Tag remaining SEO workspaces and repeat verification. +6. Update RFC #2948 issue to mark Phase 1 complete and link Phase 2 design. + +## 8. Open questions + +1. **Default value:** `DEFAULT ''` vs `DEFAULT NULL`? `''` keeps the Go model simple (`string` not `sql.NullString`) and makes the empty-template check identical to the empty-runtime check already in the codebase. +2. **CP storage:** Should CP add a dedicated `workspace_templates` column/table or just mirror into `tenant_resources.metadata`? MiniMax should pick the cheapest option that supports Phase 2 marketplace queries. +3. **Cross-runtime template changes:** Should `PATCH /template` reject a template whose manifest `runtime` differs from the workspace's current `runtime`? Recommend allowing it with a warning for same-runtime variants, rejecting for incompatible engines. +4. **Backfill scope:** Do we have a canonical list of all SEO workspaces beyond JRS? The migration will use `workspace_config.data->>template` first; the explicit list can be supplied by ops/PM. + +## 9. Decision log + +- **Single `template` field is enough for now** — CTO directive. No need for `template_version` or `template_source` until Phase 2. +- **Template-first resolution, runtime fallback** — preserves all existing workspaces; no flag day. +- **New `PATCH /workspaces/:id/template`, not `PATCH runtime=seo-agent`** — closes the rejected overload path permanently. +- **Backfill from `workspace_config.data` + explicit JRS list** — safest, most auditable way to recover the template that was already recorded at create time. -- 2.52.0 From 89047f7631fd0ad49b56ae8ec5f7cb909c285506 Mon Sep 17 00:00:00 2001 From: "Molecule AI Dev Engineer A (Kimi)" Date: Tue, 16 Jun 2026 02:05:50 +0000 Subject: [PATCH 2/3] docs(rfc#2948): fold Researcher Phase 1 risk surface into design draft - Makes template a manifest-allowlist value, validated at write + fetch. - Replaces resolveTemplateIdentity with a single resolveTemplateAssets chokepoint that becomes the Phase 2 entitlement-broker seam. - Adopts nullable template (NULL = runtime fallback) for idempotent backfill semantics. - Tightens SEO backfill predicate: exact workspace-ID allowlist or workspace_config.data->>template; canary JRS 28f97a7f first. - Adds probe-verified readiness gate + MISSING_ASSETS fail-closed retry and #2929 settle-window handling for mid-flight template changes. - Records the top-3 CTO decisions before coding starts. Refs RFC #2948 comment 103870, #2959, #2929, #2955, #2958. --- ...-2948-phase1-template-engine-decoupling.md | 206 ++++++++++++------ 1 file changed, 137 insertions(+), 69 deletions(-) diff --git a/docs/design/rfc-2948-phase1-template-engine-decoupling.md b/docs/design/rfc-2948-phase1-template-engine-decoupling.md index c7568f8fb..7119d5829 100644 --- a/docs/design/rfc-2948-phase1-template-engine-decoupling.md +++ b/docs/design/rfc-2948-phase1-template-engine-decoupling.md @@ -2,30 +2,32 @@ **Status:** Design draft for driver + CTO sign-off **Author:** Dev Engineer A (Kimi) -**Related:** RFC #2948 (marketplace foundation), #2957/#2958 (PATCH-runtime rejection + seo-agent manifest mapping), #2970 (platform-agent identity restoration), RCA #2831/#2948 (config/skill delivery), JRS `28f97a7f` +**Related:** RFC #2948 (marketplace foundation), RFC #2948 comment 103870 (Researcher pre-design risk surface), #2957/#2958 (PATCH-runtime rejection + seo-agent manifest mapping), #2959 (manifest pins must be merged commits), #2929 (restart settle-window), #2970 (platform-agent identity restoration), RCA #2831/#2948 (config/skill delivery), JRS `28f97a7f` ## 1. Summary Today a workspace has one field — `runtime` — that is overloaded: it names both the **engine adapter** (`claude-code`, `codex`, `hermes`, …) and the **template** that supplies config/prompts/skills. The merged #2958 kept this overload by mapping the `seo-agent` *template* to the `claude-code` *runtime* in `manifest.json`. That unblocks the rejected `PATCH runtime=seo-agent` path, but it still leaves the platform unable to say "this workspace is running the claude-code engine **with** the seo-agent template". -This design adds a distinct `template` column to the workspace model, changes the RFC #2843 template-asset fetcher to resolve from `template` when set and fall back to `runtime` when unset, threads `template` through the control-plane provision request, and backfills existing SEO workspaces (JRS first) so the next reconcile fetches the full `seo-all` skill bundle instead of the 218-byte stub. +This design adds a distinct `template` column to the workspace model, changes the RFC #2843 template-asset resolver to a single chokepoint `resolveTemplateAssets(template, runtime, workspaceID)`, threads `template` through the control-plane provision request, and backfills existing SEO workspaces (JRS first) so the next reconcile fetches the full `seo-all` skill bundle instead of the 218-byte stub. -No marketplace broker or per-seller encryption is in scope here — that is RFC #2948 Phase 2. Phase 1 is the CTO-decided JRS-unblock + the foundation that later phases build on. +Phase 1 is **platform-owned templates only** and intentionally reuses the existing #833 platform-token fetch path as a temporary backend. The architecture is shaped so that Phase 2 can drop a marketplace entitlement broker behind the same chokepoint without re-plumbing call sites. No per-seller encryption, no third-party seller credentials, and no standing god-credential inside the workspace are in scope here. ## 2. Goals / non-goals **Goals** - A workspace stores **one** installed template (`template`) separate from its engine (`runtime`). -- Template-asset resolution is template-first with a runtime fallback, preserving every existing workspace. -- Existing SEO workspaces get `template=seo-agent` via a safe backfill + migration. +- `template` is a **manifest-allowlist value**, validated at the write boundary (create/PATCH) and again at the fetch boundary (defense-in-depth). +- Template-asset resolution goes through a single server-side chokepoint that becomes the Phase 2 broker seam. +- Existing SEO workspaces get `template=seo-agent` via a safe, idempotent, auditable backfill. - The change is deployable before the CP marketplace broker exists. - JRS `28f97a7f` boots with `agent_card.skills > 0` after it is tagged `template=seo-agent` and reconciled. **Non-goals** -- Marketplace entitlement, per-seller encryption, or brokered delivery (Phase 2). +- Marketplace entitlement, per-seller encryption, third-party seller onboarding, or brokered signed-URL delivery (Phase 2). - Changing `tenant//bootstrap` Secrets Manager contents. -- Removing the existing `TemplateIdentity` / `TemplateAssetFetcher` channel — we *repurpose* it. +- Removing the existing `TemplateAssetFetcher` channel — we repurpose it and centralize its caller. - Allowing `PATCH /workspaces/:id` to set `runtime=seo-agent` (still rejected per #2958). +- Allowing arbitrary private repos or free-string template names in Phase 1. ## 3. Background @@ -33,7 +35,7 @@ No marketplace broker or per-seller encryption is in scope here — that is RFC - `workspaces.runtime` is a `TEXT` column with default `claude-code` (migration `011_workspace_runtime.sql`). - `models.CreateWorkspacePayload` already carries a `Template` field, used at create time to resolve `runtime` and `model` from `config.yaml` and to pick a local template path. - The workspace row, however, does **not** persist `template`; the insert SQL only stores `runtime` (`workspace.go:683`). -- `runtime_registry.go` builds `templateRepoByName` from `manifest.json` `workspace_templates[]`, keyed by the template name with `-default` stripped. For the `seo-agent` entry, the key is `seo-agent` and the manifest declares `"runtime": "claude-code"` for the runtime allowlist. +- `runtime_registry.go` builds `templateRepoByName` from `manifest.json` `workspace_templates[]`, keyed by the template name with `-default` stripped. - `templateIdentityForRuntime(runtime)` looks up `templateRepoByName[runtime]`. For `runtime=claude-code` this returns the `claude-code-default` repo; for `runtime=seo-agent` it would return the `seo-agent` repo, but `seo-agent` is not a valid runtime after #2958. ### 3.2 Why the status quo breaks JRS @@ -41,58 +43,98 @@ No marketplace broker or per-seller encryption is in scope here — that is RFC - The RFC #2843 fetcher therefore resolves `templateIdentityForRuntime("claude-code")` → `claude-code-default`, never `seo-agent`. - Even after #2958 added the `seo-agent` manifest entry, there is no workspace field that tells the fetcher to use it. -## 4. Design +## 4. Security model (Phase 1) -### 4.1 Workspace model — add `template` +This section folds in the Researcher pre-design risk surface from RFC #2948 comment 103870. + +- **`template` is an allowlist, never a free string.** It keys into the manifest registry (the same SSOT that #2959 pins to immutable, merged commits). A value not in the manifest is rejected at the WRITE boundary (create/PATCH) **and** at the fetch boundary. It never falls through to a constructed path or a repo name. +- **Platform-owned templates only.** The Phase 1 allowlist is the set of platform-owned manifest entries (open templates + our private templates like `seo-agent`). No third-party or arbitrary private repo may be named. +- **Single chokepoint: `resolveTemplateAssets(template, runtime, workspaceID)`.** All asset resolution for a `template` value goes through this function. In Phase 1 it returns the #833 platform-token fetch identity; in Phase 2 the same chokepoint swaps to brokered entitlement + signed URLs. No other call site holds the platform token or constructs a template fetch URL. +- **No standing god-credential in the workspace.** The #833 platform token is held server-side by the chokepoint/fetcher, scoped read-only to platform-owned template repos, and never exposed to the box. The workspace receives only the final assets (or, in Phase 2, a short-lived signed URL). +- **Tenant isolation.** The fetch uses only the template-scoped read-only token; it must never escalate to the requesting workspace's tenant secrets and must never let one tenant's `template` value read another tenant's data. +- **SSRF guard.** Because `template` influences the fetch URL only via the manifest registry, the HTTP path must apply the #2132 posture: dial-time IP guard, no redirects, explicit allowlist. The existing Gitea fetcher constructs URLs only from manifest `repo`/`ref` values, not from user input. + +## 5. Design + +### 5.1 Workspace model — add nullable `template` New migration (idempotent, additive): ```sql ALTER TABLE workspaces -ADD COLUMN IF NOT EXISTS template TEXT NOT NULL DEFAULT ''; +ADD COLUMN IF NOT EXISTS template TEXT; ``` -- `''` means "no installed template — use runtime fallback". This is the steady state for every existing workspace and for bare `{"name":...}` creates. -- `models.Workspace` gains `Template string `json:"template" db:"template"``. -- `models.CreateWorkspacePayload.Template` already exists; we now persist it. +- `NULL` means "no installed template — use runtime fallback". This is the steady state for every existing workspace and for bare `{"name":...}` creates. +- `models.Workspace` gains `Template *string `json:"template,omitempty" db:"template"`` (or `sql.NullString`), with helpers `templateIsSet()` / `templateValue()`. +- `models.CreateWorkspacePayload.Template` already exists; we now persist it when non-empty. - The create insert SQL becomes: ```sql INSERT INTO workspaces (id, name, role, tier, runtime, template, status, ...) - VALUES ($1, $2, $3, $4, $5, $6, 'provisioning', ...) + VALUES ($1, $2, $3, $4, $5, NULLIF($6, ''), 'provisioning', ...) ``` +- The write boundary validates `template` against `templateRepoByName` before insert. Unknown values fail with a clear 400, matching the create-boundary fail-closed posture (#188). -### 4.2 Fetch-by-template with runtime fallback +### 5.2 Single asset-resolution chokepoint -Introduce a single resolver in `runtime_registry.go`: +Introduce a single resolver in `runtime_registry.go` (or a new `internal/handlers/template_assets.go` if the broker grows): ```go -func resolveTemplateIdentity(template, runtime string) (string, bool) { +// TemplateAssetResolution is the only thing callers of the asset +// channel need. In Phase 1 it carries a Gitea identity; in Phase 2 +// it can carry a broker-signed URL or an entitlement-bound fetcher. +type TemplateAssetResolution struct { + Identity string // "/@" (Phase 1) or signed URL (Phase 2) +} + +// resolveTemplateAssets maps a workspace's template/runtime to the +// manifest-registered asset source. It is the ONLY place that: +// 1. looks up templateRepoByName, +// 2. validates the allowlist, +// 3. decides whether to use the #833 platform-token path (Phase 1) +// or a brokered entitlement (Phase 2). +// Callers outside this chokepoint must never construct a fetch URL +// or hold the platform token. +func resolveTemplateAssets(ctx context.Context, template, runtime, workspaceID string) (TemplateAssetResolution, error) { if template != "" { - if id, ok := templateRepoByName[template]; ok { - return id.Repo + "@" + id.Ref, true + rr, ok := templateRepoByName[template] + if !ok { + return TemplateAssetResolution{}, fmt.Errorf("template %q is not in the manifest allowlist", template) } - return "", false + return TemplateAssetResolution{Identity: rr.Repo + "@" + rr.Ref}, nil } - return templateIdentityForRuntime(runtime) + rr, ok := templateRepoByName[runtime] + if !ok { + // external / kimi / kimi-cli / mock: no template assets. + return TemplateAssetResolution{}, nil + } + return TemplateAssetResolution{Identity: rr.Repo + "@" + rr.Ref}, nil } ``` Rules: 1. If `template` is non-empty and exists in `templateRepoByName`, use it. -2. If `template` is non-empty and **not** in the manifest, fail-closed (`false`). This matches the create-boundary fail-closed posture (#188). -3. If `template` is empty, fall back to the current `runtime` lookup for backward compatibility. +2. If `template` is non-empty and **not** in the manifest, fail-closed with an error. +3. If `template` is empty/null, fall back to the current `runtime` lookup for backward compatibility. +4. The resolver is authoritative for assets; `runtime` is authoritative for the engine. Precedence is acyclic: `runtime` must never re-derive or override an explicitly-set `template`. Call sites to update: -- `workspace_provision.go` `prepareProvisionContext` / `provisionWorkspaceCP`: replace `templateIdentityForRuntimeOrEmpty(payload.Runtime)` with `resolveTemplateIdentity(payload.Template, payload.Runtime)`. -- Any restart/reconcile path that rebuilds a `CreateWorkspacePayload` must populate `payload.Template` from the DB (see §4.4). +- `workspace_provision.go` `buildProvisionerConfig`: replace `templateIdentityForRuntimeOrEmpty(payload.Runtime)` with `resolveTemplateAssets(ctx, payload.Template, payload.Runtime, workspaceID).Identity`. +- Any restart/reconcile path that rebuilds a `CreateWorkspacePayload` must populate `payload.Template` from the DB (see §5.4). -### 4.3 Create path +### 5.3 Manifest-pin gate -No API shape change. `workspace.go:Create` already accepts `template` in the payload. The only change is to persist `payload.Template` into the new DB column. +`templateRepoByName` inherits the #2959 ancestor-of-default-branch gate: an unmerged or orphaned SHA in `manifest.json` must be rejected at service boot. This prevents a template pointer from resolving to an untrusted commit, regardless of how the pointer is reached. + +### 5.4 Create path + +No API shape change. `workspace.go:Create` already accepts `template` in the payload. The only changes are: +- Validate `payload.Template` against `templateRepoByName` at the write boundary. +- Persist `payload.Template` into the new DB column (NULL when empty). The existing runtime/model resolution from `config.yaml` stays unchanged — a caller can still pass `template=seo-agent` and have `runtime` default to `claude-code` from the manifest. -### 4.4 Restart / reconcile path +### 5.5 Restart / reconcile path `workspace_restart.go` builds a `CreateWorkspacePayload` from the DB row (line ~444): @@ -106,18 +148,18 @@ payload := withStoredCompute(ctx, id, models.CreateWorkspacePayload{ It must also read `template` from `workspaces` and set `payload.Template`. The simplest change is a new helper `withStoredTemplate(ctx, id, payload)` or an inline query; either way, restart provisions use the persisted template. -### 4.5 CP provision wire +### 5.6 CP provision wire Add `Template` to the wire shapes so the CP can record it for its own reconciliation and future marketplace use: -- `provisioner.WorkspaceConfig` gains `Template string`. +- `provisioner.WorkspaceConfig` gains `Template string` (or `*string`). - `cp_provisioner.go` `cpProvisionRequest` gains `Template string `json:"template,omitempty"``. -- `molecule-controlplane` `wsProvisionRequest` gains `Template string `json:"template"``. +- `molecule-controlplane` `wsProvisionRequest` gains `Template string `json:"template"`. - The CP stores `template` in the workspace record / `tenant_resources.metadata` / a new lightweight table (CP-side decision). For Phase 1 the only hard requirement is that the CP does not drop the field and can echo it back in status/reconcile responses. The CP image-selection path does **not** change: `runtime` still drives `workspace-template-` / `runtime_image_pins.template_name`. The `seo-agent` template uses the `claude-code` image via the existing manifest `"runtime": "claude-code"` mapping. -### 4.6 Setting `template` on an existing workspace +### 5.7 Setting `template` on an existing workspace The supported assignment path is a new **admin/CP-gated** endpoint: @@ -127,49 +169,66 @@ PATCH /workspaces/:id/template ``` Behavior: -- Validates that `template` is a known manifest entry. +- Validates that `template` is a known manifest entry (fail-closed; same allowlist as create). - Updates `workspaces.template`. - Returns `{ "status": "updated", "needs_restart": true }`, exactly like `PATCH /workspaces/:id runtime`. - Does **not** change `runtime`; the engine stays `claude-code`. -- Rejects attempts to set a template whose base runtime differs from the workspace's current runtime? **Decision point:** either reject or warn. Recommendation: allow but warn, because a future migration may intentionally retarget a workspace to a different template variant of the same runtime. +- Rejects a template whose manifest `runtime` differs from the workspace's current `runtime`. Phase 1 intentionally does not support cross-engine retargeting; same-engine template variants (e.g. different claude-code personas) are allowed. This endpoint is what the JRS support flow / CP automation calls to tag `template=seo-agent`. -### 4.7 Backfill migration +### 5.8 Backfill migration -Two-part backfill: +Two-part, fully idempotent backfill: -1. **Data-driven backfill** — workspaces that already have a template recorded in `workspace_config.data`: +1. **Explicit data-driven backfill** — workspaces that already have a template recorded in `workspace_config.data`: ```sql UPDATE workspaces w - SET template = COALESCE(NULLIF(c.data->>'template',''), w.template) + SET template = NULLIF(TRIM(c.data->>'template'), '') FROM workspace_config c WHERE c.workspace_id = w.id - AND w.template = '' - AND c.data->>'template' IS NOT NULL; + AND w.template IS NULL + AND NULLIF(TRIM(c.data->>'template'), '') IS NOT NULL + AND EXISTS ( + SELECT 1 FROM manifest_allowed_templates m + WHERE m.name = NULLIF(TRIM(c.data->>'template'), '') + ); ``` + (In practice the allowlist check is the same Go helper used at the write boundary.) -2. **Explicit SEO backfill** — run a one-off idempotent script for known SEO workspace IDs, starting with JRS `28f97a7f`. If additional SEO workspaces cannot be identified by `workspace_config.data`, use a conservative heuristic (runtime=`claude-code` + `workspace_config.data` contains SEO-specific env like `SERVICES_REPO_WEBSITE` or the role matches `/[Ss][Ee][Oo]/`) and log every row for human audit. +2. **SEO explicit-allowlist backfill** — run a one-off idempotent script for known SEO workspace IDs, starting with JRS `28f97a7f`. If additional SEO workspaces cannot be identified by `workspace_config.data`, use an **explicit workspace-ID allowlist** supplied by ops/PM, or the existing `seo-agent` manifest mapping. **Never** a loose string match on workspace name, env values, or role. -The migration file runs in the normal migration runner and is safe to re-apply. +Safety properties: +- **Idempotency:** gate on `WHERE template IS NULL` so re-runs are no-ops and manually-set templates are never clobbered. +- **Tight predicate:** exact workspace-ID allowlist or exact `workspace_config.data->>template` signal. +- **Canary first:** run JRS `28f97a7f` first, verify behavior, then fleet. +- **Reversible + resumable:** record the changed set; apply per-workspace transactionally; produce a coverage report (tagged/total). A mid-run failure leaves a clean mixed state that a re-run completes. +- **Rollback:** a companion script can reset `template = NULL` for the exact changed set if a catastrophic issue is found. -### 4.8 JRS verification +### 5.9 Readiness gate and mid-flight changes + +- **Probe-verified readiness.** The workspace is not considered successfully provisioned until the template's assets are fetched and present at the probe path (`/configs/system-prompt.md` per the #2955 lesson, plus `config.yaml`). If assets are missing, fail-closed with a `MISSING_ASSETS` provision abort and retry on the next reconcile (same backstop pattern as `MISSING_MODEL`, core#2594). +- **Fill-absent-only.** Asset delivery never overwrites files that are already present in `/configs/*` (#141 / #833). +- **Template change mid-flight.** Changing `template` triggers a controlled re-fetch + restart using the existing #2929 `restartSettleWindow` pattern. The fetch is idempotent and keyed on the CURRENT record value, so concurrent changes converge to the last writer without half-applied states. + +### 5.10 JRS verification After the backfill + a restart/re-provision of JRS `28f97a7f`: -- `provisioner.TemplateAssetFetcher.Load("molecule-ai/molecule-ai-workspace-template-seo-agent@")` returns `agent-skills/seo-all/**`. +- `resolveTemplateAssets("seo-agent", "claude-code", "28f97a7f")` returns `molecule-ai/molecule-ai-workspace-template-seo-agent@`. +- `giteaTemplateAssetFetcher.Load` returns `agent-skills/seo-all/**`. - The workspace boots with a non-stub `/configs/config.yaml` and `agent_card.skills` contains the SEO skill package. - A simple smoke check: `agent_card.skills | length > 0` and the `/seo-*` slash commands are registered. -## 5. Build split +## 6. Build split - **Dev Engineer A (Kimi) / molecule-core** - - DB migration adding `workspaces.template`. - - `models.Workspace` + create insert SQL. - - `resolveTemplateIdentity` + update fetcher call sites. + - DB migration adding nullable `workspaces.template`. + - `models.Workspace` + create insert SQL + NULL normalization. + - `resolveTemplateAssets` chokepoint + update fetcher call sites. - Restart path reads `template` from DB. - - `PATCH /workspaces/:id/template` handler + tests. - - Backfill migration / script. - - Core-side unit + integration tests. + - `PATCH /workspaces/:id/template` handler + allowlist validation + tests. + - Idempotent backfill migration / script + canary/rollback runbook. + - Core-side unit + integration tests (allowlist, chokepoint, backfill, readiness). - **Dev Engineer B (MiniMax) / molecule-controlplane** - Accept `template` in `wsProvisionRequest`. @@ -178,43 +237,52 @@ After the backfill + a restart/re-provision of JRS `28f97a7f`: - JRS tagging automation / endpoint if needed. - CP-side tests. -## 6. Test plan +## 7. Test plan **Unit / integration (molecule-core)** -- `TestResolveTemplateIdentity`: +- `TestResolveTemplateAssets`: - template set, known → returns repo@ref. - - template set, unknown → false. + - template set, unknown → error (fail-closed). - template empty, runtime known → returns runtime's default template. - template empty, runtime external/kimi → empty (skip fetcher). -- `TestCreateWorkspace_PersistsTemplate`: create with `template=seo-agent` stores `template=seo-agent`, `runtime=claude-code`. -- `TestRestartWorkspace_UsesStoredTemplate`: restart reads `template` from DB and passes it to the provisioner. -- `TestPatchTemplate`: admin endpoint updates template and returns `needs_restart`. -- Migration test: existing workspace with `workspace_config.data.template="seo-agent"` gets backfilled. +- `TestCreateWorkspace_PersistsTemplate`: create with `template=seo-agent` stores `template=seo-agent`, `runtime=claude-code`; unknown template rejected. +- `TestRestartWorkspace_UsesStoredTemplate`: restart reads `template` from DB and passes it to the resolver. +- `TestPatchTemplate`: admin endpoint rejects unknown template, updates known template, returns `needs_restart`, rejects cross-engine template. +- Migration test: existing workspace with `workspace_config.data->>template="seo-agent"` gets backfilled; manually-set templates are not clobbered. +- Readiness test: missing `/configs/system-prompt.md` after fetch aborts with `MISSING_ASSETS`. **E2E** - Staging SEO workspace created with `template=seo-agent` boots with skills. - JRS `28f97a7f` after tagging + restart: `agent_card.skills > 0`. - Existing plain `claude-code` workspace without `template` continues to use `claude-code-default` after the change. -## 7. Rollout +## 8. Rollout 1. Land molecule-core PR: model + migration + resolver + restart + `PATCH /template` + backfill + tests. 2. Land molecule-controlplane PR: accept/store `template`. -3. Run backfill in prod; confirm JRS `28f97a7f` row has `template=seo-agent`. +3. Run backfill in prod (canary JRS `28f97a7f` first); confirm row has `template=seo-agent`. 4. Trigger restart/re-provision for JRS; verify skills. -5. Tag remaining SEO workspaces and repeat verification. +5. Tag remaining SEO workspaces from explicit allowlist and repeat verification. 6. Update RFC #2948 issue to mark Phase 1 complete and link Phase 2 design. -## 8. Open questions +## 9. Open questions -1. **Default value:** `DEFAULT ''` vs `DEFAULT NULL`? `''` keeps the Go model simple (`string` not `sql.NullString`) and makes the empty-template check identical to the empty-runtime check already in the codebase. -2. **CP storage:** Should CP add a dedicated `workspace_templates` column/table or just mirror into `tenant_resources.metadata`? MiniMax should pick the cheapest option that supports Phase 2 marketplace queries. -3. **Cross-runtime template changes:** Should `PATCH /template` reject a template whose manifest `runtime` differs from the workspace's current `runtime`? Recommend allowing it with a warning for same-runtime variants, rejecting for incompatible engines. -4. **Backfill scope:** Do we have a canonical list of all SEO workspaces beyond JRS? The migration will use `workspace_config.data->>template` first; the explicit list can be supplied by ops/PM. +1. **CP storage:** Should CP add a dedicated `workspace_templates` column/table or just mirror into `tenant_resources.metadata`? MiniMax should pick the cheapest option that supports Phase 2 marketplace queries. +2. **Same-engine variant policy:** Should `PATCH /template` allow a template whose manifest `runtime` matches the workspace runtime but whose `config.yaml` declares a different seeded runtime? The runtime-seed mismatch abort (#2027) will catch contradictions at provision time; the endpoint itself can stay runtime-agnostic. -## 9. Decision log +## 10. Decision log - **Single `template` field is enough for now** — CTO directive. No need for `template_version` or `template_source` until Phase 2. +- **Nullable `template` with `NULL` = unset** — enables the idempotent backfill predicate `WHERE template IS NULL` and makes unset behavior identical to today's runtime-fallback. - **Template-first resolution, runtime fallback** — preserves all existing workspaces; no flag day. - **New `PATCH /workspaces/:id/template`, not `PATCH runtime=seo-agent`** — closes the rejected overload path permanently. -- **Backfill from `workspace_config.data` + explicit JRS list** — safest, most auditable way to recover the template that was already recorded at create time. +- **`template` is manifest-allowlist only, validated at write and fetch** — closes the free-string / path-traversal / SSRF surface flagged by Researcher. +- **Single `resolveTemplateAssets` chokepoint** — the Phase 2 broker seam; the only place that knows about #833 token / fetch identity. +- **Tight SEO backfill predicate** — exact workspace-ID allowlist or `workspace_config.data->>template`, never a loose string match. +- **Probe-verified readiness + MISSING_ASSETS fail-closed** — extends the #2955 / core#2594 pattern to template assets. + +## 11. Top-3 decisions before coding + +1. **The broker chokepoint:** `resolveTemplateAssets(ctx, template, runtime, workspaceID) (TemplateAssetResolution, error)` lives in `runtime_registry.go` (or a new `internal/handlers/template_assets.go`). It is the sole caller of `templateRepoByName`, the sole place that knows about the #833 platform-token path, and the only seam the Phase 2 entitlement broker needs to wrap. +2. **The SEO backfill predicate:** idempotent `WHERE template IS NULL`, exact workspace-ID allowlist (JRS `28f97a7f` first) or exact `workspace_config.data->>template` signal, canary → fleet, resumable and reversible with a recorded changed-set and coverage report. +3. **The readiness gate:** assets are not considered delivered until the probe path (`/configs/system-prompt.md` / `config.yaml`) verifies them; missing assets abort with `MISSING_ASSETS` and retry on next reconcile. A mid-flight `template` change triggers a controlled re-fetch + restart inside the #2929 settle window. -- 2.52.0 From f90fe2f3933f109b3ed322f7a498eeed65dc44ea Mon Sep 17 00:00:00 2001 From: "Molecule AI Dev Engineer A (Kimi)" Date: Tue, 16 Jun 2026 04:11:31 +0000 Subject: [PATCH 3/3] docs(rfc#2948): add forward constraint note on model resolution at provision time Clarifies that the MISSING_MODEL gate reads the persisted workspace.model value at provision time, before the template config.yaml is fetched into the container. Therefore model cannot be template-delivered in Phase 1; template-delivered assets are limited to prompts, MCP wiring, and identity files. Aligns with #2966 and Researcher 103961. --- .../rfc-2948-phase1-template-engine-decoupling.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/docs/design/rfc-2948-phase1-template-engine-decoupling.md b/docs/design/rfc-2948-phase1-template-engine-decoupling.md index 7119d5829..ff608c598 100644 --- a/docs/design/rfc-2948-phase1-template-engine-decoupling.md +++ b/docs/design/rfc-2948-phase1-template-engine-decoupling.md @@ -134,6 +134,17 @@ No API shape change. `workspace.go:Create` already accepts `template` in the pay The existing runtime/model resolution from `config.yaml` stays unchanged — a caller can still pass `template=seo-agent` and have `runtime` default to `claude-code` from the manifest. +#### Forward constraint: `model` is NOT template-delivered at provision time + +The `MISSING_MODEL` gate runs at **provision time** and reads the workspace row's `model` column. At that point the template's `config.yaml` has **not** been fetched into the workspace container yet, so `model` cannot be derived from the template at provision time. Phase 1 therefore keeps the current create-time contract: + +- `model` is resolved **once**, at workspace **create** time, from the caller-supplied value or from the chosen template's `config.yaml` manifest/metadata (the same path that already exists today). +- The resolved `model` is persisted in the `workspaces` row and passed to the CP in `cpProvisionRequest`. +- The `MISSING_MODEL` provision-time gate reads this persisted value only; it never waits on a template fetch. +- Template-delivered assets in Phase 1 are limited to **prompts, MCP wiring, and identity files** (`config.yaml` content, `mcp_servers.yaml`, `prompts/concierge.md`, `identity-fallback.sh`). The model identifier is out of scope for template delivery. + +This preserves the #2966 fix and Researcher 103961 posture: the provision-time model gate remains fail-closed and independent of the template-asset fetch channel. + ### 5.5 Restart / reconcile path `workspace_restart.go` builds a `CreateWorkspacePayload` from the DB row (line ~444): -- 2.52.0