docs(rfc): marketplace template/plugin delivery (entitlement-brokered, encrypted, automatic) #2948

Open
core-devops wants to merge 3 commits from docs/rfc-marketplace-delivery into main
+399
View File
@@ -0,0 +1,399 @@
# RFC: Marketplace template/plugin delivery (entitlement-brokered, encrypted, automatic)
**Status:** Phase 1 design draft for CTO/driver sign-off
**Author:** CEO Assistant (on CTO direction, 2026-06-15) + Dev Engineer A (Kimi) (Phase 1 buildable spec)
**Related:** RFC #2843 (decouple config/skill delivery from Secrets Manager — the
public/token fetch this generalizes), CP #828 (the interim platform-token path),
template repos (claude-code/hermes/codex public; seo-agent private — our dogfood)
## 1. Summary
Molecule will run a **marketplace**: third-party developers publish **templates
and plugins** (via repo) that other orgs install into their workspaces. Sellers
must be able to keep their work **private and IP-protected** ("private for some
people"), with **encrypted** storage + delivery, and buyers must receive what
they're **entitled to****automatically**, at scale (design target: **~10K
published plugins, high daily install volume**).
The delivery path we have today does **not** meet this. RFC #2843 added two modes
to the runtime fetcher: **public-fetch** (tokenless, for our OPEN templates) and
**token-fetch** (a single platform-wide `MOLECULE_TEMPLATE_REPO_TOKEN`, CP #828,
for our OWN private templates). The platform token is **legitimate only because
the platform is the sole "seller" of its own templates**. As a *marketplace*
primitive it fails on every axis:
- **No per-seller isolation** — one token reads *all* private repos; a single
leak exposes every seller's IP. Sellers won't publish.
- **No entitlement gating** — a fetch succeeds because the token exists, not
because the org **licensed/purchased** the plugin.
- **No artifact encryption** — IP sits readable to anything holding the token.
- **Manual + O(plugins)** — minting/scoping per template is human work; it does
not survive 10K plugins.
**Proposal:** a **delivery broker** + **entitlement service** + **encrypted
artifact store**, with **no standing god-credential** in workspaces and
**automatic** purchase→entitlement→delivery. The platform's own templates become
a *special case* of the same system (we are seller #0).
## 2. Goals / non-goals
**Goals**
- Per-seller IP isolation; a compromise of one tenant/box never exposes other
sellers' artifacts.
- **Entitlement-gated** delivery: an org receives a plugin/template **iff** it
holds a current entitlement (purchase / subscription / free-grant).
- **Encrypted** artifacts at rest and in delivery; sellers' source is never
readable by infra operators by default.
- **Automatic** end-to-end: publish → buy → entitlement → delivered on next
provision/restart. Zero per-plugin manual ops.
- **Revocation + versioning**: unpublish/refund/expiry → next fetch denied;
buyers pin a version; sellers ship updates.
- **Scale**: ~10K plugins, high install volume — horizontal, cache/CDN-friendly,
no per-install human step.
**Non-goals (this RFC)**
- Billing/payments mechanics (separate; this RFC consumes an entitlement signal).
- The marketplace UI/discovery.
- Replacing the **public-fetch** path for our OPEN templates (it stays).
## 3. Design
### 3.1 Components
| Component | Responsibility |
|---|---|
| **Entitlement service** | SoT: `(org_id, plugin_id, version) → entitled?` (purchase/sub/free/grant), with expiry + revocation. |
| **Delivery broker** | Authenticates the requesting **workspace's own identity** (its workspace token / org identity), checks entitlement, returns a **short-lived, scoped, signed artifact URL** (or streams the decrypted bytes). Stateless; entitlement-cache. |
| **Encrypted artifact store** | Published artifacts stored encrypted (envelope encryption; per-seller or per-artifact data keys wrapped by a KMS CMK). Object store + CDN for signed-URL delivery. |
| **Publish pipeline** | Seller repo → CI packages the template/plugin → encrypts → registers `(plugin_id, version, seller, checksum)` → uploads to the artifact store. |
### 3.2 Delivery flow (provision/restart)
1. Workspace provisions/reconciles → asks the broker: *"deliver the assets org X
is entitled to for this workspace."*
2. Broker authenticates the workspace's **own** identity (not a shared token),
resolves the org's entitlements, and for each entitled `(plugin, version)`
returns a **short-lived signed URL** (minutes TTL, scoped to that artifact).
3. Workspace fetches via the signed URL (CDN); artifact is decrypted for the
entitled fetch (broker-side, or per-buyer envelope key).
4. No long-lived, broadly-scoped credential ever lives in the box.
### 3.3 The platform as "seller #0"
Our own templates are modeled as entitlements every org holds (free-grant for
the open ones; platform-internal for private like seo-agent). This means:
- The **public-fetch** path (RFC #2843) remains for our OPEN templates — cheapest
path, no broker needed.
- Our OWN private templates migrate from the **#828 platform token** to the
broker (as a free platform-internal entitlement) once the broker exists.
- We **dogfood** the marketplace with our own seo-agent before any third party.
### 3.4 Revocation, versioning, integrity
- Entitlement revoke (unpublish / refund / expiry) → broker denies next fetch;
signed URLs are short-lived so access ends quickly.
- Buyers pin a version; sellers publish new versions; reconcile-on-boot
(RFC #2843) picks up the entitled version.
- Artifact checksum verified post-fetch; signed manifests prevent tampering.
## 4. Phase 1: `template` field decoupling (platform-owned templates only)
Before the full broker exists, we will decouple a workspace's **runtime engine**
from its **template identity/assets** by adding an explicit `template` field.
This unblocks e.g. `runtime=claude-code` with `template=seo-agent`.
Phase 1 is intentionally **platform-owned templates only**; it uses the existing
#833 platform-token path as a temporary backend, but structures the code so the
broker can replace it without re-plumbing call sites.
**This section is the concrete buildable design the CTO must approve before
coding starts.** Implementation tracking: molecule-core#2980,
molecule-controlplane#846; detailed sub-RFC in molecule-core#2977.
### 4.1 What changes (Phase 1 buildable spec)
| Area | Change |
|---|---|
| DB | Add nullable `workspaces.template` column; `NULL` = runtime fallback. |
| Model | `Workspace.Template *string`; persist `CreateWorkspacePayload.Template`. |
| Resolver | Single `resolveTemplateAssets(ctx, template, runtime, workspaceID)` chokepoint in `runtime_registry.go`. |
| Write boundary | Validate `template` against manifest allowlist at create + `PATCH /workspaces/:id/template`. |
| Fetch boundary | Resolver allowlist check; unknown template fails closed. |
| CP wire | Forward `Template` and `TemplateAssets` in `cpProvisionRequest`. |
| Backfill | Idempotent `WHERE template IS NULL`; exact workspace-ID allowlist or `workspace_config.data->>template`; JRS `28f97a7f` canary. |
| Readiness | Probe `/configs/system-prompt.md` + `config.yaml`; `MISSING_ASSETS` fail-closed retry. |
### 4.2 Security model
- **`template` is an allowlist, never a free string.** It keys into the
**manifest registry** (the same SSOT that #2959 pins to immutable commits).
A value not in the manifest is rejected at the WRITE boundary (create/PATCH)
**and** at the fetch boundary (defense-in-depth). It never falls through to a
constructed path.
- **Platform-owned templates only.** The allowlist for Phase 1 is the set of
platform-owned manifest entries (open templates + our private templates like
seo-agent). No third-party or arbitrary private repo may be named.
- **Single chokepoint: `resolveTemplateAssets(template, runtime, workspace)`.**
All asset resolution for a `template` value goes through this function. In
Phase 1 it returns the #833 platform-token fetch identity; in Phase 2 the same
chokepoint swaps to brokered entitlement + signed URLs. No other call site
holds the platform token or constructs a template fetch URL.
- **No standing god-credential in the workspace.** The platform token is held
server-side by the chokepoint, scoped read-only to platform-owned template
repos, and never exposed to the box. The workspace receives only the final
assets (or a short-lived signed URL once the broker lands).
- **Tenant isolation.** The fetch uses only the template-scoped read-only token;
it must never escalate to the requesting workspace's tenant secrets and must
never let one tenant's `template` value read another tenant's data.
- **SSRF guard.** If `template` ever influences a fetch URL, the HTTP path must
apply the #2132 posture: dial-time IP guard, no redirects, explicit allowlist.
### 4.3 Workspace model and migration
Migration (idempotent, additive):
```sql
ALTER TABLE workspaces
ADD COLUMN IF NOT EXISTS template TEXT;
```
- `NULL` means "no installed template — use runtime fallback". This is the
steady state for every existing workspace and for bare `{"name":...}` creates.
- `models.Workspace` gains `Template *string `json:"template,omitempty" db:"template"``
(or `sql.NullString`).
- `models.CreateWorkspacePayload.Template` already exists; persist it when
non-empty.
- Create insert SQL becomes:
```sql
INSERT INTO workspaces (..., runtime, template, status, ...)
VALUES ($5, NULLIF($6, ''), 'provisioning', ...)
```
### 4.4 Single asset-resolution chokepoint
```go
// TemplateAssetResolution is the only thing callers of the asset channel need.
// In Phase 1 it carries a Gitea identity; in Phase 2 it can carry a broker-signed
// URL or an entitlement-bound fetcher.
type TemplateAssetResolution struct {
Identity string // "<owner>/<repo>@<ref>" (Phase 1) or signed URL (Phase 2)
}
// resolveTemplateAssets maps a workspace's template/runtime to the manifest-
// registered asset source. It is the ONLY place that:
// 1. looks up templateRepoByName,
// 2. validates the allowlist,
// 3. decides whether to use the #833 platform-token path (Phase 1) or a
// brokered entitlement (Phase 2).
func resolveTemplateAssets(
ctx context.Context,
template, runtime, workspaceID string,
) (TemplateAssetResolution, error) {
if template != "" {
rr, ok := templateRepoByName[template]
if !ok {
return TemplateAssetResolution{},
fmt.Errorf("template %q is not in the manifest allowlist", template)
}
return TemplateAssetResolution{Identity: rr.Repo + "@" + rr.Ref}, nil
}
rr, ok := templateRepoByName[runtime]
if !ok {
// external / kimi / kimi-cli / mock: no template assets.
return TemplateAssetResolution{}, nil
}
return TemplateAssetResolution{Identity: rr.Repo + "@" + rr.Ref}, nil
}
```
Rules:
1. If `template` is set and known, use it.
2. If `template` is set and unknown, fail closed.
3. If `template` is unset, fall back to the current `runtime` lookup.
4. `runtime` is authoritative for the engine; `template` is authoritative for
assets. Precedence is acyclic.
Call sites:
- `workspace_provision.go` `buildProvisionerConfig` sets
`cfg.TemplateIdentity = resolveTemplateAssets(...).Identity`.
- Restart/reconcile paths populate `payload.Template` from the DB row.
### 4.5 Create, restart, and PATCH paths
**Create path:** `workspace.go:Create` already accepts `template`. Validate it
against `templateRepoByName` at the write boundary and persist it (NULL when
empty). Runtime/model resolution from `config.yaml` stays unchanged.
**Restart path:** `workspace_restart.go` reads the stored `template` from the DB
and sets `payload.Template` when rebuilding `CreateWorkspacePayload`.
**PATCH /workspaces/:id/template:**
```
PATCH /workspaces/:id/template
{ "template": "seo-agent" }
```
- Validates `template` is a known manifest entry (fail-closed).
- Updates `workspaces.template`.
- Returns `{ "status": "updated", "needs_restart": true }`.
- Does **not** change `runtime`.
- Rejects cross-engine template changes in Phase 1.
### 4.6 Control-plane provision wire
- `provisioner.WorkspaceConfig` gains `Template string`.
- `cp_provisioner.go` `cpProvisionRequest` gains
`Template string `json:"template,omitempty"`` and forwards existing
`TemplateAssets`.
- `molecule-controlplane` `wsProvisionRequest` gains `Template string`.
- The CP stores `template` in its workspace record/metadata and echoes it back
in status/reconcile responses.
- CP image selection still uses `runtime` (seo-agent uses the claude-code image
via the manifest `"runtime": "claude-code"` mapping).
### 4.7 Backfill migration (SEO workspaces)
Two-part, fully idempotent backfill:
1. **Data-driven backfill** — workspaces that already recorded a template in
`workspace_config.data`:
```sql
UPDATE workspaces w
SET template = NULLIF(TRIM(c.data->>'template'), '')
FROM workspace_config c
WHERE c.workspace_id = w.id
AND w.template IS NULL
AND NULLIF(TRIM(c.data->>'template'), '') IS NOT NULL
AND EXISTS (
SELECT 1 FROM manifest_allowed_templates m
WHERE m.name = NULLIF(TRIM(c.data->>'template'), '')
);
```
2. **SEO explicit-allowlist backfill** — one-off idempotent script for known SEO
workspace IDs, starting with JRS `28f97a7f`. Never a loose string match on
name/env/role.
Safety properties:
- **Idempotency:** gate on `WHERE template IS NULL`.
- **Tight predicate:** exact workspace-ID allowlist or exact
`workspace_config.data->>template` signal.
- **Canary first:** JRS `28f97a7f`, verify, then fleet.
- **Reversible:** record changed set; companion script can reset `template = NULL`
if needed.
### 4.8 Readiness gate and mid-flight changes
- **Probe-verified readiness.** Assets must be present at `/configs/system-prompt.md`
(the #2955 lesson) and `config.yaml`. If missing, abort with `MISSING_ASSETS`
and retry on next reconcile (same pattern as `MISSING_MODEL`, core#2594).
- **Fill-absent-only.** Asset delivery never overwrites files already present in
`/configs/*` (#141 / #833).
- **Template change mid-flight** triggers a controlled re-fetch + restart inside
the existing #2929 settle window. Fetch is idempotent and keyed on the CURRENT
record value.
- **Manifest pins must be merged commits.** `template`→manifest resolution
inherits the #2959 ancestor-of-default-branch gate.
### 4.9 JRS verification
After backfill + restart/re-provision of JRS `28f97a7f`:
- `resolveTemplateAssets("seo-agent", "claude-code", "28f97a7f")` resolves to
`molecule-ai/molecule-ai-workspace-template-seo-agent@<pin>`.
- Template asset fetcher returns `agent-skills/seo-all/**`.
- Workspace boots with non-stub `/configs/config.yaml` and `agent_card.skills > 0`.
- Smoke check: `/seo-*` slash commands are registered.
### 4.10 Test plan
**Unit / integration (molecule-core)**
- `TestResolveTemplateAssets`: template set known, template set unknown fails
closed, template empty runtime known, template empty external/kimi returns empty.
- `TestCreateWorkspace_PersistsTemplate`: create with `template=seo-agent` stores
`template=seo-agent`, `runtime=claude-code`; unknown template rejected.
- `TestRestartWorkspace_UsesStoredTemplate`: restart reads `template` from DB.
- `TestPatchTemplate`: rejects unknown, updates known, returns `needs_restart`,
rejects cross-engine.
- Migration test: backfill from `workspace_config.data->>template` works and
does not clobber manually-set rows.
- Readiness test: missing probe path aborts with `MISSING_ASSETS`.
**E2E**
- Staging SEO workspace created with `template=seo-agent` boots with skills.
- JRS `28f97a7f` after tagging + restart: `agent_card.skills > 0`.
- Existing plain `claude-code` workspace without `template` continues to use
`claude-code-default`.
### 4.11 Rollout
1. Land molecule-core PR: model + migration + resolver + restart + `PATCH /template`
+ backfill + tests.
2. Land molecule-controlplane PR: accept/store `template`.
3. Run backfill in prod (canary JRS `28f97a7f` first).
4. Trigger restart/re-provision for JRS; verify skills.
5. Tag remaining SEO workspaces from explicit allowlist and repeat verification.
6. Update RFC #2948 issue to mark Phase 1 complete and link Phase 2 design.
### 4.12 Top-3 decisions before coding
1. **The broker chokepoint:** `resolveTemplateAssets(ctx, template, runtime, workspaceID)`
lives in `runtime_registry.go`. It is the sole caller of `templateRepoByName`,
the sole place that knows about the #833 platform-token path, and the only
seam the Phase 2 entitlement broker needs to wrap.
2. **The SEO backfill predicate:** idempotent `WHERE template IS NULL`, exact
workspace-ID allowlist (JRS `28f97a7f` first) or exact
`workspace_config.data->>template` signal, canary → fleet, resumable and
reversible with a recorded changed-set.
3. **The readiness gate:** probe-verified assets at `/configs/system-prompt.md` /
`config.yaml`; `MISSING_ASSETS` fail-closed + retry; mid-flight `template`
changes use the #2929 settle window.
## 5. Relationship to RFC #2843 / #828
- **Public-fetch** (open templates): unchanged, keep.
- **#828 platform token** (our own private templates): **interim**. Legitimate
today (we are sole seller), but **must not** become the marketplace mechanism.
Superseded by the broker (our private templates → platform-internal
entitlements) once it lands.
- The runtime fetcher already abstracts the source; adding a **broker fetch
mode** alongside public/token is the runtime change.
## 6. Security
- **No standing god-credential** in workspaces — per-fetch authz, short-lived
scoped signed URLs only.
- **Encryption at rest** (KMS-wrapped per-artifact data keys); operators can't
read seller source by default; audit every decrypt/deliver.
- Per-seller blast-radius isolation; key compromise scoped to one seller.
- Entitlement checks are server-side; the workspace cannot self-assert
entitlement.
## 7. Scale (~10K plugins, high install volume)
- Broker is stateless + horizontally scaled; entitlement reads cached.
- Delivery via signed-URL + CDN — bytes don't flow through the broker.
- Publish pipeline is per-seller-CI (parallel); no central manual step.
- Zero per-plugin human ops by construction (the failure mode this RFC exists
to prevent).
## 8. Rollout (phased)
1. **Phase 0 (now, parallel):** ship #828 to deliver our OWN private templates
(seo-agent → JRS) — interim, our-own-templates only. Unblocks the customer.
2. **Phase 1:** add the `template` field decoupling described in §4; keep using
the #833 platform-token path behind the `resolveTemplateAssets` chokepoint;
backfill SEO workspaces; dogfood with seo-agent. This is the design section
the CTO must approve before coding starts.
3. **Phase 2:** entitlement service + broker + encrypted store; migrate our own
private templates onto it; deprecate the #828 platform token for private
delivery.
4. **Phase 3:** third-party publish pipeline + per-seller encryption keys +
billing/entitlement integration + marketplace UI.
## 9. Alternatives considered
- **Per-seller long-lived tokens** injected per workspace: O(sellers) credentials,
still no entitlement gating, still no encryption, still manual provisioning —
rejected.
- **Keep the single platform token, add ACLs on the repo host:** no encryption,
no entitlement semantics, repo-host-specific, doesn't scale to per-buyer —
rejected.
- **Bake plugins into images:** breaks "seller owns/updates their plugin",
no per-buyer entitlement — rejected.
## 10. Open questions
- Encryption model: per-seller data keys vs per-buyer envelope (re-encrypt per
install)? KMS choice + key rotation.
- Entitlement SoT: new service vs extend CP; how billing emits the entitlement.
- Broker placement: CP endpoint vs dedicated service; CDN/object-store choice.
- Plugin vs template: same delivery primitive, or plugin-system-specific install?
- Trust/quality: seller verification, malware scanning, sandboxing of 3rd-party
plugin code at install.