From 6d787ca7adfff86dea449b65737e923035e4bba6 Mon Sep 17 00:00:00 2001 From: Claude Fable 5 Date: Sun, 14 Jun 2026 10:16:42 +0000 Subject: [PATCH 1/2] docs(rfc): decouple workspace config + skill delivery from Secrets Manager --- .../rfc-decouple-config-skill-delivery.md | 115 ++++++++++++++++++ 1 file changed, 115 insertions(+) create mode 100644 docs/design/rfc-decouple-config-skill-delivery.md diff --git a/docs/design/rfc-decouple-config-skill-delivery.md b/docs/design/rfc-decouple-config-skill-delivery.md new file mode 100644 index 000000000..168295cfa --- /dev/null +++ b/docs/design/rfc-decouple-config-skill-delivery.md @@ -0,0 +1,115 @@ +# RFC: Decouple workspace config + skill delivery from Secrets Manager + +**Status:** Draft +**Author:** CEO Assistant (on CTO direction) +**Related:** RCA #2831 (SaaS agents lose config/skills/memory), #2832 (credentials in auto-memory), #2838 (provisioner reconciliation — partial), merged runtime fix #125/#134 (memory re-inject on auto-heal + persistence discipline), seo-template #16 (slash-command format) + +## 1. Summary + +Workspace **config, prompts, and skills** are currently delivered to SaaS tenants through the **AWS Secrets Manager** "config bundle" path. That couples non-sensitive, potentially-large template assets to a store sized and scoped for *secrets*. The result is the recurring class of failures in #2831: + +- Skills (e.g. the 716 KiB `seo-all` package) don't fit the ≤256 KiB secrets bundle → they are silently dropped → `agent_card.skills = []`. +- The "fix" attempted in #2838 was a per-template patch (`EnableSEOSkillPackage` / `SEOSkillPackageFiles()`) that hardcodes one template's files into core — non-scalable, a layering violation, and it ships disabled. +- A workspace whose config secret is absent or stale silently falls back to a stub `/configs/config.yaml` and never self-repairs. + +**Proposal:** Secrets Manager keeps **only secrets**. Config + prompts + skills move to a **generic, non-secret asset channel** — the workspace fetches its template assets (incl. `agent-skills/`) from the **template repo** onto the **persisted data volume** at provision and reconciles them on every boot. This removes the size cap, deletes the per-template patch, and makes stub/missing-config workspaces self-repair. + +## 2. Problem & evidence + +### 2.1 Live Secrets-Manager inventory (operator acct `004947743811`, `us-east-2`) +SM holds exactly two kinds of secret: + +| Secret | Contents | Verdict | +|---|---|---| +| `molecule/tenant//bootstrap` | `db_password`, `admin_token`, `secrets_encryption_key`, `tunnel_token`, `ghcr_token`, `cp_admin_api_token`, `shared_secret`, `display_session_signing_secret` | **Real secrets — correct to be in SM** | +| `molecule/workspace//config` (~97+) | `{ "config.yaml": "" }` — sampled `88cc3af2…` = **240 bytes, config.yaml only** (no `prompts/`, no skills) | **Non-secret config in a secrets store** | + +- **Zero `*skill*` secrets exist.** Skills never reach SaaS via SM. +- **JRS SEO `28f97a7f` has no `workspace/.../config` secret at all** → root cause of its 218-byte stub config + `skills:[]`: it never got a bundle and fell back to the user-data baseline stub. +- Even the bundles that exist are tiny (config.yaml only) — prompts/skills largely don't make it through for anyone. + +### 2.2 Mechanism (from code, per #2831) +- Core `collectCPConfigFiles()` collects only `config.yaml` + `prompts/` from the template; the SEO skill package lives under `source/seo-agent-template-main/agent-skills/seo-all/` (716 KiB), outside the allowlist. +- CP stages that bundle to SM `molecule/workspace//config` (cp#329 moved config off the 16 KB user-data limit into SM, cap raised to 256 KiB). +- Normal restart/auto-heal calls `RestartWorkspaceAutoOpts(templatePath="", configFiles=nil)` → **no reconciliation**: a stub `/configs` is reused forever. + +### 2.3 Root cause +**Transport coupling.** Non-sensitive assets (config, prompts, skills) are forced through a secrets-sized, secrets-scoped channel. This (a) caps skills out, (b) invites per-template patches, (c) silently no-ops when a config secret is missing. + +## 3. Goals / non-goals + +**Goals** +- Secrets and non-secret assets ride **separate channels**. +- **Generic** template-asset (config + prompts + skills) delivery for *any* template/runtime — no per-template code in core. +- **Reconcile/self-repair** on every provision, restart, restore, auto-heal — a stub/missing config heals from the template. +- No size cap on assets; skills always delivered. +- Comprehensive docs + unit tests + e2e. + +**Non-goals** +- Changing what's in `tenant//bootstrap` (it stays; it's correct). +- The runtime memory-persistence fix (already merged: #125/#134) — complementary, not in scope here. +- Changing how *workspace_secrets* (per-workspace env in the tenant DB) work. + +## 4. Proposed design + +### 4.1 Secrets ↔ assets boundary (the core principle) +| Channel | Carries | Store | +|---|---|---| +| **Secrets** | tenant bootstrap secrets (db/admin/encryption/tunnel/ghcr/cp-admin/shared) | AWS Secrets Manager `tenant//bootstrap` (unchanged) | +| **Assets** | `config.yaml`, `prompts/*`, `agent-skills/**` | Template repo → fetched to the persisted data volume `/configs` (+ skills dir) | + +Secrets Manager is no longer a config/asset transport. + +### 4.2 Asset delivery — generic template fetch +At provision and on every boot, the workspace materializes its template assets from the **template repo** (Gitea), pinned to the template's `.runtime-version`/ref: + +1. Provisioner records the workspace's **template identity** (repo + ref) in the workspace record (it already knows the template at create time). +2. The boot/entrypoint performs a shallow fetch of the template repo's asset paths (`config.yaml`, `prompts/`, `agent-skills/`) into `/configs` on the data volume (the volume already persists `/configs` + `/workspace` + `~/.claude`). +3. The skill loader reads skills from the delivered `agent-skills/` dir; `config.yaml: skills:` just names which to load. +4. This is **template-agnostic** — every template ships these conventional asset paths; core has no template-specific knowledge. + +> Transport options for the fetch (decide in impl): (a) `git` shallow clone of the template repo with a read-scoped token delivered via the bootstrap secret (the only secret needed); (b) a tarball from the Gitea archive API; (c) an object-store (R2/S3) artifact the provisioner publishes per template-version. (a)/(b) reuse the template repo as SSOT and need no extra publish step; preferred unless artifact immutability is required. + +### 4.3 Reconcile / self-repair on boot +Replace the "reuse existing config volume, templatePath=\"\"" path: on every provision/restart/restore/auto-heal, re-materialize (or verify-and-repair) the template assets. A workspace with a missing/stub/partial `/configs` heals to the template's current assets. Keep #2838's good half: the `isCPTemplateConfigFile` allowlist + abort-on-provider-error + nil-safety. + +### 4.4 Deletions (remove the patch) +Delete from core: `EnableSEOSkillPackage`, `SEOSkillPackageFiles()`, `SEOSkillConfigBlock()`, `workspace-server/internal/provisioner/seo_skill_package.go`, and all references. Core carries **zero** template-specific skill knowledge. + +## 5. Migration +- **Existing stub workspaces (e.g. JRS `28f97a7f`):** self-repair on next boot once §4.3 ships — no manual per-workspace fix needed. (Until then, a one-off re-provision pulls the assets.) +- **Existing `workspace//config` SM secrets:** become vestigial. Stop writing them; delete on a sweep after the asset channel is verified live (don't delete before, to allow rollback). + +## 6. Security considerations +- The only secret the asset fetch needs is a **read-scoped template-repo token**, delivered via the existing `bootstrap` secret — no broadening of secret exposure. +- Assets are public, non-sensitive by definition; no credential material rides the asset channel. +- Complements #2832: secrets a user asks to persist go to `workspace_secrets` (durable env), not into config/memory; auto-memory redaction is tracked there. + +## 7. Test plan +**Unit (core, `workspace-server/internal/provisioner`)** +- Provisioning a template with skills produces an asset set including `agent-skills/**` (fails if skills are dropped). +- Reconcile path with `templatePath=""` (restart/auto-heal) still materializes/repairs `/configs` from the template (no stub regression). +- Missing/stub `/configs` → self-repaired to the template assets. +- No SEO/template-specific symbols remain (grep gate). +- Secrets boundary: provisioning writes secrets only to `bootstrap`; no config/skill bytes land in any SM secret. + +**E2E (staging)** +- Provision a real SEO-template workspace → assert `agent_card.skills` is non-empty and the `/seo-*` commands are present on the data volume. +- Restart the workspace → assert `/configs` + skills survive (or are re-materialized), config is not a stub. +- Corrupt `/configs` to a stub → boot → assert self-repair restores config + skills. + +## 8. Rollout +1. Land the asset channel + reconciliation behind the generic path (auto-bump templates/runtime per repo default). +2. Verify on a staging SEO workspace (e2e above). +3. Re-provision/boot existing SaaS SEO workspaces (incl. JRS) to pick up assets; confirm `skills` non-empty. +4. Sweep + delete vestigial `workspace//config` SM secrets after a soak. + +## 9. Alternatives considered +- **Keep SM, raise the cap / chunk skills across secrets:** still misuses a secrets store for public assets; per-secret limits + cost; rejected. +- **Per-template flags (`EnableSEOSkillPackage`, status quo of #2838):** O(templates) patches, layering violation, ships disabled; rejected (this RFC deletes it). +- **Bake skills into the runtime image:** couples per-template content to image builds, breaks the "template owns its skills" SSOT; rejected. + +## 10. What we keep +- The merged runtime memory fix (#125/#134) — orthogonal. +- #2838's reconciliation scaffold + allowlist (the generic half) — retained; only the SEO-specific injection is removed. +- `tenant//bootstrap` in SM — unchanged. -- 2.52.0 From 8e97a39dcf930e6fa97476252023925d20b69b5b Mon Sep 17 00:00:00 2001 From: Claude Fable 5 Date: Sun, 14 Jun 2026 10:25:14 +0000 Subject: [PATCH 2/2] =?UTF-8?q?rfc:=20add=20concierge-identity=20sibling?= =?UTF-8?q?=20anti-pattern=20(hardcoded=20in=20core=20=E2=86=92=20should?= =?UTF-8?q?=20be=20a=20template)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- docs/design/rfc-decouple-config-skill-delivery.md | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/docs/design/rfc-decouple-config-skill-delivery.md b/docs/design/rfc-decouple-config-skill-delivery.md index 168295cfa..1ccbf1256 100644 --- a/docs/design/rfc-decouple-config-skill-delivery.md +++ b/docs/design/rfc-decouple-config-skill-delivery.md @@ -109,6 +109,18 @@ Delete from core: `EnableSEOSkillPackage`, `SEOSkillPackageFiles()`, `SEOSkillCo - **Per-template flags (`EnableSEOSkillPackage`, status quo of #2838):** O(templates) patches, layering violation, ships disabled; rejected (this RFC deletes it). - **Bake skills into the runtime image:** couples per-template content to image builds, breaks the "template owns its skills" SSOT; rejected. +## 10a. Sibling anti-pattern found in the audit: the concierge identity is hardcoded in core + +The same "should be a template, not a patch" smell exists for the **org concierge / platform agent**. `workspace-server/internal/handlers/platform_agent.go` hardcodes its entire identity in Go: +- `conciergeSystemPromptTmpl` — the full "You are … the Org Concierge / org orchestrator" system prompt (string literal). +- `conciergeMCPServersBlock` — its `mcp_servers:` config block. +- `conciergeDeclaredModel = "moonshot/kimi-k2.6"`, `conciergeRuntime = "claude-code"`. +- `conciergeIdentityFiles()` overlays these as `system-prompt.md` + `config.yaml` at provision. + +The concierge has an image (`Dockerfile.platform-agent`) but **no template home for its identity** — so its prompt/config/model live as core string literals, exactly like the SEO skill files did. The fix is the same abstraction: make the concierge a **platform-agent template** (prompt/config/model in template files) delivered via this RFC's generic asset channel, and delete the `conciergeSystemPromptTmpl`/`conciergeMCPServersBlock`/`conciergeIdentityFiles` literals from core. The asset channel introduced here is the enabler for removing **both** the SEO patch **and** the concierge hardcoding. + +**Audit scope notes:** per-runtime branches in core (e.g. `if runtime == "hermes"` for provision-timeout/config paths) are adapter/registry concerns, not per-template patches — lower priority, candidates for data-driven cleanup but not in this RFC. No plugin-behavior was found hardcoded in core (the plugin system is used for extensions). The two clear "should be a template" patches are: (1) SEO skill package, (2) concierge identity. + ## 10. What we keep - The merged runtime memory fix (#125/#134) — orthogonal. - #2838's reconciliation scaffold + allowlist (the generic half) — retained; only the SEO-specific injection is removed. -- 2.52.0