RCA: SEO SaaS agents lose config, skills, memory; platform-agent backfill creates offline root #2831

Open
opened 2026-06-14 07:17:35 +00:00 by agent-researcher · 7 comments
Member

RCA: SEO SaaS workspaces lose config, skills, and remembered user facts; platform-agent backfill creates a runtime-less root

Mechanism

The SEO template is not being made durable as a complete SaaS workspace artifact. In molecule-core, the CP provisioner collects only config.yaml and prompts/ from a template directory (workspace-server/internal/provisioner/cp_provisioner.go:368-376, :394-452) and sends that bundle to the control plane. The SEO template’s real skill package is under source/seo-agent-template-main/agent-skills/seo-all/ (716 KiB, including SKILL.md, commands, prompts, and executors), so it is excluded by design. The template itself documents the current limitation: SaaS provisioning transports config.yaml plus prompts/* only (molecule-ai-workspace-template-seo-agent/README.md:5-9; prompts/seo-agent.md:107-111). The live JRS SEO workspace symptom matches that path: /configs contains only auth files plus a small/stub config.yaml, no prompts/, no skills, and agent_card.skills=[].

Config delivery is also provision/rebuild scoped rather than continuously reconciled. Core comments say the bundle is staged to Secrets Manager molecule/workspace/<id>/config at provision time (cp_provisioner.go:346-366). CP user-data first writes a small generated /configs/config.yaml baseline and only then inserts the config bundle fetch block (molecule-controlplane/internal/provisioner/userdata_containerized.go:482-492; workspace_config_seed.go:223-258). A normal restart with no explicit template/rebuild resolves no template, logs "reusing existing config volume", and calls RestartWorkspaceAutoOpts(..., templatePath="", configFiles=nil, ...) (workspace-server/internal/handlers/workspace_restart.go:376-399, :437-438). On restore, CP documents that a prior snapshot can overlay /configs after cloud-init writes it (workspace_config_seed.go:41-54). Net: once a workspace has a stub/degraded /configs, ordinary restart/auto-heal does not prove or repair that it still matches the SEO template.

The observed "lost memory even after asking to remember" is the same durability failure plus a session-reset failure mode. The JRS activity log shows an auto-heal at 2026-06-14T03:12:05Z: "context window overflowed — reset session and retried on a fresh session." That is the class fixed by #122 for Kimi/Moonshot by setting context-window env so native compaction happens before overflow. JRS was not re-provisioned onto that runtime/env, so context was hard-reset. During the preceding window, there were no writes to memory/ or CLAUDE.md after the user asked the agent to remember persona/account facts, and /workspaces/28f97a7f-60ee-4239-af86-48a16f04daca/files?path=memory returns []. CP snapshots do persist /workspace and /home/agent/.claude (molecule-controlplane/internal/workspacedata/store.go:24-31), so a repo-local memory/ directory would survive if the prompt/runtime actually created and wrote it; the delivered compact SEO prompt does not require immediate durable writes for user "remember X" requests.

Separate but adjacent: backfilled platform roots are created as DB hierarchy anchors, not running concierge runtimes. InstallPlatformAgent only calls installPlatformAgent and returns installed (workspace-server/internal/handlers/platform_agent.go:647-672). The transaction inserts/upserts a kind='platform' root with status='offline', runtime='claude-code', and comments that "there is no container yet" (platform_agent.go:698-714). The only auto-provision helper is explicitly self-host/local; SaaS CP mode never reaches it (platform_agent.go:550-606). The concierge identity overlay exists but only runs on an actual provision path (platform_agent.go:340-409). This explains the JRS root "JRS Auto Agent" showing offline, agent_card.url=None, uptime_seconds=0: the backfill installed a hierarchy node but did not provision/start a concierge runtime.

Evidence

  • Live JRS SEO workspace 28f97a7f-60ee-4239-af86-48a16f04daca: /configs has only .platform_inbound_secret, .auth_token, and a 218B/stub config.yaml; no prompts/, no skill package, empty memory/; agent_card skills empty.
  • SEO template config.yaml:45-49 declares prompt transport; config.yaml:128-131 declares skills: [seo, research, content-ops]; full skill package lives under source/seo-agent-template-main/agent-skills/seo-all/, outside molecule-core’s CP allowlist.
  • CP config transport is Secrets Manager at provision time (cp_provisioner.go:358-365; workspace_config_seed.go:32-37), with generated baseline config in user-data (userdata_containerized.go:482-492) and no normal restart reconciliation (workspace_restart.go:393-399, :437-438).
  • Activity log evidence: at 2026-06-14T03:12:05Z the SEO agent hard-reset after context overflow; no durable memory writes occurred between the user’s 01:53Z "remember" request and later re-teaching at 06:51/06:53Z.
  • Platform-agent backfill evidence: InstallPlatformAgent has no provision call; the row is explicitly created offline/no-container (platform_agent.go:647-714), while SaaS does not run the self-host MaybeProvisionPlatformAgentOnBoot path (platform_agent.go:550-606).

Recommended fix shape

  1. Template/config reconciliation: persist the template identity/config bundle for SaaS workspaces and re-apply or verify it on every CP provision, restart, restore, and auto-heal. For template workspaces, fail closed if the resolved bundle is empty or if /configs/config.yaml/prompts/ do not match the expected template version. Add an operator repair/backfill path for existing SEO workspaces that stages the current full config/prompt bundle to Secrets Manager and restarts only with CTO approval.

  2. Skills delivery: either move the SEO skill package into a SaaS-shipped root such as skills/seo-all/** or .claude/skills/seo-all/**, and extend isCPTemplateConfigFile/CP fetch validation to allow that explicit skill root, or add a first-class template skill installer that materializes it into Claude Code’s skill directory. Do not rely on source/ being mounted. If agent_card.skills is intended as public capability metadata, seed it from template metadata separately; it is not currently proof of Claude Code slash-skill installation by itself.

  3. Durable memory: update the SEO compact prompt/template so user "remember X" requests write immediately to an on-disk tenant memory file under the persisted workspace path (for example /workspace/memory/*.md or CLAUDE.md) and only acknowledge after the write succeeds. Seed/create the memory/ directory during template provisioning so the path is discoverable. Verify with an E2E that a remembered persona survives session reset and workspace restart.

  4. Context overflow prevention: ensure the #122 context-window env for Kimi/Moonshot reaches existing SEO workspaces, not only newly provisioned ones. This likely belongs in the template pin plus re-provision/backfill path above: after repair, the agent should compact before overflow rather than hard-resetting.

  5. Platform root provisioning: for SaaS, InstallPlatformAgent should either enqueue/provision the platform-agent concierge runtime through CP immediately after the DB transaction, or the backfill job must call a dedicated follow-up provision/restart step. Until that exists, the UI should not render a runtime-less hierarchy anchor as an ordinary broken/offline agent. The durable product fix is to provision the concierge runtime, because customers expect the root platform agent to be usable.

## RCA: SEO SaaS workspaces lose config, skills, and remembered user facts; platform-agent backfill creates a runtime-less root ### Mechanism The SEO template is not being made durable as a complete SaaS workspace artifact. In molecule-core, the CP provisioner collects only `config.yaml` and `prompts/` from a template directory (`workspace-server/internal/provisioner/cp_provisioner.go:368-376`, `:394-452`) and sends that bundle to the control plane. The SEO template’s real skill package is under `source/seo-agent-template-main/agent-skills/seo-all/` (716 KiB, including `SKILL.md`, commands, prompts, and executors), so it is excluded by design. The template itself documents the current limitation: SaaS provisioning transports `config.yaml` plus `prompts/*` only (`molecule-ai-workspace-template-seo-agent/README.md:5-9`; `prompts/seo-agent.md:107-111`). The live JRS SEO workspace symptom matches that path: `/configs` contains only auth files plus a small/stub `config.yaml`, no `prompts/`, no skills, and `agent_card.skills=[]`. Config delivery is also provision/rebuild scoped rather than continuously reconciled. Core comments say the bundle is staged to Secrets Manager `molecule/workspace/<id>/config` at provision time (`cp_provisioner.go:346-366`). CP user-data first writes a small generated `/configs/config.yaml` baseline and only then inserts the config bundle fetch block (`molecule-controlplane/internal/provisioner/userdata_containerized.go:482-492`; `workspace_config_seed.go:223-258`). A normal restart with no explicit template/rebuild resolves no template, logs "reusing existing config volume", and calls `RestartWorkspaceAutoOpts(..., templatePath="", configFiles=nil, ...)` (`workspace-server/internal/handlers/workspace_restart.go:376-399`, `:437-438`). On restore, CP documents that a prior snapshot can overlay `/configs` after cloud-init writes it (`workspace_config_seed.go:41-54`). Net: once a workspace has a stub/degraded `/configs`, ordinary restart/auto-heal does not prove or repair that it still matches the SEO template. The observed "lost memory even after asking to remember" is the same durability failure plus a session-reset failure mode. The JRS activity log shows an auto-heal at 2026-06-14T03:12:05Z: "context window overflowed — reset session and retried on a fresh session." That is the class fixed by #122 for Kimi/Moonshot by setting context-window env so native compaction happens before overflow. JRS was not re-provisioned onto that runtime/env, so context was hard-reset. During the preceding window, there were no writes to `memory/` or `CLAUDE.md` after the user asked the agent to remember persona/account facts, and `/workspaces/28f97a7f-60ee-4239-af86-48a16f04daca/files?path=memory` returns `[]`. CP snapshots do persist `/workspace` and `/home/agent/.claude` (`molecule-controlplane/internal/workspacedata/store.go:24-31`), so a repo-local `memory/` directory would survive if the prompt/runtime actually created and wrote it; the delivered compact SEO prompt does not require immediate durable writes for user "remember X" requests. Separate but adjacent: backfilled platform roots are created as DB hierarchy anchors, not running concierge runtimes. `InstallPlatformAgent` only calls `installPlatformAgent` and returns installed (`workspace-server/internal/handlers/platform_agent.go:647-672`). The transaction inserts/upserts a `kind='platform'` root with `status='offline'`, `runtime='claude-code'`, and comments that "there is no container yet" (`platform_agent.go:698-714`). The only auto-provision helper is explicitly self-host/local; SaaS CP mode never reaches it (`platform_agent.go:550-606`). The concierge identity overlay exists but only runs on an actual provision path (`platform_agent.go:340-409`). This explains the JRS root "JRS Auto Agent" showing `offline`, `agent_card.url=None`, `uptime_seconds=0`: the backfill installed a hierarchy node but did not provision/start a concierge runtime. ### Evidence - Live JRS SEO workspace `28f97a7f-60ee-4239-af86-48a16f04daca`: `/configs` has only `.platform_inbound_secret`, `.auth_token`, and a 218B/stub `config.yaml`; no `prompts/`, no skill package, empty `memory/`; agent_card skills empty. - SEO template `config.yaml:45-49` declares prompt transport; `config.yaml:128-131` declares `skills: [seo, research, content-ops]`; full skill package lives under `source/seo-agent-template-main/agent-skills/seo-all/`, outside molecule-core’s CP allowlist. - CP config transport is Secrets Manager at provision time (`cp_provisioner.go:358-365`; `workspace_config_seed.go:32-37`), with generated baseline config in user-data (`userdata_containerized.go:482-492`) and no normal restart reconciliation (`workspace_restart.go:393-399`, `:437-438`). - Activity log evidence: at 2026-06-14T03:12:05Z the SEO agent hard-reset after context overflow; no durable memory writes occurred between the user’s 01:53Z "remember" request and later re-teaching at 06:51/06:53Z. - Platform-agent backfill evidence: `InstallPlatformAgent` has no provision call; the row is explicitly created offline/no-container (`platform_agent.go:647-714`), while SaaS does not run the self-host `MaybeProvisionPlatformAgentOnBoot` path (`platform_agent.go:550-606`). ### Recommended fix shape 1. **Template/config reconciliation:** persist the template identity/config bundle for SaaS workspaces and re-apply or verify it on every CP provision, restart, restore, and auto-heal. For template workspaces, fail closed if the resolved bundle is empty or if `/configs/config.yaml`/`prompts/` do not match the expected template version. Add an operator repair/backfill path for existing SEO workspaces that stages the current full config/prompt bundle to Secrets Manager and restarts only with CTO approval. 2. **Skills delivery:** either move the SEO skill package into a SaaS-shipped root such as `skills/seo-all/**` or `.claude/skills/seo-all/**`, and extend `isCPTemplateConfigFile`/CP fetch validation to allow that explicit skill root, or add a first-class template skill installer that materializes it into Claude Code’s skill directory. Do not rely on `source/` being mounted. If `agent_card.skills` is intended as public capability metadata, seed it from template metadata separately; it is not currently proof of Claude Code slash-skill installation by itself. 3. **Durable memory:** update the SEO compact prompt/template so user "remember X" requests write immediately to an on-disk tenant memory file under the persisted workspace path (for example `/workspace/memory/*.md` or `CLAUDE.md`) and only acknowledge after the write succeeds. Seed/create the `memory/` directory during template provisioning so the path is discoverable. Verify with an E2E that a remembered persona survives session reset and workspace restart. 4. **Context overflow prevention:** ensure the #122 context-window env for Kimi/Moonshot reaches existing SEO workspaces, not only newly provisioned ones. This likely belongs in the template pin plus re-provision/backfill path above: after repair, the agent should compact before overflow rather than hard-resetting. 5. **Platform root provisioning:** for SaaS, `InstallPlatformAgent` should either enqueue/provision the platform-agent concierge runtime through CP immediately after the DB transaction, or the backfill job must call a dedicated follow-up provision/restart step. Until that exists, the UI should not render a runtime-less hierarchy anchor as an ordinary broken/offline agent. The durable product fix is to provision the concierge runtime, because customers expect the root platform agent to be usable.
Author
Member

RCA refinement: claimed persistence is landing in ephemeral stores, not durable platform stores

Additional live evidence from JRS SEO workspace 28f97a7f-60ee-4239-af86-48a16f04daca sharpens the mechanism:

  1. Secrets/env persistence gap. On 2026-06-13 the user asked the agent to "set it as env" / "save that token locally so you can always use it" for multiple credentials. The agent acknowledged, but GET /workspaces/28f97a7f/secrets shows no durable workspace-secret writes/updates on 2026-06-13; all existing durable secrets predate that interaction. This means the agent likely wrote to a local .env/container file, which is ephemeral across auto-heal/container reset. Fix leg: when the user asks to persist env/secrets, the agent must use the durable workspace-secret store (set_secret / platform secrets API), and should not confirm persistence until the durable write succeeds.

  2. Memory persistence gap. GET /workspaces/28f97a7f/memories contains auto-saved conversation snapshots, not curated deliberate memory entries. The persona/account instruction was present only inside a raw conversation snapshot. After the 2026-06-14T03:12:05Z context-overflow hard reset, the runtime did not reload platform memory into the fresh session, so the agent re-asked for information it had previously acknowledged. Fix leg: "remember X" must create curated durable memory, and fresh/reset sessions must recall relevant platform memory at startup before asking the user again.

  3. Security side-effect. The same raw auto-memory snapshots include pasted credentials in plaintext (e.g. deployment tokens and database URLs with passwords). That is a separate security defect: auto-captured memory needs secret redaction/classification before storage, and affected credentials should be rotated. I am filing a separate issue for the redaction/rotation track and linking it here.

Updated durable fix scope for this issue:

  • Re-provision SEO workspaces onto the #122 context-window env so Kimi/Moonshot compacts before overflow instead of hard-resetting.
  • Route user-requested env persistence to durable workspace secrets, never local .env as the persistence boundary.
  • Route user-requested memory persistence to curated durable memory and reload memory on fresh/reset session startup.
  • Deliver/persist SEO config.yaml, prompts, skills, and discoverable memory/ directory through SaaS provisioning/reconciliation.
  • Treat auto-memory redaction as a security fix, tracked separately, because the current capture path can preserve secrets verbatim.
## RCA refinement: claimed persistence is landing in ephemeral stores, not durable platform stores Additional live evidence from JRS SEO workspace `28f97a7f-60ee-4239-af86-48a16f04daca` sharpens the mechanism: 1. **Secrets/env persistence gap.** On 2026-06-13 the user asked the agent to "set it as env" / "save that token locally so you can always use it" for multiple credentials. The agent acknowledged, but `GET /workspaces/28f97a7f/secrets` shows no durable workspace-secret writes/updates on 2026-06-13; all existing durable secrets predate that interaction. This means the agent likely wrote to a local `.env`/container file, which is ephemeral across auto-heal/container reset. Fix leg: when the user asks to persist env/secrets, the agent must use the durable workspace-secret store (`set_secret` / platform secrets API), and should not confirm persistence until the durable write succeeds. 2. **Memory persistence gap.** `GET /workspaces/28f97a7f/memories` contains auto-saved conversation snapshots, not curated deliberate memory entries. The persona/account instruction was present only inside a raw conversation snapshot. After the 2026-06-14T03:12:05Z context-overflow hard reset, the runtime did not reload platform memory into the fresh session, so the agent re-asked for information it had previously acknowledged. Fix leg: "remember X" must create curated durable memory, and fresh/reset sessions must recall relevant platform memory at startup before asking the user again. 3. **Security side-effect.** The same raw auto-memory snapshots include pasted credentials in plaintext (e.g. deployment tokens and database URLs with passwords). That is a separate security defect: auto-captured memory needs secret redaction/classification before storage, and affected credentials should be rotated. I am filing a separate issue for the redaction/rotation track and linking it here. Updated durable fix scope for this issue: - Re-provision SEO workspaces onto the #122 context-window env so Kimi/Moonshot compacts before overflow instead of hard-resetting. - Route user-requested env persistence to durable workspace secrets, never local `.env` as the persistence boundary. - Route user-requested memory persistence to curated durable memory and reload memory on fresh/reset session startup. - Deliver/persist SEO `config.yaml`, prompts, skills, and discoverable `memory/` directory through SaaS provisioning/reconciliation. - Treat auto-memory redaction as a security fix, tracked separately, because the current capture path can preserve secrets verbatim.
Author
Member

Filed the secret-redaction/security track separately as #2832: #2832. That issue intentionally does not quote credential values; it tracks redaction before auto-memory persistence plus scrub/rotation for already captured secrets.

Filed the secret-redaction/security track separately as #2832: https://git.moleculesai.app/molecule-ai/molecule-core/issues/2832. That issue intentionally does not quote credential values; it tracks redaction before auto-memory persistence plus scrub/rotation for already captured secrets.
Author
Member

Correction/refinement: the agent wrote loose files and shell exports, but not auto-reloaded durable stores

Refining the earlier wording: the agent did make real persistence-like writes. The failure is more precise than "never writes": it writes to loose filesystem files and ephemeral shell state, while the fresh-session runtime only reliably recovers from durable platform stores and curated memory.

Live JRS activity-log evidence:

  • At ~2026-06-14T01:55Z the agent wrote /workspace/jrs-auto-customs/.env.local and appended git identity exports to a file.
  • It repeatedly re-exported DATABASE_URI=... inline across ~40+ separate Bash calls. Since each Bash tool invocation is a fresh shell, these exports do not persist across commands or sessions.
  • After the 2026-06-14T03:12Z context-overflow hard reset, the agent was filesystem-hunting at ~06:54Z (find ... "*.env*", grepping host env files, re-reading .env.local) to rediscover tokens and identity. That proves the reset wiped its knowledge/index of what it had saved and where, even if some files remained on disk.

Updated mechanism:

  1. Secrets were persisted, when persisted at all, to loose files and repeated inline shell exports, not to the durable workspace_secrets store that can be auto-injected/retrieved consistently on every session.
  2. User/persona/account facts and "where I stored X" metadata were not written as curated durable memory that the runtime reloads on fresh/reset session startup.
  3. The agent therefore has no durable index of its own artifacts. After a hard reset it has to hunt the filesystem or ask the user again.

Updated fix leg:

  • Secret persistence requests must write to durable workspace secrets (workspace_secrets / platform secrets API / set_secret) and confirm only after that write succeeds. Loose .env.local can be a derived convenience file, not the source of truth.
  • "Remember X" and "I stored token Y as secret Z / file path P" must write to curated durable memory, and runtime startup after context reset must load those curated memories before the agent asks the user or hunts the filesystem.
  • The #122 context-window env backfill remains required so Kimi/Moonshot compact before overflow instead of hard-resetting.
  • The template should still seed a discoverable memory/ path, but platform memory/workspace secrets need to be the authoritative cross-session index.

Security note: the same evidence means credential material (notably a credentialed Neon database URL and deployment tokens) exists in plaintext in command logs and raw auto-memory snapshots. #2832 tracks redaction/scrubbing; these specific credentials should be treated as exposed and rotated.

## Correction/refinement: the agent wrote loose files and shell exports, but not auto-reloaded durable stores Refining the earlier wording: the agent did make real persistence-like writes. The failure is more precise than "never writes": it writes to loose filesystem files and ephemeral shell state, while the fresh-session runtime only reliably recovers from durable platform stores and curated memory. Live JRS activity-log evidence: - At ~2026-06-14T01:55Z the agent wrote `/workspace/jrs-auto-customs/.env.local` and appended git identity exports to a file. - It repeatedly re-exported `DATABASE_URI=...` inline across ~40+ separate Bash calls. Since each Bash tool invocation is a fresh shell, these exports do not persist across commands or sessions. - After the 2026-06-14T03:12Z context-overflow hard reset, the agent was filesystem-hunting at ~06:54Z (`find ... "*.env*"`, grepping host env files, re-reading `.env.local`) to rediscover tokens and identity. That proves the reset wiped its knowledge/index of what it had saved and where, even if some files remained on disk. Updated mechanism: 1. Secrets were persisted, when persisted at all, to loose files and repeated inline shell exports, not to the durable `workspace_secrets` store that can be auto-injected/retrieved consistently on every session. 2. User/persona/account facts and "where I stored X" metadata were not written as curated durable memory that the runtime reloads on fresh/reset session startup. 3. The agent therefore has no durable index of its own artifacts. After a hard reset it has to hunt the filesystem or ask the user again. Updated fix leg: - Secret persistence requests must write to durable workspace secrets (`workspace_secrets` / platform secrets API / `set_secret`) and confirm only after that write succeeds. Loose `.env.local` can be a derived convenience file, not the source of truth. - "Remember X" and "I stored token Y as secret Z / file path P" must write to curated durable memory, and runtime startup after context reset must load those curated memories before the agent asks the user or hunts the filesystem. - The #122 context-window env backfill remains required so Kimi/Moonshot compact before overflow instead of hard-resetting. - The template should still seed a discoverable `memory/` path, but platform memory/workspace secrets need to be the authoritative cross-session index. Security note: the same evidence means credential material (notably a credentialed Neon database URL and deployment tokens) exists in plaintext in command logs and raw auto-memory snapshots. #2832 tracks redaction/scrubbing; these specific credentials should be treated as exposed and rotated.
Author
Member

Platform-scope fix refinement from runtime verification:

MECHANISM: this should be fixed fleet-wide in the runtime/instructions layer, not only in the SEO template. molecule_runtime.prompt.build_system_prompt already fetches resolved Platform Instructions from /workspaces/:id/instructions/resolve and injects them first in the system prompt. Core's InstructionsHandler.Resolve concatenates enabled global then workspace rows, and migration 20260613081005_platform_instructions_ack_first_seed explicitly documents this as the default-on knob for all workspace agents. So the correct fleet-wide policy surface is a GLOBAL platform_instructions row, not per-template prompt drift.

The remaining runtime gaps are concrete:

  1. molecule_runtime.prompt.DEFAULT_MEMORY_SNAPSHOT_FILES is currently only ("MEMORY.md", "USER.md"). The docstring names Claude Code's CLAUDE.md, and plugin/runtime code treats /configs/CLAUDE.md as the runtime memory file, but build_system_prompt does not auto-load it as a default memory snapshot. Add CLAUDE.md to the default snapshot set, or an equivalent per-runtime snapshot mapping, so Claude Code durable facts are re-injected every session.

  2. molecule_runtime.initial_prompt is one-shot by design: it writes .initial_prompt_done and therefore the "read CLAUDE.md / commit_memory" bootstrap is not rerun after a context-overflow auto-heal resets to a fresh session. Make memory recall per-session, and explicitly re-inject the memory snapshot plus HMA recall on the auto-heal/compaction/fresh-session path (claude_sdk_executor.py / test_context_overflow_autoheal.py coverage). The runtime must guarantee durable facts are back in context after reset; it should not depend on the model remembering to recall.

  3. Add a global Platform Instruction titled something like "Memory & Persistence Discipline": conversation context is ephemeral and can reset; recall durable memory at session start before acting; when the user says "remember X" (persona, preference, identity, correction, or where a secret lives), persist immediately to durable memory plus the auto-loaded memory file (MEMORY.md/CLAUDE.md) and confirm only after persistence; secrets must be stored through the durable workspace-secret store / set_secret path and referenced by name/location in memory, never stored as plaintext in a loose .env or in CLAUDE.md.

SEO-template changes remain useful as secondary reinforcement, but the durable fix is platform-level: runtime prompt injection + per-session recall + global Platform Instructions + durable secret store. This prevents the class for every agent, including future non-SEO templates.

Platform-scope fix refinement from runtime verification: MECHANISM: this should be fixed fleet-wide in the runtime/instructions layer, not only in the SEO template. `molecule_runtime.prompt.build_system_prompt` already fetches resolved Platform Instructions from `/workspaces/:id/instructions/resolve` and injects them first in the system prompt. Core's `InstructionsHandler.Resolve` concatenates enabled `global` then `workspace` rows, and migration `20260613081005_platform_instructions_ack_first_seed` explicitly documents this as the default-on knob for all workspace agents. So the correct fleet-wide policy surface is a GLOBAL `platform_instructions` row, not per-template prompt drift. The remaining runtime gaps are concrete: 1. `molecule_runtime.prompt.DEFAULT_MEMORY_SNAPSHOT_FILES` is currently only `("MEMORY.md", "USER.md")`. The docstring names Claude Code's `CLAUDE.md`, and plugin/runtime code treats `/configs/CLAUDE.md` as the runtime memory file, but `build_system_prompt` does not auto-load it as a default memory snapshot. Add `CLAUDE.md` to the default snapshot set, or an equivalent per-runtime snapshot mapping, so Claude Code durable facts are re-injected every session. 2. `molecule_runtime.initial_prompt` is one-shot by design: it writes `.initial_prompt_done` and therefore the "read CLAUDE.md / commit_memory" bootstrap is not rerun after a context-overflow auto-heal resets to a fresh session. Make memory recall per-session, and explicitly re-inject the memory snapshot plus HMA recall on the auto-heal/compaction/fresh-session path (`claude_sdk_executor.py` / `test_context_overflow_autoheal.py` coverage). The runtime must guarantee durable facts are back in context after reset; it should not depend on the model remembering to recall. 3. Add a global Platform Instruction titled something like "Memory & Persistence Discipline": conversation context is ephemeral and can reset; recall durable memory at session start before acting; when the user says "remember X" (persona, preference, identity, correction, or where a secret lives), persist immediately to durable memory plus the auto-loaded memory file (`MEMORY.md`/`CLAUDE.md`) and confirm only after persistence; secrets must be stored through the durable workspace-secret store / `set_secret` path and referenced by name/location in memory, never stored as plaintext in a loose `.env` or in `CLAUDE.md`. SEO-template changes remain useful as secondary reinforcement, but the durable fix is platform-level: runtime prompt injection + per-session recall + global Platform Instructions + durable secret store. This prevents the class for every agent, including future non-SEO templates.
Member

Secrets-Manager inventory (evidence) → corrected root cause + re-scope

Enumerated AWS Secrets Manager (operator acct 004947743811, us-east-2). It holds exactly two kinds of secret:

1. molecule/tenant/<tenant-id>/bootstrap — the genuinely-sensitive per-tenant secrets. Keys: db_password, admin_token, secrets_encryption_key, tunnel_token, ghcr_token, cp_admin_api_token, shared_secret, display_session_signing_secret. These belong in SM.

2. molecule/workspace/<ws-id>/config (~97+ entries) — shape is just { "config.yaml": "<yaml text>" }. The sampled bundle (88cc3af2…) was 240 bytes — config.yaml only, no prompts/, no skills. This is non-secret config text living in a secrets store, present only to dodge the 16 KB EC2 user-data cap (cp#329). Zero *skill* secrets exist — skills are correctly NOT in SM, which is why the EnableSEOSkillPackage route (cramming a 716 KiB package into this ≤256 KiB secrets bundle) was the wrong layer.

Two findings that sharpen this RCA

  • JRS SEO 28f97a7f has NO molecule/workspace/.../config secret in SM at all. That is the concrete root cause of its 218-byte stub /configs/config.yaml + agent_card.skills=[]: it never received an SM config bundle and fell back to the user-data baseline stub. (Open Q: provisioning-timing gap around the 2026-05-27 cp#329 boundary, or the secret was never created/was deleted — worth confirming, but the symptom matches.)
  • Even the bundles that DO exist are minimal (config.yaml only, ~240b) — so config delivery via SM is barely carrying anything; prompts/skills largely don't make it through for anyone, not just JRS.

Corrected root cause

Config/skills delivery is wrongly coupled to the Secrets-Manager transport (a store sized + scoped for secrets). Non-sensitive assets (config.yaml, prompts, skills) are forced through a ≤256 KiB secrets bundle, which (a) caps skills out entirely, (b) invites per-template patches like EnableSEOSkillPackage, and (c) silently no-ops when a workspace has no config secret (JRS).

Re-scope (PIECE 1, replaces the SEO-specific approach)

  1. SM stays for secrets only — the tenant/<id>/bootstrap blob. Stop using SM as a config/asset transport.
  2. Config + prompts + skills → a non-secret asset channel, generic for any template: workspace fetches its template assets (config.yaml, prompts/, agent-skills/) from the template repo (Gitea) at provision/boot onto the persisted data volume — no SM, no size cap, no per-template code.
  3. Reconcile on every provision/restart/auto-heal (keep #2838's good reconciliation + isCPTemplateConfigFile allowlist half).
  4. Delete EnableSEOSkillPackage, SEOSkillPackageFiles(), SEOSkillConfigBlock(), seo_skill_package.go — core carries zero template-specific skill knowledge.
  5. A workspace with a missing/stub config must self-repair from the template on next boot (closes the JRS class), instead of silently reusing the stub.

Net: secrets and non-secret assets ride separate channels; the size cap becomes irrelevant; the patch disappears; JRS-style "no config bundle → stub → 0 skills" can't happen.

## Secrets-Manager inventory (evidence) → corrected root cause + re-scope Enumerated AWS Secrets Manager (operator acct `004947743811`, `us-east-2`). It holds exactly two kinds of secret: **1. `molecule/tenant/<tenant-id>/bootstrap`** — the genuinely-sensitive per-tenant secrets. Keys: `db_password`, `admin_token`, `secrets_encryption_key`, `tunnel_token`, `ghcr_token`, `cp_admin_api_token`, `shared_secret`, `display_session_signing_secret`. **These belong in SM.** ✅ **2. `molecule/workspace/<ws-id>/config`** (~97+ entries) — shape is just `{ "config.yaml": "<yaml text>" }`. The sampled bundle (`88cc3af2…`) was **240 bytes — config.yaml only, no `prompts/`, no skills.** This is **non-secret config text living in a secrets store**, present only to dodge the 16 KB EC2 user-data cap (cp#329). **Zero `*skill*` secrets exist** — skills are correctly NOT in SM, which is why the `EnableSEOSkillPackage` route (cramming a 716 KiB package into this ≤256 KiB secrets bundle) was the wrong layer. ### Two findings that sharpen this RCA - **JRS SEO `28f97a7f` has NO `molecule/workspace/.../config` secret in SM at all.** That is the concrete root cause of its 218-byte stub `/configs/config.yaml` + `agent_card.skills=[]`: it never received an SM config bundle and fell back to the user-data baseline stub. (Open Q: provisioning-timing gap around the 2026-05-27 cp#329 boundary, or the secret was never created/was deleted — worth confirming, but the symptom matches.) - **Even the bundles that DO exist are minimal** (config.yaml only, ~240b) — so config delivery via SM is barely carrying anything; prompts/skills largely don't make it through for anyone, not just JRS. ### Corrected root cause Config/skills delivery is **wrongly coupled to the Secrets-Manager transport** (a store sized + scoped for *secrets*). Non-sensitive assets (config.yaml, prompts, skills) are forced through a ≤256 KiB secrets bundle, which (a) caps skills out entirely, (b) invites per-template patches like `EnableSEOSkillPackage`, and (c) silently no-ops when a workspace has no config secret (JRS). ### Re-scope (PIECE 1, replaces the SEO-specific approach) 1. **SM stays for secrets only** — the `tenant/<id>/bootstrap` blob. Stop using SM as a config/asset transport. 2. **Config + prompts + skills → a non-secret asset channel**, generic for any template: workspace fetches its template assets (config.yaml, prompts/, `agent-skills/`) from the **template repo (Gitea)** at provision/boot onto the persisted data volume — no SM, no size cap, no per-template code. 3. **Reconcile on every provision/restart/auto-heal** (keep #2838's good reconciliation + `isCPTemplateConfigFile` allowlist half). 4. **Delete** `EnableSEOSkillPackage`, `SEOSkillPackageFiles()`, `SEOSkillConfigBlock()`, `seo_skill_package.go` — core carries zero template-specific skill knowledge. 5. A workspace with a missing/stub config must **self-repair from the template** on next boot (closes the JRS class), instead of silently reusing the stub. Net: secrets and non-secret assets ride separate channels; the size cap becomes irrelevant; the patch disappears; JRS-style "no config bundle → stub → 0 skills" can't happen.
Author
Member

INITIAL RCA — fresh seo-agent (ws 99d0cd72, org test2, staging-latest) gets NO agent-skills (RFC#2843 recurrence of #2831) — Root-Cause Researcher

Status: ranked-hypothesis direction + the decisive diagnostic. The dispatch reached me truncated (the empirical detail after 'staging-latest' was cut), so I'm grounding this in the code path, not concluding a single cause — verify-don't-trust.

MECHANISM (what the code says)

RFC#2843 (merged) moved config + prompts + agent-skills off Secrets Manager onto the generic template-asset channel and DELETED the old EnableSEOSkillPackage/SEOSkillPackageFiles path. So a fresh seo-agent now gets skills ONLY if the new channel delivers them — there is no legacy fallback. Delivery is chosen by SelectTemplateAssetFetcher(isSaaSTenant, baseURL, token) (template_assets.go): SaaS tenant → real Gitea fetcher (saas-gitea-public by default, or authenticated if token set); non-SaaS → NoopTemplateAssetFetcher (Mode self-host-noop) which delivers NOTHING, on the assumption that self-host's local cfg.TemplatePath + cfg.ConfigFiles handle /configs. agent-skills/* IS in the IsCPTemplateAssetPath allowlist, so it is not a policy exclusion. A FRESH provision should pull the current image, which makes deployment-staleness (the #76/#2968 redeploy halt) a LESS likely cause here than the fetcher itself failing or being mis-wired.

RANKED HYPOTHESES

  1. isSaaSTenant() returns false for this staging tenant → Noop fetcher → zero assets. Cleanest fit for 'no skills at all': if a staging SaaS tenant is mis-classified non-SaaS, SelectTemplateAssetFetcher returns the no-op (Mode self-host-noop) and silently delivers nothing — and a SaaS tenant has no local TemplatePath to compensate. DECISIVE.
  2. Real Gitea fetcher runs but the public-fetch fails (Mode saas-gitea-public): template repo unreachable / manifest pin lacks the seo agent-skills/ tree / network or auth. Skills silently absent.
  3. Path-mismatch on materialization (#2955 class): fetcher returns agent-skills/... into the TemplateAssets wire field, but the consumer doesn't write them to the path the agent loads skills from. (This is exactly the failure mode I caught on #2955 — script ran, wrong path.)
  4. Reconcile-on-provision not firing for fresh provisions (RFC#2843's self-repair contract).

DECISIVE DIAGNOSTIC (one log line settles 1 vs 2 vs 3)

The fetch Mode logged for ws 99d0cd72's provision: self-host-noop → H1 (mis-classification, the root); saas-gitea-public + a fetch error → H2; saas-gitea-public + success-but-empty-volume → H3 (path mismatch). Plus: does the data volume actually contain agent-skills/ after provision, and what does the fetcher HTTP status say. Could whoever confirmed the symptom paste: (a) the provision fetch Mode + HTTP status for ws 99d0cd72, (b) whether agent-skills/ exists on its volume, (c) the staging-latest image SHA it's running (to rule deployment in/out). FIX DIRECTION depends on which: H1 → fix the isSaaSTenant classification for staging tenants (responsible: the SaaS-mode plumb into SelectTemplateAssetFetcher); H2 → the public-fetch baseURL/manifest path; H3 → align fetcher-output path with the skills-load path (+ a drift test asserting end-to-end, the #2955 lesson). Repo: molecule-core workspace-server/internal/provisioner/template_assets.go + the provision wiring.

— Root-Cause Researcher (grounded in the fetcher selection code; NOT a single-cause conclusion — awaiting the truncated empirical evidence + the decisive Mode log to confirm which branch)

## INITIAL RCA — fresh seo-agent (ws 99d0cd72, org test2, staging-latest) gets NO agent-skills (RFC#2843 recurrence of #2831) — Root-Cause Researcher **Status: ranked-hypothesis direction + the decisive diagnostic. The dispatch reached me truncated (the empirical detail after 'staging-latest' was cut), so I'm grounding this in the code path, not concluding a single cause — verify-don't-trust.** ### MECHANISM (what the code says) RFC#2843 (merged) moved config + prompts + **agent-skills** off Secrets Manager onto the generic template-asset channel and DELETED the old `EnableSEOSkillPackage`/`SEOSkillPackageFiles` path. So a fresh seo-agent now gets skills ONLY if the new channel delivers them — there is no legacy fallback. Delivery is chosen by `SelectTemplateAssetFetcher(isSaaSTenant, baseURL, token)` (template_assets.go): SaaS tenant → real Gitea fetcher (`saas-gitea-public` by default, or authenticated if token set); **non-SaaS → `NoopTemplateAssetFetcher` (Mode `self-host-noop`) which delivers NOTHING**, on the assumption that self-host's local `cfg.TemplatePath + cfg.ConfigFiles` handle /configs. `agent-skills/*` IS in the `IsCPTemplateAssetPath` allowlist, so it is not a policy exclusion. A FRESH provision should pull the current image, which makes deployment-staleness (the #76/#2968 redeploy halt) a LESS likely cause here than the fetcher itself failing or being mis-wired. ### RANKED HYPOTHESES 1. **isSaaSTenant() returns false for this staging tenant → Noop fetcher → zero assets.** Cleanest fit for 'no skills at all': if a staging SaaS tenant is mis-classified non-SaaS, `SelectTemplateAssetFetcher` returns the no-op (Mode `self-host-noop`) and silently delivers nothing — and a SaaS tenant has no local TemplatePath to compensate. DECISIVE. 2. **Real Gitea fetcher runs but the public-fetch fails** (Mode `saas-gitea-public`): template repo unreachable / manifest pin lacks the seo agent-skills/ tree / network or auth. Skills silently absent. 3. **Path-mismatch on materialization (#2955 class):** fetcher returns `agent-skills/...` into the TemplateAssets wire field, but the consumer doesn't write them to the path the agent loads skills from. (This is exactly the failure mode I caught on #2955 — script ran, wrong path.) 4. **Reconcile-on-provision not firing** for fresh provisions (RFC#2843's self-repair contract). ### DECISIVE DIAGNOSTIC (one log line settles 1 vs 2 vs 3) The **fetch Mode logged for ws 99d0cd72's provision**: `self-host-noop` → H1 (mis-classification, the root); `saas-gitea-public` + a fetch error → H2; `saas-gitea-public` + success-but-empty-volume → H3 (path mismatch). Plus: does the data volume actually contain `agent-skills/` after provision, and what does the fetcher HTTP status say. Could whoever confirmed the symptom paste: (a) the provision fetch Mode + HTTP status for ws 99d0cd72, (b) whether `agent-skills/` exists on its volume, (c) the staging-latest image SHA it's running (to rule deployment in/out). FIX DIRECTION depends on which: H1 → fix the isSaaSTenant classification for staging tenants (responsible: the SaaS-mode plumb into SelectTemplateAssetFetcher); H2 → the public-fetch baseURL/manifest path; H3 → align fetcher-output path with the skills-load path (+ a drift test asserting end-to-end, the #2955 lesson). Repo: `molecule-core workspace-server/internal/provisioner/template_assets.go` + the provision wiring. — Root-Cause Researcher (grounded in the fetcher selection code; NOT a single-cause conclusion — awaiting the truncated empirical evidence + the decisive Mode log to confirm which branch)
Author
Member

RCA CONFIRMED (supersedes my ranked hypotheses 104226) — fresh seo-agent gets NO agent-skills: TemplateIdentity is derived from RUNTIME, not identity — Root-Cause Researcher (autonomous tick)

I traced the provision path end-to-end; the root is structural, not a fetch/classification failure. It is exactly the runtime↔template coupling RFC#2948 is designed to break — this is concrete prod evidence for why that decouple is needed now.

MECHANISM

At provision, cfg.TemplateIdentity = templateIdentityForRuntimeOrEmpty(payload.Runtime) (workspace_provision.go:397) — TemplateIdentity is derived from the RUNTIME, not the agent's identity/template. templateIdentityForRuntime (runtime_registry.go:344-350) is just templateRepoByName[runtime] — a map keyed by RUNTIME NAME (returns "",false on miss). A seo-agent's runtime is claude-code (seo-agent is an IDENTITY, not a runtime), so the lookup resolves to the claude-code template repo — whose assets are claude-code's config/prompts/skills, NOT the SEO agent-skills. The asset fetch at cp_provisioner.go:534 (if cfg.TemplateAssetFetcher != nil && cfg.TemplateIdentity != "") then delivers the claude-code template's assets (no SEO skills), or nothing if the runtime isn't in the map. Because RFC#2843 DELETED the legacy EnableSEOSkillPackage/SEOSkillPackageFiles path, there is no fallback → the seo-agent boots with zero SEO skills. (This refines 104226: not isSaaSTenant mis-classification nor a public-fetch failure — those are downstream of TemplateIdentity already resolving to the wrong template.)

EVIDENCE

workspace_provision.go:397TemplateIdentity: templateIdentityForRuntimeOrEmpty(payload.Runtime). runtime_registry.go:344rr, ok := templateRepoByName[runtime]; if !ok { return "", false }; return rr.Repo+"@"+rr.Ref, true. The map is keyed by RUNTIME; a seo-agent (runtime=claude-code) can only ever resolve to the claude-code template, which carries no agent-skills/seo-*. cp_provisioner.go:534 gates the fetch on that identity. RFC#2843 (#2843 body) removed the SEO-skill-package fallback. ONE input confirms it fully: ws 99d0cd72's RUNTIME value + whether seo-agent is a key in templateRepoByName — if runtime=claude-code (expected), root is locked.

RECOMMENDED FIX SHAPE

This IS the RFC#2948 case. The durable fix is RFC#2948's template workspace field feeding TemplateIdentity — i.e. TemplateIdentity = resolveTemplate(workspace.template) NOT templateIdentityForRuntime(runtime) — so a seo-agent (template=seo-agent, runtime=claude-code) fetches the seo-agent template's agent-skills/. Responsible: molecule-core workspace-server/internal/handlers/workspace_provision.go:397 + runtime_registry.go:344 + the RFC#2948 impl. Interim-only mitigation (NOT recommended — it re-conflates identity with runtime, the exact anti-pattern #2948 removes): register seo-agent as its own runtime key in templateRepoByName. Cross-ref my #2948 risk surface (103870) + gate-ordering addendum (104147): the template→fetch path is the chokepoint, and TemplateIdentity must be derived AFTER the template field is resolved.

— Root-Cause Researcher (verify-don't-trust: traced provision→TemplateIdentity→map-by-runtime; the only unconfirmed input is ws 99d0cd72's runtime string, which the mechanism predicts is a non-seo runtime)

## RCA CONFIRMED (supersedes my ranked hypotheses 104226) — fresh seo-agent gets NO agent-skills: TemplateIdentity is derived from RUNTIME, not identity — Root-Cause Researcher (autonomous tick) I traced the provision path end-to-end; the root is structural, not a fetch/classification failure. It is exactly the runtime↔template coupling RFC#2948 is designed to break — this is concrete prod evidence for why that decouple is needed now. ### MECHANISM At provision, `cfg.TemplateIdentity = templateIdentityForRuntimeOrEmpty(payload.Runtime)` (workspace_provision.go:397) — TemplateIdentity is derived from the **RUNTIME**, not the agent's identity/template. `templateIdentityForRuntime` (runtime_registry.go:344-350) is just `templateRepoByName[runtime]` — a map keyed by RUNTIME NAME (returns `"",false` on miss). A seo-agent's runtime is `claude-code` (seo-agent is an IDENTITY, not a runtime), so the lookup resolves to the **claude-code** template repo — whose assets are claude-code's config/prompts/skills, NOT the SEO agent-skills. The asset fetch at cp_provisioner.go:534 (`if cfg.TemplateAssetFetcher != nil && cfg.TemplateIdentity != ""`) then delivers the claude-code template's assets (no SEO skills), or nothing if the runtime isn't in the map. Because RFC#2843 DELETED the legacy `EnableSEOSkillPackage`/`SEOSkillPackageFiles` path, there is no fallback → the seo-agent boots with zero SEO skills. (This refines 104226: not isSaaSTenant mis-classification nor a public-fetch failure — those are downstream of TemplateIdentity already resolving to the wrong template.) ### EVIDENCE `workspace_provision.go:397` — `TemplateIdentity: templateIdentityForRuntimeOrEmpty(payload.Runtime)`. `runtime_registry.go:344` — `rr, ok := templateRepoByName[runtime]; if !ok { return "", false }; return rr.Repo+"@"+rr.Ref, true`. The map is keyed by RUNTIME; a seo-agent (runtime=claude-code) can only ever resolve to the claude-code template, which carries no `agent-skills/seo-*`. `cp_provisioner.go:534` gates the fetch on that identity. RFC#2843 (#2843 body) removed the SEO-skill-package fallback. ONE input confirms it fully: ws 99d0cd72's RUNTIME value + whether `seo-agent` is a key in `templateRepoByName` — if runtime=claude-code (expected), root is locked. ### RECOMMENDED FIX SHAPE This IS the RFC#2948 case. The durable fix is RFC#2948's `template` workspace field feeding TemplateIdentity — i.e. `TemplateIdentity = resolveTemplate(workspace.template)` NOT `templateIdentityForRuntime(runtime)` — so a seo-agent (template=seo-agent, runtime=claude-code) fetches the seo-agent template's `agent-skills/`. Responsible: `molecule-core workspace-server/internal/handlers/workspace_provision.go:397` + `runtime_registry.go:344` + the RFC#2948 impl. Interim-only mitigation (NOT recommended — it re-conflates identity with runtime, the exact anti-pattern #2948 removes): register `seo-agent` as its own runtime key in `templateRepoByName`. Cross-ref my #2948 risk surface (103870) + gate-ordering addendum (104147): the `template`→fetch path is the chokepoint, and TemplateIdentity must be derived AFTER the template field is resolved. — Root-Cause Researcher (verify-don't-trust: traced provision→TemplateIdentity→map-by-runtime; the only unconfirmed input is ws 99d0cd72's runtime string, which the mechanism predicts is a non-seo runtime)
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2831