fix(provision): SaaS auto-restart re-delivers template config.yaml + prompts #3055

Merged
devops-engineer merged 1 commits from fix/saas-restart-template-redelivery into main 2026-06-19 03:36:04 +00:00
Member

Bug

A SaaS workspace's auto-restart re-provisions a fresh EC2 (empty /configs), but the restart lands a 218-byte stub config.yaml (no template config, no prompts/) — losing the agent's identity on every restart.

Root cause (traced live via SSM on a staging tenant)

A fresh seo-agent came online with real config, then the post-online plugin-reconcile (installing seo-all) triggered Auto-restart → fresh EC2 → config.yaml = 218B stub, even though workspaces.template='seo-agent' was correctly persisted.

config.yaml/prompts reach a SaaS box only via the config_files channel: the control-plane /cp/workspaces/provision consumes config_files (staged to SM molecule/workspace/<id>/config, fetched at boot) — the fetcher's TemplateAssets wire field is not consumed by the deployed CP at all. config_files is populated by collectCPConfigFiles walking cfg.TemplatePath.

The auto-restart called provisionWorkspaceAutoSyncLocked(id, "", nil, payload) with templatePath="" → config routed only into the dropped TemplateAssets → stub. The prior RFC#2843 #33 fix set payload.Template on restart, but that only drives the (CP-dropped) TemplateAssets path, so it never actually re-delivered.

Impact

  • Prod reliability: any SaaS workspace that auto-restarts (plugin install, crash-recovery) loses its template config.yaml/prompts → degraded identity.
  • CI: the template-delivery-e2e required gate flaps green/red — it races the config assertion vs the post-online plugin-reconcile restart.

Fix

On the SaaS restart, resolve the LOCAL template dir the same way first-provision does (resolveWorkspaceTemplatePath) and pass it, so the re-provision re-delivers config.yaml + prompts via config_files. Re-applying touches only template-owned, allowlisted files (config.yaml, prompts/**); user/agent paths (CLAUDE.md, MEMORY.md, .claude/**, /workspace) are excluded by IsCPTemplateAssetPath, so a persisted /configs is not clobbered. Docker keeps templatePath="" (preserve its persistent config volume).

Verify after merge (deploys to staging)

template-delivery-e2e goes deterministically green (the restart re-stub race is removed). Then unblocks #3049 (concierge create_workspace), whose own gate run was blocked by this intermittent.

SOP checklist

  • Comprehensive testing performed: go build ./... clean; go test ./internal/handlers -run 'Restart|Provision' green; gofmt-clean. Root cause verified empirically (SSM: workspaces.template=seo-agent, box /configs=218B stub, provision.start template field present on both provisions, CP request struct has no template_assets).
  • Local-postgres E2E run: Handlers Postgres Integration on head; pure handler change, no schema.
  • Staging-smoke verified or pending: scheduled post-merge — the fix takes effect once deployed (template-delivery-e2e is a black-box test of deployed code).
  • Root-cause not symptom: fixes the actual delivery channel (config_files via templatePath on restart), not the symptom.
  • Five-Axis review walked: correctness/readability/architecture/security/performance.
  • No backwards-compat shim / dead code added: no shim; Docker path unchanged; SaaS path corrected. (Separately notes the CP TemplateAssets field is dead — cleanup tracked, not in this PR.)
  • Memory/saved-feedback consulted: feedback_no_such_thing_as_flakes (named the race mechanism), feedback_follow_dev_sop_phase1_evidence_first (SSM ground truth).

Refs RFC#2843 #33, #3049 (unblocks).

🤖 Generated with Claude Code

## Bug A SaaS workspace's auto-restart re-provisions a **fresh EC2** (empty `/configs`), but the restart lands a **218-byte stub `config.yaml`** (no template config, no `prompts/`) — losing the agent's identity on every restart. ## Root cause (traced live via SSM on a staging tenant) A fresh `seo-agent` came **online** with real config, then the post-online plugin-reconcile (installing `seo-all`) triggered `Auto-restart` → fresh EC2 → `config.yaml` = 218B stub, **even though `workspaces.template='seo-agent'` was correctly persisted**. `config.yaml`/`prompts` reach a SaaS box **only** via the `config_files` channel: the control-plane `/cp/workspaces/provision` consumes `config_files` (staged to SM `molecule/workspace/<id>/config`, fetched at boot) — the fetcher's `TemplateAssets` wire field is **not consumed by the deployed CP at all**. `config_files` is populated by `collectCPConfigFiles` walking `cfg.TemplatePath`. The auto-restart called `provisionWorkspaceAutoSyncLocked(id, "", nil, payload)` with **`templatePath=""`** → config routed only into the dropped `TemplateAssets` → stub. The prior RFC#2843 #33 fix set `payload.Template` on restart, but that only drives the (CP-dropped) `TemplateAssets` path, so it never actually re-delivered. ## Impact - **Prod reliability:** any SaaS workspace that auto-restarts (plugin install, crash-recovery) loses its template `config.yaml`/`prompts` → degraded identity. - **CI:** the `template-delivery-e2e` required gate flaps green/red — it races the config assertion vs the post-online plugin-reconcile restart. ## Fix On the **SaaS** restart, resolve the LOCAL template dir the same way first-provision does (`resolveWorkspaceTemplatePath`) and pass it, so the re-provision re-delivers `config.yaml` + `prompts` via `config_files`. Re-applying touches only template-owned, allowlisted files (`config.yaml`, `prompts/**`); user/agent paths (`CLAUDE.md`, `MEMORY.md`, `.claude/**`, `/workspace`) are excluded by `IsCPTemplateAssetPath`, so a persisted `/configs` is **not** clobbered. Docker keeps `templatePath=""` (preserve its persistent config volume). ## Verify after merge (deploys to staging) `template-delivery-e2e` goes **deterministically** green (the restart re-stub race is removed). Then unblocks #3049 (concierge create_workspace), whose own gate run was blocked by this intermittent. ## SOP checklist - **Comprehensive testing performed:** `go build ./...` clean; `go test ./internal/handlers -run 'Restart|Provision'` green; gofmt-clean. Root cause verified empirically (SSM: workspaces.template=seo-agent, box /configs=218B stub, provision.start template field present on both provisions, CP request struct has no template_assets). - **Local-postgres E2E run:** Handlers Postgres Integration on head; pure handler change, no schema. - **Staging-smoke verified or pending:** scheduled post-merge — the fix takes effect once deployed (template-delivery-e2e is a black-box test of deployed code). - **Root-cause not symptom:** fixes the actual delivery channel (config_files via templatePath on restart), not the symptom. - **Five-Axis review walked:** correctness/readability/architecture/security/performance. - **No backwards-compat shim / dead code added:** no shim; Docker path unchanged; SaaS path corrected. (Separately notes the CP `TemplateAssets` field is dead — cleanup tracked, not in this PR.) - **Memory/saved-feedback consulted:** feedback_no_such_thing_as_flakes (named the race mechanism), feedback_follow_dev_sop_phase1_evidence_first (SSM ground truth). Refs RFC#2843 #33, #3049 (unblocks). 🤖 Generated with [Claude Code](https://claude.com/claude-code)
core-devops added 1 commit 2026-06-19 02:53:06 +00:00
fix(provision): SaaS auto-restart re-delivers template config.yaml + prompts
E2E Workspace Lifecycle (staginge2e) / E2E Workspace Lifecycle (staging) (pull_request) Has been skipped
CI / Python Lint & Test (pull_request) Successful in 6s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 7s
E2E Peer Visibility (literal MCP list_peers) / detect-changes (pull_request) Successful in 7s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 7s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 6s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Has been skipped
sop-checklist / review-refire (pull_request_target) Has been skipped
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 8s
Harness Replays / detect-changes (pull_request) Successful in 11s
E2E Workspace Lifecycle (staginge2e) / E2E Workspace Lifecycle (compile+skip) (pull_request) Successful in 14s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Successful in 9s
reserved-path-review / reserved-path-review (pull_request_target) Successful in 11s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 13s
PR Diff Guard / PR diff guard (pull_request) Successful in 15s
sop-checklist / all-items-acked (pull_request_target) Successful in 10s
E2E Chat / detect-changes (pull_request) Successful in 22s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 17s
gate-check-v3 / gate-check (pull_request_target) Successful in 15s
CI / Detect changes (pull_request) Successful in 26s
template-delivery-e2e / detect-changes (pull_request) Successful in 19s
E2E Chat / E2E Chat (pull_request) Successful in 3s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 25s
template-delivery-e2e / Template-asset delivery (fresh seo-agent — config+prompts via asset channel, seo-all via plugin reconcile) (pull_request) Successful in 1s
CI / Canvas (Next.js) (pull_request) Successful in 2s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 1s
E2E API Smoke Test / detect-changes (pull_request) Successful in 30s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 3s
CI / Canvas Deploy Status (pull_request) Successful in 1s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (pull_request) Successful in 40s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 35s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (pull_request) Successful in 36s
Harness Replays / Harness Replays (pull_request) Successful in 1m20s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m21s
CI / Platform (Go) (pull_request) Successful in 4m1s
CI / all-required (pull_request) Successful in 4s
E2E Staging External Runtime / E2E Staging External Runtime (pull_request) Successful in 5m34s
sop-checklist / na-declarations (pull_request) N/A: (none)
reserved-path-review / reserved-path-review (pull_request_review) Successful in 9s
security-review / approved (pull_request_target) Approved via pull_request_review trigger
qa-review / approved (pull_request_target) Approved via pull_request_review trigger
security-review / approved (pull_request_review) Successful in 11s
qa-review / approved (pull_request_review) Successful in 11s
audit-force-merge / audit (pull_request_target) Successful in 7s
sop-checklist / all-items-acked (pull_request) Compensated by status-reaper (non-required pull_request/pull_request_review governance shadow overridden by successful pull_request_target status; see .gitea/scripts/status-reaper.py)
2ba0662da6
The SaaS auto-restart re-provisions a fresh EC2 (empty /configs), but the
restart dispatched provisionWorkspaceAutoSyncLocked(id, "", nil, payload) with
templatePath="". config.yaml/prompts reach a SaaS box ONLY via the config_files
channel (CP /cp/workspaces/provision consumes `config_files`; the fetcher's
TemplateAssets wire field is NOT consumed by the deployed control-plane), and
config_files is populated by collectCPConfigFiles walking cfg.TemplatePath. With
templatePath="", config routed only into the dropped TemplateAssets → every SaaS
restart landed a 218-byte stub config.yaml (lost identity/prompts).

The prior RFC#2843 #33 fix set payload.Template on restart, but that only drives
the (CP-dropped) TemplateAssets path — it never re-delivered. This resolves the
LOCAL template dir the same way first-provision does
(resolveWorkspaceTemplatePath) and passes it, so the restart re-provision
re-delivers config.yaml + prompts via config_files. Re-applying touches only
template-owned, allowlisted files (config.yaml, prompts/**); user/agent paths
(CLAUDE.md, MEMORY.md, .claude/**, /workspace) are excluded by
IsCPTemplateAssetPath, so a persisted /configs is not clobbered. Docker keeps
templatePath="" (preserve its persistent config volume).

Root-caused live via SSM on a staging tenant: seo-agent online with real config,
then post-online plugin-reconcile triggered an auto-restart → fresh EC2 →
config.yaml = 218B stub (workspaces.template was correctly 'seo-agent'). This is
the template-delivery-e2e intermittent (races the assertion vs the restart) and a
prod identity-loss-on-restart bug.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
core-security approved these changes 2026-06-19 02:59:13 +00:00
core-security left a comment
Member

Security/correctness review — APPROVE. SaaS auto-restart re-stub fix. Verified independently: (1) Docker safety — resolution is gated to h.cpProv != nil; self-host keeps templatePath="" (no clobber of persistent /configs). (2) Clobber safety — collectCPConfigFiles filters the template-dir walk through isCPTemplateConfigFile (config.yaml + prompts/** only); user/agent paths (CLAUDE.md/MEMORY.md/.claude/**/workspace) excluded, covered by TestStart_CollectsConfigFiles. (3) Fail-safe — absent template dir → restartTemplatePath="" (no worse than today), no nil/panic. (4) Path flows to cfg.TemplatePath (dispatch→provisionWorkspaceCP→buildProvisionerConfig→Start→collect). build + Restart/Provision tests green.

**Security/correctness review — APPROVE.** SaaS auto-restart re-stub fix. Verified independently: (1) Docker safety — resolution is gated to `h.cpProv != nil`; self-host keeps templatePath="" (no clobber of persistent /configs). (2) Clobber safety — collectCPConfigFiles filters the template-dir walk through `isCPTemplateConfigFile` (config.yaml + prompts/** only); user/agent paths (CLAUDE.md/MEMORY.md/.claude/**/workspace) excluded, covered by TestStart_CollectsConfigFiles. (3) Fail-safe — absent template dir → restartTemplatePath="" (no worse than today), no nil/panic. (4) Path flows to cfg.TemplatePath (dispatch→provisionWorkspaceCP→buildProvisionerConfig→Start→collect). build + Restart/Provision tests green.
molecule-code-reviewer approved these changes 2026-06-19 02:59:15 +00:00
molecule-code-reviewer left a comment
Member

QA review — APPROVE. Root cause confirmed: CP consumes config_files only (no template_assets field), so config reaches a SaaS box via config_files (TemplatePath walk); the restart passed templatePath="" → 218B stub. Fix mirrors first-provision resolution on the SaaS branch. Follow-up (non-blocking): add a unit test asserting the SaaS restart passes the resolved dir.

**QA review — APPROVE.** Root cause confirmed: CP consumes `config_files` only (no `template_assets` field), so config reaches a SaaS box via config_files (TemplatePath walk); the restart passed templatePath="" → 218B stub. Fix mirrors first-provision resolution on the SaaS branch. Follow-up (non-blocking): add a unit test asserting the SaaS restart passes the resolved dir.
Member

Acking after independent review (security + QA pass) of the SaaS restart re-stub fix.
/sop-ack comprehensive-testing
/sop-ack local-postgres-e2e
/sop-ack staging-smoke
/sop-ack root-cause
/sop-ack five-axis-review
/sop-ack no-backwards-compat
/sop-ack memory-consulted

Acking after independent review (security + QA pass) of the SaaS restart re-stub fix. /sop-ack comprehensive-testing /sop-ack local-postgres-e2e /sop-ack staging-smoke /sop-ack root-cause /sop-ack five-axis-review /sop-ack no-backwards-compat /sop-ack memory-consulted
agent-reviewer-cr2 approved these changes 2026-06-19 03:34:52 +00:00
agent-reviewer-cr2 left a comment
Member

APPROVED. Five-axis review complete on current head 2ba0662d.

Correctness: the restart path now restores the persisted template and resolves a local template dir only on the SaaS/cpProv branch, then passes that path into provisionWorkspaceAutoSyncLocked so cfg.TemplatePath can feed collectCPConfigFiles. Docker/self-host behavior remains templatePath="".

Robustness/security: resolveWorkspaceTemplatePath keeps resolution inside configured roots, and collectCPConfigFiles/isCPTemplateConfigFile/IsCPTemplateAssetPath constrain delivery to config.yaml and prompts/** with traversal/symlink/size checks, so this does not reintroduce user/agent state clobbering. Failure to resolve logs and preserves the prior fallback rather than panicking. Performance/readability: one stat plus existing template walk only on SaaS restart; comments are long but accurately explain the production invariant. CI was reported green; no blockers found.

APPROVED. Five-axis review complete on current head 2ba0662d. Correctness: the restart path now restores the persisted template and resolves a local template dir only on the SaaS/cpProv branch, then passes that path into provisionWorkspaceAutoSyncLocked so cfg.TemplatePath can feed collectCPConfigFiles. Docker/self-host behavior remains templatePath="". Robustness/security: resolveWorkspaceTemplatePath keeps resolution inside configured roots, and collectCPConfigFiles/isCPTemplateConfigFile/IsCPTemplateAssetPath constrain delivery to config.yaml and prompts/** with traversal/symlink/size checks, so this does not reintroduce user/agent state clobbering. Failure to resolve logs and preserves the prior fallback rather than panicking. Performance/readability: one stat plus existing template walk only on SaaS restart; comments are long but accurately explain the production invariant. CI was reported green; no blockers found.
devops-engineer merged commit f9f718f95a into main 2026-06-19 03:36:04 +00:00
Sign in to join this conversation.
4 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#3055