fix(provision): SaaS auto-restart re-delivers template config.yaml + prompts #3055
Reference in New Issue
Block a user
Delete Branch "fix/saas-restart-template-redelivery"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Bug
A SaaS workspace's auto-restart re-provisions a fresh EC2 (empty
/configs), but the restart lands a 218-byte stubconfig.yaml(no template config, noprompts/) — losing the agent's identity on every restart.Root cause (traced live via SSM on a staging tenant)
A fresh
seo-agentcame online with real config, then the post-online plugin-reconcile (installingseo-all) triggeredAuto-restart→ fresh EC2 →config.yaml= 218B stub, even thoughworkspaces.template='seo-agent'was correctly persisted.config.yaml/promptsreach a SaaS box only via theconfig_fileschannel: the control-plane/cp/workspaces/provisionconsumesconfig_files(staged to SMmolecule/workspace/<id>/config, fetched at boot) — the fetcher'sTemplateAssetswire field is not consumed by the deployed CP at all.config_filesis populated bycollectCPConfigFileswalkingcfg.TemplatePath.The auto-restart called
provisionWorkspaceAutoSyncLocked(id, "", nil, payload)withtemplatePath=""→ config routed only into the droppedTemplateAssets→ stub. The prior RFC#2843 #33 fix setpayload.Templateon restart, but that only drives the (CP-dropped)TemplateAssetspath, so it never actually re-delivered.Impact
config.yaml/prompts→ degraded identity.template-delivery-e2erequired gate flaps green/red — it races the config assertion vs the post-online plugin-reconcile restart.Fix
On the SaaS restart, resolve the LOCAL template dir the same way first-provision does (
resolveWorkspaceTemplatePath) and pass it, so the re-provision re-deliversconfig.yaml+promptsviaconfig_files. Re-applying touches only template-owned, allowlisted files (config.yaml,prompts/**); user/agent paths (CLAUDE.md,MEMORY.md,.claude/**,/workspace) are excluded byIsCPTemplateAssetPath, so a persisted/configsis not clobbered. Docker keepstemplatePath=""(preserve its persistent config volume).Verify after merge (deploys to staging)
template-delivery-e2egoes deterministically green (the restart re-stub race is removed). Then unblocks #3049 (concierge create_workspace), whose own gate run was blocked by this intermittent.SOP checklist
go build ./...clean;go test ./internal/handlers -run 'Restart|Provision'green; gofmt-clean. Root cause verified empirically (SSM: workspaces.template=seo-agent, box /configs=218B stub, provision.start template field present on both provisions, CP request struct has no template_assets).TemplateAssetsfield is dead — cleanup tracked, not in this PR.)Refs RFC#2843 #33, #3049 (unblocks).
🤖 Generated with Claude Code
Security/correctness review — APPROVE. SaaS auto-restart re-stub fix. Verified independently: (1) Docker safety — resolution is gated to
h.cpProv != nil; self-host keeps templatePath="" (no clobber of persistent /configs). (2) Clobber safety — collectCPConfigFiles filters the template-dir walk throughisCPTemplateConfigFile(config.yaml + prompts/** only); user/agent paths (CLAUDE.md/MEMORY.md/.claude/**/workspace) excluded, covered by TestStart_CollectsConfigFiles. (3) Fail-safe — absent template dir → restartTemplatePath="" (no worse than today), no nil/panic. (4) Path flows to cfg.TemplatePath (dispatch→provisionWorkspaceCP→buildProvisionerConfig→Start→collect). build + Restart/Provision tests green.QA review — APPROVE. Root cause confirmed: CP consumes
config_filesonly (notemplate_assetsfield), so config reaches a SaaS box via config_files (TemplatePath walk); the restart passed templatePath="" → 218B stub. Fix mirrors first-provision resolution on the SaaS branch. Follow-up (non-blocking): add a unit test asserting the SaaS restart passes the resolved dir.Acking after independent review (security + QA pass) of the SaaS restart re-stub fix.
/sop-ack comprehensive-testing
/sop-ack local-postgres-e2e
/sop-ack staging-smoke
/sop-ack root-cause
/sop-ack five-axis-review
/sop-ack no-backwards-compat
/sop-ack memory-consulted
APPROVED. Five-axis review complete on current head
2ba0662d.Correctness: the restart path now restores the persisted template and resolves a local template dir only on the SaaS/cpProv branch, then passes that path into provisionWorkspaceAutoSyncLocked so cfg.TemplatePath can feed collectCPConfigFiles. Docker/self-host behavior remains templatePath="".
Robustness/security: resolveWorkspaceTemplatePath keeps resolution inside configured roots, and collectCPConfigFiles/isCPTemplateConfigFile/IsCPTemplateAssetPath constrain delivery to config.yaml and prompts/** with traversal/symlink/size checks, so this does not reintroduce user/agent state clobbering. Failure to resolve logs and preserves the prior fallback rather than panicking. Performance/readability: one stat plus existing template walk only on SaaS restart; comments are long but accurately explain the production invariant. CI was reported green; no blockers found.