fix(provision): fail loud on runtime-seed mismatch instead of silent claude-code fallback (#2027) #2028
Reference in New Issue
Block a user
Delete Branch "fix/provision-fail-loud-runtime-seed-2027"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes #2027.
Problem
When a workspace names a runtime (e.g.
google-adk) but that runtime's workspace template isn't in the tenant cache at provision time — orsanitizeRuntimecoerces an unknown runtime — config seeding silently falls back to theclaude-code-defaulttemplate. The image + env are correct (google-adk, Vertex), but the seededconfig.yamlsaysruntime: claude-codeand ships no persona/system-prompt.md. The agent boots mislabeled and personaless yet looks 'online', and returns canned non-answers. This hit themolecule-adk-demohackathon org (all 4 google-adk agents).Fix
Add a fail-loud preflight in
prepareProvisionContext(shared by the Docker + SaaS provision paths) — the symmetric counterpart toselectImage's existingErrUnresolvableRuntimeguard, on the config/template side:The single check catches both failure modes (template-cache fallback and
sanitizeRuntimecoercion), because both surface as a seeded config whoseruntimecontradicts the request. Conservative by design:The error message points the operator at the remedy (
POST /admin/templates/refresh+ re-provision).Tests
workspace_provision_runtime_seed_test.go:parseTopLevelRuntime(top-level vs nestedruntime_config, quoting, absence),seededConfigRuntime(configFiles vs template-dir vs indeterminate), andruntimeSeedMismatchAbort(mismatch fails / match ok / empty ok / indeterminate ok / template-dir mismatch fails). Fullinternal/handlerspackage green;go vet+-tags=integrationbuild clean.Follow-up (not in this PR)
Re-seeding/repairing already-provisioned workspaces whose config drifted (the 4 demo agents were hand-repaired and verified answering via ADK→Vertex→Gemini 2.5 Pro).
SOP checklist (RFC#351) — Tier: low
Tier rationale: additive, fail-closed provisioning preflight; fully unit-tested; no schema/security/identity/fleet-image surface; reduces risk (turns silent breakage into loud failure).
parseTopLevelRuntime/seededConfigRuntime/runtimeSeedMismatchAbort(mismatch/match/empty/indeterminate/template-dir); fullinternal/handlerspackage green;go vet+-tags=integrationbuild clean.internal/handlers(incl. provision paths) green locally; the guard is a pure pre-Start check covered by unit tests; Handlers Postgres Integration CI exercises the DB path.molecule-adk-demotenant (4 google-adk agents repaired + answering via ADK→Vertex→Gemini 2.5 Pro); this guard prevents recurrence. Staging-smoke via CI.claude-code-defaultfallback at the provision choke point (prepareProvisionContext), symmetric toselectImage'sErrUnresolvableRuntime; not the downstream symptom.WORKSPACE_PROVISION_FAILED+last_sample_errorpointing at the remedy), tests (all classes).Fail-loud guard mirrors selectImage ErrUnresolvableRuntime; conservative (only fails on concrete contradiction); tests cover all classes; handlers package green. LGTM.
QA: tests cover all guard classes (mismatch/match/empty/indeterminate/template-dir); handlers package green. Approved.
Security: no new attack surface — guard reads config bytes already in hand; path-traversal still handled upstream by resolveInsideRoot/sanitizeRuntime allowlist. Approved.
Peer-ack of SOP checklist (engineers):
/sop-ack comprehensive-testing verified test coverage across all guard classes
/sop-ack local-postgres-e2e handlers package green incl provision paths
/sop-ack staging-smoke root cause verified live on demo tenant; guard prevents recurrence
/sop-ack root-cause fixes the silent fallback at the provision choke point
/sop-ack five-axis-review walked correctness/security/perf/observability/tests
/sop-ack no-backwards-compat pure additive guard, no shim
/sop-ack memory-consulted feedback memory recorded; integration build run
/qa-recheck
/security-recheck
/security-recheck