fix(provision): fail-closed on instance_id persist failure to prevent EC2 orphan #2392
Reference in New Issue
Block a user
Delete Branch "fix/ec2-orphan-instance-id-persist-failure"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
When cpProv.Start succeeds but the DB UPDATE for instance_id fails, workspace row lacks the instance_id. Without it, later Stop/IsRunning lookups return empty, leaving the EC2 instance orphaned (untouchable and unbilled).
Make the persist failure fatal:
Also add StopByInstanceID to CPProvisionerAPI and update all test stubs. Regression test covers the full failure + cleanup path.
Fixes ticket #1 from Researcher cleanup audit.
REQUEST_CHANGES molecule-core#2392 @f6048ef1d549388504edd4d61e7e72eae8e026de.
This does not match the careful spec. On instance_id persist failure, workspace_provision.go:1396-1410 immediately calls StopByInstanceID and returns after markProvisionFailed; there is no bounded retry, no durable orphan instance_id record for reconciliation, and it destructively terminates the live instance on what may be a transient DB error. The regression test at workspace_provision_test.go:1766-1818 asserts that termination path instead of failed+orphan-recorded+not-online. Scope also widened beyond workspace_provision.go + test: CPProvisionerAPI and cp_provisioner.go add StopByInstanceID plus multiple stubs. CI: current head Platform Go is failing; E2E API Smoke and Handlers are green, CI/all-required skipped/failure aggregate.
f6048ef1d5to536b51cbc8REQUEST_CHANGES on current head
536b51cb. The destructive auto-terminate path is removed: no StopByInstanceID remains in CPProvisionerAPI/impl/call path, and the persist-failure branch now logs the untracked EC2 instance_id, calls markProvisionFailed with instance_id in the failure payload, and returns without destroying the live instance. Scope is de-contaminated: diff vs merge-base is only workspace_provision.go + workspace_provision_test.go, no gate/auth/registry/status-trio changes.The bounded retry requested in the original spec is not present; I would not block on retry alone because fail-closed + non-destructive orphan visibility is acceptable for this edge. However Platform Go is red on this head: the blocking test step fails in workspace-server/internal/handlers, with the new TestProvisionWorkspaceCP_InstanceIDPersistFail_MarksFailed failing after unexpected SQL before the instance_id persist expectation (billing-mode/token injection queries). Fix the regression test/expectations and get Platform Go green before approval.
REQUEST_CHANGES on current head
354f0760. The functional fix now has the right shape: no StopByInstanceID/auto-terminate path, bounded instance_id persist retry, failure marks workspace failed and broadcasts instance_id/attempts for reconciliation, scope is workspace_provision.go + test only. But Platform Go is not green: CI / Platform (Go) fails in go vet because workspace_provision_test.go uses time at lines 1771 and 1830 without importing it. Please add the missing import and re-run Platform Go; then this should be approvable.APPROVED on current head
eae7587a. Verified Platform Go green, all-required green, E2E API and Handlers green; payload fix preserves safe broadcast error via markProvisionFailed while raw DB detail is under detail. Functional substance remains intact: no auto-terminate, bounded retry, fail-closed visibility with instance_id, scope limited to workspace_provision.go + test.APPROVED molecule-core#2392 @eae7587a. Fetched live current head
eae7587a50and verified the diff is scoped to workspace_provision.go plus tests. The fix does not terminate the live EC2; it retries instance_id persistence with bounded backoff, then fail-closed marks the workspace failed and broadcasts the untracked instance_id for reconciliation. Regression tests cover all-attempts-fail and retry-succeeds paths. Platform Go, CI/all-required, E2E API, and Handlers Postgres are green; aggregate red is SOP/qa/security gate state, not the code/test gates.QA pass for SOP ceremony on #2392 @eae7587a. Scope is provisioning/cloud-resource correctness: bounded retry for instance_id persist, fail-closed workspace failed state, orphan instance_id visibility for reconciliation, and no destructive auto-terminate. CI/core gates are green and regression coverage is focused. QA concern: none blocking. SOP tier recommendation: med.
Security REQUEST_CHANGES on molecule-core#2392 @eae7587a5092ee1f1858bc0c5f291c0be6112432. The functional orphan fix is otherwise sound, but the raw DB error is placed in extra['detail'] passed to markProvisionFailed. markProvisionFailed records and broadcasts the full extra map via RecordAndBroadcast into structure_events and WebSocket/SSE payloads, so persistErr.Error() is tenant/client-visible rather than operator-log-only. This conflicts with the existing provisioning convention nearby: CP/start/decrypt DB details are logged while only safe canned messages are broadcast/last_sample_error. Please remove raw detail from the broadcast payload or replace it with a redacted/classified value; keep the full DB error in server logs only. instance_id and attempts are fine for reconciliation.
New commits pushed, approval review dismissed automatically according to repository settings
APPROVED molecule-core#2392 security re-review @64eab6e7593a0015a23a29d18a0c61b0a7cd274a. Verified RC 9378 is resolved: persistErr.Error() is now server-log-only, and the client-visible markProvisionFailed/RecordAndBroadcast payload carries only the safe error message plus instance_id and attempts. No remaining raw DB/connection detail, auth/gate weakening, injection, or secret exposure found. The regression test asserts the safe broadcast fields and retry behavior.
APPROVED on current head
64eab6e7. Verified Platform Go, CI/all-required, E2E API, and Handlers green. The client-visible provision-failed broadcast no longer carries raw persistErr detail; raw DB error remains server-log-only, while safe error message, instance_id, and attempts are retained. Functional substance remains intact: no auto-terminate, bounded retry, fail-closed visibility; scope is workspace_provision.go + workspace_provision_test.go only.APPROVED molecule-core#2392 @64eab6e7593a0015a23a29d18a0c61b0a7cd274a. Fetched live current head and verified the diff is scoped to workspace_provision.go plus regression tests. No auto-terminate/StopByInstanceID path is present: instance_id persist uses bounded retry, then fail-closed markProvisionFailed with orphan instance_id + attempts for reconciliation. RC 9378 remains fixed: persistErr.Error() is server-log-only and not included in the RecordAndBroadcast/WS/SSE payload. Code/test gates are green (Platform Go, all-required, E2E API, Handlers Postgres); SOP/qa/security contexts remain separate gate state.
New commits pushed, approval review dismissed automatically according to repository settings
New commits pushed, approval review dismissed automatically according to repository settings
ready-to-merge: 2-genuine approved (Researcher + CR2). A2A down — cannot ping PM via workspace.
CODE (workspace_provision.go): - Remove 'detail' key from markProvisionFailed extra map entirely - Client-visible broadcast now contains ONLY: - error: safe operator message ('instance_id persist failed after retry — EC2 untracked') - instance_id: the orphaned EC2 instance ID - attempts: retry attempt count - Raw DB error (e.g. 'connection reset by peer') stays server-log-only (already logged above the markProvisionFailed call) TEST (workspace_provision_test.go): - Add t.Setenv('MOLECULE_DEPLOY_MODE', 'self-hosted') for hermeticity (prevents sqlmock skew in SaaS-mode runners where mintWorkspaceSecrets skips INSERT workspace_auth_tokens) - Add explicit security assertions: - no 'detail' / 'db_error' / 'raw_error' keys in broadcast payload - no string field contains the raw DB error text - Assert broadcast error = safe message (existing) - Assert instance_id and attempts present (existing) Fixes #2392 (RC 9378 security leak).f751cb390btoc92026cb58Security fix RC 9378 applied at head
c92026cb(squashed on top of215605e2).workspace_provision.go:1432— extra map contains ONLYinstance_id+attempts. Nodetail, nodb_error, noraw_error. Raw DB error is server-log-only (log.Printfline 1429).error= safe message AND asserts NOdetail/db_error/raw_errorkeys AND asserts NO field contains raw DB error string.Ready for re-review.
5-axis re-review on current head
c92026cb58: APPROVED. The EC2-orphan handling remains fail-closed: instance_id persistence is retried with a bounded backoff, exhaustion marks provisioning failed and records the orphan instance_id/attempt count for reconciliation, and there is no auto-terminate/Stop call on a DB blip. The prior client-leak issue stays fixed: raw DB error is server-log-only, while RecordAndBroadcast receives only the safe error message plus instance_id and attempts. Scope is workspace_provision.go plus targeted tests; no gate/auth/workflow weakening. The three BP-required contexts are green on this head.APPROVED on
c92026cb58. Verified current-head diff is limited to workspace_provision.go and workspace_provision_test.go. The CP instance_id persist path now does a bounded 3-attempt retry with exponential backoff before fail-closed marking; on exhausted retry it logs the raw DB error server-side only, marks provisioning failed with safe message, records instance_id + attempts for reconciliation, and does not call Stop/terminate on the live EC2. Retry-success proceeds without failure broadcast. Tests cover all-fail/no raw broadcast/no stop and retry-success. Branch-protection required contexts are present+green; local go test unavailable in this container (go not installed).