molecule-core

History

Hongming Wang f18ee8598a fix(restart): retry cpProv.Stop with backoff + flag exhaustion as LEAK-SUSPECT Both restart paths (interactive Restart handler + auto-restart's stopForRestart) used to log-and-continue on cpProv.Stop failure. After PR #2500 made CPProvisioner.Stop surface CP non-2xx as an error, those paths became the actual leak generator: every transient CP/AWS hiccup = one orphan EC2 alongside the freshly provisioned one. The 13 zombie workspace EC2s on demo-prep staging traced to this exact path. Adds cpStopWithRetry helper with bounded exponential backoff (3 attempts, 1s/2s/4s). Different policy from workspace_crud.go's Delete handler: Delete returns 500 to the client on Stop failure (loud-fail-and-block — user asked to destroy, silent leak unacceptable), whereas Restart's contract is "make the workspace alive again" — refusing to reprovision strands the user with a dead workspace. So this helper retries to absorb transient failures, then on exhaustion emits a structured `LEAK-SUSPECT` log line for the (forthcoming) CP-side workspace orphan reconciler to correlate. Caller proceeds to reprovision regardless. ctx-cancel exits the retry early without sleeping the backoff (matters during shutdown drain); the cancel path emits a distinct log line and deliberately does NOT emit LEAK-SUSPECT — operator-cancel and retry-exhaustion are different signals and conflating them would noise up the orphan-reconciler queue with workspaces we never had a chance to retry. Tests: 5 behavior tests covering every branch (no-op, first-try success, eventual success, exhaustion, ctx-cancel) + 1 AST gate that pins the helper-only invariant (any future inline `h.cpProv.Stop(...)` in workspace_restart.go fires the gate, mutation-tested). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-05-01 23:36:38 -07:00
..
cmd/server	fix(sweeper): honour template-manifest provision_timeout_seconds	2026-05-01 21:44:42 -07:00
internal	fix(restart): retry cpProv.Stop with backoff + flag exhaustion as LEAK-SUSPECT	2026-05-01 23:36:38 -07:00
migrations	fix(workspaces): add missing 'awaiting_agent' + 'hibernating' to workspace_status enum	2026-04-30 08:52:05 -07:00
pkg/provisionhook	feat(#1957 ): wire gh-identity plugin into workspace-server	2026-04-24 15:01:41 +00:00
.ci-force	chore: force Platform(Go) CI run on main — validate go vet clean	2026-04-21 15:43:19 +00:00
.gitignore	feat(ws-server): pull env from CP on startup	2026-04-19 02:41:15 -07:00
.golangci.yaml	chore(workspace-server): add golangci.yaml disabling errcheck	2026-04-24 07:16:54 +00:00
Dockerfile	feat(deploy): verify each tenant /buildinfo matches published SHA after redeploy	2026-04-30 10:55:08 -07:00
Dockerfile.tenant	feat(deploy): verify each tenant /buildinfo matches published SHA after redeploy	2026-04-30 10:55:08 -07:00
entrypoint-tenant.sh	fix(security): add USER directive before ENTRYPOINT in all tenant images (#1155 )	2026-04-20 23:51:33 +00:00
go.mod	chore(deps): batch dep bumps — 11 safe upgrades from 2026-04-28 dependabot wave	2026-04-28 16:25:46 -07:00
go.sum	chore(deps): batch dep bumps — 11 safe upgrades from 2026-04-28 dependabot wave	2026-04-28 16:25:46 -07:00