fix(e2e): teardown leaked ws- workspace containers + standing orphan-sweeper (#2883) #2885
Reference in New Issue
Block a user
Delete Branch "fix/e2e-ws-teardown"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
The leak (#2883)
.gitea/workflows/local-provision-e2e.ymlprovisions stub/realws-<uuid>workspace sibling containers (imagemolecule-local/workspace-template-*/stub-runtime) from the host platform binary via the runner'sdocker.sock. Itsalways()teardown onlydocker rm -f'd the Postgres/Redis sidecars — it never removed thews-<uuid>workspace container(s) the run created.The in-test
cleanup()EXIT trap intests/e2e/test_local_provision_lifecycle_e2e.shdoes delete the workspace it created — but only when the trap actually fires. On a cancelled / timed-out job act_runner SIGKILLs the job container, so neither the bash trap nor the workflow teardown completes; a platform crash mid-provision can likewise orphan a half-started container. The leakedws-<uuid>then runs forever and pegs CPU on the shareddocker-hostrunner.Evidence: 13 orphans on the retired ded-1 box (2-3 days old), 11+3 on the production robots — cleaned manually. This recurs on every failed/timed-out run, and accumulated orphans exhausting the runner host is the likely root cause of the advisory-lane intermittent reds (#2693 / #2680 / #2739).
Fix - two layers
1. Per-job run-scoped teardown (both
lifecycle-stub+lifecycle-real)ws-*container IDs that already exist (docker ps -aq --filter name=^ws-to a per-run_id/run_attempt tmp file).if: always()teardown step (right after the existing sidecar teardown) removes only thews-*containers not in that baseline - i.e. the ones this run created.docker-hostis never touched. (lifecycle-realis alreadyneeds: lifecycle-stub-serialized; cross-SHA concurrent runs are the remaining case the baseline-diff protects.)Stop service containersstyle and the script's own "scoped teardown, never a blanket sweep" comment.2. Standing orphan-sweeper -
.gitea/workflows/sweep-stale-ws-orphans.ymlBelt-and-braces second layer for the case the per-job
always()step itself can't run (runner container SIGKILLed before it executes):docker-host(where the orphans live + wheredocker.sock/molecule-core-netare - same substrate constraint as the e2e lane andhandlers-postgres-integration.yml; a bareubuntu-latestWindows act_runner would inspect the wrong daemon).17 * * * *) +workflow_dispatch.ws-*/molecule-local/*workspace containers older thanWS_MAX_AGE_HOURS(2h) - well beyond the ~30 min max run (lifecycle-realtimeout-minutes: 30), so an in-flight run is never touched. UsesStartedAt(falling back toCreatedfor never-started containers).SAFETY_CAP=100, bail loud on a suspiciously large batch),DRY_RUNescape hatch, fail-loud notify step - all mirroringsweep-stale-e2e-orgs.yml. Pinnedpermissions: contents: read,timeout-minutes: 10, queuedconcurrency.Verification
yaml.safe_loadclean.docker ps --filter name=^ws-anchor behavior verified on the operator daemon (matchesws-..., ignores the container's leading/; safer than the substring form).uses:.Notes / flags
lifecycle-stubruns gating-locally (continue-on-error: false) but is not yet wired into branch protection (# bp-required: pending #2409);lifecycle-realis advisory.Closes #2883. Likely mitigates #2693 / #2680 / #2739.
🤖 Generated with Claude Code
local-provision-e2e.yml provisions ws-<uuid> workspace containers via the host platform binary's docker.sock, but its always() teardown only removed the Postgres/Redis sidecars — never the ws-<uuid> container(s) it created. On a cancelled/timed-out job (act_runner SIGKILLs the job container so the e2e script's EXIT trap never fires) or a platform crash mid-provision, the ws-<uuid> container leaks and runs forever, pegging CPU on the shared docker-host runner. 13 orphans were found on the retired ded-1 box (2-3 days old) and 11+3 on the prod robots (cleaned manually). Accumulated orphans exhausting the runner host is the likely root cause of the advisory-lane intermittent reds. Two layers: 1. Per-job run-scoped teardown (both lifecycle-stub + lifecycle-real): snapshot pre-existing ws-* containers at job start, then in an always() step remove ONLY ws-* containers NOT in that baseline (i.e. the ones this run created). Run-scoped so a concurrent run's in-flight workspace on the same shared host is never disrupted. 2. Standing janitor sweep-stale-ws-orphans.yml on docker-host, hourly cron + workflow_dispatch: age-guarded (>2h, well beyond the ~30m max run) removal of ws-* / molecule-local workspace containers. Belt-and-braces for the case where the runner container itself is SIGKILLed before the per-job always() step can run. Safety-capped, fail-loud (mirrors sweep-stale-e2e-orgs.yml). Closes #2883. Likely mitigates #2693 / #2680 / #2739. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>5-axis review — APPROVE. head
0d7cd8f(#2883)Reviewed carefully because it adds a standing destructive janitor (
docker rm -fhourly on the shareddocker-host). The guardrails are sound.always()teardown that snapshots a pre-jobws-*baseline and removes only containers NOT in it; (2) a standing hourly sweeper for what a SIGKILLed/cancelled job leaks. Addresses the real #2883 leak (forever-runningws-<uuid>pegging CPU).name=^ws-+ancestor=molecule-local/*filter (won't touch pg/redis sidecars, platform, or unrelated containers); 2h age floor vs the 30-min max run (in-flight runs never caught); StartedAt→Created fallback; unparseable-timestamp rows skipped (fail-safe); a 100-container SAFETY_CAP that bails on clock-skew/enumeration anomalies;DRY_RUNhatch; daemon-reachability precheck;concurrencygroup; pinned todocker-host;permissions: contents: read; fail-loud (continue-on-error: false+ failure notice). This is exactly the defensive shape this kind of janitor needs.Non-blocking note (per-run teardown concurrency caveat): the baseline-diff guarantees "a concurrent run's workspace is untouched" only for containers that existed before this job's baseline snapshot. The workflow's
concurrencygroup is per-SHA (local-provision-e2e-${{ head.sha }}), so two different PRs can run this lane concurrently on the samedocker-host; a concurrent run that creates itsws-container after this job's baseline but before its teardown would be removed by this job. Reachable only if the runner parallelizes these jobs (the two in-workflow jobs also each take independent baselines). Worst case is a rare flaky e2e red, not infra damage — and it's strictly better than the current forever-leak. If you want to close it fully, scope the per-run teardown by a run-id container label (labelws-containers at provision time, delete by that label) instead of a timing baseline. The standing sweeper is unaffected by this (age-guarded).Routine CI-hygiene fix, destructive paths well-guarded → approving. CI is green except the
qa-review/approvedceremony (this review) + a pending E2E context.