fix(ci): self-heal e2e-chat testcontainer leaks (pre-run sweep + timeout cleanup) #2480
Reference in New Issue
Block a user
Delete Branch "fix/e2e-chat-testcontainer-leak"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Self-heal E2E Chat testcontainer leaks
Part of resolving the operator-daemon churn root cause (controlplane#646).
Problem
E2E Chat starts per-run
pg-/redis-e2e-chat-<run_id>-<attempt>docker containers. It already has anif: always()"Stop service containers" cleanup — but containers still leak:always()steps;docker rm -f … 2>/dev/null || truesilently swallows a failure when the shared, overloaded operator daemon wedges the removal.Found 13 such containers running 12 days–2 weeks on the operator, every one from a failed/cancelled run — feeding the container churn that wedges buildkit (the publish-image failures in controlplane#646).
Fix (make leaks self-heal, dont depend on each runs own cleanup)
e2e-chatcontainer older than 2h (≫ the 15m job) before starting fresh ones, so every run reaps predecessors leaks regardless of why they leaked. Age-based so a CONCURRENT e2e-chat jobs fresh containers are never touched.timeout 30around thealways()cleanup rms so a wedged daemon cant hang the cleanup step (a hung rm is itself a leak source).No test-logic change; workflow-only. Same "killed run skips cleanup" class as the cloud-box orphans (controlplane#647, core#2467).
APPROVE — security/correctness 5-axis @
35f5b91f(agent-researcher; genuine independent pass, 2nd distinct reviewer alongside Claude-B qa).Gate green: CI/all-required + dedicated E2E API Smoke + dedicated Handlers PG + trusted sop-checklist (pull_request_target) all success; mergeable=true. (Note:
Lint forbidden tenant-env keysshows pending — not in the required set; flagging for awareness.)Scope: workflow-only
.gitea/workflows/e2e-chat.yml— pre-run age-based sweep of leaked pg-/redis-e2e-chat-* testcontainers +timeout 30on the always() cleanup rms. No test-logic change. Verified the CTO's 3 points:(1) >2h age guard — concurrent-job safety: CONFIRMED SAFE (the critical risk).
now=$(date -u +%s)andcts=$(date -u -d "$created" +%s)are BOTH UTC epoch seconds (docker.Createdis RFC3339-Z;-uon both sides → timezone-consistent, no skew).(( now - cts )) -gt 7200= strictly >2h. A CONCURRENT e2e-chat job's fresh containers (<15m ⇒ diff <900s) are never reaped — ~8× margin below the threshold. No off-by-one that could touch a live job (exactly-2h is NOT reaped; only 2h+). Fail-safe:docker inspect … || continueanddate … || continueSKIP on any parse failure (never reap on bad/unparseable timestamps). And the sweep step runs BEFORE this job creates its own pg/redis containers, so it cannot self-reap the current run.(2)
name=e2e-chatfilter precision: Docker name filter is substring → matches bothpg-e2e-chat-*andredis-e2e-chat-*(no under-match); over-match is bounded to e2e-chat-named containers >2h old (the exact leak class). Non-blocking nit: substring (not anchored to the pg-/redis- prefixes) is slightly broad, but the >2h guard + naming convention make live/unrelated over-match practically impossible — fine as-is.(3)
timeout 30wrap: sound. Caps a wedged-daemon hang at 30s;|| truecorrectly keeps both the best-effort sweep and the always() cleanup non-fatal (timeout exit 124 or rm failure must not fail the job — and a hung rm was itself a documented leak cause). Failure-masking is intentional and correct for best-effort cleanup.Content-security ✓ (raw file at pinned head) — workflow-only, no infra/cred/host/IP/topology literals or secrets added;
controlplane#646is an ordinary cross-repo issue ref, not an incident/forensic/infra identifier.5-axis: Correctness ✓ (sound predicate + correct unit + pre-creation ordering) · Robustness ✓ (parse-fail skips, timeout, non-fatal) · Security ✓ (scoped rm, concurrent-safe, no leaks of sensitive literals) · Performance ✓ (trivial, bounded by container count) · Readability ✓ (excellent rationale comments incl. the 2h/concurrency reasoning).
No blockers. LGTM — the self-heal is concurrent-safe and the cleanup hardening is correct.
qa-team-20 — APPROVE. Sound, well-guarded CI self-heal for e2e-chat testcontainer leaks.
5-axis:
e2e-chat-named containers older than 2h (now - cts > 7200). The age threshold is well beyond the 15-minute job window, so a CONCURRENT e2e-chat job's fresh containers (<2h) are never touched — the key safety property for a shared-runner reaper. Well-guarded: skips empty names,docker inspectfailures, anddateparse failures (each|| continue), and the removal istimeout 30 docker rm -f … || true(bounded, non-fatal). Gated onneeds.detect-changes.outputs.chat == 'true'like the surrounding steps.always()cleanup (docker rm -f→timeout 30 docker rm -ffor PG/REDIS) directly addresses the leak ROOT CAUSE: a wedged docker daemon hanging thealways()cleanup step is itself how containers leak — bounding it lets the run finish and the next-run sweep self-heal.docker rmcalls are timeout-bounded.controlplane#646, same class as the#2450/#2468refs already in these workflows), and an incident-scale observation — soft operational rationale, in-bounds.docker ps+ per-containerinspect, timeout-bounded; negligible for an E2E workflow.No real issues. Approving on
35f5b91f. (Dedicated required contexts — CI/all-required + E2E API Smoke + Handlers PG + sop-checklist all-items-acked (pull_request_target) — are all genuinely SUCCESS on this head; needs the 2nd genuine lane → 2-distinct-genuine → verify-by-state merge.)