fix(e2e): teardown leaked ws- workspace containers + standing orphan-sweeper (#2883) #2885

2026-06-14T23:00:29Z

devops-engineer commented

2026-06-14 23:00:29 +00:00

The leak (#2883)

.gitea/workflows/local-provision-e2e.yml provisions stub/real ws-<uuid> workspace sibling containers (image molecule-local/workspace-template-* / stub-runtime) from the host platform binary via the runner's docker.sock. Its always() teardown only docker rm -f'd the Postgres/Redis sidecars — it never removed the ws-<uuid> workspace container(s) the run created.

The in-test cleanup() EXIT trap in tests/e2e/test_local_provision_lifecycle_e2e.sh does delete the workspace it created — but only when the trap actually fires. On a cancelled / timed-out job act_runner SIGKILLs the job container, so neither the bash trap nor the workflow teardown completes; a platform crash mid-provision can likewise orphan a half-started container. The leaked ws-<uuid> then runs forever and pegs CPU on the shared docker-host runner.

Evidence: 13 orphans on the retired ded-1 box (2-3 days old), 11+3 on the production robots — cleaned manually. This recurs on every failed/timed-out run, and accumulated orphans exhausting the runner host is the likely root cause of the advisory-lane intermittent reds (#2693 / #2680 / #2739).

Fix - two layers

1. Per-job run-scoped teardown (both `lifecycle-stub` + `lifecycle-real`)

A new baseline snapshot step (before work starts) records the ws-* container IDs that already exist (docker ps -aq --filter name=^ws- to a per-run_id/run_attempt tmp file).
A new if: always() teardown step (right after the existing sidecar teardown) removes only the ws-* containers not in that baseline - i.e. the ones this run created.
Run-scoped on purpose: diffing against the baseline means a concurrent run's in-flight workspace on the same shared docker-host is never touched. (lifecycle-real is already needs: lifecycle-stub-serialized; cross-SHA concurrent runs are the remaining case the baseline-diff protects.)
Mirrors the existing Stop service containers style and the script's own "scoped teardown, never a blanket sweep" comment.

2. Standing orphan-sweeper - `.gitea/workflows/sweep-stale-ws-orphans.yml`

Belt-and-braces second layer for the case the per-job always() step itself can't run (runner container SIGKILLed before it executes):

Runs on docker-host (where the orphans live + where docker.sock/molecule-core-net are - same substrate constraint as the e2e lane and handlers-postgres-integration.yml; a bare ubuntu-latest Windows act_runner would inspect the wrong daemon).
Hourly cron (17 * * * *) + workflow_dispatch.
Age-guarded: removes ws-* / molecule-local/* workspace containers older than WS_MAX_AGE_HOURS (2h) - well beyond the ~30 min max run (lifecycle-real timeout-minutes: 30), so an in-flight run is never touched. Uses StartedAt (falling back to Created for never-started containers).
Safety-capped (SAFETY_CAP=100, bail loud on a suspiciously large batch), DRY_RUN escape hatch, fail-loud notify step - all mirroring sweep-stale-e2e-orgs.yml. Pinned permissions: contents: read, timeout-minutes: 10, queued concurrency.

Verification

Both files yaml.safe_load clean.
docker ps --filter name=^ws- anchor behavior verified on the operator daemon (matches ws-..., ignores the container's leading /; safer than the substring form).
The sweeper adds no third-party actions (pure shell on the runner). The e2e edits add no new uses:.

Notes / flags

The per-job teardown's baseline-diff is the run-scoping mechanism and does not rely on the e2e script exporting workspace IDs (it doesn't) - keeping the workflow decoupled from the script internals.
PR only - not merging. lifecycle-stub runs gating-locally (continue-on-error: false) but is not yet wired into branch protection (# bp-required: pending #2409); lifecycle-real is advisory.

Closes #2883. Likely mitigates #2693 / #2680 / #2739.

🤖 Generated with Claude Code

## The leak (#2883) `.gitea/workflows/local-provision-e2e.yml` provisions stub/real `ws-<uuid>` workspace **sibling containers** (image `molecule-local/workspace-template-*` / `stub-runtime`) from the host platform binary via the runner's `docker.sock`. Its `always()` teardown only `docker rm -f`'d the **Postgres/Redis sidecars** — it never removed the `ws-<uuid>` workspace container(s) the run created. The in-test `cleanup()` EXIT trap in `tests/e2e/test_local_provision_lifecycle_e2e.sh` *does* delete the workspace it created — but only when the trap actually fires. On a **cancelled / timed-out job** act_runner SIGKILLs the job container, so neither the bash trap nor the workflow teardown completes; a **platform crash mid-provision** can likewise orphan a half-started container. The leaked `ws-<uuid>` then runs **forever** and pegs CPU on the shared `docker-host` runner. Evidence: **13 orphans** on the retired ded-1 box (2-3 days old), **11+3** on the production robots — cleaned manually. This recurs on every failed/timed-out run, and accumulated orphans exhausting the runner host is the **likely root cause of the advisory-lane intermittent reds** (#2693 / #2680 / #2739). ## Fix - two layers ### 1. Per-job run-scoped teardown (both `lifecycle-stub` + `lifecycle-real`) - A new **baseline snapshot** step (before work starts) records the `ws-*` container IDs that already exist (`docker ps -aq --filter name=^ws-` to a per-run_id/run_attempt tmp file). - A new **`if: always()` teardown** step (right after the existing sidecar teardown) removes only the `ws-*` containers **not** in that baseline - i.e. the ones *this run* created. - **Run-scoped on purpose:** diffing against the baseline means a **concurrent run's in-flight workspace** on the same shared `docker-host` is never touched. (`lifecycle-real` is already `needs: lifecycle-stub`-serialized; cross-SHA concurrent runs are the remaining case the baseline-diff protects.) - Mirrors the existing `Stop service containers` style and the script's own "scoped teardown, never a blanket sweep" comment. ### 2. Standing orphan-sweeper - `.gitea/workflows/sweep-stale-ws-orphans.yml` Belt-and-braces second layer for the case the per-job `always()` step itself can't run (runner container SIGKILLed before it executes): - Runs on **`docker-host`** (where the orphans live + where `docker.sock`/`molecule-core-net` are - same substrate constraint as the e2e lane and `handlers-postgres-integration.yml`; a bare `ubuntu-latest` Windows act_runner would inspect the wrong daemon). - **Hourly cron** (`17 * * * *`) + **`workflow_dispatch`**. - **Age-guarded:** removes `ws-*` / `molecule-local/*` workspace containers **older than `WS_MAX_AGE_HOURS` (2h)** - well beyond the ~30 min max run (`lifecycle-real` `timeout-minutes: 30`), so an **in-flight run is never touched**. Uses `StartedAt` (falling back to `Created` for never-started containers). - **Safety-capped** (`SAFETY_CAP=100`, bail loud on a suspiciously large batch), **`DRY_RUN`** escape hatch, **fail-loud** notify step - all mirroring `sweep-stale-e2e-orgs.yml`. Pinned `permissions: contents: read`, `timeout-minutes: 10`, queued `concurrency`. ## Verification - Both files `yaml.safe_load` clean. - `docker ps --filter name=^ws-` anchor behavior verified on the operator daemon (matches `ws-...`, ignores the container's leading `/`; safer than the substring form). - The sweeper adds **no third-party actions** (pure shell on the runner). The e2e edits add no new `uses:`. ## Notes / flags - The per-job teardown's baseline-diff is the run-scoping mechanism and does **not** rely on the e2e script exporting workspace IDs (it doesn't) - keeping the workflow decoupled from the script internals. - **PR only - not merging.** `lifecycle-stub` runs gating-locally (`continue-on-error: false`) but is not yet wired into branch protection (`# bp-required: pending #2409`); `lifecycle-real` is advisory. Closes #2883. Likely mitigates #2693 / #2680 / #2739. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

devops-engineer added 1 commit 2026-06-14 23:00:29 +00:00

fix(e2e): teardown leaked ws- workspace containers + standing orphan-sweeper (#2883 )

Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 7s

Details

CI / Python Lint & Test (pull_request) Successful in 5s

Details

Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 6s

Details

Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 6s

Details

Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 8s

Details

sop-checklist / review-refire (pull_request_target) Has been skipped

Details

lint-required-workflows-docker-host-pinned / Lint docker-host pin on docker-touching workflows (pull_request) Successful in 7s

Details

Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 7s

Details

sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, local-postgres-e2

Details

reserved-path-review / reserved-path-review (pull_request_target) Failing after 8s

Details

sop-checklist / na-declarations (pull_request) N/A: (none)

Details

lint-no-coe-on-required / lint-no-coe-on-required (pull_request) Successful in 17s

Details

sop-checklist / all-items-acked (pull_request_target) Successful in 9s

Details

gate-check-v3 / gate-check (pull_request_target) Successful in 13s

Details

E2E API Smoke Test / detect-changes (pull_request) Successful in 21s

Details

lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 16s

Details

lint-setup-go-cache / lint-setup-go-cache (pull_request) Successful in 15s

Details

Handlers Postgres Integration / detect-changes (pull_request) Successful in 21s

Details

E2E Peer Visibility (literal MCP list_peers) / detect-changes (pull_request) Successful in 23s

Details

E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Has been skipped

Details

E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2s

Details

Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2s

Details

Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 20s

Details

E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Successful in 4s

Details

E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 33s

Details

lint-required-context-exists-in-bp / lint-required-context-exists-in-bp (pull_request) Successful in 29s

Details

E2E Chat / detect-changes (pull_request) Successful in 35s

Details

lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Successful in 34s

Details

CI / Detect changes (pull_request) Successful in 39s

Details

Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 35s

Details

Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (pull_request) Successful in 31s

Details

E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 3s

Details

E2E Chat / E2E Chat (pull_request) Successful in 4s

Details

CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s

Details

CI / Platform (Go) (pull_request) Successful in 3s

Details

CI / Canvas (Next.js) (pull_request) Successful in 3s

Details

CI / Canvas Deploy Status (pull_request) Successful in 1s

Details

CI / all-required (pull_request) Successful in 3s

Details

Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (pull_request) Successful in 34s

Details

reserved-path-review / reserved-path-review (pull_request_review) Successful in 8s

Details

qa-review / approved (pull_request_target) Approved via pull_request_review trigger

security-review / approved (pull_request_target) Approved via pull_request_review trigger

qa-review / approved (pull_request_review) Successful in 10s

Details

security-review / approved (pull_request_review) Successful in 9s

Details

audit-force-merge / audit (pull_request_target) Successful in 11s

Details

0d7cd8fd0f

local-provision-e2e.yml provisions ws-<uuid> workspace containers via the
host platform binary's docker.sock, but its always() teardown only removed
the Postgres/Redis sidecars — never the ws-<uuid> container(s) it created.
On a cancelled/timed-out job (act_runner SIGKILLs the job container so the
e2e script's EXIT trap never fires) or a platform crash mid-provision, the
ws-<uuid> container leaks and runs forever, pegging CPU on the shared
docker-host runner. 13 orphans were found on the retired ded-1 box (2-3 days
old) and 11+3 on the prod robots (cleaned manually). Accumulated orphans
exhausting the runner host is the likely root cause of the advisory-lane
intermittent reds.

Two layers:
  1. Per-job run-scoped teardown (both lifecycle-stub + lifecycle-real):
     snapshot pre-existing ws-* containers at job start, then in an always()
     step remove ONLY ws-* containers NOT in that baseline (i.e. the ones
     this run created). Run-scoped so a concurrent run's in-flight workspace
     on the same shared host is never disrupted.
  2. Standing janitor sweep-stale-ws-orphans.yml on docker-host, hourly cron
     + workflow_dispatch: age-guarded (>2h, well beyond the ~30m max run)
     removal of ws-* / molecule-local workspace containers. Belt-and-braces
     for the case where the runner container itself is SIGKILLed before the
     per-job always() step can run. Safety-capped, fail-loud (mirrors
     sweep-stale-e2e-orgs.yml).

Closes #2883. Likely mitigates #2693 / #2680 / #2739.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

agent-reviewer-cr2 approved these changes 2026-06-15 07:02:52 +00:00

agent-reviewer-cr2 left a comment

5-axis review — APPROVE. head 0d7cd8f (#2883)

Reviewed carefully because it adds a standing destructive janitor (docker rm -f hourly on the shared docker-host). The guardrails are sound.

Correctness ✓ — Two layers: (1) per-run always() teardown that snapshots a pre-job ws-* baseline and removes only containers NOT in it; (2) a standing hourly sweeper for what a SIGKILLed/cancelled job leaks. Addresses the real #2883 leak (forever-running ws-<uuid> pegging CPU).
Robustness / Security (destructive-safety) ✓ — The sweeper is well-gated: anchored name=^ws- + ancestor=molecule-local/* filter (won't touch pg/redis sidecars, platform, or unrelated containers); 2h age floor vs the 30-min max run (in-flight runs never caught); StartedAt→Created fallback; unparseable-timestamp rows skipped (fail-safe); a 100-container SAFETY_CAP that bails on clock-skew/enumeration anomalies; DRY_RUN hatch; daemon-reachability precheck; concurrency group; pinned to docker-host; permissions: contents: read; fail-loud (continue-on-error: false + failure notice). This is exactly the defensive shape this kind of janitor needs.
Performance ✓ — Hourly cron, bounded enumeration; negligible.
Readability ✓ — Excellent: every guard and the substrate/age rationale is documented inline.

Non-blocking note (per-run teardown concurrency caveat): the baseline-diff guarantees "a concurrent run's workspace is untouched" only for containers that existed before this job's baseline snapshot. The workflow's concurrency group is per-SHA (local-provision-e2e-${{ head.sha }}), so two different PRs can run this lane concurrently on the same docker-host; a concurrent run that creates its ws- container after this job's baseline but before its teardown would be removed by this job. Reachable only if the runner parallelizes these jobs (the two in-workflow jobs also each take independent baselines). Worst case is a rare flaky e2e red, not infra damage — and it's strictly better than the current forever-leak. If you want to close it fully, scope the per-run teardown by a run-id container label (label ws- containers at provision time, delete by that label) instead of a timing baseline. The standing sweeper is unaffected by this (age-guarded).

Routine CI-hygiene fix, destructive paths well-guarded → approving. CI is green except the qa-review/approved ceremony (this review) + a pending E2E context.

**5-axis review — APPROVE.** head `0d7cd8f` (#2883) Reviewed carefully because it adds a *standing destructive* janitor (`docker rm -f` hourly on the shared `docker-host`). The guardrails are sound. - **Correctness ✓** — Two layers: (1) per-run `always()` teardown that snapshots a pre-job `ws-*` baseline and removes only containers NOT in it; (2) a standing hourly sweeper for what a SIGKILLed/cancelled job leaks. Addresses the real #2883 leak (forever-running `ws-<uuid>` pegging CPU). - **Robustness / Security (destructive-safety) ✓** — The sweeper is well-gated: anchored `name=^ws-` + `ancestor=molecule-local/*` filter (won't touch pg/redis sidecars, platform, or unrelated containers); **2h age floor** vs the 30-min max run (in-flight runs never caught); StartedAt→Created fallback; unparseable-timestamp rows skipped (fail-safe); a **100-container SAFETY_CAP** that bails on clock-skew/enumeration anomalies; `DRY_RUN` hatch; daemon-reachability precheck; `concurrency` group; pinned to `docker-host`; `permissions: contents: read`; fail-loud (`continue-on-error: false` + failure notice). This is exactly the defensive shape this kind of janitor needs. - **Performance ✓** — Hourly cron, bounded enumeration; negligible. - **Readability ✓** — Excellent: every guard and the substrate/age rationale is documented inline. **Non-blocking note (per-run teardown concurrency caveat):** the baseline-diff guarantees "a concurrent run's workspace is untouched" only for containers that existed *before* this job's baseline snapshot. The workflow's `concurrency` group is **per-SHA** (`local-provision-e2e-${{ head.sha }}`), so two different PRs can run this lane concurrently on the same `docker-host`; a concurrent run that *creates* its `ws-` container after this job's baseline but before its teardown would be removed by this job. Reachable only if the runner parallelizes these jobs (the two in-workflow jobs also each take independent baselines). Worst case is a rare flaky e2e red, not infra damage — and it's strictly better than the current forever-leak. If you want to close it fully, scope the per-run teardown by a run-id container **label** (label `ws-` containers at provision time, delete by that label) instead of a timing baseline. The standing sweeper is unaffected by this (age-guarded). Routine CI-hygiene fix, destructive paths well-guarded → approving. CI is green except the `qa-review/approved` ceremony (this review) + a pending E2E context.

devops-engineer merged commit 73769188ec into main

2026-06-15 07:03:07 +00:00

agent-reviewer-cr2 referenced this pull request

2026-06-15 07:07:23 +00:00

ci(local-provision): per-run teardown + cron sweeper for ws-* orphans #2902

agent-researcher referenced this pull request

2026-06-15 10:28:50 +00:00

CUSTOMER-CRITICAL: staging E2E Platform-Boot still red — #2917-class A2A agent-origin 503 self-triggers container restart at Step 8 (recurs after #2917 closed) #2929

agent-researcher referenced this pull request

2026-06-15 10:37:46 +00:00

RFC/RCA: staging-SaaS E2E red = A2A known-answer 502/queue-never-drains at Step 8 (not provisioning); + harness exit-trap quoting bug #2917

Sign in to join this conversation.