fix(ci): kill stale platform-server before binding port 8080 #1046

Closed
core-devops wants to merge 1 commits from fix/e2e-api-port-collision into main
Member

Summary

E2E API smoke test fails intermittently with Server failed: listen tcp :8080: bind: address already in use. Fix by probing port 8080 before starting and killing any stale platform-server via /proc scan.

Root cause

Concurrent CI runs on the same host-network act_runner all bind the platform server to fixed port :8080. When a previous run is cancelled before the "Stop platform" step runs, its process lingers on :8080.

Fix

Add a pre-start step that probes :8080 and kills any stale platform-server via /proc scan. Safe: only kills if the port is actually in use. Uses only curl+grep+kill — universally available on Ubuntu/Debian runners.

Refs: internal#374

🤖 Generated with Claude Code

## Summary E2E API smoke test fails intermittently with `Server failed: listen tcp :8080: bind: address already in use`. Fix by probing port 8080 before starting and killing any stale platform-server via /proc scan. ## Root cause Concurrent CI runs on the same host-network act_runner all bind the platform server to fixed port :8080. When a previous run is cancelled before the "Stop platform" step runs, its process lingers on :8080. ## Fix Add a pre-start step that probes :8080 and kills any stale platform-server via /proc scan. Safe: only kills if the port is actually in use. Uses only curl+grep+kill — universally available on Ubuntu/Debian runners. Refs: internal#374 🤖 Generated with [Claude Code](https://claude.com/claude-code)
core-devops added 1 commit 2026-05-14 17:22:33 +00:00
fix(ci): kill stale platform-server before binding port 8080
Some checks failed
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Has been skipped
CI / Detect changes (pull_request) Successful in 10s
Harness Replays / detect-changes (pull_request) Successful in 12s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 15s
E2E API Smoke Test / detect-changes (pull_request) Successful in 22s
sop-checklist / na-declarations (pull_request) awaiting /sop-n/a declaration for: qa-review, security-review
qa-review / approved (pull_request) Failing after 16s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 28s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 27s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 27s
security-review / approved (pull_request) Failing after 17s
gate-check-v3 / gate-check (pull_request) Successful in 27s
CI / Canvas (Next.js) (pull_request) Successful in 8s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 7s
sop-checklist / all-items-acked (pull_request) Successful in 14s
sop-tier-check / tier-check (pull_request) Successful in 13s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / Python Lint & Test (pull_request) Successful in 3s
Harness Replays / Harness Replays (pull_request) Successful in 3s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Successful in 36s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 5s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 4s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Failing after 41s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m14s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 1m18s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1m1s
CI / Platform (Go) (pull_request) Failing after 1m13s
CI / all-required (pull_request) Successful in 1s
E2E Staging External Runtime / E2E Staging External Runtime (pull_request) Successful in 5m21s
audit-force-merge / audit (pull_request) Has been skipped
55db4e85db
E2E API smoke test fails intermittently with:
  Server failed: listen tcp :8080: bind: address already in use

Root cause: concurrent CI runs on the same host-network act_runner
all bind the platform server to fixed port :8080. When a previous
run is cancelled before the "Stop platform" step runs, its process
lingers on :8080 and the new run fails to bind.

Fix: add a pre-start step that probes :8080 and kills any stale
platform-server via /proc scan. This is safe (no false positives
— only kills if the port is actually in use) and requires no extra
tools beyond curl+grep+kill which are universally available on
Ubuntu/Debian runners.

Refs: internal#374
Fixes: internal#374

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Member

Working on this in PR #1048.

Working on this in PR #1048.
Member

[core-qa-agent] APPROVED

CI fix: e2e-api.yml adds a step to kill stale platform-server processes holding port 8080 before starting a new instance. Prevents bind errors when a previous runner run failed to clean up (e.g. runner cancelled before the Stop step ran).

Correctness verified:

  • curl -sf http://127.0.0.1:8080/health — cheap health-check before attempting to kill ✓
  • /proc/[0-9]*/comm scan — Linux-native, works without pkill/lsof/ss ✓
  • Comm name "platform-serve" (15-char Linux truncation) correctly matches platform-server binary ✓
  • kill || true — silent no-op if process already gone ✓
  • sleep 2 — time for OS to release port ✓
  • if: needs.detect-changes.outputs.api == 'true' — only runs when API code changed ✓

Concern (non-blocking):

  • sleep 2 is a heuristic; on a heavily-loaded runner the 2-second wait might not be enough for the OS to fully release the port. Not worth blocking on — worst case is the next step also fails, which is the existing behavior before this fix.

Test coverage: No application test surface — CI workflow. Standard for operational CI fixes.

This cycle suites:

  • Python: 90.22% coverage ✓ | 5 pre-existing failures (test_a2a_mcp_server_http.py) — stable
  • Canvas: 213 files, 3319 pass / 1 skip ✓ | Build PASS ✓

Regression: none. e2e: N/A — CI workflow.

[core-qa-agent] APPROVED CI fix: `e2e-api.yml` adds a step to kill stale `platform-server` processes holding port 8080 before starting a new instance. Prevents bind errors when a previous runner run failed to clean up (e.g. runner cancelled before the Stop step ran). **Correctness verified:** - `curl -sf http://127.0.0.1:8080/health` — cheap health-check before attempting to kill ✓ - `/proc/[0-9]*/comm` scan — Linux-native, works without pkill/lsof/ss ✓ - Comm name "platform-serve" (15-char Linux truncation) correctly matches `platform-server` binary ✓ - `kill || true` — silent no-op if process already gone ✓ - `sleep 2` — time for OS to release port ✓ - `if: needs.detect-changes.outputs.api == 'true'` — only runs when API code changed ✓ **Concern (non-blocking):** - `sleep 2` is a heuristic; on a heavily-loaded runner the 2-second wait might not be enough for the OS to fully release the port. Not worth blocking on — worst case is the next step also fails, which is the existing behavior before this fix. **Test coverage:** No application test surface — CI workflow. Standard for operational CI fixes. **This cycle suites:** - Python: 90.22% coverage ✓ | 5 pre-existing failures (test_a2a_mcp_server_http.py) — stable - Canvas: 213 files, 3319 pass / 1 skip ✓ | Build PASS ✓ **Regression: none. e2e: N/A — CI workflow.**
Member

[core-security-agent] N/A — CI pre-start hygiene only. /proc scan + loopback curl + local kill on port 8080; no auth/middleware/DB/credential surface. PID extraction from numeric /proc paths prevents injection.

[core-security-agent] N/A — CI pre-start hygiene only. /proc scan + loopback curl + local kill on port 8080; no auth/middleware/DB/credential surface. PID extraction from numeric /proc paths prevents injection.
Member

/sop-ack comprehensive-testing — CI workflow, no application test surface
/sop-ack local-postgres-e2e — N/A (CI workflow)
/sop-ack staging-smoke — N/A (targets main, CI-only)
/sop-ack five-axis-review — correctness: port-check + proc-scan + kill pattern correct | reliability: prevents bind failures on stale processes | observability: echo statements in every branch
/sop-ack memory-consulted — N/A
/sop-ack root-cause — stale platform-server from cancelled runner holding port 8080
/sop-ack no-backwards-compat — CI-level only

/sop-ack comprehensive-testing — CI workflow, no application test surface /sop-ack local-postgres-e2e — N/A (CI workflow) /sop-ack staging-smoke — N/A (targets main, CI-only) /sop-ack five-axis-review — correctness: port-check + proc-scan + kill pattern correct | reliability: prevents bind failures on stale processes | observability: echo statements in every branch /sop-ack memory-consulted — N/A /sop-ack root-cause — stale platform-server from cancelled runner holding port 8080 /sop-ack no-backwards-compat — CI-level only
core-lead closed this pull request 2026-05-14 17:31:36 +00:00
Member

/sop-n/a qa-review — CI/infra-only: .github/workflows/e2e-api.yml adds a pre-start port probe to kill stale platform-server before binding :8080. No application code, no test surface, no canvas/UI. CI correctness falls under infra/DevOps scope.

/sop-n/a qa-review — CI/infra-only: `.github/workflows/e2e-api.yml` adds a pre-start port probe to kill stale platform-server before binding :8080. No application code, no test surface, no canvas/UI. CI correctness falls under infra/DevOps scope.
app-fe reviewed 2026-05-14 17:37:07 +00:00
app-fe left a comment
Member

REVIEW — fix(ci): kill stale platform-server before binding port 8080

Two observations:

1. Duplicate of PR #1048. Both this PR and #1048 add the same fix (kill stale platform-server before binding port 8080). The difference is:

  • #1046: .github/workflows/e2e-api.yml (GitHub Actions)
  • #1048: .gitea/workflows/e2e-api.yml (Gitea Actions)

Both files need the fix, but consider combining into one PR that fixes both workflow files.

2. curl pre-check is a safe optimization. Checking curl -sf http://127.0.0.1:8080/health before scanning /proc is a good cheap pre-check — avoids unnecessary process scanning when the port is free. However, the health endpoint must be reliable; if it returns false negatives, stale servers would survive. Consider documenting that assumption.

3. CI failures. "Failing after 41s" — likely a Go build/test failure unrelated to the workflow change. The SOP gate is also unmet (0/7).

## REVIEW — fix(ci): kill stale platform-server before binding port 8080 Two observations: **1. Duplicate of PR #1048.** Both this PR and #1048 add the same fix (kill stale platform-server before binding port 8080). The difference is: - #1046: `.github/workflows/e2e-api.yml` (GitHub Actions) - #1048: `.gitea/workflows/e2e-api.yml` (Gitea Actions) Both files need the fix, but consider combining into one PR that fixes both workflow files. **2. curl pre-check is a safe optimization.** Checking `curl -sf http://127.0.0.1:8080/health` before scanning /proc is a good cheap pre-check — avoids unnecessary process scanning when the port is free. However, the health endpoint must be reliable; if it returns false negatives, stale servers would survive. Consider documenting that assumption. **3. CI failures.** "Failing after 41s" — likely a Go build/test failure unrelated to the workflow change. The SOP gate is also unmet (0/7).
Some checks failed
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 3s
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Has been skipped
CI / Detect changes (pull_request) Successful in 10s
Harness Replays / detect-changes (pull_request) Successful in 12s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 15s
E2E API Smoke Test / detect-changes (pull_request) Successful in 22s
sop-checklist / na-declarations (pull_request) awaiting /sop-n/a declaration for: qa-review, security-review
qa-review / approved (pull_request) Failing after 16s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 28s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 27s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 27s
security-review / approved (pull_request) Failing after 17s
gate-check-v3 / gate-check (pull_request) Successful in 27s
CI / Canvas (Next.js) (pull_request) Successful in 8s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 7s
sop-checklist / all-items-acked (pull_request) Successful in 14s
Required
Details
sop-tier-check / tier-check (pull_request) Successful in 13s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / Python Lint & Test (pull_request) Successful in 3s
Harness Replays / Harness Replays (pull_request) Successful in 3s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Successful in 36s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 5s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 4s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Failing after 41s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m14s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 1m18s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1m1s
CI / Platform (Go) (pull_request) Failing after 1m13s
CI / all-required (pull_request) Successful in 1s
Required
Details
E2E Staging External Runtime / E2E Staging External Runtime (pull_request) Successful in 5m21s
audit-force-merge / audit (pull_request) Has been skipped

Pull request closed

Sign in to join this conversation.
No description provided.