fix(ci): kill stale platform-server before binding port
Some checks failed
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 18s
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Has been skipped
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 1m28s
E2E API Smoke Test / detect-changes (pull_request) Successful in 1m33s
CI / Detect changes (pull_request) Successful in 1m36s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Successful in 1m20s
Harness Replays / detect-changes (pull_request) Successful in 24s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 20s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 1m24s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 27s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 41s
qa-review / approved (pull_request) Failing after 13s
gate-check-v3 / gate-check (pull_request) Successful in 18s
sop-checklist / na-declarations (pull_request) awaiting /sop-n/a declaration for: qa-review, security-review
sop-checklist / all-items-acked (pull_request) Successful in 16s
security-review / approved (pull_request) Failing after 16s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m28s
sop-tier-check / tier-check (pull_request) Successful in 10s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Successful in 2m1s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 1m48s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 2m11s
lint-required-context-exists-in-bp / lint-required-context-exists-in-bp (pull_request) Successful in 2m12s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 1m20s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 10s
CI / Python Lint & Test (pull_request) Successful in 10s
Harness Replays / Harness Replays (pull_request) Successful in 17s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 15s
E2E Staging External Runtime / E2E Staging External Runtime (pull_request) Successful in 5m41s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Failing after 2m7s
CI / Platform (Go) (pull_request) Failing after 2m23s
CI / Canvas (Next.js) (pull_request) Failing after 2m21s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m27s
CI / all-required (pull_request) Failing after 2s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Failing after 17m48s

Cancelling or timing out a workflow run leaves the platform-server
process alive — the "Stop platform" step (line 335) is skipped.
On the next run, the OS may hand out :8080 (the default port) via
socket.bind(("", 0)), and platform-server then fails with:

    listen tcp :8080: bind: address already in use

Fix: add a pre-start step that scans /proc for any zombie
platform-server processes and kills them before the port probe or
start. Brief 1s sleep lets the kernel release sockets.

Uses only grep + kill — available on any Ubuntu runner without
extra tools. Safe: only targets platform-server binaries, ignores
other Go processes.

Refs: internal#374, issue #1046

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Molecule AI · infra-sre 2026-05-14 17:27:31 +00:00
parent 5738f53ee8
commit 446236e6db

View File

@ -69,6 +69,13 @@ name: E2E API Smoke Test
# 2318) shows Postgres ready in 3s, Redis in 1s, Platform in 1s when
# they DO come up. Timeouts are not the bottleneck; not bumped.
#
# Item #1046 (fixed 2026-05-14): Stale platform-server from cancelled runs
# lingers on :8080 after "Stop platform" step is skipped (workflow cancelled
# before reaching line 335). Added a pre-start "Kill stale platform-server"
# step (line 286) that scans /proc for zombie platform-server processes
# and kills them before the port probe or bind. Makes the ephemeral port
# probe + start sequence deterministic.
#
# Item explicitly NOT fixed here: failing test `Status back online`
# fails because the platform's langgraph workspace template image
# (ghcr.io/molecule-ai/workspace-template-langgraph:latest) returns
@ -283,6 +290,33 @@ jobs:
echo "PORT=${PLATFORM_PORT}" >> "$GITHUB_ENV"
echo "BASE=http://127.0.0.1:${PLATFORM_PORT}" >> "$GITHUB_ENV"
echo "Platform host port: ${PLATFORM_PORT}"
- name: Kill stale platform-server before start (issue #1046)
if: needs.detect-changes.outputs.api == 'true'
run: |
# Concurrent runs on the same host-network act_runner can leave a
# zombie platform-server from a cancelled/timeout run. Cancelled
# runs never reach the "Stop platform" step (line 335), so the
# old process lingers. Before picking a port or starting, scan
# /proc for any platform-server binary and kill it.
#
# "Pick platform port" uses socket.bind(("", 0)) which SHOULD get
# a free port from the OS, but the OS may hand out a port still
# in TIME_WAIT from a zombie. Killing zombie processes first makes
# both the port probe AND the bind deterministic.
#
# Safe: only kills platform-server; ignores other Go binaries.
# Uses only shell builtins + grep + kill — available on any Ubuntu
# runner without extra tools.
for pid in $(grep -l 'platform-server' /proc/*/comm 2>/dev/null | cut -d/ -f3); do
cmdline=$(cat "/proc/${pid}/cmdline" 2>/dev/null | tr '\0' ' ')
if echo "$cmdline" | grep -q 'platform-server'; then
echo "Killing stale platform-server pid ${pid}: ${cmdline}"
kill "$pid" 2>/dev/null || true
fi
done
# Brief pause to let the kernel release sockets
sleep 1
echo "Stale platform-server cleanup complete."
- name: Start platform (background)
if: needs.detect-changes.outputs.api == 'true'
working-directory: workspace-server