ci(local-provision-e2e): extend platform boot DEADLINE to 480s on slow runners #2858
Reference in New Issue
Block a user
Delete Branch "fix/2520-extend-platform-boot-deadline"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Fixes #2520.
GCP-class runners (
fleet-gcp-1, e2-standard-8) boot the platform ~33% slower than dedicated AMD runners, causing the 300s health deadline to fire on otherwise-healthy runs.Changes:
PLATFORM_BOOT_DEADLINE: 480.Wait for /healthsteps via${PLATFORM_BOOT_DEADLINE:-300}so the default applies consistently without scattering the constant.Test plan:
Local Provision Lifecycle E2E (stub)job should still pass on fast runners (deadline only increases).Local Provision Lifecycle E2E (real image + MiniMax LLM)job should stop reding on GCP runners solely due to the 300s boot timeout.APPROVED: I reviewed molecule-core #2858 at head
dd0ed103da.Judgment call: the 480s deadline looks like a legitimate slow-runner accommodation, not a blanket paper-over of a current platform regression. I could not read issue #2520 via the API (403), but the public issue page metadata is specific: it cites two fleet-gcp runs failing at the platform health deadline while the same suite passes on dedicated/AMD runners, and identifies fleet-gcp as e2-standard-8/shared Intel, ~33% slower. Moving 300s to 480s is therefore not arbitrary: 300 * 1.33 is ~400s, and 480s gives a bounded ~20% buffer above that observed slow-runner class.
Scope/code review:
PLATFORM_BOOT_DEADLINE: 480plus bothWait for /healthsteps using${PLATFORM_BOOT_DEADLINE:-300}.CI state: exact-head required core gates are green (
CI / all-required, API Smoke, Peer Visibility, etc.). The stub Local Provision E2E passed. The real-image Local Provision advisory job is still red, but the current failure is not a 300s platform boot timeout; its log exits in ~26s with name-resolution heartbeat errors. That should remain tracked separately and does not invalidate this deadline-only fix.APPROVE on
dd0ed103da.Judgment pass: this looks like a CI-runner deadline adjustment, not a code/boot regression mask. The diff is limited to local-provision-e2e.yml, introduces one workflow-level PLATFORM_BOOT_DEADLINE=480, and applies it to both existing /health wait loops. Those loops still check that the platform PID is alive before trusting /health and still dump logs on timeout, so a crashed or wedged platform is not silently treated as healthy.
The required stub Local Provision Lifecycle E2E is green on the head. The remaining red is the known real-image MiniMax advisory (#2851) plus review/ceremony gates; I am not treating those as blockers for this CI-only change. I could not read #2520 directly with this token (403), but the PR evidence and scope are consistent with slow-runner false timeout mitigation, and 480s is bounded rather than open-ended.