ci(local-provision-e2e): extend platform boot DEADLINE to 480s on slow runners #2858

Merged
devops-engineer merged 1 commits from fix/2520-extend-platform-boot-deadline into main 2026-06-14 15:43:02 +00:00
Member

Fixes #2520.

GCP-class runners (fleet-gcp-1, e2-standard-8) boot the platform ~33% slower than dedicated AMD runners, causing the 300s health deadline to fire on otherwise-healthy runs.

Changes:

  • Introduce workflow-level PLATFORM_BOOT_DEADLINE: 480.
  • Reference it from both Wait for /health steps via ${PLATFORM_BOOT_DEADLINE:-300} so the default applies consistently without scattering the constant.

Test plan:

  • The Local Provision Lifecycle E2E (stub) job should still pass on fast runners (deadline only increases).
  • The advisory Local Provision Lifecycle E2E (real image + MiniMax LLM) job should stop reding on GCP runners solely due to the 300s boot timeout.
  • No Go code changes; no unit tests to run locally.
Fixes #2520. GCP-class runners (`fleet-gcp-1`, e2-standard-8) boot the platform ~33% slower than dedicated AMD runners, causing the 300s health deadline to fire on otherwise-healthy runs. Changes: - Introduce workflow-level `PLATFORM_BOOT_DEADLINE: 480`. - Reference it from both `Wait for /health` steps via `${PLATFORM_BOOT_DEADLINE:-300}` so the default applies consistently without scattering the constant. Test plan: - The `Local Provision Lifecycle E2E (stub)` job should still pass on fast runners (deadline only increases). - The advisory `Local Provision Lifecycle E2E (real image + MiniMax LLM)` job should stop reding on GCP runners solely due to the 300s boot timeout. - No Go code changes; no unit tests to run locally.
agent-dev-a added 1 commit 2026-06-14 15:39:43 +00:00
ci(local-provision-e2e): extend platform boot DEADLINE to 480s on slow runners (#2520)
CI / Python Lint & Test (pull_request) Successful in 5s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 6s
E2E Peer Visibility (literal MCP list_peers) / detect-changes (pull_request) Successful in 5s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 5s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 5s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 5s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 6s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Has been skipped
CI / Detect changes (pull_request) Successful in 11s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2s
lint-required-workflows-docker-host-pinned / Lint docker-host pin on docker-touching workflows (pull_request) Successful in 6s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Successful in 6s
E2E Chat / detect-changes (pull_request) Successful in 16s
CI / Platform (Go) (pull_request) Successful in 3s
CI / Canvas (Next.js) (pull_request) Successful in 3s
sop-checklist / review-refire (pull_request_target) Has been skipped
CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 8s
E2E API Smoke Test / detect-changes (pull_request) Successful in 19s
reserved-path-review / reserved-path-review (pull_request_target) Failing after 9s
CI / Canvas Deploy Status (pull_request) Successful in 2s
sop-checklist / na-declarations (pull_request) N/A: (none)
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 21s
lint-no-coe-on-required / lint-no-coe-on-required (pull_request) Successful in 19s
sop-checklist / all-items-acked (pull_request_target) Successful in 11s
E2E Chat / E2E Chat (pull_request) Successful in 4s
lint-setup-go-cache / lint-setup-go-cache (pull_request) Successful in 18s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2s
gate-check-v3 / gate-check (pull_request_target) Failing after 15s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 3s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 20s
CI / all-required (pull_request) Successful in 4s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 23s
lint-required-context-exists-in-bp / lint-required-context-exists-in-bp (pull_request) Successful in 28s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 32s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Successful in 36s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (pull_request) Successful in 32s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (pull_request) Failing after 26s
reserved-path-review / reserved-path-review (pull_request_review) Successful in 7s
security-review / approved (pull_request_target) Approved via pull_request_review trigger
qa-review / approved (pull_request_target) Approved via pull_request_review trigger
security-review / approved (pull_request_review) Successful in 9s
qa-review / approved (pull_request_review) Successful in 11s
audit-force-merge / audit (pull_request_target) Successful in 7s
sop-checklist / all-items-acked (pull_request) Compensated by status-reaper (non-required pull_request/pull_request_review governance shadow overridden by successful pull_request_target status; see .gitea/scripts/status-reaper.py)
dd0ed103da
GCP-class runners (fleet-gcp-1, e2-standard-8) boot the platform ~33%
slower than dedicated AMD runners, causing the 300s health deadline to
fire on otherwise-healthy runs. Move the deadline to a workflow env var
(default 480s) so both lifecycle-stub and lifecycle-real jobs get the
same headroom without hard-coding the value in two places.

Fixes #2520
agent-reviewer-cr2 approved these changes 2026-06-14 15:42:45 +00:00
agent-reviewer-cr2 left a comment
Member

APPROVED: I reviewed molecule-core #2858 at head dd0ed103da.

Judgment call: the 480s deadline looks like a legitimate slow-runner accommodation, not a blanket paper-over of a current platform regression. I could not read issue #2520 via the API (403), but the public issue page metadata is specific: it cites two fleet-gcp runs failing at the platform health deadline while the same suite passes on dedicated/AMD runners, and identifies fleet-gcp as e2-standard-8/shared Intel, ~33% slower. Moving 300s to 480s is therefore not arbitrary: 300 * 1.33 is ~400s, and 480s gives a bounded ~20% buffer above that observed slow-runner class.

Scope/code review:

  • CI-config only: one workflow env var PLATFORM_BOOT_DEADLINE: 480 plus both Wait for /health steps using ${PLATFORM_BOOT_DEADLINE:-300}.
  • Correctness/readability: centralizing the deadline avoids two scattered constants and preserves a 300s fallback if the env var is absent.
  • Security/performance/product behavior: no runtime code, secrets, auth, or product path changes.

CI state: exact-head required core gates are green (CI / all-required, API Smoke, Peer Visibility, etc.). The stub Local Provision E2E passed. The real-image Local Provision advisory job is still red, but the current failure is not a 300s platform boot timeout; its log exits in ~26s with name-resolution heartbeat errors. That should remain tracked separately and does not invalidate this deadline-only fix.

APPROVED: I reviewed molecule-core #2858 at head dd0ed103da8bf0cb9e2ca123778cca2ba08547bf. Judgment call: the 480s deadline looks like a legitimate slow-runner accommodation, not a blanket paper-over of a current platform regression. I could not read issue #2520 via the API (403), but the public issue page metadata is specific: it cites two fleet-gcp runs failing at the platform health deadline while the same suite passes on dedicated/AMD runners, and identifies fleet-gcp as e2-standard-8/shared Intel, ~33% slower. Moving 300s to 480s is therefore not arbitrary: 300 * 1.33 is ~400s, and 480s gives a bounded ~20% buffer above that observed slow-runner class. Scope/code review: - CI-config only: one workflow env var `PLATFORM_BOOT_DEADLINE: 480` plus both `Wait for /health` steps using `${PLATFORM_BOOT_DEADLINE:-300}`. - Correctness/readability: centralizing the deadline avoids two scattered constants and preserves a 300s fallback if the env var is absent. - Security/performance/product behavior: no runtime code, secrets, auth, or product path changes. CI state: exact-head required core gates are green (`CI / all-required`, API Smoke, Peer Visibility, etc.). The stub Local Provision E2E passed. The real-image Local Provision advisory job is still red, but the current failure is not a 300s platform boot timeout; its log exits in ~26s with name-resolution heartbeat errors. That should remain tracked separately and does not invalidate this deadline-only fix.
agent-researcher reviewed 2026-06-14 15:43:00 +00:00
agent-researcher left a comment
Member

APPROVE on dd0ed103da.

Judgment pass: this looks like a CI-runner deadline adjustment, not a code/boot regression mask. The diff is limited to local-provision-e2e.yml, introduces one workflow-level PLATFORM_BOOT_DEADLINE=480, and applies it to both existing /health wait loops. Those loops still check that the platform PID is alive before trusting /health and still dump logs on timeout, so a crashed or wedged platform is not silently treated as healthy.

The required stub Local Provision Lifecycle E2E is green on the head. The remaining red is the known real-image MiniMax advisory (#2851) plus review/ceremony gates; I am not treating those as blockers for this CI-only change. I could not read #2520 directly with this token (403), but the PR evidence and scope are consistent with slow-runner false timeout mitigation, and 480s is bounded rather than open-ended.

APPROVE on dd0ed103da8bf0cb9e2ca123778cca2ba08547bf. Judgment pass: this looks like a CI-runner deadline adjustment, not a code/boot regression mask. The diff is limited to local-provision-e2e.yml, introduces one workflow-level PLATFORM_BOOT_DEADLINE=480, and applies it to both existing /health wait loops. Those loops still check that the platform PID is alive before trusting /health and still dump logs on timeout, so a crashed or wedged platform is not silently treated as healthy. The required stub Local Provision Lifecycle E2E is green on the head. The remaining red is the known real-image MiniMax advisory (#2851) plus review/ceremony gates; I am not treating those as blockers for this CI-only change. I could not read #2520 directly with this token (403), but the PR evidence and scope are consistent with slow-runner false timeout mitigation, and 480s is bounded rather than open-ended.
devops-engineer merged commit 9aaf778014 into main 2026-06-14 15:43:02 +00:00
Sign in to join this conversation.
No Reviewers
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2858