From ff1003e5f6bc9ffc998f831fad64e1f5df4b64a4 Mon Sep 17 00:00:00 2001 From: Hongming Wang Date: Mon, 4 May 2026 07:02:12 -0700 Subject: [PATCH] =?UTF-8?q?ci(canary):=20bump=20timeout-minutes=2012=20?= =?UTF-8?q?=E2=86=92=2020=20to=20absorb=20apt=20tail=20latency?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Today's 4 cancelled canaries (25319625186 / 25320942822 / 25321618230 / 25322499952) were all blown by the workflow timeout despite the underlying tenant boot completing successfully (PR molecule-controlplane#455 fix verified — boot events all reach `boot_script_finished/ok`). Why the budget was wrong: The tenant user-data install phase runs apt-get update + install of docker.io / jq / awscli / caddy / amazon-ssm-agent FROM RAW UBUNTU on every tenant boot — none of it is pre-baked into the tenant AMI (EC2_AMI=ami-0ea3c35c5c3284d82, raw Jammy 22.04). Empirical fetch_secrets/ok timing across today's canaries: 51s debug-mm-1777888039 (09:47Z) 82s 25319625186 (12:42Z) 143s 25320942822 (13:11Z) 625s 25322499952 (13:43Z) Same EC2_AMI, same instance type (t3.small), same user-data install sequence — variance is entirely apt-mirror tail latency. A 12-min job budget leaves only ~2 min for the workspace on slow-apt days; the workspace itself needs ~3.5 min for claude-code cold boot, so the budget is structurally too tight whenever apt is slow. 20 min absorbs even the 10+ min boot worst-case and still leaves the workspace its full ~7 min budget. Cap stays well under the runner's 6-hour ubuntu-latest job ceiling. Real fix: pre-bake caddy + ssm-agent into the tenant AMI so the boot phase is no-ops on cached pkgs (will file controlplane#TBD as follow-up — packer/install-base.sh today only bakes the WORKSPACE thin AMI, not the tenant AMI; tenants always boot from raw Ubuntu). Co-Authored-By: Claude Opus 4.7 (1M context) --- .github/workflows/continuous-synth-e2e.yml | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/.github/workflows/continuous-synth-e2e.yml b/.github/workflows/continuous-synth-e2e.yml index dff3dfaa..0fc4a20c 100644 --- a/.github/workflows/continuous-synth-e2e.yml +++ b/.github/workflows/continuous-synth-e2e.yml @@ -93,7 +93,18 @@ jobs: synth: name: Synthetic E2E against staging runs-on: ubuntu-latest - timeout-minutes: 12 + # Bumped from 12 → 20 (2026-05-04). Tenant user-data install phase + # (apt-get update + install docker.io/jq/awscli/caddy + snap install + # ssm-agent) runs from raw Ubuntu on every boot — none of it is + # pre-baked into the tenant AMI. Empirical fetch_secrets/ok timing + # across today's canaries: 51s → 82s → 143s → 625s. apt-mirror tail + # latency drives the boot-to-fetch_secrets phase from ~1min to >10min. + # A 12min budget leaves only ~2min for the workspace (which needs + # ~3.5min for claude-code cold boot) on slow-apt days, blowing the + # budget. 20min absorbs the worst tenant tail so the workspace probe + # gets the full ~7min it needs even on a slow apt day. Real fix: + # pre-bake caddy + ssm-agent into the tenant AMI (controlplane#TBD). + timeout-minutes: 20 env: # claude-code default: cold-start ~5 min (comparable to langgraph), # but uses MiniMax-M2.7-highspeed via the template's third-party-