ci(canary): bump timeout-minutes 12 → 20 to absorb apt tail latency

Today's 4 cancelled canaries (25319625186 / 25320942822 / 25321618230 / 25322499952) were all blown by the workflow timeout despite the underlying tenant boot completing successfully (PR molecule-controlplane#455 fix verified — boot events all reach `boot_script_finished/ok`). Why the budget was wrong: The tenant user-data install phase runs apt-get update + install of docker.io / jq / awscli / caddy / amazon-ssm-agent FROM RAW UBUNTU on every tenant boot — none of it is pre-baked into the tenant AMI (EC2_AMI=ami-0ea3c35c5c3284d82, raw Jammy 22.04). Empirical fetch_secrets/ok timing across today's canaries: 51s debug-mm-1777888039 (09:47Z) 82s 25319625186 (12:42Z) 143s 25320942822 (13:11Z) 625s 25322499952 (13:43Z) Same EC2_AMI, same instance type (t3.small), same user-data install sequence — variance is entirely apt-mirror tail latency. A 12-min job budget leaves only ~2 min for the workspace on slow-apt days; the workspace itself needs ~3.5 min for claude-code cold boot, so the budget is structurally too tight whenever apt is slow. 20 min absorbs even the 10+ min boot worst-case and still leaves the workspace its full ~7 min budget. Cap stays well under the runner's 6-hour ubuntu-latest job ceiling. Real fix: pre-bake caddy + ssm-agent into the tenant AMI so the boot phase is no-ops on cached pkgs (will file controlplane#TBD as follow-up — packer/install-base.sh today only bakes the WORKSPACE thin AMI, not the tenant AMI; tenants always boot from raw Ubuntu). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 07:02:12 -07:00 · 2026-05-04 07:02:12 -07:00 · ff1003e5f6
commit ff1003e5f6
parent f52de74b7b
1 changed files with 12 additions and 1 deletions
--- a/.github/workflows/continuous-synth-e2e.yml
+++ b/.github/workflows/continuous-synth-e2e.yml
@ -93,7 +93,18 @@ jobs:
  synth:
    name: Synthetic E2E against staging
    runs-on: ubuntu-latest
-    timeout-minutes: 12
+    # Bumped from 12 → 20 (2026-05-04). Tenant user-data install phase
+    # (apt-get update + install docker.io/jq/awscli/caddy + snap install
+    # ssm-agent) runs from raw Ubuntu on every boot — none of it is
+    # pre-baked into the tenant AMI. Empirical fetch_secrets/ok timing
+    # across today's canaries: 51s → 82s → 143s → 625s. apt-mirror tail
+    # latency drives the boot-to-fetch_secrets phase from ~1min to >10min.
+    # A 12min budget leaves only ~2min for the workspace (which needs
+    # ~3.5min for claude-code cold boot) on slow-apt days, blowing the
+    # budget. 20min absorbs the worst tenant tail so the workspace probe
+    # gets the full ~7min it needs even on a slow apt day. Real fix:
+    # pre-bake caddy + ssm-agent into the tenant AMI (controlplane#TBD).
+    timeout-minutes: 20
    env:
      # claude-code default: cold-start ~5 min (comparable to langgraph),
      # but uses MiniMax-M2.7-highspeed via the template's third-party-