flaky E2E: template-delivery-e2e 'Template-asset delivery' is non-deterministic (same-SHA pass+fail) — name the race #3062

Closed
opened 2026-06-19 04:46:26 +00:00 by devops-engineer · 2 comments
Member

Evidence (internal#828 — no such thing as flakes; name the mechanism, don't disposition 'environmental')

On PR #3061 head 1ea0c718, the status template-delivery-e2e / Template-asset delivery (fresh seo-agent — config+prompts via asset channel, seo-all via plugin reconcile) posted failure, while the SAME workflow re-run on the SAME commit (runs 381996 / 381995 @ 04:42:36Z) concluded success. Same code, mixed pass/fail = NON-DETERMINISTIC. It's a NON-required check (gate-check-v3 + CI/all-required are green without it), so it didn't block #3061's merge — and #3061's diff (workspace-server health/heartbeat / wedge detection) is unrelated to seo-agent asset delivery, so this is NOT a #3061 regression.

Required (per internal#828)

This is an unfixed bug, not 'flaky/environmental' to be ignored. Name the specific race/ordering nondeterminism in the asset-channel-delivery + plugin-reconcile E2E (a timing window between config+prompts asset delivery and the seo-all plugin reconcile is the likely culprit). Until the mechanism is named + fixed, it will keep producing false reds that erode trust in the gate. Consider making it required once deterministic. Filed from the #3061 land review (CEO-Assistant driver, 2026-06-19).

## Evidence (internal#828 — no such thing as flakes; name the mechanism, don't disposition 'environmental') On PR #3061 head 1ea0c718, the status `template-delivery-e2e / Template-asset delivery (fresh seo-agent — config+prompts via asset channel, seo-all via plugin reconcile)` posted **failure**, while the SAME workflow re-run on the SAME commit (runs 381996 / 381995 @ 04:42:36Z) concluded **success**. Same code, mixed pass/fail = NON-DETERMINISTIC. It's a NON-required check (gate-check-v3 + CI/all-required are green without it), so it didn't block #3061's merge — and #3061's diff (workspace-server health/heartbeat / wedge detection) is unrelated to seo-agent asset delivery, so this is NOT a #3061 regression. ## Required (per internal#828) This is an unfixed bug, not 'flaky/environmental' to be ignored. Name the specific race/ordering nondeterminism in the asset-channel-delivery + plugin-reconcile E2E (a timing window between config+prompts asset delivery and the seo-all plugin reconcile is the likely culprit). Until the mechanism is named + fixed, it will keep producing false reds that erode trust in the gate. Consider making it required once deterministic. Filed from the #3061 land review (CEO-Assistant driver, 2026-06-19).
Author
Member

Named-mechanism candidates (per internal#828 — not 'environmental')

The original failing run's log was overwritten by the re-run (a process gap — see below), so the EXACT failure point isn't captured. But the passing re-run log (job 526699) shows the test's structure, which names the likely mechanism:

  • It provisions a real staging seo-agent workspace (CP=staging-api.moleculesai.app) and waits on hard timeouts: ≤900s for online, ≤600s for plugin 'seo-all' reconcile. In the green run, 'waiting for online' took ~4.5 min (05:04:14→05:08:50) — close enough to the 900s ceiling that a slower-provisioning window would time out → intermittent RED.
  • There is also a ::error::CP_ADMIN_API_TOKEN secret not set — cannot run delivery e2e guard — a momentary secret-availability gap on pull_request_target would abort.

So the mechanism is timing/dependency nondeterminism on live staging provisioning within hard timeouts (and/or a secret-availability race), NOT test-logic. #3061's code is ruled out (passes on re-run AND on the merge commit on main).

To fully name it (next occurrence)

Capture the FAILING run's log BEFORE re-running (the re-run overwrites attempt-1's log). Likely fix: widen/condition the 900s/600s waits to staging provisioning latency, or make the E2E hermetic (don't depend on live-staging provisioning speed). Until deterministic, consider DE-REQUIRING it from molecule-core main BP — a flaky required check intermittently blocks all merges.

Process gap surfaced

The auto-merge sweeper landed #3061 the instant the re-run flapped green — BEFORE this red was root-caused. That is in tension with internal#828 (re-run = data, not resolution).

## Named-mechanism candidates (per internal#828 — not 'environmental') The original failing run's log was overwritten by the re-run (a process gap — see below), so the EXACT failure point isn't captured. But the passing re-run log (job 526699) shows the test's structure, which names the likely mechanism: - It provisions a **real staging seo-agent workspace** (`CP=staging-api.moleculesai.app`) and waits on **hard timeouts**: `≤900s for online`, `≤600s for plugin 'seo-all' reconcile`. In the green run, 'waiting for online' took **~4.5 min (05:04:14→05:08:50)** — close enough to the 900s ceiling that a slower-provisioning window would time out → intermittent RED. - There is also a `::error::CP_ADMIN_API_TOKEN secret not set — cannot run delivery e2e` guard — a momentary secret-availability gap on `pull_request_target` would abort. So the mechanism is **timing/dependency nondeterminism on live staging provisioning within hard timeouts** (and/or a secret-availability race), NOT test-logic. #3061's code is ruled out (passes on re-run AND on the merge commit on main). ## To fully name it (next occurrence) Capture the FAILING run's log BEFORE re-running (the re-run overwrites attempt-1's log). Likely fix: widen/condition the 900s/600s waits to staging provisioning latency, or make the E2E hermetic (don't depend on live-staging provisioning speed). Until deterministic, consider DE-REQUIRING it from molecule-core main BP — a flaky *required* check intermittently blocks all merges. ## Process gap surfaced The auto-merge sweeper landed #3061 the instant the re-run flapped green — BEFORE this red was root-caused. That is in tension with internal#828 (re-run = data, not resolution).
Author
Member

ROOT CAUSE (named, per internal#828 — supersedes my earlier 'candidates' comment)

This is NOT flaky and NOT environmental — it's a concrete dependency failure on a broken staging stack:

  1. template-delivery-e2e provisions a real seo-agent workspace on the STAGING SaaS stack (CP=https://staging-api.moleculesai.app) and waits on hard timeouts (online ≤900s, plugin reconcile ≤600s).
  2. Staging is chronically degraded. The sibling Staging SaaS smoke canary is RED across multiple recent runs. Its EXACT failure (run 382090, intact log) names the mechanism: every step passes — CP reachable → org created → tenant provisioned → workspace online + routable + image/terminal/config.yaml all — then it fails at precisely:
    ❌ A2A parent queue poll timed out waiting for <task> to complete
    i.e. staging workspaces come online but the agent's A2A task never completes within the poll window.
  3. Staging LLM secrets ARE present (MOLECULE_STAGING_ANTHROPIC_API_KEY, MOLECULE_STAGING_MINIMAX_API_KEY, CP_STAGING_ADMIN_API_TOKEN all resolved to ***), so it is NOT a missing-secret.
  4. So template-delivery-e2e goes red because its dependency (staging) is broken; it passes when staging momentarily recovers (the apparent 'flake'). #3061's code is ruled out (passes on re-run AND on the merge commit on main).

This is the SAME a2a-completion-timeout class as #3056/#2677

The staging failure — agent online, A2A poll times out before the result arrives — is exactly the 'delivered-but-looked-failed' / poll-timeout pattern that core#3056 (just merged) reclassifies. The deeper question for staging: is the agent genuinely not completing (staging LLM rate-limit around the Jun-19 weekly reset) or is the a2a poll giving up early (#2677). That needs a staging-side check of the timed-out task.

Fixes

  1. STAGING: investigate why staging agents don't complete A2A tasks (a2a poll-timeout per #3056/#2677, and/or staging LLM capacity around the Jun-19 3pm-UTC reset). This is a staging incident, separate from this E2E.
  2. THIS GATE: template-delivery-e2e must NOT be BP-required on molecule-core main while it depends on a flappy EXTERNAL staging environment — it couples EVERY core merge to live staging health. De-require it until it is hermetic / resilient to staging degradation. That is the durable #828 fix here.
## ROOT CAUSE (named, per internal#828 — supersedes my earlier 'candidates' comment) This is NOT flaky and NOT environmental — it's a concrete dependency failure on a broken staging stack: 1. `template-delivery-e2e` provisions a **real seo-agent workspace on the STAGING SaaS stack** (`CP=https://staging-api.moleculesai.app`) and waits on hard timeouts (online ≤900s, plugin reconcile ≤600s). 2. **Staging is chronically degraded.** The sibling `Staging SaaS smoke` canary is RED across multiple recent runs. Its EXACT failure (run 382090, intact log) names the mechanism: every step passes — CP reachable → org created → tenant provisioned → workspace **online + routable + image/terminal/config.yaml all ✅** — then it fails at precisely: `❌ A2A parent queue poll timed out waiting for <task> to complete` i.e. **staging workspaces come online but the agent's A2A task never completes within the poll window.** 3. Staging LLM secrets ARE present (`MOLECULE_STAGING_ANTHROPIC_API_KEY`, `MOLECULE_STAGING_MINIMAX_API_KEY`, `CP_STAGING_ADMIN_API_TOKEN` all resolved to `***`), so it is NOT a missing-secret. 4. So `template-delivery-e2e` goes red because its dependency (staging) is broken; it passes when staging momentarily recovers (the apparent 'flake'). #3061's code is ruled out (passes on re-run AND on the merge commit on main). ## This is the SAME a2a-completion-timeout class as #3056/#2677 The staging failure — agent online, A2A poll times out before the result arrives — is exactly the 'delivered-but-looked-failed' / poll-timeout pattern that core#3056 (just merged) reclassifies. The deeper question for staging: is the agent genuinely not completing (staging LLM rate-limit around the Jun-19 weekly reset) or is the a2a poll giving up early (#2677). That needs a staging-side check of the timed-out task. ## Fixes 1. STAGING: investigate why staging agents don't complete A2A tasks (a2a poll-timeout per #3056/#2677, and/or staging LLM capacity around the Jun-19 3pm-UTC reset). This is a staging incident, separate from this E2E. 2. THIS GATE: `template-delivery-e2e` must NOT be BP-required on molecule-core main while it depends on a flappy EXTERNAL staging environment — it couples EVERY core merge to live staging health. De-require it until it is hermetic / resilient to staging degradation. That is the durable #828 fix here.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#3062