From 6e0eb2ddc9d6a23babe03c6485ad87070e985fc8 Mon Sep 17 00:00:00 2001
From: Hongming Wang <hongmingwang.rabbit@users.noreply.github.com>
Date: Sat, 2 May 2026 02:17:36 -0700
Subject: [PATCH] fix(redeploy-staging): tolerate e2e-* teardown race in fleet
 HTTP 500
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Recurring failure pattern in redeploy-tenants-on-staging:

  ##[error]redeploy-fleet returned HTTP 500
  ##[error]Process completed with exit code 1.

with the per-tenant breakdown in the response body showing the failures
were on ephemeral e2e-* tenants (saas/canvas/ext) whose parent E2E run
torn them down mid-redeploy — SSM exit=2 because the EC2 was already
terminating, or healthz timeout because the CF tunnel was already gone.
The actual operator-facing tenants (dryrun-98407, demo-prep, etc) all
rolled fine in the same call.

This shape repeats every staging push that overlaps an active E2E run.
The downstream `Verify each staging tenant /buildinfo matches published
SHA` step ALREADY distinguishes STALE vs UNREACHABLE for exactly this
reason (per #2402); only the top-level `if HTTP_CODE != 200; exit 1`
gate misclassifies the race.

Filter: HTTP 500 + every failed slug matches `^e2e-` → soft-warn and
fall through to verify. Any non-e2e-* failure or non-500 HTTP remains
a hard fail, with the failed non-e2e slugs surfaced in the error so
the operator doesn't have to dig the response body out of CI.

Verified the gate logic with 6 synthetic CP responses (happy / e2e-only
race / mixed real+e2e fail / non-200 / 200+ok=false / all-real-fail) —
all behave correctly.

prod's redeploy-tenants-on-main is intentionally NOT touched: prod CP
serves no e2e-* tenants, so the race can't occur there and the strict
gate is the right behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .../workflows/redeploy-tenants-on-staging.yml | 40 +++++++++++++++++--
 1 file changed, 36 insertions(+), 4 deletions(-)

diff --git a/.github/workflows/redeploy-tenants-on-staging.yml b/.github/workflows/redeploy-tenants-on-staging.yml
index 7f191e8d..caaeb56e 100644
--- a/.github/workflows/redeploy-tenants-on-staging.yml
+++ b/.github/workflows/redeploy-tenants-on-staging.yml
@@ -172,12 +172,44 @@ jobs:
             jq -r '.results[]? | "| \(.slug) | \(.phase) | \(.ssm_status // "-") | \(.ssm_exit_code) | \(.healthz_ok) | \(.error // "-") |"' "$HTTP_RESPONSE" || true
           } >> "$GITHUB_STEP_SUMMARY"
 
-          if [ "$HTTP_CODE" != "200" ]; then
+          # Distinguish "real fleet failure" from "E2E teardown race".
+          #
+          # CP returns HTTP 500 + ok=false whenever ANY tenant in the
+          # fleet failed SSM or healthz. In practice the recurring source
+          # of these is ephemeral e2e-* tenants (saas/canvas/ext) being
+          # torn down by their parent E2E run mid-redeploy: the EC2 dies →
+          # SSM exit=2 or healthz timeout → CP marks the fleet failed →
+          # this workflow goes red even though every operator-facing
+          # tenant rolled fine.
+          #
+          # Filter: if HTTP=500/ok=false AND every failed slug matches
+          # ^e2e-, treat as soft-warn and let the verify step downstream
+          # handle the unreachable-vs-stale distinction (it already knows
+          # the difference per #2402). Any non-e2e-* failure or a non-500
+          # HTTP response remains a hard failure.
+          OK=$(jq -r '.ok // "false"' "$HTTP_RESPONSE")
+          FAILED_SLUGS=$(jq -r '
+            .results[]?
+            | select((.healthz_ok != true) or (.ssm_status != "Success"))
+            | .slug' "$HTTP_RESPONSE" 2>/dev/null || true)
+          NON_E2E_FAILED=$(printf '%s\n' "$FAILED_SLUGS" | grep -v '^$' | grep -v '^e2e-' || true)
+
+          if [ "$HTTP_CODE" = "200" ] && [ "$OK" = "true" ]; then
+            : # happy path — fall through to verification
+          elif [ "$HTTP_CODE" = "500" ] && [ -z "$NON_E2E_FAILED" ] && [ -n "$FAILED_SLUGS" ]; then
+            COUNT=$(printf '%s\n' "$FAILED_SLUGS" | grep -c '^e2e-' || true)
+            echo "::warning::redeploy-fleet returned HTTP 500 but every failed tenant ($COUNT) is e2e-* ephemeral — treating as teardown race, soft-warning."
+            printf '%s\n' "$FAILED_SLUGS" | sed 's/^/::warning::  failed: /'
+          elif [ "$HTTP_CODE" != "200" ]; then
             echo "::error::redeploy-fleet returned HTTP $HTTP_CODE"
+            if [ -n "$NON_E2E_FAILED" ]; then
+              echo "::error::non-e2e tenant(s) failed:"
+              printf '%s\n' "$NON_E2E_FAILED" | sed 's/^/::error::  /'
+            fi
             exit 1
-          fi
-          OK=$(jq -r '.ok' "$HTTP_RESPONSE")
-          if [ "$OK" != "true" ]; then
+          else
+            # HTTP=200 but ok=false (shouldn't happen with current CP
+            # but keep the gate for completeness).
             echo "::error::redeploy-fleet reported ok=false (see summary for which tenant halted the rollout)"
             exit 1
           fi