fix(ci): hard-fail when >50% of fleet unreachable post-redeploy

Belt-and-suspenders sanity floor on top of the unreachable-soft-warn
introduced earlier in this PR. Addresses the residual gap noted in
review: if a new image crashes on startup, every tenant ends up
unreachable, and the soft-warn alone would let that ship as a green
deploy. Canary-verify catches it on the canary tenant first, but this
guard is a fallback for canary-skip dispatches and same-batch races.

Threshold is 50% of healthz_ok-snapshotted tenants — comfortably above
the typical e2e-* teardown rate (5-10/hour, ~1 ephemeral tenant per
batch) but below any plausible real-outage scenario.

Mirrored across staging.yml + main.yml for shape parity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Hongming Wang 2026-04-30 11:35:56 -07:00
parent d45241cae7
commit ec39fecda2
2 changed files with 22 additions and 0 deletions

View File

@ -306,6 +306,16 @@ jobs:
if [ $UNREACHABLE_COUNT -gt 0 ]; then
echo "::warning::$UNREACHABLE_COUNT tenant(s) unreachable post-redeploy. Likely benign teardown race — CP healthz monitor catches real outages."
fi
# Belt-and-suspenders sanity floor: same logic as the staging
# variant — if MORE than half the prod fleet is unreachable,
# this is a real outage, not a teardown race. Hard-fail.
TOTAL_VERIFIED=${#SLUGS[@]}
if [ $UNREACHABLE_COUNT -gt $((TOTAL_VERIFIED / 2)) ]; then
echo "::error::$UNREACHABLE_COUNT of $TOTAL_VERIFIED tenant(s) unreachable — exceeds 50% threshold. Likely real outage, not teardown race."
exit 1
fi
if [ $STALE_COUNT -gt 0 ]; then
echo "::error::$STALE_COUNT tenant(s) returned a stale SHA. ssm_status=Success was misleading — see job summary."
exit 1

View File

@ -283,6 +283,18 @@ jobs:
if [ $UNREACHABLE_COUNT -gt 0 ]; then
echo "::warning::$UNREACHABLE_COUNT staging tenant(s) unreachable post-redeploy. Likely benign teardown race — CP healthz monitor catches real outages."
fi
# Belt-and-suspenders sanity floor: if MORE than half the fleet is
# unreachable, this isn't a teardown race — it's a real outage
# (e.g. the new image crashes on startup). Hard-fail. Canary-verify
# would catch this on the canary tenant first; this guard is a
# fallback for canary-skip dispatches and same-batch races.
TOTAL_VERIFIED=${#SLUGS[@]}
if [ $UNREACHABLE_COUNT -gt $((TOTAL_VERIFIED / 2)) ]; then
echo "::error::$UNREACHABLE_COUNT of $TOTAL_VERIFIED staging tenant(s) unreachable — exceeds 50% threshold. Likely real outage, not teardown race."
exit 1
fi
if [ $STALE_COUNT -gt 0 ]; then
echo "::error::$STALE_COUNT staging tenant(s) returned a stale SHA. ssm_status=Success was misleading — see job summary."
exit 1