forked from molecule-ai/molecule-core
The /buildinfo verify step (PR #2398) was treating "no /buildinfo response" the same as "tenant returned wrong SHA" — both bumped MISMATCH_COUNT and hard-failed the workflow. First post-merge run on staging caught a real edge case: ephemeral E2E tenants (slug e2e-20260430-...) get torn down by the E2E teardown trap between CP's healthz_ok snapshot and the verify step running, so the verify step would dial into DNS that no longer resolves and hard-fail on a benign condition. The bug class we actually care about is STALE (tenant up + serving old code, the #2395 root). UNREACHABLE post-redeploy is almost always a benign teardown race; real "tenant up but unreachable" is caught by CP's own healthz monitor + the alert pipeline, so double-counting it here was making this workflow flaky on every staging push that overlapped E2E. Wire: - Split MISMATCH_COUNT into STALE_COUNT + UNREACHABLE_COUNT. - STALE → hard-fail the workflow (the bug class we're guarding). - UNREACHABLE → :⚠️:, don't fail. Reachable-mismatch still hard-fails. - Job summary surfaces both lists separately so on-call can tell at a glance which class fired. Mirror in redeploy-tenants-on-main.yml for shape parity (prod has fewer ephemeral tenants but identical asymmetry would be a gratuitous fork). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| scripts | ||
| workflows | ||
| CODEOWNERS | ||
| dependabot.yml | ||