molecule-core

History

Hongming Wang 6e0eb2ddc9 fix(redeploy-staging): tolerate e2e-* teardown race in fleet HTTP 500 Recurring failure pattern in redeploy-tenants-on-staging: ##[error]redeploy-fleet returned HTTP 500 ##[error]Process completed with exit code 1. with the per-tenant breakdown in the response body showing the failures were on ephemeral e2e-* tenants (saas/canvas/ext) whose parent E2E run torn them down mid-redeploy — SSM exit=2 because the EC2 was already terminating, or healthz timeout because the CF tunnel was already gone. The actual operator-facing tenants (dryrun-98407, demo-prep, etc) all rolled fine in the same call. This shape repeats every staging push that overlaps an active E2E run. The downstream `Verify each staging tenant /buildinfo matches published SHA` step ALREADY distinguishes STALE vs UNREACHABLE for exactly this reason (per #2402); only the top-level `if HTTP_CODE != 200; exit 1` gate misclassifies the race. Filter: HTTP 500 + every failed slug matches `^e2e-` → soft-warn and fall through to verify. Any non-e2e-* failure or non-500 HTTP remains a hard fail, with the failed non-e2e slugs surfaced in the error so the operator doesn't have to dig the response body out of CI. Verified the gate logic with 6 synthetic CP responses (happy / e2e-only race / mixed real+e2e fail / non-200 / 200+ok=false / all-real-fail) — all behave correctly. prod's redeploy-tenants-on-main is intentionally NOT touched: prod CP serves no e2e-* tenants, so the race can't occur there and the strict gate is the right behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-05-02 02:17:36 -07:00
..
auto-promote-on-e2e.yml	fix(ci): handle empty E2E lookup in auto-promote-on-e2e gate	2026-04-30 10:07:52 -07:00
auto-promote-staging.yml	ci(auto-sync): App-token dispatch + ubuntu-latest + workflow_dispatch	2026-05-01 22:28:35 -07:00
auto-sync-main-to-staging.yml	ci(auto-sync): App-token dispatch + ubuntu-latest + workflow_dispatch	2026-05-01 22:28:35 -07:00
auto-tag-runtime.yml	chore(security): pin Actions to SHAs + enable Dependabot auto-bumps	2026-04-28 15:37:06 -07:00
block-internal-paths.yml	chore(security): pin Actions to SHAs + enable Dependabot auto-bumps	2026-04-28 15:37:06 -07:00
canary-staging.yml	chore(security): pin Actions to SHAs + enable Dependabot auto-bumps	2026-04-28 15:37:06 -07:00
canary-verify.yml	chore(security): pin Actions to SHAs + enable Dependabot auto-bumps	2026-04-28 15:37:06 -07:00
check-merge-group-trigger.yml	chore(security): pin Actions to SHAs + enable Dependabot auto-bumps	2026-04-28 15:37:06 -07:00
check-migration-collisions.yml	fix(ci): drop --depth=1 from migration collision check fetch	2026-04-30 05:28:03 -07:00
ci.yml	ci: collapse all 4 path-filtered required checks to single-job-with-conditional-steps	2026-04-29 16:09:22 -07:00
codeql.yml	chore(deps): batch dep bumps — 6 safe upgrades (4 actions majors + 2 npm dev deps)	2026-04-28 17:44:55 -07:00
continuous-synth-e2e.yml	ci: continuous synthetic E2E against staging (#2342 )	2026-04-29 22:04:57 -07:00
e2e-api.yml	test(e2e): poll-mode + since_id cursor round-trip (#2339 PR 4)	2026-04-29 23:07:10 -07:00
e2e-staging-canvas.yml	fix(e2e-canvas): kill teardown race that poisons concurrent runs	2026-04-29 19:23:56 -07:00
e2e-staging-external.yml	test(e2e): live staging regression for external-runtime awaiting_agent transitions	2026-04-30 09:36:18 -07:00
e2e-staging-saas.yml	chore(security): pin Actions to SHAs + enable Dependabot auto-bumps	2026-04-28 15:37:06 -07:00
e2e-staging-sanity.yml	chore(security): pin Actions to SHAs + enable Dependabot auto-bumps	2026-04-28 15:37:06 -07:00
harness-replays.yml	harness(phase-2): multi-tenant compose + cross-tenant isolation replays	2026-05-01 21:36:40 -07:00
pr-guards.yml	ci: add pr-guards caller that disables auto-merge on push	2026-04-27 06:39:31 -07:00
promote-latest.yml	chore(security): pin Actions to SHAs + enable Dependabot auto-bumps	2026-04-28 15:37:06 -07:00
publish-canvas-image.yml	chore(security): pin Actions to SHAs + enable Dependabot auto-bumps	2026-04-28 15:37:06 -07:00
publish-runtime.yml	refactor(ci): extract wheel smoke into shared script	2026-04-30 11:52:07 -07:00
publish-workspace-server-image.yml	feat(deploy): verify each tenant /buildinfo matches published SHA after redeploy	2026-04-30 10:55:08 -07:00
railway-pin-audit.yml	ci: daily Railway pin-audit cron + issue-on-failure (#2169 )	2026-04-29 17:43:01 -07:00
redeploy-tenants-on-main.yml	fix(redeploy-main): pull staging-<head_sha> instead of stale :latest	2026-05-01 23:17:59 -07:00
redeploy-tenants-on-staging.yml	fix(redeploy-staging): tolerate e2e-* teardown race in fleet HTTP 500	2026-05-02 02:17:36 -07:00
retarget-main-to-staging.yml	ci(retarget): handle 422 'duplicate PR' by closing redundant main-PR (closes #1884 )	2026-04-26 00:53:55 -07:00
runtime-pin-compat.yml	chore(deps): batch dep bumps — 6 safe upgrades (4 actions majors + 2 npm dev deps)	2026-04-28 17:44:55 -07:00
runtime-prbuild-compat.yml	ci(wheel-smoke): always-run with per-step if-gates for required-check eligibility	2026-04-30 20:40:05 -07:00
secret-pattern-drift.yml	secret-scan: align local pre-commit + extend drift lint (closes #1569 root)	2026-05-01 23:47:56 -07:00
secret-scan.yml	chore(security): pin Actions to SHAs + enable Dependabot auto-bumps	2026-04-28 15:37:06 -07:00
sweep-cf-orphans.yml	Merge pull request #2248 from Molecule-AI/fix/sweep-cf-orphans-hard-fail-on-schedule	2026-04-29 01:16:22 +00:00
sweep-cf-tunnels.yml	feat(ops): add sweep-cf-tunnels janitor — orphan Cloudflare Tunnels accumulate	2026-04-29 19:42:47 -07:00
sweep-stale-e2e-orgs.yml	ci: hourly sweep of stale e2e-* orgs on staging	2026-04-24 23:07:57 -07:00
test-ops-scripts.yml	docs(ci): correct test-ops-scripts.yml header — discover does NOT recurse	2026-04-30 20:52:58 -07:00