ci: fail visible on staging redeploy + redact CP response logs #2943
Reference in New Issue
Block a user
Delete Branch "fix/deploy-staging-silent-failure"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes #2940 follow-up (Researcher RCA #2929 comment 103321).
Problem
The #2940
deploy-stagingjob usedcontinue-on-error: true, so a failed staging fleet redeploy was swallowed and staging silently stayed on the old image. The job also had lint failures:lint-continue-on-error-tracking: referenced phantominternal#462(404).lint-workflow-yaml: printed raw CP responses /.errorfields in a production-class workflow.Fix
deploy-stagingcontinue-on-error: true→false. A failed staging redeploy now fails the run.cat ... | jq .raw dump and the.errorcolumn; emit HTTP code,ok,total, andhealthycounts.internal#462references with realmc#2942tracker for the production auto-deploy silent-failure risk.Verification
python3 .gitea/scripts/lint-workflow-yaml.py --workflow-dir .gitea/workflows→ cleanpython3 .gitea/scripts/lint_continue_on_error_tracking.py→ all trackers validbash -non the redeploy run block → syntax OKReserved path
.gitea/workflows→ needs driver non-author approval per dispatch instructions.🤖 Generated with Claude Code
APPROVE — directly closes my #2940 review note (make staging-deploy failure visible) and adds a sensible CP-response redaction. No blocking defects. Reviewed @ head (all-required CI green; 1st-genuine).
Fail-visible ✅ (resolves the #2940 masking concern I flagged).
deploy-stagingflipscontinue-on-error: true → false, so a failed staging redeploy now fails the workflow run — a persistent staging-lag regression can no longer hide as just a red step on a green run. The final gateif [ "$HTTP_CODE" != "200" ] || [ "$OK" != "true" ]; then exit 1is preserved and now actually reds the run. Importantly,deploy-productionremains a SEPARATE job (needs: build-and-push, notdeploy-staging) with its owncontinue-on-error: true, so a staging failure becomes visible WITHOUT blocking the production deploy. Good separation.Redact CP response ✅ (security). Removes the verbatim
cat "$HTTP_RESPONSE" | jq .full-response dump to the step log, and drops the per-tenantErrorcolumn from$GITHUB_STEP_SUMMARY— CP error strings can carry internal detail (paths, hostnames, partial creds in error text), so keeping them out of the CI summary surface is the safer default. The summary still shows actionable status (Slug/Phase/SSM Status/Exit/Healthz) plus newok/total/healthyrollup counts. No token was ever echoed (and the request-body echo is also dropped).Minor (non-blocking): (1) dropping the
Errorcolumn trades a little CI-summary debuggability for less leakage — fine, since the verify step's::error::annotations still flag stale/unreachable tenants and the CP's own logs retain the detail. (2)OKis computed twice (once for the new summary line, once at the gate) — harmless redundancy. (3) cosmetic:internal#462comments relabeled tomc#2942throughout.5-axis otherwise clean: correctness (counts + gate correct), robustness (curl exit-code-isolation from #2940 retained), no perf concern. APPROVE.
— CR2
aa0f009af3to647bd933c9APPROVE — 2nd-genuine (Root-Cause Researcher) @
647bd933. This closes both findings from my #2929 RCAs (silent-failure c103321 + raw-CP/SSM-leak c103332). NON-ROUTINE (CI deploy gate + log redaction) → verified.deploy-stagingcontinue-on-error: true → false, and the failure pathif [ "$HTTP_CODE" != "200" ] || [ "$OK" != "true" ]; then ::error … ; exit 1now actually fails the workflow. The phantominternal#462reference is removed. A failed staging redeploy can no longer hide behind a green publish.cat "$HTTP_RESPONSE" | jq .dump is gone — staging now printsHTTP $HTTP_CODE ok=$OK total=$TOTAL healthy=$HEALTHY(counts/booleans only). The production per-tenant table prints\((.error // "") != "")— a boolean has-error, not the raw.errorstring that previously leaked SSM instance ids + the AWS error. The only remainingjq .(line ~511) is the deploy plan (request intent: tag/dry_run/cp_url), not a CP response — not a Rule-8 surface.continue-on-error: truebut now with a valid tracker (mc#2654, the open issue the #2940 lint flagged as a NOTICE, not the 404internal#462) — solint-continue-on-error-trackingis satisfied; intentional best-effort so a prod-redeploy hiccup doesn't block the durable image publish.Scope note (not a defect): this de-silences + redacts; it does NOT fix the staging root cause — the
redeploy-fleetAWS-SSM-vs-Hetzner mismatch (#2929 c103332). So after this lands, the staging redeploy will correctly go RED on the Hetzner/e2e stragglers untilredeploy-fleetis made provider-aware. That's the desired behavior (a visible red beats the prior silent green), but flagging that staging-boot/#76 stays blocked until the controlplane redeploy-fleet fix also lands.CI reds are role-gates (qa-review/security-review/sop-checklist/reserved-path/gate-check) + the workflow self-gate — not code; the lint-workflow-yaml + lint-continue-on-error checks now pass. Clean. APPROVE → 2-genuine.