Merge pull request 'fix(ci): canary alerting — drop Gitea-incompatible actions API call' (#130) from fix/canary-staging-gitea-compat-alerting into main
All checks were successful
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (push) Successful in 0s
CodeQL / Analyze (${{ matrix.language }}) (go) (push) Successful in 0s
CodeQL / Analyze (${{ matrix.language }}) (python) (push) Successful in 2s
Block internal-flavored paths / Block forbidden paths (push) Successful in 5s
Check merge_group trigger on required workflows / Required workflows have merge_group trigger (push) Successful in 5s
CI / Detect changes (push) Successful in 7s
E2E API Smoke Test / detect-changes (push) Successful in 7s
Secret scan / Scan diff for credential-shaped strings (push) Successful in 6s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (push) Successful in 7s
E2E Staging Canvas (Playwright) / detect-changes (push) Successful in 7s
Handlers Postgres Integration / detect-changes (push) Successful in 7s
Runtime PR-Built Compatibility / detect-changes (push) Successful in 8s
CI / Shellcheck (E2E scripts) (push) Successful in 2s
CI / Platform (Go) (push) Successful in 4s
CI / Canvas (Next.js) (push) Successful in 4s
E2E API Smoke Test / E2E API Smoke Test (push) Successful in 3s
CI / Python Lint & Test (push) Successful in 3s
CI / Canvas Deploy Reminder (push) Has been skipped
Handlers Postgres Integration / Handlers Postgres Integration (push) Successful in 4s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (push) Successful in 5s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (push) Successful in 26s
All checks were successful
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (push) Successful in 0s
CodeQL / Analyze (${{ matrix.language }}) (go) (push) Successful in 0s
CodeQL / Analyze (${{ matrix.language }}) (python) (push) Successful in 2s
Block internal-flavored paths / Block forbidden paths (push) Successful in 5s
Check merge_group trigger on required workflows / Required workflows have merge_group trigger (push) Successful in 5s
CI / Detect changes (push) Successful in 7s
E2E API Smoke Test / detect-changes (push) Successful in 7s
Secret scan / Scan diff for credential-shaped strings (push) Successful in 6s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (push) Successful in 7s
E2E Staging Canvas (Playwright) / detect-changes (push) Successful in 7s
Handlers Postgres Integration / detect-changes (push) Successful in 7s
Runtime PR-Built Compatibility / detect-changes (push) Successful in 8s
CI / Shellcheck (E2E scripts) (push) Successful in 2s
CI / Platform (Go) (push) Successful in 4s
CI / Canvas (Next.js) (push) Successful in 4s
E2E API Smoke Test / E2E API Smoke Test (push) Successful in 3s
CI / Python Lint & Test (push) Successful in 3s
CI / Canvas Deploy Reminder (push) Has been skipped
Handlers Postgres Integration / Handlers Postgres Integration (push) Successful in 4s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (push) Successful in 5s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (push) Successful in 26s
This commit is contained in:
commit
44bb35f2a8
61
.github/workflows/canary-staging.yml
vendored
61
.github/workflows/canary-staging.yml
vendored
@ -137,27 +137,28 @@ jobs:
|
|||||||
id: canary
|
id: canary
|
||||||
run: bash tests/e2e/test_staging_full_saas.sh
|
run: bash tests/e2e/test_staging_full_saas.sh
|
||||||
|
|
||||||
# Alerting: open an issue only after THREE consecutive failures so
|
# Alerting: open a sticky issue on the FIRST failure; comment on
|
||||||
# transient flakes (Cloudflare DNS hiccup, AWS API blip) don't spam
|
# subsequent failures; auto-close on next green. Comment-on-existing
|
||||||
# the issue list. If an issue is already open, we still comment on
|
# de-duplicates so a single open issue accumulates the streak —
|
||||||
# every failure so ops sees the streak. Auto-close on next green.
|
# ops sees one issue with N comments rather than N issues.
|
||||||
#
|
#
|
||||||
# Threshold rationale: canary fires every 30 min, so 3 failures =
|
# Why no consecutive-failures threshold (e.g., wait 3 runs before
|
||||||
# ~90 min of consecutive red — well past any single-run flake but
|
# filing): the prior threshold check used
|
||||||
# still tight enough that a real outage gets surfaced before the
|
# `github.rest.actions.listWorkflowRuns()` which Gitea 1.22.6 does
|
||||||
# next deploy window.
|
# not expose (returns 404). On Gitea Actions the threshold call
|
||||||
|
# ALWAYS failed, breaking the entire alerting step and going days
|
||||||
|
# silent on real regressions (38h+ chronic red on 2026-05-07/08
|
||||||
|
# before this fix; tracked in molecule-core#129). Filing on first
|
||||||
|
# failure is also better UX — we want to know about the first red,
|
||||||
|
# not wait 90 min for it to "count." Real flakes get one issue +
|
||||||
|
# a quick close-on-green; persistent reds accumulate comments.
|
||||||
- name: Open issue on failure
|
- name: Open issue on failure
|
||||||
if: failure()
|
if: failure()
|
||||||
uses: actions/github-script@3a2844b7e9c422d3c10d287c895573f7108da1b3 # v9.0.0
|
uses: actions/github-script@3a2844b7e9c422d3c10d287c895573f7108da1b3 # v9.0.0
|
||||||
env:
|
|
||||||
# Inject the workflow path explicitly — context.workflow is
|
|
||||||
# the *name*, not the file path the actions API needs.
|
|
||||||
WORKFLOW_PATH: '.github/workflows/canary-staging.yml'
|
|
||||||
CONSECUTIVE_THRESHOLD: '3'
|
|
||||||
with:
|
with:
|
||||||
script: |
|
script: |
|
||||||
const title = '🔴 Canary failing: staging SaaS smoke';
|
const title = '🔴 Canary failing: staging SaaS smoke';
|
||||||
const runURL = `https://github.com/${context.repo.owner}/${context.repo.repo}/actions/runs/${context.runId}`;
|
const runURL = `${context.serverUrl}/${context.repo.owner}/${context.repo.repo}/actions/runs/${context.runId}`;
|
||||||
|
|
||||||
// Find an existing open canary issue (stable title match).
|
// Find an existing open canary issue (stable title match).
|
||||||
// If one exists, this isn't a "first failure" — comment and exit.
|
// If one exists, this isn't a "first failure" — comment and exit.
|
||||||
@ -177,32 +178,12 @@ jobs:
|
|||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
|
|
||||||
// No open issue yet — check the last N-1 runs' conclusions.
|
// No open issue yet — file one on this first failure. The
|
||||||
// We open the issue only if the last (THRESHOLD-1) runs ALSO
|
// comment-on-existing branch above means subsequent failures
|
||||||
// failed (so this is the 3rd consecutive red).
|
// accumulate as comments on this same issue, so we don't
|
||||||
const threshold = parseInt(process.env.CONSECUTIVE_THRESHOLD, 10);
|
// spam new issues per run.
|
||||||
const { data: runs } = await github.rest.actions.listWorkflowRuns({
|
|
||||||
owner: context.repo.owner, repo: context.repo.repo,
|
|
||||||
workflow_id: process.env.WORKFLOW_PATH,
|
|
||||||
status: 'completed',
|
|
||||||
per_page: threshold,
|
|
||||||
// Skip the current in-progress run; it isn't 'completed' yet.
|
|
||||||
});
|
|
||||||
// listWorkflowRuns returns recent first. We need (threshold-1)
|
|
||||||
// prior failures (current run is the threshold-th).
|
|
||||||
const priorFailures = (runs.workflow_runs || [])
|
|
||||||
.slice(0, threshold - 1)
|
|
||||||
.filter(r => r.id !== context.runId)
|
|
||||||
.filter(r => r.conclusion === 'failure')
|
|
||||||
.length;
|
|
||||||
if (priorFailures < threshold - 1) {
|
|
||||||
core.info(`Below threshold: ${priorFailures + 1}/${threshold} consecutive failures — not filing yet`);
|
|
||||||
return;
|
|
||||||
}
|
|
||||||
|
|
||||||
const body =
|
const body =
|
||||||
`Canary run failed at ${new Date().toISOString()}, ` +
|
`Canary run failed at ${new Date().toISOString()}.\n\n` +
|
||||||
`${threshold} consecutive runs red.\n\n` +
|
|
||||||
`Run: ${runURL}\n\n` +
|
`Run: ${runURL}\n\n` +
|
||||||
`This issue auto-closes on the next green canary run. ` +
|
`This issue auto-closes on the next green canary run. ` +
|
||||||
`Consecutive failures add a comment here rather than a new issue.`;
|
`Consecutive failures add a comment here rather than a new issue.`;
|
||||||
@ -211,7 +192,7 @@ jobs:
|
|||||||
title, body,
|
title, body,
|
||||||
labels: ['canary-staging', 'bug'],
|
labels: ['canary-staging', 'bug'],
|
||||||
});
|
});
|
||||||
core.info(`Opened canary failure issue (${threshold} consecutive reds)`);
|
core.info('Opened canary failure issue (first red)');
|
||||||
|
|
||||||
- name: Auto-close canary issue on success
|
- name: Auto-close canary issue on success
|
||||||
if: success()
|
if: success()
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user