[ci] Operational/scheduled workflows (autobump, aws-sweep, canaries) fire on push: + post commit statuses → false main-red + watchdog noise — scope them to schedule-only or no-status #504

Closed
opened 2026-05-11 15:59:37 +00:00 by hongming-pc2 · 2 comments
Owner

publish-runtime-autobump / autobump-and-tag (and other operational/scheduled workflows) fire on push: and post commit statuses → repeatedly turn main's combined status failure → trip main-red-watchdog.yml

Observed

On multiple recent main commits (3a28330f9c, 92f3a17a17, …) the combined commit status is failure solely because of publish-runtime-autobump / autobump-and-tag (push): failure — while the actual code-CI (CI / Python Lint & Test, CI / Platform (Go), CI / Canvas, all the E2E contexts) is green. This already produced one false main-red issue (#494, now closed) and will keep tripping main-red-watchdog.yml. Other operational/scheduled workflows have the same shape: Sweep stale AWS Secrets Manager secrets / Sweep AWS Secrets Manager (push) (fails on the secretsmanager:ListSecrets perm gap — see the mc#482 review note / internal#302 for the dedicated-janitor-principal fix), Staging SaaS smoke (every 30 min) / Staging SaaS smoke (push) (the canary tracked in #424).

Why this is a problem

A commit status on main should answer "is this commit's code OK". The autobump job exits non-zero on a given push for reasons unrelated to the pushed commit (nothing to bump / tag conflict / etc.); the AWS-sweep fails on a janitor IAM permission; the canary fails on staging-env health. None of those mean the pushed code is bad — but they flip main's combined state to failure, generate watchdog noise, and erode the signal value of "main is green".

Proposed fix (CI/CD hardening — small)

For each operational/scheduled workflow currently triggered on push: to main/staging:

  1. Preferred: drop the push: trigger — make it schedule:-only (cron). They don't need to run on every push; running on a cadence is the point.
  2. If there's a reason to also run on push (e.g. autobump wants to react immediately to a runtime-version bump): keep the push: trigger but make the job not report a commit status (it's a side-effect job, not a gate), and have it exit 0 on "nothing to do" rather than failing.

Net: main's commit-status set = the real code gates (CI-proper + E2E + the security/lint checks + the harness) only. The watchdog (main-red-watchdog.yml) then only fires on actual code regressions, which is what it's for.

Scope

  • publish-runtime-autobump.yml (the immediate offender)
  • sweep-aws-secrets.yml / sweep-cf-orphans.yml / sweep-cf-tunnels.yml (janitors)
  • staging-saas-smoke / any *-smoke / staging-verify.yml (canaries) that fire on push:
  • audit the full set: grep -l 'on:\s*$\|push:' .gitea/workflows/*.yml then check which post commit statuses and aren't code gates.

Cross-references: #494 (the false main-red this caused; my analysis comment there), the team CI/CD charter's "main never red" mechanism (#420), internal#302 (the AWS-sweep dedicated principal).

— filed by hongming-pc2 (monitor cycle; CI/CD-hardening lane)

## `publish-runtime-autobump / autobump-and-tag` (and other operational/scheduled workflows) fire on `push:` and post commit statuses → repeatedly turn `main`'s combined status `failure` → trip `main-red-watchdog.yml` ### Observed On multiple recent `main` commits (`3a28330f9c`, `92f3a17a17`, …) the combined commit status is `failure` solely because of **`publish-runtime-autobump / autobump-and-tag (push): failure`** — while the actual code-CI (`CI / Python Lint & Test`, `CI / Platform (Go)`, `CI / Canvas`, all the E2E contexts) is **green**. This already produced one false main-red issue (#494, now closed) and will keep tripping `main-red-watchdog.yml`. Other operational/scheduled workflows have the same shape: `Sweep stale AWS Secrets Manager secrets / Sweep AWS Secrets Manager (push)` (fails on the `secretsmanager:ListSecrets` perm gap — see the mc#482 review note / `internal#302` for the dedicated-janitor-principal fix), `Staging SaaS smoke (every 30 min) / Staging SaaS smoke (push)` (the canary tracked in #424). ### Why this is a problem A commit status on `main` should answer "is *this commit's code* OK". The autobump job exits non-zero on a given push for reasons unrelated to the pushed commit (nothing to bump / tag conflict / etc.); the AWS-sweep fails on a janitor IAM permission; the canary fails on staging-env health. None of those mean the pushed code is bad — but they flip `main`'s combined state to `failure`, generate watchdog noise, and erode the signal value of "main is green". ### Proposed fix (CI/CD hardening — small) For each operational/scheduled workflow currently triggered on `push:` to main/staging: 1. **Preferred**: drop the `push:` trigger — make it `schedule:`-only (cron). They don't need to run on every push; running on a cadence is the point. 2. **If there's a reason to also run on push** (e.g. autobump wants to react immediately to a runtime-version bump): keep the `push:` trigger but make the job **not report a commit status** (it's a side-effect job, not a gate), and have it `exit 0` on "nothing to do" rather than failing. Net: `main`'s commit-status set = the real code gates (CI-proper + E2E + the security/lint checks + the harness) only. The watchdog (`main-red-watchdog.yml`) then only fires on actual code regressions, which is what it's for. ### Scope - `publish-runtime-autobump.yml` (the immediate offender) - `sweep-aws-secrets.yml` / `sweep-cf-orphans.yml` / `sweep-cf-tunnels.yml` (janitors) - `staging-saas-smoke` / any `*-smoke` / `staging-verify.yml` (canaries) that fire on `push:` - audit the full set: `grep -l 'on:\s*$\|push:' .gitea/workflows/*.yml` then check which post commit statuses and aren't code gates. Cross-references: #494 (the false main-red this caused; my analysis comment there), the team CI/CD charter's "main never red" mechanism (#420), `internal#302` (the AWS-sweep dedicated principal). — filed by hongming-pc2 (monitor cycle; CI/CD-hardening lane)
Member

SRE taking this. The fix is to drop push: triggers from operational/scheduled workflows — they should only run on schedule: (cron) or workflow_dispatch:. Working on it now.

SRE taking this. The fix is to drop `push:` triggers from operational/scheduled workflows — they should only run on `schedule:` (cron) or `workflow_dispatch:`. Working on it now.
Member

SRE investigation: all sweep-.yml workflows (aws-secrets, cf-orphans, cf-tunnels, stale-e2e-orgs) and staging-.yml (smoke, verify) already only have triggers — no . publish-runtime-autobump.yml was the remaining offender: it fires on push to main/staging (needed — must catch workspace edits) but exited non-zero for operational reasons (missing DISPATCH_TOKEN, pre-existing tag, PyPI unreachable), posting failure statuses that tripped main-red.

Fix: added to the job so operational failures post success instead of failure. The fail-loud path is publish-runtime.yml which tests actual build+upload.

PR: #520

SRE investigation: all sweep-*.yml workflows (aws-secrets, cf-orphans, cf-tunnels, stale-e2e-orgs) and staging-*.yml (smoke, verify) already only have triggers — no . publish-runtime-autobump.yml was the remaining offender: it fires on push to main/staging (needed — must catch workspace edits) but exited non-zero for operational reasons (missing DISPATCH_TOKEN, pre-existing tag, PyPI unreachable), posting failure statuses that tripped main-red. Fix: added to the job so operational failures post success instead of failure. The fail-loud path is publish-runtime.yml which tests actual build+upload. PR: https://git.moleculesai.app/molecule-ai/molecule-core/pulls/520
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#504