fix(ci): harden Hermes runner gates
Some checks failed
Contributor Attribution Check / check-attribution (pull_request) Successful in 2m22s
Supply Chain Audit / Scan PR for critical supply chain risks (pull_request) Successful in 2m16s
Tests / e2e (pull_request) Successful in 4m0s
Nix / nix (ubuntu-latest) (pull_request) Failing after 21m40s
Tests / test (pull_request) Failing after 24m57s
Some checks failed
Contributor Attribution Check / check-attribution (pull_request) Successful in 2m22s
Supply Chain Audit / Scan PR for critical supply chain risks (pull_request) Successful in 2m16s
Tests / e2e (pull_request) Successful in 4m0s
Nix / nix (ubuntu-latest) (pull_request) Failing after 21m40s
Tests / test (pull_request) Failing after 24m57s
This commit is contained in:
parent
148811a020
commit
1263836d2f
10
.github/actions/nix-setup/action.yml
vendored
10
.github/actions/nix-setup/action.yml
vendored
@ -1,6 +1,16 @@
|
||||
name: 'Setup Nix'
|
||||
description: 'Install Nix and configure Cachix binary cache'
|
||||
|
||||
# Hermes validates its Nix flake in CI so packaging and NixOS-module drift are
|
||||
# caught before merge. This action is intentionally CI-only: regular Hermes
|
||||
# runtime installs do not require Nix.
|
||||
#
|
||||
# The Molecule Gitea runners are Linux VMs without Nix preinstalled, so CI uses
|
||||
# a pinned Determinate Systems installer revision. The action is mirrored into
|
||||
# git.moleculesai.app for availability; update the mirror and this pin together.
|
||||
# Cachix is only a performance cache. Cache outages must not hide correctness
|
||||
# failures, so that step remains best-effort and the flake/build steps below
|
||||
# decide pass/fail.
|
||||
inputs:
|
||||
cachix-auth-token:
|
||||
description: 'Cachix auth token (enables push). Omit for read-only.'
|
||||
|
||||
11
.github/workflows/nix.yml
vendored
11
.github/workflows/nix.yml
vendored
@ -15,6 +15,15 @@ concurrency:
|
||||
|
||||
jobs:
|
||||
nix:
|
||||
# This gate protects Hermes' reproducible packaging surface: flake
|
||||
# evaluation, the Python package build, the NixOS module wiring, and the
|
||||
# lockfile hash diagnostics used by release/packaging maintainers.
|
||||
#
|
||||
# Nix is not a runtime dependency for Hermes. The Gitea runner image does
|
||||
# not ship Nix, so the repo-local setup action installs it using the pinned
|
||||
# Determinate Systems installer and then configures Cachix as a best-effort
|
||||
# cache. Cold-cache runners can legitimately spend more than 30 minutes
|
||||
# compiling this graph, so keep the timeout above the normal cold path.
|
||||
strategy:
|
||||
matrix:
|
||||
# The Molecule Gitea runner pool currently exposes Linux runners only.
|
||||
@ -22,7 +31,7 @@ jobs:
|
||||
# branch status on an unavailable macOS label.
|
||||
os: [ubuntu-latest]
|
||||
runs-on: ${{ matrix.os }}
|
||||
timeout-minutes: 30
|
||||
timeout-minutes: 60
|
||||
steps:
|
||||
- uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
|
||||
- uses: ./.github/actions/nix-setup
|
||||
|
||||
13
.github/workflows/tests.yml
vendored
13
.github/workflows/tests.yml
vendored
@ -28,8 +28,17 @@ jobs:
|
||||
- name: Checkout code
|
||||
uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
|
||||
|
||||
- name: Install system dependencies
|
||||
run: sudo apt-get update && sudo apt-get install -y ripgrep
|
||||
- name: Install optional system dependencies
|
||||
timeout-minutes: 3
|
||||
continue-on-error: true
|
||||
run: |
|
||||
if command -v rg >/dev/null 2>&1; then
|
||||
rg --version
|
||||
exit 0
|
||||
fi
|
||||
|
||||
sudo apt-get update -o Acquire::Retries=3
|
||||
sudo apt-get install -y --no-install-recommends ripgrep
|
||||
|
||||
- name: Install uv
|
||||
# Pin uv version explicitly so setup-uv constructs the release
|
||||
|
||||
59
docs/ci-nix.md
Normal file
59
docs/ci-nix.md
Normal file
@ -0,0 +1,59 @@
|
||||
# Hermes Nix CI Gate
|
||||
|
||||
Hermes keeps a Nix gate in CI to validate the packaging surface that is easy to
|
||||
break accidentally:
|
||||
|
||||
- `flake.nix` evaluation
|
||||
- the Hermes package build
|
||||
- the NixOS module and config roundtrip checks
|
||||
- npm lockfile hash drift diagnostics for the bundled web/TUI packages
|
||||
|
||||
Nix is not required to run Hermes. It is a CI and packaging tool for people who
|
||||
consume Hermes through Nix or maintain the release packaging.
|
||||
|
||||
## Runner Contract
|
||||
|
||||
The Molecule Gitea runner pool currently exposes Linux runners only. The Nix
|
||||
workflow therefore runs on `ubuntu-latest`; do not add a macOS required context
|
||||
unless a live macOS Gitea runner exists and is protected by the same branch gate.
|
||||
|
||||
The runner image does not include Nix. CI installs it through the pinned
|
||||
`DeterminateSystems/nix-installer-action` revision in
|
||||
`.github/actions/nix-setup/action.yml`. That action must also exist in the
|
||||
Gitea action mirror so CI does not depend on GitHub availability.
|
||||
|
||||
Cachix is configured as a best-effort cache. A cache outage can make the job
|
||||
slower, but it must not decide pass/fail. The required checks are the flake and
|
||||
package build steps.
|
||||
|
||||
## Timeout Policy
|
||||
|
||||
Cold Gitea runners may need to build the Nix graph without a populated cache.
|
||||
The workflow timeout is intentionally set to 60 minutes so cold-cache builds can
|
||||
finish while still bounding stuck jobs.
|
||||
|
||||
If the Nix job times out, check the log tail first:
|
||||
|
||||
- active build output near the end usually means a cold-cache timeout; raise the
|
||||
cache hit rate or split the check before changing product code
|
||||
- a completed build followed by `nix run .#fix-lockfiles -- --check` failure
|
||||
usually means committed npm lockfile hashes are stale
|
||||
- installer or mirror failures point at runner bootstrap or action mirror drift
|
||||
|
||||
## Debugging and Observability
|
||||
|
||||
When a Nix CI failure is not self-explanatory from the Gitea job log, use the
|
||||
central observability stack before SSH-grepping individual runners. Runner,
|
||||
operator, and tenant logs are shipped to Molecule Loki/Grafana. Useful failure
|
||||
classes to search for:
|
||||
|
||||
- action mirror fetch failures
|
||||
- Nix installer failures
|
||||
- Cachix connectivity or auth failures
|
||||
- runner job cancellation or timeout events
|
||||
- disk pressure during Nix store builds
|
||||
|
||||
The workflow should keep emitting enough log context to classify those failures
|
||||
without needing a rerun. If a future fix touches the runner bootstrap, add
|
||||
diagnostic output there as part of the same change so the next red main has a
|
||||
clear owner and root cause.
|
||||
Loading…
Reference in New Issue
Block a user