fix(ci): harden Hermes runner gates
Some checks failed
Contributor Attribution Check / check-attribution (pull_request) Successful in 2m22s
Supply Chain Audit / Scan PR for critical supply chain risks (pull_request) Successful in 2m16s
Tests / e2e (pull_request) Successful in 4m0s
Nix / nix (ubuntu-latest) (pull_request) Failing after 21m40s
Tests / test (pull_request) Failing after 24m57s

This commit is contained in:
hongming-codex-laptop 2026-05-12 23:52:09 -07:00
parent 148811a020
commit 1263836d2f
4 changed files with 90 additions and 3 deletions

View File

@ -1,6 +1,16 @@
name: 'Setup Nix'
description: 'Install Nix and configure Cachix binary cache'
# Hermes validates its Nix flake in CI so packaging and NixOS-module drift are
# caught before merge. This action is intentionally CI-only: regular Hermes
# runtime installs do not require Nix.
#
# The Molecule Gitea runners are Linux VMs without Nix preinstalled, so CI uses
# a pinned Determinate Systems installer revision. The action is mirrored into
# git.moleculesai.app for availability; update the mirror and this pin together.
# Cachix is only a performance cache. Cache outages must not hide correctness
# failures, so that step remains best-effort and the flake/build steps below
# decide pass/fail.
inputs:
cachix-auth-token:
description: 'Cachix auth token (enables push). Omit for read-only.'

View File

@ -15,6 +15,15 @@ concurrency:
jobs:
nix:
# This gate protects Hermes' reproducible packaging surface: flake
# evaluation, the Python package build, the NixOS module wiring, and the
# lockfile hash diagnostics used by release/packaging maintainers.
#
# Nix is not a runtime dependency for Hermes. The Gitea runner image does
# not ship Nix, so the repo-local setup action installs it using the pinned
# Determinate Systems installer and then configures Cachix as a best-effort
# cache. Cold-cache runners can legitimately spend more than 30 minutes
# compiling this graph, so keep the timeout above the normal cold path.
strategy:
matrix:
# The Molecule Gitea runner pool currently exposes Linux runners only.
@ -22,7 +31,7 @@ jobs:
# branch status on an unavailable macOS label.
os: [ubuntu-latest]
runs-on: ${{ matrix.os }}
timeout-minutes: 30
timeout-minutes: 60
steps:
- uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
- uses: ./.github/actions/nix-setup

View File

@ -28,8 +28,17 @@ jobs:
- name: Checkout code
uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
- name: Install system dependencies
run: sudo apt-get update && sudo apt-get install -y ripgrep
- name: Install optional system dependencies
timeout-minutes: 3
continue-on-error: true
run: |
if command -v rg >/dev/null 2>&1; then
rg --version
exit 0
fi
sudo apt-get update -o Acquire::Retries=3
sudo apt-get install -y --no-install-recommends ripgrep
- name: Install uv
# Pin uv version explicitly so setup-uv constructs the release

59
docs/ci-nix.md Normal file
View File

@ -0,0 +1,59 @@
# Hermes Nix CI Gate
Hermes keeps a Nix gate in CI to validate the packaging surface that is easy to
break accidentally:
- `flake.nix` evaluation
- the Hermes package build
- the NixOS module and config roundtrip checks
- npm lockfile hash drift diagnostics for the bundled web/TUI packages
Nix is not required to run Hermes. It is a CI and packaging tool for people who
consume Hermes through Nix or maintain the release packaging.
## Runner Contract
The Molecule Gitea runner pool currently exposes Linux runners only. The Nix
workflow therefore runs on `ubuntu-latest`; do not add a macOS required context
unless a live macOS Gitea runner exists and is protected by the same branch gate.
The runner image does not include Nix. CI installs it through the pinned
`DeterminateSystems/nix-installer-action` revision in
`.github/actions/nix-setup/action.yml`. That action must also exist in the
Gitea action mirror so CI does not depend on GitHub availability.
Cachix is configured as a best-effort cache. A cache outage can make the job
slower, but it must not decide pass/fail. The required checks are the flake and
package build steps.
## Timeout Policy
Cold Gitea runners may need to build the Nix graph without a populated cache.
The workflow timeout is intentionally set to 60 minutes so cold-cache builds can
finish while still bounding stuck jobs.
If the Nix job times out, check the log tail first:
- active build output near the end usually means a cold-cache timeout; raise the
cache hit rate or split the check before changing product code
- a completed build followed by `nix run .#fix-lockfiles -- --check` failure
usually means committed npm lockfile hashes are stale
- installer or mirror failures point at runner bootstrap or action mirror drift
## Debugging and Observability
When a Nix CI failure is not self-explanatory from the Gitea job log, use the
central observability stack before SSH-grepping individual runners. Runner,
operator, and tenant logs are shipped to Molecule Loki/Grafana. Useful failure
classes to search for:
- action mirror fetch failures
- Nix installer failures
- Cachix connectivity or auth failures
- runner job cancellation or timeout events
- disk pressure during Nix store builds
The workflow should keep emitting enough log context to classify those failures
without needing a rerun. If a future fix touches the runner bootstrap, add
diagnostic output there as part of the same change so the next red main has a
clear owner and root cause.