CI red on main: stale 2026-05-08 disk-out event (Tests + Nix); self-resolves on next push #18
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Diagnosis
Three CI checks have been red on
mainsince87a5d39blanded 2026-05-08T21:11Z. All three are stale-from-a-2026-05-08-disk-out-event, not template-content bugs:Tests / test (run 94, job 0)
uv pip install -e ".[all,dev]"couldn't write wheels to/tmp/setup-uv-cache. Out of disk.Tests / e2e (run 94, job 1)
Same
os error 28at line 160 of the log — sameuvcache write, same disk.Nix / nix (ubuntu-latest) (run 91, job 1)
cachix-action failed at 21:13:26Z (likely also disk-related — cachix uses
/tmpheavily during cache-pull). With the binary cache unavailable, Nix fell back to building every Python wheel from source (~80 derivations: pyyaml, ruff, pydantic, slack-bolt, …). Hit the 30-mintimeout-minutesafter 20m57s of derivation builds. Two failure modes layered on each other.Root incident
The operator host's
/dev/sda1ran out of space at 2026-05-08 21:14:11Z — rsyslog itself failed to write/var/log/syslog("No space left on device" event in journalctl). That cascaded into the hermes-agent run (and presumably anything else CI-firing in that window).Disk is fine now:
/dev/sda1 142G/226G (66% full)— plenty of headroom. The reds are pure leftover state.Unblock
Neither
tests.ymlnornix.ymlcarriesworkflow_dispatch, so there's no UI re-trigger path. The next push (or PR opened) againstmainre-fires both workflows and they should go green. Per SOP I'm not push-trigger-commit-ing tomaindirectly — let the team's next real change carry the re-fire.If Nix is still red after a clean disk re-run, that's a separate
cachix-actionproblem — possibly an expired/missingCACHIX_AUTH_TOKENsecret, or the cachix.org cache itself. Investigate the next run's log; expect either green-via-cache-pull or a new specific error from cachix.Adjacent
internal#194(operator-host disk WARN) andfeedback_disk_gc_must_reach_containerdtrack. Worth confirming the GC schedule + emergency trigger (df ≥ 85%) were active that night and whether they fired before 21:14.reference_hermes_runtime_topology, this repo is OSS — fixes here don't get the same orchestrator-driver shortcuts as internal repos. This issue documents diagnosis + unblock path; an actual contributor PR re-fires CI.Tier
low — stale failure, self-resolves on next
mainpush, no behavioural impact today (the dev team's Hermes-runtime workspaces use a published image, not this repo's CI). Can be closed when the next push re-fires CI green.Reporter: orchestrator. Adjacent:
internal#221(org-wide CI hygiene umbrella).