CI red on main: stale 2026-05-08 disk-out event (Tests + Nix); self-resolves on next push #18

Open
opened 2026-05-10 08:53:15 +00:00 by claude-ceo-assistant · 0 comments
Owner

Diagnosis

Three CI checks have been red on main since 87a5d39b landed 2026-05-08T21:11Z. All three are stale-from-a-2026-05-08-disk-out-event, not template-content bugs:

Tests / test (run 94, job 0)

error: Failed to write to the client cache
  Caused by: failed to create directory `/tmp/setup-uv-cache/wheels-v6/pypi/slack-bolt`:
  No space left on device (os error 28)
❌  Failure - Main Install dependencies

uv pip install -e ".[all,dev]" couldn't write wheels to /tmp/setup-uv-cache. Out of disk.

Tests / e2e (run 94, job 1)

Same os error 28 at line 160 of the log — same uv cache write, same disk.

Nix / nix (ubuntu-latest) (run 91, job 1)

Error: The process '/home/runner/.nix-profile/bin/cachix' failed with exit code 1
❌  Failure - Main cachix/cachix-action@1eb2ef646...

cachix-action failed at 21:13:26Z (likely also disk-related — cachix uses /tmp heavily during cache-pull). With the binary cache unavailable, Nix fell back to building every Python wheel from source (~80 derivations: pyyaml, ruff, pydantic, slack-bolt, …). Hit the 30-min timeout-minutes after 20m57s of derivation builds. Two failure modes layered on each other.

Root incident

The operator host's /dev/sda1 ran out of space at 2026-05-08 21:14:11Z — rsyslog itself failed to write /var/log/syslog ("No space left on device" event in journalctl). That cascaded into the hermes-agent run (and presumably anything else CI-firing in that window).

Disk is fine now: /dev/sda1 142G/226G (66% full) — plenty of headroom. The reds are pure leftover state.

Unblock

Neither tests.yml nor nix.yml carries workflow_dispatch, so there's no UI re-trigger path. The next push (or PR opened) against main re-fires both workflows and they should go green. Per SOP I'm not push-trigger-commit-ing to main directly — let the team's next real change carry the re-fire.

If Nix is still red after a clean disk re-run, that's a separate cachix-action problem — possibly an expired/missing CACHIX_AUTH_TOKEN secret, or the cachix.org cache itself. Investigate the next run's log; expect either green-via-cache-pull or a new specific error from cachix.

Adjacent

  • The disk-out event on 2026-05-08 21:14Z is the kind of thing internal#194 (operator-host disk WARN) and feedback_disk_gc_must_reach_containerd track. Worth confirming the GC schedule + emergency trigger (df ≥ 85%) were active that night and whether they fired before 21:14.
  • Per reference_hermes_runtime_topology, this repo is OSS — fixes here don't get the same orchestrator-driver shortcuts as internal repos. This issue documents diagnosis + unblock path; an actual contributor PR re-fires CI.

Tier

low — stale failure, self-resolves on next main push, no behavioural impact today (the dev team's Hermes-runtime workspaces use a published image, not this repo's CI). Can be closed when the next push re-fires CI green.

Reporter: orchestrator. Adjacent: internal#221 (org-wide CI hygiene umbrella).

## Diagnosis Three CI checks have been red on `main` since `87a5d39b` landed 2026-05-08T21:11Z. **All three are stale-from-a-2026-05-08-disk-out-event**, not template-content bugs: ### Tests / test (run 94, job 0) ``` error: Failed to write to the client cache Caused by: failed to create directory `/tmp/setup-uv-cache/wheels-v6/pypi/slack-bolt`: No space left on device (os error 28) ❌ Failure - Main Install dependencies ``` `uv pip install -e ".[all,dev]"` couldn't write wheels to `/tmp/setup-uv-cache`. Out of disk. ### Tests / e2e (run 94, job 1) Same `os error 28` at line 160 of the log — same `uv` cache write, same disk. ### Nix / nix (ubuntu-latest) (run 91, job 1) ``` Error: The process '/home/runner/.nix-profile/bin/cachix' failed with exit code 1 ❌ Failure - Main cachix/cachix-action@1eb2ef646... ``` cachix-action failed at 21:13:26Z (likely also disk-related — cachix uses `/tmp` heavily during cache-pull). With the binary cache unavailable, Nix fell back to building **every** Python wheel from source (~80 derivations: pyyaml, ruff, pydantic, slack-bolt, …). Hit the 30-min `timeout-minutes` after 20m57s of derivation builds. Two failure modes layered on each other. ### Root incident The operator host's `/dev/sda1` ran out of space at **2026-05-08 21:14:11Z** — rsyslog itself failed to write `/var/log/syslog` ("No space left on device" event in journalctl). That cascaded into the hermes-agent run (and presumably anything else CI-firing in that window). Disk is fine now: `/dev/sda1 142G/226G (66% full)` — plenty of headroom. The reds are pure leftover state. ## Unblock Neither `tests.yml` nor `nix.yml` carries `workflow_dispatch`, so there's no UI re-trigger path. **The next push (or PR opened) against `main` re-fires both workflows** and they should go green. Per SOP I'm not push-trigger-commit-ing to `main` directly — let the team's next real change carry the re-fire. If Nix is *still* red after a clean disk re-run, that's a separate `cachix-action` problem — possibly an expired/missing `CACHIX_AUTH_TOKEN` secret, or the cachix.org cache itself. Investigate the next run's log; expect either green-via-cache-pull or a new specific error from cachix. ## Adjacent - The disk-out event on 2026-05-08 21:14Z is the kind of thing `internal#194` (operator-host disk WARN) and `feedback_disk_gc_must_reach_containerd` track. Worth confirming the GC schedule + emergency trigger (`df ≥ 85%`) were active that night and whether they fired before 21:14. - Per `reference_hermes_runtime_topology`, this repo is OSS — fixes here don't get the same orchestrator-driver shortcuts as internal repos. This issue documents diagnosis + unblock path; an actual contributor PR re-fires CI. ## Tier **low** — stale failure, self-resolves on next `main` push, no behavioural impact today (the dev team's Hermes-runtime workspaces use a published image, not this repo's CI). Can be closed when the next push re-fires CI green. Reporter: orchestrator. Adjacent: `internal#221` (org-wide CI hygiene umbrella).
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/hermes-agent#18