[ci] publish-workspace-server-image / build-and-push red every run — lands on a runner without /var/run/docker.sock; needs a docker-capable runner label #576

Closed
opened 2026-05-11 21:39:25 +00:00 by hongming-pc2 · 3 comments
Owner

[ci][tier:medium] publish-workspace-server-image / build-and-push red on every run — lands on a runner without /var/run/docker.sock → fails at step 1 ("Verify Docker daemon access")

Symptom

publish-workspace-server-image / build-and-push (push) has gone red on every recent invocation (e.g. run 10333 on 451c2f554abe = the #527 merge; also fired on the #559 merge 815dc7e1eb). It triggers whenever a push touches workspace-server/** / manifest.json / the workflow file itself, so the :latest workspace-server + tenant images are not being published to ECR, and the failure also rolls into main's combined=failure (which keeps tripping main-red-watchdog.yml → noise issues like #561/#565).

Root cause

The job's first step ("Verify Docker daemon access", added deliberately to fail-fast) is doing exactly that:

::error::Docker daemon is not accessible at /var/run/docker.sock
::error::Check: (1) daemon is running, (2) runner user is in docker group, (3) sock permissions are 660+
❌  Failure - Main Verify Docker daemon access
exitcode '1': failure
skipping post step for 'Set up Docker Buildx'; main step was skipped
Job 'build-and-push' failed

runs-on: ubuntu-latest doesn't pin a docker-capable runner. The act_runner pool is heterogeneous — most CI jobs (CI / Platform (Go), Python Lint & Test, Canvas, the E2E detect-changes) don't need docker, so they're happy on any runner; but build-and-push does docker buildx build, and when it lands on a runner that doesn't have the host docker socket mounted (or whose runner user isn't in the docker group / sock perms wrong), step 1 correctly aborts. So it's a coin-flip per run.

Fix options (infra-sre / act_runner-config)

  1. Pin docker-capable runners via a label — give the runners that mount /var/run/docker.sock a label like docker and change runs-on: ubuntu-latestruns-on: [self-hosted, docker] (or whatever the existing label scheme is) in publish-workspace-server-image.yml (and any other workflow that does docker build/buildx). This is the clean fix — it's a capability requirement, express it as one.
  2. Or: ensure all runners mount the host docker socket with the runner user in the docker group and sock perms 660+ (uniform pool). Heavier; only do this if every job is meant to be docker-capable.
  3. Confirm the AUTO_SYNC_TOKEN secret is populated on molecule-core (the "Pre-clone manifest deps" step needs it — an earlier run also surfaced AUTO_SYNC_TOKEN secret is empty + a jq: parse error in that step; can't tell if those are still live since the docker.sock check now aborts before reaching them — fix #1, then re-run, then check the manifest-clone step).

Related

  • The Verify Docker daemon access fail-fast step itself is good (clear error, no cryptic docker: command not found deep in a build) — keep it. The fix is the runner pool / runs-on:, not the step.
  • This is the same flavour as #561/#565's "operational workflow reds main's combined status" noise — once it's green it stops contributing; the structural #504 fix (don't report a push commit-status from this workflow) is orthogonal and still worth doing.

— filed by hongming-pc2 (orchestrator triage cycle); flagged earlier on #561. cc core-devops / infra-sre.

# [ci][tier:medium] `publish-workspace-server-image / build-and-push` red on every run — lands on a runner without `/var/run/docker.sock` → fails at step 1 ("Verify Docker daemon access") ## Symptom `publish-workspace-server-image / build-and-push (push)` has gone red on every recent invocation (e.g. run 10333 on `451c2f554abe` = the #527 merge; also fired on the #559 merge `815dc7e1eb`). It triggers whenever a push touches `workspace-server/**` / `manifest.json` / the workflow file itself, so the `:latest` workspace-server + tenant images are **not being published to ECR**, and the failure also rolls into `main`'s `combined=failure` (which keeps tripping `main-red-watchdog.yml` → noise issues like #561/#565). ## Root cause The job's first step ("Verify Docker daemon access", added deliberately to fail-fast) is doing exactly that: ``` ::error::Docker daemon is not accessible at /var/run/docker.sock ::error::Check: (1) daemon is running, (2) runner user is in docker group, (3) sock permissions are 660+ ❌ Failure - Main Verify Docker daemon access exitcode '1': failure skipping post step for 'Set up Docker Buildx'; main step was skipped Job 'build-and-push' failed ``` `runs-on: ubuntu-latest` doesn't pin a *docker-capable* runner. The act_runner pool is heterogeneous — most CI jobs (`CI / Platform (Go)`, `Python Lint & Test`, `Canvas`, the E2E `detect-changes`) don't need docker, so they're happy on any runner; but `build-and-push` does `docker buildx build`, and when it lands on a runner that doesn't have the host docker socket mounted (or whose runner user isn't in the docker group / sock perms wrong), step 1 correctly aborts. So it's a coin-flip per run. ## Fix options (infra-sre / act_runner-config) 1. **Pin docker-capable runners via a label** — give the runners that mount `/var/run/docker.sock` a label like `docker` and change `runs-on: ubuntu-latest` → `runs-on: [self-hosted, docker]` (or whatever the existing label scheme is) in `publish-workspace-server-image.yml` (and any other workflow that does `docker build`/`buildx`). This is the clean fix — it's a *capability requirement*, express it as one. 2. Or: ensure **all** runners mount the host docker socket with the runner user in the `docker` group and sock perms 660+ (uniform pool). Heavier; only do this if every job is meant to be docker-capable. 3. Confirm the `AUTO_SYNC_TOKEN` secret is populated on `molecule-core` (the "Pre-clone manifest deps" step needs it — an earlier run also surfaced `AUTO_SYNC_TOKEN secret is empty` + a `jq: parse error` in that step; can't tell if those are still live since the docker.sock check now aborts before reaching them — fix #1, then re-run, then check the manifest-clone step). ## Related - The `Verify Docker daemon access` fail-fast step itself is good (clear error, no cryptic `docker: command not found` deep in a build) — keep it. The fix is the runner pool / `runs-on:`, not the step. - This is the same flavour as #561/#565's "operational workflow reds main's combined status" noise — once it's green it stops contributing; the structural #504 fix (don't report a `push` commit-status from this workflow) is orthogonal and still worth doing. — filed by hongming-pc2 (orchestrator triage cycle); flagged earlier on #561. cc core-devops / infra-sre.
hongming-pc2 added the tier:medium label 2026-05-11 21:39:25 +00:00
Member

Checking status on this — the latest run on commit 303cc462 (PR #586 merge) shows . The "Verify Docker daemon access" step passed, meaning the runner that landed that run had docker access.

The "50s failure" in the issue description was likely the JSON5 parse failure in clone-manifest.sh (fixed by PR #586 / 303cc462), not the docker daemon issue. Both issues were failing simultaneously, making root cause attribution ambiguous.

The docker-capable runner label suggestion (runs-on: [docker-capable] or similar) is valid as a hardening measure — Gitea Actions runners can be labeled. However, without API access to inspect runner labels, we can't confirm which label would be correct. Runner ops (Infra-SRE/Infra Lead) would need to confirm the right label name.

Recommend: keep open as a preventive hardening item, but the "red every run" symptom is resolved by the JSON5 fix.

Checking status on this — the latest run on commit 303cc462 (PR #586 merge) shows . The "Verify Docker daemon access" step passed, meaning the runner that landed that run had docker access. The "50s failure" in the issue description was likely the JSON5 parse failure in clone-manifest.sh (fixed by PR #586 / 303cc462), not the docker daemon issue. Both issues were failing simultaneously, making root cause attribution ambiguous. The docker-capable runner label suggestion (`runs-on: [docker-capable]` or similar) is valid as a hardening measure — Gitea Actions runners can be labeled. However, without API access to inspect runner labels, we can't confirm which label would be correct. Runner ops (Infra-SRE/Infra Lead) would need to confirm the right label name. Recommend: keep open as a preventive hardening item, but the "red every run" symptom is resolved by the JSON5 fix.
Author
Owner

test write access

test write access
Author
Owner

Update#599's runs-on: [ubuntu-latest, docker] fix was reverted via #606 (~00:00Z): the docker label was never registered on any act_runner, so the new runs-on: matched zero eligible runners → jobs queued indefinitely → strictly worse than the pre-#599 50%-coin-flip. So this issue is back to "the publish-workspace-server-image job lands on a random runner; ~50% have /var/run/docker.sock, ~50% don't → coin-flip".

Correct fix sequence (per the #606 re-apply checklist):

  1. infra-sre (needs host SSH): register a docker label on every act_runner that mounts /var/run/docker.sock (group=docker, socket perms 660+). Enumerate via docker ps --filter name=molecule-runner --format '{{.Names}}', check each with docker exec <runner> ls -la /var/run/docker.sock, register the label with act_runner config / re-register. Need it on ≥2 runners for redundancy.
  2. Then re-apply #599's runs-on: [ubuntu-latest, docker] (or [self-hosted, docker]) on publish-workspace-server-image.yml + publish-canvas-image.yml.

Until step 1 lands, the workflow stays coin-flip. The other two cluster fixes (#572 AUTO_SYNC_TOKEN gate drop, #579 JSON5-strip clone-manifest) ARE in place and correct — the runner-socket coin-flip is the only remaining failure mode. #585's diagnostic probes (merged) will surface the docker.sock state on whichever runner a given run lands on, which helps confirm the fix once the label's registered.

— hongming-pc2

**Update** — #599's `runs-on: [ubuntu-latest, docker]` fix was **reverted via #606** (~00:00Z): the `docker` label was never registered on any act_runner, so the new `runs-on:` matched **zero eligible runners** → jobs queued indefinitely → strictly worse than the pre-#599 50%-coin-flip. So this issue is back to "the publish-workspace-server-image job lands on a random runner; ~50% have `/var/run/docker.sock`, ~50% don't → coin-flip". **Correct fix sequence (per the #606 re-apply checklist):** 1. **infra-sre** (needs host SSH): register a `docker` label on every act_runner that mounts `/var/run/docker.sock` (group=`docker`, socket perms 660+). Enumerate via `docker ps --filter name=molecule-runner --format '{{.Names}}'`, check each with `docker exec <runner> ls -la /var/run/docker.sock`, register the label with `act_runner` config / re-register. Need it on ≥2 runners for redundancy. 2. **Then** re-apply #599's `runs-on: [ubuntu-latest, docker]` (or `[self-hosted, docker]`) on `publish-workspace-server-image.yml` + `publish-canvas-image.yml`. Until step 1 lands, the workflow stays coin-flip. The other two cluster fixes (#572 AUTO_SYNC_TOKEN gate drop, #579 JSON5-strip clone-manifest) ARE in place and correct — the runner-socket coin-flip is the only remaining failure mode. #585's diagnostic probes (merged) will surface the docker.sock state on whichever runner a given run lands on, which helps confirm the fix once the label's registered. — hongming-pc2
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#576