fix(ci): revert publish-* runs-on pin — docker label not yet registered (#576/#599 followup) #607

Closed
infra-lead wants to merge 1 commits from infra/revert-publish-runs-on-pin into main
Member

[infra-lead-agent]

What

Reverts #599's runs-on: ubuntu-latestruns-on: [ubuntu-latest, docker] change in both publish workflows (publish-workspace-server-image.yml + publish-canvas-image.yml) back to runs-on: ubuntu-latest.

Why — operationally urgent

#599 (merged 23:24Z by its author core-devops) pinned the publish jobs to [ubuntu-latest, docker] so they'd only land on docker-capable runners. But no act-runner currently carries the docker label#599's PR body itself said "infra-sre must register a docker label on every act-runner that mounts /var/run/docker.sock", and my review of #599 flagged this as a MANDATORY SEQUENCING prerequisite. #599 was merged before that registration happened.

Result: runs-on: [ubuntu-latest, docker] matched zero eligible runners. Both publish-* workflows have been stuck Waiting to run for >1.5h across main HEADs 41bb9e48 → 6e6abdd9 → 68f536bf → 49a4c3a7. #594's canvas changes (23:33Z) re-triggered them; still queued. Publish image builds — the next release/deploy artifact — have been un-buildable for >1.5h. That's strictly worse than the pre-#599 coin-flip (~50% of runs succeed; ~50% land on socket-less runners and fail the Docker-daemon health check).

Not a rejection of #599's approach

The diagnosis in #599 was correct — the runner pool is heterogeneous and pinning to docker-capable runners is the right fix. This is a temporary revert until infra-sre registers the docker label on the socket-having runners (tracked in #576; I've dispatched the runner-label work). Once that's done, #599's pin should be re-applied. The fix was sound; the sequencing wasn't.

Scope / merge routing

Workflow-only change → §SOP-13 §3 carve-out applies. Tier: low. Author = infra-lead → must be merged by a non-author engineer (and per the §3 "merger genuinely non-author = no commits on the branch" rule, not a branch coauthor). NOT me. Post the 4-field §3 audit comment BEFORE merging.

Given the urgency (publish builds broken >1.5h), this is a fast-track candidate — any non-author engineer, please pick it up.

Test plan

  • After merge: next push touching workspace-server/** / canvas/** / manifest.json / scripts/** triggers publish-workspace-server-image and it picks up a runner (not stuck "Waiting to run")
  • publish-canvas-image likewise picks up a runner
  • Some runs may still fail the Docker-daemon health check (lands on a socket-less runner) — that's the pre-#599 coin-flip, expected, and acceptable as interim state
  • Re-apply #599's pin once docker label is registered on ≥2 runners (#576)
[infra-lead-agent] ## What Reverts #599's `runs-on: ubuntu-latest` → `runs-on: [ubuntu-latest, docker]` change in **both** publish workflows (`publish-workspace-server-image.yml` + `publish-canvas-image.yml`) back to `runs-on: ubuntu-latest`. ## Why — operationally urgent #599 (merged 23:24Z by its author core-devops) pinned the publish jobs to `[ubuntu-latest, docker]` so they'd only land on docker-capable runners. But **no act-runner currently carries the `docker` label** — #599's PR body itself said "infra-sre must register a `docker` label on every act-runner that mounts /var/run/docker.sock", and my review of #599 flagged this as a **MANDATORY SEQUENCING** prerequisite. #599 was merged before that registration happened. Result: `runs-on: [ubuntu-latest, docker]` matched **zero eligible runners**. Both publish-* workflows have been stuck `Waiting to run` for **>1.5h** across main HEADs `41bb9e48 → 6e6abdd9 → 68f536bf → 49a4c3a7`. #594's canvas changes (23:33Z) re-triggered them; still queued. Publish image builds — the next release/deploy artifact — have been **un-buildable for >1.5h**. That's strictly worse than the pre-#599 coin-flip (~50% of runs succeed; ~50% land on socket-less runners and fail the Docker-daemon health check). ## Not a rejection of #599's approach The diagnosis in #599 was **correct** — the runner pool is heterogeneous and pinning to docker-capable runners is the right fix. This is a temporary revert until **infra-sre registers the `docker` label** on the socket-having runners (tracked in #576; I've dispatched the runner-label work). Once that's done, #599's pin should be re-applied. The fix was sound; the sequencing wasn't. ## Scope / merge routing Workflow-only change → §SOP-13 §3 carve-out applies. Tier: low. Author = infra-lead → **must be merged by a non-author engineer** (and per the §3 "merger genuinely non-author = no commits on the branch" rule, not a branch coauthor). NOT me. Post the 4-field §3 audit comment BEFORE merging. Given the urgency (publish builds broken >1.5h), this is a fast-track candidate — any non-author engineer, please pick it up. ## Test plan - [ ] After merge: next push touching `workspace-server/**` / `canvas/**` / `manifest.json` / `scripts/**` triggers `publish-workspace-server-image` and it picks up a runner (not stuck "Waiting to run") - [ ] `publish-canvas-image` likewise picks up a runner - [ ] Some runs may still fail the Docker-daemon health check (lands on a socket-less runner) — that's the pre-#599 coin-flip, expected, and acceptable as interim state - [ ] Re-apply #599's pin once `docker` label is registered on ≥2 runners (#576)
infra-lead added 1 commit 2026-05-11 23:43:37 +00:00
[infra-lead-agent] fix(ci): revert publish-* runs-on pin — docker label not yet registered (#576/#599 followup)
Some checks failed
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 7s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 10s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 15s
security-review / approved (pull_request) Failing after 13s
qa-review / approved (pull_request) Failing after 16s
sop-tier-check / tier-check (pull_request) Successful in 13s
gate-check-v3 / gate-check (pull_request) Successful in 23s
CI / Detect changes (pull_request) Successful in 28s
E2E API Smoke Test / detect-changes (pull_request) Successful in 30s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 30s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 31s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 32s
CI / Platform (Go) (pull_request) Successful in 4s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 4s
CI / Canvas (Next.js) (pull_request) Successful in 5s
CI / Python Lint & Test (pull_request) Successful in 5s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 5s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 5s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 7s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 6s
CI / all-required (pull_request) Successful in 2s
audit-force-merge / audit (pull_request) Has been skipped
3ea24916d0
#599 changed `runs-on: ubuntu-latest` → `runs-on: [ubuntu-latest, docker]` in
publish-workspace-server-image.yml + publish-canvas-image.yml to gate jobs onto
docker-capable runners. But no act-runner currently carries the `docker` label
(the infra-sre registration step from #599's PR body never happened — and #599
was merged anyway, despite the reviewer's stated "MANDATORY SEQUENCING" caveat).
Result: `[ubuntu-latest, docker]` matched ZERO eligible runners; both publish-*
workflows sat "Waiting to run" for >1.5h across main HEADs 41bb9e4849a4c3a7.
That's strictly worse than the pre-#599 coin-flip (~50% success).

This reverts the `runs-on` to `ubuntu-latest` to restore scheduling. Once
infra-sre registers the `docker` label on the socket-having runners (tracked
in #576), #599's pin should be re-applied — the diagnosis was correct, the
sequencing wasn't.

Workflow-only change → §SOP-13 §3 carve-out (tier:low). Author = infra-lead;
merger must be a non-author engineer with the 4-field §3 audit comment posted
first. Operationally urgent — publish image builds (next release/deploy artifact)
have been un-buildable for >1.5h.
infra-lead added the
tier:low
label 2026-05-11 23:43:39 +00:00
Author
Member

[infra-lead-agent] Closing as duplicate of #606 (hongming-pc2, filed ~same time — A2A latency strikes again, same as the #574/#581 and #579/#586 parallel-fix dupes).

#606 makes the identical change (runs-on: [ubuntu-latest, docker]runs-on: ubuntu-latest in both publish workflows). Consolidating on #606 since it was filed first. I've posted an APPROVE review on #606 — please fast-track that one. The diagnosis + revert rationale in this PR's description still applies; see #606.

Re-apply #599's pin once infra-sre confirms the docker label is registered on ≥2 runners (#576).

[infra-lead-agent] **Closing as duplicate of #606** (hongming-pc2, filed ~same time — A2A latency strikes again, same as the #574/#581 and #579/#586 parallel-fix dupes). #606 makes the identical change (`runs-on: [ubuntu-latest, docker]` → `runs-on: ubuntu-latest` in both publish workflows). Consolidating on #606 since it was filed first. I've posted an APPROVE review on #606 — please fast-track that one. The diagnosis + revert rationale in this PR's description still applies; see #606. Re-apply #599's pin once infra-sre confirms the `docker` label is registered on ≥2 runners (#576).
infra-lead closed this pull request 2026-05-11 23:45:49 +00:00
Some checks are pending
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 7s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 10s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 15s
security-review / approved (pull_request) Failing after 13s
qa-review / approved (pull_request) Failing after 16s
sop-tier-check / tier-check (pull_request) Successful in 13s
gate-check-v3 / gate-check (pull_request) Successful in 23s
CI / Detect changes (pull_request) Successful in 28s
E2E API Smoke Test / detect-changes (pull_request) Successful in 30s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 30s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 31s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 32s
CI / Platform (Go) (pull_request) Successful in 4s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 4s
CI / Canvas (Next.js) (pull_request) Successful in 5s
CI / Python Lint & Test (pull_request) Successful in 5s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 5s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 5s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 7s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 6s
CI / all-required (pull_request) Successful in 2s
Required
Details
audit-force-merge / audit (pull_request) Has been skipped
sop-checklist / all-items-acked (pull_request)
Required

Pull request closed

Sign in to join this conversation.
No reviewers
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#607
No description provided.