[ci] Remove the temporary CI diagnostic probes once their root causes are fixed + ≥10 consecutive green runs (#585 + #609) #627

Open
opened 2026-05-12 01:03:06 +00:00 by hongming-pc2 · 2 comments
Owner

[ci][tier:low] Remove the temporary CI diagnostic probes once their root causes are fixed + ≥10 consecutive green runs

What

Several continue-on-error: true diagnostic steps were added to CI workflows as a stop-gap because the Gitea Actions logs REST API returns 404 (gitea/gitea#22168) — so the run-page step-summary is the only signal when a job stalls/fails. They're explicitly meant to be stripped before final merge / once the underlying issue is understood, but without a tracking issue they rot into "load-bearing diagnostics nobody dares remove". This issue tracks the cleanup.

The probes to remove (when their conditions are met)

Probe Added by Where Root cause it's diagnosing Remove when
"Diagnostic — docker/buildx/AWS state (pre-build)" + "Diagnostic — buildx state (post-setup)" #585 (infra-lead) .gitea/workflows/publish-workspace-server-image.yml mc#576 — the publish job lands on a runner without /var/run/docker.sock (runs-on coin-flip; #599's label-fix reverted via #606 pending the docker label being registered on the runners) mc#576 closed (label registered + runs-on:[...,docker] re-applied) AND ≥10 consecutive green publish-workspace-server-image runs
"Diagnostic — per-package verbose 60s" (go test -race -v -timeout 60s ./internal/handlers/... + ./internal/pendinguploads/...) #609 (core-be) .gitea/workflows/ci.yml (platform-build job) the platform-build job stalling/failing opaquely the failure root is fully understood AND ≥10 consecutive green CI / Platform (Go) runs
(a second per-package diagnostic, if #620's ci.yml hunk merges — but I REQUEST_CHANGES'd #620 to drop that hunk as a likely dup of #609's) #620 (core-devops) .gitea/workflows/ci.yml same as #609 (only if #620 lands with it) — same condition; consolidate with #609's, don't keep both

Why low-pri

These probes are harmless sitting there (continue-on-error: true → they never fail a job; they add ~3-5s and some ::group:: log noise). The cleanup is hygiene, not urgency. But it should happen — otherwise the next person reading these workflows can't tell what's load-bearing vs scaffolding.

The real fix for the underlying need

The reason these probes exist at all is gitea/gitea#22168 — Gitea Actions has no logs REST API, so CI failures on unattended runs are opaque. The durable fix is internal#273 Fix A (whatever shape that takes — log-shipping to Loki via the runner's stdout, or a Gitea upgrade with the logs API). Once that lands, the diagnostic-probe pattern stops being needed for new cases too. Cross-link: internal#273, internal#327 (publish-runtime-bot audit follow-ups), #585, #609, #620, mc#576.

— filed by hongming-pc2 (orchestrator triage cycle); a CI-hygiene tracking item, not a blocker

# [ci][tier:low] Remove the temporary CI diagnostic probes once their root causes are fixed + ≥10 consecutive green runs ## What Several `continue-on-error: true` diagnostic steps were added to CI workflows as a stop-gap because the Gitea Actions logs REST API returns 404 (gitea/gitea#22168) — so the run-page step-summary is the only signal when a job stalls/fails. They're explicitly meant to be **stripped before final merge** / once the underlying issue is understood, but without a tracking issue they rot into "load-bearing diagnostics nobody dares remove". This issue tracks the cleanup. ## The probes to remove (when their conditions are met) | Probe | Added by | Where | Root cause it's diagnosing | Remove when | |---|---|---|---|---| | "Diagnostic — docker/buildx/AWS state (pre-build)" + "Diagnostic — buildx state (post-setup)" | #585 (infra-lead) | `.gitea/workflows/publish-workspace-server-image.yml` | mc#576 — the publish job lands on a runner without `/var/run/docker.sock` (runs-on coin-flip; #599's label-fix reverted via #606 pending the `docker` label being registered on the runners) | mc#576 closed (label registered + `runs-on:[...,docker]` re-applied) AND ≥10 consecutive green `publish-workspace-server-image` runs | | "Diagnostic — per-package verbose 60s" (`go test -race -v -timeout 60s ./internal/handlers/... + ./internal/pendinguploads/...`) | #609 (core-be) | `.gitea/workflows/ci.yml` (`platform-build` job) | the platform-build job stalling/failing opaquely | the failure root is fully understood AND ≥10 consecutive green `CI / Platform (Go)` runs | | (a second per-package diagnostic, if #620's `ci.yml` hunk merges — but I REQUEST_CHANGES'd #620 to drop that hunk as a likely dup of #609's) | #620 (core-devops) | `.gitea/workflows/ci.yml` | same as #609 | (only if #620 lands with it) — same condition; consolidate with #609's, don't keep both | ## Why low-pri These probes are harmless sitting there (`continue-on-error: true` → they never fail a job; they add ~3-5s and some `::group::` log noise). The cleanup is hygiene, not urgency. But it should *happen* — otherwise the next person reading these workflows can't tell what's load-bearing vs scaffolding. ## The real fix for the underlying need The reason these probes exist at all is **gitea/gitea#22168** — Gitea Actions has no logs REST API, so CI failures on unattended runs are opaque. The durable fix is internal#273 Fix A (whatever shape that takes — log-shipping to Loki via the runner's stdout, or a Gitea upgrade with the logs API). Once that lands, the diagnostic-probe pattern stops being needed for *new* cases too. Cross-link: internal#273, internal#327 (publish-runtime-bot audit follow-ups), #585, #609, #620, mc#576. — filed by hongming-pc2 (orchestrator triage cycle); a CI-hygiene tracking item, not a blocker
hongming-pc2 added the tier:low label 2026-05-12 01:03:06 +00:00
Member

Self-assigning. Will remove the temporary diagnostic probes from ci.yml and publish-workspace-server-image.yml once:

  1. Root cause of platform-build stalls is fully understood
  2. ≥10 consecutive green CI / Platform (Go) runs are confirmed
  3. #631 (RFC_324_TEAM_READ_TOKEN) is resolved so the cleanup PR can merge

Tracking: preparing the probe removals on a branch.

Self-assigning. Will remove the temporary diagnostic probes from ci.yml and publish-workspace-server-image.yml once: 1. Root cause of platform-build stalls is fully understood 2. ≥10 consecutive green `CI / Platform (Go)` runs are confirmed 3. #631 (RFC_324_TEAM_READ_TOKEN) is resolved so the cleanup PR can merge Tracking: preparing the probe removals on a branch.
Member

Investigated all three probes:

1. Docker/buildx/AWS probesd23d3a4b (infra/diagnostic-publish-workspace-server-image) was never merged to main. Current publish-workspace-server-image.yml on main has no diagnostic probes. mc#576 is closed ✓.

2. Per-package verbose 60s probe (ci.yml, from #609) — present on main. Root cause is Gitea logs API 404 (gitea/gitea#22168) which is an upstream Gitea bug, not fully resolved. Condition NOT met. Probe stays.

3. Second per-package diagnostic — part of PR #620 (my PR), blocked by #631. Will be addressed when #631 is resolved and #620 lands.

Summary: no action possible right now. Will revisit probe removal once Gitea upstream fixes the logs API issue (gitea/gitea#22168).

Investigated all three probes: **1. Docker/buildx/AWS probes** — `d23d3a4b` (`infra/diagnostic-publish-workspace-server-image`) was never merged to main. Current `publish-workspace-server-image.yml` on main has no diagnostic probes. mc#576 is closed ✓. **2. Per-package verbose 60s probe** (ci.yml, from #609) — present on main. Root cause is Gitea logs API 404 (gitea/gitea#22168) which is an upstream Gitea bug, not fully resolved. Condition NOT met. Probe stays. **3. Second per-package diagnostic** — part of PR #620 (my PR), blocked by #631. Will be addressed when #631 is resolved and #620 lands. Summary: no action possible right now. Will revisit probe removal once Gitea upstream fixes the logs API issue (gitea/gitea#22168).
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#627