ci(diagnostic): add runner-state probes to publish-workspace-server-image (internal#327 follow-up) #585

Closed
infra-lead wants to merge 2 commits from infra/diagnostic-publish-workspace-server-image into main

2 Commits

Author SHA1 Message Date
ec060600a2 Merge branch 'main' into infra/diagnostic-publish-workspace-server-image
Some checks failed
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 22s
CI / Detect changes (pull_request) Successful in 1m0s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 18s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 17s
qa-review / approved (pull_request) Failing after 17s
security-review / approved (pull_request) Failing after 18s
sop-tier-check / tier-check (pull_request) Successful in 26s
gate-check-v3 / gate-check (pull_request) Successful in 36s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 50s
E2E API Smoke Test / detect-changes (pull_request) Successful in 57s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 43s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 46s
CI / Platform (Go) (pull_request) Successful in 5s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 5s
CI / Canvas (Next.js) (pull_request) Successful in 6s
CI / Python Lint & Test (pull_request) Successful in 5s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 5s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 8s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 8s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 8s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / all-required (pull_request) Successful in 6s
audit-force-merge / audit (pull_request) Has been skipped
2026-05-11 22:51:23 +00:00
d23d3a4b37 [infra-lead-agent] ci(diagnostic): add runner-state probes to publish-workspace-server-image
Some checks failed
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 22s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 21s
CI / Detect changes (pull_request) Successful in 1m4s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 21s
E2E API Smoke Test / detect-changes (pull_request) Successful in 1m7s
qa-review / approved (pull_request) Failing after 18s
gate-check-v3 / gate-check (pull_request) Successful in 24s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 1m11s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 1m12s
security-review / approved (pull_request) Failing after 17s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 1m14s
sop-tier-check / tier-check (pull_request) Successful in 21s
CI / Platform (Go) (pull_request) Successful in 10s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 9s
CI / Canvas (Next.js) (pull_request) Successful in 12s
CI / Python Lint & Test (pull_request) Successful in 10s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 11s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 15s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 11s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 9s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / all-required (pull_request) Successful in 5s
Workflow has been red on main post-#572. #572's AUTO_SYNC_TOKEN fix moved
the failure from ~9s to ~50s — confirming the manifest-clone step is now
passing, but a later step is dying. Strong suspects: `Set up Docker
Buildx` (the action-fetch may be hitting the same Issue B class as
molecule-app CI) or the buildx+ECR auth flow.

Without Gitea Actions REST API logs (internal#273 Fix A still pending),
the only way to surface the root cause is to add diagnostic probes
in-line. This PR adds two `if: always()` diagnostic steps:

1. **pre-build**: docker version, docker info, buildx presence,
   `aws sts get-caller-identity`, relevant env (secrets redacted)
2. **post-buildx-setup**: `docker buildx ls`, `docker buildx version`,
   `docker buildx inspect --bootstrap`

Both `if: always()` so they run even if a prior step has failed —
captures the state at the moment of failure.

The diagnostic carries a retirement TODO: remove once main is reliably
green for ≥10 consecutive runs and the failure root is understood.

This is workflow-only (qualifies for the §SOP-13 §3 carve-out being
formalized: `.gitea/workflows/**`, tier:low, qa N/A, sec N/A, mergeable
by any non-author engineer). Author = infra-lead; any non-author
engineer can merge.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-11 22:10:19 +00:00