RCA: finish post-#1766 path-aware all-required migration across CI tooling #1859

Closed
opened 2026-05-26 01:53:32 +00:00 by agent-researcher · 3 comments
Member

RCA — root cause

Post-#1766, .gitea/workflows/ci.yml intentionally moved CI / all-required to a no-needs ci-meta polling sentinel, but several downstream CI tools still assume the old all-required.needs contract or encode stale static required-context lists. This leaves cleanup work spread across #1776, #1738, #1739, and the related cutover-style #1828 incident instead of one engineer-owned migration completion task.

Evidence

  • .gitea/workflows/ci.yml:474-483 — documents that all-required deliberately has no needs: and polls path-relevant statuses because Gitea 1.22/act_runner can leave a needs-based sentinel skipped/pending.
  • .gitea/workflows/ci.yml:503-543 — the sentinel now runs detect-changes.py and builds required contexts dynamically from REQUIRE_PLATFORM, REQUIRE_CANVAS, and REQUIRE_SCRIPTS.
  • .gitea/scripts/ci-required-drift.py:204-218 — still describes F1 as jobs missing from sentinel needs, which is stale after #1766.
  • .gitea/scripts/ci-required-drift.py:506-509 — rendered remediation still tells engineers to add jobs back to all-required.needs, which is the opposite of the new contract.
  • .gitea/scripts/prod-auto-deploy.py:23-26 and :127-131 — production wait-ci uses a static default required-context list and does not reuse the path-aware sentinel's own decision model; #1776 showed this can wait on contexts the new sentinel no longer treats as universally required.
  • .gitea/workflows/publish-workspace-server-image.yml:293-297 — production auto-deploy still gates side effects through that helper.

Suggested fix

Create one cleanup PR with these concrete pieces:

  1. Update .gitea/scripts/ci-required-drift.py so F1 understands the post-#1766 contract: if all-required has no needs: but contains the status-polling path-aware sentinel, do not file “missing needs” findings. Keep F1b only for actual stale needs typos if a needs block exists.
  2. Rewrite the ci-required-drift.py issue body/resolution text so it no longer instructs engineers to add jobs back to all-required.needs.
  3. Update .gitea/scripts/prod-auto-deploy.py wait-ci to wait on CI / all-required (push) plus required external/static contexts only, or import/reuse the same path-aware context derivation as the sentinel instead of maintaining a second static context contract.
  4. Add tests under .gitea/scripts/tests/ covering: no-needs path-aware sentinel is valid; old needs typo still fails; prod wait-ci does not wait on platform/canvas/shellcheck when detect-changes says they are not required.
  5. After merge, close or update #1776, #1738, and #1739 as resolved by the migration-completion PR. #1828 is a separate product cutover but is the same operational lesson: compatibility shims and downstream callers need explicit completion tracking.

Confidence

High — the cited files directly encode both sides of the split contract, and the prior RCA comments on #1776/#1738/#1739 show the same mechanism recurring.

## RCA — root cause Post-#1766, `.gitea/workflows/ci.yml` intentionally moved `CI / all-required` to a no-`needs` `ci-meta` polling sentinel, but several downstream CI tools still assume the old `all-required.needs` contract or encode stale static required-context lists. This leaves cleanup work spread across #1776, #1738, #1739, and the related cutover-style #1828 incident instead of one engineer-owned migration completion task. ## Evidence - `.gitea/workflows/ci.yml:474-483` — documents that `all-required` deliberately has no `needs:` and polls path-relevant statuses because Gitea 1.22/act_runner can leave a needs-based sentinel skipped/pending. - `.gitea/workflows/ci.yml:503-543` — the sentinel now runs `detect-changes.py` and builds required contexts dynamically from `REQUIRE_PLATFORM`, `REQUIRE_CANVAS`, and `REQUIRE_SCRIPTS`. - `.gitea/scripts/ci-required-drift.py:204-218` — still describes F1 as jobs missing from sentinel `needs`, which is stale after #1766. - `.gitea/scripts/ci-required-drift.py:506-509` — rendered remediation still tells engineers to add jobs back to `all-required.needs`, which is the opposite of the new contract. - `.gitea/scripts/prod-auto-deploy.py:23-26` and `:127-131` — production wait-ci uses a static default required-context list and does not reuse the path-aware sentinel's own decision model; #1776 showed this can wait on contexts the new sentinel no longer treats as universally required. - `.gitea/workflows/publish-workspace-server-image.yml:293-297` — production auto-deploy still gates side effects through that helper. ## Suggested fix Create one cleanup PR with these concrete pieces: 1. Update `.gitea/scripts/ci-required-drift.py` so F1 understands the post-#1766 contract: if `all-required` has no `needs:` but contains the status-polling path-aware sentinel, do not file “missing needs” findings. Keep F1b only for actual stale `needs` typos if a needs block exists. 2. Rewrite the `ci-required-drift.py` issue body/resolution text so it no longer instructs engineers to add jobs back to `all-required.needs`. 3. Update `.gitea/scripts/prod-auto-deploy.py wait-ci` to wait on `CI / all-required (push)` plus required external/static contexts only, or import/reuse the same path-aware context derivation as the sentinel instead of maintaining a second static context contract. 4. Add tests under `.gitea/scripts/tests/` covering: no-needs path-aware sentinel is valid; old needs typo still fails; prod wait-ci does not wait on platform/canvas/shellcheck when detect-changes says they are not required. 5. After merge, close or update #1776, #1738, and #1739 as resolved by the migration-completion PR. #1828 is a separate product cutover but is the same operational lesson: compatibility shims and downstream callers need explicit completion tracking. ## Confidence High — the cited files directly encode both sides of the split contract, and the prior RCA comments on #1776/#1738/#1739 show the same mechanism recurring.
Author
Member

Routing note from Researcher: this is still open, unassigned, and has no follow-up comments as of 2026-05-26T02:25Z.

@agent-dev-a this looks like a good SWARM pickup because the PR shape is already concrete and bounded: update .gitea/scripts/ci-required-drift.py to understand the post-#1766 no-needs sentinel, update .gitea/scripts/prod-auto-deploy.py wait-ci to depend on CI / all-required (push) or the same path-aware context model, then add focused tests under .gitea/scripts/tests/. @hongming likely owns final GO because this touches production deploy gating and the CI-required contract.

Related status: #1828’s active repair PR #1852 is still open/mergeable but blocked on qa-review / approved, security-review / approved, and pending sop-checklist / na-declarations, so the memory outage repair has not landed yet.

Routing note from Researcher: this is still open, unassigned, and has no follow-up comments as of 2026-05-26T02:25Z. @agent-dev-a this looks like a good SWARM pickup because the PR shape is already concrete and bounded: update `.gitea/scripts/ci-required-drift.py` to understand the post-#1766 no-`needs` sentinel, update `.gitea/scripts/prod-auto-deploy.py wait-ci` to depend on `CI / all-required (push)` or the same path-aware context model, then add focused tests under `.gitea/scripts/tests/`. @hongming likely owns final GO because this touches production deploy gating and the CI-required contract. Related status: #1828’s active repair PR #1852 is still open/mergeable but blocked on `qa-review / approved`, `security-review / approved`, and pending `sop-checklist / na-declarations`, so the memory outage repair has not landed yet.
Author
Member

Status ping: still open and unassigned; no engineer pickup after c49135. @agent-dev-a can take the bounded CI tooling cleanup PR shape here; @hongming final GO still needed because this touches production deploy / required-check semantics.

Status ping: still open and unassigned; no engineer pickup after c49135. @agent-dev-a can take the bounded CI tooling cleanup PR shape here; @hongming final GO still needed because this touches production deploy / required-check semantics.
Author
Member

Follow-up status ping: no engineer pickup comment is visible yet after PM dispatch; @hongming remains the best secondary routing target if @agent-dev-a/Kimi does not pick up the post-#1766 CI migration cleanup.

Follow-up status ping: no engineer pickup comment is visible yet after PM dispatch; @hongming remains the best secondary routing target if @agent-dev-a/Kimi does not pick up the post-#1766 CI migration cleanup.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1859