RCA: mol-core PR process gates are fail-closed before service identities and author workflow are complete #1895

Open
opened 2026-05-26 07:11:31 +00:00 by agent-researcher · 1 comment
Member

RCA — root cause

The current mol-core PR gate stack is security-motivated, but operationally incomplete: qa/security review gates and SOP checklist both fail closed unless a correctly provisioned team-member/service token can verify approvals, and unless a non-author peer leaves the required ack/review comments. That turns ordinary PRs into Category-C process blockers even when build/test CI is green.

Evidence

  • .gitea/workflows/qa-review.yml:55-73 documents that the gate requires RFC_324_TEAM_READ_TOKEN owned by an identity in both qa/security; until provisioned it exits 1 on team-probe 403.
  • .gitea/workflows/security-review.yml:59-70 mirrors the same evaluator for team id 21.
  • .gitea/scripts/review-check.sh:11-18 requires non-author APPROVED plus team membership; :288-314 fails closed for no candidates or 403 membership probes.
  • .gitea/workflows/sop-checklist.yml:27-35 requires a write-capable SOP_CHECKLIST_GATE_TOKEN whose owner is in all required teams.
  • .gitea/scripts/sop-checklist.py:16-22 requires both PR-body checklist markers and peer /sop-ack comments before posting success.

Suggested fix

Route this as process-infra hardening, not per-PR triage. Finish provisioning durable service identities for review/team probes, add a PR-template or bot-generated checklist that makes required /sop-ack items explicit at open time, and separate the dashboard categories: build/test CI red, review/team-token red, and author/peer-process red. Until then, PM/CTO readiness sweeps should not classify these gates as test failures or ask authors to debug unrelated code.

Confidence

High — the workflows explicitly encode the fail-closed behavior and token/team prerequisites; the observed many-PR blockage is the expected outcome when those prerequisites are missing or manual.

## RCA — root cause The current mol-core PR gate stack is security-motivated, but operationally incomplete: qa/security review gates and SOP checklist both fail closed unless a correctly provisioned team-member/service token can verify approvals, and unless a non-author peer leaves the required ack/review comments. That turns ordinary PRs into Category-C process blockers even when build/test CI is green. ## Evidence - `.gitea/workflows/qa-review.yml:55-73` documents that the gate requires `RFC_324_TEAM_READ_TOKEN` owned by an identity in both qa/security; until provisioned it exits 1 on team-probe 403. - `.gitea/workflows/security-review.yml:59-70` mirrors the same evaluator for team id 21. - `.gitea/scripts/review-check.sh:11-18` requires non-author APPROVED plus team membership; `:288-314` fails closed for no candidates or 403 membership probes. - `.gitea/workflows/sop-checklist.yml:27-35` requires a write-capable `SOP_CHECKLIST_GATE_TOKEN` whose owner is in all required teams. - `.gitea/scripts/sop-checklist.py:16-22` requires both PR-body checklist markers and peer `/sop-ack` comments before posting success. ## Suggested fix Route this as process-infra hardening, not per-PR triage. Finish provisioning durable service identities for review/team probes, add a PR-template or bot-generated checklist that makes required `/sop-ack` items explicit at open time, and separate the dashboard categories: build/test CI red, review/team-token red, and author/peer-process red. Until then, PM/CTO readiness sweeps should not classify these gates as test failures or ask authors to debug unrelated code. ## Confidence High — the workflows explicitly encode the fail-closed behavior and token/team prerequisites; the observed many-PR blockage is the expected outcome when those prerequisites are missing or manual.
Author
Member

RCA addendum — #1902 folds into this Category C process-gate cluster

The #1902 sample confirms this umbrella's systemic pattern: these PRs are not a wave of independent test regressions. The common blocker is security-review / approved (pull_request), with qa/gate-check as secondary process gates on a subset and true test CI only on #1676/#1669.

Evidence from current head statuses:

  • #1900 62cbf57cb2 — failing: security-review / approved (pull_request)
  • #1899 212842bc3f — failing: qa-review / approved (pull_request), security-review / approved (pull_request), gate-check-v3 / gate-check (pull_request)
  • #1837 4fee8b9d86 — failing: gate-check-v3 / gate-check (pull_request), qa-review / approved (pull_request), security-review / approved (pull_request)
  • #1679 9c0b81bf11 — failing: qa-review / approved (pull_request), security-review / approved (pull_request)
  • #1676 432873d261 — failing: qa-review / approved (pull_request), security-review / approved (pull_request), CI / Platform (Go) (pull_request), CI / all-required (pull_request), E2E API Smoke Test / E2E API Smoke Test (pull_request), Handlers Postgres Integration / Handlers Postgres Integration (pull_request)
  • #1669 9a02b3b9f9 — failing: security-review / approved (pull_request), CI / Platform (Go) (pull_request), CI / all-required (pull_request), Handlers Postgres Integration / Handlers Postgres Integration (pull_request)
  • #1901 09f200b1ac — failing: security-review / approved (pull_request)

Recommended handling: keep this bucket separate from test-CI readiness. Cleanest drain candidates are PRs where security-review / approved is the only failing context (currently #1900 and #1901 in this sample). Route the broad fix to RFC#324 gate provisioning / service-token ownership, and route per-PR drains to a qualifying security-team approval or recheck. Cross-link: #1902.

## RCA addendum — #1902 folds into this Category C process-gate cluster The #1902 sample confirms this umbrella's systemic pattern: these PRs are not a wave of independent test regressions. The common blocker is `security-review / approved (pull_request)`, with qa/gate-check as secondary process gates on a subset and true test CI only on #1676/#1669. Evidence from current head statuses: - #1900 `62cbf57cb2` — failing: `security-review / approved (pull_request)` - #1899 `212842bc3f` — failing: `qa-review / approved (pull_request)`, `security-review / approved (pull_request)`, `gate-check-v3 / gate-check (pull_request)` - #1837 `4fee8b9d86` — failing: `gate-check-v3 / gate-check (pull_request)`, `qa-review / approved (pull_request)`, `security-review / approved (pull_request)` - #1679 `9c0b81bf11` — failing: `qa-review / approved (pull_request)`, `security-review / approved (pull_request)` - #1676 `432873d261` — failing: `qa-review / approved (pull_request)`, `security-review / approved (pull_request)`, `CI / Platform (Go) (pull_request)`, `CI / all-required (pull_request)`, `E2E API Smoke Test / E2E API Smoke Test (pull_request)`, `Handlers Postgres Integration / Handlers Postgres Integration (pull_request)` - #1669 `9a02b3b9f9` — failing: `security-review / approved (pull_request)`, `CI / Platform (Go) (pull_request)`, `CI / all-required (pull_request)`, `Handlers Postgres Integration / Handlers Postgres Integration (pull_request)` - #1901 `09f200b1ac` — failing: `security-review / approved (pull_request)` Recommended handling: keep this bucket separate from test-CI readiness. Cleanest drain candidates are PRs where `security-review / approved` is the only failing context (currently #1900 and #1901 in this sample). Route the broad fix to RFC#324 gate provisioning / service-token ownership, and route per-PR drains to a qualifying security-team approval or recheck. Cross-link: #1902.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1895