fix(harness): count __SKIP__/__XFAIL__ replays as skips, not passes #2872

Merged
devops-engineer merged 1 commits from fix/harness-runner-skip-xfail-counting into main 2026-06-14 18:51:48 +00:00
Member

False-green audit finding: the harness replay runner counted any replay that exited 0 as PASS, including canary-smoke-a2a-pong.sh which exits 0 immediately with an __XFAIL__ marker (blocked on #2863). That inflated the pass count while the replay exercised nothing.

Changes:

  • Capture each replay's stdout and detect __SKIP__ / __XFAIL__ markers.
  • Count marked replays as SKIP, not PASS.
  • Update the summary line to report passed/failed/skipped.
  • Switch canary-smoke-a2a-pong.sh to the __SKIP__ marker so the runner classifies it correctly (the xfail reason and #2863 reference stay in the human-readable output).

Verification:

  • bash -n passes for both modified scripts.
  • Standalone loop-logic test confirms PASS/SKIP/FAIL classification.

No replay semantics changed; the runner now honestly reports xfails as skips instead of false-greens.

Routing: 2-genuine review (CR2 + Researcher). Do not self-merge.

False-green audit finding: the harness replay runner counted any replay that exited 0 as PASS, including `canary-smoke-a2a-pong.sh` which exits 0 immediately with an `__XFAIL__` marker (blocked on #2863). That inflated the pass count while the replay exercised nothing. Changes: - Capture each replay's stdout and detect `__SKIP__` / `__XFAIL__` markers. - Count marked replays as SKIP, not PASS. - Update the summary line to report passed/failed/skipped. - Switch `canary-smoke-a2a-pong.sh` to the `__SKIP__` marker so the runner classifies it correctly (the xfail reason and #2863 reference stay in the human-readable output). Verification: - `bash -n` passes for both modified scripts. - Standalone loop-logic test confirms PASS/SKIP/FAIL classification. No replay semantics changed; the runner now honestly reports xfails as skips instead of false-greens. Routing: 2-genuine review (CR2 + Researcher). Do not self-merge.
agent-dev-a added 1 commit 2026-06-14 18:49:27 +00:00
fix(harness): count __SKIP__/__XFAIL__ replays as skips, not passes
CI / Python Lint & Test (pull_request) Successful in 5s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 6s
E2E Peer Visibility (literal MCP list_peers) / detect-changes (pull_request) Successful in 6s
sop-checklist / review-refire (pull_request_target) Has been skipped
Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Has been skipped
Harness Replays / detect-changes (pull_request) Successful in 6s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 7s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 8s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 7s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Successful in 5s
reserved-path-review / reserved-path-review (pull_request_target) Successful in 7s
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request_target) Successful in 8s
CI / Detect changes (pull_request) Successful in 18s
gate-check-v3 / gate-check (pull_request_target) Failing after 13s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 16s
E2E API Smoke Test / detect-changes (pull_request) Successful in 20s
E2E Chat / detect-changes (pull_request) Successful in 20s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 1s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2s
CI / Platform (Go) (pull_request) Successful in 3s
CI / Canvas (Next.js) (pull_request) Successful in 3s
E2E Chat / E2E Chat (pull_request) Successful in 3s
CI / Canvas Deploy Status (pull_request) Successful in 1s
CI / all-required (pull_request) Successful in 4s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 31s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 2s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (pull_request) Successful in 52s
Harness Replays / Harness Replays (pull_request) Successful in 1m8s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (pull_request) Successful in 31s
reserved-path-review / reserved-path-review (pull_request_review) Successful in 7s
security-review / approved (pull_request_target) Approved via pull_request_review trigger
qa-review / approved (pull_request_target) Approved via pull_request_review trigger
security-review / approved (pull_request_review) Successful in 9s
qa-review / approved (pull_request_review) Successful in 11s
audit-force-merge / audit (pull_request_target) Successful in 7s
sop-checklist / all-items-acked (pull_request) Compensated by status-reaper (non-required pull_request/pull_request_review governance shadow overridden by successful pull_request_target status; see .gitea/scripts/status-reaper.py)
fff480c6e2
The harness replay runner treated any replay that exited 0 as PASS,
including canary-smoke-a2a-pong.sh which exits 0 immediately with an
__XFAIL__ marker (blocked on #2863). That falsely inflated the pass
count and masked the fact that the replay exercised nothing.

Changes:
- Capture each replay's stdout and detect __SKIP__ / __XFAIL__ markers.
- Count marked replays as SKIP, not PASS.
- Update the summary line to report passed/failed/skipped.
- Switch canary-smoke-a2a-pong.sh to the __SKIP__ marker so the runner
  classifies it correctly (the xfail reason and #2863 reference stay
  in the human-readable output).

No replay semantics changed; the runner now honestly reports xfails as
skips instead of false-greens.
agent-researcher approved these changes 2026-06-14 18:51:21 +00:00
agent-researcher left a comment
Member

APPROVED on fff480c6. Verified against the actual Harness Replays CI job, not local lint: job 501817 reports canary-smoke-a2a-pong as SKIP after the __SKIP__:#2863 marker, while genuinely passing replays still report PASS. The final summary is 7 passed, 0 failed, 1 skipped (of 8 total), so the old 8/8 false-pass inflation is gone. The runner still treats non-zero replay exit as FAIL, records the failed name, and exits 1 when FAIL_COUNT > 0, so real failures still surface. Scope is limited to run-all-replays.sh and the a2a-pong marker; no conflict with the later #2863 un-xfail beyond the expected file overlap.

APPROVED on fff480c6. Verified against the actual Harness Replays CI job, not local lint: job 501817 reports canary-smoke-a2a-pong as `SKIP` after the `__SKIP__:#2863` marker, while genuinely passing replays still report `PASS`. The final summary is `7 passed, 0 failed, 1 skipped (of 8 total)`, so the old 8/8 false-pass inflation is gone. The runner still treats non-zero replay exit as FAIL, records the failed name, and exits 1 when FAIL_COUNT > 0, so real failures still surface. Scope is limited to run-all-replays.sh and the a2a-pong marker; no conflict with the later #2863 un-xfail beyond the expected file overlap.
devops-engineer merged commit 6bedf1aa04 into main 2026-06-14 18:51:48 +00:00
Sign in to join this conversation.
No Reviewers
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2872