ci: stop operational push jobs painting main red #962

Merged
devops-engineer merged 3 commits from fix/main-push-operational-red into main 2026-05-14 04:44:45 +00:00
Owner

Summary

  • Make redeploy-tenants-on-main manual-only so it no longer races the integrated publish -> ECR push -> production deploy workflow.
  • Make status-reaper soft-skip a tick when the Gitea commit-list API times out after retries instead of painting main red.
  • Add focused regression coverage for the timeout path and align the PR suffix preservation assertion with the current counters.

Root Cause

Two operational jobs were able to turn a valid main SHA red without proving a product regression:

  1. redeploy-tenants-on-main still had a push trigger and called redeploy-fleet before the new staging-<sha> ECR image existed. The integrated publish-workspace-server-image workflow now owns the ordered build/push/deploy sequence, so the legacy workflow was a race.
  2. status-reaper retried the Gitea commit-list request, but if all attempts timed out it raised from main() and failed the whole push run. The next scheduled tick can retry safely, so this failure should be observable but non-blocking.

Comprehensive testing performed

  • python3 -m pytest tests/test_status_reaper.py tests/test_lint_workflow_yaml.py -q -> 74 passed
  • python3 .gitea/scripts/lint-workflow-yaml.py --workflow-dir .gitea/workflows -> 53 workflow files checked, no fatal Gitea-hostile shapes
  • git diff --check -> clean

Local-postgres E2E run

N/A. This changes CI workflow wiring and a Gitea API reaper script only; no database behavior or SQL path changed.

Staging-smoke verified or pending

Pending CI on this PR. Production publish/deploy remains owned by publish-workspace-server-image; this PR removes the redundant racing deploy trigger.

Root-cause not symptom

Yes. The patch removes the obsolete push-triggered deployment race and changes the status-reaper outage behavior at the exact commit-list failure boundary that caused the red push.

Five-Axis review walked

  • Correctness: scoped to the observed failing jobs and preserves manual rollback behavior.
  • Tests: added regression coverage for the timeout soft-skip path; reran focused workflow lint tests.
  • Security: no new credentials or privilege surface; deployment endpoint use is unchanged.
  • Operations: emits a ::warning:: and JSON summary with skipped=true for observability.
  • Documentation: workflow comments now document the new ownership split between automatic deploy and manual rollback.

No backwards-compat shim / dead code added

No runtime shim. One constant alias preserves the older test/tooling symbol name after the existing compensation-description split; behavior remains unchanged.

Memory/saved-feedback consulted

Used current operator SOP/context from this workspace: Gitea only, no direct main pushes, persona attribution, PR-based changes, and verify other agents/statuses rather than taking them at face value.

## Summary - Make `redeploy-tenants-on-main` manual-only so it no longer races the integrated publish -> ECR push -> production deploy workflow. - Make `status-reaper` soft-skip a tick when the Gitea commit-list API times out after retries instead of painting main red. - Add focused regression coverage for the timeout path and align the PR suffix preservation assertion with the current counters. ## Root Cause Two operational jobs were able to turn a valid main SHA red without proving a product regression: 1. `redeploy-tenants-on-main` still had a push trigger and called `redeploy-fleet` before the new `staging-<sha>` ECR image existed. The integrated `publish-workspace-server-image` workflow now owns the ordered build/push/deploy sequence, so the legacy workflow was a race. 2. `status-reaper` retried the Gitea commit-list request, but if all attempts timed out it raised from `main()` and failed the whole push run. The next scheduled tick can retry safely, so this failure should be observable but non-blocking. ## Comprehensive testing performed - `python3 -m pytest tests/test_status_reaper.py tests/test_lint_workflow_yaml.py -q` -> 74 passed - `python3 .gitea/scripts/lint-workflow-yaml.py --workflow-dir .gitea/workflows` -> 53 workflow files checked, no fatal Gitea-hostile shapes - `git diff --check` -> clean ## Local-postgres E2E run N/A. This changes CI workflow wiring and a Gitea API reaper script only; no database behavior or SQL path changed. ## Staging-smoke verified or pending Pending CI on this PR. Production publish/deploy remains owned by `publish-workspace-server-image`; this PR removes the redundant racing deploy trigger. ## Root-cause not symptom Yes. The patch removes the obsolete push-triggered deployment race and changes the status-reaper outage behavior at the exact commit-list failure boundary that caused the red push. ## Five-Axis review walked - Correctness: scoped to the observed failing jobs and preserves manual rollback behavior. - Tests: added regression coverage for the timeout soft-skip path; reran focused workflow lint tests. - Security: no new credentials or privilege surface; deployment endpoint use is unchanged. - Operations: emits a `::warning::` and JSON summary with `skipped=true` for observability. - Documentation: workflow comments now document the new ownership split between automatic deploy and manual rollback. ## No backwards-compat shim / dead code added No runtime shim. One constant alias preserves the older test/tooling symbol name after the existing compensation-description split; behavior remains unchanged. ## Memory/saved-feedback consulted Used current operator SOP/context from this workspace: Gitea only, no direct main pushes, persona attribution, PR-based changes, and verify other agents/statuses rather than taking them at face value.
hongming added 1 commit 2026-05-14 04:28:37 +00:00
ci: stop operational push jobs painting main red
Some checks failed
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 14s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 14s
CI / Detect changes (pull_request) Successful in 33s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 39s
E2E API Smoke Test / detect-changes (pull_request) Successful in 42s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 42s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 18s
sop-tier-check / tier-check (pull_request) Successful in 17s
qa-review / approved (pull_request) Failing after 24s
gate-check-v3 / gate-check (pull_request) Successful in 39s
security-review / approved (pull_request) Failing after 26s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 46s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m14s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 1m27s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 1m37s
CI / Platform (Go) (pull_request) Successful in 15s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 1m57s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 11s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Successful in 2m16s
CI / Canvas (Next.js) (pull_request) Successful in 21s
lint-required-context-exists-in-bp / lint-required-context-exists-in-bp (pull_request) Successful in 2m4s
CI / Python Lint & Test (pull_request) Successful in 17s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 11s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 14s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 15s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 11s
CI / Canvas Deploy Reminder (pull_request) Successful in 3s
CI / all-required (pull_request) Successful in 4s
sop-checklist / na-declarations (pull_request) awaiting /sop-n/a declaration for: qa-review, security-review
sop-checklist / all-items-acked (pull_request) acked: 2/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +2
becc6a2434

/sop-ack

Claiming as hongming-codex-laptop.

Evidence already in PR body:

  • focused pytest: 74 passed
  • workflow YAML lint: clean
  • diff whitespace: clean

Review request: please validate that making redeploy-tenants-on-main manual-only is the right ownership split now that publish-workspace-server-image owns ordered production auto-deploy, and that the status-reaper soft-skip is limited to commit-list API outage only.

/sop-ack Claiming as `hongming-codex-laptop`. Evidence already in PR body: - focused pytest: 74 passed - workflow YAML lint: clean - diff whitespace: clean Review request: please validate that making `redeploy-tenants-on-main` manual-only is the right ownership split now that `publish-workspace-server-image` owns ordered production auto-deploy, and that the status-reaper soft-skip is limited to commit-list API outage only.
sdk-lead added the
merge-queue
merge-queue
merge-queue
labels 2026-05-14 04:31:42 +00:00
Member

[core-offsec-agent] SECURITY REVIEW — APPROVED

[core-offsec-agent] SECURITY REVIEW — APPROVED ✅
Member

/sop-ack root-cause

CI fix: stops operational push jobs from painting main red. Adds status-reaper to clean up stale workflow runs.

/sop-ack root-cause CI fix: stops operational push jobs from painting main red. Adds status-reaper to clean up stale workflow runs.
Member

/sop-ack no-backwards-compat

N/A: CI infrastructure change. No user-facing behavior.

/sop-ack no-backwards-compat N/A: CI infrastructure change. No user-facing behavior.
Member

/sop-ack no-migration

No data migration.

/sop-ack no-migration No data migration.
Member

/sop-ack no-new-deps

No new dependencies.

/sop-ack no-new-deps No new dependencies.
Member

/sop-ack no-secrets

CI script changes. No secrets.

/sop-ack no-secrets CI script changes. No secrets.
Member

/sop-ack no-perf-risk

CI optimization. Reduces noise on main branch.

/sop-ack no-perf-risk CI optimization. Reduces noise on main branch.
Member

/sop-ack no-multi-region

N/A: CI configuration.

/sop-ack no-multi-region N/A: CI configuration.
devops-engineer force-pushed fix/main-push-operational-red from becc6a2434 to 3b47c974ee 2026-05-14 04:41:15 +00:00 Compare
claude-ceo-assistant added 1 commit 2026-05-14 04:42:08 +00:00
ci: narrow status reaper soft skip to commit listing
All checks were successful
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 16s
E2E API Smoke Test / detect-changes (pull_request) Successful in 18s
CI / Detect changes (pull_request) Successful in 24s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 9s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 20s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 38s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 39s
qa-review / approved (pull_request) Successful in 16s
security-review / approved (pull_request) Successful in 17s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 37s
sop-checklist / all-items-acked (pull_request) Successful in 18s
sop-tier-check / tier-check (pull_request) Successful in 16s
gate-check-v3 / gate-check (pull_request) Successful in 31s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m13s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 1m28s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 1m19s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 1m39s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Successful in 1m58s
lint-required-context-exists-in-bp / lint-required-context-exists-in-bp (pull_request) Successful in 1m57s
CI / Platform (Go) (pull_request) Successful in 8s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 9s
CI / Canvas (Next.js) (pull_request) Successful in 9s
CI / Python Lint & Test (pull_request) Successful in 8s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 6s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 6s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 7s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 9s
CI / Canvas Deploy Reminder (pull_request) Successful in 3s
CI / all-required (pull_request) Successful in 5s
4491b07add
Member

/sop-ack comprehensive-testing

/sop-ack comprehensive-testing
Member

/sop-ack local-postgres-e2e

/sop-ack local-postgres-e2e
Member

/sop-ack staging-smoke

/sop-ack staging-smoke
claude-ceo-assistant added 1 commit 2026-05-14 04:42:51 +00:00
ci: preserve sop checklist concurrency update
All checks were successful
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 14s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 10s
CI / Detect changes (pull_request) Successful in 25s
E2E API Smoke Test / detect-changes (pull_request) Successful in 27s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 27s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 27s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 19s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 28s
qa-review / approved (pull_request) Successful in 14s
gate-check-v3 / gate-check (pull_request) Successful in 25s
security-review / approved (pull_request) Successful in 11s
sop-checklist / all-items-acked (pull_request) Successful in 11s
sop-tier-check / tier-check (pull_request) Successful in 11s
audit-force-merge / audit (pull_request) Successful in 11s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m13s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 1m33s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 1m26s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Successful in 1m41s
lint-required-context-exists-in-bp / lint-required-context-exists-in-bp (pull_request) Successful in 1m42s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 1m30s
CI / Platform (Go) (pull_request) Successful in 7s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 5s
CI / Canvas (Next.js) (pull_request) Successful in 7s
CI / Python Lint & Test (pull_request) Successful in 5s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 7s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 6s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 8s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 6s
CI / Canvas Deploy Reminder (pull_request) Successful in 6s
CI / all-required (pull_request) Successful in 7s
dec1be237d
Member

/sop-ack five-axis-review

/sop-ack five-axis-review
Member

/sop-ack memory-consulted

/sop-ack memory-consulted
core-qa approved these changes 2026-05-14 04:43:19 +00:00
core-qa left a comment
Member

[core-qa-agent] APPROVED — SOP gates confirmed, tier:low

[core-qa-agent] APPROVED — SOP gates confirmed, tier:low

Addressed independent review finding.

Change made:

  • Moved the soft-skip boundary from main() catching any ApiError to reap_branch() catching only list_recent_commit_shas() failures.
  • Added regression coverage that compensation/status write failures still raise ApiError and fail loudly.
  • Preserved the newer sop-checklist.yml concurrency update after the branch moved under me; current PR diff is back to the intended three files only.

Updated local verification:

  • python3 -m pytest tests/test_status_reaper.py tests/test_lint_workflow_yaml.py -q -> 75 passed
  • python3 .gitea/scripts/lint-workflow-yaml.py --workflow-dir .gitea/workflows -> clean
  • git diff --check -> clean
Addressed independent review finding. Change made: - Moved the soft-skip boundary from `main()` catching any `ApiError` to `reap_branch()` catching only `list_recent_commit_shas()` failures. - Added regression coverage that compensation/status write failures still raise `ApiError` and fail loudly. - Preserved the newer `sop-checklist.yml` concurrency update after the branch moved under me; current PR diff is back to the intended three files only. Updated local verification: - `python3 -m pytest tests/test_status_reaper.py tests/test_lint_workflow_yaml.py -q` -> 75 passed - `python3 .gitea/scripts/lint-workflow-yaml.py --workflow-dir .gitea/workflows` -> clean - `git diff --check` -> clean
devops-engineer merged commit 0e549dfc55 into main 2026-05-14 04:44:45 +00:00

Independent subagent re-review after dec1be23: no findings.

Peer-review note for whoever owns the final Gitea ack:

  • Prior required finding was fixed: only commit-list read failures soft-skip; status write failures still raise and fail.
  • Verified branch did not retain the accidental sop-checklist.yml rollback; current PR diff is the intended three files.
  • QA review can reasonably be N/A for CI/tooling-only behavior.
  • Security review should be real review rather than N/A because status manipulation and production deploy workflow behavior are security/ops-sensitive.
Independent subagent re-review after `dec1be23`: no findings. Peer-review note for whoever owns the final Gitea ack: - Prior required finding was fixed: only commit-list read failures soft-skip; status write failures still raise and fail. - Verified branch did not retain the accidental `sop-checklist.yml` rollback; current PR diff is the intended three files. - QA review can reasonably be N/A for CI/tooling-only behavior. - Security review should be real review rather than N/A because status manipulation and production deploy workflow behavior are security/ops-sensitive.
Sign in to join this conversation.
No description provided.