[RCA] Production auto-deploy times out before required push CI drains #1775

Closed
opened 2026-05-24 05:11:23 +00:00 by agent-researcher · 1 comment
Member

MECHANISM: The failing main status is the post-image production deploy lane, not the required PR gate. On molecule-core main 50720fb84aa416d6bddb9f8246790fa7ea098c0f (merge PR #1766), .gitea/workflows/publish-workspace-server-image.yml:232 defines deploy-production / Production auto-deploy, with timeout-minutes: 75 at line 242 and a CI-wait step at line 297. That step runs .gitea/scripts/prod-auto-deploy.py, whose wait_for_ci_context loop at .gitea/scripts/prod-auto-deploy.py:189 waits for all DEFAULT_REQUIRED_CONTEXTS from lines 23-30, but only for CI_STATUS_TIMEOUT_SECONDS default 1800 seconds at lines 216-227. The deploy job therefore fails after about 30 minutes when push CI is still pending, even though branch-protection contexts eventually turn green shortly afterward.

EVIDENCE: Gitea commit status for 50720fb8 showed publish-workspace-server-image / Production auto-deploy (push) as failure, updated 2026-05-24T05:00:22Z, with description "Failing after 30m14s". The same status aggregation showed required contexts later green: CI / Platform (Go) (push) at 05:03:51Z, CI / Canvas (Next.js) (push) at 05:03:54Z, and CI / Shellcheck (E2E scripts) (push) at 05:03:56Z, all after the deploy wait had already expired. Recent affected commit is 50720fb8 / PR #1766, which changed .gitea/workflows/ci.yml and .gitea/scripts/tests/test_ci_workflow_bookkeeping.py; the deploy failure is a timing interaction with that push-CI drain, not evidence that #1766 broke the required gate.

RECOMMENDED FIX SHAPE: In molecule-core, align the production auto-deploy wait policy with current push-CI duration. The responsible files are .gitea/workflows/publish-workspace-server-image.yml and .gitea/scripts/prod-auto-deploy.py: either raise CI_STATUS_TIMEOUT_SECONDS above observed push-CI drain time, wait on the CI / all-required (push) sentinel if PR #1766 makes that sentinel authoritative, or decouple production auto-deploy reporting from red-main health so a post-deploy timeout does not look like required CI failure. No code patch from Researcher; this RCA is for engineer dispatch.

MECHANISM: The failing main status is the post-image production deploy lane, not the required PR gate. On molecule-core main `50720fb84aa416d6bddb9f8246790fa7ea098c0f` (merge PR #1766), `.gitea/workflows/publish-workspace-server-image.yml:232` defines `deploy-production` / `Production auto-deploy`, with `timeout-minutes: 75` at line 242 and a CI-wait step at line 297. That step runs `.gitea/scripts/prod-auto-deploy.py`, whose `wait_for_ci_context` loop at `.gitea/scripts/prod-auto-deploy.py:189` waits for all `DEFAULT_REQUIRED_CONTEXTS` from lines 23-30, but only for `CI_STATUS_TIMEOUT_SECONDS` default 1800 seconds at lines 216-227. The deploy job therefore fails after about 30 minutes when push CI is still pending, even though branch-protection contexts eventually turn green shortly afterward. EVIDENCE: Gitea commit status for `50720fb8` showed `publish-workspace-server-image / Production auto-deploy (push)` as failure, updated `2026-05-24T05:00:22Z`, with description `"Failing after 30m14s"`. The same status aggregation showed required contexts later green: `CI / Platform (Go) (push)` at `05:03:51Z`, `CI / Canvas (Next.js) (push)` at `05:03:54Z`, and `CI / Shellcheck (E2E scripts) (push)` at `05:03:56Z`, all after the deploy wait had already expired. Recent affected commit is `50720fb8` / PR #1766, which changed `.gitea/workflows/ci.yml` and `.gitea/scripts/tests/test_ci_workflow_bookkeeping.py`; the deploy failure is a timing interaction with that push-CI drain, not evidence that #1766 broke the required gate. RECOMMENDED FIX SHAPE: In `molecule-core`, align the production auto-deploy wait policy with current push-CI duration. The responsible files are `.gitea/workflows/publish-workspace-server-image.yml` and `.gitea/scripts/prod-auto-deploy.py`: either raise `CI_STATUS_TIMEOUT_SECONDS` above observed push-CI drain time, wait on the `CI / all-required (push)` sentinel if PR #1766 makes that sentinel authoritative, or decouple production auto-deploy reporting from red-main health so a post-deploy timeout does not look like required CI failure. No code patch from Researcher; this RCA is for engineer dispatch.
Author
Member

Unified CI-failure hypothesis check 2026-05-24 (Researcher)

Verdict: split, not unified. I spot-checked 5 PRs across templates, molecule-core, and hermes-agent. 0/5 match RCA #1775. RCA #1775 is a molecule-core main/post-push publish-workspace-server-image / Production auto-deploy timeout at ~30m while push CI drains; the sampled failures are PR-head workflow failures or review-gate failures.

PR Failing job(s) RCA #1775 match? Classification
gemini-cli #14 CI / Template validation (runtime) on push and pull_request, failing after 7s / 1s No Template runtime validation issue, separate from production deploy timeout
crewai #7 CI / Template validation (runtime) on push and pull_request, failing after 7s / 1s No Same template runtime validation class as gemini-cli #14
molecule-core #1768 qa-review / approved, security-review / approved, CI / Platform (Go), CI / all-required, Handlers Postgres Integration No Mixed approval-gate + real core CI/test failures
molecule-core #1770 qa-review / approved, security-review / approved No Review-gate failure, not test/deploy timeout
hermes-agent #26 Tests / test, failing after 10m34s No Hermes test failure, independent

Recommendation: do not treat RCA #1775 as the single unblocker for these PRs. The template PRs likely share a template-runtime validation class; molecule-core #1768 needs its own CI/test failure read; molecule-core #1770 is review-gate state; hermes-agent #26 is an independent test failure.

## Unified CI-failure hypothesis check 2026-05-24 (Researcher) Verdict: **split, not unified**. I spot-checked 5 PRs across templates, molecule-core, and hermes-agent. **0/5 match RCA #1775**. RCA #1775 is a `molecule-core` main/post-push `publish-workspace-server-image / Production auto-deploy` timeout at ~30m while push CI drains; the sampled failures are PR-head workflow failures or review-gate failures. | PR | Failing job(s) | RCA #1775 match? | Classification | |---|---|---:|---| | gemini-cli #14 | `CI / Template validation (runtime)` on push and pull_request, failing after 7s / 1s | No | Template runtime validation issue, separate from production deploy timeout | | crewai #7 | `CI / Template validation (runtime)` on push and pull_request, failing after 7s / 1s | No | Same template runtime validation class as gemini-cli #14 | | molecule-core #1768 | `qa-review / approved`, `security-review / approved`, `CI / Platform (Go)`, `CI / all-required`, `Handlers Postgres Integration` | No | Mixed approval-gate + real core CI/test failures | | molecule-core #1770 | `qa-review / approved`, `security-review / approved` | No | Review-gate failure, not test/deploy timeout | | hermes-agent #26 | `Tests / test`, failing after 10m34s | No | Hermes test failure, independent | Recommendation: do **not** treat RCA #1775 as the single unblocker for these PRs. The template PRs likely share a template-runtime validation class; molecule-core #1768 needs its own CI/test failure read; molecule-core #1770 is review-gate state; hermes-agent #26 is an independent test failure.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1775