From 3d8a0a58fa35829443d48afb36fdeb8159d3c70f Mon Sep 17 00:00:00 2001 From: Hongming Wang Date: Fri, 1 May 2026 22:28:35 -0700 Subject: [PATCH] ci(auto-sync): App-token dispatch + ubuntu-latest + workflow_dispatch MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit auto-sync-main-to-staging.yml hasn't fired since 2026-04-29 despite multiple staging→main promotes since. The promote PR #2442 (Phase 2) has been wedged on `mergeStateStatus: BEHIND` for hours because staging is missing the merge commit from PR #2437. Three compounding bugs, all fixed here: 1. **GitHub no-recursion suppresses the `on: push` trigger.** When the merge queue lands a staging→main promote, the resulting push to main is "by GITHUB_TOKEN", and per https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow that push event does NOT fire any downstream workflows. Verified empirically against SHA 76c604fb (PR #2437): exactly ONE workflow fired on that push — `publish-workspace-server-image`, dispatched explicitly by auto-promote-staging.yml's polling tail with an App token (the documented #2357 workaround). Every other `on: push` workflow on main, including auto-sync, was silently suppressed. Same fix extended here: auto-promote-staging.yml's polling tail now ALSO dispatches `auto-sync-main-to-staging.yml --ref main` via the App token after the merge lands. App-initiated dispatch propagates `workflow_run` cascades, which is what the publish tail relies on too. Failure path: emits `::error::` with the recovery command — operator runs it once and the next promote self-heals. auto-sync.yml gains `workflow_dispatch:` so it can be invoked from the dispatch above + manually if a future promote also misses (defense in depth). 2. **`runs-on: [self-hosted, macos, arm64]` was wrong for this repo.** Comment claimed "matches the rest of this repo's workflows" — false: this is the ONLY workflow in molecule-core/.github/workflows/ with a non-ubuntu runs-on. Copy-paste artefact from molecule-controlplane (which IS private and has a Mac runner). molecule-core has no Mac runner registered, so even when the trigger DID fire (the 3 historic manual-UI merges), the job would have sat unassigned if the runner were offline. Switched to `ubuntu-latest` to match every other workflow in this repo. 3. **The `on: push` trigger remains** as a defense-in-depth path for the rare case of a manual UI merge by a real user (which uses their PAT and DOES fire downstream workflows — confirmed via the 2026-04-29 d35a2420 run with `triggering_actor=HongmingWang-Rabbit` that fired 16 workflows including auto-sync). Belt-and-suspenders. Long-term: switching auto-promote's `gh pr merge --auto` call to use the App token (instead of GITHUB_TOKEN) would let `on: push` triggers fire naturally and obviate the need for the explicit dispatches in the polling tail. Tracked in #2357 — out of scope here. Operator recovery for the current Phase 2 wedge: after this lands on staging, dispatch auto-sync once via `gh workflow run auto-sync-main-to-staging.yml --ref main` to backfill the missed sync from 76c604fb. PR #2442 will go from BEHIND → CLEAN and auto-merge. Co-Authored-By: Claude Opus 4.7 (1M context) --- .github/workflows/auto-promote-staging.yml | 18 ++++++++++++ .../workflows/auto-sync-main-to-staging.yml | 28 +++++++++++++++++-- 2 files changed, 44 insertions(+), 2 deletions(-) diff --git a/.github/workflows/auto-promote-staging.yml b/.github/workflows/auto-promote-staging.yml index a62010f2..de6ce46a 100644 --- a/.github/workflows/auto-promote-staging.yml +++ b/.github/workflows/auto-promote-staging.yml @@ -364,3 +364,21 @@ jobs: else echo "::error::Failed to dispatch publish-workspace-server-image. Run manually: gh workflow run publish-workspace-server-image.yml --ref main" fi + + # ALSO dispatch auto-sync-main-to-staging.yml. Same root cause as + # publish above (issue #2357): the merge-queue-initiated push to + # main is by GITHUB_TOKEN → no `on: push` triggers fire downstream. + # Without this dispatch, every staging→main promote leaves staging + # one merge commit BEHIND main, which silently dead-locks the NEXT + # promote PR as `mergeStateStatus: BEHIND` because main's + # branch-protection has `strict: true`. Verified empirically on + # 2026-05-02 against PR #2442 (Phase 2 promote): only the explicit + # publish-workspace-server-image dispatch fired on the previous + # promote SHA 76c604fb, while auto-sync silently no-op'd, leaving + # staging behind for ~24h until manually bridged. + if gh workflow run auto-sync-main-to-staging.yml \ + --repo "$REPO" --ref main 2>&1; then + echo "::notice::Dispatched auto-sync-main-to-staging on ref=main as molecule-ai App — staging will absorb the new main merge commit via PR + merge queue." + else + echo "::error::Failed to dispatch auto-sync-main-to-staging. Run manually: gh workflow run auto-sync-main-to-staging.yml --ref main" + fi diff --git a/.github/workflows/auto-sync-main-to-staging.yml b/.github/workflows/auto-sync-main-to-staging.yml index 36ab63f7..9a0140d7 100644 --- a/.github/workflows/auto-sync-main-to-staging.yml +++ b/.github/workflows/auto-sync-main-to-staging.yml @@ -60,6 +60,24 @@ name: Auto-sync main → staging on: push: branches: [main] + # workflow_dispatch lets: + # 1. Operators manually backfill a missed sync (e.g. after a manual + # UI merge that the runner missed). + # 2. auto-promote-staging.yml's polling tail explicitly invoke us + # after the promote PR lands. This is load-bearing: when the + # merge queue lands a promote-PR merge, the resulting push to + # `main` is "by GITHUB_TOKEN", and per GitHub's no-recursion + # rule (https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow) + # that push event does NOT fire any downstream workflows. The + # `on: push` trigger above is silently dead for the very pattern + # we exist to handle. Verified empirically 2026-05-02 against + # SHA 76c604fb (PR #2437 staging→main): only ONE workflow fired + # (publish-workspace-server-image, dispatched explicitly by + # auto-promote's polling tail with an App token). Every other + # `on: push: branches: [main]` workflow — including this one — + # was suppressed. Until the underlying merge call moves to an + # App token, an explicit dispatch is the only reliable path. + workflow_dispatch: permissions: contents: write @@ -71,8 +89,14 @@ concurrency: jobs: sync-staging: - # Self-hosted Mac mini matches the rest of this repo's workflows. - runs-on: [self-hosted, macos, arm64] + # ubuntu-latest matches every other workflow in this repo. The + # earlier `[self-hosted, macos, arm64]` was a copy-paste artefact + # from the molecule-controlplane repo (which IS private and uses a + # Mac runner) — molecule-core has no Mac runner registered, so the + # job sat unassigned whenever the trigger fired. Verified 2026-05-02: + # this is the ONLY workflow in molecule-core/.github/workflows/ with + # a non-ubuntu runs-on. + runs-on: ubuntu-latest steps: - name: Checkout staging uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4