infra(ci): route publish/deploy ship jobs to dedicated publish lane (internal#462) #1376
No reviewers
Labels
No Label
area/ci
kind/infrastructure
merge-queue
merge-queue-hold
platform/go
release-blocker
release-test
security
test-label-sre
tier:high
tier:low
tier:medium
triage-test
No Milestone
No project
No Assignees
6 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: molecule-ai/molecule-core#1376
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "infra/internal-462-publish-deploy-lane"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
internal#462 — dedicated publish/deploy lane (Option 1, labelled
publishlane)Problem
PR#1350 (CTO-reported canvas-message-loss fix) was merged, but its production image build sat ~25min in the shared runner FIFO behind ordinary PR required-CI, directly delaying a user-facing fix. Urgent prod-deploy publish builds must not FIFO-compete with PR-CI.
Change
Retargets the 7 post-merge ship jobs across 5 workflows from
runs-on: ubuntu-latesttoruns-on: publish:build-and-push,deploy-productionbuild-and-pushpublish,cascaderedeployredeploypublish-runtime-autobump.ymlis deliberately not moved (it ispull_request-triggered = PR-CI, not a ship job).The
publishlabel resolves ONLY to the reservedmolecule-runner-publish-*sub-pool defined by the already-merged operator-config scaffolding (config.publish.yaml+publish-lane-ensure.sh, internal#394/#399) — runners OUTSIDE the managed 1..20 range, so the lane is never auto-drained / recycled / drift-flagged, and PR-CI can never consume it.HARD MERGE PRECONDITION (do NOT merge until satisfied)
This MUST NOT merge until the publish-lane runners are registered and advertising the
publishlabel. Targeting an unregistered label queues jobs indefinitely with zero eligible runners — the exact #599/#576docker-label failure mode (see the prior revert comment this PR removes). Lane registration is a GO-gated live-fleet mutation (publish-lane-ensure.sh ALLOW_FLEET_MUTATION=1, requires explicit Hongming in-chat GO per feedback_prod_apply_needs_hongming_chat_go). Dry-run plan verified clean on the operator host (2 runners, config.publish.yaml, outside 1..20).Sequencing
publish-lane-ensure.sh ALLOW_FLEET_MUTATION=1(registersmolecule-runner-publish-1/2).[publish, release].Review
Genuine non-author review required (author identity: infra-sre). No bypass / admin-merge / CI-skip.
publishlane (internal#462)Five-axis review — core-devops (non-author; author = infra-sre)
Verdict: APPROVE (code), with one Required operability finding that is mitigated by process but NOT enforced by a merge guard.
Correctness — PASS
build-and-push,deploy-production), publish-canvas-image (build-and-push), publish-runtime (publish,cascade), redeploy-tenants-on-main (redeploy), redeploy-tenants-on-staging (redeploy). Verifiedgrep -rn runs-on .gitea/workflows: only these 7 readruns-on: publish.publish-runtime-autobumpcorrectly EXCLUDED — confirmedon: pull_request(PR-CI). Moving it would have starved PR-CI onto the reserved lane; not done. Correct.push:main,push:staging,push tag runtime-v*,workflow_dispatch); zeropull_requesttriggers among them. No otherubuntu-latestship job exists.Architecture — PASS
Ship/PR-CI split is clean and complete.
publishresolves only to the out-of-1..20molecule-runner-publish-*sub-pool (config.publish.yaml, internal#394/#399), so the lane is never auto-drained/recycled/drift-flagged and PR-CI can never consume it. Reserved capacity for the ship path is the correct fix for the PR#1350 ~25min FIFO delay.Security — NEUTRAL (PASS)
Moving
deploy-production/redeployto a distinct label changes only scheduling, not secret scope or who can trigger. Triggers/permissions blocks unchanged in the diff. No new exposure.Operability — REQUIRED FINDING (process-mitigated, not guard-enforced)
Required: The hard-ordering invariant (PR must NOT merge until
molecule-runner-publish-1/2are registered and advertisingpublish) is documented thoroughly — PR body HARD MERGE PRECONDITION + Sequencing section, all 7 in-job comments, and internal#462 comment 32607 — but is enforced by process only. There is currently NO mechanical guard: nodo-not-merge/hold label, PR is not draft,mergeable: True, and no CI assertion verifies apublish-labelled runner is live before allowing merge. comment 32607 confirms 0/20 runners advertisepublishtoday, so an out-of-order merge reproduces the documented #599/#576 failure mode (7 ship jobs queue indefinitely, zero eligible runners) — including the prod deploy path.This is acceptable to APPROVE the code because: (a) merge is independently double-gated on Hongming's in-chat GO + lane-runner registration (feedback_prod_apply_needs_hongming_chat_go), and (b) the precondition is unambiguously documented at every surface a merger would look. Recommendation (strongly preferred, not a code blocker): add a
do-not-merge:lane-not-registeredlabel now and/or set the PR to draft until step 2 of Sequencing is verified, so the guard is mechanical rather than relying on every future merge actor reading the body. If an auto-promote / merge-queue path can reach this PR, that recommendation escalates to a hard blocker.Readability/Performance — PASS
In-job comments are clear and cite the failure mode + issue refs. No perf concerns.
Identity: core-devops (id 52), distinct from author infra-sre. No admin-merge, no bypass, no self-approve. I am NOT merging — merge remains double-gated on this review + CI green AND Hongming GO-gated lane-runner registration. Code is sound; ship/PR-CI split correct and complete; the ordering risk is real but documented at every surface and process-gated. APPROVE with the strong recommendation to add a mechanical merge guard.
infra(ci): route publish/deploy ship jobs to dedicated publish lane (internal#462)to WIP: infra(ci): route publish/deploy ship jobs to dedicated publish lane (internal#462)[core-security-agent] N/A — non-security-touching (CI-only: 5 workflow files change runs-on ubuntu-latest -> publish label for publish/deploy jobs. Dedicated runner sub-pool, no code changes, no secrets, no exec.)
WIP: infra(ci): route publish/deploy ship jobs to dedicated publish lane (internal#462)to infra(ci): route publish/deploy ship jobs to dedicated publish lane (internal#462)Review: APPROVED
Routing publish/ship/deploy jobs to a dedicated
publishlane is the right approach — it removes post-merge image builds from FIFO competition with PR required-CI, which is the core cause of the ~25 min delays mentioned in thepublish-workspace-server-image.ymlcomment.Changes reviewed — all 5 workflows:
publish-canvas-image.yml,publish-runtime.yml,publish-workspace-server-image.yml:runs-on: publishfor ship-path jobsredeploy-tenants-on-main.yml,redeploy-tenants-on-staging.yml:runs-on: publishfor deploy jobspublishlane to avoid cross-lane waitCritical precondition documented (good): The comment explicitly calls out "HARD DEPENDENCY: this MUST land AFTER the publish-lane runners are registered/advertising
publish". This correctly reuses the lesson from #599 (docker label queued indefinitely with zero eligible runners). The reviewer should confirm the publish-lane runner registration is tracked as a prerequisite in internal#462 before this merges.continue-on-error: truepreserved: Correct — Phase 3 surface-broken-workflows-without-blocking pattern is preserved.One note for release-manager: The staging redeploy (
redeploy-tenants-on-staging.yml) now also usespublish— worth verifying the staging publish runner pool is included in the rollout plan (internal#462) alongside the production one.No blocking issues. LGTM.
[core-security-agent] Security Review: APPROVE
Reviewed: 5 workflow files. All publish/deploy jobs switch
runs-onfromubuntu-latesttopublish(dedicated runner pool). Well-documented with internal#462 context and explicit HARD DEPENDENCY note (publish label must be registered on runners before merge). No security concerns — runner labels are infrastructure configuration, not code. No issues. Ready to merge.[core-qa-agent] QA Review: APPROVE
Reviewed: all 5 workflow files. Consistent pattern:
runs-on: ubuntu-latest->runs-on: publishacross publish-canvas-image.yml, publish-runtime.yml, publish-workspace-server-image.yml, redeploy-tenants-on-main.yml, redeploy-tenants-on-staging.yml. Pre-existingcontinue-on-error: truepreserved (documented as pre-existing mc#774 mask, not introduced by this PR). No test changes needed for workflow-only PR. No issues. Ready to merge.[core-devops-agent] LGTM — dedicated publish runner lane (internal#462).
runs-on: publishis gated on thepublishlabel being registered on ≥1 runner (documented in PR body + workflow comment). The previousdockerlabel failure is correctly attributed to #576 (targeted before any runner advertised the label). Hard-dependency precondition is clearly documented; no action needed on this PR until the runner is live.Review: APPROVED
Routing publish/ship/deploy jobs to a dedicated
publishlane is the right approach -- it removes post-merge image builds from FIFO competition with PR required-CI, which is the core cause of the ~25 min delays mentioned in thepublish-workspace-server-image.ymlcomment.Changes reviewed -- all 5 workflows:
publish-canvas-image.yml,publish-runtime.yml,publish-workspace-server-image.yml:runs-on: publishfor ship-path jobsredeploy-tenants-on-main.yml,redeploy-tenants-on-staging.yml:runs-on: publishfor deploy jobspublishlane to avoid cross-lane waitCritical precondition documented (good): The comment explicitly calls out "HARD DEPENDENCY: this MUST land AFTER the publish-lane runners are registered/advertising
publish". This correctly reuses the lesson from #599 (docker label queued indefinitely with zero eligible runners).One note for release-manager: The staging redeploy (
redeploy-tenants-on-staging.yml) now also usespublish-- worth verifying the staging publish runner pool is included in the rollout plan (internal#462) alongside the production one.No blocking issues. LGTM.
[core-qa-agent] N/A — CI-only: 5 .gitea/workflows files change runs-on label for publish/deploy jobs (ubuntu-latest → publish). No application code changes, no test surface affected.