ci(e2e-staging): promote E2E Staging Platform Boot to merge-blocking (fail-closed) — #48 #3116
Reference in New Issue
Block a user
Delete Branch "harden/platform-boot-merge-blocking"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
What & why (RCA #878 → #885)
A prod onboarding outage (06:04–08:09 UTC 2026-06-21) was caused by molecule-controlplane PR #878: it rendered the tenant
docker runenv block with a blank line that broke shell\-continuation → the image arg was orphaned →docker run exit=127→ no tenant container → onboarding down. Fixed by CP PR #885 (deployed 08:09). It escaped pre-merge testing partly because the real-boot e2e (E2E Staging Platform Boot) is advisory (continue-on-error: true) and never ran on PRs (if:push/dispatch/schedule guard).Task #48: promote
E2E Staging Platform Bootto merge-blocking (fail-closed).What changed
Mirrors the in-file gating exemplar
e2e-staging-concierge-creates-workspace(core#3081 / CR2 #12653) exactly..gitea/workflows/e2e-staging-saas.yml—e2e-staging-platform-bootjob:continue-on-error: true(was themc#2654mask — Gitea Quirk #10 makes a failed step roll up tosuccessunder CoE, which is precisely how a broken boot would false-green).if: push || workflow_dispatch || scheduleguard so the job runs onpull_request. A required context that never fires on PR degrades the merge gate to a silent indefinitepending(the failure modelint-required-no-paths/feedback_path_filtered_workflow_cant_be_requiredexist to prevent).E2E_REQUIRE_LIVE: ${{ github.event_name == 'pull_request' && '0' || '1' }}— false-green-proof:pull_request→0: PRs carry no staging creds; the harness runs abash -nPR-mode self-check andexit 0.push/dispatch/schedule→1: the real staging boot runs and HARD FAILs (exit 5) if it proves no liveprovisioned → tenant_online → workspace_online → a2a_roundtriplifecycle.Verify admin token presentstep is PR-mode-aware: skips cleanly whenE2E_REQUIRE_LIVE=0+ no token; still hard-errors on a real run.Teardown safety net (if: always())step unchanged.tests/e2e/test_staging_full_saas.sh(shared harness):REQUIRE_LIVE=0 && no admin token→bash -nself-check →exit 0), mirroringtest_staging_concierge_creates_workspace_e2e.sh.${MOLECULE_ADMIN_TOKEN:?...}to${MOLECULE_ADMIN_TOKEN:-}so the PR lane no longer hard-dies before the self-check; a non-PR run with no token is still a HARD FAIL just past the PR-mode block.e2e-staging-saasjob: it keeps itsif:push/dispatch/schedule guard andE2E_REQUIRE_LIVE: '1', so it never reaches the PR-mode branch..gitea/required-contexts.txt: addedE2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot(SSOT), in the same PR that removes CoE, perlint-no-coe-on-required(CoE forbidden on any listed context).Cost tradeoff (please weigh)
This makes every core PR provision a real staging tenant — a second full EC2 provision per PR, alongside
e2e-staging-concierge-creates-workspace. Wait: the platform-boot job runs onpull_requestwithE2E_REQUIRE_LIVE=0and no staging creds on PR, so on PR it does abash -nself-check and exits — it does NOT provision on PR. The real provision happens on push-to-main / dispatch / cron (as before, but now blocking deploy-to-main rather than advisory). So the incremental cost vs. today is: the platform boot is no longer maskable, and a red platform-boot on main now blocks (it was advisory).Alternative the reviewer/owner may prefer: rely on CP PR #885's merged unit test (it pins the exact
docker runenv-block rendering bug) and keepE2E Staging Platform Bootfail-loud post-merge (CoE removed, but not added to branch protection). That gets the masking fix and the post-merge blocking on main without the gate ceremony. This PR takes the stronger position (full merge-gate parity with the concierge job) because the #878 class was a rendering bug that a unit test on one repo (CP) cannot guarantee stays covered as the boot path evolves in core.REMAINING OWNER ACTION (branch protection)
This PR does not touch branch protection. After this PR merges, the owner must add the required status context to core
main:(Same Gitea format as existing required contexts, e.g.
CI / all-required (pull_request). The(pull_request)event suffix is the live-BP form;required-contexts.txtstores the event-stripped form.)Order (per the lint gates):
continue-on-error, adds the context torequired-contexts.txt). After merge, the allowlist lists the context but BP does not yet — this is lint-clean:lint-no-coe-on-requiredonly fails onlive(BP) − allowlistdrift (BP has something the allowlist lacks), never the reverse.branch_protections/main.status_check_contextsto add the context above.Doing it in this order is mandatory: if BP listed the context before CoE was removed, the next
lint-no-coe-on-requiredrun would fail (CoE on a now-required context). Merging this PR removes CoE first, so adding it to BP afterward is always clean.Lint gates that constrained the approach
lint-no-coe-on-required— forbidscontinue-on-error: trueon any job emitting arequired-contexts.txtcontext, and fails onlive(BP) − allowlistdrift. ⇒ CoE removed in the same PR that adds the context; allowlist-before-BP is the safe order. Verified locally:OK: no continue-on-error on any of the 8 required contexts.lint-required-no-paths— forbidspaths:on theon:block of any required workflow. ⇒ The workflowon:block already has nopaths:(cleaned by core#3081); left untouched.lint-pre-flip-continue-on-error— blocks a CoEtrue→falseflip without run-log proof of recent green on main, EXCEPT graceful-degrade (no recent runs / log-404 → warn, allow). ⇒ Satisfied via that exemption or by the platform-boot job's recent green push-runs.lint-required-context-exists-in-bp(Tier 2g) — requires a directive only for a NEW emitter. ⇒ The platform-boot context string is byte-identical before/after (only CoE +if:changed), so this is not a new emitter; directive updated tobp-required: now requiredfor the post-merge state anyway.lint-required-workflows-docker-host-pinned— only applies to workflows running docker. ⇒ This workflow runscurl+ a bash harness onubuntu-latest; no docker. N/A.lint-continue-on-error-tracking— every CoE:true needs a fresh (<14d, open) tracker. ⇒ Removing the platform-boot CoE eliminates a tracked directive; other CoE directives untouched. N/A.Validation run locally: workflow YAML parses,
bash -non the harness passes,shellcheck -S errorclean,lint_no_coe_on_required.pyexits 0.🤖 Generated with Claude Code
Reviewed against the in-file gating exemplar (e2e-staging-concierge-creates-workspace): REQUIRE_LIVE=${{ pull_request && '0' || '1' }} matches (PRs lack staging creds → bash-clean self-check; real boot runs post-merge with =1, no longer masked by continue-on-error). Confirmed the platform-boot step invokes the patched tests/e2e/test_staging_full_saas.sh, and the sibling e2e-staging-saas keeps its push-only if-guard so the shared-harness PR-mode early-exit never affects it. SSOT required-contexts.txt updated in the same PR (lint-no-coe-on-required order). Sound, convention-following. LGTM.
Security: PRs get no staging creds (REQUIRE_LIVE=0 self-check only) — no secret exposure on the PR lane; real run is push/dispatch/cron. continue-on-error removal makes a real boot regression fail loud post-merge (was silently masked). No new secret surfaces. LGTM.