fix(ci): sweep-stale-e2e-orgs reference + drop continue-on-error
All checks were successful
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 17s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 11s
CI / Detect changes (pull_request) Successful in 48s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 18s
sop-tier-check / tier-check (pull_request) Successful in 19s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 37s
E2E API Smoke Test / detect-changes (pull_request) Successful in 42s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 39s
CI / Platform (Go) (pull_request) Successful in 10s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 7s
CI / Canvas (Next.js) (pull_request) Successful in 9s
CI / Python Lint & Test (pull_request) Successful in 8s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 10s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 7s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 10s
CI / Canvas Deploy Reminder (pull_request) Has been skipped

The janitor was non-functional because env ADMIN_TOKEN referenced
`secrets.MOLECULE_STAGING_ADMIN_TOKEN`, which does not exist in the
org secret store. Canonical name (per #430 Class-E rename) is
`CP_STAGING_ADMIN_API_TOKEN`. Workflow exited 2 every 15-min tick,
job-level `continue-on-error: true` masked the failure, staging
tenants kept leaking.

Hongming observed 15 leaked EC2 in molecule-canary (004947743811)
us-east-2 at 11:05Z 2026-05-11.

Changes:
- Rename `secrets.MOLECULE_STAGING_ADMIN_TOKEN` → `secrets.CP_STAGING_ADMIN_API_TOKEN`
  in env block + diagnostic error message.
- Remove `continue-on-error: true` from the sweep job. Per
  `feedback_strict_root_only_after_class_a` the RFC #219 §1
  "surface without blocking" rationale was applied wrongly here:
  silent-fail on the janitor IS the meta-bug. Critical janitors
  must fail loud.
- Add `if: failure()` notify step that emits a tagged ::error::
  line on any prior-step failure, so log-tail consumers (Loki
  SOPRefireRule, orchestrator triage loop) can grep for it.

Other workflows in this repo still reference the old name
(e2e-staging-saas/sanity/external/canvas, canary-staging,
tests/e2e/STAGING_SAAS_E2E.md). Deferred to a follow-up PR
per scope guidance.
This commit is contained in:
claude-ceo-assistant 2026-05-11 04:14:40 -07:00
parent 00f0a1066f
commit 548889ac96

View File

@ -63,12 +63,21 @@ jobs:
sweep:
name: Sweep e2e orgs
runs-on: ubuntu-latest
# Phase 3 (RFC #219 §1): surface broken workflows without blocking.
continue-on-error: true
# NOTE: Phase 3 (RFC #219 §1) `continue-on-error: true` removed
# 2026-05-11. The "surface broken workflows without blocking"
# rationale was correctly applied to advisory/lint workflows but
# wrong for this janitor — silent failure here masks real-money
# tenant leaks. Hongming observed 15 leaked EC2 in molecule-canary
# (004947743811) us-east-2 at 11:05Z 2026-05-11 because the sweep
# had been exiting 2 every tick and the failure was swallowed.
# See `feedback_strict_root_only_after_class_a` — critical janitors
# must fail loud. A follow-up `notify-failure` step below also
# surfaces breakage to ops even if branch-protection wiring is
# adjusted to keep this off the required-checks list.
timeout-minutes: 15
env:
MOLECULE_CP_URL: https://staging-api.moleculesai.app
ADMIN_TOKEN: ${{ secrets.MOLECULE_STAGING_ADMIN_TOKEN }}
ADMIN_TOKEN: ${{ secrets.CP_STAGING_ADMIN_API_TOKEN }}
MAX_AGE_MINUTES: ${{ github.event.inputs.max_age_minutes || '30' }}
DRY_RUN: ${{ github.event.inputs.dry_run || 'false' }}
# Refuse to delete more than this many orgs in one tick. If the
@ -81,7 +90,7 @@ jobs:
- name: Verify admin token present
run: |
if [ -z "$ADMIN_TOKEN" ]; then
echo "::error::MOLECULE_STAGING_ADMIN_TOKEN not set"
echo "::error::CP_STAGING_ADMIN_API_TOKEN not set"
exit 2
fi
echo "Admin token present ✓"
@ -241,3 +250,17 @@ jobs:
if: env.DRY_RUN == 'true'
run: |
echo "DRY RUN — would have deleted ${{ steps.identify.outputs.count }} org(s) AND triggered orphan-tunnels cleanup. Re-run with dry_run=false to actually delete."
- name: Notify on sweep failure
# Fail-loud companion to dropping `continue-on-error: true`.
# If any prior step failed (missing token, CP 5xx, safety-cap
# tripped, etc.) emit a clearly-tagged ::error:: line so the
# Gitea runs UI + any log-tail consumer (Loki SOPRefireRule)
# flags this. Without this step, an early `exit 2` shows as a
# red run but the message can scroll past in busy log windows;
# the explicit tag here is greppable from the orchestrator
# triage loop.
if: failure()
run: |
echo "::error::sweep-stale-e2e-orgs FAILED — staging tenants are LEAKING. See prior step logs. Common causes: (a) CP_STAGING_ADMIN_API_TOKEN secret missing/rotated, (b) staging-api.moleculesai.app 5xx, (c) safety-cap tripped (CP admin API returning malformed orgs). Manual cleanup of leaked EC2 + DNS may be required while this is broken."
exit 1