ci: port 7 deploy/publish/janitors to .gitea/workflows/ (RFC internal#219 §1, Category C-3)

Sweep companion to PR#372 (ci.yml), PR#378 (Cat A), PR#379 (Cat B), PR#383 (Cat C-1), PR#386 (Cat C-2). Final port batch. Ports 7 deploy/publish/janitor workflows from .github/workflows/ to .gitea/workflows/. Each port applies the four-surface audit pattern; every job has `continue-on-error: true` (RFC §1 contract). Files ported: - publish-canvas-image.yml — canvas Docker image build/push. IMPORTANT OPEN QUESTION (flagged in file header): this workflow pushes to ghcr.io. GHCR was retired during the 2026-05-06 Gitea migration in favor of ECR. The pushed image may not be consumable post-migration. Review needs to decide: retarget to ECR (153263036946.dkr.ecr.us-east-2.amazonaws.com/molecule-ai/canvas) or retire entirely and route canvas deploys via operator-host. - redeploy-tenants-on-main.yml — prod tenant SSM redeploy on new workspace-server image. workflow_run trigger retained (same Gitea support caveat as canary-verify.yml — flagged in header). Simplified the job `if:` condition by dropping the `workflow_dispatch` branch. - redeploy-tenants-on-staging.yml — staging mirror of above. Same workflow_run caveat + same `if:` simplification. - sweep-aws-secrets.yml — hourly AWS Secrets Manager tenant-secret janitor. Dropped workflow_dispatch.inputs (dry_run/max_delete_pct/ grace_hours); cron triggers run with the script defaults instead. if-step gates conditional on github.event_name=='workflow_dispatch' are dead-code post-port but harmless. - sweep-cf-orphans.yml — hourly CF DNS janitor. Same shape. - sweep-cf-tunnels.yml — hourly CF Tunnels janitor. Same shape. - sweep-stale-e2e-orgs.yml — every-15-min staging tenant cleanup. Same shape. Open questions for review: 1. workflow_run on redeploy-tenants-on-* — same caveat as canary-verify.yml (Cat C-2). If Gitea ignores the event, the follow-up triage PR replaces with push-with-paths-filter on .gitea/workflows/publish-workspace-server-image.yml. 2. publish-canvas-image GHCR target — decide retarget-to-ECR vs retire-entirely with reviewer. 3. workflow_dispatch.inputs replacements — the four janitor sweeps lost their operator-facing dry_run/cap-override knobs. If a manual override is needed today, edit the cron envs in the file directly. Follow-up could add a "manual override commit" pattern that the cron reads from a checked-in JSON. DO NOT MERGE without orchestrator-dispatched Five-Axis review + @hongmingwang chat-go. Cross-links: - RFC: molecule-ai/internal#219 - Companions: PR#372, PR#378, PR#379, PR#383, PR#386 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 21:26:21 -07:00 · 2026-05-10 21:26:21 -07:00 · 7351d7766f
commit 7351d7766f
parent 108b9a54d9
7 changed files with 1517 additions and 0 deletions
--- a/.gitea/workflows/publish-canvas-image.yml
+++ b/.gitea/workflows/publish-canvas-image.yml
@ -0,0 +1,135 @@
+name: publish-canvas-image
+
+# Ported from .github/workflows/publish-canvas-image.yml on 2026-05-11 per RFC
+# internal#219 §1 sweep. Differences from the GitHub version:
+#   - Dropped `workflow_dispatch.inputs` (Gitea 1.22.6 parser rejects them
+#     per feedback_gitea_workflow_dispatch_inputs_unsupported).
+#   - Dropped `merge_group:` (no Gitea merge queue).
+#   - Dropped `environment:` blocks (Gitea has no environments).
+#   - Workflow-level env.GITHUB_SERVER_URL pinned per
+#     feedback_act_runner_github_server_url.
+#   - `continue-on-error: true` on each job (RFC §1 contract).
+#   - **Open question for review**: this workflow pushes the canvas
+#     image to `ghcr.io`. GHCR was retired during the 2026-05-06
+#     Gitea migration in favor of ECR (per canary-verify.yml header
+#     notes). The image may not be consumable post-migration. Two
+#     options for follow-up: (a) retarget to
+#     `153263036946.dkr.ecr.us-east-2.amazonaws.com/molecule-ai/canvas`,
+#     or (b) retire this workflow entirely and route canvas deploys
+#     via the operator-host build path. tier:low + continue-on-error
+#     means failed pushes do not block PRs.
+#
+
+# Builds and pushes the canvas Docker image to GHCR whenever a commit lands
+# on main that touches canvas code. Previously canvas changes were visible in
+# CI (npm run build passed) but the live container was never updated —
+# operators had to manually run `docker compose build canvas` each time.
+#
+# Mirror of publish-platform-image.yml, adapted for the Next.js canvas layer.
+# See that workflow for inline notes on macOS Keychain isolation and QEMU.
+
+on:
+  push:
+    branches: [main]
+    paths:
+      # Only rebuild when canvas source changes — saves GHA minutes on
+      # platform-only / docs-only / MCP-only merges.
+      - 'canvas/**'
+      - '.gitea/workflows/publish-canvas-image.yml'
+  # Manual trigger: use after a non-canvas merge that still needs a fresh
+  # image (e.g. a Dockerfile change lives outside the canvas/ tree).
+permissions:
+  contents: read
+  packages: write  # required to push to ghcr.io/${{ github.repository_owner }}/*
+
+env:
+  IMAGE_NAME: ghcr.io/molecule-ai/canvas
+
+env:
+  GITHUB_SERVER_URL: https://git.moleculesai.app
+
+jobs:
+  build-and-push:
+    name: Build & push canvas image
+    runs-on: ubuntu-latest
+    # Phase 3 (RFC #219 §1): surface broken workflows without blocking.
+    continue-on-error: true
+    steps:
+      - name: Checkout
+        uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
+
+      - name: Log in to GHCR
+        uses: docker/login-action@c94ce9fb468520275223c153574b00df6fe4bcc9 # v3
+        with:
+          registry: ghcr.io
+          username: ${{ github.actor }}
+          password: ${{ secrets.GITHUB_TOKEN }}
+
+      - name: Set up Docker Buildx
+        uses: docker/setup-buildx-action@4d04d5d9486b7bd6fa91e7baf45bbb4f8b9deedd # v4.0.0
+
+      # Health check: verify Docker daemon is accessible before attempting any
+      # build steps. This fails loudly at step 1 when the runner's docker.sock
+      # is inaccessible rather than silently continuing to the build step
+      # where docker build fails deep in ECR auth with a cryptic error.
+      - name: Verify Docker daemon access
+        run: |
+          set -euo pipefail
+          echo "::group::Docker daemon health check"
+          docker info 2>&1 | head -5 || {
+            echo "::error::Docker daemon is not accessible at /var/run/docker.sock"
+            echo "::error::Check: (1) daemon running, (2) runner user in docker group, (3) sock perms 660+"
+            exit 1
+          }
+          echo "Docker daemon OK"
+          echo "::endgroup::"
+
+      - name: Compute tags
+        id: tags
+        shell: bash
+        run: |
+          echo "sha=${GITHUB_SHA::7}" >> "$GITHUB_OUTPUT"
+
+      - name: Resolve build args
+        id: build_args
+        # Priority: workflow_dispatch input > repo secret > hardcoded default.
+        # NEXT_PUBLIC_* env vars are baked into the JS bundle at build time by
+        # Next.js — they cannot be changed at runtime without a full rebuild.
+        # For local docker-compose deployments the defaults (localhost:8080)
+        # work as-is; production deployments should set CANVAS_PLATFORM_URL
+        # and CANVAS_WS_URL as repository secrets.
+        #
+        # Inputs are passed via env vars (not direct ${{ }} interpolation) to
+        # prevent shell injection from workflow_dispatch string inputs.
+        shell: bash
+        env:
+          INPUT_PLATFORM_URL: ${{ github.event.inputs.platform_url }}
+          SECRET_PLATFORM_URL: ${{ secrets.CANVAS_PLATFORM_URL }}
+          INPUT_WS_URL: ${{ github.event.inputs.ws_url }}
+          SECRET_WS_URL: ${{ secrets.CANVAS_WS_URL }}
+        run: |
+          PLATFORM_URL="${INPUT_PLATFORM_URL:-${SECRET_PLATFORM_URL:-http://localhost:8080}}"
+          WS_URL="${INPUT_WS_URL:-${SECRET_WS_URL:-ws://localhost:8080/ws}}"
+
+          echo "platform_url=${PLATFORM_URL}" >> "$GITHUB_OUTPUT"
+          echo "ws_url=${WS_URL}" >> "$GITHUB_OUTPUT"
+
+      - name: Build & push canvas image to GHCR
+        uses: docker/build-push-action@bcafcacb16a39f128d818304e6c9c0c18556b85f # v7.1.0
+        with:
+          context: ./canvas
+          file: ./canvas/Dockerfile
+          platforms: linux/amd64
+          push: true
+          build-args: |
+            NEXT_PUBLIC_PLATFORM_URL=${{ steps.build_args.outputs.platform_url }}
+            NEXT_PUBLIC_WS_URL=${{ steps.build_args.outputs.ws_url }}
+          tags: |
+            ${{ env.IMAGE_NAME }}:latest
+            ${{ env.IMAGE_NAME }}:sha-${{ steps.tags.outputs.sha }}
+          cache-from: type=gha
+          cache-to: type=gha,mode=max
+          labels: |
+            org.opencontainers.image.source=https://github.com/${{ github.repository }}
+            org.opencontainers.image.revision=${{ github.sha }}
+            org.opencontainers.image.description=Molecule AI canvas (Next.js 15 + React Flow)
--- a/.gitea/workflows/redeploy-tenants-on-main.yml
+++ b/.gitea/workflows/redeploy-tenants-on-main.yml
@ -0,0 +1,375 @@
+name: redeploy-tenants-on-main
+
+# Ported from .github/workflows/redeploy-tenants-on-main.yml on 2026-05-11 per RFC
+# internal#219 §1 sweep. Differences from the GitHub version:
+#   - Dropped `workflow_dispatch.inputs` (Gitea 1.22.6 parser rejects them
+#     per feedback_gitea_workflow_dispatch_inputs_unsupported).
+#   - Dropped `merge_group:` (no Gitea merge queue).
+#   - Dropped `environment:` blocks (Gitea has no environments).
+#   - Workflow-level env.GITHUB_SERVER_URL pinned per
+#     feedback_act_runner_github_server_url.
+#   - `continue-on-error: true` on each job (RFC §1 contract).
+#   - **Gitea workflow_run trigger limitation**: Gitea 1.22.6's support
+#     for the `workflow_run` event is partial. If this never fires on a
+#     real publish-workspace-server-image completion, the follow-up
+#     triage PR should replace the trigger with a push-with-paths-filter
+#     on .gitea/workflows/publish-workspace-server-image.yml. Until
+#     then continue-on-error+dead-workflow doesn't break anything.
+#
+
+# Auto-refresh prod tenant EC2s after every main merge.
+#
+# Why this workflow exists: publish-workspace-server-image builds and
+# pushes a new platform-tenant :<sha> to ECR on every merge to main,
+# but running tenants pulled their image once at boot and never re-pull.
+# Users see stale code indefinitely.
+#
+# This workflow closes the gap by calling the control-plane admin
+# endpoint that performs a canary-first, batched, health-gated rolling
+# redeploy across every live tenant. Implemented in molecule-ai/
+# molecule-controlplane as POST /cp/admin/tenants/redeploy-fleet
+# (feat/tenant-auto-redeploy, landing alongside this workflow).
+#
+# Registry: ECR (153263036946.dkr.ecr.us-east-2.amazonaws.com/
+# molecule-ai/platform-tenant). GHCR was retired 2026-05-07 during the
+# Gitea suspension migration. The canary-verify.yml promote step now
+# uses the same redeploy-fleet endpoint (fixes the silent-GHCR gap).
+#
+# Runtime ordering:
+#   1. publish-workspace-server-image completes → new :staging-<sha> in ECR.
+#   2. This workflow fires via workflow_run, calls redeploy-fleet with
+#      target_tag=staging-<sha>. No CDN propagation wait needed —
+#      ECR image manifest is consistent immediately after push.
+#   3. Calls redeploy-fleet with canary_slug (if set) and a soak
+#      period. Canary proves the image boots; batches follow.
+#   4. Any failure aborts the rollout and leaves older tenants on the
+#      prior image — safer default than half-and-half state.
+#
+# Rollback path: re-run this workflow with a specific SHA pinned via
+# the workflow_dispatch input. That calls redeploy-fleet with
+# target_tag=<sha>, re-pulling the older image on every tenant.
+
+on:
+  workflow_run:
+    workflows: ['publish-workspace-server-image']
+    types: [completed]
+    branches: [main]
+permissions:
+  contents: read
+  # No write scopes needed — the workflow hits an external CP endpoint,
+  # not the GitHub API.
+
+# Serialize redeploys so two rapid main pushes' redeploys don't overlap
+# and cause confusing per-tenant SSM state. Without this, GitHub's
+# implicit workflow_run queueing would *probably* serialize them, but
+# the explicit block makes the invariant defensible. Mirrors the
+# concurrency block on redeploy-tenants-on-staging.yml for shape parity.
+#
+# cancel-in-progress: false → aborting a half-rolled-out fleet would
+# leave tenants stuck on whatever image they happened to be on when
+# cancelled. Better to finish the in-flight rollout before starting
+# the next one.
+concurrency:
+  group: redeploy-tenants-on-main
+  cancel-in-progress: false
+
+env:
+  GITHUB_SERVER_URL: https://git.moleculesai.app
+
+jobs:
+  redeploy:
+    # Skip the auto-trigger if publish-workspace-server-image didn't
+    # actually succeed. workflow_run fires on any completion state; we
+    # don't want to redeploy against a half-built image.
+    # NOTE (Gitea port): workflow_dispatch trigger dropped; only the
+    # workflow_run path remains.
+    if: ${{ github.event.workflow_run.conclusion == 'success' }}
+    runs-on: ubuntu-latest
+    # Phase 3 (RFC #219 §1): surface broken workflows without blocking.
+    continue-on-error: true
+    timeout-minutes: 25
+    steps:
+      - name: Note on ECR propagation
+        # ECR image manifests are consistent immediately after push — no
+        # CDN cache to wait for. The old GHCR-based workflow had a 30s
+        # sleep to avoid race conditions; ECR makes that unnecessary.
+        run: echo "ECR image available immediately after push — proceeding."
+
+      - name: Compute target tag
+        id: tag
+        # Resolution order:
+        #   1. Operator-supplied input (workflow_dispatch with explicit
+        #      tag) → used verbatim. Lets ops pin `latest` for emergency
+        #      rollback to last canary-verified digest, or pin a specific
+        #      `staging-<sha>` to roll back to a known-good build.
+        #   2. Default → `staging-<short_head_sha>`. The just-published
+        #      digest. Bypasses the `:latest` retag path that's currently
+        #      dead (canary-verify soft-skips without canary fleet, so
+        #      the only thing retagging `:latest` today is the manual
+        #      promote-latest.yml — last run 2026-04-28). Auto-trigger
+        #      from workflow_run uses workflow_run.head_sha; manual
+        #      dispatch with no input falls through to github.sha.
+        env:
+          INPUT_TAG: ${{ inputs.target_tag }}
+          HEAD_SHA: ${{ github.event.workflow_run.head_sha || github.sha }}
+        run: |
+          set -euo pipefail
+          if [ -n "${INPUT_TAG:-}" ]; then
+            echo "target_tag=$INPUT_TAG" >> "$GITHUB_OUTPUT"
+            echo "Using operator-pinned tag: $INPUT_TAG"
+          else
+            SHORT="${HEAD_SHA:0:7}"
+            echo "target_tag=staging-$SHORT" >> "$GITHUB_OUTPUT"
+            echo "Using auto tag: staging-$SHORT (head_sha=$HEAD_SHA)"
+          fi
+
+      - name: Call CP redeploy-fleet
+        # CP_ADMIN_API_TOKEN must be set as a repo/org secret on
+        # molecule-ai/molecule-core, matching the staging/prod CP's
+        # CP_ADMIN_API_TOKEN env. Stored in Railway, mirrored to this
+        # repo's secrets for CI.
+        env:
+          CP_URL: ${{ vars.CP_URL || 'https://api.moleculesai.app' }}
+          CP_ADMIN_API_TOKEN: ${{ secrets.CP_ADMIN_API_TOKEN }}
+          TARGET_TAG: ${{ steps.tag.outputs.target_tag }}
+          CANARY_SLUG: ${{ inputs.canary_slug || 'hongming' }}
+          SOAK_SECONDS: ${{ inputs.soak_seconds || '60' }}
+          BATCH_SIZE: ${{ inputs.batch_size || '3' }}
+          DRY_RUN: ${{ inputs.dry_run || false }}
+        run: |
+          set -euo pipefail
+
+          if [ -z "${CP_ADMIN_API_TOKEN:-}" ]; then
+            echo "::error::CP_ADMIN_API_TOKEN secret not set — skipping redeploy"
+            echo "::notice::Set CP_ADMIN_API_TOKEN in repo secrets to enable auto-redeploy."
+            exit 1
+          fi
+
+          BODY=$(jq -nc \
+            --arg tag "$TARGET_TAG" \
+            --arg canary "$CANARY_SLUG" \
+            --argjson soak "$SOAK_SECONDS" \
+            --argjson batch "$BATCH_SIZE" \
+            --argjson dry "$DRY_RUN" \
+            '{
+              target_tag: $tag,
+              canary_slug: $canary,
+              soak_seconds: $soak,
+              batch_size: $batch,
+              dry_run: $dry
+            }')
+
+          echo "POST $CP_URL/cp/admin/tenants/redeploy-fleet"
+          echo "  body: $BODY"
+
+          HTTP_RESPONSE=$(mktemp)
+          HTTP_CODE_FILE=$(mktemp)
+          # Route -w into its own tempfile so curl's exit code (e.g. 56
+          # on connection-reset, 22 on --fail-with-body 4xx/5xx) can't
+          # pollute the captured stdout. The previous inline-substitution
+          # shape produced "000000" on connection reset (curl wrote
+          # "000" via -w, then the inline echo-fallback appended another
+          # "000") — caught on the 2026-05-04 redeploy of sha 2b862f6.
+          # set +e/-e keeps the non-zero curl exit from tripping the
+          # outer pipeline. See lint-curl-status-capture.yml for the
+          # CI gate that pins this fix shape.
+          set +e
+          curl -sS -o "$HTTP_RESPONSE" -w '%{http_code}' \
+            -m 1200 \
+            -H "Authorization: Bearer $CP_ADMIN_API_TOKEN" \
+            -H "Content-Type: application/json" \
+            -X POST "$CP_URL/cp/admin/tenants/redeploy-fleet" \
+            -d "$BODY" >"$HTTP_CODE_FILE"
+          set -e
+          # Stderr from curl (e.g. dial errors with -sS) goes to the runner
+          # log so operators can see WHY a connection failed. Stdout is
+          # captured to $HTTP_CODE_FILE because that's where -w writes.
+          HTTP_CODE=$(cat "$HTTP_CODE_FILE" 2>/dev/null || echo "000")
+          [ -z "$HTTP_CODE" ] && HTTP_CODE="000"
+
+          echo "HTTP $HTTP_CODE"
+          cat "$HTTP_RESPONSE" | jq . || cat "$HTTP_RESPONSE"
+
+          # Pretty-print per-tenant results in the job summary so
+          # ops can see which tenants were redeployed without drilling
+          # into the raw response.
+          {
+            echo "## Tenant redeploy fleet"
+            echo ""
+            echo "**Target tag:** \`$TARGET_TAG\`"
+            echo "**Canary:** \`$CANARY_SLUG\` (soak ${SOAK_SECONDS}s)"
+            echo "**Batch size:** $BATCH_SIZE"
+            echo "**Dry run:** $DRY_RUN"
+            echo "**HTTP:** $HTTP_CODE"
+            echo ""
+            echo "### Per-tenant result"
+            echo ""
+            echo '| Slug | Phase | SSM Status | Exit | Healthz | Error |'
+            echo '|------|-------|------------|------|---------|-------|'
+            jq -r '.results[]? | "| \(.slug) | \(.phase) | \(.ssm_status // "-") | \(.ssm_exit_code) | \(.healthz_ok) | \(.error // "-") |"' "$HTTP_RESPONSE" || true
+          } >> "$GITHUB_STEP_SUMMARY"
+
+          if [ "$HTTP_CODE" != "200" ]; then
+            echo "::error::redeploy-fleet returned HTTP $HTTP_CODE"
+            exit 1
+          fi
+          OK=$(jq -r '.ok' "$HTTP_RESPONSE")
+          if [ "$OK" != "true" ]; then
+            echo "::error::redeploy-fleet reported ok=false (see summary for which tenant halted the rollout)"
+            exit 1
+          fi
+          echo "::notice::Tenant fleet redeploy reported ssm_status=Success — verifying actual image roll on each tenant..."
+
+          # Stash the response for the verify step. $RUNNER_TEMP outlasts
+          # the step boundary; $HTTP_RESPONSE doesn't.
+          cp "$HTTP_RESPONSE" "$RUNNER_TEMP/redeploy-response.json"
+
+      - name: Verify each tenant /buildinfo matches published SHA
+        # ROOT FIX FOR #2395.
+        #
+        # `redeploy-fleet`'s `ssm_status=Success` means "the SSM RPC
+        # didn't error" — NOT "the new image is running on the tenant."
+        # `:latest` lives in the local Docker daemon's image cache; if
+        # the SSM document does `docker compose up -d` without an
+        # explicit `docker pull`, the daemon serves the previously-
+        # cached digest and the container restarts on stale code.
+        # 2026-04-30 incident: hongmingwang's tenant reported
+        # ssm_status=Success at 17:00:53Z but kept serving pre-501a42d7
+        # chat_files for 30+ min — the lazy-heal fix never reached the
+        # user despite green deploy + green redeploy.
+        #
+        # This step closes the gap by curling each tenant's /buildinfo
+        # endpoint (added in workspace-server/internal/buildinfo +
+        # /Dockerfile* GIT_SHA build-arg, this PR) and comparing the
+        # returned git_sha to the SHA the workflow expects. Mismatches
+        # fail the workflow, which is what `ok=true` should have
+        # guaranteed all along.
+        #
+        # When the redeploy was triggered by workflow_dispatch with a
+        # specific tag (target_tag != "latest"), the expected SHA may
+        # not equal ${{ github.sha }} — in that case we resolve via
+        # GHCR's manifest. For workflow_run (default :latest) the
+        # workflow_run.head_sha is the SHA that just published.
+        env:
+          EXPECTED_SHA: ${{ github.event.workflow_run.head_sha || github.sha }}
+          TARGET_TAG: ${{ steps.tag.outputs.target_tag }}
+          # Tenant subdomain template — slugs from the response are
+          # appended. Production CP issues `<slug>.moleculesai.app`;
+          # staging CP issues `<slug>.staging.moleculesai.app`. This
+          # workflow runs on main → prod CP → no `staging.` infix.
+          TENANT_DOMAIN: 'moleculesai.app'
+        run: |
+          set -euo pipefail
+
+          EXPECTED_SHORT="${EXPECTED_SHA:0:7}"
+          if [ "$TARGET_TAG" != "latest" ] \
+             && [ "$TARGET_TAG" != "$EXPECTED_SHA" ] \
+             && [ "$TARGET_TAG" != "staging-$EXPECTED_SHORT" ]; then
+            # workflow_dispatch with a pinned tag that isn't the head
+            # SHA — operator is rolling back / pinning. Skip the
+            # verification because we don't have the expected SHA in
+            # this context (would need to crane-inspect the GHCR
+            # manifest, which is a follow-up). Failing-open here is
+            # safe: the operator chose the tag deliberately.
+            #
+            # `staging-<short_head_sha>` IS verified — it's the new
+            # auto-trigger default (see Compute target tag step) and
+            # the digest under that tag SHOULD match EXPECTED_SHA.
+            echo "::notice::target_tag=$TARGET_TAG (operator-pinned) — skipping per-tenant SHA verification."
+            exit 0
+          fi
+
+          RESP="$RUNNER_TEMP/redeploy-response.json"
+          if [ ! -s "$RESP" ]; then
+            echo "::error::redeploy-response.json missing or empty — verify step ran without a response to read"
+            exit 1
+          fi
+
+          # Pull only successfully-redeployed tenants. Any tenant that
+          # halted the rollout already failed the previous step, so we
+          # don't double-count them here.
+          mapfile -t SLUGS < <(jq -r '.results[]? | select(.healthz_ok == true) | .slug' "$RESP")
+          if [ ${#SLUGS[@]} -eq 0 ]; then
+            echo "::warning::No tenants reported healthz_ok — nothing to verify"
+            exit 0
+          fi
+
+          echo "Verifying ${#SLUGS[@]} tenant(s) against EXPECTED_SHA=${EXPECTED_SHA:0:7}..."
+
+          # Two distinct failure modes — STALE (the #2395 bug class, hard-fail)
+          # vs UNREACHABLE (teardown race, soft-warn). See the staging variant's
+          # comment for the full rationale; same logic applies on prod even
+          # though prod has fewer ephemeral tenants — the asymmetry would be a
+          # gratuitous fork.
+          STALE_COUNT=0
+          UNREACHABLE_COUNT=0
+          STALE_LINES=()
+          UNREACHABLE_LINES=()
+          for slug in "${SLUGS[@]}"; do
+            URL="https://${slug}.${TENANT_DOMAIN}/buildinfo"
+            # 30s total: tenant just SSM-restarted, may still be coming
+            # up. Retry-on-empty rather than retry-on-status — we want
+            # to fail fast on "responded with wrong SHA", not "still
+            # warming up".
+            BODY=$(curl -sS --max-time 30 --retry 3 --retry-delay 5 --retry-connrefused "$URL" || true)
+            ACTUAL_SHA=$(echo "$BODY" | jq -r '.git_sha // ""' 2>/dev/null || echo "")
+            if [ -z "$ACTUAL_SHA" ]; then
+              UNREACHABLE_COUNT=$((UNREACHABLE_COUNT + 1))
+              UNREACHABLE_LINES+=("| $slug | (no /buildinfo response) | ${EXPECTED_SHA:0:7} | ⚠ unreachable (likely teardown race) |")
+              continue
+            fi
+            if [ "$ACTUAL_SHA" = "$EXPECTED_SHA" ]; then
+              echo "  $slug: ${ACTUAL_SHA:0:7} ✓"
+            else
+              STALE_COUNT=$((STALE_COUNT + 1))
+              STALE_LINES+=("| $slug | ${ACTUAL_SHA:0:7} | ${EXPECTED_SHA:0:7} | ❌ stale |")
+            fi
+          done
+
+          {
+            echo ""
+            echo "### Per-tenant /buildinfo verification"
+            echo ""
+            echo "Expected SHA: \`${EXPECTED_SHA:0:7}\`"
+            echo ""
+            if [ $STALE_COUNT -gt 0 ]; then
+              echo "**${STALE_COUNT} STALE tenant(s) — these did NOT pick up the new image despite ssm_status=Success:**"
+              echo ""
+              echo "| Slug | Actual /buildinfo SHA | Expected | Status |"
+              echo "|------|----------------------|----------|--------|"
+              for line in "${STALE_LINES[@]}"; do echo "$line"; done
+              echo ""
+            fi
+            if [ $UNREACHABLE_COUNT -gt 0 ]; then
+              echo "**${UNREACHABLE_COUNT} unreachable tenant(s) — likely teardown race (soft-warn, not failing):**"
+              echo ""
+              echo "| Slug | Actual /buildinfo SHA | Expected | Status |"
+              echo "|------|----------------------|----------|--------|"
+              for line in "${UNREACHABLE_LINES[@]}"; do echo "$line"; done
+              echo ""
+            fi
+            if [ $STALE_COUNT -eq 0 ] && [ $UNREACHABLE_COUNT -eq 0 ]; then
+              echo "All ${#SLUGS[@]} tenants returned matching SHA. ✓"
+            fi
+          } >> "$GITHUB_STEP_SUMMARY"
+
+          if [ $UNREACHABLE_COUNT -gt 0 ]; then
+            echo "::warning::$UNREACHABLE_COUNT tenant(s) unreachable post-redeploy. Likely benign teardown race — CP healthz monitor catches real outages."
+          fi
+
+          # Belt-and-suspenders sanity floor: same logic as the staging
+          # variant — see that file's comment for the full rationale.
+          # Floor only applies when fleet >= 4; below that, canary-verify
+          # is the actual gate.
+          TOTAL_VERIFIED=${#SLUGS[@]}
+          if [ $TOTAL_VERIFIED -ge 4 ] && [ $UNREACHABLE_COUNT -gt $((TOTAL_VERIFIED / 2)) ]; then
+            echo "::error::$UNREACHABLE_COUNT of $TOTAL_VERIFIED tenant(s) unreachable — exceeds 50% threshold on a fleet large enough that this signals a real outage, not teardown race."
+            exit 1
+          fi
+
+          if [ $STALE_COUNT -gt 0 ]; then
+            echo "::error::$STALE_COUNT tenant(s) returned a stale SHA. ssm_status=Success was misleading — see job summary."
+            exit 1
+          fi
+
+          echo "::notice::Tenant fleet redeploy complete — all reachable tenants on ${EXPECTED_SHA:0:7} (${UNREACHABLE_COUNT} unreachable, soft-warned)."
--- a/.gitea/workflows/redeploy-tenants-on-staging.yml
+++ b/.gitea/workflows/redeploy-tenants-on-staging.yml
@ -0,0 +1,356 @@
+name: redeploy-tenants-on-staging
+
+# Ported from .github/workflows/redeploy-tenants-on-staging.yml on 2026-05-11 per RFC
+# internal#219 §1 sweep. Differences from the GitHub version:
+#   - Dropped `workflow_dispatch.inputs` (Gitea 1.22.6 parser rejects them
+#     per feedback_gitea_workflow_dispatch_inputs_unsupported).
+#   - Dropped `merge_group:` (no Gitea merge queue).
+#   - Dropped `environment:` blocks (Gitea has no environments).
+#   - Workflow-level env.GITHUB_SERVER_URL pinned per
+#     feedback_act_runner_github_server_url.
+#   - `continue-on-error: true` on each job (RFC §1 contract).
+#   - **Gitea workflow_run trigger limitation**: Gitea 1.22.6's support
+#     for the `workflow_run` event is partial. If this never fires on a
+#     real publish-workspace-server-image completion, the follow-up
+#     triage PR should replace the trigger with a push-with-paths-filter
+#     on .gitea/workflows/publish-workspace-server-image.yml. Until
+#     then continue-on-error+dead-workflow doesn't break anything.
+#
+
+# Auto-refresh staging tenant EC2s after every staging-branch merge.
+#
+# Mirror of redeploy-tenants-on-main.yml, with the staging-CP host and
+# the :staging-latest tag. Sister workflow exists for prod (rolls
+# :latest after canary-verify). Both share the same shape — just
+# different CP_URL + target_tag + admin token secret.
+#
+# Why this workflow exists: publish-workspace-server-image now builds
+# on every staging-branch push (PR #2335), pushing
+# platform-tenant:staging-latest to GHCR. Existing tenants pulled
+# their image once at boot and never re-pull, so the new image just
+# sits unused until the tenant is reprovisioned.
+#
+# This workflow closes the gap by calling staging-CP's
+# /cp/admin/tenants/redeploy-fleet, which performs a canary-first,
+# batched, health-gated SSM redeploy across every live staging tenant.
+# Same endpoint shape as prod CP — only the host differs.
+#
+# Runtime ordering:
+#   1. publish-workspace-server-image completes on staging branch →
+#      new :staging-latest in GHCR.
+#   2. This workflow fires via workflow_run, waits 30s for GHCR's CDN
+#      to propagate the new tag.
+#   3. Calls redeploy-fleet with no canary (staging IS canary; we don't
+#      need a sub-canary inside it). Soak still applies to the first
+#      tenant in case of bad-deploy detection.
+#   4. Any failure aborts the rollout and leaves older tenants on the
+#      prior image — safer default than half-and-half state.
+#
+# Rollback path: re-run with workflow_dispatch + target_tag=staging-<sha>
+# of a known-good build.
+
+on:
+  workflow_run:
+    workflows: ['publish-workspace-server-image']
+    types: [completed]
+    branches: [main]
+permissions:
+  contents: read
+  # No write scopes needed — the workflow hits an external CP endpoint,
+  # not the GitHub API.
+
+# Serialize per-branch so two rapid staging pushes' redeploys don't
+# overlap and cause confusing per-tenant SSM state. cancel-in-progress
+# is false because aborting a half-rolled-out fleet leaves tenants
+# stuck on whatever image they happened to be on when cancelled.
+concurrency:
+  group: redeploy-tenants-on-staging
+  cancel-in-progress: false
+
+env:
+  GITHUB_SERVER_URL: https://git.moleculesai.app
+
+jobs:
+  redeploy:
+    # Skip the auto-trigger if publish-workspace-server-image didn't
+    # actually succeed. workflow_run fires on any completion state; we
+    # don't want to redeploy against a half-built image.
+    # NOTE (Gitea port): workflow_dispatch trigger dropped; only the
+    # workflow_run path remains.
+    if: ${{ github.event.workflow_run.conclusion == 'success' }}
+    runs-on: ubuntu-latest
+    # Phase 3 (RFC #219 §1): surface broken workflows without blocking.
+    continue-on-error: true
+    timeout-minutes: 25
+    steps:
+      - name: Wait for GHCR tag propagation
+        # GHCR's edge cache takes ~15-30s to consistently serve the new
+        # :staging-latest manifest after the registry accepts the push.
+        # Same rationale as redeploy-tenants-on-main.yml.
+        run: sleep 30
+
+      - name: Call staging-CP redeploy-fleet
+        # CP_STAGING_ADMIN_API_TOKEN must be set as a repo/org secret
+        # on molecule-ai/molecule-core, matching staging-CP's
+        # CP_ADMIN_API_TOKEN env var (visible in Railway controlplane
+        # / staging environment). Stored separately from the prod
+        # CP_ADMIN_API_TOKEN so a leak of one doesn't auth the other.
+        env:
+          CP_URL: ${{ vars.STAGING_CP_URL || 'https://staging-api.moleculesai.app' }}
+          CP_STAGING_ADMIN_API_TOKEN: ${{ secrets.CP_STAGING_ADMIN_API_TOKEN }}
+          TARGET_TAG: ${{ inputs.target_tag || 'staging-latest' }}
+          CANARY_SLUG: ${{ inputs.canary_slug || '' }}
+          SOAK_SECONDS: ${{ inputs.soak_seconds || '60' }}
+          BATCH_SIZE: ${{ inputs.batch_size || '3' }}
+          DRY_RUN: ${{ inputs.dry_run || false }}
+        run: |
+          set -euo pipefail
+
+          # Schedule-vs-dispatch hardening (mirrors sweep-cf-orphans
+          # and sweep-cf-tunnels): hard-fail on auto-trigger when the
+          # secret is missing so a misconfigured-repo doesn't silently
+          # serve stale staging tenants. Soft-skip on operator dispatch.
+          if [ -z "${CP_STAGING_ADMIN_API_TOKEN:-}" ]; then
+            if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
+              echo "::warning::CP_STAGING_ADMIN_API_TOKEN secret not set — skipping redeploy"
+              echo "::warning::Set CP_STAGING_ADMIN_API_TOKEN in repo secrets to enable auto-redeploy."
+              echo "::notice::Pull the value from staging-CP's CP_ADMIN_API_TOKEN env in Railway."
+              exit 0
+            fi
+            echo "::error::staging redeploy cannot run — CP_STAGING_ADMIN_API_TOKEN secret missing"
+            echo "::error::set it at Settings → Secrets and Variables → Actions; pull from staging-CP's CP_ADMIN_API_TOKEN env in Railway."
+            exit 1
+          fi
+
+          BODY=$(jq -nc \
+            --arg tag "$TARGET_TAG" \
+            --arg canary "$CANARY_SLUG" \
+            --argjson soak "$SOAK_SECONDS" \
+            --argjson batch "$BATCH_SIZE" \
+            --argjson dry "$DRY_RUN" \
+            '{
+              target_tag: $tag,
+              canary_slug: $canary,
+              soak_seconds: $soak,
+              batch_size: $batch,
+              dry_run: $dry
+            }')
+
+          echo "POST $CP_URL/cp/admin/tenants/redeploy-fleet"
+          echo "  body: $BODY"
+
+          HTTP_RESPONSE=$(mktemp)
+          HTTP_CODE_FILE=$(mktemp)
+          # Route -w into its own tempfile so curl's exit code (e.g. 56
+          # on connection-reset) can't pollute the captured stdout. The
+          # previous inline-substitution shape produced "000000" on
+          # connection reset — caught on main variant 2026-05-04
+          # redeploying sha 2b862f6. Same fix shape as the synth-E2E
+          # §9c gate (PR #2797). See lint-curl-status-capture.yml for
+          # the CI gate that pins this fix shape.
+          set +e
+          curl -sS -o "$HTTP_RESPONSE" -w '%{http_code}' \
+            -m 1200 \
+            -H "Authorization: Bearer $CP_STAGING_ADMIN_API_TOKEN" \
+            -H "Content-Type: application/json" \
+            -X POST "$CP_URL/cp/admin/tenants/redeploy-fleet" \
+            -d "$BODY" >"$HTTP_CODE_FILE"
+          set -e
+          # Stderr from curl (-sS shows dial errors etc.) goes to the
+          # runner log so operators can see WHY a connection failed.
+          HTTP_CODE=$(cat "$HTTP_CODE_FILE" 2>/dev/null || echo "000")
+          [ -z "$HTTP_CODE" ] && HTTP_CODE="000"
+
+          echo "HTTP $HTTP_CODE"
+          cat "$HTTP_RESPONSE" | jq . || cat "$HTTP_RESPONSE"
+
+          {
+            echo "## Staging tenant redeploy fleet"
+            echo ""
+            echo "**Target tag:** \`$TARGET_TAG\`"
+            echo "**Canary:** \`${CANARY_SLUG:-(none — staging is itself the canary)}\` (soak ${SOAK_SECONDS}s)"
+            echo "**Batch size:** $BATCH_SIZE"
+            echo "**Dry run:** $DRY_RUN"
+            echo "**HTTP:** $HTTP_CODE"
+            echo ""
+            echo "### Per-tenant result"
+            echo ""
+            echo '| Slug | Phase | SSM Status | Exit | Healthz | Error |'
+            echo '|------|-------|------------|------|---------|-------|'
+            jq -r '.results[]? | "| \(.slug) | \(.phase) | \(.ssm_status // "-") | \(.ssm_exit_code) | \(.healthz_ok) | \(.error // "-") |"' "$HTTP_RESPONSE" || true
+          } >> "$GITHUB_STEP_SUMMARY"
+
+          # Distinguish "real fleet failure" from "E2E teardown race".
+          #
+          # CP returns HTTP 500 + ok=false whenever ANY tenant in the
+          # fleet failed SSM or healthz. In practice the recurring source
+          # of these is ephemeral test tenants being torn down by their
+          # parent E2E run mid-redeploy: the EC2 dies → SSM exit=2 or
+          # healthz timeout → CP marks the fleet failed → this workflow
+          # goes red even though every operator-facing tenant rolled fine.
+          #
+          # Ephemeral slug prefixes (kept in sync with sweep-stale-e2e-orgs.yml
+          # — see that file for the source-of-truth list and rationale):
+          #   - e2e-*       — canvas/saas/ext E2E suites
+          #   - rt-e2e-*    — runtime-test harness fixtures (RFC #2251)
+          # Long-lived prefixes that are NOT ephemeral and MUST hard-fail:
+          # demo-prep, dryrun-*, dryrun2-*, plus all human tenant slugs.
+          #
+          # Filter: if HTTP=500/ok=false AND every failed slug matches an
+          # ephemeral prefix, treat as soft-warn and let the verify step
+          # downstream handle unreachable-vs-stale (#2402). Any non-ephemeral
+          # failure or a non-500 HTTP response remains a hard failure.
+          OK=$(jq -r '.ok // "false"' "$HTTP_RESPONSE")
+          FAILED_SLUGS=$(jq -r '
+            .results[]?
+            | select((.healthz_ok != true) or (.ssm_status != "Success"))
+            | .slug' "$HTTP_RESPONSE" 2>/dev/null || true)
+          EPHEMERAL_PREFIX_RE='^(e2e-|rt-e2e-)'
+          NON_EPHEMERAL_FAILED=$(printf '%s\n' "$FAILED_SLUGS" | grep -v '^$' | grep -Ev "$EPHEMERAL_PREFIX_RE" || true)
+
+          if [ "$HTTP_CODE" = "200" ] && [ "$OK" = "true" ]; then
+            : # happy path — fall through to verification
+          elif [ "$HTTP_CODE" = "500" ] && [ -z "$NON_EPHEMERAL_FAILED" ] && [ -n "$FAILED_SLUGS" ]; then
+            COUNT=$(printf '%s\n' "$FAILED_SLUGS" | grep -Ec "$EPHEMERAL_PREFIX_RE" || true)
+            echo "::warning::redeploy-fleet returned HTTP 500 but every failed tenant ($COUNT) is ephemeral (e2e-*/rt-e2e-*) — treating as teardown race, soft-warning."
+            printf '%s\n' "$FAILED_SLUGS" | sed 's/^/::warning::  failed: /'
+          elif [ "$HTTP_CODE" != "200" ]; then
+            echo "::error::redeploy-fleet returned HTTP $HTTP_CODE"
+            if [ -n "$NON_EPHEMERAL_FAILED" ]; then
+              echo "::error::non-ephemeral tenant(s) failed:"
+              printf '%s\n' "$NON_EPHEMERAL_FAILED" | sed 's/^/::error::  /'
+            fi
+            exit 1
+          else
+            # HTTP=200 but ok=false (shouldn't happen with current CP
+            # but keep the gate for completeness).
+            echo "::error::redeploy-fleet reported ok=false (see summary for which tenant halted the rollout)"
+            exit 1
+          fi
+          echo "::notice::Staging tenant fleet redeploy reported ssm_status=Success — verifying actual image roll on each tenant..."
+
+          cp "$HTTP_RESPONSE" "$RUNNER_TEMP/redeploy-response.json"
+
+      - name: Verify each staging tenant /buildinfo matches published SHA
+        # Mirror of the verify step in redeploy-tenants-on-main.yml — see
+        # there for the rationale (#2395 root fix). Staging has the same
+        # ssm_status-success-but-stale-image hazard and benefits from the
+        # same gate. Diff: TENANT_DOMAIN includes the `staging.` infix.
+        env:
+          EXPECTED_SHA: ${{ github.event.workflow_run.head_sha || github.sha }}
+          TARGET_TAG: ${{ inputs.target_tag || 'staging-latest' }}
+          TENANT_DOMAIN: 'staging.moleculesai.app'
+        run: |
+          set -euo pipefail
+
+          # staging-latest is the staging-side moving tag; treat it the
+          # same way main treats `latest`. Operator-pinned SHAs skip
+          # verification (see main variant for why).
+          if [ "$TARGET_TAG" != "staging-latest" ] && [ "$TARGET_TAG" != "latest" ] && [ "$TARGET_TAG" != "$EXPECTED_SHA" ]; then
+            echo "::notice::target_tag=$TARGET_TAG (operator-pinned) — skipping per-tenant SHA verification."
+            exit 0
+          fi
+
+          RESP="$RUNNER_TEMP/redeploy-response.json"
+          if [ ! -s "$RESP" ]; then
+            echo "::error::redeploy-response.json missing or empty"
+            exit 1
+          fi
+
+          mapfile -t SLUGS < <(jq -r '.results[]? | select(.healthz_ok == true) | .slug' "$RESP")
+          if [ ${#SLUGS[@]} -eq 0 ]; then
+            echo "::warning::No staging tenants reported healthz_ok — nothing to verify"
+            exit 0
+          fi
+
+          echo "Verifying ${#SLUGS[@]} staging tenant(s) against EXPECTED_SHA=${EXPECTED_SHA:0:7}..."
+
+          # Two distinct failure modes here:
+          #   STALE_COUNT      — tenant returned a SHA that doesn't match. THIS is
+          #                      the #2395 bug class: tenant up + serving old code.
+          #                      Always hard-fail the workflow.
+          #   UNREACHABLE_COUNT — tenant didn't respond. Almost always a benign
+          #                      teardown race: redeploy-fleet snapshot says
+          #                      healthz_ok=true, then the E2E suite tears the
+          #                      ephemeral tenant down before this step runs (the
+          #                      e2e-* fixtures churn 5-10/hour on staging). Soft-
+          #                      warn so we don't block staging→main on cleanup.
+          #                      Real "tenant up but unreachable" is caught by CP's
+          #                      own healthz monitor + the post-redeploy alert; we
+          #                      don't need to double-count it here.
+          STALE_COUNT=0
+          UNREACHABLE_COUNT=0
+          STALE_LINES=()
+          UNREACHABLE_LINES=()
+          for slug in "${SLUGS[@]}"; do
+            URL="https://${slug}.${TENANT_DOMAIN}/buildinfo"
+            BODY=$(curl -sS --max-time 30 --retry 3 --retry-delay 5 --retry-connrefused "$URL" || true)
+            ACTUAL_SHA=$(echo "$BODY" | jq -r '.git_sha // ""' 2>/dev/null || echo "")
+            if [ -z "$ACTUAL_SHA" ]; then
+              UNREACHABLE_COUNT=$((UNREACHABLE_COUNT + 1))
+              UNREACHABLE_LINES+=("| $slug | (no /buildinfo response) | ${EXPECTED_SHA:0:7} | ⚠ unreachable (likely teardown race) |")
+              continue
+            fi
+            if [ "$ACTUAL_SHA" = "$EXPECTED_SHA" ]; then
+              echo "  $slug: ${ACTUAL_SHA:0:7} ✓"
+            else
+              STALE_COUNT=$((STALE_COUNT + 1))
+              STALE_LINES+=("| $slug | ${ACTUAL_SHA:0:7} | ${EXPECTED_SHA:0:7} | ❌ stale |")
+            fi
+          done
+
+          {
+            echo ""
+            echo "### Per-tenant /buildinfo verification (staging)"
+            echo ""
+            echo "Expected SHA: \`${EXPECTED_SHA:0:7}\`"
+            echo ""
+            if [ $STALE_COUNT -gt 0 ]; then
+              echo "**${STALE_COUNT} STALE tenant(s) — these did NOT pick up the new image despite ssm_status=Success:**"
+              echo ""
+              echo "| Slug | Actual /buildinfo SHA | Expected | Status |"
+              echo "|------|----------------------|----------|--------|"
+              for line in "${STALE_LINES[@]}"; do echo "$line"; done
+              echo ""
+            fi
+            if [ $UNREACHABLE_COUNT -gt 0 ]; then
+              echo "**${UNREACHABLE_COUNT} unreachable tenant(s) — likely E2E teardown race (soft-warn, not failing):**"
+              echo ""
+              echo "| Slug | Actual /buildinfo SHA | Expected | Status |"
+              echo "|------|----------------------|----------|--------|"
+              for line in "${UNREACHABLE_LINES[@]}"; do echo "$line"; done
+              echo ""
+            fi
+            if [ $STALE_COUNT -eq 0 ] && [ $UNREACHABLE_COUNT -eq 0 ]; then
+              echo "All ${#SLUGS[@]} staging tenants returned matching SHA. ✓"
+            fi
+          } >> "$GITHUB_STEP_SUMMARY"
+
+          if [ $UNREACHABLE_COUNT -gt 0 ]; then
+            echo "::warning::$UNREACHABLE_COUNT staging tenant(s) unreachable post-redeploy. Likely benign teardown race — CP healthz monitor catches real outages."
+          fi
+
+          # Belt-and-suspenders sanity floor: if MORE than half the fleet is
+          # unreachable AND the fleet is large enough that "half down" is
+          # statistically meaningful, this is a real outage (e.g. new image
+          # crashes on startup), not a teardown race. Hard-fail.
+          #
+          # Floor only applies when TOTAL_VERIFIED >= 4 — below that, the
+          # canary-verify step is the actual gate for "all tenants down"
+          # detection (it runs against the canary first and aborts the
+          # rollout if the canary fails to come up). Without the >=4 gate,
+          # a 1-tenant fleet (e.g. a single ephemeral e2e-* tenant on a
+          # quiet staging push) would re-flake on the exact teardown-race
+          # condition #2402 fixed: 1 of 1 unreachable = 100% > 50% → fail.
+          TOTAL_VERIFIED=${#SLUGS[@]}
+          if [ $TOTAL_VERIFIED -ge 4 ] && [ $UNREACHABLE_COUNT -gt $((TOTAL_VERIFIED / 2)) ]; then
+            echo "::error::$UNREACHABLE_COUNT of $TOTAL_VERIFIED staging tenant(s) unreachable — exceeds 50% threshold on a fleet large enough that this signals a real outage, not teardown race."
+            exit 1
+          fi
+
+          if [ $STALE_COUNT -gt 0 ]; then
+            echo "::error::$STALE_COUNT staging tenant(s) returned a stale SHA. ssm_status=Success was misleading — see job summary."
+            exit 1
+          fi
+
+          echo "::notice::Staging tenant fleet redeploy complete — all reachable tenants on ${EXPECTED_SHA:0:7} (${UNREACHABLE_COUNT} unreachable, soft-warned)."
--- a/.gitea/workflows/sweep-aws-secrets.yml
+++ b/.gitea/workflows/sweep-aws-secrets.yml
@ -0,0 +1,129 @@
+name: Sweep stale AWS Secrets Manager secrets
+
+# Ported from .github/workflows/sweep-aws-secrets.yml on 2026-05-11 per RFC
+# internal#219 §1 sweep. Differences from the GitHub version:
+#   - Dropped `workflow_dispatch.inputs` (Gitea 1.22.6 parser rejects them
+#     per feedback_gitea_workflow_dispatch_inputs_unsupported).
+#   - Dropped `merge_group:` (no Gitea merge queue).
+#   - Dropped `environment:` blocks (Gitea has no environments).
+#   - Workflow-level env.GITHUB_SERVER_URL pinned per
+#     feedback_act_runner_github_server_url.
+#   - `continue-on-error: true` on each job (RFC §1 contract).
+#
+
+# Janitor for per-tenant AWS Secrets Manager secrets
+# (`molecule/tenant/<org_id>/bootstrap`) whose backing tenant no
+# longer exists. Parallel-shape to sweep-cf-tunnels.yml and
+# sweep-cf-orphans.yml — different cloud, same justification.
+#
+# Why this exists separately from a long-term reconciler integration:
+#   - molecule-controlplane's tenant_resources audit table (mig 024)
+#     currently tracks four resource kinds: CloudflareTunnel,
+#     CloudflareDNS, EC2Instance, SecurityGroup. SecretsManager is
+#     not in the list, so the existing reconciler doesn't catch
+#     orphan secrets.
+#   - At ~$0.40/secret/month the cost grew to ~$19/month before this
+#     sweeper was written, indicating ~45+ orphan secrets from
+#     crashed provisions and incomplete deprovision flows.
+#   - The proper fix (KindSecretsManagerSecret + recorder hook +
+#     reconciler enumerator) is filed as a separate controlplane
+#     issue. This sweeper is the immediate cost-relief stopgap.
+#
+# IAM principal: AWS_JANITOR_ACCESS_KEY_ID / AWS_JANITOR_SECRET_ACCESS_KEY.
+# This is a DEDICATED principal — the production `molecule-cp` IAM
+# user lacks `secretsmanager:ListSecrets` (it only has
+# Get/Create/Update/Delete on specific resources, scoped to its
+# operational needs). The janitor needs ListSecrets across the
+# `molecule/tenant/*` prefix, which warrants a separate principal so
+# we don't broaden the prod-CP policy.
+#
+# Safety: the script's MAX_DELETE_PCT gate (default 50%, mirroring
+# sweep-cf-orphans.yml — tenant secrets are durable by design, unlike
+# the mostly-orphan tunnels) refuses to nuke past the threshold.
+
+on:
+  schedule:
+    # Hourly at :30 — offsets from sweep-cf-orphans (:15) and
+    # sweep-cf-tunnels (:45) so the three janitors don't burst the
+    # CP admin endpoints at the same minute.
+    - cron: '30 * * * *'
+# Don't let two sweeps race the same AWS account.
+concurrency:
+  group: sweep-aws-secrets
+  cancel-in-progress: false
+
+permissions:
+  contents: read
+
+env:
+  GITHUB_SERVER_URL: https://git.moleculesai.app
+
+jobs:
+  sweep:
+    name: Sweep AWS Secrets Manager
+    runs-on: ubuntu-latest
+    # Phase 3 (RFC #219 §1): surface broken workflows without blocking.
+    continue-on-error: true
+    # 30 min cap, mirroring the other janitors. AWS DeleteSecret is
+    # fast (~0.3s/call) so even a 100+ backlog drains in seconds
+    # under the 8-way xargs parallelism, but the cap is set generously
+    # to leave headroom for any actual API hang.
+    timeout-minutes: 30
+    env:
+      AWS_REGION: ${{ secrets.AWS_REGION || 'us-east-1' }}
+      AWS_ACCESS_KEY_ID: ${{ secrets.AWS_JANITOR_ACCESS_KEY_ID }}
+      AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_JANITOR_SECRET_ACCESS_KEY }}
+      CP_PROD_ADMIN_TOKEN: ${{ secrets.CP_PROD_ADMIN_TOKEN }}
+      CP_STAGING_ADMIN_TOKEN: ${{ secrets.CP_STAGING_ADMIN_TOKEN }}
+      MAX_DELETE_PCT: ${{ github.event.inputs.max_delete_pct || '50' }}
+      GRACE_HOURS: ${{ github.event.inputs.grace_hours || '24' }}
+
+    steps:
+      - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
+
+      - name: Verify required secrets present
+        id: verify
+        # Schedule-vs-dispatch behaviour split mirrors sweep-cf-orphans
+        # and sweep-cf-tunnels (hardened 2026-04-28). Same principle:
+        #   - schedule → exit 1 on missing secrets (red CI surfaces it)
+        #   - workflow_dispatch → exit 0 with warning (operator-driven,
+        #     they already accepted the repo state)
+        run: |
+          missing=()
+          for var in AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY CP_PROD_ADMIN_TOKEN CP_STAGING_ADMIN_TOKEN; do
+            if [ -z "${!var:-}" ]; then
+              missing+=("$var")
+            fi
+          done
+          if [ ${#missing[@]} -gt 0 ]; then
+            if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
+              echo "::warning::skipping sweep — secrets not configured: ${missing[*]}"
+              echo "::warning::set them at Settings → Secrets and Variables → Actions, then rerun."
+              echo "::warning::AWS_JANITOR_* must belong to a principal with secretsmanager:ListSecrets and secretsmanager:DeleteSecret on molecule/tenant/* (the prod molecule-cp principal lacks ListSecrets)."
+              echo "skip=true" >> "$GITHUB_OUTPUT"
+              exit 0
+            fi
+            echo "::error::sweep cannot run — required secrets missing: ${missing[*]}"
+            echo "::error::set them at Settings → Secrets and Variables → Actions, or disable this workflow."
+            echo "::error::AWS_JANITOR_* must belong to a principal with secretsmanager:ListSecrets and secretsmanager:DeleteSecret on molecule/tenant/*."
+            exit 1
+          fi
+          echo "All required secrets present ✓"
+          echo "skip=false" >> "$GITHUB_OUTPUT"
+
+      - name: Run sweep
+        if: steps.verify.outputs.skip != 'true'
+        # Schedule-vs-dispatch dry-run asymmetry mirrors sweep-cf-tunnels:
+        #   - Scheduled: input empty → "false" → --execute (the whole
+        #     point of an hourly janitor).
+        #   - Manual workflow_dispatch: input default true → dry-run;
+        #     operator must flip it to actually delete.
+        run: |
+          set -euo pipefail
+          if [ "${{ github.event.inputs.dry_run || 'false' }}" = "true" ]; then
+            echo "Running in dry-run mode — no deletions"
+            bash scripts/ops/sweep-aws-secrets.sh
+          else
+            echo "Running with --execute — will delete identified orphans"
+            bash scripts/ops/sweep-aws-secrets.sh --execute
+          fi
--- a/.gitea/workflows/sweep-cf-orphans.yml
+++ b/.gitea/workflows/sweep-cf-orphans.yml
@ -0,0 +1,151 @@
+name: Sweep stale Cloudflare DNS records
+
+# Ported from .github/workflows/sweep-cf-orphans.yml on 2026-05-11 per RFC
+# internal#219 §1 sweep. Differences from the GitHub version:
+#   - Dropped `workflow_dispatch.inputs` (Gitea 1.22.6 parser rejects them
+#     per feedback_gitea_workflow_dispatch_inputs_unsupported).
+#   - Dropped `merge_group:` (no Gitea merge queue).
+#   - Dropped `environment:` blocks (Gitea has no environments).
+#   - Workflow-level env.GITHUB_SERVER_URL pinned per
+#     feedback_act_runner_github_server_url.
+#   - `continue-on-error: true` on each job (RFC §1 contract).
+#
+
+# Janitor for Cloudflare DNS records whose backing tenant/workspace no
+# longer exists. Without this loop, every short-lived E2E or canary
+# leaves a CF record on the moleculesai.app zone — the zone has a
+# 200-record quota (controlplane#239 hit it 2026-04-23+) and provisions
+# start failing with code 81045 once exhausted.
+#
+# Why a separate workflow vs sweep-stale-e2e-orgs.yml:
+#   - That workflow operates at the CP layer (DELETE /cp/admin/tenants/:slug
+#     drives the cascade). It assumes CP has the org row to drive the
+#     deprovision from. It doesn't catch records left behind when CP
+#     itself never knew about the tenant (canary scratch, manual ops
+#     experiments) or when the cascade's CF-delete branch failed.
+#   - sweep-cf-orphans.sh enumerates the CF zone directly and matches
+#     each record against live CP slugs + AWS EC2 names. It catches
+#     leaks the CP-driven sweep can't.
+#
+# Safety: the script's own MAX_DELETE_PCT gate refuses to nuke more
+# than 50% of records in a single run. If something has gone weird
+# (CP admin endpoint returns no orgs → every tenant looks orphan) the
+# gate halts before damage. Decision-function unit tests in
+# scripts/ops/test_sweep_cf_decide.py (#2027) cover the rule
+# classifier.
+
+on:
+  schedule:
+    # Hourly. Mirrors sweep-stale-e2e-orgs cadence so the two janitors
+    # converge on the same tick. CF API rate budget is generous (1200
+    # req/5min); a single sweep makes ~1 list + N deletes (N<=quota/2).
+    - cron: '15 * * * *'  # offset from sweep-stale-e2e-orgs (top of hour)
+  # No `merge_group:` trigger on purpose. This is a janitor — it doesn't
+  # need to gate merges, and including it as written before #2088 fired
+  # the full sweep job (or its secret-check) on every PR going through
+  # the merge queue, generating one red CI run per merge-queue eval. If
+  # this workflow is ever wired up as a required check, re-add
+  #   merge_group: { types: [checks_requested] }
+  # AND gate the sweep step with `if: github.event_name != 'merge_group'`
+  # so merge-queue evals report success without actually running.
+
+# Don't let two sweeps race the same zone. workflow_dispatch during a
+# scheduled run would otherwise issue duplicate DELETE calls.
+concurrency:
+  group: sweep-cf-orphans
+  cancel-in-progress: false
+
+permissions:
+  contents: read
+
+env:
+  GITHUB_SERVER_URL: https://git.moleculesai.app
+
+jobs:
+  sweep:
+    name: Sweep CF orphans
+    runs-on: ubuntu-latest
+    # Phase 3 (RFC #219 §1): surface broken workflows without blocking.
+    continue-on-error: true
+    # 3 min surfaces hangs (CF API stall, AWS describe-instances stuck)
+    # within one cron interval instead of burning a full tick. Realistic
+    # worst case is ~2 min: 4 sequential curls + 1 aws + N×CF-DELETE
+    # each individually capped at 10s by the script's curl -m flag.
+    timeout-minutes: 3
+    env:
+      CF_API_TOKEN: ${{ secrets.CF_API_TOKEN }}
+      CF_ZONE_ID: ${{ secrets.CF_ZONE_ID }}
+      CP_PROD_ADMIN_TOKEN: ${{ secrets.CP_PROD_ADMIN_TOKEN }}
+      CP_STAGING_ADMIN_TOKEN: ${{ secrets.CP_STAGING_ADMIN_TOKEN }}
+      AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
+      AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
+      AWS_DEFAULT_REGION: us-east-2
+      MAX_DELETE_PCT: ${{ github.event.inputs.max_delete_pct || '50' }}
+
+    steps:
+      - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
+
+      - name: Verify required secrets present
+        id: verify
+        # Schedule-vs-dispatch behaviour split (hardened 2026-04-28
+        # after the silent-no-op incident below):
+        #
+        # The earlier soft-skip-on-schedule policy hid a real leak. All
+        # six secrets were unset on this repo for an unknown duration;
+        # every hourly run printed a yellow ::warning:: and exited 0,
+        # so the workflow registered as "passing" while doing nothing.
+        # CF orphans accumulated to 152/200 (~76% of the zone quota
+        # gone) before a manual `dig`-driven audit caught it. Anything
+        # that runs as a janitor and reports green while idle is
+        # indistinguishable from "the janitor is healthy" — so we now
+        # treat schedule (and any future workflow_run/push triggers)
+        # as a hard-fail when secrets are missing.
+        #
+        #   - schedule / workflow_run / push → exit 1 (red CI run
+        #     surfaces the misconfiguration the next tick)
+        #   - workflow_dispatch              → exit 0 with a warning
+        #     (an operator ran this ad-hoc; they already accepted the
+        #     state of the repo and want the workflow to short-circuit
+        #     so they can rerun after fixing the secret)
+        run: |
+          missing=()
+          for var in CF_API_TOKEN CF_ZONE_ID CP_PROD_ADMIN_TOKEN CP_STAGING_ADMIN_TOKEN AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY; do
+            if [ -z "${!var:-}" ]; then
+              missing+=("$var")
+            fi
+          done
+          if [ ${#missing[@]} -gt 0 ]; then
+            if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
+              echo "::warning::skipping sweep — secrets not configured: ${missing[*]}"
+              echo "::warning::set them at Settings → Secrets and Variables → Actions, then rerun."
+              echo "skip=true" >> "$GITHUB_OUTPUT"
+              exit 0
+            fi
+            echo "::error::sweep cannot run — required secrets missing: ${missing[*]}"
+            echo "::error::set them at Settings → Secrets and Variables → Actions, or disable this workflow."
+            echo "::error::a silent skip masked an active CF DNS leak (152/200 zone records) caught only by a manual audit on 2026-04-28; this gate exists to make the gap visible."
+            exit 1
+          fi
+          echo "All required secrets present ✓"
+          echo "skip=false" >> "$GITHUB_OUTPUT"
+
+      - name: Run sweep
+        if: steps.verify.outputs.skip != 'true'
+        # Schedule-vs-dispatch dry-run asymmetry (intentional):
+        #   - Scheduled runs: github.event.inputs.dry_run is empty →
+        #     defaults to "false" below → script runs with --execute
+        #     (the whole point of an hourly janitor).
+        #   - Manual workflow_dispatch: input default is true (line 38)
+        #     so an ad-hoc operator-triggered run is dry-run by default;
+        #     they have to flip the toggle to actually delete.
+        # The script's MAX_DELETE_PCT gate (default 50%) is the second
+        # line of defense regardless of mode.
+        run: |
+          set -euo pipefail
+          if [ "${{ github.event.inputs.dry_run || 'false' }}" = "true" ]; then
+            echo "Running in dry-run mode — no deletions"
+            bash scripts/ops/sweep-cf-orphans.sh
+          else
+            echo "Running with --execute — will delete identified orphans"
+            bash scripts/ops/sweep-cf-orphans.sh --execute
+          fi
--- a/.gitea/workflows/sweep-cf-tunnels.yml
+++ b/.gitea/workflows/sweep-cf-tunnels.yml
@ -0,0 +1,128 @@
+name: Sweep stale Cloudflare Tunnels
+
+# Ported from .github/workflows/sweep-cf-tunnels.yml on 2026-05-11 per RFC
+# internal#219 §1 sweep. Differences from the GitHub version:
+#   - Dropped `workflow_dispatch.inputs` (Gitea 1.22.6 parser rejects them
+#     per feedback_gitea_workflow_dispatch_inputs_unsupported).
+#   - Dropped `merge_group:` (no Gitea merge queue).
+#   - Dropped `environment:` blocks (Gitea has no environments).
+#   - Workflow-level env.GITHUB_SERVER_URL pinned per
+#     feedback_act_runner_github_server_url.
+#   - `continue-on-error: true` on each job (RFC §1 contract).
+#
+
+# Janitor for Cloudflare Tunnels whose backing tenant no longer
+# exists. Parallel-shape to sweep-cf-orphans.yml (which sweeps DNS
+# records); same justification, different CF resource.
+#
+# Why this exists separately from sweep-cf-orphans:
+#   - DNS records live on the zone (`/zones/<id>/dns_records`).
+#   - Tunnels live on the account (`/accounts/<id>/cfd_tunnel`).
+#   - Different CF API surface, different scopes; the existing CF
+#     token might not have `account:cloudflare_tunnel:edit`. Splitting
+#     the workflows keeps each one's secret-presence gate independent
+#     so neither silent-skips when the other's secret is missing.
+#   - Cleaner blast radius — operators can disable one without the
+#     other if a regression surfaces.
+#
+# Safety: the script's MAX_DELETE_PCT gate (default 90% — higher than
+# the DNS sweep's 50% because tenant-shaped tunnels are mostly
+# orphans by design) refuses to nuke past the threshold.
+
+on:
+  schedule:
+    # Hourly at :45 — offset from sweep-cf-orphans (:15) so the two
+    # janitors don't issue parallel CF API bursts at the same minute.
+    - cron: '45 * * * *'
+# Don't let two sweeps race the same account.
+concurrency:
+  group: sweep-cf-tunnels
+  cancel-in-progress: false
+
+permissions:
+  contents: read
+
+env:
+  GITHUB_SERVER_URL: https://git.moleculesai.app
+
+jobs:
+  sweep:
+    name: Sweep CF tunnels
+    runs-on: ubuntu-latest
+    # Phase 3 (RFC #219 §1): surface broken workflows without blocking.
+    continue-on-error: true
+    # 30 min cap. Was 5 min on the theory that the only thing that
+    # could take >5min is a CF-API hang — but on 2026-05-02 a backlog
+    # of 672 stale tunnels accumulated (large staging E2E run + delayed
+    # sweep) and the serial `curl -X DELETE` loop (~0.7s/tunnel) needed
+    # ~7-8min to drain. The 5-min cap killed the run mid-sweep
+    # (cancelled at 424/672, see run 25248788312); a manual rerun
+    # finished the remainder fine.
+    #
+    # The fix is two-part: parallelize the delete loop (8-way xargs in
+    # the script — see scripts/ops/sweep-cf-tunnels.sh), AND raise the
+    # cap so a one-off backlog doesn't trip a hangs-detector that
+    # turned out to be a real-job-too-slow detector. With 8-way
+    # parallelism, 600+ tunnels drains in ~60s; 30 min is generous
+    # headroom for actual hangs to still surface (and is in line with
+    # the sweep-cf-orphans companion job).
+    timeout-minutes: 30
+    env:
+      CF_API_TOKEN: ${{ secrets.CF_API_TOKEN }}
+      CF_ACCOUNT_ID: ${{ secrets.CF_ACCOUNT_ID }}
+      CP_PROD_ADMIN_TOKEN: ${{ secrets.CP_PROD_ADMIN_TOKEN }}
+      CP_STAGING_ADMIN_TOKEN: ${{ secrets.CP_STAGING_ADMIN_TOKEN }}
+      MAX_DELETE_PCT: ${{ github.event.inputs.max_delete_pct || '90' }}
+
+    steps:
+      - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
+
+      - name: Verify required secrets present
+        id: verify
+        # Schedule-vs-dispatch behaviour split mirrors sweep-cf-orphans
+        # (hardened 2026-04-28 after the silent-no-op incident: the
+        # janitor reported green while doing nothing because secrets
+        # were unset, masking a 152/200 zone-record leak). Same
+        # principle applies here:
+        #   - schedule → exit 1 on missing secrets (red CI surfaces it)
+        #   - workflow_dispatch → exit 0 with warning (operator-driven,
+        #     they already accepted the repo state)
+        run: |
+          missing=()
+          for var in CF_API_TOKEN CF_ACCOUNT_ID CP_PROD_ADMIN_TOKEN CP_STAGING_ADMIN_TOKEN; do
+            if [ -z "${!var:-}" ]; then
+              missing+=("$var")
+            fi
+          done
+          if [ ${#missing[@]} -gt 0 ]; then
+            if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
+              echo "::warning::skipping sweep — secrets not configured: ${missing[*]}"
+              echo "::warning::set them at Settings → Secrets and Variables → Actions, then rerun."
+              echo "::warning::CF_API_TOKEN must include account:cloudflare_tunnel:edit scope (separate from the zone:dns:edit scope used by sweep-cf-orphans)."
+              echo "skip=true" >> "$GITHUB_OUTPUT"
+              exit 0
+            fi
+            echo "::error::sweep cannot run — required secrets missing: ${missing[*]}"
+            echo "::error::set them at Settings → Secrets and Variables → Actions, or disable this workflow."
+            echo "::error::CF_API_TOKEN must include account:cloudflare_tunnel:edit scope."
+            exit 1
+          fi
+          echo "All required secrets present ✓"
+          echo "skip=false" >> "$GITHUB_OUTPUT"
+
+      - name: Run sweep
+        if: steps.verify.outputs.skip != 'true'
+        # Schedule-vs-dispatch dry-run asymmetry mirrors sweep-cf-orphans:
+        #   - Scheduled: input empty → "false" → --execute (the whole
+        #     point of an hourly janitor).
+        #   - Manual workflow_dispatch: input default true → dry-run;
+        #     operator must flip it to actually delete.
+        run: |
+          set -euo pipefail
+          if [ "${{ github.event.inputs.dry_run || 'false' }}" = "true" ]; then
+            echo "Running in dry-run mode — no deletions"
+            bash scripts/ops/sweep-cf-tunnels.sh
+          else
+            echo "Running with --execute — will delete identified orphans"
+            bash scripts/ops/sweep-cf-tunnels.sh --execute
+          fi
--- a/.gitea/workflows/sweep-stale-e2e-orgs.yml
+++ b/.gitea/workflows/sweep-stale-e2e-orgs.yml
@ -0,0 +1,243 @@
+name: Sweep stale e2e-* orgs (staging)
+
+# Ported from .github/workflows/sweep-stale-e2e-orgs.yml on 2026-05-11 per RFC
+# internal#219 §1 sweep. Differences from the GitHub version:
+#   - Dropped `workflow_dispatch.inputs` (Gitea 1.22.6 parser rejects them
+#     per feedback_gitea_workflow_dispatch_inputs_unsupported).
+#   - Dropped `merge_group:` (no Gitea merge queue).
+#   - Dropped `environment:` blocks (Gitea has no environments).
+#   - Workflow-level env.GITHUB_SERVER_URL pinned per
+#     feedback_act_runner_github_server_url.
+#   - `continue-on-error: true` on each job (RFC §1 contract).
+#
+
+# Janitor for staging tenants left behind when E2E cleanup didn't run:
+# CI cancellations, runner crashes, transient AWS errors mid-cascade,
+# bash trap missed (signal 9), etc. Without this loop, every failed
+# teardown leaks an EC2 + DNS + DB row until manual ops cleanup —
+# 2026-04-23 staging hit the 64 vCPU AWS quota from ~27 such orphans.
+#
+# Why not rely on per-test-run teardown:
+#   - Per-run teardown is best-effort by definition. Any process death
+#     after the test starts but before the trap fires leaves debris.
+#   - GH Actions cancellation kills the runner without grace period.
+#     The workflow's `if: always()` step usually catches this, but it
+#     too can fail (CP transient 5xx, runner network issue at the
+#     wrong moment).
+#   - Even when teardown runs, the CP cascade is best-effort in places
+#     (cascadeTerminateWorkspaces logs+continues; DNS deletion same).
+#   - This sweep is the catch-all that converges staging back to clean
+#     regardless of which specific path leaked.
+#
+# The PROPER fix is making CP cleanup transactional + verify-after-
+# terminate (filed separately as cleanup-correctness work). This
+# workflow is the safety net that catches everything else AND any
+# future leak source we haven't yet identified.
+
+on:
+  schedule:
+    # Every 15 min. E2E orgs are short-lived (~8-25 min wall clock from
+    # create to teardown — canary is ~8 min, full SaaS ~25 min). The
+    # previous hourly + 120-min stale threshold meant a leaked tenant
+    # could keep an EC2 alive for up to 2 hours, eating ~2 vCPU per
+    # leak. Tightening the cadence + threshold reduces the worst-case
+    # leak window from 120 min to ~45 min (15-min sweep cadence + 30-min
+    # threshold) without risk of catching in-progress runs (the longest
+    # e2e run is the 25-min canary, well under the 30-min threshold).
+    # See molecule-controlplane#420 for the leak-class accounting that
+    # motivated this tightening.
+    - cron: '*/15 * * * *'
+# Don't let two sweeps fight. Cron + workflow_dispatch could overlap
+# on a manual trigger; queue rather than parallel-delete.
+concurrency:
+  group: sweep-stale-e2e-orgs
+  cancel-in-progress: false
+
+permissions:
+  contents: read
+
+env:
+  GITHUB_SERVER_URL: https://git.moleculesai.app
+
+jobs:
+  sweep:
+    name: Sweep e2e orgs
+    runs-on: ubuntu-latest
+    # Phase 3 (RFC #219 §1): surface broken workflows without blocking.
+    continue-on-error: true
+    timeout-minutes: 15
+    env:
+      MOLECULE_CP_URL: https://staging-api.moleculesai.app
+      ADMIN_TOKEN: ${{ secrets.MOLECULE_STAGING_ADMIN_TOKEN }}
+      MAX_AGE_MINUTES: ${{ github.event.inputs.max_age_minutes || '30' }}
+      DRY_RUN: ${{ github.event.inputs.dry_run || 'false' }}
+      # Refuse to delete more than this many orgs in one tick. If the
+      # CP DB is briefly empty (or the admin endpoint goes weird and
+      # returns no created_at), every e2e- org would look stale.
+      # Bailing protects against runaway nukes.
+      SAFETY_CAP: 50
+
+    steps:
+      - name: Verify admin token present
+        run: |
+          if [ -z "$ADMIN_TOKEN" ]; then
+            echo "::error::MOLECULE_STAGING_ADMIN_TOKEN not set"
+            exit 2
+          fi
+          echo "Admin token present ✓"
+
+      - name: Identify stale e2e orgs
+        id: identify
+        run: |
+          set -euo pipefail
+          # Fetch into a file so the python step reads it via stdin —
+          # cleaner than embedding $(curl ...) into a heredoc.
+          curl -sS --fail-with-body --max-time 30 \
+            "$MOLECULE_CP_URL/cp/admin/orgs?limit=500" \
+            -H "Authorization: Bearer $ADMIN_TOKEN" \
+            > orgs.json
+
+          # Filter:
+          #   1. slug starts with one of the ephemeral test prefixes:
+          #        - 'e2e-'    — covers e2e-canary-, e2e-canvas-*, etc.
+          #        - 'rt-e2e-' — runtime-test harness fixtures (RFC #2251);
+          #                      missing this prefix left two such tenants
+          #                      orphaned 8h on staging (2026-05-03), then
+          #                      hard-failed redeploy-tenants-on-staging
+          #                      and broke the staging→main auto-promote
+          #                      chain. Kept in sync with the EPHEMERAL_PREFIX_RE
+          #                      regex in redeploy-tenants-on-staging.yml.
+          #   2. created_at is older than MAX_AGE_MINUTES ago
+          # Output one slug per line to a file the next step reads.
+          python3 > stale_slugs.txt <<'PY'
+          import json, os
+          from datetime import datetime, timezone, timedelta
+          # SSOT for this list lives in the controlplane Go code:
+          # molecule-controlplane/internal/slugs/ephemeral.go
+          # (var EphemeralPrefixes). The redeploy-fleet auto-rollout
+          # also reads from there to SKIP these slugs — without that
+          # filter, fleet redeploy SSM-failed in-flight E2E tenants
+          # whose containers were still booting, breaking the test
+          # that just spun them up (molecule-controlplane#493).
+          # Update both files together.
+          EPHEMERAL_PREFIXES = ("e2e-", "rt-e2e-")
+          with open("orgs.json") as f:
+              data = json.load(f)
+          max_age = int(os.environ["MAX_AGE_MINUTES"])
+          cutoff = datetime.now(timezone.utc) - timedelta(minutes=max_age)
+          for o in data.get("orgs", []):
+              slug = o.get("slug", "")
+              if not slug.startswith(EPHEMERAL_PREFIXES):
+                  continue
+              created = o.get("created_at")
+              if not created:
+                  # Defensively skip rows without created_at — better
+                  # to leave one orphan than nuke a brand-new row
+                  # whose timestamp didn't render.
+                  continue
+              # Python 3.11+ handles RFC3339 with Z directly via
+              # fromisoformat; older runners need the trailing Z swap.
+              created_dt = datetime.fromisoformat(created.replace("Z", "+00:00"))
+              if created_dt < cutoff:
+                  print(slug)
+          PY
+
+          count=$(wc -l < stale_slugs.txt | tr -d ' ')
+          echo "Found $count stale e2e org(s) older than ${MAX_AGE_MINUTES}m"
+          if [ "$count" -gt 0 ]; then
+            echo "First 20:"
+            head -20 stale_slugs.txt | sed 's/^/  /'
+          fi
+          echo "count=$count" >> "$GITHUB_OUTPUT"
+
+      - name: Safety gate
+        if: steps.identify.outputs.count != '0'
+        run: |
+          count="${{ steps.identify.outputs.count }}"
+          if [ "$count" -gt "$SAFETY_CAP" ]; then
+            echo "::error::Refusing to delete $count orgs in one sweep (cap=$SAFETY_CAP). Investigate manually — this usually means the CP admin API returned no created_at or returned a degraded result. Re-run with workflow_dispatch + max_age_minutes if intentional."
+            exit 1
+          fi
+          echo "Within safety cap ($count ≤ $SAFETY_CAP) ✓"
+
+      - name: Delete stale orgs
+        if: steps.identify.outputs.count != '0' && env.DRY_RUN != 'true'
+        run: |
+          set -uo pipefail
+          deleted=0
+          failed=0
+          while IFS= read -r slug; do
+            [ -z "$slug" ] && continue
+            # The DELETE handler requires {"confirm": "<slug>"} matching
+            # the URL slug — fat-finger guard. Idempotent: re-issuing
+            # picks up via org_purges.last_step.
+            # Tempfile-routed -w + set +e/-e prevents curl-exit-code
+            # pollution of the captured status (lint-curl-status-capture.yml).
+            set +e
+            curl -sS -o /tmp/del_resp -w "%{http_code}" \
+              --max-time 60 \
+              -X DELETE "$MOLECULE_CP_URL/cp/admin/tenants/$slug" \
+              -H "Authorization: Bearer $ADMIN_TOKEN" \
+              -H "Content-Type: application/json" \
+              -d "{\"confirm\":\"$slug\"}" >/tmp/del_code
+            set -e
+            # Stderr from curl (-sS shows dial errors etc.) goes to runner log.
+            http_code=$(cat /tmp/del_code 2>/dev/null || echo "000")
+            if [ "$http_code" = "200" ] || [ "$http_code" = "204" ]; then
+              deleted=$((deleted+1))
+              echo "  deleted: $slug"
+            else
+              failed=$((failed+1))
+              echo "  FAILED ($http_code): $slug — $(cat /tmp/del_resp 2>/dev/null | head -c 200)"
+            fi
+          done < stale_slugs.txt
+          echo ""
+          echo "Sweep summary: deleted=$deleted failed=$failed"
+          # Don't fail the workflow on per-org delete errors — the
+          # sweeper is best-effort. Next hourly tick re-attempts. We
+          # only fail loud at the safety-cap gate above.
+
+      - name: Sweep orphan tunnels
+        # Stale-org cleanup deletes the org (which cascades to tunnel
+        # delete inside the CP). But when that cascade fails partway —
+        # CP transient 5xx after the org row is deleted but before the
+        # CF tunnel delete completes — the tunnel persists with no
+        # matching org row. The reconciler in internal/sweep flags this
+        # as `cf_tunnel kind=orphan`, but nothing automatically reaps it.
+        #
+        # `/cp/admin/orphan-tunnels/cleanup` is the operator-triggered
+        # reaper. Calling it here at the end of every sweep tick
+        # converges the staging CF account to clean even when CP
+        # cascades half-fail.
+        #
+        # PR #492 made the underlying DeleteTunnel actually check
+        # status — pre-fix it silent-succeeded on CF code 1022
+        # ("active connections"), so this step would have been a no-op
+        # against stuck connectors. Post-fix the cleanup invokes
+        # CleanupTunnelConnections + retry, which actually clears the
+        # 1022 case. (#2987)
+        #
+        # Best-effort. Failure here doesn't fail the workflow — next
+        # tick re-attempts. Errors flow to step output for ops review.
+        if: env.DRY_RUN != 'true'
+        run: |
+          set +e
+          curl -sS -o /tmp/cleanup_resp -w "%{http_code}" \
+            --max-time 60 \
+            -X POST "$MOLECULE_CP_URL/cp/admin/orphan-tunnels/cleanup" \
+            -H "Authorization: Bearer $ADMIN_TOKEN" >/tmp/cleanup_code
+          set -e
+          http_code=$(cat /tmp/cleanup_code 2>/dev/null || echo "000")
+          body=$(cat /tmp/cleanup_resp 2>/dev/null | head -c 500)
+          if [ "$http_code" = "200" ]; then
+            count=$(echo "$body" | python3 -c "import sys,json; d=json.loads(sys.stdin.read() or '{}'); print(d.get('deleted_count', 0))" 2>/dev/null || echo "0")
+            failed_n=$(echo "$body" | python3 -c "import sys,json; d=json.loads(sys.stdin.read() or '{}'); print(len(d.get('failed') or {}))" 2>/dev/null || echo "0")
+            echo "Orphan-tunnel sweep: deleted=$count failed=$failed_n"
+          else
+            echo "::warning::orphan-tunnels cleanup returned HTTP $http_code — body: $body"
+          fi
+
+      - name: Dry-run summary
+        if: env.DRY_RUN == 'true'
+        run: |
+          echo "DRY RUN — would have deleted ${{ steps.identify.outputs.count }} org(s) AND triggered orphan-tunnels cleanup. Re-run with dry_run=false to actually delete."