Merge branch 'main' into fix/plugin-uninstall-exec-errors

Merge PR #2362 via Gitea merge queue
Serialized merge by gitea-merge-queue after current-main, genuine approvals, and required CI checks were green.
2026-06-06 13:30:27 +00:00 · 2026-06-06 12:50:47 +00:00 · 2026-06-06 11:05:42 +00:00 · 2026-06-06 10:50:18 +00:00 · 2026-06-06 10:13:50 +00:00 · 2026-06-06 03:07:06 -07:00
30 changed files with 3208 additions and 179 deletions
@@ -1,16 +1,77 @@
 #!/usr/bin/env python3
 """gitea-merge-queue — conservative serialized merge bot for Gitea.

-Gitea 1.22.6 has auto-merge (`pull_auto_merge`) but no GitHub-style merge
+Gitea 1.22.6+ has auto-merge (`pull_auto_merge`) but no GitHub-style merge
 queue. This script provides the missing serialized policy in user space:

-1. Pick the oldest open PR carrying QUEUE_LABEL.
-2. Refuse to act unless main is green.
+1. Scan open same-repo PRs that are NOT opted out (auto-discovery, see below),
+   oldest-first, skipping drafts, until an ACTIONABLE one is found. A non-ready
+   candidate (REQUEST_CHANGES, mergeable!=True, insufficient genuine approvals,
+   or red required CI) is SKIPPED so it cannot head-of-line block newer ready
+   PRs; the scan continues to the next candidate.
+2. Refuse to act unless main's BP-required contexts are green.
 3. Refuse fork PRs; the queue may only mutate same-repo branches.
 4. If the PR branch does not contain current main, call Gitea's
   /pulls/{n}/update endpoint and stop. CI must rerun on the updated head.
-5. If the updated PR head has all required contexts green, merge with the
-   non-bypass merge actor token.
+5. Merge ONLY when, on the PR's CURRENT head sha:
+     - >= REQUIRED_APPROVALS distinct GENUINE official APPROVED reviews from
+       the recognised reviewer set (not stale, not dismissed, commit_id ==
+       current head), AND
+     - no open official REQUEST_CHANGES on the current head, AND
+     - every BP-required status context is green, AND
+     - the PR is mergeable.
+
+Authoritative gates (fail-closed):
+  - The REQUIRED status contexts come from BRANCH PROTECTION
+    (`status_check_contexts`), not a hand-maintained env list. If branch
+    protection cannot be enumerated, the queue HOLDS (does not merge blindly).
+  - NON-required reds (qa-review, security-review, sop-tier, sop-checklist
+    when not branch-required, E2E Chat, Staging SaaS, ci-arm64-advisory, any
+    continue-on-error job) MUST NOT block. They are reported, never gating.
+  - `force_merge=true` is used ONLY when the merge is blocked *solely* by
+    missing-but-non-required governance contexts (required are green + genuine
+    approvals present). It is NEVER used to bypass a failing REQUIRED context
+    or missing approvals.
+
+Auto-discovery (opt-OUT, label-optional):
+  The queue is SELF-SUSTAINING — a ready PR does NOT need a human (or an agent)
+  to add the `merge-queue` label first. When AUTO_DISCOVER is on (default), the
+  queue enumerates ALL open same-repo PRs and considers any that meets the full
+  merge bar (genuine approvals on current head + BP-required green + mergeable +
+  no open REQUEST_CHANGES). The merge bar above is UNCHANGED; auto-discovery only
+  changes WHICH PRs are considered, not whether they are mergeable.
+
+  This deliberately removes the historical dependency on an agent adding the
+  `merge-queue` label — agent Gitea tokens lack `write:issue` (labels are
+  issue-scoped), so they could never self-label and the queue stalled. The label
+  is now OPTIONAL metadata, not a gate.
+
+  SAFETY is preserved as opt-OUT: any PR carrying an opt-out label
+  (OPT_OUT_LABELS — `merge-queue-hold`, `do-not-auto-merge`, `wip`, `draft` by
+  default) is skipped (never auto-considered, never merged). Draft PRs
+  (draft=true STATE) are also skipped; the literal `draft` LABEL is an
+  additional explicit opt-out a human can apply without converting to a draft.
+  A human who wants to keep a PR out of autonomous merging just adds one of
+  those labels. Setting AUTO_DISCOVER=0 restores the legacy opt-IN behaviour
+  (only PRs already carrying QUEUE_LABEL are considered).
+
+Head-of-line (HOL) safety has two complementary layers:
+  (a) The queue SCANS THROUGH the FIFO candidate list and skips any non-ready
+      PR (REQUEST_CHANGES, mergeable!=True, insufficient genuine approvals, or
+      red required CI) instead of locking on the oldest and waiting, so a PR
+      that can never become ready without human action does not block newer
+      ready PRs.
+  (b) For the candidate the scan acts on, two permanent failure modes HOLD the
+      PR (apply HOLD_LABEL) and let the scan CONTINUE to the next candidate
+      rather than re-selecting the same wedged PR every tick:
+        - a permanent permission/4xx merge error (403/404/405), and
+        - a persistent branch-update conflict (the /update endpoint returns
+          HTTP 409 because the PR branch cannot be merged with main without a
+          manual rebase). A conflict will not self-resolve, so retrying it
+          every tick would HOL-block every ready PR behind it (issue #2352).
+
+Status-fetch is fail-closed: if the combined status for a sha cannot be
+fetched, the PR is skipped this tick (never treated as green).

 The script is intentionally one-PR-per-run. Workflow/cron concurrency should
 serialize invocations so two green PRs cannot merge against the same main.
@@ -40,6 +101,33 @@ WATCH_BRANCH = _env("WATCH_BRANCH", default="main")
 QUEUE_LABEL = _env("QUEUE_LABEL", default="merge-queue")
 HOLD_LABEL = _env("HOLD_LABEL", default="merge-queue-hold")
 UPDATE_STYLE = _env("UPDATE_STYLE", default="merge")
+# Auto-discovery (opt-OUT). When truthy (default), the queue considers ALL open
+# same-repo PRs that meet the merge bar, not only PRs already carrying
+# QUEUE_LABEL — so the queue is self-sustaining without any human/agent labeling
+# (agent tokens lack write:issue and cannot self-label). Set AUTO_DISCOVER=0 to
+# restore the legacy opt-IN behaviour (QUEUE_LABEL required to be considered).
+AUTO_DISCOVER = _env("AUTO_DISCOVER", default="1").strip().lower() not in {
+    "0",
+    "false",
+    "no",
+    "off",
+    "",
+}
+# Opt-OUT labels. A PR carrying ANY of these is skipped (never auto-considered,
+# never merged) — the human escape hatch from autonomous merging. HOLD_LABEL is
+# always included so the existing hold semantics keep working. `do-not-auto-merge`
+# and `wip` let a human keep a PR out of the auto-merge path without removing it.
+# `draft` is included as a literal label too: Gitea draft STATE (draft=true) is
+# already skipped via _issue_is_draft, but a "draft" LABEL is an additional,
+# explicit opt-out signal a human can apply without converting the PR to a draft.
+OPT_OUT_LABELS = {
+    name.strip()
+    for name in _env(
+        "OPT_OUT_LABELS",
+        default="do-not-auto-merge,wip,draft",
+    ).split(",")
+    if name.strip()
+} | ({HOLD_LABEL} if HOLD_LABEL else set())
 REQUIRED_CONTEXTS_RAW = _env(
    "REQUIRED_CONTEXTS",
    default=(
@@ -57,6 +145,24 @@ PUSH_REQUIRED_CONTEXTS_RAW = _env(
    default="CI / all-required (push)",
 )

+# Recognised official-reviewer set. A merge requires this many DISTINCT genuine
+# approvals (not stale/dismissed, on the current head sha) from accounts in
+# this set. The set is the real agents-team reviewer roster; founder/CTO-agent
+# accounts are intentionally excluded so the queue cannot be satisfied by a
+# human/owner approval alone — it must be a genuine peer review.
+REVIEWER_SET = {
+    name.strip()
+    for name in _env(
+        "REVIEWER_SET",
+        default="agent-reviewer,agent-researcher,agent-reviewer-cr2",
+    ).split(",")
+    if name.strip()
+}
+# Default mirrors molecule-core branch protection (required_approvals: 2). The
+# authoritative value is read from branch protection at runtime; this is only
+# the fallback when BP does not specify one.
+REQUIRED_APPROVALS_DEFAULT = int(_env("REQUIRED_APPROVALS", default="2") or "2")
+
 OWNER, NAME = (REPO.split("/", 1) + [""])[:2] if REPO else ("", "")
 API = f"https://{GITEA_HOST}/api/v1" if GITEA_HOST else ""

@@ -67,7 +173,27 @@ class ApiError(RuntimeError):

 class MergePermissionError(ApiError):
    """Merge failed with a permanent permission error (403/404/405).
-    The queue should skip this PR and move to the next one."""
+    The queue should HOLD this PR and move to the next one."""
+
+
+class BranchUpdateConflictError(ApiError):
+    """Updating the PR branch with the base hit a merge-conflict (HTTP 409).
+
+    A true merge-conflict is NOT transient: the branch cannot be auto-updated
+    until a human/agent rebases it. The queue should HOLD this PR (apply
+    HOLD_LABEL) and advance to the next candidate, exactly like the permission
+    path — otherwise the conflicted PR sits at the queue head and is retried
+    every tick forever, head-of-line-blocking every ready PR behind it.
+
+    NOTE: distinct from mergeable=None, which is Gitea STILL COMPUTING conflict
+    state — that case is handled as a transient WAIT (no hold). This error is
+    only raised on an explicit 409 returned by the /update endpoint."""
+
+
+class BranchProtectionUnavailable(ApiError):
+    """Branch protection (the authoritative required-context source) could not
+    be enumerated. The queue must HOLD rather than merge with an unverified
+    required-context set (fail-closed, no fail-open)."""


@dataclasses.dataclass(frozen=True)
@@ -75,6 +201,20 @@ class MergeDecision:
    ready: bool
    action: str
    reason: str
+    # When ready is True, force indicates the merge is blocked SOLELY by
+    # missing-but-non-required governance contexts (required are green +
+    # genuine approvals present), so force_merge=true is justified to bypass
+    # ONLY those non-required contexts. Defaults False.
+    force: bool = False
+
+
+@dataclasses.dataclass(frozen=True)
+class BranchProtection:
+    """The subset of branch protection the queue depends on."""
+
+    required_contexts: list[str]
+    required_approvals: int
+    block_on_rejected_reviews: bool


 def _require_runtime_env() -> None:
@@ -191,6 +331,117 @@ def required_contexts_green(
    return not missing_or_bad, missing_or_bad


+def parse_branch_protection(body: Any) -> BranchProtection:
+    """Extract the queue-relevant fields from a branch_protections payload.
+
+    Fail-closed: raises BranchProtectionUnavailable when status checks are
+    expected but the required-context list cannot be enumerated. We never fall
+    back to a hand-maintained env list as the authoritative required set —
+    doing so risks merging when a real required context is red/missing.
+    """
+    if not isinstance(body, dict):
+        raise BranchProtectionUnavailable("branch protection response not an object")
+    enable = bool(body.get("enable_status_check"))
+    contexts_raw = body.get("status_check_contexts")
+    if not enable:
+        # Status checks not enforced by BP at all. With no required contexts
+        # the queue would gate on approvals only — acceptable, but make it
+        # explicit and let the caller decide.
+        contexts: list[str] = []
+    else:
+        if not isinstance(contexts_raw, list):
+            raise BranchProtectionUnavailable(
+                "enable_status_check is true but status_check_contexts is not a list"
+            )
+        contexts = [c for c in contexts_raw if isinstance(c, str) and c.strip()]
+        if not contexts:
+            raise BranchProtectionUnavailable(
+                "enable_status_check is true but status_check_contexts is empty"
+            )
+    approvals = body.get("required_approvals")
+    required_approvals = (
+        int(approvals) if isinstance(approvals, int) else REQUIRED_APPROVALS_DEFAULT
+    )
+    return BranchProtection(
+        required_contexts=contexts,
+        required_approvals=required_approvals,
+        block_on_rejected_reviews=bool(body.get("block_on_rejected_reviews")),
+    )
+
+
+def get_branch_protection(branch: str) -> BranchProtection:
+    """Fetch branch protection for `branch`; fail-closed if unavailable."""
+    try:
+        _, body = api("GET", f"/repos/{OWNER}/{NAME}/branch_protections/{branch}")
+    except ApiError as exc:
+        raise BranchProtectionUnavailable(
+            f"could not fetch branch protection for {branch}: {exc}"
+        ) from exc
+    return parse_branch_protection(body)
+
+
+def genuine_approvals(
+    reviews: list[dict],
+    *,
+    head_sha: str,
+    reviewer_set: set[str],
+) -> tuple[set[str], list[str]]:
+    """Reduce a PR's reviews to genuine official approvals on the CURRENT head.
+
+    Returns (approvers, request_changes) where:
+      - approvers is the set of distinct logins (in reviewer_set) whose LATEST
+        review on the current head is an official, non-stale, non-dismissed
+        APPROVED, and
+      - request_changes is the list of logins (in reviewer_set) whose latest
+        official review on the current head is REQUEST_CHANGES.
+
+    "Current head" is enforced two ways, because Gitea exposes both signals:
+    a review must be `official` and NOT `stale`/`dismissed`, AND when the
+    review carries a commit_id it must equal head_sha. A review with no
+    commit_id but stale=False/dismissed=False is accepted (older Gitea rows).
+    We take each reviewer's LATEST submission (reviews arrive oldest-first), so
+    a later REQUEST_CHANGES correctly supersedes an earlier APPROVED and vice
+    versa.
+    """
+    latest_by_user: dict[str, dict] = {}
+    for review in reviews:
+        if not isinstance(review, dict):
+            continue
+        user = (review.get("user") or {}).get("login")
+        if not isinstance(user, str) or user not in reviewer_set:
+            continue
+        state = str(review.get("state") or "").upper()
+        if state not in {"APPROVED", "REQUEST_CHANGES"}:
+            continue  # ignore COMMENT/PENDING/DISMISSED-state rows
+        # reviews are returned oldest-first; later entries overwrite → latest wins
+        latest_by_user[user] = review
+
+    approvers: set[str] = set()
+    request_changes: list[str] = []
+    for user, review in latest_by_user.items():
+        if not review.get("official"):
+            continue
+        if review.get("stale") or review.get("dismissed"):
+            continue
+        commit_id = review.get("commit_id")
+        if isinstance(commit_id, str) and commit_id and head_sha:
+            if commit_id != head_sha:
+                continue  # review was on a previous head
+        state = str(review.get("state") or "").upper()
+        if state == "APPROVED":
+            approvers.add(user)
+        elif state == "REQUEST_CHANGES":
+            request_changes.append(user)
+    return approvers, request_changes
+
+
+def get_pull_reviews(pr_number: int) -> list[dict]:
+    _, body = api("GET", f"/repos/{OWNER}/{NAME}/pulls/{pr_number}/reviews")
+    if not isinstance(body, list):
+        raise ApiError(f"PR #{pr_number} reviews response not list")
+    return body
+
+
 def label_names(issue: dict) -> set[str]:
    return {
        label["name"]
@@ -219,6 +470,85 @@ def choose_next_queued_issue(
    return candidates[0] if candidates else None


+def _issue_is_draft(issue: dict) -> bool:
+    """True if the issue/PR is a draft.
+
+    The /issues listing exposes draft state under the `pull_request` sub-object
+    (`{"draft": true}`); some Gitea versions also surface a top-level `draft`.
+    Either is honoured. Drafts are never auto-considered for merging.
+    """
+    pr = issue.get("pull_request")
+    if isinstance(pr, dict) and pr.get("draft") is True:
+        return True
+    return issue.get("draft") is True
+
+
+def choose_candidate_issues(
+    issues: list[dict],
+    *,
+    queue_label: str,
+    opt_out_labels: set[str],
+    auto_discover: bool,
+) -> list[dict]:
+    """All open PRs eligible for a merge attempt this tick, oldest-first.
+
+    This is the auto-discovery selector. It does NOT change the merge bar — it
+    only changes WHICH PRs are considered:
+
+      - auto_discover=True (default): every open same-repo PR is a candidate,
+        EXCEPT those carrying an opt-out label or marked draft. The QUEUE_LABEL
+        is optional metadata, not a gate, so a ready PR reaches the queue with no
+        human/agent labeling (the write:issue gap is removed).
+      - auto_discover=False: legacy opt-IN — only PRs carrying queue_label are
+        candidates (still skipping opt-out labels and drafts).
+
+    Opt-out is the safety escape hatch: any opt_out_labels member present skips
+    the PR entirely (never considered, never merged). Ordering is oldest-first
+    (created_at, then number) to preserve the serialized FIFO ordering.
+
+    Returns the FULL ordered list (not just the head) so process_once can SCAN
+    THROUGH non-ready candidates instead of locking on the oldest. A non-ready
+    auto-discovered PR (e.g. one with REQUEST_CHANGES or mergeable=false, which
+    can never become ready without human action) must NOT head-of-line block the
+    newer ready PRs behind it — the readiness check happens per-candidate in
+    process_once, and a `wait` candidate is skipped to the next one.
+    """
+    candidates = []
+    for issue in issues:
+        if "pull_request" not in issue:
+            continue
+        labels = label_names(issue)
+        if opt_out_labels & labels:
+            continue  # opt-out: human kept this PR out of autonomous merging
+        if _issue_is_draft(issue):
+            continue  # drafts are never auto-merged
+        if not auto_discover and queue_label not in labels:
+            continue  # legacy opt-IN: require the queue label
+        candidates.append(issue)
+    candidates.sort(key=lambda issue: (issue.get("created_at") or "", int(issue["number"])))
+    return candidates
+
+
+def choose_next_candidate_issue(
+    issues: list[dict],
+    *,
+    queue_label: str,
+    opt_out_labels: set[str],
+    auto_discover: bool,
+) -> dict | None:
+    """The oldest eligible candidate, or None. Thin head-of-list wrapper around
+    choose_candidate_issues; retained for callers/tests that only want the head.
+    process_once uses the full list (choose_candidate_issues) so it can scan past
+    non-ready PRs rather than HOL-block on the oldest."""
+    candidates = choose_candidate_issues(
+        issues,
+        queue_label=queue_label,
+        opt_out_labels=opt_out_labels,
+        auto_discover=auto_discover,
+    )
+    return candidates[0] if candidates else None
+
+
 def pr_contains_base_sha(commits: list[dict], base_sha: str) -> bool:
    for commit in commits:
        sha = commit.get("sha") or commit.get("id")
@@ -233,36 +563,87 @@ def pr_has_current_base(pr: dict, commits: list[dict], main_sha: str) -> bool:
    return pr_contains_base_sha(commits, main_sha)


+def _non_required_red_present(
+    latest: dict[str, dict],
+    required_contexts: list[str],
+) -> bool:
+    """True if any NON-required context is non-success.
+
+    Such reds are the governance/SOP/advisory checks Gitea may still treat as
+    "missing required context" at merge time even though branch protection does
+    not require them. Their presence is what justifies force_merge=true (we
+    have already verified every REQUIRED context is green and approvals are
+    genuine, so force only bypasses these non-required reds).
+    """
+    required = set(required_contexts)
+    for context, status in latest.items():
+        if context in required:
+            continue
+        if status_state(status) != "success":
+            return True
+    return False
+
+
 def evaluate_merge_readiness(
    *,
    main_status: dict,
    pr_status: dict,
    required_contexts: list[str],
+    required_approvals: int,
+    approvers: set[str],
+    request_changes: list[str],
    pr_has_current_base: bool,
+    mergeable: bool,
    pr_labels: set[str] | None = None,
 ) -> MergeDecision:
-    # Check push-required contexts explicitly instead of combined state.
-    # Combined state can be "failure" due to non-blocking jobs
-    # (continue-on-error: true) that don't actually gate merges.
-    # CI / all-required (push) is the authoritative gate — it respects
-    # continue-on-error and correctly aggregates all blocking failures.
+    # 1) Main's push-required contexts must be green. Combined state can be
+    #    "failure" due to non-blocking jobs (continue-on-error: true) that do
+    #    not gate merges, so check the explicit required set, not combined.
    main_latest = latest_statuses_by_context(main_status.get("statuses") or [])
    main_ok, main_bad = required_contexts_green(main_latest, push_required_contexts())
    if not main_ok:
        return MergeDecision(False, "pause", "main required contexts not green: " + ", ".join(main_bad))
+
+    # 2) PR head must contain current main.
    if not pr_has_current_base:
        return MergeDecision(False, "update", "PR head does not contain current main")

-    # Check explicit required contexts instead of combined state. Combined state
-    # can be "failure" due to non-blocking jobs with continue-on-error: true
-    # (e.g. publish-runtime-autobump/pr-validate, qa-review on stale tokens).
-    # The required_contexts list is the authoritative gate — it includes only
-    # the checks that actually block merges.
+    # 3) No open official REQUEST_CHANGES on the current head.
+    if request_changes:
+        return MergeDecision(
+            False, "wait",
+            "open REQUEST_CHANGES on current head from: " + ", ".join(sorted(request_changes)),
+        )
+
+    # 4) Enough distinct genuine official approvals on the current head.
+    if len(approvers) < required_approvals:
+        return MergeDecision(
+            False, "wait",
+            f"insufficient genuine approvals on current head: have "
+            f"{len(approvers)} ({', '.join(sorted(approvers)) or 'none'}), "
+            f"need {required_approvals}",
+        )
+
+    # 5) Every BRANCH-PROTECTION-REQUIRED status context must be green. This is
+    #    the authoritative status gate — NON-required reds (qa-review,
+    #    security-review, sop-tier/sop-checklist when not BP-required, E2E Chat,
+    #    Staging SaaS, ci-arm64-advisory, continue-on-error jobs) are NOT
+    #    consulted here and must not block.
    latest = latest_statuses_by_context(pr_status.get("statuses") or [])
    ok, missing_or_bad = required_contexts_green(latest, required_contexts, pr_labels)
    if not ok:
        return MergeDecision(False, "wait", "required contexts not green: " + ", ".join(missing_or_bad))
-    return MergeDecision(True, "merge", "ready")
+
+    # 6) Gitea must consider the PR mergeable (no conflicts).
+    if not mergeable:
+        return MergeDecision(False, "wait", "PR is not mergeable (conflicts)")
+
+    # Ready. Use force_merge ONLY if the merge would otherwise be blocked by
+    # missing-but-non-required governance contexts. Required are green and
+    # approvals are genuine, so force only bypasses non-required reds — never a
+    # failing required context or missing approval.
+    force = _non_required_red_present(latest, required_contexts)
+    return MergeDecision(True, "merge", "ready", force=force)


 def get_branch_head(branch: str) -> str:
@@ -280,6 +661,12 @@ def get_combined_status(sha: str) -> dict:
    The /status endpoint caps the `statuses` array at 30 entries (Gitea
    default page size), so we fetch the full list via /statuses with a
    higher limit. The combined `state` still comes from /status.
+
+    Fail-closed: the PRIMARY /status fetch must succeed. If it raises, the
+    error propagates so the caller skips this PR this tick (we never treat a
+    failed status fetch as green — dev-sop "no fail-open"). Only the SECONDARY
+    /statuses enrichment (which merely extends the per-context list beyond the
+    30-entry cap) is best-effort; if it fails we still have the combined set.
    """
    _, combined = api("GET", f"/repos/{OWNER}/{NAME}/commits/{sha}/status")
    if not isinstance(combined, dict):
@@ -329,6 +716,31 @@ def list_queued_issues() -> list[dict]:
    return body


+def list_candidate_issues(*, auto_discover: bool) -> list[dict]:
+    """Open PR issues eligible for consideration this tick.
+
+    With auto_discover=True (default) this enumerates ALL open PRs (no label
+    filter) so the queue is self-sustaining — a ready PR is considered without
+    any human/agent first adding QUEUE_LABEL. With auto_discover=False it falls
+    back to the legacy label-filtered listing (opt-IN). Opt-out filtering and
+    draft-skipping happen in choose_next_candidate_issue, not here.
+    """
+    if not auto_discover:
+        return list_queued_issues()
+    _, body = api(
+        "GET",
+        f"/repos/{OWNER}/{NAME}/issues",
+        query={
+            "state": "open",
+            "type": "pulls",
+            "limit": "50",
+        },
+    )
+    if not isinstance(body, list):
+        raise ApiError("candidate issues response not list")
+    return body
+
+
 def get_pull(pr_number: int) -> dict:
    _, body = api("GET", f"/repos/{OWNER}/{NAME}/pulls/{pr_number}")
    if not isinstance(body, dict):
@@ -354,30 +766,97 @@ def update_pull(pr_number: int, *, dry_run: bool) -> None:
    print(f"::notice::updating PR #{pr_number} with base branch via style={UPDATE_STYLE}")
    if dry_run:
        return
+    try:
+        api(
+            "POST",
+            f"/repos/{OWNER}/{NAME}/pulls/{pr_number}/update",
+            query={"style": UPDATE_STYLE},
+            expect_json=False,
+        )
+    except ApiError as exc:
+        # Gitea returns HTTP 409 when the base cannot be merged into the PR
+        # branch because of a real conflict. The queue cannot auto-resolve a
+        # conflict, so re-raise as BranchUpdateConflictError; process_once HOLDs
+        # the PR and advances (HOL guard) instead of retrying it forever.
+        # Match the HTTP STATUS token ("-> HTTP 409") specifically, not a bare
+        # "409" substring — the PR number or path can itself contain "409"
+        # (e.g. /pulls/1409/update) and must not be misread as a conflict.
+        if "-> HTTP 409" in str(exc):
+            raise BranchUpdateConflictError(str(exc)) from exc
+        raise  # re-raise other ApiErrors unchanged
+
+
+def add_label_by_name(pr_number: int, label_name: str, *, dry_run: bool) -> None:
+    """Apply an existing repo label (by name) to a PR/issue.
+
+    Used to HOLD a wedged PR so the queue advances. Resolves the label id from
+    the repo label set; if the label does not exist, raises ApiError (the
+    caller decides whether that is fatal).
+    """
+    print(f"::notice::applying label '{label_name}' to PR #{pr_number}")
+    if dry_run:
+        return
+    _, labels = api("GET", f"/repos/{OWNER}/{NAME}/labels", query={"limit": "100"})
+    label_id = None
+    if isinstance(labels, list):
+        for label in labels:
+            if isinstance(label, dict) and label.get("name") == label_name:
+                label_id = label.get("id")
+                break
+    if label_id is None:
+        raise ApiError(f"label '{label_name}' not found in repo {OWNER}/{NAME}")
    api(
        "POST",
-        f"/repos/{OWNER}/{NAME}/pulls/{pr_number}/update",
-        query={"style": UPDATE_STYLE},
-        expect_json=False,
+        f"/repos/{OWNER}/{NAME}/issues/{pr_number}/labels",
+        body={"labels": [label_id]},
    )


-def merge_pull(pr_number: int, *, dry_run: bool) -> None:
-    payload = {
+def hold_pr(pr_number: int, hold_note: str, *, dry_run: bool) -> None:
+    """Apply HOLD_LABEL to a wedged PR so the queue advances past it.
+
+    choose_next_queued_issue skips HOLD_LABEL-bearing PRs, so this is the HOL
+    guard: a PR the queue cannot make progress on (permanent permission error
+    or unresolvable branch-update conflict) is held and a human/agent fixes it,
+    rather than the queue re-selecting it every tick forever. If the label
+    cannot be applied we still post the explanatory comment so the wedge is at
+    least visible — but we never loop on the PR.
+    """
+    try:
+        add_label_by_name(pr_number, HOLD_LABEL, dry_run=dry_run)
+    except ApiError as label_exc:
+        sys.stderr.write(
+            f"::error::could not apply HOLD_LABEL to PR #{pr_number}: {label_exc}\n"
+        )
+        hold_note += (
+            f"\n\n(NOTE: could not apply the hold label automatically: "
+            f"{label_exc}. Please add `{HOLD_LABEL}` manually.)"
+        )
+    post_comment(pr_number, hold_note, dry_run=dry_run)
+
+
+def merge_pull(pr_number: int, *, dry_run: bool, force: bool = False) -> None:
+    payload: dict[str, Any] = {
        "Do": "merge",
        "MergeTitleField": f"Merge PR #{pr_number} via Gitea merge queue",
        "MergeMessageField": (
            "Serialized merge by gitea-merge-queue after current-main, "
-            "SOP, and required CI checks were green."
+            "genuine approvals, and required CI checks were green."
        ),
    }
-    print(f"::notice::merging PR #{pr_number}")
+    if force:
+        # force_merge bypasses ONLY missing-but-non-required governance
+        # contexts. The caller has already verified required contexts are green
+        # and genuine approvals are present, so this never bypasses a failing
+        # required context or an approval shortfall.
+        payload["force_merge"] = True
+    print(f"::notice::merging PR #{pr_number}{' (force_merge: non-required reds)' if force else ''}")
    if dry_run:
        return
    try:
        api("POST", f"/repos/{OWNER}/{NAME}/pulls/{pr_number}/merge", body=payload, expect_json=False)
    except ApiError as exc:
-        # Re-raise permission-like errors so process_once can skip this PR.
+        # Re-raise permission-like errors so process_once can HOLD this PR.
        # 403 = no push access, 404 = repo/pr not found, 405 = not allowed.
        msg = str(exc)
        for code in ("403", "404", "405"):
@@ -387,7 +866,25 @@ def merge_pull(pr_number: int, *, dry_run: bool) -> None:


 def process_once(*, dry_run: bool = False) -> int:
-    contexts = required_contexts(REQUIRED_CONTEXTS_RAW)
+    # Required status contexts come from BRANCH PROTECTION, not a hand-kept env
+    # list. Fail-closed: if BP cannot be enumerated, HOLD the whole tick rather
+    # than merge against an unverified required set.
+    try:
+        bp = get_branch_protection(WATCH_BRANCH)
+    except BranchProtectionUnavailable as exc:
+        sys.stderr.write(
+            f"::error::queue held: branch protection for {WATCH_BRANCH} "
+            f"unavailable (fail-closed): {exc}\n"
+        )
+        return 0
+    contexts = bp.required_contexts
+    required_approvals = bp.required_approvals
+    print(
+        f"::notice::queue policy from branch protection: "
+        f"required_approvals={required_approvals} "
+        f"required_contexts={contexts or '[none]'}"
+    )
+
    main_sha = get_branch_head(WATCH_BRANCH)
    main_status = get_combined_status(main_sha)
    # Check push-required contexts explicitly instead of combined state.
@@ -398,83 +895,199 @@ def process_once(*, dry_run: bool = False) -> int:
        print(f"::notice::queue paused: {WATCH_BRANCH}@{main_sha[:8]} required contexts not green: {', '.join(main_bad)}")
        return 0

-    issue = choose_next_queued_issue(
-        list_queued_issues(),
+    candidates = choose_candidate_issues(
+        list_candidate_issues(auto_discover=AUTO_DISCOVER),
        queue_label=QUEUE_LABEL,
-        hold_label=HOLD_LABEL,
+        opt_out_labels=OPT_OUT_LABELS,
+        auto_discover=AUTO_DISCOVER,
    )
-    if not issue:
-        print("::notice::merge queue empty")
+    if not candidates:
+        print(
+            "::notice::no merge candidates "
+            f"(auto_discover={'on' if AUTO_DISCOVER else 'off'})"
+        )
        return 0

+    # HOL fix: SCAN THROUGH the FIFO candidate list until a PR we can ACT on is
+    # found, instead of locking on the oldest and waiting. A non-ready candidate
+    # (decision.action == "wait": REQUEST_CHANGES, mergeable!=True, insufficient
+    # genuine approvals, or red required CI) is SKIPPED — it must NOT head-of-line
+    # block the newer ready PRs behind it. The merge bar is unchanged: a skipped
+    # PR is never merged, and the first ACTIONABLE candidate (an "update" that
+    # advances a stale branch, or a fully-ready "merge") terminates the scan.
+    #
+    # `update` is treated as actionable, not skippable: a PR whose head merely
+    # lacks current main is in a legitimate in-progress state (updating it +
+    # rerunning CI moves it toward ready), unlike a PR that can never become
+    # ready without a human (RC / conflict), which is a `wait` and gets skipped.
+    for issue in candidates:
+        decision, ctx = _evaluate_candidate(
+            issue,
+            main_sha=main_sha,
+            main_status=main_status,
+            required_contexts=contexts,
+            required_approvals=required_approvals,
+            dry_run=dry_run,
+        )
+        if decision is None:
+            continue  # not merge-eligible (not-open / opted-out / fork / wrong base)
+        pr_number = ctx["pr_number"]
+        print(f"::notice::PR #{pr_number} decision={decision.action}: {decision.reason}")
+        if decision.action == "wait":
+            # Non-ready: skip to the next candidate (no HOL block, no merge).
+            continue
+        if decision.action == "update":
+            try:
+                update_pull(pr_number, dry_run=dry_run)
+            except BranchUpdateConflictError as exc:
+                # The branch cannot be updated with main because of a real
+                # conflict (HTTP 409 from /update). This is the #2352 HOL guard:
+                # a conflict will not self-resolve without a human/agent rebase,
+                # so re-attempting the update every tick would head-of-line block
+                # every ready PR behind it. HOLD this PR (apply HOLD_LABEL, which
+                # is an opt-out label so later ticks skip it) and CONTINUE the
+                # scan so a newer ready PR can still merge this tick. Fail-closed:
+                # a held PR is skipped, never merged.
+                sys.stderr.write(
+                    f"::error::branch-update conflict for PR #{pr_number}: {exc}\n"
+                )
+                hold_note = (
+                    "merge-queue: could not update this branch with "
+                    f"`{WATCH_BRANCH}` — the update returned a merge conflict "
+                    f"(HTTP 409) that the queue cannot auto-resolve ({exc}). "
+                    f"Applied `{HOLD_LABEL}` to unblock the queue (HOL guard). "
+                    f"Fix: rebase/merge `{WATCH_BRANCH}` into this branch and "
+                    f"resolve the conflicts, then remove `{HOLD_LABEL}` to requeue."
+                )
+                hold_pr(pr_number, hold_note, dry_run=dry_run)
+                continue  # held — keep scanning for a mergeable candidate
+            post_comment(
+                pr_number,
+                (
+                    f"merge-queue: updated this branch with `{WATCH_BRANCH}` at "
+                    f"`{main_sha[:12]}`. Waiting for CI on the refreshed head."
+                ),
+                dry_run=dry_run,
+            )
+            return 0
+        if decision.ready:
+            latest_main_sha = get_branch_head(WATCH_BRANCH)
+            if latest_main_sha != main_sha:
+                print(
+                    f"::notice::main moved {main_sha[:8]} -> {latest_main_sha[:8]}; "
+                    "deferring to next tick"
+                )
+                return 0
+            try:
+                merge_pull(pr_number, dry_run=dry_run, force=decision.force)
+            except MergePermissionError as exc:
+                # Permanent merge failure (HTTP 403/404/405). HOLD this PR by
+                # applying HOLD_LABEL (it becomes an opt-out label, so subsequent
+                # ticks skip it) and CONTINUE scanning so the queue still advances
+                # to the next ready PR this tick rather than stalling.
+                sys.stderr.write(f"::error::merge permission error for PR #{pr_number}: {exc}\n")
+                hold_note = (
+                    "merge-queue: merge failed with a permanent permission error "
+                    f"({exc}). No available token has Can-merge permission for this "
+                    f"PR. Applied `{HOLD_LABEL}` to unblock the queue (HOL guard). "
+                    f"Fix: grant Can-merge to the queue token, then remove "
+                    f"`{HOLD_LABEL}` to requeue."
+                )
+                try:
+                    add_label_by_name(pr_number, HOLD_LABEL, dry_run=dry_run)
+                except ApiError as label_exc:
+                    # If we cannot even apply the hold label, fall back to a comment
+                    # so the wedge is at least visible; do NOT loop on this PR.
+                    sys.stderr.write(
+                        f"::error::could not apply HOLD_LABEL to PR #{pr_number}: {label_exc}\n"
+                    )
+                    hold_note += (
+                        f"\n\n(NOTE: could not apply the hold label automatically: "
+                        f"{label_exc}. Please add `{HOLD_LABEL}` manually.)"
+                    )
+                post_comment(pr_number, hold_note, dry_run=dry_run)
+                continue  # held — keep scanning for a mergeable candidate
+            return 0
+    return 0
+
+
+def _evaluate_candidate(
+    issue: dict,
+    *,
+    main_sha: str,
+    main_status: dict,
+    required_contexts: list[str],
+    required_approvals: int,
+    dry_run: bool,
+) -> tuple[MergeDecision | None, dict]:
+    """Evaluate a single auto-discovered candidate against the full merge bar.
+
+    Returns (decision, ctx) where ctx carries {"pr_number"}. A None decision
+    means the PR is not merge-eligible at all (not open / opted-out / draft /
+    fork / wrong base) and the caller should skip to the next candidate; for
+    fork / wrong-base the explanatory comment is posted here before returning.
+
+    The merge bar is UNCHANGED from the single-PR path — this only factors the
+    per-PR evaluation out so process_once can scan multiple candidates. A failed
+    status fetch still raises (fail-closed): it propagates to the caller so the
+    PR is never treated as green.
+    """
    pr_number = int(issue["number"])
+    ctx = {"pr_number": pr_number}
    pr = get_pull(pr_number)
    if pr.get("state") != "open":
        print(f"::notice::PR #{pr_number} is not open; skipping")
-        return 0
+        return None, ctx
+    # Defensive opt-out/draft re-check on the authoritative pull payload: the
+    # /issues listing's label/draft view can lag, but the merge bar must respect
+    # the live pull state. (choose_candidate_issues already filtered on the
+    # listing; this guards against a stale listing racing a just-added opt-out.)
+    if OPT_OUT_LABELS & label_names(pr):
+        print(f"::notice::PR #{pr_number} carries an opt-out label; skipping")
+        return None, ctx
+    if pr.get("draft") is True:
+        print(f"::notice::PR #{pr_number} is a draft; skipping")
+        return None, ctx
    if pr.get("base", {}).get("ref") != WATCH_BRANCH:
        post_comment(pr_number, f"merge-queue: skipped; base branch is not `{WATCH_BRANCH}`.", dry_run=dry_run)
-        return 0
+        return None, ctx
    if pr.get("head", {}).get("repo_id") != pr.get("base", {}).get("repo_id"):
        post_comment(pr_number, "merge-queue: skipped; fork PRs are not supported by the serialized queue.", dry_run=dry_run)
-        return 0
+        return None, ctx

    head_sha = pr.get("head", {}).get("sha")
    if not isinstance(head_sha, str) or len(head_sha) < 7:
        raise ApiError(f"PR #{pr_number} missing head sha")
    commits = get_pull_commits(pr_number)
    current_base = pr_has_current_base(pr, commits, main_sha)
+    # Fail-closed: a failed status fetch raises here and propagates (the PR is
+    # never treated as green).
    pr_status = get_combined_status(head_sha)
    pr_labels = label_names(pr)
+    # FAIL-CLOSED: Gitea returns mergeable=None (or omits the field) while it is
+    # still COMPUTING conflict state. Only the literal True is decisive proof the
+    # PR is conflict-free; None and False both mean "not (yet) mergeable". We must
+    # NOT autonomously merge on an unknown — treat anything but True as not-yet-
+    # mergeable so evaluate_merge_readiness returns a "wait" decision.
+    mergeable = pr.get("mergeable") is True
+
+    reviews = get_pull_reviews(pr_number)
+    approvers, request_changes = genuine_approvals(
+        reviews, head_sha=head_sha, reviewer_set=REVIEWER_SET
+    )
+
    decision = evaluate_merge_readiness(
        main_status=main_status,
        pr_status=pr_status,
-        required_contexts=contexts,
+        required_contexts=required_contexts,
+        required_approvals=required_approvals,
+        approvers=approvers,
+        request_changes=request_changes,
        pr_has_current_base=current_base,
+        mergeable=mergeable,
        pr_labels=pr_labels,
    )
-
-    print(f"::notice::PR #{pr_number} decision={decision.action}: {decision.reason}")
-    if decision.action == "update":
-        update_pull(pr_number, dry_run=dry_run)
-        post_comment(
-            pr_number,
-            (
-                f"merge-queue: updated this branch with `{WATCH_BRANCH}` at "
-                f"`{main_sha[:12]}`. Waiting for CI on the refreshed head."
-            ),
-            dry_run=dry_run,
-        )
-        return 0
-    if decision.ready:
-        latest_main_sha = get_branch_head(WATCH_BRANCH)
-        if latest_main_sha != main_sha:
-            print(
-                f"::notice::main moved {main_sha[:8]} -> {latest_main_sha[:8]}; "
-                "deferring to next tick"
-            )
-            return 0
-        try:
-            merge_pull(pr_number, dry_run=dry_run)
-        except MergePermissionError as exc:
-            # Permanent merge failure (HTTP 403/404/405). Post a comment so
-            # maintainers know why, then return 0 so this tick is done.
-            # The PR stays in the queue; future ticks can retry after the
-            # permission issue is resolved.
-            sys.stderr.write(f"::error::merge permission error for PR #{pr_number}: {exc}\n")
-            post_comment(
-                pr_number,
-                (
-                    "merge-queue: merge failed with HTTP 405 'User not allowed to merge PR'. "
-                    "No available token has Can-merge permission on this repo. "
-                    "Fix: grant Can-merge to a token, or add a maintain/admin collaborator. "
-                    "Skipping to next queued PR on next tick."
-                ),
-                dry_run=dry_run,
-            )
-            return 0
-        return 0
-    return 0
+    return decision, ctx


 def main() -> int:
@@ -174,6 +174,16 @@ def parse_directives(
        if not parts:
            continue
        first = parts[0]
+        # Em-dash (U+2014) is a common visual separator in user-written
+        # notes, e.g.  /sop-ack Five-Axis — five-axis-review
+        # If raw_slug contains an em-dash, split on the first one so
+        # the part before becomes the slug and the rest becomes the note.
+        note_from_slug = ""
+        slug_source = raw_slug
+        emdash_idx = raw_slug.find("—")
+        if emdash_idx != -1:
+            slug_source = raw_slug[:emdash_idx].strip()
+            note_from_slug = raw_slug[emdash_idx + 1 :].strip()
        # If the slug-capture greedily matched multiple words (e.g.
        # "comprehensive testing"), preserve normalize behavior: join
        # the WHOLE first-word-token only; trailing words get appended to
@@ -186,13 +196,19 @@ def parse_directives(
            # as slug and "testing extra-note" as note. We defer the
            # disambiguation to the caller via the returned canonical
            # slug. For simplicity: try the WHOLE captured string first.
-            canonical = normalize_slug(raw_slug, numeric_aliases)
+            canonical = normalize_slug(slug_source, numeric_aliases)
        else:
-            canonical = normalize_slug(first, numeric_aliases)
+            canonical = normalize_slug(slug_source, numeric_aliases)
        note_from_group = (m.group(3) or "").strip()
-        # If we collapsed multi-word slug into kebab and there's a
-        # trailing-text group too, append it.
-        entry = (kind, canonical, note_from_group)
+        # The em-dash (U+2014) is a visual separator; the regex puts it
+        # in group(3) because it is outside the slug character class.
+        # Strip it so "/sop-ack slug — note" yields just "note".
+        if note_from_group.startswith("—"):
+            note_from_group = note_from_group[1:].strip()
+        # Combine note_from_slug (em-dash split) with note_from_group
+        # (trailing text after the slug captured by the regex group).
+        combined_note = (note_from_slug + " " + note_from_group).strip()
+        entry = (kind, canonical, combined_note)
        if kind == "sop-n/a":
            na_directives.append(entry)
        else:
@@ -48,7 +48,6 @@ set -euo pipefail
 # workflow-level jq install can fail on runners with network restrictions
 # (GitHub releases not reachable from some runner networks — infra#241
 # follow-up). This fallback is idempotent — no-op when jq is already on PATH.
-# SOP_FAIL_OPEN=1 makes this always exit 0 so CI never blocks on jq absence.
 if ! command -v jq >/dev/null 2>&1; then
  echo "::notice::jq not found on PATH — attempting install..."
  _jq_installed="no"
@@ -67,12 +66,6 @@ if ! command -v jq >/dev/null 2>&1; then
  if ! command -v jq >/dev/null 2>&1; then
    echo "::error::jq installation failed — apt-get and GitHub binary both failed."
    echo "::error::sop-tier-check requires jq for all JSON API parsing."
-    # SOP_FAIL_OPEN=1 is set in the workflow step's env — makes script always
-    # exit 0 so CI never blocks. The SOP-6 tier review gate remains enforced.
-    if [ "${SOP_FAIL_OPEN:-}" = "1" ]; then
-      echo "::warning::SOP_FAIL_OPEN=1 — exiting 0 so CI does not block."
-      exit 0
-    fi
    exit 1
  fi
 fi
@@ -101,15 +94,10 @@ echo "::notice::tier-check start: repo=$OWNER/$NAME pr=$PR_NUMBER author=$PR_AUT
 # cause the script to exit prematurely when the token is empty/invalid — the
 # if check below handles that case gracefully. Without || true, a 401 from an
 # empty/invalid token causes jq to exit 1, triggering set -e and exiting the
-# entire script before SOP_FAIL_OPEN can be evaluated (the check is in the jq-
-# install block; if jq is already on PATH, that block is skipped entirely).
+# entire script before the error can be logged.
 WHOAMI=$(curl -sS -H "$AUTH" "${API}/user" | jq -r '.login // ""') || true
 if [ -z "$WHOAMI" ]; then
  echo "::error::GITEA_TOKEN cannot resolve a user via /api/v1/user — check the token scope and that the secret is wired correctly."
-  if [ "${SOP_FAIL_OPEN:-}" = "1" ]; then
-    echo "::warning::SOP_FAIL_OPEN=1 — exiting 0 so CI does not block."
-    exit 0
-  fi
  exit 1
 fi
 echo "::notice::token resolves to user: $WHOAMI"
@@ -119,10 +107,6 @@ echo "::notice::token resolves to user: $WHOAMI"
 HEAD_SHA=$(curl -sS -H "$AUTH" "${API}/repos/${OWNER}/${NAME}/pulls/${PR_NUMBER}" | jq -r '.head.sha // ""') || true
 if [ -z "$HEAD_SHA" ]; then
  echo "::error::Failed to fetch PR head SHA — token may be invalid."
-  if [ "${SOP_FAIL_OPEN:-}" = "1" ]; then
-    echo "::warning::SOP_FAIL_OPEN=1 — exiting 0 so CI does not block."
-    exit 0
-  fi
  exit 1
 fi
 debug "pr-head-sha=$HEAD_SHA"
@@ -215,10 +199,6 @@ if [ "${SOP_DEBUG:-}" = "1" ]; then
 fi
 if [ "$_HTTP_EXIT" -ne 0 ] || [ "$HTTP_CODE" != "200" ]; then
  echo "::error::GET /orgs/${OWNER}/teams failed (curl exit=$_HTTP_EXIT HTTP=$HTTP_CODE) — token may lack read:org scope or be invalid."
-  if [ "${SOP_FAIL_OPEN:-}" = "1" ]; then
-    echo "::warning::SOP_FAIL_OPEN=1 — exiting 0 so CI does not block."
-    exit 0
-  fi
  exit 1
 fi

@@ -265,17 +245,13 @@ done

 # 5. Read approving reviewers. set +e disables set -e temporarily so that curl
 # failures (e.g. empty/invalid token → HTTP 401) do not abort the script before
-# SOP_FAIL_OPEN is evaluated. set -e is restored immediately after.
+# set -e is restored immediately after.
 set +e
 REVIEWS=$(curl -sS -H "$AUTH" "${API}/repos/${OWNER}/${NAME}/pulls/${PR_NUMBER}/reviews")
 _REVIEWS_EXIT=$?
 set -e
 if [ $_REVIEWS_EXIT -ne 0 ] || [ -z "$REVIEWS" ]; then
  echo "::error::Failed to fetch reviews (curl exit=$_REVIEWS_EXIT) — token may be invalid or unreachable."
-  if [ "${SOP_FAIL_OPEN:-}" = "1" ]; then
-    echo "::warning::SOP_FAIL_OPEN=1 — exiting 0 so CI does not block."
-    exit 0
-  fi
  exit 1
 fi
 APPROVERS=$(echo "$REVIEWS" | jq -r --arg head_sha "$HEAD_SHA" '[.[] | select(.state=="APPROVED" and .commit_id == $head_sha) | .user.login] | unique | .[]') || true
@@ -208,6 +208,22 @@ class TestParseDirectives(unittest.TestCase):
        d = self.parse_ack_revoke("/sop-ack Comprehensive_Testing")
        self.assertEqual(d[0][1], "comprehensive-testing")

+    def test_emdash_separator_parsed_correctly(self):
+        # Em-dash (U+2014) between slug and note is common in practice.
+        # /sop-ack Five-Axis — five-axis-review
+        # → slug = five-axis, note = — five-axis-review
+        d = self.parse_ack_revoke("/sop-ack Five-Axis — five-axis-review")
+        self.assertEqual(len(d), 1)
+        self.assertEqual(d[0][1], "five-axis")
+        self.assertIn("five-axis-review", d[0][2])
+
+    def test_emdash_no_note(self):
+        # Em-dash at end of slug: only slug, no note content
+        d = self.parse_ack_revoke("/sop-ack Five-Axis —")
+        self.assertEqual(len(d), 1)
+        self.assertEqual(d[0][1], "five-axis")
+        self.assertEqual(d[0][2], "")  # em-dash is separator-only → empty note
+

 # ---------------------------------------------------------------------------
 # section_marker_present
@@ -205,5 +205,5 @@ n/a_gates:
    required_teams: [security, managers, ceo]
    description: >-
      Security review N/A when this change has no security surface
-      (docs-only, pure-frontend, dependency-only). A security/owners
+      (docs-only, pure-frontend, dependency-only). A security/managers/ceo
      member must post /sop-n/a security-review to activate.
@@ -34,11 +34,6 @@ jobs:
  check:
    name: Block forbidden paths
    runs-on: ubuntu-latest
-    # Phase 3 (RFC #219 §1): surface broken workflows without blocking
-    # the PR. Follow-up PR flips this off after surfaced defects are
-    # triaged.
-    # mc#1982: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
-    continue-on-error: true
    steps:
      - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
        with:
@@ -290,6 +290,15 @@ jobs:
          echo "ADMIN_TOKEN=${E2E_ADMIN_TOKEN}" >> "$GITHUB_ENV"
          echo "MOLECULE_ADMIN_TOKEN=${E2E_ADMIN_TOKEN}" >> "$GITHUB_ENV"
          echo "Admin token configured for the e2e platform (ADMIN_TOKEN + MOLECULE_ADMIN_TOKEN)."
+          # Channels e2e test seam (core#2332 P1.10). These env-gated overrides
+          # let the LIVE Slack-webhook send path + Telegram discover path target
+          # the local mock upstreams that tests/e2e/test_channels_e2e.sh binds,
+          # so the outbound serialize+POST is provable in CI (was unit-mock-only).
+          # Inert in prod/staging — those deploys never set these. The fixed
+          # loopback ports MUST match the script's E2E_CHANNELS_*_PORT defaults.
+          echo "MOLECULE_CHANNELS_TEST_WEBHOOK_BASE=http://127.0.0.1:18099/" >> "$GITHUB_ENV"
+          echo "MOLECULE_CHANNELS_TEST_TELEGRAM_API_BASE=http://127.0.0.1:18098" >> "$GITHUB_ENV"
+          echo "Channels test seam configured (webhook+telegram mock bases on fixed loopback ports)."
      - name: Build platform
        if: needs.detect-changes.outputs.api == 'true'
        working-directory: workspace-server
@@ -430,6 +439,20 @@ jobs:
      - name: Run notify-with-attachments E2E
        if: needs.detect-changes.outputs.api == 'true'
        run: bash tests/e2e/test_notify_attachments_e2e.sh
+      - name: "Run channels + data-prune E2E (REQUIRE-LIVE: mock upstream proves send+discover, purge proves prune)"
+        # core#2332 P1.10. Stands up a local mock upstream, points the LIVE
+        # Slack-webhook send + Telegram discover paths at it via the
+        # production-inert test seam configured above, and asserts the mock
+        # RECEIVED the serialized payload (send) + round-tripped the bot/chat
+        # (discover). Then exercises the RFC #734 data-prune: DELETE
+        # ?purge=true removes the target's durable child data while a sibling
+        # survives. E2E_REQUIRE_LIVE=1 ⇒ a missing/regressed seam is RED, not a
+        # silent skip. The platform inherits the MOLECULE_CHANNELS_TEST_* bases
+        # from $GITHUB_ENV; the script's mock ports match them (18099/18098).
+        if: needs.detect-changes.outputs.api == 'true'
+        env:
+          E2E_REQUIRE_LIVE: '1'
+        run: bash tests/e2e/test_channels_e2e.sh
      - name: "Run priority-runtimes E2E (REQUIRE-LIVE: mock validates the runtime plumbing end-to-end)"
        # E2E_REQUIRE_LIVE=1 is ON: the run MUST validate >=1 runtime end-to-end
        # or it exits NON-zero (RED). This is now SAFE because the `mock` arm can
@@ -7,10 +7,13 @@ name: gitea-merge-queue
 # the user-space queue bot, one PR per tick, using the non-bypass merge actor.
 #
 # Queue contract:
-#   - add label `merge-queue` to an open same-repo PR
+#   - auto-discovery (default): any open same-repo PR is considered — no
+#     `merge-queue` label required (the label is optional metadata now)
 #   - bot updates stale PR heads with current main, then waits for CI
-#   - bot merges only when current main is green and required PR contexts pass
-#   - add `merge-queue-hold` to pause a queued PR without removing it
+#   - bot merges only when current main is green, genuine approvals are present
+#     on the current head, required PR contexts pass, and the PR is mergeable
+#   - add `merge-queue-hold`, `do-not-auto-merge`, or `wip` to keep a PR OUT of
+#     autonomous merging; draft PRs are also skipped

 on:
  # Schedule moved to operator-config:
@@ -48,10 +51,34 @@ jobs:
          WATCH_BRANCH: ${{ github.event.repository.default_branch }}
          QUEUE_LABEL: merge-queue
          HOLD_LABEL: merge-queue-hold
+          # Auto-discovery (opt-OUT). When on (default), the queue considers ALL
+          # open same-repo PRs that meet the merge bar — it does NOT wait for a
+          # human/agent to add `merge-queue`. Agent Gitea tokens lack
+          # write:issue (labels are issue-scoped) and could never self-label,
+          # which stalled the queue; the label is now OPTIONAL metadata. The
+          # merge bar is UNCHANGED — only candidate selection widens. Set
+          # AUTO_DISCOVER=0 to restore legacy opt-IN (require the merge-queue
+          # label to be considered).
+          AUTO_DISCOVER: "1"
+          # Opt-OUT labels: any of these on a PR keeps it OUT of autonomous
+          # merging (the human escape hatch). HOLD_LABEL is always also honoured.
+          # A human who wants a PR held just adds one of these labels.
+          OPT_OUT_LABELS: do-not-auto-merge,wip
          UPDATE_STYLE: merge
-          REQUIRED_CONTEXTS: >-
-            CI / all-required (pull_request),
-            sop-checklist / all-items-acked (pull_request)
+          # Recognised official-reviewer set. A merge needs >= required_approvals
+          # DISTINCT genuine official approvals from these accounts on the
+          # CURRENT head sha (not stale/dismissed). The required_approvals count
+          # itself is read from branch protection at runtime.
+          REVIEWER_SET: agent-reviewer,agent-researcher,agent-reviewer-cr2
+          # NOTE: REQUIRED_CONTEXTS is no longer the authoritative PR gate. The
+          # queue now reads the required status contexts from BRANCH PROTECTION
+          # (status_check_contexts) so non-required governance reds (qa-review,
+          # security-review, sop-tier, sop-checklist when not branch-required,
+          # E2E Chat, Staging SaaS, ci-arm64-advisory) cannot block a merge.
+          # If branch protection cannot be enumerated the queue HOLDS
+          # (fail-closed). REQUIRED_APPROVALS below is only a fallback used when
+          # branch protection does not specify required_approvals.
+          REQUIRED_APPROVALS: "2"
          # Push-side required contexts. Checking CI / all-required (push)
          # explicitly instead of the combined state avoids false-pause when
          # non-blocking jobs (continue-on-error: true) have failed — those
@@ -99,7 +99,7 @@ jobs:
    # all violate this lint at first — intentional. Flip to false
    # follow-up after main is clean for 3 days. mc#1982.
    # mc#1982: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
-    continue-on-error: true  # mc#1982 Phase 3 mask — 14d forced-renewal cadence
+    continue-on-error: true  # internal#837 Phase 3 mask — 14d forced-renewal cadence
    steps:
      - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd  # v6.0.2
      - uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065  # v5.6.0
@@ -90,7 +90,13 @@ jobs:
          # checked-in artifact; exit 1 (RED) on any drift. This is the
          # single source of the gate's verdict — the same code path
          # `go test ./cmd/gen-providers` exercises.
-          go run ./cmd/gen-providers -check
+          if ! go run ./cmd/gen-providers -check; then
+            echo "::error::workspace-server/internal/providers/gen/registry_gen.go is stale (drifted from providers.yaml)."
+            echo "Regenerate and commit it (run from repo root):"
+            echo "  make gen          # native (needs a local Go toolchain)"
+            echo "  make gen-docker   # Docker only — no local Go needed"
+            exit 1
+          fi

      - name: Belt-and-braces — regenerate in place and assert clean tree
        run: |
@@ -101,7 +107,9 @@ jobs:
          go generate ./...
          if ! git diff --quiet -- internal/providers/gen/registry_gen.go; then
            echo "::error::workspace-server/internal/providers/gen/registry_gen.go drifted from providers.yaml."
-            echo "Run 'go generate ./...' (or 'go run ./cmd/gen-providers') in workspace-server/ and commit the result."
+            echo "Regenerate and commit it. No local Go? Use Docker (run from repo root):"
+            echo "  make gen          # native (needs a local Go toolchain)"
+            echo "  make gen-docker   # Docker only — no local Go needed"
            git --no-pager diff -- internal/providers/gen/registry_gen.go | head -80
            exit 1
          fi
@@ -4,7 +4,27 @@
 # use this Makefile; CI calls docker compose / go test directly so the
 # Makefile can evolve without breaking the build.

-.PHONY: help dev up down logs build test e2e-peer-visibility openapi-spec openapi-spec-check
+.PHONY: help dev up down logs build test e2e-peer-visibility openapi-spec openapi-spec-check gen gen-docker gen-check gen-check-docker
+
+# ─── Provider-registry SSOT codegen (internal#718) ─────────────────────
+# The Go module lives in workspace-server/. The checked-in artifact
+# workspace-server/internal/providers/gen/registry_gen.go is a gofmt'd
+# projection of providers.yaml, drift-gated by
+# .gitea/workflows/verify-providers-gen.yml. `make gen-docker` runs the SAME
+# generator inside the pinned golang image so a toolchain-less env (an agent
+# without Go) can regenerate without a local Go install (core#2332 follow-up).
+#
+# BYTE-EQUIVALENCE: gen-docker is byte-identical to native only while
+# GO_VERSION below matches the `go` directive in workspace-server/go.mod.
+# NOTE: the CI verify workflow pins setup-go go-version: 'stable' (not '1.25');
+# that is a latent hazard — a future Go minor could reformat the artifact in CI
+# vs a 1.25 local. Pin CI to '1.25' to close it (tracked alongside this change).
+GO_VERSION ?= 1.25
+GO_IMAGE   ?= golang:$(GO_VERSION)
+DOCKER     ?= docker
+# Mount the Go module (workspace-server) read-write; Go's default -mod=readonly
+# keeps go.mod/go.sum untouched — only the artifact is written in-place.
+DOCKER_RUN_WS = $(DOCKER) run --rm -v "$(CURDIR)/workspace-server":/src -w /src $(GO_IMAGE)

 help: ## Show this help.
 	@grep -E '^[a-zA-Z0-9_-]+:.*?## ' $(MAKEFILE_LIST) | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-22s\033[0m %s\n", $$1, $$2}'
@@ -56,3 +76,16 @@ openapi-spec: ## Regenerate OpenAPI spec from workspace-server handler annotatio
 openapi-spec-check: openapi-spec ## CI gate — fail if openapi-spec produces a diff vs the committed file.
 	@git diff --exit-code -- workspace-server/docs/openapi/ \
 	  || (echo "openapi-spec is stale — run 'make openapi-spec' and commit the result" && exit 1)
+
+# ─── Provider-registry codegen targets ────────────────────────────────
+gen: ## Regenerate the providers registry artifact natively (needs local Go).
+	cd workspace-server && go generate ./...
+
+gen-docker: ## Same, inside the pinned $(GO_IMAGE) — Docker only, no local Go.
+	$(DOCKER_RUN_WS) go generate ./...
+
+gen-check: ## Drift gate (native): exit 1 if the artifact is stale.
+	cd workspace-server && go run ./cmd/gen-providers -check
+
+gen-check-docker: ## Drift gate inside the pinned $(GO_IMAGE) — Docker only.
+	$(DOCKER_RUN_WS) go run ./cmd/gen-providers -check
@@ -93,12 +93,12 @@ For "do we have any backend?", use `HasProvisioner()`, never bare `h.provisioner
 3. **Restart divergence on runtime changes.** Docker re-reads `/configs/config.yaml` from the container before stop, so a changed `runtime:` survives a restart even if the DB isn't synced. EC2 trusts the DB only. If you change the runtime via the Config tab and the handler races the restart, Docker will land on the new runtime, EC2 will land on the old one. **Fix path:** make the Config-tab save explicitly flush to DB before kicking off a restart, not deferred.
 4. **Console-output asymmetry.** Users debugging a stuck workspace on Docker see `docker logs`; on EC2 they see `GetConsoleOutput`. The two outputs look nothing alike. **Fix path:** expose a unified `GET /workspaces/:id/boot-log` that proxies to whichever backend serves the data. Already partly there via `cp_provisioner.Console`.
 5. **Template script drift.** `install.sh` and `start.sh` in each template repo do the same high-level work (install hermes-agent, write .env, write config.yaml, start gateway) but must be kept byte-level consistent on the provider-key forwarding block. Easy to forget. Enforced now by `tools/check-template-parity.sh` (see below) — run it in each template repo's CI.
-6. **Both backends panic when underlying client is nil.** Discovered by the contract-test scaffold landing in this PR: `Provisioner.{Stop,IsRunning}` nil-dereferences the Docker client, and `CPProvisioner.{Stop,IsRunning}` nil-dereferences `httpClient`. The real code always sets these, so this is theoretical in prod — but it means the contract runner can't execute scenarios against zero-value backends. **Fix path:** guard each method with `if p.docker == nil { return false, errNoBackend }` (and equivalent for CP), then flip the `t.Skip` in the contract tests to `t.Run`.
+6. **Both backends panic when underlying client is nil.** ✅ **Resolved** (`fix/provisioner-nil-guards-1813`). `Provisioner.{Stop,IsRunning}` and `CPProvisioner.{Stop,IsRunning}` now guard against nil clients with `ErrNoBackend`, so the contract-test runner executes scenarios against zero-valued backends without panic.

 ## Enforcement

 - **`tools/check-template-parity.sh`** (this repo) — ensures `install.sh` and `start.sh` in a template repo forward identical sets of provider keys. Wire into each template repo's CI as `bash $MONOREPO/tools/check-template-parity.sh install.sh start.sh`.
- **Contract tests** (stub) — `workspace-server/internal/provisioner/backend_contract_test.go` defines the behaviors every `provisioner.Provisioner` implementation must satisfy. Fails compile when a method drifts between `Docker` and `CPProvisioner`. Scenario-level runs are `t.Skip`'d today pending drift risk #6 (see above) — compile-time assertions still catch method drift.
+- **Contract tests** — `workspace-server/internal/provisioner/backend_contract_test.go` defines the behaviors every `provisioner.Provisioner` implementation must satisfy. Fails compile when a method drifts between `Docker` and `CPProvisioner`. Scenario-level runs execute against zero-valued backends since drift risk #6 was resolved (`fix/provisioner-nil-guards-1813`).
 - **Source-level dispatcher pins** — `workspace_provision_auto_test.go` enforces the SoT pattern documented above:
  - `TestNoCallSiteCallsDirectProvisionerExceptAuto` — no handler calls `.provisionWorkspace(` or `.provisionWorkspaceCP(` directly outside the dispatcher's allowlist.
  - `TestNoCallSiteCallsBareStop` — no handler calls `.provisioner.Stop(` or `.cpProv.Stop(` directly outside the dispatcher's allowlist (strips Go comments before substring match so archaeology in code comments doesn't trip the gate).
@@ -8,26 +8,39 @@ against the latest `main`.

 ## Queue Contract

-Add the `merge-queue` label to an open PR when it is ready to merge.
+**Auto-discovery (opt-OUT, default).** You do NOT need to label a PR. The bot
+auto-discovers every open same-repo PR and merges any that meets the bar. The
+`merge-queue` label is now optional metadata, not a gate. This removed the
+historical autonomy gap: agent Gitea tokens lack `write:issue` (labels are
+issue-scoped), so agents could never self-label and ready PRs stalled.
+
+To keep a PR OUT of autonomous merging, add an opt-OUT label:
+`merge-queue-hold`, `do-not-auto-merge`, or `wip`. Draft PRs are also skipped.

 The bot processes one PR per tick:

-1. Confirms `main` is green.
-2. Selects the oldest open PR carrying `merge-queue`.
-3. Skips PRs with `merge-queue-hold`.
-4. Rejects fork PRs because the queue may only update same-repo branches.
-5. If the PR head does not contain current `main`, calls Gitea's
+1. Confirms `main`'s branch-protection-required push contexts are green.
+2. Selects the oldest open same-repo PR that is NOT opt-out-labeled and NOT a
+   draft (auto-discovery). With `AUTO_DISCOVER=0` it falls back to legacy
+   opt-IN: only PRs carrying `merge-queue` are considered.
+3. Rejects fork PRs because the queue may only update same-repo branches.
+4. If the PR head does not contain current `main`, calls Gitea's
   `/pulls/{n}/update?style=merge` endpoint and waits for CI on the new head.
-6. Merges only after the current PR head has required contexts green:
-   - `CI / all-required (pull_request)`
-   - `sop-checklist / all-items-acked (pull_request)`
+5. Merges only when, on the PR's CURRENT head sha:
+   - `>= required_approvals` distinct genuine official `APPROVED` reviews from
+     the recognised reviewer set (read from branch protection; default 2),
+   - no open official `REQUEST_CHANGES`,
+   - every branch-protection-required status context is green, and
+   - the PR is `mergeable` (Gitea returns `True`; `None`/`False` = wait).

-The workflow is serialized with `concurrency`, so two queued PRs cannot be
+The merge bar is unchanged by auto-discovery — only WHICH PRs are considered
+changes. The workflow is serialized with `concurrency`, so two PRs cannot be
 merged against the same observed `main`.

 ## Operator Commands

-Queue a PR:
+Queue a PR (optional — auto-discovery already considers every ready PR; the
+label is just visible metadata):

 ```bash
 curl -fsS -X POST \
@@ -37,7 +50,8 @@ curl -fsS -X POST \
  -d '{"labels":["merge-queue"]}'
 ```

-Temporarily hold a queued PR:
+Keep a PR OUT of autonomous merging (opt-OUT — use `merge-queue-hold`,
+`do-not-auto-merge`, or `wip`):

 ```bash
 curl -fsS -X POST \
@@ -56,9 +70,11 @@ REPO=molecule-ai/molecule-core \
 WATCH_BRANCH=main \
 QUEUE_LABEL=merge-queue \
 HOLD_LABEL=merge-queue-hold \
+AUTO_DISCOVER=1 \
+OPT_OUT_LABELS=do-not-auto-merge,wip \
+REVIEWER_SET=agent-reviewer,agent-researcher,agent-reviewer-cr2 \
 UPDATE_STYLE=merge \
-REQUIRED_CONTEXTS='CI / all-required (pull_request),sop-checklist / all-items-acked (pull_request)' \
-python3 .gitea/scripts/gitea-merge-queue.py
+python3 .gitea/scripts/gitea-merge-queue.py --dry-run
 ```

 Dry run:
@@ -0,0 +1,468 @@
+#!/usr/bin/env bash
+# GATING E2E for the social-channels outbound + discover + data-prune paths
+# (core#2332 P1.10). Closes two coverage gaps that were previously only
+# unit-mocked, so a regression in any of them goes RED in the required
+# `E2E API Smoke Test` lane instead of slipping through:
+#
+#  (1) Channel SEND end-to-end. Every adapter's SendMessage was only ever
+#      asserted by unit tests that reconstruct the payload by hand and POST
+#      it themselves (see internal/channels/lark_test.go's "we can't change
+#      the prefix const" comment) — nothing proved that a message submitted
+#      through the LIVE platform API actually serializes and POSTs to a
+#      provider endpoint. Here we stand up a local mock-upstream, point a
+#      Slack Incoming-Webhook channel at it, send via
+#      POST /channels/:id/send, and assert the MOCK RECEIVED the correctly
+#      serialized {"text":"..."} body. Real serialize+POST, real HTTP stack,
+#      no real Slack account.
+#
+#  (2) Channel DISCOVER (POST /channels/discover). Had no test at all. We
+#      point the Telegram discover path at a mock Bot API that serves
+#      getMe + getUpdates and assert the discovered bot username + chat
+#      round-trip back through the handler.
+#
+#  (3) Workspace data-prune (RFC #734). The user-requested permanent delete
+#      with ?purge=true prunes a workspace's durable child data (channels,
+#      secrets, config, …). We create prunable data on a target workspace
+#      AND a sibling, purge the target, then assert the target's child rows
+#      are GONE while the sibling's SURVIVE.
+#
+# ── Test seam (production-inert) ────────────────────────────────────────
+# Adapters pin their outbound host to the real vendor (hooks.slack.com /
+# api.telegram.org). Two env-gated overrides — set ONLY by this lane, never
+# in any prod/staging deploy — let the live send/discover path target a
+# local mock so the round-trip is provable in CI:
+#
+#   MOLECULE_CHANNELS_TEST_WEBHOOK_BASE       (Slack webhook accept-prefix)
+#   MOLECULE_CHANNELS_TEST_TELEGRAM_API_BASE  (Telegram Bot API base)
+#
+# These must be present in the PLATFORM process env (the workflow exports
+# them via $GITHUB_ENV before "Start platform"), pointing at the fixed
+# loopback ports this script binds its mocks on. If they are absent the
+# platform rejects the mock URLs; under E2E_REQUIRE_LIVE=1 that is a hard
+# RED (the seam regressed / the workflow wiring broke), otherwise a LOUD
+# SKIP for ad-hoc local runs that didn't export them.
+#
+# NEVER fail-open: a missing assertion target fails the script.
+#
+# Required env (defaults shown):
+#   BASE                       http://127.0.0.1:8080
+#   MOLECULE_ADMIN_TOKEN       (admin bearer; matches the platform's ADMIN_TOKEN)
+#   E2E_CHANNELS_WEBHOOK_PORT  18099   (mock Slack webhook upstream)
+#   E2E_CHANNELS_TELEGRAM_PORT 18098   (mock Telegram Bot API upstream)
+#   E2E_REQUIRE_LIVE           0        (1 = seam-absent is RED, not skip)
+
+set -uo pipefail
+
+# shellcheck disable=SC1091
+source "$(dirname "$0")/_lib.sh"   # sets BASE default + admin/token helpers
+
+WEBHOOK_PORT="${E2E_CHANNELS_WEBHOOK_PORT:-18099}"
+TELEGRAM_PORT="${E2E_CHANNELS_TELEGRAM_PORT:-18098}"
+REQUIRE_LIVE="${E2E_REQUIRE_LIVE:-0}"
+
+# The base prefixes the PLATFORM must have been started with. We assert the
+# adapter accepted a URL under these — proving the platform's env matches.
+WEBHOOK_BASE="http://127.0.0.1:${WEBHOOK_PORT}/"
+TELEGRAM_BASE="http://127.0.0.1:${TELEGRAM_PORT}"
+
+PASS=0
+FAIL=0
+WORK_DIR="$(mktemp -d)"
+WS_TARGET=""
+WS_SIBLING=""
+WS_TARGET_TOK=""
+WS_SIBLING_TOK=""
+MOCK_PID=""
+
+ADMIN_BEARER="${MOLECULE_ADMIN_TOKEN:-${ADMIN_TOKEN:-}}"
+ADMIN_AUTH=()
+[ -n "$ADMIN_BEARER" ] && ADMIN_AUTH=(-H "Authorization: Bearer $ADMIN_BEARER")
+
+pass() { echo "PASS: $1"; PASS=$((PASS + 1)); }
+fail() { echo "FAIL: $1"; [ -n "${2:-}" ] && echo "  $2"; FAIL=$((FAIL + 1)); }
+
+# loud_skip records a SKIP and exits according to E2E_REQUIRE_LIVE. NEVER
+# silently passes — it either hard-fails (require-live) or exits 0 with a
+# loud banner (ad-hoc local). Mirrors the require-live gate pattern used by
+# test_priority_runtimes_e2e.sh.
+loud_skip() {
+  local reason="$1"
+  echo
+  echo "============================================================"
+  if [ "$REQUIRE_LIVE" = "1" ]; then
+    echo "E2E_REQUIRE_LIVE=1 but channels e2e seam is unavailable:"
+    echo "  $reason"
+    echo "This is a HARD FAILURE — the platform was not started with the"
+    echo "channels test seam env (MOLECULE_CHANNELS_TEST_WEBHOOK_BASE /"
+    echo "MOLECULE_CHANNELS_TEST_TELEGRAM_API_BASE) on the fixed loopback"
+    echo "ports, or the seam regressed. Fix the workflow wiring or the seam."
+    echo "============================================================"
+    cleanup
+    exit 1
+  fi
+  echo "SKIP (loud): $reason"
+  echo "Set MOLECULE_CHANNELS_TEST_WEBHOOK_BASE=$WEBHOOK_BASE and"
+  echo "MOLECULE_CHANNELS_TEST_TELEGRAM_API_BASE=$TELEGRAM_BASE in the"
+  echo "PLATFORM env before starting it, then re-run. (CI sets these.)"
+  echo "============================================================"
+  cleanup
+  exit 0
+}
+
+cleanup() {
+  set +e
+  if [ -n "$MOCK_PID" ]; then
+    kill "$MOCK_PID" 2>/dev/null
+    wait "$MOCK_PID" 2>/dev/null
+  fi
+  # Hard-purge any workspaces we created so repeat runs are deterministic.
+  for pair in "$WS_TARGET|$WS_TARGET_TOK|e2e-chan-target" \
+              "$WS_SIBLING|$WS_SIBLING_TOK|e2e-chan-sibling"; do
+    local wid tok name
+    wid="${pair%%|*}"; pair="${pair#*|}"
+    tok="${pair%%|*}"; name="${pair#*|}"
+    [ -z "$wid" ] && continue
+    local auth=("${ADMIN_AUTH[@]}")
+    [ -n "$tok" ] && auth=(-H "Authorization: Bearer $tok")
+    curl -s -X DELETE "$BASE/workspaces/$wid?confirm=true&purge=true" \
+      -H "X-Confirm-Name: $name" "${auth[@]}" >/dev/null 2>&1
+  done
+  rm -rf "$WORK_DIR" 2>/dev/null
+}
+trap cleanup EXIT INT TERM
+
+# ── mock upstream ───────────────────────────────────────────────────────
+# One Python process serves BOTH mocks (different ports). It records the
+# Slack webhook request body to $WORK_DIR/slack_body.json and answers the
+# Telegram getMe/getUpdates calls with a deterministic bot+chat fixture.
+start_mock() {
+  cat > "$WORK_DIR/mock.py" <<'PY'
+import json
+import os
+import sys
+import threading
+from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
+
+WORK_DIR = os.environ["MOCK_WORK_DIR"]
+WEBHOOK_PORT = int(os.environ["MOCK_WEBHOOK_PORT"])
+TELEGRAM_PORT = int(os.environ["MOCK_TELEGRAM_PORT"])
+
+BOT_USERNAME = "e2e_mock_bot"
+CHAT_ID = -1009876543210
+CHAT_NAME = "E2E Mock Group"
+
+
+class SlackHandler(BaseHTTPRequestHandler):
+    def log_message(self, *a):  # silence
+        pass
+
+    def do_POST(self):
+        n = int(self.headers.get("Content-Length", "0") or "0")
+        body = self.rfile.read(n)
+        # Persist EXACTLY what the live Slack send path POSTed so the bash
+        # side can assert the serialized payload.
+        with open(os.path.join(WORK_DIR, "slack_body.json"), "wb") as f:
+            f.write(body)
+        with open(os.path.join(WORK_DIR, "slack_meta.json"), "w") as f:
+            json.dump({"path": self.path,
+                       "content_type": self.headers.get("Content-Type", "")}, f)
+        # Real Slack Incoming Webhooks reply 200 "ok".
+        self.send_response(200)
+        self.end_headers()
+        self.wfile.write(b"ok")
+
+
+class TelegramHandler(BaseHTTPRequestHandler):
+    def log_message(self, *a):
+        pass
+
+    def _send(self, obj):
+        payload = json.dumps(obj).encode()
+        self.send_response(200)
+        self.send_header("Content-Type", "application/json")
+        self.send_header("Content-Length", str(len(payload)))
+        self.end_headers()
+        self.wfile.write(payload)
+
+    def _route(self):
+        # tgbotapi calls <base>/bot<token>/<method>
+        method = self.path.rsplit("/", 1)[-1]
+        if method == "getMe":
+            return self._send({"ok": True, "result": {
+                "id": 4242, "is_bot": True, "first_name": "E2E Mock",
+                "username": BOT_USERNAME, "can_read_all_group_messages": True}})
+        if method == "setMyCommands":
+            return self._send({"ok": True, "result": True})
+        if method == "deleteWebhook":
+            return self._send({"ok": True, "result": True})
+        if method == "getUpdates":
+            # One my_chat_member update so the bot "discovers" a group.
+            return self._send({"ok": True, "result": [{
+                "update_id": 1,
+                "my_chat_member": {
+                    "chat": {"id": CHAT_ID, "title": CHAT_NAME, "type": "supergroup"},
+                    "from": {"id": 1, "is_bot": False, "first_name": "Op"},
+                    "date": 0,
+                    "old_chat_member": {"user": {"id": 4242, "is_bot": True,
+                                                 "first_name": "E2E Mock"},
+                                        "status": "left"},
+                    "new_chat_member": {"user": {"id": 4242, "is_bot": True,
+                                                 "first_name": "E2E Mock"},
+                                        "status": "member"},
+                }}]})
+        # Default OK for any other bot method tgbotapi may probe.
+        return self._send({"ok": True, "result": True})
+
+    def do_POST(self):
+        n = int(self.headers.get("Content-Length", "0") or "0")
+        if n:
+            self.rfile.read(n)
+        self._route()
+
+    def do_GET(self):
+        self._route()
+
+
+def serve(port, handler):
+    ThreadingHTTPServer(("127.0.0.1", port), handler).serve_forever()
+
+
+t = threading.Thread(target=serve, args=(TELEGRAM_PORT, TelegramHandler), daemon=True)
+t.start()
+serve(WEBHOOK_PORT, SlackHandler)
+PY
+  MOCK_WORK_DIR="$WORK_DIR" MOCK_WEBHOOK_PORT="$WEBHOOK_PORT" \
+    MOCK_TELEGRAM_PORT="$TELEGRAM_PORT" \
+    python3 "$WORK_DIR/mock.py" &
+  MOCK_PID=$!
+  # Wait for both ports to accept connections (fail loudly if they never do).
+  local up=0
+  for _ in $(seq 1 50); do
+    if curl -s -o /dev/null "http://127.0.0.1:${WEBHOOK_PORT}/" \
+       && curl -s -o /dev/null "http://127.0.0.1:${TELEGRAM_PORT}/botX/getMe"; then
+      up=1; break
+    fi
+    sleep 0.1
+  done
+  if [ "$up" != "1" ]; then
+    echo "FATAL: mock upstream did not come up on ports $WEBHOOK_PORT/$TELEGRAM_PORT" >&2
+    cleanup
+    exit 2
+  fi
+}
+
+json_field() { python3 -c "import sys,json; print(json.load(sys.stdin).get('$1',''))"; }
+
+create_external_ws() {
+  local name="$1" resp wid
+  resp=$(curl -s -X POST "$BASE/workspaces" "${ADMIN_AUTH[@]}" \
+    -H "Content-Type: application/json" \
+    -d "{\"name\":\"$name\",\"runtime\":\"external\",\"external\":true,\"tier\":1}")
+  wid=$(printf '%s' "$resp" | json_field id)
+  if [ -z "$wid" ]; then
+    echo "FATAL: could not create workspace $name: $resp" >&2
+    cleanup
+    exit 1
+  fi
+  local tok
+  tok=$(printf '%s' "$resp" | e2e_extract_token)
+  [ -z "$tok" ] && tok=$(e2e_mint_workspace_token "$wid" 2>/dev/null || true)
+  printf '%s\t%s\n' "$wid" "$tok"
+}
+
+# ════════════════════════════════════════════════════════════════════════
+echo "=== Channels + data-prune E2E (core#2332 P1.10) ==="
+echo "BASE=$BASE  webhook_mock=$WEBHOOK_BASE  telegram_mock=$TELEGRAM_BASE"
+
+if ! curl -sf "$BASE/health" >/dev/null 2>&1; then
+  echo "FATAL: platform not reachable at $BASE/health" >&2
+  exit 2
+fi
+
+start_mock
+
+# ── workspaces ──────────────────────────────────────────────────────────
+IFS=$'\t' read -r WS_TARGET WS_TARGET_TOK < <(create_external_ws "e2e-chan-target-$$")
+IFS=$'\t' read -r WS_SIBLING WS_SIBLING_TOK < <(create_external_ws "e2e-chan-sibling-$$")
+echo "target=$WS_TARGET sibling=$WS_SIBLING"
+
+WS_AUTH=("${ADMIN_AUTH[@]}")
+[ -n "$WS_TARGET_TOK" ] && WS_AUTH=(-H "Authorization: Bearer $WS_TARGET_TOK")
+SIB_AUTH=("${ADMIN_AUTH[@]}")
+[ -n "$WS_SIBLING_TOK" ] && SIB_AUTH=(-H "Authorization: Bearer $WS_SIBLING_TOK")
+
+# ── (1) SEND end-to-end via a Slack Incoming-Webhook channel ────────────
+echo
+echo "--- (1) channel SEND → mock upstream receives serialized payload ---"
+
+# Create a slack channel whose webhook_url points at our mock. If the
+# platform wasn't started with the webhook test-base, ValidateConfig
+# rejects this URL → loud_skip / RED. chat_id is required by SendOutbound.
+SLACK_CFG=$(python3 -c "import json,sys; print(json.dumps({
+  'webhook_url': sys.argv[1] + 'services/T000/B000/e2e',
+  'chat_id': 'mock-chat'}))" "$WEBHOOK_BASE")
+CREATE=$(curl -s -X POST "$BASE/workspaces/$WS_TARGET/channels" "${WS_AUTH[@]}" \
+  -H "Content-Type: application/json" \
+  -d "{\"channel_type\":\"slack\",\"config\":$SLACK_CFG,\"enabled\":true}")
+CH_ID=$(printf '%s' "$CREATE" | json_field id)
+if [ -z "$CH_ID" ]; then
+  case "$CREATE" in
+    *"invalid channel config"*)
+      loud_skip "platform rejected mock webhook_url (MOLECULE_CHANNELS_TEST_WEBHOOK_BASE not set on platform): $CREATE" ;;
+    *)
+      fail "create slack channel" "$CREATE" ;;
+  esac
+else
+  pass "create slack channel pointed at mock upstream (id=$CH_ID)"
+
+  SEND_TEXT="hello from e2e $$"
+  # Send route: wsAuth.POST /workspaces/:id/channels/:channelId/send (the
+  # handler keys off :channelId; :id scopes the workspace bearer).
+  SEND=$(curl -s -w $'\n%{http_code}' -X POST \
+    "$BASE/workspaces/$WS_TARGET/channels/$CH_ID/send" "${WS_AUTH[@]}" \
+    -H "Content-Type: application/json" \
+    -d "{\"text\":\"$SEND_TEXT\"}")
+  SEND_CODE=$(printf '%s' "$SEND" | tail -n1)
+  if [ "$SEND_CODE" = "200" ]; then
+    pass "POST /channels/:id/send returned 200"
+  else
+    fail "POST /channels/:id/send" "code=$SEND_CODE body=$(printf '%s' "$SEND" | sed '$d')"
+  fi
+
+  # Give the async-free SendOutbound a beat to land at the mock.
+  RECEIVED=""
+  for _ in $(seq 1 30); do
+    if [ -s "$WORK_DIR/slack_body.json" ]; then RECEIVED=1; break; fi
+    sleep 0.1
+  done
+  if [ -n "$RECEIVED" ]; then
+    pass "mock upstream RECEIVED an outbound POST"
+    GOT_TEXT=$(python3 -c "import json,sys; print(json.load(open(sys.argv[1])).get('text',''))" \
+      "$WORK_DIR/slack_body.json" 2>/dev/null || true)
+    if [ "$GOT_TEXT" = "$SEND_TEXT" ]; then
+      pass "mock received correctly-serialized {\"text\":...} payload (text matches end-to-end)"
+    else
+      fail "serialized payload mismatch" "want=[$SEND_TEXT] got=[$GOT_TEXT] raw=$(cat "$WORK_DIR/slack_body.json")"
+    fi
+  else
+    fail "mock upstream never received the outbound POST" "send path did not serialize+POST to the configured endpoint"
+  fi
+fi
+
+# ── (2) DISCOVER via the Telegram mock Bot API ──────────────────────────
+echo
+echo "--- (2) POST /channels/discover (telegram) → mock Bot API ---"
+# A token matching the telegramTokenRegex (\d+:[A-Za-z0-9_-]{30,}).
+DISC_TOKEN="424242:AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
+DISC=$(curl -s -w $'\n%{http_code}' -X POST "$BASE/channels/discover" \
+  "${ADMIN_AUTH[@]}" -H "Content-Type: application/json" \
+  -d "{\"channel_type\":\"telegram\",\"bot_token\":\"$DISC_TOKEN\",\"workspace_id\":\"$WS_TARGET\"}")
+DISC_CODE=$(printf '%s' "$DISC" | tail -n1)
+DISC_BODY=$(printf '%s' "$DISC" | sed '$d')
+if [ "$DISC_CODE" = "200" ]; then
+  pass "POST /channels/discover returned 200"
+  if printf '%s' "$DISC_BODY" | grep -qF '"bot_username":"e2e_mock_bot"'; then
+    pass "discover round-tripped the mock bot username"
+  else
+    fail "discover bot_username" "$DISC_BODY"
+  fi
+  if printf '%s' "$DISC_BODY" | grep -qF '"chat_id":"-1009876543210"'; then
+    pass "discover round-tripped the mock chat id"
+  else
+    fail "discover chat list" "$DISC_BODY"
+  fi
+else
+  case "$DISC_BODY" in
+    *"Cannot reach Telegram"*|*"Invalid bot token"*|*"Failed to connect"*)
+      # Platform reached the REAL api.telegram.org (seam not set) → can't prove.
+      loud_skip "discover hit real Telegram, not the mock (MOLECULE_CHANNELS_TEST_TELEGRAM_API_BASE not set on platform): code=$DISC_CODE $DISC_BODY" ;;
+    *)
+      fail "POST /channels/discover" "code=$DISC_CODE body=$DISC_BODY" ;;
+  esac
+fi
+
+# ── (3) Data-prune (RFC #734): purge removes prunable data, sibling survives
+echo
+echo "--- (3) data-prune: purge target's child data, sibling survives ---"
+
+# Seed prunable child data on BOTH workspaces: a channel (already on target)
+# + a secret on each. We assert via GET /channels which lists workspace_channels.
+seed_secret() {
+  local wid="$1"; shift
+  curl -s -o /dev/null -X POST "$BASE/workspaces/$wid/secrets" "$@" \
+    -H "Content-Type: application/json" \
+    -d '{"key":"E2E_PRUNE_PROBE","value":"v"}'
+}
+seed_secret "$WS_TARGET" "${WS_AUTH[@]}"
+# Sibling gets its OWN channel so we can prove its rows survive the target purge.
+SIB_SLACK_CFG=$(python3 -c "import json,sys; print(json.dumps({
+  'webhook_url': sys.argv[1] + 'services/T111/B111/sib',
+  'chat_id': 'sib-chat'}))" "$WEBHOOK_BASE")
+SIB_CH=$(curl -s -X POST "$BASE/workspaces/$WS_SIBLING/channels" "${SIB_AUTH[@]}" \
+  -H "Content-Type: application/json" \
+  -d "{\"channel_type\":\"slack\",\"config\":$SIB_SLACK_CFG,\"enabled\":true}")
+SIB_CH_ID=$(printf '%s' "$SIB_CH" | json_field id)
+
+# Pre-purge: confirm both workspaces have >=1 channel row.
+TGT_CH_PRE=$(curl -s "$BASE/workspaces/$WS_TARGET/channels" "${WS_AUTH[@]}")
+SIB_CH_PRE=$(curl -s "$BASE/workspaces/$WS_SIBLING/channels" "${SIB_AUTH[@]}")
+TGT_PRE_N=$(printf '%s' "$TGT_CH_PRE" | python3 -c "import sys,json; print(len(json.load(sys.stdin)))" 2>/dev/null || echo 0)
+SIB_PRE_N=$(printf '%s' "$SIB_CH_PRE" | python3 -c "import sys,json; print(len(json.load(sys.stdin)))" 2>/dev/null || echo 0)
+if [ "${TGT_PRE_N:-0}" -ge 1 ] && [ "${SIB_PRE_N:-0}" -ge 1 ]; then
+  pass "pre-purge: target ($TGT_PRE_N) and sibling ($SIB_PRE_N) both have channel data"
+else
+  fail "pre-purge seed" "target=$TGT_PRE_N sibling=$SIB_PRE_N (need >=1 each)"
+fi
+
+# Permanent delete WITH purge — the RFC #734 prune of durable child data.
+# DELETE /workspaces/:id is AdminAuth-gated (router.go:167); Tier-2b rejects a
+# workspace bearer when ADMIN_TOKEN is set, so this MUST use the admin bearer.
+# X-Confirm-Name must equal the workspace name (the destructive-delete guard).
+PURGE_AUTH=("${ADMIN_AUTH[@]}")
+[ ${#PURGE_AUTH[@]} -eq 0 ] && [ -n "$WS_TARGET_TOK" ] && PURGE_AUTH=(-H "Authorization: Bearer $WS_TARGET_TOK")
+PURGE=$(curl -s -w $'\n%{http_code}' -X DELETE \
+  "$BASE/workspaces/$WS_TARGET?confirm=true&purge=true" \
+  -H "X-Confirm-Name: e2e-chan-target-$$" "${PURGE_AUTH[@]}")
+PURGE_CODE=$(printf '%s' "$PURGE" | tail -n1)
+PURGE_BODY=$(printf '%s' "$PURGE" | sed '$d')
+if [ "$PURGE_CODE" = "200" ] && printf '%s' "$PURGE_BODY" | grep -qF '"status":"purged"'; then
+  pass "DELETE ?purge=true returned purged"
+else
+  fail "DELETE ?purge=true" "code=$PURGE_CODE body=$PURGE_BODY"
+fi
+# Target was purged → its token is revoked; query its channels with admin
+# bearer. The purge hard-deletes workspace_channels rows for the target.
+TGT_CH_POST=$(curl -s "$BASE/workspaces/$WS_TARGET/channels" "${ADMIN_AUTH[@]}")
+TGT_POST_N=$(printf '%s' "$TGT_CH_POST" | python3 -c "import sys,json
+try:
+  d=json.load(sys.stdin); print(len(d) if isinstance(d,list) else -1)
+except Exception:
+  print(-1)" 2>/dev/null || echo -1)
+if [ "${TGT_POST_N:-1}" = "0" ]; then
+  pass "post-purge: target's prunable channel data is GONE (0 rows)"
+else
+  fail "prune did not remove target channel data" "post-purge target rows=$TGT_POST_N body=$(printf '%s' "$TGT_CH_POST" | head -c 200)"
+fi
+WS_TARGET=""  # purged; don't re-delete in cleanup
+
+# Sibling (NON-prunable relative to the target purge) must be untouched.
+SIB_CH_POST=$(curl -s "$BASE/workspaces/$WS_SIBLING/channels" "${SIB_AUTH[@]}")
+SIB_POST_N=$(printf '%s' "$SIB_CH_POST" | python3 -c "import sys,json; print(len(json.load(sys.stdin)))" 2>/dev/null || echo -1)
+if [ "${SIB_POST_N:-0}" -ge 1 ] && printf '%s' "$SIB_CH_POST" | grep -qF "$SIB_CH_ID"; then
+  pass "post-purge: sibling's non-prunable data SURVIVED ($SIB_POST_N rows, channel $SIB_CH_ID intact)"
+else
+  fail "purge over-reached: sibling data did not survive" "sibling rows=$SIB_POST_N body=$(printf '%s' "$SIB_CH_POST" | head -c 200)"
+fi
+
+# ── verdict ─────────────────────────────────────────────────────────────
+echo
+echo "=== channels+prune e2e: $PASS passed, $FAIL failed ==="
+if [ "$FAIL" -ne 0 ]; then
+  exit 1
+fi
+# Guard against a vacuous green: every section must have produced asserts.
+if [ "$PASS" -lt 9 ]; then
+  echo "FATAL: only $PASS assertions ran — expected >=9 (send + discover + prune). Refusing to report green." >&2
+  exit 1
+fi
+echo "ALL CHANNELS + PRUNE E2E CHECKS PASSED"
@@ -1004,6 +1004,12 @@ for wid in "${WS_TO_CHECK[@]}"; do
  else
    DIAG_FAIL=$(echo "$DIAG_JSON" | python3 -c "import json,sys; d=json.load(sys.stdin); print(d.get('first_failure','unknown'))" 2>/dev/null || echo "unknown")
    DIAG_DETAIL=$(echo "$DIAG_JSON" | python3 -c "import json,sys; d=json.load(sys.stdin); s=[x for x in d.get('steps',[]) if not x.get('ok')]; step=s[0] if s else {}; print(' — '.join(x for x in [step.get('error',''), step.get('detail','')] if x))" 2>/dev/null || echo "")
+    # #767: always emit the full diagnose JSON so operators see every step's
+    # Detail field even when the Python extraction above fails or the shape
+    # drifts. The burst is bracketed like steps 2 and 4 for grep-friendly CI.
+    log "── DIAGNOSTIC BURST (step 7b — terminal diagnose for $wid) ──"
+    echo "$DIAG_JSON" | python3 -m json.tool 2>/dev/null || echo "$DIAG_JSON"
+    log "── END DIAGNOSTIC ──"
    fail "Workspace $wid terminal diagnose failed at step '$DIAG_FAIL': $DIAG_DETAIL — check tenant SG has tcp/22 from the configured EIC endpoint SG, MOLECULE_EIC_ENDPOINT_SG_ID is set in Railway, and EIC endpoint health"
  fi
 done
@@ -21,6 +21,27 @@ const (

 var slackHTTPClient = &http.Client{Timeout: slackHTTPTimeout}

+// slackWebhookAccepted reports whether a Slack Incoming Webhook URL is allowed
+// as a send destination. Production accepts only the real hooks.slack.com host.
+//
+// TEST SEAM (gating e2e): when MOLECULE_CHANNELS_TEST_WEBHOOK_BASE is set, a
+// URL with that prefix is ALSO accepted so tests/e2e/test_channels_e2e.sh can
+// point the live Slack send path at a local mock-upstream and assert the mock
+// actually received the serialized {"text":...} payload end-to-end (the unit
+// tests can only assert the body shape — see lark_test.go's prefix-gate
+// workaround comment). The env var is NEVER set in any production/staging
+// deploy; channelsTestWebhookBase() returns "" there and only the real
+// hooks.slack.com prefix passes, so this changes no production behaviour.
+func slackWebhookAccepted(u string) bool {
+	if strings.HasPrefix(u, slackWebhookPrefix) {
+		return true
+	}
+	if base := channelsTestWebhookBase(); base != "" && strings.HasPrefix(u, base) {
+		return true
+	}
+	return false
+}
+
 // SlackAdapter implements ChannelAdapter for Slack Incoming Webhooks.
 //
 // Outbound messages are sent via Slack Incoming Webhooks (the simple,
@@ -98,7 +119,7 @@ func (s *SlackAdapter) ValidateConfig(config map[string]interface{}) error {
 			return fmt.Errorf("bot_token mode requires channel_id")
 		}
 	}
-	if webhookURL != "" && !strings.HasPrefix(webhookURL, slackWebhookPrefix) {
+	if webhookURL != "" && !slackWebhookAccepted(webhookURL) {
 		return fmt.Errorf("invalid Slack webhook URL")
 	}
 	return nil
@@ -197,7 +218,7 @@ func (s *SlackAdapter) sendWebhookMessage(ctx context.Context, config map[string
 	if webhookURL == "" {
 		return fmt.Errorf("webhook_url not configured")
 	}
-	if !strings.HasPrefix(webhookURL, slackWebhookPrefix) {
+	if !slackWebhookAccepted(webhookURL) {
 		return fmt.Errorf("invalid Slack webhook URL")
 	}

@@ -148,7 +148,18 @@ func (t *TelegramAdapter) DiscoverChats(ctx context.Context, botToken string) (*
 		return nil, errors.New("invalid bot token format")
 	}

-	bot, err := tgbotapi.NewBotAPI(botToken)
+	// TEST SEAM: when MOLECULE_CHANNELS_TEST_TELEGRAM_API_BASE is set (only in
+	// the gating channels e2e — never in prod/staging), build the bot client
+	// against a local mock API base instead of api.telegram.org so
+	// POST /channels/discover can be proven end-to-end. The format string is
+	// "<base>/bot%s/%s" (token, method), matching tgbotapi.APIEndpoint.
+	var bot *tgbotapi.BotAPI
+	var err error
+	if apiBase := channelsTestTelegramAPIBase(); apiBase != "" {
+		bot, err = tgbotapi.NewBotAPIWithAPIEndpoint(botToken, apiBase+"/bot%s/%s")
+	} else {
+		bot, err = tgbotapi.NewBotAPI(botToken)
+	}
 	if err != nil {
 		return nil, fmt.Errorf("invalid bot token: %w", err)
 	}
@@ -0,0 +1,47 @@
+package channels
+
+import "os"
+
+// Test seams for the GATING channels e2e (tests/e2e/test_channels_e2e.sh).
+//
+// Every adapter pins its outbound destination to the real vendor host
+// (hooks.slack.com, discord.com, api.telegram.org) in both ValidateConfig and
+// SendMessage. That host pin is correct for production, but it means a real
+// end-to-end test cannot point the LIVE send/discover path at a local mock
+// upstream — so today the outbound serialize+POST is only ever asserted by
+// unit tests that reconstruct the payload by hand (see lark_test.go's
+// "we can't change the prefix const" comment) and never proven through the
+// running platform.
+//
+// These two env-gated overrides close that gap WITHOUT changing any
+// production behaviour:
+//
+//   - MOLECULE_CHANNELS_TEST_WEBHOOK_BASE — when set, Slack Incoming Webhook
+//     URLs with this prefix are accepted as send destinations (in addition to
+//     the real hooks.slack.com host). Lets the e2e create a slack channel whose
+//     webhook_url points at a local httptest mock and assert the mock RECEIVED
+//     the serialized {"text":...} payload.
+//
+//   - MOLECULE_CHANNELS_TEST_TELEGRAM_API_BASE — when set, TelegramAdapter.
+//     DiscoverChats builds its bot client against this API base instead of
+//     api.telegram.org, so POST /channels/discover can be exercised against a
+//     mock that serves getMe/getUpdates and the e2e can assert the discovered
+//     chats round-trip.
+//
+// Both vars are NEVER set in any production or staging deploy. The helpers
+// return "" there, so the real vendor-host pins are the only thing that
+// passes — production behaviour is byte-for-byte unchanged. Reading os.Getenv
+// on each call (not caching) keeps the seam honest: a process that never sets
+// the var can never accidentally enable it.
+
+// channelsTestWebhookBase returns the test-only accepted webhook base prefix,
+// or "" in production. See package doc above.
+func channelsTestWebhookBase() string {
+	return os.Getenv("MOLECULE_CHANNELS_TEST_WEBHOOK_BASE")
+}
+
+// channelsTestTelegramAPIBase returns the test-only Telegram Bot API base
+// (a printf format string "<base>/bot%s/%s"), or "" in production.
+func channelsTestTelegramAPIBase() string {
+	return os.Getenv("MOLECULE_CHANNELS_TEST_TELEGRAM_API_BASE")
+}
@@ -9,6 +9,7 @@ import (
 	"log"
 	"net/http"
 	"os"
+	"sort"
 	"strings"
 	"time"

@@ -18,6 +19,7 @@ import (
 	dockerclient "github.com/docker/docker/client"
 	"github.com/gin-gonic/gin"

+	"git.moleculesai.app/molecule-ai/molecule-core/workspace-server/internal/providers"
 	"git.moleculesai.app/molecule-ai/molecule-core/workspace-server/internal/provisioner"
 )

@@ -41,10 +43,53 @@ func NewWorkspaceImageService(docker *dockerclient.Client) *WorkspaceImageServic
 	return &WorkspaceImageService{docker: docker}
 }

-// AllRuntimes is the canonical list mirroring docs/workspace-runtime-package.md.
-// Update both when a new template is added.
-var AllRuntimes = []string{
-	"claude-code", "codex", "hermes", "openclaw",
+// AllRuntimes is the canonical set of workspace runtimes this tenant will
+// pull/recreate template images for. It is DERIVED from the same providers
+// manifest SSOT (internal/providers/providers.yaml `runtimes:` block, mirrored
+// from CP's providers.yaml) that the rest of the platform routes against —
+// NOT a second hand-maintained list.
+//
+// Why derive instead of hardcode (controlplane#578): the old hardcoded slice
+// here ({claude-code, codex, hermes, openclaw}) silently DRIFTED from CP, which
+// already accepts `google-adk` for pin-promote/redeploy. A google-adk pin would
+// be accepted CP-side, then this tenant's POST /admin/workspace-images/refresh
+// ?runtime=google-adk rejected it 400 ("unknown runtime"), so google-adk image
+// fixes never deployed. Deriving from the manifest makes the tenant allowlist
+// and the CP allowlist provably the same set — they can't drift again.
+//
+// imageRefreshFallbackRuntimes is used ONLY if the embedded providers manifest
+// fails to load (which would be a build/CI failure caught by the providers
+// package's own tests, never a healthy prod). It preserves the historical
+// behavior — plus google-adk — so a manifest regression can never take the
+// refresh endpoint fully offline. Kept in lockstep with the providers.yaml
+// `runtimes:` keys; the drift guard in admin_workspace_images_test.go asserts
+// the two match.
+var imageRefreshFallbackRuntimes = []string{
+	"claude-code", "codex", "google-adk", "hermes", "openclaw",
+}
+
+// AllRuntimes is computed once at package init from the providers SSOT.
+var AllRuntimes = loadImageRefreshRuntimes()
+
+// loadImageRefreshRuntimes returns the sorted runtime names declared in the
+// providers manifest, falling back to imageRefreshFallbackRuntimes if the
+// manifest can't be loaded.
+func loadImageRefreshRuntimes() []string {
+	m, err := providers.LoadManifest()
+	if err != nil || len(m.Runtimes) == 0 {
+		if err != nil {
+			log.Printf("workspace-images: providers.LoadManifest failed (%v); falling back to static runtime allowlist", err)
+		}
+		out := append([]string(nil), imageRefreshFallbackRuntimes...)
+		sort.Strings(out)
+		return out
+	}
+	out := make([]string, 0, len(m.Runtimes))
+	for rt := range m.Runtimes {
+		out = append(out, rt)
+	}
+	sort.Strings(out)
+	return out
 }

 // RefreshResult is the per-call outcome surfaced to HTTP callers AND logged
@@ -197,7 +242,7 @@ func (s *WorkspaceImageService) Refresh(ctx context.Context, runtimes []string,

 // AdminWorkspaceImagesHandler serves POST /admin/workspace-images/refresh.
 //
-//	?runtime=claude-code   (optional; default = all 8 templates)
+//	?runtime=claude-code   (optional; default = all runtimes in AllRuntimes)
 //	&recreate=true|false   (default true; false = pull only)
 //
 // Returns JSON {pulled: [...], failed: [...], recreated: [...]}
@@ -3,7 +3,14 @@ package handlers
 import (
 	"encoding/base64"
 	"encoding/json"
+	"net/http"
+	"net/http/httptest"
+	"sort"
 	"testing"
+
+	"github.com/gin-gonic/gin"
+
+	"git.moleculesai.app/molecule-ai/molecule-core/workspace-server/internal/providers"
 )

 func TestGHCRAuthHeader_NoEnvReturnsEmpty(t *testing.T) {
@@ -92,6 +99,119 @@ func TestGHCRAuthHeader_RespectsRegistryEnv(t *testing.T) {
 	}
 }

+// runtimeListContains is a tiny membership helper for the runtime-allowlist tests.
+func runtimeListContains(s []string, v string) bool {
+	for _, x := range s {
+		if x == v {
+			return true
+		}
+	}
+	return false
+}
+
+// TestAllRuntimes_IncludesGoogleADK is the direct regression for
+// controlplane#578: a google-adk pin promote/redeploy is accepted CP-side, so
+// the tenant image-refresh allowlist MUST also accept google-adk or the image
+// fix never deploys (tenant returned 400 "unknown runtime"). google-adk lives
+// in the providers SSOT, so the derived AllRuntimes must contain it.
+func TestAllRuntimes_IncludesGoogleADK(t *testing.T) {
+	if !runtimeListContains(AllRuntimes, "google-adk") {
+		t.Fatalf("AllRuntimes must include google-adk (controlplane#578 drift); got %v", AllRuntimes)
+	}
+}
+
+// TestAllRuntimes_MatchesProvidersSSOT is the drift guard. AllRuntimes is
+// derived from providers.LoadManifest().Runtimes — assert it equals exactly the
+// runtime keys the providers manifest (mirrored from CP's providers.yaml)
+// declares. If CP adds/removes a runtime, this test fails RED until the tenant
+// re-derives, so the tenant image-refresh allowlist can never silently drift
+// from the CP pin-promote allowlist again.
+func TestAllRuntimes_MatchesProvidersSSOT(t *testing.T) {
+	m, err := providers.LoadManifest()
+	if err != nil {
+		t.Fatalf("providers.LoadManifest: %v", err)
+	}
+	want := make([]string, 0, len(m.Runtimes))
+	for rt := range m.Runtimes {
+		want = append(want, rt)
+	}
+	sort.Strings(want)
+
+	got := append([]string(nil), AllRuntimes...)
+	sort.Strings(got)
+
+	if len(got) != len(want) {
+		t.Fatalf("AllRuntimes drift: got %v, want %v (providers SSOT)", got, want)
+	}
+	for i := range want {
+		if got[i] != want[i] {
+			t.Fatalf("AllRuntimes drift at %d: got %v, want %v (providers SSOT)", i, got, want)
+		}
+	}
+}
+
+// TestImageRefreshFallbackMatchesSSOT pins the static fallback (used only when
+// the embedded manifest fails to load) to the providers SSOT. If a runtime is
+// added to providers.yaml but not to imageRefreshFallbackRuntimes, this fails
+// RED — so a manifest-load failure can't silently drop a supported runtime.
+func TestImageRefreshFallbackMatchesSSOT(t *testing.T) {
+	m, err := providers.LoadManifest()
+	if err != nil {
+		t.Fatalf("providers.LoadManifest: %v", err)
+	}
+	want := make([]string, 0, len(m.Runtimes))
+	for rt := range m.Runtimes {
+		want = append(want, rt)
+	}
+	sort.Strings(want)
+
+	got := append([]string(nil), imageRefreshFallbackRuntimes...)
+	sort.Strings(got)
+
+	if len(got) != len(want) {
+		t.Fatalf("fallback drift: got %v, want %v (providers SSOT)", got, want)
+	}
+	for i := range want {
+		if got[i] != want[i] {
+			t.Fatalf("fallback drift at %d: got %v, want %v (providers SSOT)", i, got, want)
+		}
+	}
+}
+
+// TestRefresh_RejectsUnknownRuntime asserts a genuinely unknown runtime still
+// 400s (the guard isn't removed) AND that the 400 body lists google-adk in
+// known_runtimes (proving the allowlist now advertises it). This exercises the
+// gin handler's reject branch, which runs entirely before any Docker call.
+func TestRefresh_RejectsUnknownRuntime(t *testing.T) {
+	gin.SetMode(gin.TestMode)
+
+	// nil docker client is safe: the unknown-runtime branch returns 400
+	// before svc.Refresh (which is the only path that touches Docker).
+	h := &AdminWorkspaceImagesHandler{svc: &WorkspaceImageService{}}
+
+	r := gin.New()
+	r.POST("/admin/workspace-images/refresh", h.Refresh)
+
+	req := httptest.NewRequest(http.MethodPost, "/admin/workspace-images/refresh?runtime=not-a-real-runtime", nil)
+	rec := httptest.NewRecorder()
+	r.ServeHTTP(rec, req)
+
+	if rec.Code != http.StatusBadRequest {
+		t.Fatalf("unknown runtime: got status %d, want 400; body=%s", rec.Code, rec.Body.String())
+	}
+
+	var body struct {
+		Error         string   `json:"error"`
+		KnownRuntimes []string `json:"known_runtimes"`
+	}
+	if err := json.Unmarshal(rec.Body.Bytes(), &body); err != nil {
+		t.Fatalf("decode 400 body: %v (raw=%s)", err, rec.Body.String())
+	}
+	if !runtimeListContains(body.KnownRuntimes, "google-adk") {
+		t.Errorf("400 known_runtimes must advertise google-adk (controlplane#578); got %v", body.KnownRuntimes)
+	}
+}
+
 func TestGHCRAuthHeader_TrimsWhitespace(t *testing.T) {
 	t.Setenv("MOLECULE_IMAGE_REGISTRY", "")
 	// .env lines often have trailing newlines or accidental spaces. Without
@@ -73,6 +73,7 @@ func (h *ChannelHandler) List(c *gin.Context) {
 		var config map[string]interface{}
 		if err := json.Unmarshal(configJSON, &config); err != nil {
 			log.Printf("Channels: unmarshal config for channel %s: %v", id, err)
+			config = map[string]interface{}{}
 		}
 		// #319: decrypt sensitive fields first so the mask operates on
 		// plaintext (first-4 / last-4 of the real token, not the ciphertext
@@ -94,6 +95,7 @@ func (h *ChannelHandler) List(c *gin.Context) {
 		var allowed []string
 		if err := json.Unmarshal(allowedJSON, &allowed); err != nil {
 			log.Printf("Channels: unmarshal allowed_users for channel %s: %v", id, err)
+			allowed = []string{}
 		}

 		entry := map[string]interface{}{
@@ -540,9 +542,11 @@ func (h *ChannelHandler) Webhook(c *gin.Context) {
 		}
 		if err := json.Unmarshal(configJSON, &row.Config); err != nil {
 			log.Printf("Channels: unmarshal config for webhook row %s: %v", row.ID, err)
+			row.Config = map[string]interface{}{}
 		}
 		if err := json.Unmarshal(allowedJSON, &row.AllowedUsers); err != nil {
 			log.Printf("Channels: unmarshal allowed_users for webhook row %s: %v", row.ID, err)
+			row.AllowedUsers = []string{}
 		}
 		if err := channels.DecryptSensitiveFields(row.Config); err != nil {
 			log.Printf("Channels: decrypt webhook row %s: %v", row.ID, err)
@@ -116,6 +116,56 @@ func TestChannelHandler_List(t *testing.T) {
 	}
 }

+func TestChannelHandler_List_InvalidJSON_FallsBack(t *testing.T) {
+	mock := setupTestDB(t)
+	handler := NewChannelHandler(newTestChannelManager())
+
+	rows := sqlmock.NewRows([]string{
+		"id", "workspace_id", "channel_type", "channel_config", "enabled",
+		"allowed_users", "last_message_at", "message_count", "created_at", "updated_at",
+	}).AddRow(
+		"ch-bad", "ws-1", "telegram",
+		[]byte(`{not valid json`),
+		true, []byte(`[also not json`), nil, 0, nil, nil,
+	)
+	mock.ExpectQuery("SELECT .* FROM workspace_channels WHERE workspace_id").
+		WithArgs("ws-1").
+		WillReturnRows(rows)
+
+	w := httptest.NewRecorder()
+	c, _ := gin.CreateTestContext(w)
+	c.Request, _ = http.NewRequest("GET", "/workspaces/ws-1/channels", nil)
+	c.Params = gin.Params{{Key: "id", Value: "ws-1"}}
+
+	handler.List(c)
+
+	if w.Code != 200 {
+		t.Errorf("expected 200, got %d", w.Code)
+	}
+
+	var result []map[string]interface{}
+	json.Unmarshal(w.Body.Bytes(), &result)
+	if len(result) != 1 {
+		t.Fatalf("expected 1 channel, got %d", len(result))
+	}
+
+	config, ok := result[0]["config"].(map[string]interface{})
+	if !ok {
+		t.Fatalf("expected config to be a map, got %T", result[0]["config"])
+	}
+	if len(config) != 0 {
+		t.Errorf("expected empty config after unmarshal fallback, got %v", config)
+	}
+
+	allowed, ok := result[0]["allowed_users"].([]interface{})
+	if !ok {
+		t.Fatalf("expected allowed_users to be a slice, got %T", result[0]["allowed_users"])
+	}
+	if len(allowed) != 0 {
+		t.Errorf("expected empty allowed_users after unmarshal fallback, got %v", allowed)
+	}
+}
+
 // ==================== Create ====================

 func TestChannelHandler_Create_Success(t *testing.T) {
@@ -546,6 +596,41 @@ func TestChannelHandler_Webhook_UnknownType(t *testing.T) {
 	}
 }

+// TestChannelHandler_Webhook_InvalidJSON_FallsBack verifies that when the DB
+// row contains invalid JSON for channel_config or allowed_users, the webhook
+// handler logs the error and falls back to an empty map/slice rather than
+// leaving the fields nil (which would panic on downstream code that expects
+// concrete values). With empty config there is no chat_id match, so the
+// handler returns {"status":"no_channel"}.
+func TestChannelHandler_Webhook_InvalidJSON_FallsBack(t *testing.T) {
+	mock := setupTestDB(t)
+	handler := NewChannelHandler(newTestChannelManager())
+
+	mock.ExpectQuery(`SELECT id, workspace_id, channel_type, channel_config, enabled, allowed_users FROM workspace_channels WHERE channel_type = .* AND enabled = true`).
+		WithArgs("telegram").
+		WillReturnRows(sqlmock.NewRows([]string{
+			"id", "workspace_id", "channel_type", "channel_config", "enabled", "allowed_users",
+		}).AddRow("ch-bad", "ws-1", "telegram", []byte(`{bad json`), true, []byte(`[bad json`)))
+
+	body := `{"update_id":1,"message":{"message_id":1,"from":{"id":111,"is_bot":false,"first_name":"Test","username":"testuser"},"chat":{"id":-100123,"title":"Test Group","type":"supergroup"},"date":1700000000,"text":"hello"}}`
+	w := httptest.NewRecorder()
+	c, _ := gin.CreateTestContext(w)
+	c.Request = httptest.NewRequest(http.MethodPost, "/webhooks/telegram", strings.NewReader(body))
+	c.Request.Header.Set("Content-Type", "application/json")
+	c.Params = gin.Params{{Key: "type", Value: "telegram"}}
+
+	handler.Webhook(c)
+
+	if w.Code != 200 {
+		t.Errorf("expected 200, got %d: %s", w.Code, w.Body.String())
+	}
+	var resp map[string]interface{}
+	json.Unmarshal(w.Body.Bytes(), &resp)
+	if resp["status"] != "no_channel" {
+		t.Errorf("expected status 'no_channel', got %v", resp["status"])
+	}
+}
+
 // ==================== Discover ====================

 func TestChannelHandler_Discover_MissingToken(t *testing.T) {
@@ -161,7 +161,7 @@ func (h *PluginsHandler) uninstallViaDocker(ctx context.Context, c *gin.Context,
 	// 1. Strip plugin's rule/fragment markers from CLAUDE.md (mirrors
 	//    AgentskillsAdaptor.uninstall lines 184-188). Best-effort: if
 	//    the user edited CLAUDE.md, our marker stays untouched.
-	h.stripPluginMarkersFromMemory(ctx, containerName, pluginName)
+	h.stripPluginMarkersFromMemory(ctx, workspaceID, containerName, pluginName)

 	// 2. Remove copied skill dirs declared in the plugin's plugin.yaml.
 	for _, skill := range skillNames {
@@ -171,9 +171,11 @@ func (h *PluginsHandler) uninstallViaDocker(ctx context.Context, c *gin.Context,
 			log.Printf("Plugin uninstall: skipping invalid skill name %q in %s: %v", skill, pluginName, err)
 			continue
 		}
-		_, _ = h.execAsRoot(ctx, containerName, []string{
+		if _, rmErr := h.execAsRoot(ctx, containerName, []string{
 			"rm", "-rf", "/configs/skills/" + skill,
-		})
+		}); rmErr != nil {
+			log.Printf("Plugin uninstall: failed to remove skill %s from %s: %v", skill, workspaceID, rmErr)
+		}
 	}

 	// 3. Delete the plugin directory itself (as root to handle file ownership).
@@ -393,7 +393,7 @@ func (h *PluginsHandler) readPluginSkillsFromContainer(ctx context.Context, cont
 // `# Plugin: <name> /` — mirrors AgentskillsAdaptor.uninstall's stripping
 // logic so install/uninstall are symmetric. Best-effort: silent on read or
 // write failure, since the rest of uninstall must still succeed.
-func (h *PluginsHandler) stripPluginMarkersFromMemory(ctx context.Context, containerName, pluginName string) {
+func (h *PluginsHandler) stripPluginMarkersFromMemory(ctx context.Context, workspaceID, containerName, pluginName string) {
 	// Use sed via bash -c for atomic in-place delete: drop the marker line
 	// and the blank line that follows it (install adds a leading blank line
 	// before the marker via append_to_memory). Three sed passes mirror the
@@ -417,7 +417,9 @@ func (h *PluginsHandler) stripPluginMarkersFromMemory(ctx context.Context, conta
 		`awk 'BEGIN{skip=0; blanks=0} /^%s/{skip=1; blanks=0; next} skip==1 && /^[[:space:]]*$/{blanks++; if(blanks>=2){skip=0; print; next} next} /^# Plugin: /{if(skip==1)skip=0} skip==1{next} {print}' /configs/CLAUDE.md > /tmp/claude.new && mv /tmp/claude.new /configs/CLAUDE.md`,
 		regexpEscapeForAwk(marker),
 	)
-	_, _ = h.execAsRoot(ctx, containerName, []string{"bash", "-c", script})
+	if _, awkErr := h.execAsRoot(ctx, containerName, []string{"bash", "-c", script}); awkErr != nil {
+		log.Printf("Plugin uninstall: failed to strip markers from CLAUDE.md for %s in %s: %v", pluginName, workspaceID, awkErr)
+	}
 }

 // regexpEscapeForAwk escapes characters that have special meaning inside an
@@ -332,6 +332,7 @@ func (h *WorkspaceHandler) buildProvisionerConfig(
 		InstanceType:    payload.Compute.InstanceType,
 		DiskGB:          int32(payload.Compute.Volume.RootGB),
 		DataPersistence: payload.Compute.DataPersistence,
+		Provider:        payload.Compute.Provider,
 		Display: provisioner.WorkspaceDisplayConfig{
 			Mode:     payload.Compute.Display.Mode,
 			Width:    payload.Compute.Display.Width,
@@ -174,6 +174,11 @@ type WorkspaceCompute struct {
 	// disk (wiped each recreate — privacy); "" = auto (desktop-control persists,
 	// others follow the org flag). Forwarded verbatim to CP's data_persistence.
 	DataPersistence string `json:"data_persistence,omitempty"`
+	// Provider is the CLOUD/compute backend for this workspace box (multi-provider
+	// RFC, per-workspace): ""/"aws" = default EC2; "hetzner"/"gcp" route to the
+	// CP WorkspaceProvisioner. Distinct from the LLM/model provider. Forwarded to
+	// CP /cp/workspaces/provision `provider`.
+	Provider string `json:"provider,omitempty"`
 }

 type CreateWorkspacePayload struct {
@@ -161,6 +161,9 @@ type cpProvisionRequest struct {
 	Tier         int    `json:"tier"`
 	InstanceType string `json:"instance_type,omitempty"`
 	DiskGB       int32  `json:"disk_gb,omitempty"`
+	// Provider routes the CP to the compute backend for this workspace box
+	// (multi-provider RFC, per-workspace). Distinct from the LLM/model provider.
+	Provider string `json:"provider,omitempty"`
 	// DataPersistence is the per-workspace durable-data choice (internal#734);
 	// CP validates the enum at its provision edge and resolves the data volume
 	// from it. Empty = auto (omitted on the wire).
@@ -257,6 +260,7 @@ func (p *CPProvisioner) Start(ctx context.Context, cfg WorkspaceConfig) (string,
 		InstanceType:    cfg.InstanceType,
 		DiskGB:          cfg.DiskGB,
 		DataPersistence: cfg.DataPersistence,
+		Provider:        cfg.Provider,
 		Display:         cfg.Display,
 		PlatformURL:     cfg.PlatformURL,
 		Env:             env,
@@ -100,6 +100,7 @@ type WorkspaceConfig struct {
 	InstanceType    string // Optional CP EC2 instance type override (SaaS only)
 	DiskGB          int32  // Optional CP root volume size override in GiB (SaaS only)
 	DataPersistence string // internal#734: "persist"|"ephemeral"|"" — durable-data choice forwarded to CP (SaaS only)
+	Provider        string // multi-provider RFC: ""/"aws"|"hetzner"|"gcp" compute backend for the workspace box (per-workspace; distinct from LLM/model provider). Forwarded to CP.
 	Display         WorkspaceDisplayConfig
 	EnvVars         map[string]string // Additional env vars (API keys, etc.)
 	PlatformURL     string
@@ -0,0 +1,322 @@
+//go:build staging_e2e
+
+package staginge2e
+
+import (
+	"fmt"
+	"net/http"
+	"testing"
+	"time"
+)
+
+// TestDataVolumeSurvivesRecreate_Staging closes the data-persistence coverage
+// gap flagged in core#2332 (P0.5): "data-volume survives recreate" and
+// "snapshot-before-container-swap (/home/agent not wiped)" had NO e2e, and both
+// map to a real past incident — feedback_workspace_container_swap_wipes_home_agent:
+// on a container swap, only the /configs + /workspace binds (the durable data
+// volume, cp#326) survive; the container's own $HOME (/home/agent) is ephemeral
+// and is WIPED unless a snapshot is taken BEFORE docker stop+rm+run.
+//
+// This is the FORWARD half of that incident: prove the durable-data invariant
+// holds across a recreate so a future regression that drops the data-volume
+// reattach (or that flips a "persist" workspace to ephemeral) fails LOUD here
+// instead of silently eating a customer's /workspace state.
+//
+// What it does, end-to-end, against a real staging tenant:
+//  0. Provision a throwaway org + tenant via the CP admin API and acquire the
+//     tenant admin token (shared harness — mirrors workspace_lifecycle_test.go).
+//  1. Create a workspace with compute.data_persistence="persist" (the durable
+//     data-volume choice, internal#734) and wait for it to come ONLINE.
+//  2. Write a unique sentinel into /workspace (?root=/workspace) — the data
+//     volume per cp#326 — via the tenant Files API.
+//  3. Probe the /home/agent (container-$HOME) surface to encode the documented
+//     contract for the ephemeral side (see assertAgentHomeContract).
+//  4. Trigger a recreate / container-swap on the SAME data volume via
+//     POST /workspaces/:id/restart, and wait for ONLINE again.
+//  5. Assert the /workspace sentinel SURVIVES (data volume reattached +
+//     persisted). This is the load-bearing assertion — a wipe here is the
+//     regression we are gating.
+//
+// Guarded by the staging_e2e build tag and STAGING_E2E=1 env gate. Teardown is
+// t.Cleanup-driven (admin DELETE /cp/admin/tenants + DELETE /workspaces/:id).
+// Promote-to-required is a CTO call (infra-bound; see doc.go).
+func TestDataVolumeSurvivesRecreate_Staging(t *testing.T) {
+	cfg := requireStagingEnv(t)
+
+	// Unique-per-run sentinel so a stale prior run can never make a wiped
+	// volume look "survived" (we compare exact content, not mere existence).
+	stamp := time.Now().UnixNano()
+	relPath := fmt.Sprintf("e2e-persist/%d.sentinel", stamp)
+
+	slug := fmt.Sprintf("e2e-persist-%d", time.Now().Unix()%100000000)
+	t.Logf("data-persistence: slug=%s", slug)
+
+	// --- Step 0: provision org + tenant, acquire token + wait TLS ready ---
+	orgID := adminCreateOrg(t, cfg, slug)
+	t.Cleanup(func() { adminDeleteTenant(t, cfg, slug) })
+	t.Logf("org created: org_id=%s", orgID)
+
+	token := tenantAdminToken(t, cfg, slug)
+	tenantHost := slug + "." + cfg.subdomainSuffix
+	waitForHTTP(t, tenantHost, http.StatusOK, 10*time.Minute, "tenant /health ready")
+	t.Logf("tenant TLS ready: %s", tenantHost)
+
+	sentinel := fmt.Sprintf("data-volume-survives-recreate stamp=%d host=%s", stamp, tenantHost)
+
+	// --- Step 1: create workspace with durable data persistence ---
+	wsID := createPersistWorkspace(t, tenantHost, token, orgID, stamp)
+	t.Cleanup(func() { deletePersistWorkspace(t, tenantHost, token, orgID, wsID) })
+	t.Logf("workspace created: id=%s (data_persistence=persist)", wsID)
+
+	waitForWorkspaceOnline(t, tenantHost, token, orgID, wsID, 20*time.Minute)
+	t.Logf("workspace %s ONLINE", wsID)
+
+	// --- Step 2: write the /workspace sentinel (data volume, cp#326) ---
+	writeWorkspaceFile(t, tenantHost, token, orgID, wsID, "/workspace", relPath, sentinel)
+	t.Logf("wrote /workspace sentinel: root=/workspace path=%s", relPath)
+
+	// Read it straight back so a write that silently no-op'd can't masquerade
+	// as a survived-recreate later. This also confirms the EIC write landed on
+	// the host data volume before we swap the container out from under it.
+	if got := readWorkspaceFile(t, tenantHost, token, orgID, wsID, "/workspace", relPath); got != sentinel {
+		t.Fatalf("pre-recreate readback mismatch: wrote %q, read %q", sentinel, got)
+	}
+	t.Logf("pre-recreate readback OK")
+
+	// --- Step 3: encode the /home/agent (ephemeral container-$HOME) contract ---
+	assertAgentHomeContract(t, tenantHost, token, orgID, wsID, stamp)
+
+	// A successful Files write to a SaaS workspace can itself debounce-trigger
+	// an auto-restart (internal#624). Settle that window first so our explicit
+	// recreate below is the swap we actually measure, not a coalesced one that
+	// races our readback.
+	settleAutoRestart(t, tenantHost, token, orgID, wsID)
+
+	// --- Step 4: recreate / container-swap on the SAME data volume ---
+	// POST /restart is the recreate path: Stop (prune=false ALWAYS for restart,
+	// so the data volume is NEVER erased) -> re-provision on the same volume,
+	// templates NOT re-applied. See workspace_restart.go runRestartCycle.
+	triggerRecreate(t, tenantHost, token, orgID, wsID)
+	t.Logf("recreate (container swap) triggered via POST /restart")
+
+	// The swap flips status to 'provisioning'; wait for it to come back ONLINE.
+	waitForRecreateThenOnline(t, tenantHost, token, orgID, wsID, 20*time.Minute)
+	t.Logf("workspace %s back ONLINE after recreate", wsID)
+
+	// --- Step 5: LOAD-BEARING — the /workspace sentinel must SURVIVE ---
+	got := readWorkspaceFile(t, tenantHost, token, orgID, wsID, "/workspace", relPath)
+	if got != sentinel {
+		t.Fatalf("DATA-VOLUME REGRESSION: /workspace sentinel did NOT survive recreate.\n"+
+			"  wrote: %q\n  read:  %q\n"+
+			"  This is the cp#326 durable-data-volume invariant: a 'persist' workspace's\n"+
+			"  /workspace MUST survive a container swap. A wipe here means the data volume\n"+
+			"  was not reattached (or a persist→ephemeral regression). See\n"+
+			"  feedback_workspace_container_swap_wipes_home_agent.", sentinel, got)
+	}
+	t.Logf("PASS: /workspace sentinel SURVIVED recreate — data-volume invariant holds (cp#326)")
+}
+
+// assertAgentHomeContract encodes the CORRECT, documented expectation for the
+// /home/agent (container-$HOME) side of the incident.
+//
+// The Files API exposes the container's own $HOME via ?root=/agent-home (the
+// docker-exec backend, internal#425 RFC). That backend is intentionally STUBBED
+// today: every verb returns 501 Not Implemented. So there is NO supported
+// platform write path into the container's /home/agent — which is precisely
+// because that directory is EPHEMERAL: it lives inside the container, not on the
+// durable data volume, and is WIPED on every container swap unless a snapshot is
+// taken first (the incident's snapshot-before-stop+rm+run rule, which is a
+// CP-side provisioner concern, not a tenant ws-server file-API surface).
+//
+// This assertion is the regression tripwire for that contract: if a future
+// change wires /agent-home to a path WITHOUT also making it data-volume-backed,
+// this 501 flips to 200 and the test fails LOUD — forcing whoever lit up the
+// surface to first answer "is /home/agent now durable, and was the snapshot
+// hook added?" rather than silently shipping a wipe-on-recreate surface.
+//
+// We do NOT write-then-recreate-then-expect-wipe on /home/agent: asserting a
+// WIPE as a pass would be fail-open (a no-op write would also "pass"). Pinning
+// the 501 contract is the fail-closed encoding.
+func assertAgentHomeContract(t *testing.T, host, token, orgID, wsID string, stamp int64) {
+	t.Helper()
+	rel := fmt.Sprintf("e2e-persist/%d.home.sentinel", stamp)
+	url := fmt.Sprintf("https://%s/workspaces/%s/files/%s?root=%s",
+		host, wsID, rel, "/agent-home")
+	status, body := doTenantJSON(t, "PUT", url, token, orgID, fmt.Sprintf(`{"content":%q}`, "x"))
+
+	switch status {
+	case http.StatusNotImplemented:
+		// Documented contract: container-$HOME browse/write is stubbed BECAUSE
+		// it is ephemeral. No durable surface to assert survival on. Good.
+		t.Logf("/home/agent contract OK: /agent-home is 501 (ephemeral container-$HOME, no durable write surface — snapshot-before-swap is a CP-side concern)")
+	case http.StatusOK:
+		// The stub was lit up. This is a contract change that MUST be paired
+		// with data-volume backing + a snapshot-before-swap hook; until this
+		// test is extended to prove BOTH, treat the bare flip as a regression
+		// of the documented ephemeral contract.
+		t.Fatalf("CONTRACT DRIFT: PUT ?root=/agent-home returned 200 — the container-$HOME surface was wired up.\n"+
+			"  Per feedback_workspace_container_swap_wipes_home_agent, /home/agent is EPHEMERAL and wiped on\n"+
+			"  container swap unless snapshotted first. If this surface is now durable, EXTEND this test to\n"+
+			"  write→recreate→assert-survival on /home/agent AND assert the snapshot-before-swap hook fired.\n"+
+			"  Do not leave a write-able-but-ephemeral surface uncovered. body=%s", body)
+	default:
+		// 4xx other than 501 (e.g. 400/404) is acceptable — still "not a
+		// durable write surface". Anything 5xx that ISN'T 501 is a real bug.
+		if status >= 500 {
+			t.Fatalf("/home/agent contract probe: unexpected %d (want 501 or a 4xx): %s", status, body)
+		}
+		t.Logf("/home/agent contract: ?root=/agent-home returned %d (non-durable surface) — acceptable", status)
+	}
+}
+
+// --- workspace lifecycle over the tenant API ------------------------------
+
+// createPersistWorkspace creates a throwaway workspace with the durable
+// data-volume choice (compute.data_persistence="persist", internal#734). The
+// "persist" choice is what makes /workspace survive a recreate; we set it
+// explicitly rather than relying on the auto/org-flag default so the invariant
+// under test is unambiguous.
+func createPersistWorkspace(t *testing.T, host, token, orgID string, stamp int64) string {
+	t.Helper()
+	url := "https://" + host + "/workspaces"
+	body := fmt.Sprintf(
+		`{"name":%q,"runtime":%q,"tier":%d,"compute":{"data_persistence":%q}}`,
+		fmt.Sprintf("e2e-persist-%d", stamp%100000000), "claude-code", 1, "persist",
+	)
+	status, resp := doTenantJSON(t, "POST", url, token, orgID, body)
+	if status != http.StatusCreated && status != http.StatusOK {
+		t.Fatalf("create workspace: HTTP %d: %s", status, resp)
+	}
+	id := jsonField(resp, "id")
+	if id == "" {
+		t.Fatalf("create workspace: no id in response: %s", resp)
+	}
+	return id
+}
+
+// deletePersistWorkspace is the t.Cleanup teardown — best-effort, never fails
+// the test. DELETE without prune so a hung delete doesn't strand the test;
+// staging sweep reclaims any leftover compute. (The org/tenant itself is torn
+// down separately via adminDeleteTenant.)
+func deletePersistWorkspace(t *testing.T, host, token, orgID, wsID string) {
+	t.Helper()
+	url := "https://" + host + "/workspaces/" + wsID
+	status, resp := doTenantJSON(t, "DELETE", url, token, orgID, "")
+	if status != http.StatusOK && status != http.StatusAccepted && status != http.StatusNoContent && status != http.StatusNotFound {
+		t.Logf("WARNING: teardown DELETE workspace %s returned HTTP %d: %s (manual cleanup may be needed)", wsID, status, resp)
+		return
+	}
+	t.Logf("teardown: deleted workspace %s (HTTP %d)", wsID, status)
+}
+
+// waitForWorkspaceOnline polls GET /workspaces/:id until .status == "online".
+func waitForWorkspaceOnline(t *testing.T, host, token, orgID, wsID string, timeout time.Duration) {
+	t.Helper()
+	url := "https://" + host + "/workspaces/" + wsID
+	deadline := time.Now().Add(timeout)
+	var last string
+	for time.Now().Before(deadline) {
+		status, body := doTenantJSON(t, "GET", url, token, orgID, "")
+		if status == http.StatusOK {
+			last = jsonField(body, "status")
+			if last == "online" {
+				return
+			}
+		}
+		time.Sleep(10 * time.Second)
+	}
+	t.Fatalf("workspace %s did not reach status=online within %s (last=%q)", wsID, timeout, last)
+}
+
+// triggerRecreate POSTs /restart, the recreate / container-swap path. The
+// handler tears down the container and re-provisions on the SAME data volume
+// (Stop is called with prune=false for restart — see workspace_restart.go's
+// cpStopWithRetryErr — so a recreate can NEVER erase the data volume).
+func triggerRecreate(t *testing.T, host, token, orgID, wsID string) {
+	t.Helper()
+	url := "https://" + host + "/workspaces/" + wsID + "/restart"
+	status, body := doTenantJSON(t, "POST", url, token, orgID, "")
+	if status != http.StatusOK && status != http.StatusAccepted {
+		t.Fatalf("trigger recreate (POST /restart): HTTP %d: %s", status, body)
+	}
+}
+
+// waitForRecreateThenOnline waits out the swap. The recreate flips status to
+// 'provisioning'; we first observe it LEAVE online (so we don't read a stale
+// "still online" before the swap starts), then wait for it to return to online.
+// If we never catch the provisioning dip (fast swap), the subsequent online
+// poll still proves liveness — the load-bearing assertion is the sentinel read,
+// not the transient state machine.
+func waitForRecreateThenOnline(t *testing.T, host, token, orgID, wsID string, timeout time.Duration) {
+	t.Helper()
+	url := "https://" + host + "/workspaces/" + wsID
+	deadline := time.Now().Add(timeout)
+
+	// Brief window to catch the provisioning dip (best-effort; not required).
+	dipDeadline := time.Now().Add(90 * time.Second)
+	for time.Now().Before(dipDeadline) {
+		status, body := doTenantJSON(t, "GET", url, token, orgID, "")
+		if status == http.StatusOK && jsonField(body, "status") != "online" {
+			break
+		}
+		time.Sleep(3 * time.Second)
+	}
+
+	var last string
+	for time.Now().Before(deadline) {
+		status, body := doTenantJSON(t, "GET", url, token, orgID, "")
+		if status == http.StatusOK {
+			last = jsonField(body, "status")
+			if last == "online" {
+				return
+			}
+		}
+		time.Sleep(10 * time.Second)
+	}
+	t.Fatalf("workspace %s did not return to status=online after recreate within %s (last=%q)", wsID, timeout, last)
+}
+
+// settleAutoRestart absorbs the internal#624 file-write→restart debounce so the
+// explicit recreate we measure isn't coalesced with an implicit one. The
+// debounce window is 15s + a restart cycle; we poll back to a stable online.
+func settleAutoRestart(t *testing.T, host, token, orgID, wsID string) {
+	t.Helper()
+	// Give the debounce window time to fire (or not) ...
+	time.Sleep(20 * time.Second)
+	// ... then ensure we're back to a stable online before the measured swap.
+	waitForWorkspaceOnline(t, host, token, orgID, wsID, 10*time.Minute)
+}
+
+// --- tenant Files API ------------------------------------------------------
+
+// writeWorkspaceFile PUTs a file via the tenant Files API into the given root.
+// root="/workspace" is the literal data-volume path (cp#326).
+func writeWorkspaceFile(t *testing.T, host, token, orgID, wsID, root, relPath, content string) {
+	t.Helper()
+	url := fmt.Sprintf("https://%s/workspaces/%s/files/%s?root=%s",
+		host, wsID, relPath, root)
+	status, body := doTenantJSON(t, "PUT", url, token, orgID, fmt.Sprintf(`{"content":%q}`, content))
+	if status != http.StatusOK {
+		t.Fatalf("write %s%s: HTTP %d: %s", root, relPath, status, body)
+	}
+}
+
+// readWorkspaceFile GETs a file via the tenant Files API and returns its
+// content. Fails the test on any non-200 (a not-found after a recreate is the
+// wipe we are gating, so the caller compares content and emits the regression
+// message — but a transport/auth failure should still fail loud here).
+func readWorkspaceFile(t *testing.T, host, token, orgID, wsID, root, relPath string) string {
+	t.Helper()
+	url := fmt.Sprintf("https://%s/workspaces/%s/files/%s?root=%s",
+		host, wsID, relPath, root)
+	status, body := doTenantJSON(t, "GET", url, token, orgID, "")
+	if status == http.StatusNotFound {
+		// Surface the not-found as empty content; the caller's exact-content
+		// compare turns this into the DATA-VOLUME REGRESSION message.
+		return ""
+	}
+	if status != http.StatusOK {
+		t.Fatalf("read %s%s: HTTP %d: %s", root, relPath, status, body)
+	}
+	return jsonField(body, "content")
+}