Merge branch 'main' into fix/plugin-uninstall-exec-errors

Merge PR #2362 via Gitea merge queue
Serialized merge by gitea-merge-queue after current-main, genuine approvals, and required CI checks were green.
2026-06-06 13:30:27 +00:00 · 2026-06-06 12:50:47 +00:00 · 2026-06-06 11:05:42 +00:00 · 2026-06-06 10:50:18 +00:00 · 2026-06-06 10:13:50 +00:00 · 2026-06-06 03:07:06 -07:00
21 changed files with 2268 additions and 172 deletions
@@ -1,16 +1,77 @@
 #!/usr/bin/env python3
 """gitea-merge-queue — conservative serialized merge bot for Gitea.

-Gitea 1.22.6 has auto-merge (`pull_auto_merge`) but no GitHub-style merge
+Gitea 1.22.6+ has auto-merge (`pull_auto_merge`) but no GitHub-style merge
 queue. This script provides the missing serialized policy in user space:

-1. Pick the oldest open PR carrying QUEUE_LABEL.
-2. Refuse to act unless main is green.
+1. Scan open same-repo PRs that are NOT opted out (auto-discovery, see below),
+   oldest-first, skipping drafts, until an ACTIONABLE one is found. A non-ready
+   candidate (REQUEST_CHANGES, mergeable!=True, insufficient genuine approvals,
+   or red required CI) is SKIPPED so it cannot head-of-line block newer ready
+   PRs; the scan continues to the next candidate.
+2. Refuse to act unless main's BP-required contexts are green.
 3. Refuse fork PRs; the queue may only mutate same-repo branches.
 4. If the PR branch does not contain current main, call Gitea's
   /pulls/{n}/update endpoint and stop. CI must rerun on the updated head.
-5. If the updated PR head has all required contexts green, merge with the
-   non-bypass merge actor token.
+5. Merge ONLY when, on the PR's CURRENT head sha:
+     - >= REQUIRED_APPROVALS distinct GENUINE official APPROVED reviews from
+       the recognised reviewer set (not stale, not dismissed, commit_id ==
+       current head), AND
+     - no open official REQUEST_CHANGES on the current head, AND
+     - every BP-required status context is green, AND
+     - the PR is mergeable.
+
+Authoritative gates (fail-closed):
+  - The REQUIRED status contexts come from BRANCH PROTECTION
+    (`status_check_contexts`), not a hand-maintained env list. If branch
+    protection cannot be enumerated, the queue HOLDS (does not merge blindly).
+  - NON-required reds (qa-review, security-review, sop-tier, sop-checklist
+    when not branch-required, E2E Chat, Staging SaaS, ci-arm64-advisory, any
+    continue-on-error job) MUST NOT block. They are reported, never gating.
+  - `force_merge=true` is used ONLY when the merge is blocked *solely* by
+    missing-but-non-required governance contexts (required are green + genuine
+    approvals present). It is NEVER used to bypass a failing REQUIRED context
+    or missing approvals.
+
+Auto-discovery (opt-OUT, label-optional):
+  The queue is SELF-SUSTAINING — a ready PR does NOT need a human (or an agent)
+  to add the `merge-queue` label first. When AUTO_DISCOVER is on (default), the
+  queue enumerates ALL open same-repo PRs and considers any that meets the full
+  merge bar (genuine approvals on current head + BP-required green + mergeable +
+  no open REQUEST_CHANGES). The merge bar above is UNCHANGED; auto-discovery only
+  changes WHICH PRs are considered, not whether they are mergeable.
+
+  This deliberately removes the historical dependency on an agent adding the
+  `merge-queue` label — agent Gitea tokens lack `write:issue` (labels are
+  issue-scoped), so they could never self-label and the queue stalled. The label
+  is now OPTIONAL metadata, not a gate.
+
+  SAFETY is preserved as opt-OUT: any PR carrying an opt-out label
+  (OPT_OUT_LABELS — `merge-queue-hold`, `do-not-auto-merge`, `wip`, `draft` by
+  default) is skipped (never auto-considered, never merged). Draft PRs
+  (draft=true STATE) are also skipped; the literal `draft` LABEL is an
+  additional explicit opt-out a human can apply without converting to a draft.
+  A human who wants to keep a PR out of autonomous merging just adds one of
+  those labels. Setting AUTO_DISCOVER=0 restores the legacy opt-IN behaviour
+  (only PRs already carrying QUEUE_LABEL are considered).
+
+Head-of-line (HOL) safety has two complementary layers:
+  (a) The queue SCANS THROUGH the FIFO candidate list and skips any non-ready
+      PR (REQUEST_CHANGES, mergeable!=True, insufficient genuine approvals, or
+      red required CI) instead of locking on the oldest and waiting, so a PR
+      that can never become ready without human action does not block newer
+      ready PRs.
+  (b) For the candidate the scan acts on, two permanent failure modes HOLD the
+      PR (apply HOLD_LABEL) and let the scan CONTINUE to the next candidate
+      rather than re-selecting the same wedged PR every tick:
+        - a permanent permission/4xx merge error (403/404/405), and
+        - a persistent branch-update conflict (the /update endpoint returns
+          HTTP 409 because the PR branch cannot be merged with main without a
+          manual rebase). A conflict will not self-resolve, so retrying it
+          every tick would HOL-block every ready PR behind it (issue #2352).
+
+Status-fetch is fail-closed: if the combined status for a sha cannot be
+fetched, the PR is skipped this tick (never treated as green).

 The script is intentionally one-PR-per-run. Workflow/cron concurrency should
 serialize invocations so two green PRs cannot merge against the same main.
@@ -40,6 +101,33 @@ WATCH_BRANCH = _env("WATCH_BRANCH", default="main")
 QUEUE_LABEL = _env("QUEUE_LABEL", default="merge-queue")
 HOLD_LABEL = _env("HOLD_LABEL", default="merge-queue-hold")
 UPDATE_STYLE = _env("UPDATE_STYLE", default="merge")
+# Auto-discovery (opt-OUT). When truthy (default), the queue considers ALL open
+# same-repo PRs that meet the merge bar, not only PRs already carrying
+# QUEUE_LABEL — so the queue is self-sustaining without any human/agent labeling
+# (agent tokens lack write:issue and cannot self-label). Set AUTO_DISCOVER=0 to
+# restore the legacy opt-IN behaviour (QUEUE_LABEL required to be considered).
+AUTO_DISCOVER = _env("AUTO_DISCOVER", default="1").strip().lower() not in {
+    "0",
+    "false",
+    "no",
+    "off",
+    "",
+}
+# Opt-OUT labels. A PR carrying ANY of these is skipped (never auto-considered,
+# never merged) — the human escape hatch from autonomous merging. HOLD_LABEL is
+# always included so the existing hold semantics keep working. `do-not-auto-merge`
+# and `wip` let a human keep a PR out of the auto-merge path without removing it.
+# `draft` is included as a literal label too: Gitea draft STATE (draft=true) is
+# already skipped via _issue_is_draft, but a "draft" LABEL is an additional,
+# explicit opt-out signal a human can apply without converting the PR to a draft.
+OPT_OUT_LABELS = {
+    name.strip()
+    for name in _env(
+        "OPT_OUT_LABELS",
+        default="do-not-auto-merge,wip,draft",
+    ).split(",")
+    if name.strip()
+} | ({HOLD_LABEL} if HOLD_LABEL else set())
 REQUIRED_CONTEXTS_RAW = _env(
    "REQUIRED_CONTEXTS",
    default=(
@@ -57,6 +145,24 @@ PUSH_REQUIRED_CONTEXTS_RAW = _env(
    default="CI / all-required (push)",
 )

+# Recognised official-reviewer set. A merge requires this many DISTINCT genuine
+# approvals (not stale/dismissed, on the current head sha) from accounts in
+# this set. The set is the real agents-team reviewer roster; founder/CTO-agent
+# accounts are intentionally excluded so the queue cannot be satisfied by a
+# human/owner approval alone — it must be a genuine peer review.
+REVIEWER_SET = {
+    name.strip()
+    for name in _env(
+        "REVIEWER_SET",
+        default="agent-reviewer,agent-researcher,agent-reviewer-cr2",
+    ).split(",")
+    if name.strip()
+}
+# Default mirrors molecule-core branch protection (required_approvals: 2). The
+# authoritative value is read from branch protection at runtime; this is only
+# the fallback when BP does not specify one.
+REQUIRED_APPROVALS_DEFAULT = int(_env("REQUIRED_APPROVALS", default="2") or "2")
+
 OWNER, NAME = (REPO.split("/", 1) + [""])[:2] if REPO else ("", "")
 API = f"https://{GITEA_HOST}/api/v1" if GITEA_HOST else ""

@@ -67,7 +173,27 @@ class ApiError(RuntimeError):

 class MergePermissionError(ApiError):
    """Merge failed with a permanent permission error (403/404/405).
-    The queue should skip this PR and move to the next one."""
+    The queue should HOLD this PR and move to the next one."""
+
+
+class BranchUpdateConflictError(ApiError):
+    """Updating the PR branch with the base hit a merge-conflict (HTTP 409).
+
+    A true merge-conflict is NOT transient: the branch cannot be auto-updated
+    until a human/agent rebases it. The queue should HOLD this PR (apply
+    HOLD_LABEL) and advance to the next candidate, exactly like the permission
+    path — otherwise the conflicted PR sits at the queue head and is retried
+    every tick forever, head-of-line-blocking every ready PR behind it.
+
+    NOTE: distinct from mergeable=None, which is Gitea STILL COMPUTING conflict
+    state — that case is handled as a transient WAIT (no hold). This error is
+    only raised on an explicit 409 returned by the /update endpoint."""
+
+
+class BranchProtectionUnavailable(ApiError):
+    """Branch protection (the authoritative required-context source) could not
+    be enumerated. The queue must HOLD rather than merge with an unverified
+    required-context set (fail-closed, no fail-open)."""


@dataclasses.dataclass(frozen=True)
@@ -75,6 +201,20 @@ class MergeDecision:
    ready: bool
    action: str
    reason: str
+    # When ready is True, force indicates the merge is blocked SOLELY by
+    # missing-but-non-required governance contexts (required are green +
+    # genuine approvals present), so force_merge=true is justified to bypass
+    # ONLY those non-required contexts. Defaults False.
+    force: bool = False
+
+
+@dataclasses.dataclass(frozen=True)
+class BranchProtection:
+    """The subset of branch protection the queue depends on."""
+
+    required_contexts: list[str]
+    required_approvals: int
+    block_on_rejected_reviews: bool


 def _require_runtime_env() -> None:
@@ -191,6 +331,117 @@ def required_contexts_green(
    return not missing_or_bad, missing_or_bad


+def parse_branch_protection(body: Any) -> BranchProtection:
+    """Extract the queue-relevant fields from a branch_protections payload.
+
+    Fail-closed: raises BranchProtectionUnavailable when status checks are
+    expected but the required-context list cannot be enumerated. We never fall
+    back to a hand-maintained env list as the authoritative required set —
+    doing so risks merging when a real required context is red/missing.
+    """
+    if not isinstance(body, dict):
+        raise BranchProtectionUnavailable("branch protection response not an object")
+    enable = bool(body.get("enable_status_check"))
+    contexts_raw = body.get("status_check_contexts")
+    if not enable:
+        # Status checks not enforced by BP at all. With no required contexts
+        # the queue would gate on approvals only — acceptable, but make it
+        # explicit and let the caller decide.
+        contexts: list[str] = []
+    else:
+        if not isinstance(contexts_raw, list):
+            raise BranchProtectionUnavailable(
+                "enable_status_check is true but status_check_contexts is not a list"
+            )
+        contexts = [c for c in contexts_raw if isinstance(c, str) and c.strip()]
+        if not contexts:
+            raise BranchProtectionUnavailable(
+                "enable_status_check is true but status_check_contexts is empty"
+            )
+    approvals = body.get("required_approvals")
+    required_approvals = (
+        int(approvals) if isinstance(approvals, int) else REQUIRED_APPROVALS_DEFAULT
+    )
+    return BranchProtection(
+        required_contexts=contexts,
+        required_approvals=required_approvals,
+        block_on_rejected_reviews=bool(body.get("block_on_rejected_reviews")),
+    )
+
+
+def get_branch_protection(branch: str) -> BranchProtection:
+    """Fetch branch protection for `branch`; fail-closed if unavailable."""
+    try:
+        _, body = api("GET", f"/repos/{OWNER}/{NAME}/branch_protections/{branch}")
+    except ApiError as exc:
+        raise BranchProtectionUnavailable(
+            f"could not fetch branch protection for {branch}: {exc}"
+        ) from exc
+    return parse_branch_protection(body)
+
+
+def genuine_approvals(
+    reviews: list[dict],
+    *,
+    head_sha: str,
+    reviewer_set: set[str],
+) -> tuple[set[str], list[str]]:
+    """Reduce a PR's reviews to genuine official approvals on the CURRENT head.
+
+    Returns (approvers, request_changes) where:
+      - approvers is the set of distinct logins (in reviewer_set) whose LATEST
+        review on the current head is an official, non-stale, non-dismissed
+        APPROVED, and
+      - request_changes is the list of logins (in reviewer_set) whose latest
+        official review on the current head is REQUEST_CHANGES.
+
+    "Current head" is enforced two ways, because Gitea exposes both signals:
+    a review must be `official` and NOT `stale`/`dismissed`, AND when the
+    review carries a commit_id it must equal head_sha. A review with no
+    commit_id but stale=False/dismissed=False is accepted (older Gitea rows).
+    We take each reviewer's LATEST submission (reviews arrive oldest-first), so
+    a later REQUEST_CHANGES correctly supersedes an earlier APPROVED and vice
+    versa.
+    """
+    latest_by_user: dict[str, dict] = {}
+    for review in reviews:
+        if not isinstance(review, dict):
+            continue
+        user = (review.get("user") or {}).get("login")
+        if not isinstance(user, str) or user not in reviewer_set:
+            continue
+        state = str(review.get("state") or "").upper()
+        if state not in {"APPROVED", "REQUEST_CHANGES"}:
+            continue  # ignore COMMENT/PENDING/DISMISSED-state rows
+        # reviews are returned oldest-first; later entries overwrite → latest wins
+        latest_by_user[user] = review
+
+    approvers: set[str] = set()
+    request_changes: list[str] = []
+    for user, review in latest_by_user.items():
+        if not review.get("official"):
+            continue
+        if review.get("stale") or review.get("dismissed"):
+            continue
+        commit_id = review.get("commit_id")
+        if isinstance(commit_id, str) and commit_id and head_sha:
+            if commit_id != head_sha:
+                continue  # review was on a previous head
+        state = str(review.get("state") or "").upper()
+        if state == "APPROVED":
+            approvers.add(user)
+        elif state == "REQUEST_CHANGES":
+            request_changes.append(user)
+    return approvers, request_changes
+
+
+def get_pull_reviews(pr_number: int) -> list[dict]:
+    _, body = api("GET", f"/repos/{OWNER}/{NAME}/pulls/{pr_number}/reviews")
+    if not isinstance(body, list):
+        raise ApiError(f"PR #{pr_number} reviews response not list")
+    return body
+
+
 def label_names(issue: dict) -> set[str]:
    return {
        label["name"]
@@ -219,6 +470,85 @@ def choose_next_queued_issue(
    return candidates[0] if candidates else None


+def _issue_is_draft(issue: dict) -> bool:
+    """True if the issue/PR is a draft.
+
+    The /issues listing exposes draft state under the `pull_request` sub-object
+    (`{"draft": true}`); some Gitea versions also surface a top-level `draft`.
+    Either is honoured. Drafts are never auto-considered for merging.
+    """
+    pr = issue.get("pull_request")
+    if isinstance(pr, dict) and pr.get("draft") is True:
+        return True
+    return issue.get("draft") is True
+
+
+def choose_candidate_issues(
+    issues: list[dict],
+    *,
+    queue_label: str,
+    opt_out_labels: set[str],
+    auto_discover: bool,
+) -> list[dict]:
+    """All open PRs eligible for a merge attempt this tick, oldest-first.
+
+    This is the auto-discovery selector. It does NOT change the merge bar — it
+    only changes WHICH PRs are considered:
+
+      - auto_discover=True (default): every open same-repo PR is a candidate,
+        EXCEPT those carrying an opt-out label or marked draft. The QUEUE_LABEL
+        is optional metadata, not a gate, so a ready PR reaches the queue with no
+        human/agent labeling (the write:issue gap is removed).
+      - auto_discover=False: legacy opt-IN — only PRs carrying queue_label are
+        candidates (still skipping opt-out labels and drafts).
+
+    Opt-out is the safety escape hatch: any opt_out_labels member present skips
+    the PR entirely (never considered, never merged). Ordering is oldest-first
+    (created_at, then number) to preserve the serialized FIFO ordering.
+
+    Returns the FULL ordered list (not just the head) so process_once can SCAN
+    THROUGH non-ready candidates instead of locking on the oldest. A non-ready
+    auto-discovered PR (e.g. one with REQUEST_CHANGES or mergeable=false, which
+    can never become ready without human action) must NOT head-of-line block the
+    newer ready PRs behind it — the readiness check happens per-candidate in
+    process_once, and a `wait` candidate is skipped to the next one.
+    """
+    candidates = []
+    for issue in issues:
+        if "pull_request" not in issue:
+            continue
+        labels = label_names(issue)
+        if opt_out_labels & labels:
+            continue  # opt-out: human kept this PR out of autonomous merging
+        if _issue_is_draft(issue):
+            continue  # drafts are never auto-merged
+        if not auto_discover and queue_label not in labels:
+            continue  # legacy opt-IN: require the queue label
+        candidates.append(issue)
+    candidates.sort(key=lambda issue: (issue.get("created_at") or "", int(issue["number"])))
+    return candidates
+
+
+def choose_next_candidate_issue(
+    issues: list[dict],
+    *,
+    queue_label: str,
+    opt_out_labels: set[str],
+    auto_discover: bool,
+) -> dict | None:
+    """The oldest eligible candidate, or None. Thin head-of-list wrapper around
+    choose_candidate_issues; retained for callers/tests that only want the head.
+    process_once uses the full list (choose_candidate_issues) so it can scan past
+    non-ready PRs rather than HOL-block on the oldest."""
+    candidates = choose_candidate_issues(
+        issues,
+        queue_label=queue_label,
+        opt_out_labels=opt_out_labels,
+        auto_discover=auto_discover,
+    )
+    return candidates[0] if candidates else None
+
+
 def pr_contains_base_sha(commits: list[dict], base_sha: str) -> bool:
    for commit in commits:
        sha = commit.get("sha") or commit.get("id")
@@ -233,36 +563,87 @@ def pr_has_current_base(pr: dict, commits: list[dict], main_sha: str) -> bool:
    return pr_contains_base_sha(commits, main_sha)


+def _non_required_red_present(
+    latest: dict[str, dict],
+    required_contexts: list[str],
+) -> bool:
+    """True if any NON-required context is non-success.
+
+    Such reds are the governance/SOP/advisory checks Gitea may still treat as
+    "missing required context" at merge time even though branch protection does
+    not require them. Their presence is what justifies force_merge=true (we
+    have already verified every REQUIRED context is green and approvals are
+    genuine, so force only bypasses these non-required reds).
+    """
+    required = set(required_contexts)
+    for context, status in latest.items():
+        if context in required:
+            continue
+        if status_state(status) != "success":
+            return True
+    return False
+
+
 def evaluate_merge_readiness(
    *,
    main_status: dict,
    pr_status: dict,
    required_contexts: list[str],
+    required_approvals: int,
+    approvers: set[str],
+    request_changes: list[str],
    pr_has_current_base: bool,
+    mergeable: bool,
    pr_labels: set[str] | None = None,
 ) -> MergeDecision:
-    # Check push-required contexts explicitly instead of combined state.
-    # Combined state can be "failure" due to non-blocking jobs
-    # (continue-on-error: true) that don't actually gate merges.
-    # CI / all-required (push) is the authoritative gate — it respects
-    # continue-on-error and correctly aggregates all blocking failures.
+    # 1) Main's push-required contexts must be green. Combined state can be
+    #    "failure" due to non-blocking jobs (continue-on-error: true) that do
+    #    not gate merges, so check the explicit required set, not combined.
    main_latest = latest_statuses_by_context(main_status.get("statuses") or [])
    main_ok, main_bad = required_contexts_green(main_latest, push_required_contexts())
    if not main_ok:
        return MergeDecision(False, "pause", "main required contexts not green: " + ", ".join(main_bad))
+
+    # 2) PR head must contain current main.
    if not pr_has_current_base:
        return MergeDecision(False, "update", "PR head does not contain current main")

-    # Check explicit required contexts instead of combined state. Combined state
-    # can be "failure" due to non-blocking jobs with continue-on-error: true
-    # (e.g. publish-runtime-autobump/pr-validate, qa-review on stale tokens).
-    # The required_contexts list is the authoritative gate — it includes only
-    # the checks that actually block merges.
+    # 3) No open official REQUEST_CHANGES on the current head.
+    if request_changes:
+        return MergeDecision(
+            False, "wait",
+            "open REQUEST_CHANGES on current head from: " + ", ".join(sorted(request_changes)),
+        )
+
+    # 4) Enough distinct genuine official approvals on the current head.
+    if len(approvers) < required_approvals:
+        return MergeDecision(
+            False, "wait",
+            f"insufficient genuine approvals on current head: have "
+            f"{len(approvers)} ({', '.join(sorted(approvers)) or 'none'}), "
+            f"need {required_approvals}",
+        )
+
+    # 5) Every BRANCH-PROTECTION-REQUIRED status context must be green. This is
+    #    the authoritative status gate — NON-required reds (qa-review,
+    #    security-review, sop-tier/sop-checklist when not BP-required, E2E Chat,
+    #    Staging SaaS, ci-arm64-advisory, continue-on-error jobs) are NOT
+    #    consulted here and must not block.
    latest = latest_statuses_by_context(pr_status.get("statuses") or [])
    ok, missing_or_bad = required_contexts_green(latest, required_contexts, pr_labels)
    if not ok:
        return MergeDecision(False, "wait", "required contexts not green: " + ", ".join(missing_or_bad))
-    return MergeDecision(True, "merge", "ready")
+
+    # 6) Gitea must consider the PR mergeable (no conflicts).
+    if not mergeable:
+        return MergeDecision(False, "wait", "PR is not mergeable (conflicts)")
+
+    # Ready. Use force_merge ONLY if the merge would otherwise be blocked by
+    # missing-but-non-required governance contexts. Required are green and
+    # approvals are genuine, so force only bypasses non-required reds — never a
+    # failing required context or missing approval.
+    force = _non_required_red_present(latest, required_contexts)
+    return MergeDecision(True, "merge", "ready", force=force)


 def get_branch_head(branch: str) -> str:
@@ -280,6 +661,12 @@ def get_combined_status(sha: str) -> dict:
    The /status endpoint caps the `statuses` array at 30 entries (Gitea
    default page size), so we fetch the full list via /statuses with a
    higher limit. The combined `state` still comes from /status.
+
+    Fail-closed: the PRIMARY /status fetch must succeed. If it raises, the
+    error propagates so the caller skips this PR this tick (we never treat a
+    failed status fetch as green — dev-sop "no fail-open"). Only the SECONDARY
+    /statuses enrichment (which merely extends the per-context list beyond the
+    30-entry cap) is best-effort; if it fails we still have the combined set.
    """
    _, combined = api("GET", f"/repos/{OWNER}/{NAME}/commits/{sha}/status")
    if not isinstance(combined, dict):
@@ -329,6 +716,31 @@ def list_queued_issues() -> list[dict]:
    return body


+def list_candidate_issues(*, auto_discover: bool) -> list[dict]:
+    """Open PR issues eligible for consideration this tick.
+
+    With auto_discover=True (default) this enumerates ALL open PRs (no label
+    filter) so the queue is self-sustaining — a ready PR is considered without
+    any human/agent first adding QUEUE_LABEL. With auto_discover=False it falls
+    back to the legacy label-filtered listing (opt-IN). Opt-out filtering and
+    draft-skipping happen in choose_next_candidate_issue, not here.
+    """
+    if not auto_discover:
+        return list_queued_issues()
+    _, body = api(
+        "GET",
+        f"/repos/{OWNER}/{NAME}/issues",
+        query={
+            "state": "open",
+            "type": "pulls",
+            "limit": "50",
+        },
+    )
+    if not isinstance(body, list):
+        raise ApiError("candidate issues response not list")
+    return body
+
+
 def get_pull(pr_number: int) -> dict:
    _, body = api("GET", f"/repos/{OWNER}/{NAME}/pulls/{pr_number}")
    if not isinstance(body, dict):
@@ -354,30 +766,97 @@ def update_pull(pr_number: int, *, dry_run: bool) -> None:
    print(f"::notice::updating PR #{pr_number} with base branch via style={UPDATE_STYLE}")
    if dry_run:
        return
+    try:
+        api(
+            "POST",
+            f"/repos/{OWNER}/{NAME}/pulls/{pr_number}/update",
+            query={"style": UPDATE_STYLE},
+            expect_json=False,
+        )
+    except ApiError as exc:
+        # Gitea returns HTTP 409 when the base cannot be merged into the PR
+        # branch because of a real conflict. The queue cannot auto-resolve a
+        # conflict, so re-raise as BranchUpdateConflictError; process_once HOLDs
+        # the PR and advances (HOL guard) instead of retrying it forever.
+        # Match the HTTP STATUS token ("-> HTTP 409") specifically, not a bare
+        # "409" substring — the PR number or path can itself contain "409"
+        # (e.g. /pulls/1409/update) and must not be misread as a conflict.
+        if "-> HTTP 409" in str(exc):
+            raise BranchUpdateConflictError(str(exc)) from exc
+        raise  # re-raise other ApiErrors unchanged
+
+
+def add_label_by_name(pr_number: int, label_name: str, *, dry_run: bool) -> None:
+    """Apply an existing repo label (by name) to a PR/issue.
+
+    Used to HOLD a wedged PR so the queue advances. Resolves the label id from
+    the repo label set; if the label does not exist, raises ApiError (the
+    caller decides whether that is fatal).
+    """
+    print(f"::notice::applying label '{label_name}' to PR #{pr_number}")
+    if dry_run:
+        return
+    _, labels = api("GET", f"/repos/{OWNER}/{NAME}/labels", query={"limit": "100"})
+    label_id = None
+    if isinstance(labels, list):
+        for label in labels:
+            if isinstance(label, dict) and label.get("name") == label_name:
+                label_id = label.get("id")
+                break
+    if label_id is None:
+        raise ApiError(f"label '{label_name}' not found in repo {OWNER}/{NAME}")
    api(
        "POST",
-        f"/repos/{OWNER}/{NAME}/pulls/{pr_number}/update",
-        query={"style": UPDATE_STYLE},
-        expect_json=False,
+        f"/repos/{OWNER}/{NAME}/issues/{pr_number}/labels",
+        body={"labels": [label_id]},
    )


-def merge_pull(pr_number: int, *, dry_run: bool) -> None:
-    payload = {
+def hold_pr(pr_number: int, hold_note: str, *, dry_run: bool) -> None:
+    """Apply HOLD_LABEL to a wedged PR so the queue advances past it.
+
+    choose_next_queued_issue skips HOLD_LABEL-bearing PRs, so this is the HOL
+    guard: a PR the queue cannot make progress on (permanent permission error
+    or unresolvable branch-update conflict) is held and a human/agent fixes it,
+    rather than the queue re-selecting it every tick forever. If the label
+    cannot be applied we still post the explanatory comment so the wedge is at
+    least visible — but we never loop on the PR.
+    """
+    try:
+        add_label_by_name(pr_number, HOLD_LABEL, dry_run=dry_run)
+    except ApiError as label_exc:
+        sys.stderr.write(
+            f"::error::could not apply HOLD_LABEL to PR #{pr_number}: {label_exc}\n"
+        )
+        hold_note += (
+            f"\n\n(NOTE: could not apply the hold label automatically: "
+            f"{label_exc}. Please add `{HOLD_LABEL}` manually.)"
+        )
+    post_comment(pr_number, hold_note, dry_run=dry_run)
+
+
+def merge_pull(pr_number: int, *, dry_run: bool, force: bool = False) -> None:
+    payload: dict[str, Any] = {
        "Do": "merge",
        "MergeTitleField": f"Merge PR #{pr_number} via Gitea merge queue",
        "MergeMessageField": (
            "Serialized merge by gitea-merge-queue after current-main, "
-            "SOP, and required CI checks were green."
+            "genuine approvals, and required CI checks were green."
        ),
    }
-    print(f"::notice::merging PR #{pr_number}")
+    if force:
+        # force_merge bypasses ONLY missing-but-non-required governance
+        # contexts. The caller has already verified required contexts are green
+        # and genuine approvals are present, so this never bypasses a failing
+        # required context or an approval shortfall.
+        payload["force_merge"] = True
+    print(f"::notice::merging PR #{pr_number}{' (force_merge: non-required reds)' if force else ''}")
    if dry_run:
        return
    try:
        api("POST", f"/repos/{OWNER}/{NAME}/pulls/{pr_number}/merge", body=payload, expect_json=False)
    except ApiError as exc:
-        # Re-raise permission-like errors so process_once can skip this PR.
+        # Re-raise permission-like errors so process_once can HOLD this PR.
        # 403 = no push access, 404 = repo/pr not found, 405 = not allowed.
        msg = str(exc)
        for code in ("403", "404", "405"):
@@ -387,7 +866,25 @@ def merge_pull(pr_number: int, *, dry_run: bool) -> None:


 def process_once(*, dry_run: bool = False) -> int:
-    contexts = required_contexts(REQUIRED_CONTEXTS_RAW)
+    # Required status contexts come from BRANCH PROTECTION, not a hand-kept env
+    # list. Fail-closed: if BP cannot be enumerated, HOLD the whole tick rather
+    # than merge against an unverified required set.
+    try:
+        bp = get_branch_protection(WATCH_BRANCH)
+    except BranchProtectionUnavailable as exc:
+        sys.stderr.write(
+            f"::error::queue held: branch protection for {WATCH_BRANCH} "
+            f"unavailable (fail-closed): {exc}\n"
+        )
+        return 0
+    contexts = bp.required_contexts
+    required_approvals = bp.required_approvals
+    print(
+        f"::notice::queue policy from branch protection: "
+        f"required_approvals={required_approvals} "
+        f"required_contexts={contexts or '[none]'}"
+    )
+
    main_sha = get_branch_head(WATCH_BRANCH)
    main_status = get_combined_status(main_sha)
    # Check push-required contexts explicitly instead of combined state.
@@ -398,83 +895,199 @@ def process_once(*, dry_run: bool = False) -> int:
        print(f"::notice::queue paused: {WATCH_BRANCH}@{main_sha[:8]} required contexts not green: {', '.join(main_bad)}")
        return 0

-    issue = choose_next_queued_issue(
-        list_queued_issues(),
+    candidates = choose_candidate_issues(
+        list_candidate_issues(auto_discover=AUTO_DISCOVER),
        queue_label=QUEUE_LABEL,
-        hold_label=HOLD_LABEL,
+        opt_out_labels=OPT_OUT_LABELS,
+        auto_discover=AUTO_DISCOVER,
    )
-    if not issue:
-        print("::notice::merge queue empty")
+    if not candidates:
+        print(
+            "::notice::no merge candidates "
+            f"(auto_discover={'on' if AUTO_DISCOVER else 'off'})"
+        )
        return 0

+    # HOL fix: SCAN THROUGH the FIFO candidate list until a PR we can ACT on is
+    # found, instead of locking on the oldest and waiting. A non-ready candidate
+    # (decision.action == "wait": REQUEST_CHANGES, mergeable!=True, insufficient
+    # genuine approvals, or red required CI) is SKIPPED — it must NOT head-of-line
+    # block the newer ready PRs behind it. The merge bar is unchanged: a skipped
+    # PR is never merged, and the first ACTIONABLE candidate (an "update" that
+    # advances a stale branch, or a fully-ready "merge") terminates the scan.
+    #
+    # `update` is treated as actionable, not skippable: a PR whose head merely
+    # lacks current main is in a legitimate in-progress state (updating it +
+    # rerunning CI moves it toward ready), unlike a PR that can never become
+    # ready without a human (RC / conflict), which is a `wait` and gets skipped.
+    for issue in candidates:
+        decision, ctx = _evaluate_candidate(
+            issue,
+            main_sha=main_sha,
+            main_status=main_status,
+            required_contexts=contexts,
+            required_approvals=required_approvals,
+            dry_run=dry_run,
+        )
+        if decision is None:
+            continue  # not merge-eligible (not-open / opted-out / fork / wrong base)
+        pr_number = ctx["pr_number"]
+        print(f"::notice::PR #{pr_number} decision={decision.action}: {decision.reason}")
+        if decision.action == "wait":
+            # Non-ready: skip to the next candidate (no HOL block, no merge).
+            continue
+        if decision.action == "update":
+            try:
+                update_pull(pr_number, dry_run=dry_run)
+            except BranchUpdateConflictError as exc:
+                # The branch cannot be updated with main because of a real
+                # conflict (HTTP 409 from /update). This is the #2352 HOL guard:
+                # a conflict will not self-resolve without a human/agent rebase,
+                # so re-attempting the update every tick would head-of-line block
+                # every ready PR behind it. HOLD this PR (apply HOLD_LABEL, which
+                # is an opt-out label so later ticks skip it) and CONTINUE the
+                # scan so a newer ready PR can still merge this tick. Fail-closed:
+                # a held PR is skipped, never merged.
+                sys.stderr.write(
+                    f"::error::branch-update conflict for PR #{pr_number}: {exc}\n"
+                )
+                hold_note = (
+                    "merge-queue: could not update this branch with "
+                    f"`{WATCH_BRANCH}` — the update returned a merge conflict "
+                    f"(HTTP 409) that the queue cannot auto-resolve ({exc}). "
+                    f"Applied `{HOLD_LABEL}` to unblock the queue (HOL guard). "
+                    f"Fix: rebase/merge `{WATCH_BRANCH}` into this branch and "
+                    f"resolve the conflicts, then remove `{HOLD_LABEL}` to requeue."
+                )
+                hold_pr(pr_number, hold_note, dry_run=dry_run)
+                continue  # held — keep scanning for a mergeable candidate
+            post_comment(
+                pr_number,
+                (
+                    f"merge-queue: updated this branch with `{WATCH_BRANCH}` at "
+                    f"`{main_sha[:12]}`. Waiting for CI on the refreshed head."
+                ),
+                dry_run=dry_run,
+            )
+            return 0
+        if decision.ready:
+            latest_main_sha = get_branch_head(WATCH_BRANCH)
+            if latest_main_sha != main_sha:
+                print(
+                    f"::notice::main moved {main_sha[:8]} -> {latest_main_sha[:8]}; "
+                    "deferring to next tick"
+                )
+                return 0
+            try:
+                merge_pull(pr_number, dry_run=dry_run, force=decision.force)
+            except MergePermissionError as exc:
+                # Permanent merge failure (HTTP 403/404/405). HOLD this PR by
+                # applying HOLD_LABEL (it becomes an opt-out label, so subsequent
+                # ticks skip it) and CONTINUE scanning so the queue still advances
+                # to the next ready PR this tick rather than stalling.
+                sys.stderr.write(f"::error::merge permission error for PR #{pr_number}: {exc}\n")
+                hold_note = (
+                    "merge-queue: merge failed with a permanent permission error "
+                    f"({exc}). No available token has Can-merge permission for this "
+                    f"PR. Applied `{HOLD_LABEL}` to unblock the queue (HOL guard). "
+                    f"Fix: grant Can-merge to the queue token, then remove "
+                    f"`{HOLD_LABEL}` to requeue."
+                )
+                try:
+                    add_label_by_name(pr_number, HOLD_LABEL, dry_run=dry_run)
+                except ApiError as label_exc:
+                    # If we cannot even apply the hold label, fall back to a comment
+                    # so the wedge is at least visible; do NOT loop on this PR.
+                    sys.stderr.write(
+                        f"::error::could not apply HOLD_LABEL to PR #{pr_number}: {label_exc}\n"
+                    )
+                    hold_note += (
+                        f"\n\n(NOTE: could not apply the hold label automatically: "
+                        f"{label_exc}. Please add `{HOLD_LABEL}` manually.)"
+                    )
+                post_comment(pr_number, hold_note, dry_run=dry_run)
+                continue  # held — keep scanning for a mergeable candidate
+            return 0
+    return 0
+
+
+def _evaluate_candidate(
+    issue: dict,
+    *,
+    main_sha: str,
+    main_status: dict,
+    required_contexts: list[str],
+    required_approvals: int,
+    dry_run: bool,
+) -> tuple[MergeDecision | None, dict]:
+    """Evaluate a single auto-discovered candidate against the full merge bar.
+
+    Returns (decision, ctx) where ctx carries {"pr_number"}. A None decision
+    means the PR is not merge-eligible at all (not open / opted-out / draft /
+    fork / wrong base) and the caller should skip to the next candidate; for
+    fork / wrong-base the explanatory comment is posted here before returning.
+
+    The merge bar is UNCHANGED from the single-PR path — this only factors the
+    per-PR evaluation out so process_once can scan multiple candidates. A failed
+    status fetch still raises (fail-closed): it propagates to the caller so the
+    PR is never treated as green.
+    """
    pr_number = int(issue["number"])
+    ctx = {"pr_number": pr_number}
    pr = get_pull(pr_number)
    if pr.get("state") != "open":
        print(f"::notice::PR #{pr_number} is not open; skipping")
-        return 0
+        return None, ctx
+    # Defensive opt-out/draft re-check on the authoritative pull payload: the
+    # /issues listing's label/draft view can lag, but the merge bar must respect
+    # the live pull state. (choose_candidate_issues already filtered on the
+    # listing; this guards against a stale listing racing a just-added opt-out.)
+    if OPT_OUT_LABELS & label_names(pr):
+        print(f"::notice::PR #{pr_number} carries an opt-out label; skipping")
+        return None, ctx
+    if pr.get("draft") is True:
+        print(f"::notice::PR #{pr_number} is a draft; skipping")
+        return None, ctx
    if pr.get("base", {}).get("ref") != WATCH_BRANCH:
        post_comment(pr_number, f"merge-queue: skipped; base branch is not `{WATCH_BRANCH}`.", dry_run=dry_run)
-        return 0
+        return None, ctx
    if pr.get("head", {}).get("repo_id") != pr.get("base", {}).get("repo_id"):
        post_comment(pr_number, "merge-queue: skipped; fork PRs are not supported by the serialized queue.", dry_run=dry_run)
-        return 0
+        return None, ctx

    head_sha = pr.get("head", {}).get("sha")
    if not isinstance(head_sha, str) or len(head_sha) < 7:
        raise ApiError(f"PR #{pr_number} missing head sha")
    commits = get_pull_commits(pr_number)
    current_base = pr_has_current_base(pr, commits, main_sha)
+    # Fail-closed: a failed status fetch raises here and propagates (the PR is
+    # never treated as green).
    pr_status = get_combined_status(head_sha)
    pr_labels = label_names(pr)
+    # FAIL-CLOSED: Gitea returns mergeable=None (or omits the field) while it is
+    # still COMPUTING conflict state. Only the literal True is decisive proof the
+    # PR is conflict-free; None and False both mean "not (yet) mergeable". We must
+    # NOT autonomously merge on an unknown — treat anything but True as not-yet-
+    # mergeable so evaluate_merge_readiness returns a "wait" decision.
+    mergeable = pr.get("mergeable") is True
+
+    reviews = get_pull_reviews(pr_number)
+    approvers, request_changes = genuine_approvals(
+        reviews, head_sha=head_sha, reviewer_set=REVIEWER_SET
+    )
+
    decision = evaluate_merge_readiness(
        main_status=main_status,
        pr_status=pr_status,
-        required_contexts=contexts,
+        required_contexts=required_contexts,
+        required_approvals=required_approvals,
+        approvers=approvers,
+        request_changes=request_changes,
        pr_has_current_base=current_base,
+        mergeable=mergeable,
        pr_labels=pr_labels,
    )
-
-    print(f"::notice::PR #{pr_number} decision={decision.action}: {decision.reason}")
-    if decision.action == "update":
-        update_pull(pr_number, dry_run=dry_run)
-        post_comment(
-            pr_number,
-            (
-                f"merge-queue: updated this branch with `{WATCH_BRANCH}` at "
-                f"`{main_sha[:12]}`. Waiting for CI on the refreshed head."
-            ),
-            dry_run=dry_run,
-        )
-        return 0
-    if decision.ready:
-        latest_main_sha = get_branch_head(WATCH_BRANCH)
-        if latest_main_sha != main_sha:
-            print(
-                f"::notice::main moved {main_sha[:8]} -> {latest_main_sha[:8]}; "
-                "deferring to next tick"
-            )
-            return 0
-        try:
-            merge_pull(pr_number, dry_run=dry_run)
-        except MergePermissionError as exc:
-            # Permanent merge failure (HTTP 403/404/405). Post a comment so
-            # maintainers know why, then return 0 so this tick is done.
-            # The PR stays in the queue; future ticks can retry after the
-            # permission issue is resolved.
-            sys.stderr.write(f"::error::merge permission error for PR #{pr_number}: {exc}\n")
-            post_comment(
-                pr_number,
-                (
-                    "merge-queue: merge failed with HTTP 405 'User not allowed to merge PR'. "
-                    "No available token has Can-merge permission on this repo. "
-                    "Fix: grant Can-merge to a token, or add a maintain/admin collaborator. "
-                    "Skipping to next queued PR on next tick."
-                ),
-                dry_run=dry_run,
-            )
-            return 0
-        return 0
-    return 0
+    return decision, ctx


 def main() -> int:
@@ -174,6 +174,16 @@ def parse_directives(
        if not parts:
            continue
        first = parts[0]
+        # Em-dash (U+2014) is a common visual separator in user-written
+        # notes, e.g.  /sop-ack Five-Axis — five-axis-review
+        # If raw_slug contains an em-dash, split on the first one so
+        # the part before becomes the slug and the rest becomes the note.
+        note_from_slug = ""
+        slug_source = raw_slug
+        emdash_idx = raw_slug.find("—")
+        if emdash_idx != -1:
+            slug_source = raw_slug[:emdash_idx].strip()
+            note_from_slug = raw_slug[emdash_idx + 1 :].strip()
        # If the slug-capture greedily matched multiple words (e.g.
        # "comprehensive testing"), preserve normalize behavior: join
        # the WHOLE first-word-token only; trailing words get appended to
@@ -186,13 +196,19 @@ def parse_directives(
            # as slug and "testing extra-note" as note. We defer the
            # disambiguation to the caller via the returned canonical
            # slug. For simplicity: try the WHOLE captured string first.
-            canonical = normalize_slug(raw_slug, numeric_aliases)
+            canonical = normalize_slug(slug_source, numeric_aliases)
        else:
-            canonical = normalize_slug(first, numeric_aliases)
+            canonical = normalize_slug(slug_source, numeric_aliases)
        note_from_group = (m.group(3) or "").strip()
-        # If we collapsed multi-word slug into kebab and there's a
-        # trailing-text group too, append it.
-        entry = (kind, canonical, note_from_group)
+        # The em-dash (U+2014) is a visual separator; the regex puts it
+        # in group(3) because it is outside the slug character class.
+        # Strip it so "/sop-ack slug — note" yields just "note".
+        if note_from_group.startswith("—"):
+            note_from_group = note_from_group[1:].strip()
+        # Combine note_from_slug (em-dash split) with note_from_group
+        # (trailing text after the slug captured by the regex group).
+        combined_note = (note_from_slug + " " + note_from_group).strip()
+        entry = (kind, canonical, combined_note)
        if kind == "sop-n/a":
            na_directives.append(entry)
        else:
@@ -48,7 +48,6 @@ set -euo pipefail
 # workflow-level jq install can fail on runners with network restrictions
 # (GitHub releases not reachable from some runner networks — infra#241
 # follow-up). This fallback is idempotent — no-op when jq is already on PATH.
-# SOP_FAIL_OPEN=1 makes this always exit 0 so CI never blocks on jq absence.
 if ! command -v jq >/dev/null 2>&1; then
  echo "::notice::jq not found on PATH — attempting install..."
  _jq_installed="no"
@@ -67,12 +66,6 @@ if ! command -v jq >/dev/null 2>&1; then
  if ! command -v jq >/dev/null 2>&1; then
    echo "::error::jq installation failed — apt-get and GitHub binary both failed."
    echo "::error::sop-tier-check requires jq for all JSON API parsing."
-    # SOP_FAIL_OPEN=1 is set in the workflow step's env — makes script always
-    # exit 0 so CI never blocks. The SOP-6 tier review gate remains enforced.
-    if [ "${SOP_FAIL_OPEN:-}" = "1" ]; then
-      echo "::warning::SOP_FAIL_OPEN=1 — exiting 0 so CI does not block."
-      exit 0
-    fi
    exit 1
  fi
 fi
@@ -101,15 +94,10 @@ echo "::notice::tier-check start: repo=$OWNER/$NAME pr=$PR_NUMBER author=$PR_AUT
 # cause the script to exit prematurely when the token is empty/invalid — the
 # if check below handles that case gracefully. Without || true, a 401 from an
 # empty/invalid token causes jq to exit 1, triggering set -e and exiting the
-# entire script before SOP_FAIL_OPEN can be evaluated (the check is in the jq-
-# install block; if jq is already on PATH, that block is skipped entirely).
+# entire script before the error can be logged.
 WHOAMI=$(curl -sS -H "$AUTH" "${API}/user" | jq -r '.login // ""') || true
 if [ -z "$WHOAMI" ]; then
  echo "::error::GITEA_TOKEN cannot resolve a user via /api/v1/user — check the token scope and that the secret is wired correctly."
-  if [ "${SOP_FAIL_OPEN:-}" = "1" ]; then
-    echo "::warning::SOP_FAIL_OPEN=1 — exiting 0 so CI does not block."
-    exit 0
-  fi
  exit 1
 fi
 echo "::notice::token resolves to user: $WHOAMI"
@@ -119,10 +107,6 @@ echo "::notice::token resolves to user: $WHOAMI"
 HEAD_SHA=$(curl -sS -H "$AUTH" "${API}/repos/${OWNER}/${NAME}/pulls/${PR_NUMBER}" | jq -r '.head.sha // ""') || true
 if [ -z "$HEAD_SHA" ]; then
  echo "::error::Failed to fetch PR head SHA — token may be invalid."
-  if [ "${SOP_FAIL_OPEN:-}" = "1" ]; then
-    echo "::warning::SOP_FAIL_OPEN=1 — exiting 0 so CI does not block."
-    exit 0
-  fi
  exit 1
 fi
 debug "pr-head-sha=$HEAD_SHA"
@@ -215,10 +199,6 @@ if [ "${SOP_DEBUG:-}" = "1" ]; then
 fi
 if [ "$_HTTP_EXIT" -ne 0 ] || [ "$HTTP_CODE" != "200" ]; then
  echo "::error::GET /orgs/${OWNER}/teams failed (curl exit=$_HTTP_EXIT HTTP=$HTTP_CODE) — token may lack read:org scope or be invalid."
-  if [ "${SOP_FAIL_OPEN:-}" = "1" ]; then
-    echo "::warning::SOP_FAIL_OPEN=1 — exiting 0 so CI does not block."
-    exit 0
-  fi
  exit 1
 fi

@@ -265,17 +245,13 @@ done

 # 5. Read approving reviewers. set +e disables set -e temporarily so that curl
 # failures (e.g. empty/invalid token → HTTP 401) do not abort the script before
-# SOP_FAIL_OPEN is evaluated. set -e is restored immediately after.
+# set -e is restored immediately after.
 set +e
 REVIEWS=$(curl -sS -H "$AUTH" "${API}/repos/${OWNER}/${NAME}/pulls/${PR_NUMBER}/reviews")
 _REVIEWS_EXIT=$?
 set -e
 if [ $_REVIEWS_EXIT -ne 0 ] || [ -z "$REVIEWS" ]; then
  echo "::error::Failed to fetch reviews (curl exit=$_REVIEWS_EXIT) — token may be invalid or unreachable."
-  if [ "${SOP_FAIL_OPEN:-}" = "1" ]; then
-    echo "::warning::SOP_FAIL_OPEN=1 — exiting 0 so CI does not block."
-    exit 0
-  fi
  exit 1
 fi
 APPROVERS=$(echo "$REVIEWS" | jq -r --arg head_sha "$HEAD_SHA" '[.[] | select(.state=="APPROVED" and .commit_id == $head_sha) | .user.login] | unique | .[]') || true
@@ -208,6 +208,22 @@ class TestParseDirectives(unittest.TestCase):
        d = self.parse_ack_revoke("/sop-ack Comprehensive_Testing")
        self.assertEqual(d[0][1], "comprehensive-testing")

+    def test_emdash_separator_parsed_correctly(self):
+        # Em-dash (U+2014) between slug and note is common in practice.
+        # /sop-ack Five-Axis — five-axis-review
+        # → slug = five-axis, note = — five-axis-review
+        d = self.parse_ack_revoke("/sop-ack Five-Axis — five-axis-review")
+        self.assertEqual(len(d), 1)
+        self.assertEqual(d[0][1], "five-axis")
+        self.assertIn("five-axis-review", d[0][2])
+
+    def test_emdash_no_note(self):
+        # Em-dash at end of slug: only slug, no note content
+        d = self.parse_ack_revoke("/sop-ack Five-Axis —")
+        self.assertEqual(len(d), 1)
+        self.assertEqual(d[0][1], "five-axis")
+        self.assertEqual(d[0][2], "")  # em-dash is separator-only → empty note
+

 # ---------------------------------------------------------------------------
 # section_marker_present
@@ -205,5 +205,5 @@ n/a_gates:
    required_teams: [security, managers, ceo]
    description: >-
      Security review N/A when this change has no security surface
-      (docs-only, pure-frontend, dependency-only). A security/owners
+      (docs-only, pure-frontend, dependency-only). A security/managers/ceo
      member must post /sop-n/a security-review to activate.
@@ -34,11 +34,6 @@ jobs:
  check:
    name: Block forbidden paths
    runs-on: ubuntu-latest
-    # Phase 3 (RFC #219 §1): surface broken workflows without blocking
-    # the PR. Follow-up PR flips this off after surfaced defects are
-    # triaged.
-    # mc#1982: pre-existing continue-on-error mask; root-fix and remove, do not renew silently.
-    continue-on-error: true
    steps:
      - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
        with:
@@ -7,10 +7,13 @@ name: gitea-merge-queue
 # the user-space queue bot, one PR per tick, using the non-bypass merge actor.
 #
 # Queue contract:
-#   - add label `merge-queue` to an open same-repo PR
+#   - auto-discovery (default): any open same-repo PR is considered — no
+#     `merge-queue` label required (the label is optional metadata now)
 #   - bot updates stale PR heads with current main, then waits for CI
-#   - bot merges only when current main is green and required PR contexts pass
-#   - add `merge-queue-hold` to pause a queued PR without removing it
+#   - bot merges only when current main is green, genuine approvals are present
+#     on the current head, required PR contexts pass, and the PR is mergeable
+#   - add `merge-queue-hold`, `do-not-auto-merge`, or `wip` to keep a PR OUT of
+#     autonomous merging; draft PRs are also skipped

 on:
  # Schedule moved to operator-config:
@@ -48,10 +51,34 @@ jobs:
          WATCH_BRANCH: ${{ github.event.repository.default_branch }}
          QUEUE_LABEL: merge-queue
          HOLD_LABEL: merge-queue-hold
+          # Auto-discovery (opt-OUT). When on (default), the queue considers ALL
+          # open same-repo PRs that meet the merge bar — it does NOT wait for a
+          # human/agent to add `merge-queue`. Agent Gitea tokens lack
+          # write:issue (labels are issue-scoped) and could never self-label,
+          # which stalled the queue; the label is now OPTIONAL metadata. The
+          # merge bar is UNCHANGED — only candidate selection widens. Set
+          # AUTO_DISCOVER=0 to restore legacy opt-IN (require the merge-queue
+          # label to be considered).
+          AUTO_DISCOVER: "1"
+          # Opt-OUT labels: any of these on a PR keeps it OUT of autonomous
+          # merging (the human escape hatch). HOLD_LABEL is always also honoured.
+          # A human who wants a PR held just adds one of these labels.
+          OPT_OUT_LABELS: do-not-auto-merge,wip
          UPDATE_STYLE: merge
-          REQUIRED_CONTEXTS: >-
-            CI / all-required (pull_request),
-            sop-checklist / all-items-acked (pull_request)
+          # Recognised official-reviewer set. A merge needs >= required_approvals
+          # DISTINCT genuine official approvals from these accounts on the
+          # CURRENT head sha (not stale/dismissed). The required_approvals count
+          # itself is read from branch protection at runtime.
+          REVIEWER_SET: agent-reviewer,agent-researcher,agent-reviewer-cr2
+          # NOTE: REQUIRED_CONTEXTS is no longer the authoritative PR gate. The
+          # queue now reads the required status contexts from BRANCH PROTECTION
+          # (status_check_contexts) so non-required governance reds (qa-review,
+          # security-review, sop-tier, sop-checklist when not branch-required,
+          # E2E Chat, Staging SaaS, ci-arm64-advisory) cannot block a merge.
+          # If branch protection cannot be enumerated the queue HOLDS
+          # (fail-closed). REQUIRED_APPROVALS below is only a fallback used when
+          # branch protection does not specify required_approvals.
+          REQUIRED_APPROVALS: "2"
          # Push-side required contexts. Checking CI / all-required (push)
          # explicitly instead of the combined state avoids false-pause when
          # non-blocking jobs (continue-on-error: true) have failed — those
@@ -93,12 +93,12 @@ For "do we have any backend?", use `HasProvisioner()`, never bare `h.provisioner
 3. **Restart divergence on runtime changes.** Docker re-reads `/configs/config.yaml` from the container before stop, so a changed `runtime:` survives a restart even if the DB isn't synced. EC2 trusts the DB only. If you change the runtime via the Config tab and the handler races the restart, Docker will land on the new runtime, EC2 will land on the old one. **Fix path:** make the Config-tab save explicitly flush to DB before kicking off a restart, not deferred.
 4. **Console-output asymmetry.** Users debugging a stuck workspace on Docker see `docker logs`; on EC2 they see `GetConsoleOutput`. The two outputs look nothing alike. **Fix path:** expose a unified `GET /workspaces/:id/boot-log` that proxies to whichever backend serves the data. Already partly there via `cp_provisioner.Console`.
 5. **Template script drift.** `install.sh` and `start.sh` in each template repo do the same high-level work (install hermes-agent, write .env, write config.yaml, start gateway) but must be kept byte-level consistent on the provider-key forwarding block. Easy to forget. Enforced now by `tools/check-template-parity.sh` (see below) — run it in each template repo's CI.
-6. **Both backends panic when underlying client is nil.** Discovered by the contract-test scaffold landing in this PR: `Provisioner.{Stop,IsRunning}` nil-dereferences the Docker client, and `CPProvisioner.{Stop,IsRunning}` nil-dereferences `httpClient`. The real code always sets these, so this is theoretical in prod — but it means the contract runner can't execute scenarios against zero-value backends. **Fix path:** guard each method with `if p.docker == nil { return false, errNoBackend }` (and equivalent for CP), then flip the `t.Skip` in the contract tests to `t.Run`.
+6. **Both backends panic when underlying client is nil.** ✅ **Resolved** (`fix/provisioner-nil-guards-1813`). `Provisioner.{Stop,IsRunning}` and `CPProvisioner.{Stop,IsRunning}` now guard against nil clients with `ErrNoBackend`, so the contract-test runner executes scenarios against zero-valued backends without panic.

 ## Enforcement

 - **`tools/check-template-parity.sh`** (this repo) — ensures `install.sh` and `start.sh` in a template repo forward identical sets of provider keys. Wire into each template repo's CI as `bash $MONOREPO/tools/check-template-parity.sh install.sh start.sh`.
- **Contract tests** (stub) — `workspace-server/internal/provisioner/backend_contract_test.go` defines the behaviors every `provisioner.Provisioner` implementation must satisfy. Fails compile when a method drifts between `Docker` and `CPProvisioner`. Scenario-level runs are `t.Skip`'d today pending drift risk #6 (see above) — compile-time assertions still catch method drift.
+- **Contract tests** — `workspace-server/internal/provisioner/backend_contract_test.go` defines the behaviors every `provisioner.Provisioner` implementation must satisfy. Fails compile when a method drifts between `Docker` and `CPProvisioner`. Scenario-level runs execute against zero-valued backends since drift risk #6 was resolved (`fix/provisioner-nil-guards-1813`).
 - **Source-level dispatcher pins** — `workspace_provision_auto_test.go` enforces the SoT pattern documented above:
  - `TestNoCallSiteCallsDirectProvisionerExceptAuto` — no handler calls `.provisionWorkspace(` or `.provisionWorkspaceCP(` directly outside the dispatcher's allowlist.
  - `TestNoCallSiteCallsBareStop` — no handler calls `.provisioner.Stop(` or `.cpProv.Stop(` directly outside the dispatcher's allowlist (strips Go comments before substring match so archaeology in code comments doesn't trip the gate).
@@ -8,26 +8,39 @@ against the latest `main`.

 ## Queue Contract

-Add the `merge-queue` label to an open PR when it is ready to merge.
+**Auto-discovery (opt-OUT, default).** You do NOT need to label a PR. The bot
+auto-discovers every open same-repo PR and merges any that meets the bar. The
+`merge-queue` label is now optional metadata, not a gate. This removed the
+historical autonomy gap: agent Gitea tokens lack `write:issue` (labels are
+issue-scoped), so agents could never self-label and ready PRs stalled.
+
+To keep a PR OUT of autonomous merging, add an opt-OUT label:
+`merge-queue-hold`, `do-not-auto-merge`, or `wip`. Draft PRs are also skipped.

 The bot processes one PR per tick:

-1. Confirms `main` is green.
-2. Selects the oldest open PR carrying `merge-queue`.
-3. Skips PRs with `merge-queue-hold`.
-4. Rejects fork PRs because the queue may only update same-repo branches.
-5. If the PR head does not contain current `main`, calls Gitea's
+1. Confirms `main`'s branch-protection-required push contexts are green.
+2. Selects the oldest open same-repo PR that is NOT opt-out-labeled and NOT a
+   draft (auto-discovery). With `AUTO_DISCOVER=0` it falls back to legacy
+   opt-IN: only PRs carrying `merge-queue` are considered.
+3. Rejects fork PRs because the queue may only update same-repo branches.
+4. If the PR head does not contain current `main`, calls Gitea's
   `/pulls/{n}/update?style=merge` endpoint and waits for CI on the new head.
-6. Merges only after the current PR head has required contexts green:
-   - `CI / all-required (pull_request)`
-   - `sop-checklist / all-items-acked (pull_request)`
+5. Merges only when, on the PR's CURRENT head sha:
+   - `>= required_approvals` distinct genuine official `APPROVED` reviews from
+     the recognised reviewer set (read from branch protection; default 2),
+   - no open official `REQUEST_CHANGES`,
+   - every branch-protection-required status context is green, and
+   - the PR is `mergeable` (Gitea returns `True`; `None`/`False` = wait).

-The workflow is serialized with `concurrency`, so two queued PRs cannot be
+The merge bar is unchanged by auto-discovery — only WHICH PRs are considered
+changes. The workflow is serialized with `concurrency`, so two PRs cannot be
 merged against the same observed `main`.

 ## Operator Commands

-Queue a PR:
+Queue a PR (optional — auto-discovery already considers every ready PR; the
+label is just visible metadata):

 ```bash
 curl -fsS -X POST \
@@ -37,7 +50,8 @@ curl -fsS -X POST \
  -d '{"labels":["merge-queue"]}'
 ```

-Temporarily hold a queued PR:
+Keep a PR OUT of autonomous merging (opt-OUT — use `merge-queue-hold`,
+`do-not-auto-merge`, or `wip`):

 ```bash
 curl -fsS -X POST \
@@ -56,9 +70,11 @@ REPO=molecule-ai/molecule-core \
 WATCH_BRANCH=main \
 QUEUE_LABEL=merge-queue \
 HOLD_LABEL=merge-queue-hold \
+AUTO_DISCOVER=1 \
+OPT_OUT_LABELS=do-not-auto-merge,wip \
+REVIEWER_SET=agent-reviewer,agent-researcher,agent-reviewer-cr2 \
 UPDATE_STYLE=merge \
-REQUIRED_CONTEXTS='CI / all-required (pull_request),sop-checklist / all-items-acked (pull_request)' \
-python3 .gitea/scripts/gitea-merge-queue.py
+python3 .gitea/scripts/gitea-merge-queue.py --dry-run
 ```

 Dry run:
@@ -1004,6 +1004,12 @@ for wid in "${WS_TO_CHECK[@]}"; do
  else
    DIAG_FAIL=$(echo "$DIAG_JSON" | python3 -c "import json,sys; d=json.load(sys.stdin); print(d.get('first_failure','unknown'))" 2>/dev/null || echo "unknown")
    DIAG_DETAIL=$(echo "$DIAG_JSON" | python3 -c "import json,sys; d=json.load(sys.stdin); s=[x for x in d.get('steps',[]) if not x.get('ok')]; step=s[0] if s else {}; print(' — '.join(x for x in [step.get('error',''), step.get('detail','')] if x))" 2>/dev/null || echo "")
+    # #767: always emit the full diagnose JSON so operators see every step's
+    # Detail field even when the Python extraction above fails or the shape
+    # drifts. The burst is bracketed like steps 2 and 4 for grep-friendly CI.
+    log "── DIAGNOSTIC BURST (step 7b — terminal diagnose for $wid) ──"
+    echo "$DIAG_JSON" | python3 -m json.tool 2>/dev/null || echo "$DIAG_JSON"
+    log "── END DIAGNOSTIC ──"
    fail "Workspace $wid terminal diagnose failed at step '$DIAG_FAIL': $DIAG_DETAIL — check tenant SG has tcp/22 from the configured EIC endpoint SG, MOLECULE_EIC_ENDPOINT_SG_ID is set in Railway, and EIC endpoint health"
  fi
 done
@@ -9,6 +9,7 @@ import (
 	"log"
 	"net/http"
 	"os"
+	"sort"
 	"strings"
 	"time"

@@ -18,6 +19,7 @@ import (
 	dockerclient "github.com/docker/docker/client"
 	"github.com/gin-gonic/gin"

+	"git.moleculesai.app/molecule-ai/molecule-core/workspace-server/internal/providers"
 	"git.moleculesai.app/molecule-ai/molecule-core/workspace-server/internal/provisioner"
 )

@@ -41,10 +43,53 @@ func NewWorkspaceImageService(docker *dockerclient.Client) *WorkspaceImageServic
 	return &WorkspaceImageService{docker: docker}
 }

-// AllRuntimes is the canonical list mirroring docs/workspace-runtime-package.md.
-// Update both when a new template is added.
-var AllRuntimes = []string{
-	"claude-code", "codex", "hermes", "openclaw",
+// AllRuntimes is the canonical set of workspace runtimes this tenant will
+// pull/recreate template images for. It is DERIVED from the same providers
+// manifest SSOT (internal/providers/providers.yaml `runtimes:` block, mirrored
+// from CP's providers.yaml) that the rest of the platform routes against —
+// NOT a second hand-maintained list.
+//
+// Why derive instead of hardcode (controlplane#578): the old hardcoded slice
+// here ({claude-code, codex, hermes, openclaw}) silently DRIFTED from CP, which
+// already accepts `google-adk` for pin-promote/redeploy. A google-adk pin would
+// be accepted CP-side, then this tenant's POST /admin/workspace-images/refresh
+// ?runtime=google-adk rejected it 400 ("unknown runtime"), so google-adk image
+// fixes never deployed. Deriving from the manifest makes the tenant allowlist
+// and the CP allowlist provably the same set — they can't drift again.
+//
+// imageRefreshFallbackRuntimes is used ONLY if the embedded providers manifest
+// fails to load (which would be a build/CI failure caught by the providers
+// package's own tests, never a healthy prod). It preserves the historical
+// behavior — plus google-adk — so a manifest regression can never take the
+// refresh endpoint fully offline. Kept in lockstep with the providers.yaml
+// `runtimes:` keys; the drift guard in admin_workspace_images_test.go asserts
+// the two match.
+var imageRefreshFallbackRuntimes = []string{
+	"claude-code", "codex", "google-adk", "hermes", "openclaw",
+}
+
+// AllRuntimes is computed once at package init from the providers SSOT.
+var AllRuntimes = loadImageRefreshRuntimes()
+
+// loadImageRefreshRuntimes returns the sorted runtime names declared in the
+// providers manifest, falling back to imageRefreshFallbackRuntimes if the
+// manifest can't be loaded.
+func loadImageRefreshRuntimes() []string {
+	m, err := providers.LoadManifest()
+	if err != nil || len(m.Runtimes) == 0 {
+		if err != nil {
+			log.Printf("workspace-images: providers.LoadManifest failed (%v); falling back to static runtime allowlist", err)
+		}
+		out := append([]string(nil), imageRefreshFallbackRuntimes...)
+		sort.Strings(out)
+		return out
+	}
+	out := make([]string, 0, len(m.Runtimes))
+	for rt := range m.Runtimes {
+		out = append(out, rt)
+	}
+	sort.Strings(out)
+	return out
 }

 // RefreshResult is the per-call outcome surfaced to HTTP callers AND logged
@@ -197,7 +242,7 @@ func (s *WorkspaceImageService) Refresh(ctx context.Context, runtimes []string,

 // AdminWorkspaceImagesHandler serves POST /admin/workspace-images/refresh.
 //
-//	?runtime=claude-code   (optional; default = all 8 templates)
+//	?runtime=claude-code   (optional; default = all runtimes in AllRuntimes)
 //	&recreate=true|false   (default true; false = pull only)
 //
 // Returns JSON {pulled: [...], failed: [...], recreated: [...]}
@@ -3,7 +3,14 @@ package handlers
 import (
 	"encoding/base64"
 	"encoding/json"
+	"net/http"
+	"net/http/httptest"
+	"sort"
 	"testing"
+
+	"github.com/gin-gonic/gin"
+
+	"git.moleculesai.app/molecule-ai/molecule-core/workspace-server/internal/providers"
 )

 func TestGHCRAuthHeader_NoEnvReturnsEmpty(t *testing.T) {
@@ -92,6 +99,119 @@ func TestGHCRAuthHeader_RespectsRegistryEnv(t *testing.T) {
 	}
 }

+// runtimeListContains is a tiny membership helper for the runtime-allowlist tests.
+func runtimeListContains(s []string, v string) bool {
+	for _, x := range s {
+		if x == v {
+			return true
+		}
+	}
+	return false
+}
+
+// TestAllRuntimes_IncludesGoogleADK is the direct regression for
+// controlplane#578: a google-adk pin promote/redeploy is accepted CP-side, so
+// the tenant image-refresh allowlist MUST also accept google-adk or the image
+// fix never deploys (tenant returned 400 "unknown runtime"). google-adk lives
+// in the providers SSOT, so the derived AllRuntimes must contain it.
+func TestAllRuntimes_IncludesGoogleADK(t *testing.T) {
+	if !runtimeListContains(AllRuntimes, "google-adk") {
+		t.Fatalf("AllRuntimes must include google-adk (controlplane#578 drift); got %v", AllRuntimes)
+	}
+}
+
+// TestAllRuntimes_MatchesProvidersSSOT is the drift guard. AllRuntimes is
+// derived from providers.LoadManifest().Runtimes — assert it equals exactly the
+// runtime keys the providers manifest (mirrored from CP's providers.yaml)
+// declares. If CP adds/removes a runtime, this test fails RED until the tenant
+// re-derives, so the tenant image-refresh allowlist can never silently drift
+// from the CP pin-promote allowlist again.
+func TestAllRuntimes_MatchesProvidersSSOT(t *testing.T) {
+	m, err := providers.LoadManifest()
+	if err != nil {
+		t.Fatalf("providers.LoadManifest: %v", err)
+	}
+	want := make([]string, 0, len(m.Runtimes))
+	for rt := range m.Runtimes {
+		want = append(want, rt)
+	}
+	sort.Strings(want)
+
+	got := append([]string(nil), AllRuntimes...)
+	sort.Strings(got)
+
+	if len(got) != len(want) {
+		t.Fatalf("AllRuntimes drift: got %v, want %v (providers SSOT)", got, want)
+	}
+	for i := range want {
+		if got[i] != want[i] {
+			t.Fatalf("AllRuntimes drift at %d: got %v, want %v (providers SSOT)", i, got, want)
+		}
+	}
+}
+
+// TestImageRefreshFallbackMatchesSSOT pins the static fallback (used only when
+// the embedded manifest fails to load) to the providers SSOT. If a runtime is
+// added to providers.yaml but not to imageRefreshFallbackRuntimes, this fails
+// RED — so a manifest-load failure can't silently drop a supported runtime.
+func TestImageRefreshFallbackMatchesSSOT(t *testing.T) {
+	m, err := providers.LoadManifest()
+	if err != nil {
+		t.Fatalf("providers.LoadManifest: %v", err)
+	}
+	want := make([]string, 0, len(m.Runtimes))
+	for rt := range m.Runtimes {
+		want = append(want, rt)
+	}
+	sort.Strings(want)
+
+	got := append([]string(nil), imageRefreshFallbackRuntimes...)
+	sort.Strings(got)
+
+	if len(got) != len(want) {
+		t.Fatalf("fallback drift: got %v, want %v (providers SSOT)", got, want)
+	}
+	for i := range want {
+		if got[i] != want[i] {
+			t.Fatalf("fallback drift at %d: got %v, want %v (providers SSOT)", i, got, want)
+		}
+	}
+}
+
+// TestRefresh_RejectsUnknownRuntime asserts a genuinely unknown runtime still
+// 400s (the guard isn't removed) AND that the 400 body lists google-adk in
+// known_runtimes (proving the allowlist now advertises it). This exercises the
+// gin handler's reject branch, which runs entirely before any Docker call.
+func TestRefresh_RejectsUnknownRuntime(t *testing.T) {
+	gin.SetMode(gin.TestMode)
+
+	// nil docker client is safe: the unknown-runtime branch returns 400
+	// before svc.Refresh (which is the only path that touches Docker).
+	h := &AdminWorkspaceImagesHandler{svc: &WorkspaceImageService{}}
+
+	r := gin.New()
+	r.POST("/admin/workspace-images/refresh", h.Refresh)
+
+	req := httptest.NewRequest(http.MethodPost, "/admin/workspace-images/refresh?runtime=not-a-real-runtime", nil)
+	rec := httptest.NewRecorder()
+	r.ServeHTTP(rec, req)
+
+	if rec.Code != http.StatusBadRequest {
+		t.Fatalf("unknown runtime: got status %d, want 400; body=%s", rec.Code, rec.Body.String())
+	}
+
+	var body struct {
+		Error         string   `json:"error"`
+		KnownRuntimes []string `json:"known_runtimes"`
+	}
+	if err := json.Unmarshal(rec.Body.Bytes(), &body); err != nil {
+		t.Fatalf("decode 400 body: %v (raw=%s)", err, rec.Body.String())
+	}
+	if !runtimeListContains(body.KnownRuntimes, "google-adk") {
+		t.Errorf("400 known_runtimes must advertise google-adk (controlplane#578); got %v", body.KnownRuntimes)
+	}
+}
+
 func TestGHCRAuthHeader_TrimsWhitespace(t *testing.T) {
 	t.Setenv("MOLECULE_IMAGE_REGISTRY", "")
 	// .env lines often have trailing newlines or accidental spaces. Without
@@ -73,6 +73,7 @@ func (h *ChannelHandler) List(c *gin.Context) {
 		var config map[string]interface{}
 		if err := json.Unmarshal(configJSON, &config); err != nil {
 			log.Printf("Channels: unmarshal config for channel %s: %v", id, err)
+			config = map[string]interface{}{}
 		}
 		// #319: decrypt sensitive fields first so the mask operates on
 		// plaintext (first-4 / last-4 of the real token, not the ciphertext
@@ -94,6 +95,7 @@ func (h *ChannelHandler) List(c *gin.Context) {
 		var allowed []string
 		if err := json.Unmarshal(allowedJSON, &allowed); err != nil {
 			log.Printf("Channels: unmarshal allowed_users for channel %s: %v", id, err)
+			allowed = []string{}
 		}

 		entry := map[string]interface{}{
@@ -540,9 +542,11 @@ func (h *ChannelHandler) Webhook(c *gin.Context) {
 		}
 		if err := json.Unmarshal(configJSON, &row.Config); err != nil {
 			log.Printf("Channels: unmarshal config for webhook row %s: %v", row.ID, err)
+			row.Config = map[string]interface{}{}
 		}
 		if err := json.Unmarshal(allowedJSON, &row.AllowedUsers); err != nil {
 			log.Printf("Channels: unmarshal allowed_users for webhook row %s: %v", row.ID, err)
+			row.AllowedUsers = []string{}
 		}
 		if err := channels.DecryptSensitiveFields(row.Config); err != nil {
 			log.Printf("Channels: decrypt webhook row %s: %v", row.ID, err)
@@ -116,6 +116,56 @@ func TestChannelHandler_List(t *testing.T) {
 	}
 }

+func TestChannelHandler_List_InvalidJSON_FallsBack(t *testing.T) {
+	mock := setupTestDB(t)
+	handler := NewChannelHandler(newTestChannelManager())
+
+	rows := sqlmock.NewRows([]string{
+		"id", "workspace_id", "channel_type", "channel_config", "enabled",
+		"allowed_users", "last_message_at", "message_count", "created_at", "updated_at",
+	}).AddRow(
+		"ch-bad", "ws-1", "telegram",
+		[]byte(`{not valid json`),
+		true, []byte(`[also not json`), nil, 0, nil, nil,
+	)
+	mock.ExpectQuery("SELECT .* FROM workspace_channels WHERE workspace_id").
+		WithArgs("ws-1").
+		WillReturnRows(rows)
+
+	w := httptest.NewRecorder()
+	c, _ := gin.CreateTestContext(w)
+	c.Request, _ = http.NewRequest("GET", "/workspaces/ws-1/channels", nil)
+	c.Params = gin.Params{{Key: "id", Value: "ws-1"}}
+
+	handler.List(c)
+
+	if w.Code != 200 {
+		t.Errorf("expected 200, got %d", w.Code)
+	}
+
+	var result []map[string]interface{}
+	json.Unmarshal(w.Body.Bytes(), &result)
+	if len(result) != 1 {
+		t.Fatalf("expected 1 channel, got %d", len(result))
+	}
+
+	config, ok := result[0]["config"].(map[string]interface{})
+	if !ok {
+		t.Fatalf("expected config to be a map, got %T", result[0]["config"])
+	}
+	if len(config) != 0 {
+		t.Errorf("expected empty config after unmarshal fallback, got %v", config)
+	}
+
+	allowed, ok := result[0]["allowed_users"].([]interface{})
+	if !ok {
+		t.Fatalf("expected allowed_users to be a slice, got %T", result[0]["allowed_users"])
+	}
+	if len(allowed) != 0 {
+		t.Errorf("expected empty allowed_users after unmarshal fallback, got %v", allowed)
+	}
+}
+
 // ==================== Create ====================

 func TestChannelHandler_Create_Success(t *testing.T) {
@@ -546,6 +596,41 @@ func TestChannelHandler_Webhook_UnknownType(t *testing.T) {
 	}
 }

+// TestChannelHandler_Webhook_InvalidJSON_FallsBack verifies that when the DB
+// row contains invalid JSON for channel_config or allowed_users, the webhook
+// handler logs the error and falls back to an empty map/slice rather than
+// leaving the fields nil (which would panic on downstream code that expects
+// concrete values). With empty config there is no chat_id match, so the
+// handler returns {"status":"no_channel"}.
+func TestChannelHandler_Webhook_InvalidJSON_FallsBack(t *testing.T) {
+	mock := setupTestDB(t)
+	handler := NewChannelHandler(newTestChannelManager())
+
+	mock.ExpectQuery(`SELECT id, workspace_id, channel_type, channel_config, enabled, allowed_users FROM workspace_channels WHERE channel_type = .* AND enabled = true`).
+		WithArgs("telegram").
+		WillReturnRows(sqlmock.NewRows([]string{
+			"id", "workspace_id", "channel_type", "channel_config", "enabled", "allowed_users",
+		}).AddRow("ch-bad", "ws-1", "telegram", []byte(`{bad json`), true, []byte(`[bad json`)))
+
+	body := `{"update_id":1,"message":{"message_id":1,"from":{"id":111,"is_bot":false,"first_name":"Test","username":"testuser"},"chat":{"id":-100123,"title":"Test Group","type":"supergroup"},"date":1700000000,"text":"hello"}}`
+	w := httptest.NewRecorder()
+	c, _ := gin.CreateTestContext(w)
+	c.Request = httptest.NewRequest(http.MethodPost, "/webhooks/telegram", strings.NewReader(body))
+	c.Request.Header.Set("Content-Type", "application/json")
+	c.Params = gin.Params{{Key: "type", Value: "telegram"}}
+
+	handler.Webhook(c)
+
+	if w.Code != 200 {
+		t.Errorf("expected 200, got %d: %s", w.Code, w.Body.String())
+	}
+	var resp map[string]interface{}
+	json.Unmarshal(w.Body.Bytes(), &resp)
+	if resp["status"] != "no_channel" {
+		t.Errorf("expected status 'no_channel', got %v", resp["status"])
+	}
+}
+
 // ==================== Discover ====================

 func TestChannelHandler_Discover_MissingToken(t *testing.T) {
@@ -161,7 +161,7 @@ func (h *PluginsHandler) uninstallViaDocker(ctx context.Context, c *gin.Context,
 	// 1. Strip plugin's rule/fragment markers from CLAUDE.md (mirrors
 	//    AgentskillsAdaptor.uninstall lines 184-188). Best-effort: if
 	//    the user edited CLAUDE.md, our marker stays untouched.
-	h.stripPluginMarkersFromMemory(ctx, containerName, pluginName)
+	h.stripPluginMarkersFromMemory(ctx, workspaceID, containerName, pluginName)

 	// 2. Remove copied skill dirs declared in the plugin's plugin.yaml.
 	for _, skill := range skillNames {
@@ -171,9 +171,11 @@ func (h *PluginsHandler) uninstallViaDocker(ctx context.Context, c *gin.Context,
 			log.Printf("Plugin uninstall: skipping invalid skill name %q in %s: %v", skill, pluginName, err)
 			continue
 		}
-		_, _ = h.execAsRoot(ctx, containerName, []string{
+		if _, rmErr := h.execAsRoot(ctx, containerName, []string{
 			"rm", "-rf", "/configs/skills/" + skill,
-		})
+		}); rmErr != nil {
+			log.Printf("Plugin uninstall: failed to remove skill %s from %s: %v", skill, workspaceID, rmErr)
+		}
 	}

 	// 3. Delete the plugin directory itself (as root to handle file ownership).
@@ -393,7 +393,7 @@ func (h *PluginsHandler) readPluginSkillsFromContainer(ctx context.Context, cont
 // `# Plugin: <name> /` — mirrors AgentskillsAdaptor.uninstall's stripping
 // logic so install/uninstall are symmetric. Best-effort: silent on read or
 // write failure, since the rest of uninstall must still succeed.
-func (h *PluginsHandler) stripPluginMarkersFromMemory(ctx context.Context, containerName, pluginName string) {
+func (h *PluginsHandler) stripPluginMarkersFromMemory(ctx context.Context, workspaceID, containerName, pluginName string) {
 	// Use sed via bash -c for atomic in-place delete: drop the marker line
 	// and the blank line that follows it (install adds a leading blank line
 	// before the marker via append_to_memory). Three sed passes mirror the
@@ -417,7 +417,9 @@ func (h *PluginsHandler) stripPluginMarkersFromMemory(ctx context.Context, conta
 		`awk 'BEGIN{skip=0; blanks=0} /^%s/{skip=1; blanks=0; next} skip==1 && /^[[:space:]]*$/{blanks++; if(blanks>=2){skip=0; print; next} next} /^# Plugin: /{if(skip==1)skip=0} skip==1{next} {print}' /configs/CLAUDE.md > /tmp/claude.new && mv /tmp/claude.new /configs/CLAUDE.md`,
 		regexpEscapeForAwk(marker),
 	)
-	_, _ = h.execAsRoot(ctx, containerName, []string{"bash", "-c", script})
+	if _, awkErr := h.execAsRoot(ctx, containerName, []string{"bash", "-c", script}); awkErr != nil {
+		log.Printf("Plugin uninstall: failed to strip markers from CLAUDE.md for %s in %s: %v", pluginName, workspaceID, awkErr)
+	}
 }

 // regexpEscapeForAwk escapes characters that have special meaning inside an
@@ -332,6 +332,7 @@ func (h *WorkspaceHandler) buildProvisionerConfig(
 		InstanceType:    payload.Compute.InstanceType,
 		DiskGB:          int32(payload.Compute.Volume.RootGB),
 		DataPersistence: payload.Compute.DataPersistence,
+		Provider:        payload.Compute.Provider,
 		Display: provisioner.WorkspaceDisplayConfig{
 			Mode:     payload.Compute.Display.Mode,
 			Width:    payload.Compute.Display.Width,
@@ -174,6 +174,11 @@ type WorkspaceCompute struct {
 	// disk (wiped each recreate — privacy); "" = auto (desktop-control persists,
 	// others follow the org flag). Forwarded verbatim to CP's data_persistence.
 	DataPersistence string `json:"data_persistence,omitempty"`
+	// Provider is the CLOUD/compute backend for this workspace box (multi-provider
+	// RFC, per-workspace): ""/"aws" = default EC2; "hetzner"/"gcp" route to the
+	// CP WorkspaceProvisioner. Distinct from the LLM/model provider. Forwarded to
+	// CP /cp/workspaces/provision `provider`.
+	Provider string `json:"provider,omitempty"`
 }

 type CreateWorkspacePayload struct {
@@ -161,6 +161,9 @@ type cpProvisionRequest struct {
 	Tier         int    `json:"tier"`
 	InstanceType string `json:"instance_type,omitempty"`
 	DiskGB       int32  `json:"disk_gb,omitempty"`
+	// Provider routes the CP to the compute backend for this workspace box
+	// (multi-provider RFC, per-workspace). Distinct from the LLM/model provider.
+	Provider string `json:"provider,omitempty"`
 	// DataPersistence is the per-workspace durable-data choice (internal#734);
 	// CP validates the enum at its provision edge and resolves the data volume
 	// from it. Empty = auto (omitted on the wire).
@@ -257,6 +260,7 @@ func (p *CPProvisioner) Start(ctx context.Context, cfg WorkspaceConfig) (string,
 		InstanceType:    cfg.InstanceType,
 		DiskGB:          cfg.DiskGB,
 		DataPersistence: cfg.DataPersistence,
+		Provider:        cfg.Provider,
 		Display:         cfg.Display,
 		PlatformURL:     cfg.PlatformURL,
 		Env:             env,
@@ -100,6 +100,7 @@ type WorkspaceConfig struct {
 	InstanceType    string // Optional CP EC2 instance type override (SaaS only)
 	DiskGB          int32  // Optional CP root volume size override in GiB (SaaS only)
 	DataPersistence string // internal#734: "persist"|"ephemeral"|"" — durable-data choice forwarded to CP (SaaS only)
+	Provider        string // multi-provider RFC: ""/"aws"|"hetzner"|"gcp" compute backend for the workspace box (per-workspace; distinct from LLM/model provider). Forwarded to CP.
 	Display         WorkspaceDisplayConfig
 	EnvVars         map[string]string // Additional env vars (API keys, etc.)
 	PlatformURL     string