feat(plugins): atomic install — stage→snapshot→swap→marker (docker path)

Closes molecule-core#114 for the docker (local-OSS) path. EIC (SaaS) path tracked as a follow-up — same shape, different exec primitives (ssh vs docker exec); shipping both in one PR doubles the test surface. THE FOUR-STEP DANCE 1. STAGE — docker.CopyToContainer extracts tar into /configs/plugins/.staging/<name>.<ts>/ 2. SNAPSHOT — if /configs/plugins/<name>/ exists, mv to /configs/plugins/.previous/<name>.<ts>/ 3. SWAP — atomic mv staging → live (single rename(2)) 4. MARKER — touch /configs/plugins/<name>/.complete Workspace-side plugin loaders should refuse to load any plugin dir without .complete (separate small change, not in this PR — the marker write is the necessary precursor; consumer side is a follow-up so existing-content plugins don't break before they're re-installed). ROLLBACK - Stage failure: rm -rf staging dir; live untouched - Snapshot failure: rm -rf staging dir; live untouched (no rename happened) - Swap failure with snapshot present: mv previous back to live - Swap failure (no snapshot): rm -rf staging; live (which never existed) stays absent - Marker failure: content already in place, log loudly with manual recovery hint (touch <plugin>/.complete) — don't roll back since the new content is what we wanted, just unmarked GC Best-effort delete of previous-version snapshot after successful marker write. Failures non-fatal — next install or a separate sweeper reclaims. Sweeper for stale .previous/* across reboots is follow-up scope. CONCURRENCY Each install gets a unique stamp (UTC second precision), so two concurrent reinstalls land in distinct staging dirs and the second swap simply overwrites the first's live result. The atomicity is per-install, not cross-install — by design (the platform serializes POST /workspaces/:id/plugins via Go-side semaphore upstream of this code, so cross-install collisions don't reach here). CHANGES + plugins_atomic.go — installVersion + atomicCopyToContainer + plugins_atomic_tar.go — tarWalk/tarHostDirWithPrefix helpers + plugins_atomic_test.go — 5 unit tests (paths, stamp shape, tar happy path, symlink-skip, prefix normalization). All green. ~ plugins_install_pipeline.go::deliverToContainer — swap copyPluginToContainer call to atomicCopyToContainer Old copyPluginToContainer is retained (still called by Download()) so this PR is purely additive on the install path; no public API change. PHASE 4 SELF-REVIEW (FIVE-AXIS) Correctness: Required (addressed) — swap-failure rollback writes mv of previous back to live before returning the error; if rollback itself fails, we wrap both errors and surface the combined fault. Marker-write failure is treated as content-landed-but-unmarked (LOG, don't roll back the new content). Readability: No finding — installVersion path methods make the /staging/.previous/live/marker layout obvious from one struct. tarWalk extracted from the inline filepath.Walk in plugins_install_pipeline.go for testability. Architecture: No finding — atomicCopyToContainer composes existing execAsRoot / docker.CopyToContainer primitives; no new dependencies. Old copyPluginToContainer kept for Download() — single responsibility per function. Security: No finding — symlinks still skipped during tar walk (defense vs hostile plugin escaping its own dir). Marker writes use composeable path.Join, no user input touches the path. Performance: No finding — adds ~3 docker exec calls per install (mkdir, mv-snapshot, mv-swap, touch — actually 4) on top of the one CopyToContainer. Each exec ~50-100ms in practice; install end-to-end was already seconds-scale, this rounds to noise. REFS molecule-core#114 — this issue Companion: molecule-core#112 (hot-reload classifier — depends on .complete marker) Companion: molecule-core#113 (version subscription — uses install machinery) EIC follow-up: separate issue to be filed for SaaS path parity Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Merge pull request 'feat(org-import): inject per-role persona env from operator-host bootstrap dir' (#110 ) from feat/persona-env-injection into main
2026-05-08 08:22:52 -07:00 · 2026-05-08 15:17:17 +00:00 · 2026-05-08 15:16:58 +00:00 · 2026-05-08 08:10:50 -07:00 · 2026-05-08 07:09:40 -07:00 · 2026-05-08 13:30:46 +00:00
117 changed files with 11492 additions and 1123 deletions
@@ -154,30 +154,71 @@ jobs:
            exit 0
          fi

-          # Upstream is publish-workspace-server-image. Check E2E state.
-          # The jq filter must defend against TWO empty cases that gh
-          # CLI emits indistinguishably:
-          #   1. gh exits non-zero (network blip, auth issue) → handled
-          #      by the `|| echo "none/none"` fallback below.
-          #   2. gh exits zero but returns `[]` (no E2E run on this
-          #      main SHA — the common case for canvas-only / cmd-only
-          #      / sweep-only changes whose paths don't trigger E2E).
-          #      Without `(.[0] // {})`, jq sees `null` and emits
-          #      "null/none" — which the case statement below has no
-          #      branch for, so it falls into *) → exit 1.
-          # Surfaced 2026-04-30 the first time the App-token chain
-          # (#2389) actually fired auto-promote-on-e2e from a publish
-          # upstream — every prior run was E2E-upstream which
-          # short-circuits before this gate.
-          RESULT=$(gh run list \
-            --repo "$REPO" \
-            --workflow e2e-staging-saas.yml \
-            --branch main \
-            --commit "$SHA" \
-            --limit 1 \
-            --json status,conclusion \
-            --jq '(.[0] // {}) | "\(.status // "none")/\(.conclusion // "none")"' \
-            2>/dev/null || echo "none/none")
+          # Upstream is publish-workspace-server-image. Check E2E state
+          # for the same SHA via Gitea's commit-status API.
+          #
+          # GitHub-era this was `gh run list --workflow=X --commit=SHA
+          # --json status,conclusion` returning either `[]` (no run on
+          # this SHA) or `[{status, conclusion}]` (the run's state).
+          # Gitea has NO workflow-runs API at all — `/api/v1/repos/.../
+          # actions/runs` returns 404 (verified 2026-05-07, issue #75).
+          # However Gitea Actions DOES emit a commit status per workflow
+          # job, with `context = "<Workflow Name> / <Job Name> (<event>)"`,
+          # which is exactly what we need: each E2E run leg becomes one
+          # status row on the SHA, and the aggregate state encodes the
+          # run's outcome.
+          #
+          # Mapping:
+          #   0 matched contexts          → "none/none"      (E2E paths-
+          #                                                    filtered
+          #                                                    out — same
+          #                                                    semantic
+          #                                                    as before)
+          #   any context = pending       → "in_progress/none" (defer)
+          #   any context = error|failure → "completed/failure" (abort)
+          #   all contexts = success      → "completed/success" (proceed)
+          #
+          # The "completed/cancelled" and "completed/timed_out" buckets
+          # don't have direct Gitea analogs (Gitea statuses are
+          # success / failure / error / pending / warning). Per-SHA
+          # concurrency cancellation surfaces as `error` on Gitea, which
+          # we map to "completed/failure" rather than "completed/cancelled"
+          # — losing the soft-defer semantic of the cancelled bucket on
+          # this fleet. Tradeoff: the staleness alarm (auto-promote-stale-
+          # alarm.yml) still catches a stuck :latest within 4h, and a
+          # legitimate cancel is rare enough that aborting + manual
+          # re-dispatch is acceptable. If we measure cancel frequency
+          # > 1/week, revisit by reading the run-step-summary text via
+          # a follow-up script.
+          #
+          # Network or auth blips collapse to "none/none" via the curl
+          # `|| true` fallback, matching the pre-Gitea behaviour where
+          # an empty list also degenerated to none/none.
+          GITEA_API_URL="${GITHUB_SERVER_URL:-https://git.moleculesai.app}/api/v1"
+          STATUSES_JSON=$(curl --fail-with-body -sS \
+            -H "Authorization: token ${GH_TOKEN}" \
+            -H "Accept: application/json" \
+            "${GITEA_API_URL}/repos/${REPO}/commits/${SHA}/statuses?limit=100" \
+            2>/dev/null || echo "[]")
+          RESULT=$(printf '%s' "$STATUSES_JSON" | jq -r '
+            # Filter to E2E Staging SaaS (full lifecycle) statuses.
+            # Match by leading workflow-name prefix so the "<job>
+            # (<event>)" tail is irrelevant. Gitea emits the workflow
+            # name verbatim from the YAML `name:` field.
+            [.[] | select(.context | startswith("E2E Staging SaaS (full lifecycle) /"))] as $rows
+            | if ($rows | length) == 0 then
+                "none/none"
+              elif any($rows[]; .status == "pending") then
+                "in_progress/none"
+              elif any($rows[]; .status == "failure" or .status == "error") then
+                "completed/failure"
+              elif all($rows[]; .status == "success") then
+                "completed/success"
+              else
+                # Mixed / unknown — fall through to *) bucket below.
+                "completed/" + ($rows[0].status // "unknown")
+              end
+          ' 2>/dev/null || echo "none/none")

          echo "E2E Staging SaaS for ${SHA:0:7}: $RESULT"

@@ -199,16 +240,13 @@ jobs:
              exit 1
              ;;
            completed/cancelled)
-              # cancelled ≠ failure. Per-SHA concurrency cancels older E2E
-              # runs when a newer push lands (memory:
-              # feedback_concurrency_group_per_sha) — the newer SHA will
-              # have its own E2E + promote chain. Treat the same as
-              # in_progress: defer without aborting, let the next E2E run
-              # promote when it lands.
-              #
-              # Caught 2026-05-05 02:03 on sha 31f9a5e — auto-promote
-              # blocked the whole chain because this case fell through to
-              # exit 1 instead of clean defer.
+              # GitHub-era only: cancelled ≠ failure. Gitea statuses
+              # don't expose a "cancelled" state — a per-SHA concurrency
+              # cancellation surfaces as `failure` or `error` on Gitea
+              # and is now handled by the failure branch above. This
+              # arm is kept for backwards compatibility / dual-host
+              # operation (if we ever add a non-Gitea fallback) but
+              # under the post-#75 flow it's unreachable.
              echo "proceed=false" >> "$GITHUB_OUTPUT"
              {
                echo "## ⏭ Auto-promote deferred — E2E Staging SaaS was cancelled"
@@ -2,61 +2,148 @@ name: Auto-promote staging → main

 # Fires after any of the staging-branch quality gates complete. When ALL
 # required gates are green on the same staging SHA, opens (or re-uses)
-# a PR `staging → main` and enables auto-merge so the merge queue lands
-# it. Closes the gap that historically let features sit on staging for
-# weeks waiting for a bulk promotion PR (see molecule-core#1496 for the
-# 1172-commit example).
+# a PR `staging → main` and schedules Gitea auto-merge so the PR lands
+# automatically once approval + status checks are satisfied.
 #
-# 2026-04-28 rewrite (PR #142): the previous version did a direct
-# `git merge --ff-only origin staging && git push origin main`. That
-# breaks against main's branch-protection ruleset, which requires
-# status checks "set by the expected GitHub apps" — direct pushes
-# can't satisfy that condition (only PR merges through the queue can).
-# The workflow was failing every tick with:
-#   remote: error: GH006: Protected branch update failed for refs/heads/main.
-#   remote: - Required status checks ... were not set by the expected GitHub apps.
-# Fix: mirror the PR-based pattern from auto-sync-main-to-staging.yml
-# (the reverse-direction sync, fixed in #2234 for the same reason).
-# Both directions now use the same merge-queue path that humans use,
-# no special-case bypass.
+# ============================================================
+# What this workflow does
+# ============================================================
 #
-# Safety model:
-# - Runs ONLY on workflow_run events for the staging branch.
-# - Requires EVERY named gate workflow to have the same head_sha and
-#   all be `conclusion == success`. If any of them is red, skipped,
-#   cancelled, or pending, we abort (stay on the current main).
-# - The PR base=main head=staging path lets GitHub itself enforce
-#   branch protection. If main has diverged from staging or required
-#   checks aren't satisfied, the merge queue declines the PR — no
-#   need for a manual ff-only ancestry check here.
-# - Loop safety: the auto-sync-main-to-staging workflow fires when
-#   main lands the auto-promote PR, but its merge into staging is by
-#   GITHUB_TOKEN which doesn't trigger downstream workflow_run events
-#   (GitHub Actions safety). So this workflow doesn't re-fire from
-#   its own promote landing.
+# 1. On a workflow_run completion event for one of the staging gate
+#    workflows (CI, E2E Staging Canvas, E2E API Smoke, CodeQL),
+#    checks if the combined status on the staging head SHA is green.
+# 2. If green, opens (or re-uses) a PR `head: staging → base: main`
+#    via Gitea REST `POST /api/v1/repos/.../pulls`.
+# 3. Schedules auto-merge via `POST /api/v1/repos/.../pulls/{index}/merge`
+#    with `merge_when_checks_succeed: true`. Gitea waits for the
+#    approval requirement on `main` (`required_approvals: 1`) and
+#    the status-check gates, then merges.
+# 4. The merge commit lands on `main` and fires
+#    `publish-workspace-server-image.yml` naturally via its
+#    `on: push: branches: [main]` trigger — no explicit dispatch
+#    needed (see "Why no workflow_dispatch tail" below).
 #
-# Toggle via repo variable AUTO_PROMOTE_ENABLED (true/unset). When
-# unset, the workflow logs what it would have done but doesn't open
-# the PR — useful for dry-running the gate logic without surfacing
-# a noisy PR while staging CI is still flaky.
+# `auto-sync-main-to-staging.yml` is the reverse-direction
+# counterpart (main → staging, fast-forward push). Together they
+# keep the staging-superset-of-main invariant tight.
 #
-# **One-time repo setting (load-bearing):** this workflow opens the
-# staging→main PR via `gh pr create` using the default GITHUB_TOKEN.
-# Since GitHub's 2022 default change, that token cannot create or
-# approve PRs unless the repo opts in. The toggle is at:
+# ============================================================
+# Why Gitea REST (and not `gh pr create`)
+# ============================================================
 #
-#   Settings → Actions → General → Workflow permissions
-#   → ✅ Allow GitHub Actions to create and approve pull requests
+# Pre-2026-05-06 this workflow used `gh pr create`, `gh pr merge --auto`,
+# `gh run list`, and `gh workflow run` against GitHub. After the
+# GitHub→Gitea cutover those calls fail because:
 #
-# Without it, every workflow_run fails with:
+#   - `gh pr create / merge / view / list` route to GitHub GraphQL
+#     (`/api/graphql`). Gitea does not expose a GraphQL endpoint;
+#     every call returns `HTTP 405 Method Not Allowed` — same root
+#     cause as #65 (auto-sync) which PR #66 fixed by dropping `gh`
+#     entirely.
+#   - `gh run list --workflow=...` GitHub-shape; Gitea has the
+#     simpler `GET /repos/.../commits/{ref}/status` combined-status
+#     endpoint instead.
+#   - `gh workflow run X.yml` calls `POST /repos/.../actions/workflows/{id}/dispatches`,
+#     which does NOT exist on Gitea 1.22.6 (verified via swagger.v1.json).
 #
-#   pull request create failed: GraphQL: GitHub Actions is not
-#   permitted to create or approve pull requests (createPullRequest)
+# So this workflow uses direct `curl` calls to Gitea REST. No `gh`
+# CLI dependency, no GraphQL, no missing-endpoint footgun.
 #
-# Observed 2026-04-29 01:43 UTC blocking promotion of fcd87b9 (PRs
-# #2248 + #2249); manually bridged via PR #2252. Re-check this
-# setting if auto-promote starts failing with createPullRequest
-# errors after a repo or org admin change.
+# ============================================================
+# Why no workflow_dispatch tail (was load-bearing on GitHub, dead on Gitea)
+# ============================================================
+#
+# The GitHub-era version had a 60-line polling step that waited for
+# the promote PR to merge, then explicitly dispatched
+# `publish-workspace-server-image.yml` on `--ref main`. That step
+# existed because GitHub's GITHUB_TOKEN-initiated merges suppress
+# downstream `on: push` workflows (the documented "no recursion" rule
+# — https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow).
+# The explicit dispatch was the workaround.
+#
+# Gitea Actions does NOT have this no-recursion rule. PR #66's auto-
+# sync merge to main fired `auto-promote-staging` on the next push
+# trigger naturally. So the cascade fires on the natural push event;
+# the explicit dispatch is dead code. (And even if we wanted to
+# preserve it, Gitea has no `workflow_dispatch` REST endpoint.)
+#
+# Removed in this rewrite. If we ever observe the cascade misfire,
+# operator can push an empty commit to `main` to wake it.
+#
+# ============================================================
+# Why open a PR (and not direct push)
+# ============================================================
+#
+# `main` branch protection has `enable_push: false` with NO
+# `push_whitelist_usernames`. Direct push is impossible for any
+# persona, including admins. PR-mediated merge is the only path,
+# which is intentional: prod state mutations (and staging→main IS a
+# prod mutation, since the next deploy fans out to tenants) require
+# Hongming's approval per `feedback_prod_apply_needs_hongming_chat_go`.
+#
+# The auto-merge schedule preserves this gate: `merge_when_checks_succeed`
+# does NOT bypass `required_approvals: 1`. Gitea waits for BOTH
+# approval AND green checks before merging. Hongming reviews via the
+# canvas/chat-handle of the PR notification, approves, and Gitea
+# auto-merges within seconds.
+#
+# ============================================================
+# Identity + token (anti-bot-ring per saved-memory
+# `feedback_per_agent_gitea_identity_default`)
+# ============================================================
+#
+# This workflow uses `secrets.AUTO_SYNC_TOKEN` — a personal access
+# token issued to the `devops-engineer` Gitea persona. NOT the
+# founder PAT. The bot-ring fingerprint that triggered the GitHub
+# org suspension on 2026-05-06 was characterised by founder PAT
+# acting as CI at machine speed.
+#
+# Token scope: `push: true` (read+write) on this repo. The persona
+# can: open PRs, comment on PRs, schedule auto-merge. The persona
+# CANNOT bypass main's branch protection (`required_approvals: 1`
+# still applies — only Hongming's review unblocks merge).
+#
+# Authorship: the PR is opened by `devops-engineer`; the merge
+# commit credits Hongming-as-approver and `devops-engineer` as
+# the merger.
+#
+# ============================================================
+# Failure modes & operational notes
+# ============================================================
+#
+# A — staging gates not all green at trigger time:
+#     - The combined-status check returns `state: pending|failure`.
+#       Workflow exits 0 with a step-summary "not all green; staying
+#       on current main". Re-fires on the next gate completion.
+#
+# B — Gitea PR-create returns non-201 (e.g. 422 already-exists):
+#     - Idempotent: the workflow first GETs the existing open
+#       staging→main PR. If found, reuse it; if not, POST a new one.
+#       422 should never surface; if it does (race), step summary
+#       captures the body and the next workflow_run picks up.
+#
+# C — `merge_when_checks_succeed` schedule fails:
+#     - 422 with "Pull request is not mergeable" if there are
+#       conflicts or stale base. Step summary surfaces it; operator
+#       (or `auto-sync-main-to-staging`) needs to bring staging up
+#       to date with main first. Workflow exits 1 to surface red.
+#
+# D — `AUTO_SYNC_TOKEN` rotated / wrong scope:
+#     - 401/403 on first REST call. Step summary surfaces it.
+#       Re-issue the token from `~/.molecule-ai/personas/` on the
+#       operator host and update the repo Actions secret.
+#
+# ============================================================
+# Loop safety
+# ============================================================
+#
+# When the promote PR merges to main, `auto-sync-main-to-staging.yml`
+# fires (on:push:main) and pushes the merge commit back to staging.
+# That push to staging is by `devops-engineer`, NOT this workflow's
+# token, and triggers the staging gate workflows. When they all
+# complete, we end up back here — but the tree-diff guard catches
+# it: staging tree == main tree (the merge commit changes nothing),
+# so we skip and the cycle terminates.

 on:
  workflow_run:
@@ -74,26 +161,16 @@ on:
        default: "false"

 permissions:
-  contents: write
+  contents: read
  pull-requests: write
-  # actions: write is needed by the post-merge dispatch tail step
-  # (#2358 / #2357) — `gh workflow run publish-workspace-server-image.yml`
-  # POSTs to /actions/workflows/.../dispatches which requires this scope.
-  # Without it the call 403s and the publish/canary/redeploy chain still
-  # doesn't run on staging→main promotions, undoing #2358.
-  actions: write

 # Serialize auto-promote runs. Multiple staging gate completions can land
 # in quick succession (CI + E2E + CodeQL all finish within seconds of
 # each other on a green PR) — without this, two parallel runs both:
-#   1. Open / re-use the same promote PR.
-#   2. Both call `gh pr merge --auto` (idempotent — fine).
-#   3. Both poll for the same mergedAt and both `gh workflow run` publish
-#      → 2× redundant publish builds racing for the same `:staging-latest`
-#      retag, and 2× canary-verify chains.
-# cancel-in-progress: false because we don't want a brand-new run to kill
-# a polling-tail that's about to dispatch — the polling tail's 30 min cap
-# is the right backstop, not workflow-level cancel.
+#   1. Would race the GET-or-POST PR step.
+#   2. Would both call merge-schedule (idempotent — fine on Gitea).
+# cancel-in-progress: false because the second run on a fresh staging
+# tip should NOT kill the first which has already opened the PR.
 concurrency:
  group: auto-promote-staging
  cancel-in-progress: false
@@ -111,126 +188,112 @@ jobs:
      all_green: ${{ steps.gates.outputs.all_green }}
      head_sha: ${{ steps.gates.outputs.head_sha }}
    steps:
-      # Skip empty-tree promotes (the perpetual auto-promote↔auto-sync cycle
-      # observed 2026-05-03). Sequence: auto-promote merges via the staging
-      # merge-queue's MERGE strategy, creating a merge commit on main that
-      # staging doesn't have. auto-sync then merges main back into staging
-      # via another merge commit (the queue's MERGE strategy applies on
-      # the staging side too, even when the workflow's local FF would
-      # have sufficed). Now staging has a new merge-commit SHA whose
-      # tree == main's tree — but auto-promote sees "staging ahead of
-      # main by 1" and opens YET another empty promote PR. Each round
-      # costs ~30-40 min wallclock, ~2 manual approvals, and burns a
-      # full CodeQL Go run (~15 min). Without this guard the cycle
-      # repeats indefinitely.
-      #
-      # Long-term fix is to switch the merge_queue ruleset's
-      # `merge_method` away from MERGE so FF-able PRs land cleanly,
-      # but that's a broader change affecting every staging PR's
-      # commit shape. This guard is the one-line surgical fix that
-      # breaks the cycle without touching merge-queue config.
-      #
-      # Fail-open: if `git diff` errors for any reason, fall through
-      # to the gate check (preserve existing behavior). Only skip
-      # when the diff is DEFINITIVELY empty.
+      # Skip empty-tree promotes (the perpetual auto-promote↔auto-sync
+      # cycle observed pre-cutover on GitHub). On Gitea the cycle shape
+      # is different (auto-sync uses fast-forward, no merge commit),
+      # but the tree-diff guard is cheap insurance and protects against
+      # any future merge-style regression.
      - name: Checkout for tree-diff check
        uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
        with:
          fetch-depth: 0
          ref: staging
-      - name: Skip if staging tree == main tree (perpetual-cycle break)
+
+      - name: Skip if staging tree == main tree (cycle-break safety)
        id: tree-diff
        env:
          HEAD_SHA: ${{ github.event.workflow_run.head_sha || github.sha }}
        run: |
          set -eu
          git fetch origin main --depth=50 || { echo "::warning::git fetch main failed — proceeding (fail-open)"; exit 0; }
-          # Compare staging tip's tree against main's tree. `git diff
-          # --quiet` exits 0 if no differences, 1 if there are.
          if git diff --quiet origin/main "$HEAD_SHA" -- 2>/dev/null; then
            {
-              echo "## ⏭ Skipped — no code to promote"
+              echo "## Skipped — no code to promote"
              echo
              echo "staging tip (\`${HEAD_SHA:0:8}\`) and \`main\` have identical trees."
-              echo "This is the auto-promote↔auto-sync merge-commit cycle: staging has a"
-              echo "new SHA (a sync-back merge commit) but the underlying file tree is"
-              echo "already on main, so there's no real code to ship."
-              echo
-              echo "Skipping to avoid opening an empty promote PR. Cycle terminates here."
+              echo "Skipping to avoid opening an empty promote PR."
            } >> "$GITHUB_STEP_SUMMARY"
            echo "::notice::auto-promote: staging tree == main tree — no code to promote, skipping"
            echo "skip=true" >> "$GITHUB_OUTPUT"
          else
            echo "skip=false" >> "$GITHUB_OUTPUT"
          fi
-      - name: Check all required gates on this SHA
+
+      - name: Check combined status on staging head
        if: steps.tree-diff.outputs.skip != 'true'
        id: gates
        env:
-          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+          GITEA_TOKEN: ${{ secrets.AUTO_SYNC_TOKEN }}
          HEAD_SHA: ${{ github.event.workflow_run.head_sha || github.sha }}
          REPO: ${{ github.repository }}
+          GITEA_HOST: ${{ vars.GITEA_HOST || 'https://git.moleculesai.app' }}
        run: |
          set -euo pipefail

-          # Required gate workflow files. Use file paths (relative to
-          # .github/workflows/) rather than display names because:
+          # Gitea-native combined-status endpoint aggregates every
+          # check context attached to a SHA. This is structurally
+          # cleaner than the GitHub-era per-workflow `gh run list`
+          # loop because:
          #
-          #   1. `gh run list --workflow=<name>` is ambiguous when two
-          #      workflows have the same `name:` — observed 2026-04-28
-          #      with "CodeQL" matching both `codeql.yml` (explicit) and
-          #      GitHub's UI-configured Code-quality default setup
-          #      (internal "codeql"). gh CLI returns "could not resolve
-          #      to a unique workflow" → empty result → gate evaluated
-          #      as missing/none → auto-promote dead-locked despite all
-          #      checks actually passing.
+          #   1. There's no risk of "workflow name collision" (the
+          #      GitHub-era code had to switch from `--workflow=NAME`
+          #      to `--workflow=FILE.YML` to disambiguate "CodeQL"
+          #      between the explicit workflow and GitHub's UI-
+          #      configured default setup; Gitea has no such
+          #      duplicate-name surface).
+          #   2. Gitea's combined state already encodes the AND
+          #      across all contexts: success only if EVERY context
+          #      is success. Pending or failure on any context
+          #      produces non-success state.
          #
-          #   2. File paths are the unique identifier for workflows;
-          #      `name:` is just a display string and can collide.
-          #
-          # When adding/removing a gate, update this list AND the
-          # branch-protection required-checks list (which uses check-run
-          # display names, not workflow names; the two are decoupled and
-          # should be kept in sync manually).
-          GATES=(
-            "ci.yml"
-            "e2e-staging-canvas.yml"
-            "e2e-api.yml"
-            "codeql.yml"
-          )
+          # See https://docs.gitea.com/api/1.22 for the schema —
+          # `state` is one of: success, pending, failure, error.

          echo "head_sha=${HEAD_SHA}" >> "$GITHUB_OUTPUT"
-          echo "Checking gates on SHA ${HEAD_SHA}"
+          echo "Checking combined status on SHA ${HEAD_SHA}"

-          ALL_GREEN=true
-          for gate in "${GATES[@]}"; do
-            # Query the most recent run of this workflow on this SHA.
-            # event=push to avoid picking up PR runs. branch=staging to
-            # guard against someone dispatching the gate on a non-staging
-            # branch at the same SHA.
-            RESULT=$(gh run list \
-              --repo "$REPO" \
-              --workflow "$gate" \
-              --branch staging \
-              --event push \
-              --commit "$HEAD_SHA" \
-              --limit 1 \
-              --json status,conclusion \
-              --jq '.[0] | "\(.status)/\(.conclusion // "none")"' \
-              2>/dev/null || echo "missing/none")
+          # `set +o pipefail` for the http-code capture pattern; restore
+          # immediately. Pattern hardened per `feedback_curl_status_capture_pollution`.
+          BODY_FILE=$(mktemp)
+          set +e
+          STATUS=$(curl -sS \
+            -H "Authorization: token ${GITEA_TOKEN}" \
+            -H "Accept: application/json" \
+            -o "${BODY_FILE}" \
+            -w "%{http_code}" \
+            "${GITEA_HOST}/api/v1/repos/${REPO}/commits/${HEAD_SHA}/status")
+          CURL_RC=$?
+          set -e

-            echo "  $gate → $RESULT"
+          if [ "${CURL_RC}" -ne 0 ] || [ "${STATUS}" != "200" ]; then
+            echo "::error::combined-status fetch failed: curl=${CURL_RC} http=${STATUS}"
+            cat "${BODY_FILE}" | head -c 500 || true
+            rm -f "${BODY_FILE}"
+            echo "all_green=false" >> "$GITHUB_OUTPUT"
+            exit 0
+          fi

-            # Only completed/success counts. completed/failure or
-            # in_progress/anything or no record at all = abort.
-            if [ "$RESULT" != "completed/success" ]; then
-              ALL_GREEN=false
-            fi
-          done
+          STATE=$(jq -r '.state // "missing"' < "${BODY_FILE}")
+          TOTAL=$(jq -r '.total_count // 0' < "${BODY_FILE}")
+          rm -f "${BODY_FILE}"

-          echo "all_green=${ALL_GREEN}" >> "$GITHUB_OUTPUT"
-          if [ "$ALL_GREEN" != "true" ]; then
-            echo "::notice::auto-promote: not all gates are green on ${HEAD_SHA} — staying on current main"
+          echo "Combined status: state=${STATE} total_count=${TOTAL}"
+
+          if [ "${STATE}" = "success" ] && [ "${TOTAL}" -gt 0 ]; then
+            echo "all_green=true" >> "$GITHUB_OUTPUT"
+            echo "::notice::All gates green on ${HEAD_SHA} (${TOTAL} contexts)"
+          else
+            echo "all_green=false" >> "$GITHUB_OUTPUT"
+            {
+              echo "## Not promoting — combined status not green"
+              echo
+              echo "- SHA: \`${HEAD_SHA:0:8}\`"
+              echo "- Combined state: \`${STATE}\`"
+              echo "- Context count: ${TOTAL}"
+              echo
+              echo "Will re-fire on the next gate completion. Investigate any red gate via the Actions UI."
+            } >> "$GITHUB_STEP_SUMMARY"
+            echo "::notice::auto-promote: combined status is ${STATE} on ${HEAD_SHA} — staying on current main"
          fi

  promote:
@@ -247,188 +310,183 @@ jobs:
          # Repo variable AUTO_PROMOTE_ENABLED=true flips this on. While
          # it's unset, the workflow dry-runs (logs what it would have
          # done) but doesn't open the promote PR. Set the variable in
-          # Settings → Secrets and variables → Actions → Variables.
+          # Settings → Actions → Variables.
          if [ "${AUTO_PROMOTE_ENABLED:-}" != "true" ] && [ "${FORCE_INPUT:-false}" != "true" ]; then
            {
-              echo "## ⏸ Auto-promote disabled"
+              echo "## Auto-promote disabled"
              echo
              echo "Repo variable \`AUTO_PROMOTE_ENABLED\` is not set to \`true\`."
              echo "All gates are green on staging; would have opened a promote PR to \`main\`."
              echo
-              echo "To enable: Settings → Secrets and variables → Actions → Variables → \`AUTO_PROMOTE_ENABLED=true\`."
+              echo "To enable: Settings → Actions → Variables → \`AUTO_PROMOTE_ENABLED=true\`."
              echo "To test once manually: workflow_dispatch with \`force=true\`."
            } >> "$GITHUB_STEP_SUMMARY"
            echo "::notice::auto-promote disabled — dry run only"
            exit 0
          fi

-      # Mint the App token BEFORE the promote-PR step so the auto-merge
-      # call can use it. GITHUB_TOKEN-initiated merges suppress the
-      # downstream `push` event on main, breaking the
-      # publish-workspace-server-image → canary-verify → redeploy-tenants
-      # chain (issue #2357). Using the App token here means the
-      # merge-queue-landed merge IS able to fire the cascade naturally;
-      # the polling tail below stays as defense-in-depth.
-      - name: Mint App token for promote-PR + downstream dispatch
-        if: ${{ vars.AUTO_PROMOTE_ENABLED == 'true' || github.event.inputs.force == 'true' }}
-        id: app-token
-        uses: actions/create-github-app-token@1b10c78c7865c340bc4f6099eb2f838309f1e8c3 # v3.1.1
-        with:
-          app-id: ${{ secrets.MOLECULE_AI_APP_ID }}
-          private-key: ${{ secrets.MOLECULE_AI_APP_PRIVATE_KEY }}
-
-      - name: Open (or reuse) staging → main promote PR + enable auto-merge
+      - name: Open or reuse promote PR + schedule auto-merge
        if: ${{ vars.AUTO_PROMOTE_ENABLED == 'true' || github.event.inputs.force == 'true' }}
        env:
-          GH_TOKEN: ${{ steps.app-token.outputs.token }}
+          GITEA_TOKEN: ${{ secrets.AUTO_SYNC_TOKEN }}
          REPO: ${{ github.repository }}
          TARGET_SHA: ${{ needs.check-all-gates-green.outputs.head_sha }}
+          GITEA_HOST: ${{ vars.GITEA_HOST || 'https://git.moleculesai.app' }}
        run: |
          set -euo pipefail

-          # Look for an existing open promote PR (idempotent on re-run
-          # of the workflow). The PR's head IS the staging branch — the
-          # whole point is "advance main to staging's tip", so we don't
-          # need a per-SHA branch like auto-sync-main-to-staging uses.
-          PR_NUM=$(gh pr list --repo "$REPO" \
-            --base main --head staging --state open \
-            --json number --jq '.[0].number // ""')
+          API="${GITEA_HOST}/api/v1/repos/${REPO}"
+          AUTH=(-H "Authorization: token ${GITEA_TOKEN}" -H "Accept: application/json")

-          if [ -z "$PR_NUM" ]; then
+          # http_status_get RESULT_VAR URL
+          # Sets RESULT_VAR to "<http_code>:<body_file>". Curl status
+          # capture pattern per `feedback_curl_status_capture_pollution`:
+          # http_code goes to its own tempfile-equivalent (-w), body to
+          # another tempfile, set +e/-e bracket protects pipeline state.
+          http_get() {
+            local body_file="$1"; shift
+            local url="$1"; shift
+            set +e
+            local code
+            code=$(curl -sS "${AUTH[@]}" -o "${body_file}" -w "%{http_code}" "${url}")
+            local rc=$?
+            set -e
+            if [ "${rc}" -ne 0 ]; then
+              echo "::error::curl GET failed (rc=${rc}) on ${url}"
+              return 99
+            fi
+            echo "${code}"
+          }
+          http_post_json() {
+            local body_file="$1"; shift
+            local data="$1"; shift
+            local url="$1"; shift
+            set +e
+            local code
+            code=$(curl -sS "${AUTH[@]}" -H "Content-Type: application/json" \
+              -X POST -d "${data}" -o "${body_file}" -w "%{http_code}" "${url}")
+            local rc=$?
+            set -e
+            if [ "${rc}" -ne 0 ]; then
+              echo "::error::curl POST failed (rc=${rc}) on ${url}"
+              return 99
+            fi
+            echo "${code}"
+          }
+
+          # Step 1: look for an existing open staging→main promote PR
+          # (idempotent on workflow re-run). Gitea doesn't have a
+          # head/base filter on the list endpoint that's as ergonomic
+          # as gh's, but the dedicated `/pulls/{base}/{head}` lookup
+          # works.
+          BODY=$(mktemp)
+          STATUS=$(http_get "${BODY}" "${API}/pulls/main/staging") || true
+
+          PR_NUM=""
+          if [ "${STATUS}" = "200" ]; then
+            STATE=$(jq -r '.state // "missing"' < "${BODY}")
+            if [ "${STATE}" = "open" ]; then
+              PR_NUM=$(jq -r '.number // ""' < "${BODY}")
+              echo "::notice::Re-using existing open promote PR #${PR_NUM}"
+            fi
+          fi
+          rm -f "${BODY}"
+
+          # Step 2: if no open PR, create one.
+          if [ -z "${PR_NUM}" ]; then
            TITLE="staging → main: auto-promote ${TARGET_SHA:0:7}"
-            BODY_FILE=$(mktemp)
-            cat > "$BODY_FILE" <<EOFBODY
-          Automated promotion of \`staging\` (\`${TARGET_SHA:0:8}\`) to \`main\`. All required staging gates green at this SHA: CI, E2E Staging Canvas, E2E API Smoke, CodeQL.
+            BODY_TEXT=$(cat <<EOFBODY
+          Automated promotion of \`staging\` (\`${TARGET_SHA:0:8}\`) to \`main\`. All required staging gates are green at this SHA (combined status reported success).

-          This PR is auto-generated by \`.github/workflows/auto-promote-staging.yml\` whenever every required gate completes green on the same staging SHA. It exists because main's branch protection requires status checks "set by the expected GitHub apps" — direct \`git push\` from a workflow can't satisfy that, only PR merges through the queue can.
+          This PR is auto-generated by \`.github/workflows/auto-promote-staging.yml\` whenever every required gate completes green on the same staging SHA.

-          Merge queue lands this; no human action needed unless gates fail. Reverse-direction sync (the merge commit on main → staging) is handled by \`auto-sync-main-to-staging.yml\`.
+          **Approval gate:** \`main\` branch protection requires 1 approval before this can land. Once approved, Gitea will auto-merge (the workflow scheduled \`merge_when_checks_succeed: true\` immediately after open).
+
+          The reverse-direction sync (the merge commit on \`main\` → \`staging\`) is handled automatically by \`auto-sync-main-to-staging.yml\` after this PR lands.
+
+          ---
+          - Source: staging at \`${TARGET_SHA}\`
+          - Opened by: \`devops-engineer\` persona (anti-bot-ring; never founder PAT)
+          - Refs: #65, #73, #195
          EOFBODY
-            PR_URL=$(gh pr create --repo "$REPO" \
-              --base main --head staging \
-              --title "$TITLE" \
-              --body-file "$BODY_FILE")
-            PR_NUM=$(echo "$PR_URL" | grep -oE '[0-9]+$' | tail -1)
-            rm -f "$BODY_FILE"
-            echo "::notice::Opened PR #${PR_NUM}"
-          else
-            echo "::notice::Re-using existing promote PR #${PR_NUM}"
+          )
+            REQ=$(jq -n \
+              --arg title "${TITLE}" \
+              --arg body "${BODY_TEXT}" \
+              --arg base "main" \
+              --arg head "staging" \
+              '{title:$title, body:$body, base:$base, head:$head}')
+
+            BODY=$(mktemp)
+            STATUS=$(http_post_json "${BODY}" "${REQ}" "${API}/pulls")
+
+            if [ "${STATUS}" = "201" ]; then
+              PR_NUM=$(jq -r '.number // ""' < "${BODY}")
+              echo "::notice::Opened promote PR #${PR_NUM}"
+            else
+              echo "::error::Failed to create promote PR: HTTP ${STATUS}"
+              jq -r '.message // .' < "${BODY}" | head -c 500
+              rm -f "${BODY}"
+              exit 1
+            fi
+            rm -f "${BODY}"
          fi

-          # Enable auto-merge — the merge queue picks it up once
-          # required gates are green on the merge_group ref.
-          if ! gh pr merge "$PR_NUM" --repo "$REPO" --auto --merge 2>&1; then
-            echo "::warning::Failed to enable auto-merge on PR #${PR_NUM} — operator may need to merge manually."
-          fi
+          # Step 3: schedule auto-merge. merge_when_checks_succeed
+          # tells Gitea to wait for both:
+          #   - all required status checks to pass
+          #   - the required-approvals gate (1 approval on main)
+          # before merging. On approval+green, Gitea merges within
+          # seconds. On any check failing or approval being denied,
+          # the schedule stays armed but doesn't fire.
+          #
+          # Idempotent: re-arming on an already-armed PR is a no-op.
+          REQ=$(jq -n '{Do:"merge", merge_when_checks_succeed:true}')
+          BODY=$(mktemp)
+          STATUS=$(http_post_json "${BODY}" "${REQ}" "${API}/pulls/${PR_NUM}/merge")
+
+          # Gitea returns:
+          #   - 200/204 on successful immediate merge (gates already green AND approved)
+          #   - 405 "Please try again later" when scheduled successfully but waiting
+          #   - 422 on "Pull request is not mergeable" (conflict, stale base, etc.)
+          #
+          # 405 here is benign — Gitea's way of saying "scheduled, not merging now".
+          # We treat 200/204/405 as success, anything else as failure.
+          case "${STATUS}" in
+            200|204)
+              MERGE_OUTCOME="merged-immediately"
+              echo "::notice::Promote PR #${PR_NUM} merged immediately (gates+approval already green)"
+              ;;
+            405)
+              MERGE_OUTCOME="auto-merge-scheduled"
+              echo "::notice::Promote PR #${PR_NUM}: auto-merge scheduled (Gitea will land on approval+green)"
+              ;;
+            422)
+              MERGE_OUTCOME="not-mergeable"
+              echo "::warning::Promote PR #${PR_NUM}: not mergeable (conflict, stale base, or already merging)."
+              jq -r '.message // .' < "${BODY}" | head -c 500
+              ;;
+            *)
+              echo "::error::Unexpected status ${STATUS} on merge schedule"
+              jq -r '.message // .' < "${BODY}" | head -c 500
+              rm -f "${BODY}"
+              exit 1
+              ;;
+          esac
+          rm -f "${BODY}"

          {
-            echo "## ✅ Auto-promote PR opened"
+            echo "## Auto-promote PR opened"
            echo
            echo "- Source: staging at \`${TARGET_SHA:0:8}\`"
            echo "- PR: #${PR_NUM}"
+            echo "- Outcome: \`${MERGE_OUTCOME}\`"
            echo
-            echo "Merge queue lands the PR once required gates are green; no human action needed unless gates fail."
+            if [ "${MERGE_OUTCOME}" = "auto-merge-scheduled" ]; then
+              echo "Gitea will auto-merge once Hongming approves and all checks are green. No human action needed beyond approval."
+            elif [ "${MERGE_OUTCOME}" = "merged-immediately" ]; then
+              echo "Merged immediately. \`publish-workspace-server-image.yml\` will fire naturally on the resulting \`main\` push."
+            else
+              echo "PR is not auto-merging. Operator may need to bring staging up to date with main, then re-trigger this workflow via workflow_dispatch."
+            fi
          } >> "$GITHUB_STEP_SUMMARY"
-
-          # Hand the PR number to the next step so we can dispatch the
-          # tenant-redeploy chain after the merge queue lands the merge.
-          echo "promote_pr_num=${PR_NUM}" >> "$GITHUB_OUTPUT"
-        id: promote_pr
-
-      # The App token minted above (before the promote-PR step) is
-      # also used by the polling tail below. Defense-in-depth: with
-      # the merge-queue-landed merge now using the App token, the
-      # main-branch push event SHOULD fire the publish/canary/redeploy
-      # cascade naturally — but if for any reason it doesn't (e.g. an
-      # unrelated event-suppression edge case), the explicit dispatches
-      # below still wake the chain.
-      - name: Wait for promote merge, then dispatch publish + redeploy (#2357)
-        # Defense-in-depth dispatch. With the auto-merge call above
-        # now using the App token (this commit), the merge-queue-landed
-        # merge SHOULD fire publish-workspace-server-image naturally
-        # via on:push:[main] — App-token-initiated pushes DO trigger
-        # workflow_run cascades, unlike GITHUB_TOKEN-initiated ones
-        # (the documented "no recursion" rule —
-        # https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow).
-        #
-        # This explicit dispatch stays as belt-and-suspenders for any
-        # edge case where the natural cascade misfires. If it never
-        # observably fires after this token swap (i.e. the publish
-        # workflow has already started by the time we get here), the
-        # second dispatch is a harmless no-op (publish-workspace-server-image
-        # has its own concurrency group that dedupes).
-        #
-        # See PR for #2357: pre-fix the merge action was via
-        # GITHUB_TOKEN, suppressing the cascade and forcing this tail
-        # to be the SOLE chain trigger. With the auto-merge token swap
-        # the tail becomes redundant in the happy path; keep until
-        # we've observed >=10 successful natural cascades, then drop.
-        if: steps.promote_pr.outputs.promote_pr_num != ''
-        env:
-          GH_TOKEN: ${{ steps.app-token.outputs.token }}
-          REPO: ${{ github.repository }}
-          PR_NUM: ${{ steps.promote_pr.outputs.promote_pr_num }}
-        run: |
-          # Poll for merge — max 30 min (60 × 30s). The merge queue
-          # typically lands within 5-10 min when gates are green. Break
-          # early if the PR is closed without merging (operator action,
-          # gates flipped red post-approval, branch-protection rejection)
-          # so we don't tie up a runner for the full 30 min on a dead PR.
-          MERGED=""
-          STATE=""
-          for _ in $(seq 1 60); do
-            VIEW=$(gh pr view "$PR_NUM" --repo "$REPO" --json mergedAt,state)
-            MERGED=$(echo "$VIEW" | jq -r '.mergedAt // ""')
-            STATE=$(echo "$VIEW" | jq -r '.state // ""')
-            if [ -n "$MERGED" ] && [ "$MERGED" != "null" ]; then
-              echo "::notice::Promote PR #${PR_NUM} merged at ${MERGED}"
-              break
-            fi
-            if [ "$STATE" = "CLOSED" ]; then
-              echo "::warning::Promote PR #${PR_NUM} was closed without merging — skipping deploy dispatch."
-              exit 0
-            fi
-            sleep 30
-          done
-
-          if [ -z "$MERGED" ] || [ "$MERGED" = "null" ]; then
-            echo "::warning::Promote PR #${PR_NUM} didn't merge within 30min — skipping deploy dispatch (manually run \`gh workflow run publish-workspace-server-image.yml --ref main\` once it lands)."
-            exit 0
-          fi
-
-          # Dispatch publish on main using the App token. App-initiated
-          # workflow_dispatch DOES propagate the workflow_run cascade,
-          # unlike GITHUB_TOKEN-initiated dispatch.
-          # publish completes → canary-verify chains via workflow_run →
-          # redeploy-tenants-on-main chains via workflow_run + branches:[main].
-          if gh workflow run publish-workspace-server-image.yml \
-              --repo "$REPO" --ref main 2>&1; then
-            echo "::notice::Dispatched publish-workspace-server-image on ref=main as molecule-ai App — canary-verify and redeploy-tenants-on-main will chain via workflow_run."
-            {
-              echo "## 🚀 Tenant redeploy chain dispatched"
-              echo
-              echo "- publish-workspace-server-image (workflow_dispatch on \`main\`, actor: \`molecule-ai[bot]\`)"
-              echo "- canary-verify will chain on completion"
-              echo "- redeploy-tenants-on-main will chain on canary green"
-            } >> "$GITHUB_STEP_SUMMARY"
-          else
-            echo "::error::Failed to dispatch publish-workspace-server-image. Run manually: gh workflow run publish-workspace-server-image.yml --ref main"
-          fi
-
-          # ALSO dispatch auto-sync-main-to-staging.yml. Same root cause as
-          # publish above (issue #2357): the merge-queue-initiated push to
-          # main is by GITHUB_TOKEN → no `on: push` triggers fire downstream.
-          # Without this dispatch, every staging→main promote leaves staging
-          # one merge commit BEHIND main, which silently dead-locks the NEXT
-          # promote PR as `mergeStateStatus: BEHIND` because main's
-          # branch-protection has `strict: true`. Verified empirically on
-          # 2026-05-02 against PR #2442 (Phase 2 promote): only the explicit
-          # publish-workspace-server-image dispatch fired on the previous
-          # promote SHA 76c604fb, while auto-sync silently no-op'd, leaving
-          # staging behind for ~24h until manually bridged.
-          if gh workflow run auto-sync-main-to-staging.yml \
-              --repo "$REPO" --ref main 2>&1; then
-            echo "::notice::Dispatched auto-sync-main-to-staging on ref=main as molecule-ai App — staging will absorb the new main merge commit via PR + merge queue."
-          else
-            echo "::error::Failed to dispatch auto-sync-main-to-staging. Run manually: gh workflow run auto-sync-main-to-staging.yml --ref main"
-          fi
@@ -0,0 +1,404 @@
+name: Auto-sync canary — AUTO_SYNC_TOKEN rotation drift
+
+# Synthetic health check for the AUTO_SYNC_TOKEN secret consumed by
+# auto-sync-main-to-staging.yml (PR #66) and publish-workspace-server-image.yml.
+#
+# ============================================================
+# Why this workflow exists
+# ============================================================
+#
+# PR #66 fixed auto-sync (replaced GitHub-era `gh pr create` — which
+# 405s on Gitea's GraphQL endpoint — with a direct git push from the
+# `devops-engineer` persona's `AUTO_SYNC_TOKEN`). Hostile self-review
+# weakest spot #3 of that PR:
+#
+#   "Token rotation silently breaks auto-sync. If AUTO_SYNC_TOKEN is
+#    rotated without updating the repo secret, every push to main
+#    fails red on the auto-sync push step. The workflow surfaces the
+#    failure mode in the step summary (failure mode B in the header),
+#    but there's no proactive monitoring."
+#
+# Detection latency under the status quo: rotation is only caught on
+# the next push to `main`. During quiet periods (no main push for
+# many hours) the staging-superset-of-main invariant silently breaks.
+#
+# This workflow closes the gap: every 6 hours, it fires the auth
+# surface that auto-sync depends on and emits a red workflow status
+# if AUTO_SYNC_TOKEN has drifted out of validity.
+#
+# ============================================================
+# What this checks (Option B — read-only verify)
+# ============================================================
+#
+# 1. `GET /api/v1/user` against Gitea with the token → validates the
+#    token authenticates AND resolves to `devops-engineer` (catches
+#    the case where the token was regenerated under a different
+#    persona by mistake).
+# 2. `GET /api/v1/repos/molecule-ai/molecule-core` with the token →
+#    validates the token has `read:repository` scope on this repo
+#    (the v2 scope contract — see saved memory
+#    `reference_persona_token_v2_scope`).
+# 3. `git push --dry-run` of the current staging SHA back to
+#    `refs/heads/staging` via `https://oauth2:<token>@<gitea>/...`
+#    → validates the EXACT HTTPS basic-auth path that
+#    `actions/checkout` + `git push origin staging` use inside
+#    auto-sync-main-to-staging.yml. NOP by construction (push the
+#    current tip to itself = "Everything up-to-date"); auth is
+#    checked at the smart-protocol handshake BEFORE the empty-diff
+#    computation, so bad token → exit 128 with "Authentication
+#    failed". `git ls-remote` is NOT used here because Gitea
+#    falls back to anonymous read on public repos and would
+#    silently green-light a rotated token.
+#
+# Each step exits non-zero with an actionable error message if it
+# fails. The workflow status itself is the operator-facing surface.
+#
+# ============================================================
+# What this does NOT check (intentional)
+# ============================================================
+#
+# - **Branch-protection authz** (failure mode C in auto-sync header):
+#   would require an actual write to staging. Already monitored by
+#   `branch-protection-drift.yml` daily. Don't duplicate.
+# - **Conflict resolution** (failure mode A): a real conflict is data-
+#   driven, not auth-driven; can't synthesise it without polluting
+#   staging. Already surfaces immediately on the next main push.
+# - **Concurrency** (failure mode D): handled by workflow concurrency
+#   group on auto-sync, not a credential issue.
+#
+# ============================================================
+# Why Option B (read-only) and not the alternatives
+# ============================================================
+#
+# Considered + rejected (see issue #72 for full write-up):
+#
+# - **Option A — full auto-sync on schedule**: every run creates a
+#   no-op merge commit on staging when main hasn't advanced. 4 noise
+#   commits/day. And races the real `push:` trigger when main has
+#   advanced. Rejected.
+#
+# - **Option C — push to dedicated `auto-sync-canary` branch**: would
+#   exercise authz too, but adds branch noise on Gitea AND requires
+#   maintaining a second branch protection (or expanding staging's
+#   whitelist to a junk branch). Authz already covered by
+#   `branch-protection-drift.yml`. Rejected.
+#
+# Prior art for the chosen Option B shape:
+#   - Cloudflare's `/user/tokens/verify` endpoint (read-only auth
+#     probe explicitly designed for credential canaries).
+#   - AWS Secrets Manager rotation Lambda's `testSecret` step (auth
+#     probe before promoting AWSPENDING → AWSCURRENT).
+#   - HashiCorp Vault's `vault token lookup` for renewal canaries.
+#
+# ============================================================
+# Operator runbook — what to do when this workflow goes RED
+# ============================================================
+#
+# 1. **Identify which step failed**:
+#    - Step "Verify token authenticates as devops-engineer" red →
+#      token is invalid OR resolves to wrong persona.
+#    - Step "Verify token has repo read scope" red → token valid but
+#      stripped of `read:repository` scope (or repo perms changed).
+#    - Step "Verify git HTTPS auth path via no-op dry-run push to
+#      staging" red → token rotated/revoked OR Gitea git-HTTPS
+#      surface is broken (rare). Auth check happens on the
+#      smart-protocol handshake, separate from the API path.
+#
+# 2. **Re-issue the token** on the operator host:
+#    ```
+#    ssh root@5.78.80.188 'docker exec --user git molecule-gitea-1 \
+#      gitea admin user generate-access-token \
+#      --username devops-engineer \
+#      --token-name persona-devops-engineer-vN \
+#      --scopes "read:repository,write:repository,read:user,read:organization,read:issue,write:issue,read:notification,read:misc"'
+#    ```
+#    Update `/etc/molecule-bootstrap/agent-secrets.env` in place
+#    (per `feedback_unified_credentials_file`). The previous token
+#    file lands at `.bak.<date>`.
+#
+# 3. **Update the repo Actions secret** at:
+#    Settings → Secrets and variables → Actions → AUTO_SYNC_TOKEN
+#    Paste the new token. (Don't echo it in chat — but per
+#    `feedback_passwords_in_chat_are_burned`, a paste in a 1:1
+#    Claude session is within trust boundary.)
+#
+# 4. **Re-run this canary** via workflow_dispatch. Confirm GREEN.
+#
+# 5. **Backfill any missed main → staging syncs** by re-running
+#    `auto-sync-main-to-staging.yml` from its workflow_dispatch
+#    surface, OR by pushing an empty commit to main (if you'd
+#    rather force a real trigger).
+#
+# ============================================================
+# Security notes
+# ============================================================
+#
+# - Token usage: read-only (`GET /api/v1/user`, `GET /api/v1/repos/...`,
+#   `git ls-remote`). No write paths. Same blast-radius profile as
+#   `actions/checkout` on a public repo.
+# - The token NEVER appears in logs: every `curl` uses a header
+#   variable, never inline; the `git ls-remote` URL builds the
+#   `oauth2:$TOKEN@host` form into a single env var that's not
+#   echoed. GitHub Actions secret-masking covers anything that does
+#   slip through.
+# - No new token introduced — same `AUTO_SYNC_TOKEN` the workflow
+#   under monitor uses. Per least-privilege we deliberately do NOT
+#   broaden scope for the canary.
+
+on:
+  schedule:
+    # Every 6 hours at :17 (offsets the cron herd at :00). Justification
+    # from issue #72: cheap to run (~5s wall-clock, no quota), 3h average
+    # detection latency, 6h max. 1h would be 24× the runs for marginal
+    # benefit; daily would be 6× longer latency and worse than status
+    # quo on a quiet-main day.
+    - cron: '17 */6 * * *'
+  workflow_dispatch:
+
+# No concurrency group needed — the canary is read-only and idempotent.
+# Two parallel runs (e.g. operator dispatch during a scheduled tick) are
+# harmless: same result, doubled HTTPS calls, no shared state.
+
+permissions:
+  contents: read
+
+jobs:
+  verify-token:
+    name: Verify AUTO_SYNC_TOKEN validity
+    runs-on: ubuntu-latest
+    # 2 min surfaces hangs (Gitea API stall, DNS issue) within one
+    # cron interval. Realistic worst case is ~10s: 2 curls + 1 git
+    # ls-remote, each capped by the explicit timeouts below.
+    timeout-minutes: 2
+
+    env:
+      # Pinned in env so individual steps can read it without
+      # repeating the secret reference. GitHub masks the value in
+      # logs automatically.
+      AUTO_SYNC_TOKEN: ${{ secrets.AUTO_SYNC_TOKEN }}
+      # MUST stay in sync with auto-sync-main-to-staging.yml's
+      # `git config user.name "devops-engineer"` line. Renaming the
+      # devops-engineer persona requires updating both files (and
+      # the staging branch protection's `push_whitelist_usernames`).
+      EXPECTED_PERSONA: devops-engineer
+      GITEA_HOST: git.moleculesai.app
+      REPO_PATH: molecule-ai/molecule-core
+
+    steps:
+      - name: Verify AUTO_SYNC_TOKEN secret is configured
+        # Schedule-vs-dispatch behaviour split, per
+        # `feedback_schedule_vs_dispatch_secrets_hardening`:
+        #
+        #   - schedule: hard-fail when the secret is missing. The
+        #     whole point of the canary is to surface drift; soft-
+        #     skipping on missing-secret would make the canary
+        #     itself drift-invisible (sweep-cf-orphans #2088 lesson).
+        #   - workflow_dispatch: hard-fail too — there's no scenario
+        #     where an operator wants this canary to silently no-op.
+        #     The workflow has no other ad-hoc utility; if you ran
+        #     it, you wanted the answer.
+        run: |
+          if [ -z "${AUTO_SYNC_TOKEN}" ]; then
+            echo "::error::AUTO_SYNC_TOKEN secret is not set on this repo." >&2
+            echo "::error::Set it at Settings → Secrets and variables → Actions." >&2
+            echo "::error::Without it, auto-sync-main-to-staging.yml will fail every push to main." >&2
+            exit 1
+          fi
+          echo "AUTO_SYNC_TOKEN is configured (value masked)."
+
+      - name: Verify token authenticates as ${{ env.EXPECTED_PERSONA }}
+        # Calls Gitea's `/api/v1/user` — the canonical
+        # auth-probe-with-no-side-effects endpoint (mirrors
+        # Cloudflare's /user/tokens/verify).
+        #
+        # Failure surfaces:
+        #   - HTTP 401: token invalid (rotated, revoked, or never
+        #     correctly registered).
+        #   - HTTP 200 but username != devops-engineer: token was
+        #     regenerated under the wrong persona — this would let
+        #     auth pass but commit attribution would be wrong, and
+        #     branch-protection authz would fail because only
+        #     `devops-engineer` is whitelisted.
+        run: |
+          set -euo pipefail
+          response_file="$(mktemp)"
+          code_file="$(mktemp)"
+          # `--max-time 30`: full call ceiling. `--connect-timeout 10`:
+          # DNS + TCP. `-w "%{http_code}"` routed to a tempfile so curl's
+          # exit code can't pollute the captured status — see
+          # feedback_curl_status_capture_pollution + the
+          # `lint-curl-status-capture.yml` gate that rejects the unsafe
+          # `$(curl ... || echo "000")` shape.
+          set +e
+          curl -sS -o "$response_file" \
+            --max-time 30 --connect-timeout 10 \
+            -w "%{http_code}" \
+            -H "Authorization: token ${AUTO_SYNC_TOKEN}" \
+            -H "Accept: application/json" \
+            "https://${GITEA_HOST}/api/v1/user" >"$code_file" 2>/dev/null
+          set -e
+          status=$(cat "$code_file" 2>/dev/null || true)
+          [ -z "$status" ] && status="000"
+
+          if [ "$status" != "200" ]; then
+            echo "::error::Token rotation suspected: GET /api/v1/user returned HTTP $status (expected 200)." >&2
+            echo "::error::Likely cause: AUTO_SYNC_TOKEN has been rotated/revoked on Gitea but the repo Actions secret was not updated." >&2
+            echo "::error::Runbook: see header comment of this workflow file." >&2
+            # Print response body but redact anything that looks like a token.
+            sed -E 's/[A-Fa-f0-9]{32,}/<redacted>/g' "$response_file" >&2 || true
+            exit 1
+          fi
+
+          username=$(python3 -c "import json,sys; print(json.load(open(sys.argv[1])).get('login',''))" "$response_file")
+          if [ "$username" != "${EXPECTED_PERSONA}" ]; then
+            echo "::error::Token resolves to user '$username', expected '${EXPECTED_PERSONA}'." >&2
+            echo "::error::AUTO_SYNC_TOKEN must be the devops-engineer persona PAT (not founder PAT, not another persona)." >&2
+            echo "::error::Auto-sync push will fail because only 'devops-engineer' is whitelisted on staging branch protection." >&2
+            exit 1
+          fi
+          echo "Token authenticates as: $username ✓"
+
+      - name: Verify token has repo read scope
+        # `GET /api/v1/repos/<owner>/<repo>` requires `read:repository`
+        # on the persona's v2 scope contract. If the scope was
+        # narrowed/dropped on rotation we catch it here, before the
+        # next main push reveals it via a checkout failure.
+        run: |
+          set -euo pipefail
+          response_file="$(mktemp)"
+          code_file="$(mktemp)"
+          # See first probe step for the rationale on the tempfile-routed
+          # `-w "%{http_code}"` pattern — the unsafe `|| echo "000"` shape
+          # is rejected by lint-curl-status-capture.yml.
+          set +e
+          curl -sS -o "$response_file" \
+            --max-time 30 --connect-timeout 10 \
+            -w "%{http_code}" \
+            -H "Authorization: token ${AUTO_SYNC_TOKEN}" \
+            -H "Accept: application/json" \
+            "https://${GITEA_HOST}/api/v1/repos/${REPO_PATH}" >"$code_file" 2>/dev/null
+          set -e
+          status=$(cat "$code_file" 2>/dev/null || true)
+          [ -z "$status" ] && status="000"
+
+          if [ "$status" != "200" ]; then
+            echo "::error::Token lacks read:repository scope on ${REPO_PATH}: HTTP $status." >&2
+            echo "::error::Auto-sync's actions/checkout step will fail with this token." >&2
+            echo "::error::Re-issue with v2 scope contract: read:repository,write:repository,read:user,read:organization,read:issue,write:issue,read:notification,read:misc" >&2
+            sed -E 's/[A-Fa-f0-9]{32,}/<redacted>/g' "$response_file" >&2 || true
+            exit 1
+          fi
+          echo "Token has read:repository on ${REPO_PATH} ✓"
+
+      - name: Verify git HTTPS auth path via no-op dry-run push to staging
+        # Final probe: exercise the EXACT auth path that
+        # `actions/checkout` + `git push origin staging` use in
+        # auto-sync-main-to-staging.yml. Gitea's API and git-HTTPS
+        # surfaces share the token-lookup code path internally but
+        # the wire-level error shapes differ — historically (#173)
+        # the API path was healthy while git-HTTPS rejected, so
+        # checking only the API would have given false-green.
+        #
+        # IMPORTANT: `git ls-remote` on a public repo (which
+        # molecule-core is) succeeds even with a junk token because
+        # Gitea falls back to anonymous-read. `ls-remote` therefore
+        # CANNOT validate auth on this surface. We use
+        # `git push --dry-run` instead — push is auth-gated even on
+        # public repos.
+        #
+        # NOP shape: read the current staging SHA via authenticated
+        # ls-remote (the SHA itself is public; auth is incidental
+        # here, used only to colocate the discovery in one step), then
+        # `git push --dry-run <SHA>:refs/heads/staging`. Pushing the
+        # current tip back to itself is "Everything up-to-date" with
+        # exit 0 when auth succeeds. With a bad token Gitea returns
+        # HTTP 401 in the smart-protocol handshake and git exits 128
+        # with "Authentication failed".
+        #
+        # The dry-run never reaches Gitea's pre-receive hook (which
+        # is where branch-protection authz runs), so this probe does
+        # not validate failure mode C. That's intentional —
+        # branch-protection-drift.yml owns authz monitoring; this
+        # canary owns auth.
+        env:
+          # Don't hang waiting for password prompt if auth fails on a
+          # terminal-attached run. (In Actions there's no terminal,
+          # but the env-var hardens against an interactive runner
+          # config.)
+          GIT_TERMINAL_PROMPT: "0"
+        run: |
+          set -euo pipefail
+          # Token is in $AUTO_SYNC_TOKEN (job-level env). Compose the
+          # URL as a local var that's never echoed.
+          url="https://oauth2:${AUTO_SYNC_TOKEN}@${GITEA_HOST}/${REPO_PATH}"
+
+          # Step a: read current staging SHA. ~1KB; auth-gated only
+          # on private repos but always works on public — used here
+          # only to discover the SHA, not to validate auth.
+          staging_ref=$(timeout 30s git ls-remote --refs "$url" refs/heads/staging 2>&1) || {
+            redacted=$(echo "$staging_ref" | sed -E "s|oauth2:[^@]+@|oauth2:<redacted>@|g")
+            echo "::error::ls-remote against staging failed (network/DNS issue):" >&2
+            echo "$redacted" >&2
+            exit 1
+          }
+          if ! echo "$staging_ref" | grep -qE '^[0-9a-f]{40}[[:space:]]+refs/heads/staging$'; then
+            echo "::error::ls-remote returned unexpected shape:" >&2
+            echo "$staging_ref" | sed -E "s|oauth2:[^@]+@|oauth2:<redacted>@|g" >&2
+            exit 1
+          fi
+          staging_sha=$(echo "$staging_ref" | awk '{print $1}')
+
+          # Step b: spin up an ephemeral local repo. `git push` always
+          # requires a local repo even when pushing a remote SHA that
+          # isn't in the local object DB (the protocol negotiates and
+          # discovers we don't need to send any objects). We don't use
+          # `actions/checkout` for this — it would clone the whole
+          # repo (~hundreds of MB) for what's essentially `git init`.
+          tmp_repo="$(mktemp -d)"
+          trap 'rm -rf "$tmp_repo"' EXIT
+          git -C "$tmp_repo" init -q
+          # Author config required for any git operation; values are
+          # arbitrary because nothing gets committed here.
+          git -C "$tmp_repo" config user.email canary@auto-sync.local
+          git -C "$tmp_repo" config user.name auto-sync-canary
+
+          # Step c: dry-run push the current staging SHA back to
+          # staging. NOP by construction — the remote tip equals the
+          # SHA we're pushing, so "Everything up-to-date" is the
+          # success path.
+          #
+          # Authentication is checked at the smart-protocol handshake,
+          # BEFORE the dry-run can compute an empty diff. Bad token
+          # → "Authentication failed", exit 128. Good token → exit 0.
+          set +e
+          push_out=$(timeout 30s git -C "$tmp_repo" push --dry-run "$url" "${staging_sha}:refs/heads/staging" 2>&1)
+          push_rc=$?
+          set -e
+
+          if [ "$push_rc" -ne 0 ]; then
+            redacted=$(echo "$push_out" | sed -E "s|oauth2:[^@]+@|oauth2:<redacted>@|g")
+            echo "::error::Token rotation suspected: git push --dry-run against staging failed via the AUTO_SYNC_TOKEN HTTPS auth path (exit $push_rc)." >&2
+            echo "::error::This is the EXACT auth path that actions/checkout + git push use in auto-sync-main-to-staging.yml." >&2
+            echo "::error::Likely cause: AUTO_SYNC_TOKEN was rotated/revoked on Gitea but the repo Actions secret was not updated. Runbook: see header." >&2
+            echo "$redacted" >&2
+            exit 1
+          fi
+
+          echo "git HTTPS auth path: NOP push --dry-run to staging → ${staging_sha:0:8} ✓"
+
+      - name: Summarise canary result
+        # Everything passed — surface a green summary. (Failures
+        # already wrote ::error:: lines and exited above; if we got
+        # here, all three probes passed.)
+        run: |
+          {
+            echo "## Auto-sync canary: GREEN"
+            echo ""
+            echo "AUTO_SYNC_TOKEN is healthy:"
+            echo "- Authenticates as \`${EXPECTED_PERSONA}\` ✓"
+            echo "- Has \`read:repository\` scope on \`${REPO_PATH}\` ✓"
+            echo "- Git HTTPS auth path: no-op dry-run push to \`refs/heads/staging\` succeeds ✓"
+            echo ""
+            echo "Auto-sync main → staging will succeed on the next push to main."
+            echo "If this canary ever goes RED, see the runbook in this workflow's header."
+          } >> "$GITHUB_STEP_SUMMARY"
@@ -3,85 +3,138 @@ name: Auto-sync main → staging
 # Reflects every push to `main` back onto `staging` so the
 # staging-as-superset-of-main invariant holds.
 #
-# Background:
+# ============================================================
+# What this workflow does
+# ============================================================
 #
-# `auto-promote-staging.yml` advances main via `git merge --ff-only`
-# + `git push origin main` — that's a clean fast-forward, no merge
-# commit. But manual merges of `staging → main` PRs through the
-# GitHub UI / API create a merge commit on main that staging
-# doesn't have. The next `staging → main` PR then evaluates as
-# "BEHIND" because staging is missing that merge commit, requiring
-# a manual `gh pr update-branch` round-trip.
+# On every push to `main`:
+#   1. Checks if staging already contains main → no-op.
+#   2. Fetches both branches, merges main into staging in the
+#      runner workspace (fast-forward if possible, else
+#      `--no-ff` merge commit).
+#   3. Pushes staging directly to origin via the
+#      `devops-engineer` persona's `AUTO_SYNC_TOKEN`.
 #
-# This happened twice on 2026-04-28 (PRs #2202, #2205, both manual
-# bridges). Each time the bridge needed update-branch + a re-CI
-# round before merging. Operationally annoying and avoidable.
+# Authoritative path: a single `git push origin staging` from
+# inside this workflow is the SSOT for advancing staging after
+# a main push. No PR, no merge queue, no human approval —
+# staging is mechanically maintained as a superset of main.
 #
-# Architecture:
+# `auto-promote-staging.yml` is the reverse-direction
+# counterpart (staging → main, gated on green CI). Together
+# they keep the staging-superset-of-main invariant tight.
 #
-# This repo's `staging` branch is protected by a `merge_queue`
-# ruleset (id 15500102) that blocks ALL direct pushes — no bypass
-# even for org admins or the GitHub Actions integration. Direct
-# `git push origin staging` returns GH013. So instead of pushing
-# directly, this workflow:
+# ============================================================
+# Why direct push (and not "open a PR")
+# ============================================================
 #
-#   1. Checks if main is already in staging's ancestry → no-op.
-#   2. Creates an `auto-sync/main-<sha>` branch from staging.
-#   3. Tries `git merge --ff-only origin/main` → if staging hasn't
-#      diverged this is a clean ff.
-#   4. Otherwise `git merge --no-ff origin/main` to absorb main's
-#      tip while keeping staging's history.
-#   5. Pushes the auto-sync branch.
-#   6. Opens a PR (base=staging, head=auto-sync/main-<sha>) and
-#      enables auto-merge so the merge queue lands it.
+# Pre-2026-05-06 the canonical SCM was GitHub.com, where:
+#   - The `staging` branch had a `merge_queue` ruleset that
+#     blocked ALL direct pushes (no bypass even for org
+#     admins or the GitHub Actions integration).
+#   - Therefore this workflow opened a PR via `gh pr create`
+#     and let auto-merge land it through the queue.
 #
-# This mirrors the path human PRs take through staging — same
-# rules, same gates, no special-case bypass.
+# Post-2026-05-06 the canonical SCM is Gitea
+# (`git.moleculesai.app/molecule-ai/molecule-core`). Gitea:
+#   - Has no `merge_queue` concept.
+#   - Allows direct push to protected branches via per-user
+#     `push_whitelist_usernames` on the branch protection.
+#   - Does not expose a GraphQL endpoint, so `gh pr create`
+#     returns `HTTP 405 Method Not Allowed
+#     (https://git.moleculesai.app/api/graphql)` — the
+#     pre-suspension architecture cannot work on Gitea.
 #
-# Loop safety:
+# The molecule-ai/molecule-core staging branch protection
+# (verified via `GET /api/v1/repos/.../branch_protections`)
+# whitelists `devops-engineer` for direct push. So the
+# correct Gitea-shape architecture is: authenticate as
+# `devops-engineer`, merge locally, push staging directly.
 #
-# `GITHUB_TOKEN`-authored merges (including the merge queue's land
-# of the auto-sync PR) do NOT trigger downstream workflow runs
-# (GitHub Actions safety). So when the auto-sync PR lands on
-# staging, `auto-promote-staging.yml` is NOT triggered by that
-# push. The next developer push to staging triggers auto-promote
-# normally. No loop possible.
+# This is structurally simpler than the GitHub-era PR dance
+# and removes the dependence on `gh` CLI / GraphQL entirely.
 #
-# Concurrency:
+# ============================================================
+# Identity + token (anti-bot-ring per saved-memory
+# `feedback_per_agent_gitea_identity_default`)
+# ============================================================
 #
-# Two pushes to main in quick succession (e.g., manual UI merge
-# immediately followed by auto-promote-staging's ff-merge) could
-# otherwise open two overlapping auto-sync PRs. The concurrency
-# group serializes runs; the second waits for the first to exit.
-# (The first run exits after opening + auto-merge-queueing the PR,
-# not after the merge actually completes — so multiple PRs can be
-# open simultaneously, but the merge queue handles them serially.)
+# This workflow uses `secrets.AUTO_SYNC_TOKEN`, which is a
+# personal access token issued to the `devops-engineer`
+# persona on Gitea — NOT the founder PAT. The bot-ring
+# fingerprint that triggered the GitHub org suspension on
+# 2026-05-06 was characterised by founder PAT acting as CI
+# at machine speed; per-persona identities split the
+# attribution honestly.
+#
+# Token scope on Gitea: repo write. Push target restricted
+# to `staging` (this workflow is the only writer; main is
+# untouched). Compromise blast radius: bounded to staging
+# branch + this repo's read surface.
+#
+# Commits are authored by the persona email
+# `devops-engineer@agents.moleculesai.app` so commit history
+# reflects which automation produced the merge.
+#
+# ============================================================
+# Failure modes & operational notes
+# ============================================================
+#
+# A — staging has commits main doesn't, and the merge
+#     conflicts:
+#     - The `--no-ff` merge step exits non-zero. Workflow
+#       fails red. Operator (devops-engineer or human)
+#       resolves manually:
+#         git fetch origin
+#         git checkout staging
+#         git merge --no-ff origin/main
+#         # resolve conflicts
+#         git push origin staging
+#     - Step summary surfaces the conflict so the failed run
+#       is self-explanatory.
+#
+# B — `AUTO_SYNC_TOKEN` rotated / wrong scope:
+#     - `git push` step exits non-zero with `HTTP 401` /
+#       `403`. Step summary surfaces the failed push.
+#     - Re-issue the token from `~/.molecule-ai/personas/`
+#       on the operator host and update the repo Actions
+#       secret. Re-run the workflow.
+#
+# C — staging branch protection no longer whitelists
+#     `devops-engineer`:
+#     - `git push` exits non-zero with a Gitea protected-
+#       branch rejection. Step summary surfaces it.
+#     - Re-add `devops-engineer` to
+#       `push_whitelist_usernames` on the staging
+#       protection (Settings → Branches → staging).
+#
+# D — concurrent push to main while a sync is in flight:
+#     - The `concurrency` group below serialises runs.
+#       The second waits for the first; if main advances
+#       again while we're syncing, the second run picks
+#       up the new tip on its own fetch.
+#
+# ============================================================
+# Loop safety
+# ============================================================
+#
+# The push to staging from this workflow does NOT itself
+# fire a `push: branches: [main]` event (different branch),
+# so there's no risk of self-recursion. `auto-promote-staging.yml`
+# fires on `workflow_run` of CI etc. — it sees the new
+# staging tip on its next gate-completion event, NOT on this
+# push directly. No loop.

 on:
  push:
    branches: [main]
-  # workflow_dispatch lets:
-  #   1. Operators manually backfill a missed sync (e.g. after a manual
-  #      UI merge that the runner missed).
-  #   2. auto-promote-staging.yml's polling tail explicitly invoke us
-  #      after the promote PR lands. This is load-bearing: when the
-  #      merge queue lands a promote-PR merge, the resulting push to
-  #      `main` is "by GITHUB_TOKEN", and per GitHub's no-recursion
-  #      rule (https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow)
-  #      that push event does NOT fire any downstream workflows. The
-  #      `on: push` trigger above is silently dead for the very pattern
-  #      we exist to handle. Verified empirically 2026-05-02 against
-  #      SHA 76c604fb (PR #2437 staging→main): only ONE workflow fired
-  #      (publish-workspace-server-image, dispatched explicitly by
-  #      auto-promote's polling tail with an App token). Every other
-  #      `on: push: branches: [main]` workflow — including this one —
-  #      was suppressed. Until the underlying merge call moves to an
-  #      App token, an explicit dispatch is the only reliable path.
+  # workflow_dispatch lets operators manually backfill a
+  # missed sync (e.g. if AUTO_SYNC_TOKEN was rotated and a
+  # main push slipped through while the secret was stale).
  workflow_dispatch:

 permissions:
  contents: write
-  pull-requests: write

 concurrency:
  group: auto-sync-main-to-staging
@@ -89,26 +142,25 @@ concurrency:

 jobs:
  sync-staging:
-    # ubuntu-latest matches every other workflow in this repo. The
-    # earlier `[self-hosted, macos, arm64]` was a copy-paste artefact
-    # from the molecule-controlplane repo (which IS private and uses a
-    # Mac runner) — molecule-core has no Mac runner registered, so the
-    # job sat unassigned whenever the trigger fired. Verified 2026-05-02:
-    # this is the ONLY workflow in molecule-core/.github/workflows/ with
-    # a non-ubuntu runs-on.
    runs-on: ubuntu-latest
    steps:
-      - name: Checkout staging
+      - name: Checkout staging (with devops-engineer push token)
        uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
        with:
          fetch-depth: 0
          ref: staging
+          # AUTO_SYNC_TOKEN authenticates as the
+          # `devops-engineer` Gitea persona — the only
+          # identity whitelisted for direct push to
+          # staging. See header comment for context.
          token: ${{ secrets.AUTO_SYNC_TOKEN }}

      - name: Configure git author
        run: |
-          git config user.name "github-actions[bot]"
-          git config user.email "41898282+github-actions[bot]@users.noreply.github.com"
+          # Per-persona identity, NOT founder PAT.
+          # `feedback_per_agent_gitea_identity_default`.
+          git config user.name "devops-engineer"
+          git config user.email "devops-engineer@agents.moleculesai.app"

      - name: Check if staging already contains main
        id: check
@@ -118,7 +170,7 @@ jobs:
          if git merge-base --is-ancestor origin/main HEAD; then
            echo "needs_sync=false" >> "$GITHUB_OUTPUT"
            {
-              echo "## ✅ No-op"
+              echo "## No-op"
              echo
              echo "staging already contains \`origin/main\` ($(git rev-parse --short=8 origin/main))."
            } >> "$GITHUB_STEP_SUMMARY"
@@ -126,112 +178,78 @@ jobs:
            echo "needs_sync=true" >> "$GITHUB_OUTPUT"
            MAIN_SHORT=$(git rev-parse --short=8 origin/main)
            echo "main_short=${MAIN_SHORT}" >> "$GITHUB_OUTPUT"
-            echo "branch=auto-sync/main-${MAIN_SHORT}" >> "$GITHUB_OUTPUT"
-            echo "::notice::staging is missing main's tip (${MAIN_SHORT}) — opening sync PR"
+            echo "::notice::staging is missing main's tip (${MAIN_SHORT}) — merging in-runner and pushing"
          fi

-      - name: Create auto-sync branch + merge main
+      - name: Merge main into staging (in-runner)
        if: steps.check.outputs.needs_sync == 'true'
-        id: prep
+        id: merge
        run: |
          set -euo pipefail
-          BRANCH="${{ steps.check.outputs.branch }}"
-
-          # If a previous auto-sync run already opened a branch for the
-          # same main sha, prefer reusing it (idempotent behavior on
-          # workflow restart). Force-update from latest staging anyway
-          # so it absorbs any staging-side commits that landed since.
-          git checkout -B "$BRANCH"
-
+          # Already on staging from checkout. Try fast-forward
+          # first (cleanest history); fall back to merge commit
+          # if staging has commits main doesn't.
          if git merge --ff-only origin/main; then
            echo "did_ff=true" >> "$GITHUB_OUTPUT"
-            echo "::notice::Fast-forwarded ${BRANCH} to origin/main"
+            echo "::notice::Fast-forwarded staging to origin/main"
          else
            echo "did_ff=false" >> "$GITHUB_OUTPUT"
-            if ! git merge --no-ff origin/main -m "chore: sync main → staging (auto)"; then
+            if ! git merge --no-ff origin/main \
+                -m "chore: sync main → staging (auto, ${{ steps.check.outputs.main_short }})"; then
              # Hygiene: leave the work tree clean before failing.
              git merge --abort || true
              {
-                echo "## ❌ Conflict"
+                echo "## Conflict"
                echo
                echo "Auto-merge \`main → staging\` failed with conflicts."
-                echo "A human needs to resolve manually."
+                echo "A human (or devops-engineer persona) needs to resolve manually:"
+                echo
+                echo '```'
+                echo "git fetch origin"
+                echo "git checkout staging"
+                echo "git merge --no-ff origin/main"
+                echo "# resolve conflicts"
+                echo "git push origin staging"
+                echo '```'
              } >> "$GITHUB_STEP_SUMMARY"
              exit 1
            fi
          fi

-      - name: Push auto-sync branch
+      - name: Push staging to origin
        if: steps.check.outputs.needs_sync == 'true'
        run: |
          set -euo pipefail
-          # Force-with-lease so a concurrent auto-sync run can't
-          # silently clobber an in-flight branch we just updated. If a
-          # different writer touched the branch, we abort and the next
-          # run picks up the latest state.
-          git push --force-with-lease origin "${{ steps.check.outputs.branch }}"
-
-      - name: Open auto-sync PR + enable auto-merge
-        if: steps.check.outputs.needs_sync == 'true'
-        env:
-          GH_TOKEN: ${{ secrets.AUTO_SYNC_TOKEN }}
-          BRANCH: ${{ steps.check.outputs.branch }}
-          MAIN_SHORT: ${{ steps.check.outputs.main_short }}
-          DID_FF: ${{ steps.prep.outputs.did_ff }}
-        run: |
-          set -euo pipefail
-
-          # Find existing PR for this branch (idempotent on workflow
-          # restart) before creating a new one.
-          PR_NUM=$(gh pr list --head "$BRANCH" --base staging --state open --json number --jq '.[0].number // ""')
-
-          if [ -z "$PR_NUM" ]; then
-            # Body lives in a temp file to keep the multi-line content
-            # out of the YAML block scalar (un-indented newlines inside
-            # an inline shell string break YAML parsing).
-            BODY_FILE=$(mktemp)
-            if [ "$DID_FF" = "true" ]; then
-              TITLE="chore: sync main → staging (auto, ff to ${MAIN_SHORT})"
-              cat > "$BODY_FILE" <<EOFBODY
-          Automated fast-forward of \`staging\` to \`origin/main\` (\`${MAIN_SHORT}\`). Staging has no in-flight commits that diverge from main. Merge queue lands this; no human action needed.
-
-          This PR is auto-generated by \`.github/workflows/auto-sync-main-to-staging.yml\` on every push to \`main\`. It exists because this repo's \`staging\` branch has a \`merge_queue\` ruleset that blocks direct pushes — even from the GitHub Actions integration.
-          EOFBODY
-            else
-              TITLE="chore: sync main → staging (auto, merge ${MAIN_SHORT})"
-              cat > "$BODY_FILE" <<EOFBODY
-          Automated merge of \`origin/main\` (\`${MAIN_SHORT}\`) into \`staging\`. Staging has commits main doesn't, so this is a non-ff merge that absorbs main's tip. Merge queue lands this.
-
-          This PR is auto-generated by \`.github/workflows/auto-sync-main-to-staging.yml\` on every push to \`main\`.
-          EOFBODY
-            fi
-
-            # gh pr create prints the URL on stdout; extract the PR number.
-            PR_URL=$(gh pr create \
-              --base staging \
-              --head "$BRANCH" \
-              --title "$TITLE" \
-              --body-file "$BODY_FILE")
-            PR_NUM=$(echo "$PR_URL" | grep -oE '[0-9]+$' | tail -1)
-            rm -f "$BODY_FILE"
-            echo "::notice::Opened PR #${PR_NUM}"
-          else
-            echo "::notice::Re-using existing PR #${PR_NUM} for ${BRANCH}"
-          fi
-
-          # Enable auto-merge — the merge queue picks it up once
-          # required gates are green. Use --merge for merge commits
-          # (matches the rest of this repo's PR convention).
-          if ! gh pr merge "$PR_NUM" --auto --merge 2>&1; then
-            echo "::warning::Failed to enable auto-merge on PR #${PR_NUM} — operator may need to merge manually."
+          # Direct push to staging. devops-engineer persona is
+          # whitelisted for direct push on the staging branch
+          # protection (Settings → Branches → staging).
+          #
+          # No --force / --force-with-lease: a fast-forward or
+          # legitimate merge commit on top of current staging
+          # is the only thing we'd ever push. If origin/staging
+          # advanced under us (concurrent merge), the push
+          # legitimately rejects and the next run picks up the
+          # new state.
+          if ! git push origin staging; then
+            {
+              echo "## Push rejected"
+              echo
+              echo "Direct push to \`staging\` failed. Likely causes:"
+              echo "- \`AUTO_SYNC_TOKEN\` rotated / wrong scope (HTTP 401/403)"
+              echo "- \`devops-engineer\` no longer in"
+              echo "  \`push_whitelist_usernames\` on the staging"
+              echo "  branch protection (HTTP 422)"
+              echo "- staging advanced concurrently — re-running this"
+              echo "  workflow on the new main tip will pick it up"
+            } >> "$GITHUB_STEP_SUMMARY"
+            exit 1
          fi

          {
-            echo "## ✅ Auto-sync PR opened"
+            echo "## Auto-sync succeeded"
            echo
-            echo "- Branch: \`$BRANCH\`"
-            echo "- PR: #$PR_NUM"
-            echo "- Strategy: $([ "$DID_FF" = "true" ] && echo "ff" || echo "merge commit")"
-            echo
-            echo "Merge queue lands the PR once required gates are green; no human action needed unless gates fail."
+            echo "- staging advanced to: \`$(git rev-parse --short=8 HEAD)\`"
+            echo "- main tip: \`${{ steps.check.outputs.main_short }}\`"
+            echo "- Strategy: $([ "${{ steps.merge.outputs.did_ff }}" = "true" ] && echo "fast-forward" || echo "merge commit")"
+            echo "- Pushed by: \`devops-engineer\` (per-agent persona, anti-bot-ring)"
          } >> "$GITHUB_STEP_SUMMARY"
@@ -57,17 +57,42 @@ jobs:
        id: bump
        if: steps.skip.outputs.skip != 'true'
        env:
-          GH_TOKEN: ${{ github.token }}
+          # Gitea-shape token (act_runner forwards GITHUB_TOKEN as a
+          # short-lived per-run secret with read access to this repo).
+          # We hit `/api/v1/repos/.../pulls?state=closed` directly
+          # because `gh pr list` calls Gitea's GraphQL endpoint, which
+          # returns HTTP 405 (issue #75 / post-#66 sweep).
+          GITEA_TOKEN: ${{ github.token }}
+          REPO: ${{ github.repository }}
+          GITEA_API_URL: ${{ github.server_url }}/api/v1
+          PUSH_SHA: ${{ github.sha }}
        run: |
-          # The merged PR for this push commit. `gh pr list --search` finds
-          # closed PRs whose merge commit matches; we take the first.
-          PR=$(gh pr list --state merged --search "${{ github.sha }}" --json number,labels --jq '.[0]' 2>/dev/null || echo "")
+          # Find the merged PR whose merge_commit_sha matches this push.
+          # Gitea's `/repos/{owner}/{repo}/pulls?state=closed` returns
+          # PRs sorted newest-first; we paginate up to 50 and jq-filter
+          # on `merge_commit_sha == PUSH_SHA`. Bounded — auto-tag fires
+          # per push to main, so the matching PR is always among the
+          # most recent closures. 50 is comfortably more than the
+          # ~10-20 staging→main promotes that close in any reasonable
+          # window.
+          set -euo pipefail
+          PRS_JSON=$(curl --fail-with-body -sS \
+            -H "Authorization: token ${GITEA_TOKEN}" \
+            -H "Accept: application/json" \
+            "${GITEA_API_URL}/repos/${REPO}/pulls?state=closed&sort=newest&limit=50" \
+            2>/dev/null || echo "[]")
+          PR=$(printf '%s' "$PRS_JSON" \
+            | jq -c --arg sha "$PUSH_SHA" \
+                '[.[] | select(.merged_at != null and .merge_commit_sha == $sha)] | .[0] // empty')
          if [ -z "$PR" ] || [ "$PR" = "null" ]; then
-            echo "No merged PR found for ${{ github.sha }} — defaulting to patch bump."
+            echo "No merged PR found for ${PUSH_SHA} — defaulting to patch bump."
            echo "kind=patch" >> "$GITHUB_OUTPUT"
            exit 0
          fi
-          LABELS=$(echo "$PR" | jq -r '.labels[].name')
+          # Gitea returns labels under `.labels[].name`, same shape as
+          # GitHub's REST. The previous `gh pr list --json number,labels`
+          # output was identical; jq filter unchanged.
+          LABELS=$(printf '%s' "$PR" | jq -r '.labels[]?.name // empty')
          if echo "$LABELS" | grep -qx 'release:major'; then
            echo "kind=major" >> "$GITHUB_OUTPUT"
          elif echo "$LABELS" | grep -qx 'release:minor'; then
@@ -1,7 +1,7 @@
 name: Block internal-flavored paths

 # Hard CI gate. Internal content (positioning, competitive briefs, sales
-# playbooks, PMM/press drip, draft campaigns) lives in Molecule-AI/internal —
+# playbooks, PMM/press drip, draft campaigns) lives in molecule-ai/internal —
 # this public monorepo must never re-acquire those paths. CEO directive
 # 2026-04-23 after a fleet-wide audit found 79 internal files leaked here.
 #
@@ -135,7 +135,7 @@ jobs:
            echo "::error::Forbidden internal-flavored paths detected:"
            printf "$OFFENDING"
            echo ""
-            echo "These paths belong in Molecule-AI/internal, not this public repo."
+            echo "These paths belong in molecule-ai/internal, not this public repo."
            echo "See docs/internal-content-policy.md for canonical locations."
            echo ""
            echo "If your file is genuinely public-facing (e.g. a blog post"
@@ -19,6 +19,7 @@ on:
    branches: [staging, main]
    paths:
      - 'tools/branch-protection/**'
+      - '.github/workflows/**'
      - '.github/workflows/branch-protection-drift.yml'

 permissions:
@@ -79,3 +80,32 @@ jobs:
          # Repo-admin scope, needed for /branches/:b/protection.
          GH_TOKEN: ${{ secrets.GH_TOKEN_FOR_ADMIN_API }}
        run: bash tools/branch-protection/drift_check.sh
+
+      # Self-test the parity script before running it on the real
+      # workflows — pins the script's classification logic against
+      # synthetic safe/unsafe/missing/unsafe-mix/matrix fixtures so a
+      # regression in the script can't false-pass on the production
+      # workflow audit. Cheap (~0.5s); always runs.
+      - name: Self-test check-name parity script
+        run: bash tools/branch-protection/test_check_name_parity.sh
+
+      # Check-name parity gate (#144 / saved memory
+      # feedback_branch_protection_check_name_parity).
+      #
+      # drift_check.sh asserts the live branch protection matches what
+      # apply.sh would set; check_name_parity.sh closes the orthogonal
+      # gap: it asserts every required check name in apply.sh maps to a
+      # workflow job whose "always emits this status" shape is intact.
+      #
+      # The two checks fail in different scenarios:
+      #
+      #   - drift_check fails → live state was rewritten out-of-band
+      #     (UI click, manual PATCH).
+      #   - check_name_parity fails → an apply.sh required name has no
+      #     emitter, OR the emitting workflow has a top-level paths:
+      #     filter without per-step if-gates (the silent-block shape).
+      #
+      # Cheap (~1s); runs without the admin token because it only reads
+      # apply.sh + .github/workflows/ from the checkout.
+      - name: Run check-name parity gate
+        run: bash tools/branch-protection/check_name_parity.sh
@@ -108,7 +108,7 @@ jobs:
              echo
              echo "One or more canary secrets are unset (\`CANARY_TENANT_URLS\`, \`CANARY_ADMIN_TOKENS\`, \`CANARY_CP_SHARED_SECRET\`)."
              echo "Phase 2 canary fleet has not been stood up yet —"
-              echo "see [canary-tenants.md](https://github.com/Molecule-AI/molecule-controlplane/blob/main/docs/canary-tenants.md)."
+              echo "see [canary-tenants.md](https://git.moleculesai.app/molecule-ai/molecule-controlplane/blob/main/docs/canary-tenants.md)."
              echo
              echo "**Skipped — promote-to-latest will NOT auto-fire.** Dispatch \`promote-latest.yml\` manually when ready."
            } >> "$GITHUB_STEP_SUMMARY"
@@ -87,7 +87,7 @@ jobs:
        run: go mod download
      - if: needs.changes.outputs.platform == 'true'
        run: go build ./cmd/server
-      # CLI (molecli) moved to standalone repo: github.com/Molecule-AI/molecule-cli
+      # CLI (molecli) moved to standalone repo: github.com/molecule-ai/molecule-cli
      - if: needs.changes.outputs.platform == 'true'
        run: go vet ./... || true
      - if: needs.changes.outputs.platform == 'true'
@@ -165,7 +165,7 @@ jobs:
              # Strip the package-import prefix so we can match .coverage-allowlist.txt
              # entries written as paths relative to workspace-server/.
              # Handle both module paths: platform/workspace-server/... and platform/...
-              rel=$(echo "$file" | sed 's|^github.com/Molecule-AI/molecule-monorepo/platform/workspace-server/||; s|^github.com/Molecule-AI/molecule-monorepo/platform/||')
+              rel=$(echo "$file" | sed 's|^github.com/molecule-ai/molecule-monorepo/platform/workspace-server/||; s|^github.com/molecule-ai/molecule-monorepo/platform/||')

              if echo "$ALLOWLIST" | grep -qxF "$rel"; then
                echo "::warning file=workspace-server/$rel::Critical file at ${pct}% coverage (allowlisted, #1823) — fix before expiry."
@@ -235,7 +235,13 @@ jobs:
        run: npx vitest run --coverage
      - name: Upload coverage summary as artifact
        if: needs.changes.outputs.canvas == 'true' && always()
-        uses: actions/upload-artifact@v3 # pinned to v3 for Gitea act_runner v0.6 compatibility (internal#46)
+        # Pinned to v3 for Gitea act_runner v0.6 compatibility — v4+ uses
+        # the GHES 3.10+ artifact protocol that Gitea 1.22.x does NOT
+        # implement, surfacing as `GHESNotSupportedError: @actions/artifact
+        # v2.0.0+, upload-artifact@v4+ and download-artifact@v4+ are not
+        # currently supported on GHES`. Drop this pin when Gitea ships
+        # the v4 protocol (tracked: post-Gitea-1.23 followup).
+        uses: actions/upload-artifact@c6a366c94c3e0affe28c06c8df20a878f24da3cf # v3.2.2
        with:
          name: canvas-coverage-${{ github.run_id }}
          path: canvas/coverage/
@@ -243,8 +249,8 @@ jobs:
          if-no-files-found: warn

  # MCP Server + SDK removed from CI — now in standalone repos:
-  # - github.com/Molecule-AI/molecule-mcp-server (npm CI)
-  # - github.com/Molecule-AI/molecule-sdk-python (PyPI CI)
+  # - github.com/molecule-ai/molecule-mcp-server (npm CI)
+  # - github.com/molecule-ai/molecule-sdk-python (PyPI CI)

  # e2e-api job moved to .github/workflows/e2e-api.yml (issue #458).
  # It now has workflow-level concurrency (cancel-in-progress: false) so
@@ -434,5 +440,5 @@ jobs:
          fi

      # SDK + plugin validation moved to standalone repo:
-      # github.com/Molecule-AI/molecule-sdk-python
+      # github.com/molecule-ai/molecule-sdk-python

@@ -1,36 +1,92 @@
 name: CodeQL

-# Controls CodeQL scan triggers for this repo.
+# Stub workflow — CodeQL Action is structurally incompatible with Gitea
+# Actions (post-2026-05-06 SCM migration off GitHub).
 #
-# GitHub's "Code quality" default setup (the UI-configured one) is
-# hardcoded to only scan the default branch — on this repo that's
-# `staging`, so PRs promoting staging→main would otherwise never be
-# scanned. This workflow fills that gap by explicitly scanning both
-# branches on push and PR.
+# Why this is a stub, not a real CodeQL run:
 #
-# Runs on ubuntu-latest (GHA-hosted — public repo, free). GHAS is NOT
-# enabled on this repo, so results are not uploaded to the Security
-# tab — the scan fails the PR check on findings, and the SARIF is
-# kept as a workflow artifact for triage.
+# 1. github/codeql-action/init@v4 hits api.github.com endpoints
+#    (CodeQL CLI bundle download + query-pack registry + telemetry)
+#    that Gitea 1.22.x does NOT proxy. The act_runner has
+#    GITHUB_SERVER_URL=https://git.moleculesai.app correctly set
+#    (per saved memory feedback_act_runner_github_server_url and
+#    /config.yaml on the operator host), but the Gitea API surface
+#    simply does not implement the codeql-action bundle endpoints.
+#    Observed in run 1d/3101 (2026-05-07): "::error::404 page not
+#    found" inside the Initialize CodeQL step, before any analysis.
+#
+# 2. PR #35 attempted to mark `continue-on-error: true` at the JOB
+#    level (correct YAML structure). Gitea 1.22.6 does NOT propagate
+#    job-level continue-on-error to the commit-status API — every
+#    matrix leg still posts `failure` to the status surface, which
+#    keeps OVERALL=failure on every push to main + staging and
+#    blocks visual auto-promote signals (#156).
+#
+# 3. Hongming policy decision (2026-05-07, task #156): CodeQL is
+#    ADVISORY, not blocking, on Gitea Actions. We do not block PR
+#    merge or staging→main promotion on CodeQL findings until we
+#    have a Gitea-compatible static-analysis pipeline.
+#
+# What this stub preserves:
+#
+# - Workflow name `CodeQL` (referenced by auto-promote-staging.yml
+#   line 67 as a workflow_run gate — must stay stable).
+# - Job name template `Analyze (${{ matrix.language }})` and the
+#   3-leg matrix (go, javascript-typescript, python). Branch
+#   protection / required-check parity (#144) keys on these
+#   exact context names.
+# - merge_group + push + pull_request + schedule triggers, so the
+#   merge-queue check name still resolves (per saved memory
+#   feedback_branch_protection_check_name_parity).
+#
+# Re-enabling real analysis (future work):
+#
+# - Option A: self-hosted Semgrep / OpenGrep via a custom action
+#   that doesn't hit api.github.com. Tracked behind #156 follow-up.
+# - Option B: Sonatype Nexus IQ or similar, called from a step
+#   that uses the Gitea-issued token only.
+# - Option C: re-host this workflow on a small GitHub mirror used
+#   ONLY for SAST (push-mirrored from Gitea). Acceptable trade-off
+#   if/when payment is restored on a non-suspended GitHub org —
+#   but per saved memory feedback_no_single_source_of_truth, we
+#   should design for multi-vendor backup, not GitHub-only SAST.
+#
+# Until one of those lands, this stub keeps commit-status green so
+# the auto-promote chain isn't permanently red on a tool we cannot
+# actually run.
+#
+# Security policy: ADVISORY. We accept the residual risk of un-scanned
+# pushes during this window. Compensating controls in place:
+#   - secret-scan.yml runs on every push (active, blocks on hits)
+#   - block-internal-paths.yml blocks forbidden file paths
+#   - lint-curl-status-capture.yml catches one specific class of bug
+#   - branch-protection-drift.yml + the merge_group required-checks
+#     parity keep the gate surface stable
+# These are not equivalent to CodeQL coverage. Status of the
+# replacement plan is tracked in #156.

 on:
  push:
    branches: [main, staging]
  pull_request:
    branches: [main, staging]
-  # GitHub merge queue fires `merge_group` for the queue's pre-merge CI run.
-  # Required so CodeQL Analyze checks get a real result on the queued
-  # commit instead of a false-green. Event only fires once merge queue is
-  # enabled on the target branch — safe to add unconditionally.
+  # Required so the matrix legs emit a real result on the queued
+  # commit instead of a false-green when merge queue is enabled.
+  # Per saved memory feedback_branch_protection_check_name_parity:
+  # path-filtered / matrix workflows MUST emit the protected name
+  # via a job that always runs.
  merge_group:
    types: [checks_requested]
  schedule:
-    # Weekly run picks up findings in code that hasn't been touched.
+    # Weekly heartbeat. Cheap on a stub (the no-op job is ~5s) but
+    # keeps the workflow visible in Gitea's Actions UI so the next
+    # operator notices it's a stub instead of a missing surface.
    - cron: '30 1 * * 0'

-# Workflow-level concurrency: only one CodeQL run per branch/PR at a time.
-# `cancel-in-progress: false` queues new runs so a quick follow-up push
-# doesn't nuke a 45-min analysis mid-flight.
+# Workflow-level concurrency: only one stub run per branch/PR at a
+# time. cancel-in-progress: false because a quick follow-up push
+# shouldn't kill an in-flight run — even though the stub is fast,
+# the contract should match a real CodeQL run for when we re-enable.
 concurrency:
  group: codeql-${{ github.ref }}
  cancel-in-progress: false
@@ -38,13 +94,17 @@ concurrency:
 permissions:
  actions: read
  contents: read
-  # No security-events: write — we don't call the upload API.
+  # No security-events: write — we don't call the upload API anyway,
+  # GHAS isn't on Gitea.

 jobs:
  analyze:
+    # Job NAME shape is load-bearing — auto-promote-staging.yml +
+    # branch protection both key on `Analyze (${{ matrix.language }})`.
+    # Do NOT rename without coordinating both surfaces.
    name: Analyze (${{ matrix.language }})
    runs-on: ubuntu-latest
-    timeout-minutes: 45
+    timeout-minutes: 5

    strategy:
      fail-fast: false
@@ -52,68 +112,25 @@ jobs:
        language: [go, javascript-typescript, python]

    steps:
-      - name: Checkout
-        uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
-
-      # github-app-auth sibling-checkout removed 2026-05-07 (#157):
-      # plugin was dropped + the Dockerfile no longer needs it.
-      # jq is pre-installed on ubuntu-latest — no setup step needed.
-
-      - name: Initialize CodeQL
-        uses: github/codeql-action/init@95e58e9a2cdfd71adc6e0353d5c52f41a045d225 # v4.35.2
-        with:
-          languages: ${{ matrix.language }}
-          # security-extended widens past the default to include the
-          # full security-query set for a public SaaS surface.
-          queries: security-extended
-
-      - name: Autobuild
-        uses: github/codeql-action/autobuild@95e58e9a2cdfd71adc6e0353d5c52f41a045d225 # v4.35.2
-
-      - name: Perform CodeQL Analysis
-        id: analyze
-        uses: github/codeql-action/analyze@95e58e9a2cdfd71adc6e0353d5c52f41a045d225 # v4.35.2
-        with:
-          category: "/language:${{ matrix.language }}"
-          # upload: never — GHAS isn't enabled on this repo, so the
-          # upload API 403s. Write SARIF locally instead.
-          upload: never
-          output: sarif-results/${{ matrix.language }}
-
-      - name: Parse SARIF + fail on findings
-        # The analyze step writes <database>.sarif into the output
-        # directory — database name is the short CodeQL lang id, not
-        # the matrix value (e.g. "javascript-typescript" →
-        # javascript.sarif), so glob rather than hardcode.
-        # Filter to error/warning severity: security-extended emits
-        # "note" rows for informational findings we don't want to fail
-        # the build over.
+      # Single-step stub: log the policy decision + emit success.
+      # Exit 0 explicitly so the commit-status API records `success`
+      # for each of the three matrix legs.
+      - name: CodeQL stub (advisory, non-blocking on Gitea)
        shell: bash
        run: |
          set -euo pipefail
-          dir="sarif-results/${{ matrix.language }}"
-          sarif=$(ls "$dir"/*.sarif 2>/dev/null | head -1 || true)
-          if [ -z "$sarif" ] || [ ! -f "$sarif" ]; then
-            echo "::error::No SARIF file found under $dir"
-            ls -la "$dir" 2>/dev/null || true
-            exit 1
-          fi
-          echo "Parsing $sarif"
-          count=$(jq '[.runs[].results[] | select(.level == "error" or .level == "warning")] | length' "$sarif")
-          echo "CodeQL findings (error+warning) for ${{ matrix.language }}: $count"
-          if [ "$count" -gt 0 ]; then
-            echo "::error::CodeQL found $count issues. Details below; full SARIF in the artifact."
-            jq -r '.runs[].results[] | select(.level == "error" or .level == "warning") | "  - [\(.level)] \(.ruleId // "?"): \(.message.text // "(no message)") @ \(.locations[0].physicalLocation.artifactLocation.uri // "?"):\(.locations[0].physicalLocation.region.startLine // "?")"' "$sarif"
-            exit 1
-          fi
-
-      - name: Upload SARIF artifact
-        # Keep SARIF around on success + failure so triagers can diff.
-        # 14-day retention — longer than default 3, short enough not
-        # to bloat quota.
-        if: always()
-        uses: actions/upload-artifact@v3 # pinned to v3 for Gitea act_runner v0.6 compatibility (internal#46)
-        with:
-          name: codeql-sarif-${{ matrix.language }}
-          path: sarif-results/${{ matrix.language }}/
-          retention-days: 14
+          cat <<EOF
+          CodeQL is currently ADVISORY on Gitea Actions (post-2026-05-06).
+          Language matrix leg: ${{ matrix.language }}
+          Reason: github/codeql-action/init@v4 calls api.github.com
+                  bundle endpoints that Gitea 1.22.x does not implement.
+                  Observed: "::error::404 page not found" in the Init
+                  CodeQL step on every prior run.
+          Policy: per Hongming decision 2026-05-07 (#156), CodeQL is
+                  non-blocking until a Gitea-compatible SAST pipeline
+                  lands. See workflow file header for replacement
+                  options + compensating controls.
+          Status: emitting success so auto-promote isn't permanently
+                  red on a tool we cannot actually run today.
+          EOF
+          echo "::notice::CodeQL ${{ matrix.language }} — advisory stub, success."
@@ -12,6 +12,59 @@ name: E2E API Smoke Test
 # spending CI cycles. See the in-job comment on the `e2e-api` job for
 # why this is one job (not two-jobs-sharing-name) and the 2026-04-29
 # PR #2264 incident that drove the consolidation.
+#
+# Parallel-safety (Class B Hongming-owned CICD red sweep, 2026-05-08)
+# -------------------------------------------------------------------
+# Same substrate hazard as PR #98 (handlers-postgres-integration). Our
+# Gitea act_runner runs with `container.network: host` (operator host
+# `/opt/molecule/runners/config.yaml`), which means:
+#
+#   * Two concurrent runs both try to bind their `-p 15432:5432` /
+#     `-p 16379:6379` host ports — the second postgres/redis FATALs
+#     with `Address in use` and `docker run` returns exit 125 with
+#     `Conflict. The container name "/molecule-ci-postgres" is already
+#     in use by container ...`. Verified in run a7/2727 on 2026-05-07.
+#   * The fixed container names `molecule-ci-postgres` / `-redis` (the
+#     pre-fix shape) collide on name AS WELL AS port. The cleanup-with-
+#     `docker rm -f` at the start of the second job KILLS the first
+#     job's still-running postgres/redis.
+#
+# Fix shape (mirrors PR #98's bridge-net pattern, adapted because
+# platform-server is a Go binary on the host, not a containerised
+# step):
+#
+#   1. Unique container names per run:
+#         pg-e2e-api-${RUN_ID}-${RUN_ATTEMPT}
+#         redis-e2e-api-${RUN_ID}-${RUN_ATTEMPT}
+#      `${RUN_ID}-${RUN_ATTEMPT}` is unique even across reruns of the
+#      same run_id.
+#   2. Ephemeral host port per run (`-p 0:5432`), then read the actual
+#      bound port via `docker port` and export DATABASE_URL/REDIS_URL
+#      pointing at it. No fixed host-port → no port collision.
+#   3. `127.0.0.1` (NOT `localhost`) in URLs — IPv6 first-resolve was
+#      the original flake fixed in #92 and the script's still IPv6-
+#      enabled.
+#   4. `if: always()` cleanup so containers don't leak when test steps
+#      fail.
+#
+# Issue #94 items #2 + #3 (also fixed here):
+#   * Pre-pull `alpine:latest` so the platform-server's provisioner
+#     (`internal/handlers/container_files.go`) can stand up its
+#     ephemeral token-write helper without a daemon.io round-trip.
+#   * Create `molecule-monorepo-net` bridge network if missing so the
+#     provisioner's container.HostConfig {NetworkMode: ...} attach
+#     succeeds.
+# Item #1 (timeouts) — evidence on recent runs (77/3191, ae/4270, 0e/
+# 2318) shows Postgres ready in 3s, Redis in 1s, Platform in 1s when
+# they DO come up. Timeouts are not the bottleneck; not bumped.
+#
+# Item explicitly NOT fixed here: failing test `Status back online`
+# fails because the platform's langgraph workspace template image
+# (ghcr.io/molecule-ai/workspace-template-langgraph:latest) returns
+# 403 Forbidden post-2026-05-06 GitHub org suspension. That is a
+# template-registry resolution issue (ADR-002 / local-build mode) and
+# belongs in a separate change that touches workspace-server, not
+# this workflow file.

 on:
  push:
@@ -78,11 +131,14 @@ jobs:
    runs-on: ubuntu-latest
    timeout-minutes: 15
    env:
-      DATABASE_URL: postgres://dev:dev@localhost:15432/molecule?sslmode=disable
-      REDIS_URL: redis://localhost:16379
+      # Unique per-run container names so concurrent runs on the host-
+      # network act_runner don't collide on name OR port.
+      # `${RUN_ID}-${RUN_ATTEMPT}` stays unique across reruns of the
+      # same run_id. PORT is set later (after docker port lookup) since
+      # we let Docker assign an ephemeral host port.
+      PG_CONTAINER: pg-e2e-api-${{ github.run_id }}-${{ github.run_attempt }}
+      REDIS_CONTAINER: redis-e2e-api-${{ github.run_id }}-${{ github.run_attempt }}
      PORT: "8080"
-      PG_CONTAINER: molecule-ci-postgres
-      REDIS_CONTAINER: molecule-ci-redis
    steps:
      - name: No-op pass (paths filter excluded this commit)
        if: needs.detect-changes.outputs.api != 'true'
@@ -97,11 +153,53 @@ jobs:
          go-version: 'stable'
          cache: true
          cache-dependency-path: workspace-server/go.sum
+      - name: Pre-pull alpine + ensure provisioner network (Issue #94 items #2 + #3)
+        if: needs.detect-changes.outputs.api == 'true'
+        run: |
+          # Provisioner uses alpine:latest for ephemeral token-write
+          # containers (workspace-server/internal/handlers/container_files.go).
+          # Pre-pull so the first provision in test_api.sh doesn't race
+          # the daemon's pull cache. Idempotent — `docker pull` is a no-op
+          # when the image is already present.
+          docker pull alpine:latest >/dev/null
+          # Provisioner attaches workspace containers to
+          # molecule-monorepo-net (workspace-server/internal/provisioner/
+          # provisioner.go::DefaultNetwork). The bridge already exists on
+          # the operator host's docker daemon — `network create` is
+          # idempotent via `|| true`.
+          docker network create molecule-monorepo-net >/dev/null 2>&1 || true
+          echo "alpine:latest pre-pulled; molecule-monorepo-net ensured."
      - name: Start Postgres (docker)
        if: needs.detect-changes.outputs.api == 'true'
        run: |
+          # Defensive cleanup — only matches THIS run's container name,
+          # so it cannot kill a sibling run's postgres. (Pre-fix the
+          # name was static and this rm hit other runs' containers.)
          docker rm -f "$PG_CONTAINER" 2>/dev/null || true
-          docker run -d --name "$PG_CONTAINER" -e POSTGRES_USER=dev -e POSTGRES_PASSWORD=dev -e POSTGRES_DB=molecule -p 15432:5432 postgres:16
+          # `-p 0:5432` requests an ephemeral host port; we read it back
+          # below and export DATABASE_URL.
+          docker run -d --name "$PG_CONTAINER" \
+            -e POSTGRES_USER=dev -e POSTGRES_PASSWORD=dev -e POSTGRES_DB=molecule \
+            -p 0:5432 postgres:16 >/dev/null
+          # Resolve the host-side port assignment. `docker port` prints
+          # `0.0.0.0:NNNN` (and on host-net runners may also print an
+          # IPv6 line — take the first IPv4 line).
+          PG_PORT=$(docker port "$PG_CONTAINER" 5432/tcp | awk -F: '/^0\.0\.0\.0:/ {print $2; exit}')
+          if [ -z "$PG_PORT" ]; then
+            # Fallback: any first line. Some Docker versions print only
+            # one line.
+            PG_PORT=$(docker port "$PG_CONTAINER" 5432/tcp | head -1 | awk -F: '{print $NF}')
+          fi
+          if [ -z "$PG_PORT" ]; then
+            echo "::error::Could not resolve host port for $PG_CONTAINER"
+            docker port "$PG_CONTAINER" 5432/tcp || true
+            docker logs "$PG_CONTAINER" || true
+            exit 1
+          fi
+          # 127.0.0.1 (NOT localhost) — IPv6 first-resolve flake (#92).
+          echo "PG_PORT=${PG_PORT}" >> "$GITHUB_ENV"
+          echo "DATABASE_URL=postgres://dev:dev@127.0.0.1:${PG_PORT}/molecule?sslmode=disable" >> "$GITHUB_ENV"
+          echo "Postgres host port: ${PG_PORT}"
          for i in $(seq 1 30); do
            if docker exec "$PG_CONTAINER" pg_isready -U dev >/dev/null 2>&1; then
              echo "Postgres ready after ${i}s"
@@ -116,7 +214,20 @@ jobs:
        if: needs.detect-changes.outputs.api == 'true'
        run: |
          docker rm -f "$REDIS_CONTAINER" 2>/dev/null || true
-          docker run -d --name "$REDIS_CONTAINER" -p 16379:6379 redis:7
+          docker run -d --name "$REDIS_CONTAINER" -p 0:6379 redis:7 >/dev/null
+          REDIS_PORT=$(docker port "$REDIS_CONTAINER" 6379/tcp | awk -F: '/^0\.0\.0\.0:/ {print $2; exit}')
+          if [ -z "$REDIS_PORT" ]; then
+            REDIS_PORT=$(docker port "$REDIS_CONTAINER" 6379/tcp | head -1 | awk -F: '{print $NF}')
+          fi
+          if [ -z "$REDIS_PORT" ]; then
+            echo "::error::Could not resolve host port for $REDIS_CONTAINER"
+            docker port "$REDIS_CONTAINER" 6379/tcp || true
+            docker logs "$REDIS_CONTAINER" || true
+            exit 1
+          fi
+          echo "REDIS_PORT=${REDIS_PORT}" >> "$GITHUB_ENV"
+          echo "REDIS_URL=redis://127.0.0.1:${REDIS_PORT}" >> "$GITHUB_ENV"
+          echo "Redis host port: ${REDIS_PORT}"
          for i in $(seq 1 15); do
            if docker exec "$REDIS_CONTAINER" redis-cli ping 2>/dev/null | grep -q PONG; then
              echo "Redis ready after ${i}s"
@@ -135,13 +246,15 @@ jobs:
        if: needs.detect-changes.outputs.api == 'true'
        working-directory: workspace-server
        run: |
+          # DATABASE_URL + REDIS_URL exported by the start-postgres /
+          # start-redis steps point at this run's per-run host ports.
          ./platform-server > platform.log 2>&1 &
          echo $! > platform.pid
      - name: Wait for /health
        if: needs.detect-changes.outputs.api == 'true'
        run: |
          for i in $(seq 1 30); do
-            if curl -sf http://localhost:8080/health > /dev/null; then
+            if curl -sf http://127.0.0.1:8080/health > /dev/null; then
              echo "Platform up after ${i}s"
              exit 0
            fi
@@ -185,6 +298,9 @@ jobs:
            kill "$(cat workspace-server/platform.pid)" 2>/dev/null || true
          fi
      - name: Stop service containers
+        # always() so containers don't leak when test steps fail. The
+        # cleanup is best-effort: if the container is already gone
+        # (e.g. concurrent rerun race), don't fail the job.
        if: always() && needs.detect-changes.outputs.api == 'true'
        run: |
          docker rm -f "$PG_CONTAINER" 2>/dev/null || true
@@ -22,9 +22,9 @@ on:
  # spending CI cycles. See e2e-api.yml for the rationale on why this
  # is a single job rather than two-jobs-sharing-name.
  push:
-    branches: [main, staging]
+    branches: [main]
  pull_request:
-    branches: [main, staging]
+    branches: [main]
  workflow_dispatch:
  schedule:
    # Weekly on Sunday 08:00 UTC — catches Chrome / Playwright / Next.js
@@ -139,7 +139,11 @@ jobs:

      - name: Upload Playwright report on failure
        if: failure() && needs.detect-changes.outputs.canvas == 'true'
-        uses: actions/upload-artifact@v3 # pinned to v3 for Gitea act_runner v0.6 compatibility (internal#46)
+        # Pinned to v3 for Gitea act_runner v0.6 compatibility — v4+ uses
+        # the GHES 3.10+ artifact protocol that Gitea 1.22.x does NOT
+        # implement (see ci.yml upload step for the canonical error
+        # cite). Drop this pin when Gitea ships the v4 protocol.
+        uses: actions/upload-artifact@c6a366c94c3e0affe28c06c8df20a878f24da3cf # v3.2.2
        with:
          name: playwright-report-staging
          path: canvas/playwright-report-staging/
@@ -147,7 +151,8 @@ jobs:

      - name: Upload screenshots on failure
        if: failure() && needs.detect-changes.outputs.canvas == 'true'
-        uses: actions/upload-artifact@v3 # pinned to v3 for Gitea act_runner v0.6 compatibility (internal#46)
+        # Pinned to v3 for Gitea act_runner v0.6 compatibility (see above).
+        uses: actions/upload-artifact@c6a366c94c3e0affe28c06c8df20a878f24da3cf # v3.2.2
        with:
          name: playwright-screenshots
          path: canvas/test-results/
@@ -32,7 +32,7 @@ name: E2E Staging External Runtime

 on:
  push:
-    branches: [staging, main]
+    branches: [main]
    paths:
      - 'workspace-server/internal/handlers/workspace.go'
      - 'workspace-server/internal/handlers/registry.go'
@@ -44,7 +44,7 @@ on:
      - 'tests/e2e/test_staging_external_runtime.sh'
      - '.github/workflows/e2e-staging-external.yml'
  pull_request:
-    branches: [staging, main]
+    branches: [main]
    paths:
      - 'workspace-server/internal/handlers/workspace.go'
      - 'workspace-server/internal/handlers/registry.go'
@@ -20,13 +20,12 @@ name: E2E Staging SaaS (full lifecycle)
 #     via the same paths watcher that e2e-api.yml uses)

 on:
-  # Fire on staging push too — previously this only ran on main, which
-  # meant the most thorough end-to-end test caught regressions AFTER
-  # they shipped to staging (and then to the auto-promote PR). Running
-  # on staging push catches them BEFORE the staging→main promotion
-  # opens, so a green canary into auto-promote is more meaningful.
+  # Trunk-based (Phase 3 of internal#81): main is the only branch.
+  # Previously this fired on staging push too because staging was a
+  # superset of main and ran the gate ahead of auto-promote; with no
+  # staging branch, main is where E2E gates the deploy.
  push:
-    branches: [staging, main]
+    branches: [main]
    paths:
      - 'workspace-server/internal/handlers/registry.go'
      - 'workspace-server/internal/handlers/workspace_provision.go'
@@ -36,7 +35,7 @@ on:
      - 'tests/e2e/test_staging_full_saas.sh'
      - '.github/workflows/e2e-staging-saas.yml'
  pull_request:
-    branches: [staging, main]
+    branches: [main]
    paths:
      - 'workspace-server/internal/handlers/registry.go'
      - 'workspace-server/internal/handlers/workspace_provision.go'
@@ -14,12 +14,42 @@ name: Handlers Postgres Integration
 # self-review caught it took 2 minutes to set up and would have caught
 # the bug at PR-time.
 #
-# This job spins a Postgres service container, applies the migration,
-# and runs `go test -tags=integration` against a live DB. Required
-# check on staging branch protection — backend handler PRs cannot
-# merge without a real-DB regression gate.
+# Why this workflow does NOT use `services: postgres:` (Class B fix)
+# ------------------------------------------------------------------
+# Our act_runner config has `container.network: host` (operator host
+# /opt/molecule/runners/config.yaml), which act_runner applies to BOTH
+# the job container AND every service container. With host-net, two
+# concurrent runs of this workflow both try to bind 0.0.0.0:5432 — the
+# second postgres FATALs with `could not create any TCP/IP sockets:
+# Address in use`, and Docker auto-removes it (act_runner sets
+# AutoRemove:true on service containers). By the time the migrations
+# step runs `psql`, the postgres container is gone, hence
+# `Connection refused` then `failed to remove container: No such
+# container` at cleanup time.
 #
-# Cost: ~30s job (postgres pull from GH cache + go build + 4 tests).
+# Per-job `container.network` override is silently ignored by
+# act_runner — `--network and --net in the options will be ignored.`
+# appears in the runner log. Documented constraint.
+#
+# So we sidestep `services:` entirely. The job container still uses
+# host-net (inherited from runner config; required for cache server
+# discovery on the bridge IP 172.18.0.17:42631). We launch a sibling
+# postgres on the existing `molecule-monorepo-net` bridge with a
+# UNIQUE name per run — `pg-handlers-${RUN_ID}-${RUN_ATTEMPT}` — and
+# read its bridge IP via `docker inspect`. A host-net job container
+# can reach a bridge-net container directly via the bridge IP (verified
+# manually on operator host 2026-05-08).
+#
+# Trade-offs vs. the original `services:` shape:
+#   + No host-port collision; N parallel runs share the bridge cleanly
+#   + `if: always()` cleanup runs even on test-step failure
+#   - One more step in the workflow (+~3 lines)
+#   - Requires `molecule-monorepo-net` to exist on the operator host
+#     (it does; declared in docker-compose.yml + docker-compose.infra.yml)
+#
+# Class B Hongming-owned CICD red sweep, 2026-05-08.
+#
+# Cost: ~30s job (postgres pull from cache + go build + 4 tests).

 on:
  push:
@@ -59,20 +89,14 @@ jobs:
    name: Handlers Postgres Integration
    needs: detect-changes
    runs-on: ubuntu-latest
-    services:
-      postgres:
-        image: postgres:15-alpine
-        env:
-          POSTGRES_PASSWORD: test
-          POSTGRES_DB: molecule
-        ports:
-          - 5432:5432
-        # GHA spins this with --health-cmd built in for postgres images.
-        options: >-
-          --health-cmd pg_isready
-          --health-interval 5s
-          --health-timeout 5s
-          --health-retries 10
+    env:
+      # Unique name per run so concurrent jobs don't collide on the
+      # bridge network. ${RUN_ID}-${RUN_ATTEMPT} is unique even across
+      # workflow_dispatch reruns of the same run_id.
+      PG_NAME: pg-handlers-${{ github.run_id }}-${{ github.run_attempt }}
+      # Bridge network already exists on the operator host (declared
+      # in docker-compose.yml + docker-compose.infra.yml).
+      PG_NETWORK: molecule-monorepo-net
    defaults:
      run:
        working-directory: workspace-server
@@ -89,16 +113,57 @@ jobs:
        with:
          go-version: 'stable'

+      - if: needs.detect-changes.outputs.handlers == 'true'
+        name: Start sibling Postgres on bridge network
+        working-directory: .
+        run: |
+          # Sanity: the bridge network must exist on the operator host.
+          # Hard-fail loud if it doesn't — easier to spot than a silent
+          # auto-create that diverges from the rest of the stack.
+          if ! docker network inspect "${PG_NETWORK}" >/dev/null 2>&1; then
+            echo "::error::Bridge network '${PG_NETWORK}' missing on operator host. Re-run docker-compose.infra.yml or check ops handbook."
+            exit 1
+          fi
+
+          # If a stale container with the same name exists (rerun on
+          # the same run_id), wipe it first.
+          docker rm -f "${PG_NAME}" >/dev/null 2>&1 || true
+
+          docker run -d \
+            --name "${PG_NAME}" \
+            --network "${PG_NETWORK}" \
+            --health-cmd "pg_isready -U postgres" \
+            --health-interval 5s \
+            --health-timeout 5s \
+            --health-retries 10 \
+            -e POSTGRES_PASSWORD=test \
+            -e POSTGRES_DB=molecule \
+            postgres:15-alpine >/dev/null
+
+          # Read back the bridge IP. Always present immediately after
+          # `docker run -d` for bridge networks.
+          PG_HOST=$(docker inspect "${PG_NAME}" \
+            --format "{{(index .NetworkSettings.Networks \"${PG_NETWORK}\").IPAddress}}")
+          if [ -z "${PG_HOST}" ]; then
+            echo "::error::Could not resolve PG_HOST for ${PG_NAME} on ${PG_NETWORK}"
+            docker logs "${PG_NAME}" || true
+            exit 1
+          fi
+          echo "PG_HOST=${PG_HOST}" >> "$GITHUB_ENV"
+          echo "INTEGRATION_DB_URL=postgres://postgres:test@${PG_HOST}:5432/molecule?sslmode=disable" >> "$GITHUB_ENV"
+          echo "Started ${PG_NAME} at ${PG_HOST}:5432"
+
      - if: needs.detect-changes.outputs.handlers == 'true'
        name: Apply migrations to Postgres service
        env:
          PGPASSWORD: test
        run: |
-          # Wait for postgres to actually accept connections (the
-          # GHA --health-cmd is best-effort but psql can still race).
+          # Wait for postgres to actually accept connections. Docker's
+          # health-cmd handles container-side readiness, but the wire
+          # to the bridge IP is best-tested with pg_isready directly.
          for i in {1..15}; do
-            if pg_isready -h localhost -p 5432 -U postgres -q; then break; fi
-            echo "waiting for postgres..."; sleep 2
+            if pg_isready -h "${PG_HOST}" -p 5432 -U postgres -q; then break; fi
+            echo "waiting for postgres at ${PG_HOST}:5432..."; sleep 2
          done

          # Apply every .up.sql in lexicographic order with
@@ -131,7 +196,7 @@ jobs:
          # not fine once a cross-table atomicity test came in.
          set +e
          for migration in $(ls migrations/*.sql 2>/dev/null | grep -v '\.down\.sql$' | sort); do
-            if psql -h localhost -U postgres -d molecule -v ON_ERROR_STOP=1 \
+            if psql -h "${PG_HOST}" -U postgres -d molecule -v ON_ERROR_STOP=1 \
                  -f "$migration" >/dev/null 2>&1; then
              echo "✓ $(basename "$migration")"
            else
@@ -145,7 +210,7 @@ jobs:
          # fail if any didn't land — that would be a real regression we
          # want loud.
          for tbl in delegations workspaces activity_logs pending_uploads; do
-            if ! psql -h localhost -U postgres -d molecule -tA \
+            if ! psql -h "${PG_HOST}" -U postgres -d molecule -tA \
                -c "SELECT 1 FROM information_schema.tables WHERE table_name = '$tbl'" \
                | grep -q 1; then
              echo "::error::$tbl table missing after migration replay — handler integration tests would be meaningless"
@@ -156,16 +221,32 @@ jobs:

      - if: needs.detect-changes.outputs.handlers == 'true'
        name: Run integration tests
-        env:
-          INTEGRATION_DB_URL: postgres://postgres:test@localhost:5432/molecule?sslmode=disable
        run: |
+          # INTEGRATION_DB_URL is exported by the start-postgres step;
+          # points at the per-run bridge IP, not 127.0.0.1, so concurrent
+          # workflow runs don't fight over a host-net 5432 port.
          go test -tags=integration -timeout 5m -v ./internal/handlers/ -run "^TestIntegration_"

-      - if: needs.detect-changes.outputs.handlers == 'true' && failure()
+      - if: failure() && needs.detect-changes.outputs.handlers == 'true'
        name: Diagnostic dump on failure
        env:
          PGPASSWORD: test
        run: |
-          echo "::group::delegations table state"
-          psql -h localhost -U postgres -d molecule -c "SELECT * FROM delegations LIMIT 50;" || true
+          echo "::group::postgres container status"
+          docker ps -a --filter "name=${PG_NAME}" --format '{{.Status}} {{.Names}}' || true
+          docker logs "${PG_NAME}" 2>&1 | tail -50 || true
          echo "::endgroup::"
+          echo "::group::delegations table state"
+          psql -h "${PG_HOST}" -U postgres -d molecule -c "SELECT * FROM delegations LIMIT 50;" || true
+          echo "::endgroup::"
+
+      - if: always() && needs.detect-changes.outputs.handlers == 'true'
+        name: Stop sibling Postgres
+        working-directory: .
+        run: |
+          # always() so containers don't leak when migrations or tests
+          # fail. The cleanup is best-effort: if the container is
+          # already gone (e.g. concurrent rerun race), don't fail the job.
+          docker rm -f "${PG_NAME}" >/dev/null 2>&1 || true
+          echo "Cleaned up ${PG_NAME}"
+
@@ -98,6 +98,55 @@ jobs:
      # github-app-auth sibling-checkout removed 2026-05-07 (#157):
      # the plugin was dropped + Dockerfile.tenant no longer COPYs it.

+      # Pre-clone manifest deps before docker compose builds the tenant
+      # image (Task #173 followup — same pattern as
+      # publish-workspace-server-image.yml's "Pre-clone manifest deps"
+      # step).
+      #
+      # Why pre-clone here too: tests/harness/compose.yml builds tenant-alpha
+      # and tenant-beta from workspace-server/Dockerfile.tenant with
+      # context=../.. (repo root). That Dockerfile expects
+      # .tenant-bundle-deps/{workspace-configs-templates,org-templates,plugins}
+      # to be present at build context root (post-#173 it COPYs from there
+      # instead of running an in-image clone — the in-image clone failed
+      # with "could not read Username for https://git.moleculesai.app"
+      # because there's no auth path inside the build sandbox).
+      #
+      # Without this step harness-replays fails before any replay runs,
+      # with `failed to calculate checksum of ref ...
+      # "/.tenant-bundle-deps/plugins": not found`. Caught by run #892
+      # (main, 2026-05-07T20:28:53Z) and run #964 (staging — same
+      # symptom, different root cause: staging still has the in-image
+      # clone path, hits the auth error directly).
+      #
+      # Token shape matches publish-workspace-server-image.yml: AUTO_SYNC_TOKEN
+      # is the devops-engineer persona PAT, NOT the founder PAT (per
+      # `feedback_per_agent_gitea_identity_default`). clone-manifest.sh
+      # embeds it as basic-auth for the duration of the clones and strips
+      # .git directories — the token never enters the resulting image.
+      - name: Pre-clone manifest deps
+        if: needs.detect-changes.outputs.run == 'true'
+        env:
+          MOLECULE_GITEA_TOKEN: ${{ secrets.AUTO_SYNC_TOKEN }}
+        run: |
+          set -euo pipefail
+          if [ -z "${MOLECULE_GITEA_TOKEN}" ]; then
+            echo "::error::AUTO_SYNC_TOKEN secret is empty — register the devops-engineer persona PAT in repo Actions secrets"
+            exit 1
+          fi
+          mkdir -p .tenant-bundle-deps
+          bash scripts/clone-manifest.sh \
+            manifest.json \
+            .tenant-bundle-deps/workspace-configs-templates \
+            .tenant-bundle-deps/org-templates \
+            .tenant-bundle-deps/plugins
+          # Sanity-check counts so a silent partial clone fails fast
+          # instead of producing a half-empty image.
+          ws_count=$(find .tenant-bundle-deps/workspace-configs-templates -mindepth 1 -maxdepth 1 -type d | wc -l)
+          org_count=$(find .tenant-bundle-deps/org-templates -mindepth 1 -maxdepth 1 -type d | wc -l)
+          plugins_count=$(find .tenant-bundle-deps/plugins -mindepth 1 -maxdepth 1 -type d | wc -l)
+          echo "Cloned: ws=$ws_count org=$org_count plugins=$plugins_count"
+
      - name: Install Python deps for replays
        # peer-discovery-404 (and future replays) eval Python against the
        # running tenant — importing workspace/a2a_client.py pulls in
@@ -1,14 +1,25 @@
 name: pr-guards

-# Thin caller that delegates to the molecule-ci reusable guard. Today
-# the guard is just "disable auto-merge when a new commit is pushed
-# after auto-merge was enabled" — added 2026-04-27 after PR #2174
-# auto-merged with only its first commit because the second commit
-# was pushed after the merge queue had locked the PR's SHA.
+# PR-time guards. Today the only guard is "disable auto-merge when a
+# new commit is pushed after auto-merge was enabled" — added 2026-04-27
+# after PR #2174 auto-merged with only its first commit because the
+# second commit was pushed after the merge queue had locked the PR's
+# SHA.
 #
-# When more PR-time guards land in molecule-ci, add them here as
-# additional jobs that share the same pull_request:synchronize
-# trigger.
+# Why this is inlined (not delegated to molecule-ci's reusable
+# workflow): the reusable workflow uses `gh pr merge --disable-auto`,
+# which calls GitHub's GraphQL API. Gitea has no GraphQL endpoint and
+# returns HTTP 405 on /api/graphql, so the job failed on every Gitea
+# PR push since the 2026-05-06 migration. Gitea also has no `--auto`
+# merge primitive that this job could be acting on, so the right
+# behaviour on Gitea is "no-op + green status" — not a 405.
+#
+# Inlining (vs. an `if:` on the `uses:` line) keeps the job ALWAYS
+# running, which matters for branch protection: required-check names
+# need a job that emits SUCCESS terminal state, not SKIPPED. See
+# `feedback_branch_protection_check_name_parity` and `feedback_pr_merge_safety_guards`.
+#
+# Issue #88 item 1.

 on:
  pull_request:
@@ -19,4 +30,34 @@ permissions:

 jobs:
  disable-auto-merge-on-push:
-    uses: Molecule-AI/molecule-ci/.github/workflows/disable-auto-merge-on-push.yml@main
+    runs-on: ubuntu-latest
+    steps:
+      # Detect Gitea Actions. act_runner sets GITEA_ACTIONS=true in the
+      # step env on every job. Belt-and-suspenders: also check the repo
+      # url's host, which is independent of any runner-side env config
+      # (covers a future Gitea host where the env var is forgotten).
+      - name: Detect runner host
+        id: host
+        run: |
+          if [[ "${GITEA_ACTIONS:-}" == "true" ]] || [[ "${{ github.server_url }}" == *moleculesai.app* ]] || [[ "${{ github.event.repository.html_url }}" == *moleculesai.app* ]]; then
+            echo "is_gitea=true" >> "$GITHUB_OUTPUT"
+            echo "::notice::Gitea Actions detected — auto-merge gating is not applicable here (Gitea has no --auto merge primitive). Job will no-op."
+          else
+            echo "is_gitea=false" >> "$GITHUB_OUTPUT"
+          fi
+
+      - name: Disable auto-merge (GitHub only)
+        if: steps.host.outputs.is_gitea != 'true'
+        env:
+          GH_TOKEN: ${{ github.token }}
+          PR: ${{ github.event.pull_request.number }}
+          REPO: ${{ github.repository }}
+          NEW_SHA: ${{ github.sha }}
+        run: |
+          set -eu
+          gh pr merge "$PR" --disable-auto -R "$REPO" || true
+          gh pr comment "$PR" -R "$REPO" --body "🔒 Auto-merge disabled — new commit (\`${NEW_SHA:0:7}\`) pushed after auto-merge was enabled. The merge queue locks SHAs at entry, so subsequent pushes can race. Verify the new commit and re-enable with \`gh pr merge --auto\`."
+
+      - name: Gitea no-op
+        if: steps.host.outputs.is_gitea == 'true'
+        run: echo "Gitea Actions — auto-merge gating not applicable; no-op (job intentionally green so branch protection's required-check name lands SUCCESS)."
@@ -25,7 +25,7 @@ name: publish-runtime
 #   3. Publishes to PyPI via the PyPA Trusted Publisher action (OIDC).
 #      No static API token is stored — PyPI verifies the workflow's
 #      OIDC claim against the trusted-publisher config registered for
-#      molecule-ai-workspace-runtime (Molecule-AI/molecule-core,
+#      molecule-ai-workspace-runtime (molecule-ai/molecule-core,
 #      publish-runtime.yml, environment pypi-publish).
 #
 # After publish: the 8 template repos pick up the new version on their
@@ -166,7 +166,7 @@ jobs:

      - name: Publish to PyPI (Trusted Publisher / OIDC)
        # PyPI side is configured: project molecule-ai-workspace-runtime →
-        # publisher Molecule-AI/molecule-core, workflow publish-runtime.yml,
+        # publisher molecule-ai/molecule-core, workflow publish-runtime.yml,
        # environment pypi-publish. The action mints a short-lived OIDC
        # token and exchanges it for a PyPI upload credential — no static
        # API token in this repo's secrets.
@@ -37,6 +37,7 @@ on:
      - 'workspace-server/**'
      - 'canvas/**'
      - 'manifest.json'
+      - 'scripts/**'
      - '.github/workflows/publish-workspace-server-image.yml'
  workflow_dispatch:

@@ -74,33 +75,87 @@ jobs:
      # plugin was dropped + workspace-server/Dockerfile no longer
      # COPYs it.

-      - name: Configure AWS credentials for ECR
-        # GHCR was the pre-suspension target; the molecule-ai org on
-        # GitHub got swept 2026-05-06 and ghcr.io/molecule-ai/* is no
-        # longer reachable. Post-suspension target is the operator's
-        # ECR org (153263036946.dkr.ecr.us-east-2.amazonaws.com/
-        # molecule-ai/*), which already hosts platform-tenant +
-        # workspace-template-* + runner-base images. AWS creds come
-        # from the AWS_ACCESS_KEY_ID/SECRET secrets bound to the
-        # molecule-cp IAM user. Closes #161.
-        uses: aws-actions/configure-aws-credentials@v4
-        with:
-          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
-          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
-          aws-region: us-east-2
-
-      - name: Log in to ECR
-        id: ecr-login
-        uses: aws-actions/amazon-ecr-login@v2
-
-      - name: Set up Docker Buildx
-        uses: docker/setup-buildx-action@4d04d5d9486b7bd6fa91e7baf45bbb4f8b9deedd # v4.0.0
+      # ECR auth + buildx setup are now inline in each build step
+      # below (Task #173, 2026-05-07).
+      #
+      # Why moved inline: aws-actions/configure-aws-credentials@v4 +
+      # aws-actions/amazon-ecr-login@v2 + docker/setup-buildx-action
+      # all left auth state in places that the actual `docker push`
+      # couldn't see on Gitea Actions:
+      #   - The actions wrote to a step-scoped DOCKER_CONFIG path
+      #     that didn't survive into subsequent shell steps.
+      #   - Buildx couldn't bridge the runner container ↔
+      #     operator-host docker daemon auth gap (401 on the
+      #     docker-container driver, "no basic auth credentials"
+      #     with the action-driven login).
+      #
+      # Doing AWS+ECR auth inline (`aws ecr get-login-password |
+      # docker login`) in the same shell step as `docker build` +
+      # `docker push` is the operator-host manual approach, mapped
+      # 1:1 into CI. Auth state is guaranteed to live in the env that
+      # `docker push` actually runs from.
+      #
+      # Post-suspension target is the operator's ECR org
+      # (153263036946.dkr.ecr.us-east-2.amazonaws.com/molecule-ai/*),
+      # which already hosts platform-tenant + workspace-template-* +
+      # runner-base images. AWS creds come from the
+      # AWS_ACCESS_KEY_ID/SECRET secrets bound to the molecule-cp
+      # IAM user. Closes #161.

      - name: Compute tags
        id: tags
        run: |
          echo "sha=${GITHUB_SHA::7}" >> "$GITHUB_OUTPUT"

+      # Pre-clone manifest deps before docker build (Task #173 fix).
+      #
+      # Why pre-clone: post-2026-05-06, every workspace-template-* repo on
+      # Gitea (codex, crewai, deepagents, gemini-cli, langgraph) plus all
+      # 7 org-template-* repos are private. The pre-fix Dockerfile.tenant
+      # ran `git clone` inside an in-image stage, which had no auth path
+      # — every CI build failed with "fatal: could not read Username for
+      # https://git.moleculesai.app". For weeks, every workspace-server
+      # rebuild required a manual operator-host push. Now we clone in the
+      # trusted CI context (where AUTO_SYNC_TOKEN is naturally available)
+      # and Dockerfile.tenant just COPYs from .tenant-bundle-deps/.
+      #
+      # Token shape: AUTO_SYNC_TOKEN is the devops-engineer persona PAT
+      # (see /etc/molecule-bootstrap/agent-secrets.env). Per saved memory
+      # `feedback_per_agent_gitea_identity_default`, every CI surface uses
+      # a per-persona token, never the founder PAT. clone-manifest.sh
+      # embeds it as basic-auth (oauth2:<token>) for the duration of the
+      # clones, then strips .git directories — the token never enters
+      # the resulting image.
+      #
+      # Idempotent: if a re-run finds populated dirs, clone-manifest.sh
+      # skips them; safe to retrigger via path-filter or workflow_dispatch.
+      - name: Pre-clone manifest deps
+        env:
+          MOLECULE_GITEA_TOKEN: ${{ secrets.AUTO_SYNC_TOKEN }}
+        run: |
+          set -euo pipefail
+          if [ -z "${MOLECULE_GITEA_TOKEN}" ]; then
+            echo "::error::AUTO_SYNC_TOKEN secret is empty — register the devops-engineer persona PAT in repo Actions secrets"
+            exit 1
+          fi
+          mkdir -p .tenant-bundle-deps
+          bash scripts/clone-manifest.sh \
+            manifest.json \
+            .tenant-bundle-deps/workspace-configs-templates \
+            .tenant-bundle-deps/org-templates \
+            .tenant-bundle-deps/plugins
+          # Sanity-check counts so a silent partial clone fails fast
+          # instead of producing a half-empty image.
+          ws_count=$(find .tenant-bundle-deps/workspace-configs-templates -mindepth 1 -maxdepth 1 -type d | wc -l)
+          org_count=$(find .tenant-bundle-deps/org-templates -mindepth 1 -maxdepth 1 -type d | wc -l)
+          plugins_count=$(find .tenant-bundle-deps/plugins -mindepth 1 -maxdepth 1 -type d | wc -l)
+          echo "Cloned: ws=$ws_count org=$org_count plugins=$plugins_count"
+          # Counts are derived from manifest.json (9 ws / 7 org / 21
+          # plugins as of 2026-05-07). If manifest.json grows but the
+          # clone step regresses silently, the find above caps at the
+          # actual disk state — but clone-manifest.sh's own EXPECTED vs
+          # CLONED check (line ~95) is the authoritative fail-fast.
+
      # Canary-gated release flow:
      #   - This step always publishes :staging-<sha> + :staging-latest.
      #   - On staging push, staging-CP picks up :staging-latest immediately
@@ -126,58 +181,82 @@ jobs:
      # were running pre-RFC code. Adding the staging trigger above closes
      # that gap. Earlier 2026-04-24 incident: a static :staging-<sha> pin
      # drifted 10 days behind staging — same class of bug, different
-      # mechanism.
-      - name: Build & push platform image to GHCR (staging-<sha> + staging-latest)
-        uses: docker/build-push-action@bcafcacb16a39f128d818304e6c9c0c18556b85f # v7.1.0
-        with:
-          context: .
-          file: ./workspace-server/Dockerfile
-          platforms: linux/amd64
-          push: true
-          tags: |
-            ${{ env.IMAGE_NAME }}:staging-${{ steps.tags.outputs.sha }}
-            ${{ env.IMAGE_NAME }}:staging-latest
-          cache-from: type=gha
-          cache-to: type=gha,mode=max
-          # GIT_SHA bakes into the Go binary via -ldflags so /buildinfo
-          # returns it at runtime — see Dockerfile + buildinfo/buildinfo.go.
-          # This is the same value as the OCI revision label below; passing
-          # it twice is intentional, the OCI label is for registry tooling
-          # while /buildinfo is for the redeploy verification step.
-          build-args: |
-            GIT_SHA=${{ github.sha }}
-          labels: |
-            org.opencontainers.image.source=https://github.com/${{ github.repository }}
-            org.opencontainers.image.revision=${{ github.sha }}
-            org.opencontainers.image.description=Molecule AI platform (Go API server) — pending canary verify
+      # mechanism. ECR repo molecule-ai/platform created 2026-05-07.
+      # Build + push platform image with plain `docker` (no buildx).
+      # GIT_SHA bakes into the Go binary via -ldflags so /buildinfo
+      # returns it at runtime — see Dockerfile + buildinfo/buildinfo.go.
+      # The OCI revision label below carries the same value for registry
+      # tooling; the duplication is intentional.
+      - name: Build & push platform image to ECR (staging-<sha> + staging-latest)
+        env:
+          IMAGE_NAME: ${{ env.IMAGE_NAME }}
+          TAG_SHA: staging-${{ steps.tags.outputs.sha }}
+          TAG_LATEST: staging-latest
+          GIT_SHA: ${{ github.sha }}
+          REPO: ${{ github.repository }}
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
+          AWS_DEFAULT_REGION: us-east-2
+        run: |
+          set -euo pipefail
+          # ECR auth in-step so config.json is populated in the same
+          # shell env that runs `docker push`. ECR get-login-password
+          # tokens last 12h, plenty for a single-step build+push.
+          ECR_REGISTRY="${IMAGE_NAME%%/*}"
+          aws ecr get-login-password --region us-east-2 | \
+            docker login --username AWS --password-stdin "${ECR_REGISTRY}"
+          docker build \
+            --file ./workspace-server/Dockerfile \
+            --build-arg GIT_SHA="${GIT_SHA}" \
+            --label "org.opencontainers.image.source=https://github.com/${REPO}" \
+            --label "org.opencontainers.image.revision=${GIT_SHA}" \
+            --label "org.opencontainers.image.description=Molecule AI platform (Go API server) — pending canary verify" \
+            --tag "${IMAGE_NAME}:${TAG_SHA}" \
+            --tag "${IMAGE_NAME}:${TAG_LATEST}" \
+            .
+          docker push "${IMAGE_NAME}:${TAG_SHA}"
+          docker push "${IMAGE_NAME}:${TAG_LATEST}"
+
+      # Canvas uses same-origin fetches. The tenant Go platform
+      # reverse-proxies /cp/* to the SaaS CP via its CP_UPSTREAM_URL
+      # env; the tenant's /canvas/viewport, /approvals/pending,
+      # /org/templates etc. live on the tenant platform itself.
+      # Both legs share one origin (the tenant subdomain) so
+      # PLATFORM_URL="" forces canvas to fetch paths as relative,
+      # which land same-origin.
+      #
+      # Self-hosted / private-label deployments override this at
+      # build time with a specific backend (e.g. local dev:
+      # NEXT_PUBLIC_PLATFORM_URL=http://localhost:8080).
+      - name: Build & push tenant image to ECR (staging-<sha> + staging-latest)
+        env:
+          TENANT_IMAGE_NAME: ${{ env.TENANT_IMAGE_NAME }}
+          TAG_SHA: staging-${{ steps.tags.outputs.sha }}
+          TAG_LATEST: staging-latest
+          GIT_SHA: ${{ github.sha }}
+          REPO: ${{ github.repository }}
+          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
+          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
+          AWS_DEFAULT_REGION: us-east-2
+        run: |
+          set -euo pipefail
+          # Re-login: the platform-image step's docker login wrote to
+          # the same config.json, so this is technically redundant — but
+          # making each push step self-contained keeps the workflow
+          # robust to step reordering / future extraction.
+          ECR_REGISTRY="${TENANT_IMAGE_NAME%%/*}"
+          aws ecr get-login-password --region us-east-2 | \
+            docker login --username AWS --password-stdin "${ECR_REGISTRY}"
+          docker build \
+            --file ./workspace-server/Dockerfile.tenant \
+            --build-arg NEXT_PUBLIC_PLATFORM_URL= \
+            --build-arg GIT_SHA="${GIT_SHA}" \
+            --label "org.opencontainers.image.source=https://github.com/${REPO}" \
+            --label "org.opencontainers.image.revision=${GIT_SHA}" \
+            --label "org.opencontainers.image.description=Molecule AI tenant platform + canvas — pending canary verify" \
+            --tag "${TENANT_IMAGE_NAME}:${TAG_SHA}" \
+            --tag "${TENANT_IMAGE_NAME}:${TAG_LATEST}" \
+            .
+          docker push "${TENANT_IMAGE_NAME}:${TAG_SHA}"
+          docker push "${TENANT_IMAGE_NAME}:${TAG_LATEST}"

-      - name: Build & push tenant image to GHCR (staging-<sha> + staging-latest)
-        uses: docker/build-push-action@bcafcacb16a39f128d818304e6c9c0c18556b85f # v7.1.0
-        with:
-          context: .
-          file: ./workspace-server/Dockerfile.tenant
-          platforms: linux/amd64
-          push: true
-          tags: |
-            ${{ env.TENANT_IMAGE_NAME }}:staging-${{ steps.tags.outputs.sha }}
-            ${{ env.TENANT_IMAGE_NAME }}:staging-latest
-          cache-from: type=gha
-          cache-to: type=gha,mode=max
-          # Canvas uses same-origin fetches. The tenant Go platform
-          # reverse-proxies /cp/* to the SaaS CP via its CP_UPSTREAM_URL
-          # env; the tenant's /canvas/viewport, /approvals/pending,
-          # /org/templates etc. live on the tenant platform itself.
-          # Both legs share one origin (the tenant subdomain) so
-          # PLATFORM_URL="" forces canvas to fetch paths as relative,
-          # which land same-origin.
-          #
-          # Self-hosted / private-label deployments override this at
-          # build time with a specific backend (e.g. local dev:
-          # NEXT_PUBLIC_PLATFORM_URL=http://localhost:8080).
-          build-args: |
-            NEXT_PUBLIC_PLATFORM_URL=
-            GIT_SHA=${{ github.sha }}
-          labels: |
-            org.opencontainers.image.source=https://github.com/${{ github.repository }}
-            org.opencontainers.image.revision=${{ github.sha }}
-            org.opencontainers.image.description=Molecule AI tenant platform + canvas — pending canary verify
@@ -9,7 +9,7 @@ name: redeploy-tenants-on-main
 #
 # This workflow closes the gap by calling the control-plane admin
 # endpoint that performs a canary-first, batched, health-gated rolling
-# redeploy across every live tenant. Implemented in Molecule-AI/
+# redeploy across every live tenant. Implemented in molecule-ai/
 # molecule-controlplane as POST /cp/admin/tenants/redeploy-fleet
 # (feat/tenant-auto-redeploy, landing alongside this workflow).
 #
@@ -146,7 +146,7 @@ jobs:

      - name: Call CP redeploy-fleet
        # CP_ADMIN_API_TOKEN must be set as a repo/org secret on
-        # Molecule-AI/molecule-core, matching the staging/prod CP's
+        # molecule-ai/molecule-core, matching the staging/prod CP's
        # CP_ADMIN_API_TOKEN env. Stored in Railway, mirrored to this
        # repo's secrets for CI.
        env:
@@ -36,7 +36,7 @@ on:
  workflow_run:
    workflows: ['publish-workspace-server-image']
    types: [completed]
-    branches: [staging]
+    branches: [main]
  workflow_dispatch:
    inputs:
      target_tag:
@@ -97,7 +97,7 @@ jobs:

      - name: Call staging-CP redeploy-fleet
        # CP_STAGING_ADMIN_API_TOKEN must be set as a repo/org secret
-        # on Molecule-AI/molecule-core, matching staging-CP's
+        # on molecule-ai/molecule-core, matching staging-CP's
        # CP_ADMIN_API_TOKEN env var (visible in Railway controlplane
        # / staging environment). Stored separately from the prod
        # CP_ADMIN_API_TOKEN so a leak of one doesn't auth the other.
@@ -1,16 +1,99 @@
 name: Retarget main PRs to staging

-# Mechanical enforcement of SHARED_RULES rule 8 ("Staging-first workflow, no
-# exceptions"). When a bot opens a PR against main, retarget it to staging
-# automatically and leave an explanatory comment. Human CEO-authored PRs (the
-# staging→main promotion PR, etc.) are left alone — they're the authorised
-# exception to the rule.
+# Mechanical enforcement of SHARED_RULES rule 8 ("Staging-first
+# workflow, no exceptions"). When a bot opens a PR against `main`,
+# retarget it to `staging` automatically and leave an explanatory
+# comment. Human / CEO-authored PRs (the staging→main promotion
+# PRs, etc.) are left alone — they're the authorised exception
+# to the rule.
 #
-# Why an Action instead of only a prompt rule: prompt rules depend on every
-# role's system-prompt.md staying in sync. Today 5 of 8 engineer roles
-# (core-be, core-fe, app-fe, app-qa, devops-engineer) don't have the
-# staging-first section — the bot keeps opening PRs to main. An Action
-# enforces the invariant regardless of prompt drift.
+# ============================================================
+# What this workflow does
+# ============================================================
+#
+# On `pull_request_target` opened/reopened against `main`:
+#   1. If the PR head is `staging`, skip (the auto-promote PRs
+#      MUST stay base=main).
+#   2. If the PR author is a bot, retarget the PR base to
+#      `staging` via Gitea REST `PATCH /pulls/{N}` body
+#      `{"base":"staging"}`.
+#   3. If the retarget returns 422 "pull request already exists
+#      for base branch 'staging'" (issue #1884 case: another PR
+#      on the same head already targets staging), close the
+#      now-redundant main-PR via Gitea REST instead of failing
+#      red.
+#   4. Post an explainer comment on the retargeted PR via
+#      Gitea REST `POST /issues/{N}/comments`.
+#
+# ============================================================
+# Why Gitea REST (and not `gh api / gh pr close / gh pr comment`)
+# ============================================================
+#
+# Pre-2026-05-06 this workflow used `gh api -X PATCH "repos/{owner}/{repo}/pulls/{N}" -f base=staging`
+# plus `gh pr close` and `gh pr comment`. After the GitHub→Gitea
+# cutover those calls fail because:
+#
+#   - `gh` CLI defaults to `api.github.com`. Even with `GH_HOST`
+#     pointing at Gitea, `gh pr close / comment` route through
+#     GraphQL (`/api/graphql`) which Gitea does not expose.
+#     Empirical: every `gh pr *` call returns
+#     `HTTP 405 Method Not Allowed (https://git.moleculesai.app/api/graphql)`
+#     — same root cause as #65 (auto-sync, fixed in PR #66) and
+#     #73/#195 (auto-promote, fixed in PR #78).
+#   - `gh api -X PATCH /pulls/{N}` happens to use a REST path
+#     that Gitea also has, but the `gh` host-resolution layer
+#     and pagination/retry logic don't always hit Gitea cleanly,
+#     and the cost of switching to direct `curl` is one extra
+#     line of code.
+#
+# So this workflow uses direct `curl` calls to Gitea REST. No
+# `gh` CLI dependency, no GraphQL, no flaky host-resolution.
+#
+# ============================================================
+# Identity + token (anti-bot-ring per saved-memory
+# `feedback_per_agent_gitea_identity_default`)
+# ============================================================
+#
+# Pre-fix this workflow used the per-job ephemeral
+# `secrets.GITHUB_TOKEN`. On Gitea Actions that token has
+# narrow scope and unpredictable cross-PR write capability.
+#
+# Post-fix: `secrets.AUTO_SYNC_TOKEN` (the `devops-engineer`
+# Gitea persona). Same persona used by `auto-sync-main-to-staging.yml`
+# (PR #66) and `auto-promote-staging.yml` (PR #78). Token scope:
+# `push: true` repo write, sufficient for PR-edit + close + comment.
+#
+# Why this token does NOT need branch-protection bypass:
+# patching a PR's base ref is a PR-level operation that does not
+# require push perms on either branch (the PR's own commits stay
+# put; only the metadata changes).
+#
+# ============================================================
+# Failure modes & operational notes
+# ============================================================
+#
+# A — PATCH base→staging returns 422 "pull request already exists"
+#     (issue #1884 case):
+#     - Detected by string-match on response body. Workflow
+#       falls through to closing the now-redundant main-PR
+#       (Gitea REST `PATCH /pulls/{N}` with `state: closed`)
+#       and posts an explanation comment. Step summary surfaces.
+#
+# B — `AUTO_SYNC_TOKEN` rotated / wrong scope:
+#     - First REST call returns 401/403. Step summary surfaces.
+#       Re-issue token from `~/.molecule-ai/personas/` on the
+#       operator host and update repo Actions secret.
+#
+# C — PR was deleted between trigger and run:
+#     - REST call returns 404. Workflow exits 0 with a notice
+#       (the rule was already enforced or the PR is gone).
+#
+# D — author is not actually a bot but the filter mis-fires:
+#     - Filter is conservative: only triggers on
+#       `user.type == 'Bot'`, `login` ends with `[bot]`, or
+#       known bot logins (`molecule-ai[bot]`, `app/molecule-ai`).
+#       Human PRs slip through unaffected. If a NEW bot login
+#       starts shipping main-PRs, add it to the filter.

 on:
  pull_request_target:
@@ -24,16 +107,16 @@ jobs:
  retarget:
    name: Retarget to staging
    runs-on: ubuntu-latest
-    # Only fire for bot-authored PRs. Human CEO PRs (staging→main promotion)
-    # are intentional and pass through.
+    # Only fire for bot-authored PRs. Human CEO PRs (staging→main
+    # promotion) are intentional and pass through.
    #
-    # Head-ref guard: never retarget a PR whose head IS `staging` — those
-    # are the auto-promote staging→main PRs (opened by molecule-ai[bot]
-    # since #2586 switched to an App token, which now passes the bot
-    # filter below). Retargeting head=staging onto base=staging fails
-    # with HTTP 422 "no new commits between base 'staging' and head
-    # 'staging'", which used to surface as a noisy red workflow run on
-    # every auto-promote (caught 2026-05-03 on PR #2588).
+    # Head-ref guard: never retarget a PR whose head IS `staging`
+    # — those are the auto-promote staging→main PRs (opened by
+    # `devops-engineer` since PR #78 / #195 fix). Retargeting
+    # head=staging onto base=staging fails with HTTP 422 "no new
+    # commits between base 'staging' and head 'staging'", which
+    # would surface as a noisy red workflow run on every
+    # auto-promote (caught 2026-05-03 on the GitHub-era PR #2588).
    if: >-
      github.event.pull_request.head.ref != 'staging'
      && (
@@ -41,65 +124,153 @@ jobs:
        || endsWith(github.event.pull_request.user.login, '[bot]')
        || github.event.pull_request.user.login == 'app/molecule-ai'
        || github.event.pull_request.user.login == 'molecule-ai[bot]'
+        || github.event.pull_request.user.login == 'devops-engineer'
      )
    steps:
-      - name: Retarget PR base to staging
+      - name: Retarget PR base to staging via Gitea REST
        id: retarget
        env:
-          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+          GITEA_TOKEN: ${{ secrets.AUTO_SYNC_TOKEN }}
+          GITEA_HOST: ${{ vars.GITEA_HOST || 'https://git.moleculesai.app' }}
+          REPO: ${{ github.repository }}
          PR_NUMBER: ${{ github.event.pull_request.number }}
          PR_AUTHOR: ${{ github.event.pull_request.user.login }}
-        # Issue #1884: when the bot opens a PR against main and there's
-        # already another PR on the same head branch targeting staging,
-        # GitHub's PATCH /pulls returns 422 with
-        # "A pull request already exists for base branch 'staging' …".
-        # The retarget can't proceed — but the right response is to
-        # close the now-redundant main-PR, not to fail the workflow
-        # noisily. Detect that specific 422 and close instead.
+        # Issue #1884 case: when the bot opens a PR against main
+        # and there's already another PR on the same head branch
+        # targeting staging, Gitea's PATCH returns 422 with a
+        # body mentioning "pull request already exists for base
+        # branch 'staging'" (the Gitea message wording is
+        # slightly different from GitHub's; the substring match
+        # below covers both for forward/back compat).
+        # The retarget can't proceed — but the right response is
+        # to close the now-redundant main-PR, not to fail the
+        # workflow noisily. Detect that specific 422 and close
+        # instead.
        run: |
-          set +e
+          set -euo pipefail
+
+          API="${GITEA_HOST}/api/v1/repos/${REPO}"
+          AUTH=(-H "Authorization: token ${GITEA_TOKEN}" -H "Accept: application/json")
+
          echo "Retargeting PR #${PR_NUMBER} (author: ${PR_AUTHOR}) from main → staging"
-          PATCH_OUTPUT=$(gh api -X PATCH \
-            "repos/${{ github.repository }}/pulls/${PR_NUMBER}" \
-            -f base=staging \
-            --jq '.base.ref' 2>&1)
-          PATCH_EXIT=$?
+
+          # Curl-status-capture pattern per `feedback_curl_status_capture_pollution`:
+          # http_code via -w to its own scalar, body to a tempfile, set +e/-e
+          # bracket so curl's non-zero-on-4xx doesn't pollute the script's exit chain.
+          BODY_FILE=$(mktemp)
+          REQ='{"base":"staging"}'
+
+          set +e
+          STATUS=$(curl -sS "${AUTH[@]}" -H "Content-Type: application/json" \
+            -X PATCH -d "${REQ}" \
+            -o "${BODY_FILE}" -w "%{http_code}" \
+            "${API}/pulls/${PR_NUMBER}")
+          CURL_RC=$?
          set -e
-          if [ "$PATCH_EXIT" -eq 0 ]; then
-            echo "::notice::Retargeted PR #${PR_NUMBER} → staging"
-            echo "outcome=retargeted" >> "$GITHUB_OUTPUT"
-            exit 0
+
+          if [ "${CURL_RC}" -ne 0 ]; then
+            echo "::error::curl PATCH failed (rc=${CURL_RC})"
+            rm -f "${BODY_FILE}"
+            exit 1
          fi
+
+          if [ "${STATUS}" = "201" ] || [ "${STATUS}" = "200" ]; then
+            NEW_BASE=$(jq -r '.base.ref // "?"' < "${BODY_FILE}")
+            rm -f "${BODY_FILE}"
+            if [ "${NEW_BASE}" = "staging" ]; then
+              echo "::notice::Retargeted PR #${PR_NUMBER} → staging"
+              echo "outcome=retargeted" >> "$GITHUB_OUTPUT"
+              exit 0
+            fi
+            echo "::error::PATCH returned ${STATUS} but base.ref is '${NEW_BASE}', not 'staging'"
+            exit 1
+          fi
+
          # Specifically match the 422 duplicate-base/head error so
          # any OTHER PATCH failure (auth, deleted PR, etc.) still
          # surfaces as a real workflow failure.
-          if echo "$PATCH_OUTPUT" | grep -q "pull request already exists for base branch 'staging'"; then
+          BODY=$(cat "${BODY_FILE}" || true)
+          rm -f "${BODY_FILE}"
+
+          if [ "${STATUS}" = "422" ] && echo "${BODY}" | grep -qE "(pull request already exists for base branch 'staging'|already exists.*base.*staging)"; then
            echo "::notice::PR #${PR_NUMBER}: duplicate target-staging PR exists on same head — closing this main-PR as redundant."
-            gh pr close "$PR_NUMBER" \
-              --repo "${{ github.repository }}" \
-              --comment "[retarget-bot] Closing — another PR on the same head branch already targets \`staging\`. This PR is redundant. See issue #1884 for the rationale."
-            echo "outcome=closed-as-duplicate" >> "$GITHUB_OUTPUT"
-            exit 0
+
+            # Close the now-redundant main-PR via Gitea REST
+            # (PATCH state=closed). Post comment explaining
+            # rationale BEFORE close so the comment lands on the
+            # PR (commenting on a closed PR works on Gitea, but
+            # historically caused notification ordering surprises).
+
+            CLOSE_BODY_FILE=$(mktemp)
+            CMT_REQ=$(jq -n '{body:"[retarget-bot] Closing — another PR on the same head branch already targets `staging`. This PR is redundant. See issue #1884 for the rationale."}')
+            set +e
+            CMT_STATUS=$(curl -sS "${AUTH[@]}" -H "Content-Type: application/json" \
+              -X POST -d "${CMT_REQ}" \
+              -o "${CLOSE_BODY_FILE}" -w "%{http_code}" \
+              "${API}/issues/${PR_NUMBER}/comments")
+            set -e
+            if [ "${CMT_STATUS}" != "201" ]; then
+              echo "::warning::dup-close comment POST returned ${CMT_STATUS}; continuing to close anyway"
+              cat "${CLOSE_BODY_FILE}" | head -c 300 || true
+            fi
+            rm -f "${CLOSE_BODY_FILE}"
+
+            CLOSE_REQ='{"state":"closed"}'
+            CLOSE_RESP=$(mktemp)
+            set +e
+            CL_STATUS=$(curl -sS "${AUTH[@]}" -H "Content-Type: application/json" \
+              -X PATCH -d "${CLOSE_REQ}" \
+              -o "${CLOSE_RESP}" -w "%{http_code}" \
+              "${API}/pulls/${PR_NUMBER}")
+            set -e
+            if [ "${CL_STATUS}" = "201" ] || [ "${CL_STATUS}" = "200" ]; then
+              echo "::notice::Closed PR #${PR_NUMBER} as redundant"
+              echo "outcome=closed-as-duplicate" >> "$GITHUB_OUTPUT"
+              rm -f "${CLOSE_RESP}"
+              exit 0
+            fi
+            echo "::error::Failed to close redundant PR: HTTP ${CL_STATUS}"
+            cat "${CLOSE_RESP}" | head -c 300 || true
+            rm -f "${CLOSE_RESP}"
+            exit 1
          fi
-          echo "::error::Retarget PATCH failed and was NOT a duplicate-base error:"
-          echo "$PATCH_OUTPUT" >&2
+
+          echo "::error::Retarget PATCH failed and was NOT a duplicate-base error: HTTP ${STATUS}"
+          echo "${BODY}" | head -c 500 >&2
          exit 1

      - name: Post explainer comment
        if: steps.retarget.outputs.outcome == 'retargeted'
        env:
-          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+          GITEA_TOKEN: ${{ secrets.AUTO_SYNC_TOKEN }}
+          GITEA_HOST: ${{ vars.GITEA_HOST || 'https://git.moleculesai.app' }}
+          REPO: ${{ github.repository }}
          PR_NUMBER: ${{ github.event.pull_request.number }}
        run: |
-          gh pr comment "$PR_NUMBER" \
-            --repo "${{ github.repository }}" \
-            --body "$(cat <<'BODY'
-          [retarget-bot] This PR was opened against `main` and has been retargeted to `staging` automatically.
+          set -euo pipefail

-          **Why:** per [SHARED_RULES rule 8](https://github.com/Molecule-AI/molecule-ai-org-template-molecule-dev/blob/main/SHARED_RULES.md), all feature work targets `staging` first; the CEO promotes `staging → main` separately.
+          API="${GITEA_HOST}/api/v1/repos/${REPO}"
+          AUTH=(-H "Authorization: token ${GITEA_TOKEN}" -H "Accept: application/json")

-          **What changed:** just the base branch — no code change. CI will re-run against `staging`. If you get merge conflicts, rebase on `staging`.
+          # PR comments live on the issue endpoint in Gitea
+          # (PRs ARE issues — same endpoint, different sub-resources
+          # for diffs/files/etc.). The body uses jq to safely
+          # encode the multi-line markdown without shell-quote
+          # nightmares.
+          REQ=$(jq -n '{body:"[retarget-bot] This PR was opened against `main` and has been retargeted to `staging` automatically.\n\n**Why:** per [SHARED_RULES rule 8](https://git.moleculesai.app/molecule-ai/molecule-ai-org-template-molecule-dev/src/branch/main/SHARED_RULES.md), all feature work targets `staging` first; the CEO promotes `staging → main` separately.\n\n**What changed:** just the base branch — no code change. CI will re-run against `staging`. If you get merge conflicts, rebase on `staging`.\n\n**If this PR is the CEO`s staging→main promotion:** the Action skipped you (only bot-authored PRs are retargeted, head=staging is also exempted). If you see this comment on your CEO PR, that`s a bug — please tag @hongmingwang."}')

-          **If this PR is the CEO's staging→main promotion:** the Action skipped you (only bot-authored PRs are retargeted). If you see this comment on your CEO PR, that's a bug — please tag @HongmingWang-Rabbit.
-          BODY
-          )"
+          BODY_FILE=$(mktemp)
+          set +e
+          STATUS=$(curl -sS "${AUTH[@]}" -H "Content-Type: application/json" \
+            -X POST -d "${REQ}" \
+            -o "${BODY_FILE}" -w "%{http_code}" \
+            "${API}/issues/${PR_NUMBER}/comments")
+          set -e
+
+          if [ "${STATUS}" = "201" ]; then
+            echo "::notice::Posted explainer comment on PR #${PR_NUMBER}"
+          else
+            echo "::warning::Failed to post explainer (HTTP ${STATUS}) — retarget itself succeeded"
+            cat "${BODY_FILE}" | head -c 300 || true
+          fi
+          rm -f "${BODY_FILE}"
@@ -12,7 +12,7 @@ name: Secret scan
 #
 #   jobs:
 #     secret-scan:
-#       uses: Molecule-AI/molecule-core/.github/workflows/secret-scan.yml@staging
+#       uses: molecule-ai/molecule-core/.github/workflows/secret-scan.yml@staging
 #
 # Pin to @staging not @main — staging is the active default branch,
 # main lags via the staging-promotion workflow. Updates ride along
@@ -131,6 +131,13 @@ backups/
 # Cloned by publish-workspace-server-image.yml so the Dockerfile's
 # replace-directive path resolves. Lives in its own repo.
 /molecule-ai-plugin-github-app-auth/
+# Tenant-image build context — populated by the workflow's
+# "Pre-clone manifest deps" step. Mirrors the public manifest, holds the
+# same content as the three /<>/ dirs above but namespaced under one
+# parent so the Docker build context is a single COPY-friendly tree.
+# Each entry is a transient working-dir, never source-of-truth, never
+# committed.
+/.tenant-bundle-deps/

 # Internal-flavored content lives in Molecule-AI/internal — NEVER in this
 # public monorepo. Migrated 2026-04-23 (CEO directive). The CI workflow
@@ -0,0 +1,28 @@
+# Top-level Makefile — convenience wrappers around docker compose.
+#
+# Most molecule-core dev work happens via these shortcuts. CI doesn't
+# use this Makefile; CI calls docker compose / go test directly so the
+# Makefile can evolve without breaking the build.
+
+.PHONY: help dev up down logs build test
+
+help: ## Show this help.
+	@grep -E '^[a-zA-Z_-]+:.*?## ' $(MAKEFILE_LIST) | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-12s\033[0m %s\n", $$1, $$2}'
+
+dev: ## Start the full stack with air hot-reload for the platform service.
+	docker compose -f docker-compose.yml -f docker-compose.dev.yml up
+
+up: ## Start the full stack in production-shape mode (no air, normal Dockerfile).
+	docker compose up
+
+down: ## Stop the stack and remove containers (volumes preserved).
+	docker compose down
+
+logs: ## Tail logs from all services (Ctrl-C to detach).
+	docker compose logs -f
+
+build: ## Force a fresh build of the platform image (no cache).
+	docker compose build --no-cache platform
+
+test: ## Run Go unit tests in workspace-server/.
+	cd workspace-server && go test -race ./...
@@ -225,14 +225,14 @@ The result is not just “an agent that learns.” It is **an organization that
 - runtime tiers
 - direct workspace inspection through terminal and files

-### SaaS (via [`molecule-controlplane`](https://github.com/Molecule-AI/molecule-controlplane))
+### SaaS (via [`molecule-controlplane`](https://git.moleculesai.app/molecule-ai/molecule-controlplane))

 - multi-tenant on AWS EC2 + Neon (per-tenant Postgres branch) + Cloudflare Tunnels (per-tenant, no public ports)
 - WorkOS AuthKit + Stripe Checkout + Customer Portal
 - AWS KMS envelope encryption (DB / Redis connection strings); AWS Secrets Manager for tenant bootstrap
 - `tenant_resources` audit table + 30-min boot-event-aware reconciler — every CF / AWS lifecycle event recorded, claim vs live state diffed

-### Bring your own Claude Code session (via [`molecule-mcp-claude-channel`](https://github.com/Molecule-AI/molecule-mcp-claude-channel))
+### Bring your own Claude Code session (via [`molecule-mcp-claude-channel`](https://git.moleculesai.app/molecule-ai/molecule-mcp-claude-channel))

 - Claude Code plugin that bridges Molecule A2A traffic into a local Claude Code session via MCP
 - subscribe to one or more workspaces; peer messages surface as conversation turns; replies route back through Molecule's A2A
@@ -330,7 +330,7 @@ Then open `http://localhost:3000`:

 The current `main` branch ships the core platform, Canvas v4 (warm-paper themed), Memory v2 (pgvector semantic recall), the typed-SSOT A2A response path (RFC #2967), **eight production adapters** (Claude Code, Hermes, Gemini CLI, LangGraph, DeepAgents, CrewAI, AutoGen, OpenClaw), skill lifecycle, and operational surfaces.

-The companion private repo [`molecule-controlplane`](https://github.com/Molecule-AI/molecule-controlplane) provides the SaaS surface — multi-tenant orchestration on EC2 + Neon + Cloudflare Tunnels, KMS envelope encryption, WorkOS auth, Stripe billing, and a `tenant_resources` audit table with a 30-min reconciler.
+The companion private repo [`molecule-controlplane`](https://git.moleculesai.app/molecule-ai/molecule-controlplane) provides the SaaS surface — multi-tenant orchestration on EC2 + Neon + Cloudflare Tunnels, KMS envelope encryption, WorkOS auth, Stripe billing, and a `tenant_resources` audit table with a 30-min reconciler.

 Adjacent runtime work such as **NemoClaw** remains branch-level until merged, and this README keeps that distinction explicit on purpose.

@@ -224,14 +224,14 @@ Molecule AI 并不是要替代下面这些 framework，而是把它们纳入更
 - runtime tiers
 - 终端与文件层面的 workspace 直接排障

-### SaaS（由 [`molecule-controlplane`](https://github.com/Molecule-AI/molecule-controlplane) 提供）
+### SaaS（由 [`molecule-controlplane`](https://git.moleculesai.app/molecule-ai/molecule-controlplane) 提供）

 - 多租户运行在 AWS EC2 + Neon（每租户一个 Postgres branch）+ Cloudflare Tunnels（每租户一条隧道，对外不开任何端口）
 - WorkOS AuthKit + Stripe Checkout + Customer Portal
 - AWS KMS 信封加密（DB / Redis 连接串）；AWS Secrets Manager 负责租户 bootstrap
 - `tenant_resources` 审计表 + 30 分钟 boot-event-aware reconciler —— 每个 CF / AWS lifecycle 事件都有记录，每 30 分钟比对 claim 与实际状态

-### 在 Claude Code 里直接接入（由 [`molecule-mcp-claude-channel`](https://github.com/Molecule-AI/molecule-mcp-claude-channel) 提供）
+### 在 Claude Code 里直接接入（由 [`molecule-mcp-claude-channel`](https://git.moleculesai.app/molecule-ai/molecule-mcp-claude-channel) 提供）

 - 把 Molecule A2A 流量桥接到本地 Claude Code 会话的 MCP 插件
 - 订阅一个或多个 workspace；peer 的消息会以 user-turn 出现，回复会经 Molecule A2A 路由出去
@@ -323,7 +323,7 @@ npm run dev

 当前 `main` 已经包含核心平台、Canvas v4（warm-paper 主题）、Memory v2（pgvector 语义召回）、typed-SSOT A2A 响应路径（RFC #2967）、**8 个正式 adapter**（Claude Code、Hermes、Gemini CLI、LangGraph、DeepAgents、CrewAI、AutoGen、OpenClaw）、skill lifecycle，以及主要运维面。

-配套的私有仓库 [`molecule-controlplane`](https://github.com/Molecule-AI/molecule-controlplane) 提供 SaaS 层 —— 多租户编排（EC2 + Neon + Cloudflare Tunnels）、KMS 信封加密、WorkOS 鉴权、Stripe 计费，以及 `tenant_resources` 审计表加 30 分钟 reconciler。
+配套的私有仓库 [`molecule-controlplane`](https://git.moleculesai.app/molecule-ai/molecule-controlplane) 提供 SaaS 层 —— 多租户编排（EC2 + Neon + Cloudflare Tunnels）、KMS 信封加密、WorkOS 鉴权、Stripe 计费，以及 `tenant_resources` 审计表加 30 分钟 reconciler。

 像 **NemoClaw** 这样的相邻 runtime 路线仍然属于分支级工作，只有合并后才会进入正式支持列表，这里会明确区分。

@@ -3,6 +3,7 @@ import { cookies, headers } from "next/headers";
 import "./globals.css";
 import { AuthGate } from "@/components/AuthGate";
 import { CookieConsent } from "@/components/CookieConsent";
+import { PurchaseSuccessModal } from "@/components/PurchaseSuccessModal";
 import { ThemeProvider } from "@/lib/theme-provider";
 import {
  THEME_COOKIE,
@@ -86,6 +87,12 @@ export default async function RootLayout({
              vercel preview URL, apex) pass through unchanged. */}
          <AuthGate>{children}</AuthGate>
          <CookieConsent />
+          {/* Demo Mock #1: post-purchase success toast. Mounted at the
+              layout level so it persists across page state transitions
+              (loading → hydrated → error) without being unmounted and
+              losing its open-state. Reads ?purchase_success=1 from the
+              URL on first paint, then strips the param. */}
+          <PurchaseSuccessModal />
        </ThemeProvider>
      </body>
    </html>
@@ -41,7 +41,7 @@ export default function PricingPage() {
        <p className="mt-2 text-ink-mid">
          We publish the{" "}
          <a
-            href="https://github.com/Molecule-AI/molecule-monorepo"
+            href="https://git.moleculesai.app/molecule-ai/molecule-monorepo"
            className="text-accent underline hover:text-accent"
          >
            full source on GitHub
@@ -1,9 +1,10 @@
 'use client';

-import { useEffect, useMemo, useCallback } from "react";
+import { useEffect, useMemo, useCallback, useRef } from "react";
 import { type Edge, MarkerType } from "@xyflow/react";
 import { api } from "@/lib/api";
 import { useCanvasStore } from "@/store/canvas";
+import { useSocketEvent } from "@/hooks/useSocketEvent";
 import type { ActivityEntry } from "@/types/activity";

 // ── Constants ─────────────────────────────────────────────────────────────────
@@ -11,9 +12,6 @@ import type { ActivityEntry } from "@/types/activity";
 /** 60-minute look-back window for delegation activity */
 export const A2A_WINDOW_MS = 60 * 60 * 1000;

-/** Polling interval — refresh edges every 60 seconds */
-export const A2A_POLL_MS = 60 * 1_000;
-
 /** Threshold for "hot" edges: < 5 minutes → animated + violet stroke */
 export const A2A_HOT_MS = 5 * 60 * 1_000;

@@ -131,6 +129,20 @@ export function buildA2AEdges(
 * `a2aEdges`. Canvas.tsx merges these with topology edges and passes the
 * combined list to ReactFlow.
 *
+ * Update shape (issue #61 Stage 2, replaces the 60s polling loop):
+ *  - On mount (when showA2AEdges): one HTTP fan-out per visible workspace
+ *    (delegation rows, 60-min window). Bootstraps the local row buffer.
+ *  - Steady state: subscribes to ACTIVITY_LOGGED via useSocketEvent.
+ *    Each delegation event from a visible workspace is appended to the
+ *    buffer; edges are re-derived via the existing buildA2AEdges helper.
+ *  - showA2AEdges toggle off: clears edges + buffer.
+ *  - Visible-ID-set change: re-bootstraps so a freshly-shown workspace
+ *    backfills its 60-min history (existing visibleIdsKey selector
+ *    behaviour preserved — that's the 2026-05-04 render-loop fix).
+ *
+ * No interval poll. The singleton ReconnectingSocket already owns
+ * reconnect / backoff / health-check; useSocketEvent inherits those.
+ *
 * Mount this inside CanvasInner (no ReactFlow hook dependency).
 */
 export function A2ATopologyOverlay() {
@@ -157,7 +169,9 @@ export function A2ATopologyOverlay() {
  // the symptom of this re-render storm.
  //
  // The fix is purely the dependency-stability change here; the fetch
-  // logic is unchanged.
+  // logic is unchanged. Post-#61 the polling-driven fetch is gone, but
+  // the visibleIdsKey gate is still required so a peer-discovery write
+  // doesn't trigger a wasteful re-bootstrap.
  const visibleIdsKey = useCanvasStore((s) =>
    s.nodes
      .filter((n) => !n.hidden)
@@ -171,16 +185,42 @@ export function A2ATopologyOverlay() {
    [visibleIdsKey]
  );

-  // Fetch delegation activity for all visible workspaces and rebuild overlay edges.
-  const fetchAndUpdate = useCallback(async () => {
+  // Local rolling buffer of delegation rows. Pruned by A2A_WINDOW_MS on
+  // each rebuild so a long-lived session doesn't accumulate unbounded
+  // history. The buffer's high-water mark is approximately:
+  //    visibleIds.length × bootstrap-fetch-limit (500) + WS arrivals
+  // Real-world ceiling: ~3000 entries at the 60-min boundary, all of
+  // which buildA2AEdges aggregates into at most N² edges.
+  const bufferRef = useRef<ActivityEntry[]>([]);
+  // visibleIdsRef gives the WS handler the latest visible-ID set without
+  // re-subscribing on every render. The bus listener is registered
+  // exactly once per mount; subscriber-side filtering reads from this ref.
+  const visibleIdsRef = useRef(visibleIds);
+  visibleIdsRef.current = visibleIds;
+
+  // Re-derive overlay edges from the current buffer + push to store.
+  // Prunes by A2A_WINDOW_MS first so memory stays bounded across long
+  // sessions and the aggregation cost stays O(window-size).
+  const recomputeAndPush = useCallback(() => {
+    const cutoff = Date.now() - A2A_WINDOW_MS;
+    bufferRef.current = bufferRef.current.filter(
+      (r) => new Date(r.created_at).getTime() > cutoff
+    );
+    setA2AEdges(buildA2AEdges(bufferRef.current));
+  }, [setA2AEdges]);
+
+  // Bootstrap fan-out — one HTTP per visible workspace. Replaces the
+  // 60s polling loop entirely. Race-aware: any WS arrivals that landed
+  // in the buffer DURING the fetch (between the await and resume) are
+  // preserved by id-dedup-with-fetched-first ordering.
+  const bootstrap = useCallback(async () => {
    if (visibleIds.length === 0) {
+      bufferRef.current = [];
      setA2AEdges([]);
      return;
    }
    try {
-      // Fan-out — one request per visible workspace.
-      // Per-request failures are swallowed so one broken workspace doesn't blank the overlay.
-      const allRows = (
+      const fetchedRows = (
        await Promise.all(
          visibleIds.map((id) =>
            api
@@ -192,24 +232,76 @@ export function A2ATopologyOverlay() {
        )
      ).flat();

-      setA2AEdges(buildA2AEdges(allRows));
+      // Merge: fetched rows first, then any in-flight WS arrivals that
+      // accumulated during the await. Dedup by id so rows that appear
+      // in both paths are not double-counted in the aggregation.
+      const merged = [...fetchedRows, ...bufferRef.current];
+      const seen = new Set<string>();
+      bufferRef.current = merged.filter((r) => {
+        if (seen.has(r.id)) return false;
+        seen.add(r.id);
+        return true;
+      });
+      recomputeAndPush();
    } catch {
      // Overlay failure is non-critical — canvas remains functional
    }
-  }, [visibleIds, setA2AEdges]);
+  }, [visibleIds, setA2AEdges, recomputeAndPush]);

  useEffect(() => {
    if (!showA2AEdges) {
-      // Clear edges immediately when toggled off
+      // Clear edges + buffer immediately when toggled off
+      bufferRef.current = [];
      setA2AEdges([]);
      return;
    }
+    void bootstrap();
+  }, [showA2AEdges, bootstrap, setA2AEdges]);

-    // Initial fetch, then poll every 60 s
-    void fetchAndUpdate();
-    const timer = setInterval(() => void fetchAndUpdate(), A2A_POLL_MS);
-    return () => clearInterval(timer);
-  }, [showA2AEdges, fetchAndUpdate, setA2AEdges]);
+  // Live-update path. Filters server-side ACTIVITY_LOGGED events down
+  // to delegation initiations from visible workspaces and appends each
+  // into the rolling buffer, re-deriving edges via buildA2AEdges.
+  //
+  // Only `method === "delegate"` rows count — the same filter
+  // buildA2AEdges applies — so delegate_result rows arriving over the
+  // wire don't double-count.
+  useSocketEvent((msg) => {
+    if (!showA2AEdges) return;
+    if (msg.event !== "ACTIVITY_LOGGED") return;
+
+    const p = (msg.payload || {}) as Record<string, unknown>;
+    if (p.activity_type !== "delegation") return;
+    if (p.method !== "delegate") return;
+
+    const wsId = msg.workspace_id;
+    if (!visibleIdsRef.current.includes(wsId)) return;
+
+    // Synthesise an ActivityEntry from the WS payload so buildA2AEdges
+    // (which the bootstrap path also feeds) handles it identically.
+    const entry: ActivityEntry = {
+      id:
+        (p.id as string) ||
+        `ws-push-${msg.timestamp || Date.now()}-${wsId}`,
+      workspace_id: wsId,
+      activity_type: "delegation",
+      source_id: (p.source_id as string | null) ?? null,
+      target_id: (p.target_id as string | null) ?? null,
+      method: "delegate",
+      summary: (p.summary as string | null) ?? null,
+      request_body: null,
+      response_body: null,
+      duration_ms: (p.duration_ms as number | null) ?? null,
+      status: (p.status as string) || "ok",
+      error_detail: null,
+      created_at:
+        (p.created_at as string) ||
+        msg.timestamp ||
+        new Date().toISOString(),
+    };
+
+    bufferRef.current = [...bufferRef.current, entry];
+    recomputeAndPush();
+  });

  // Pure side-effect — renders nothing
  return null;
@@ -3,6 +3,7 @@
 import { useState, useEffect, useCallback, useRef } from "react";
 import { useCanvasStore } from "@/store/canvas";
 import { api } from "@/lib/api";
+import { useSocketEvent } from "@/hooks/useSocketEvent";
 import { COMM_TYPE_LABELS } from "@/lib/design-tokens";

 interface Communication {
@@ -18,32 +19,71 @@ interface Communication {
  durationMs: number | null;
 }

+/** Workspace-server `ACTIVITY_LOGGED` payload shape. Pulled out so the
+ *  WS handler below has a typed view of the same fields the HTTP
+ *  bootstrap consumes — drift between the two paths is a class of bug
+ *  AgentCommsPanel hit historically. */
+interface ActivityLoggedPayload {
+  id?: string;
+  activity_type?: string;
+  source_id?: string | null;
+  target_id?: string | null;
+  workspace_id?: string;
+  summary?: string | null;
+  status?: string;
+  duration_ms?: number | null;
+  created_at?: string;
+}
+
+/** Fan-out cap for the bootstrap HTTP fetch on mount / on visibility
+ *  re-open. Kept at 3 (carried over from the 2026-05-04 fix) so a
+ *  freshly-mounted overlay on a 15-workspace tenant only spends 3
+ *  round-trips bootstrapping. Live updates after that arrive via the
+ *  WS subscription below — no polling, no fan-out to maintain. */
+const BOOTSTRAP_FAN_OUT_CAP = 3;
+
+/** Cap on the rendered list. Bootstrap + every WS push prepends, the
+ *  list is sliced to this size after each update. Mirrors the prior
+ *  polling-loop behaviour. */
+const COMMS_RENDER_CAP = 20;
+
 /**
 * Overlay showing recent A2A communications between workspaces.
- * Renders as a floating log panel that auto-updates.
+ *
+ * Update shape (issue #61 Stage 1, replaces the 30s polling loop):
+ *  - On mount (when visible): one HTTP bootstrap per online workspace,
+ *    capped at BOOTSTRAP_FAN_OUT_CAP. Yields the initial recent-comms
+ *    window without waiting for live events.
+ *  - Steady state: subscribes to ACTIVITY_LOGGED via useSocketEvent.
+ *    Each event with a matching activity_type from a visible online
+ *    workspace gets synthesised into a Communication and prepended.
+ *  - Visibility re-open: re-bootstraps so the user sees the freshest
+ *    window even if WS was idle while collapsed.
+ *
+ * No interval poll. The singleton ReconnectingSocket in `store/socket.ts`
+ * already owns reconnect/backoff/health-check, and `useSocketEvent`
+ * inherits those guarantees. If WS is genuinely unhealthy, the overlay
+ * shows the bootstrap snapshot until the next visibility re-open or
+ * the next WS reconnect (which fires its own rehydrate burst).
 */
 export function CommunicationOverlay() {
  const [comms, setComms] = useState<Communication[]>([]);
  const [visible, setVisible] = useState(true);
  const selectedNodeId = useCanvasStore((s) => s.selectedNodeId);
  const nodes = useCanvasStore((s) => s.nodes);
+  // nodesRef gives the WS handler current node-name resolution without
+  // re-subscribing on every node-list change. The bus listener is
+  // registered exactly once per mount; subscriber-side filtering reads
+  // the latest value via this ref.
  const nodesRef = useRef(nodes);
  nodesRef.current = nodes;

-  const fetchComms = useCallback(async () => {
+  const bootstrapComms = useCallback(async () => {
    try {
-      // Fan-out cap: each polled workspace = 1 round-trip. The platform
-      // rate limits at 600 req/min/IP; combined with heartbeats + other
-      // canvas polling, every workspace polled here costs ~6 req/min
-      // (1 every 30s × 1 per workspace). Capping at 3 keeps this
-      // overlay's footprint at 18 req/min worst case — well under
-      // budget even with 8+ workspaces visible. Caught 2026-05-04 when
-      // a user with 8+ workspaces (Design Director + 6 sub-agents +
-      // 3 standalones) saw sustained 429s in canvas console.
      const onlineNodes = nodesRef.current.filter((n) => n.data.status === "online");
      const allComms: Communication[] = [];

-      for (const node of onlineNodes.slice(0, 3)) {
+      for (const node of onlineNodes.slice(0, BOOTSTRAP_FAN_OUT_CAP)) {
        try {
          const activities = await api.get<Array<{
            id: string;
@@ -59,8 +99,8 @@ export function CommunicationOverlay() {

          for (const a of activities) {
            if (a.activity_type === "a2a_send" || a.activity_type === "a2a_receive") {
-              const sourceNode = nodes.find((n) => n.id === (a.source_id || a.workspace_id));
-              const targetNode = nodes.find((n) => n.id === (a.target_id || ""));
+              const sourceNode = nodesRef.current.find((n) => n.id === (a.source_id || a.workspace_id));
+              const targetNode = nodesRef.current.find((n) => n.id === (a.target_id || ""));
              allComms.push({
                id: a.id,
                sourceId: a.source_id || a.workspace_id,
@@ -76,11 +116,12 @@ export function CommunicationOverlay() {
            }
          }
        } catch {
-          // Skip workspaces that fail
+          // Per-workspace failures must not blank the panel — the same
+          // robustness the polling version had.
        }
      }

-      // Sort by timestamp, newest first, dedupe
+      // Newest-first with id-dedup, capped at COMMS_RENDER_CAP.
      const seen = new Set<string>();
      const sorted = allComms
        .sort((a, b) => b.timestamp.localeCompare(a.timestamp))
@@ -89,29 +130,78 @@ export function CommunicationOverlay() {
          seen.add(c.id);
          return true;
        })
-        .slice(0, 20);
+        .slice(0, COMMS_RENDER_CAP);

      setComms(sorted);
    } catch {
-      // Silently handle API errors
+      // Bootstrap failure is non-blocking — the WS subscription below
+      // will populate the panel as live events arrive.
    }
  }, []);

+  // Bootstrap once on mount + every time the user re-opens after a
+  // collapse. Closed-panel state intentionally drops live updates so
+  // the panel doesn't churn invisible state — the next open reloads.
  useEffect(() => {
-    // Gate polling on visibility — when the user collapses the overlay
-    // the data isn't being read, so the per-workspace fan-out becomes
-    // pure rate-limit overhead. Pre-fix this overlay polled regardless
-    // of whether the panel was shown, costing ~36 req/min from a
-    // hidden surface.
    if (!visible) return;
-    fetchComms();
-    // 30s cadence (was 10s). At 3-workspace fan-out that's 6 req/min
-    // worst case from this overlay. Combined with heartbeats (~30/min)
-    // and other canvas polling, leaves ample headroom under the 600/
-    // min/IP server-side rate limit even at 8+ workspace tenants.
-    const interval = setInterval(fetchComms, 30000);
-    return () => clearInterval(interval);
-  }, [fetchComms, visible]);
+    bootstrapComms();
+  }, [bootstrapComms, visible]);
+
+  // Live-update path. Filters server-side ACTIVITY_LOGGED events down
+  // to the comm-overlay-relevant subset and prepends each into the
+  // rendered list with the same dedup the bootstrap path uses.
+  //
+  // Scope guard: ignore events for workspaces not in the visible online
+  // set, so a user collapsing one workspace doesn't see its comms
+  // continue to scroll in. Same shape the bootstrap path applies.
+  useSocketEvent((msg) => {
+    if (!visible) return;
+    if (msg.event !== "ACTIVITY_LOGGED") return;
+
+    const p = (msg.payload || {}) as ActivityLoggedPayload;
+    const type = p.activity_type;
+    if (type !== "a2a_send" && type !== "a2a_receive" && type !== "task_update") return;
+
+    const wsId = msg.workspace_id;
+    const onlineSet = new Set(
+      nodesRef.current.filter((n) => n.data.status === "online").map((n) => n.id),
+    );
+    if (!onlineSet.has(wsId)) return;
+
+    const sourceId = p.source_id || wsId;
+    const targetId = p.target_id || "";
+    const sourceNode = nodesRef.current.find((n) => n.id === sourceId);
+    const targetNode = nodesRef.current.find((n) => n.id === targetId);
+
+    const incoming: Communication = {
+      id: p.id || `${msg.timestamp || Date.now()}:${sourceId}:${targetId}`,
+      sourceId,
+      targetId,
+      sourceName: sourceNode?.data.name || "Unknown",
+      targetName: targetNode?.data.name || "Unknown",
+      type: type as Communication["type"],
+      summary: p.summary || "",
+      status: p.status || "ok",
+      timestamp: p.created_at || msg.timestamp || new Date().toISOString(),
+      durationMs: p.duration_ms ?? null,
+    };
+
+    setComms((prev) => {
+      // Prepend, dedup by id, re-cap. Functional setState is necessary
+      // because two ACTIVITY_LOGGED events arriving in the same React
+      // batch would otherwise read a stale `comms` from the closure.
+      const seen = new Set<string>();
+      const merged = [incoming, ...prev]
+        .sort((a, b) => b.timestamp.localeCompare(a.timestamp))
+        .filter((c) => {
+          if (seen.has(c.id)) return false;
+          seen.add(c.id);
+          return true;
+        })
+        .slice(0, COMMS_RENDER_CAP);
+      return merged;
+    });
+  });

  if (!visible || comms.length === 0) {
    return (
@@ -0,0 +1,175 @@
+"use client";
+
+/**
+ * PurchaseSuccessModal — demo-only post-purchase confirmation.
+ *
+ * Mounted on the canvas root (`app/page.tsx`). On first paint it inspects
+ * `?purchase_success=1[&item=<name>]` on the current URL. If present, it
+ * renders a centred modal styled after `ConfirmDialog`, schedules a 5s
+ * auto-dismiss, and rewrites the URL via `history.replaceState` to drop
+ * the params so a refresh after dismiss does NOT re-show the modal.
+ *
+ * Mock for the funding demo — there is no real billing surface behind
+ * this. The marketplace "Purchase" button on the landing page redirects
+ * here with the params; this modal is the only thing the user sees of
+ * the "transaction".
+ *
+ * Styling matches the warm-paper @theme tokens (surface-sunken / line /
+ * ink / good) so it tracks light + dark without per-mode overrides.
+ */
+
+import { useEffect, useRef, useState } from "react";
+import { createPortal } from "react-dom";
+
+const AUTO_DISMISS_MS = 5000;
+
+function readPurchaseParams(): { open: boolean; item: string | null } {
+  if (typeof window === "undefined") return { open: false, item: null };
+  const sp = new URLSearchParams(window.location.search);
+  const flag = sp.get("purchase_success");
+  if (flag !== "1" && flag !== "true") return { open: false, item: null };
+  return { open: true, item: sp.get("item") };
+}
+
+function stripPurchaseParams() {
+  if (typeof window === "undefined") return;
+  const url = new URL(window.location.href);
+  url.searchParams.delete("purchase_success");
+  url.searchParams.delete("item");
+  // replaceState (not pushState) so back-button doesn't return to the
+  // pre-strip URL and re-trigger the modal.
+  window.history.replaceState({}, "", url.toString());
+}
+
+export function PurchaseSuccessModal() {
+  const [open, setOpen] = useState(false);
+  const [item, setItem] = useState<string | null>(null);
+  const [mounted, setMounted] = useState(false);
+  const dialogRef = useRef<HTMLDivElement>(null);
+
+  // Read the URL params once on mount. We don't subscribe to navigation —
+  // this modal is a one-shot for the demo redirect, not a persistent
+  // listener.
+  useEffect(() => {
+    setMounted(true);
+    const { open: shouldOpen, item: itemName } = readPurchaseParams();
+    if (shouldOpen) {
+      setOpen(true);
+      setItem(itemName);
+      // Clean the URL immediately so a refresh after the modal is closed
+      // (or even while it's still open) does NOT re-trigger it.
+      stripPurchaseParams();
+    }
+  }, []);
+
+  // Auto-dismiss timer + Escape handler.
+  useEffect(() => {
+    if (!open) return;
+    const t = window.setTimeout(() => setOpen(false), AUTO_DISMISS_MS);
+    const onKey = (e: KeyboardEvent) => {
+      if (e.key === "Escape") setOpen(false);
+    };
+    window.addEventListener("keydown", onKey);
+    // Focus the close button so keyboard users land on it after redirect.
+    const raf = requestAnimationFrame(() => {
+      dialogRef.current?.querySelector<HTMLButtonElement>("button")?.focus();
+    });
+    return () => {
+      window.clearTimeout(t);
+      window.removeEventListener("keydown", onKey);
+      cancelAnimationFrame(raf);
+    };
+  }, [open]);
+
+  if (!open || !mounted) return null;
+
+  const itemLabel = item ? decodeURIComponent(item) : "Your new agent";
+
+  return createPortal(
+    <div
+      className="fixed inset-0 z-[9999] flex items-center justify-center"
+      data-testid="purchase-success-modal"
+    >
+      {/* Backdrop — click closes, matches ConfirmDialog backdrop. */}
+      <div
+        className="absolute inset-0 bg-black/60 backdrop-blur-sm"
+        onClick={() => setOpen(false)}
+        aria-hidden="true"
+      />
+
+      <div
+        ref={dialogRef}
+        role="dialog"
+        aria-modal="true"
+        aria-labelledby="purchase-success-title"
+        className="relative bg-surface-sunken border border-line rounded-xl shadow-2xl shadow-black/50 max-w-[420px] w-full mx-4 overflow-hidden"
+      >
+        <div className="px-6 pt-6 pb-4">
+          <div className="flex items-start gap-4">
+            {/* Success glyph — uses --color-good so it tracks the theme.
+                Inline SVG over an emoji so it stays readable + on-brand
+                in both light and dark. */}
+            <div
+              className="flex h-10 w-10 flex-shrink-0 items-center justify-center rounded-full"
+              style={{
+                background:
+                  "color-mix(in srgb, var(--color-good) 15%, transparent)",
+                color: "var(--color-good)",
+              }}
+            >
+              <svg
+                width="22"
+                height="22"
+                viewBox="0 0 24 24"
+                fill="none"
+                aria-hidden="true"
+              >
+                <circle
+                  cx="12"
+                  cy="12"
+                  r="10"
+                  stroke="currentColor"
+                  strokeWidth="1.5"
+                />
+                <path
+                  d="M7.5 12.5L10.5 15.5L16.5 9.5"
+                  stroke="currentColor"
+                  strokeWidth="1.8"
+                  strokeLinecap="round"
+                  strokeLinejoin="round"
+                />
+              </svg>
+            </div>
+            <div className="flex-1">
+              <h3
+                id="purchase-success-title"
+                className="text-base font-semibold text-ink"
+              >
+                Purchase successful
+              </h3>
+              <p className="mt-1.5 text-[13px] leading-relaxed text-ink-mid">
+                <span className="font-medium text-ink">{itemLabel}</span> has
+                been added to your workspace. Provisioning starts in the
+                background — you can keep working while it spins up.
+              </p>
+            </div>
+          </div>
+        </div>
+
+        <div className="flex items-center justify-between gap-3 px-6 py-3 border-t border-line bg-surface/50">
+          <span className="font-mono text-[10.5px] uppercase tracking-[0.12em] text-ink-soft">
+            auto-dismiss · {AUTO_DISMISS_MS / 1000}s
+          </span>
+          <button
+            type="button"
+            onClick={() => setOpen(false)}
+            className="px-3.5 py-1.5 text-[13px] rounded-lg bg-accent hover:bg-accent-strong text-white transition-colors focus:outline-none focus-visible:ring-2 focus-visible:ring-offset-2 focus-visible:ring-offset-surface-sunken focus-visible:ring-accent/60"
+          >
+            Close
+          </button>
+        </div>
+      </div>
+    </div>,
+    document.body,
+  );
+}
@@ -41,6 +41,10 @@ vi.mock("@/store/canvas", () => ({
 // ── Imports (after mocks) ─────────────────────────────────────────────────────

 import { api } from "@/lib/api";
+import {
+  emitSocketEvent,
+  _resetSocketEventListenersForTests,
+} from "@/store/socket-events";
 import {
  buildA2AEdges,
  formatA2ARelativeTime,
@@ -342,6 +346,151 @@ describe("A2ATopologyOverlay component", () => {
    expect(mockGet.mock.calls.length).toBe(callsAfterMount);
  });

+  // ── #61 Stage 2: ACTIVITY_LOGGED subscription tests ────────────────────────
+  //
+  // Pin the post-#61 behaviour: WS push for delegation contributes to
+  // the overlay's edge buffer with NO additional HTTP fetch. Same shape
+  // as Stage 1 (CommunicationOverlay).
+
+  describe("#61 stage 2 — ACTIVITY_LOGGED subscription", () => {
+    beforeEach(() => {
+      _resetSocketEventListenersForTests();
+    });
+    afterEach(() => {
+      _resetSocketEventListenersForTests();
+    });
+
+    function emitDelegation(overrides: {
+      workspaceId?: string;
+      sourceId?: string;
+      targetId?: string;
+      method?: string;
+      activityType?: string;
+    } = {}) {
+      // Use Date.now() (real time, fake-timer-frozen) rather than the
+      // hardcoded NOW constant — buildA2AEdges prunes by Date.now() -
+      // A2A_WINDOW_MS, so a row dated against the wrong epoch silently
+      // falls outside the window and the test fails for a confusing
+      // reason ("edges array empty" vs "filter dropped my row").
+      const realNow = Date.now();
+      emitSocketEvent({
+        event: "ACTIVITY_LOGGED",
+        workspace_id: overrides.workspaceId ?? "ws-a",
+        timestamp: new Date(realNow).toISOString(),
+        payload: {
+          id: `act-${Math.random().toString(36).slice(2)}`,
+          activity_type: overrides.activityType ?? "delegation",
+          method: overrides.method ?? "delegate",
+          source_id: overrides.sourceId ?? "ws-a",
+          target_id: overrides.targetId ?? "ws-b",
+          status: "ok",
+          created_at: new Date(realNow - 30_000).toISOString(),
+        },
+      });
+    }
+
+    it("does NOT poll on a 60s interval after bootstrap (post-#61)", async () => {
+      // eslint-disable-next-line @typescript-eslint/no-explicit-any
+      mockGet.mockResolvedValue([] as any);
+      render(<A2ATopologyOverlay />);
+      await act(async () => { await Promise.resolve(); });
+      const callsAfterBootstrap = mockGet.mock.calls.length;
+      expect(callsAfterBootstrap).toBe(2); // ws-a + ws-b
+
+      // Pre-#61: a 60s clock tick would fire a fresh fan-out (2 more
+      // calls). Post-#61: no interval, no extra calls.
+      await act(async () => {
+        vi.advanceTimersByTime(120_000);
+      });
+      expect(mockGet.mock.calls.length).toBe(callsAfterBootstrap);
+    });
+
+    it("WS push for a delegation event from a visible workspace updates edges with NO HTTP call", async () => {
+      // eslint-disable-next-line @typescript-eslint/no-explicit-any
+      mockGet.mockResolvedValue([] as any);
+      render(<A2ATopologyOverlay />);
+      await act(async () => { await Promise.resolve(); await Promise.resolve(); });
+      mockGet.mockClear();
+      mockStoreState.setA2AEdges.mockClear();
+
+      await act(async () => {
+        emitDelegation({ sourceId: "ws-a", targetId: "ws-b" });
+      });
+
+      // Edges-set called with at least one a2a edge for the new push.
+      const calls = mockStoreState.setA2AEdges.mock.calls;
+      expect(calls.length).toBeGreaterThanOrEqual(1);
+      const lastCall = calls[calls.length - 1][0] as Array<{ id: string }>;
+      expect(lastCall.some((e) => e.id === "a2a-ws-a-ws-b")).toBe(true);
+
+      // Critical: no HTTP fetch fired during the WS path.
+      expect(mockGet).not.toHaveBeenCalled();
+    });
+
+    it("WS push for a non-delegation activity_type is ignored", async () => {
+      // eslint-disable-next-line @typescript-eslint/no-explicit-any
+      mockGet.mockResolvedValue([] as any);
+      render(<A2ATopologyOverlay />);
+      await act(async () => { await Promise.resolve(); });
+      mockStoreState.setA2AEdges.mockClear();
+
+      await act(async () => {
+        emitDelegation({ activityType: "a2a_send" });
+      });
+
+      // setA2AEdges must not be called by the WS handler — the only
+      // setA2AEdges calls in this test came from the initial bootstrap.
+      expect(mockStoreState.setA2AEdges).not.toHaveBeenCalled();
+    });
+
+    it("WS push for a delegate_result row is ignored (mirrors buildA2AEdges filter)", async () => {
+      // eslint-disable-next-line @typescript-eslint/no-explicit-any
+      mockGet.mockResolvedValue([] as any);
+      render(<A2ATopologyOverlay />);
+      await act(async () => { await Promise.resolve(); });
+      mockStoreState.setA2AEdges.mockClear();
+
+      await act(async () => {
+        emitDelegation({ method: "delegate_result" });
+      });
+
+      // delegate_result rows do not contribute to the edge count — they
+      // are completion signals, not initiations.
+      expect(mockStoreState.setA2AEdges).not.toHaveBeenCalled();
+    });
+
+    it("WS push from a hidden workspace is ignored", async () => {
+      // eslint-disable-next-line @typescript-eslint/no-explicit-any
+      mockGet.mockResolvedValue([] as any);
+      render(<A2ATopologyOverlay />);
+      await act(async () => { await Promise.resolve(); });
+      mockStoreState.setA2AEdges.mockClear();
+
+      await act(async () => {
+        emitDelegation({ workspaceId: "ws-hidden" });
+      });
+
+      expect(mockStoreState.setA2AEdges).not.toHaveBeenCalled();
+    });
+
+    it("WS push while showA2AEdges is false is ignored", async () => {
+      mockStoreState.showA2AEdges = false;
+      // eslint-disable-next-line @typescript-eslint/no-explicit-any
+      mockGet.mockResolvedValue([] as any);
+      render(<A2ATopologyOverlay />);
+      // The mount path with showA2AEdges=false calls setA2AEdges([])
+      // once — clear that to isolate the WS path.
+      mockStoreState.setA2AEdges.mockClear();
+
+      await act(async () => {
+        emitDelegation();
+      });
+
+      expect(mockStoreState.setA2AEdges).not.toHaveBeenCalled();
+      expect(mockGet).not.toHaveBeenCalled();
+    });
+  });
+
  it("re-fetches when the visible ID set actually changes", async () => {
    // eslint-disable-next-line @typescript-eslint/no-explicit-any
    mockGet.mockResolvedValue([] as any);
@@ -36,6 +36,10 @@ vi.mock("@/hooks/useWorkspaceName", () => ({
  useWorkspaceName: () => () => "Test WS",
 }));

+import {
+  emitSocketEvent,
+  _resetSocketEventListenersForTests,
+} from "@/store/socket-events";
 import { ActivityTab } from "../tabs/ActivityTab";

 // ── Fixtures ──────────────────────────────────────────────────────────────────
@@ -358,6 +362,191 @@ describe("ActivityTab — refresh button", () => {
  });
 });

+// ── Suite 6.5: ACTIVITY_LOGGED subscription (#61 stage 3) ─────────────────────
+//
+// Pin the post-#61 behaviour: WS push extends the rendered list with NO
+// additional HTTP fetch. The 5s polling loop is gone; live updates
+// arrive over the WebSocket bus.
+
+describe("ActivityTab — #61 stage 3: ACTIVITY_LOGGED subscription", () => {
+  beforeEach(() => {
+    vi.clearAllMocks();
+    mockGet.mockResolvedValue([]);
+    _resetSocketEventListenersForTests();
+  });
+  afterEach(() => {
+    cleanup();
+    _resetSocketEventListenersForTests();
+  });
+
+  function emitActivity(overrides: {
+    workspaceId?: string;
+    activityType?: string;
+    summary?: string;
+    id?: string;
+  } = {}) {
+    const realNow = Date.now();
+    emitSocketEvent({
+      event: "ACTIVITY_LOGGED",
+      workspace_id: overrides.workspaceId ?? "ws-1",
+      timestamp: new Date(realNow).toISOString(),
+      payload: {
+        id: overrides.id ?? `act-${Math.random().toString(36).slice(2)}`,
+        activity_type: overrides.activityType ?? "agent_log",
+        source_id: null,
+        target_id: null,
+        method: null,
+        summary: overrides.summary ?? "live-pushed",
+        status: "ok",
+        created_at: new Date(realNow - 5_000).toISOString(),
+      },
+    });
+  }
+
+  it("WS push for matching workspace prepends to the list with NO HTTP call", async () => {
+    render(<ActivityTab workspaceId="ws-1" />);
+    await waitFor(() => {
+      expect(screen.getByText(/0 activities|no activity/i)).toBeTruthy();
+    });
+    mockGet.mockClear();
+
+    await act(async () => {
+      emitActivity({ summary: "live-row-from-bus" });
+    });
+
+    await waitFor(() => {
+      expect(screen.getByText(/live-row-from-bus/)).toBeTruthy();
+    });
+    expect(mockGet).not.toHaveBeenCalled();
+  });
+
+  it("WS push for a different workspace is ignored", async () => {
+    render(<ActivityTab workspaceId="ws-1" />);
+    await waitFor(() => screen.getByText(/no activity/i));
+
+    await act(async () => {
+      emitActivity({
+        workspaceId: "ws-other",
+        summary: "should-not-render-other-ws",
+      });
+    });
+
+    expect(screen.queryByText(/should-not-render-other-ws/)).toBeNull();
+  });
+
+  it("WS push respects the active filter — non-matching activity_type is ignored", async () => {
+    render(<ActivityTab workspaceId="ws-1" />);
+    await waitFor(() => screen.getByText(/no activity/i));
+
+    // Apply "Tasks" filter.
+    clickButton(/tasks/i);
+    await waitFor(() => {
+      expect(
+        screen.getByRole("button", { name: /tasks/i }).getAttribute("aria-pressed"),
+      ).toBe("true");
+    });
+
+    // Push an a2a_send (does NOT match task_update filter).
+    await act(async () => {
+      emitActivity({
+        activityType: "a2a_send",
+        summary: "should-not-render-filter-mismatch",
+      });
+    });
+
+    expect(
+      screen.queryByText(/should-not-render-filter-mismatch/),
+    ).toBeNull();
+  });
+
+  it("WS push respects the active filter — matching activity_type is rendered", async () => {
+    render(<ActivityTab workspaceId="ws-1" />);
+    await waitFor(() => screen.getByText(/no activity/i));
+
+    clickButton(/tasks/i);
+    await waitFor(() => {
+      expect(
+        screen.getByRole("button", { name: /tasks/i }).getAttribute("aria-pressed"),
+      ).toBe("true");
+    });
+
+    await act(async () => {
+      emitActivity({
+        activityType: "task_update",
+        summary: "task-filter-match",
+      });
+    });
+
+    await waitFor(() => {
+      expect(screen.getByText(/task-filter-match/)).toBeTruthy();
+    });
+  });
+
+  it("WS push while autoRefresh is paused is ignored", async () => {
+    render(<ActivityTab workspaceId="ws-1" />);
+    await waitFor(() => screen.getByText(/no activity/i));
+
+    // Toggle Live → Paused.
+    clickButton(/live/i);
+    await waitFor(() => {
+      expect(screen.getByText(/Paused/)).toBeTruthy();
+    });
+
+    await act(async () => {
+      emitActivity({ summary: "should-not-render-paused" });
+    });
+
+    expect(screen.queryByText(/should-not-render-paused/)).toBeNull();
+  });
+
+  it("WS push for a row already in the list is deduped (no double-render)", async () => {
+    // Bootstrap with one row — same id as the WS push to trigger dedup.
+    mockGet.mockResolvedValueOnce([
+      makeEntry({ id: "shared-id", summary: "bootstrap-summary" }),
+    ]);
+    render(<ActivityTab workspaceId="ws-1" />);
+    await waitFor(() => {
+      expect(screen.getByText(/bootstrap-summary/)).toBeTruthy();
+    });
+    mockGet.mockClear();
+
+    // Push a row with the SAME id but a different summary — must not
+    // render the new summary; original row stays.
+    await act(async () => {
+      emitActivity({
+        id: "shared-id",
+        summary: "should-not-replace-existing",
+      });
+    });
+
+    expect(screen.queryByText(/should-not-replace-existing/)).toBeNull();
+    // Also verify count didn't grow.
+    expect(screen.getByText(/1 activities/)).toBeTruthy();
+  });
+
+  it("does NOT poll on a 5s interval after mount (post-#61)", async () => {
+    vi.useFakeTimers();
+    try {
+      render(<ActivityTab workspaceId="ws-1" />);
+      // Drain the mount-time bootstrap promise.
+      await act(async () => {
+        await Promise.resolve();
+        await Promise.resolve();
+      });
+      const callsAfterBootstrap = mockGet.mock.calls.length;
+      expect(callsAfterBootstrap).toBeGreaterThanOrEqual(1);
+
+      // Pre-#61: a 30s clock advance fires 6 more polls. Post-#61: 0.
+      await act(async () => {
+        vi.advanceTimersByTime(30_000);
+      });
+      expect(mockGet.mock.calls.length).toBe(callsAfterBootstrap);
+    } finally {
+      vi.useRealTimers();
+    }
+  });
+});
+
 // ── Suite 7: Activity count ───────────────────────────────────────────────────

 describe("ActivityTab — activity count", () => {
@@ -1,18 +1,28 @@
 // @vitest-environment jsdom
 /**
- * CommunicationOverlay tests — pin the rate-limit fix shipped 2026-05-04.
+ * CommunicationOverlay tests — pin both the 2026-05-04 fan-out cap fix
+ * AND the 2026-05-07 polling → ACTIVITY_LOGGED-subscriber refactor
+ * (issue #61 stage 1).
 *
- * The overlay polls /workspaces/:id/activity?limit=5 for each online
- * workspace. Pre-fix it (a) polled regardless of visibility and (b)
- * fanned out to 6 workspaces every 10s. With 8+ workspaces a user
- * triggered sustained 429s (server-side rate limit is 600 req/min/IP).
+ * The overlay used to poll /workspaces/:id/activity?limit=5 on a 30s
+ * interval per online workspace (capped at 3). Post-#61: it bootstraps
+ * once on mount via the same HTTP path (cap of 3 retained), then
+ * subscribes to ACTIVITY_LOGGED via the global socket bus for live
+ * updates. No interval poll.
 *
 * These tests pin:
- *  1. Fan-out cap of 3 — even with 6 online nodes, only 3 fetches
- *  2. Visibility gate — when collapsed, no polling
+ *  1. Bootstrap fan-out cap of 3 — even with 6 online nodes, only 3
+ *     HTTP fetches on mount.
+ *  2. Visibility gate — when collapsed, no HTTP fetches; re-open
+ *     re-bootstraps.
+ *  3. NO interval polling — advancing the clock past 30s does not fire
+ *     additional HTTP calls.
+ *  4. WS push extends the rendered list without firing any HTTP call.
+ *  5. WS push for an offline workspace is ignored.
+ *  6. WS push for a non-comm activity_type is ignored.
 *
- * If a future refactor pushes either dial back up, CI fails before
- * the regression hits a paying tenant.
+ * If a future refactor regresses any of these, CI fails before the
+ * regression hits a paying tenant.
 */
 import { describe, it, expect, vi, beforeEach, afterEach } from "vitest";
 import { render, cleanup, act, fireEvent } from "@testing-library/react";
@@ -23,7 +33,7 @@ vi.mock("@/lib/api", () => ({
  api: { get: vi.fn() },
 }));

-// Six online nodes — enough to verify the cap of 3.
+// Six online nodes — enough to verify the bootstrap cap of 3.
 const mockStoreState = {
  selectedNodeId: null as string | null,
  nodes: [
@@ -56,6 +66,10 @@ vi.mock("@/lib/design-tokens", () => ({
 // ── Imports (after mocks) ─────────────────────────────────────────────────────

 import { api } from "@/lib/api";
+import {
+  emitSocketEvent,
+  _resetSocketEventListenersForTests,
+} from "@/store/socket-events";
 import { CommunicationOverlay } from "../CommunicationOverlay";

 const mockGet = vi.mocked(api.get);
@@ -66,30 +80,34 @@ beforeEach(() => {
  vi.useFakeTimers();
  mockGet.mockReset();
  mockGet.mockResolvedValue([]);
+  // Drop any subscribers the previous test left on the singleton bus —
+  // each render adds one via useSocketEvent.
+  _resetSocketEventListenersForTests();
 });

 afterEach(() => {
  cleanup();
  vi.useRealTimers();
+  _resetSocketEventListenersForTests();
 });

 // ── Tests ─────────────────────────────────────────────────────────────────────

-describe("CommunicationOverlay — fan-out cap", () => {
-  it("polls at most 3 of 6 online workspaces (rate-limit floor)", async () => {
+describe("CommunicationOverlay — bootstrap fan-out cap", () => {
+  it("bootstraps at most 3 of 6 online workspaces (rate-limit floor preserved post-#61)", async () => {
    await act(async () => {
      render(<CommunicationOverlay />);
    });
-    // Mount fires the first poll synchronously (no interval tick yet).
-    // Pre-fix: 6 calls. Post-fix: 3.
+    // Mount fires the bootstrap synchronously — pre-#61 this was the
+    // first poll cycle; post-#61 it's the only HTTP fetch (live updates
+    // arrive via WS push). 6 nodes → 3 fetches.
    expect(mockGet).toHaveBeenCalledTimes(3);
-    // Verify the calls are for the FIRST 3 online nodes (slice order).
    expect(mockGet).toHaveBeenCalledWith("/workspaces/ws-1/activity?limit=5");
    expect(mockGet).toHaveBeenCalledWith("/workspaces/ws-2/activity?limit=5");
    expect(mockGet).toHaveBeenCalledWith("/workspaces/ws-3/activity?limit=5");
  });

-  it("never polls offline workspaces", async () => {
+  it("never bootstraps offline workspaces", async () => {
    await act(async () => {
      render(<CommunicationOverlay />);
    });
@@ -99,40 +117,39 @@ describe("CommunicationOverlay — fan-out cap", () => {
  });
 });

-describe("CommunicationOverlay — cadence", () => {
-  it("uses 30s interval cadence (was 10s pre-fix)", async () => {
+describe("CommunicationOverlay — no interval polling (post-#61)", () => {
+  // The pre-#61 implementation re-fetched every 30s per workspace.
+  // Post-#61 the only HTTP path is the bootstrap on mount + on
+  // visibility-toggle. This test pins the absence of any interval
+  // poll: a 60s clock advance must not produce a second round of
+  // fetches.
+  it("does NOT poll on a 30s interval after bootstrap", async () => {
    await act(async () => {
      render(<CommunicationOverlay />);
    });
-    expect(mockGet).toHaveBeenCalledTimes(3); // initial mount poll
+    expect(mockGet).toHaveBeenCalledTimes(3); // initial bootstrap
+    mockGet.mockClear();

-    // Advance 10s — pre-fix this would fire another poll. Post-fix: silent.
+    // Advance 60s — well past any plausible cadence the prior version
+    // could have used.
    await act(async () => {
-      vi.advanceTimersByTime(10_000);
+      vi.advanceTimersByTime(60_000);
    });
-    expect(mockGet).toHaveBeenCalledTimes(3);
-
-    // Advance to 30s — interval fires.
-    await act(async () => {
-      vi.advanceTimersByTime(20_000);
-    });
-    expect(mockGet).toHaveBeenCalledTimes(6); // +3 from second tick
+    expect(mockGet).not.toHaveBeenCalled();
  });
 });

 describe("CommunicationOverlay — visibility gate", () => {
-  // The visibility gate is the dial that drops collapsed-panel polling
-  // to ZERO. The cadence test above can't catch its removal — if a
-  // refactor dropped `if (!visible) return`, the cadence test would
-  // still pass because the effect would still fire every 30s.
+  // The visibility gate now does two things post-#61:
+  //   - while closed, the WS handler short-circuits (no setComms churn)
+  //   - re-opening triggers a fresh bootstrap so the list reflects
+  //     anything that happened while the panel was collapsed
  //
  // Direct probe: render with comms-returning mock so the panel
  // actually renders (close button only exists in the expanded panel,
  // not the collapsed button-state). Click close, advance the clock,
  // assert no further fetches.
-  it("stops polling after the user collapses the panel", async () => {
-    // Mock returns one a2a_send so comms.length > 0 → panel renders →
-    // close button accessible.
+  it("stops fetching while collapsed and re-bootstraps on re-open", async () => {
    mockGet.mockResolvedValue([
      {
        id: "act-1",
@@ -150,29 +167,202 @@ describe("CommunicationOverlay — visibility gate", () => {
    const { getByLabelText } = await act(async () => {
      return render(<CommunicationOverlay />);
    });
-    // Drain pending microtasks (resolves the await in fetchComms) so
-    // setComms lands and the panel renders. Don't advance time — that
-    // would fire the next interval tick and pollute the assertion.
+    // Drain pending microtasks (resolves the await in bootstrap) so
+    // setComms lands and the panel renders. Don't advance time — it's
+    // not load-bearing for the gate test, but matches the pattern used
+    // pre-#61 for stability.
    await act(async () => {
      await Promise.resolve();
      await Promise.resolve();
      await Promise.resolve();
    });
-    // Initial mount polled 3 workspaces.
-    expect(mockGet).toHaveBeenCalledTimes(3);
+    expect(mockGet).toHaveBeenCalledTimes(3); // initial bootstrap
    mockGet.mockClear();

-    // Click the close button. Synchronous getByLabelText avoids
-    // findBy's internal setTimeout (deadlocks under useFakeTimers).
+    // Click close. While closed, no fetches and no WS-driven updates.
+    const closeBtn = getByLabelText("Close communications panel");
+    await act(async () => {
+      fireEvent.click(closeBtn);
+    });
+    await act(async () => {
+      vi.advanceTimersByTime(60_000);
+    });
+    expect(mockGet).not.toHaveBeenCalled();
+
+    // Re-open via the collapsed button. Must trigger a fresh bootstrap.
+    const openBtn = getByLabelText("Show communications panel");
+    await act(async () => {
+      fireEvent.click(openBtn);
+    });
+    await act(async () => {
+      await Promise.resolve();
+      await Promise.resolve();
+    });
+    expect(mockGet).toHaveBeenCalledTimes(3); // re-bootstrap on re-open
+  });
+});
+
+describe("CommunicationOverlay — WS subscription (#61 stage 1 core)", () => {
+  // The load-bearing post-#61 behaviour. Every test in this block must
+  // verify (a) the WS push DID update the rendered comms list, and
+  // (b) NO additional HTTP call was fired — the whole point of the
+  // refactor is to remove the polling-driven HTTP traffic.
+  function emitActivityLogged(overrides: Partial<{
+    workspaceId: string;
+    payload: Record<string, unknown>;
+  }> = {}) {
+    emitSocketEvent({
+      event: "ACTIVITY_LOGGED",
+      workspace_id: overrides.workspaceId ?? "ws-1",
+      timestamp: new Date().toISOString(),
+      payload: {
+        id: `act-${Math.random().toString(36).slice(2)}`,
+        activity_type: "a2a_send",
+        source_id: "ws-1",
+        target_id: "ws-2",
+        summary: "live push",
+        status: "ok",
+        duration_ms: 42,
+        created_at: new Date().toISOString(),
+        ...overrides.payload,
+      },
+    });
+  }
+
+  it("WS push for a comm activity_type extends the rendered list with NO additional HTTP call", async () => {
+    const { container } = await act(async () => {
+      return render(<CommunicationOverlay />);
+    });
+    expect(mockGet).toHaveBeenCalledTimes(3); // bootstrap
+    mockGet.mockClear();
+
+    await act(async () => {
+      emitActivityLogged({ payload: { summary: "hello" } });
+    });
+    await act(async () => {
+      await Promise.resolve();
+    });
+
+    // Two pins:
+    //   1. comms list reflects the live push (look for the summary text)
+    //   2. zero HTTP fetches fired during the WS path
+    expect(container.textContent).toContain("hello");
+    expect(mockGet).not.toHaveBeenCalled();
+  });
+
+  it("WS push for an offline workspace is ignored", async () => {
+    const { container } = await act(async () => {
+      return render(<CommunicationOverlay />);
+    });
+    mockGet.mockClear();
+
+    await act(async () => {
+      emitActivityLogged({
+        workspaceId: "ws-offline",
+        payload: { source_id: "ws-offline", summary: "should-not-render" },
+      });
+    });
+    await act(async () => {
+      await Promise.resolve();
+    });
+
+    expect(container.textContent).not.toContain("should-not-render");
+    expect(mockGet).not.toHaveBeenCalled();
+  });
+
+  it("WS push for a non-comm activity_type is ignored (e.g. delegation)", async () => {
+    const { container } = await act(async () => {
+      return render(<CommunicationOverlay />);
+    });
+    mockGet.mockClear();
+
+    await act(async () => {
+      emitActivityLogged({
+        payload: {
+          activity_type: "delegation",
+          summary: "should-not-render-delegation",
+        },
+      });
+    });
+    await act(async () => {
+      await Promise.resolve();
+    });
+
+    expect(container.textContent).not.toContain("should-not-render-delegation");
+    expect(mockGet).not.toHaveBeenCalled();
+  });
+
+  it("WS push while the panel is collapsed is ignored (no churn on hidden state)", async () => {
+    // Bootstrap with one comm so the panel renders → close button
+    // accessible. Then collapse, emit a WS push, re-open: the rendered
+    // list must come from the re-bootstrap, NOT from the WS-push that
+    // arrived during the closed state. Also: nothing visible while
+    // closed (the collapsed button shows only the count, not summaries).
+    mockGet.mockResolvedValue([
+      {
+        id: "act-bootstrap",
+        workspace_id: "ws-1",
+        activity_type: "a2a_send",
+        source_id: "ws-1",
+        target_id: "ws-2",
+        summary: "bootstrap-summary",
+        status: "ok",
+        duration_ms: 1,
+        created_at: new Date().toISOString(),
+      },
+    ]);
+    const { getByLabelText, container } = await act(async () => {
+      return render(<CommunicationOverlay />);
+    });
+    await act(async () => {
+      await Promise.resolve();
+      await Promise.resolve();
+    });
+
+    // Collapse.
    const closeBtn = getByLabelText("Close communications panel");
    await act(async () => {
      fireEvent.click(closeBtn);
    });

-    // Advance well past the 30s cadence — gate should suppress the tick.
+    // Bootstrap mock returns nothing on the re-open path so we can
+    // distinguish "WS push leaked through the gate" from "re-bootstrap
+    // refilled the list."
+    mockGet.mockReset();
+    mockGet.mockResolvedValue([]);
+
    await act(async () => {
-      vi.advanceTimersByTime(60_000);
+      emitActivityLogged({
+        payload: { summary: "leaked-while-closed" },
+      });
    });
+    await act(async () => {
+      await Promise.resolve();
+    });
+
+    // Closed state: rendered DOM must not show any push-derived text.
+    expect(container.textContent).not.toContain("leaked-while-closed");
+  });
+
+  it("non-ACTIVITY_LOGGED events are ignored (e.g. WORKSPACE_OFFLINE)", async () => {
+    const { container } = await act(async () => {
+      return render(<CommunicationOverlay />);
+    });
+    mockGet.mockClear();
+
+    await act(async () => {
+      emitSocketEvent({
+        event: "WORKSPACE_OFFLINE",
+        workspace_id: "ws-1",
+        timestamp: new Date().toISOString(),
+        payload: { summary: "should-not-render-event" },
+      });
+    });
+    await act(async () => {
+      await Promise.resolve();
+    });
+
+    expect(container.textContent).not.toContain("should-not-render-event");
    expect(mockGet).not.toHaveBeenCalled();
  });
 });
@@ -1,8 +1,9 @@
 "use client";

-import { useState, useEffect, useCallback } from "react";
+import { useState, useEffect, useCallback, useRef } from "react";
 import { api } from "@/lib/api";
 import { ConversationTraceModal } from "@/components/ConversationTraceModal";
+import { useSocketEvent } from "@/hooks/useSocketEvent";
 import { type ActivityEntry } from "@/types/activity";
 import { useWorkspaceName } from "@/hooks/useWorkspaceName";
 import { inferA2AErrorHint } from "./chat/a2aErrorHint";
@@ -48,6 +49,15 @@ export function ActivityTab({ workspaceId }: Props) {
  const [traceOpen, setTraceOpen] = useState(false);
  const resolveName = useWorkspaceName();

+  // Refs let the WS handler read the latest filter / autoRefresh
+  // selection without re-subscribing on every state change. The bus
+  // listener is registered exactly once per mount via useSocketEvent's
+  // ref-internal pattern; subscriber-side filtering reads from these.
+  const filterRef = useRef(filter);
+  filterRef.current = filter;
+  const autoRefreshRef = useRef(autoRefresh);
+  autoRefreshRef.current = autoRefresh;
+
  const loadActivities = useCallback(async () => {
    try {
      const typeParam = filter !== "all" ? `?type=${filter}` : "";
@@ -66,11 +76,58 @@ export function ActivityTab({ workspaceId }: Props) {
    loadActivities();
  }, [loadActivities]);

-  useEffect(() => {
-    if (!autoRefresh) return;
-    const interval = setInterval(loadActivities, 5000);
-    return () => clearInterval(interval);
-  }, [loadActivities, autoRefresh]);
+  // Live-update path (issue #61 stage 3, replaces the 5s setInterval).
+  // ACTIVITY_LOGGED events from this workspace prepend to the rendered
+  // list — dedup by id so a server-side update + a poll reply don't
+  // double-render the same row.
+  //
+  // Honours the user's autoRefresh toggle: when paused, live updates
+  // are dropped until the user re-enables Live (or hits Refresh, which
+  // re-bootstraps via loadActivities).
+  //
+  // Filter awareness: matches the server-side `?type=<filter>`
+  // semantics so the panel doesn't show rows the user excluded.
+  useSocketEvent((msg) => {
+    if (!autoRefreshRef.current) return;
+    if (msg.event !== "ACTIVITY_LOGGED") return;
+    if (msg.workspace_id !== workspaceId) return;
+
+    const p = (msg.payload || {}) as Record<string, unknown>;
+    const activityType = (p.activity_type as string) || "";
+
+    const f = filterRef.current;
+    if (f !== "all" && activityType !== f) return;
+
+    const entry: ActivityEntry = {
+      id:
+        (p.id as string) ||
+        `ws-push-${msg.timestamp || Date.now()}-${msg.workspace_id}`,
+      workspace_id: msg.workspace_id,
+      activity_type: activityType,
+      source_id: (p.source_id as string | null) ?? null,
+      target_id: (p.target_id as string | null) ?? null,
+      method: (p.method as string | null) ?? null,
+      summary: (p.summary as string | null) ?? null,
+      request_body: (p.request_body as Record<string, unknown> | null) ?? null,
+      response_body:
+        (p.response_body as Record<string, unknown> | null) ?? null,
+      duration_ms: (p.duration_ms as number | null) ?? null,
+      status: (p.status as string) || "ok",
+      error_detail: (p.error_detail as string | null) ?? null,
+      created_at:
+        (p.created_at as string) ||
+        msg.timestamp ||
+        new Date().toISOString(),
+    };
+
+    setActivities((prev) => {
+      // Dedup by id — a row that arrived via the bootstrap fetch and
+      // also fires ACTIVITY_LOGGED from a delayed server-side hook
+      // must render exactly once.
+      if (prev.some((e) => e.id === entry.id)) return prev;
+      return [entry, ...prev];
+    });
+  });

  return (
    <div className="flex flex-col h-full">
@@ -7,6 +7,32 @@ export default defineConfig({
  test: {
    environment: 'node',
    exclude: ['e2e/**', 'node_modules/**', '**/dist/**'],
+    // CI-conditional test timeout (issue #96).
+    //
+    // Vitest's 5000ms default is too tight for the first test in any
+    // file under our CI shape: `npx vitest run --coverage` on the
+    // self-hosted Gitea Actions Docker runner. The cold-start cost
+    // (v8 coverage instrumentation init + JSDOM bootstrap + module-
+    // graph import for @/components/* and @/lib/* + first React
+    // render) consistently consumes 5-7 seconds for the first
+    // synchronous test in heavyweight component files
+    // (ActivityTab.test.tsx, CreateWorkspaceDialog.test.tsx,
+    // ConfigTab.provider.test.tsx) — even though every subsequent
+    // test in the same file completes in 100-1500ms.
+    //
+    // Empirically the worst observed first-test was 6453ms in a
+    // single file (CreateWorkspaceDialog). 30000ms gives ~5x
+    // headroom over that on CI; we still keep 5000ms locally so
+    // genuine waitFor races / hung promises stay sensitive in dev.
+    //
+    // Same vitest pattern documented at:
+    //   https://vitest.dev/config/testtimeout
+    //   https://vitest.dev/guide/coverage#profiling-test-performance
+    //
+    // Per-test duration is still emitted to the CI log; if a test
+    // ever silently approaches 25-30s under this raised ceiling that
+    // will surface as a duration regression and we revisit.
+    testTimeout: process.env.CI ? 30000 : 5000,
    // Coverage is instrumented but NOT yet a CI gate — first land
    // observability so we can see the baseline, then dial in
    // thresholds + a hard gate in a follow-up PR (#1815). Today's
@@ -0,0 +1,43 @@
+# docker-compose.dev.yml — overlay over docker-compose.yml for local dev
+# with air-driven live reload of the platform (workspace-server) service.
+#
+# Usage:
+#   docker compose -f docker-compose.yml -f docker-compose.dev.yml up
+#   (or `make dev` shorthand from repo root)
+#
+# What this overlay changes vs docker-compose.yml alone:
+#   - Platform service uses workspace-server/Dockerfile.dev (air on top of
+#     golang:1.25-alpine) instead of the multi-stage prod Dockerfile.
+#   - Platform service bind-mounts the host's workspace-server/ source
+#     into /app/workspace-server so air sees source edits live.
+#   - Other services (postgres, redis, langfuse, etc.) inherit unchanged
+#     from docker-compose.yml.
+#
+# What stays the same:
+#   - All env vars, volumes, depends_on, healthchecks from docker-compose.yml.
+#   - Network topology + ports.
+#   - Postgres/Redis as service containers (no in-process replacements).
+
+services:
+  platform:
+    build:
+      context: .
+      dockerfile: workspace-server/Dockerfile.dev
+    # Rebind source: edits under host's workspace-server/ propagate live.
+    # The named volume on go-build-cache speeds up first build per container.
+    volumes:
+      - ./workspace-server:/app/workspace-server
+      - go-build-cache:/root/.cache/go-build
+      - go-mod-cache:/go/pkg/mod
+    # Air signals the running binary on rebuild; ensure shell stops cleanly.
+    init: true
+    # Mark the service as dev-mode so the platform can short-circuit any
+    # behavior that's incompatible with hot-reload (e.g. background
+    # cron-style watchers that don't survive process restart). No-op
+    # today; reserved for future flag use.
+    environment:
+      MOLECULE_DEV_HOT_RELOAD: "1"
+
+volumes:
+  go-build-cache:
+  go-mod-cache:
@@ -0,0 +1,74 @@
+# ADR-002: Local-build mode signalled by `MOLECULE_IMAGE_REGISTRY` presence
+
+* Status: Accepted (2026-05-07)
+* Issue: #63 (closes Task #194)
+* Decision: Hongming (CTO) + Claude Opus 4.7 (implementation)
+
+## Context
+
+Pre-2026-05-06, every Molecule deployment — both production tenants and OSS contributor laptops — pulled workspace-template-* container images from `ghcr.io/molecule-ai/`. Production tenants additionally set `MOLECULE_IMAGE_REGISTRY` to an AWS ECR mirror via Railway env / EC2 user-data, but the OSS default was the upstream GHCR org.
+
+On 2026-05-06 the `Molecule-AI` GitHub org was suspended (saved memory: `feedback_github_botring_fingerprint`). GHCR now returns **403 Forbidden** for every `molecule-ai/workspace-template-*` manifest. OSS contributors who clone `molecule-core` and run `go run ./workspace-server/cmd/server` cannot provision a workspace — every first provision fails with:
+
+```
+docker image "ghcr.io/molecule-ai/workspace-template-claude-code:latest" not found after pull attempt
+```
+
+Production tenants are unaffected (their `MOLECULE_IMAGE_REGISTRY` points at ECR, which we still control), but OSS onboarding is broken. Workspace template repos are intentionally separate from `molecule-core` (each runtime is OSS-shape and forkable), and they are mirrored to Gitea (`https://git.moleculesai.app/molecule-ai/molecule-ai-workspace-template-<runtime>`) — but the provisioner has no path that consumes Gitea source directly.
+
+## Decision
+
+When `MOLECULE_IMAGE_REGISTRY` is **unset** (or empty), the provisioner switches to a **local-build mode** that:
+
+1. Looks up the workspace-template repo's HEAD sha on Gitea via a single API call.
+2. Checks whether a SHA-pinned local image (`molecule-local/workspace-template-<runtime>:<sha12>`) already exists; if so, reuses it.
+3. Otherwise shallow-clones the repo into `~/.cache/molecule/workspace-template-build/<runtime>/<sha12>/` and runs `docker build --platform=linux/amd64 -t <tag> .`.
+4. Hands the SHA-pinned tag to Docker for ContainerCreate, bypassing the registry-pull path entirely.
+
+When `MOLECULE_IMAGE_REGISTRY` is **set**, behavior is unchanged: pull the image from that registry. Existing prod tenants and self-hosters who mirror to a private registry are not affected.
+
+## Consequences
+
+### Positive
+
+* **Zero-config OSS onboarding** — `git clone molecule-core && go run ./workspace-server/cmd/server` boots end-to-end without any registry credentials.
+* **Production tenants protected** — same env var, same semantics in SaaS-mode. Migration is a no-op.
+* **No new env var** — extending an existing var's semantics ("where to pull, OR build locally if absent") rather than introducing `MOLECULE_LOCAL_BUILD=1` keeps the surface small.
+* **SHA-pinned cache** — repeat builds are O(API-call); only template-repo HEAD changes invalidate.
+* **Production-parity image** — amd64 emulation on Apple Silicon honours `feedback_local_must_mimic_production`. The provisioner's existing `defaultImagePlatform()` already forces amd64 for parity; building amd64 locally lets that decision stay consistent.
+
+### Negative
+
+* **Conflates two concerns** — `MOLECULE_IMAGE_REGISTRY` now signals BOTH "where to pull" AND "build locally if absent." A future operator who unsets it expecting a hard error will instead get a slow first-provision. Documented in the runbook.
+* **First-provision is slow on Apple Silicon** — 5–10 min via QEMU emulation on the cold path. Mitigated by SHA-cache (subsequent runs are <1s lookup + 0s build).
+* **Coverage gap** — only 4 of 9 runtimes are mirrored to Gitea today (`claude-code`, `hermes`, `langgraph`, `autogen`). The other 5 fail with an actionable "not mirrored" error. Mirroring those repos is a separate task.
+* **Implicit trust boundary** — operator running `go run` implicitly trusts `molecule-ai/molecule-ai-workspace-template-*` repos on Gitea. This is the same trust they would extend to the GHCR images today; not a new attack surface.
+
+## Alternatives considered
+
+1. **New env var `MOLECULE_LOCAL_BUILD=1`** — explicit, but requires OSS contributors to know it exists. Violates the zero-config goal.
+2. **Push pre-built images to a Gitea container registry, mirror tag from upstream** — operationally cleaner but: (a) Gitea's container-registry add-on isn't deployed on the operator host, (b) defeats the OSS-contributor goal of "hack on the source, see your changes," since they'd still pull a stale image.
+3. **Embed Dockerfiles in molecule-core itself, drop the standalone template repos** — would work but breaks the OSS-shape principle; templates are intentionally separable, anyone-can-fork artifacts.
+4. **Build native arch on Apple Silicon (arm64) and drop the platform pin in local-mode** — fast, but creates `linux/arm64` images that diverge from the amd64-only prod runtime. Local-vs-prod debug behavior would diverge. Rejected per `feedback_local_must_mimic_production`.
+
+## Security review
+
+* **Gitea repo URL allowlist** — runtime name must be in the `knownRuntimes` allowlist (defence-in-depth against a future code path that lets cfg.Runtime carry untrusted input). Repo prefix is hardcoded to `https://git.moleculesai.app/molecule-ai/molecule-ai-workspace-template-`; forks can override via `MOLECULE_LOCAL_TEMPLATE_REPO_PREFIX` (opt-in, default off).
+* **Token handling** — clones are anonymous over HTTPS by default (templates are public). `MOLECULE_GITEA_TOKEN`, if set, is passed via URL userinfo for the clone and as `Authorization: token` for the API call. The token is **masked in every log line** via `maskTokenInURL` / `maskTokenInString` and never appears in the cache dir path.
+* **No silent fallback** — if Gitea is unreachable or the runtime isn't mirrored, we return a clear error mentioning the repo URL and the missing runtime. We **never** fall back to GHCR/ECR (that would be a confusing bug for an OSS contributor who happened to have stale ECR creds in their docker config).
+* **Build-arg injection** — `docker build` is invoked with NO `--build-arg` from external input. Dockerfile is consumed as-is.
+* **Cache poisoning** — cache key is the Gitea HEAD sha + Dockerfile content; a force-push to the template repo's main branch regenerates the key on next run. Cache dir is per-user (`$HOME/.cache`), so cross-user attacks aren't relevant in single-user dev mode.
+
+## Versioning + back-compat
+
+* Existing prod tenants set `MOLECULE_IMAGE_REGISTRY=<ECR url>` → unchanged behavior.
+* Existing local installs that set the var → unchanged behavior.
+* Existing local installs that don't set it → switch to local-build path. Migration: none required (additive); first provision will take 5–10 min instead of failing.
+* No deprecations.
+
+## References
+
+* Issue #63 — feat(workspace-server): local-dev provisioner builds from Gitea source
+* Saved memory `feedback_local_must_mimic_production` — local docker must mimic prod, no bypasses
+* Saved memory `reference_post_suspension_pipeline` — full post-2026-05-06 stack shape
+* Saved memory `feedback_github_botring_fingerprint` — what got the org suspended
@@ -2,7 +2,7 @@

 **Status:** living document — update when you ship a feature that touches one backend.
 **Owner:** workspace-server + controlplane teams.
-**Last audit:** 2026-05-05 (Claude agent — `provisionWorkspaceAuto` / `StopWorkspaceAuto` / `HasProvisioner` SoT pattern landed in PRs #2811 + #2824).
+**Last audit:** 2026-05-07 (plugin install/uninstall closed for EC2 backend via EIC SSH push to the bind-mounted `/configs/plugins/<name>/`, mirroring the Files API PR #1702 pattern).

 ## Why this exists

@@ -54,7 +54,7 @@ For "do we have any backend?", use `HasProvisioner()`, never bare `h.provisioner
 | **Files API** | | | | |
 | List / Read / Write / Replace / Delete | `container_files.go`, `template_import.go` | `docker exec` + tar `CopyToContainer` | SSH via EIC tunnel (PR #1702) | ✅ parity as of 2026-04-22 (previously docker-only) |
 | **Plugins** | | | | |
-| Install / uninstall / list | `plugins_install.go` | `deliverToContainer()` + volume rm | **gap — no live plugin delivery** | 🔴 **docker-only** |
+| Install / uninstall / list | `plugins_install.go` + `plugins_install_eic.go` | `deliverToContainer()` → exec+`CopyToContainer` on local container | `instance_id` set → EIC SSH push of the staged tarball into the EC2's bind-mounted `/configs/plugins/<name>/` (per `workspaceFilePathPrefix`), `chown 1000:1000`, restart | ✅ parity |
 | **Terminal (WebSocket)** | | | | |
 | Dispatch | `terminal.go:90-105` | `instance_id=""` → `handleLocalConnect` → `docker attach` | `instance_id` set → `handleRemoteConnect` → EIC SSH + `docker exec` | ✅ parity (different implementations, same UX) |
 | **A2A proxy** | | | | |
@@ -10,7 +10,7 @@ tags: [platform, fly.io, deployment, infrastructure]

 Your infrastructure choice just got decoupled from your agent platform choice. Molecule AI now ships three production-ready workspace backends — `docker`, `flyio`, and `controlplane` — and switching between them takes a single environment variable. Your agent code, model choices, and workspace topology stay exactly the same.

-This post covers what shipped in [PR #501](https://github.com/Molecule-AI/molecule-core/pull/501) (Fly Machines provisioner) and [PR #503](https://github.com/Molecule-AI/molecule-core/pull/503) (control plane provisioner), and which backend fits your situation.
+This post covers what shipped in [PR #501](https://git.moleculesai.app/molecule-ai/molecule-core/pull/501) (Fly Machines provisioner) and [PR #503](https://git.moleculesai.app/molecule-ai/molecule-core/pull/503) (control plane provisioner), and which backend fits your situation.

 ## Before: One Deployment Model for Every Use Case

@@ -107,4 +107,4 @@ No changes to agent code, tool definitions, or orchestration logic. Swap `CONTAI

 ---

-*[PR #501](https://github.com/Molecule-AI/molecule-core/pull/501) (Fly Machines provisioner) and [PR #503](https://github.com/Molecule-AI/molecule-core/pull/503) (control plane provisioner) are both merged to `main`. Molecule AI is open source — contributions welcome.*
+*[PR #501](https://git.moleculesai.app/molecule-ai/molecule-core/pull/501) (Fly Machines provisioner) and [PR #503](https://git.moleculesai.app/molecule-ai/molecule-core/pull/503) (control plane provisioner) are both merged to `main`. Molecule AI is open source — contributions welcome.*
@@ -27,7 +27,7 @@ The biggest user-facing change: every Molecule AI org can now mint named, revoca

 → [User guide: Organization API Keys](/docs/guides/org-api-keys.md)
 → [Architecture: Org API Keys](/docs/architecture/org-api-keys.md)
-→ PRs: [#1105](https://github.com/Molecule-AI/molecule-core/pull/1105), [#1107](https://github.com/Molecule-AI/molecule-core/pull/1107), [#1109](https://github.com/Molecule-AI/molecule-core/pull/1109), [#1110](https://github.com/Molecule-AI/molecule-core/pull/1110)
+→ PRs: [#1105](https://git.moleculesai.app/molecule-ai/molecule-core/pull/1105), [#1107](https://git.moleculesai.app/molecule-ai/molecule-core/pull/1107), [#1109](https://git.moleculesai.app/molecule-ai/molecule-core/pull/1109), [#1110](https://git.moleculesai.app/molecule-ai/molecule-core/pull/1110)

 ---

@@ -48,7 +48,7 @@ AdminAuth now accepts a session-verification tier that runs **before** the beare
 **Self-hosted / local dev:** `CP_UPSTREAM_URL` is unset → this feature is disabled, behaviour is unchanged.

 → [Guide: Same-Origin Canvas Fetches & Session Auth](/docs/guides/same-origin-canvas-fetches.md)
-→ PRs: [#1099](https://github.com/Molecule-AI/molecule-core/pull/1099), [#1100](https://github.com/Molecule-AI/molecule-core/pull/1100)
+→ PRs: [#1099](https://git.moleculesai.app/molecule-ai/molecule-core/pull/1099), [#1100](https://git.moleculesai.app/molecule-ai/molecule-core/pull/1100)

 ---

@@ -87,7 +87,7 @@ The proxy is **fail-closed**: only an explicit allowlist of paths (`/cp/auth/`,
 This is also the structural fix for the lateral-movement risk that session auth introduced: without the allowlist, a tenant-authed browser user could have proxied `/cp/admin/*` requests upstream and exploited the fact that those endpoints accept WorkOS session cookies. The allowlist makes that impossible by construction.

 → [Guide: Same-Origin Canvas Fetches & Session Auth](/docs/guides/same-origin-canvas-fetches.md)
-→ PR: [#1095](https://github.com/Molecule-AI/molecule-core/pull/1095)
+→ PR: [#1095](https://git.moleculesai.app/molecule-ai/molecule-core/pull/1095)

 ---

@@ -99,7 +99,7 @@ The waitlist itself is a Canvas-administered list with email hashing in audit lo

 This is the operational surface that makes the above security work matter: the beta is invitation-only, credentials are scoped, and every admin action is auditable.

-→ Control plane PRs [#145](https://github.com/Molecule-AI/molecule-controlplane/pull/145), [#148](https://github.com/Molecule-AI/molecule-controlplane/pull/148), [#150](https://github.com/Molecule-AI/molecule-controlplane/pull/150)
+→ Control plane PRs [#145](https://git.moleculesai.app/molecule-ai/molecule-controlplane/pull/145), [#148](https://git.moleculesai.app/molecule-ai/molecule-controlplane/pull/148), [#150](https://git.moleculesai.app/molecule-ai/molecule-controlplane/pull/150)

 ---

@@ -12,7 +12,7 @@ Your team is in Discord. Your AI agents are in Molecule AI. Until today, those t

 That's now one webhook URL.

-Molecule AI workspaces can now connect to Discord. Here's what shipped in [PR #656](https://github.com/Molecule-AI/molecule-core/pull/656).
+Molecule AI workspaces can now connect to Discord. Here's what shipped in [PR #656](https://git.moleculesai.app/molecule-ai/molecule-core/pull/656).

 ---

@@ -70,7 +70,7 @@ For inbound slash commands, point your Discord app's **Interactions Endpoint URL

 ## Security: Webhook Tokens Don't Appear in Logs

-Webhook URLs contain a token (`/webhooks/{id}/{token}`). If that token leaks into server logs, it's a rotation event. The Discord adapter is explicit about this: HTTP request errors are logged without the URL, and the adapter returns a generic error message. This was hardened in [PR #659](https://github.com/Molecule-AI/molecule-core/pull/659).
+Webhook URLs contain a token (`/webhooks/{id}/{token}`). If that token leaks into server logs, it's a rotation event. The Discord adapter is explicit about this: HTTP request errors are logged without the URL, and the adapter returns a generic error message. This was hardened in [PR #659](https://git.moleculesai.app/molecule-ai/molecule-core/pull/659).

 ---

@@ -97,4 +97,4 @@ Documentation: [Social Channels guide](/docs/agent-runtime/social-channels#disco

 ---

-*Discord adapter shipped in [PR #656](https://github.com/Molecule-AI/molecule-core/pull/656). Security hardening in [PR #659](https://github.com/Molecule-AI/molecule-core/pull/659). Molecule AI is open source — contributions welcome.*
+*Discord adapter shipped in [PR #656](https://git.moleculesai.app/molecule-ai/molecule-core/pull/656). Security hardening in [PR #659](https://git.moleculesai.app/molecule-ai/molecule-core/pull/659). Molecule AI is open source — contributions welcome.*
@@ -1,5 +1,41 @@
 # Local Development

+## Workspace Template Images: Local-Build Mode (Issue #63)
+
+OSS contributors who run `molecule-core` locally do **not** need to authenticate to GHCR or AWS ECR. When the `MOLECULE_IMAGE_REGISTRY` env var is **unset**, the platform automatically:
+
+1. Looks up the HEAD sha of `https://git.moleculesai.app/molecule-ai/molecule-ai-workspace-template-<runtime>` (single API call, no clone).
+2. If a local image tagged `molecule-local/workspace-template-<runtime>:<sha12>` already exists, reuses it (cache hit).
+3. Otherwise, shallow-clones the repo into `~/.cache/molecule/workspace-template-build/<runtime>/<sha12>/` and runs `docker build --platform=linux/amd64 -t <tag> .`.
+4. Hands the SHA-pinned tag to Docker for `ContainerCreate`.
+
+**First-provision build time:** 5–10 min on Apple Silicon (amd64 emulation). Subsequent provisions hit the cache and start in seconds. Cache is invalidated automatically when the template repo's HEAD moves.
+
+**Currently mirrored on Gitea:** `claude-code`, `hermes`, `langgraph`, `autogen`. Other runtimes (`crewai`, `deepagents`, `codex`, `gemini-cli`, `openclaw`) fail with an actionable "not mirrored to Gitea" error pointing at the missing repo.
+
+**Production tenants are unaffected** — every prod tenant sets `MOLECULE_IMAGE_REGISTRY` to its private ECR mirror via Railway env / EC2 user-data, so the SaaS pull path stays identical.
+
+### Environment overrides
+
+| Var | Default | Use case |
+|-----|---------|----------|
+| `MOLECULE_IMAGE_REGISTRY` | (unset) | Set to a real registry URL to switch from local-build to SaaS-pull mode. |
+| `MOLECULE_LOCAL_BUILD_CACHE` | `~/.cache/molecule/workspace-template-build` | Override cache directory. |
+| `MOLECULE_LOCAL_TEMPLATE_REPO_PREFIX` | `https://git.moleculesai.app/molecule-ai/molecule-ai-workspace-template-` | Point at a fork. |
+| `MOLECULE_GITEA_TOKEN` | (unset) | Required only if your fork has private template repos. |
+
+### Verifying a switch from the GHCR-retag stopgap
+
+Pre-fix, OSS contributors worked around the suspended GHCR org by manually retagging an `:latest` image. After this change, that workaround is **redundant**: simply unset `MOLECULE_IMAGE_REGISTRY` (or leave it unset), boot the platform, and provision a workspace. Logs will show:
+
+```
+Provisioner: local-build mode → using locally-built image molecule-local/workspace-template-claude-code:<sha12> for runtime claude-code
+local-build: cloning https://git.moleculesai.app/molecule-ai/molecule-ai-workspace-template-claude-code → ...
+local-build: docker build done in <duration>
+```
+
+If you still see `ghcr.io/molecule-ai/...` in the boot log, double-check `env | grep MOLECULE_IMAGE_REGISTRY` — a stale shell export from the pre-fix workaround could keep SaaS-mode active.
+
 ## Starting the Stack

 ```bash
@@ -3,8 +3,8 @@
 **Date:** 2026-04-23
 **Severity:** High — every new SaaS tenant blocked
 **Detection path:** E2E Staging SaaS run 24848425822 failed at "tenant provisioning"; investigation of CP Railway logs surfaced the auth mismatch.
-**Status:** Fix pushed on [molecule-controlplane#238](https://github.com/Molecule-AI/molecule-controlplane/pull/238).
-**Related:** [issue #239](https://github.com/Molecule-AI/molecule-controlplane/issues/239) (Cloudflare DNS record quota), [testing-strategy.md](../engineering/testing-strategy.md)
+**Status:** Fix pushed on [molecule-controlplane#238](https://git.moleculesai.app/molecule-ai/molecule-controlplane/pull/238).
+**Related:** [issue #239](https://git.moleculesai.app/molecule-ai/molecule-controlplane/issues/239) (Cloudflare DNS record quota), [testing-strategy.md](../engineering/testing-strategy.md)

 ## Summary

@@ -35,7 +35,7 @@ The flow was:

 ### The commit that introduced the bug

-[molecule-controlplane#235](https://github.com/Molecule-AI/molecule-controlplane/pull/235) — "fix(provision): wait for tenant boot-event before falling back to canary". Merged 2026-04-22.
+[molecule-controlplane#235](https://git.moleculesai.app/molecule-ai/molecule-controlplane/pull/235) — "fix(provision): wait for tenant boot-event before falling back to canary". Merged 2026-04-22.

 Before #235, readiness was determined via a canary probe through Cloudflare's edge — which didn't need CP-side auth, so the INSERT ordering didn't matter. #235 made boot-events the primary readiness signal but didn't move the INSERT earlier. The race was latent before but became load-bearing after.

@@ -90,7 +90,7 @@ bootReady, _ := provisioner.WaitForTenantReady(ctx, h.db, org.ID, 4*time.Minute)
 h.db.ExecContext(ctx, `UPDATE org_instances SET status = 'running' WHERE org_id = $1`, org.ID)
 ```

-See [molecule-controlplane#238](https://github.com/Molecule-AI/molecule-controlplane/pull/238) for the full diff.
+See [molecule-controlplane#238](https://git.moleculesai.app/molecule-ai/molecule-controlplane/pull/238) for the full diff.

 ## Lessons

@@ -122,9 +122,9 @@ Early investigation blamed the hermes provider 401 bug (a separate, known issue

 ## Follow-ups

- [ ] Land [molecule-controlplane#238](https://github.com/Molecule-AI/molecule-controlplane/pull/238)
+- [ ] Land [molecule-controlplane#238](https://git.moleculesai.app/molecule-ai/molecule-controlplane/pull/238)
 - [ ] Redeploy staging-api, verify E2E goes green
 - [ ] Add CP integration test suite (see lesson #2)
 - [ ] Wire E2E failure → notification (see lesson #3)
 - [ ] Add invariant comment in `provisionTenant` (see lesson #4)
- [ ] Cloudflare DNS quota cleanup — [molecule-controlplane#239](https://github.com/Molecule-AI/molecule-controlplane/issues/239)
+- [ ] Cloudflare DNS quota cleanup — [molecule-controlplane#239](https://git.moleculesai.app/molecule-ai/molecule-controlplane/issues/239)
@@ -138,5 +138,5 @@ If you see any of these, don't try to "clean it up in place" — **cherry-pick o

 ## Related

- [Issue #1822](https://github.com/Molecule-AI/molecule-core/issues/1822) — backend parity drift tracker (example of docs that have to stay current)
+- [Issue #1822](https://git.moleculesai.app/molecule-ai/molecule-core/issues/1822) — backend parity drift tracker (example of docs that have to stay current)
 - [Postmortem: CP boot-event 401](./postmortem-2026-04-23-boot-event-401.md) — caught before shipping because a reviewer could read the diff
@@ -0,0 +1,147 @@
+# Rate-limit observability runbook
+
+> Companion to issue #64 ("RATE_LIMIT default re-tune analysis"). After
+> #60 deployed the per-tenant `keyFor` keying, the right RATE_LIMIT
+> default became data-dependent. This runbook documents the metrics +
+> queries an operator should run to confirm whether the current 600
+> req/min/key default is correct, too tight, or too loose.
+
+## What's already exposed
+
+The workspace-server's existing Prometheus middleware
+(`workspace-server/internal/metrics/metrics.go`) tracks every request
+on every path:
+
+```
+molecule_http_requests_total{method, path, status}      counter
+molecule_http_request_duration_seconds_total{method,path,status}  counter
+```
+
+Path is the matched route pattern (`/workspaces/:id/activity` etc), so
+high-cardinality workspace UUIDs do not explode the label space.
+
+The rate limiter middleware (#60, `workspace-server/internal/middleware/ratelimit.go`)
+also stamps every response with `X-RateLimit-Limit`, `X-RateLimit-Remaining`,
+and `X-RateLimit-Reset`. Operators with browser-side or proxy-side
+header capture can read per-request bucket state directly.
+
+No new instrumentation is needed for #64's acceptance criteria. The
+metric surface is sufficient — this runbook just collects the queries.
+
+## Queries to run after #60 deploys
+
+### 1. Is the bucket actually firing 429s?
+
+```promql
+sum(rate(molecule_http_requests_total{status="429"}[5m]))
+```
+
+If this is zero on a given tenant, the bucket isn't being hit. If it's
+sustained > 1/min, dig in.
+
+### 2. Which routes attract 429s?
+
+```promql
+topk(
+  10,
+  sum by (path) (
+    rate(molecule_http_requests_total{status="429"}[5m])
+  )
+)
+```
+
+Expected shape post-#60:
+- `/workspaces/:id/activity` should be near zero — the canvas no longer
+  polls it on a 30s/60s/5s cadence (PRs #69 / #71 / #76).
+- Probe / health / heartbeat paths should be ~0 (those routes have a
+  separate IP-fallback bucket).
+
+If `/workspaces/:id/activity` 429s persist post-PRs-69/71/76 deploy, the
+canvas isn't running the WS-subscriber path — investigate WS health
+on that tenant.
+
+### 3. Per-bucket-key inference (no direct exposure today)
+
+The bucket map itself is in-memory only; we deliberately do **not**
+expose `org:<uuid>` ↔ remaining-tokens because that map can include
+SHA-256 hashes of bearer tokens. A tenant that wants per-key visibility
+should rely on response headers (`X-RateLimit-Remaining` on every
+response from a given session is the bucket's view of that session).
+
+If you genuinely need server-side per-bucket counts for triage,
+file a follow-up — the proper shape is a `/internal/ratelimit-stats`
+endpoint that emits **counts per key prefix only** (e.g. `org:`, `tok:`,
+`ip:`), never the key payloads. Don't roll that ad-hoc; it's a security
+review surface.
+
+## Decision tree for the re-tune
+
+After 14 days of production traffic on a tenant, look at the queries
+above and walk this tree:
+
+```
+Q1: Is the 429 rate sustained > 0.1/sec on any tenant?
+  ├─ NO  → The 600 default has comfortable headroom. Either keep it,
+  │        or lower it carefully (300) ONLY if you have a documented
+  │        reason (e.g. a misbehaving client we want to throttle harder).
+  │        Default to "no change" — see #64 for the math.
+  └─ YES → Q2.
+
+Q2: Is the 429 rate concentrated on ONE tenant or spread across many?
+  ├─ ONE tenant → Operator override: set RATE_LIMIT=1200 or 1800 on that
+  │               tenant's box. Document in the tenant's ops note. The
+  │               default does not need to change.
+  └─ MANY tenants → Q3.
+
+Q3: Are the 429s on a route that polls (e.g. /activity / /peers)?
+  ├─ YES → Confirm PRs #69, #71, #76 have actually deployed to those
+  │         tenants. If they have and 429s persist, the canvas may have
+  │         a regression — do not raise RATE_LIMIT. File a canvas issue.
+  └─ NO  → 429s on mutating routes mean genuine load. Raise the default
+            to 1200 in `workspace-server/internal/router/router.go:54`.
+            Same PR should attach: the metric chart, the time window,
+            and a paragraph explaining what changed in our traffic shape.
+```
+
+## Alert rule template (drop-in for Prometheus)
+
+```yaml
+# Sustained 429s — file is the SLO trip-wire. If this fires, walk the
+# decision tree above. NB: the issue#64 acceptance criterion is "two
+# weeks of metrics"; this alert is the inverse — it tells you something
+# changed before the two weeks are up.
+groups:
+  - name: workspace-server-ratelimit
+    rules:
+      - alert: WorkspaceServerRateLimit429Sustained
+        expr: |
+          sum by (instance) (
+            rate(molecule_http_requests_total{status="429"}[10m])
+          ) > 0.1
+        for: 30m
+        labels:
+          severity: warning
+          owner: workspace-server
+        annotations:
+          summary: "{{ $labels.instance }} sustained 429s — see ratelimit-observability runbook"
+          runbook: "https://git.moleculesai.app/molecule-ai/molecule-core/blob/main/docs/engineering/ratelimit-observability.md"
+```
+
+Threshold rationale: 0.1 req/s = 6/min sustained over 10min. Below
+that, a 429 is almost certainly a transient burst that the canvas's
+retry-once handler at `canvas/src/lib/api.ts:55` already absorbs. The
+30m `for:` keeps the alert from chattering on a brief blip.
+
+## Companion probe script
+
+For one-off triage when an operator can reproduce the problem in their
+own browser, `scripts/edge-429-probe.sh` (#62) reproduces a canvas-
+sized burst against a tenant subdomain and dumps each 429's response
+shape so the operator can distinguish workspace-server bucket overflow
+from CF/Vercel edge rate-limiting without dashboard access.
+
+```sh
+./scripts/edge-429-probe.sh hongming.moleculesai.app --burst 80 --out /tmp/edge.txt
+```
+
+The script's report header explains how to read the output.
@@ -103,9 +103,9 @@ A bad test:

 ## Related

- [Issue #1821](https://github.com/Molecule-AI/molecule-core/issues/1821) — policy tracking issue
- [Issue #1815](https://github.com/Molecule-AI/molecule-core/issues/1815) — Canvas coverage instrumentation
- [Issue #1818](https://github.com/Molecule-AI/molecule-core/issues/1818) — Python pytest-cov
- [Issue #1814](https://github.com/Molecule-AI/molecule-core/issues/1814) — workspace_provision_test.go unblock
- [Issue #1816](https://github.com/Molecule-AI/molecule-core/issues/1816) — tokens.go coverage
- [Issue #1819](https://github.com/Molecule-AI/molecule-core/issues/1819) — wsauth_middleware coverage
+- [Issue #1821](https://git.moleculesai.app/molecule-ai/molecule-core/issues/1821) — policy tracking issue
+- [Issue #1815](https://git.moleculesai.app/molecule-ai/molecule-core/issues/1815) — Canvas coverage instrumentation
+- [Issue #1818](https://git.moleculesai.app/molecule-ai/molecule-core/issues/1818) — Python pytest-cov
+- [Issue #1814](https://git.moleculesai.app/molecule-ai/molecule-core/issues/1814) — workspace_provision_test.go unblock
+- [Issue #1816](https://git.moleculesai.app/molecule-ai/molecule-core/issues/1816) — tokens.go coverage
+- [Issue #1819](https://git.moleculesai.app/molecule-ai/molecule-core/issues/1819) — wsauth_middleware coverage
@@ -153,7 +153,7 @@ The `id` field is your workspace ID — remember it.
 |---|---|
 | "Failed to send message — agent may be unreachable" | The tenant couldn't POST to your URL. Verify `curl https://<your-tunnel>/health` returns 200 from another machine. |
 | Response takes > 30s | Canvas times out around 30s. Keep initial implementations simple. For long-running work, return a placeholder and use [polling mode](#next-step-polling-mode-preview) (once available). |
-| Agent duplicated in chat | Known canvas bug where WebSocket + HTTP responses both render. Fixed in [PR #1517](https://github.com/Molecule-AI/molecule-core/pull/1517). |
+| Agent duplicated in chat | Known canvas bug where WebSocket + HTTP responses both render. Fixed in [PR #1517](https://git.moleculesai.app/molecule-ai/molecule-core/pull/1517). |
 | Agent replies but canvas shows "Agent unreachable" | Check the tenant can reach your URL. Cloudflare quick tunnels rotate — the URL in your canvas may point at a dead tunnel after restart. |
 | Getting 404 when POSTing to tenant | Add `X-Molecule-Org-Id` header. The tenant's security layer 404s unmatched origin requests by design. |

@@ -255,7 +255,7 @@ If all four pass and canvas still shows your agent as unreachable, see the [remo
 ## Feedback

 This is a new path. Tell us what broke:
- Open an issue: https://github.com/Molecule-AI/molecule-core/issues/new?labels=external-workspace
+- Open an issue: https://git.moleculesai.app/molecule-ai/molecule-core/issues/new?labels=external-workspace
 - Join #external-workspaces on our Slack
 - Submit a PR improving this doc if something tripped you up — the faster we can make the quickstart, the more developers we bring in

@@ -58,8 +58,11 @@ green — proves wire shape end-to-end against a real `hermes gateway run`
 subprocess + stub OpenAI-compat LLM. Caught + fixed a real `KeyError`
 in upstream `hermes_cli/tools_config.py` (PLATFORMS dict lookup
 crashed on plugin platforms) — fix on the patched fork branch
-(`HongmingWang-Rabbit/hermes-agent` `feat/platform-adapter-plugins`,
-commit `18e4849e`). Upstream PR #18775 OPEN; CONFLICTING with main.
+(`molecule-ai/hermes-agent` `feat/platform-adapter-plugins`, commit
+`18e4849e`, hosted on Gitea at
+`https://git.moleculesai.app/molecule-ai/hermes-agent` — moved from the
+suspended `github.com/HongmingWang-Rabbit/hermes-agent`, see
+`molecule-ai/internal#72`). Upstream PR #18775 OPEN; CONFLICTING with main.
 Not on critical path for our platform — patched fork is what the
 workspace image installs.

@@ -99,7 +102,7 @@ fork needed in production.
 - **Plugin package**: [Molecule-AI/hermes-platform-molecule-a2a](https://git.moleculesai.app/molecule-ai/hermes-platform-molecule-a2a)
  v0.1.0 — public, MIT-licensed. 11 unit tests + 8 in-process E2E
  + 4 real-subprocess E2E checkpoints all green.
- **Workspace template patch**: [Molecule-AI/molecule-ai-workspace-template-hermes#32](https://github.com/Molecule-AI/molecule-ai-workspace-template-hermes/pull/32)
+- **Workspace template patch**: [Molecule-AI/molecule-ai-workspace-template-hermes#32](https://git.moleculesai.app/molecule-ai/molecule-ai-workspace-template-hermes/pull/32)
  — Dockerfile installs the patched fork + plugin into the hermes
  installer's venv; start.sh seeds `platforms.molecule-a2a` config
  stanza. Pre-demo deliberately install-only; adapter.py rewrite to
@@ -156,7 +159,7 @@ intermediate shim earns its complexity.
 **Status:** Template SHIPPED. Repo live at
 [`Molecule-AI/molecule-ai-workspace-template-codex`](https://git.moleculesai.app/molecule-ai/molecule-ai-workspace-template-codex)
 (14 files, 1411 LOC, 12/12 tests). molecule-core registration in
-[PR #2512](https://github.com/Molecule-AI/molecule-core/pull/2512).
+[PR #2512](https://git.moleculesai.app/molecule-ai/molecule-core/pull/2512).
 E2E with real A2A traffic remains.

 **Path:** Persistent `codex app-server` stdio JSON-RPC client
@@ -101,7 +101,7 @@ incident-shaped.
 ## [v1.0.0] — initial release (RFC #2728, PRs #2729-#2742)

 Initial plugin contract + 11-PR rollout. See
-[issue #2728](https://github.com/Molecule-AI/molecule-core/issues/2728)
+[issue #2728](https://git.moleculesai.app/molecule-ai/molecule-core/issues/2728)
 for the full RFC.

 Endpoints: `/v1/health`, `/v1/namespaces/{name}` (PUT/PATCH/DELETE),
@@ -160,11 +160,11 @@ not expose.
 | `molecule-skill-update-docs` | `[claude_code]` | `[claude_code, hermes]` |

 Companion PRs:
- [molecule-ai-plugin-ecc#2](https://github.com/Molecule-AI/molecule-ai-plugin-ecc/pull/2)
- [molecule-ai-plugin-superpowers#2](https://github.com/Molecule-AI/molecule-ai-plugin-superpowers/pull/2)
- [molecule-ai-plugin-molecule-dev#2](https://github.com/Molecule-AI/molecule-ai-plugin-molecule-dev/pull/2)
- [molecule-ai-plugin-molecule-skill-cron-learnings#2](https://github.com/Molecule-AI/molecule-ai-plugin-molecule-skill-cron-learnings/pull/2)
- [molecule-ai-plugin-molecule-skill-update-docs#2](https://github.com/Molecule-AI/molecule-ai-plugin-molecule-skill-update-docs/pull/2)
+- [molecule-ai-plugin-ecc#2](https://git.moleculesai.app/molecule-ai/molecule-ai-plugin-ecc/pull/2)
+- [molecule-ai-plugin-superpowers#2](https://git.moleculesai.app/molecule-ai/molecule-ai-plugin-superpowers/pull/2)
+- [molecule-ai-plugin-molecule-dev#2](https://git.moleculesai.app/molecule-ai/molecule-ai-plugin-molecule-dev/pull/2)
+- [molecule-ai-plugin-molecule-skill-cron-learnings#2](https://git.moleculesai.app/molecule-ai/molecule-ai-plugin-molecule-skill-cron-learnings/pull/2)
+- [molecule-ai-plugin-molecule-skill-update-docs#2](https://git.moleculesai.app/molecule-ai/molecule-ai-plugin-molecule-skill-update-docs/pull/2)

 Security note: Security Auditor was offline at time of change. Self-assessed
 as non-security-impacting — adding `hermes` to a string list in `plugin.yaml`
@@ -0,0 +1,137 @@
+# Runbook — Handlers Postgres Integration port-collision substrate
+
+**Status:** Resolved 2026-05-08 (PR for class B Hongming-owned CICD red sweep).
+
+## Symptom
+
+`Handlers Postgres Integration` workflow fails on staging push and PRs.
+Step `Apply migrations to Postgres service` shows:
+
+```
+psql: error: connection to server at "127.0.0.1", port 5432 failed: Connection refused
+```
+
+Job-cleanup step further down logs:
+
+```
+Cleaning up services for job Handlers Postgres Integration
+failed to remove container: Error response from daemon: No such container: <id>
+```
+
+…confirming the postgres service container was already gone before
+cleanup ran.
+
+## Root cause
+
+Our Gitea act_runner (operator host `5.78.80.188`,
+`/opt/molecule/runners/config.yaml`) sets:
+
+```yaml
+container:
+  network: host
+```
+
+…which act_runner applies to BOTH the job container AND every
+`services:` container in a workflow. Multiple workflow instances
+running concurrently across the 16 parallel runners each try to bind
+postgres on `0.0.0.0:5432`. The first wins; subsequent instances exit
+immediately with:
+
+```
+LOG:  could not bind IPv4 address "0.0.0.0": Address in use
+HINT: Is another postmaster already running on port 5432?
+FATAL: could not create any TCP/IP sockets
+```
+
+act_runner sets `AutoRemove:true` on service containers, so Docker
+garbage-collects them as soon as they exit. By the time the migrations
+step runs `pg_isready` / `psql`, the container is gone and connection
+refused.
+
+Reproduction (operator host):
+
+```bash
+docker run --rm -d --name pg-A --network host \
+  -e POSTGRES_PASSWORD=test postgres:15-alpine
+docker run -d --name pg-B --network host \
+  -e POSTGRES_PASSWORD=test postgres:15-alpine
+docker logs pg-B   # FATAL: could not create any TCP/IP sockets
+```
+
+## Why per-job override doesn't work
+
+The natural fix — per-job `container.network` override — is silently
+ignored by act_runner. The runner log emits:
+
+```
+--network and --net in the options will be ignored.
+```
+
+This is a documented act_runner constraint: container network is a
+runner-wide setting, not per-job. Source: gitea/act_runner config docs
+ vegardit/docker-gitea-act-runner issue #7.
+
+Flipping the global `container.network` to `bridge` would break every
+other workflow in the repo (cache server discovery,
+`molecule-monorepo-net` peer access during integration tests, etc.) —
+unacceptable blast radius for a per-test bug.
+
+## Fix shape
+
+`handlers-postgres-integration.yml` no longer uses `services: postgres:`.
+It launches a sibling postgres container manually on the existing
+`molecule-monorepo-net` bridge network with a per-run unique name:
+
+```yaml
+env:
+  PG_NAME: pg-handlers-${{ github.run_id }}-${{ github.run_attempt }}
+  PG_NETWORK: molecule-monorepo-net
+
+steps:
+  - name: Start sibling Postgres on bridge network
+    run: |
+      docker run -d --name "${PG_NAME}" --network "${PG_NETWORK}" \
+        ...
+        postgres:15-alpine
+      PG_HOST=$(docker inspect "${PG_NAME}" \
+        --format "{{(index .NetworkSettings.Networks \"${PG_NETWORK}\").IPAddress}}")
+      echo "PG_HOST=${PG_HOST}" >> "$GITHUB_ENV"
+
+  # … migrations + tests use ${PG_HOST}, not 127.0.0.1 …
+
+  - if: always() && …
+    name: Stop sibling Postgres
+    run: docker rm -f "${PG_NAME}" || true
+```
+
+The host-net job container can reach a bridge-net container via the
+bridge IP directly (verified manually, 2026-05-08). Two parallel runs
+use different names + different bridge IPs — no collision.
+
+## Future-proofing
+
+Other workflows that hit the same shape (any `services:` with a
+fixed-port image) will exhibit the same failure mode under
+host-network runner config. Translate using this same pattern:
+
+1. Drop the `services:` block.
+2. Use `${{ github.run_id }}-${{ github.run_attempt }}` for unique
+   container name.
+3. Launch on `molecule-monorepo-net` (already trusted bridge in
+   `docker-compose.infra.yml`).
+4. Read back the bridge IP via `docker inspect` and export as a step env.
+5. `if: always()` cleanup step at the end.
+
+If the count of such workflows grows, factor into a composite action
+(`./.github/actions/sibling-postgres`) so the substrate logic lives
+in one place.
+
+## Related
+
+- Issue #88 (closed by #92): localhost → 127.0.0.1 fix that unmasked
+  this collision; the IPv6 fix is correct, port collision is the new
+  layer.
+- Issue #94 created `molecule-monorepo-net` + `alpine:latest` as
+  prereqs.
+- Saved memory `feedback_act_runner_github_server_url` documents
+  another act_runner-vs-GHA divergence (server URL).
@@ -198,7 +198,7 @@ Lighthouse audit against staging.yourapp.com:
  FCP: 2.4s | LCP: 5.2s | CLS: 0.18 | TBT: 620ms

 Performance regression detected — opening GitHub issue.
-Issue: https://github.com/Molecule-AI/molecule-core/issues/1527
+Issue: https://git.moleculesai.app/molecule-ai/molecule-core/issues/1527
 Label: performance-regression | Assignees: @your-team
 ```

@@ -85,8 +85,8 @@ Fly Machines start in milliseconds and run in 35+ regions. Provisioning agent wo

 ## Related

- PR #501: [feat(platform): Fly Machines provisioner](https://github.com/Molecule-AI/molecule-core/pull/501)
- PR #481: [feat(ci): deploy to Fly after image push](https://github.com/Molecule-AI/molecule-core/pull/481)
+- PR #501: [feat(platform): Fly Machines provisioner](https://git.moleculesai.app/molecule-ai/molecule-core/pull/501)
+- PR #481: [feat(ci): deploy to Fly after image push](https://git.moleculesai.app/molecule-ai/molecule-core/pull/481)
 - [Fly Machines API docs](https://fly.io/docs/machines/api/)
 - [Platform API reference](../api-reference.md)
- Issue [#525](https://github.com/Molecule-AI/molecule-core/issues/525)
+- Issue [#525](https://git.moleculesai.app/molecule-ai/molecule-core/issues/525)
@@ -61,6 +61,6 @@ The real power surfaces when you mix runtimes on the same Molecule AI tenant. Yo

 ## Related

- PR #379: [feat(adapters): add gemini-cli runtime adapter](https://github.com/Molecule-AI/molecule-core/pull/379)
+- PR #379: [feat(adapters): add gemini-cli runtime adapter](https://git.moleculesai.app/molecule-ai/molecule-core/pull/379)
 - [Multi-provider Hermes docs](../architecture/hermes.md)
 - [Workspace runtimes reference](../reference/runtimes.md)
@@ -68,7 +68,7 @@ ADK workspaces participate in the same A2A network as Claude Code, Gemini CLI, H

 ## Related

- PR #550: [feat(adapters): add google-adk runtime adapter](https://github.com/Molecule-AI/molecule-core/pull/550)
+- PR #550: [feat(adapters): add google-adk runtime adapter](https://git.moleculesai.app/molecule-ai/molecule-core/pull/550)
 - [Google ADK (adk-python)](https://github.com/google/adk-python)
 - [Gemini CLI runtime tutorial](./gemini-cli-runtime.md)
 - [Platform API reference](../api-reference.md)
@@ -176,9 +176,9 @@ What is on the roadmap for Phase 2d (not yet shipped):

 ## Related

- PR #240: [Phase 2a — native Anthropic dispatch](https://github.com/Molecule-AI/molecule-core/pull/240)
- PR #255: [Phase 2b — native Gemini dispatch](https://github.com/Molecule-AI/molecule-core/pull/255)
- PR #267: [Phase 2c — multi-turn history on all paths](https://github.com/Molecule-AI/molecule-core/pull/267)
+- PR #240: [Phase 2a — native Anthropic dispatch](https://git.moleculesai.app/molecule-ai/molecule-core/pull/240)
+- PR #255: [Phase 2b — native Gemini dispatch](https://git.moleculesai.app/molecule-ai/molecule-core/pull/255)
+- PR #267: [Phase 2c — multi-turn history on all paths](https://git.moleculesai.app/molecule-ai/molecule-core/pull/267)
 - [Hermes adapter design](../adapters/hermes-adapter-design.md)
 - [Platform API reference](../api-reference.md)
- Issue [#513](https://github.com/Molecule-AI/molecule-core/issues/513)
+- Issue [#513](https://git.moleculesai.app/molecule-ai/molecule-core/issues/513)
@@ -90,6 +90,6 @@ Molecule AI canvas without code changes.

 ## Related

- PR #480: [feat(channels): Lark / Feishu channel adapter](https://github.com/Molecule-AI/molecule-core/pull/480)
+- PR #480: [feat(channels): Lark / Feishu channel adapter](https://git.moleculesai.app/molecule-ai/molecule-core/pull/480)
 - [Social channels architecture](../agent-runtime/social-channels.md)
 - [Channel adapter reference](../api-reference.md#channels)
@@ -2,45 +2,46 @@
  "_comment": "Pin refs to release tags for reproducible builds. 'main' is OK while all repos are internal.",
  "version": 1,
  "plugins": [
-    {"name": "browser-automation", "repo": "Molecule-AI/molecule-ai-plugin-browser-automation", "ref": "main"},
-    {"name": "ecc", "repo": "Molecule-AI/molecule-ai-plugin-ecc", "ref": "main"},
-    {"name": "gh-identity", "repo": "Molecule-AI/molecule-ai-plugin-gh-identity", "ref": "main"},
-    {"name": "molecule-audit", "repo": "Molecule-AI/molecule-ai-plugin-molecule-audit", "ref": "main"},
-    {"name": "molecule-audit-trail", "repo": "Molecule-AI/molecule-ai-plugin-molecule-audit-trail", "ref": "main"},
-    {"name": "molecule-careful-bash", "repo": "Molecule-AI/molecule-ai-plugin-molecule-careful-bash", "ref": "main"},
-    {"name": "molecule-compliance", "repo": "Molecule-AI/molecule-ai-plugin-molecule-compliance", "ref": "main"},
-    {"name": "molecule-dev", "repo": "Molecule-AI/molecule-ai-plugin-molecule-dev", "ref": "main"},
-    {"name": "molecule-freeze-scope", "repo": "Molecule-AI/molecule-ai-plugin-molecule-freeze-scope", "ref": "main"},
-    {"name": "molecule-hitl", "repo": "Molecule-AI/molecule-ai-plugin-molecule-hitl", "ref": "main"},
-    {"name": "molecule-prompt-watchdog", "repo": "Molecule-AI/molecule-ai-plugin-molecule-prompt-watchdog", "ref": "main"},
-    {"name": "molecule-security-scan", "repo": "Molecule-AI/molecule-ai-plugin-molecule-security-scan", "ref": "main"},
-    {"name": "molecule-session-context", "repo": "Molecule-AI/molecule-ai-plugin-molecule-session-context", "ref": "main"},
-    {"name": "molecule-skill-code-review", "repo": "Molecule-AI/molecule-ai-plugin-molecule-skill-code-review", "ref": "main"},
-    {"name": "molecule-skill-cron-learnings", "repo": "Molecule-AI/molecule-ai-plugin-molecule-skill-cron-learnings", "ref": "main"},
-    {"name": "molecule-skill-cross-vendor-review", "repo": "Molecule-AI/molecule-ai-plugin-molecule-skill-cross-vendor-review", "ref": "main"},
-    {"name": "molecule-skill-llm-judge", "repo": "Molecule-AI/molecule-ai-plugin-molecule-skill-llm-judge", "ref": "main"},
-    {"name": "molecule-skill-update-docs", "repo": "Molecule-AI/molecule-ai-plugin-molecule-skill-update-docs", "ref": "main"},
-    {"name": "molecule-workflow-retro", "repo": "Molecule-AI/molecule-ai-plugin-molecule-workflow-retro", "ref": "main"},
-    {"name": "molecule-workflow-triage", "repo": "Molecule-AI/molecule-ai-plugin-molecule-workflow-triage", "ref": "main"},
-    {"name": "superpowers", "repo": "Molecule-AI/molecule-ai-plugin-superpowers", "ref": "main"}
+    {"name": "browser-automation", "repo": "molecule-ai/molecule-ai-plugin-browser-automation", "ref": "main"},
+    {"name": "ecc", "repo": "molecule-ai/molecule-ai-plugin-ecc", "ref": "main"},
+    {"name": "gh-identity", "repo": "molecule-ai/molecule-ai-plugin-gh-identity", "ref": "main"},
+    {"name": "molecule-audit", "repo": "molecule-ai/molecule-ai-plugin-molecule-audit", "ref": "main"},
+    {"name": "molecule-audit-trail", "repo": "molecule-ai/molecule-ai-plugin-molecule-audit-trail", "ref": "main"},
+    {"name": "molecule-careful-bash", "repo": "molecule-ai/molecule-ai-plugin-molecule-careful-bash", "ref": "main"},
+    {"name": "molecule-compliance", "repo": "molecule-ai/molecule-ai-plugin-molecule-compliance", "ref": "main"},
+    {"name": "molecule-dev", "repo": "molecule-ai/molecule-ai-plugin-molecule-dev", "ref": "main"},
+    {"name": "molecule-freeze-scope", "repo": "molecule-ai/molecule-ai-plugin-molecule-freeze-scope", "ref": "main"},
+    {"name": "molecule-hitl", "repo": "molecule-ai/molecule-ai-plugin-molecule-hitl", "ref": "main"},
+    {"name": "molecule-prompt-watchdog", "repo": "molecule-ai/molecule-ai-plugin-molecule-prompt-watchdog", "ref": "main"},
+    {"name": "molecule-security-scan", "repo": "molecule-ai/molecule-ai-plugin-molecule-security-scan", "ref": "main"},
+    {"name": "molecule-session-context", "repo": "molecule-ai/molecule-ai-plugin-molecule-session-context", "ref": "main"},
+    {"name": "molecule-skill-code-review", "repo": "molecule-ai/molecule-ai-plugin-molecule-skill-code-review", "ref": "main"},
+    {"name": "molecule-skill-cron-learnings", "repo": "molecule-ai/molecule-ai-plugin-molecule-skill-cron-learnings", "ref": "main"},
+    {"name": "molecule-skill-cross-vendor-review", "repo": "molecule-ai/molecule-ai-plugin-molecule-skill-cross-vendor-review", "ref": "main"},
+    {"name": "molecule-skill-llm-judge", "repo": "molecule-ai/molecule-ai-plugin-molecule-skill-llm-judge", "ref": "main"},
+    {"name": "molecule-skill-update-docs", "repo": "molecule-ai/molecule-ai-plugin-molecule-skill-update-docs", "ref": "main"},
+    {"name": "molecule-workflow-retro", "repo": "molecule-ai/molecule-ai-plugin-molecule-workflow-retro", "ref": "main"},
+    {"name": "molecule-workflow-triage", "repo": "molecule-ai/molecule-ai-plugin-molecule-workflow-triage", "ref": "main"},
+    {"name": "superpowers", "repo": "molecule-ai/molecule-ai-plugin-superpowers", "ref": "main"}
  ],
  "workspace_templates": [
-    {"name": "claude-code-default", "repo": "Molecule-AI/molecule-ai-workspace-template-claude-code", "ref": "main"},
-    {"name": "hermes", "repo": "Molecule-AI/molecule-ai-workspace-template-hermes", "ref": "main"},
-    {"name": "openclaw", "repo": "Molecule-AI/molecule-ai-workspace-template-openclaw", "ref": "main"},
-    {"name": "codex", "repo": "Molecule-AI/molecule-ai-workspace-template-codex", "ref": "main"},
-    {"name": "langgraph", "repo": "Molecule-AI/molecule-ai-workspace-template-langgraph", "ref": "main"},
-    {"name": "crewai", "repo": "Molecule-AI/molecule-ai-workspace-template-crewai", "ref": "main"},
-    {"name": "autogen", "repo": "Molecule-AI/molecule-ai-workspace-template-autogen", "ref": "main"},
-    {"name": "deepagents", "repo": "Molecule-AI/molecule-ai-workspace-template-deepagents", "ref": "main"},
-    {"name": "gemini-cli", "repo": "Molecule-AI/molecule-ai-workspace-template-gemini-cli", "ref": "main"}
+    {"name": "claude-code-default", "repo": "molecule-ai/molecule-ai-workspace-template-claude-code", "ref": "main"},
+    {"name": "hermes", "repo": "molecule-ai/molecule-ai-workspace-template-hermes", "ref": "main"},
+    {"name": "openclaw", "repo": "molecule-ai/molecule-ai-workspace-template-openclaw", "ref": "main"},
+    {"name": "codex", "repo": "molecule-ai/molecule-ai-workspace-template-codex", "ref": "main"},
+    {"name": "langgraph", "repo": "molecule-ai/molecule-ai-workspace-template-langgraph", "ref": "main"},
+    {"name": "crewai", "repo": "molecule-ai/molecule-ai-workspace-template-crewai", "ref": "main"},
+    {"name": "autogen", "repo": "molecule-ai/molecule-ai-workspace-template-autogen", "ref": "main"},
+    {"name": "deepagents", "repo": "molecule-ai/molecule-ai-workspace-template-deepagents", "ref": "main"},
+    {"name": "gemini-cli", "repo": "molecule-ai/molecule-ai-workspace-template-gemini-cli", "ref": "main"}
  ],
  "org_templates": [
-    {"name": "molecule-dev", "repo": "Molecule-AI/molecule-ai-org-template-molecule-dev", "ref": "main"},
-    {"name": "free-beats-all", "repo": "Molecule-AI/molecule-ai-org-template-free-beats-all", "ref": "main"},
-    {"name": "medo-smoke", "repo": "Molecule-AI/molecule-ai-org-template-medo-smoke", "ref": "main"},
-    {"name": "molecule-worker-gemini", "repo": "Molecule-AI/molecule-ai-org-template-molecule-worker-gemini", "ref": "main"},
-    {"name": "reno-stars", "repo": "Molecule-AI/molecule-ai-org-template-reno-stars", "ref": "main"},
-    {"name": "ux-ab-lab", "repo": "Molecule-AI/molecule-ai-org-template-ux-ab-lab", "ref": "main"}
+    {"name": "molecule-dev", "repo": "molecule-ai/molecule-ai-org-template-molecule-dev", "ref": "main"},
+    {"name": "free-beats-all", "repo": "molecule-ai/molecule-ai-org-template-free-beats-all", "ref": "main"},
+    {"name": "medo-smoke", "repo": "molecule-ai/molecule-ai-org-template-medo-smoke", "ref": "main"},
+    {"name": "molecule-worker-gemini", "repo": "molecule-ai/molecule-ai-org-template-molecule-worker-gemini", "ref": "main"},
+    {"name": "reno-stars", "repo": "molecule-ai/molecule-ai-org-template-reno-stars", "ref": "main"},
+    {"name": "ux-ab-lab", "repo": "molecule-ai/molecule-ai-org-template-ux-ab-lab", "ref": "main"},
+    {"name": "mock-bigorg", "repo": "molecule-ai/molecule-ai-org-template-mock-bigorg", "ref": "main"}
  ]
 }
@@ -376,7 +376,7 @@ hold:
   non-plugin-sourced server, which Claude Code rejects with
   `channel_enable requires a marketplace plugin`. Until the
   official `moleculesai/claude-code-plugin` marketplace lands
-   (tracking [#2936](https://github.com/Molecule-AI/molecule-core/issues/2936)),
+   (tracking [#2936](https://git.moleculesai.app/molecule-ai/molecule-core/issues/2936)),
   operators who want push must scaffold their own local marketplace
   under
   `~/.claude/marketplaces/molecule-local/` containing a
@@ -389,7 +389,7 @@ hold:
 Symptom of any condition failing: messages arrive but only via the
 poll path (every ~1–60s), not real-time. There's currently no
 diagnostic surfaced — `molecule-mcp doctor` (tracking
-[#2937](https://github.com/Molecule-AI/molecule-core/issues/2937)) is
+[#2937](https://git.moleculesai.app/molecule-ai/molecule-core/issues/2937)) is
 planned.

 If you don't need real-time push, the default poll path works
@@ -17,12 +17,23 @@
 #
 # Used by .github/workflows/auto-promote-stale-alarm.yml. Logic lives
 # here (not inline in the workflow YAML) so we can:
-#   - Unit-test it with a stubbed `gh` (see test-check-stale-promote-pr.sh)
+#   - Unit-test it with a fixture (see test-check-stale-promote-pr.sh)
 #   - Run it ad-hoc by an operator: `scripts/check-stale-promote-pr.sh`
 #   - Reuse the same surface in any sibling workflow that needs the same
 #     check (SSOT — one detector, many callers).
 #
-# Requires: `gh` CLI, `jq`. `GH_TOKEN` env in the workflow context.
+# Requires: `curl`, `jq`. `GITEA_TOKEN` (or `GITHUB_TOKEN` / `GH_TOKEN`
+# for back-compat) in the workflow context. Reads `GITHUB_SERVER_URL`
+# / `GITEA_API_URL` for the Gitea base, defaulting to
+# https://git.moleculesai.app/api/v1.
+#
+# Post-2026-05-06 (Gitea migration, issue #75): the previous version
+# called `gh pr list/view/comment`, all of which hit GitHub.com's
+# GraphQL or /api/v3 REST shapes. Gitea exposes /api/v1/ only (no
+# GraphQL → 405, no /api/v3 → 404). So this script now talks to the
+# Gitea v1 API directly via curl. The fixture-driven unit tests are
+# unchanged — they bypass the live fetch via PR_FIXTURE and still pass
+# the historical (GitHub-shape) JSON which `detect_stale` consumes.

 set -euo pipefail

@@ -36,14 +47,15 @@ set -euo pipefail
 # alarming. Override via env for tests + edge ops.
 STALE_HOURS="${STALE_HOURS:-4}"

-# Repo defaults to the current `gh` context. Tests pass --repo explicitly.
+# Repo defaults to GITHUB_REPOSITORY (act_runner sets this in workflow
+# context). Tests pass --repo explicitly.
 REPO="${GITHUB_REPOSITORY:-}"

 # Whether to post a comment to the PR. Off by default to avoid noise on
 # manual ad-hoc runs; the cron workflow turns it on.
 POST_COMMENT="${POST_COMMENT:-false}"

-# Where to read the open-PR JSON from. Empty = call `gh` live. Tests
+# Where to read the open-PR JSON from. Empty = call Gitea live. Tests
 # point this at a fixture file.
 PR_FIXTURE="${PR_FIXTURE:-}"

@@ -51,6 +63,17 @@ PR_FIXTURE="${PR_FIXTURE:-}"
 # the staleness math is deterministic.
 NOW_OVERRIDE="${NOW_OVERRIDE:-}"

+# Gitea API base. act_runner forwards github.server_url as
+# GITHUB_SERVER_URL; for the molecule-ai fleet that's
+# https://git.moleculesai.app. Append /api/v1 to get the REST root.
+# Override directly via GITEA_API_URL for tests / non-default hosts.
+GITEA_API_URL="${GITEA_API_URL:-${GITHUB_SERVER_URL:-https://git.moleculesai.app}/api/v1}"
+
+# Token. Workflow context sets GITHUB_TOKEN; we accept GITEA_TOKEN as
+# the explicit name and GH_TOKEN for back-compat with operator habits
+# from the GitHub era. First non-empty wins.
+GITEA_TOKEN="${GITEA_TOKEN:-${GITHUB_TOKEN:-${GH_TOKEN:-}}}"
+
 while [ $# -gt 0 ]; do
  case "$1" in
    --repo) REPO="$2"; shift 2 ;;
@@ -83,7 +106,7 @@ now_epoch() {
  fi
 }

-# Parse RFC3339 timestamps the way GitHub emits them (e.g.
+# Parse RFC3339 timestamps the way Gitea / GitHub emit them (e.g.
 # "2026-05-05T23:15:00Z"). gnu-date uses -d, bsd-date uses -j -f. Cover
 # both because the workflow runs on ubuntu-latest (gnu) but operators
 # may run this script on macOS (bsd).
@@ -106,14 +129,100 @@ to_epoch() {
 # Fetch open auto-promote PRs
 # -----------------------------------------------------------------------------

+# Gitea v1 returns PRs with the canonical Gitea shape (number, title,
+# created_at, html_url, mergeable, state). The previous GitHub-CLI
+# version returned a derived `mergeStateStatus` / `reviewDecision`
+# pair which only GitHub computes — Gitea doesn't expose them
+# natively. Rebuild equivalents:
+#
+#   mergeStateStatus = BLOCKED  ↔ Gitea: state==open AND mergeable==true
+#                                  AND no APPROVED review yet
+#                                  (i.e. branch protection is gating
+#                                  the auto-merge pending an approval)
+#   reviewDecision   = REVIEW_REQUIRED  ↔ Gitea: 0 APPROVED reviews
+#
+# This mirrors the SAME silent-block failure mode the GitHub version
+# detected: auto-merge armed, branch protection requires 1 review,
+# nobody's approved yet.
+#
+# Implementation: pull the open PR list base=main, then for each PR
+# pull /pulls/{n}/reviews and synthesize the GitHub-shape JSON the
+# rest of the script + the test fixtures consume.
 fetch_prs() {
  if [ -n "$PR_FIXTURE" ]; then
    cat "$PR_FIXTURE"
    return 0
  fi
-  gh pr list --repo "$REPO" \
-    --base main --head staging --state open \
-    --json number,title,createdAt,mergeStateStatus,reviewDecision,url
+  if [ -z "$GITEA_TOKEN" ]; then
+    echo "::error::GITEA_TOKEN / GITHUB_TOKEN unset — cannot fetch PRs from $GITEA_API_URL" >&2
+    return 1
+  fi
+  local prs_json
+  prs_json="$(curl --fail-with-body -sS \
+    -H "Authorization: token ${GITEA_TOKEN}" \
+    -H "Accept: application/json" \
+    "${GITEA_API_URL}/repos/${REPO}/pulls?state=open&base=main&limit=50" \
+    2>/dev/null)" || {
+    echo "::error::Failed to fetch PRs from ${GITEA_API_URL}/repos/${REPO}/pulls" >&2
+    return 1
+  }
+
+  # Filter to head=staging (the auto-promote shape) and synthesize
+  # mergeStateStatus + reviewDecision per PR. Approval count via
+  # /pulls/{n}/reviews. Errors fall through to 0-approvals (treated
+  # as REVIEW_REQUIRED) preserving the existing "fail-safe — alarm if
+  # uncertain" semantic.
+  local synthesized="[]"
+  while IFS= read -r pr; do
+    [ -z "$pr" ] && continue
+    [ "$pr" = "null" ] && continue
+    local num
+    num="$(printf '%s' "$pr" | jq -r '.number')"
+    [ -z "$num" ] && continue
+    [ "$num" = "null" ] && continue
+    local approved_count
+    approved_count="$(curl --fail-with-body -sS \
+      -H "Authorization: token ${GITEA_TOKEN}" \
+      -H "Accept: application/json" \
+      "${GITEA_API_URL}/repos/${REPO}/pulls/${num}/reviews" 2>/dev/null \
+      | jq '[.[] | select(.state == "APPROVED" and (.dismissed // false) == false)] | length' \
+      2>/dev/null || echo 0)"
+    local mergeable
+    mergeable="$(printf '%s' "$pr" | jq -r '.mergeable')"
+    local merge_state="UNKNOWN"
+    local review_decision="REVIEW_REQUIRED"
+    if [ "$mergeable" = "true" ]; then
+      if [ "$approved_count" -ge 1 ]; then
+        merge_state="CLEAN"
+        review_decision="APPROVED"
+      else
+        # mergeable but no approving review — exactly the wedge state
+        # the alarm targets.
+        merge_state="BLOCKED"
+        review_decision="REVIEW_REQUIRED"
+      fi
+    else
+      # not mergeable (conflicts, behind, failed checks) — different
+      # failure mode, the author owns the fix; the alarm doesn't fire.
+      merge_state="DIRTY"
+      review_decision="REVIEW_REQUIRED"
+    fi
+    synthesized="$(printf '%s' "$synthesized" \
+      | jq -c --argjson pr "$pr" \
+              --arg ms "$merge_state" \
+              --arg rd "$review_decision" \
+              '. + [{
+                 number: $pr.number,
+                 title: $pr.title,
+                 createdAt: $pr.created_at,
+                 mergeStateStatus: $ms,
+                 reviewDecision: $rd,
+                 url: $pr.html_url
+              }]')"
+  done < <(printf '%s' "$prs_json" \
+    | jq -c '.[] | select(.head.ref == "staging")' 2>/dev/null)
+
+  printf '%s\n' "$synthesized"
 }

 # -----------------------------------------------------------------------------
@@ -171,18 +280,40 @@ post_comment() {
  if [ "$POST_COMMENT" != "true" ]; then
    return 0
  fi
+  if [ -z "$GITEA_TOKEN" ]; then
+    echo "::warning::GITEA_TOKEN unset — cannot post stale-alarm comment on PR #$pr_num" >&2
+    return 0
+  fi
  # Idempotency: only one alarm comment per PR. Look for the marker
-  # string in existing comments before posting a new one.
+  # string in existing comments before posting a new one. Gitea's
+  # /repos/{owner}/{repo}/issues/{n}/comments returns the same shape
+  # for issues + PRs (PRs are issues internally on Gitea, same as
+  # GitHub's REST).
  local existing
-  existing="$(gh pr view "$pr_num" --repo "$REPO" --json comments \
-    --jq '.comments[] | select(.body | test("scripts/check-stale-promote-pr.sh per issue #2975")) | .databaseId' \
+  existing="$(curl --fail-with-body -sS \
+    -H "Authorization: token ${GITEA_TOKEN}" \
+    -H "Accept: application/json" \
+    "${GITEA_API_URL}/repos/${REPO}/issues/${pr_num}/comments?limit=50" 2>/dev/null \
+    | jq -r '.[] | select(.body | test("scripts/check-stale-promote-pr.sh per issue #2975")) | .id' \
    | head -n1)"
  if [ -n "$existing" ]; then
    echo "::notice::PR #$pr_num already has a stale-alarm comment ($existing) — not re-posting"
    return 0
  fi
-  comment_body "$age_h" | gh pr comment "$pr_num" --repo "$REPO" --body-file -
-  echo "::notice::Posted stale-alarm comment on PR #$pr_num (age=${age_h}h)"
+  local body
+  body="$(comment_body "$age_h")"
+  if curl --fail-with-body -sS \
+      -X POST \
+      -H "Authorization: token ${GITEA_TOKEN}" \
+      -H "Accept: application/json" \
+      -H "Content-Type: application/json" \
+      "${GITEA_API_URL}/repos/${REPO}/issues/${pr_num}/comments" \
+      -d "$(jq -nc --arg b "$body" '{body: $b}')" \
+      >/dev/null 2>&1; then
+    echo "::notice::Posted stale-alarm comment on PR #$pr_num (age=${age_h}h)"
+  else
+    echo "::warning::Failed to POST stale-alarm comment on PR #$pr_num" >&2
+  fi
 }

 # -----------------------------------------------------------------------------
@@ -6,6 +6,29 @@
 #   ./scripts/clone-manifest.sh <manifest.json> <ws-templates-dir> <org-templates-dir> <plugins-dir>
 #
 # Requires: git, jq (lighter than python3 — ~2MB vs ~50MB in Alpine)
+#
+# Auth (optional):
+#   When MOLECULE_GITEA_TOKEN is set, embed it as the basic-auth password so
+#   private Gitea repos clone successfully. When unset, clone anonymously
+#   (works only for repos that are public on git.moleculesai.app).
+#
+#   This is the path the publish-workspace-server-image.yml workflow uses:
+#   it injects AUTO_SYNC_TOKEN (devops-engineer persona PAT, repo:read on
+#   the molecule-ai org) so the in-CI pre-clone step succeeds for ALL
+#   manifest entries — including the 5 private workspace-template-* repos
+#   (codex, crewai, deepagents, gemini-cli, langgraph) and all 7
+#   org-template-* repos.
+#
+#   The token never enters the Docker image: this script runs in the
+#   trusted CI context BEFORE `docker buildx build`, populates
+#   .tenant-bundle-deps/, then `Dockerfile.tenant` COPYs from there with
+#   the .git directories already stripped (see line ~67 below).
+#
+#   For backward compatibility — and so a fresh clone works without
+#   secrets when (eventually) the workspace-template-* repos flip public —
+#   the unset path remains a plain anonymous HTTPS clone. That path will
+#   FAIL with "could not read Username" on private repos today; CI MUST
+#   set MOLECULE_GITEA_TOKEN.

 set -euo pipefail

@@ -45,11 +68,27 @@ clone_category() {
            continue
        fi

-        echo "  cloning $repo -> $target_dir/$name (ref=$ref)"
-        if [ "$ref" = "main" ]; then
-            git clone --depth=1 -q "https://github.com/${repo}.git" "$target_dir/$name"
+        # Build the clone URL. When MOLECULE_GITEA_TOKEN is set (CI path)
+        # embed it as basic-auth so private repos succeed. The username
+        # part ("oauth2") is conventional and ignored by Gitea — only the
+        # token-as-password is verified.
+        #
+        # manifest.json was migrated to lowercase org slugs on
+        # 2026-05-07 (post-suspension reconciliation), so we use $repo
+        # verbatim — no on-the-fly tolower transform needed.
+        if [ -n "${MOLECULE_GITEA_TOKEN:-}" ]; then
+            clone_url="https://oauth2:${MOLECULE_GITEA_TOKEN}@git.moleculesai.app/${repo}.git"
+            display_url="https://oauth2:***@git.moleculesai.app/${repo}.git"
        else
-            git clone --depth=1 -q --branch "$ref" "https://github.com/${repo}.git" "$target_dir/$name"
+            clone_url="https://git.moleculesai.app/${repo}.git"
+            display_url="$clone_url"
+        fi
+
+        echo "  cloning $display_url -> $target_dir/$name (ref=$ref)"
+        if [ "$ref" = "main" ]; then
+            git clone --depth=1 -q "$clone_url" "$target_dir/$name"
+        else
+            git clone --depth=1 -q --branch "$ref" "$clone_url" "$target_dir/$name"
        fi
        CLONED=$((CLONED + 1))
        i=$((i + 1))
@@ -0,0 +1,155 @@
+#!/usr/bin/env bash
+# edge-429-probe.sh — capture 429 origin (workspace-server vs CF/Vercel edge)
+# during a simulated canvas-burst against a tenant subdomain.
+#
+# Issue molecule-core#62. The post-#60 verification step asks an
+# operator with CF/Vercel dashboard access to confirm whether the
+# layout-chunk 429s observed in DevTools were:
+#   (a) workspace-server bucket overflow (closes once #60 deploys), or
+#   (b) actual edge-layer rate-limiting (CF or Vercel).
+#
+# This script doesn't need dashboard access. It reproduces the burst
+# pattern locally and dumps every 429's response shape so the operator
+# can distinguish (a) from (b) by inspection: workspace-server emits a
+# JSON body, CF emits HTML, Vercel emits a different HTML. Headers tell
+# the same story (cf-ray vs x-vercel-*).
+#
+# Usage:
+#   ./scripts/edge-429-probe.sh <tenant-host> [--burst N] [--waves N] [--pause SECS] [--out file]
+#
+# Example:
+#   ./scripts/edge-429-probe.sh hongming.moleculesai.app --burst 80 --out /tmp/edge.txt
+#
+# The script is read-only against the target — it only issues GETs to
+# public-by-design endpoints. No mutating requests, no credential use.
+
+set -euo pipefail
+
+# ── Help / usage handling first, before positional capture ────────────────────
+case "${1:-}" in
+  -h|--help|"")
+    sed -n '/^# edge-429-probe.sh/,/^$/p' "$0" | sed 's/^# \{0,1\}//'
+    exit 0
+    ;;
+esac
+
+HOST="$1"; shift
+BURST=80
+WAVES=3
+WAVE_PAUSE=2
+OUT=""
+
+while [ "${1:-}" != "" ]; do
+  case "$1" in
+    --burst) BURST="$2"; shift 2 ;;
+    --waves) WAVES="$2"; shift 2 ;;
+    --pause) WAVE_PAUSE="$2"; shift 2 ;;
+    --out)   OUT="$2";   shift 2 ;;
+    -h|--help)
+      sed -n '/^# edge-429-probe.sh/,/^$/p' "$0" | sed 's/^# \{0,1\}//'
+      exit 0
+      ;;
+    *) echo "unknown arg: $1" >&2; exit 2 ;;
+  esac
+done
+
+# ── Endpoint discovery ────────────────────────────────────────────────────────
+echo "→ Discovering a layout-chunk URL from canvas root..." >&2
+ROOT_BODY=$(curl -fsSL --max-time 10 "https://${HOST}/" 2>/dev/null || true)
+LAYOUT_PATH=$(echo "$ROOT_BODY" \
+  | grep -oE '/_next/static/chunks/layout-[A-Za-z0-9_-]+\.js' \
+  | head -1 || true)
+if [ -z "$LAYOUT_PATH" ]; then
+  LAYOUT_PATH="/_next/static/chunks/layout-probe-not-found.js"
+  echo "  (no layout chunk discovered — using sentinel path; 404 on this is expected)" >&2
+else
+  echo "  layout chunk: $LAYOUT_PATH" >&2
+fi
+
+# Probe URL: a generic activity endpoint. The rate-limiter middleware
+# runs BEFORE workspace-id validation, so unauth/invalid-id requests
+# still hit the bucket.
+ACTIVITY_PATH="/workspaces/00000000-0000-0000-0000-000000000000/activity?probe=edge-429"
+
+# ── Fire one curl, write a single-line JSON-ish status record to stdout ──────
+# Inlined into xargs as a heredoc-style command rather than a function so
+# the function-export pitfalls (some shells lose `export -f` across xargs)
+# don't apply. Each output line is a parseable record; failed curls emit
+# a curl_err record so request volume is preserved.
+TMP_RESULTS="$(mktemp -t edge-429-probe.XXXXXX)"
+trap 'rm -f "$TMP_RESULTS"' EXIT
+
+run_burst() {
+  # $1 = path; $2 = label; $3 = wave_id
+  local path="$1" label="$2" wave="$3"
+  local i
+  for i in $(seq 1 "$BURST"); do
+    {
+      out=$(curl -sS --max-time 10 -o /dev/null \
+        -w 'status=%{http_code} size=%{size_download} time=%{time_total} server=%{header.server} cf_ray=%{header.cf-ray} x_vercel=%{header.x-vercel-id} retry_after=%{header.retry-after} content_type=%{header.content-type} x_ratelimit_limit=%{header.x-ratelimit-limit} x_ratelimit_remaining=%{header.x-ratelimit-remaining} x_ratelimit_reset=%{header.x-ratelimit-reset}\n' \
+        "https://${HOST}${path}" 2>/dev/null) || out="status=curl_err"
+      printf 'label=%s-%s-%s %s\n' "$label" "$wave" "$i" "$out" >> "$TMP_RESULTS"
+    } &
+  done
+  wait
+}
+
+emit() {
+  if [ -n "$OUT" ]; then
+    printf '%s\n' "$*" >> "$OUT"
+  else
+    printf '%s\n' "$*"
+  fi
+}
+
+if [ -n "$OUT" ]; then : > "$OUT"; fi
+
+emit "# edge-429-probe report"
+emit "# host=$HOST burst=$BURST waves=$WAVES pause=${WAVE_PAUSE}s"
+emit "# layout_path=$LAYOUT_PATH"
+emit "# activity_path=$ACTIVITY_PATH"
+emit "# generated=$(date -u +%Y-%m-%dT%H:%M:%SZ)"
+emit ""
+
+for wave in $(seq 1 "$WAVES"); do
+  emit "## wave $wave"
+  : > "$TMP_RESULTS"
+  run_burst "$LAYOUT_PATH" "layout" "$wave"
+  run_burst "$ACTIVITY_PATH" "activity" "$wave"
+  while read -r line; do
+    emit "  $line"
+  done < "$TMP_RESULTS"
+  if [ "$wave" -lt "$WAVES" ]; then
+    sleep "$WAVE_PAUSE"
+  fi
+done
+
+emit ""
+emit "## summary — how to read the report"
+emit "#   status=429 + content_type starts with application/json + x_ratelimit_limit set"
+emit "#     => workspace-server bucket overflow. Closes when #60 deploys."
+emit "#   status=429 + cf_ray set + content_type=text/html"
+emit "#     => Cloudflare WAF / rate-limit. Audit dashboard rules per #62."
+emit "#   status=429 + x_vercel set + content_type=text/html"
+emit "#     => Vercel edge / Bot Fight Mode. Audit Vercel project per #62."
+emit "#   status=429 with no server/cf_ray/x_vercel"
+emit "#     => corporate proxy or VPN. Not actionable in this repo."
+
+if [ -n "$OUT" ]; then
+  echo "→ Report written to $OUT" >&2
+  # Match only data lines (begin with two-space indent + "label="),
+  # not the summary's reference text which also mentions "status=429".
+  # grep -c outputs "0" + exits 1 when zero matches; `|| true` masks
+  # the exit status so set -e doesn't trip without losing the count.
+  total=$(grep -c '^  label=' "$OUT" 2>/dev/null || true)
+  total429=$(grep -c '^  label=.*status=429' "$OUT" 2>/dev/null || true)
+  total=${total:-0}
+  total429=${total429:-0}
+  echo "→ Totals: ${total429} of ${total} requests returned 429" >&2
+  if [ "${total429}" -gt 0 ]; then
+    echo "→ Per-label 429 counts:" >&2
+    grep '^  label=.*status=429' "$OUT" \
+      | sed -E 's/^  label=([^-]+).*/  \1/' \
+      | sort | uniq -c >&2
+  fi
+fi
@@ -19,9 +19,15 @@ Exit codes:
    0  — no collisions
    1  — collision detected; output names the conflicting PR(s) for the author

-Designed to run from a GitHub Actions PR check. Reads PR metadata via the
-GitHub CLI (gh) which is preinstalled on ubuntu-latest runners. Runs in
-under 10s against a typical PR.
+Designed to run from a Gitea Actions PR check. Reads PR metadata via direct
+HTTP calls to Gitea's REST API (`/api/v1/`), which on the molecule-ai fleet
+lives at https://git.moleculesai.app. Runs in under 10s against a typical PR.
+
+Post-2026-05-06 (Gitea migration, issue #75): the previous version called
+the GitHub CLI (``gh pr list``, ``gh pr diff``). On Gitea those calls hit
+either the GraphQL endpoint (HTTP 405) or /api/v3 (HTTP 404). This module
+now talks to /api/v1 directly via urllib so it works against any Gitea
+host without a `gh` install or extra dependencies.
 """

 from __future__ import annotations
@@ -31,12 +37,70 @@ import os
 import re
 import subprocess
 import sys
+import urllib.error
+import urllib.parse
+import urllib.request
 from pathlib import Path

 MIGRATIONS_DIR = "workspace-server/migrations"
 MIGRATION_FILE_RE = re.compile(r"^(\d+)_[^/]+\.(up|down)\.sql$")


+def _gitea_api_url() -> str:
+    """Resolve the Gitea API base URL.
+
+    act_runner forwards github.server_url as GITHUB_SERVER_URL; for the
+    molecule-ai fleet that's https://git.moleculesai.app. Append /api/v1
+    to get the REST root. Override directly via GITEA_API_URL for tests
+    or non-default hosts.
+    """
+    env_override = os.environ.get("GITEA_API_URL", "").rstrip("/")
+    if env_override:
+        return env_override
+    server = os.environ.get("GITHUB_SERVER_URL", "https://git.moleculesai.app").rstrip("/")
+    return f"{server}/api/v1"
+
+
+def _gitea_token() -> str:
+    """Resolve the Gitea token from env. GITEA_TOKEN wins; falls back
+    to GITHUB_TOKEN (set by act_runner) and GH_TOKEN (operator habit
+    from the GitHub era)."""
+    return (
+        os.environ.get("GITEA_TOKEN")
+        or os.environ.get("GITHUB_TOKEN")
+        or os.environ.get("GH_TOKEN")
+        or ""
+    )
+
+
+def _gitea_get(path: str, params: dict[str, str] | None = None) -> bytes | None:
+    """GET against /api/v1; returns response body or None on HTTP error.
+
+    Errors return None (not raise) because callers handle missing data
+    by emitting an actionable workflow message rather than crashing the
+    PR check on a transient API blip.
+    """
+    base = _gitea_api_url()
+    qs = ""
+    if params:
+        qs = "?" + urllib.parse.urlencode(params)
+    url = f"{base}/{path.lstrip('/')}{qs}"
+    req = urllib.request.Request(url)
+    token = _gitea_token()
+    if token:
+        req.add_header("Authorization", f"token {token}")
+    req.add_header("Accept", "application/json")
+    try:
+        with urllib.request.urlopen(req, timeout=20) as resp:  # noqa: S310
+            return resp.read()
+    except urllib.error.HTTPError as e:
+        sys.stderr.write(f"Gitea API HTTP {e.code} on {path}: {e.reason}\n")
+        return None
+    except (urllib.error.URLError, TimeoutError) as e:
+        sys.stderr.write(f"Gitea API network error on {path}: {e}\n")
+        return None
+
+
 def run(cmd: list[str], check: bool = True) -> str:
    """Run a subprocess and return stdout. Raise on non-zero when check=True."""
    result = subprocess.run(cmd, capture_output=True, text=True)
@@ -96,32 +160,49 @@ def open_prs_with_migration_prefix(
    repo: str, prefix: int, exclude_pr: int
 ) -> list[dict]:
    """Return open PRs (other than `exclude_pr`) that add a migration with
-    `prefix`. Uses `gh pr diff` per PR — we only need to walk PRs that are
-    actually in flight, so the cost is bounded by open-PR count.
+    `prefix`. Walks open PRs via Gitea's `/repos/{owner}/{repo}/pulls` and
+    pulls each one's changed-file list via `/pulls/{n}/files`. The cost is
+    bounded by open-PR count, which is small (<100) on this repo. The
+    return shape mimics the GitHub CLI's `--json number,headRefName`:
+    ``[{"number": int, "headRefName": str}, ...]``.
    """
-    out = run([
-        "gh", "pr", "list", "--repo", repo, "--state", "open",
-        "--json", "number,headRefName", "--limit", "100",
-    ])
-    prs = json.loads(out)
+    body = _gitea_get(
+        f"repos/{repo}/pulls",
+        {"state": "open", "limit": "50"},
+    )
+    if body is None:
+        # Best-effort: a transient Gitea blip shouldn't fail the PR
+        # check (the base-branch collision check runs locally and is
+        # the more common failure mode).
+        return []
+    prs = json.loads(body)
    matches: list[dict] = []
    for pr in prs:
        num = pr["number"]
        if num == exclude_pr:
            continue
-        try:
-            files = run([
-                "gh", "pr", "diff", str(num), "--repo", repo, "--name-only",
-            ], check=False)
-        except Exception:  # noqa: BLE001
+        # Gitea returns the head ref under .head.ref (REST shape);
+        # GitHub CLI's --json headRefName flattens it. Normalize on
+        # the way out so callers see the historical shape.
+        head_ref_name = (pr.get("head") or {}).get("ref", "")
+        files_body = _gitea_get(f"repos/{repo}/pulls/{num}/files", {"limit": "100"})
+        if files_body is None:
            continue
-        for raw in files.splitlines():
+        try:
+            files = json.loads(files_body)
+        except json.JSONDecodeError:
+            continue
+        for f in files:
+            # Gitea's /pulls/{n}/files returns objects with `.filename`
+            # (same as GitHub's REST). Older Gitea versions emit
+            # `.name` instead — handle both.
+            raw = f.get("filename") or f.get("name") or ""
            path = Path(raw.strip())
            if not path.name:
                continue
            m = MIGRATION_FILE_RE.match(path.name)
            if m and int(m.group(1)) == prefix:
-                matches.append(pr)
+                matches.append({"number": num, "headRefName": head_ref_name})
                break
    return matches

@@ -138,7 +219,10 @@ def main() -> int:
    pr_number = int(pr_number_env)
    base_ref = os.environ.get("BASE_REF", "origin/staging")
    head_ref = os.environ.get("HEAD_REF", "HEAD")
-    repo = os.environ.get("GITHUB_REPOSITORY", "Molecule-AI/molecule-core")
+    # Default kept lowercase to match the Gitea-canonical org name
+    # (post-2026-05-06 migration). Tests + workflow context override
+    # via GITHUB_REPOSITORY which act_runner sets per-run.
+    repo = os.environ.get("GITHUB_REPOSITORY", "molecule-ai/molecule-core")

    added = migrations_in_diff(base_ref, head_ref)
    if not added:
@@ -105,5 +105,5 @@ Hard per-workflow timeouts (15–40 min) cap runaway cost. Three teardown layers

 ## Known gaps (tracked elsewhere)

- [#1369](https://github.com/Molecule-AI/molecule-core/issues/1369): SaaS canvas Files / Terminal / Peers tabs — architecturally broken; whitelisted in the spec
+- [#1369](https://git.moleculesai.app/molecule-ai/molecule-core/issues/1369): SaaS canvas Files / Terminal / Peers tabs — architecturally broken; whitelisted in the spec
 - LLM-driven delegation (autonomous `delegate_task` tool use) — probabilistic, not in v1; proxy mechanics covered
@@ -1,5 +1,7 @@
 # Production-shape local harness

+<!-- Retrigger Harness Replays after Class G #168 + clone-manifest fix (#42). -->
+
 The harness brings up the SaaS tenant topology on localhost using the
 same `Dockerfile.tenant` image that ships to production. Tests target
 the cf-proxy on `http://localhost:8080` and pass the tenant identity
@@ -0,0 +1,14 @@
+# cf-proxy harness image — nginx + the harness's tenant-routing config baked
+# in at build time.
+#
+# Why bake (not bind-mount): on Gitea Actions / act_runner, the runner is a
+# container talking to the OUTER docker daemon over the host socket; runc
+# resolves bind-mount source paths on the outer host filesystem, where the
+# repo at `/workspace/.../tests/harness/cf-proxy/nginx.conf` is invisible.
+# Compose `configs:` (with `file:`) falls back to bind mounts when swarm is
+# not active, so it hits the same gap. A build-time COPY uploads the file
+# as part of the docker build context — the daemon receives the tarball
+# directly and never bind-mounts. See issue #88 item 2.
+FROM nginx:1.27-alpine
+
+COPY nginx.conf /etc/nginx/nginx.conf
@@ -167,15 +167,26 @@ services:
  # Production shape: same single CF tunnel front-doors every tenant
  # subdomain — the Host header carries the tenant identity, not the
  # routing destination. Local cf-proxy mirrors this exactly.
+  #
+  # nginx.conf delivery: built into a custom image via cf-proxy/Dockerfile
+  # (a thin nginx:1.27-alpine + COPY). NOT a bind mount and NOT a
+  # compose `configs:` block, both of which break under Gitea's
+  # act_runner: the runner talks to the OUTER docker daemon over the
+  # host socket, and runc resolves bind sources on the outer host
+  # filesystem, where `/workspace/.../tests/harness/cf-proxy/nginx.conf`
+  # is invisible. Compose `configs:` falls back to bind mounts without
+  # swarm, so it hits the same gap. A build context, by contrast, is
+  # uploaded to the daemon as a tarball at build time — no bind. See
+  # issue #88 item 2.
  cf-proxy:
-    image: nginx:1.27-alpine
+    build:
+      context: ./cf-proxy
+      dockerfile: Dockerfile
    depends_on:
      tenant-alpha:
        condition: service_healthy
      tenant-beta:
        condition: service_healthy
-    volumes:
-      - ./cf-proxy/nginx.conf:/etc/nginx/nginx.conf:ro
    # Bind to 127.0.0.1 only — hardcoded ADMIN_TOKENs make 0.0.0.0
    # exposure unsafe even on a local network.
    ports:
@@ -0,0 +1,252 @@
+#!/usr/bin/env bash
+# tools/branch-protection/check_name_parity.sh — assert every required-
+# check name listed in apply.sh maps to a workflow job whose "always
+# emits this status" shape is intact.
+#
+# Closes #144 / encodes the saved memory
+# feedback_branch_protection_check_name_parity:
+#
+#   "Path filters (e.g., detect-changes → conditional skip) silently
+#    break branch protection because no job emits the protected
+#    sentinel status when path-filter returns false."
+#
+# Two safe shapes for a required-check job:
+#
+#   1. Single-job-with-per-step-if (path-filter case):
+#      The workflow has NO top-level `paths:` filter; the always-running
+#      job has steps gated on `if: needs.<gate>.outputs.<flag> == 'true'`
+#      so the no-op step alone fires when paths exclude the commit.
+#      Used by ci.yml's Platform/Canvas/Python/Shellcheck and by
+#      e2e-api.yml / e2e-staging-canvas.yml / runtime-prbuild-compat.yml.
+#
+#   2. Aggregator-with-needs+always() (matrix-refactor case):
+#      An aggregator job named after the protected check `needs:` the
+#      matrix children + uses `if: always()` + checks each child's
+#      result. (Not currently in this repo but supported.)
+#
+# Unsafe shape this script catches:
+#   - Workflow has top-level `paths:` filter AND the protected check
+#     name is on a single job. When paths-filter excludes a commit, the
+#     workflow doesn't fire — branch protection waits forever.
+#
+# Exit codes:
+#   0 — every required check name has at least one safe-shape match
+#   1 — a required name has no match OR matches an unsafe shape
+#   2 — script-internal error (apply.sh missing, awk failure, etc.)
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
+WORKFLOWS_DIR="$REPO_ROOT/.github/workflows"
+APPLY_SH="$SCRIPT_DIR/apply.sh"
+
+if [[ ! -f "$APPLY_SH" ]]; then
+  echo "check_name_parity: missing apply.sh at $APPLY_SH" >&2
+  exit 2
+fi
+if [[ ! -d "$WORKFLOWS_DIR" ]]; then
+  echo "check_name_parity: missing .github/workflows at $WORKFLOWS_DIR" >&2
+  exit 2
+fi
+
+# ─── Extract the union of required check names from apply.sh ──────
+# apply.sh has STAGING_CHECKS and MAIN_CHECKS heredocs; union them so
+# we audit any name that gates EITHER branch. Filters out blank lines
+# and the heredoc end marker. Sorted + uniq so the audit output is stable.
+#
+# Captures the heredoc end-marker dynamically from the `<<'MARKER'`
+# token on the opening line — the token can be `EOF` (production
+# apply.sh), `EOF2` (test fixtures with nested heredocs), or any other
+# bash-legal identifier. Without dynamic extraction, test fixtures
+# with nested heredocs would either skip-capture (wrong end marker)
+# or capture the inner end marker as a stray check name.
+#
+# Two-step approach to keep awk-portable across BSD awk (macOS) and
+# gawk (Linux): grep finds the heredoc-opening lines, sed extracts the
+# marker, then awk does the capture. Pure-awk attempts hit BSD-vs-GNU
+# regex/variable-init differences that regress silently — this shape
+# stays in POSIX-portable territory.
+extract_heredoc_block() {
+  local file="$1"
+  local marker="$2"
+  awk -v marker="$marker" '
+    $0 ~ "<<.?" marker { capture=1; next }
+    $0 == marker && capture { capture=0; next }
+    capture && NF { print }
+  ' "$file"
+}
+
+# Find every heredoc-end marker used in apply.sh (typically just EOF
+# in the production script, but EOF2 / TAG / ABC are all valid in
+# fixtures or future expansions). Each marker maps to one or more
+# heredoc blocks; we union all of them.
+markers=$(grep -E "<<['\"]?[A-Za-z0-9_]+['\"]?[[:space:]]*\\|\\|" "$APPLY_SH" \
+  | sed -E "s/.*<<['\"]?([A-Za-z0-9_]+)['\"]?.*/\\1/" \
+  | sort -u)
+
+required_names=""
+while IFS= read -r marker; do
+  [[ -z "$marker" ]] && continue
+  block=$(extract_heredoc_block "$APPLY_SH" "$marker")
+  if [[ -n "$block" ]]; then
+    required_names+="$block"$'\n'
+  fi
+done <<< "$markers"
+
+required_names=$(printf '%s' "$required_names" | sort -u | sed '/^$/d')
+
+if [[ -z "$required_names" ]]; then
+  echo "check_name_parity: failed to extract required check names from apply.sh" >&2
+  exit 2
+fi
+
+# ─── For each required name, find the workflow file that owns it ──
+# A workflow "owns" a name if any `name:` line in the file equals the
+# required name. We look at job-level names AND the workflow-level
+# `name:` (the latter prefixes "Analyze" jobs in codeql.yml).
+#
+# Then we check whether the owning workflow has a top-level `paths:`
+# filter. The unsafe shape is:
+#   - top-level paths: filter present
+#   - AND the named job is gated only at the workflow level (no per-
+#     step `if:` gates)
+#
+# Distinguishing "no `paths:` filter" from "paths: filter + per-step
+# gating" requires parsing the YAML semantics. We do it heuristically:
+#
+#   - "no top-level paths:"     → safe by construction (workflow always
+#                                  fires)
+#   - "paths: present"          → check that the matching job has at
+#                                  least one `if: needs.<x>.outputs`
+#                                  step gate. If yes, that's the
+#                                  single-job-with-per-step-if shape.
+#                                  If no, flag as unsafe.
+#
+# Heuristic so it stays a portable bash + awk + grep tool — full YAML
+# parsing would need yq which isn't a dependency. The known unsafe
+# shape (workflow-level paths: AND no per-step if-gates) is what we're
+# trying to catch.
+
+failed=0
+declare -a unsafe_findings=()
+
+while IFS= read -r name; do
+  [[ -z "$name" ]] && continue
+  # Find every workflow file that contains a job with `name: <name>` or
+  # whose top-level workflow `name:` plus matrix substitution would
+  # produce <name>. Need to be careful about quoting — YAML allows
+  # `name: Foo`, `name: "Foo"`, `name: 'Foo'`. Strip quotes.
+  matches=()
+  while IFS= read -r f; do
+    # Look for an exact `name:` match (anywhere in the file). The
+    # workflow-level name line is at column 0; job-level names are
+    # indented. Either is acceptable for parity — what matters is
+    # whether the EMITTED check-run name is the one we required.
+    # Strip surrounding quotes/whitespace before comparing.
+    if awk -v want="$name" '
+      /^[[:space:]]*name:[[:space:]]*/ {
+        line = $0
+        sub(/^[[:space:]]*name:[[:space:]]*/, "", line)
+        # Strip surrounding " or '\''
+        gsub(/^["\047]|["\047]$/, "", line)
+        # Strip trailing whitespace + comment
+        sub(/[[:space:]]*#.*$/, "", line)
+        sub(/[[:space:]]+$/, "", line)
+        if (line == want) found = 1
+      }
+      END { exit !found }
+    ' "$f"; then
+      matches+=("$f")
+    fi
+  done < <(find "$WORKFLOWS_DIR" -name '*.yml' -o -name '*.yaml')
+
+  if [[ ${#matches[@]} -eq 0 ]]; then
+    # Special case — Analyze (go/javascript-typescript/python) is
+    # generated by codeql.yml's matrix expansion of `Analyze (${{
+    # matrix.language }})`. Don't flag those as missing if codeql.yml
+    # exists with the expected base name.
+    case "$name" in
+      "Analyze (go)"|"Analyze (javascript-typescript)"|"Analyze (python)")
+        # shellcheck disable=SC2016
+        # The literal `${{ matrix.language }}` is the GHA template
+        # syntax we're searching FOR — not a shell expansion. SC2016
+        # would have us add quotes that defeat the search.
+        if [[ -f "$WORKFLOWS_DIR/codeql.yml" ]] && \
+           grep -q 'name: Analyze (${{[[:space:]]*matrix.language[[:space:]]*}})' "$WORKFLOWS_DIR/codeql.yml"; then
+          matches=("$WORKFLOWS_DIR/codeql.yml")
+        fi
+        ;;
+    esac
+  fi
+
+  if [[ ${#matches[@]} -eq 0 ]]; then
+    unsafe_findings+=("MISSING: required check name '$name' has no matching workflow job")
+    failed=1
+    continue
+  fi
+
+  # For each owning workflow, classify safe vs unsafe.
+  for f in "${matches[@]}"; do
+    rel="${f#"$REPO_ROOT"/}"
+    # Heuristic: does the workflow have a top-level `paths:` filter?
+    # Top-level here means under the `on:` key, not under jobs.<x>.if.
+    # Workflow-level paths filters appear at indent depth 4 (under
+    # `push:` or `pull_request:`). Job-level `if:` paths-filter doesn't
+    # block the workflow from firing.
+    has_top_paths=0
+    if awk '
+      # Track whether we are inside the `on:` block. The `on:` block
+      # starts at column 0 (`on:` key) and ends when the next column-0
+      # key appears.
+      /^on:[[:space:]]*$/ { in_on = 1; next }
+      /^[a-zA-Z]/ && in_on { in_on = 0 }
+      in_on && /^[[:space:]]+paths:[[:space:]]*$/ { print "yes"; exit }
+      in_on && /^[[:space:]]+paths:[[:space:]]*\[/ { print "yes"; exit }
+    ' "$f" | grep -q yes; then
+      has_top_paths=1
+    fi
+
+    if [[ "$has_top_paths" -eq 0 ]]; then
+      # Safe: workflow always fires. If there are inner per-step if-
+      # gates (single-job-with-per-step-if pattern), the no-op step
+      # produces SUCCESS for the protected name — branch-protection-clean.
+      continue
+    fi
+
+    # Unsafe candidate — has top-level paths: AND we need to verify
+    # the per-step if-gate pattern is absent. Look for any `if:`
+    # referencing a paths-filter / detect-changes output inside the
+    # owning job's body. If at least one is present, classify as the
+    # single-job-with-per-step-if pattern (safe).
+    #
+    # The regex is intentionally anchored loosely — actual workflow
+    # YAML writes per-step if-gates as `      - if: needs.X.outputs.Y`
+    # (with the `-` step-marker between the leading spaces and the
+    # `if`). Anchoring on `^[[:space:]]+if:` would miss those.
+    if grep -qE "if:[[:space:]]+needs\.[a-zA-Z_-]+\.outputs\." "$f"; then
+      # Per-step if-gates exist. Combined with top-level paths: this
+      # would be a buggy mix (the workflow might still skip entirely
+      # when paths exclude). Flag as unsafe — the safe pattern omits
+      # the top-level paths: filter altogether and gates per-step.
+      unsafe_findings+=("UNSAFE-MIX: $rel has top-level paths: AND per-step if-gates — when paths exclude the commit, the workflow doesn't fire and the required check '$name' is silently absent. Drop the top-level paths: filter; keep the per-step if-gates.")
+      failed=1
+    else
+      # Top-level paths: with no per-step if-gates: the canonical
+      # check-name parity bug.
+      unsafe_findings+=("UNSAFE-PATH-FILTER: $rel has top-level paths: filter and no per-step if-gates. When paths exclude the commit, no job emits the required check '$name' — branch protection waits forever. Either drop the paths: filter and add per-step if-gates against a detect-changes output, or add an aggregator-with-needs+always() job that emits '$name'.")
+      failed=1
+    fi
+  done
+done <<< "$required_names"
+
+if [[ "$failed" -eq 0 ]]; then
+  echo "check_name_parity: OK — every required check name maps to a safe workflow shape."
+  exit 0
+fi
+
+echo "check_name_parity: FOUND $((${#unsafe_findings[@]})) issue(s):" >&2
+for finding in "${unsafe_findings[@]}"; do
+  echo "  - $finding" >&2
+done
+exit 1
@@ -0,0 +1,285 @@
+#!/usr/bin/env bash
+# tools/branch-protection/test_check_name_parity.sh — unit tests for
+# check_name_parity.sh.
+#
+# Builds synthetic apply.sh + workflow files in a tmpdir for each case,
+# invokes the script with REPO_ROOT pointing at the tmpdir, and asserts
+# on exit code + stderr. Per feedback_assert_exact_not_substring we
+# pin the EXACT exit code AND a substring of the stderr that names the
+# offending workflow + name combo — so a "false-pass that prints the
+# wrong message" still fails the test.
+#
+# Run locally: bash tools/branch-protection/test_check_name_parity.sh
+# Run in CI:  same — added to ci.yml's shellcheck job's "E2E bash unit
+#             tests" step alongside test_model_slug.sh.
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+SCRIPT_UNDER_TEST="$SCRIPT_DIR/check_name_parity.sh"
+
+if [[ ! -x "$SCRIPT_UNDER_TEST" ]]; then
+  echo "test_check_name_parity: script under test missing or not executable: $SCRIPT_UNDER_TEST" >&2
+  exit 2
+fi
+
+PASSED=0
+FAILED=0
+
+# Tracks the active tmpdir for the running case so the trap can clean
+# up even when assertions abort the case mid-flight.
+TMPDIR_FOR_CASE=""
+trap '[[ -n "$TMPDIR_FOR_CASE" && -d "$TMPDIR_FOR_CASE" ]] && rm -rf "$TMPDIR_FOR_CASE"' EXIT
+
+# Build a synthetic repo at $1 with apply.sh listing $2 (one name per
+# line) as the staging required set + zero main required, then write
+# whatever .github/workflows/* files the test case adds.
+make_fake_repo() {
+  local root="$1"
+  local checks="$2"
+  mkdir -p "$root/tools/branch-protection"
+  mkdir -p "$root/.github/workflows"
+  cat > "$root/tools/branch-protection/apply.sh" <<EOF
+#!/usr/bin/env bash
+# Stub apply.sh — only the heredoc-shaped check lists matter for the
+# parity script. Other functions intentionally absent.
+
+read -r -d '' STAGING_CHECKS <<'EOF2' || true
+$checks
+EOF2
+
+read -r -d '' MAIN_CHECKS <<'EOF2' || true
+$checks
+EOF2
+EOF
+  chmod +x "$root/tools/branch-protection/apply.sh"
+  # Place the script-under-test alongside its sibling apply.sh so the
+  # script's REPO_ROOT walk finds the synthetic .github/workflows/.
+  cp "$SCRIPT_UNDER_TEST" "$root/tools/branch-protection/check_name_parity.sh"
+}
+
+run_case() {
+  local desc="$1"
+  local checks="$2"
+  local workflow_yaml="$3"   # contents to write
+  local workflow_filename="$4"
+  local expected_exit="$5"
+  local expected_stderr_substring="$6"
+  TMPDIR_FOR_CASE=$(mktemp -d)
+  make_fake_repo "$TMPDIR_FOR_CASE" "$checks"
+  printf '%s' "$workflow_yaml" > "$TMPDIR_FOR_CASE/.github/workflows/$workflow_filename"
+  local stderr_file
+  stderr_file=$(mktemp)
+  local actual_exit=0
+  bash "$TMPDIR_FOR_CASE/tools/branch-protection/check_name_parity.sh" 2>"$stderr_file" >/dev/null || actual_exit=$?
+  local stderr_content
+  stderr_content=$(cat "$stderr_file")
+  rm "$stderr_file"
+  if [[ "$actual_exit" -ne "$expected_exit" ]]; then
+    echo "FAIL: $desc"
+    echo "  expected exit: $expected_exit, got: $actual_exit"
+    echo "  stderr: $stderr_content"
+    FAILED=$((FAILED+1))
+    rm -rf "$TMPDIR_FOR_CASE"; TMPDIR_FOR_CASE=""
+    return
+  fi
+  # Empty expected substring → no assertion on stderr (used for the
+  # passing case where stderr should be empty / not interesting).
+  if [[ -n "$expected_stderr_substring" ]]; then
+    if ! grep -qF "$expected_stderr_substring" <<< "$stderr_content"; then
+      echo "FAIL: $desc"
+      echo "  expected stderr to contain: '$expected_stderr_substring'"
+      echo "  actual stderr: $stderr_content"
+      FAILED=$((FAILED+1))
+      rm -rf "$TMPDIR_FOR_CASE"; TMPDIR_FOR_CASE=""
+      return
+    fi
+  fi
+  echo "PASS: $desc"
+  PASSED=$((PASSED+1))
+  rm -rf "$TMPDIR_FOR_CASE"; TMPDIR_FOR_CASE=""
+}
+
+# Case 1: safe workflow — no top-level paths: filter, single job
+# emitting the required name. Should exit 0.
+run_case "safe: no paths filter, job emits required name" \
+  "Foo Build" \
+  "$(cat <<'EOF'
+name: Foo
+
+on:
+  push:
+    branches: [main]
+  pull_request:
+
+jobs:
+  foo:
+    name: Foo Build
+    runs-on: ubuntu-latest
+    steps:
+      - run: echo ok
+EOF
+)" \
+  "foo.yml" \
+  0 \
+  ""
+
+# Case 2: unsafe — top-level paths: filter AND no per-step if-gates.
+# This is the silent-block shape from the saved memory.
+run_case "unsafe: top-level paths: filter without per-step if-gates" \
+  "Bar Build" \
+  "$(cat <<'EOF'
+name: Bar
+
+on:
+  push:
+    branches: [main]
+    paths:
+      - 'bar/**'
+  pull_request:
+    paths:
+      - 'bar/**'
+
+jobs:
+  bar:
+    name: Bar Build
+    runs-on: ubuntu-latest
+    steps:
+      - run: echo ok
+EOF
+)" \
+  "bar.yml" \
+  1 \
+  "UNSAFE-PATH-FILTER"
+
+# Case 3: required name has no emitter at all.
+run_case "missing: required name not in any workflow" \
+  "Nonexistent Job" \
+  "$(cat <<'EOF'
+name: Other
+
+on:
+  pull_request:
+
+jobs:
+  other:
+    name: Other Job
+    runs-on: ubuntu-latest
+    steps:
+      - run: echo ok
+EOF
+)" \
+  "other.yml" \
+  1 \
+  "MISSING: required check name 'Nonexistent Job'"
+
+# Case 4: safe — top-level paths: filter is absent BUT per-step if-
+# gates are present (single-job-with-per-step-if pattern, what
+# ci.yml + e2e-api.yml use). Should exit 0.
+run_case "safe: per-step if-gates without top-level paths" \
+  "Baz Build" \
+  "$(cat <<'EOF'
+name: Baz
+
+on:
+  push:
+    branches: [main]
+  pull_request:
+
+jobs:
+  changes:
+    name: Detect changes
+    runs-on: ubuntu-latest
+    outputs:
+      baz: ${{ steps.check.outputs.baz }}
+    steps:
+      - id: check
+        run: echo "baz=true" >> "$GITHUB_OUTPUT"
+
+  baz:
+    needs: changes
+    name: Baz Build
+    runs-on: ubuntu-latest
+    steps:
+      - if: needs.changes.outputs.baz != 'true'
+        run: echo no-op
+      - if: needs.changes.outputs.baz == 'true'
+        run: echo real work
+EOF
+)" \
+  "baz.yml" \
+  0 \
+  ""
+
+# Case 5: unsafe-mix — top-level paths: AND per-step if-gates. The
+# script flags this distinctly because the workflow may STILL skip
+# entirely when paths exclude the commit (the per-step gates only
+# matter if the workflow actually fires).
+run_case "unsafe-mix: top-level paths: AND per-step if-gates" \
+  "Qux Build" \
+  "$(cat <<'EOF'
+name: Qux
+
+on:
+  push:
+    branches: [main]
+    paths:
+      - 'qux/**'
+  pull_request:
+    paths:
+      - 'qux/**'
+
+jobs:
+  changes:
+    name: Detect changes
+    runs-on: ubuntu-latest
+    outputs:
+      qux: ${{ steps.check.outputs.qux }}
+    steps:
+      - id: check
+        run: echo "qux=true" >> "$GITHUB_OUTPUT"
+
+  qux:
+    needs: changes
+    name: Qux Build
+    runs-on: ubuntu-latest
+    steps:
+      - if: needs.changes.outputs.qux == 'true'
+        run: echo build
+EOF
+)" \
+  "qux.yml" \
+  1 \
+  "UNSAFE-MIX"
+
+# Case 6: codeql.yml matrix — required names like "Analyze (go)" are
+# generated by `Analyze (${{ matrix.language }})`. Script must
+# special-case match this pattern.
+run_case "matrix: codeql Analyze (go) is recognised via matrix expansion" \
+  "$(printf 'Analyze (go)\nAnalyze (javascript-typescript)\nAnalyze (python)')" \
+  "$(cat <<'EOF'
+name: CodeQL
+
+on:
+  pull_request:
+
+jobs:
+  analyze:
+    name: Analyze (${{ matrix.language }})
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        language: [go, javascript-typescript, python]
+    steps:
+      - run: echo analyse
+EOF
+)" \
+  "codeql.yml" \
+  0 \
+  ""
+
+echo ""
+echo "================================================"
+echo "test_check_name_parity: $PASSED passed, $FAILED failed"
+echo "================================================"
+exit "$FAILED"
@@ -0,0 +1,49 @@
+# air.toml — live-reload config for local docker-compose dev mode.
+#
+# Active when the platform service runs from workspace-server/Dockerfile.dev
+# (selected via docker-compose.dev.yml overlay). In production, the regular
+# Dockerfile builds a static binary; air is dev-only.
+#
+# Reference: https://github.com/air-verse/air
+
+root = "."
+testdata_dir = "testdata"
+tmp_dir = "tmp"
+
+[build]
+  # Same build invocation as Dockerfile's builder stage minus the
+  # CGO_ENABLED=0 toggle (CGO ok in dev for richer race detector output).
+  cmd = "go build -o ./tmp/server ./cmd/server"
+  bin = "tmp/server"
+  full_bin = ""
+  args_bin = []
+  # Watch every .go and .yaml file under workspace-server/.
+  include_ext = ["go", "yaml", "tmpl"]
+  # Don't watch tests, build artifacts, vendored deps, or migration .sql
+  # (migrations need a clean DB anyway — handled by docker-compose down/up).
+  exclude_dir = ["assets", "tmp", "vendor", "testdata", "node_modules"]
+  exclude_file = []
+  # _test.go and *_mock.go shouldn't trigger a rebuild — saves cycles.
+  exclude_regex = ["_test\\.go$", "_mock\\.go$"]
+  exclude_unchanged = true
+  follow_symlink = false
+  log = "build-errors.log"
+  # Kill running binary 1s before starting new one.
+  kill_delay = "1s"
+  send_interrupt = true
+  stop_on_error = true
+  # Debounce: wait this long after last change before triggering rebuild.
+  delay = 500
+
+[log]
+  time = false
+
+[color]
+  main = "magenta"
+  watcher = "cyan"
+  build = "yellow"
+  runner = "green"
+
+[misc]
+  # Don't keep the tmp/ dir around between runs.
+  clean_on_exit = true
@@ -1,7 +1,15 @@
-# Platform-only image (no canvas). Used by publish-platform-image workflow
-# for GHCR + Fly registry. Tenant image uses Dockerfile.tenant instead.
+# Platform-only image (no canvas). Used by publish-workspace-server-image
+# workflow for ECR. Tenant image uses Dockerfile.tenant instead.
 #
-# Build context: repo root.
+# Templates + plugins are pre-cloned by scripts/clone-manifest.sh (in CI
+# or on the operator host) into .tenant-bundle-deps/ — same pattern as
+# Dockerfile.tenant. See that file's header for the full rationale; the
+# short version is that post-2026-05-06 every workspace-template-* and
+# org-template-* repo on Gitea is private, so an in-image `git clone`
+# has no auth path that doesn't leak the Gitea token into a layer.
+#
+# Build context: repo root, with `.tenant-bundle-deps/` populated by the
+# workflow's "Pre-clone manifest deps" step (Task #173).

 FROM golang:1.25-alpine AS builder
 WORKDIR /app
@@ -26,21 +34,18 @@ RUN CGO_ENABLED=0 GOOS=linux go build \
    -ldflags "-X github.com/Molecule-AI/molecule-monorepo/platform/internal/buildinfo.GitSHA=${GIT_SHA}" \
    -o /memory-plugin ./cmd/memory-plugin-postgres

-# Clone templates + plugins at build time from manifest.json
-FROM alpine:3.20 AS templates
-RUN apk add --no-cache git jq
-COPY manifest.json /manifest.json
-COPY scripts/clone-manifest.sh /scripts/clone-manifest.sh
-RUN chmod +x /scripts/clone-manifest.sh && /scripts/clone-manifest.sh /manifest.json /workspace-configs-templates /org-templates /plugins
-
 FROM alpine:3.20
 RUN apk add --no-cache ca-certificates git tzdata wget
 COPY --from=builder /platform /platform
 COPY --from=builder /memory-plugin /memory-plugin
 COPY workspace-server/migrations /migrations
-COPY --from=templates /workspace-configs-templates /workspace-configs-templates
-COPY --from=templates /org-templates /org-templates
-COPY --from=templates /plugins /plugins
+# Templates + plugins (pre-cloned by scripts/clone-manifest.sh in the
+# trusted CI / operator-host context, .git already stripped). The Gitea
+# token used to clone them never enters this image — same shape as
+# Dockerfile.tenant.
+COPY .tenant-bundle-deps/workspace-configs-templates /workspace-configs-templates
+COPY .tenant-bundle-deps/org-templates /org-templates
+COPY .tenant-bundle-deps/plugins /plugins
 # Non-root runtime with Docker socket access for workspace provisioning.
 RUN addgroup -g 1000 platform && adduser -u 1000 -G platform -s /bin/sh -D platform
 EXPOSE 8080
@@ -0,0 +1,38 @@
+# Dockerfile.dev — local-development image with air-driven live reload.
+#
+# Selected by docker-compose.dev.yml (overlay over docker-compose.yml).
+# Production stays on workspace-server/Dockerfile (static binary, no air).
+#
+# Workflow:
+#   1. docker compose -f docker-compose.yml -f docker-compose.dev.yml up
+#   2. Edit any .go file under workspace-server/
+#   3. air detects, rebuilds, kills old binary, starts new one (~3-5s)
+#   4. No `docker compose up --build` needed
+#
+# Templates + plugins are NOT pre-cloned here — air-mode assumes the
+# developer's filesystem has the workspace-configs-templates/ + plugins/
+# dirs available, mounted at runtime via docker-compose.dev.yml.
+
+FROM golang:1.25-alpine
+
+# air + git (for go mod) + ca-certs (for TLS) + tzdata (for time-zone DB).
+RUN apk add --no-cache git ca-certificates tzdata wget \
+ && go install github.com/air-verse/air@latest
+
+WORKDIR /app/workspace-server
+
+# Pre-fetch deps so the first `air` rebuild on a fresh container is fast.
+# These are bind-mount-overridden at runtime, so the COPY here is just
+# to warm the module cache.
+COPY workspace-server/go.mod workspace-server/go.sum ./
+RUN go mod download
+
+# Source is bind-mounted at runtime (see docker-compose.dev.yml volumes
+# block) so the Dockerfile doesn't need to COPY it. air watches the
+# bind-mounted dir for changes.
+
+ENV CGO_ENABLED=1
+ENV GOFLAGS="-buildvcs=false"
+
+# Run air with the .air.toml in the bind-mounted source dir.
+CMD ["air", "-c", ".air.toml"]
@@ -3,14 +3,34 @@
 # Serves both the API (Go on :8080) and the UI (Node.js on :3000) in a
 # single container. Go reverse-proxies unknown routes to canvas.
 #
-# Templates are cloned from standalone GitHub repos at build time so the
-# monorepo doesn't need to carry them. The repos are public; no auth.
+# Templates + plugins are NOT cloned at build time. They are pre-cloned
+# in the trusted CI context (or operator host) by
+# `scripts/clone-manifest.sh` into `.tenant-bundle-deps/` and COPYed in.
+# The reason: post-2026-05-06, every workspace-template-* repo on Gitea
+# (codex, crewai, deepagents, gemini-cli, langgraph) plus all 7
+# org-template-* repos are private, so the Docker build can't `git clone`
+# from inside the build context — there's no auth path that doesn't leak
+# the Gitea token into an image layer. Pre-cloning keeps the token in
+# the CI environment only; the resulting image carries the cloned trees
+# with `.git` already stripped (see clone-manifest.sh).
 #
-# Build context: repo root.
+# Build context: repo root, with `.tenant-bundle-deps/` populated by:
+#
+#     MOLECULE_GITEA_TOKEN=<persona-PAT> scripts/clone-manifest.sh \
+#       manifest.json \
+#       .tenant-bundle-deps/workspace-configs-templates \
+#       .tenant-bundle-deps/org-templates \
+#       .tenant-bundle-deps/plugins
+#
+# In CI this happens in publish-workspace-server-image.yml's "Pre-clone
+# manifest deps" step (uses AUTO_SYNC_TOKEN = devops-engineer persona).
+# For a manual operator-host build, source the same token from
+# /etc/molecule-bootstrap/agent-secrets.env first.
 #
 #   docker buildx build --platform linux/amd64 \
 #     -f workspace-server/Dockerfile.tenant \
-#     -t registry.fly.io/molecule-tenant:latest \
+#     -t <ECR>/molecule-ai/platform-tenant:latest \
+#     --build-arg GIT_SHA=<sha> --build-arg NEXT_PUBLIC_PLATFORM_URL= \
 #     --push .

 # ── Stage 1: Go platform binary ──────────────────────────────────────
@@ -55,14 +75,7 @@ ENV NEXT_PUBLIC_PLATFORM_URL=$NEXT_PUBLIC_PLATFORM_URL
 ENV NEXT_PUBLIC_WS_URL=$NEXT_PUBLIC_WS_URL
 RUN npm run build

-# ── Stage 3: Clone templates + plugins from manifest.json ─────────────
-FROM alpine:3.20 AS templates
-RUN apk add --no-cache git jq
-COPY manifest.json /manifest.json
-COPY scripts/clone-manifest.sh /scripts/clone-manifest.sh
-RUN chmod +x /scripts/clone-manifest.sh && /scripts/clone-manifest.sh /manifest.json /workspace-configs-templates /org-templates /plugins
-
-# ── Stage 4: Runtime ──────────────────────────────────────────────────
+# ── Stage 3: Runtime ──────────────────────────────────────────────────
 FROM node:20-alpine
 RUN apk add --no-cache ca-certificates git tzdata openssh-client aws-cli

@@ -87,10 +100,13 @@ COPY --from=go-builder /platform /platform
 COPY --from=go-builder /memory-plugin /memory-plugin
 COPY workspace-server/migrations /migrations

-# Templates + plugins (cloned from GitHub in stage 3)
-COPY --from=templates /workspace-configs-templates /workspace-configs-templates
-COPY --from=templates /org-templates /org-templates
-COPY --from=templates /plugins /plugins
+# Templates + plugins (pre-cloned by scripts/clone-manifest.sh in the
+# trusted CI / operator-host context, .git already stripped — see
+# .tenant-bundle-deps/ in the build context). The Gitea token used to
+# clone them never enters this image.
+COPY .tenant-bundle-deps/workspace-configs-templates /workspace-configs-templates
+COPY .tenant-bundle-deps/org-templates /org-templates
+COPY .tenant-bundle-deps/plugins /plugins

 # Canvas standalone
 WORKDIR /canvas
@@ -0,0 +1,89 @@
+package main
+
+import "testing"
+
+// TestResolveBindHost pins the precedence: BIND_ADDR explicit > dev-mode
+// fail-open default of 127.0.0.1 > production-shape empty (all interfaces).
+//
+// Mutation-test invariant: removing the IsDevModeFailOpen() branch makes
+// "no_bindaddr_devmode_unset_admin" fail (returns "" instead of "127.0.0.1").
+// Removing the BIND_ADDR branch makes "explicit_bindaddr_*" cases fail.
+func TestResolveBindHost(t *testing.T) {
+	cases := []struct {
+		name       string
+		bindAddr   string
+		adminToken string
+		molEnv     string
+		want       string
+	}{
+		{
+			name:       "no_bindaddr_devmode_unset_admin",
+			bindAddr:   "",
+			adminToken: "",
+			molEnv:     "dev",
+			want:       "127.0.0.1",
+		},
+		{
+			name:       "no_bindaddr_devmode_unset_admin_full_word",
+			bindAddr:   "",
+			adminToken: "",
+			molEnv:     "development",
+			want:       "127.0.0.1",
+		},
+		{
+			name:       "no_bindaddr_admin_set_in_dev_env",
+			bindAddr:   "",
+			adminToken: "secret",
+			molEnv:     "dev",
+			want:       "", // ADMIN_TOKEN flips IsDevModeFailOpen to false → all interfaces
+		},
+		{
+			name:       "no_bindaddr_production_env",
+			bindAddr:   "",
+			adminToken: "",
+			molEnv:     "production",
+			want:       "", // production is not a dev value → all interfaces
+		},
+		{
+			name:       "no_bindaddr_unset_env",
+			bindAddr:   "",
+			adminToken: "",
+			molEnv:     "",
+			want:       "", // unset MOLECULE_ENV → not dev → all interfaces
+		},
+		{
+			name:       "explicit_bindaddr_loopback_overrides_devmode",
+			bindAddr:   "127.0.0.1",
+			adminToken: "",
+			molEnv:     "dev",
+			want:       "127.0.0.1",
+		},
+		{
+			name:       "explicit_bindaddr_wildcard_overrides_devmode_default",
+			bindAddr:   "0.0.0.0",
+			adminToken: "",
+			molEnv:     "dev",
+			want:       "0.0.0.0",
+		},
+		{
+			name:       "explicit_bindaddr_in_production",
+			bindAddr:   "10.0.5.7",
+			adminToken: "secret",
+			molEnv:     "production",
+			want:       "10.0.5.7",
+		},
+	}
+
+	for _, tc := range cases {
+		t.Run(tc.name, func(t *testing.T) {
+			t.Setenv("BIND_ADDR", tc.bindAddr)
+			t.Setenv("ADMIN_TOKEN", tc.adminToken)
+			t.Setenv("MOLECULE_ENV", tc.molEnv)
+			got := resolveBindHost()
+			if got != tc.want {
+				t.Errorf("resolveBindHost() = %q, want %q (BIND_ADDR=%q ADMIN_TOKEN=%q MOLECULE_ENV=%q)",
+					got, tc.want, tc.bindAddr, tc.adminToken, tc.molEnv)
+			}
+		})
+	}
+}
@@ -19,6 +19,7 @@ import (
 	"github.com/Molecule-AI/molecule-monorepo/platform/internal/handlers"
 	"github.com/Molecule-AI/molecule-monorepo/platform/internal/imagewatch"
 	memwiring "github.com/Molecule-AI/molecule-monorepo/platform/internal/memory/wiring"
+	"github.com/Molecule-AI/molecule-monorepo/platform/internal/middleware"
 	"github.com/Molecule-AI/molecule-monorepo/platform/internal/pendinguploads"
 	"github.com/Molecule-AI/molecule-monorepo/platform/internal/provisioner"
 	"github.com/Molecule-AI/molecule-monorepo/platform/internal/registry"
@@ -332,15 +333,23 @@ func main() {
 	// Router
 	r := router.Setup(hub, broadcaster, prov, platformURL, configsDir, wh, channelMgr, memBundle)

-	// HTTP server with graceful shutdown
+	// HTTP server with graceful shutdown.
+	//
+	// Bind host: in dev-mode (no ADMIN_TOKEN, MOLECULE_ENV=dev|development)
+	// the AdminAuth chain fails open by design; pairing that with a wildcard
+	// bind would expose unauth /workspaces to any same-LAN peer. Default to
+	// loopback when fail-open is active. Operators who need LAN exposure set
+	// BIND_ADDR=0.0.0.0 explicitly. Production (ADMIN_TOKEN set) is unchanged.
+	// See molecule-core#7.
+	bindHost := resolveBindHost()
 	srv := &http.Server{
-		Addr:    fmt.Sprintf(":%s", port),
+		Addr:    fmt.Sprintf("%s:%s", bindHost, port),
 		Handler: r,
 	}

 	// Start server in goroutine
 	go func() {
-		log.Printf("Platform starting on :%s", port)
+		log.Printf("Platform starting on %s:%s (dev-mode-fail-open=%v)", bindHost, port, middleware.IsDevModeFailOpen())
 		if err := srv.ListenAndServe(); err != nil && err != http.ErrServerClosed {
 			log.Fatalf("Server failed: %v", err)
 		}
@@ -375,6 +384,29 @@ func envOr(key, fallback string) string {
 	return fallback
 }

+// resolveBindHost picks the listener interface for the HTTP server.
+//
+// Precedence:
+//  1. BIND_ADDR — explicit operator override (any value, including "0.0.0.0").
+//  2. dev-mode fail-open active → "127.0.0.1" (loopback only).
+//  3. otherwise → "" (Go binds every interface; existing prod/self-host shape).
+//
+// Coupling the loopback default to middleware.IsDevModeFailOpen() means the
+// two safety levers — bind narrowness and auth strength — move together. A
+// production deploy (ADMIN_TOKEN set) keeps binding to all interfaces because
+// the auth chain is doing its job; a dev Mac (no ADMIN_TOKEN, MOLECULE_ENV=dev)
+// is reachable only via loopback because the auth chain is fail-open. See
+// molecule-core#7 for the original LAN exposure finding.
+func resolveBindHost() string {
+	if v := os.Getenv("BIND_ADDR"); v != "" {
+		return v
+	}
+	if middleware.IsDevModeFailOpen() {
+		return "127.0.0.1"
+	}
+	return ""
+}
+
 func findConfigsDir() string {
 	candidates := []string{
 		"workspace-configs-templates",
@@ -413,11 +413,56 @@ func (h *WorkspaceHandler) proxyA2ARequest(ctx context.Context, workspaceID stri
 		return http.StatusOK, respBody, nil
 	}

+	// Mock-runtime short-circuit. Workspaces with runtime='mock' have
+	// no container, no EC2, no URL — every reply is synthesised here
+	// from a small canned-variant pool. Built for the "200-workspace
+	// mock org" demo: a CEO/VPs/Managers/ICs hierarchy that renders
+	// at scale on the canvas without burning real LLM credits or
+	// provisioning 200 EC2 instances. See mock_runtime.go for the
+	// full rationale + reply shape contract.
+	//
+	// Position: AFTER poll-mode (mock isn't a delivery mode, it's a
+	// runtime; treating poll-set-on-mock as poll matches operator
+	// intent if anyone ever does that), BEFORE resolveAgentURL (mock
+	// has no URL — going through resolveAgentURL would 404 on the
+	// SELECT url since the row is provisioned as NULL).
+	if status, respBody, handled := h.handleMockA2A(ctx, workspaceID, callerID, body, a2aMethod, logActivity); handled {
+		return status, respBody, nil
+	}
+
 	agentURL, proxyErr := h.resolveAgentURL(ctx, workspaceID)
 	if proxyErr != nil {
 		return 0, nil, proxyErr
 	}

+	// Pre-flight container-health check (#36). The dispatchA2A path below
+	// does Docker-DNS forwarding to `ws-<wsShort>:8000` and only catches a
+	// missing/dead container REACTIVELY via maybeMarkContainerDead in
+	// handleA2ADispatchError. That works but costs the caller a full
+	// network-timeout (2-30s) before the structured 503 surfaces.
+	//
+	// When we KNOW the workspace is container-backed (h.docker != nil + we
+	// rewrite to Docker-DNS form below), do a single proactive
+	// RunningContainerName lookup. If the container is genuinely missing,
+	// short-circuit with the same structured 503 + async restart that
+	// maybeMarkContainerDead would produce — but immediately, without the
+	// network round-trip.
+	//
+	// Three outcomes of provisioner.RunningContainerName(ctx, h.docker, id):
+	//   ("ws-<id>", nil) → forward as today.
+	//   ("",        nil) → container is genuinely not running. Fast-503.
+	//   ("",        err) → transient daemon error. Fall through to optimistic
+	//                       forward — matches Provisioner.IsRunning's
+	//                       (true, err) "fail-soft as alive" contract.
+	//
+	// Same SSOT as findRunningContainer (#10/#12). See AST gate
+	// TestProxyA2A_RoutesThroughProvisionerSSOT.
+	if h.provisioner != nil && platformInDocker && strings.HasPrefix(agentURL, "http://"+provisioner.ContainerName(workspaceID)+":") {
+		if proxyErr := h.preflightContainerHealth(ctx, workspaceID); proxyErr != nil {
+			return 0, nil, proxyErr
+		}
+	}
+
 	startTime := time.Now()
 	resp, cancelFwd, err := h.dispatchA2A(ctx, workspaceID, agentURL, body, callerID)
 	if cancelFwd != nil {
@@ -198,6 +198,60 @@ func (h *WorkspaceHandler) maybeMarkContainerDead(ctx context.Context, workspace
 	return true
 }

+// preflightContainerHealth runs a proactive Provisioner.IsRunning check
+// (#36) before dispatching the a2a forward. Routed through provisioner's
+// SSOT IsRunning, which itself wraps RunningContainerName — same source
+// as findRunningContainer in the plugins handler (#10/#12).
+//
+// Returns nil when the forward should proceed:
+//   - container is running, OR
+//   - daemon errored transiently (matches IsRunning's (true, err)
+//     "fail-soft as alive" contract — let the optimistic forward run
+//     and reactive maybeMarkContainerDead catch a real failure).
+//
+// Returns a structured 503 + triggers the same async restart that
+// maybeMarkContainerDead would produce, when:
+//   - container is genuinely not running (NotFound / Exited / Created…).
+//
+// The point of running this BEFORE the forward is to save the caller
+// 2-30s of network-timeout cost when the container is missing — a common
+// shape post-EC2-replace (see molecule-controlplane#20 incident
+// 2026-05-07) where the reconciler hasn't respawned the agent yet.
+func (h *WorkspaceHandler) preflightContainerHealth(ctx context.Context, workspaceID string) *proxyA2AError {
+	running, err := h.provisioner.IsRunning(ctx, workspaceID)
+	if err != nil {
+		// Transient daemon error. Provisioner.IsRunning returns (true, err)
+		// in this case — fall through to the optimistic forward, reactive
+		// maybeMarkContainerDead handles a real failure later.
+		log.Printf("ProxyA2A preflight: IsRunning transient error for %s: %v (proceeding with forward)", workspaceID, err)
+		return nil
+	}
+	if running {
+		// Container is running — forward as today.
+		return nil
+	}
+	// Container is genuinely not running. Mark offline + trigger restart
+	// (same effect as maybeMarkContainerDead's branch), and return the
+	// structured 503 immediately so the caller skips the forward.
+	log.Printf("ProxyA2A preflight: container for %s is not running — marking offline and triggering restart (#36)", workspaceID)
+	if _, dbErr := db.DB.ExecContext(ctx,
+		`UPDATE workspaces SET status = $1, updated_at = now() WHERE id = $2 AND status NOT IN ('removed', 'provisioning')`,
+		models.StatusOffline, workspaceID); dbErr != nil {
+		log.Printf("ProxyA2A preflight: failed to mark workspace %s offline: %v", workspaceID, dbErr)
+	}
+	db.ClearWorkspaceKeys(ctx, workspaceID)
+	h.broadcaster.RecordAndBroadcast(ctx, string(events.EventWorkspaceOffline), workspaceID, map[string]interface{}{})
+	go h.RestartByID(workspaceID)
+	return &proxyA2AError{
+		Status: http.StatusServiceUnavailable,
+		Response: gin.H{
+			"error":      "workspace container not running — restart triggered",
+			"restarting": true,
+			"preflight":  true, // distinguishes from reactive containerDead path
+		},
+	}
+}
+
 // logA2AFailure records a failed A2A attempt to activity_logs in a detached
 // goroutine (the request context may already be done by the time it runs).
 func (h *WorkspaceHandler) logA2AFailure(ctx context.Context, workspaceID, callerID string, body []byte, a2aMethod string, err error, durationMs int) {
@@ -0,0 +1,194 @@
+package handlers
+
+import (
+	"context"
+	"errors"
+	"go/ast"
+	"go/parser"
+	"go/token"
+	"testing"
+
+	"github.com/DATA-DOG/go-sqlmock"
+	"github.com/Molecule-AI/molecule-monorepo/platform/internal/models"
+	"github.com/Molecule-AI/molecule-monorepo/platform/internal/provisioner"
+)
+
+// preflightLocalProv is a controllable LocalProvisionerAPI stub for the
+// preflight tests (#36). Other API methods panic to guard against tests
+// that should be using a different stub.
+type preflightLocalProv struct {
+	running    bool
+	err        error
+	calls      int
+	calledWith []string
+}
+
+func (p *preflightLocalProv) IsRunning(_ context.Context, workspaceID string) (bool, error) {
+	p.calls++
+	p.calledWith = append(p.calledWith, workspaceID)
+	return p.running, p.err
+}
+func (p *preflightLocalProv) Start(_ context.Context, _ provisioner.WorkspaceConfig) (string, error) {
+	panic("preflightLocalProv: Start not implemented")
+}
+func (p *preflightLocalProv) Stop(_ context.Context, _ string) error {
+	panic("preflightLocalProv: Stop not implemented")
+}
+func (p *preflightLocalProv) ExecRead(_ context.Context, _, _ string) ([]byte, error) {
+	panic("preflightLocalProv: ExecRead not implemented")
+}
+func (p *preflightLocalProv) RemoveVolume(_ context.Context, _ string) error {
+	panic("preflightLocalProv: RemoveVolume not implemented")
+}
+func (p *preflightLocalProv) VolumeHasFile(_ context.Context, _, _ string) (bool, error) {
+	panic("preflightLocalProv: VolumeHasFile not implemented")
+}
+func (p *preflightLocalProv) WriteAuthTokenToVolume(_ context.Context, _, _ string) error {
+	panic("preflightLocalProv: WriteAuthTokenToVolume not implemented")
+}
+
+// TestPreflight_ContainerRunning_ReturnsNil — IsRunning(true,nil): forward
+// proceeds. preflight returns nil → caller continues to dispatchA2A.
+func TestPreflight_ContainerRunning_ReturnsNil(t *testing.T) {
+	_ = setupTestDB(t)
+	stub := &preflightLocalProv{running: true, err: nil}
+	h := NewWorkspaceHandler(newTestBroadcaster(), nil, "http://localhost:8080", t.TempDir())
+	h.provisioner = stub
+
+	if err := h.preflightContainerHealth(context.Background(), "ws-running-123"); err != nil {
+		t.Fatalf("preflight should return nil when container running, got %+v", err)
+	}
+	if stub.calls != 1 {
+		t.Errorf("IsRunning should be called exactly once, got %d", stub.calls)
+	}
+	if len(stub.calledWith) != 1 || stub.calledWith[0] != "ws-running-123" {
+		t.Errorf("IsRunning should be called with workspace id, got %v", stub.calledWith)
+	}
+}
+
+// TestPreflight_ContainerNotRunning_StructuredFastFail — IsRunning(false,nil):
+// preflight returns structured 503 with restarting=true + preflight=true, AND
+// triggers the offline-flip + WORKSPACE_OFFLINE broadcast + async restart.
+// This is the load-bearing case — saves the caller 2-30s of network timeout.
+func TestPreflight_ContainerNotRunning_StructuredFastFail(t *testing.T) {
+	mock := setupTestDB(t)
+	_ = setupTestRedis(t)
+	stub := &preflightLocalProv{running: false, err: nil}
+	h := NewWorkspaceHandler(newTestBroadcaster(), nil, "http://localhost:8080", t.TempDir())
+	h.provisioner = stub
+
+	// Expect the offline-flip UPDATE.
+	mock.ExpectExec(`UPDATE workspaces SET status =`).
+		WithArgs(models.StatusOffline, "ws-dead-456").
+		WillReturnResult(sqlmock.NewResult(0, 1))
+	// Broadcaster's INSERT INTO structure_events fires too — best-effort
+	// log entry for the WORKSPACE_OFFLINE event. Match permissively.
+	mock.ExpectExec(`INSERT INTO structure_events`).
+		WillReturnResult(sqlmock.NewResult(0, 1))
+
+	proxyErr := h.preflightContainerHealth(context.Background(), "ws-dead-456")
+	if proxyErr == nil {
+		t.Fatal("preflight should return *proxyA2AError when container not running")
+	}
+	if proxyErr.Status != 503 {
+		t.Errorf("expected 503, got %d", proxyErr.Status)
+	}
+	if got := proxyErr.Response["restarting"]; got != true {
+		t.Errorf("response should mark restarting=true, got %v", got)
+	}
+	if got := proxyErr.Response["preflight"]; got != true {
+		t.Errorf("response should mark preflight=true so callers can distinguish from reactive containerDead, got %v", got)
+	}
+	if got := proxyErr.Response["error"]; got != "workspace container not running — restart triggered" {
+		t.Errorf("error message mismatch, got %q", got)
+	}
+
+	// Note: broadcaster firing is exercised by the production path's
+	// h.broadcaster.RecordAndBroadcast call but not asserted here — the
+	// real *events.Broadcaster doesn't expose received events for inspection.
+	// The DB UPDATE expectation is sufficient to pin the offline-flip path.
+}
+
+// TestPreflight_TransientError_FailsSoftAsAlive — IsRunning(true,err): the
+// (true, err) "fail-soft" contract — preflight returns nil so the optimistic
+// forward runs; reactive maybeMarkContainerDead handles a real failure later.
+// This pin is critical: a flaky daemon must NOT trigger a restart cascade.
+func TestPreflight_TransientError_FailsSoftAsAlive(t *testing.T) {
+	_ = setupTestDB(t)
+	stub := &preflightLocalProv{running: true, err: errors.New("docker daemon EOF")}
+	h := NewWorkspaceHandler(newTestBroadcaster(), nil, "http://localhost:8080", t.TempDir())
+	h.provisioner = stub
+
+	if err := h.preflightContainerHealth(context.Background(), "ws-flaky-789"); err != nil {
+		t.Fatalf("preflight should return nil on transient error (fail-soft), got %+v", err)
+	}
+	// No DB UPDATE expected — sqlmock would complain about unexpected calls
+	// at test cleanup if the offline-flip path fired.
+}
+
+// TestProxyA2A_Preflight_RoutesThroughProvisionerSSOT — AST gate (#36 mirror
+// of #12's gate). Pins the invariant that preflightContainerHealth uses the
+// SSOT Provisioner.IsRunning helper, NOT a parallel docker.ContainerInspect
+// of its own.
+//
+// Mutation invariant: if a future PR replaces h.provisioner.IsRunning with
+// a direct cli.ContainerInspect call, this test fails. That's the signal to
+// either (a) extend Provisioner.IsRunning's contract OR (b) document why
+// this call site needs to differ. Either way, the drift gets a reviewer's
+// attention instead of shipping silently.
+func TestProxyA2A_Preflight_RoutesThroughProvisionerSSOT(t *testing.T) {
+	fset := token.NewFileSet()
+	file, err := parser.ParseFile(fset, "a2a_proxy_helpers.go", nil, parser.ParseComments)
+	if err != nil {
+		t.Fatalf("parse a2a_proxy_helpers.go: %v", err)
+	}
+
+	var fn *ast.FuncDecl
+	ast.Inspect(file, func(n ast.Node) bool {
+		f, ok := n.(*ast.FuncDecl)
+		if !ok || f.Name.Name != "preflightContainerHealth" {
+			return true
+		}
+		fn = f
+		return false
+	})
+	if fn == nil {
+		t.Fatal("preflightContainerHealth not found — was it renamed? update this gate or the SSOT routing assumption")
+	}
+
+	var (
+		callsIsRunning             bool
+		callsContainerInspectRaw   bool
+		callsRunningContainerNameDirect bool
+	)
+	ast.Inspect(fn.Body, func(n ast.Node) bool {
+		call, ok := n.(*ast.CallExpr)
+		if !ok {
+			return true
+		}
+		sel, ok := call.Fun.(*ast.SelectorExpr)
+		if !ok {
+			return true
+		}
+		switch sel.Sel.Name {
+		case "IsRunning":
+			callsIsRunning = true
+		case "ContainerInspect":
+			callsContainerInspectRaw = true
+		case "RunningContainerName":
+			// Direct RunningContainerName is also acceptable SSOT — but
+			// preferring IsRunning keeps the (bool, error) contract that
+			// already exists in the helper API surface.
+			callsRunningContainerNameDirect = true
+		}
+		return true
+	})
+
+	if !callsIsRunning && !callsRunningContainerNameDirect {
+		t.Errorf("preflightContainerHealth must call provisioner.IsRunning OR provisioner.RunningContainerName for the SSOT health check — see molecule-core#36. Found neither.")
+	}
+	if callsContainerInspectRaw {
+		t.Errorf("preflightContainerHealth carries a direct ContainerInspect call. This is the parallel-impl drift molecule-core#36 fixed. " +
+			"Either route through provisioner.IsRunning OR — if a new use case truly needs a different inspect — extend the helper's contract first and update this gate to allow the specific delta.")
+	}
+}
@@ -108,6 +108,18 @@ type eicTunnelPool struct {
 	// First acquirer takes the slot; later ones wait on the channel.
 	pendingSetups map[string]chan struct{}
 	stopJanitor   chan struct{}
+	// janitorInterval is captured at pool construction from the
+	// package-level poolJanitorInterval var. Captured (not re-read on
+	// every tick) so a test that swaps the package var via t.Cleanup
+	// after a global pool's janitor is already running can't race
+	// with that goroutine's ticker read. The global pool is created
+	// lazily once per process via sync.Once; before this capture
+	// landed, every test that touched poolJanitorInterval after the
+	// global pool's first-touch raced the janitor (caught by -race
+	// on staging tip 249dbc6a — TestPooledWithEICTunnel_PanicPoisonsEntry).
+	// Tests still get the new value on a freshPool() because they
+	// set the package var BEFORE calling newEICTunnelPool().
+	janitorInterval time.Duration
 }

 var (
@@ -127,11 +139,16 @@ func getEICTunnelPool() *eicTunnelPool {

 // newEICTunnelPool constructs an empty pool. Exported so tests can
 // build isolated pools without sharing the singleton.
+//
+// Captures poolJanitorInterval at construction time so the janitor
+// goroutine doesn't race with t.Cleanup-driven swaps of the package
+// var. See the janitorInterval field comment for the failure mode.
 func newEICTunnelPool() *eicTunnelPool {
 	return &eicTunnelPool{
-		entries:       map[string]*pooledTunnel{},
-		pendingSetups: map[string]chan struct{}{},
-		stopJanitor:   make(chan struct{}),
+		entries:         map[string]*pooledTunnel{},
+		pendingSetups:   map[string]chan struct{}{},
+		stopJanitor:     make(chan struct{}),
+		janitorInterval: poolJanitorInterval,
 	}
 }

@@ -290,8 +307,11 @@ func (p *eicTunnelPool) evictLRUIfFullLocked(skipInstance string) {
 // janitor periodically scans for entries that are idle AND expired,
 // closing their tunnels. Runs forever (per pool lifetime); cancelled
 // by close(p.stopJanitor) for tests that build short-lived pools.
+//
+// Reads p.janitorInterval (captured at construction) instead of the
+// package-level poolJanitorInterval — see janitorInterval field comment.
 func (p *eicTunnelPool) janitor() {
-	t := time.NewTicker(poolJanitorInterval)
+	t := time.NewTicker(p.janitorInterval)
 	defer t.Stop()
 	for {
 		select {
@@ -0,0 +1,375 @@
+package handlers
+
+import (
+	"archive/tar"
+	"bytes"
+	"net"
+	"os"
+	"os/exec"
+	"path/filepath"
+	"strings"
+	"testing"
+	"time"
+
+	"gopkg.in/yaml.v3"
+)
+
+// Local E2E for the dev-department extraction (RFC internal#77).
+//
+// Pre-conditions: both repos cloned as siblings under
+// /tmp/local-e2e-deploy/{molecule-dev, molecule-dev-department}.
+// (Set up by the orchestrator before running this test.)
+//
+// What this proves end-to-end through real platform code:
+//   1. resolveYAMLIncludes follows the dev-lead symlink at the parent's
+//      template root and pulls in the dev-department subtree.
+//   2. Recursive !include's inside the symlinked subtree resolve
+//      correctly via the chain dev-lead/workspace.yaml →
+//      ./core-lead/workspace.yaml → ./core-be/workspace.yaml etc.
+//   3. The resolved YAML unmarshals into a complete OrgTemplate with the
+//      expected count of workspaces (parent's PM+Marketing+Research +
+//      dev-department's atomized 28 workspaces).
+//
+// Skipped if the local-e2e-deploy fixture isn't present — won't block
+// CI on hosts that haven't set it up.
+func TestLocalE2E_DevDepartmentExtraction(t *testing.T) {
+	parent := "/tmp/local-e2e-deploy/molecule-dev"
+	if _, err := os.Stat(filepath.Join(parent, "org.yaml")); err != nil {
+		t.Skipf("local-e2e fixture not present at %s: %v", parent, err)
+	}
+
+	orgYAML, err := os.ReadFile(filepath.Join(parent, "org.yaml"))
+	if err != nil {
+		t.Fatalf("read org.yaml: %v", err)
+	}
+
+	expanded, err := resolveYAMLIncludes(orgYAML, parent)
+	if err != nil {
+		t.Fatalf("resolveYAMLIncludes failed: %v", err)
+	}
+
+	var tmpl OrgTemplate
+	if err := yaml.Unmarshal(expanded, &tmpl); err != nil {
+		t.Fatalf("unmarshal expanded OrgTemplate: %v", err)
+	}
+
+	// Walk the full workspace tree, collect names.
+	names := []string{}
+	var walk func([]OrgWorkspace)
+	walk = func(ws []OrgWorkspace) {
+		for _, w := range ws {
+			names = append(names, w.Name)
+			walk(w.Children)
+		}
+	}
+	walk(tmpl.Workspaces)
+
+	t.Logf("org name: %q", tmpl.Name)
+	t.Logf("total workspaces (recursive): %d", len(names))
+	for _, n := range names {
+		t.Logf("  - %q", n)
+	}
+
+	// Expected: PM + Marketing Lead + Dev Lead at top level, plus the
+	// full sub-trees under each. After atomization, we expect:
+	//   - PM tree: PM + Research Lead + 3 research roles = 5
+	//   - Marketing tree: Marketing Lead + 5 marketing roles = 6
+	//   - Dev Lead tree: Dev Lead + (5 sub-team leads × ~6 each) +
+	//     3 floaters + Triage Operator = ~32
+	// Roughly ~43 total. Be liberal; just assert a floor.
+	if len(names) < 30 {
+		t.Errorf("workspace count too low (%d) — expected ~40+ (PM+Marketing+Dev tree)", len(names))
+	}
+
+	// Specific sentinel names we expect to find:
+	expected := []string{
+		"PM",
+		"Marketing Lead",
+		"Dev Lead",
+		"Core Platform Lead",
+		"Controlplane Lead",
+		"App & Docs Lead",
+		"Infra Lead",
+		"SDK Lead",
+		"Documentation Specialist", // Q1 — should be under app-lead
+		"Triage Operator",          // Q2 — should be under dev-lead
+	}
+	found := map[string]bool{}
+	for _, n := range names {
+		found[n] = true
+	}
+	for _, want := range expected {
+		if !found[want] {
+			t.Errorf("missing expected workspace %q", want)
+		}
+	}
+}
+
+// Stage-2 of the local e2e: prove every resolved workspace's `files_dir`
+// path actually consumes correctly through the rest of the import chain.
+// resolveYAMLIncludes returning a populated OrgTemplate is necessary but
+// not sufficient — `POST /org/import` then does:
+//
+//   1. resolveInsideRoot(orgBaseDir, ws.FilesDir) → must return a path
+//      that exists and stat-resolves to a directory (org_import.go:313-317).
+//   2. CopyTemplateToContainer(ctx, containerID, templatePath) → walks
+//      the dir with filepath.Walk and tars its contents into the
+//      workspace's /configs/ mount (provisioner.go:766-820).
+//
+// This stage-2 test exercises both #1 and #2 against every workspace in
+// the resolved tree, mimicking what the platform does post-include-
+// resolution. Catches: files_dir paths that don't resolve through the
+// symlink, paths that exist but are empty (silently produces empty
+// /configs/), or filepath.Walk failing to descend through cross-repo
+// symlink boundaries.
+func TestLocalE2E_FilesDirConsumption(t *testing.T) {
+	parent := "/tmp/local-e2e-deploy/molecule-dev"
+	if _, err := os.Stat(filepath.Join(parent, "org.yaml")); err != nil {
+		t.Skipf("local-e2e fixture not present at %s: %v", parent, err)
+	}
+
+	orgYAML, err := os.ReadFile(filepath.Join(parent, "org.yaml"))
+	if err != nil {
+		t.Fatalf("read org.yaml: %v", err)
+	}
+	expanded, err := resolveYAMLIncludes(orgYAML, parent)
+	if err != nil {
+		t.Fatalf("resolveYAMLIncludes: %v", err)
+	}
+	var tmpl OrgTemplate
+	if err := yaml.Unmarshal(expanded, &tmpl); err != nil {
+		t.Fatalf("unmarshal: %v", err)
+	}
+
+	// Flatten every workspace — including children, grandchildren, etc.
+	flat := []OrgWorkspace{}
+	var walk func([]OrgWorkspace)
+	walk = func(ws []OrgWorkspace) {
+		for _, w := range ws {
+			flat = append(flat, w)
+			walk(w.Children)
+		}
+	}
+	walk(tmpl.Workspaces)
+
+	checked := 0
+	for _, w := range flat {
+		if w.FilesDir == "" {
+			continue // workspace declared inline (no files_dir) — skip
+		}
+		checked++
+		t.Run(w.Name+"/"+w.FilesDir, func(t *testing.T) {
+			// Step 1: resolveInsideRoot returns a path that's-inside-root.
+			abs, err := resolveInsideRoot(parent, w.FilesDir)
+			if err != nil {
+				t.Fatalf("resolveInsideRoot(%q, %q): %v", parent, w.FilesDir, err)
+			}
+			info, err := os.Stat(abs)
+			if err != nil {
+				t.Fatalf("stat %q (resolved from files_dir %q): %v", abs, w.FilesDir, err)
+			}
+			if !info.IsDir() {
+				t.Fatalf("files_dir %q resolved to %q which is not a directory", w.FilesDir, abs)
+			}
+
+			// Step 2: walk the dir like CopyTemplateToContainer does.
+			// Mirror the platform's symlink-resolution at the root —
+			// filepath.Walk doesn't descend into a symlink leaf, so
+			// CopyTemplateToContainer (provisioner.go) calls
+			// EvalSymlinks on templatePath first. Replicate exactly.
+			if resolved, err := filepath.EvalSymlinks(abs); err == nil {
+				abs = resolved
+			}
+			var buf bytes.Buffer
+			tw := tar.NewWriter(&buf)
+			fileCount := 0
+			fileNames := []string{}
+			err = filepath.Walk(abs, func(path string, info os.FileInfo, err error) error {
+				if err != nil {
+					return err
+				}
+				rel, err := filepath.Rel(abs, path)
+				if err != nil {
+					return err
+				}
+				if rel == "." {
+					return nil
+				}
+				header, _ := tar.FileInfoHeader(info, "")
+				header.Name = rel
+				if err := tw.WriteHeader(header); err != nil {
+					return err
+				}
+				if !info.IsDir() {
+					fileCount++
+					fileNames = append(fileNames, rel)
+					data, err := os.ReadFile(path)
+					if err != nil {
+						return err
+					}
+					header.Size = int64(len(data))
+					tw.Write(data)
+				}
+				return nil
+			})
+			if err != nil {
+				t.Fatalf("filepath.Walk %q (mimics CopyTemplateToContainer): %v", abs, err)
+			}
+			tw.Close()
+
+			if fileCount == 0 {
+				t.Errorf("files_dir %q at %q is empty — CopyTemplateToContainer would produce empty /configs/",
+					w.FilesDir, abs)
+			}
+
+			// Sanity: every workspace folder should have AT LEAST one of
+			// {workspace.yaml, system-prompt.md, initial-prompt.md} —
+			// these are the markers a workspace folder is recognizable
+			// as a workspace (mirrors validator's WORKSPACE_FOLDER_MARKERS).
+			markers := []string{"workspace.yaml", "system-prompt.md", "initial-prompt.md"}
+			hasMarker := false
+			for _, name := range fileNames {
+				for _, m := range markers {
+					if name == m || strings.HasSuffix(name, "/"+m) {
+						hasMarker = true
+						break
+					}
+				}
+				if hasMarker {
+					break
+				}
+			}
+			if !hasMarker {
+				t.Errorf("files_dir %q at %q has %d files but none of the workspace markers %v — found: %v",
+					w.FilesDir, abs, fileCount, markers, fileNames)
+			}
+		})
+	}
+	t.Logf("checked %d workspaces with files_dir", checked)
+	if checked < 25 {
+		t.Errorf("expected ~28 workspaces with files_dir (post-atomization); only saw %d", checked)
+	}
+}
+
+// PR-C from the Phase 3a phasing (task #234): real-Gitea e2e for the
+// !external resolver against the LIVE molecule-ai/molecule-dev-department
+// repo. Verifies the production gitFetcher fetches the dev tree and the
+// resolver grafts it correctly into a parent template that has NO
+// symlink — composition is purely platform-side.
+//
+// Skipped if Gitea isn't reachable (offline / firewall / CI without
+// network). Requires `git` binary on PATH.
+func TestLocalE2E_ExternalDevDepartment(t *testing.T) {
+	if _, err := exec.LookPath("git"); err != nil {
+		t.Skipf("git binary not found: %v", err)
+	}
+
+	// Skip if Gitea host isn't reachable (TCP probe). Avoids network-
+	// dependent tests failing on offline runners.
+	conn, err := net.DialTimeout("tcp", "git.moleculesai.app:443", 3*time.Second)
+	if err != nil {
+		t.Skipf("git.moleculesai.app:443 unreachable: %v", err)
+	}
+	conn.Close()
+
+	// Build a minimal parent template inline — no need for the
+	// /tmp/local-e2e-deploy/ symlinked fixture. The whole point of
+	// !external is that the parent template is self-contained;
+	// composition resolves over the network at import time.
+	parent := t.TempDir()
+
+	orgYAML := []byte(`name: External-Only Test Parent
+description: Parent template that pulls the entire dev tree via !external.
+defaults:
+  runtime: claude-code
+  tier: 2
+workspaces:
+  - !external
+    repo: molecule-ai/molecule-dev-department
+    ref: main
+    path: dev-lead/workspace.yaml
+`)
+	if err := os.WriteFile(filepath.Join(parent, "org.yaml"), orgYAML, 0o644); err != nil {
+		t.Fatalf("write org.yaml: %v", err)
+	}
+
+	out, err := resolveYAMLIncludes(orgYAML, parent)
+	if err != nil {
+		t.Fatalf("resolveYAMLIncludes (!external against live Gitea): %v", err)
+	}
+
+	var tmpl OrgTemplate
+	if err := yaml.Unmarshal(out, &tmpl); err != nil {
+		t.Fatalf("unmarshal: %v", err)
+	}
+
+	// Walk the workspace tree, collect names + check files_dir paths.
+	flat := []OrgWorkspace{}
+	var walk func([]OrgWorkspace)
+	walk = func(ws []OrgWorkspace) {
+		for _, w := range ws {
+			flat = append(flat, w)
+			walk(w.Children)
+		}
+	}
+	walk(tmpl.Workspaces)
+
+	t.Logf("workspaces resolved through !external: %d", len(flat))
+	if len(flat) < 25 {
+		t.Errorf("expected ~28 dev-tree workspaces via !external; got %d", len(flat))
+	}
+
+	// Sentinel checks — same as TestLocalE2E_DevDepartmentExtraction
+	// (Q1+Q2 placements verified).
+	expected := []string{
+		"Dev Lead",
+		"Core Platform Lead",
+		"Controlplane Lead",
+		"App & Docs Lead",
+		"Documentation Specialist", // Q1
+		"Triage Operator",          // Q2
+	}
+	found := map[string]bool{}
+	for _, w := range flat {
+		found[w.Name] = true
+	}
+	for _, want := range expected {
+		if !found[want] {
+			t.Errorf("missing expected workspace %q", want)
+		}
+	}
+
+	// Every workspace's files_dir must be cache-prefixed (proves the
+	// path-rewrite ran end-to-end).
+	cachePrefix := ".external-cache"
+	for _, w := range flat {
+		if w.FilesDir == "" {
+			continue
+		}
+		if !strings.HasPrefix(w.FilesDir, cachePrefix) {
+			t.Errorf("workspace %q files_dir %q missing cache prefix %q", w.Name, w.FilesDir, cachePrefix)
+		}
+	}
+
+	// Verify the fetched cache exists and resolveInsideRoot accepts
+	// every workspace's files_dir (would cause provisioning to fail
+	// if not).
+	for _, w := range flat {
+		if w.FilesDir == "" {
+			continue
+		}
+		abs, err := resolveInsideRoot(parent, w.FilesDir)
+		if err != nil {
+			t.Errorf("workspace %q files_dir %q: resolveInsideRoot: %v", w.Name, w.FilesDir, err)
+			continue
+		}
+		info, err := os.Stat(abs)
+		if err != nil {
+			t.Errorf("workspace %q: stat %q: %v", w.Name, abs, err)
+			continue
+		}
+		if !info.IsDir() {
+			t.Errorf("workspace %q files_dir %q is not a directory", w.Name, w.FilesDir)
+		}
+	}
+}
@@ -0,0 +1,223 @@
+package handlers
+
+// mock_runtime.go — "mock" runtime: a virtual workspace that has no
+// container, no EC2, no LLM, just hardcoded canned A2A replies. Built
+// for the funding-demo "200-workspace mock org" so hongming can show
+// investors a CEO/VPs/Managers/ICs hierarchy at scale without burning
+// 200 EC2 instances or 200 Anthropic keys.
+//
+// Wire model:
+//   - org template declares `runtime: mock` on every workspace
+//   - createWorkspaceTree skips provisioning, sets status='online'
+//     directly (mirrors the `external` short-circuit, minus the URL +
+//     awaiting_agent dance)
+//   - proxyA2ARequest short-circuits on a mock-runtime target and
+//     returns a canned JSON-RPC reply; never calls resolveAgentURL,
+//     never opens an HTTP connection, never touches Docker/EC2
+//
+// The reply is JSON-RPC 2.0 + a2a-sdk v0.3 shape so the canvas's
+// extractAgentText / extractTextsFromParts read it without any
+// special-casing. We rotate over a small variant pool so a screen
+// full of replies doesn't all read identical — gives the demo a bit
+// of life without pretending to be a real agent.
+
+import (
+	"context"
+	"crypto/sha1"
+	"database/sql"
+	"encoding/binary"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"log"
+	"net/http"
+	"strings"
+	"time"
+
+	"github.com/Molecule-AI/molecule-monorepo/platform/internal/db"
+	"github.com/gin-gonic/gin"
+	"github.com/google/uuid"
+)
+
+// MockRuntimeName is the canonical runtime string a workspace row
+// carries to opt into the canned-reply short-circuit. Kept as a const
+// so the proxy's runtime-check + the org-import skip-block reference
+// the same literal.
+const MockRuntimeName = "mock"
+
+// mockReplyVariants is the pool of canned strings the mock runtime
+// rotates through. Picked to read like a busy-but-short reply from a
+// real human in a hierarchy — a CEO would NOT respond with "On it!",
+// but for the demo every node is shown to be reachable, so we lean
+// into the variety. Variant selection is deterministic per
+// (workspaceID, request-id) pair so a screen recording replays the
+// same reply for the same input.
+var mockReplyVariants = []string{
+	"On it!",
+	"Got it, on it now.",
+	"On it, boss.",
+	"Working on it.",
+	"Acknowledged — on it.",
+	"On it, will report back.",
+	"Roger that, on it.",
+	"Copy that. On it.",
+	"On it — ETA shortly.",
+	"On it. Standby for update.",
+}
+
+// pickMockReply returns a canned reply for the given workspaceID +
+// requestID. Deterministic so the same (workspace, message-id) pair
+// always picks the same variant — useful for screen recordings and
+// flake-free e2e snapshots. Falls back to variant[0] if the inputs
+// are empty.
+func pickMockReply(workspaceID, requestID string) string {
+	if len(mockReplyVariants) == 0 {
+		return "On it!"
+	}
+	if workspaceID == "" && requestID == "" {
+		return mockReplyVariants[0]
+	}
+	h := sha1.Sum([]byte(workspaceID + ":" + requestID))
+	idx := int(binary.BigEndian.Uint32(h[0:4]) % uint32(len(mockReplyVariants)))
+	return mockReplyVariants[idx]
+}
+
+// lookupRuntime returns the workspace's runtime string. Empty when the
+// row is missing / DB hiccup so callers fall through to the existing
+// dispatch path (which will then 404 / 502 normally). Fail-open here
+// because a transient DB error must not silently flip a real workspace
+// into mock-mode and start handing out canned replies in place of
+// genuine agent traffic.
+func lookupRuntime(ctx context.Context, workspaceID string) string {
+	var runtime sql.NullString
+	err := db.DB.QueryRowContext(ctx,
+		`SELECT runtime FROM workspaces WHERE id = $1`, workspaceID,
+	).Scan(&runtime)
+	if err != nil {
+		if !errors.Is(err, sql.ErrNoRows) {
+			log.Printf("ProxyA2A: lookupRuntime(%s) failed (%v) — falling through to dispatch path", workspaceID, err)
+		}
+		return ""
+	}
+	if !runtime.Valid {
+		return ""
+	}
+	return runtime.String
+}
+
+// buildMockA2AResponse synthesises a JSON-RPC 2.0 success envelope that
+// matches the a2a-sdk v0.3 reply shape the canvas's extractAgentText
+// already understands: `{result: {parts: [{kind: "text", text: ...}]}}`.
+// `requestID` is the JSON-RPC `id` of the inbound request — A2A
+// implementations echo it on the reply so callers can correlate. We
+// extract it from the normalized payload in the caller and pass it in
+// here so this function stays JSON-only (no payload parsing).
+//
+// Returns marshalled bytes ready to write straight to the HTTP body.
+// Marshal failure is logged + a tiny fallback envelope returned, since
+// failing the whole request because of a JSON encoding hiccup on a
+// constant-shaped payload would defeat the "mock always works" guarantee.
+func buildMockA2AResponse(workspaceID, requestID, replyText string) []byte {
+	if requestID == "" {
+		requestID = uuid.New().String()
+	}
+	envelope := map[string]any{
+		"jsonrpc": "2.0",
+		"id":      requestID,
+		"result": map[string]any{
+			"parts": []map[string]any{
+				{"kind": "text", "text": replyText},
+			},
+		},
+	}
+	out, err := json.Marshal(envelope)
+	if err != nil {
+		log.Printf("ProxyA2A: mock-runtime response marshal failed for %s: %v — emitting fallback", workspaceID, err)
+		// Hand-rolled minimal envelope. Safe because every value is a
+		// hardcoded constant string with no characters that need
+		// escaping in a JSON string literal.
+		fallback := fmt.Sprintf(
+			`{"jsonrpc":"2.0","id":%q,"result":{"parts":[{"kind":"text","text":%q}]}}`,
+			requestID, replyText,
+		)
+		return []byte(fallback)
+	}
+	return out
+}
+
+// extractRequestID pulls the JSON-RPC `id` out of an already-normalized
+// A2A payload. Returns "" when the field is absent or not a string —
+// caller substitutes a fresh UUID. Tolerant of every shape
+// normalizeA2APayload could produce.
+func extractRequestID(body []byte) string {
+	var top map[string]json.RawMessage
+	if err := json.Unmarshal(body, &top); err != nil {
+		return ""
+	}
+	raw, ok := top["id"]
+	if !ok {
+		return ""
+	}
+	var s string
+	if json.Unmarshal(raw, &s) == nil {
+		return s
+	}
+	// JSON-RPC permits numeric IDs too; canvas issues UUIDs but be
+	// defensive against alternative SDKs.
+	var n json.Number
+	if json.Unmarshal(raw, &n) == nil {
+		return n.String()
+	}
+	return ""
+}
+
+// handleMockA2A is the proxy short-circuit for mock-runtime workspaces.
+// Returns (status, body, true) when the target is mock — caller writes
+// the response and returns. Returns (_, _, false) when the target is
+// not mock — caller continues to the real dispatch path.
+//
+// Side-effects: writes a synthetic activity_logs row via logA2ASuccess
+// when logActivity is true so the canvas's "Agent Comms" tab shows the
+// mock reply in the trace alongside real-agent traffic. Without this
+// the demo would render messages on the canvas chat panel but a peer
+// node clicking through to its activity tab would see an empty list.
+func (h *WorkspaceHandler) handleMockA2A(ctx context.Context, workspaceID, callerID string, body []byte, a2aMethod string, logActivity bool) (int, []byte, bool) {
+	if lookupRuntime(ctx, workspaceID) != MockRuntimeName {
+		return 0, nil, false
+	}
+	requestID := extractRequestID(body)
+	replyText := pickMockReply(workspaceID, requestID)
+	respBody := buildMockA2AResponse(workspaceID, requestID, replyText)
+
+	// Tiny artificial delay so the canvas chat UI has time to render
+	// the user's outgoing bubble before the agent reply appears.
+	// Without it the reply lands the same animation frame and feels
+	// robotic. 80ms is too fast to look "real" but masks the React
+	// double-render race that drops the user bubble entirely on slow
+	// machines (observed locally on M1 Air, 2026-05-07). Below 200ms
+	// keeps a 200-node demo snappy when investors fan out 30 messages
+	// at once.
+	time.Sleep(80 * time.Millisecond)
+
+	if logActivity {
+		// Reuse the existing success-logger so the activity feed shape
+		// is identical to a real agent reply. Status 200 + duration 0
+		// is the "synthesised reply" marker; activity_logs.duration_ms
+		// being 0 is harmless (real fast paths can hit 0 too).
+		h.logA2ASuccess(ctx, workspaceID, callerID, body, respBody, a2aMethod, http.StatusOK, 0)
+	}
+	return http.StatusOK, respBody, true
+}
+
+// IsMockRuntime is a small public helper for callers outside this
+// package (tests, the org importer) that need to ask the question
+// without depending on the unexported constant. Trims + lower-cases
+// so a typoed YAML cell like "  Mock " still resolves correctly.
+func IsMockRuntime(runtime string) bool {
+	return strings.EqualFold(strings.TrimSpace(runtime), MockRuntimeName)
+}
+
+// gin import is unused at file scope but kept as a tag so a future
+// addition of a thin HTTP handler (e.g. POST /workspaces/:id/mock/replies
+// for an admin-set custom reply pool) doesn't need an import re-order.
+var _ = gin.H{}
@@ -0,0 +1,266 @@
+package handlers
+
+// mock_runtime_test.go — locks the contract for the mock-runtime
+// short-circuit added for the funding-demo "200-workspace mock org"
+// template. Three invariants:
+//
+//   1. ProxyA2A on a workspace with runtime='mock' must return 200
+//      with a JSON-RPC reply containing one text part. NO HTTP
+//      dispatch, NO resolveAgentURL DB read (mock workspaces have
+//      no URL — that read would 404 and break the demo).
+//
+//   2. The reply text must be one of the canned variants and must be
+//      deterministic for a given (workspace_id, request_id) pair so
+//      screen recordings replay identically.
+//
+//   3. Workspaces with runtime != 'mock' must NOT be affected — the
+//      mock check fails fast and falls through to the existing
+//      dispatch path. Same kind of regression guard the poll-mode
+//      tests carry.
+
+import (
+	"bytes"
+	"encoding/json"
+	"net/http"
+	"net/http/httptest"
+	"testing"
+	"time"
+
+	"github.com/DATA-DOG/go-sqlmock"
+	"github.com/gin-gonic/gin"
+)
+
+// TestProxyA2A_MockRuntime_ReturnsCannedReply is the happy-path
+// contract. A workspace flagged runtime='mock' must:
+//   - return 200 with JSON-RPC envelope {result:{parts:[{kind:text,text:...}]}}
+//   - not dispatch HTTP (no SELECT url SQL expected)
+//   - reply text is one of mockReplyVariants
+func TestProxyA2A_MockRuntime_ReturnsCannedReply(t *testing.T) {
+	mock := setupTestDB(t)
+	setupTestRedis(t)
+	broadcaster := newTestBroadcaster()
+	handler := NewWorkspaceHandler(broadcaster, nil, "http://localhost:8080", t.TempDir())
+
+	const wsID = "ws-mock-canned"
+
+	// Budget check fires before runtime lookup (same as the poll-mode
+	// short-circuit) — keeps mock workspaces honest if a tenant ever
+	// sets a budget on one. Unlikely on a demo, but the guard stays
+	// uniform so future "monthly_spend on mock = 0" assertions don't
+	// drift.
+	expectBudgetCheck(mock, wsID)
+
+	// lookupDeliveryMode runs first — return push so the poll
+	// short-circuit doesn't fire and we hit the mock check.
+	mock.ExpectQuery("SELECT delivery_mode FROM workspaces WHERE id").
+		WithArgs(wsID).
+		WillReturnRows(sqlmock.NewRows([]string{"delivery_mode"}).AddRow("push"))
+
+	// lookupRuntime SELECT — returns 'mock', triggering the canned-reply
+	// short-circuit. CRITICAL: NO ExpectQuery for `SELECT url, status
+	// FROM workspaces` (resolveAgentURL's query). If the short-circuit
+	// fails to fire, sqlmock will surface "unexpected query" on the URL
+	// SELECT and the test fails loudly — that's the dispatch-leak detector.
+	mock.ExpectQuery("SELECT runtime FROM workspaces WHERE id").
+		WithArgs(wsID).
+		WillReturnRows(sqlmock.NewRows([]string{"runtime"}).AddRow("mock"))
+
+	// Activity log: logA2ASuccess writes the synthetic reply to
+	// activity_logs so the canvas's Agent Comms tab shows it alongside
+	// real-agent traffic.
+	mock.ExpectExec("INSERT INTO activity_logs").
+		WillReturnResult(sqlmock.NewResult(0, 1))
+
+	w := httptest.NewRecorder()
+	c, _ := gin.CreateTestContext(w)
+	c.Params = gin.Params{{Key: "id", Value: wsID}}
+
+	body := `{"jsonrpc":"2.0","id":"req-mock-1","method":"message/send","params":{"message":{"role":"user","parts":[{"kind":"text","text":"hello mock"}]}}}`
+	c.Request = httptest.NewRequest("POST", "/workspaces/"+wsID+"/a2a", bytes.NewBufferString(body))
+	c.Request.Header.Set("Content-Type", "application/json")
+
+	handler.ProxyA2A(c)
+
+	// logA2ASuccess fires async — give it a moment to settle so
+	// ExpectationsWereMet doesn't flake.
+	time.Sleep(200 * time.Millisecond)
+
+	if w.Code != http.StatusOK {
+		t.Fatalf("expected 200, got %d: %s", w.Code, w.Body.String())
+	}
+	var resp map[string]interface{}
+	if err := json.Unmarshal(w.Body.Bytes(), &resp); err != nil {
+		t.Fatalf("response is not valid JSON: %v", err)
+	}
+	if resp["jsonrpc"] != "2.0" {
+		t.Errorf("response.jsonrpc = %v, want 2.0", resp["jsonrpc"])
+	}
+	if resp["id"] != "req-mock-1" {
+		t.Errorf("response.id = %v, want %q (echoed from request)", resp["id"], "req-mock-1")
+	}
+	result, _ := resp["result"].(map[string]interface{})
+	if result == nil {
+		t.Fatalf("response.result missing or wrong type: %v", resp["result"])
+	}
+	parts, _ := result["parts"].([]interface{})
+	if len(parts) != 1 {
+		t.Fatalf("expected exactly one part, got %d: %v", len(parts), parts)
+	}
+	part, _ := parts[0].(map[string]interface{})
+	if part["kind"] != "text" {
+		t.Errorf("part.kind = %v, want text", part["kind"])
+	}
+	text, _ := part["text"].(string)
+	if text == "" {
+		t.Error("part.text is empty — canned reply not populated")
+	}
+	// Reply must be one of the variants.
+	matched := false
+	for _, v := range mockReplyVariants {
+		if v == text {
+			matched = true
+			break
+		}
+	}
+	if !matched {
+		t.Errorf("reply text %q is not in mockReplyVariants", text)
+	}
+
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Errorf("unmet sqlmock expectations: %v", err)
+	}
+}
+
+// TestProxyA2A_NonMockRuntime_NoShortCircuit verifies the symmetric
+// contract: a workspace with a real runtime (claude-code, hermes, etc.)
+// must NOT be affected by the mock check — it falls through to the
+// real dispatch path. Without this guard, a regression in
+// lookupRuntime could silently flip every workspace into mock-mode
+// and start handing out canned replies in place of real-agent traffic.
+func TestProxyA2A_NonMockRuntime_NoShortCircuit(t *testing.T) {
+	mock := setupTestDB(t)
+	mr := setupTestRedis(t)
+	allowLoopbackForTest(t)
+	broadcaster := newTestBroadcaster()
+	handler := NewWorkspaceHandler(broadcaster, nil, "http://localhost:8080", t.TempDir())
+
+	const wsID = "ws-real-runtime"
+
+	dispatched := false
+	agentServer := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		dispatched = true
+		w.Header().Set("Content-Type", "application/json")
+		w.Write([]byte(`{"jsonrpc":"2.0","id":"1","result":{"status":"ok"}}`))
+	}))
+	defer agentServer.Close()
+	mr.Set("ws:"+wsID+":url", agentServer.URL)
+
+	expectBudgetCheck(mock, wsID)
+
+	// poll-mode SELECT — return push so we proceed past the poll
+	// short-circuit.
+	mock.ExpectQuery("SELECT delivery_mode FROM workspaces WHERE id").
+		WithArgs(wsID).
+		WillReturnRows(sqlmock.NewRows([]string{"delivery_mode"}).AddRow("push"))
+
+	// runtime SELECT — return claude-code so the mock check falls
+	// through.
+	mock.ExpectQuery("SELECT runtime FROM workspaces WHERE id").
+		WithArgs(wsID).
+		WillReturnRows(sqlmock.NewRows([]string{"runtime"}).AddRow("claude-code"))
+
+	mock.ExpectExec("INSERT INTO activity_logs").
+		WillReturnResult(sqlmock.NewResult(0, 1))
+
+	w := httptest.NewRecorder()
+	c, _ := gin.CreateTestContext(w)
+	c.Params = gin.Params{{Key: "id", Value: wsID}}
+	body := `{"jsonrpc":"2.0","id":"real-1","method":"message/send","params":{"message":{"role":"user","parts":[{"kind":"text","text":"hi"}]}}}`
+	c.Request = httptest.NewRequest("POST", "/workspaces/"+wsID+"/a2a", bytes.NewBufferString(body))
+	c.Request.Header.Set("Content-Type", "application/json")
+
+	handler.ProxyA2A(c)
+
+	time.Sleep(50 * time.Millisecond)
+
+	if w.Code != http.StatusOK {
+		t.Fatalf("expected 200, got %d: %s", w.Code, w.Body.String())
+	}
+	if !dispatched {
+		t.Error("non-mock runtime: expected the agent server to receive the request, but it did not — mock short-circuit may be over-firing")
+	}
+	if err := mock.ExpectationsWereMet(); err != nil {
+		t.Errorf("unmet sqlmock expectations: %v", err)
+	}
+}
+
+// TestPickMockReply_Deterministic locks the determinism contract:
+// the same (workspaceID, requestID) input must yield the same variant
+// every call. Required for screen recordings + flake-free e2e
+// snapshots.
+func TestPickMockReply_Deterministic(t *testing.T) {
+	cases := []struct {
+		ws, req string
+	}{
+		{"ws-1", "req-A"},
+		{"ws-1", "req-B"},
+		{"ws-2", "req-A"},
+		{"", ""},
+	}
+	for _, tc := range cases {
+		first := pickMockReply(tc.ws, tc.req)
+		for i := 0; i < 10; i++ {
+			next := pickMockReply(tc.ws, tc.req)
+			if next != first {
+				t.Errorf("pickMockReply(%q,%q) is not deterministic: got %q then %q",
+					tc.ws, tc.req, first, next)
+			}
+		}
+	}
+}
+
+// TestIsMockRuntime_TrimsAndCaseInsensitive — typos and stray
+// whitespace in YAML must still resolve to mock so a single
+// runtime: " Mock " entry doesn't silently get dispatched.
+func TestIsMockRuntime_TrimsAndCaseInsensitive(t *testing.T) {
+	cases := map[string]bool{
+		"mock":      true,
+		"MOCK":      true,
+		"  Mock  ":  true,
+		"mocky":     false,
+		"":          false,
+		"external":  false,
+		"claude-code": false,
+	}
+	for in, want := range cases {
+		if got := IsMockRuntime(in); got != want {
+			t.Errorf("IsMockRuntime(%q) = %v, want %v", in, got, want)
+		}
+	}
+}
+
+// TestBuildMockA2AResponse_EchoesRequestID — JSON-RPC requires the
+// reply id to match the request id so callers can correlate. Mock
+// must hold this contract or canvas's correlation logic breaks.
+func TestBuildMockA2AResponse_EchoesRequestID(t *testing.T) {
+	out := buildMockA2AResponse("ws-x", "req-echo-7", "On it!")
+	var resp map[string]interface{}
+	if err := json.Unmarshal(out, &resp); err != nil {
+		t.Fatalf("response is not valid JSON: %v", err)
+	}
+	if resp["id"] != "req-echo-7" {
+		t.Errorf("id = %v, want req-echo-7", resp["id"])
+	}
+	if resp["jsonrpc"] != "2.0" {
+		t.Errorf("jsonrpc = %v, want 2.0", resp["jsonrpc"])
+	}
+	result, _ := resp["result"].(map[string]interface{})
+	parts, _ := result["parts"].([]interface{})
+	if len(parts) != 1 {
+		t.Fatalf("expected 1 part, got %d", len(parts))
+	}
+	p, _ := parts[0].(map[string]interface{})
+	if p["text"] != "On it!" {
+		t.Errorf("part.text = %v, want On it!", p["text"])
+	}
+}
@@ -0,0 +1,439 @@
+package handlers
+
+import (
+	"context"
+	"fmt"
+	"net/url"
+	"os"
+	"os/exec"
+	"path/filepath"
+	"regexp"
+	"strings"
+	"time"
+
+	"gopkg.in/yaml.v3"
+)
+
+// External-ref resolver — gitops-style cross-repo subtree composition.
+// Internal#77 RFC, Phase 3a (task #222). Prior art: Helm subcharts +
+// dependency cache, Kustomize remote bases, Terraform module sources.
+//
+// Schema (a `!external`-tagged mapping anywhere a workspace entry is
+// allowed — workspaces:, roots:, children:):
+//
+//   - !external
+//     repo: molecule-ai/molecule-dev-department
+//     ref: main
+//     path: dev-lead/workspace.yaml
+//
+// At resolve time, the platform fetches the repo at ref into a content-
+// addressable cache under <rootDir>/.external-cache/<repo>/<sha>/, loads
+// the yaml at <cacheDir>/<path>, rewrites every files_dir + relative
+// !include path to be cache-prefixed, then grafts the result in place of
+// the !external node. Downstream pipeline (resolveInsideRoot, plugin
+// merge, CopyTemplateToContainer) sees ordinary in-tree paths.
+
+// ExternalRef is the deserialized form of an `!external`-tagged mapping.
+type ExternalRef struct {
+	Repo string `yaml:"repo"`
+	Ref  string `yaml:"ref"`
+	Path string `yaml:"path"`
+
+	// URL overrides the default Gitea host. Optional; defaults to
+	// MOLECULE_EXTERNAL_GITEA_URL env or git.moleculesai.app.
+	URL string `yaml:"url,omitempty"`
+}
+
+const (
+	// maxExternalDepth caps recursion through nested `!external`s. Lower
+	// than maxIncludeDepth (16) because each level may issue a network
+	// fetch. Composition that genuinely needs >4 layers is a smell.
+	maxExternalDepth = 4
+
+	// externalCacheDirName is the per-template cache subdir under rootDir.
+	// Content-addressable: keyed by (repo, sha). Operators add this to
+	// .gitignore — cache is platform-mutated, not source-tracked.
+	externalCacheDirName = ".external-cache"
+
+	// gitFetchTimeout caps a single clone operation. Conservative —
+	// org template fetches are typically <100KB.
+	gitFetchTimeout = 60 * time.Second
+)
+
+// safeRefPattern restricts `ref` values to characters git itself accepts
+// for branch / tag / SHA. Belt-and-braces over git's own validation.
+var safeRefPattern = regexp.MustCompile(`^[a-zA-Z0-9_./-]+$`)
+
+// allowlistedHostPath returns true if `<host>/<repo>` matches the
+// configured allowlist. Default allowlist: git.moleculesai.app/molecule-ai/.
+// Override via MOLECULE_EXTERNAL_REPO_ALLOWLIST env var (comma-separated
+// patterns). Patterns are matched as prefixes (with trailing-slash
+// semantics) or as exact matches. Trailing /* is treated as "any
+// descendants of this prefix".
+//
+// Examples:
+//   - "git.moleculesai.app/molecule-ai/" → matches molecule-ai/* (any repo)
+//   - "git.moleculesai.app/molecule-ai/*" → same; trailing /* normalized to /
+//   - "git.moleculesai.app/molecule-ai/molecule-dev-department" → exact
+//   - "git.moleculesai.app/" → matches everything on that host
+func allowlistedHostPath(host, repoPath string) bool {
+	allow := os.Getenv("MOLECULE_EXTERNAL_REPO_ALLOWLIST")
+	if allow == "" {
+		allow = "git.moleculesai.app/molecule-ai/"
+	}
+	hp := host + "/" + repoPath
+	for _, pat := range strings.Split(allow, ",") {
+		pat = strings.TrimSpace(pat)
+		if pat == "" {
+			continue
+		}
+		// Normalize trailing /* → /
+		pat = strings.TrimSuffix(pat, "*")
+		if pat == hp {
+			return true
+		}
+		if strings.HasSuffix(pat, "/") && strings.HasPrefix(hp+"/", pat) {
+			return true
+		}
+	}
+	return false
+}
+
+// externalFetcher abstracts the git-clone-into-cache step. Production
+// uses gitFetcher (shells out to git); tests inject a fake that
+// pre-stages content in a temp dir.
+type externalFetcher interface {
+	// Fetch ensures rootDir/.external-cache/<safe-repo>/<sha>/ contains
+	// the repo content at the given ref. Returns the absolute cache
+	// dir + the resolved SHA. Cache hit = no network. Cache miss =
+	// clone.
+	Fetch(ctx context.Context, rootDir, host, repoPath, ref string) (cacheDir, sha string, err error)
+}
+
+// defaultExternalFetcher is the package-level fetcher injection point.
+// Production code uses the git-shell fetcher; tests override via
+// SetExternalFetcherForTest.
+var defaultExternalFetcher externalFetcher = &gitFetcher{}
+
+// SetExternalFetcherForTest swaps the fetcher for testing. Returns a
+// cleanup func that restores the previous fetcher.
+func SetExternalFetcherForTest(f externalFetcher) func() {
+	prev := defaultExternalFetcher
+	defaultExternalFetcher = f
+	return func() { defaultExternalFetcher = prev }
+}
+
+// resolveExternalMapping replaces an `!external`-tagged mapping node
+// with the loaded + path-rewritten yaml content from the fetched repo.
+//
+// `currentDir` and `rootDir` are inherited from expandNode's resolve
+// frame. `visited` tracks (repo, sha, path) tuples for cycle detection
+// across nested externals.
+func resolveExternalMapping(n *yaml.Node, currentDir, rootDir string, visited map[string]bool, depth int) error {
+	if depth > maxExternalDepth {
+		return fmt.Errorf("!external: max depth %d exceeded (possible cycle)", maxExternalDepth)
+	}
+	if rootDir == "" {
+		return fmt.Errorf("!external at line %d requires a dir-based org template (no rootDir in inline-template mode)", n.Line)
+	}
+
+	var ref ExternalRef
+	if err := n.Decode(&ref); err != nil {
+		return fmt.Errorf("!external at line %d: decode: %w", n.Line, err)
+	}
+	if ref.Repo == "" || ref.Ref == "" || ref.Path == "" {
+		return fmt.Errorf("!external at line %d: repo, ref, path are all required (got %+v)", n.Line, ref)
+	}
+	if !safeRefPattern.MatchString(ref.Ref) {
+		return fmt.Errorf("!external at line %d: ref %q contains disallowed characters", n.Line, ref.Ref)
+	}
+	// Defense-in-depth: even though git itself rejects refs containing
+	// `..`, the regex above currently allows them. Reject explicitly.
+	if strings.Contains(ref.Ref, "..") {
+		return fmt.Errorf("!external at line %d: ref %q must not contain '..'", n.Line, ref.Ref)
+	}
+	if strings.Contains(ref.Path, "..") || strings.HasPrefix(ref.Path, "/") {
+		return fmt.Errorf("!external at line %d: path %q must be relative-and-down-only", n.Line, ref.Path)
+	}
+
+	host := ref.URL
+	if host == "" {
+		host = os.Getenv("MOLECULE_EXTERNAL_GITEA_URL")
+	}
+	if host == "" {
+		host = "git.moleculesai.app"
+	}
+	host = strings.TrimPrefix(strings.TrimPrefix(host, "https://"), "http://")
+	host = strings.TrimSuffix(host, "/")
+
+	if !allowlistedHostPath(host, ref.Repo) {
+		return fmt.Errorf("!external at line %d: %s/%s not in MOLECULE_EXTERNAL_REPO_ALLOWLIST", n.Line, host, ref.Repo)
+	}
+
+	ctx, cancel := context.WithTimeout(context.Background(), gitFetchTimeout)
+	defer cancel()
+
+	cacheDir, sha, err := defaultExternalFetcher.Fetch(ctx, rootDir, host, ref.Repo, ref.Ref)
+	if err != nil {
+		return fmt.Errorf("!external at line %d: fetch %s/%s@%s: %w", n.Line, host, ref.Repo, ref.Ref, err)
+	}
+
+	// Cycle key: (repo, sha, path) — same external content reachable
+	// via two paths is fine, but a self-referential cycle isn't.
+	cycleKey := fmt.Sprintf("%s/%s@%s/%s", host, ref.Repo, sha, ref.Path)
+	if visited[cycleKey] {
+		return fmt.Errorf("!external cycle detected at %q (line %d)", cycleKey, n.Line)
+	}
+
+	// Validate path resolves inside the cache dir (anti-traversal).
+	yamlPathAbs, err := resolveInsideRoot(cacheDir, ref.Path)
+	if err != nil {
+		return fmt.Errorf("!external at line %d: path %q: %w", n.Line, ref.Path, err)
+	}
+	if _, err := os.Stat(yamlPathAbs); err != nil {
+		return fmt.Errorf("!external at line %d: %s/%s@%s does not contain %q: %w", n.Line, host, ref.Repo, sha, ref.Path, err)
+	}
+
+	data, err := os.ReadFile(yamlPathAbs)
+	if err != nil {
+		return fmt.Errorf("!external at line %d: read %q: %w", n.Line, yamlPathAbs, err)
+	}
+
+	var sub yaml.Node
+	if err := yaml.Unmarshal(data, &sub); err != nil {
+		return fmt.Errorf("!external at line %d: parse %q: %w", n.Line, yamlPathAbs, err)
+	}
+	root := &sub
+	if root.Kind == yaml.DocumentNode && len(root.Content) == 1 {
+		root = root.Content[0]
+	}
+
+	// Recurse FIRST: load all nested !include / !external content into
+	// the tree. Then rewrite ALL files_dir scalars in the fully-resolved
+	// tree (top + nested) with the cache prefix in one pass. Doing
+	// rewrite-before-recurse would leave nested-loaded files_dir paths
+	// unprefixed.
+	visited[cycleKey] = true
+	defer delete(visited, cycleKey)
+
+	subDir := filepath.Dir(yamlPathAbs)
+	if err := expandNode(root, subDir, rootDir, visited, depth+1); err != nil {
+		return err
+	}
+
+	// Path rewrite: prefix every files_dir scalar in the fully-resolved
+	// content with the cache-relative-from-rootDir prefix. After this
+	// pass, fetched workspaces look like ordinary in-tree workspaces.
+	cachePrefix, err := filepath.Rel(rootDir, cacheDir)
+	if err != nil {
+		return fmt.Errorf("!external at line %d: cannot compute cache prefix: %w", n.Line, err)
+	}
+	rewriteFilesDir(root, cachePrefix)
+
+	// Replace the !external mapping with the resolved content in-place.
+	*n = *root
+	if n.Tag == "!external" {
+		n.Tag = ""
+	}
+	return nil
+}
+
+// rewriteFilesDir walks the yaml node tree and prepends cachePrefix to
+// every files_dir scalar value. Idempotent: if a files_dir value already
+// starts with the prefix, no-op.
+//
+// !include paths are intentionally NOT rewritten. They resolve relative
+// to their containing file's directory (subDir in expandNode), and after
+// fetch that directory IS inside the cache, so relative !include paths
+// Just Work without any rewrite. Rewriting them would double-prefix on
+// recursive resolution.
+//
+// files_dir DOES need rewriting because it's consumed at workspace-
+// provisioning time relative to orgBaseDir (the parent template's root),
+// not relative to the workspace.yaml's containing dir.
+func rewriteFilesDir(n *yaml.Node, cachePrefix string) {
+	if n == nil {
+		return
+	}
+	if n.Kind == yaml.MappingNode {
+		for i := 0; i+1 < len(n.Content); i += 2 {
+			key, value := n.Content[i], n.Content[i+1]
+			if key.Kind == yaml.ScalarNode && key.Value == "files_dir" && value.Kind == yaml.ScalarNode {
+				if !strings.HasPrefix(value.Value, cachePrefix+string(filepath.Separator)) && value.Value != cachePrefix {
+					value.Value = filepath.Join(cachePrefix, value.Value)
+				}
+			}
+		}
+	}
+	for _, child := range n.Content {
+		rewriteFilesDir(child, cachePrefix)
+	}
+}
+
+// safeRepoCacheDir converts a repo path like "molecule-ai/foo" into a
+// filesystem-safe segment "molecule-ai__foo". Avoids nesting cache dirs
+// (which would complicate cleanup).
+func safeRepoCacheDir(host, repoPath string) string {
+	hp := host + "/" + repoPath
+	hp = strings.ReplaceAll(hp, "/", "__")
+	hp = strings.ReplaceAll(hp, ":", "_")
+	return hp
+}
+
+// gitFetcher is the production externalFetcher: shells out to `git` to
+// clone the repo at ref into the cache dir. Cache key includes the
+// resolved SHA, so different SHAs of the same ref get different cache
+// dirs (no overwrite).
+//
+// Token handling — important for security. The auth token never enters
+// the clone URL (and therefore never lands in the cloned repo's
+// .git/config) and never appears in returned errors. We use git's
+// `http.extraHeader` config option (passed via `-c`), which sends an
+// Authorization header per-request without persisting it. The token is
+// briefly visible in the `git` process's argv (so other local users
+// with the same uid could see it via `ps`), which is the same exposure
+// it has via the env var that supplied it.
+//
+// Cache validity uses a `.complete` marker written after a successful
+// clone+rename. Cache-hit checks for the marker, not just the dir
+// existence — a partially-written cache (clone failed mid-way, or a
+// concurrent caller wrote a half-baked cache dir) is treated as cache
+// miss and re-fetched cleanly.
+type gitFetcher struct{}
+
+// cacheCompleteMarker is the filename written after a successful clone.
+// Cache-hit requires this marker; without it, the cache dir is treated
+// as partially-written and re-fetched.
+const cacheCompleteMarker = ".complete"
+
+// Fetch resolves ref → SHA via `git ls-remote`, then `git clone --depth=1`
+// if the cache dir is missing or incomplete. Auth via MOLECULE_GITEA_TOKEN
+// injected via http.extraHeader (never via URL).
+func (g *gitFetcher) Fetch(ctx context.Context, rootDir, host, repoPath, ref string) (string, string, error) {
+	cacheRoot := filepath.Join(rootDir, externalCacheDirName, safeRepoCacheDir(host, repoPath))
+	if err := os.MkdirAll(cacheRoot, 0o755); err != nil {
+		return "", "", fmt.Errorf("mkdir cache root: %w", err)
+	}
+
+	cloneURL := buildExternalCloneURL(host, repoPath)
+	gitArgs := func(extra ...string) []string {
+		args := authConfigArgs()
+		return append(args, extra...)
+	}
+
+	// 1. Resolve ref → SHA (so cache dir is content-addressable).
+	sha, err := g.resolveRefToSHA(ctx, cloneURL, ref, gitArgs)
+	if err != nil {
+		return "", "", fmt.Errorf("ls-remote: %s", redactToken(err.Error()))
+	}
+
+	cacheDir := filepath.Join(cacheRoot, sha)
+	// Cache-hit requires the .complete marker AND the .git dir.
+	// Without the marker, cache is partially-written → treat as miss.
+	if isCacheComplete(cacheDir) {
+		return cacheDir, sha, nil
+	}
+
+	// Cache miss or partially-written — clean any stale cacheDir before
+	// cloning (a previous broken attempt would otherwise block rename).
+	os.RemoveAll(cacheDir)
+
+	// 2. Clone into a sibling tmp dir; atomic rename on success.
+	tmpDir, err := os.MkdirTemp(cacheRoot, sha+".tmp.")
+	if err != nil {
+		return "", "", fmt.Errorf("mkdir tmp: %w", err)
+	}
+	// MkdirTemp creates the dir; git clone refuses to clone into a
+	// non-empty dir. Remove + recreate empty.
+	os.RemoveAll(tmpDir)
+	cloneAndConfig := append(gitArgs("clone", "--quiet", "--depth=1", "-b", ref, cloneURL, tmpDir))
+	cmd := exec.CommandContext(ctx, "git", cloneAndConfig...)
+	cmd.Env = append(os.Environ(), "GIT_TERMINAL_PROMPT=0")
+	if out, err := cmd.CombinedOutput(); err != nil {
+		os.RemoveAll(tmpDir)
+		return "", "", fmt.Errorf("git clone: %w: %s", err, redactToken(strings.TrimSpace(string(out))))
+	}
+
+	// Write the .complete marker BEFORE the rename. If rename succeeds,
+	// the marker is in place. If rename loses the race (concurrent
+	// fetcher won), our tmp gets cleaned up and we trust the winner.
+	if err := os.WriteFile(filepath.Join(tmpDir, cacheCompleteMarker), []byte(time.Now().UTC().Format(time.RFC3339)), 0o644); err != nil {
+		os.RemoveAll(tmpDir)
+		return "", "", fmt.Errorf("write complete marker: %w", err)
+	}
+
+	if err := os.Rename(tmpDir, cacheDir); err != nil {
+		// Race: another import beat us. Validate THEIR cache, accept it.
+		os.RemoveAll(tmpDir)
+		if isCacheComplete(cacheDir) {
+			return cacheDir, sha, nil
+		}
+		return "", "", fmt.Errorf("rename clone to cache (and winner's cache is incomplete): %w", err)
+	}
+	return cacheDir, sha, nil
+}
+
+// isCacheComplete reports whether cacheDir contains both the cloned
+// repo (.git) and the .complete marker. Treats partial state as miss.
+func isCacheComplete(cacheDir string) bool {
+	if _, err := os.Stat(filepath.Join(cacheDir, ".git")); err != nil {
+		return false
+	}
+	if _, err := os.Stat(filepath.Join(cacheDir, cacheCompleteMarker)); err != nil {
+		return false
+	}
+	return true
+}
+
+func (g *gitFetcher) resolveRefToSHA(ctx context.Context, cloneURL, ref string, gitArgs func(...string) []string) (string, error) {
+	args := gitArgs("ls-remote", cloneURL, ref)
+	cmd := exec.CommandContext(ctx, "git", args...)
+	cmd.Env = append(os.Environ(), "GIT_TERMINAL_PROMPT=0")
+	out, err := cmd.Output()
+	if err != nil {
+		return "", err
+	}
+	line := strings.TrimSpace(string(out))
+	if line == "" {
+		return "", fmt.Errorf("ref %q not found", ref)
+	}
+	// First whitespace-separated field is the SHA.
+	for i, ch := range line {
+		if ch == ' ' || ch == '\t' {
+			return line[:i], nil
+		}
+	}
+	return line, nil
+}
+
+// buildExternalCloneURL constructs the clone URL WITHOUT auth in userinfo.
+// Auth is layered on via authConfigArgs's http.extraHeader.
+func buildExternalCloneURL(host, repoPath string) string {
+	u := url.URL{Scheme: "https", Host: host, Path: "/" + repoPath + ".git"}
+	return u.String()
+}
+
+// authConfigArgs returns the `-c http.extraHeader=Authorization: token X`
+// args to pass to git, OR an empty slice if no token is set. The token
+// goes into the request headers (not the URL or .git/config), so it
+// doesn't persist on disk and doesn't appear in clone error output.
+func authConfigArgs() []string {
+	token := os.Getenv("MOLECULE_GITEA_TOKEN")
+	if token == "" {
+		return nil
+	}
+	return []string{"-c", "http.extraHeader=Authorization: token " + token}
+}
+
+// redactToken scrubs the auth token from a string before it's logged
+// or returned in an error. Belt-and-braces: with the http.extraHeader
+// approach the token shouldn't appear in git's output, but if some
+// future git version or libcurl debug mode emits it, this catches it.
+func redactToken(s string) string {
+	token := os.Getenv("MOLECULE_GITEA_TOKEN")
+	if token == "" || len(token) < 8 {
+		return s
+	}
+	return strings.ReplaceAll(s, token, "<redacted-token>")
+}
+
@@ -0,0 +1,379 @@
+package handlers
+
+import (
+	"context"
+	"os"
+	"os/exec"
+	"path/filepath"
+	"runtime"
+	"strings"
+	"testing"
+
+	"gopkg.in/yaml.v3"
+)
+
+// PR-B integration test: exercises the REAL gitFetcher (no fakeFetcher
+// injection) against a local bare-git repo. Uses git's `insteadOf`
+// config to rewrite the configured Gitea URL to the local bare path
+// at clone time, so the fetcher's URL-building, ls-remote, clone,
+// atomic-rename, and cache-hit paths all run against real git
+// without requiring network or modifying production code.
+//
+// Internal#77 task #233 (PR-B from the design's phasing).
+
+// TestGitFetcher_RealClone_LocalRedirect proves the production
+// gitFetcher round-trips correctly against a real git repository.
+// Steps:
+//   1. Set up a local bare-git repo with workspace content.
+//   2. Configure git's `insteadOf` to rewrite the gitea URL → local path
+//      via GIT_CONFIG_COUNT/KEY/VALUE env vars (process-scoped).
+//   3. Run resolveYAMLIncludes with !external pointing at the gitea URL.
+//   4. Assert: cache dir populated; content materialized; path rewrite
+//      applied; second invocation hits cache (no second clone).
+func TestGitFetcher_RealClone_LocalRedirect(t *testing.T) {
+	if _, err := exec.LookPath("git"); err != nil {
+		t.Skipf("git binary not found: %v", err)
+	}
+
+	if runtime.GOOS == "windows" {
+		t.Skip("path-based git URLs behave differently on Windows; skipping")
+	}
+
+	// Step 1: create a local bare-git repo at <fixtures>/test-dev-dept.git
+	// with workspace content. Use a working clone to add content, then
+	// push to the bare.
+	fixtures := t.TempDir()
+	barePath := filepath.Join(fixtures, "test-dev-dept.git")
+	workPath := filepath.Join(fixtures, "work")
+
+	mustGit(t, "", "init", "--bare", "-b", "main", barePath)
+	mustGit(t, "", "clone", barePath, workPath)
+	mustGit(t, workPath, "config", "user.email", "test@example.com")
+	mustGit(t, workPath, "config", "user.name", "Integration Test")
+
+	mustWriteFile(t, filepath.Join(workPath, "dev-lead/workspace.yaml"), `name: Dev Lead
+files_dir: dev-lead
+children:
+  - !include ./core-be/workspace.yaml
+`)
+	mustWriteFile(t, filepath.Join(workPath, "dev-lead/system-prompt.md"), "Dev Lead persona body.\n")
+	mustWriteFile(t, filepath.Join(workPath, "dev-lead/core-be/workspace.yaml"), `name: Core BE
+files_dir: dev-lead/core-be
+`)
+	mustWriteFile(t, filepath.Join(workPath, "dev-lead/core-be/system-prompt.md"), "Core BE persona body.\n")
+
+	mustGit(t, workPath, "add", ".")
+	mustGit(t, workPath, "commit", "-m", "seed dev tree")
+	mustGit(t, workPath, "push", "origin", "main")
+
+	// Step 2: configure git's insteadOf rewrite. The fetcher will try
+	// to clone https://git.moleculesai.app/molecule-ai/test-dev-dept.git;
+	// git rewrites to file://<barePath>.
+	//
+	// GIT_CONFIG_COUNT/KEY/VALUE injects config without touching
+	// ~/.gitconfig — process-scoped, no test pollution.
+	geesUrl := "https://git.moleculesai.app/molecule-ai/test-dev-dept.git"
+	t.Setenv("GIT_CONFIG_COUNT", "1")
+	t.Setenv("GIT_CONFIG_KEY_0", "url."+barePath+".insteadOf")
+	t.Setenv("GIT_CONFIG_VALUE_0", geesUrl)
+
+	// Step 3: run resolveYAMLIncludes with !external pointing at the
+	// gitea URL. Allowlist is the default (molecule-ai/* on Gitea host).
+	rootDir := t.TempDir()
+	src := []byte(`workspaces:
+  - !external
+    repo: molecule-ai/test-dev-dept
+    ref: main
+    path: dev-lead/workspace.yaml
+`)
+
+	out, err := resolveYAMLIncludes(src, rootDir)
+	if err != nil {
+		t.Fatalf("resolveYAMLIncludes: %v", err)
+	}
+
+	var tmpl OrgTemplate
+	if err := yaml.Unmarshal(out, &tmpl); err != nil {
+		t.Fatalf("unmarshal: %v", err)
+	}
+	if len(tmpl.Workspaces) != 1 {
+		t.Fatalf("workspaces: %+v", tmpl.Workspaces)
+	}
+	dev := tmpl.Workspaces[0]
+	if dev.Name != "Dev Lead" {
+		t.Errorf("dev.Name = %q; want Dev Lead", dev.Name)
+	}
+	if !strings.Contains(dev.FilesDir, ".external-cache") {
+		t.Errorf("dev.FilesDir = %q; want cache prefix", dev.FilesDir)
+	}
+	if !strings.HasSuffix(dev.FilesDir, "dev-lead") {
+		t.Errorf("dev.FilesDir = %q; want suffix dev-lead", dev.FilesDir)
+	}
+	if len(dev.Children) != 1 {
+		t.Fatalf("expected nested core-be child; got %+v", dev.Children)
+	}
+	core := dev.Children[0]
+	if core.Name != "Core BE" {
+		t.Errorf("core.Name = %q; want Core BE", core.Name)
+	}
+	if !strings.HasSuffix(core.FilesDir, filepath.Join("dev-lead", "core-be")) {
+		t.Errorf("core.FilesDir = %q; want suffix dev-lead/core-be", core.FilesDir)
+	}
+
+	// Step 4: verify the cache dir actually exists and contains the
+	// materialized files (CopyTemplateToContainer would tar these).
+	cacheRoot := filepath.Join(rootDir, ".external-cache")
+	entries, err := os.ReadDir(cacheRoot)
+	if err != nil {
+		t.Fatalf("read cache root: %v", err)
+	}
+	if len(entries) != 1 {
+		t.Fatalf("expected 1 cached repo, got %d: %v", len(entries), entries)
+	}
+	repoDir := filepath.Join(cacheRoot, entries[0].Name())
+	shaDirs, _ := os.ReadDir(repoDir)
+	if len(shaDirs) != 1 {
+		t.Fatalf("expected 1 SHA cache dir, got %d", len(shaDirs))
+	}
+	cacheDir := filepath.Join(repoDir, shaDirs[0].Name())
+	if _, err := os.Stat(filepath.Join(cacheDir, "dev-lead/system-prompt.md")); err != nil {
+		t.Errorf("expected dev-lead/system-prompt.md in cache: %v", err)
+	}
+	if _, err := os.Stat(filepath.Join(cacheDir, "dev-lead/core-be/system-prompt.md")); err != nil {
+		t.Errorf("expected dev-lead/core-be/system-prompt.md in cache: %v", err)
+	}
+
+	// Step 5: re-run; verify cache hit (no second clone). Set a
+	// "marker" file in the cache that a second clone would clobber.
+	marker := filepath.Join(cacheDir, ".cache-hit-marker")
+	if err := os.WriteFile(marker, []byte("hit"), 0o644); err != nil {
+		t.Fatal(err)
+	}
+	out2, err := resolveYAMLIncludes(src, rootDir)
+	if err != nil {
+		t.Fatalf("resolveYAMLIncludes second call: %v", err)
+	}
+	if string(out) != string(out2) {
+		t.Errorf("cached output differs from initial — non-deterministic resolve")
+	}
+	if _, err := os.Stat(marker); err != nil {
+		t.Errorf("cache hit not honored — marker file disappeared: %v", err)
+	}
+}
+
+// TestGitFetcher_RealClone_BadRefFails: pointing at a ref that doesn't
+// exist in the bare-repo surfaces git's error cleanly.
+func TestGitFetcher_RealClone_BadRefFails(t *testing.T) {
+	if _, err := exec.LookPath("git"); err != nil {
+		t.Skipf("git binary not found: %v", err)
+	}
+	if runtime.GOOS == "windows" {
+		t.Skip("skipping on windows")
+	}
+
+	fixtures := t.TempDir()
+	barePath := filepath.Join(fixtures, "empty-repo.git")
+	workPath := filepath.Join(fixtures, "work")
+	mustGit(t, "", "init", "--bare", "-b", "main", barePath)
+	mustGit(t, "", "clone", barePath, workPath)
+	mustGit(t, workPath, "config", "user.email", "test@example.com")
+	mustGit(t, workPath, "config", "user.name", "Test")
+	mustWriteFile(t, filepath.Join(workPath, "README.md"), "x")
+	mustGit(t, workPath, "add", ".")
+	mustGit(t, workPath, "commit", "-m", "seed")
+	mustGit(t, workPath, "push", "origin", "main")
+
+	t.Setenv("GIT_CONFIG_COUNT", "1")
+	t.Setenv("GIT_CONFIG_KEY_0", "url."+barePath+".insteadOf")
+	t.Setenv("GIT_CONFIG_VALUE_0", "https://git.moleculesai.app/molecule-ai/empty-repo.git")
+
+	rootDir := t.TempDir()
+	src := []byte(`workspaces:
+  - !external
+    repo: molecule-ai/empty-repo
+    ref: nonexistent-branch
+    path: anything.yaml
+`)
+	_, err := resolveYAMLIncludes(src, rootDir)
+	if err == nil {
+		t.Fatalf("expected error for nonexistent ref; got nil")
+	}
+	if !strings.Contains(err.Error(), "ref") && !strings.Contains(err.Error(), "ls-remote") && !strings.Contains(err.Error(), "not found") {
+		t.Errorf("error doesn't mention ref/ls-remote: %v", err)
+	}
+}
+
+// ---------- helpers ----------
+
+func mustGit(t *testing.T, cwd string, args ...string) {
+	t.Helper()
+	cmd := exec.Command("git", args...)
+	if cwd != "" {
+		cmd.Dir = cwd
+	}
+	// Ensure user.email/name are set globally for non-cwd commands too.
+	cmd.Env = append(os.Environ(),
+		"GIT_AUTHOR_EMAIL=test@example.com",
+		"GIT_AUTHOR_NAME=Integration Test",
+		"GIT_COMMITTER_EMAIL=test@example.com",
+		"GIT_COMMITTER_NAME=Integration Test",
+	)
+	if out, err := cmd.CombinedOutput(); err != nil {
+		t.Fatalf("git %s: %v\n%s", strings.Join(args, " "), err, string(out))
+	}
+}
+
+func mustWriteFile(t *testing.T, path, content string) {
+	t.Helper()
+	if err := os.MkdirAll(filepath.Dir(path), 0o755); err != nil {
+		t.Fatal(err)
+	}
+	if err := os.WriteFile(path, []byte(content), 0o644); err != nil {
+		t.Fatal(err)
+	}
+}
+
+// Verify gitFetcher.Fetch direct invocation (no resolver wrapping) for
+// the cache-hit path, exercising the bare API against a local bare-repo.
+func TestGitFetcher_DirectFetch_CacheHit(t *testing.T) {
+	if _, err := exec.LookPath("git"); err != nil {
+		t.Skipf("git binary not found: %v", err)
+	}
+	if runtime.GOOS == "windows" {
+		t.Skip("skipping on windows")
+	}
+
+	fixtures := t.TempDir()
+	barePath := filepath.Join(fixtures, "direct.git")
+	workPath := filepath.Join(fixtures, "w")
+	mustGit(t, "", "init", "--bare", "-b", "main", barePath)
+	mustGit(t, "", "clone", barePath, workPath)
+	mustGit(t, workPath, "config", "user.email", "t@e")
+	mustGit(t, workPath, "config", "user.name", "T")
+	mustWriteFile(t, filepath.Join(workPath, "marker.txt"), "hello")
+	mustGit(t, workPath, "add", ".")
+	mustGit(t, workPath, "commit", "-m", "seed")
+	mustGit(t, workPath, "push", "origin", "main")
+
+	t.Setenv("GIT_CONFIG_COUNT", "1")
+	t.Setenv("GIT_CONFIG_KEY_0", "url."+barePath+".insteadOf")
+	t.Setenv("GIT_CONFIG_VALUE_0", "https://git.moleculesai.app/molecule-ai/direct.git")
+
+	rootDir := t.TempDir()
+	g := &gitFetcher{}
+	ctx := context.Background()
+
+	cacheDir1, sha1, err := g.Fetch(ctx, rootDir, "git.moleculesai.app", "molecule-ai/direct", "main")
+	if err != nil {
+		t.Fatalf("first Fetch: %v", err)
+	}
+	if sha1 == "" || len(sha1) < 7 {
+		t.Errorf("expected SHA-like string, got %q", sha1)
+	}
+	if _, err := os.Stat(filepath.Join(cacheDir1, "marker.txt")); err != nil {
+		t.Errorf("first fetch missing marker.txt: %v", err)
+	}
+
+	// Second call: cache hit, returns same dir + sha, no re-clone.
+	stamp := filepath.Join(cacheDir1, ".not-clobbered-by-second-fetch")
+	if err := os.WriteFile(stamp, []byte("x"), 0o644); err != nil {
+		t.Fatal(err)
+	}
+	cacheDir2, sha2, err := g.Fetch(ctx, rootDir, "git.moleculesai.app", "molecule-ai/direct", "main")
+	if err != nil {
+		t.Fatalf("second Fetch: %v", err)
+	}
+	if cacheDir2 != cacheDir1 || sha2 != sha1 {
+		t.Errorf("cache miss on second call: %q/%q vs %q/%q", cacheDir1, sha1, cacheDir2, sha2)
+	}
+	if _, err := os.Stat(stamp); err != nil {
+		t.Errorf("cache hit not honored — stamp file disappeared: %v", err)
+	}
+}
+
+// TestGitFetcher_RejectsRefWithDoubleDot: defense-in-depth on ref input.
+// safeRefPattern allows '.' as a regex character, so ".." would match
+// without an explicit deny. Verify it's rejected even though git itself
+// would also reject the resulting clone.
+func TestGitFetcher_RejectsRefWithDoubleDot(t *testing.T) {
+	rootDir := t.TempDir()
+	src := []byte(`workspaces:
+  - !external
+    repo: molecule-ai/x
+    ref: foo..bar
+    path: x.yaml
+`)
+	_, err := resolveYAMLIncludes(src, rootDir)
+	if err == nil {
+		t.Fatalf("expected '..' rejection")
+	}
+	if !strings.Contains(err.Error(), "..") {
+		t.Errorf("expected '..' in error; got %v", err)
+	}
+}
+
+// TestGitFetcher_CacheValidatedByCompleteMarker: a partially-written
+// cache (the .git dir exists but no .complete marker) is treated as
+// cache-miss and re-fetched. Catches the broken-cache-permanence bug.
+func TestGitFetcher_CacheValidatedByCompleteMarker(t *testing.T) {
+	if _, err := exec.LookPath("git"); err != nil {
+		t.Skipf("git not found: %v", err)
+	}
+	if runtime.GOOS == "windows" {
+		t.Skip("skipping on windows")
+	}
+
+	fixtures := t.TempDir()
+	barePath := filepath.Join(fixtures, "test.git")
+	workPath := filepath.Join(fixtures, "w")
+	mustGit(t, "", "init", "--bare", "-b", "main", barePath)
+	mustGit(t, "", "clone", barePath, workPath)
+	mustGit(t, workPath, "config", "user.email", "t@e")
+	mustGit(t, workPath, "config", "user.name", "T")
+	mustWriteFile(t, filepath.Join(workPath, "good.txt"), "from-network")
+	mustGit(t, workPath, "add", ".")
+	mustGit(t, workPath, "commit", "-m", "seed")
+	mustGit(t, workPath, "push", "origin", "main")
+	t.Setenv("GIT_CONFIG_COUNT", "1")
+	t.Setenv("GIT_CONFIG_KEY_0", "url."+barePath+".insteadOf")
+	t.Setenv("GIT_CONFIG_VALUE_0", "https://git.moleculesai.app/molecule-ai/marker-test.git")
+
+	rootDir := t.TempDir()
+	g := &gitFetcher{}
+
+	// First fetch — populates the cache (creates .complete marker).
+	cacheDir1, _, err := g.Fetch(context.Background(), rootDir, "git.moleculesai.app", "molecule-ai/marker-test", "main")
+	if err != nil {
+		t.Fatalf("first Fetch: %v", err)
+	}
+	marker := filepath.Join(cacheDir1, cacheCompleteMarker)
+	if _, err := os.Stat(marker); err != nil {
+		t.Fatalf("first fetch should have written .complete marker: %v", err)
+	}
+
+	// Now simulate a partial cache: delete the marker but leave .git
+	// in place. The next Fetch should treat this as cache-miss and
+	// re-fetch (NOT silently use the partial cache).
+	if err := os.Remove(marker); err != nil {
+		t.Fatal(err)
+	}
+	// Drop a sentinel file the second fetch will clobber if it re-fetches.
+	sentinel := filepath.Join(cacheDir1, "_should_be_clobbered")
+	if err := os.WriteFile(sentinel, []byte("partial"), 0o644); err != nil {
+		t.Fatal(err)
+	}
+
+	cacheDir2, _, err := g.Fetch(context.Background(), rootDir, "git.moleculesai.app", "molecule-ai/marker-test", "main")
+	if err != nil {
+		t.Fatalf("second Fetch: %v", err)
+	}
+	if cacheDir1 != cacheDir2 {
+		t.Errorf("cache dirs differ across fetches: %q vs %q", cacheDir1, cacheDir2)
+	}
+	if _, err := os.Stat(filepath.Join(cacheDir2, cacheCompleteMarker)); err != nil {
+		t.Errorf("re-fetch should have re-written .complete marker: %v", err)
+	}
+	if _, err := os.Stat(sentinel); err == nil {
+		t.Errorf("sentinel still present — re-fetch did NOT clobber partial cache")
+	}
+}
@@ -0,0 +1,331 @@
+package handlers
+
+import (
+	"context"
+	"os"
+	"path/filepath"
+	"strings"
+	"testing"
+
+	"gopkg.in/yaml.v3"
+)
+
+// fakeFetcher pre-stages a "fetched" repo at a fixed path inside the
+// rootDir's .external-cache, bypassing the real git clone. Tests
+// inject this via SetExternalFetcherForTest to exercise the resolver
+// + path-rewrite logic without network.
+type fakeFetcher struct {
+	// content maps "<host>/<repo>@<ref>" → a function that materializes
+	// repo content under cacheDir. Returns the fake SHA to use.
+	content map[string]func(cacheDir string) (sha string, err error)
+}
+
+func (f *fakeFetcher) Fetch(ctx context.Context, rootDir, host, repoPath, ref string) (string, string, error) {
+	key := host + "/" + repoPath + "@" + ref
+	stage, ok := f.content[key]
+	if !ok {
+		return "", "", &fakeNotFoundError{key: key}
+	}
+	// Use a stable SHA for the test so cache dir is deterministic.
+	cacheDir := filepath.Join(rootDir, ".external-cache", safeRepoCacheDir(host, repoPath), "deadbeef")
+	if err := os.MkdirAll(cacheDir, 0o755); err != nil {
+		return "", "", err
+	}
+	sha, err := stage(cacheDir)
+	if err != nil {
+		return "", "", err
+	}
+	return cacheDir, sha, nil
+}
+
+type fakeNotFoundError struct{ key string }
+
+func (e *fakeNotFoundError) Error() string {
+	return "fake fetcher: no content registered for " + e.key
+}
+
+// stageFiles writes a map of relative-path → content into cacheDir,
+// returning a fake SHA. Helper for fakeFetcher closures.
+func stageFiles(cacheDir string, files map[string]string) error {
+	if err := os.MkdirAll(filepath.Join(cacheDir, ".git"), 0o755); err != nil {
+		return err
+	}
+	for path, content := range files {
+		full := filepath.Join(cacheDir, path)
+		if err := os.MkdirAll(filepath.Dir(full), 0o755); err != nil {
+			return err
+		}
+		if err := os.WriteFile(full, []byte(content), 0o644); err != nil {
+			return err
+		}
+	}
+	return nil
+}
+
+// TestResolveExternalMapping_HappyPath: a parent template with an
+// !external entry resolves cleanly into the fetched workspace + path-
+// rewrites files_dir + relative !include refs into the cache prefix.
+func TestResolveExternalMapping_HappyPath(t *testing.T) {
+	tmp := t.TempDir()
+
+	// Stub fetcher: "fetched" content has a workspace.yaml that uses
+	// files_dir + nested !include relative to the fetched repo's root.
+	fake := &fakeFetcher{
+		content: map[string]func(string) (string, error){
+			"git.moleculesai.app/molecule-ai/molecule-dev-department@main": func(cacheDir string) (string, error) {
+				return "deadbeef", stageFiles(cacheDir, map[string]string{
+					"dev-lead/workspace.yaml": `name: Dev Lead
+files_dir: dev-lead
+children:
+  - !include ./core-lead/workspace.yaml
+`,
+					"dev-lead/core-lead/workspace.yaml": `name: Core Platform Lead
+files_dir: dev-lead/core-lead
+`,
+				})
+			},
+		},
+	}
+	cleanup := SetExternalFetcherForTest(fake)
+	defer cleanup()
+
+	src := []byte(`name: Parent
+workspaces:
+  - !external
+    repo: molecule-ai/molecule-dev-department
+    ref: main
+    path: dev-lead/workspace.yaml
+`)
+
+	out, err := resolveYAMLIncludes(src, tmp)
+	if err != nil {
+		t.Fatalf("resolveYAMLIncludes: %v", err)
+	}
+
+	var tmpl OrgTemplate
+	if err := yaml.Unmarshal(out, &tmpl); err != nil {
+		t.Fatalf("unmarshal: %v", err)
+	}
+	if len(tmpl.Workspaces) != 1 {
+		t.Fatalf("workspaces: %+v", tmpl.Workspaces)
+	}
+	dev := tmpl.Workspaces[0]
+	if dev.Name != "Dev Lead" {
+		t.Errorf("dev.Name = %q; want Dev Lead", dev.Name)
+	}
+	// files_dir should be cache-prefixed.
+	wantPrefix := filepath.Join(".external-cache", "git.moleculesai.app__molecule-ai__molecule-dev-department", "deadbeef")
+	if !strings.HasPrefix(dev.FilesDir, wantPrefix) {
+		t.Errorf("dev.FilesDir = %q; want prefix %q", dev.FilesDir, wantPrefix)
+	}
+	if !strings.HasSuffix(dev.FilesDir, "dev-lead") {
+		t.Errorf("dev.FilesDir = %q; want suffix dev-lead", dev.FilesDir)
+	}
+	// Nested child: files_dir cache-prefixed, name Core Platform Lead.
+	if len(dev.Children) != 1 {
+		t.Fatalf("dev.Children: %+v", dev.Children)
+	}
+	core := dev.Children[0]
+	if core.Name != "Core Platform Lead" {
+		t.Errorf("core.Name = %q; want Core Platform Lead", core.Name)
+	}
+	if !strings.HasPrefix(core.FilesDir, wantPrefix) {
+		t.Errorf("core.FilesDir = %q; want prefix %q", core.FilesDir, wantPrefix)
+	}
+	if !strings.HasSuffix(core.FilesDir, filepath.Join("dev-lead", "core-lead")) {
+		t.Errorf("core.FilesDir = %q; want suffix dev-lead/core-lead", core.FilesDir)
+	}
+}
+
+// TestResolveExternalMapping_AllowlistRejection: hostile yaml pointing
+// at a non-allowlisted repo gets rejected.
+func TestResolveExternalMapping_AllowlistRejection(t *testing.T) {
+	tmp := t.TempDir()
+	fake := &fakeFetcher{content: map[string]func(string) (string, error){}}
+	cleanup := SetExternalFetcherForTest(fake)
+	defer cleanup()
+
+	// Default allowlist is git.moleculesai.app/molecule-ai/*.
+	// github.com/foo/bar is NOT in it.
+	src := []byte(`workspaces:
+  - !external
+    repo: foo/bar
+    ref: main
+    path: x.yaml
+    url: github.com
+`)
+	_, err := resolveYAMLIncludes(src, tmp)
+	if err == nil {
+		t.Fatalf("expected allowlist rejection, got nil")
+	}
+	if !strings.Contains(err.Error(), "MOLECULE_EXTERNAL_REPO_ALLOWLIST") {
+		t.Errorf("expected allowlist error; got %v", err)
+	}
+}
+
+// TestResolveExternalMapping_PathTraversalRejection: hostile yaml
+// with `path: ../../etc/passwd` gets rejected before fetch.
+func TestResolveExternalMapping_PathTraversalRejection(t *testing.T) {
+	tmp := t.TempDir()
+	fake := &fakeFetcher{content: map[string]func(string) (string, error){}}
+	cleanup := SetExternalFetcherForTest(fake)
+	defer cleanup()
+
+	src := []byte(`workspaces:
+  - !external
+    repo: molecule-ai/dev-department
+    ref: main
+    path: ../../etc/passwd
+`)
+	_, err := resolveYAMLIncludes(src, tmp)
+	if err == nil {
+		t.Fatalf("expected path traversal rejection, got nil")
+	}
+	if !strings.Contains(err.Error(), "relative-and-down-only") {
+		t.Errorf("expected path traversal error; got %v", err)
+	}
+}
+
+// TestResolveExternalMapping_BadRefRejection: non-allowlisted ref chars.
+func TestResolveExternalMapping_BadRefRejection(t *testing.T) {
+	tmp := t.TempDir()
+	fake := &fakeFetcher{content: map[string]func(string) (string, error){}}
+	cleanup := SetExternalFetcherForTest(fake)
+	defer cleanup()
+
+	src := []byte(`workspaces:
+  - !external
+    repo: molecule-ai/dev-department
+    ref: "main; rm -rf /"
+    path: foo.yaml
+`)
+	_, err := resolveYAMLIncludes(src, tmp)
+	if err == nil || !strings.Contains(err.Error(), "disallowed characters") {
+		t.Errorf("expected ref-validation error; got %v", err)
+	}
+}
+
+// TestResolveExternalMapping_MissingRequiredFields: repo / ref / path
+// are all required.
+func TestResolveExternalMapping_MissingRequiredFields(t *testing.T) {
+	tmp := t.TempDir()
+	fake := &fakeFetcher{content: map[string]func(string) (string, error){}}
+	cleanup := SetExternalFetcherForTest(fake)
+	defer cleanup()
+
+	cases := []string{
+		// missing repo
+		`workspaces:
+  - !external
+    ref: main
+    path: x.yaml
+`,
+		// missing ref
+		`workspaces:
+  - !external
+    repo: molecule-ai/x
+    path: x.yaml
+`,
+		// missing path
+		`workspaces:
+  - !external
+    repo: molecule-ai/x
+    ref: main
+`,
+	}
+	for i, src := range cases {
+		_, err := resolveYAMLIncludes([]byte(src), tmp)
+		if err == nil {
+			t.Errorf("case %d: expected required-field error, got nil", i)
+		} else if !strings.Contains(err.Error(), "required") {
+			t.Errorf("case %d: want 'required' in error; got %v", i, err)
+		}
+	}
+}
+
+// TestRewriteFilesDir: verify the path-rewrite walker
+// prefixes files_dir scalars. !include scalars are NOT rewritten —
+// they resolve relative to their containing file's dir, which post-
+// fetch is naturally inside the cache.
+func TestRewriteFilesDir(t *testing.T) {
+	src := `name: Foo
+files_dir: dev-lead
+children:
+  - !include ./bar/workspace.yaml
+  - !include other-team.yaml
+inner:
+  files_dir: dev-lead/sub
+`
+	var n yaml.Node
+	if err := yaml.Unmarshal([]byte(src), &n); err != nil {
+		t.Fatal(err)
+	}
+	rewriteFilesDir(&n, ".external-cache/foo/bar")
+
+	out, err := yaml.Marshal(&n)
+	if err != nil {
+		t.Fatal(err)
+	}
+	got := string(out)
+	for _, want := range []string{
+		"files_dir: .external-cache/foo/bar/dev-lead",
+		"files_dir: .external-cache/foo/bar/dev-lead/sub",
+		// !include preserved as-is; resolves naturally via subDir.
+		"!include ./bar/workspace.yaml",
+		"!include other-team.yaml",
+	} {
+		if !strings.Contains(got, want) {
+			t.Errorf("missing %q in:\n%s", want, got)
+		}
+	}
+}
+
+// TestRewriteFilesDir_Idempotent: re-running the rewriter
+// on already-prefixed files_dir doesn't double-prefix.
+func TestRewriteFilesDir_Idempotent(t *testing.T) {
+	src := `files_dir: .external-cache/foo/bar/dev-lead
+inner:
+  files_dir: .external-cache/foo/bar/dev-lead/sub
+`
+	var n yaml.Node
+	if err := yaml.Unmarshal([]byte(src), &n); err != nil {
+		t.Fatal(err)
+	}
+	rewriteFilesDir(&n, ".external-cache/foo/bar")
+
+	out, _ := yaml.Marshal(&n)
+	got := string(out)
+	if strings.Contains(got, ".external-cache/foo/bar/.external-cache") {
+		t.Errorf("double-prefix detected:\n%s", got)
+	}
+	// Should still be valid (single-prefixed) afterwards.
+	for _, want := range []string{
+		"files_dir: .external-cache/foo/bar/dev-lead",
+		"files_dir: .external-cache/foo/bar/dev-lead/sub",
+	} {
+		if !strings.Contains(got, want) {
+			t.Errorf("expected unchanged %q in:\n%s", want, got)
+		}
+	}
+}
+
+// TestAllowlistedHostPath: env-var override + glob matching.
+func TestAllowlistedHostPath(t *testing.T) {
+	t.Setenv("MOLECULE_EXTERNAL_REPO_ALLOWLIST", "")
+	if !allowlistedHostPath("git.moleculesai.app", "molecule-ai/foo") {
+		t.Error("default allowlist should accept molecule-ai/*")
+	}
+	if allowlistedHostPath("github.com", "molecule-ai/foo") {
+		t.Error("default allowlist should reject github.com")
+	}
+	t.Setenv("MOLECULE_EXTERNAL_REPO_ALLOWLIST", "github.com/me/*,git.moleculesai.app/*")
+	if !allowlistedHostPath("github.com", "me/x") {
+		t.Error("override should accept github.com/me/*")
+	}
+	if !allowlistedHostPath("git.moleculesai.app", "any/repo") {
+		t.Error("override should accept git.moleculesai.app/*")
+	}
+	if allowlistedHostPath("github.com", "evil/x") {
+		t.Error("override should reject github.com/evil/*")
+	}
+}
@@ -6,6 +6,7 @@ package handlers

 import (
 	"fmt"
+	"log"
 	"os"
 	"path/filepath"
 	"regexp"
@@ -102,6 +103,56 @@ func loadWorkspaceEnv(orgBaseDir, filesDir string) map[string]string {
 	return envVars
 }

+// loadPersonaEnvFile merges per-role persona credentials into out. The file
+// lives at $MOLECULE_PERSONA_ROOT/<role>/env (default
+// /etc/molecule-bootstrap/personas) and is populated by the operator-host
+// bootstrap kit — one persona per dev-tree role, each carrying the role's
+// Gitea identity (GITEA_USER, GITEA_TOKEN, GITEA_TOKEN_SCOPES,
+// GITEA_USER_EMAIL, GITEA_SSH_KEY_PATH).
+//
+// Lower precedence than the org and workspace .env files: callers should
+// invoke this BEFORE parseEnvFile on those, so a workspace .env can
+// override a persona-default value when needed.
+//
+// Silent no-op when role is empty, when the role name fails the safe-segment
+// check, or when the env file does not exist (workspaces without a role —
+// or running on hosts that don't ship the bootstrap dir — keep their old
+// behavior).
+func loadPersonaEnvFile(role string, out map[string]string) {
+	if !isSafeRoleName(role) {
+		if role != "" {
+			log.Printf("Org import: refusing persona env load for unsafe role name %q", role)
+		}
+		return
+	}
+	root := os.Getenv("MOLECULE_PERSONA_ROOT")
+	if root == "" {
+		root = "/etc/molecule-bootstrap/personas"
+	}
+	parseEnvFile(filepath.Join(root, role, "env"), out)
+}
+
+// isSafeRoleName accepts a single path segment of [A-Za-z0-9_-]+. Rejects
+// empty, ".", "..", and anything containing a path separator — even though
+// the construct is admin-only, defense-in-depth keeps the persona dir
+// shape invariant: one flat directory per role, no climbing out.
+func isSafeRoleName(s string) bool {
+	if s == "" || s == "." || s == ".." {
+		return false
+	}
+	for _, c := range s {
+		switch {
+		case c >= 'a' && c <= 'z':
+		case c >= 'A' && c <= 'Z':
+		case c >= '0' && c <= '9':
+		case c == '-' || c == '_':
+		default:
+			return false
+		}
+	}
+	return true
+}
+
 // parseEnvFile reads a .env file and adds KEY=VALUE pairs to the map.
 // Skips comments (#) and empty lines. Values can be quoted.
 func parseEnvFile(path string, out map[string]string) {
@@ -250,6 +250,21 @@ func (h *OrgHandler) createWorkspaceTree(ws OrgWorkspace, parentID *string, absX
 		h.broadcaster.RecordAndBroadcast(ctx, string(events.EventWorkspaceOnline), id, map[string]interface{}{
 			"name": ws.Name, "external": true,
 		})
+	} else if IsMockRuntime(runtime) {
+		// Mock-runtime workspaces have no container, no EC2, no URL —
+		// the proxyA2ARequest short-circuit synthesises every reply
+		// from a canned variant pool (see mock_runtime.go). Status
+		// goes straight to 'online' so the canvas renders the node
+		// as reachable + the chat tab's send button is enabled. No
+		// URL is set; the proxy never tries to resolve one for mock
+		// runtimes. Built for the funding-demo "200-workspace mock
+		// org" template — visual scale without real backend cost.
+		if _, err := db.DB.ExecContext(ctx, `UPDATE workspaces SET status = $1 WHERE id = $2`, models.StatusOnline, id); err != nil {
+			log.Printf("Org import: mock workspace status update failed for %s: %v", ws.Name, err)
+		}
+		h.broadcaster.RecordAndBroadcast(ctx, string(events.EventWorkspaceOnline), id, map[string]interface{}{
+			"name": ws.Name, "mock": true, "runtime": runtime,
+		})
 	} else if h.workspace.HasProvisioner() {
 		// Provision container — either backend (CP for SaaS, local Docker
 		// for self-hosted) is fine. Pre-2026-05-05 this gate was
@@ -428,10 +443,18 @@ func (h *OrgHandler) createWorkspaceTree(ws OrgWorkspace, parentID *string, absX
 			configFiles["system-prompt.md"] = []byte(ws.SystemPrompt)
 		}

-		// Inject secrets from .env files as workspace secrets.
-		// Resolution: workspace .env → org root .env (workspace overrides org root).
+		// Inject secrets from persona env + .env files as workspace secrets.
+		// Resolution (later overrides earlier):
+		//   0. Persona env (per-role bootstrap creds; only when ws.Role is set
+		//      and the operator-host bootstrap dir ships a matching file)
+		//   1. Org root .env (shared defaults)
+		//   2. Workspace-specific .env (per-workspace overrides)
 		// Each line: KEY=VALUE → stored as encrypted workspace secret.
 		envVars := map[string]string{}
+		// 0. Persona env (lowest precedence; injects the role's Gitea identity:
+		//    GITEA_USER, GITEA_TOKEN, GITEA_TOKEN_SCOPES, GITEA_USER_EMAIL,
+		//    GITEA_SSH_KEY_PATH). Workspace and org .env can override.
+		loadPersonaEnvFile(ws.Role, envVars)
 		if orgBaseDir != "" {
 			// 1. Org root .env (shared defaults)
 			parseEnvFile(filepath.Join(orgBaseDir, ".env"), envVars)
@@ -675,7 +698,23 @@ func (h *OrgHandler) recurseChildrenForImport(ws OrgWorkspace, parentID string,
 		if err := h.createWorkspaceTree(child, &parentID, childAbsX, childAbsY, slotX, slotY, defaults, orgBaseDir, results, provisionSem); err != nil {
 			return err
 		}
-		time.Sleep(workspaceCreatePacingMs * time.Millisecond)
+		// Pacing exists to throttle Docker container-spawn thundering
+		// during a self-hosted import. Mock-runtime children spawn no
+		// container — no Docker pressure, no LLM bursts, just DB
+		// inserts + a broadcast. Skipping the 2s sleep collapses a
+		// 200-workspace mock-org import from ~7min → ~5s, which is
+		// the difference between a snappy demo and a "did it freeze?"
+		// staring contest. Real (containerful) runtimes still pace.
+		// Inheritance: if the child itself doesn't declare a runtime,
+		// fall back to defaults.runtime — the org template sets
+		// runtime: mock once at the org level, not on every IC node.
+		childRuntime := child.Runtime
+		if childRuntime == "" {
+			childRuntime = defaults.Runtime
+		}
+		if !IsMockRuntime(childRuntime) {
+			time.Sleep(workspaceCreatePacingMs * time.Millisecond)
+		}
 	}
 	return nil
 }
@@ -76,6 +76,12 @@ func expandNode(n *yaml.Node, currentDir, rootDir string, visited map[string]boo
 		return resolveIncludeScalar(n, currentDir, rootDir, visited, depth)
 	}

+	// `!external`-tagged mapping: gitops cross-repo subtree composition.
+	// See org_external.go (internal#77 / task #222).
+	if n.Kind == yaml.MappingNode && n.Tag == "!external" {
+		return resolveExternalMapping(n, currentDir, rootDir, visited, depth)
+	}
+
 	for _, child := range n.Content {
 		if err := expandNode(child, currentDir, rootDir, visited, depth); err != nil {
 			return err
@@ -0,0 +1,136 @@
+package handlers
+
+import (
+	"os"
+	"path/filepath"
+	"testing"
+
+	"gopkg.in/yaml.v3"
+)
+
+// Phase 5 (RFC internal#77 dev-department extraction):
+// Proves a parent org template can compose a subtree from a sibling repo
+// via a directory symlink. Pattern that gets shipped:
+//
+//   /org-templates/parent-template/                  ← imported by POST /org/import
+//     org.yaml                                       (workspaces: !include dev/dev-lead/workspace.yaml)
+//     dev → /org-templates/molecule-dev-department/  (symlink)
+//   /org-templates/molecule-dev-department/          (sibling repo)
+//     dev-lead/
+//       workspace.yaml                               (children: !include ./core-platform/workspace.yaml)
+//       core-platform/
+//         workspace.yaml
+//
+// resolveYAMLIncludes resolves paths via filepath.Abs/Rel (no symlink
+// following at the path-string layer), so the security check passes. The
+// actual file open uses os.ReadFile, which DOES follow symlinks — so the
+// content from the sibling repo gets inlined. This test pins that contract.
+func TestResolveYAMLIncludes_FollowsDirectorySymlink(t *testing.T) {
+	tmp := t.TempDir()
+
+	// Subtree repo: dev-department/dev-lead/...
+	devDept := filepath.Join(tmp, "molecule-dev-department")
+	devLead := filepath.Join(devDept, "dev-lead")
+	corePlatform := filepath.Join(devLead, "core-platform")
+	if err := os.MkdirAll(corePlatform, 0o755); err != nil {
+		t.Fatal(err)
+	}
+	// dev-lead/workspace.yaml — uses `./core-platform/workspace.yaml` (relative
+	// to its own dir, which after symlink follows is dev-department/dev-lead/).
+	devLeadYAML := []byte(`name: Dev Lead
+tier: 3
+children:
+  - !include ./core-platform/workspace.yaml
+`)
+	if err := os.WriteFile(filepath.Join(devLead, "workspace.yaml"), devLeadYAML, 0o644); err != nil {
+		t.Fatal(err)
+	}
+	if err := os.WriteFile(filepath.Join(corePlatform, "workspace.yaml"), []byte("name: Core Platform\ntier: 3\n"), 0o644); err != nil {
+		t.Fatal(err)
+	}
+
+	// Parent template: parent/, with `dev` symlink → ../molecule-dev-department/
+	parent := filepath.Join(tmp, "parent-template")
+	if err := os.MkdirAll(parent, 0o755); err != nil {
+		t.Fatal(err)
+	}
+	// Symlink TARGET is a relative path (matches operator-side deploy
+	// convention where both repos are cloned as siblings under a shared
+	// /org-templates/ dir).
+	if err := os.Symlink("../molecule-dev-department", filepath.Join(parent, "dev")); err != nil {
+		t.Skipf("symlinks unsupported on this fs: %v", err)
+	}
+
+	// Parent's org.yaml: !include into the symlinked subtree.
+	src := []byte(`name: Parent
+workspaces:
+  - !include dev/dev-lead/workspace.yaml
+`)
+
+	out, err := resolveYAMLIncludes(src, parent)
+	if err != nil {
+		t.Fatalf("resolveYAMLIncludes through symlink failed: %v", err)
+	}
+
+	var tmpl OrgTemplate
+	if err := yaml.Unmarshal(out, &tmpl); err != nil {
+		t.Fatalf("unmarshal: %v", err)
+	}
+	if len(tmpl.Workspaces) != 1 {
+		t.Fatalf("expected 1 workspace, got %d", len(tmpl.Workspaces))
+	}
+	if tmpl.Workspaces[0].Name != "Dev Lead" {
+		t.Fatalf("workspace[0].Name = %q; want Dev Lead", tmpl.Workspaces[0].Name)
+	}
+	kids := tmpl.Workspaces[0].Children
+	if len(kids) != 1 {
+		t.Fatalf("expected 1 child workspace, got %d", len(kids))
+	}
+	if kids[0].Name != "Core Platform" {
+		t.Fatalf("child[0].Name = %q; want Core Platform — symlink-aware nested !include broken", kids[0].Name)
+	}
+}
+
+// Companion: prove the security check still works when the symlink target
+// is OUTSIDE the parent template's root. This is the "hostile symlink"
+// case — an org.yaml that tries to slip in arbitrary files from /etc.
+func TestResolveYAMLIncludes_RejectsSymlinkEscapingRoot(t *testing.T) {
+	tmp := t.TempDir()
+	parent := filepath.Join(tmp, "parent-template")
+	outside := filepath.Join(tmp, "outside")
+	if err := os.MkdirAll(parent, 0o755); err != nil {
+		t.Fatal(err)
+	}
+	if err := os.MkdirAll(outside, 0o755); err != nil {
+		t.Fatal(err)
+	}
+	if err := os.WriteFile(filepath.Join(outside, "evil.yaml"), []byte("name: Evil\n"), 0o644); err != nil {
+		t.Fatal(err)
+	}
+
+	// Symlink that escapes the parent root via `../outside/...`. The path
+	// STRING `evil` resolves to parent/evil — passes the rel2 check. But
+	// because filepath.Abs doesn't follow symlinks, the ReadFile call DOES
+	// follow it to outside/evil.yaml. This is the trade-off the symlink
+	// approach accepts: the security boundary is a deployment-layer
+	// invariant, not a code-layer one. Documented in dev-department/README.
+	if err := os.Symlink(filepath.Join(outside, "evil.yaml"), filepath.Join(parent, "evil.yaml")); err != nil {
+		t.Skipf("symlinks unsupported on this fs: %v", err)
+	}
+	src := []byte("workspaces:\n  - !include evil.yaml\n")
+	out, err := resolveYAMLIncludes(src, parent)
+	if err != nil {
+		// If the resolver is later hardened to refuse symlink targets
+		// outside the root (e.g. via filepath.EvalSymlinks), this test
+		// will start failing — and the dev-department symlink approach
+		// would need to be updated accordingly.
+		t.Fatalf("symlink resolved successfully under current resolver: %v", err)
+	}
+	var tmpl OrgTemplate
+	if err := yaml.Unmarshal(out, &tmpl); err != nil {
+		t.Fatalf("unmarshal: %v", err)
+	}
+	if len(tmpl.Workspaces) != 1 || tmpl.Workspaces[0].Name != "Evil" {
+		t.Fatalf("expected Evil workspace via symlink; got %+v", tmpl.Workspaces)
+	}
+}
@@ -0,0 +1,171 @@
+package handlers
+
+import (
+	"os"
+	"path/filepath"
+	"testing"
+)
+
+// TestLoadPersonaEnvFile_HappyPath: the standard case — a persona-shaped
+// env file exists at <root>/<role>/env and its KEY=VALUE pairs land in
+// the out map. Mirrors what the operator-host bootstrap kit ships:
+// GITEA_USER, GITEA_TOKEN, GITEA_TOKEN_SCOPES, GITEA_USER_EMAIL,
+// GITEA_SSH_KEY_PATH.
+func TestLoadPersonaEnvFile_HappyPath(t *testing.T) {
+	root := t.TempDir()
+	roleDir := filepath.Join(root, "dev-lead")
+	if err := os.MkdirAll(roleDir, 0o755); err != nil {
+		t.Fatal(err)
+	}
+	envBody := `# Persona env file — mode 600
+GITEA_USER=dev-lead
+GITEA_USER_EMAIL=dev-lead@agents.moleculesai.app
+GITEA_TOKEN=abc123
+GITEA_TOKEN_SCOPES=write:repository,write:issue,read:user
+GITEA_SSH_KEY_PATH=/etc/molecule-bootstrap/personas/dev-lead/ssh_priv
+`
+	if err := os.WriteFile(filepath.Join(roleDir, "env"), []byte(envBody), 0o600); err != nil {
+		t.Fatal(err)
+	}
+	t.Setenv("MOLECULE_PERSONA_ROOT", root)
+
+	out := map[string]string{}
+	loadPersonaEnvFile("dev-lead", out)
+
+	want := map[string]string{
+		"GITEA_USER":           "dev-lead",
+		"GITEA_USER_EMAIL":     "dev-lead@agents.moleculesai.app",
+		"GITEA_TOKEN":          "abc123",
+		"GITEA_TOKEN_SCOPES":   "write:repository,write:issue,read:user",
+		"GITEA_SSH_KEY_PATH":   "/etc/molecule-bootstrap/personas/dev-lead/ssh_priv",
+	}
+	if len(out) != len(want) {
+		t.Fatalf("got %d keys, want %d: %#v", len(out), len(want), out)
+	}
+	for k, v := range want {
+		if out[k] != v {
+			t.Errorf("out[%q] = %q; want %q", k, out[k], v)
+		}
+	}
+}
+
+// TestLoadPersonaEnvFile_MissingDir: when the persona dir doesn't exist
+// (e.g. dev-only host without the bootstrap kit, or a workspace whose
+// role isn't a known persona), it's a silent no-op — out stays empty,
+// no panic, no log noise that would break callers.
+func TestLoadPersonaEnvFile_MissingDir(t *testing.T) {
+	t.Setenv("MOLECULE_PERSONA_ROOT", t.TempDir()) // empty dir
+	out := map[string]string{}
+	loadPersonaEnvFile("nonexistent-role", out)
+	if len(out) != 0 {
+		t.Errorf("expected empty out, got %#v", out)
+	}
+}
+
+// TestLoadPersonaEnvFile_EmptyRole: empty role string is the common case
+// for non-dev workspaces (research/marketing/etc.). Skip silently.
+func TestLoadPersonaEnvFile_EmptyRole(t *testing.T) {
+	t.Setenv("MOLECULE_PERSONA_ROOT", t.TempDir())
+	out := map[string]string{}
+	loadPersonaEnvFile("", out)
+	if len(out) != 0 {
+		t.Errorf("empty role should produce empty out; got %#v", out)
+	}
+}
+
+// TestLoadPersonaEnvFile_RejectsTraversal: even though role names come
+// from server-side admin-only org templates, defense-in-depth — refuse
+// any role string with path separators or "..". Verifies that a maliciously
+// crafted template can't read /etc/passwd by setting role: "../../etc".
+func TestLoadPersonaEnvFile_RejectsTraversal(t *testing.T) {
+	root := t.TempDir()
+	// Plant a file at /tmp/.../env so a bad traversal would reach it
+	if err := os.WriteFile(filepath.Join(root, "env"), []byte("STOLEN=yes\n"), 0o600); err != nil {
+		t.Fatal(err)
+	}
+	t.Setenv("MOLECULE_PERSONA_ROOT", filepath.Join(root, "personas"))
+
+	for _, bad := range []string{"..", "../personas", "../etc/passwd", "/abs", "with/slash", "dot.in.middle", "with space", "back\\slash", ".", ""} {
+		out := map[string]string{}
+		loadPersonaEnvFile(bad, out)
+		if len(out) != 0 {
+			t.Errorf("role %q should have been rejected; got %#v", bad, out)
+		}
+	}
+}
+
+// TestLoadPersonaEnvFile_DefaultRoot: when MOLECULE_PERSONA_ROOT is unset,
+// the helper falls back to /etc/molecule-bootstrap/personas. We don't
+// touch real /etc — just verify the function doesn't panic and produces
+// empty out (since the test box isn't expected to ship that path).
+func TestLoadPersonaEnvFile_DefaultRoot(t *testing.T) {
+	t.Setenv("MOLECULE_PERSONA_ROOT", "") // explicit empty
+	out := map[string]string{}
+	loadPersonaEnvFile("dev-lead", out)
+	// Don't assert content — production CI might or might not have the
+	// /etc dir mounted. Just verify the call returns cleanly.
+	_ = out
+}
+
+// TestLoadPersonaEnvFile_PrecedenceCallerOverrides: the contract is "lower
+// precedence than later .env files." The helper writes into out without
+// removing existing keys, so a caller pre-populating out simulates a
+// later layer overriding persona defaults. We verify the helper does NOT
+// clobber pre-existing entries… actually, parseEnvFile DOES overwrite,
+// so the caller-side ordering (persona → org → workspace) is what enforces
+// precedence. This test pins that contract: persona is loaded into a
+// fresh map, then later layers can override.
+func TestLoadPersonaEnvFile_OverwritesEmptyMap(t *testing.T) {
+	root := t.TempDir()
+	roleDir := filepath.Join(root, "core-be")
+	if err := os.MkdirAll(roleDir, 0o755); err != nil {
+		t.Fatal(err)
+	}
+	if err := os.WriteFile(filepath.Join(roleDir, "env"),
+		[]byte("GITEA_TOKEN=persona-value\n"), 0o600); err != nil {
+		t.Fatal(err)
+	}
+	t.Setenv("MOLECULE_PERSONA_ROOT", root)
+
+	out := map[string]string{"GITEA_TOKEN": "preset"}
+	loadPersonaEnvFile("core-be", out)
+
+	// Persona helper is meant to populate a FRESH map first in the
+	// caller's flow; calling it on a pre-populated map and seeing the
+	// value get overwritten is consistent with parseEnvFile semantics.
+	if out["GITEA_TOKEN"] != "persona-value" {
+		t.Errorf("loadPersonaEnvFile did not write into existing map; got %q", out["GITEA_TOKEN"])
+	}
+}
+
+// TestIsSafeRoleName_Acceptance: positive + negative cases for the
+// validator. Pinned because every dev-tree role name must pass.
+func TestIsSafeRoleName_Acceptance(t *testing.T) {
+	good := []string{
+		"dev-lead", "core-be", "cp-security", "infra-runtime-be",
+		"sdk-dev", "plugin-dev", "documentation-specialist",
+		"triage-operator", "fullstack-engineer", "release-manager",
+		"core_underscore_ok", "X", "a1", "Z9-0",
+	}
+	for _, s := range good {
+		if !isSafeRoleName(s) {
+			t.Errorf("isSafeRoleName(%q) = false; want true", s)
+		}
+	}
+	bad := []string{
+		"", ".", "..", "with/slash", "/abs", "dot.in.middle",
+		"with space", "back\\slash", "trailing-", // trailing-hyphen is fine actually
+		"with$dollar", "with?question", "newline\nsplit",
+	}
+	// trailing-hyphen IS allowed; remove from "bad" list:
+	bad = []string{
+		"", ".", "..", "with/slash", "/abs", "dot.in.middle",
+		"with space", "back\\slash", "with$dollar", "with?question",
+		"newline\nsplit",
+	}
+	for _, s := range bad {
+		if isSafeRoleName(s) {
+			t.Errorf("isSafeRoleName(%q) = true; want false", s)
+		}
+	}
+}
@@ -4,6 +4,7 @@ import (
 	"bytes"
 	"context"
 	"io"
+	"log"
 	"os"
 	"path/filepath"
 	"strings"
@@ -22,6 +23,16 @@ import (
 // workspace-scoped filtering (handler falls back to unfiltered list).
 type RuntimeLookup func(workspaceID string) (string, error)

+// InstanceIDLookup resolves a workspace's EC2 instance_id by ID. Empty
+// string means the workspace is not on the SaaS (EC2-per-workspace)
+// backend — i.e. either local-Docker or pre-provision. The handler uses
+// this to dispatch plugin install/uninstall to the EIC SSH path
+// (template_files_eic.go primitive) when a workspace runs on its own EC2
+// and there's no local Docker container to exec into. A nil lookup keeps
+// the handler on the local-Docker code path only — same shape as the
+// pre-fix behaviour.
+type InstanceIDLookup func(workspaceID string) (string, error)
+
 // pluginSources is the contract PluginsHandler uses to talk to the
 // plugin source registry. Extracted as an interface (#1814) so tests can
 // substitute a stub without standing up the real *plugins.Registry +
@@ -45,10 +56,11 @@ var _ pluginSources = (*plugins.Registry)(nil)

 // PluginsHandler manages the plugin registry and per-workspace plugin installation.
 type PluginsHandler struct {
-	pluginsDir    string         // host path to plugins/ registry
-	docker        *client.Client // Docker client for container operations
-	restartFunc   func(string)   // auto-restart workspace after install/uninstall
-	runtimeLookup RuntimeLookup  // workspace_id → runtime (optional)
+	pluginsDir       string           // host path to plugins/ registry
+	docker           *client.Client   // Docker client for container operations
+	restartFunc      func(string)     // auto-restart workspace after install/uninstall
+	runtimeLookup    RuntimeLookup    // workspace_id → runtime (optional)
+	instanceIDLookup InstanceIDLookup // workspace_id → EC2 instance_id (optional)
 	// sources narrowed from `*plugins.Registry` to the pluginSources
 	// interface (#1814) so tests can substitute a stub. Production
 	// callers still pass *plugins.Registry, which satisfies the
@@ -89,6 +101,15 @@ func (h *PluginsHandler) WithRuntimeLookup(lookup RuntimeLookup) *PluginsHandler
 	return h
 }

+// WithInstanceIDLookup installs a workspace → EC2 instance_id resolver.
+// Wired by the router so production hits a real DB; tests stub it. The
+// install/uninstall pipeline uses this to dispatch to the EIC SSH path
+// for SaaS workspaces (no local Docker container to exec into).
+func (h *PluginsHandler) WithInstanceIDLookup(lookup InstanceIDLookup) *PluginsHandler {
+	h.instanceIDLookup = lookup
+	return h
+}
+
 // pluginInfo is the API response for a plugin.
 type pluginInfo struct {
 	Name        string   `json:"name"`
@@ -177,16 +198,42 @@ func strDefault(m map[string]interface{}, key, fallback string) string {
 	return fallback
 }

+// findRunningContainer returns the live container name for workspaceID, or ""
+// when the container is genuinely not running OR the daemon errored
+// transiently. Routed through provisioner.RunningContainerName as the SSOT
+// (molecule-core#10) so this handler agrees with healthsweep on the same
+// inputs. Transient daemon errors are logged distinctly so triage doesn't
+// confuse a flaky daemon with a stopped container.
 func (h *PluginsHandler) findRunningContainer(ctx context.Context, workspaceID string) string {
-	if h.docker == nil {
+	name, err := provisioner.RunningContainerName(ctx, h.docker, workspaceID)
+	if err != nil {
+		log.Printf("plugins: docker inspect transient error for %s: %v (treating as not-running for this request)", workspaceID, err)
 		return ""
 	}
-	name := provisioner.ContainerName(workspaceID)
-	info, err := h.docker.ContainerInspect(ctx, name)
-	if err == nil && info.State.Running {
-		return name
+	return name
+}
+
+// isExternalRuntime reports whether the workspace's runtime is the
+// `external` (remote-pull) shape introduced in Phase 30. External
+// workspaces have no local container — `POST /plugins` (push-install via
+// docker exec) doesn't apply to them; they pull via the download endpoint
+// instead. Returns false (allow-install) if the lookup is unwired or
+// errors — failing open here is safe because the downstream
+// findRunningContainer step still gates on a real container being there.
+//
+// Background — molecule-core#10: without this check, external workspaces
+// fall through to findRunningContainer's NotFound path and return a
+// misleading 503 "container not running" instead of a clear "use the
+// pull endpoint" message.
+func (h *PluginsHandler) isExternalRuntime(workspaceID string) bool {
+	if h.runtimeLookup == nil {
+		return false
 	}
-	return ""
+	runtime, err := h.runtimeLookup(workspaceID)
+	if err != nil {
+		return false
+	}
+	return runtime == "external"
 }

 func (h *PluginsHandler) execAsRoot(ctx context.Context, containerName string, cmd []string) (string, error) {
@@ -0,0 +1,207 @@
+package handlers
+
+// plugins_atomic.go — atomic install pattern for plugin delivery into a
+// running workspace container. Closes molecule-core#114.
+//
+// Replaces the prior "tar + docker.CopyToContainer to /configs/plugins/<name>"
+// single-step write (no atomicity, no marker, no rollback) with a 4-step
+// dance:
+//
+//   1. STAGE     — extract tar into /configs/plugins/.staging/<name>.<ts>/
+//   2. SNAPSHOT  — if /configs/plugins/<name>/ exists, mv to .previous/<name>.<ts>/
+//   3. SWAP      — mv /configs/plugins/.staging/<name>.<ts>/ → /configs/plugins/<name>/
+//   4. MARKER    — touch /configs/plugins/<name>/.complete
+//
+// On any post-snapshot failure we attempt a best-effort rollback by mv-ing
+// the previous snapshot back into place. The .complete marker is the
+// canonical "this install is fully landed" signal — workspace-side plugin
+// loaders should refuse to load a plugin dir without it.
+//
+// Scope: docker path only (workspace running as a local container). The
+// SaaS path (deliverViaEIC, SSH-into-EC2) is unchanged in this PR; tracked
+// as a follow-up. The same stage-then-swap shape applies but the exec
+// primitives differ (ssh vs docker exec), and shipping both paths in one
+// PR doubles the test surface.
+
+import (
+	"bytes"
+	"context"
+	"fmt"
+	"path"
+	"strings"
+	"time"
+
+	"github.com/docker/docker/api/types/container"
+)
+
+const (
+	pluginsRoot       = "/configs/plugins"
+	pluginsStagingDir = "/configs/plugins/.staging"
+	pluginsPrevDir    = "/configs/plugins/.previous"
+	completeMarker    = ".complete"
+)
+
+// installVersion identifies one install attempt — the plugin name plus a
+// monotonic-ish UTC timestamp suffix. Used to namespace the staging dir
+// and any snapshot of the previous version, so a reinstall mid-flight
+// can't collide with a concurrent reinstall.
+type installVersion struct {
+	plugin string
+	stamp  string // e.g. 20260508T141530Z
+}
+
+func newInstallVersion(plugin string) installVersion {
+	return installVersion{
+		plugin: plugin,
+		stamp:  time.Now().UTC().Format("20060102T150405Z"),
+	}
+}
+
+// stagedPath is the container path where the new content lands during fetch.
+// e.g. /configs/plugins/.staging/molecule-skill-foo.20260508T141530Z
+func (v installVersion) stagedPath() string {
+	return path.Join(pluginsStagingDir, v.plugin+"."+v.stamp)
+}
+
+// previousPath is where the prior live version is moved before swap.
+// e.g. /configs/plugins/.previous/molecule-skill-foo.20260508T141530Z
+func (v installVersion) previousPath() string {
+	return path.Join(pluginsPrevDir, v.plugin+"."+v.stamp)
+}
+
+// livePath is the destination after swap.
+// e.g. /configs/plugins/molecule-skill-foo
+func (v installVersion) livePath() string {
+	return path.Join(pluginsRoot, v.plugin)
+}
+
+// markerPath is the .complete file inside the live dir written last.
+func (v installVersion) markerPath() string {
+	return path.Join(v.livePath(), completeMarker)
+}
+
+// atomicCopyToContainer does a stage→snapshot→swap→marker install of a
+// host-side staged plugin tree into a running container's
+// /configs/plugins/<name>/. Returns nil on success.
+//
+// On post-snapshot failure (swap or marker write), best-effort rollback
+// restores the previous snapshot to the live path. Returns the original
+// error wrapped — the caller should surface it; rollback success is
+// logged separately.
+func (h *PluginsHandler) atomicCopyToContainer(
+	ctx context.Context, containerName, hostDir, pluginName string,
+) error {
+	v := newInstallVersion(pluginName)
+
+	// Step 0a: ensure staging + previous root dirs exist (idempotent).
+	if _, err := h.execAsRoot(ctx, containerName, []string{
+		"mkdir", "-p", pluginsStagingDir, pluginsPrevDir,
+	}); err != nil {
+		return fmt.Errorf("atomic install: mkdir staging/previous: %w", err)
+	}
+
+	// Step 0b: tar the host content with a path prefix that lands it in the
+	// staging dir — NOT directly into the live name. The prefix has no
+	// leading "/" because docker.CopyToContainer extracts paths relative
+	// to the dstPath argument we pass below.
+	stagedRel := strings.TrimPrefix(v.stagedPath(), "/")
+	tarBuf, err := tarHostDirWithPrefix(hostDir, stagedRel)
+	if err != nil {
+		return fmt.Errorf("atomic install: tar host dir: %w", err)
+	}
+
+	// Step 1: STAGE — extract tar into /configs/plugins/.staging/<name>.<ts>/
+	if err := h.docker.CopyToContainer(ctx, containerName, "/", &tarBuf,
+		container.CopyToContainerOptions{}); err != nil {
+		// Best-effort: clean up any partial staging extract before returning.
+		_, _ = h.execAsRoot(ctx, containerName, []string{
+			"rm", "-rf", v.stagedPath(),
+		})
+		return fmt.Errorf("atomic install: copy to container: %w", err)
+	}
+
+	// Step 2: SNAPSHOT — if a live version exists, move it aside.
+	// `test -d` exits 0 if the dir exists, non-zero otherwise; the helper
+	// returns a non-nil error in the non-zero case which we treat as
+	// "no previous version" rather than a real failure.
+	snapshotted := false
+	if _, err := h.execAsRoot(ctx, containerName, []string{
+		"test", "-d", v.livePath(),
+	}); err == nil {
+		if _, err := h.execAsRoot(ctx, containerName, []string{
+			"mv", v.livePath(), v.previousPath(),
+		}); err != nil {
+			// Snapshot failure: roll back the staged extract before failing.
+			_, _ = h.execAsRoot(ctx, containerName, []string{
+				"rm", "-rf", v.stagedPath(),
+			})
+			return fmt.Errorf("atomic install: snapshot previous version: %w", err)
+		}
+		snapshotted = true
+	}
+
+	// Step 3: SWAP — atomic rename of the staged dir into the live name.
+	// `mv` on the same filesystem is a single rename(2), atomic at the FS level.
+	if _, err := h.execAsRoot(ctx, containerName, []string{
+		"mv", v.stagedPath(), v.livePath(),
+	}); err != nil {
+		// Swap failure: roll back if we had a snapshot.
+		if snapshotted {
+			if _, rbErr := h.execAsRoot(ctx, containerName, []string{
+				"mv", v.previousPath(), v.livePath(),
+			}); rbErr != nil {
+				return fmt.Errorf("atomic install: swap failed AND rollback failed: swap=%w, rollback=%v", err, rbErr)
+			}
+		}
+		// Best-effort cleanup of the still-staged dir.
+		_, _ = h.execAsRoot(ctx, containerName, []string{
+			"rm", "-rf", v.stagedPath(),
+		})
+		return fmt.Errorf("atomic install: swap to live path: %w", err)
+	}
+
+	// Step 4: MARKER — touch .complete inside the live dir as the last write.
+	// Workspace-side plugin loaders treat a plugin dir without this marker
+	// as half-installed and skip it (or surface a clear error to the
+	// operator instead of loading a possibly-partial tree).
+	if _, err := h.execAsRoot(ctx, containerName, []string{
+		"touch", v.markerPath(),
+	}); err != nil {
+		// Marker write failure with the new content already in place is a
+		// weird state — content is fine on disk, but the plugin loader
+		// will refuse to use it. Log loudly; do NOT roll back, since the
+		// content is the latest, just unmarked. Operator can manually
+		// `touch <plugin>/.complete` to recover.
+		return fmt.Errorf("atomic install: write .complete marker (content landed but unmarked, manual recovery: touch %s): %w", v.markerPath(), err)
+	}
+
+	// Step 5: GC — best-effort delete the previous snapshot. Failures here
+	// just leave a directory; not load-bearing for correctness, the next
+	// install or a separate sweeper will reclaim the space.
+	if snapshotted {
+		_, _ = h.execAsRoot(ctx, containerName, []string{
+			"rm", "-rf", v.previousPath(),
+		})
+	}
+
+	return nil
+}
+
+// tarHostDirWithPrefix walks hostDir and writes a tar to a buffer with
+// every entry's name prefixed by `prefix`. Mirrors the prior streaming
+// shape used in copyPluginToContainer but with a configurable prefix
+// (the prior version hardcoded "plugins/<name>/"; we use a full
+// staging path so the extracted layout is the staging dir directly).
+//
+// Symlinks are skipped — same posture as streamDirAsTar elsewhere in
+// this file. Skipping prevents a hostile plugin from injecting a
+// symlink that, post-extract, points outside the plugin's own dir.
+func tarHostDirWithPrefix(hostDir, prefix string) (bytes.Buffer, error) {
+	var buf bytes.Buffer
+	tw := newTarWriter(&buf)
+	defer tw.Close()
+	if err := tarWalk(hostDir, prefix, tw); err != nil {
+		return bytes.Buffer{}, err
+	}
+	return buf, nil
+}
@@ -0,0 +1,70 @@
+package handlers
+
+// plugins_atomic_tar.go — tar-walk helpers split out so the main atomic
+// install flow stays readable. The prefix argument lets the caller
+// arrange where the tar's contents land at extract time.
+
+import (
+	"archive/tar"
+	"io"
+	"os"
+	"path/filepath"
+)
+
+// newTarWriter is a thin wrapper so atomic_test.go can swap the writer
+// destination if it needs to.
+func newTarWriter(w io.Writer) *tar.Writer {
+	return tar.NewWriter(w)
+}
+
+// tarWalk walks hostDir and writes every regular file + dir to the tar
+// writer with paths of the form `<prefix>/<relative>`. Symlinks are
+// skipped — same posture as streamDirAsTar in plugins_install_pipeline.go.
+//
+// The trailing-slash on prefix is normalized away: prefix "foo" and
+// prefix "foo/" produce identical archives.
+func tarWalk(hostDir, prefix string, tw *tar.Writer) error {
+	prefix = filepath.Clean(prefix)
+	return filepath.Walk(hostDir, func(p string, info os.FileInfo, err error) error {
+		if err != nil {
+			return err
+		}
+		if info.Mode()&os.ModeSymlink != 0 {
+			return nil // skip symlinks; see doc above
+		}
+		rel, err := filepath.Rel(hostDir, p)
+		if err != nil {
+			return err
+		}
+		if rel == "." {
+			// Emit the prefix dir itself once, with the source dir's mode.
+			hdr, err := tar.FileInfoHeader(info, "")
+			if err != nil {
+				return err
+			}
+			hdr.Name = prefix + "/"
+			return tw.WriteHeader(hdr)
+		}
+		hdr, err := tar.FileInfoHeader(info, "")
+		if err != nil {
+			return err
+		}
+		hdr.Name = filepath.Join(prefix, rel)
+		if info.IsDir() {
+			hdr.Name += "/"
+		}
+		if err := tw.WriteHeader(hdr); err != nil {
+			return err
+		}
+		if !info.Mode().IsRegular() {
+			return nil
+		}
+		f, err := os.Open(p)
+		if err != nil {
+			return err
+		}
+		defer f.Close()
+		_, err = io.Copy(tw, f)
+		return err
+	})
+}
@@ -0,0 +1,193 @@
+package handlers
+
+import (
+	"archive/tar"
+	"bytes"
+	"io"
+	"os"
+	"path/filepath"
+	"sort"
+	"strings"
+	"testing"
+	"time"
+)
+
+// TestInstallVersion_Paths: the path helpers must produce a stable shape
+// the in-container exec calls depend on. Pinning the layout here
+// catches a future refactor that accidentally changes where staging /
+// previous / live dirs live, which would break the swap atomicity.
+func TestInstallVersion_Paths(t *testing.T) {
+	v := installVersion{plugin: "molecule-skill-foo", stamp: "20260508T141530Z"}
+
+	if got, want := v.stagedPath(), "/configs/plugins/.staging/molecule-skill-foo.20260508T141530Z"; got != want {
+		t.Errorf("stagedPath = %q; want %q", got, want)
+	}
+	if got, want := v.previousPath(), "/configs/plugins/.previous/molecule-skill-foo.20260508T141530Z"; got != want {
+		t.Errorf("previousPath = %q; want %q", got, want)
+	}
+	if got, want := v.livePath(), "/configs/plugins/molecule-skill-foo"; got != want {
+		t.Errorf("livePath = %q; want %q", got, want)
+	}
+	if got, want := v.markerPath(), "/configs/plugins/molecule-skill-foo/.complete"; got != want {
+		t.Errorf("markerPath = %q; want %q", got, want)
+	}
+}
+
+// TestInstallVersion_StampUniqueness: two newInstallVersion calls within
+// the same second produce the same stamp (we use second precision); the
+// caller relies on the mv-rename being atomic, so collision-free
+// stamping is NOT a correctness requirement — but a regression that
+// changes stamp shape (e.g. RFC3339 with colons) would break the path
+// helpers since path.Join treats a colon as a regular char but ssh +
+// docker exec generally don't. Pin the no-colon shape.
+func TestInstallVersion_StampShape(t *testing.T) {
+	v := newInstallVersion("anything")
+	if strings.Contains(v.stamp, ":") {
+		t.Errorf("stamp must not contain colons (breaks shell-quoting in exec): %q", v.stamp)
+	}
+	if strings.Contains(v.stamp, " ") {
+		t.Errorf("stamp must not contain spaces: %q", v.stamp)
+	}
+	// Sanity: stamp parses as the documented format.
+	if _, err := time.Parse("20060102T150405Z", v.stamp); err != nil {
+		t.Errorf("stamp %q does not parse as 20060102T150405Z: %v", v.stamp, err)
+	}
+}
+
+// TestTarHostDirWithPrefix_HappyPath: walks a host dir, builds a tar with
+// the configured prefix, verifies every entry's name is rooted under
+// the prefix, and the file contents survive round-trip.
+func TestTarHostDirWithPrefix_HappyPath(t *testing.T) {
+	hostDir := t.TempDir()
+
+	// Plant: <host>/plugin.yaml + <host>/skills/foo/SKILL.md + <host>/.complete
+	files := map[string]string{
+		"plugin.yaml":             "name: foo\nversion: 1.0.0\n",
+		"skills/foo/SKILL.md":     "# Foo skill\n",
+		".complete":                "", // upstream may already have a marker
+	}
+	for rel, body := range files {
+		full := filepath.Join(hostDir, rel)
+		if err := os.MkdirAll(filepath.Dir(full), 0o755); err != nil {
+			t.Fatal(err)
+		}
+		if err := os.WriteFile(full, []byte(body), 0o644); err != nil {
+			t.Fatal(err)
+		}
+	}
+
+	prefix := "configs/plugins/.staging/foo.20260508T141530Z"
+	buf, err := tarHostDirWithPrefix(hostDir, prefix)
+	if err != nil {
+		t.Fatalf("tar: %v", err)
+	}
+
+	// Read back the tar; collect names + body for regular files.
+	got := map[string]string{}
+	tr := tar.NewReader(&buf)
+	for {
+		hdr, err := tr.Next()
+		if err == io.EOF {
+			break
+		}
+		if err != nil {
+			t.Fatalf("tar reader: %v", err)
+		}
+		// Every entry must start with the prefix
+		if !strings.HasPrefix(hdr.Name, prefix) {
+			t.Errorf("entry %q does not start with prefix %q", hdr.Name, prefix)
+		}
+		if hdr.Typeflag == tar.TypeReg {
+			body, err := io.ReadAll(tr)
+			if err != nil {
+				t.Fatal(err)
+			}
+			rel := strings.TrimPrefix(hdr.Name, prefix+"/")
+			got[rel] = string(body)
+		}
+	}
+
+	for rel, want := range files {
+		if got[rel] != want {
+			t.Errorf("body[%q] = %q; want %q", rel, got[rel], want)
+		}
+	}
+}
+
+// TestTarHostDirWithPrefix_SkipsSymlinks: a hostile plugin shouldn't be
+// able to ship a symlink that, post-extract, points outside its own
+// dir. The walker silently skips symlinks (same posture as
+// streamDirAsTar). Verify a planted symlink doesn't appear in the tar.
+func TestTarHostDirWithPrefix_SkipsSymlinks(t *testing.T) {
+	hostDir := t.TempDir()
+	// Plant a real file + a symlink pointing outside hostDir.
+	if err := os.WriteFile(filepath.Join(hostDir, "real.txt"), []byte("ok"), 0o644); err != nil {
+		t.Fatal(err)
+	}
+	target := filepath.Join(t.TempDir(), "outside")
+	if err := os.WriteFile(target, []byte("SHOULD NOT APPEAR"), 0o644); err != nil {
+		t.Fatal(err)
+	}
+	if err := os.Symlink(target, filepath.Join(hostDir, "evil")); err != nil {
+		t.Fatal(err)
+	}
+
+	buf, err := tarHostDirWithPrefix(hostDir, "p")
+	if err != nil {
+		t.Fatal(err)
+	}
+
+	names := []string{}
+	tr := tar.NewReader(&buf)
+	for {
+		hdr, err := tr.Next()
+		if err == io.EOF {
+			break
+		}
+		if err != nil {
+			t.Fatal(err)
+		}
+		names = append(names, hdr.Name)
+	}
+	sort.Strings(names)
+
+	for _, n := range names {
+		if strings.Contains(n, "evil") {
+			t.Errorf("symlink leaked into tar: %q", n)
+		}
+	}
+	// real.txt should be present
+	found := false
+	for _, n := range names {
+		if strings.HasSuffix(n, "real.txt") {
+			found = true
+			break
+		}
+	}
+	if !found {
+		t.Errorf("real.txt missing from tar; got names: %v", names)
+	}
+}
+
+// TestTarHostDirWithPrefix_PrefixNormalization: trailing slash on prefix
+// should not change the archive shape. Pinning this so a future caller
+// passing "foo/" instead of "foo" doesn't double-slash entry names.
+func TestTarHostDirWithPrefix_PrefixNormalization(t *testing.T) {
+	hostDir := t.TempDir()
+	if err := os.WriteFile(filepath.Join(hostDir, "x"), []byte("y"), 0o644); err != nil {
+		t.Fatal(err)
+	}
+
+	a, err := tarHostDirWithPrefix(hostDir, "foo")
+	if err != nil {
+		t.Fatal(err)
+	}
+	b, err := tarHostDirWithPrefix(hostDir, "foo/")
+	if err != nil {
+		t.Fatal(err)
+	}
+
+	if !bytes.Equal(a.Bytes(), b.Bytes()) {
+		t.Errorf("trailing-slash on prefix changed archive shape; tarHostDirWithPrefix should be slash-insensitive")
+	}
+}
@@ -0,0 +1,176 @@
+package handlers
+
+import (
+	"go/ast"
+	"go/parser"
+	"go/token"
+	"strings"
+	"testing"
+)
+
+// TestFindRunningContainer_RoutesThroughProvisionerSSOT is a behavior-based
+// AST gate: it pins the invariant that PluginsHandler.findRunningContainer
+// MUST go through provisioner.RunningContainerName for its is-running check,
+// instead of carrying its own copy of cli.ContainerInspect logic.
+//
+// Background — molecule-core#10: a parallel impl of "is the workspace's
+// container running" used to live in plugins.go. It drifted from the
+// canonical impl in healthsweep (which goes through Provisioner.IsRunning
+// → RunningContainerName) on edge cases like "transient daemon error" —
+// the duplicate would 503 with a misleading message while healthsweep
+// correctly stayed defensive. Consolidating onto RunningContainerName as
+// the SSOT prevents any future copy from re-introducing that drift.
+//
+// Mutation invariant: if a future PR replaces the provisioner call with
+// `h.docker.ContainerInspect(...)` directly, this test fails. That's the
+// signal to either (a) extend RunningContainerName's contract OR (b)
+// document why this call site needs to differ. Either way: the drift
+// gets a reviewer's attention instead of shipping silently.
+func TestFindRunningContainer_RoutesThroughProvisionerSSOT(t *testing.T) {
+	fset := token.NewFileSet()
+	file, err := parser.ParseFile(fset, "plugins.go", nil, parser.ParseComments)
+	if err != nil {
+		t.Fatalf("parse plugins.go: %v", err)
+	}
+
+	var fn *ast.FuncDecl
+	ast.Inspect(file, func(n ast.Node) bool {
+		f, ok := n.(*ast.FuncDecl)
+		if !ok || f.Name.Name != "findRunningContainer" {
+			return true
+		}
+		// Confirm receiver is *PluginsHandler so we don't pick up an unrelated
+		// helper of the same name. ast.Recv is a FieldList — receivers carry
+		// at most one field.
+		if f.Recv == nil || len(f.Recv.List) == 0 {
+			return true
+		}
+		fn = f
+		return false
+	})
+
+	if fn == nil {
+		t.Fatal("findRunningContainer not found in plugins.go — was it renamed? update this test or the SSOT routing assumption")
+	}
+
+	var (
+		callsRunningContainerName bool
+		callsContainerInspectRaw  bool
+	)
+	ast.Inspect(fn.Body, func(n ast.Node) bool {
+		call, ok := n.(*ast.CallExpr)
+		if !ok {
+			return true
+		}
+		sel, ok := call.Fun.(*ast.SelectorExpr)
+		if !ok {
+			return true
+		}
+		// Pkg.Func form: provisioner.RunningContainerName(...)
+		if pkgIdent, ok := sel.X.(*ast.Ident); ok {
+			if pkgIdent.Name == "provisioner" && sel.Sel.Name == "RunningContainerName" {
+				callsRunningContainerName = true
+			}
+		}
+		// Receiver-then-method form: h.docker.ContainerInspect(...) /
+		// p.cli.ContainerInspect(...) — anything ending in
+		// .ContainerInspect that's NOT routed through provisioner.
+		if sel.Sel.Name == "ContainerInspect" {
+			callsContainerInspectRaw = true
+		}
+		return true
+	})
+
+	if !callsRunningContainerName {
+		t.Errorf(
+			"findRunningContainer must call provisioner.RunningContainerName for the SSOT inspect — see molecule-core#10. Found no such call.",
+		)
+	}
+	if callsContainerInspectRaw {
+		t.Errorf(
+			"findRunningContainer carries a direct ContainerInspect call. This is the parallel-impl drift molecule-core#10 fixed. " +
+				"Either route through provisioner.RunningContainerName OR — if a new use case truly needs a different inspect — extend RunningContainerName's contract first and update this gate to allow the specific delta.",
+		)
+	}
+}
+
+// TestProvisionerIsRunning_RoutesThroughRunningContainerName mirrors the
+// gate above but for the OTHER consumer of the SSOT — Provisioner.IsRunning
+// (called by healthsweep). If a future refactor makes IsRunning carry its
+// own ContainerInspect again, the two consumers' edge-case behaviors will
+// silently drift. Keep them yoked.
+func TestProvisionerIsRunning_RoutesThroughRunningContainerName(t *testing.T) {
+	fset := token.NewFileSet()
+	file, err := parser.ParseFile(fset, "../provisioner/provisioner.go", nil, parser.ParseComments)
+	if err != nil {
+		t.Fatalf("parse provisioner.go: %v", err)
+	}
+
+	var fn *ast.FuncDecl
+	ast.Inspect(file, func(n ast.Node) bool {
+		f, ok := n.(*ast.FuncDecl)
+		if !ok || f.Name.Name != "IsRunning" || f.Recv == nil {
+			return true
+		}
+		// The receiver type must be *Provisioner specifically. CPProvisioner
+		// has its own IsRunning that talks HTTP to the controlplane and is
+		// out of scope for this gate.
+		if !receiverIs(f, "Provisioner") {
+			return true
+		}
+		fn = f
+		return false
+	})
+	if fn == nil {
+		t.Fatal("Provisioner.IsRunning not found — was it renamed? update this test")
+	}
+
+	var (
+		callsRunningContainerName bool
+		callsContainerInspectRaw  bool
+	)
+	ast.Inspect(fn.Body, func(n ast.Node) bool {
+		call, ok := n.(*ast.CallExpr)
+		if !ok {
+			return true
+		}
+		// Same-package call: bare identifier (e.g. RunningContainerName(...)).
+		if id, ok := call.Fun.(*ast.Ident); ok && id.Name == "RunningContainerName" {
+			callsRunningContainerName = true
+			return true
+		}
+		// Selector call: pkg.Func (e.g. provisioner.RunningContainerName)
+		// OR recv.Method (e.g. p.cli.ContainerInspect).
+		sel, ok := call.Fun.(*ast.SelectorExpr)
+		if !ok {
+			return true
+		}
+		switch sel.Sel.Name {
+		case "RunningContainerName":
+			callsRunningContainerName = true
+		case "ContainerInspect":
+			callsContainerInspectRaw = true
+		}
+		return true
+	})
+
+	if !callsRunningContainerName {
+		t.Errorf("Provisioner.IsRunning must call RunningContainerName for the SSOT inspect — see molecule-core#10")
+	}
+	if callsContainerInspectRaw {
+		t.Errorf("Provisioner.IsRunning carries a direct ContainerInspect call; route through RunningContainerName instead")
+	}
+}
+
+// receiverIs reports whether fn's receiver is `*<typeName>` or `<typeName>`.
+func receiverIs(fn *ast.FuncDecl, typeName string) bool {
+	if fn.Recv == nil || len(fn.Recv.List) == 0 {
+		return false
+	}
+	expr := fn.Recv.List[0].Type
+	if star, ok := expr.(*ast.StarExpr); ok {
+		expr = star.X
+	}
+	id, ok := expr.(*ast.Ident)
+	return ok && strings.EqualFold(id.Name, typeName)
+}
@@ -32,6 +32,18 @@ import (
 // inside the workspace at startup.
 func (h *PluginsHandler) Install(c *gin.Context) {
 	workspaceID := c.Param("id")
+	// External-runtime guard (molecule-core#10): push-install via docker
+	// exec is meaningless for `runtime='external'` workspaces — they have
+	// no local container. Reject early with a hint pointing at the
+	// pull-mode endpoint, instead of falling through to a misleading
+	// "container not running" 503 from findRunningContainer.
+	if h.isExternalRuntime(workspaceID) {
+		c.JSON(http.StatusUnprocessableEntity, gin.H{
+			"error": "plugin install via push is not supported for external runtimes",
+			"hint":  "external workspaces pull plugins via GET /workspaces/:id/plugins/:name/download",
+		})
+		return
+	}
 	// Cap the JSON body so a pathological POST can't exhaust parser memory.
 	bodyMax := envx.Int64("PLUGIN_INSTALL_BODY_MAX_BYTES", defaultInstallBodyMaxBytes)
 	c.Request.Body = http.MaxBytesReader(c.Writer, c.Request.Body, bodyMax)
@@ -88,22 +100,51 @@ func (h *PluginsHandler) Install(c *gin.Context) {
 }

 // Uninstall handles DELETE /workspaces/:id/plugins/:name — removes a plugin.
+//
+// Dispatch order mirrors Install's deliverToContainer:
+//
+//  1. Local Docker container up → exec rm -rf via existing helpers.
+//  2. SaaS workspace (instance_id set) → ssh sudo rm -rf via EIC.
+//  3. external runtime → 422 (caller manages its own plugin dir).
+//  4. Neither → 503.
 func (h *PluginsHandler) Uninstall(c *gin.Context) {
 	workspaceID := c.Param("id")
 	pluginName := c.Param("name")
 	ctx := c.Request.Context()

+	// Mirror Install's external-runtime guard (molecule-core#10) so the
+	// two endpoints reject the same shape with the same message.
+	if h.isExternalRuntime(workspaceID) {
+		c.JSON(http.StatusUnprocessableEntity, gin.H{
+			"error": "plugin uninstall via docker exec is not supported for external runtimes",
+			"hint":  "external workspaces manage their own plugin directory; remove it locally",
+		})
+		return
+	}
+
 	if err := validatePluginName(pluginName); err != nil {
 		c.JSON(http.StatusBadRequest, gin.H{"error": "invalid plugin name"})
 		return
 	}

-	containerName := h.findRunningContainer(ctx, workspaceID)
-	if containerName == "" {
-		c.JSON(http.StatusServiceUnavailable, gin.H{"error": "workspace container not running"})
+	if containerName := h.findRunningContainer(ctx, workspaceID); containerName != "" {
+		h.uninstallViaDocker(ctx, c, workspaceID, pluginName, containerName)
 		return
 	}

+	if instanceID, runtime := h.lookupSaaSDispatch(workspaceID); instanceID != "" {
+		h.uninstallViaEIC(ctx, c, workspaceID, pluginName, instanceID, runtime)
+		return
+	}
+
+	c.JSON(http.StatusServiceUnavailable, gin.H{"error": "workspace container not running"})
+}
+
+// uninstallViaDocker holds the historical Docker-exec uninstall flow.
+// Extracted out of Uninstall so the new SaaS dispatch reads cleanly and
+// the two backend bodies are visibly symmetric (same steps, different
+// transport).
+func (h *PluginsHandler) uninstallViaDocker(ctx context.Context, c *gin.Context, workspaceID, pluginName, containerName string) {
 	// Read the plugin's manifest BEFORE deletion to learn which skill dirs
 	// it owns, so we can clean them out of /configs/skills/ and avoid the
 	// auto-restart re-mounting them. Issue #106.
@@ -155,6 +196,61 @@ func (h *PluginsHandler) Uninstall(c *gin.Context) {
 	})
 }

+// uninstallViaEIC removes a plugin from a SaaS workspace EC2 over SSH.
+// Symmetric with uninstallViaDocker:
+//
+//   - Read manifest (best-effort, missing plugin.yaml = no skills to clean).
+//   - Skip CLAUDE.md awk-strip for now: that file lives at
+//     <runtime-config-prefix>/CLAUDE.md on the host and the same awk script
+//     would work over ssh, but the file is rewritten on workspace restart
+//     by the runtime adapter anyway, so the marker either stays harmless
+//     or gets dropped on the next install/restart cycle. Tracked as
+//     follow-up; not a regression vs the docker path's semantics here.
+//   - rm -rf the plugin dir.
+//   - Trigger restart.
+//
+// We intentionally don't try to remove /configs/skills/<skill> entries
+// over ssh because the same /configs is bind-mounted into the runtime
+// container; the agent's own start-up adapter rewrites that tree from
+// the live plugin set, so a stale skill dir for an uninstalled plugin
+// is cleaned up at restart. The docker path removes them eagerly only
+// because docker-exec is cheap. We can mirror that later if a real bug
+// surfaces, but adding two extra ssh round-trips per uninstall today
+// would be churn for no behavioural win.
+func (h *PluginsHandler) uninstallViaEIC(ctx context.Context, c *gin.Context, workspaceID, pluginName, instanceID, runtime string) {
+	// Read manifest first (best-effort) — we don't currently use the
+	// skills list on the SaaS path (see comment above), but reading it
+	// keeps the parsing path warm and lets log lines distinguish "we
+	// deleted a real plugin" from "user asked us to delete something
+	// that wasn't there." Errors here are swallowed: missing manifest
+	// must not block uninstall.
+	if data, err := readPluginManifestViaEIC(ctx, instanceID, runtime, pluginName); err == nil && len(data) > 0 {
+		info := parseManifestYAML(pluginName, data)
+		if len(info.Skills) > 0 {
+			log.Printf("Plugin uninstall: %s declared skills=%v (left to runtime restart to clean)", pluginName, info.Skills)
+		}
+	}
+
+	if err := uninstallPluginViaEIC(ctx, instanceID, runtime, pluginName); err != nil {
+		log.Printf("Plugin uninstall: EIC rm failed for %s on %s: %v", pluginName, workspaceID, err)
+		c.JSON(http.StatusBadGateway, gin.H{"error": "failed to remove plugin from workspace EC2"})
+		return
+	}
+
+	if h.restartFunc != nil {
+		go func() {
+			time.Sleep(2 * time.Second)
+			h.restartFunc(workspaceID)
+		}()
+	}
+
+	log.Printf("Plugin uninstall: %s from workspace %s (restarting via SaaS path)", pluginName, workspaceID)
+	c.JSON(http.StatusOK, gin.H{
+		"status": "uninstalled",
+		"plugin": pluginName,
+	})
+}
+
 // Download handles GET /workspaces/:id/plugins/:name/download?source=<scheme://spec>
 //
 // Phase 30.3 — stream the named plugin as a gzipped tarball so remote
@@ -0,0 +1,249 @@
+package handlers
+
+// plugins_install_eic.go — SaaS (EC2-per-workspace) plugin install + uninstall
+// over the EIC SSH primitive that template_files_eic.go already plumbs. Pairs
+// with the local-Docker path in plugins_install.go / plugins_install_pipeline.go,
+// closing the 🔴 docker-only row in docs/architecture/backends.md.
+//
+// Architecture note: every operation goes through `withEICTunnel` (ephemeral
+// keypair → AWS push → tunnel → ssh). This file owns the plugin-shaped
+// remote commands; the tunnel mechanics live in template_files_eic.go so a
+// fix to the dance lands in one place.
+//
+// Why direct host write (not docker cp via SSH): on the workspace EC2, the
+// runtime's managed-config dir (/configs for claude-code, /home/ubuntu/.hermes
+// for hermes — see workspaceFilePathPrefix) is bind-mounted into the
+// runtime's container by cloud-init. Writing into <prefix>/plugins/<name>/
+// on the host is exactly what the runtime sees on the next start. No
+// docker-cp needed, and we avoid coupling to any specific container layout
+// inside the workspace EC2.
+
+import (
+	"archive/tar"
+	"bytes"
+	"compress/gzip"
+	"context"
+	"fmt"
+	"log"
+	"os"
+	"os/exec"
+	"path/filepath"
+	"strings"
+	"time"
+)
+
+// eicPluginOpTimeout bounds the whole EIC-tunnel + ssh + tar-pipe dance
+// for a plugin install or uninstall. Larger than eicFileOpTimeout (30s)
+// because plugin trees can carry skill markdown, MCP server binaries,
+// and config files — easily a few MB through ssh + sudo on a fresh
+// tunnel. 2 min gives headroom on a cold tunnel; the install pipeline's
+// PLUGIN_INSTALL_FETCH_TIMEOUT (5 min default) still bounds the outer
+// request.
+const eicPluginOpTimeout = 2 * time.Minute
+
+// hostPluginPath returns the absolute directory on the workspace EC2
+// where /configs/plugins/<name>/ lives for a given runtime. Keeps the
+// per-runtime indirection in one place (mirrors resolveWorkspaceRootPath
+// in template_files_eic.go) so future runtimes only edit
+// workspaceFilePathPrefix.
+//
+// The plugin name is shellQuote-wrapped at the call site, not here,
+// because a couple of callers want the unquoted form for log lines.
+func hostPluginPath(runtime, pluginName string) string {
+	base := resolveWorkspaceRootPath(runtime, "/configs")
+	return filepath.Join(base, "plugins", pluginName)
+}
+
+// buildPluginInstallShell returns the remote command for receiving a tar.gz
+// stream on stdin and unpacking it into <hostPluginDir>/, owned by the agent
+// user (uid 1000 — matches the local-Docker path's chown 1000:1000).
+//
+// The script is a single `sudo sh -c '...'` so the tar-receive + chown run
+// under one privileged invocation; ssh-as-ubuntu has passwordless sudo on
+// the standard tenant AMI.
+//
+//   - rm -rf clears any prior install of the same plugin (idempotent
+//     reinstall — the user re-clicked Install or version-bumped the source).
+//   - mkdir -p makes the parent dir (host /configs is root-owned + always
+//     present; the per-plugin dir is what we're creating).
+//   - tar -xzf - reads stdin (the gzipped tar). --no-same-owner keeps the
+//     archive's tar-recorded uid/gid out of the picture; the chown -R
+//     after is the canonical owner.
+//   - chown -R 1000:1000 matches the local-Docker handler's exec at
+//     plugins_install_pipeline.go:273 — agent user inside the runtime
+//     container is uid 1000 on every workspace-template image we ship.
+//
+// shellQuote on the path is defence-in-depth: the path is composed from
+// a runtime allowlist (workspaceFilePathPrefix) + validated plugin name,
+// so traversal is already blocked.
+func buildPluginInstallShell(hostPluginDir string) string {
+	q := shellQuote(hostPluginDir)
+	return fmt.Sprintf(
+		"sudo -n sh -c 'rm -rf %s && mkdir -p %s && tar -xzf - --no-same-owner -C %s && chown -R 1000:1000 %s'",
+		q, q, q, q,
+	)
+}
+
+// buildPluginUninstallShell returns the remote command for `sudo -n rm -rf
+// <hostPluginDir>`. -rf (vs -f) is intentional here, unlike buildRmShell:
+// uninstall really does need to remove the plugin's whole subtree.
+func buildPluginUninstallShell(hostPluginDir string) string {
+	return fmt.Sprintf("sudo -n rm -rf %s", shellQuote(hostPluginDir))
+}
+
+// buildPluginManifestReadShell returns the remote command for reading the
+// plugin's manifest (plugin.yaml). Mirrors buildCatShell — swallows the
+// missing-file stderr so the missing-manifest case lands as empty stdout
+// + non-zero exit, which uninstall translates to "no skills to clean".
+func buildPluginManifestReadShell(hostPluginDir string) string {
+	return fmt.Sprintf("sudo -n cat %s/plugin.yaml 2>/dev/null", shellQuote(hostPluginDir))
+}
+
+// installPluginViaEIC pushes a staged plugin directory to a SaaS workspace
+// EC2 via the EIC SSH tunnel. On success the plugin lives at
+// <runtime-config-prefix>/plugins/<name>/ on the host, owned by 1000:1000,
+// ready for the next workspace restart to pick up.
+//
+// The caller (deliverToContainer SaaS branch) owns:
+//   - the staged dir (created + cleaned up by resolveAndStage)
+//   - the workspace restart trigger after install
+//
+// Errors here are wrapped with the instance + runtime so triage can tell
+// "tunnel failed" from "tar payload corrupt" without grep-ing the EC2's
+// auth.log.
+var installPluginViaEIC = realInstallPluginViaEIC
+
+func realInstallPluginViaEIC(ctx context.Context, instanceID, runtime, pluginName, stagedDir string) error {
+	if instanceID == "" {
+		return fmt.Errorf("installPluginViaEIC: empty instance_id")
+	}
+	if err := validatePluginName(pluginName); err != nil {
+		return fmt.Errorf("installPluginViaEIC: %w", err)
+	}
+
+	// Build the tar.gz payload up-front so a tar-walk failure is surfaced
+	// before we open the EIC tunnel — saves a 1-2s tunnel setup on every
+	// "broken plugin tree" case.
+	var payload bytes.Buffer
+	gz := gzip.NewWriter(&payload)
+	tw := tar.NewWriter(gz)
+	if err := streamDirAsTar(stagedDir, tw); err != nil {
+		return fmt.Errorf("installPluginViaEIC: tar pack: %w", err)
+	}
+	if err := tw.Close(); err != nil {
+		return fmt.Errorf("installPluginViaEIC: tar close: %w", err)
+	}
+	if err := gz.Close(); err != nil {
+		return fmt.Errorf("installPluginViaEIC: gzip close: %w", err)
+	}
+
+	hostDir := hostPluginPath(runtime, pluginName)
+	cmd := buildPluginInstallShell(hostDir)
+
+	ctx, cancel := context.WithTimeout(ctx, eicPluginOpTimeout)
+	defer cancel()
+
+	return withEICTunnel(ctx, instanceID, func(s eicSSHSession) error {
+		sshCmd := exec.CommandContext(ctx, "ssh", s.sshArgs(cmd)...)
+		sshCmd.Env = os.Environ()
+		sshCmd.Stdin = bytes.NewReader(payload.Bytes())
+		var stderr bytes.Buffer
+		sshCmd.Stderr = &stderr
+		if err := sshCmd.Run(); err != nil {
+			return fmt.Errorf(
+				"ssh install: %w (instance=%s runtime=%s plugin=%s payload=%dB stderr=%s)",
+				err, instanceID, runtime, pluginName, payload.Len(),
+				strings.TrimSpace(stderr.String()),
+			)
+		}
+		log.Printf(
+			"installPluginViaEIC: ws instance=%s runtime=%s plugin=%s payload=%dB → %s",
+			instanceID, runtime, pluginName, payload.Len(), hostDir,
+		)
+		return nil
+	})
+}
+
+// uninstallPluginViaEIC removes the plugin's directory from the workspace
+// EC2 via SSH. Symmetric with installPluginViaEIC but no payload — the
+// remote command is a single `rm -rf`.
+//
+// Best-effort by design: the local-Docker path also doesn't fail
+// uninstall on a missing directory (the pre-existing exec returns 0 when
+// the dir is absent), so we mirror that here. Real ssh-layer failures
+// (tunnel down, sudo denied) still propagate.
+var uninstallPluginViaEIC = realUninstallPluginViaEIC
+
+func realUninstallPluginViaEIC(ctx context.Context, instanceID, runtime, pluginName string) error {
+	if instanceID == "" {
+		return fmt.Errorf("uninstallPluginViaEIC: empty instance_id")
+	}
+	if err := validatePluginName(pluginName); err != nil {
+		return fmt.Errorf("uninstallPluginViaEIC: %w", err)
+	}
+
+	hostDir := hostPluginPath(runtime, pluginName)
+	cmd := buildPluginUninstallShell(hostDir)
+
+	ctx, cancel := context.WithTimeout(ctx, eicPluginOpTimeout)
+	defer cancel()
+
+	return withEICTunnel(ctx, instanceID, func(s eicSSHSession) error {
+		sshCmd := exec.CommandContext(ctx, "ssh", s.sshArgs(cmd)...)
+		sshCmd.Env = os.Environ()
+		var stderr bytes.Buffer
+		sshCmd.Stderr = &stderr
+		if err := sshCmd.Run(); err != nil {
+			return fmt.Errorf(
+				"ssh rm: %w (instance=%s runtime=%s plugin=%s stderr=%s)",
+				err, instanceID, runtime, pluginName,
+				strings.TrimSpace(stderr.String()),
+			)
+		}
+		log.Printf(
+			"uninstallPluginViaEIC: ws instance=%s runtime=%s plugin=%s → removed %s",
+			instanceID, runtime, pluginName, hostDir,
+		)
+		return nil
+	})
+}
+
+// readPluginManifestViaEIC reads the plugin's plugin.yaml from the
+// workspace EC2 so uninstall can learn the skills list to clean up.
+// Returns ("", nil) when the manifest doesn't exist (best-effort: the
+// local-Docker path treats a missing manifest as "no skills to remove",
+// not a failure).
+var readPluginManifestViaEIC = realReadPluginManifestViaEIC
+
+func realReadPluginManifestViaEIC(ctx context.Context, instanceID, runtime, pluginName string) ([]byte, error) {
+	if instanceID == "" {
+		return nil, fmt.Errorf("readPluginManifestViaEIC: empty instance_id")
+	}
+	if err := validatePluginName(pluginName); err != nil {
+		return nil, fmt.Errorf("readPluginManifestViaEIC: %w", err)
+	}
+
+	hostDir := hostPluginPath(runtime, pluginName)
+	cmd := buildPluginManifestReadShell(hostDir)
+
+	ctx, cancel := context.WithTimeout(ctx, eicPluginOpTimeout)
+	defer cancel()
+
+	var out []byte
+	runErr := withEICTunnel(ctx, instanceID, func(s eicSSHSession) error {
+		sshCmd := exec.CommandContext(ctx, "ssh", s.sshArgs(cmd)...)
+		sshCmd.Env = os.Environ()
+		var stdout, stderr bytes.Buffer
+		sshCmd.Stdout = &stdout
+		sshCmd.Stderr = &stderr
+		// Don't fail on non-zero exit: missing-manifest case returns 1
+		// from cat with empty stdout, which is the "no skills" signal.
+		_ = sshCmd.Run()
+		out = stdout.Bytes()
+		return nil
+	})
+	if runErr != nil {
+		return nil, runErr
+	}
+	return out, nil
+}
@@ -0,0 +1,505 @@
+package handlers
+
+import (
+	"archive/tar"
+	"bytes"
+	"compress/gzip"
+	"context"
+	"errors"
+	"io"
+	"net/http"
+	"net/http/httptest"
+	"os"
+	"path/filepath"
+	"strings"
+	"testing"
+
+	"github.com/DATA-DOG/go-sqlmock"
+	"github.com/gin-gonic/gin"
+)
+
+// expectAllowlistAllowAll programs the package-shared withMockDB sqlmock
+// so the org-allowlist gate (org_plugin_allowlist.go) returns "allow-all"
+// for the duration of one Install call. The gate fires three queries —
+// resolveOrgID, allowlist EXISTS, allowlist COUNT — and we satisfy each
+// with the empty/zero shape that means "no allowlist configured."
+//
+// Without this, tests that exercise the full Install flow panic on a
+// nil DB. The handlers package already ships withMockDB in
+// tokens_sqlmock_test.go; we just layer the allowlist-specific
+// expectations on top.
+func expectAllowlistAllowAll(mock sqlmock.Sqlmock) {
+	mock.MatchExpectationsInOrder(false)
+	mock.ExpectQuery(`SELECT parent_id FROM workspaces WHERE id`).
+		WillReturnRows(sqlmock.NewRows([]string{"parent_id"}).AddRow(nil))
+	mock.ExpectQuery(`SELECT EXISTS`).
+		WillReturnRows(sqlmock.NewRows([]string{"exists"}).AddRow(false))
+	mock.ExpectQuery(`SELECT COUNT\(\*\) FROM org_plugin_allowlist`).
+		WillReturnRows(sqlmock.NewRows([]string{"count"}).AddRow(0))
+}
+
+// stagePluginRegistry creates a single-plugin registry under dir so the
+// install handler's local resolver can find it. Returns the path to the
+// plugin dir for any caller that wants to assert tar contents.
+//
+// Centralised so a future tweak to the registry shape (e.g. plugin.yaml
+// schema bump) only updates one place. Tests use the source spec
+// `local://<name>` which the local resolver maps to <dir>/<name>/.
+func stagePluginRegistry(t *testing.T, dir, name string) string {
+	t.Helper()
+	pluginDir := filepath.Join(dir, name)
+	if err := os.Mkdir(pluginDir, 0755); err != nil {
+		t.Fatalf("mkdir plugin dir: %v", err)
+	}
+	manifest := "name: " + name + "\nversion: \"1.0.0\"\ndescription: SaaS dispatch test plugin\n"
+	if err := os.WriteFile(filepath.Join(pluginDir, "plugin.yaml"), []byte(manifest), 0644); err != nil {
+		t.Fatalf("write plugin.yaml: %v", err)
+	}
+	if err := os.WriteFile(filepath.Join(pluginDir, "rule.md"), []byte("# rule\n"), 0644); err != nil {
+		t.Fatalf("write rule.md: %v", err)
+	}
+	return pluginDir
+}
+
+// stubInstallPluginViaEIC swaps the package-level installPluginViaEIC for
+// the duration of the test; restored by t.Cleanup. Mirrors the existing
+// withEICTunnel stub pattern (template_files_eic_dispatch_test.go).
+func stubInstallPluginViaEIC(t *testing.T, fn func(ctx context.Context, instanceID, runtime, pluginName, stagedDir string) error) {
+	t.Helper()
+	prev := installPluginViaEIC
+	installPluginViaEIC = fn
+	t.Cleanup(func() { installPluginViaEIC = prev })
+}
+
+func stubUninstallPluginViaEIC(t *testing.T, fn func(ctx context.Context, instanceID, runtime, pluginName string) error) {
+	t.Helper()
+	prev := uninstallPluginViaEIC
+	uninstallPluginViaEIC = fn
+	t.Cleanup(func() { uninstallPluginViaEIC = prev })
+}
+
+func stubReadPluginManifestViaEIC(t *testing.T, fn func(ctx context.Context, instanceID, runtime, pluginName string) ([]byte, error)) {
+	t.Helper()
+	prev := readPluginManifestViaEIC
+	readPluginManifestViaEIC = fn
+	t.Cleanup(func() { readPluginManifestViaEIC = prev })
+}
+
+// ---------- pure-function shell shape ----------
+
+func TestBuildPluginInstallShell_QuotesPath(t *testing.T) {
+	got := buildPluginInstallShell("/configs/plugins/my-plugin")
+	want := "sudo -n sh -c 'rm -rf '/configs/plugins/my-plugin' && mkdir -p '/configs/plugins/my-plugin' && tar -xzf - --no-same-owner -C '/configs/plugins/my-plugin' && chown -R 1000:1000 '/configs/plugins/my-plugin''"
+	if got != want {
+		t.Errorf("buildPluginInstallShell mismatch:\n got %q\nwant %q", got, want)
+	}
+}
+
+func TestBuildPluginUninstallShell_QuotesPath(t *testing.T) {
+	got := buildPluginUninstallShell("/configs/plugins/my-plugin")
+	want := "sudo -n rm -rf '/configs/plugins/my-plugin'"
+	if got != want {
+		t.Errorf("buildPluginUninstallShell mismatch:\n got %q\nwant %q", got, want)
+	}
+}
+
+func TestBuildPluginManifestReadShell_QuotesPath(t *testing.T) {
+	got := buildPluginManifestReadShell("/configs/plugins/my-plugin")
+	want := "sudo -n cat '/configs/plugins/my-plugin'/plugin.yaml 2>/dev/null"
+	if got != want {
+		t.Errorf("buildPluginManifestReadShell mismatch:\n got %q\nwant %q", got, want)
+	}
+}
+
+func TestHostPluginPath_PerRuntime(t *testing.T) {
+	cases := []struct {
+		runtime string
+		plugin  string
+		want    string
+	}{
+		{"claude-code", "browser-automation", "/configs/plugins/browser-automation"},
+		{"hermes", "browser-automation", "/home/ubuntu/.hermes/plugins/browser-automation"},
+		{"langgraph", "browser-automation", "/opt/configs/plugins/browser-automation"},
+		// Unknown / empty runtime falls back to /configs (containerized
+		// user-data layout) so a future runtime added to workspaces table
+		// without a workspaceFilePathPrefix entry doesn't blow up the
+		// install path silently.
+		{"", "browser-automation", "/configs/plugins/browser-automation"},
+		{"some-future-runtime", "x", "/configs/plugins/x"},
+	}
+	for _, c := range cases {
+		t.Run(c.runtime+"/"+c.plugin, func(t *testing.T) {
+			got := hostPluginPath(c.runtime, c.plugin)
+			if got != c.want {
+				t.Errorf("hostPluginPath(%q, %q) = %q, want %q", c.runtime, c.plugin, got, c.want)
+			}
+		})
+	}
+}
+
+// ---------- dispatch: install ----------
+
+// TestPluginInstall_SaaS_DispatchesToEIC — the most-load-bearing test in
+// this file. With h.docker == nil and instanceIDLookup returning a real
+// instance_id, Install MUST push the staged plugin to the EC2 over EIC
+// (not 503). Asserts the EIC stub is called with the right (instance,
+// runtime, plugin) tuple AND that the staged dir has the manifest +
+// rule files we put there — proves the staging side wasn't bypassed.
+func TestPluginInstall_SaaS_DispatchesToEIC(t *testing.T) {
+	registry := t.TempDir()
+	stagePluginRegistry(t, registry, "browser-automation")
+
+	type capture struct {
+		called      bool
+		instanceID  string
+		runtime     string
+		pluginName  string
+		stagedFiles []string
+	}
+	var got capture
+
+	stubInstallPluginViaEIC(t, func(ctx context.Context, instanceID, runtime, pluginName, stagedDir string) error {
+		got.called = true
+		got.instanceID = instanceID
+		got.runtime = runtime
+		got.pluginName = pluginName
+		entries, err := os.ReadDir(stagedDir)
+		if err != nil {
+			t.Fatalf("read staged dir: %v", err)
+		}
+		for _, e := range entries {
+			got.stagedFiles = append(got.stagedFiles, e.Name())
+		}
+		return nil
+	})
+
+	mock, cleanup := withMockDB(t)
+	defer cleanup()
+	expectAllowlistAllowAll(mock)
+
+	h := NewPluginsHandler(registry, nil, nil).
+		WithRuntimeLookup(func(string) (string, error) { return "claude-code", nil }).
+		WithInstanceIDLookup(func(string) (string, error) { return "i-0e0951a3cfd9bbf75", nil })
+
+	w := httptest.NewRecorder()
+	c, _ := gin.CreateTestContext(w)
+	c.Params = gin.Params{{Key: "id", Value: "c7244ed9-f623-4cba-8873-020e5c9fe104"}}
+	c.Request = httptest.NewRequest(
+		"POST",
+		"/workspaces/c7244ed9-f623-4cba-8873-020e5c9fe104/plugins",
+		bytes.NewBufferString(`{"source":"local://browser-automation"}`),
+	)
+	c.Request.Header.Set("Content-Type", "application/json")
+
+	h.Install(c)
+
+	if w.Code != http.StatusOK {
+		t.Fatalf("expected 200, got %d: %s", w.Code, w.Body.String())
+	}
+	if !got.called {
+		t.Fatalf("installPluginViaEIC was not called")
+	}
+	if got.instanceID != "i-0e0951a3cfd9bbf75" {
+		t.Errorf("instanceID = %q, want i-0e0951a3cfd9bbf75", got.instanceID)
+	}
+	if got.runtime != "claude-code" {
+		t.Errorf("runtime = %q, want claude-code", got.runtime)
+	}
+	if got.pluginName != "browser-automation" {
+		t.Errorf("pluginName = %q, want browser-automation", got.pluginName)
+	}
+	// Staged dir must carry the resolver's actual fetch — manifest + rule.
+	// Anything missing here means the stage step was bypassed.
+	hasManifest, hasRule := false, false
+	for _, f := range got.stagedFiles {
+		if f == "plugin.yaml" {
+			hasManifest = true
+		}
+		if f == "rule.md" {
+			hasRule = true
+		}
+	}
+	if !hasManifest || !hasRule {
+		t.Errorf("staged dir missing files: %v (want plugin.yaml + rule.md)", got.stagedFiles)
+	}
+}
+
+// TestPluginInstall_SaaS_PropagatesEICError — when the EIC push fails
+// (tunnel down, sudo denied), Install MUST surface 502 rather than swallow
+// the error and report 200. 502 is the right status for "we tried, the
+// remote side wasn't there" — distinct from 503 ("nothing wired") and
+// 500 ("our bug"). The body deliberately doesn't echo the underlying
+// error string (would leak ssh stderr / instance metadata).
+func TestPluginInstall_SaaS_PropagatesEICError(t *testing.T) {
+	registry := t.TempDir()
+	stagePluginRegistry(t, registry, "browser-automation")
+
+	mock, cleanup := withMockDB(t)
+	defer cleanup()
+	expectAllowlistAllowAll(mock)
+
+	stubInstallPluginViaEIC(t, func(ctx context.Context, instanceID, runtime, pluginName, stagedDir string) error {
+		return errors.New("ssh: tunnel exited 255")
+	})
+
+	h := NewPluginsHandler(registry, nil, nil).
+		WithRuntimeLookup(func(string) (string, error) { return "claude-code", nil }).
+		WithInstanceIDLookup(func(string) (string, error) { return "i-aaaa", nil })
+
+	w := httptest.NewRecorder()
+	c, _ := gin.CreateTestContext(w)
+	c.Params = gin.Params{{Key: "id", Value: "ws-1"}}
+	c.Request = httptest.NewRequest(
+		"POST",
+		"/workspaces/ws-1/plugins",
+		bytes.NewBufferString(`{"source":"local://browser-automation"}`),
+	)
+	c.Request.Header.Set("Content-Type", "application/json")
+
+	h.Install(c)
+
+	if w.Code != http.StatusBadGateway {
+		t.Errorf("expected 502 for EIC failure, got %d: %s", w.Code, w.Body.String())
+	}
+	if strings.Contains(w.Body.String(), "tunnel exited") {
+		t.Errorf("response body must not echo raw EIC error: %s", w.Body.String())
+	}
+}
+
+// TestPluginInstall_NoBackends_Returns503 — lookup is wired but returns
+// empty instance_id (e.g. workspace pre-provision, or local-Docker
+// deploy without a running container). The handler MUST 503, not silently
+// dispatch to EIC with an empty instance_id.
+func TestPluginInstall_NoBackends_Returns503(t *testing.T) {
+	registry := t.TempDir()
+	stagePluginRegistry(t, registry, "browser-automation")
+
+	mock, cleanup := withMockDB(t)
+	defer cleanup()
+	expectAllowlistAllowAll(mock)
+
+	stubInstallPluginViaEIC(t, func(ctx context.Context, instanceID, runtime, pluginName, stagedDir string) error {
+		t.Errorf("EIC must not be called when instance_id is empty")
+		return nil
+	})
+
+	h := NewPluginsHandler(registry, nil, nil).
+		WithRuntimeLookup(func(string) (string, error) { return "claude-code", nil }).
+		WithInstanceIDLookup(func(string) (string, error) { return "", nil }) // empty
+
+	w := httptest.NewRecorder()
+	c, _ := gin.CreateTestContext(w)
+	c.Params = gin.Params{{Key: "id", Value: "ws-1"}}
+	c.Request = httptest.NewRequest(
+		"POST",
+		"/workspaces/ws-1/plugins",
+		bytes.NewBufferString(`{"source":"local://browser-automation"}`),
+	)
+	c.Request.Header.Set("Content-Type", "application/json")
+
+	h.Install(c)
+
+	if w.Code != http.StatusServiceUnavailable {
+		t.Errorf("expected 503, got %d: %s", w.Code, w.Body.String())
+	}
+}
+
+// TestPluginInstall_InstanceLookupError_Returns503 — a DB hiccup on the
+// instance_id lookup must NOT crash or 502; the handler logs and falls
+// through to 503. Same fail-open shape h.runtimeLookup uses (see
+// TestPluginInstall_NoRuntimeLookup_FailsOpen). Pinning this prevents a
+// future "tighten error handling" refactor from quietly converting a DB
+// blip into a five-minute outage on the install endpoint.
+func TestPluginInstall_InstanceLookupError_Returns503(t *testing.T) {
+	registry := t.TempDir()
+	stagePluginRegistry(t, registry, "browser-automation")
+
+	mock, cleanup := withMockDB(t)
+	defer cleanup()
+	expectAllowlistAllowAll(mock)
+
+	h := NewPluginsHandler(registry, nil, nil).
+		WithRuntimeLookup(func(string) (string, error) { return "claude-code", nil }).
+		WithInstanceIDLookup(func(string) (string, error) { return "", errors.New("db: connection refused") })
+
+	w := httptest.NewRecorder()
+	c, _ := gin.CreateTestContext(w)
+	c.Params = gin.Params{{Key: "id", Value: "ws-1"}}
+	c.Request = httptest.NewRequest(
+		"POST",
+		"/workspaces/ws-1/plugins",
+		bytes.NewBufferString(`{"source":"local://browser-automation"}`),
+	)
+	c.Request.Header.Set("Content-Type", "application/json")
+
+	h.Install(c)
+
+	if w.Code != http.StatusServiceUnavailable {
+		t.Errorf("expected 503 on instance-id lookup error, got %d: %s", w.Code, w.Body.String())
+	}
+}
+
+// ---------- dispatch: uninstall ----------
+
+func TestPluginUninstall_SaaS_DispatchesToEIC(t *testing.T) {
+	stubReadPluginManifestViaEIC(t, func(ctx context.Context, instanceID, runtime, pluginName string) ([]byte, error) {
+		return []byte("name: browser-automation\nskills:\n  - browse\n"), nil
+	})
+
+	type capture struct {
+		called     bool
+		instanceID string
+		runtime    string
+		pluginName string
+	}
+	var got capture
+	stubUninstallPluginViaEIC(t, func(ctx context.Context, instanceID, runtime, pluginName string) error {
+		got.called = true
+		got.instanceID = instanceID
+		got.runtime = runtime
+		got.pluginName = pluginName
+		return nil
+	})
+
+	h := NewPluginsHandler(t.TempDir(), nil, nil).
+		WithRuntimeLookup(func(string) (string, error) { return "claude-code", nil }).
+		WithInstanceIDLookup(func(string) (string, error) { return "i-bbbb", nil })
+
+	w := httptest.NewRecorder()
+	c, _ := gin.CreateTestContext(w)
+	c.Params = gin.Params{
+		{Key: "id", Value: "ws-1"},
+		{Key: "name", Value: "browser-automation"},
+	}
+	c.Request = httptest.NewRequest("DELETE", "/workspaces/ws-1/plugins/browser-automation", nil)
+
+	h.Uninstall(c)
+
+	if w.Code != http.StatusOK {
+		t.Fatalf("expected 200, got %d: %s", w.Code, w.Body.String())
+	}
+	if !got.called {
+		t.Fatalf("uninstallPluginViaEIC was not called")
+	}
+	if got.instanceID != "i-bbbb" || got.runtime != "claude-code" || got.pluginName != "browser-automation" {
+		t.Errorf("dispatch args wrong: %+v", got)
+	}
+}
+
+func TestPluginUninstall_SaaS_PropagatesEICError(t *testing.T) {
+	stubReadPluginManifestViaEIC(t, func(ctx context.Context, instanceID, runtime, pluginName string) ([]byte, error) {
+		return nil, nil
+	})
+	stubUninstallPluginViaEIC(t, func(ctx context.Context, instanceID, runtime, pluginName string) error {
+		return errors.New("ssh: connection refused")
+	})
+
+	h := NewPluginsHandler(t.TempDir(), nil, nil).
+		WithRuntimeLookup(func(string) (string, error) { return "claude-code", nil }).
+		WithInstanceIDLookup(func(string) (string, error) { return "i-cccc", nil })
+
+	w := httptest.NewRecorder()
+	c, _ := gin.CreateTestContext(w)
+	c.Params = gin.Params{
+		{Key: "id", Value: "ws-1"},
+		{Key: "name", Value: "browser-automation"},
+	}
+	c.Request = httptest.NewRequest("DELETE", "/workspaces/ws-1/plugins/browser-automation", nil)
+
+	h.Uninstall(c)
+
+	if w.Code != http.StatusBadGateway {
+		t.Errorf("expected 502, got %d: %s", w.Code, w.Body.String())
+	}
+}
+
+func TestPluginUninstall_NoBackends_Returns503(t *testing.T) {
+	stubUninstallPluginViaEIC(t, func(ctx context.Context, instanceID, runtime, pluginName string) error {
+		t.Errorf("EIC uninstall must not be called with empty instance_id")
+		return nil
+	})
+
+	h := NewPluginsHandler(t.TempDir(), nil, nil).
+		WithRuntimeLookup(func(string) (string, error) { return "claude-code", nil }).
+		WithInstanceIDLookup(func(string) (string, error) { return "", nil })
+
+	w := httptest.NewRecorder()
+	c, _ := gin.CreateTestContext(w)
+	c.Params = gin.Params{
+		{Key: "id", Value: "ws-1"},
+		{Key: "name", Value: "browser-automation"},
+	}
+	c.Request = httptest.NewRequest("DELETE", "/workspaces/ws-1/plugins/browser-automation", nil)
+
+	h.Uninstall(c)
+
+	if w.Code != http.StatusServiceUnavailable {
+		t.Errorf("expected 503, got %d: %s", w.Code, w.Body.String())
+	}
+}
+
+// ---------- tarball shape ----------
+
+// TestRealInstallPluginViaEIC_TarPayloadShape — the production
+// installPluginViaEIC packs the staged dir as gzipped tar. Stub
+// withEICTunnel + run the real installPluginViaEIC body, capturing the
+// ssh stdin via a fake exec.Command — except go's exec is hard to fake
+// without hijacking $PATH. Instead we exercise the tar packer directly:
+// streamDirAsTar's behaviour is what we actually depend on, and a
+// regression in either streamDirAsTar OR the gzip wrapping will be
+// visible here.
+func TestRealInstallPluginViaEIC_TarPayloadShape(t *testing.T) {
+	staged := t.TempDir()
+	if err := os.WriteFile(filepath.Join(staged, "plugin.yaml"), []byte("name: x\n"), 0644); err != nil {
+		t.Fatal(err)
+	}
+	if err := os.MkdirAll(filepath.Join(staged, "skills", "browse"), 0755); err != nil {
+		t.Fatal(err)
+	}
+	if err := os.WriteFile(filepath.Join(staged, "skills", "browse", "instructions.md"), []byte("step 1\n"), 0644); err != nil {
+		t.Fatal(err)
+	}
+
+	var buf bytes.Buffer
+	gz := gzip.NewWriter(&buf)
+	tw := tar.NewWriter(gz)
+	if err := streamDirAsTar(staged, tw); err != nil {
+		t.Fatalf("streamDirAsTar: %v", err)
+	}
+	if err := tw.Close(); err != nil {
+		t.Fatalf("tw close: %v", err)
+	}
+	if err := gz.Close(); err != nil {
+		t.Fatalf("gz close: %v", err)
+	}
+
+	// Round-trip: the same payload the production flow would pipe into
+	// `tar -xzf -` on the remote should unpack to plugin.yaml +
+	// skills/browse/instructions.md.
+	gr, err := gzip.NewReader(&buf)
+	if err != nil {
+		t.Fatalf("gzip reader: %v", err)
+	}
+	tr := tar.NewReader(gr)
+	seen := map[string]bool{}
+	for {
+		hdr, err := tr.Next()
+		if err == io.EOF {
+			break
+		}
+		if err != nil {
+			t.Fatalf("tar next: %v", err)
+		}
+		seen[hdr.Name] = true
+	}
+	for _, want := range []string{"plugin.yaml", "skills/browse/instructions.md"} {
+		// Tar entries on Linux normally use forward slashes regardless
+		// of host separator; double-check both forms so a Windows test
+		// runner doesn't go red on a path-sep difference. Production
+		// always runs on Linux (CI + tenant EC2).
+		alt := filepath.FromSlash(want)
+		if !seen[want] && !seen[alt] {
+			t.Errorf("tar payload missing %q (saw %v)", want, seen)
+		}
+	}
+}
--- a/Show More
+++ b/Show More