feat(plugins): workspace_plugins tracking table — version-subscription foundation

Closes core#113 partial. Adds the DB foundation for the version-subscription model. Drift detection + queue + admin apply endpoint are follow-up scope (separate PR; filed as a new issue). WHY THIS PR ONLY GETS US PART-WAY Plugin install state today is filesystem-only — '/configs/plugins/<name>/' inside the container. There's no DB record of 'plugin X installed at workspace W from source S, tracking ref T'. That makes drift detection impossible: nothing to compare upstream tags against. This PR adds the table + the install-endpoint hook that writes to it. With baseline tags now on every plugin (post internal#92), the table starts collecting tracked-ref values immediately on the next install. The actual drift-check job + queue + apply endpoint layer on top. WHAT THIS ADDS workspace_plugins table: workspace_id FK → workspaces(id) ON DELETE CASCADE plugin_name canonical name from plugin.yaml source_raw full source URL the install used tracked_ref 'none' | 'tag:vX.Y.Z' | 'tag:latest' | 'sha:<full>' installed_at, updated_at installRequest gains optional 'track' field (defaults to 'none'). Install handler upserts the workspace_plugins row after delivery succeeds. DB write failure is logged but doesn't fail the install (the plugin IS in the container; surfacing 500 misleads the caller). validateTrackedRef enforces the closed set of accepted shapes: 'none' | 'tag:<non-empty>' | 'sha:<non-empty>' Bare values like 'latest' / 'main' / version-strings without prefix are rejected — the drift detector keys on prefix to know what kind of resolution to do. WHAT THIS DOES NOT ADD (filed separately) - Drift detector job (cron / on-demand) that scans 'WHERE tracked_ref != none' rows and queues updates on upstream drift - plugin_update_queue table (separate migration once detector lands) - GET /admin/plugin-updates-pending and POST .../apply endpoints - Tier-aware apply (core#115 — composes here) PHASE 4 SELF-REVIEW (FIVE-AXIS) Correctness: No finding — install endpoint behavior unchanged for callers that don't pass 'track'. DB write is best-effort + logged on failure. validateTrackedRef rejects ambiguous bare strings. Readability: No finding — separate file plugins_tracking.go isolates the new concern; install handler delta is a single 4-line block. Architecture: No finding — additive table; existing schema untouched. Migration 20260508160000_* uses the timestamp-prefixed convention. Security: No finding — INSERT params via placeholders (no string interpolation). validateTrackedRef rejects unexpected shapes before the column constraint would. Performance: No finding — one extra ExecContext per install. Install is already seconds-scale (network fetch + tar + docker exec); rounds to noise. TESTS (1 new, all green) TestValidateTrackedRef — pin closed set + structural validators REFS core#113 — this issue (foundation only; drift+queue+apply = follow-up) internal#92, internal#93 — plugin/template baseline tags (now exists for tracking) core#114 — atomic install (this PR composes — no atomicity regression) core#115 — canary tier filter (will key off the same DB foundation) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Merge pull request 'feat(plugins): hot-reload classifier — skip restart on SKILL-content-only updates' (#121 ) from feat/plugin-hot-reload-classifier into main
2026-05-08 08:52:35 -07:00 · 2026-05-08 15:26:32 +00:00 · 2026-05-08 08:26:05 -07:00 · 2026-05-08 15:26:00 +00:00 · 2026-05-08 15:24:55 +00:00 · 2026-05-08 15:23:31 +00:00
29 changed files with 1433 additions and 2003 deletions
@@ -1,467 +0,0 @@
-name: Auto-promote :latest after main image build
-
-# Retags `ghcr.io/molecule-ai/{platform,platform-tenant}:staging-<sha>`
-# → `:latest` after either the image build or E2E completes on a `main`
-# push, gated on E2E Staging SaaS not being red for that SHA.
-#
-# Why two triggers:
-#
-#   `publish-workspace-server-image` and `e2e-staging-saas` are both
-#   paths-filtered, but with DIFFERENT path sets:
-#
-#     publish-workspace-server-image:
-#       workspace-server/**, canvas/**, manifest.json
-#
-#     e2e-staging-saas (full lifecycle):
-#       workspace-server/internal/handlers/{registry,workspace_provision,
-#       a2a_proxy}.go, workspace-server/internal/middleware/**,
-#       workspace-server/internal/provisioner/**, tests/e2e/test_staging_full_saas.sh
-#
-#   The E2E set is a strict SUBSET of the publish set. So:
-#     - canvas/** changes → publish fires, E2E does not
-#     - workspace-server/cmd/** changes → publish fires, E2E does not
-#     - workspace-server/internal/sweep/** → publish fires, E2E does not
-#
-#   The previous version triggered ONLY on E2E completion, which meant
-#   non-E2E-path changes (canvas, cmd, sweep, etc.) rebuilt the image
-#   but never advanced `:latest`. Result: as of 2026-04-28 this workflow
-#   had run zero times since merge despite eight main pushes — `:latest`
-#   was ~7 hours / 9 PRs behind main with no human realising. See
-#   `molecule-core` Slack discussion 2026-04-28.
-#
-#   Adding `publish-workspace-server-image` as a second trigger closes
-#   the gap: any image rebuild on main eligibly advances `:latest`.
-#
-# Why E2E remains a kill-switch (not the trigger):
-#
-#   When E2E DID run for this SHA and ended red, we abort — `:latest`
-#   stays on the prior known-good digest. When E2E didn't run (paths
-#   filtered out), we proceed: pre-merge gates already validated this
-#   SHA on staging via auto-promote-staging requiring CI + E2E Canvas +
-#   E2E API + CodeQL all green. Image content for non-E2E-paths
-#   (canvas, cmd, sweep) is exercised by those staging gates.
-#
-# Why `main` only:
-#
-#   `:latest` is what prod tenants pull. We only want SHAs that have
-#   reached main (via auto-promote-staging) to advance `:latest`.
-#   Triggering on staging would let a staging-only revert advance
-#   `:latest` to a SHA that never reaches main, breaking the "production
-#   runs what's on main" invariant.
-#
-# Idempotency:
-#
-#   When a SHA touches paths that match BOTH publish and E2E, both
-#   workflows fire and complete. Both trigger this workflow on
-#   completion → two runs race. Both retag `:staging-<sha>` →
-#   `:latest`. crane tag is idempotent (re-tagging the same digest is a
-#   no-op), so the second run is harmless. concurrency group serializes
-#   them anyway.
-
-on:
-  workflow_run:
-    workflows:
-      - 'E2E Staging SaaS (full lifecycle)'
-      - 'publish-workspace-server-image'
-    types: [completed]
-    branches: [main]
-  workflow_dispatch:
-    inputs:
-      sha:
-        description: 'Short sha to promote (override; defaults to upstream workflow_run head_sha)'
-        required: false
-        type: string
-
-permissions:
-  contents: read
-  packages: write
-
-concurrency:
-  # Serialize promotes per-SHA so the publish+E2E both-fired race lands
-  # cleanly. Different SHAs can promote in parallel.
-  group: auto-promote-latest-${{ github.event.workflow_run.head_sha || github.event.inputs.sha || github.sha }}
-  cancel-in-progress: false
-
-env:
-  IMAGE_NAME: ghcr.io/molecule-ai/platform
-  TENANT_IMAGE_NAME: ghcr.io/molecule-ai/platform-tenant
-
-jobs:
-  promote:
-    # Proceed if upstream succeeded OR manual dispatch. Upstream-failure
-    # paths are filtered here; the E2E-was-red kill-switch lives in the
-    # gate-check step below (covers the case where upstream is publish
-    # success but E2E for the same SHA failed).
-    if: |
-      github.event_name == 'workflow_dispatch' ||
-      (github.event_name == 'workflow_run' && github.event.workflow_run.conclusion == 'success')
-    runs-on: ubuntu-latest
-    steps:
-      - name: Compute short sha
-        id: sha
-        run: |
-          set -euo pipefail
-          if [ -n "${{ github.event.inputs.sha }}" ]; then
-            FULL="${{ github.event.inputs.sha }}"
-          else
-            FULL="${{ github.event.workflow_run.head_sha }}"
-          fi
-          echo "short=${FULL:0:7}" >> "$GITHUB_OUTPUT"
-          echo "full=${FULL}" >> "$GITHUB_OUTPUT"
-
-      - name: Gate — E2E Staging SaaS state for this SHA
-        # When upstream IS E2E success, we know it's green (filtered by
-        # the job-level `if` already). When upstream is publish, look up
-        # E2E state for the same SHA. Four buckets:
-        #
-        #   - completed/success: E2E confirmed safe → proceed
-        #   - completed/failure|cancelled|timed_out: E2E found a
-        #     regression → ABORT (exit 1), `:latest` stays put
-        #   - in_progress|queued|requested: E2E is RACING with publish
-        #     for a runtime-touching SHA. publish typically completes
-        #     ~5-10min before E2E (~10-15min). If we promote on the
-        #     publish signal here, a later E2E failure can't roll back
-        #     `:latest` — it'd already be wrongly advanced. So we DEFER:
-        #     skip subsequent steps (proceed=false) and let E2E's own
-        #     completion event re-fire this workflow, which then takes
-        #     the upstream-is-E2E path. exit 0 so the run shows as
-        #     success rather than a noisy fake-failure.
-        #   - none/none: E2E was paths-filtered out for this SHA (the
-        #     change touched canvas/cmd/sweep/etc. — paths covered by
-        #     publish but not by E2E). pre-merge gates on staging
-        #     already validated this SHA → proceed.
-        #
-        # Manual dispatch skips this check — operator override.
-        id: gate
-        env:
-          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-          REPO: ${{ github.repository }}
-          SHA: ${{ steps.sha.outputs.full }}
-          UPSTREAM_NAME: ${{ github.event.workflow_run.name }}
-          EVENT_NAME: ${{ github.event_name }}
-        run: |
-          set -euo pipefail
-
-          if [ "$EVENT_NAME" = "workflow_dispatch" ]; then
-            echo "proceed=true" >> "$GITHUB_OUTPUT"
-            echo "::notice::Manual dispatch — skipping E2E gate (operator override)"
-            exit 0
-          fi
-
-          if [ "$UPSTREAM_NAME" = "E2E Staging SaaS (full lifecycle)" ]; then
-            echo "proceed=true" >> "$GITHUB_OUTPUT"
-            echo "::notice::Upstream is E2E itself (success per job-level if) — gate trivially satisfied"
-            exit 0
-          fi
-
-          # Upstream is publish-workspace-server-image. Check E2E state
-          # for the same SHA via Gitea's commit-status API.
-          #
-          # GitHub-era this was `gh run list --workflow=X --commit=SHA
-          # --json status,conclusion` returning either `[]` (no run on
-          # this SHA) or `[{status, conclusion}]` (the run's state).
-          # Gitea has NO workflow-runs API at all — `/api/v1/repos/.../
-          # actions/runs` returns 404 (verified 2026-05-07, issue #75).
-          # However Gitea Actions DOES emit a commit status per workflow
-          # job, with `context = "<Workflow Name> / <Job Name> (<event>)"`,
-          # which is exactly what we need: each E2E run leg becomes one
-          # status row on the SHA, and the aggregate state encodes the
-          # run's outcome.
-          #
-          # Mapping:
-          #   0 matched contexts          → "none/none"      (E2E paths-
-          #                                                    filtered
-          #                                                    out — same
-          #                                                    semantic
-          #                                                    as before)
-          #   any context = pending       → "in_progress/none" (defer)
-          #   any context = error|failure → "completed/failure" (abort)
-          #   all contexts = success      → "completed/success" (proceed)
-          #
-          # The "completed/cancelled" and "completed/timed_out" buckets
-          # don't have direct Gitea analogs (Gitea statuses are
-          # success / failure / error / pending / warning). Per-SHA
-          # concurrency cancellation surfaces as `error` on Gitea, which
-          # we map to "completed/failure" rather than "completed/cancelled"
-          # — losing the soft-defer semantic of the cancelled bucket on
-          # this fleet. Tradeoff: the staleness alarm (auto-promote-stale-
-          # alarm.yml) still catches a stuck :latest within 4h, and a
-          # legitimate cancel is rare enough that aborting + manual
-          # re-dispatch is acceptable. If we measure cancel frequency
-          # > 1/week, revisit by reading the run-step-summary text via
-          # a follow-up script.
-          #
-          # Network or auth blips collapse to "none/none" via the curl
-          # `|| true` fallback, matching the pre-Gitea behaviour where
-          # an empty list also degenerated to none/none.
-          GITEA_API_URL="${GITHUB_SERVER_URL:-https://git.moleculesai.app}/api/v1"
-          STATUSES_JSON=$(curl --fail-with-body -sS \
-            -H "Authorization: token ${GH_TOKEN}" \
-            -H "Accept: application/json" \
-            "${GITEA_API_URL}/repos/${REPO}/commits/${SHA}/statuses?limit=100" \
-            2>/dev/null || echo "[]")
-          RESULT=$(printf '%s' "$STATUSES_JSON" | jq -r '
-            # Filter to E2E Staging SaaS (full lifecycle) statuses.
-            # Match by leading workflow-name prefix so the "<job>
-            # (<event>)" tail is irrelevant. Gitea emits the workflow
-            # name verbatim from the YAML `name:` field.
-            [.[] | select(.context | startswith("E2E Staging SaaS (full lifecycle) /"))] as $rows
-            | if ($rows | length) == 0 then
-                "none/none"
-              elif any($rows[]; .status == "pending") then
-                "in_progress/none"
-              elif any($rows[]; .status == "failure" or .status == "error") then
-                "completed/failure"
-              elif all($rows[]; .status == "success") then
-                "completed/success"
-              else
-                # Mixed / unknown — fall through to *) bucket below.
-                "completed/" + ($rows[0].status // "unknown")
-              end
-          ' 2>/dev/null || echo "none/none")
-
-          echo "E2E Staging SaaS for ${SHA:0:7}: $RESULT"
-
-          case "$RESULT" in
-            completed/success)
-              echo "proceed=true" >> "$GITHUB_OUTPUT"
-              echo "::notice::E2E green for this SHA — proceeding with promote"
-              ;;
-            completed/failure|completed/timed_out)
-              echo "proceed=false" >> "$GITHUB_OUTPUT"
-              {
-                echo "## ❌ Auto-promote aborted — E2E Staging SaaS failed"
-                echo
-                echo "E2E Staging SaaS for \`${SHA:0:7}\`: \`$RESULT\`"
-                echo "\`:latest\` stays on the prior known-good digest."
-                echo
-                echo "If the failure was a flake, manually dispatch this workflow with the same sha to override."
-              } >> "$GITHUB_STEP_SUMMARY"
-              exit 1
-              ;;
-            completed/cancelled)
-              # GitHub-era only: cancelled ≠ failure. Gitea statuses
-              # don't expose a "cancelled" state — a per-SHA concurrency
-              # cancellation surfaces as `failure` or `error` on Gitea
-              # and is now handled by the failure branch above. This
-              # arm is kept for backwards compatibility / dual-host
-              # operation (if we ever add a non-Gitea fallback) but
-              # under the post-#75 flow it's unreachable.
-              echo "proceed=false" >> "$GITHUB_OUTPUT"
-              {
-                echo "## ⏭ Auto-promote deferred — E2E Staging SaaS was cancelled"
-                echo
-                echo "E2E Staging SaaS for \`${SHA:0:7}\`: \`$RESULT\`"
-                echo "Likely per-SHA concurrency (newer push superseded this E2E run)."
-                echo "The newer SHA's E2E will fire its own promote when it lands."
-                echo "If you need this specific SHA promoted, manually dispatch."
-              } >> "$GITHUB_STEP_SUMMARY"
-              ;;
-            in_progress/*|queued/*|requested/*|waiting/*|pending/*)
-              echo "proceed=false" >> "$GITHUB_OUTPUT"
-              {
-                echo "## ⏳ Auto-promote deferred — E2E Staging SaaS still running"
-                echo
-                echo "Publish completed before E2E for \`${SHA:0:7}\` (state: \`$RESULT\`)."
-                echo "Skipping retag here — E2E's own completion event will re-fire this workflow."
-                echo "If E2E ends green, that run promotes \`:latest\`. If red, it aborts."
-              } >> "$GITHUB_STEP_SUMMARY"
-              ;;
-            none/none)
-              echo "proceed=true" >> "$GITHUB_OUTPUT"
-              echo "::notice::E2E paths-filtered out for this SHA — pre-merge staging gates carry"
-              ;;
-            *)
-              echo "proceed=false" >> "$GITHUB_OUTPUT"
-              {
-                echo "## ❓ Auto-promote aborted — unexpected E2E state"
-                echo
-                echo "E2E Staging SaaS for \`${SHA:0:7}\`: \`$RESULT\` (unhandled)"
-                echo "Manual investigation needed; re-dispatch with the same sha once resolved."
-              } >> "$GITHUB_STEP_SUMMARY"
-              exit 1
-              ;;
-          esac
-
-      - if: steps.gate.outputs.proceed == 'true'
-        uses: imjasonh/setup-crane@6da1ae018866400525525ce74ff892880c099987 # v0.5
-
-      - name: GHCR login
-        if: steps.gate.outputs.proceed == 'true'
-        run: |
-          echo "${{ secrets.GITHUB_TOKEN }}" | \
-            crane auth login ghcr.io -u "${{ github.actor }}" --password-stdin
-
-      - name: Verify :staging-<sha> exists for both images
-        # Better to fail fast with a clear message than to half-tag
-        # (platform retagged but platform-tenant missing → tenants pull
-        # a stale image).
-        if: steps.gate.outputs.proceed == 'true'
-        run: |
-          set -euo pipefail
-          for img in "${IMAGE_NAME}" "${TENANT_IMAGE_NAME}"; do
-            tag="${img}:staging-${{ steps.sha.outputs.short }}"
-            if ! crane manifest "$tag" >/dev/null 2>&1; then
-              echo "::error::Missing tag: $tag"
-              echo "::error::publish-workspace-server-image must complete on this SHA before auto-promote can retag :latest."
-              exit 1
-            fi
-            echo "  ok: $tag exists"
-          done
-
-      - name: Ancestry check — refuse to promote :latest backwards
-        # #2244: workflow_run completions arrive in arbitrary order. If
-        # SHA-A and SHA-B both reach main within ~10 min and SHA-B's E2E
-        # completes before SHA-A's, this workflow can fire for SHA-A
-        # AFTER it already promoted SHA-B → :latest goes backwards. The
-        # orphan-reconciler "next run corrects it" doesn't apply: there's
-        # no auto-corrective re-promote, :latest stays wrong until the
-        # next main push lands.
-        #
-        # Detection: read current :latest's `org.opencontainers.image.revision`
-        # label (set by publish-workspace-server-image.yml at build time)
-        # and ask the GitHub compare API whether the candidate SHA is
-        # ahead-of / identical-to / behind / diverged-from current.
-        # Hard-fail on `behind` and `diverged` per the approved design —
-        # silent-bypass is the class we're moving away from. Workflow
-        # goes red, oncall sees it, operator decides how to recover
-        # (manual dispatch with the right SHA, force-promote, etc.).
-        #
-        # Manual dispatch skips this check — operator override semantics
-        # match the gate-check step above.
-        #
-        # Backward-compat: when current :latest carries no revision
-        # label (legacy image pre-publish-with-label), skip-with-warning.
-        # All :latest images on main are post-label as of 2026-04-29, so
-        # this branch will be dead within 90 days; remove then.
-        if: steps.gate.outputs.proceed == 'true' && github.event_name != 'workflow_dispatch'
-        id: ancestry
-        env:
-          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-          REPO: ${{ github.repository }}
-          TARGET_SHA: ${{ steps.sha.outputs.full }}
-        run: |
-          set -euo pipefail
-
-          # Read the current :latest config and pull the revision label.
-          # `crane config` returns the OCI image config blob (not the manifest);
-          # labels live under `.config.Labels`. `// empty` makes jq return ""
-          # rather than the literal "null" so the test below works.
-          CURRENT_REVISION=$(crane config "${IMAGE_NAME}:latest" 2>/dev/null \
-            | jq -r '.config.Labels["org.opencontainers.image.revision"] // empty' \
-            || true)
-
-          if [ -z "$CURRENT_REVISION" ]; then
-            echo "decision=skip-no-label" >> "$GITHUB_OUTPUT"
-            {
-              echo "## ⚠ Ancestry check skipped — current :latest has no revision label"
-              echo
-              echo "Likely a legacy image built before \`org.opencontainers.image.revision\` was set."
-              echo "Falling through to retag. After all \`:latest\` images are post-label (TODO 90 days), this branch is dead and should be removed."
-            } >> "$GITHUB_STEP_SUMMARY"
-            echo "::warning::Current :latest carries no revision label — skipping ancestry check (legacy image)"
-            exit 0
-          fi
-
-          if [ "$CURRENT_REVISION" = "$TARGET_SHA" ]; then
-            echo "decision=identical" >> "$GITHUB_OUTPUT"
-            echo "::notice:::latest already at ${TARGET_SHA:0:7} — retag will be a no-op"
-            exit 0
-          fi
-
-          # Ask GitHub which side of the merge graph TARGET_SHA sits on
-          # relative to CURRENT_REVISION. Returns one of: ahead | identical
-          # | behind | diverged. Network or auth errors collapse to "error"
-          # via the explicit fallback so the case below always matches.
-          STATUS=$(gh api \
-            "repos/${REPO}/compare/${CURRENT_REVISION}...${TARGET_SHA}" \
-            --jq '.status' 2>/dev/null || echo "error")
-
-          echo "ancestry compare ${CURRENT_REVISION:0:7} → ${TARGET_SHA:0:7}: $STATUS"
-
-          case "$STATUS" in
-            ahead)
-              echo "decision=ahead" >> "$GITHUB_OUTPUT"
-              echo "::notice::Target ${TARGET_SHA:0:7} is ahead of current :latest (${CURRENT_REVISION:0:7}) — proceeding with retag"
-              ;;
-            identical)
-              echo "decision=identical" >> "$GITHUB_OUTPUT"
-              echo "::notice::Target identical to :latest — retag will be a no-op"
-              ;;
-            behind)
-              echo "decision=behind" >> "$GITHUB_OUTPUT"
-              {
-                echo "## ❌ Auto-promote refused — target is BEHIND current :latest"
-                echo
-                echo "| Field | Value |"
-                echo "|---|---|"
-                echo "| Target SHA | \`$TARGET_SHA\` |"
-                echo "| Current :latest revision | \`$CURRENT_REVISION\` |"
-                echo "| GitHub compare status | \`behind\` |"
-                echo
-                echo "This guard catches the workflow_run-completion-order race (#2244):"
-                echo "two rapid main pushes whose E2Es complete out-of-order can otherwise"
-                echo "promote \`:latest\` backwards. \`:latest\` stays on \`${CURRENT_REVISION:0:7}\`."
-                echo
-                echo "**Recovery:** if this is a legitimate revert that should land on \`:latest\`,"
-                echo "manually dispatch this workflow with the target sha as input — the manual-dispatch"
-                echo "path skips the ancestry check (operator override)."
-              } >> "$GITHUB_STEP_SUMMARY"
-              exit 1
-              ;;
-            diverged)
-              echo "decision=diverged" >> "$GITHUB_OUTPUT"
-              {
-                echo "## ❓ Auto-promote refused — history diverged"
-                echo
-                echo "| Field | Value |"
-                echo "|---|---|"
-                echo "| Target SHA | \`$TARGET_SHA\` |"
-                echo "| Current :latest revision | \`$CURRENT_REVISION\` |"
-                echo "| GitHub compare status | \`diverged\` |"
-                echo
-                echo "Likely cause: force-push rewrote main's history, leaving the previous"
-                echo "\`:latest\` revision orphaned. Needs human review before \`:latest\` advances."
-              } >> "$GITHUB_STEP_SUMMARY"
-              exit 1
-              ;;
-            error|*)
-              echo "decision=error" >> "$GITHUB_OUTPUT"
-              {
-                echo "## ❌ Auto-promote aborted — ancestry-check API error"
-                echo
-                echo "\`gh api repos/${REPO}/compare/${CURRENT_REVISION}...${TARGET_SHA}\` returned unexpected status: \`$STATUS\`"
-                echo
-                echo "Manual dispatch with the target sha bypasses this check."
-              } >> "$GITHUB_STEP_SUMMARY"
-              exit 1
-              ;;
-          esac
-
-      - name: Retag platform :staging-<sha> → :latest
-        if: steps.gate.outputs.proceed == 'true'
-        run: |
-          crane tag "${IMAGE_NAME}:staging-${{ steps.sha.outputs.short }}" latest
-
-      - name: Retag tenant :staging-<sha> → :latest
-        if: steps.gate.outputs.proceed == 'true'
-        run: |
-          crane tag "${TENANT_IMAGE_NAME}:staging-${{ steps.sha.outputs.short }}" latest
-
-      - name: Summary
-        if: steps.gate.outputs.proceed == 'true'
-        run: |
-          {
-            echo "## :latest promoted to ${{ steps.sha.outputs.short }}"
-            echo
-            if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
-              echo "- Trigger: manual dispatch"
-            else
-              echo "- Upstream: \`${{ github.event.workflow_run.name }}\` ([run](${{ github.event.workflow_run.html_url }}))"
-            fi
-            echo "- platform:staging-${{ steps.sha.outputs.short }} → :latest"
-            echo "- platform-tenant:staging-${{ steps.sha.outputs.short }} → :latest"
-            echo
-            echo "Tenant fleet auto-pulls within 5 min via IMAGE_AUTO_REFRESH=true."
-            echo "Force immediate fanout: dispatch redeploy-tenants-on-main.yml."
-          } >> "$GITHUB_STEP_SUMMARY"
@@ -1,492 +0,0 @@
-name: Auto-promote staging → main
-
-# Fires after any of the staging-branch quality gates complete. When ALL
-# required gates are green on the same staging SHA, opens (or re-uses)
-# a PR `staging → main` and schedules Gitea auto-merge so the PR lands
-# automatically once approval + status checks are satisfied.
-#
-# ============================================================
-# What this workflow does
-# ============================================================
-#
-# 1. On a workflow_run completion event for one of the staging gate
-#    workflows (CI, E2E Staging Canvas, E2E API Smoke, CodeQL),
-#    checks if the combined status on the staging head SHA is green.
-# 2. If green, opens (or re-uses) a PR `head: staging → base: main`
-#    via Gitea REST `POST /api/v1/repos/.../pulls`.
-# 3. Schedules auto-merge via `POST /api/v1/repos/.../pulls/{index}/merge`
-#    with `merge_when_checks_succeed: true`. Gitea waits for the
-#    approval requirement on `main` (`required_approvals: 1`) and
-#    the status-check gates, then merges.
-# 4. The merge commit lands on `main` and fires
-#    `publish-workspace-server-image.yml` naturally via its
-#    `on: push: branches: [main]` trigger — no explicit dispatch
-#    needed (see "Why no workflow_dispatch tail" below).
-#
-# `auto-sync-main-to-staging.yml` is the reverse-direction
-# counterpart (main → staging, fast-forward push). Together they
-# keep the staging-superset-of-main invariant tight.
-#
-# ============================================================
-# Why Gitea REST (and not `gh pr create`)
-# ============================================================
-#
-# Pre-2026-05-06 this workflow used `gh pr create`, `gh pr merge --auto`,
-# `gh run list`, and `gh workflow run` against GitHub. After the
-# GitHub→Gitea cutover those calls fail because:
-#
-#   - `gh pr create / merge / view / list` route to GitHub GraphQL
-#     (`/api/graphql`). Gitea does not expose a GraphQL endpoint;
-#     every call returns `HTTP 405 Method Not Allowed` — same root
-#     cause as #65 (auto-sync) which PR #66 fixed by dropping `gh`
-#     entirely.
-#   - `gh run list --workflow=...` GitHub-shape; Gitea has the
-#     simpler `GET /repos/.../commits/{ref}/status` combined-status
-#     endpoint instead.
-#   - `gh workflow run X.yml` calls `POST /repos/.../actions/workflows/{id}/dispatches`,
-#     which does NOT exist on Gitea 1.22.6 (verified via swagger.v1.json).
-#
-# So this workflow uses direct `curl` calls to Gitea REST. No `gh`
-# CLI dependency, no GraphQL, no missing-endpoint footgun.
-#
-# ============================================================
-# Why no workflow_dispatch tail (was load-bearing on GitHub, dead on Gitea)
-# ============================================================
-#
-# The GitHub-era version had a 60-line polling step that waited for
-# the promote PR to merge, then explicitly dispatched
-# `publish-workspace-server-image.yml` on `--ref main`. That step
-# existed because GitHub's GITHUB_TOKEN-initiated merges suppress
-# downstream `on: push` workflows (the documented "no recursion" rule
-# — https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow).
-# The explicit dispatch was the workaround.
-#
-# Gitea Actions does NOT have this no-recursion rule. PR #66's auto-
-# sync merge to main fired `auto-promote-staging` on the next push
-# trigger naturally. So the cascade fires on the natural push event;
-# the explicit dispatch is dead code. (And even if we wanted to
-# preserve it, Gitea has no `workflow_dispatch` REST endpoint.)
-#
-# Removed in this rewrite. If we ever observe the cascade misfire,
-# operator can push an empty commit to `main` to wake it.
-#
-# ============================================================
-# Why open a PR (and not direct push)
-# ============================================================
-#
-# `main` branch protection has `enable_push: false` with NO
-# `push_whitelist_usernames`. Direct push is impossible for any
-# persona, including admins. PR-mediated merge is the only path,
-# which is intentional: prod state mutations (and staging→main IS a
-# prod mutation, since the next deploy fans out to tenants) require
-# Hongming's approval per `feedback_prod_apply_needs_hongming_chat_go`.
-#
-# The auto-merge schedule preserves this gate: `merge_when_checks_succeed`
-# does NOT bypass `required_approvals: 1`. Gitea waits for BOTH
-# approval AND green checks before merging. Hongming reviews via the
-# canvas/chat-handle of the PR notification, approves, and Gitea
-# auto-merges within seconds.
-#
-# ============================================================
-# Identity + token (anti-bot-ring per saved-memory
-# `feedback_per_agent_gitea_identity_default`)
-# ============================================================
-#
-# This workflow uses `secrets.AUTO_SYNC_TOKEN` — a personal access
-# token issued to the `devops-engineer` Gitea persona. NOT the
-# founder PAT. The bot-ring fingerprint that triggered the GitHub
-# org suspension on 2026-05-06 was characterised by founder PAT
-# acting as CI at machine speed.
-#
-# Token scope: `push: true` (read+write) on this repo. The persona
-# can: open PRs, comment on PRs, schedule auto-merge. The persona
-# CANNOT bypass main's branch protection (`required_approvals: 1`
-# still applies — only Hongming's review unblocks merge).
-#
-# Authorship: the PR is opened by `devops-engineer`; the merge
-# commit credits Hongming-as-approver and `devops-engineer` as
-# the merger.
-#
-# ============================================================
-# Failure modes & operational notes
-# ============================================================
-#
-# A — staging gates not all green at trigger time:
-#     - The combined-status check returns `state: pending|failure`.
-#       Workflow exits 0 with a step-summary "not all green; staying
-#       on current main". Re-fires on the next gate completion.
-#
-# B — Gitea PR-create returns non-201 (e.g. 422 already-exists):
-#     - Idempotent: the workflow first GETs the existing open
-#       staging→main PR. If found, reuse it; if not, POST a new one.
-#       422 should never surface; if it does (race), step summary
-#       captures the body and the next workflow_run picks up.
-#
-# C — `merge_when_checks_succeed` schedule fails:
-#     - 422 with "Pull request is not mergeable" if there are
-#       conflicts or stale base. Step summary surfaces it; operator
-#       (or `auto-sync-main-to-staging`) needs to bring staging up
-#       to date with main first. Workflow exits 1 to surface red.
-#
-# D — `AUTO_SYNC_TOKEN` rotated / wrong scope:
-#     - 401/403 on first REST call. Step summary surfaces it.
-#       Re-issue the token from `~/.molecule-ai/personas/` on the
-#       operator host and update the repo Actions secret.
-#
-# ============================================================
-# Loop safety
-# ============================================================
-#
-# When the promote PR merges to main, `auto-sync-main-to-staging.yml`
-# fires (on:push:main) and pushes the merge commit back to staging.
-# That push to staging is by `devops-engineer`, NOT this workflow's
-# token, and triggers the staging gate workflows. When they all
-# complete, we end up back here — but the tree-diff guard catches
-# it: staging tree == main tree (the merge commit changes nothing),
-# so we skip and the cycle terminates.
-
-on:
-  workflow_run:
-    workflows:
-      - CI
-      - E2E Staging Canvas (Playwright)
-      - E2E API Smoke Test
-      - CodeQL
-    types: [completed]
-  workflow_dispatch:
-    inputs:
-      force:
-        description: "Force promote even when AUTO_PROMOTE_ENABLED is unset (manual override)"
-        required: false
-        default: "false"
-
-permissions:
-  contents: read
-  pull-requests: write
-
-# Serialize auto-promote runs. Multiple staging gate completions can land
-# in quick succession (CI + E2E + CodeQL all finish within seconds of
-# each other on a green PR) — without this, two parallel runs both:
-#   1. Would race the GET-or-POST PR step.
-#   2. Would both call merge-schedule (idempotent — fine on Gitea).
-# cancel-in-progress: false because the second run on a fresh staging
-# tip should NOT kill the first which has already opened the PR.
-concurrency:
-  group: auto-promote-staging
-  cancel-in-progress: false
-
-jobs:
-  check-all-gates-green:
-    # Only consider staging pushes. PRs into staging don't promote.
-    if: >
-      (github.event_name == 'workflow_run' &&
-       github.event.workflow_run.head_branch == 'staging' &&
-       github.event.workflow_run.event == 'push')
-      || github.event_name == 'workflow_dispatch'
-    runs-on: ubuntu-latest
-    outputs:
-      all_green: ${{ steps.gates.outputs.all_green }}
-      head_sha: ${{ steps.gates.outputs.head_sha }}
-    steps:
-      # Skip empty-tree promotes (the perpetual auto-promote↔auto-sync
-      # cycle observed pre-cutover on GitHub). On Gitea the cycle shape
-      # is different (auto-sync uses fast-forward, no merge commit),
-      # but the tree-diff guard is cheap insurance and protects against
-      # any future merge-style regression.
-      - name: Checkout for tree-diff check
-        uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
-        with:
-          fetch-depth: 0
-          ref: staging
-
-      - name: Skip if staging tree == main tree (cycle-break safety)
-        id: tree-diff
-        env:
-          HEAD_SHA: ${{ github.event.workflow_run.head_sha || github.sha }}
-        run: |
-          set -eu
-          git fetch origin main --depth=50 || { echo "::warning::git fetch main failed — proceeding (fail-open)"; exit 0; }
-          if git diff --quiet origin/main "$HEAD_SHA" -- 2>/dev/null; then
-            {
-              echo "## Skipped — no code to promote"
-              echo
-              echo "staging tip (\`${HEAD_SHA:0:8}\`) and \`main\` have identical trees."
-              echo "Skipping to avoid opening an empty promote PR."
-            } >> "$GITHUB_STEP_SUMMARY"
-            echo "::notice::auto-promote: staging tree == main tree — no code to promote, skipping"
-            echo "skip=true" >> "$GITHUB_OUTPUT"
-          else
-            echo "skip=false" >> "$GITHUB_OUTPUT"
-          fi
-
-      - name: Check combined status on staging head
-        if: steps.tree-diff.outputs.skip != 'true'
-        id: gates
-        env:
-          GITEA_TOKEN: ${{ secrets.AUTO_SYNC_TOKEN }}
-          HEAD_SHA: ${{ github.event.workflow_run.head_sha || github.sha }}
-          REPO: ${{ github.repository }}
-          GITEA_HOST: ${{ vars.GITEA_HOST || 'https://git.moleculesai.app' }}
-        run: |
-          set -euo pipefail
-
-          # Gitea-native combined-status endpoint aggregates every
-          # check context attached to a SHA. This is structurally
-          # cleaner than the GitHub-era per-workflow `gh run list`
-          # loop because:
-          #
-          #   1. There's no risk of "workflow name collision" (the
-          #      GitHub-era code had to switch from `--workflow=NAME`
-          #      to `--workflow=FILE.YML` to disambiguate "CodeQL"
-          #      between the explicit workflow and GitHub's UI-
-          #      configured default setup; Gitea has no such
-          #      duplicate-name surface).
-          #   2. Gitea's combined state already encodes the AND
-          #      across all contexts: success only if EVERY context
-          #      is success. Pending or failure on any context
-          #      produces non-success state.
-          #
-          # See https://docs.gitea.com/api/1.22 for the schema —
-          # `state` is one of: success, pending, failure, error.
-
-          echo "head_sha=${HEAD_SHA}" >> "$GITHUB_OUTPUT"
-          echo "Checking combined status on SHA ${HEAD_SHA}"
-
-          # `set +o pipefail` for the http-code capture pattern; restore
-          # immediately. Pattern hardened per `feedback_curl_status_capture_pollution`.
-          BODY_FILE=$(mktemp)
-          set +e
-          STATUS=$(curl -sS \
-            -H "Authorization: token ${GITEA_TOKEN}" \
-            -H "Accept: application/json" \
-            -o "${BODY_FILE}" \
-            -w "%{http_code}" \
-            "${GITEA_HOST}/api/v1/repos/${REPO}/commits/${HEAD_SHA}/status")
-          CURL_RC=$?
-          set -e
-
-          if [ "${CURL_RC}" -ne 0 ] || [ "${STATUS}" != "200" ]; then
-            echo "::error::combined-status fetch failed: curl=${CURL_RC} http=${STATUS}"
-            cat "${BODY_FILE}" | head -c 500 || true
-            rm -f "${BODY_FILE}"
-            echo "all_green=false" >> "$GITHUB_OUTPUT"
-            exit 0
-          fi
-
-          STATE=$(jq -r '.state // "missing"' < "${BODY_FILE}")
-          TOTAL=$(jq -r '.total_count // 0' < "${BODY_FILE}")
-          rm -f "${BODY_FILE}"
-
-          echo "Combined status: state=${STATE} total_count=${TOTAL}"
-
-          if [ "${STATE}" = "success" ] && [ "${TOTAL}" -gt 0 ]; then
-            echo "all_green=true" >> "$GITHUB_OUTPUT"
-            echo "::notice::All gates green on ${HEAD_SHA} (${TOTAL} contexts)"
-          else
-            echo "all_green=false" >> "$GITHUB_OUTPUT"
-            {
-              echo "## Not promoting — combined status not green"
-              echo
-              echo "- SHA: \`${HEAD_SHA:0:8}\`"
-              echo "- Combined state: \`${STATE}\`"
-              echo "- Context count: ${TOTAL}"
-              echo
-              echo "Will re-fire on the next gate completion. Investigate any red gate via the Actions UI."
-            } >> "$GITHUB_STEP_SUMMARY"
-            echo "::notice::auto-promote: combined status is ${STATE} on ${HEAD_SHA} — staying on current main"
-          fi
-
-  promote:
-    needs: check-all-gates-green
-    if: needs.check-all-gates-green.outputs.all_green == 'true'
-    runs-on: ubuntu-latest
-    steps:
-      - name: Check rollout gate
-        env:
-          AUTO_PROMOTE_ENABLED: ${{ vars.AUTO_PROMOTE_ENABLED }}
-          FORCE_INPUT: ${{ github.event.inputs.force }}
-        run: |
-          set -eu
-          # Repo variable AUTO_PROMOTE_ENABLED=true flips this on. While
-          # it's unset, the workflow dry-runs (logs what it would have
-          # done) but doesn't open the promote PR. Set the variable in
-          # Settings → Actions → Variables.
-          if [ "${AUTO_PROMOTE_ENABLED:-}" != "true" ] && [ "${FORCE_INPUT:-false}" != "true" ]; then
-            {
-              echo "## Auto-promote disabled"
-              echo
-              echo "Repo variable \`AUTO_PROMOTE_ENABLED\` is not set to \`true\`."
-              echo "All gates are green on staging; would have opened a promote PR to \`main\`."
-              echo
-              echo "To enable: Settings → Actions → Variables → \`AUTO_PROMOTE_ENABLED=true\`."
-              echo "To test once manually: workflow_dispatch with \`force=true\`."
-            } >> "$GITHUB_STEP_SUMMARY"
-            echo "::notice::auto-promote disabled — dry run only"
-            exit 0
-          fi
-
-      - name: Open or reuse promote PR + schedule auto-merge
-        if: ${{ vars.AUTO_PROMOTE_ENABLED == 'true' || github.event.inputs.force == 'true' }}
-        env:
-          GITEA_TOKEN: ${{ secrets.AUTO_SYNC_TOKEN }}
-          REPO: ${{ github.repository }}
-          TARGET_SHA: ${{ needs.check-all-gates-green.outputs.head_sha }}
-          GITEA_HOST: ${{ vars.GITEA_HOST || 'https://git.moleculesai.app' }}
-        run: |
-          set -euo pipefail
-
-          API="${GITEA_HOST}/api/v1/repos/${REPO}"
-          AUTH=(-H "Authorization: token ${GITEA_TOKEN}" -H "Accept: application/json")
-
-          # http_status_get RESULT_VAR URL
-          # Sets RESULT_VAR to "<http_code>:<body_file>". Curl status
-          # capture pattern per `feedback_curl_status_capture_pollution`:
-          # http_code goes to its own tempfile-equivalent (-w), body to
-          # another tempfile, set +e/-e bracket protects pipeline state.
-          http_get() {
-            local body_file="$1"; shift
-            local url="$1"; shift
-            set +e
-            local code
-            code=$(curl -sS "${AUTH[@]}" -o "${body_file}" -w "%{http_code}" "${url}")
-            local rc=$?
-            set -e
-            if [ "${rc}" -ne 0 ]; then
-              echo "::error::curl GET failed (rc=${rc}) on ${url}"
-              return 99
-            fi
-            echo "${code}"
-          }
-          http_post_json() {
-            local body_file="$1"; shift
-            local data="$1"; shift
-            local url="$1"; shift
-            set +e
-            local code
-            code=$(curl -sS "${AUTH[@]}" -H "Content-Type: application/json" \
-              -X POST -d "${data}" -o "${body_file}" -w "%{http_code}" "${url}")
-            local rc=$?
-            set -e
-            if [ "${rc}" -ne 0 ]; then
-              echo "::error::curl POST failed (rc=${rc}) on ${url}"
-              return 99
-            fi
-            echo "${code}"
-          }
-
-          # Step 1: look for an existing open staging→main promote PR
-          # (idempotent on workflow re-run). Gitea doesn't have a
-          # head/base filter on the list endpoint that's as ergonomic
-          # as gh's, but the dedicated `/pulls/{base}/{head}` lookup
-          # works.
-          BODY=$(mktemp)
-          STATUS=$(http_get "${BODY}" "${API}/pulls/main/staging") || true
-
-          PR_NUM=""
-          if [ "${STATUS}" = "200" ]; then
-            STATE=$(jq -r '.state // "missing"' < "${BODY}")
-            if [ "${STATE}" = "open" ]; then
-              PR_NUM=$(jq -r '.number // ""' < "${BODY}")
-              echo "::notice::Re-using existing open promote PR #${PR_NUM}"
-            fi
-          fi
-          rm -f "${BODY}"
-
-          # Step 2: if no open PR, create one.
-          if [ -z "${PR_NUM}" ]; then
-            TITLE="staging → main: auto-promote ${TARGET_SHA:0:7}"
-            BODY_TEXT=$(cat <<EOFBODY
-          Automated promotion of \`staging\` (\`${TARGET_SHA:0:8}\`) to \`main\`. All required staging gates are green at this SHA (combined status reported success).
-
-          This PR is auto-generated by \`.github/workflows/auto-promote-staging.yml\` whenever every required gate completes green on the same staging SHA.
-
-          **Approval gate:** \`main\` branch protection requires 1 approval before this can land. Once approved, Gitea will auto-merge (the workflow scheduled \`merge_when_checks_succeed: true\` immediately after open).
-
-          The reverse-direction sync (the merge commit on \`main\` → \`staging\`) is handled automatically by \`auto-sync-main-to-staging.yml\` after this PR lands.
-
-          ---
-          - Source: staging at \`${TARGET_SHA}\`
-          - Opened by: \`devops-engineer\` persona (anti-bot-ring; never founder PAT)
-          - Refs: #65, #73, #195
-          EOFBODY
-          )
-            REQ=$(jq -n \
-              --arg title "${TITLE}" \
-              --arg body "${BODY_TEXT}" \
-              --arg base "main" \
-              --arg head "staging" \
-              '{title:$title, body:$body, base:$base, head:$head}')
-
-            BODY=$(mktemp)
-            STATUS=$(http_post_json "${BODY}" "${REQ}" "${API}/pulls")
-
-            if [ "${STATUS}" = "201" ]; then
-              PR_NUM=$(jq -r '.number // ""' < "${BODY}")
-              echo "::notice::Opened promote PR #${PR_NUM}"
-            else
-              echo "::error::Failed to create promote PR: HTTP ${STATUS}"
-              jq -r '.message // .' < "${BODY}" | head -c 500
-              rm -f "${BODY}"
-              exit 1
-            fi
-            rm -f "${BODY}"
-          fi
-
-          # Step 3: schedule auto-merge. merge_when_checks_succeed
-          # tells Gitea to wait for both:
-          #   - all required status checks to pass
-          #   - the required-approvals gate (1 approval on main)
-          # before merging. On approval+green, Gitea merges within
-          # seconds. On any check failing or approval being denied,
-          # the schedule stays armed but doesn't fire.
-          #
-          # Idempotent: re-arming on an already-armed PR is a no-op.
-          REQ=$(jq -n '{Do:"merge", merge_when_checks_succeed:true}')
-          BODY=$(mktemp)
-          STATUS=$(http_post_json "${BODY}" "${REQ}" "${API}/pulls/${PR_NUM}/merge")
-
-          # Gitea returns:
-          #   - 200/204 on successful immediate merge (gates already green AND approved)
-          #   - 405 "Please try again later" when scheduled successfully but waiting
-          #   - 422 on "Pull request is not mergeable" (conflict, stale base, etc.)
-          #
-          # 405 here is benign — Gitea's way of saying "scheduled, not merging now".
-          # We treat 200/204/405 as success, anything else as failure.
-          case "${STATUS}" in
-            200|204)
-              MERGE_OUTCOME="merged-immediately"
-              echo "::notice::Promote PR #${PR_NUM} merged immediately (gates+approval already green)"
-              ;;
-            405)
-              MERGE_OUTCOME="auto-merge-scheduled"
-              echo "::notice::Promote PR #${PR_NUM}: auto-merge scheduled (Gitea will land on approval+green)"
-              ;;
-            422)
-              MERGE_OUTCOME="not-mergeable"
-              echo "::warning::Promote PR #${PR_NUM}: not mergeable (conflict, stale base, or already merging)."
-              jq -r '.message // .' < "${BODY}" | head -c 500
-              ;;
-            *)
-              echo "::error::Unexpected status ${STATUS} on merge schedule"
-              jq -r '.message // .' < "${BODY}" | head -c 500
-              rm -f "${BODY}"
-              exit 1
-              ;;
-          esac
-          rm -f "${BODY}"
-
-          {
-            echo "## Auto-promote PR opened"
-            echo
-            echo "- Source: staging at \`${TARGET_SHA:0:8}\`"
-            echo "- PR: #${PR_NUM}"
-            echo "- Outcome: \`${MERGE_OUTCOME}\`"
-            echo
-            if [ "${MERGE_OUTCOME}" = "auto-merge-scheduled" ]; then
-              echo "Gitea will auto-merge once Hongming approves and all checks are green. No human action needed beyond approval."
-            elif [ "${MERGE_OUTCOME}" = "merged-immediately" ]; then
-              echo "Merged immediately. \`publish-workspace-server-image.yml\` will fire naturally on the resulting \`main\` push."
-            else
-              echo "PR is not auto-merging. Operator may need to bring staging up to date with main, then re-trigger this workflow via workflow_dispatch."
-            fi
-          } >> "$GITHUB_STEP_SUMMARY"
@@ -1,83 +0,0 @@
-name: auto-promote-stale-alarm
-
-# Hourly cron + on-demand alarm for the silent-block failure mode that
-# motivated issue #2975:
-#   - The auto-promote-staging.yml workflow opened a PR + armed
-#     auto-merge, but main's branch protection requires a human review
-#     (reviewDecision=REVIEW_REQUIRED). The PR sat BLOCKED with no
-#     surface-up-the-stack for 12+ hours, holding 25 commits hostage
-#     including the Memory v2 redesign and a reno-stars data-loss fix.
-#
-# This workflow runs `scripts/check-stale-promote-pr.sh` against the
-# repo's open auto-promote PRs (base=main head=staging). When a PR has
-# been BLOCKED on REVIEW_REQUIRED for >4h, it:
-#   1. Emits a workflow-level warning (visible in run summary + the
-#      Actions UI feed).
-#   2. Posts a comment on the PR (idempotent — one alarm per PR).
-#
-# The detection logic lives in scripts/check-stale-promote-pr.sh so
-# it's unit-testable with stubbed `gh` (see test-check-stale-promote-pr.sh).
-# This file is the schedule + invocation surface only — SSOT for the
-# detector itself.
-
-on:
-  schedule:
-    # Hourly. Cheap (one `gh pr list` + jq), and 1h granularity is
-    # plenty for a 4h staleness threshold — operators see the alarm
-    # within at most 1h of crossing the threshold.
-    - cron: "27 * * * *"  # at :27 to dodge the cron herd at :00
-  workflow_dispatch:
-    inputs:
-      stale_hours:
-        description: "Hours after which a BLOCKED+REVIEW_REQUIRED PR is stale (default 4)"
-        required: false
-        default: "4"
-      post_comment:
-        description: "Post a comment on stale PRs (default true)"
-        required: false
-        default: "true"
-
-permissions:
-  contents: read
-  pull-requests: write  # post comments on stale PRs
-
-# Serialize so the on-demand and scheduled runs don't double-comment
-# the same PR. cancel-in-progress=false because the script is idempotent
-# (existing comment marker prevents dupes), but a scheduled run firing
-# while a manual one runs would just re-list the same PR set.
-concurrency:
-  group: auto-promote-stale-alarm
-  cancel-in-progress: false
-
-jobs:
-  scan:
-    runs-on: ubuntu-latest
-    steps:
-      - name: Checkout (need scripts/ only)
-        uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
-        with:
-          sparse-checkout: |
-            scripts/check-stale-promote-pr.sh
-          sparse-checkout-cone-mode: false
-      - name: Run stale-PR detector
-        env:
-          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-          GITHUB_REPOSITORY: ${{ github.repository }}
-          STALE_HOURS: ${{ inputs.stale_hours || '4' }}
-          POST_COMMENT: ${{ inputs.post_comment || 'true' }}
-        run: |
-          # The script's exit code reflects the count of stale PRs.
-          # We don't want a stale finding to fail the workflow run —
-          # the warning + comment are the signal, the green/red is
-          # noise. So convert any non-zero exit to a workflow notice
-          # and exit 0.
-          set +e
-          bash scripts/check-stale-promote-pr.sh
-          rc=$?
-          set -e
-          if [ "$rc" -ne 0 ]; then
-            echo "::notice::Stale PR detector found $rc PR(s) needing attention. See warnings above + comments on the PRs."
-          fi
-          # Always succeed — operator-facing surface is the warning,
-          # not the workflow status.
-          exit 0
@@ -1,404 +0,0 @@
-name: Auto-sync canary — AUTO_SYNC_TOKEN rotation drift
-
-# Synthetic health check for the AUTO_SYNC_TOKEN secret consumed by
-# auto-sync-main-to-staging.yml (PR #66) and publish-workspace-server-image.yml.
-#
-# ============================================================
-# Why this workflow exists
-# ============================================================
-#
-# PR #66 fixed auto-sync (replaced GitHub-era `gh pr create` — which
-# 405s on Gitea's GraphQL endpoint — with a direct git push from the
-# `devops-engineer` persona's `AUTO_SYNC_TOKEN`). Hostile self-review
-# weakest spot #3 of that PR:
-#
-#   "Token rotation silently breaks auto-sync. If AUTO_SYNC_TOKEN is
-#    rotated without updating the repo secret, every push to main
-#    fails red on the auto-sync push step. The workflow surfaces the
-#    failure mode in the step summary (failure mode B in the header),
-#    but there's no proactive monitoring."
-#
-# Detection latency under the status quo: rotation is only caught on
-# the next push to `main`. During quiet periods (no main push for
-# many hours) the staging-superset-of-main invariant silently breaks.
-#
-# This workflow closes the gap: every 6 hours, it fires the auth
-# surface that auto-sync depends on and emits a red workflow status
-# if AUTO_SYNC_TOKEN has drifted out of validity.
-#
-# ============================================================
-# What this checks (Option B — read-only verify)
-# ============================================================
-#
-# 1. `GET /api/v1/user` against Gitea with the token → validates the
-#    token authenticates AND resolves to `devops-engineer` (catches
-#    the case where the token was regenerated under a different
-#    persona by mistake).
-# 2. `GET /api/v1/repos/molecule-ai/molecule-core` with the token →
-#    validates the token has `read:repository` scope on this repo
-#    (the v2 scope contract — see saved memory
-#    `reference_persona_token_v2_scope`).
-# 3. `git push --dry-run` of the current staging SHA back to
-#    `refs/heads/staging` via `https://oauth2:<token>@<gitea>/...`
-#    → validates the EXACT HTTPS basic-auth path that
-#    `actions/checkout` + `git push origin staging` use inside
-#    auto-sync-main-to-staging.yml. NOP by construction (push the
-#    current tip to itself = "Everything up-to-date"); auth is
-#    checked at the smart-protocol handshake BEFORE the empty-diff
-#    computation, so bad token → exit 128 with "Authentication
-#    failed". `git ls-remote` is NOT used here because Gitea
-#    falls back to anonymous read on public repos and would
-#    silently green-light a rotated token.
-#
-# Each step exits non-zero with an actionable error message if it
-# fails. The workflow status itself is the operator-facing surface.
-#
-# ============================================================
-# What this does NOT check (intentional)
-# ============================================================
-#
-# - **Branch-protection authz** (failure mode C in auto-sync header):
-#   would require an actual write to staging. Already monitored by
-#   `branch-protection-drift.yml` daily. Don't duplicate.
-# - **Conflict resolution** (failure mode A): a real conflict is data-
-#   driven, not auth-driven; can't synthesise it without polluting
-#   staging. Already surfaces immediately on the next main push.
-# - **Concurrency** (failure mode D): handled by workflow concurrency
-#   group on auto-sync, not a credential issue.
-#
-# ============================================================
-# Why Option B (read-only) and not the alternatives
-# ============================================================
-#
-# Considered + rejected (see issue #72 for full write-up):
-#
-# - **Option A — full auto-sync on schedule**: every run creates a
-#   no-op merge commit on staging when main hasn't advanced. 4 noise
-#   commits/day. And races the real `push:` trigger when main has
-#   advanced. Rejected.
-#
-# - **Option C — push to dedicated `auto-sync-canary` branch**: would
-#   exercise authz too, but adds branch noise on Gitea AND requires
-#   maintaining a second branch protection (or expanding staging's
-#   whitelist to a junk branch). Authz already covered by
-#   `branch-protection-drift.yml`. Rejected.
-#
-# Prior art for the chosen Option B shape:
-#   - Cloudflare's `/user/tokens/verify` endpoint (read-only auth
-#     probe explicitly designed for credential canaries).
-#   - AWS Secrets Manager rotation Lambda's `testSecret` step (auth
-#     probe before promoting AWSPENDING → AWSCURRENT).
-#   - HashiCorp Vault's `vault token lookup` for renewal canaries.
-#
-# ============================================================
-# Operator runbook — what to do when this workflow goes RED
-# ============================================================
-#
-# 1. **Identify which step failed**:
-#    - Step "Verify token authenticates as devops-engineer" red →
-#      token is invalid OR resolves to wrong persona.
-#    - Step "Verify token has repo read scope" red → token valid but
-#      stripped of `read:repository` scope (or repo perms changed).
-#    - Step "Verify git HTTPS auth path via no-op dry-run push to
-#      staging" red → token rotated/revoked OR Gitea git-HTTPS
-#      surface is broken (rare). Auth check happens on the
-#      smart-protocol handshake, separate from the API path.
-#
-# 2. **Re-issue the token** on the operator host:
-#    ```
-#    ssh root@5.78.80.188 'docker exec --user git molecule-gitea-1 \
-#      gitea admin user generate-access-token \
-#      --username devops-engineer \
-#      --token-name persona-devops-engineer-vN \
-#      --scopes "read:repository,write:repository,read:user,read:organization,read:issue,write:issue,read:notification,read:misc"'
-#    ```
-#    Update `/etc/molecule-bootstrap/agent-secrets.env` in place
-#    (per `feedback_unified_credentials_file`). The previous token
-#    file lands at `.bak.<date>`.
-#
-# 3. **Update the repo Actions secret** at:
-#    Settings → Secrets and variables → Actions → AUTO_SYNC_TOKEN
-#    Paste the new token. (Don't echo it in chat — but per
-#    `feedback_passwords_in_chat_are_burned`, a paste in a 1:1
-#    Claude session is within trust boundary.)
-#
-# 4. **Re-run this canary** via workflow_dispatch. Confirm GREEN.
-#
-# 5. **Backfill any missed main → staging syncs** by re-running
-#    `auto-sync-main-to-staging.yml` from its workflow_dispatch
-#    surface, OR by pushing an empty commit to main (if you'd
-#    rather force a real trigger).
-#
-# ============================================================
-# Security notes
-# ============================================================
-#
-# - Token usage: read-only (`GET /api/v1/user`, `GET /api/v1/repos/...`,
-#   `git ls-remote`). No write paths. Same blast-radius profile as
-#   `actions/checkout` on a public repo.
-# - The token NEVER appears in logs: every `curl` uses a header
-#   variable, never inline; the `git ls-remote` URL builds the
-#   `oauth2:$TOKEN@host` form into a single env var that's not
-#   echoed. GitHub Actions secret-masking covers anything that does
-#   slip through.
-# - No new token introduced — same `AUTO_SYNC_TOKEN` the workflow
-#   under monitor uses. Per least-privilege we deliberately do NOT
-#   broaden scope for the canary.
-
-on:
-  schedule:
-    # Every 6 hours at :17 (offsets the cron herd at :00). Justification
-    # from issue #72: cheap to run (~5s wall-clock, no quota), 3h average
-    # detection latency, 6h max. 1h would be 24× the runs for marginal
-    # benefit; daily would be 6× longer latency and worse than status
-    # quo on a quiet-main day.
-    - cron: '17 */6 * * *'
-  workflow_dispatch:
-
-# No concurrency group needed — the canary is read-only and idempotent.
-# Two parallel runs (e.g. operator dispatch during a scheduled tick) are
-# harmless: same result, doubled HTTPS calls, no shared state.
-
-permissions:
-  contents: read
-
-jobs:
-  verify-token:
-    name: Verify AUTO_SYNC_TOKEN validity
-    runs-on: ubuntu-latest
-    # 2 min surfaces hangs (Gitea API stall, DNS issue) within one
-    # cron interval. Realistic worst case is ~10s: 2 curls + 1 git
-    # ls-remote, each capped by the explicit timeouts below.
-    timeout-minutes: 2
-
-    env:
-      # Pinned in env so individual steps can read it without
-      # repeating the secret reference. GitHub masks the value in
-      # logs automatically.
-      AUTO_SYNC_TOKEN: ${{ secrets.AUTO_SYNC_TOKEN }}
-      # MUST stay in sync with auto-sync-main-to-staging.yml's
-      # `git config user.name "devops-engineer"` line. Renaming the
-      # devops-engineer persona requires updating both files (and
-      # the staging branch protection's `push_whitelist_usernames`).
-      EXPECTED_PERSONA: devops-engineer
-      GITEA_HOST: git.moleculesai.app
-      REPO_PATH: molecule-ai/molecule-core
-
-    steps:
-      - name: Verify AUTO_SYNC_TOKEN secret is configured
-        # Schedule-vs-dispatch behaviour split, per
-        # `feedback_schedule_vs_dispatch_secrets_hardening`:
-        #
-        #   - schedule: hard-fail when the secret is missing. The
-        #     whole point of the canary is to surface drift; soft-
-        #     skipping on missing-secret would make the canary
-        #     itself drift-invisible (sweep-cf-orphans #2088 lesson).
-        #   - workflow_dispatch: hard-fail too — there's no scenario
-        #     where an operator wants this canary to silently no-op.
-        #     The workflow has no other ad-hoc utility; if you ran
-        #     it, you wanted the answer.
-        run: |
-          if [ -z "${AUTO_SYNC_TOKEN}" ]; then
-            echo "::error::AUTO_SYNC_TOKEN secret is not set on this repo." >&2
-            echo "::error::Set it at Settings → Secrets and variables → Actions." >&2
-            echo "::error::Without it, auto-sync-main-to-staging.yml will fail every push to main." >&2
-            exit 1
-          fi
-          echo "AUTO_SYNC_TOKEN is configured (value masked)."
-
-      - name: Verify token authenticates as ${{ env.EXPECTED_PERSONA }}
-        # Calls Gitea's `/api/v1/user` — the canonical
-        # auth-probe-with-no-side-effects endpoint (mirrors
-        # Cloudflare's /user/tokens/verify).
-        #
-        # Failure surfaces:
-        #   - HTTP 401: token invalid (rotated, revoked, or never
-        #     correctly registered).
-        #   - HTTP 200 but username != devops-engineer: token was
-        #     regenerated under the wrong persona — this would let
-        #     auth pass but commit attribution would be wrong, and
-        #     branch-protection authz would fail because only
-        #     `devops-engineer` is whitelisted.
-        run: |
-          set -euo pipefail
-          response_file="$(mktemp)"
-          code_file="$(mktemp)"
-          # `--max-time 30`: full call ceiling. `--connect-timeout 10`:
-          # DNS + TCP. `-w "%{http_code}"` routed to a tempfile so curl's
-          # exit code can't pollute the captured status — see
-          # feedback_curl_status_capture_pollution + the
-          # `lint-curl-status-capture.yml` gate that rejects the unsafe
-          # `$(curl ... || echo "000")` shape.
-          set +e
-          curl -sS -o "$response_file" \
-            --max-time 30 --connect-timeout 10 \
-            -w "%{http_code}" \
-            -H "Authorization: token ${AUTO_SYNC_TOKEN}" \
-            -H "Accept: application/json" \
-            "https://${GITEA_HOST}/api/v1/user" >"$code_file" 2>/dev/null
-          set -e
-          status=$(cat "$code_file" 2>/dev/null || true)
-          [ -z "$status" ] && status="000"
-
-          if [ "$status" != "200" ]; then
-            echo "::error::Token rotation suspected: GET /api/v1/user returned HTTP $status (expected 200)." >&2
-            echo "::error::Likely cause: AUTO_SYNC_TOKEN has been rotated/revoked on Gitea but the repo Actions secret was not updated." >&2
-            echo "::error::Runbook: see header comment of this workflow file." >&2
-            # Print response body but redact anything that looks like a token.
-            sed -E 's/[A-Fa-f0-9]{32,}/<redacted>/g' "$response_file" >&2 || true
-            exit 1
-          fi
-
-          username=$(python3 -c "import json,sys; print(json.load(open(sys.argv[1])).get('login',''))" "$response_file")
-          if [ "$username" != "${EXPECTED_PERSONA}" ]; then
-            echo "::error::Token resolves to user '$username', expected '${EXPECTED_PERSONA}'." >&2
-            echo "::error::AUTO_SYNC_TOKEN must be the devops-engineer persona PAT (not founder PAT, not another persona)." >&2
-            echo "::error::Auto-sync push will fail because only 'devops-engineer' is whitelisted on staging branch protection." >&2
-            exit 1
-          fi
-          echo "Token authenticates as: $username ✓"
-
-      - name: Verify token has repo read scope
-        # `GET /api/v1/repos/<owner>/<repo>` requires `read:repository`
-        # on the persona's v2 scope contract. If the scope was
-        # narrowed/dropped on rotation we catch it here, before the
-        # next main push reveals it via a checkout failure.
-        run: |
-          set -euo pipefail
-          response_file="$(mktemp)"
-          code_file="$(mktemp)"
-          # See first probe step for the rationale on the tempfile-routed
-          # `-w "%{http_code}"` pattern — the unsafe `|| echo "000"` shape
-          # is rejected by lint-curl-status-capture.yml.
-          set +e
-          curl -sS -o "$response_file" \
-            --max-time 30 --connect-timeout 10 \
-            -w "%{http_code}" \
-            -H "Authorization: token ${AUTO_SYNC_TOKEN}" \
-            -H "Accept: application/json" \
-            "https://${GITEA_HOST}/api/v1/repos/${REPO_PATH}" >"$code_file" 2>/dev/null
-          set -e
-          status=$(cat "$code_file" 2>/dev/null || true)
-          [ -z "$status" ] && status="000"
-
-          if [ "$status" != "200" ]; then
-            echo "::error::Token lacks read:repository scope on ${REPO_PATH}: HTTP $status." >&2
-            echo "::error::Auto-sync's actions/checkout step will fail with this token." >&2
-            echo "::error::Re-issue with v2 scope contract: read:repository,write:repository,read:user,read:organization,read:issue,write:issue,read:notification,read:misc" >&2
-            sed -E 's/[A-Fa-f0-9]{32,}/<redacted>/g' "$response_file" >&2 || true
-            exit 1
-          fi
-          echo "Token has read:repository on ${REPO_PATH} ✓"
-
-      - name: Verify git HTTPS auth path via no-op dry-run push to staging
-        # Final probe: exercise the EXACT auth path that
-        # `actions/checkout` + `git push origin staging` use in
-        # auto-sync-main-to-staging.yml. Gitea's API and git-HTTPS
-        # surfaces share the token-lookup code path internally but
-        # the wire-level error shapes differ — historically (#173)
-        # the API path was healthy while git-HTTPS rejected, so
-        # checking only the API would have given false-green.
-        #
-        # IMPORTANT: `git ls-remote` on a public repo (which
-        # molecule-core is) succeeds even with a junk token because
-        # Gitea falls back to anonymous-read. `ls-remote` therefore
-        # CANNOT validate auth on this surface. We use
-        # `git push --dry-run` instead — push is auth-gated even on
-        # public repos.
-        #
-        # NOP shape: read the current staging SHA via authenticated
-        # ls-remote (the SHA itself is public; auth is incidental
-        # here, used only to colocate the discovery in one step), then
-        # `git push --dry-run <SHA>:refs/heads/staging`. Pushing the
-        # current tip back to itself is "Everything up-to-date" with
-        # exit 0 when auth succeeds. With a bad token Gitea returns
-        # HTTP 401 in the smart-protocol handshake and git exits 128
-        # with "Authentication failed".
-        #
-        # The dry-run never reaches Gitea's pre-receive hook (which
-        # is where branch-protection authz runs), so this probe does
-        # not validate failure mode C. That's intentional —
-        # branch-protection-drift.yml owns authz monitoring; this
-        # canary owns auth.
-        env:
-          # Don't hang waiting for password prompt if auth fails on a
-          # terminal-attached run. (In Actions there's no terminal,
-          # but the env-var hardens against an interactive runner
-          # config.)
-          GIT_TERMINAL_PROMPT: "0"
-        run: |
-          set -euo pipefail
-          # Token is in $AUTO_SYNC_TOKEN (job-level env). Compose the
-          # URL as a local var that's never echoed.
-          url="https://oauth2:${AUTO_SYNC_TOKEN}@${GITEA_HOST}/${REPO_PATH}"
-
-          # Step a: read current staging SHA. ~1KB; auth-gated only
-          # on private repos but always works on public — used here
-          # only to discover the SHA, not to validate auth.
-          staging_ref=$(timeout 30s git ls-remote --refs "$url" refs/heads/staging 2>&1) || {
-            redacted=$(echo "$staging_ref" | sed -E "s|oauth2:[^@]+@|oauth2:<redacted>@|g")
-            echo "::error::ls-remote against staging failed (network/DNS issue):" >&2
-            echo "$redacted" >&2
-            exit 1
-          }
-          if ! echo "$staging_ref" | grep -qE '^[0-9a-f]{40}[[:space:]]+refs/heads/staging$'; then
-            echo "::error::ls-remote returned unexpected shape:" >&2
-            echo "$staging_ref" | sed -E "s|oauth2:[^@]+@|oauth2:<redacted>@|g" >&2
-            exit 1
-          fi
-          staging_sha=$(echo "$staging_ref" | awk '{print $1}')
-
-          # Step b: spin up an ephemeral local repo. `git push` always
-          # requires a local repo even when pushing a remote SHA that
-          # isn't in the local object DB (the protocol negotiates and
-          # discovers we don't need to send any objects). We don't use
-          # `actions/checkout` for this — it would clone the whole
-          # repo (~hundreds of MB) for what's essentially `git init`.
-          tmp_repo="$(mktemp -d)"
-          trap 'rm -rf "$tmp_repo"' EXIT
-          git -C "$tmp_repo" init -q
-          # Author config required for any git operation; values are
-          # arbitrary because nothing gets committed here.
-          git -C "$tmp_repo" config user.email canary@auto-sync.local
-          git -C "$tmp_repo" config user.name auto-sync-canary
-
-          # Step c: dry-run push the current staging SHA back to
-          # staging. NOP by construction — the remote tip equals the
-          # SHA we're pushing, so "Everything up-to-date" is the
-          # success path.
-          #
-          # Authentication is checked at the smart-protocol handshake,
-          # BEFORE the dry-run can compute an empty diff. Bad token
-          # → "Authentication failed", exit 128. Good token → exit 0.
-          set +e
-          push_out=$(timeout 30s git -C "$tmp_repo" push --dry-run "$url" "${staging_sha}:refs/heads/staging" 2>&1)
-          push_rc=$?
-          set -e
-
-          if [ "$push_rc" -ne 0 ]; then
-            redacted=$(echo "$push_out" | sed -E "s|oauth2:[^@]+@|oauth2:<redacted>@|g")
-            echo "::error::Token rotation suspected: git push --dry-run against staging failed via the AUTO_SYNC_TOKEN HTTPS auth path (exit $push_rc)." >&2
-            echo "::error::This is the EXACT auth path that actions/checkout + git push use in auto-sync-main-to-staging.yml." >&2
-            echo "::error::Likely cause: AUTO_SYNC_TOKEN was rotated/revoked on Gitea but the repo Actions secret was not updated. Runbook: see header." >&2
-            echo "$redacted" >&2
-            exit 1
-          fi
-
-          echo "git HTTPS auth path: NOP push --dry-run to staging → ${staging_sha:0:8} ✓"
-
-      - name: Summarise canary result
-        # Everything passed — surface a green summary. (Failures
-        # already wrote ::error:: lines and exited above; if we got
-        # here, all three probes passed.)
-        run: |
-          {
-            echo "## Auto-sync canary: GREEN"
-            echo ""
-            echo "AUTO_SYNC_TOKEN is healthy:"
-            echo "- Authenticates as \`${EXPECTED_PERSONA}\` ✓"
-            echo "- Has \`read:repository\` scope on \`${REPO_PATH}\` ✓"
-            echo "- Git HTTPS auth path: no-op dry-run push to \`refs/heads/staging\` succeeds ✓"
-            echo ""
-            echo "Auto-sync main → staging will succeed on the next push to main."
-            echo "If this canary ever goes RED, see the runbook in this workflow's header."
-          } >> "$GITHUB_STEP_SUMMARY"
@@ -1,255 +0,0 @@
-name: Auto-sync main → staging
-
-# Reflects every push to `main` back onto `staging` so the
-# staging-as-superset-of-main invariant holds.
-#
-# ============================================================
-# What this workflow does
-# ============================================================
-#
-# On every push to `main`:
-#   1. Checks if staging already contains main → no-op.
-#   2. Fetches both branches, merges main into staging in the
-#      runner workspace (fast-forward if possible, else
-#      `--no-ff` merge commit).
-#   3. Pushes staging directly to origin via the
-#      `devops-engineer` persona's `AUTO_SYNC_TOKEN`.
-#
-# Authoritative path: a single `git push origin staging` from
-# inside this workflow is the SSOT for advancing staging after
-# a main push. No PR, no merge queue, no human approval —
-# staging is mechanically maintained as a superset of main.
-#
-# `auto-promote-staging.yml` is the reverse-direction
-# counterpart (staging → main, gated on green CI). Together
-# they keep the staging-superset-of-main invariant tight.
-#
-# ============================================================
-# Why direct push (and not "open a PR")
-# ============================================================
-#
-# Pre-2026-05-06 the canonical SCM was GitHub.com, where:
-#   - The `staging` branch had a `merge_queue` ruleset that
-#     blocked ALL direct pushes (no bypass even for org
-#     admins or the GitHub Actions integration).
-#   - Therefore this workflow opened a PR via `gh pr create`
-#     and let auto-merge land it through the queue.
-#
-# Post-2026-05-06 the canonical SCM is Gitea
-# (`git.moleculesai.app/molecule-ai/molecule-core`). Gitea:
-#   - Has no `merge_queue` concept.
-#   - Allows direct push to protected branches via per-user
-#     `push_whitelist_usernames` on the branch protection.
-#   - Does not expose a GraphQL endpoint, so `gh pr create`
-#     returns `HTTP 405 Method Not Allowed
-#     (https://git.moleculesai.app/api/graphql)` — the
-#     pre-suspension architecture cannot work on Gitea.
-#
-# The molecule-ai/molecule-core staging branch protection
-# (verified via `GET /api/v1/repos/.../branch_protections`)
-# whitelists `devops-engineer` for direct push. So the
-# correct Gitea-shape architecture is: authenticate as
-# `devops-engineer`, merge locally, push staging directly.
-#
-# This is structurally simpler than the GitHub-era PR dance
-# and removes the dependence on `gh` CLI / GraphQL entirely.
-#
-# ============================================================
-# Identity + token (anti-bot-ring per saved-memory
-# `feedback_per_agent_gitea_identity_default`)
-# ============================================================
-#
-# This workflow uses `secrets.AUTO_SYNC_TOKEN`, which is a
-# personal access token issued to the `devops-engineer`
-# persona on Gitea — NOT the founder PAT. The bot-ring
-# fingerprint that triggered the GitHub org suspension on
-# 2026-05-06 was characterised by founder PAT acting as CI
-# at machine speed; per-persona identities split the
-# attribution honestly.
-#
-# Token scope on Gitea: repo write. Push target restricted
-# to `staging` (this workflow is the only writer; main is
-# untouched). Compromise blast radius: bounded to staging
-# branch + this repo's read surface.
-#
-# Commits are authored by the persona email
-# `devops-engineer@agents.moleculesai.app` so commit history
-# reflects which automation produced the merge.
-#
-# ============================================================
-# Failure modes & operational notes
-# ============================================================
-#
-# A — staging has commits main doesn't, and the merge
-#     conflicts:
-#     - The `--no-ff` merge step exits non-zero. Workflow
-#       fails red. Operator (devops-engineer or human)
-#       resolves manually:
-#         git fetch origin
-#         git checkout staging
-#         git merge --no-ff origin/main
-#         # resolve conflicts
-#         git push origin staging
-#     - Step summary surfaces the conflict so the failed run
-#       is self-explanatory.
-#
-# B — `AUTO_SYNC_TOKEN` rotated / wrong scope:
-#     - `git push` step exits non-zero with `HTTP 401` /
-#       `403`. Step summary surfaces the failed push.
-#     - Re-issue the token from `~/.molecule-ai/personas/`
-#       on the operator host and update the repo Actions
-#       secret. Re-run the workflow.
-#
-# C — staging branch protection no longer whitelists
-#     `devops-engineer`:
-#     - `git push` exits non-zero with a Gitea protected-
-#       branch rejection. Step summary surfaces it.
-#     - Re-add `devops-engineer` to
-#       `push_whitelist_usernames` on the staging
-#       protection (Settings → Branches → staging).
-#
-# D — concurrent push to main while a sync is in flight:
-#     - The `concurrency` group below serialises runs.
-#       The second waits for the first; if main advances
-#       again while we're syncing, the second run picks
-#       up the new tip on its own fetch.
-#
-# ============================================================
-# Loop safety
-# ============================================================
-#
-# The push to staging from this workflow does NOT itself
-# fire a `push: branches: [main]` event (different branch),
-# so there's no risk of self-recursion. `auto-promote-staging.yml`
-# fires on `workflow_run` of CI etc. — it sees the new
-# staging tip on its next gate-completion event, NOT on this
-# push directly. No loop.
-
-on:
-  push:
-    branches: [main]
-  # workflow_dispatch lets operators manually backfill a
-  # missed sync (e.g. if AUTO_SYNC_TOKEN was rotated and a
-  # main push slipped through while the secret was stale).
-  workflow_dispatch:
-
-permissions:
-  contents: write
-
-concurrency:
-  group: auto-sync-main-to-staging
-  cancel-in-progress: false
-
-jobs:
-  sync-staging:
-    runs-on: ubuntu-latest
-    steps:
-      - name: Checkout staging (with devops-engineer push token)
-        uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
-        with:
-          fetch-depth: 0
-          ref: staging
-          # AUTO_SYNC_TOKEN authenticates as the
-          # `devops-engineer` Gitea persona — the only
-          # identity whitelisted for direct push to
-          # staging. See header comment for context.
-          token: ${{ secrets.AUTO_SYNC_TOKEN }}
-
-      - name: Configure git author
-        run: |
-          # Per-persona identity, NOT founder PAT.
-          # `feedback_per_agent_gitea_identity_default`.
-          git config user.name "devops-engineer"
-          git config user.email "devops-engineer@agents.moleculesai.app"
-
-      - name: Check if staging already contains main
-        id: check
-        run: |
-          set -euo pipefail
-          git fetch origin main
-          if git merge-base --is-ancestor origin/main HEAD; then
-            echo "needs_sync=false" >> "$GITHUB_OUTPUT"
-            {
-              echo "## No-op"
-              echo
-              echo "staging already contains \`origin/main\` ($(git rev-parse --short=8 origin/main))."
-            } >> "$GITHUB_STEP_SUMMARY"
-          else
-            echo "needs_sync=true" >> "$GITHUB_OUTPUT"
-            MAIN_SHORT=$(git rev-parse --short=8 origin/main)
-            echo "main_short=${MAIN_SHORT}" >> "$GITHUB_OUTPUT"
-            echo "::notice::staging is missing main's tip (${MAIN_SHORT}) — merging in-runner and pushing"
-          fi
-
-      - name: Merge main into staging (in-runner)
-        if: steps.check.outputs.needs_sync == 'true'
-        id: merge
-        run: |
-          set -euo pipefail
-          # Already on staging from checkout. Try fast-forward
-          # first (cleanest history); fall back to merge commit
-          # if staging has commits main doesn't.
-          if git merge --ff-only origin/main; then
-            echo "did_ff=true" >> "$GITHUB_OUTPUT"
-            echo "::notice::Fast-forwarded staging to origin/main"
-          else
-            echo "did_ff=false" >> "$GITHUB_OUTPUT"
-            if ! git merge --no-ff origin/main \
-                -m "chore: sync main → staging (auto, ${{ steps.check.outputs.main_short }})"; then
-              # Hygiene: leave the work tree clean before failing.
-              git merge --abort || true
-              {
-                echo "## Conflict"
-                echo
-                echo "Auto-merge \`main → staging\` failed with conflicts."
-                echo "A human (or devops-engineer persona) needs to resolve manually:"
-                echo
-                echo '```'
-                echo "git fetch origin"
-                echo "git checkout staging"
-                echo "git merge --no-ff origin/main"
-                echo "# resolve conflicts"
-                echo "git push origin staging"
-                echo '```'
-              } >> "$GITHUB_STEP_SUMMARY"
-              exit 1
-            fi
-          fi
-
-      - name: Push staging to origin
-        if: steps.check.outputs.needs_sync == 'true'
-        run: |
-          set -euo pipefail
-          # Direct push to staging. devops-engineer persona is
-          # whitelisted for direct push on the staging branch
-          # protection (Settings → Branches → staging).
-          #
-          # No --force / --force-with-lease: a fast-forward or
-          # legitimate merge commit on top of current staging
-          # is the only thing we'd ever push. If origin/staging
-          # advanced under us (concurrent merge), the push
-          # legitimately rejects and the next run picks up the
-          # new state.
-          if ! git push origin staging; then
-            {
-              echo "## Push rejected"
-              echo
-              echo "Direct push to \`staging\` failed. Likely causes:"
-              echo "- \`AUTO_SYNC_TOKEN\` rotated / wrong scope (HTTP 401/403)"
-              echo "- \`devops-engineer\` no longer in"
-              echo "  \`push_whitelist_usernames\` on the staging"
-              echo "  branch protection (HTTP 422)"
-              echo "- staging advanced concurrently — re-running this"
-              echo "  workflow on the new main tip will pick it up"
-            } >> "$GITHUB_STEP_SUMMARY"
-            exit 1
-          fi
-
-          {
-            echo "## Auto-sync succeeded"
-            echo
-            echo "- staging advanced to: \`$(git rev-parse --short=8 HEAD)\`"
-            echo "- main tip: \`${{ steps.check.outputs.main_short }}\`"
-            echo "- Strategy: $([ "${{ steps.merge.outputs.did_ff }}" = "true" ] && echo "fast-forward" || echo "merge commit")"
-            echo "- Pushed by: \`devops-engineer\` (per-agent persona, anti-bot-ring)"
-          } >> "$GITHUB_STEP_SUMMARY"
@@ -22,9 +22,9 @@ on:
  # spending CI cycles. See e2e-api.yml for the rationale on why this
  # is a single job rather than two-jobs-sharing-name.
  push:
-    branches: [main, staging]
+    branches: [main]
  pull_request:
-    branches: [main, staging]
+    branches: [main]
  workflow_dispatch:
  schedule:
    # Weekly on Sunday 08:00 UTC — catches Chrome / Playwright / Next.js
@@ -32,7 +32,7 @@ name: E2E Staging External Runtime

 on:
  push:
-    branches: [staging, main]
+    branches: [main]
    paths:
      - 'workspace-server/internal/handlers/workspace.go'
      - 'workspace-server/internal/handlers/registry.go'
@@ -44,7 +44,7 @@ on:
      - 'tests/e2e/test_staging_external_runtime.sh'
      - '.github/workflows/e2e-staging-external.yml'
  pull_request:
-    branches: [staging, main]
+    branches: [main]
    paths:
      - 'workspace-server/internal/handlers/workspace.go'
      - 'workspace-server/internal/handlers/registry.go'
@@ -20,13 +20,12 @@ name: E2E Staging SaaS (full lifecycle)
 #     via the same paths watcher that e2e-api.yml uses)

 on:
-  # Fire on staging push too — previously this only ran on main, which
-  # meant the most thorough end-to-end test caught regressions AFTER
-  # they shipped to staging (and then to the auto-promote PR). Running
-  # on staging push catches them BEFORE the staging→main promotion
-  # opens, so a green canary into auto-promote is more meaningful.
+  # Trunk-based (Phase 3 of internal#81): main is the only branch.
+  # Previously this fired on staging push too because staging was a
+  # superset of main and ran the gate ahead of auto-promote; with no
+  # staging branch, main is where E2E gates the deploy.
  push:
-    branches: [staging, main]
+    branches: [main]
    paths:
      - 'workspace-server/internal/handlers/registry.go'
      - 'workspace-server/internal/handlers/workspace_provision.go'
@@ -36,7 +35,7 @@ on:
      - 'tests/e2e/test_staging_full_saas.sh'
      - '.github/workflows/e2e-staging-saas.yml'
  pull_request:
-    branches: [staging, main]
+    branches: [main]
    paths:
      - 'workspace-server/internal/handlers/registry.go'
      - 'workspace-server/internal/handlers/workspace_provision.go'
@@ -36,7 +36,7 @@ on:
  workflow_run:
    workflows: ['publish-workspace-server-image']
    types: [completed]
-    branches: [staging]
+    branches: [main]
  workflow_dispatch:
    inputs:
      target_tag:
@@ -1,276 +0,0 @@
-name: Retarget main PRs to staging
-
-# Mechanical enforcement of SHARED_RULES rule 8 ("Staging-first
-# workflow, no exceptions"). When a bot opens a PR against `main`,
-# retarget it to `staging` automatically and leave an explanatory
-# comment. Human / CEO-authored PRs (the staging→main promotion
-# PRs, etc.) are left alone — they're the authorised exception
-# to the rule.
-#
-# ============================================================
-# What this workflow does
-# ============================================================
-#
-# On `pull_request_target` opened/reopened against `main`:
-#   1. If the PR head is `staging`, skip (the auto-promote PRs
-#      MUST stay base=main).
-#   2. If the PR author is a bot, retarget the PR base to
-#      `staging` via Gitea REST `PATCH /pulls/{N}` body
-#      `{"base":"staging"}`.
-#   3. If the retarget returns 422 "pull request already exists
-#      for base branch 'staging'" (issue #1884 case: another PR
-#      on the same head already targets staging), close the
-#      now-redundant main-PR via Gitea REST instead of failing
-#      red.
-#   4. Post an explainer comment on the retargeted PR via
-#      Gitea REST `POST /issues/{N}/comments`.
-#
-# ============================================================
-# Why Gitea REST (and not `gh api / gh pr close / gh pr comment`)
-# ============================================================
-#
-# Pre-2026-05-06 this workflow used `gh api -X PATCH "repos/{owner}/{repo}/pulls/{N}" -f base=staging`
-# plus `gh pr close` and `gh pr comment`. After the GitHub→Gitea
-# cutover those calls fail because:
-#
-#   - `gh` CLI defaults to `api.github.com`. Even with `GH_HOST`
-#     pointing at Gitea, `gh pr close / comment` route through
-#     GraphQL (`/api/graphql`) which Gitea does not expose.
-#     Empirical: every `gh pr *` call returns
-#     `HTTP 405 Method Not Allowed (https://git.moleculesai.app/api/graphql)`
-#     — same root cause as #65 (auto-sync, fixed in PR #66) and
-#     #73/#195 (auto-promote, fixed in PR #78).
-#   - `gh api -X PATCH /pulls/{N}` happens to use a REST path
-#     that Gitea also has, but the `gh` host-resolution layer
-#     and pagination/retry logic don't always hit Gitea cleanly,
-#     and the cost of switching to direct `curl` is one extra
-#     line of code.
-#
-# So this workflow uses direct `curl` calls to Gitea REST. No
-# `gh` CLI dependency, no GraphQL, no flaky host-resolution.
-#
-# ============================================================
-# Identity + token (anti-bot-ring per saved-memory
-# `feedback_per_agent_gitea_identity_default`)
-# ============================================================
-#
-# Pre-fix this workflow used the per-job ephemeral
-# `secrets.GITHUB_TOKEN`. On Gitea Actions that token has
-# narrow scope and unpredictable cross-PR write capability.
-#
-# Post-fix: `secrets.AUTO_SYNC_TOKEN` (the `devops-engineer`
-# Gitea persona). Same persona used by `auto-sync-main-to-staging.yml`
-# (PR #66) and `auto-promote-staging.yml` (PR #78). Token scope:
-# `push: true` repo write, sufficient for PR-edit + close + comment.
-#
-# Why this token does NOT need branch-protection bypass:
-# patching a PR's base ref is a PR-level operation that does not
-# require push perms on either branch (the PR's own commits stay
-# put; only the metadata changes).
-#
-# ============================================================
-# Failure modes & operational notes
-# ============================================================
-#
-# A — PATCH base→staging returns 422 "pull request already exists"
-#     (issue #1884 case):
-#     - Detected by string-match on response body. Workflow
-#       falls through to closing the now-redundant main-PR
-#       (Gitea REST `PATCH /pulls/{N}` with `state: closed`)
-#       and posts an explanation comment. Step summary surfaces.
-#
-# B — `AUTO_SYNC_TOKEN` rotated / wrong scope:
-#     - First REST call returns 401/403. Step summary surfaces.
-#       Re-issue token from `~/.molecule-ai/personas/` on the
-#       operator host and update repo Actions secret.
-#
-# C — PR was deleted between trigger and run:
-#     - REST call returns 404. Workflow exits 0 with a notice
-#       (the rule was already enforced or the PR is gone).
-#
-# D — author is not actually a bot but the filter mis-fires:
-#     - Filter is conservative: only triggers on
-#       `user.type == 'Bot'`, `login` ends with `[bot]`, or
-#       known bot logins (`molecule-ai[bot]`, `app/molecule-ai`).
-#       Human PRs slip through unaffected. If a NEW bot login
-#       starts shipping main-PRs, add it to the filter.
-
-on:
-  pull_request_target:
-    types: [opened, reopened]
-    branches: [main]
-
-permissions:
-  pull-requests: write
-
-jobs:
-  retarget:
-    name: Retarget to staging
-    runs-on: ubuntu-latest
-    # Only fire for bot-authored PRs. Human CEO PRs (staging→main
-    # promotion) are intentional and pass through.
-    #
-    # Head-ref guard: never retarget a PR whose head IS `staging`
-    # — those are the auto-promote staging→main PRs (opened by
-    # `devops-engineer` since PR #78 / #195 fix). Retargeting
-    # head=staging onto base=staging fails with HTTP 422 "no new
-    # commits between base 'staging' and head 'staging'", which
-    # would surface as a noisy red workflow run on every
-    # auto-promote (caught 2026-05-03 on the GitHub-era PR #2588).
-    if: >-
-      github.event.pull_request.head.ref != 'staging'
-      && (
-        github.event.pull_request.user.type == 'Bot'
-        || endsWith(github.event.pull_request.user.login, '[bot]')
-        || github.event.pull_request.user.login == 'app/molecule-ai'
-        || github.event.pull_request.user.login == 'molecule-ai[bot]'
-        || github.event.pull_request.user.login == 'devops-engineer'
-      )
-    steps:
-      - name: Retarget PR base to staging via Gitea REST
-        id: retarget
-        env:
-          GITEA_TOKEN: ${{ secrets.AUTO_SYNC_TOKEN }}
-          GITEA_HOST: ${{ vars.GITEA_HOST || 'https://git.moleculesai.app' }}
-          REPO: ${{ github.repository }}
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-          PR_AUTHOR: ${{ github.event.pull_request.user.login }}
-        # Issue #1884 case: when the bot opens a PR against main
-        # and there's already another PR on the same head branch
-        # targeting staging, Gitea's PATCH returns 422 with a
-        # body mentioning "pull request already exists for base
-        # branch 'staging'" (the Gitea message wording is
-        # slightly different from GitHub's; the substring match
-        # below covers both for forward/back compat).
-        # The retarget can't proceed — but the right response is
-        # to close the now-redundant main-PR, not to fail the
-        # workflow noisily. Detect that specific 422 and close
-        # instead.
-        run: |
-          set -euo pipefail
-
-          API="${GITEA_HOST}/api/v1/repos/${REPO}"
-          AUTH=(-H "Authorization: token ${GITEA_TOKEN}" -H "Accept: application/json")
-
-          echo "Retargeting PR #${PR_NUMBER} (author: ${PR_AUTHOR}) from main → staging"
-
-          # Curl-status-capture pattern per `feedback_curl_status_capture_pollution`:
-          # http_code via -w to its own scalar, body to a tempfile, set +e/-e
-          # bracket so curl's non-zero-on-4xx doesn't pollute the script's exit chain.
-          BODY_FILE=$(mktemp)
-          REQ='{"base":"staging"}'
-
-          set +e
-          STATUS=$(curl -sS "${AUTH[@]}" -H "Content-Type: application/json" \
-            -X PATCH -d "${REQ}" \
-            -o "${BODY_FILE}" -w "%{http_code}" \
-            "${API}/pulls/${PR_NUMBER}")
-          CURL_RC=$?
-          set -e
-
-          if [ "${CURL_RC}" -ne 0 ]; then
-            echo "::error::curl PATCH failed (rc=${CURL_RC})"
-            rm -f "${BODY_FILE}"
-            exit 1
-          fi
-
-          if [ "${STATUS}" = "201" ] || [ "${STATUS}" = "200" ]; then
-            NEW_BASE=$(jq -r '.base.ref // "?"' < "${BODY_FILE}")
-            rm -f "${BODY_FILE}"
-            if [ "${NEW_BASE}" = "staging" ]; then
-              echo "::notice::Retargeted PR #${PR_NUMBER} → staging"
-              echo "outcome=retargeted" >> "$GITHUB_OUTPUT"
-              exit 0
-            fi
-            echo "::error::PATCH returned ${STATUS} but base.ref is '${NEW_BASE}', not 'staging'"
-            exit 1
-          fi
-
-          # Specifically match the 422 duplicate-base/head error so
-          # any OTHER PATCH failure (auth, deleted PR, etc.) still
-          # surfaces as a real workflow failure.
-          BODY=$(cat "${BODY_FILE}" || true)
-          rm -f "${BODY_FILE}"
-
-          if [ "${STATUS}" = "422" ] && echo "${BODY}" | grep -qE "(pull request already exists for base branch 'staging'|already exists.*base.*staging)"; then
-            echo "::notice::PR #${PR_NUMBER}: duplicate target-staging PR exists on same head — closing this main-PR as redundant."
-
-            # Close the now-redundant main-PR via Gitea REST
-            # (PATCH state=closed). Post comment explaining
-            # rationale BEFORE close so the comment lands on the
-            # PR (commenting on a closed PR works on Gitea, but
-            # historically caused notification ordering surprises).
-
-            CLOSE_BODY_FILE=$(mktemp)
-            CMT_REQ=$(jq -n '{body:"[retarget-bot] Closing — another PR on the same head branch already targets `staging`. This PR is redundant. See issue #1884 for the rationale."}')
-            set +e
-            CMT_STATUS=$(curl -sS "${AUTH[@]}" -H "Content-Type: application/json" \
-              -X POST -d "${CMT_REQ}" \
-              -o "${CLOSE_BODY_FILE}" -w "%{http_code}" \
-              "${API}/issues/${PR_NUMBER}/comments")
-            set -e
-            if [ "${CMT_STATUS}" != "201" ]; then
-              echo "::warning::dup-close comment POST returned ${CMT_STATUS}; continuing to close anyway"
-              cat "${CLOSE_BODY_FILE}" | head -c 300 || true
-            fi
-            rm -f "${CLOSE_BODY_FILE}"
-
-            CLOSE_REQ='{"state":"closed"}'
-            CLOSE_RESP=$(mktemp)
-            set +e
-            CL_STATUS=$(curl -sS "${AUTH[@]}" -H "Content-Type: application/json" \
-              -X PATCH -d "${CLOSE_REQ}" \
-              -o "${CLOSE_RESP}" -w "%{http_code}" \
-              "${API}/pulls/${PR_NUMBER}")
-            set -e
-            if [ "${CL_STATUS}" = "201" ] || [ "${CL_STATUS}" = "200" ]; then
-              echo "::notice::Closed PR #${PR_NUMBER} as redundant"
-              echo "outcome=closed-as-duplicate" >> "$GITHUB_OUTPUT"
-              rm -f "${CLOSE_RESP}"
-              exit 0
-            fi
-            echo "::error::Failed to close redundant PR: HTTP ${CL_STATUS}"
-            cat "${CLOSE_RESP}" | head -c 300 || true
-            rm -f "${CLOSE_RESP}"
-            exit 1
-          fi
-
-          echo "::error::Retarget PATCH failed and was NOT a duplicate-base error: HTTP ${STATUS}"
-          echo "${BODY}" | head -c 500 >&2
-          exit 1
-
-      - name: Post explainer comment
-        if: steps.retarget.outputs.outcome == 'retargeted'
-        env:
-          GITEA_TOKEN: ${{ secrets.AUTO_SYNC_TOKEN }}
-          GITEA_HOST: ${{ vars.GITEA_HOST || 'https://git.moleculesai.app' }}
-          REPO: ${{ github.repository }}
-          PR_NUMBER: ${{ github.event.pull_request.number }}
-        run: |
-          set -euo pipefail
-
-          API="${GITEA_HOST}/api/v1/repos/${REPO}"
-          AUTH=(-H "Authorization: token ${GITEA_TOKEN}" -H "Accept: application/json")
-
-          # PR comments live on the issue endpoint in Gitea
-          # (PRs ARE issues — same endpoint, different sub-resources
-          # for diffs/files/etc.). The body uses jq to safely
-          # encode the multi-line markdown without shell-quote
-          # nightmares.
-          REQ=$(jq -n '{body:"[retarget-bot] This PR was opened against `main` and has been retargeted to `staging` automatically.\n\n**Why:** per [SHARED_RULES rule 8](https://git.moleculesai.app/molecule-ai/molecule-ai-org-template-molecule-dev/src/branch/main/SHARED_RULES.md), all feature work targets `staging` first; the CEO promotes `staging → main` separately.\n\n**What changed:** just the base branch — no code change. CI will re-run against `staging`. If you get merge conflicts, rebase on `staging`.\n\n**If this PR is the CEO`s staging→main promotion:** the Action skipped you (only bot-authored PRs are retargeted, head=staging is also exempted). If you see this comment on your CEO PR, that`s a bug — please tag @hongmingwang."}')
-
-          BODY_FILE=$(mktemp)
-          set +e
-          STATUS=$(curl -sS "${AUTH[@]}" -H "Content-Type: application/json" \
-            -X POST -d "${REQ}" \
-            -o "${BODY_FILE}" -w "%{http_code}" \
-            "${API}/issues/${PR_NUMBER}/comments")
-          set -e
-
-          if [ "${STATUS}" = "201" ]; then
-            echo "::notice::Posted explainer comment on PR #${PR_NUMBER}"
-          else
-            echo "::warning::Failed to post explainer (HTTP ${STATUS}) — retarget itself succeeded"
-            cat "${BODY_FILE}" | head -c 300 || true
-          fi
-          rm -f "${BODY_FILE}"
@@ -0,0 +1,28 @@
+# Top-level Makefile — convenience wrappers around docker compose.
+#
+# Most molecule-core dev work happens via these shortcuts. CI doesn't
+# use this Makefile; CI calls docker compose / go test directly so the
+# Makefile can evolve without breaking the build.
+
+.PHONY: help dev up down logs build test
+
+help: ## Show this help.
+	@grep -E '^[a-zA-Z_-]+:.*?## ' $(MAKEFILE_LIST) | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-12s\033[0m %s\n", $$1, $$2}'
+
+dev: ## Start the full stack with air hot-reload for the platform service.
+	docker compose -f docker-compose.yml -f docker-compose.dev.yml up
+
+up: ## Start the full stack in production-shape mode (no air, normal Dockerfile).
+	docker compose up
+
+down: ## Stop the stack and remove containers (volumes preserved).
+	docker compose down
+
+logs: ## Tail logs from all services (Ctrl-C to detach).
+	docker compose logs -f
+
+build: ## Force a fresh build of the platform image (no cache).
+	docker compose build --no-cache platform
+
+test: ## Run Go unit tests in workspace-server/.
+	cd workspace-server && go test -race ./...
@@ -0,0 +1,43 @@
+# docker-compose.dev.yml — overlay over docker-compose.yml for local dev
+# with air-driven live reload of the platform (workspace-server) service.
+#
+# Usage:
+#   docker compose -f docker-compose.yml -f docker-compose.dev.yml up
+#   (or `make dev` shorthand from repo root)
+#
+# What this overlay changes vs docker-compose.yml alone:
+#   - Platform service uses workspace-server/Dockerfile.dev (air on top of
+#     golang:1.25-alpine) instead of the multi-stage prod Dockerfile.
+#   - Platform service bind-mounts the host's workspace-server/ source
+#     into /app/workspace-server so air sees source edits live.
+#   - Other services (postgres, redis, langfuse, etc.) inherit unchanged
+#     from docker-compose.yml.
+#
+# What stays the same:
+#   - All env vars, volumes, depends_on, healthchecks from docker-compose.yml.
+#   - Network topology + ports.
+#   - Postgres/Redis as service containers (no in-process replacements).
+
+services:
+  platform:
+    build:
+      context: .
+      dockerfile: workspace-server/Dockerfile.dev
+    # Rebind source: edits under host's workspace-server/ propagate live.
+    # The named volume on go-build-cache speeds up first build per container.
+    volumes:
+      - ./workspace-server:/app/workspace-server
+      - go-build-cache:/root/.cache/go-build
+      - go-mod-cache:/go/pkg/mod
+    # Air signals the running binary on rebuild; ensure shell stops cleanly.
+    init: true
+    # Mark the service as dev-mode so the platform can short-circuit any
+    # behavior that's incompatible with hot-reload (e.g. background
+    # cron-style watchers that don't survive process restart). No-op
+    # today; reserved for future flag use.
+    environment:
+      MOLECULE_DEV_HOT_RELOAD: "1"
+
+volumes:
+  go-build-cache:
+  go-mod-cache:
@@ -0,0 +1,49 @@
+# air.toml — live-reload config for local docker-compose dev mode.
+#
+# Active when the platform service runs from workspace-server/Dockerfile.dev
+# (selected via docker-compose.dev.yml overlay). In production, the regular
+# Dockerfile builds a static binary; air is dev-only.
+#
+# Reference: https://github.com/air-verse/air
+
+root = "."
+testdata_dir = "testdata"
+tmp_dir = "tmp"
+
+[build]
+  # Same build invocation as Dockerfile's builder stage minus the
+  # CGO_ENABLED=0 toggle (CGO ok in dev for richer race detector output).
+  cmd = "go build -o ./tmp/server ./cmd/server"
+  bin = "tmp/server"
+  full_bin = ""
+  args_bin = []
+  # Watch every .go and .yaml file under workspace-server/.
+  include_ext = ["go", "yaml", "tmpl"]
+  # Don't watch tests, build artifacts, vendored deps, or migration .sql
+  # (migrations need a clean DB anyway — handled by docker-compose down/up).
+  exclude_dir = ["assets", "tmp", "vendor", "testdata", "node_modules"]
+  exclude_file = []
+  # _test.go and *_mock.go shouldn't trigger a rebuild — saves cycles.
+  exclude_regex = ["_test\\.go$", "_mock\\.go$"]
+  exclude_unchanged = true
+  follow_symlink = false
+  log = "build-errors.log"
+  # Kill running binary 1s before starting new one.
+  kill_delay = "1s"
+  send_interrupt = true
+  stop_on_error = true
+  # Debounce: wait this long after last change before triggering rebuild.
+  delay = 500
+
+[log]
+  time = false
+
+[color]
+  main = "magenta"
+  watcher = "cyan"
+  build = "yellow"
+  runner = "green"
+
+[misc]
+  # Don't keep the tmp/ dir around between runs.
+  clean_on_exit = true
@@ -0,0 +1,38 @@
+# Dockerfile.dev — local-development image with air-driven live reload.
+#
+# Selected by docker-compose.dev.yml (overlay over docker-compose.yml).
+# Production stays on workspace-server/Dockerfile (static binary, no air).
+#
+# Workflow:
+#   1. docker compose -f docker-compose.yml -f docker-compose.dev.yml up
+#   2. Edit any .go file under workspace-server/
+#   3. air detects, rebuilds, kills old binary, starts new one (~3-5s)
+#   4. No `docker compose up --build` needed
+#
+# Templates + plugins are NOT pre-cloned here — air-mode assumes the
+# developer's filesystem has the workspace-configs-templates/ + plugins/
+# dirs available, mounted at runtime via docker-compose.dev.yml.
+
+FROM golang:1.25-alpine
+
+# air + git (for go mod) + ca-certs (for TLS) + tzdata (for time-zone DB).
+RUN apk add --no-cache git ca-certificates tzdata wget \
+ && go install github.com/air-verse/air@latest
+
+WORKDIR /app/workspace-server
+
+# Pre-fetch deps so the first `air` rebuild on a fresh container is fast.
+# These are bind-mount-overridden at runtime, so the COPY here is just
+# to warm the module cache.
+COPY workspace-server/go.mod workspace-server/go.sum ./
+RUN go mod download
+
+# Source is bind-mounted at runtime (see docker-compose.dev.yml volumes
+# block) so the Dockerfile doesn't need to COPY it. air watches the
+# bind-mounted dir for changes.
+
+ENV CGO_ENABLED=1
+ENV GOFLAGS="-buildvcs=false"
+
+# Run air with the .air.toml in the bind-mounted source dir.
+CMD ["air", "-c", ".air.toml"]
@@ -6,6 +6,7 @@ package handlers

 import (
 	"fmt"
+	"log"
 	"os"
 	"path/filepath"
 	"regexp"
@@ -102,6 +103,56 @@ func loadWorkspaceEnv(orgBaseDir, filesDir string) map[string]string {
 	return envVars
 }

+// loadPersonaEnvFile merges per-role persona credentials into out. The file
+// lives at $MOLECULE_PERSONA_ROOT/<role>/env (default
+// /etc/molecule-bootstrap/personas) and is populated by the operator-host
+// bootstrap kit — one persona per dev-tree role, each carrying the role's
+// Gitea identity (GITEA_USER, GITEA_TOKEN, GITEA_TOKEN_SCOPES,
+// GITEA_USER_EMAIL, GITEA_SSH_KEY_PATH).
+//
+// Lower precedence than the org and workspace .env files: callers should
+// invoke this BEFORE parseEnvFile on those, so a workspace .env can
+// override a persona-default value when needed.
+//
+// Silent no-op when role is empty, when the role name fails the safe-segment
+// check, or when the env file does not exist (workspaces without a role —
+// or running on hosts that don't ship the bootstrap dir — keep their old
+// behavior).
+func loadPersonaEnvFile(role string, out map[string]string) {
+	if !isSafeRoleName(role) {
+		if role != "" {
+			log.Printf("Org import: refusing persona env load for unsafe role name %q", role)
+		}
+		return
+	}
+	root := os.Getenv("MOLECULE_PERSONA_ROOT")
+	if root == "" {
+		root = "/etc/molecule-bootstrap/personas"
+	}
+	parseEnvFile(filepath.Join(root, role, "env"), out)
+}
+
+// isSafeRoleName accepts a single path segment of [A-Za-z0-9_-]+. Rejects
+// empty, ".", "..", and anything containing a path separator — even though
+// the construct is admin-only, defense-in-depth keeps the persona dir
+// shape invariant: one flat directory per role, no climbing out.
+func isSafeRoleName(s string) bool {
+	if s == "" || s == "." || s == ".." {
+		return false
+	}
+	for _, c := range s {
+		switch {
+		case c >= 'a' && c <= 'z':
+		case c >= 'A' && c <= 'Z':
+		case c >= '0' && c <= '9':
+		case c == '-' || c == '_':
+		default:
+			return false
+		}
+	}
+	return true
+}
+
 // parseEnvFile reads a .env file and adds KEY=VALUE pairs to the map.
 // Skips comments (#) and empty lines. Values can be quoted.
 func parseEnvFile(path string, out map[string]string) {
@@ -443,10 +443,18 @@ func (h *OrgHandler) createWorkspaceTree(ws OrgWorkspace, parentID *string, absX
 			configFiles["system-prompt.md"] = []byte(ws.SystemPrompt)
 		}

-		// Inject secrets from .env files as workspace secrets.
-		// Resolution: workspace .env → org root .env (workspace overrides org root).
+		// Inject secrets from persona env + .env files as workspace secrets.
+		// Resolution (later overrides earlier):
+		//   0. Persona env (per-role bootstrap creds; only when ws.Role is set
+		//      and the operator-host bootstrap dir ships a matching file)
+		//   1. Org root .env (shared defaults)
+		//   2. Workspace-specific .env (per-workspace overrides)
 		// Each line: KEY=VALUE → stored as encrypted workspace secret.
 		envVars := map[string]string{}
+		// 0. Persona env (lowest precedence; injects the role's Gitea identity:
+		//    GITEA_USER, GITEA_TOKEN, GITEA_TOKEN_SCOPES, GITEA_USER_EMAIL,
+		//    GITEA_SSH_KEY_PATH). Workspace and org .env can override.
+		loadPersonaEnvFile(ws.Role, envVars)
 		if orgBaseDir != "" {
 			// 1. Org root .env (shared defaults)
 			parseEnvFile(filepath.Join(orgBaseDir, ".env"), envVars)
@@ -0,0 +1,171 @@
+package handlers
+
+import (
+	"os"
+	"path/filepath"
+	"testing"
+)
+
+// TestLoadPersonaEnvFile_HappyPath: the standard case — a persona-shaped
+// env file exists at <root>/<role>/env and its KEY=VALUE pairs land in
+// the out map. Mirrors what the operator-host bootstrap kit ships:
+// GITEA_USER, GITEA_TOKEN, GITEA_TOKEN_SCOPES, GITEA_USER_EMAIL,
+// GITEA_SSH_KEY_PATH.
+func TestLoadPersonaEnvFile_HappyPath(t *testing.T) {
+	root := t.TempDir()
+	roleDir := filepath.Join(root, "dev-lead")
+	if err := os.MkdirAll(roleDir, 0o755); err != nil {
+		t.Fatal(err)
+	}
+	envBody := `# Persona env file — mode 600
+GITEA_USER=dev-lead
+GITEA_USER_EMAIL=dev-lead@agents.moleculesai.app
+GITEA_TOKEN=abc123
+GITEA_TOKEN_SCOPES=write:repository,write:issue,read:user
+GITEA_SSH_KEY_PATH=/etc/molecule-bootstrap/personas/dev-lead/ssh_priv
+`
+	if err := os.WriteFile(filepath.Join(roleDir, "env"), []byte(envBody), 0o600); err != nil {
+		t.Fatal(err)
+	}
+	t.Setenv("MOLECULE_PERSONA_ROOT", root)
+
+	out := map[string]string{}
+	loadPersonaEnvFile("dev-lead", out)
+
+	want := map[string]string{
+		"GITEA_USER":           "dev-lead",
+		"GITEA_USER_EMAIL":     "dev-lead@agents.moleculesai.app",
+		"GITEA_TOKEN":          "abc123",
+		"GITEA_TOKEN_SCOPES":   "write:repository,write:issue,read:user",
+		"GITEA_SSH_KEY_PATH":   "/etc/molecule-bootstrap/personas/dev-lead/ssh_priv",
+	}
+	if len(out) != len(want) {
+		t.Fatalf("got %d keys, want %d: %#v", len(out), len(want), out)
+	}
+	for k, v := range want {
+		if out[k] != v {
+			t.Errorf("out[%q] = %q; want %q", k, out[k], v)
+		}
+	}
+}
+
+// TestLoadPersonaEnvFile_MissingDir: when the persona dir doesn't exist
+// (e.g. dev-only host without the bootstrap kit, or a workspace whose
+// role isn't a known persona), it's a silent no-op — out stays empty,
+// no panic, no log noise that would break callers.
+func TestLoadPersonaEnvFile_MissingDir(t *testing.T) {
+	t.Setenv("MOLECULE_PERSONA_ROOT", t.TempDir()) // empty dir
+	out := map[string]string{}
+	loadPersonaEnvFile("nonexistent-role", out)
+	if len(out) != 0 {
+		t.Errorf("expected empty out, got %#v", out)
+	}
+}
+
+// TestLoadPersonaEnvFile_EmptyRole: empty role string is the common case
+// for non-dev workspaces (research/marketing/etc.). Skip silently.
+func TestLoadPersonaEnvFile_EmptyRole(t *testing.T) {
+	t.Setenv("MOLECULE_PERSONA_ROOT", t.TempDir())
+	out := map[string]string{}
+	loadPersonaEnvFile("", out)
+	if len(out) != 0 {
+		t.Errorf("empty role should produce empty out; got %#v", out)
+	}
+}
+
+// TestLoadPersonaEnvFile_RejectsTraversal: even though role names come
+// from server-side admin-only org templates, defense-in-depth — refuse
+// any role string with path separators or "..". Verifies that a maliciously
+// crafted template can't read /etc/passwd by setting role: "../../etc".
+func TestLoadPersonaEnvFile_RejectsTraversal(t *testing.T) {
+	root := t.TempDir()
+	// Plant a file at /tmp/.../env so a bad traversal would reach it
+	if err := os.WriteFile(filepath.Join(root, "env"), []byte("STOLEN=yes\n"), 0o600); err != nil {
+		t.Fatal(err)
+	}
+	t.Setenv("MOLECULE_PERSONA_ROOT", filepath.Join(root, "personas"))
+
+	for _, bad := range []string{"..", "../personas", "../etc/passwd", "/abs", "with/slash", "dot.in.middle", "with space", "back\\slash", ".", ""} {
+		out := map[string]string{}
+		loadPersonaEnvFile(bad, out)
+		if len(out) != 0 {
+			t.Errorf("role %q should have been rejected; got %#v", bad, out)
+		}
+	}
+}
+
+// TestLoadPersonaEnvFile_DefaultRoot: when MOLECULE_PERSONA_ROOT is unset,
+// the helper falls back to /etc/molecule-bootstrap/personas. We don't
+// touch real /etc — just verify the function doesn't panic and produces
+// empty out (since the test box isn't expected to ship that path).
+func TestLoadPersonaEnvFile_DefaultRoot(t *testing.T) {
+	t.Setenv("MOLECULE_PERSONA_ROOT", "") // explicit empty
+	out := map[string]string{}
+	loadPersonaEnvFile("dev-lead", out)
+	// Don't assert content — production CI might or might not have the
+	// /etc dir mounted. Just verify the call returns cleanly.
+	_ = out
+}
+
+// TestLoadPersonaEnvFile_PrecedenceCallerOverrides: the contract is "lower
+// precedence than later .env files." The helper writes into out without
+// removing existing keys, so a caller pre-populating out simulates a
+// later layer overriding persona defaults. We verify the helper does NOT
+// clobber pre-existing entries… actually, parseEnvFile DOES overwrite,
+// so the caller-side ordering (persona → org → workspace) is what enforces
+// precedence. This test pins that contract: persona is loaded into a
+// fresh map, then later layers can override.
+func TestLoadPersonaEnvFile_OverwritesEmptyMap(t *testing.T) {
+	root := t.TempDir()
+	roleDir := filepath.Join(root, "core-be")
+	if err := os.MkdirAll(roleDir, 0o755); err != nil {
+		t.Fatal(err)
+	}
+	if err := os.WriteFile(filepath.Join(roleDir, "env"),
+		[]byte("GITEA_TOKEN=persona-value\n"), 0o600); err != nil {
+		t.Fatal(err)
+	}
+	t.Setenv("MOLECULE_PERSONA_ROOT", root)
+
+	out := map[string]string{"GITEA_TOKEN": "preset"}
+	loadPersonaEnvFile("core-be", out)
+
+	// Persona helper is meant to populate a FRESH map first in the
+	// caller's flow; calling it on a pre-populated map and seeing the
+	// value get overwritten is consistent with parseEnvFile semantics.
+	if out["GITEA_TOKEN"] != "persona-value" {
+		t.Errorf("loadPersonaEnvFile did not write into existing map; got %q", out["GITEA_TOKEN"])
+	}
+}
+
+// TestIsSafeRoleName_Acceptance: positive + negative cases for the
+// validator. Pinned because every dev-tree role name must pass.
+func TestIsSafeRoleName_Acceptance(t *testing.T) {
+	good := []string{
+		"dev-lead", "core-be", "cp-security", "infra-runtime-be",
+		"sdk-dev", "plugin-dev", "documentation-specialist",
+		"triage-operator", "fullstack-engineer", "release-manager",
+		"core_underscore_ok", "X", "a1", "Z9-0",
+	}
+	for _, s := range good {
+		if !isSafeRoleName(s) {
+			t.Errorf("isSafeRoleName(%q) = false; want true", s)
+		}
+	}
+	bad := []string{
+		"", ".", "..", "with/slash", "/abs", "dot.in.middle",
+		"with space", "back\\slash", "trailing-", // trailing-hyphen is fine actually
+		"with$dollar", "with?question", "newline\nsplit",
+	}
+	// trailing-hyphen IS allowed; remove from "bad" list:
+	bad = []string{
+		"", ".", "..", "with/slash", "/abs", "dot.in.middle",
+		"with space", "back\\slash", "with$dollar", "with?question",
+		"newline\nsplit",
+	}
+	for _, s := range bad {
+		if isSafeRoleName(s) {
+			t.Errorf("isSafeRoleName(%q) = true; want false", s)
+		}
+	}
+}
@@ -0,0 +1,207 @@
+package handlers
+
+// plugins_atomic.go — atomic install pattern for plugin delivery into a
+// running workspace container. Closes molecule-core#114.
+//
+// Replaces the prior "tar + docker.CopyToContainer to /configs/plugins/<name>"
+// single-step write (no atomicity, no marker, no rollback) with a 4-step
+// dance:
+//
+//   1. STAGE     — extract tar into /configs/plugins/.staging/<name>.<ts>/
+//   2. SNAPSHOT  — if /configs/plugins/<name>/ exists, mv to .previous/<name>.<ts>/
+//   3. SWAP      — mv /configs/plugins/.staging/<name>.<ts>/ → /configs/plugins/<name>/
+//   4. MARKER    — touch /configs/plugins/<name>/.complete
+//
+// On any post-snapshot failure we attempt a best-effort rollback by mv-ing
+// the previous snapshot back into place. The .complete marker is the
+// canonical "this install is fully landed" signal — workspace-side plugin
+// loaders should refuse to load a plugin dir without it.
+//
+// Scope: docker path only (workspace running as a local container). The
+// SaaS path (deliverViaEIC, SSH-into-EC2) is unchanged in this PR; tracked
+// as a follow-up. The same stage-then-swap shape applies but the exec
+// primitives differ (ssh vs docker exec), and shipping both paths in one
+// PR doubles the test surface.
+
+import (
+	"bytes"
+	"context"
+	"fmt"
+	"path"
+	"strings"
+	"time"
+
+	"github.com/docker/docker/api/types/container"
+)
+
+const (
+	pluginsRoot       = "/configs/plugins"
+	pluginsStagingDir = "/configs/plugins/.staging"
+	pluginsPrevDir    = "/configs/plugins/.previous"
+	completeMarker    = ".complete"
+)
+
+// installVersion identifies one install attempt — the plugin name plus a
+// monotonic-ish UTC timestamp suffix. Used to namespace the staging dir
+// and any snapshot of the previous version, so a reinstall mid-flight
+// can't collide with a concurrent reinstall.
+type installVersion struct {
+	plugin string
+	stamp  string // e.g. 20260508T141530Z
+}
+
+func newInstallVersion(plugin string) installVersion {
+	return installVersion{
+		plugin: plugin,
+		stamp:  time.Now().UTC().Format("20060102T150405Z"),
+	}
+}
+
+// stagedPath is the container path where the new content lands during fetch.
+// e.g. /configs/plugins/.staging/molecule-skill-foo.20260508T141530Z
+func (v installVersion) stagedPath() string {
+	return path.Join(pluginsStagingDir, v.plugin+"."+v.stamp)
+}
+
+// previousPath is where the prior live version is moved before swap.
+// e.g. /configs/plugins/.previous/molecule-skill-foo.20260508T141530Z
+func (v installVersion) previousPath() string {
+	return path.Join(pluginsPrevDir, v.plugin+"."+v.stamp)
+}
+
+// livePath is the destination after swap.
+// e.g. /configs/plugins/molecule-skill-foo
+func (v installVersion) livePath() string {
+	return path.Join(pluginsRoot, v.plugin)
+}
+
+// markerPath is the .complete file inside the live dir written last.
+func (v installVersion) markerPath() string {
+	return path.Join(v.livePath(), completeMarker)
+}
+
+// atomicCopyToContainer does a stage→snapshot→swap→marker install of a
+// host-side staged plugin tree into a running container's
+// /configs/plugins/<name>/. Returns nil on success.
+//
+// On post-snapshot failure (swap or marker write), best-effort rollback
+// restores the previous snapshot to the live path. Returns the original
+// error wrapped — the caller should surface it; rollback success is
+// logged separately.
+func (h *PluginsHandler) atomicCopyToContainer(
+	ctx context.Context, containerName, hostDir, pluginName string,
+) error {
+	v := newInstallVersion(pluginName)
+
+	// Step 0a: ensure staging + previous root dirs exist (idempotent).
+	if _, err := h.execAsRoot(ctx, containerName, []string{
+		"mkdir", "-p", pluginsStagingDir, pluginsPrevDir,
+	}); err != nil {
+		return fmt.Errorf("atomic install: mkdir staging/previous: %w", err)
+	}
+
+	// Step 0b: tar the host content with a path prefix that lands it in the
+	// staging dir — NOT directly into the live name. The prefix has no
+	// leading "/" because docker.CopyToContainer extracts paths relative
+	// to the dstPath argument we pass below.
+	stagedRel := strings.TrimPrefix(v.stagedPath(), "/")
+	tarBuf, err := tarHostDirWithPrefix(hostDir, stagedRel)
+	if err != nil {
+		return fmt.Errorf("atomic install: tar host dir: %w", err)
+	}
+
+	// Step 1: STAGE — extract tar into /configs/plugins/.staging/<name>.<ts>/
+	if err := h.docker.CopyToContainer(ctx, containerName, "/", &tarBuf,
+		container.CopyToContainerOptions{}); err != nil {
+		// Best-effort: clean up any partial staging extract before returning.
+		_, _ = h.execAsRoot(ctx, containerName, []string{
+			"rm", "-rf", v.stagedPath(),
+		})
+		return fmt.Errorf("atomic install: copy to container: %w", err)
+	}
+
+	// Step 2: SNAPSHOT — if a live version exists, move it aside.
+	// `test -d` exits 0 if the dir exists, non-zero otherwise; the helper
+	// returns a non-nil error in the non-zero case which we treat as
+	// "no previous version" rather than a real failure.
+	snapshotted := false
+	if _, err := h.execAsRoot(ctx, containerName, []string{
+		"test", "-d", v.livePath(),
+	}); err == nil {
+		if _, err := h.execAsRoot(ctx, containerName, []string{
+			"mv", v.livePath(), v.previousPath(),
+		}); err != nil {
+			// Snapshot failure: roll back the staged extract before failing.
+			_, _ = h.execAsRoot(ctx, containerName, []string{
+				"rm", "-rf", v.stagedPath(),
+			})
+			return fmt.Errorf("atomic install: snapshot previous version: %w", err)
+		}
+		snapshotted = true
+	}
+
+	// Step 3: SWAP — atomic rename of the staged dir into the live name.
+	// `mv` on the same filesystem is a single rename(2), atomic at the FS level.
+	if _, err := h.execAsRoot(ctx, containerName, []string{
+		"mv", v.stagedPath(), v.livePath(),
+	}); err != nil {
+		// Swap failure: roll back if we had a snapshot.
+		if snapshotted {
+			if _, rbErr := h.execAsRoot(ctx, containerName, []string{
+				"mv", v.previousPath(), v.livePath(),
+			}); rbErr != nil {
+				return fmt.Errorf("atomic install: swap failed AND rollback failed: swap=%w, rollback=%v", err, rbErr)
+			}
+		}
+		// Best-effort cleanup of the still-staged dir.
+		_, _ = h.execAsRoot(ctx, containerName, []string{
+			"rm", "-rf", v.stagedPath(),
+		})
+		return fmt.Errorf("atomic install: swap to live path: %w", err)
+	}
+
+	// Step 4: MARKER — touch .complete inside the live dir as the last write.
+	// Workspace-side plugin loaders treat a plugin dir without this marker
+	// as half-installed and skip it (or surface a clear error to the
+	// operator instead of loading a possibly-partial tree).
+	if _, err := h.execAsRoot(ctx, containerName, []string{
+		"touch", v.markerPath(),
+	}); err != nil {
+		// Marker write failure with the new content already in place is a
+		// weird state — content is fine on disk, but the plugin loader
+		// will refuse to use it. Log loudly; do NOT roll back, since the
+		// content is the latest, just unmarked. Operator can manually
+		// `touch <plugin>/.complete` to recover.
+		return fmt.Errorf("atomic install: write .complete marker (content landed but unmarked, manual recovery: touch %s): %w", v.markerPath(), err)
+	}
+
+	// Step 5: GC — best-effort delete the previous snapshot. Failures here
+	// just leave a directory; not load-bearing for correctness, the next
+	// install or a separate sweeper will reclaim the space.
+	if snapshotted {
+		_, _ = h.execAsRoot(ctx, containerName, []string{
+			"rm", "-rf", v.previousPath(),
+		})
+	}
+
+	return nil
+}
+
+// tarHostDirWithPrefix walks hostDir and writes a tar to a buffer with
+// every entry's name prefixed by `prefix`. Mirrors the prior streaming
+// shape used in copyPluginToContainer but with a configurable prefix
+// (the prior version hardcoded "plugins/<name>/"; we use a full
+// staging path so the extracted layout is the staging dir directly).
+//
+// Symlinks are skipped — same posture as streamDirAsTar elsewhere in
+// this file. Skipping prevents a hostile plugin from injecting a
+// symlink that, post-extract, points outside the plugin's own dir.
+func tarHostDirWithPrefix(hostDir, prefix string) (bytes.Buffer, error) {
+	var buf bytes.Buffer
+	tw := newTarWriter(&buf)
+	defer tw.Close()
+	if err := tarWalk(hostDir, prefix, tw); err != nil {
+		return bytes.Buffer{}, err
+	}
+	return buf, nil
+}
@@ -0,0 +1,70 @@
+package handlers
+
+// plugins_atomic_tar.go — tar-walk helpers split out so the main atomic
+// install flow stays readable. The prefix argument lets the caller
+// arrange where the tar's contents land at extract time.
+
+import (
+	"archive/tar"
+	"io"
+	"os"
+	"path/filepath"
+)
+
+// newTarWriter is a thin wrapper so atomic_test.go can swap the writer
+// destination if it needs to.
+func newTarWriter(w io.Writer) *tar.Writer {
+	return tar.NewWriter(w)
+}
+
+// tarWalk walks hostDir and writes every regular file + dir to the tar
+// writer with paths of the form `<prefix>/<relative>`. Symlinks are
+// skipped — same posture as streamDirAsTar in plugins_install_pipeline.go.
+//
+// The trailing-slash on prefix is normalized away: prefix "foo" and
+// prefix "foo/" produce identical archives.
+func tarWalk(hostDir, prefix string, tw *tar.Writer) error {
+	prefix = filepath.Clean(prefix)
+	return filepath.Walk(hostDir, func(p string, info os.FileInfo, err error) error {
+		if err != nil {
+			return err
+		}
+		if info.Mode()&os.ModeSymlink != 0 {
+			return nil // skip symlinks; see doc above
+		}
+		rel, err := filepath.Rel(hostDir, p)
+		if err != nil {
+			return err
+		}
+		if rel == "." {
+			// Emit the prefix dir itself once, with the source dir's mode.
+			hdr, err := tar.FileInfoHeader(info, "")
+			if err != nil {
+				return err
+			}
+			hdr.Name = prefix + "/"
+			return tw.WriteHeader(hdr)
+		}
+		hdr, err := tar.FileInfoHeader(info, "")
+		if err != nil {
+			return err
+		}
+		hdr.Name = filepath.Join(prefix, rel)
+		if info.IsDir() {
+			hdr.Name += "/"
+		}
+		if err := tw.WriteHeader(hdr); err != nil {
+			return err
+		}
+		if !info.Mode().IsRegular() {
+			return nil
+		}
+		f, err := os.Open(p)
+		if err != nil {
+			return err
+		}
+		defer f.Close()
+		_, err = io.Copy(tw, f)
+		return err
+	})
+}
@@ -0,0 +1,193 @@
+package handlers
+
+import (
+	"archive/tar"
+	"bytes"
+	"io"
+	"os"
+	"path/filepath"
+	"sort"
+	"strings"
+	"testing"
+	"time"
+)
+
+// TestInstallVersion_Paths: the path helpers must produce a stable shape
+// the in-container exec calls depend on. Pinning the layout here
+// catches a future refactor that accidentally changes where staging /
+// previous / live dirs live, which would break the swap atomicity.
+func TestInstallVersion_Paths(t *testing.T) {
+	v := installVersion{plugin: "molecule-skill-foo", stamp: "20260508T141530Z"}
+
+	if got, want := v.stagedPath(), "/configs/plugins/.staging/molecule-skill-foo.20260508T141530Z"; got != want {
+		t.Errorf("stagedPath = %q; want %q", got, want)
+	}
+	if got, want := v.previousPath(), "/configs/plugins/.previous/molecule-skill-foo.20260508T141530Z"; got != want {
+		t.Errorf("previousPath = %q; want %q", got, want)
+	}
+	if got, want := v.livePath(), "/configs/plugins/molecule-skill-foo"; got != want {
+		t.Errorf("livePath = %q; want %q", got, want)
+	}
+	if got, want := v.markerPath(), "/configs/plugins/molecule-skill-foo/.complete"; got != want {
+		t.Errorf("markerPath = %q; want %q", got, want)
+	}
+}
+
+// TestInstallVersion_StampUniqueness: two newInstallVersion calls within
+// the same second produce the same stamp (we use second precision); the
+// caller relies on the mv-rename being atomic, so collision-free
+// stamping is NOT a correctness requirement — but a regression that
+// changes stamp shape (e.g. RFC3339 with colons) would break the path
+// helpers since path.Join treats a colon as a regular char but ssh +
+// docker exec generally don't. Pin the no-colon shape.
+func TestInstallVersion_StampShape(t *testing.T) {
+	v := newInstallVersion("anything")
+	if strings.Contains(v.stamp, ":") {
+		t.Errorf("stamp must not contain colons (breaks shell-quoting in exec): %q", v.stamp)
+	}
+	if strings.Contains(v.stamp, " ") {
+		t.Errorf("stamp must not contain spaces: %q", v.stamp)
+	}
+	// Sanity: stamp parses as the documented format.
+	if _, err := time.Parse("20060102T150405Z", v.stamp); err != nil {
+		t.Errorf("stamp %q does not parse as 20060102T150405Z: %v", v.stamp, err)
+	}
+}
+
+// TestTarHostDirWithPrefix_HappyPath: walks a host dir, builds a tar with
+// the configured prefix, verifies every entry's name is rooted under
+// the prefix, and the file contents survive round-trip.
+func TestTarHostDirWithPrefix_HappyPath(t *testing.T) {
+	hostDir := t.TempDir()
+
+	// Plant: <host>/plugin.yaml + <host>/skills/foo/SKILL.md + <host>/.complete
+	files := map[string]string{
+		"plugin.yaml":             "name: foo\nversion: 1.0.0\n",
+		"skills/foo/SKILL.md":     "# Foo skill\n",
+		".complete":                "", // upstream may already have a marker
+	}
+	for rel, body := range files {
+		full := filepath.Join(hostDir, rel)
+		if err := os.MkdirAll(filepath.Dir(full), 0o755); err != nil {
+			t.Fatal(err)
+		}
+		if err := os.WriteFile(full, []byte(body), 0o644); err != nil {
+			t.Fatal(err)
+		}
+	}
+
+	prefix := "configs/plugins/.staging/foo.20260508T141530Z"
+	buf, err := tarHostDirWithPrefix(hostDir, prefix)
+	if err != nil {
+		t.Fatalf("tar: %v", err)
+	}
+
+	// Read back the tar; collect names + body for regular files.
+	got := map[string]string{}
+	tr := tar.NewReader(&buf)
+	for {
+		hdr, err := tr.Next()
+		if err == io.EOF {
+			break
+		}
+		if err != nil {
+			t.Fatalf("tar reader: %v", err)
+		}
+		// Every entry must start with the prefix
+		if !strings.HasPrefix(hdr.Name, prefix) {
+			t.Errorf("entry %q does not start with prefix %q", hdr.Name, prefix)
+		}
+		if hdr.Typeflag == tar.TypeReg {
+			body, err := io.ReadAll(tr)
+			if err != nil {
+				t.Fatal(err)
+			}
+			rel := strings.TrimPrefix(hdr.Name, prefix+"/")
+			got[rel] = string(body)
+		}
+	}
+
+	for rel, want := range files {
+		if got[rel] != want {
+			t.Errorf("body[%q] = %q; want %q", rel, got[rel], want)
+		}
+	}
+}
+
+// TestTarHostDirWithPrefix_SkipsSymlinks: a hostile plugin shouldn't be
+// able to ship a symlink that, post-extract, points outside its own
+// dir. The walker silently skips symlinks (same posture as
+// streamDirAsTar). Verify a planted symlink doesn't appear in the tar.
+func TestTarHostDirWithPrefix_SkipsSymlinks(t *testing.T) {
+	hostDir := t.TempDir()
+	// Plant a real file + a symlink pointing outside hostDir.
+	if err := os.WriteFile(filepath.Join(hostDir, "real.txt"), []byte("ok"), 0o644); err != nil {
+		t.Fatal(err)
+	}
+	target := filepath.Join(t.TempDir(), "outside")
+	if err := os.WriteFile(target, []byte("SHOULD NOT APPEAR"), 0o644); err != nil {
+		t.Fatal(err)
+	}
+	if err := os.Symlink(target, filepath.Join(hostDir, "evil")); err != nil {
+		t.Fatal(err)
+	}
+
+	buf, err := tarHostDirWithPrefix(hostDir, "p")
+	if err != nil {
+		t.Fatal(err)
+	}
+
+	names := []string{}
+	tr := tar.NewReader(&buf)
+	for {
+		hdr, err := tr.Next()
+		if err == io.EOF {
+			break
+		}
+		if err != nil {
+			t.Fatal(err)
+		}
+		names = append(names, hdr.Name)
+	}
+	sort.Strings(names)
+
+	for _, n := range names {
+		if strings.Contains(n, "evil") {
+			t.Errorf("symlink leaked into tar: %q", n)
+		}
+	}
+	// real.txt should be present
+	found := false
+	for _, n := range names {
+		if strings.HasSuffix(n, "real.txt") {
+			found = true
+			break
+		}
+	}
+	if !found {
+		t.Errorf("real.txt missing from tar; got names: %v", names)
+	}
+}
+
+// TestTarHostDirWithPrefix_PrefixNormalization: trailing slash on prefix
+// should not change the archive shape. Pinning this so a future caller
+// passing "foo/" instead of "foo" doesn't double-slash entry names.
+func TestTarHostDirWithPrefix_PrefixNormalization(t *testing.T) {
+	hostDir := t.TempDir()
+	if err := os.WriteFile(filepath.Join(hostDir, "x"), []byte("y"), 0o644); err != nil {
+		t.Fatal(err)
+	}
+
+	a, err := tarHostDirWithPrefix(hostDir, "foo")
+	if err != nil {
+		t.Fatal(err)
+	}
+	b, err := tarHostDirWithPrefix(hostDir, "foo/")
+	if err != nil {
+		t.Fatal(err)
+	}
+
+	if !bytes.Equal(a.Bytes(), b.Bytes()) {
+		t.Errorf("trailing-slash on prefix changed archive shape; tarHostDirWithPrefix should be slash-insensitive")
+	}
+}
@@ -0,0 +1,214 @@
+package handlers
+
+// plugins_classifier.go — diff classifier for plugin updates.
+//
+// Closes molecule-core#112. Composes with #114 (atomic install) so the
+// platform can decide *before* triggering restartFunc whether the
+// update is content-only (SKILL.md text changed; agent re-reads at next
+// Skill invocation) or structural (hooks/settings/plugin.yaml/file added
+// or removed; agent must restart to pick up the new state).
+//
+// SKILL.md content is hot-reloadable because Claude Code reads the file
+// on each Skill invocation — no in-memory cache. Hooks and settings.json
+// are loaded at session start and need a session restart. plugin.yaml
+// changes are structural by definition (manifest controls everything
+// else).
+//
+// CLASSIFICATION RULE
+//   classify(staged, live) → "skill-content-only" if and only if
+//     every file present in either tree is one of:
+//       - identical between staged and live, OR
+//       - a **/SKILL.md file with content change (text body modified)
+//     AND no files were added or removed.
+//   Anything else → "cold" (the safe default).
+//
+// The classifier reads live-tree files from inside the container via
+// `docker exec cat`. Comparison is by SHA-256 over file content, not
+// mtime — mtime changes on every install regardless of content.
+
+import (
+	"context"
+	"crypto/sha256"
+	"encoding/hex"
+	"fmt"
+	"io/fs"
+	"os"
+	"path/filepath"
+	"strings"
+)
+
+const (
+	// classifyKindSkillContentOnly: install can skip restartFunc; the
+	// only changes are SKILL.md body text.
+	classifyKindSkillContentOnly = "skill-content-only"
+	// classifyKindCold: must restart the workspace container; structural
+	// or hook/settings change.
+	classifyKindCold = "cold"
+)
+
+// classifyInstallChanges compares the staged plugin tree (host filesystem)
+// against the currently-live plugin tree inside the container. Returns
+// classifyKindSkillContentOnly when the only diff is SKILL.md content
+// changes, classifyKindCold otherwise (added/removed files, hooks/
+// settings.json edits, plugin.yaml edits, anything else).
+//
+// `noLive` is the sentinel returned when /configs/plugins/<name> doesn't
+// exist (first install for this plugin). Treated as cold — no live state
+// to hot-reload into.
+func (h *PluginsHandler) classifyInstallChanges(
+	ctx context.Context, containerName, hostStagedDir, pluginName string,
+) (string, error) {
+	livePath := "/configs/plugins/" + pluginName
+
+	// Probe: does live exist? If not, this is a first install — cold.
+	if _, err := h.execAsRoot(ctx, containerName, []string{
+		"test", "-d", livePath,
+	}); err != nil {
+		return classifyKindCold, nil
+	}
+
+	// Build hash maps for both trees.
+	stagedHashes, err := hashLocalTree(hostStagedDir)
+	if err != nil {
+		return classifyKindCold, fmt.Errorf("classifier: hash staged: %w", err)
+	}
+	liveHashes, err := h.hashContainerTree(ctx, containerName, livePath)
+	if err != nil {
+		// Live tree read failure: be conservative, cold-restart.
+		return classifyKindCold, nil
+	}
+
+	// Drop the .complete marker from comparison — its mtime/atime can
+	// vary across installs but content is empty/trivial; including it
+	// would force-cold every reinstall.
+	delete(stagedHashes, ".complete")
+	delete(liveHashes, ".complete")
+
+	// Set difference: any file in one but not the other → cold.
+	for rel := range stagedHashes {
+		if _, ok := liveHashes[rel]; !ok {
+			return classifyKindCold, nil // file added
+		}
+	}
+	for rel := range liveHashes {
+		if _, ok := stagedHashes[rel]; !ok {
+			return classifyKindCold, nil // file removed
+		}
+	}
+
+	// Same set of files. Walk the diff.
+	for rel, stagedHash := range stagedHashes {
+		liveHash := liveHashes[rel]
+		if stagedHash == liveHash {
+			continue
+		}
+		// Content differs. Allow if and only if it's a SKILL.md.
+		if !isSkillMarkdown(rel) {
+			return classifyKindCold, nil
+		}
+	}
+	return classifyKindSkillContentOnly, nil
+}
+
+// isSkillMarkdown returns true for any path whose basename is SKILL.md
+// (case-sensitive, matches Claude Code's skill discovery rule).
+func isSkillMarkdown(rel string) bool {
+	return filepath.Base(rel) == "SKILL.md"
+}
+
+// hashLocalTree walks a host directory and returns rel-path → sha256-hex.
+// Symlinks are skipped (same posture as the tar walker).
+func hashLocalTree(root string) (map[string]string, error) {
+	out := map[string]string{}
+	err := filepath.WalkDir(root, func(p string, d fs.DirEntry, err error) error {
+		if err != nil {
+			return err
+		}
+		if d.IsDir() {
+			return nil
+		}
+		info, err := d.Info()
+		if err != nil {
+			return err
+		}
+		if info.Mode()&os.ModeSymlink != 0 {
+			return nil
+		}
+		if !info.Mode().IsRegular() {
+			return nil
+		}
+		rel, err := filepath.Rel(root, p)
+		if err != nil {
+			return err
+		}
+		body, err := os.ReadFile(p)
+		if err != nil {
+			return err
+		}
+		sum := sha256.Sum256(body)
+		out[filepath.ToSlash(rel)] = hex.EncodeToString(sum[:])
+		return nil
+	})
+	if err != nil {
+		return nil, err
+	}
+	return out, nil
+}
+
+// hashContainerTree reads every regular file under livePath via docker
+// exec sh -c 'cd <livePath> && find . -type f -not -name .complete | xargs -I {} sh -c "echo {}; sha256sum {}"'.
+//
+// The output is parsed line-by-line into rel-path → sha256-hex.
+func (h *PluginsHandler) hashContainerTree(
+	ctx context.Context, containerName, livePath string,
+) (map[string]string, error) {
+	out, err := h.execAsRoot(ctx, containerName, []string{
+		"sh", "-c",
+		// Find regular files, hash each, output `<hex>  ./<relpath>`.
+		// `cd` then `find .` keeps paths relative to livePath.
+		fmt.Sprintf("cd %s && find . -type f -print0 | xargs -0 -r sha256sum 2>/dev/null", shQuote(livePath)),
+	})
+	if err != nil {
+		return nil, fmt.Errorf("hash container tree: %w", err)
+	}
+	hashes := map[string]string{}
+	for _, line := range strings.Split(out, "\n") {
+		line = strings.TrimSpace(line)
+		if line == "" {
+			continue
+		}
+		// sha256sum output: "<hex>  <path>" (two spaces). Path starts with "./".
+		parts := strings.SplitN(line, "  ", 2)
+		if len(parts) != 2 {
+			continue
+		}
+		hash := parts[0]
+		rel := strings.TrimPrefix(parts[1], "./")
+		hashes[rel] = hash
+	}
+	return hashes, nil
+}
+
+// shQuote single-quotes a string for safe insertion into a shell command.
+// Returns the input unchanged if it's already shell-safe (alphanumeric +
+// /._-). Otherwise wraps in single quotes and escapes inner '.
+func shQuote(s string) string {
+	safe := true
+	for _, c := range s {
+		switch {
+		case c >= 'a' && c <= 'z':
+		case c >= 'A' && c <= 'Z':
+		case c >= '0' && c <= '9':
+		case c == '/' || c == '.' || c == '_' || c == '-':
+		default:
+			safe = false
+		}
+		if !safe {
+			break
+		}
+	}
+	if safe {
+		return s
+	}
+	return "'" + strings.ReplaceAll(s, "'", `'\''`) + "'"
+}
@@ -0,0 +1,121 @@
+package handlers
+
+import (
+	"os"
+	"path/filepath"
+	"testing"
+)
+
+// TestIsSkillMarkdown: pin which paths the classifier considers
+// hot-reloadable. SKILL.md by basename only — case-sensitive.
+func TestIsSkillMarkdown(t *testing.T) {
+	yes := []string{
+		"SKILL.md",
+		"skills/foo/SKILL.md",
+		"deeply/nested/SKILL.md",
+	}
+	no := []string{
+		"plugin.yaml",
+		"hooks.json",
+		"settings.json",
+		"README.md",
+		"skill.md",  // case-sensitive
+		"SKILLS.md", // not a skill file
+		"skills/foo/extra.md",
+	}
+	for _, s := range yes {
+		if !isSkillMarkdown(s) {
+			t.Errorf("isSkillMarkdown(%q) = false; want true", s)
+		}
+	}
+	for _, s := range no {
+		if isSkillMarkdown(s) {
+			t.Errorf("isSkillMarkdown(%q) = true; want false", s)
+		}
+	}
+}
+
+// TestHashLocalTree_StableHash: hashing the same content twice must
+// produce identical maps. Pinned because if hashLocalTree ever picks up
+// mtime/inode (e.g. via a refactor to use os.Lstat metadata), every
+// install would classify as cold and we'd lose the hot-reload.
+func TestHashLocalTree_StableHash(t *testing.T) {
+	dir := t.TempDir()
+	if err := os.MkdirAll(filepath.Join(dir, "skills/foo"), 0o755); err != nil {
+		t.Fatal(err)
+	}
+	if err := os.WriteFile(filepath.Join(dir, "plugin.yaml"), []byte("name: foo\n"), 0o644); err != nil {
+		t.Fatal(err)
+	}
+	if err := os.WriteFile(filepath.Join(dir, "skills/foo/SKILL.md"), []byte("# Foo\n"), 0o644); err != nil {
+		t.Fatal(err)
+	}
+
+	h1, err := hashLocalTree(dir)
+	if err != nil {
+		t.Fatal(err)
+	}
+	h2, err := hashLocalTree(dir)
+	if err != nil {
+		t.Fatal(err)
+	}
+	if len(h1) != len(h2) {
+		t.Fatalf("hash count differs: %d vs %d", len(h1), len(h2))
+	}
+	for k, v := range h1 {
+		if h2[k] != v {
+			t.Errorf("hash[%q] differs: %q vs %q", k, v, h2[k])
+		}
+	}
+}
+
+// TestHashLocalTree_SymlinkSkipped: symlinks should not appear in the
+// hash map — same posture as the tar walker. Otherwise a hostile plugin
+// could include a symlink whose hash changes when its target changes,
+// silently flipping classification.
+func TestHashLocalTree_SymlinkSkipped(t *testing.T) {
+	dir := t.TempDir()
+	if err := os.WriteFile(filepath.Join(dir, "real.txt"), []byte("ok"), 0o644); err != nil {
+		t.Fatal(err)
+	}
+	target := filepath.Join(t.TempDir(), "target")
+	if err := os.WriteFile(target, []byte("outside"), 0o644); err != nil {
+		t.Fatal(err)
+	}
+	if err := os.Symlink(target, filepath.Join(dir, "link")); err != nil {
+		t.Fatal(err)
+	}
+
+	h, err := hashLocalTree(dir)
+	if err != nil {
+		t.Fatal(err)
+	}
+	if _, exists := h["link"]; exists {
+		t.Errorf("symlink leaked into hash map: %v", h)
+	}
+	if _, exists := h["real.txt"]; !exists {
+		t.Errorf("real.txt missing from hash map: %v", h)
+	}
+}
+
+// TestShQuote: the classifier injects livePath into a shell command via
+// docker exec. Path must be quoted to handle pluginName entries with
+// hyphens (which are safe but exercised here) and any future special-
+// character edge case. Pin the safe-vs-quoted boundary.
+func TestShQuote(t *testing.T) {
+	cases := []struct {
+		in, want string
+	}{
+		{"foo", "foo"},
+		{"/configs/plugins/foo-bar", "/configs/plugins/foo-bar"},
+		{"with space", "'with space'"},
+		{"with'quote", "'with'\\''quote'"},
+		{"$envvar", "'$envvar'"},
+		{"path/with/dots.txt", "path/with/dots.txt"},
+	}
+	for _, tc := range cases {
+		if got := shQuote(tc.in); got != tc.want {
+			t.Errorf("shQuote(%q) = %q; want %q", tc.in, got, tc.want)
+		}
+	}
+}
@@ -91,6 +91,14 @@ func (h *PluginsHandler) Install(c *gin.Context) {
 		return
 	}

+	// Record the install in workspace_plugins (core#113 — version-subscription
+	// foundation). Best-effort: DB write failure is logged but doesn't fail
+	// the install — the plugin IS in the container; surfacing a 500 here
+	// would mislead the caller about the install state.
+	if err := recordWorkspacePluginInstall(ctx, workspaceID, result.PluginName, result.Source.Raw(), req.Track); err != nil {
+		log.Printf("Plugin install: failed to record %s for %s in workspace_plugins: %v (install succeeded; tracking row missing)", result.PluginName, workspaceID, err)
+	}
+
 	log.Printf("Plugin install: %s via %s → workspace %s (restarting)", result.PluginName, result.Source.Scheme, workspaceID)
 	c.JSON(http.StatusOK, gin.H{
 		"status": "installed",
@@ -114,6 +114,15 @@ type installRequest struct {
 	// When present, resolveAndStage verifies the fetched content matches
 	// before allowing the install to proceed (SAFE-T1102 supply-chain hardening).
 	SHA256 string `json:"sha256,omitempty"`
+	// Track is the version-subscription mode for this install (core#113):
+	//   "none"        — no auto-update tracking (default)
+	//   "tag:vX.Y.Z"  — track a specific version tag
+	//   "tag:latest"  — track latest tag, drift on every new tag
+	//   "sha:<full>"  — pinned, no drift ever
+	// The drift detector (separate component, follow-up) reads
+	// workspace_plugins rows where tracked_ref != 'none' and queues
+	// updates when upstream resolves to a different SHA.
+	Track string `json:"track,omitempty"`
 }

 // stageResult bundles the outputs of resolveAndStage for the caller.
@@ -276,7 +285,22 @@ func (h *PluginsHandler) resolveAndStage(ctx context.Context, req installRequest
 // using NewPluginsHandler without a DB; production wires it in router.go.
 func (h *PluginsHandler) deliverToContainer(ctx context.Context, workspaceID string, r *stageResult) error {
 	if containerName := h.findRunningContainer(ctx, workspaceID); containerName != "" {
-		if err := h.copyPluginToContainer(ctx, containerName, r.StagedDir, r.PluginName); err != nil {
+		// Hot-reload classifier (molecule-core#112) — decide BEFORE the
+		// install whether this update can skip restartFunc. SKILL.md
+		// content changes are filesystem-visible to Claude Code on the
+		// next Skill invocation; hooks / settings.json / plugin.yaml /
+		// added-or-removed files need a container restart.
+		// Classifier reads live tree from container; on any read error
+		// it returns kindCold so we never hot-reload speculatively.
+		kind, _ := h.classifyInstallChanges(ctx, containerName, r.StagedDir, r.PluginName)
+
+		// Atomic stage→snapshot→swap→marker (molecule-core#114).
+		// Replaces the prior single docker.CopyToContainer write that
+		// left a partially-extracted tree on mid-install failure with
+		// no rollback path. atomicCopyToContainer writes a .complete
+		// marker as the last step; workspace-side plugin loaders should
+		// refuse to load a plugin dir without it.
+		if err := h.atomicCopyToContainer(ctx, containerName, r.StagedDir, r.PluginName); err != nil {
 			log.Printf("Plugin install: failed to copy %s to %s: %v", r.PluginName, workspaceID, err)
 			return newHTTPErr(http.StatusInternalServerError, gin.H{"error": "failed to copy plugin to container"})
 		}
@@ -284,7 +308,11 @@ func (h *PluginsHandler) deliverToContainer(ctx context.Context, workspaceID str
 			"chown", "-R", "1000:1000", "/configs/plugins/" + r.PluginName,
 		})
 		if h.restartFunc != nil {
-			go h.restartFunc(workspaceID)
+			if kind == classifyKindSkillContentOnly {
+				log.Printf("Plugin install: %s → workspace %s — SKILL-content-only update, SKIPPING restart", r.PluginName, workspaceID)
+			} else {
+				go h.restartFunc(workspaceID)
+			}
 		}
 		return nil
 	}
@@ -0,0 +1,78 @@
+package handlers
+
+// plugins_tracking.go — workspace_plugins DB tracking for the
+// version-subscription model (core#113).
+//
+// Schema lives in migration 20260508160000_workspace_plugins_tracking.up.sql.
+// This file is the Go-side write surface used at install time to record
+// each plugin's install record. Drift detection / queue / apply are
+// follow-up scope (filed as a separate issue once this lands).
+
+import (
+	"context"
+	"errors"
+	"fmt"
+	"strings"
+
+	"github.com/Molecule-AI/molecule-monorepo/platform/internal/db"
+)
+
+// trackedRefValues is the closed set of bare-string values the
+// workspace_plugins.tracked_ref column accepts. Prefixed values
+// ("tag:..." / "sha:...") are validated structurally below.
+var trackedRefValues = map[string]bool{
+	"none": true,
+}
+
+// validateTrackedRef returns the canonical form of a track string, or
+// an error if the input is malformed. Empty input → "none" (default).
+//
+// Accepted shapes:
+//
+//	""                — defaults to "none"
+//	"none"            — no tracking
+//	"tag:vX.Y.Z"      — track a specific tag
+//	"tag:latest"      — track latest tag, drift on every new tag
+//	"sha:<full-sha>"  — pinned to commit SHA
+func validateTrackedRef(s string) (string, error) {
+	s = strings.TrimSpace(s)
+	if s == "" {
+		return "none", nil
+	}
+	if trackedRefValues[s] {
+		return s, nil
+	}
+	if strings.HasPrefix(s, "tag:") && len(s) > 4 {
+		return s, nil
+	}
+	if strings.HasPrefix(s, "sha:") && len(s) > 4 {
+		return s, nil
+	}
+	return "", fmt.Errorf("invalid track value %q: expected 'none' | 'tag:vX.Y.Z' | 'tag:latest' | 'sha:<full>'", s)
+}
+
+// recordWorkspacePluginInstall upserts the workspace_plugins row for a
+// plugin install. ON CONFLICT (workspace_id, plugin_name) DO UPDATE so
+// reinstalling the same plugin name (with a possibly-different source or
+// track value) updates the existing row rather than failing.
+func recordWorkspacePluginInstall(
+	ctx context.Context, workspaceID, pluginName, sourceRaw, track string,
+) error {
+	if workspaceID == "" || pluginName == "" || sourceRaw == "" {
+		return errors.New("recordWorkspacePluginInstall: missing required field")
+	}
+	canonicalTrack, err := validateTrackedRef(track)
+	if err != nil {
+		return err
+	}
+	_, err = db.DB.ExecContext(ctx, `
+		INSERT INTO workspace_plugins (workspace_id, plugin_name, source_raw, tracked_ref)
+		VALUES ($1, $2, $3, $4)
+		ON CONFLICT (workspace_id, plugin_name)
+		DO UPDATE SET
+			source_raw  = EXCLUDED.source_raw,
+			tracked_ref = EXCLUDED.tracked_ref,
+			updated_at  = NOW()
+	`, workspaceID, pluginName, sourceRaw, canonicalTrack)
+	return err
+}
@@ -0,0 +1,54 @@
+package handlers
+
+import "testing"
+
+// TestValidateTrackedRef: pin the exact set of accepted track values
+// the install endpoint stores. Drift detector reads this column; any
+// value that slips through here without structural validation would
+// silently fail at drift-check time.
+func TestValidateTrackedRef(t *testing.T) {
+	cases := []struct {
+		in   string
+		want string
+		err  bool
+	}{
+		// Defaults
+		{"", "none", false},
+		{"   ", "none", false},
+		{"none", "none", false},
+
+		// Tag shape
+		{"tag:v1.0.0", "tag:v1.0.0", false},
+		{"tag:v0.4.0-gitea.1", "tag:v0.4.0-gitea.1", false},
+		{"tag:latest", "tag:latest", false},
+
+		// SHA shape
+		{"sha:abc123", "sha:abc123", false},
+		{"sha:0123456789abcdef0123456789abcdef01234567", "sha:0123456789abcdef0123456789abcdef01234567", false},
+
+		// Reject malformed
+		{"tag:", "", true},      // empty after prefix
+		{"sha:", "", true},      // empty after prefix
+		{"latest", "", true},    // bare 'latest' is ambiguous (tag? branch?)
+		{"main", "", true},      // bare branch name not allowed
+		{"v1.0.0", "", true},    // missing tag: prefix
+		{"random", "", true},    // not in allowlist
+		{"tag", "", true},       // prefix without separator
+	}
+	for _, tc := range cases {
+		got, err := validateTrackedRef(tc.in)
+		if tc.err {
+			if err == nil {
+				t.Errorf("validateTrackedRef(%q) = (%q, nil); want error", tc.in, got)
+			}
+			continue
+		}
+		if err != nil {
+			t.Errorf("validateTrackedRef(%q) error: %v", tc.in, err)
+			continue
+		}
+		if got != tc.want {
+			t.Errorf("validateTrackedRef(%q) = %q; want %q", tc.in, got, tc.want)
+		}
+	}
+}
@@ -207,20 +207,25 @@ func TestStartSweeper_TransientErrorDoesNotCrashLoop(t *testing.T) {
 	ctx, cancel := context.WithCancel(context.Background())
 	defer cancel()

-	// 50ms ticker so the second cycle fires quickly enough for the test.
-	// We re-export SweepInterval as a const, but tests use the public
-	// StartSweeper that takes its own interval — wait, the public
-	// StartSweeper signature uses the package-level SweepInterval. Hmm,
-	// this means the test takes ~5 minutes. Let me reconsider.
-	//
-	// (We patch the test below to just look at the immediate-sweep call
-	// + an error path, since the immediate call is enough to prove the
-	// "error doesn't crash" contract — the loop continues afterward
-	// regardless of timing.)
+	// Capture metric baseline so we can wait for the error counter to
+	// settle before returning — otherwise this test's leaked metric
+	// write races with the next test's metricDelta() baseline read and
+	// causes a non-deterministic +1 leak (manifests as
+	// TestStartSweeper_RecordsMetricsOnSuccess: "error counter delta=1,
+	// want 0"). cycleDone fires inside the fake's Sweep defer, BEFORE
+	// sweepOnce records the error metric — so cancel() right after
+	// waitForCycle is too early.
+	_, _, deltaError := metricDelta(t)
+
 	go pendinguploads.StartSweeper(ctx, store, time.Hour)

 	// Wait for the first (errored) cycle.
 	store.waitForCycle(t, 1, 2*time.Second)
+	// Wait for the goroutine to record the error metric. After this
+	// returns, sweepOnce has fully completed and a subsequent cancel()
+	// stops the loop on the next select pass with no in-flight metric
+	// writes outstanding.
+	waitForMetricDelta(t, deltaError, 1, 2*time.Second)
 	// Cancel — the goroutine returns cleanly, proving the error path
 	// didn't crash the loop. Without this fix the goroutine would have
 	// either panicked (process abort visible at exit) or stuck (this
@@ -0,0 +1,3 @@
+DROP INDEX IF EXISTS workspace_plugins_tracked_not_none;
+DROP INDEX IF EXISTS workspace_plugins_workspace_name;
+DROP TABLE IF EXISTS workspace_plugins;
@@ -0,0 +1,39 @@
+-- workspace_plugins: per-workspace record of installed plugins, with the
+-- tracked-ref needed for the version-subscription model (core#113).
+--
+-- Today plugin install state is filesystem-only — `/configs/plugins/<name>/`
+-- inside the workspace container. There's no DB record of "what's installed
+-- where, from what source, pinned to what." That's fine until you want
+-- drift detection (compare upstream tag's resolved SHA vs the installed
+-- one) and that's the foundation this table provides.
+--
+-- This migration is purely additive: existing install paths keep working;
+-- they'll write to this table on next install. Workspaces with plugins
+-- already installed before this migration won't have rows until they're
+-- re-installed (acceptable — the tracking is forward-looking).
+--
+-- tracked_ref values:
+--   'none'         — no auto-update tracking (default)
+--   'tag:vX.Y.Z'   — track a specific version tag
+--   'tag:latest'   — track the latest tag (drift on every new tag)
+--   'sha:<full>'   — pinned to a specific commit SHA (no drift ever)
+--
+-- A subsequent migration adds the plugin_update_queue table once drift
+-- detection lands.
+
+CREATE TABLE IF NOT EXISTS workspace_plugins (
+  id              UUID        PRIMARY KEY DEFAULT gen_random_uuid(),
+  workspace_id    UUID        NOT NULL REFERENCES workspaces(id) ON DELETE CASCADE,
+  plugin_name     TEXT        NOT NULL,
+  source_raw      TEXT        NOT NULL,
+  tracked_ref     TEXT        NOT NULL DEFAULT 'none',
+  installed_at    TIMESTAMPTZ NOT NULL DEFAULT NOW(),
+  updated_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
+);
+
+CREATE UNIQUE INDEX IF NOT EXISTS workspace_plugins_workspace_name
+  ON workspace_plugins(workspace_id, plugin_name);
+
+-- Partial index for the drift detector: only scan rows opted into tracking.
+CREATE INDEX IF NOT EXISTS workspace_plugins_tracked_not_none
+  ON workspace_plugins(tracked_ref) WHERE tracked_ref != 'none';