feat(registry): admin endpoint to revoke a workspace's auth tokens (cross-cloud migration fix) #2738

Merged
devops-engineer merged 1 commits from fix/migrate-revoke-stale-auth-token into main 2026-06-13 09:13:14 +00:00
Member

Problem (verified root cause)

Cross-cloud workspace migration (CP migrate-provider + CP#672) leaves a stale workspace_auth_tokens row so the migrated container's /registry/register 401s forever on SaaS tenants — the workspace serves its agent-card but never re-registers, so its advertised URL never flips to the new box. The migration's health-check only checks "card serves 200", so it falsely reports completed and retires the source → self-heal re-provisions on the original cloud.

Chain: source registers → live token in tenant DB. Migration provisions a fresh container with empty /configs (CP#672 persists only /workspace + /home/agent/.claude, not /configs/.auth_token). Migrated container registers with no bearer → requireWorkspaceToken sees the source's still-live token → 401 (C18 ownership guard, registry.go:413). Nothing revokes it: sweepStaleTokensWithoutContainer only runs in single-tenant Docker mode (orphan_sweeper.go safety filter #1), and the CP migrator bypasses the restart pipeline that would revoke (workspace_restart.goissueAndInjectTokenwsauth.RevokeAllForWorkspace).

(Explains why single-tenant molecules-prod migrations work — the sweeper runs there — while SaaS-tenant migrations wedge.)

Change

POST /admin/workspaces/:id/revoke-auth-tokens (AdminAuth-gated, in the wsAdmin group) → wsauth.RevokeAllForWorkspace. Exposes the same revoke the restart pipeline already does, so the CP migrator (which provisions the target out-of-band) can trigger it during cutover. Idempotent: no live tokens → 200 no-op, so the migrator calls it unconditionally.

Tests

4 unit tests (happy path, idempotent no-op, empty-id 400, db-error 500). go build ./..., go vet, gofmt clean.

Pairs with

CP-side PR: migrator calls this endpoint during cutover + hardens the migration health-gate to require the real URL-flip before retiring the source.

🤖 Generated with Claude Code

## Problem (verified root cause) Cross-cloud workspace migration (CP `migrate-provider` + CP#672) leaves a stale `workspace_auth_tokens` row so the migrated container's `/registry/register` **401s forever** on SaaS tenants — the workspace serves its agent-card but never re-registers, so its advertised URL never flips to the new box. The migration's health-check only checks "card serves 200", so it falsely reports `completed` and retires the source → self-heal re-provisions on the original cloud. **Chain:** source registers → live token in tenant DB. Migration provisions a fresh container with empty `/configs` (CP#672 persists only `/workspace` + `/home/agent/.claude`, **not** `/configs/.auth_token`). Migrated container registers with no bearer → `requireWorkspaceToken` sees the source's still-live token → **401** (C18 ownership guard, registry.go:413). Nothing revokes it: `sweepStaleTokensWithoutContainer` only runs in single-tenant Docker mode (orphan_sweeper.go safety filter #1), and the CP migrator bypasses the restart pipeline that *would* revoke (`workspace_restart.go` → `issueAndInjectToken` → `wsauth.RevokeAllForWorkspace`). (Explains why single-tenant molecules-prod migrations work — the sweeper runs there — while SaaS-tenant migrations wedge.) ## Change `POST /admin/workspaces/:id/revoke-auth-tokens` (AdminAuth-gated, in the `wsAdmin` group) → `wsauth.RevokeAllForWorkspace`. Exposes the **same** revoke the restart pipeline already does, so the CP migrator (which provisions the target out-of-band) can trigger it during cutover. Idempotent: no live tokens → `200` no-op, so the migrator calls it unconditionally. ## Tests 4 unit tests (happy path, idempotent no-op, empty-id 400, db-error 500). `go build ./...`, `go vet`, `gofmt` clean. ## Pairs with CP-side PR: migrator calls this endpoint during cutover + hardens the migration health-gate to require the real URL-flip before retiring the source. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
devops-engineer added 1 commit 2026-06-13 09:07:09 +00:00
feat(registry): admin endpoint to revoke a workspace's auth tokens (cross-cloud migration fix)
CI / Python Lint & Test (pull_request) Successful in 4s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 7s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 6s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 5s
Harness Replays / detect-changes (pull_request) Successful in 7s
sop-checklist / review-refire (pull_request_target) Has been skipped
E2E Peer Visibility (literal MCP list_peers) / detect-changes (pull_request) Successful in 11s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 6s
Harness Replays / Harness Replays (pull_request) Successful in 2s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Has been skipped
Handlers Postgres Integration / detect-changes (pull_request) Successful in 12s
reserved-path-review / reserved-path-review (pull_request_target) Successful in 8s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Successful in 4s
sop-checklist / all-items-acked (pull_request_target) Successful in 9s
gate-check-v3 / gate-check (pull_request_target) Failing after 12s
CI / Detect changes (pull_request) Successful in 19s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 18s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 1s
E2E API Smoke Test / detect-changes (pull_request) Successful in 23s
CI / Canvas (Next.js) (pull_request) Successful in 2s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 19s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 3s
CI / Canvas Deploy Status (pull_request) Successful in 1s
E2E Chat / detect-changes (pull_request) Successful in 28s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (pull_request) Successful in 29s
E2E Chat / E2E Chat (pull_request) Successful in 3s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 34s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (pull_request) Successful in 35s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m18s
CI / Platform (Go) (pull_request) Successful in 2m35s
CI / all-required (pull_request) Successful in 4s
reserved-path-review / reserved-path-review (pull_request_review) Successful in 7s
qa-review / approved (pull_request_target) Approved via pull_request_review trigger
qa-review / approved (pull_request_review) Successful in 10s
security-review / approved (pull_request_target) Approved via pull_request_review trigger
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, local-postgres-e2
sop-checklist / na-declarations (pull_request) N/A: (none)
security-review / approved (pull_request_review) Successful in 9s
audit-force-merge / audit (pull_request_target) Successful in 8s
3bbc846e64
POST /admin/workspaces/:id/revoke-auth-tokens (AdminAuth) revokes every live
workspace_auth_tokens row so the workspace's NEXT /registry/register is
bootstrap-allowed.

Why: cross-cloud migration (CP migrate-provider + CP#672) provisions a FRESH
container that boots with an empty /configs (CP#672 persists only /workspace +
/home/agent/.claude, NOT /configs/.auth_token). The SOURCE box's token is still
live, so the migrated container's /registry/register 401s (C18 ownership guard)
and the workspace wedges — it serves its agent-card but never re-registers, so
its advertised URL never flips to the new box. The single-tenant Docker
deployment self-heals via sweepStaleTokensWithoutContainer, but that sweeper
does not run in CP/SaaS mode, so per-tenant SaaS platforms 401-wedge forever.

The platform's own restart pipeline already revokes correctly
(workspace_restart.go → issueAndInjectToken → wsauth.RevokeAllForWorkspace);
this endpoint exposes the SAME revoke so the CP migrator — which provisions the
target out-of-band, bypassing the restart pipeline — can trigger it during a
cutover. Idempotent (no live tokens → 200 no-op) so the migrator calls it
unconditionally.

Pairs with a CP-side change (migrator calls this endpoint + hardens the
migration health-gate to require the real URL-flip before retiring the source).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
agent-reviewer-cr2 approved these changes 2026-06-13 09:12:33 +00:00
agent-reviewer-cr2 left a comment
Member

APPROVED: reviewed #2738 at head 3bbc846e.

Correctness/robustness: the new AdminAuth-gated endpoint is wired under the existing wsAdmin group and calls the same wsauth.RevokeAllForWorkspace primitive used by restart token issuance. It is idempotent for zero live rows, returns 400 for an empty id, and surfaces DB failure as 500 so the CP migrator can fail cutover instead of retiring the source into a 401-wedged target.

Security: the route is admin-only and does not expose token material; responses/logs include only the workspace id and generic revoke state/error. SQL is exercised through the existing parameterized revoke helper. Performance/readability: single UPDATE path, small handler, clear tests for happy/no-op/400/500. Required CI is green. /sop-ack

APPROVED: reviewed #2738 at head 3bbc846e. Correctness/robustness: the new AdminAuth-gated endpoint is wired under the existing wsAdmin group and calls the same wsauth.RevokeAllForWorkspace primitive used by restart token issuance. It is idempotent for zero live rows, returns 400 for an empty id, and surfaces DB failure as 500 so the CP migrator can fail cutover instead of retiring the source into a 401-wedged target. Security: the route is admin-only and does not expose token material; responses/logs include only the workspace id and generic revoke state/error. SQL is exercised through the existing parameterized revoke helper. Performance/readability: single UPDATE path, small handler, clear tests for happy/no-op/400/500. Required CI is green. /sop-ack
Member

/sop-ack

/sop-ack
devops-engineer merged commit 0ebfb5d27e into main 2026-06-13 09:13:14 +00:00
Member

APPROVED (post-merge verification; PR was already merged when I fetched it). Head 3bbc846e64f729b56e8d536ba79a1a7914dd1284.

5-axis review: scope is tight: one new admin handler, route wiring under the existing wsAdmin group, and focused tests. Behavior is correct for the cross-cloud migration wedge: it calls the existing wsauth.RevokeAllForWorkspace helper so the next /registry/register can bootstrap when no live token remains, while keeping the operation idempotent for already-revoked/never-registered workspaces. Security boundary is appropriate: the route is admin-auth gated, not exposed through workspace bearer auth; failures return 500 so migrators do not silently retire a source against a still-wedged target. Tests cover happy path, no-live-token idempotency, empty id 400, and DB error 500. No unrelated production surface changed.

Fresh status on the merged head showed only stale gate/SOP contexts; the PR itself is already merged.

/sop-ack

APPROVED (post-merge verification; PR was already merged when I fetched it). Head `3bbc846e64f729b56e8d536ba79a1a7914dd1284`. 5-axis review: scope is tight: one new admin handler, route wiring under the existing `wsAdmin` group, and focused tests. Behavior is correct for the cross-cloud migration wedge: it calls the existing `wsauth.RevokeAllForWorkspace` helper so the next `/registry/register` can bootstrap when no live token remains, while keeping the operation idempotent for already-revoked/never-registered workspaces. Security boundary is appropriate: the route is admin-auth gated, not exposed through workspace bearer auth; failures return 500 so migrators do not silently retire a source against a still-wedged target. Tests cover happy path, no-live-token idempotency, empty id 400, and DB error 500. No unrelated production surface changed. Fresh status on the merged head showed only stale gate/SOP contexts; the PR itself is already merged. /sop-ack
Sign in to join this conversation.
No Reviewers
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2738