molecule-core/workspace-server/internal/registry
Hongming Wang 317196463a fix(orphan-sweeper): close TOCTOU race with issueAndInjectToken on restart
Independent code review caught a real bug in the previous commit's
stale-token revoke pass. The platform's restart endpoint
(workspace_restart.go:104) Stops the workspace container synchronously
then dispatches re-provisioning to a goroutine (line 173). For a
workspace that's been idle past the 5-minute grace window — extremely
common: user comes back to a long-idle workspace and clicks Restart —
this opens a race window:

  1. Container stopped → ListWorkspaceContainerIDPrefixes returns no
     entry → workspace becomes a stale-token candidate.
  2. issueAndInjectToken runs in the goroutine: revokes old tokens,
     issues a fresh one, writes it to /configs/.auth_token.
  3. If the sweeper's predicate-only UPDATE
     `WHERE workspace_id = $1 AND revoked_at IS NULL` runs AFTER
     IssueToken commits but is racing the SELECT-then-UPDATE window,
     it revokes the freshly-issued token alongside the old ones.
  4. Container starts with a now-revoked token → 401 forever.

The fix carries the SAME staleness predicate from the SELECT into the
per-workspace UPDATE: a token created within the grace window can't
match `< now() - grace` and is automatically excluded. The operation
is now idempotent against fresh inserts.

Also addresses other findings from the same review:

  - Add `status NOT IN ('removed', 'provisioning')` to the SELECT
    (R2 + first-line C1 defence). 'provisioning' is set synchronously
    in workspace_restart.go before the async re-provision begins, so
    it's a reliable in-flight signal that narrows the candidate set.

  - Stop calling wsauth.RevokeAllForWorkspace from the sweeper —
    that helper revokes EVERY live token unconditionally; the sweeper
    needs "every STALE live token" which is a different (safer)
    operation. Inline the UPDATE so we own the predicate end-to-end.
    Drop the wsauth import (no longer needed in this package).

  - Tighten expectStaleTokenSweepNoOp regex to anchor at start and
    require the status filter, so a future query whose first line
    coincidentally starts with "SELECT DISTINCT t.workspace_id" can't
    silently absorb the helper's expectation (R3).

  - Defensive `if reaper == nil { return }` at top of
    sweepStaleTokensWithoutContainer — even though StartOrphanSweeper
    already short-circuits on nil, a future refactor that wires this
    pass directly without checking would otherwise mass-revoke in
    CP/SaaS mode (F2).

  - Comment in the function explaining why empty likes is intentionally
    NOT a short-circuit (asymmetry with the first two passes is the
    whole point — "no containers running" is the load-bearing case).

  - Add TestSweepOnce_StaleTokenRevokeUsesStalenessPredicate that
    asserts the UPDATE shape (predicate present, grace bound). A
    real-Postgres integration test would prove the race resolution
    end-to-end; this catches the regression where someone simplifies
    the UPDATE back to predicate-only.

  - Add TestSweepStaleTokens_NilReaperEarlyExit pinning the F2 guard.

Existing tests updated to match the new query/UPDATE shape with tight
regexes that pin all the safety guards (status filter, staleness
predicate in both SELECT and UPDATE).

Full Go suite green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 17:28:50 -07:00
..
access_test.go chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
access.go chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
healthsweep_test.go chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
healthsweep.go chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
hibernation_test.go chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
hibernation.go chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
liveness_test.go chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
liveness.go chore: open-source restructure — rename dirs, remove internal files, scrub secrets 2026-04-18 00:24:44 -07:00
orphan_sweeper_test.go fix(orphan-sweeper): close TOCTOU race with issueAndInjectToken on restart 2026-04-27 17:28:50 -07:00
orphan_sweeper.go fix(orphan-sweeper): close TOCTOU race with issueAndInjectToken on restart 2026-04-27 17:28:50 -07:00
provisiontimeout_test.go fix(registry): runtime-aware provision-timeout sweep — give hermes 30 min 2026-04-26 01:44:09 -07:00
provisiontimeout.go fix(registry): runtime-aware provision-timeout sweep — give hermes 30 min 2026-04-26 01:44:09 -07:00