molecule-core/workspace-server
Hongming Wang 317196463a fix(orphan-sweeper): close TOCTOU race with issueAndInjectToken on restart
Independent code review caught a real bug in the previous commit's
stale-token revoke pass. The platform's restart endpoint
(workspace_restart.go:104) Stops the workspace container synchronously
then dispatches re-provisioning to a goroutine (line 173). For a
workspace that's been idle past the 5-minute grace window — extremely
common: user comes back to a long-idle workspace and clicks Restart —
this opens a race window:

  1. Container stopped → ListWorkspaceContainerIDPrefixes returns no
     entry → workspace becomes a stale-token candidate.
  2. issueAndInjectToken runs in the goroutine: revokes old tokens,
     issues a fresh one, writes it to /configs/.auth_token.
  3. If the sweeper's predicate-only UPDATE
     `WHERE workspace_id = $1 AND revoked_at IS NULL` runs AFTER
     IssueToken commits but is racing the SELECT-then-UPDATE window,
     it revokes the freshly-issued token alongside the old ones.
  4. Container starts with a now-revoked token → 401 forever.

The fix carries the SAME staleness predicate from the SELECT into the
per-workspace UPDATE: a token created within the grace window can't
match `< now() - grace` and is automatically excluded. The operation
is now idempotent against fresh inserts.

Also addresses other findings from the same review:

  - Add `status NOT IN ('removed', 'provisioning')` to the SELECT
    (R2 + first-line C1 defence). 'provisioning' is set synchronously
    in workspace_restart.go before the async re-provision begins, so
    it's a reliable in-flight signal that narrows the candidate set.

  - Stop calling wsauth.RevokeAllForWorkspace from the sweeper —
    that helper revokes EVERY live token unconditionally; the sweeper
    needs "every STALE live token" which is a different (safer)
    operation. Inline the UPDATE so we own the predicate end-to-end.
    Drop the wsauth import (no longer needed in this package).

  - Tighten expectStaleTokenSweepNoOp regex to anchor at start and
    require the status filter, so a future query whose first line
    coincidentally starts with "SELECT DISTINCT t.workspace_id" can't
    silently absorb the helper's expectation (R3).

  - Defensive `if reaper == nil { return }` at top of
    sweepStaleTokensWithoutContainer — even though StartOrphanSweeper
    already short-circuits on nil, a future refactor that wires this
    pass directly without checking would otherwise mass-revoke in
    CP/SaaS mode (F2).

  - Comment in the function explaining why empty likes is intentionally
    NOT a short-circuit (asymmetry with the first two passes is the
    whole point — "no containers running" is the load-bearing case).

  - Add TestSweepOnce_StaleTokenRevokeUsesStalenessPredicate that
    asserts the UPDATE shape (predicate present, grace bound). A
    real-Postgres integration test would prove the race resolution
    end-to-end; this catches the regression where someone simplifies
    the UPDATE back to predicate-only.

  - Add TestSweepStaleTokens_NilReaperEarlyExit pinning the F2 guard.

Existing tests updated to match the new query/UPDATE shape with tight
regexes that pin all the safety guards (status filter, staleness
predicate in both SELECT and UPDATE).

Full Go suite green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 17:28:50 -07:00
..
cmd/server feat(runtime): native_scheduler skip — primitive #3 of 6 2026-04-26 22:47:00 -07:00
internal fix(orphan-sweeper): close TOCTOU race with issueAndInjectToken on restart 2026-04-27 17:28:50 -07:00
migrations chore: second-pass review polish — symmetry + clearer test fixtures 2026-04-25 08:48:30 -07:00
pkg/provisionhook feat(#1957): wire gh-identity plugin into workspace-server 2026-04-24 15:01:41 +00:00
.ci-force chore: force Platform(Go) CI run on main — validate go vet clean 2026-04-21 15:43:19 +00:00
.gitignore feat(ws-server): pull env from CP on startup 2026-04-19 02:41:15 -07:00
.golangci.yaml chore(workspace-server): add golangci.yaml disabling errcheck 2026-04-24 07:16:54 +00:00
Dockerfile chore: extract ContextMenu Zustand fix + a2a_proxy local-docker SSRF bypass + workspace-server Dockerfile GID entrypoint 2026-04-22 20:00:16 -07:00
Dockerfile.tenant feat(terminal): remote path via aws ec2-instance-connect + pty 2026-04-21 18:13:29 -07:00
entrypoint-tenant.sh fix(security): add USER directive before ENTRYPOINT in all tenant images (#1155) 2026-04-20 23:51:33 +00:00
go.mod deps(redis): bump go-redis/v9 v9.7.0 → v9.7.3 (GHSA-92cp-5422-2mw7) 2026-04-27 06:54:13 -07:00
go.sum deps(redis): bump go-redis/v9 v9.7.0 → v9.7.3 (GHSA-92cp-5422-2mw7) 2026-04-27 06:54:13 -07:00