forked from molecule-ai/molecule-core
Independent code review caught a real bug in the previous commit's
stale-token revoke pass. The platform's restart endpoint
(workspace_restart.go:104) Stops the workspace container synchronously
then dispatches re-provisioning to a goroutine (line 173). For a
workspace that's been idle past the 5-minute grace window — extremely
common: user comes back to a long-idle workspace and clicks Restart —
this opens a race window:
1. Container stopped → ListWorkspaceContainerIDPrefixes returns no
entry → workspace becomes a stale-token candidate.
2. issueAndInjectToken runs in the goroutine: revokes old tokens,
issues a fresh one, writes it to /configs/.auth_token.
3. If the sweeper's predicate-only UPDATE
`WHERE workspace_id = $1 AND revoked_at IS NULL` runs AFTER
IssueToken commits but is racing the SELECT-then-UPDATE window,
it revokes the freshly-issued token alongside the old ones.
4. Container starts with a now-revoked token → 401 forever.
The fix carries the SAME staleness predicate from the SELECT into the
per-workspace UPDATE: a token created within the grace window can't
match `< now() - grace` and is automatically excluded. The operation
is now idempotent against fresh inserts.
Also addresses other findings from the same review:
- Add `status NOT IN ('removed', 'provisioning')` to the SELECT
(R2 + first-line C1 defence). 'provisioning' is set synchronously
in workspace_restart.go before the async re-provision begins, so
it's a reliable in-flight signal that narrows the candidate set.
- Stop calling wsauth.RevokeAllForWorkspace from the sweeper —
that helper revokes EVERY live token unconditionally; the sweeper
needs "every STALE live token" which is a different (safer)
operation. Inline the UPDATE so we own the predicate end-to-end.
Drop the wsauth import (no longer needed in this package).
- Tighten expectStaleTokenSweepNoOp regex to anchor at start and
require the status filter, so a future query whose first line
coincidentally starts with "SELECT DISTINCT t.workspace_id" can't
silently absorb the helper's expectation (R3).
- Defensive `if reaper == nil { return }` at top of
sweepStaleTokensWithoutContainer — even though StartOrphanSweeper
already short-circuits on nil, a future refactor that wires this
pass directly without checking would otherwise mass-revoke in
CP/SaaS mode (F2).
- Comment in the function explaining why empty likes is intentionally
NOT a short-circuit (asymmetry with the first two passes is the
whole point — "no containers running" is the load-bearing case).
- Add TestSweepOnce_StaleTokenRevokeUsesStalenessPredicate that
asserts the UPDATE shape (predicate present, grace bound). A
real-Postgres integration test would prove the race resolution
end-to-end; this catches the regression where someone simplifies
the UPDATE back to predicate-only.
- Add TestSweepStaleTokens_NilReaperEarlyExit pinning the F2 guard.
Existing tests updated to match the new query/UPDATE shape with tight
regexes that pin all the safety guards (status filter, staleness
predicate in both SELECT and UPDATE).
Full Go suite green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|---|---|---|
| .. | ||
| artifacts | ||
| bundle | ||
| channels | ||
| crypto | ||
| db | ||
| envx | ||
| events | ||
| handlers | ||
| imagewatch | ||
| metrics | ||
| middleware | ||
| models | ||
| orgtoken | ||
| plugins | ||
| provisioner | ||
| registry | ||
| router | ||
| scheduler | ||
| supervised | ||
| ws | ||
| wsauth | ||