fix(workspace-server): central codex OAuth refresher (single-owner, anti-burn) #2023
Reference in New Issue
Block a user
Delete Branch "fix/codex-central-refresher"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Central codex OAuth refresher (single-owner, anti-burn)
Pairs with template PR
fix/codex-oauth-resync. Together they stop the codex token-burn.Root-cause not symptom
N codex workspaces share ONE ChatGPT-Pro OAuth token (global_secrets
CODEX_AUTH_JSON). OpenAI's refresh_token is single-use; each per-agent codex app-server refreshing on its own 401 burned the shared seed within seconds (token_invalidated+ "refresh token already used"). Fix: a single platform-side owner of the refresh; workspaces only GET the current token (template side), never rotate it.Comprehensive testing performed
internal/codexauth/refresher_test.go: structural single-flight; rotate+write-back then re-skip-when-fresh; skip-when-fresh (no POST); no-secret-inert; permanent-fail (invalid_grant) → no write-back + no retry-storm; JWT exp parse.go build ./... && go vet ./... && go test ./internal/codexauth/...all green; fullgo test ./...+-tags=integrationgreen in dev.Local-postgres E2E run
N/A — refresher uses an injected db/http in tests; no schema/migration change.
Staging-smoke verified or pending
Pending post-merge: fleet-rollout to agents-team prod tenant, then verify both codex agents complete a real turn on the user's OAuth (rolled to one agent first).
Five-Axis review walked
Correctness: one goroutine + package mutex = structural single-flight; refreshes only within expiry safety-margin. Readability: mirrors existing
Start*sweep wiring. Architecture: co-located with background sweeps undersupervised.RunWithRecover. Security: no token values logged; encrypted write-back via existing crypto. Performance: at most one POST per due cycle; inert when secret absent.No backwards-compat shim / dead code added
No shim — additive goroutine. (Template side gates the legacy per-agent watchdog behind
CODEX_AUTH_REFRESH_OWNER=1, reserved for a single owner, not a compat shim.)Memory/saved-feedback consulted
project_codex_billing_mode_byok_default_wedge, project_codex_provider_ssot_split (the auth_env/providers re-sync is a SEPARATE follow-up PR, deliberately decoupled from this low-risk burn-fix to keep SSOT derivation changes out of the critical-path deploy).
/sop-ack comprehensive-testing
/sop-ack local-postgres-e2e
/sop-ack staging-smoke
/sop-ack root-cause
/sop-ack five-axis-review
/sop-ack no-backwards-compat
/sop-ack memory-consulted
qa: refresher_test.go covers single-flight/rotate-writeback/skip-fresh/permanent-fail/JWT-exp; template test_auth_sync proves GET-only. Approving.
security: verified no token values logged (only error %v + value-less success), single-flight mutex prevents double-POST of the single-use refresh_token, permanent-failure latch prevents hot-looping a dead token, encrypted write-back, bearer in header not URL. Agents are GET-only. Approving.
/qa-recheck
/security-recheck