fix(workspace-server): central codex OAuth refresher (single-owner, anti-burn) #2023

Merged
devops-engineer merged 1 commits from fix/codex-central-refresher into main 2026-05-31 19:52:22 +00:00
Member

Central codex OAuth refresher (single-owner, anti-burn)

Pairs with template PR fix/codex-oauth-resync. Together they stop the codex token-burn.

Root-cause not symptom

N codex workspaces share ONE ChatGPT-Pro OAuth token (global_secrets CODEX_AUTH_JSON). OpenAI's refresh_token is single-use; each per-agent codex app-server refreshing on its own 401 burned the shared seed within seconds (token_invalidated + "refresh token already used"). Fix: a single platform-side owner of the refresh; workspaces only GET the current token (template side), never rotate it.

Comprehensive testing performed

internal/codexauth/refresher_test.go: structural single-flight; rotate+write-back then re-skip-when-fresh; skip-when-fresh (no POST); no-secret-inert; permanent-fail (invalid_grant) → no write-back + no retry-storm; JWT exp parse. go build ./... && go vet ./... && go test ./internal/codexauth/... all green; full go test ./... + -tags=integration green in dev.

Local-postgres E2E run

N/A — refresher uses an injected db/http in tests; no schema/migration change.

Staging-smoke verified or pending

Pending post-merge: fleet-rollout to agents-team prod tenant, then verify both codex agents complete a real turn on the user's OAuth (rolled to one agent first).

Five-Axis review walked

Correctness: one goroutine + package mutex = structural single-flight; refreshes only within expiry safety-margin. Readability: mirrors existing Start* sweep wiring. Architecture: co-located with background sweeps under supervised.RunWithRecover. Security: no token values logged; encrypted write-back via existing crypto. Performance: at most one POST per due cycle; inert when secret absent.

No backwards-compat shim / dead code added

No shim — additive goroutine. (Template side gates the legacy per-agent watchdog behind CODEX_AUTH_REFRESH_OWNER=1, reserved for a single owner, not a compat shim.)

Memory/saved-feedback consulted

project_codex_billing_mode_byok_default_wedge, project_codex_provider_ssot_split (the auth_env/providers re-sync is a SEPARATE follow-up PR, deliberately decoupled from this low-risk burn-fix to keep SSOT derivation changes out of the critical-path deploy).

## Central codex OAuth refresher (single-owner, anti-burn) Pairs with template PR `fix/codex-oauth-resync`. Together they stop the codex token-burn. ### Root-cause not symptom N codex workspaces share ONE ChatGPT-Pro OAuth token (global_secrets `CODEX_AUTH_JSON`). OpenAI's refresh_token is **single-use**; each per-agent codex app-server refreshing on its own 401 burned the shared seed within seconds (`token_invalidated` + "refresh token already used"). Fix: a single platform-side owner of the refresh; workspaces only GET the current token (template side), never rotate it. ### Comprehensive testing performed `internal/codexauth/refresher_test.go`: structural single-flight; rotate+write-back then re-skip-when-fresh; skip-when-fresh (no POST); no-secret-inert; permanent-fail (invalid_grant) → no write-back + no retry-storm; JWT exp parse. `go build ./... && go vet ./... && go test ./internal/codexauth/...` all green; full `go test ./...` + `-tags=integration` green in dev. ### Local-postgres E2E run N/A — refresher uses an injected db/http in tests; no schema/migration change. ### Staging-smoke verified or pending Pending post-merge: fleet-rollout to agents-team prod tenant, then verify both codex agents complete a real turn on the user's OAuth (rolled to one agent first). ### Five-Axis review walked Correctness: one goroutine + package mutex = structural single-flight; refreshes only within expiry safety-margin. Readability: mirrors existing `Start*` sweep wiring. Architecture: co-located with background sweeps under `supervised.RunWithRecover`. Security: no token values logged; encrypted write-back via existing crypto. Performance: at most one POST per due cycle; inert when secret absent. ### No backwards-compat shim / dead code added No shim — additive goroutine. (Template side gates the legacy per-agent watchdog behind `CODEX_AUTH_REFRESH_OWNER=1`, reserved for a single owner, not a compat shim.) ### Memory/saved-feedback consulted project_codex_billing_mode_byok_default_wedge, project_codex_provider_ssot_split (the auth_env/providers re-sync is a SEPARATE follow-up PR, deliberately decoupled from this low-risk burn-fix to keep SSOT derivation changes out of the critical-path deploy).
devops-engineer added 1 commit 2026-05-31 19:41:29 +00:00
fix(workspace-server): central codex OAuth refresher (single-owner, anti-burn)
ci-arm64-advisory / fast-checks (pull_request) Waiting to run
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 7s
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Successful in 12s
CI / Detect changes (pull_request) Successful in 27s
CI / Python Lint & Test (pull_request) Successful in 27s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 13s
E2E API Smoke Test / detect-changes (pull_request) Successful in 16s
E2E Chat / detect-changes (pull_request) Successful in 15s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 14s
Lint no tenant GITEA or GITHUB token write / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 5s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 6s
Harness Replays / detect-changes (pull_request) Successful in 7s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s
gate-check-v3 / gate-check (pull_request) Successful in 11s
sop-checklist / review-refire (pull_request) Has been skipped
verify-providers-gen / Regenerate providers artifact and fail on drift (pull_request) Successful in 29s
sop-tier-check / tier-check (pull_request) Successful in 13s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m7s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s
CI / Canvas (Next.js) (pull_request) Successful in 2s
E2E Chat / E2E Chat (pull_request) Successful in 23s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 4s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2s
Harness Replays / Harness Replays (pull_request) Successful in 9s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1m56s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
sop-checklist / all-items-acked (pull_request) acked: 7/7
sop-checklist / na-declarations (pull_request) N/A: (none)
CI / Platform (Go) (pull_request) Successful in 5m28s
CI / all-required (pull_request) Successful in 7m16s
qa-review / approved (pull_request) Refired via /qa-recheck by unknown
security-review / approved (pull_request) Refired via /security-recheck by unknown
audit-force-merge / audit (pull_request) Successful in 9s
df972a85e2
Multiple codex workspaces share ONE ChatGPT-Pro OAuth token (global_secrets
key CODEX_AUTH_JSON). OpenAI's refresh_token is single-use, so letting each
per-agent codex app-server refresh on its own 401 burned the shared seed within
seconds (a refresh storm → token_invalidated + "refresh token already used").

This adds a single platform-side owner of the refresh:
- internal/codexauth/refresher.go: one background goroutine, structurally
  single-flight (one goroutine + package mutex). Reads the global
  CODEX_AUTH_JSON, decodes the access_token JWT exp, and only within a safety
  margin of expiry POSTs the refresh_token ONCE per due cycle, then re-encrypts
  and writes the rotated blob back to global_secrets. Inert when the secret is
  absent; on a permanent failure (invalid_grant / "already used") it logs once
  and does NOT hot-loop. Billing-mode resolution + byok are untouched.
- cmd/server/main.go: wired under supervised.RunWithRecover like the other
  background sweeps.

Pairs with the codex template's codex_auth_sync.sh (GET-only re-sync; per-agent
OAuth POST disabled) so workspaces only consume the current token and never
rotate it themselves.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
devops-engineer added the tier:medium label 2026-05-31 19:41:31 +00:00
Member

/sop-ack comprehensive-testing

/sop-ack comprehensive-testing
Member

/sop-ack local-postgres-e2e

/sop-ack local-postgres-e2e
Member

/sop-ack staging-smoke

/sop-ack staging-smoke
Member

/sop-ack root-cause

/sop-ack root-cause
Member

/sop-ack five-axis-review

/sop-ack five-axis-review
Member

/sop-ack no-backwards-compat

/sop-ack no-backwards-compat
Member

/sop-ack memory-consulted

/sop-ack memory-consulted
core-qa approved these changes 2026-05-31 19:46:15 +00:00
core-qa left a comment
Member

qa: refresher_test.go covers single-flight/rotate-writeback/skip-fresh/permanent-fail/JWT-exp; template test_auth_sync proves GET-only. Approving.

qa: refresher_test.go covers single-flight/rotate-writeback/skip-fresh/permanent-fail/JWT-exp; template test_auth_sync proves GET-only. Approving.
core-security approved these changes 2026-05-31 19:46:15 +00:00
core-security left a comment
Member

security: verified no token values logged (only error %v + value-less success), single-flight mutex prevents double-POST of the single-use refresh_token, permanent-failure latch prevents hot-looping a dead token, encrypted write-back, bearer in header not URL. Agents are GET-only. Approving.

security: verified no token values logged (only error %v + value-less success), single-flight mutex prevents double-POST of the single-use refresh_token, permanent-failure latch prevents hot-looping a dead token, encrypted write-back, bearer in header not URL. Agents are GET-only. Approving.
Author
Member

/qa-recheck

/qa-recheck
Author
Member

/security-recheck

/security-recheck
devops-engineer merged commit 446b8c78fd into main 2026-05-31 19:52:22 +00:00
Sign in to join this conversation.
4 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2023