Files
molecule-ai-workspace-templ…/codex_auth_refresh.sh
hongming eb68b7b025
CI / validate (push) Blocked by required conditions
CI / validate (pull_request) Blocked by required conditions
CI / Template validation (static) (push) Successful in 1m20s
CI / Adapter unit tests (push) Successful in 1m9s
CI / Template validation (static) (pull_request) Successful in 1m25s
CI / Adapter unit tests (pull_request) Successful in 1m2s
CI / T4 tier-4 conformance (live) (push) Failing after 52s
CI / Template validation (runtime) (push) Failing after 2m51s
CI / T4 tier-4 conformance (live) (pull_request) Failing after 51s
CI / Template validation (runtime) (pull_request) Failing after 2m40s
fix(auth): codex_auth_refresh.sh portable python3 path
PR#19 (internal#569) shipped the OAuth refresh watchdog hardcoded to
`/opt/molecule-venv/bin/python3`. The codex image is built FROM
python:3.11-slim — python3 lives at /usr/local/bin/python3 and no
/opt/molecule-venv ever exists. Every helper invocation therefore
exited 127 → OAuth refresh never fired → id_token expired silently →
Researcher wedged upstream of stdout (ae2c3012 diagnosis: 55h past
expiration, access_token still valid ~7.7d but id_token rejected).

Fix:
- Resolve python3 portably via `command -v python3` at script start
  (CODEX_PYTHON env override for test/dev rigs); fail-fast with rc=127
  if no python3 found, so the symptom is loud not silent.
- Replace all 6 hardcoded `/opt/molecule-venv/bin/python3` call sites
  with "$PYTHON_BIN".
- Drop the test harness's `_patch_script_python` hack — the test file
  used to know about the broken path and patch a copy of the script,
  which masked the production breakage. Tests now exec the real
  shipped script with CODEX_PYTHON pointing at sys.executable.
- Add `test_script_does_not_exit_127_with_portable_python_path` as a
  regression-pin: runs the script WITHOUT CODEX_PYTHON so the
  `command -v python3` resolver is genuinely exercised, asserts the
  script never exits 127 and produces the expected skip path.
- Add an image-build smoke check in the Dockerfile: runs
  `codex_auth_refresh.sh --once` against an absent CODEX_HOME at
  build time, fails the build if rc=127 or rc≠1. This makes the
  watchdog a hard image-build invariant; a future regression of the
  python path cannot ship.

Verified locally:
- bash -n codex_auth_refresh.sh → OK
- shellcheck -S error codex_auth_refresh.sh + boot helpers → 0
- pytest tests/test_auth_refresh_watchdog.py → 8 passed
- pytest tests/ (full) → 75 passed
- Direct script smoke with empty CODEX_HOME → rc=1
  "absent or empty" (NOT rc=127)

Cross-links:
- ae2c3012 diagnosis (Researcher wedge, id_token expiry, access_token
  remaining ~198h)
- PR#19 (internal#569) — the auto-refresh feature this fix unblocks
- feedback_image_promote_is_not_user_live — cascade is image-build →
  ECR push → CP pin promote → workspace recreate before user-live

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 03:01:10 -07:00

471 lines
17 KiB
Bash

#!/usr/bin/env bash
# codex_auth_refresh.sh — proactive OAuth refresh watchdog for the
# ChatGPT/Codex-subscription credential.
#
# Why this exists (RFC internal#569):
# Codex CLI 0.130.0 ONLY refreshes auth.json on the demand path
# `AuthManager::auth().await` (codex-rs/login/src/auth/manager.rs:1411).
# In our prod-Reviewer / prod-Researcher topology the workspace can
# sit idle (or wedge upstream in executor.py) for >8 days between
# turns, so `auth()` is never invoked end-to-end and refresh is never
# armed. id_token expires within hours; access_token within ~198h;
# eventually refresh_token rolls past its own window and the workspace
# becomes hard-broken (no recovery path, requires a CTO re-login).
#
# Strategy:
# Background bash loop, started by start.sh as `gosu agent`. Every
# $REFRESH_INTERVAL seconds (default 6h, RFC §"Proposed fix" item 2)
# it inspects $CODEX_HOME/auth.json. If either:
# - tokens.access_token JWT `exp` is within $SAFETY_MARGIN seconds
# of now (default 4h), OR
# - last_refresh ISO timestamp is older than $STALE_AFTER seconds
# (default 7 days — one day below the CLI's TOKEN_REFRESH_INTERVAL
# of 8d so we refresh before the CLI would mark it stale)
# then it POSTs the refresh_token to
# `https://auth.openai.com/oauth/token` (the endpoint baked into the
# CLI — codex-rs/login/src/auth/manager.rs:94, const REFRESH_TOKEN_URL)
# with the same CLIENT_ID (`app_EMoamEEZ73f0CkXaXp7hrann`,
# codex-rs/login/src/auth/manager.rs:928 / v0.130.0:921) and
# grant_type=refresh_token, and atomically rewrites auth.json in place
# (write to .tmp + chmod + rename). The CLI tolerates an externally
# refreshed file: `reload()` is called inside `auth()` whenever the
# file mtime changes.
#
# Vendor contract source — all field names, endpoint URL, client id,
# response shape verified against `openai/codex@rust-v0.130.0`
# (the exact CLI version pinned in the Dockerfile, line 166):
# - REFRESH_TOKEN_URL manager.rs:94
# - CLIENT_ID manager.rs:921
# - request shape: { client_id, grant_type:"refresh_token",
# refresh_token } manager.rs:913-918
# - response shape: { id_token?, access_token?,
# refresh_token? } manager.rs:920-925
# - persist semantics: update tokens.{id_token,access_token,
# refresh_token} only when present, set
# last_refresh = now (manager.rs:787-810)
# - stale predicate: access_token exp <= now OR
# last_refresh < now - 8d
# manager.rs:1786-1808
#
# Constraints honored (RFC §"Constraints honored"):
# - NEVER echo token values. Filenames + JWT field NAMES + status
# codes + ISO timestamps only.
# - No mutation of auth.json outside the atomic-rename path.
# - Inert when $CODEX_HOME/auth.json is absent OR auth_mode != chatgpt
# OR refresh_token is empty (the API-key / MiniMax paths don't have
# refresh tokens; this watchdog is subscription-only by design).
#
# Health sidecar (RFC §"Surface a health metric"):
# On every loop iteration, write the parsed `last_refresh` ISO and
# age-in-seconds to $CODEX_HOME/auth_refresh_status.json (mode 0600,
# agent-owned). The molecule-runtime obs sidecar can scrape this
# file instead of the auth.json itself — keeps tokens out of the
# metrics path entirely.
set -uo pipefail
CODEX_HOME="${CODEX_HOME:-/home/agent/.codex}"
AUTH_JSON="${CODEX_HOME}/auth.json"
STATUS_FILE="${CODEX_HOME}/auth_refresh_status.json"
# Resolve python3 portably. PR#19 (internal#569) hardcoded
# /opt/molecule-venv/bin/python3, but the codex image is built FROM
# python:3.11-slim with python3 at /usr/local/bin/python3 — no
# /opt/molecule-venv exists. Every helper invocation exited 127 →
# OAuth refresh never fired → id_token expired silently and Researcher
# wedged upstream of stdout (ae2c3012 diagnosis). Override via
# CODEX_PYTHON for test/dev rigs that need a custom interpreter.
PYTHON_BIN="${CODEX_PYTHON:-$(command -v python3 || true)}"
if [ -z "$PYTHON_BIN" ] || [ ! -x "$PYTHON_BIN" ]; then
printf '[codex_auth_refresh %s] FATAL: python3 not found on PATH (CODEX_PYTHON=%s); the watchdog cannot run.\n' \
"$(date -u +%Y-%m-%dT%H:%M:%SZ)" "${CODEX_PYTHON:-<unset>}" >&2
exit 127
fi
# Tunable via env. Defaults come from RFC §"Proposed fix" + CLI source:
# - REFRESH_INTERVAL: how often the loop wakes (6h)
# - SAFETY_MARGIN: refresh if access_token exp - now < this (4h)
# - STALE_AFTER: refresh if last_refresh older than this (7d,
# chosen to fire before the CLI's 8d cliff)
REFRESH_INTERVAL="${CODEX_AUTH_REFRESH_INTERVAL_SECONDS:-21600}"
SAFETY_MARGIN="${CODEX_AUTH_SAFETY_MARGIN_SECONDS:-14400}"
STALE_AFTER="${CODEX_AUTH_STALE_AFTER_SECONDS:-604800}"
# Refresh endpoint — keep the CLI's env override honored so a test rig
# can point us at a mock server.
REFRESH_URL="${CODEX_REFRESH_TOKEN_URL_OVERRIDE:-https://auth.openai.com/oauth/token}"
CLIENT_ID="app_EMoamEEZ73f0CkXaXp7hrann"
log() {
# Stamped, prefixed; never includes token contents.
printf '[codex_auth_refresh %s] %s\n' "$(date -u +%Y-%m-%dT%H:%M:%SZ)" "$*" >&2
}
# Run a one-shot python helper to inspect auth.json and decide whether
# to refresh. Python gives us reliable JSON + JWT decoding without
# pulling in jq (which is not in the image). We deliberately keep the
# helper inline so the script is a single self-contained file.
needs_refresh_and_payload() {
# Emits one of:
# SKIP <reason>
# READY <refresh_token> <age_seconds> <iso_last_refresh>
# Note: <refresh_token> is on stdout for the caller to consume in a
# variable. Caller MUST NOT echo it. The wrapper below redacts it
# before any logging.
"$PYTHON_BIN" - "$AUTH_JSON" "$SAFETY_MARGIN" "$STALE_AFTER" <<'PY'
import base64, json, sys, time
from datetime import datetime, timezone
path = sys.argv[1]
safety_margin = int(sys.argv[2])
stale_after = int(sys.argv[3])
def emit(line):
print(line, flush=True)
try:
with open(path, "r", encoding="utf-8") as f:
blob = json.load(f)
except FileNotFoundError:
emit("SKIP no_auth_json")
sys.exit(0)
except (OSError, json.JSONDecodeError) as exc:
emit(f"SKIP unreadable:{type(exc).__name__}")
sys.exit(0)
# Subscription-only: API-key / MiniMax paths have no refresh token.
auth_mode = (blob.get("auth_mode") or "").lower()
if auth_mode not in ("chatgpt", "chatgpt_auth_tokens"):
emit(f"SKIP auth_mode={auth_mode or 'none'}")
sys.exit(0)
tokens = blob.get("tokens") or {}
refresh_token = tokens.get("refresh_token") or ""
access_token = tokens.get("access_token") or ""
if not refresh_token:
emit("SKIP no_refresh_token")
sys.exit(0)
# Parse last_refresh ISO. Tolerate trailing Z (codex emits +00:00 in
# newer versions, Z in 0.130.0 era).
last_refresh_iso = blob.get("last_refresh") or ""
now = datetime.now(timezone.utc)
age_seconds = -1
if last_refresh_iso:
try:
iso = last_refresh_iso.replace("Z", "+00:00")
last_refresh_dt = datetime.fromisoformat(iso)
if last_refresh_dt.tzinfo is None:
last_refresh_dt = last_refresh_dt.replace(tzinfo=timezone.utc)
age_seconds = int((now - last_refresh_dt).total_seconds())
except ValueError:
age_seconds = -1
# Decode the access_token JWT `exp` claim (RFC matches the CLI:
# parse_jwt_expiration in codex-rs/login/src/auth/manager.rs).
exp_seconds_remaining = None
if access_token and access_token.count(".") >= 2:
try:
payload_b64 = access_token.split(".")[1]
padded = payload_b64 + "=" * (-len(payload_b64) % 4)
claims = json.loads(base64.urlsafe_b64decode(padded.encode()))
exp = claims.get("exp")
if isinstance(exp, (int, float)):
exp_seconds_remaining = int(exp - time.time())
except (ValueError, json.JSONDecodeError):
exp_seconds_remaining = None
needs = False
reasons = []
if exp_seconds_remaining is not None and exp_seconds_remaining <= safety_margin:
needs = True
reasons.append(f"exp_in={exp_seconds_remaining}s<={safety_margin}s")
if age_seconds >= 0 and age_seconds >= stale_after:
needs = True
reasons.append(f"last_refresh_age={age_seconds}s>={stale_after}s")
if exp_seconds_remaining is None and age_seconds == -1:
# Defensive: if we can't parse anything, refresh anyway. The CLI's
# own staleness path would do the same.
needs = True
reasons.append("unparseable_metadata")
if not needs:
emit(
"SKIP fresh exp_in="
+ (str(exp_seconds_remaining) if exp_seconds_remaining is not None else "unknown")
+ "s age="
+ (str(age_seconds) if age_seconds >= 0 else "unknown")
+ "s"
)
sys.exit(0)
# Caller consumes refresh_token from the next line. Reasons echoed
# after for logging.
emit(f"READY {refresh_token}")
print(f"REASONS {','.join(reasons)}", file=sys.stderr)
PY
}
# Apply a refresh response (file path passed as $2) to auth.json
# atomically. We pass the response as a FILE PATH (not stdin) because
# the python helper's body is itself a heredoc on stdin — using stdin
# for the response would collide with the heredoc.
apply_refresh_response() {
local response_file="$1"
"$PYTHON_BIN" - "$AUTH_JSON" "$response_file" <<'PY'
import json, os, stat, sys, tempfile
from datetime import datetime, timezone
path = sys.argv[1]
response_path = sys.argv[2]
try:
with open(path, "r", encoding="utf-8") as f:
blob = json.load(f)
except (OSError, json.JSONDecodeError) as exc:
print(f"FAIL load_auth:{type(exc).__name__}", file=sys.stderr)
sys.exit(2)
try:
with open(response_path, "r", encoding="utf-8") as f:
response = json.load(f)
except (OSError, json.JSONDecodeError) as exc:
print(f"FAIL bad_response_json:{exc.__class__.__name__}", file=sys.stderr)
sys.exit(2)
if not isinstance(response, dict):
print("FAIL response_not_object", file=sys.stderr)
sys.exit(2)
tokens = blob.get("tokens") or {}
# Per manager.rs:787-810: persist only what the response contains.
new_id = response.get("id_token")
new_access = response.get("access_token")
new_refresh = response.get("refresh_token")
if new_id:
tokens["id_token"] = new_id
if new_access:
tokens["access_token"] = new_access
if new_refresh:
tokens["refresh_token"] = new_refresh
blob["tokens"] = tokens
blob["last_refresh"] = datetime.now(timezone.utc).isoformat().replace("+00:00", "Z")
# Atomic rename, preserve mode 0600.
dir_path = os.path.dirname(path) or "."
fd, tmp_path = tempfile.mkstemp(prefix=".auth.", suffix=".tmp", dir=dir_path)
try:
with os.fdopen(fd, "w", encoding="utf-8") as f:
json.dump(blob, f, indent=2)
f.flush()
os.fsync(f.fileno())
os.chmod(tmp_path, stat.S_IRUSR | stat.S_IWUSR) # 0600
os.replace(tmp_path, path)
except Exception:
try:
os.unlink(tmp_path)
except OSError:
pass
raise
# Summary line — token VALUES are not printed, only WHICH fields were
# rotated.
fields = []
if new_id: fields.append("id_token")
if new_access: fields.append("access_token")
if new_refresh: fields.append("refresh_token")
print("OK rotated=" + ",".join(fields) if fields else "OK rotated=<none>")
PY
}
# Write the status sidecar (mode 0600). Never includes token contents.
write_status() {
local last_iso="$1" age="$2" outcome="$3"
"$PYTHON_BIN" - "$STATUS_FILE" "$last_iso" "$age" "$outcome" <<'PY'
import json, os, stat, sys
path, last_iso, age, outcome = sys.argv[1:5]
payload = {
"last_refresh": last_iso,
"last_refresh_age_seconds": int(age) if age.lstrip("-").isdigit() else None,
"watchdog_last_outcome": outcome,
"watchdog_last_run_iso": __import__("datetime").datetime.now(
__import__("datetime").timezone.utc
).isoformat().replace("+00:00", "Z"),
}
with open(path + ".tmp", "w", encoding="utf-8") as f:
json.dump(payload, f, indent=2)
os.chmod(path + ".tmp", stat.S_IRUSR | stat.S_IWUSR)
os.replace(path + ".tmp", path)
PY
}
# One refresh attempt. Returns 0 on refreshed, 1 on skipped, 2 on
# transient failure (will retry next loop), 3 on permanent failure
# (e.g. refresh_token_expired — operator intervention needed).
attempt_refresh_once() {
if [ ! -s "$AUTH_JSON" ]; then
log "skip: $AUTH_JSON absent or empty (no chatgpt-subscription auth in this workspace)"
write_status "" -1 "skip:no_auth_json"
return 1
fi
local decision rc=0
# Use a temp file so the multi-line decision output is fully captured.
local decision_file
decision_file="$(mktemp)"
needs_refresh_and_payload >"$decision_file" 2>>/tmp/codex_auth_refresh.errlog || rc=$?
if [ "$rc" -ne 0 ]; then
log "decision: helper exited $rc; see /tmp/codex_auth_refresh.errlog"
rm -f "$decision_file"
write_status "" -1 "skip:helper_error"
return 2
fi
local verdict
verdict="$(head -n1 "$decision_file")"
case "$verdict" in
SKIP*)
log "no-op: ${verdict#SKIP }"
rm -f "$decision_file"
# Re-parse just to refresh the sidecar — separate, doesn't see
# the token because no refresh body is emitted.
local last_iso age
read -r last_iso age <<<"$("$PYTHON_BIN" - "$AUTH_JSON" <<'PY'
import json, sys
from datetime import datetime, timezone
try:
blob = json.load(open(sys.argv[1]))
except Exception:
print(' -1'); sys.exit(0)
iso = (blob.get('last_refresh') or '').replace('Z', '+00:00')
age = -1
if iso:
try:
dt = datetime.fromisoformat(iso)
if dt.tzinfo is None: dt = dt.replace(tzinfo=timezone.utc)
age = int((datetime.now(timezone.utc) - dt).total_seconds())
except ValueError:
pass
print((blob.get('last_refresh') or ''), age)
PY
)"
write_status "${last_iso:-}" "${age:--1}" "skip"
return 1
;;
READY*)
:
;;
*)
log "decision: unrecognized verdict line (length=${#verdict}); aborting attempt"
rm -f "$decision_file"
write_status "" -1 "skip:bad_verdict"
return 2
;;
esac
# READY <refresh_token>
local refresh_token
refresh_token="${verdict#READY }"
rm -f "$decision_file"
if [ -z "$refresh_token" ]; then
log "decision READY but refresh_token empty; aborting"
write_status "" -1 "skip:empty_refresh_token"
return 2
fi
log "refresh: starting (endpoint=${REFRESH_URL%%\?*})"
# Build the request body via python to avoid quoting the token in
# the shell — token value never appears in argv or env. Send body
# to curl on stdin via @- ; the shell never sees the token in
# process arguments.
local response_file http_code
response_file="$(mktemp)"
http_code="$(
"$PYTHON_BIN" -c "import json,sys; sys.stdout.write(json.dumps({'client_id': '$CLIENT_ID', 'grant_type': 'refresh_token', 'refresh_token': sys.argv[1]}))" "$refresh_token" \
| curl -sS --max-time 30 \
-H "Content-Type: application/json" \
-X POST -d @- \
-o "$response_file" \
-w "%{http_code}" \
"$REFRESH_URL" \
|| echo "000"
)"
unset refresh_token
case "$http_code" in
2*)
if apply_refresh_response "$response_file" >>/tmp/codex_auth_refresh.errlog 2>&1; then
rm -f "$response_file"
log "refresh: ok (http=$http_code)"
# Re-read the freshly written file for the sidecar.
local last_iso age
read -r last_iso age <<<"$("$PYTHON_BIN" - "$AUTH_JSON" <<'PY'
import json, sys
from datetime import datetime, timezone
blob = json.load(open(sys.argv[1]))
iso = (blob.get('last_refresh') or '').replace('Z', '+00:00')
age = -1
if iso:
dt = datetime.fromisoformat(iso)
if dt.tzinfo is None: dt = dt.replace(tzinfo=timezone.utc)
age = int((datetime.now(timezone.utc) - dt).total_seconds())
print((blob.get('last_refresh') or ''), age)
PY
)"
write_status "${last_iso:-}" "${age:--1}" "refreshed"
return 0
else
log "refresh: apply_refresh_response failed (see /tmp/codex_auth_refresh.errlog)"
rm -f "$response_file"
write_status "" -1 "fail:apply_response"
return 2
fi
;;
401)
# Per manager.rs:846-848 — 401 is a PERMANENT failure (refresh
# token expired / reused / revoked). Operator must re-login on
# the host running the CTO ChatGPT account.
log "refresh: PERMANENT FAILURE (http=401) — refresh_token rejected. Operator must re-login the CTO ChatGPT subscription and re-inject CODEX_AUTH_JSON."
rm -f "$response_file"
write_status "" -1 "fail:permanent_401"
return 3
;;
*)
# Anything else is treated as transient (network glitch, 5xx).
# Per manager.rs:849-854, the CLI does the same: keep the old
# tokens and retry later.
log "refresh: transient failure (http=$http_code); will retry next loop"
rm -f "$response_file"
write_status "" -1 "fail:transient_$http_code"
return 2
;;
esac
}
main_loop() {
log "watchdog: starting (interval=${REFRESH_INTERVAL}s safety_margin=${SAFETY_MARGIN}s stale_after=${STALE_AFTER}s codex_home=${CODEX_HOME})"
while true; do
attempt_refresh_once || true
sleep "$REFRESH_INTERVAL"
done
}
# Allow `--once` for manual probe + the boot-time priming probe in
# start.sh. Default is the long-running loop.
case "${1:-}" in
--once)
attempt_refresh_once
exit $?
;;
--help|-h)
grep '^#' "$0" | sed 's/^# \{0,1\}//'
exit 0
;;
*)
main_loop
;;
esac