feat(executor): emit incident.codex_wedge JSONL on SSE wedge #18

Open
core-devops wants to merge 1 commits from feat/codex-wedge-obs-emit into main
Member

Summary

  • Adds a structured JSONL incident line at the existing 90s SSE-wedge watchdog site so the tenant Vector pipeline ships it to Loki under {service="molecule-tenant"}. Pairs with the codex-wedge.yml Loki ruler in operator-config (separate PR).
  • INTENTIONALLY does NOT bump @openai/codex@0.130.0 — see investigation below.

codex-cli upstream investigation (VENDOR-DOC CHECK)

  • 0.131.0 (2026-05-18) release notes: no mention of SSE / chatgpt-subscription / no-events / app-server stream fixes. Focus is TUI / plugin / remote / Python SDK.
  • Open issues openai/codex#23061 and #22793 (stream disconnected before completion) track related streaming instability against subscription auth — neither has a verified upstream fix.
  • Conclusion: bumping to 0.131 would be a blind upgrade against the same hypothesised upstream bug. Hold the pin at 0.130.0 until upstream lands a verified fix AND we reproduce the wedge against a 0.131 image.

Why not switch to auth_mode=openai_api (CTO billing touchpoint)

The wedge is correlated with chatgpt_subscription + gpt-5.5, so api-key fallback is a viable workaround — but it routes traffic through a DIFFERENT OpenAI account. CTO-only decision; this PR adds the obs signal so the call can be made on data.

Schema (frozen — Loki ruler depends on it)

event_type            = "incident.codex_wedge"
workspace_id          (from $WORKSPACE_ID)
turn_id
deltas_at_wedge       (count seen pre-wedge — 0 in the textbook case)
wedge_duration_seconds
codex_cli_version     (currently "0.130.0")
model                 (e.g. "gpt-5.5")
auth_mode             (chatgpt_subscription / openai_api / custom_anthropic_compat / unknown)
ts                    (RFC-3339 UTC)

Loki query example (after both PRs ship)

{service="molecule-tenant"}
  | json
  | event_type = "incident.codex_wedge"
  | line_format "{{.ts}} ws={{.workspace_id}} model={{.model}} auth={{.auth_mode}} deltas={{.deltas_at_wedge}} dur={{.wedge_duration_seconds}}s"

Test plan

  • tests/test_executor.py::test_wedge_emits_incident_jsonl — JSON shape matches ruler
  • Full tests/test_executor.py suite passes locally (10/10)
  • Image build green in CI
  • Post-merge: verify a synthetic wedge in a staging workspace emits the expected JSONL line under service=molecule-tenant in Loki
## Summary - Adds a structured JSONL incident line at the existing 90s SSE-wedge watchdog site so the tenant Vector pipeline ships it to Loki under `{service="molecule-tenant"}`. Pairs with the `codex-wedge.yml` Loki ruler in operator-config (separate PR). - INTENTIONALLY does NOT bump `@openai/codex@0.130.0` — see investigation below. ## codex-cli upstream investigation (VENDOR-DOC CHECK) - 0.131.0 (2026-05-18) release notes: no mention of SSE / chatgpt-subscription / no-events / app-server stream fixes. Focus is TUI / plugin / remote / Python SDK. - Open issues `openai/codex#23061` and `#22793` (`stream disconnected before completion`) track related streaming instability against subscription auth — neither has a verified upstream fix. - Conclusion: bumping to 0.131 would be a blind upgrade against the same hypothesised upstream bug. Hold the pin at `0.130.0` until upstream lands a verified fix AND we reproduce the wedge against a 0.131 image. ## Why not switch to `auth_mode=openai_api` (CTO billing touchpoint) The wedge is correlated with `chatgpt_subscription` + `gpt-5.5`, so api-key fallback is a viable workaround — but it routes traffic through a DIFFERENT OpenAI account. CTO-only decision; this PR adds the obs signal so the call can be made on data. ## Schema (frozen — Loki ruler depends on it) ``` event_type = "incident.codex_wedge" workspace_id (from $WORKSPACE_ID) turn_id deltas_at_wedge (count seen pre-wedge — 0 in the textbook case) wedge_duration_seconds codex_cli_version (currently "0.130.0") model (e.g. "gpt-5.5") auth_mode (chatgpt_subscription / openai_api / custom_anthropic_compat / unknown) ts (RFC-3339 UTC) ``` ## Loki query example (after both PRs ship) ```logql {service="molecule-tenant"} | json | event_type = "incident.codex_wedge" | line_format "{{.ts}} ws={{.workspace_id}} model={{.model}} auth={{.auth_mode}} deltas={{.deltas_at_wedge}} dur={{.wedge_duration_seconds}}s" ``` ## Test plan - [x] `tests/test_executor.py::test_wedge_emits_incident_jsonl` — JSON shape matches ruler - [x] Full `tests/test_executor.py` suite passes locally (10/10) - [ ] Image build green in CI - [ ] Post-merge: verify a synthetic wedge in a staging workspace emits the expected JSONL line under `service=molecule-tenant` in Loki
core-devops added 1 commit 2026-05-19 19:56:18 +00:00
feat(executor): emit incident.codex_wedge JSONL on SSE wedge
CI / Adapter unit tests (pull_request) Successful in 29s
CI / Adapter unit tests (push) Successful in 1m12s
CI / Template validation (static) (push) Successful in 1m41s
CI / Template validation (static) (pull_request) Successful in 1m36s
CI / Template validation (runtime) (push) Failing after 45s
CI / Template validation (runtime) (pull_request) Failing after 46s
CI / T4 tier-4 conformance (live) (push) Successful in 1m44s
CI / T4 tier-4 conformance (live) (pull_request) Successful in 42s
CI / validate (push) Failing after 11s
CI / validate (pull_request) Failing after 34s
89f664ebc7
Surfaces the 2026-05-18-class wedge (codex turn emits zero events for
90s) as a structured log line the tenant Vector pipeline ships to
Loki under {service="molecule-tenant"}. Pairs with the codex-wedge
Loki ruler in operator-config (separate PR).

Schema (frozen by the matching ruler):
  event_type, workspace_id, turn_id, deltas_at_wedge,
  wedge_duration_seconds, codex_cli_version, model, auth_mode, ts

Notes on the broader investigation:

- Upstream codex-cli 0.131.0 release notes (May 2026) do NOT mention
  any SSE / chatgpt-subscription / no-events fix; open issues #23061
  and #22793 (stream-disconnect-before-completion) track related
  instability with NO verified fix. Therefore this PR INTENTIONALLY
  does NOT bump the @openai/codex@0.130.0 pin — that would be a
  blind upgrade against the same hypothesised upstream bug. Bump only
  after upstream lands a verified SSE / app-server-stream fix and we
  reproduce the wedge in a 0.131 image.

- We do NOT switch the production prod-team auth_mode to openai_api
  (api-key) — that's a CTO billing touchpoint (different OpenAI
  account); the obs signal lets us decide that with data instead of
  hypothesis.

- The wedge-detection logic itself was already added by PR#14
  (the 2026-05-18 deadlock fix); this PR only adds the
  structured-log emission at the same site, plus a regression test.

Auth-mode label derived from credential-env presence (mirrors
provider_config._BUILTIN_PROVIDERS selection order) so the line is
emitted even if the process wedged before render_provider_toml.py
finished writing ~/.codex/config.toml.

Test:
  tests/test_executor.py::test_wedge_emits_incident_jsonl validates
  the JSON shape against the ruler's expected fields.
agent-reviewer requested changes 2026-05-24 00:10:19 +00:00
agent-reviewer left a comment
Member

REQUEST_CHANGES after 5-axis review of 89f664e.

Correctness: The wedge incident emission is placed at the existing inactivity timeout site and the test covers the JSON payload for the ChatGPT subscription path. However, _derive_auth_mode_label() does not actually mirror provider_config selection for the MiniMax route: the repo's configured third-party provider is driven by MINIMAX_API_KEY, but this code only checks ANTHROPIC_AUTH_TOKEN/ANTHROPIC_API_KEY before falling back to unknown. A MiniMax workspace wedge would therefore emit the wrong auth_mode, undercutting the Loki grouping this PR adds. Please include the active compat-provider env names, at least MINIMAX_API_KEY, or derive from the same provider registry used at boot.

Robustness: The log emission is exception-contained and emits once per timeout, which is good. Current PR CI is also red on runtime validation / validate, so this needs a green rerun before merge.

Security: No secrets are emitted; the auth label is categorical only.

Performance: One small JSON serialization on the terminal wedge path is negligible.

Readability: The new helper is readable, but its comments overstate parity with provider_config while missing the MiniMax auth path.

REQUEST_CHANGES after 5-axis review of 89f664e. Correctness: The wedge incident emission is placed at the existing inactivity timeout site and the test covers the JSON payload for the ChatGPT subscription path. However, `_derive_auth_mode_label()` does not actually mirror provider_config selection for the MiniMax route: the repo's configured third-party provider is driven by `MINIMAX_API_KEY`, but this code only checks `ANTHROPIC_AUTH_TOKEN`/`ANTHROPIC_API_KEY` before falling back to `unknown`. A MiniMax workspace wedge would therefore emit the wrong `auth_mode`, undercutting the Loki grouping this PR adds. Please include the active compat-provider env names, at least `MINIMAX_API_KEY`, or derive from the same provider registry used at boot. Robustness: The log emission is exception-contained and emits once per timeout, which is good. Current PR CI is also red on runtime validation / validate, so this needs a green rerun before merge. Security: No secrets are emitted; the auth label is categorical only. Performance: One small JSON serialization on the terminal wedge path is negligible. Readability: The new helper is readable, but its comments overstate parity with provider_config while missing the MiniMax auth path.
Some required checks failed
CI / Adapter unit tests (pull_request) Successful in 29s
CI / Adapter unit tests (push) Successful in 1m12s
CI / Template validation (static) (push) Successful in 1m41s
CI / Template validation (static) (pull_request) Successful in 1m36s
CI / Template validation (runtime) (push) Failing after 45s
CI / Template validation (runtime) (pull_request) Failing after 46s
CI / T4 tier-4 conformance (live) (push) Successful in 1m44s
CI / T4 tier-4 conformance (live) (pull_request) Successful in 42s
CI / validate (push) Failing after 11s
CI / validate (pull_request) Failing after 34s
Required
Details
Checking for merge conflicts…
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin feat/codex-wedge-obs-emit:feat/codex-wedge-obs-emit
git checkout feat/codex-wedge-obs-emit
Sign in to join this conversation.
No Reviewers
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-ai-workspace-template-codex#18