RFC: cross-template vendor-CLI wrapper schema-drift guards — apply codex lessons to claude-code, hermes, openclaw #1691

Open
opened 2026-05-22 22:48:05 +00:00 by hongming · 0 comments
Owner

Why

2026-05-22 codex outage root cause: vendor CLI (openai-codex) changed wire protocol between 0.72 and 0.130 without bumping protocol version. Executor matched only the 0.72 method names. ~7h prod outage before identified.

The same risk exists for EVERY workspace template that wraps a vendor CLI as a subprocess:

  • claude-code template wraps claude CLI
  • hermes template wraps hermes CLI
  • openclaw template wraps openclaw CLI
  • (And future templates as we add runtimes.)

Each has its own subprocess-IO + notification-parsing layer that could silently break on vendor upgrade.

Proposal

Apply ALL four codex-side guards to EVERY vendor-CLI-wrapping template:

1. Pin vendor CLI version in Dockerfile (build-time grep -F check)

See codex template issue #34. Same pattern: ARG + npm install @version + grep verification.

2. Notification-schema contract test in CI

See codex template issue #31. Spawn the real vendor CLI, drive a real turn, assert each method name is recognized by the adapter's notification handlers. Plus an offline replay-fixture variant for credential-less CI.

3. Self-diagnosing wedge log

See codex template issue #32. Whenever an adapter's wedge-detector fires, include the actual notification methods seen + their counts. Never just say (deltas=0) without naming what WAS received.

4. CODEX_DEBUG=1-style instrumentation switch

See codex template issue #33. Bake stderr-trace hooks gated on an env var per template (CLAUDE_CODE_DEBUG, HERMES_DEBUG, OPENCLAW_DEBUG). Then prod debugging never requires docker cp + hot-patches.

Acceptance

  • Each vendor-CLI-wrapping template has all 4 guards merged
  • Cross-template audit: list every *_app_server*.py / equivalent + verify it has the patterns
  • README in each template links the troubleshooting flow (CLI_DEBUG=1 → check wedge log method tally → find vendor schema-drift early)

Touchpoints

  • molecule-ai-workspace-template-codex (already has the immediate fix via PR #30; the followups are #31-34)
  • molecule-ai-workspace-template-claude-code
  • molecule-ai-workspace-template-hermes
  • molecule-ai-workspace-template-openclaw
  • (audit other workspace-template-* repos and file per-template tickets as needed)

Owner

TBD. Could be parallelized 1 owner per template. ~1d each, ~1w total.

## Why 2026-05-22 codex outage root cause: vendor CLI (openai-codex) changed wire protocol between 0.72 and 0.130 without bumping protocol version. Executor matched only the 0.72 method names. ~7h prod outage before identified. The same risk exists for EVERY workspace template that wraps a vendor CLI as a subprocess: - claude-code template wraps `claude` CLI - hermes template wraps `hermes` CLI - openclaw template wraps `openclaw` CLI - (And future templates as we add runtimes.) Each has its own subprocess-IO + notification-parsing layer that could silently break on vendor upgrade. ## Proposal Apply ALL four codex-side guards to EVERY vendor-CLI-wrapping template: ### 1. Pin vendor CLI version in Dockerfile (build-time grep -F check) See codex template issue #34. Same pattern: ARG + npm install @version + grep verification. ### 2. Notification-schema contract test in CI See codex template issue #31. Spawn the real vendor CLI, drive a real turn, assert each method name is recognized by the adapter's notification handlers. Plus an offline replay-fixture variant for credential-less CI. ### 3. Self-diagnosing wedge log See codex template issue #32. Whenever an adapter's wedge-detector fires, include the actual notification methods seen + their counts. Never just say `(deltas=0)` without naming what WAS received. ### 4. CODEX_DEBUG=1-style instrumentation switch See codex template issue #33. Bake stderr-trace hooks gated on an env var per template (CLAUDE_CODE_DEBUG, HERMES_DEBUG, OPENCLAW_DEBUG). Then prod debugging never requires docker cp + hot-patches. ## Acceptance - Each vendor-CLI-wrapping template has all 4 guards merged - Cross-template audit: list every `*_app_server*.py` / equivalent + verify it has the patterns - README in each template links the troubleshooting flow (CLI_DEBUG=1 → check wedge log method tally → find vendor schema-drift early) ## Touchpoints - molecule-ai-workspace-template-codex (already has the immediate fix via PR #30; the followups are #31-34) - molecule-ai-workspace-template-claude-code - molecule-ai-workspace-template-hermes - molecule-ai-workspace-template-openclaw - (audit other workspace-template-* repos and file per-template tickets as needed) ## Owner TBD. Could be parallelized 1 owner per template. ~1d each, ~1w total.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1691