refactor(wedge): extract claude_sdk_executor wedge state into runtime_wedge module

Prerequisite for the universal-runtime refactor (task #87) to move
claude_sdk_executor.py out of molecule-runtime into the claude-code
template repo. heartbeat.py had a hard import:

    from claude_sdk_executor import is_wedged, wedge_reason

which would break the moment the executor moves out of the runtime
package — the heartbeat would lose access to the wedge state used to
flip workspace status to degraded.

Extract the wedge state to a runtime-side module that the heartbeat
can keep importing regardless of which adapter executor is wedged:

  - workspace/runtime_wedge.py — single-flag state + mark_wedged /
    clear_wedge / is_wedged / wedge_reason / reset_for_test. Same
    semantics as the original claude_sdk_executor implementation
    (sticky first-write-wins, auto-clear on observed success). 100
    LOC of pure stateless helpers; lock-free ok because there's one
    executor per workspace process today.

  - workspace/claude_sdk_executor.py — drops the in-file definitions;
    re-exports the same names from runtime_wedge as a backwards-compat
    shim. Any third-party adapter that imported is_wedged / wedge_reason
    / _mark_sdk_wedged from claude_sdk_executor keeps working for one
    release cycle while they migrate to runtime_wedge.

  - workspace/heartbeat.py — _runtime_state_payload() now imports
    from runtime_wedge instead of claude_sdk_executor. Lazy-import
    pattern preserved; the docstring updated to explain the new
    cross-cutting source-of-truth.

Tests (10 new in test_runtime_wedge.py):
  - Default state (unwedged), mark sets flag, first-write-wins,
    clear restores healthy, clear-when-not-wedged is no-op,
    re-marking after clear is allowed
  - Re-export shim: each old name in claude_sdk_executor IS the
    runtime_wedge function (identity check), state is shared
    (marking via the executor shim is observable via runtime_wedge
    and vice versa)

Verification:
  - 1251/1251 workspace pytest pass (was 1241 after orphan deletion;
    +10 = exactly the new test_runtime_wedge.py cases)
  - All existing test_claude_sdk_executor.py cases (which call
    _mark_sdk_wedged via the shim) still pass

After this lands + the claude-code template image rebuilds with the
local claude_sdk_executor.py copy (template PR #13), the molecule-
core deletion of workspace/claude_sdk_executor.py becomes safe (the
shim deletion comes alongside the file deletion, since runtime_wedge
is the new public API).

See project memory `project_runtime_native_pluggable.md`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Hongming Wang 2026-04-27 00:08:53 -07:00
parent c1e9aa7461
commit 1d231ed295
4 changed files with 237 additions and 71 deletions

View File

@ -87,71 +87,26 @@ _RETRYABLE_PATTERNS = (
"try again",
)
# Module-level SDK-wedge flag. When claude_agent_sdk's `query.initialize()`
# raises `Control request timeout: initialize`, the SDK's internal client-
# process state is corrupted for the rest of the Python process — every
# subsequent `_run_query()` call hits the same wedge and re-throws. The
# executor itself can't auto-recover (the underlying CLI subprocess and
# its read pipe are in an unrecoverable state); only a workspace restart
# clears it.
# SDK-wedge state lives in the runtime-side module (runtime_wedge) so
# heartbeat.py and any future cross-cutting consumer can read it without
# importing this adapter-specific executor. Decoupling was the prerequisite
# for moving claude_sdk_executor out of molecule-runtime into the
# claude-code template repo (task #87 — universal-runtime refactor).
#
# The heartbeat task reads these helpers and reports
# `runtime_state="wedged"` to the platform, which flips the workspace to
# `degraded` so the canvas surfaces a Restart hint instead of leaving
# the user staring at a green dot while every chat hangs.
#
# Module scope (not instance scope) is deliberate: the wedge is a
# property of the Python process, not the executor. A future per-org
# multi-executor design could move this to a shared registry, but with
# one executor per workspace process today the simplest lock-free
# read+write fits.
_sdk_wedged_reason: str | None = None
def is_wedged() -> bool:
"""True if the Claude SDK has hit a non-recoverable init wedge in
this process. Sticky until process restart."""
return _sdk_wedged_reason is not None
def wedge_reason() -> str:
"""Human-readable description of the wedge cause, or empty string
when not wedged. Surfaced to the canvas via heartbeat sample_error."""
return _sdk_wedged_reason or ""
def _mark_sdk_wedged(reason: str) -> None:
"""Internal — flag the SDK as wedged. Only the first call wins
(subsequent identical wedges shouldn't overwrite a more specific
reason). Tests use `_reset_sdk_wedge_for_test()` to clear."""
global _sdk_wedged_reason
if _sdk_wedged_reason is None:
_sdk_wedged_reason = reason
logger.error("SDK wedge detected: %s — workspace will report degraded until a successful query clears it", reason)
def _clear_sdk_wedge_on_success() -> None:
"""Auto-recovery — called from _run_query after a successful
completion. The original wedge could be transient (a single network
blip during the SDK's first-message handshake), and a sticky-only
flag would lock the workspace into degraded forever even after the
SDK started working again. Clearing on observed success means the
next heartbeat after a working query reports `runtime_state` empty
and the platform flips status back to online.
No-op when not wedged (the common case)."""
global _sdk_wedged_reason
if _sdk_wedged_reason is not None:
logger.info("SDK wedge cleared after successful query — workspace will recover to online on next heartbeat")
_sdk_wedged_reason = None
def _reset_sdk_wedge_for_test() -> None:
"""Test-only escape hatch. Production code clears the wedge via
`_clear_sdk_wedge_on_success` when a query succeeds; this helper
is for unit tests that need to reset between cases."""
global _sdk_wedged_reason
_sdk_wedged_reason = None
# Local re-exports keep the in-file call sites (_run_query etc.) terse
# and preserve the historical names so the behavior is identical to
# the pre-extraction version. is_wedged/wedge_reason are also re-exported
# so any external consumer that imported them from this module keeps
# working — heartbeat.py has been updated to import from runtime_wedge
# directly, but a third-party adapter copying our wedge convention may
# still expect them here.
from runtime_wedge import ( # noqa: E402
clear_wedge as _clear_sdk_wedge_on_success,
is_wedged,
mark_wedged as _mark_sdk_wedged,
reset_for_test as _reset_sdk_wedge_for_test,
wedge_reason,
)
# Per-tool-use summarizers. Reads the most-useful argument from each

View File

@ -22,16 +22,25 @@ from platform_auth import auth_headers, refresh_cache, self_source_headers
def _runtime_state_payload() -> dict:
"""Build the {runtime_state, sample_error} portion of the heartbeat
body when the Claude SDK has hit a wedge. Returns an empty dict
when the runtime is healthy so the heartbeat payload doesn't grow
fields the platform doesn't need.
body when SOME adapter executor has marked itself wedged. Returns
an empty dict when the runtime is healthy so the heartbeat payload
doesn't grow fields the platform doesn't need.
Imported lazily so workspaces running non-Claude runtimes (where
`claude_sdk_executor` may not be importable at all) keep working
a missing import means "no Claude wedge possible here, healthy."
Source of truth is runtime_wedge (lives in molecule-runtime,
independent of any specific adapter). Pre task #87 this imported
from claude_sdk_executor that worked because the executor was
bundled into molecule-runtime, but blocked moving it to the
claude-code template repo. The runtime_wedge module is now the
cross-cutting wedge-state holder; adapters mark/clear via it,
heartbeat reads it.
Imported lazily so a workspace whose runtime image somehow ships
without runtime_wedge (corrupt install, mid-rolling-deploy state)
keeps heartbeating a missing import means "no wedge info; assume
healthy."
"""
try:
from claude_sdk_executor import is_wedged, wedge_reason
from runtime_wedge import is_wedged, wedge_reason
except Exception:
return {}
if not is_wedged():

View File

@ -0,0 +1,99 @@
"""Per-process runtime-wedge state.
Adapter executors that hit a non-recoverable wedge (e.g. claude-agent-sdk's
`Control request timeout: initialize` corrupting the client process's
internal state) call mark_wedged(reason). The heartbeat task reads
is_wedged() / wedge_reason() and forwards them in the heartbeat payload's
runtime_state field the platform then flips workspace status to
`degraded` so the canvas surfaces a Restart hint instead of leaving the
user staring at a green dot while every chat hangs.
Module scope (not instance scope) is deliberate: the wedge is a property
of the Python process, not any particular executor. With one executor
per workspace process today this is the simplest lock-free
read+write fit. A future per-org multi-executor design could move this
to a shared registry.
This module lives in molecule-runtime (NOT in any adapter / template
repo) because:
1. workspace/heartbeat.py reads it on every heartbeat cross-cutting
concern, runtime owns it.
2. Multiple adapter executors can mark themselves wedged with their
own reason; the runtime aggregates one flag for the platform.
3. Decoupling from claude_sdk_executor is the prerequisite for the
universal-runtime refactor (molecule-core task #87) — without
this extraction, claude_sdk_executor.py couldn't move to its
template repo because heartbeat would lose access to the wedge
state.
Public API: mark_wedged(reason), clear_wedge(), is_wedged(),
wedge_reason(). The reset_for_test() helper is for unit tests only.
"""
from __future__ import annotations
import logging
logger = logging.getLogger(__name__)
# Single-flag state. None = healthy; non-empty string = wedged with that
# human-readable reason. Surfaced verbatim as the canvas's degraded-card
# banner text via heartbeat.sample_error.
_wedged_reason: str | None = None
def is_wedged() -> bool:
"""True if some adapter executor in this process has marked itself
wedged. Sticky until the same executor calls clear_wedge() on
observed recovery (or the process restarts)."""
return _wedged_reason is not None
def wedge_reason() -> str:
"""Human-readable description of the wedge cause, or empty string
when not wedged. Surfaced to the canvas via heartbeat sample_error."""
return _wedged_reason or ""
def mark_wedged(reason: str) -> None:
"""Flag the runtime as wedged. Only the FIRST call wins so a
subsequent identical-class wedge can't overwrite a more specific
initial reason the operator-visible banner stays stable.
Adapters call this from their executor's exception path when the
SDK has hit a non-recoverable error class. Safe to call multiple
times; the no-op when already wedged is intentional.
"""
global _wedged_reason
if _wedged_reason is None:
_wedged_reason = reason
logger.error(
"runtime wedge detected: %s — workspace will report degraded until cleared",
reason,
)
def clear_wedge() -> None:
"""Auto-recovery: adapter calls this after an observed successful
operation. The original wedge could be transient (single network
blip during the SDK's first-message handshake), and a sticky-only
flag would lock the workspace into degraded forever even after the
SDK started working again. Clearing on observed success means the
next heartbeat after a working query reports runtime_state empty
and the platform flips status back to online.
No-op when not wedged (the common case)."""
global _wedged_reason
if _wedged_reason is not None:
logger.info("runtime wedge cleared after successful operation — workspace will recover to online on next heartbeat")
_wedged_reason = None
def reset_for_test() -> None:
"""Test-only escape hatch. Production code clears the wedge via
clear_wedge() on observed success; this helper is for unit tests
that need to reset between cases without going through the full
SDK round-trip."""
global _wedged_reason
_wedged_reason = None

View File

@ -0,0 +1,103 @@
"""Tests for runtime_wedge — the runtime-side wedge-state module that
heartbeat reads + adapter executors write. Extracted from claude_sdk_
executor (task #87 universal-runtime refactor) so the executor can move
to its template repo without breaking heartbeat.
The behavior is identical to the prior in-executor implementation; tests
pin the contract so the re-export shim in claude_sdk_executor.py can
later be deleted without surprise."""
import pytest
import runtime_wedge
@pytest.fixture(autouse=True)
def _reset():
"""Each test starts with a clean wedge state — production wedges are
sticky-per-process, but cross-test bleed would couple unrelated cases."""
runtime_wedge.reset_for_test()
yield
runtime_wedge.reset_for_test()
class TestRuntimeWedge:
def test_starts_unwedged(self):
assert runtime_wedge.is_wedged() is False
assert runtime_wedge.wedge_reason() == ""
def test_mark_wedged_sets_flag_and_reason(self):
runtime_wedge.mark_wedged("SDK init timeout")
assert runtime_wedge.is_wedged() is True
assert runtime_wedge.wedge_reason() == "SDK init timeout"
def test_first_mark_wins(self):
# Stable banner text is more important than the most-recent
# cause. A second wedge while already wedged should NOT
# overwrite — operator sees the original (more diagnosable)
# reason, not whatever the SDK said next.
runtime_wedge.mark_wedged("SDK init timeout")
runtime_wedge.mark_wedged("Subsequent identical-class wedge")
assert runtime_wedge.wedge_reason() == "SDK init timeout"
def test_clear_wedge_restores_healthy(self):
# Auto-recovery: when the SDK starts working again, the next
# heartbeat must report empty runtime_state so the platform
# flips status from degraded back to online.
runtime_wedge.mark_wedged("transient blip")
runtime_wedge.clear_wedge()
assert runtime_wedge.is_wedged() is False
assert runtime_wedge.wedge_reason() == ""
def test_clear_wedge_when_not_wedged_is_noop(self):
# No-op safety — production calls clear_wedge() on every
# successful query (~thousands of times per session); throwing
# or logging when not wedged would spam.
runtime_wedge.clear_wedge()
runtime_wedge.clear_wedge() # still safe twice in a row
assert runtime_wedge.is_wedged() is False
def test_re_marking_after_clear_is_allowed(self):
# Real production path: SDK wedges, recovers, wedges again.
# Each cycle should land cleanly (not silently drop).
runtime_wedge.mark_wedged("first wedge")
runtime_wedge.clear_wedge()
runtime_wedge.mark_wedged("second wedge — different reason")
assert runtime_wedge.is_wedged() is True
assert runtime_wedge.wedge_reason() == "second wedge — different reason"
class TestClaudeSdkExecutorReExportShim:
"""claude_sdk_executor.py keeps re-exporting the old names for one
release cycle so any third-party adapter copying our wedge convention
has time to migrate. These tests pin the shim when removed, the
test file goes too."""
def test_is_wedged_re_exported(self):
from claude_sdk_executor import is_wedged
assert is_wedged is runtime_wedge.is_wedged
def test_wedge_reason_re_exported(self):
from claude_sdk_executor import wedge_reason
assert wedge_reason is runtime_wedge.wedge_reason
def test_internal_helpers_re_exported(self):
# Keep the underscore names too — claude_sdk_executor's own
# _run_query calls _mark_sdk_wedged / _clear_sdk_wedge_on_success
# via these re-exports.
from claude_sdk_executor import (
_mark_sdk_wedged,
_clear_sdk_wedge_on_success,
_reset_sdk_wedge_for_test,
)
assert _mark_sdk_wedged is runtime_wedge.mark_wedged
assert _clear_sdk_wedge_on_success is runtime_wedge.clear_wedge
assert _reset_sdk_wedge_for_test is runtime_wedge.reset_for_test
def test_re_export_state_is_shared(self):
# The shim isn't a copy — both names refer to the same module
# state. Marking via the executor name must be observable via
# the runtime_wedge name (and vice versa).
from claude_sdk_executor import _mark_sdk_wedged
_mark_sdk_wedged("via executor shim")
assert runtime_wedge.is_wedged() is True
assert runtime_wedge.wedge_reason() == "via executor shim"