Prerequisite for the universal-runtime refactor (task #87) to move claude_sdk_executor.py out of molecule-runtime into the claude-code template repo. heartbeat.py had a hard import: from claude_sdk_executor import is_wedged, wedge_reason which would break the moment the executor moves out of the runtime package — the heartbeat would lose access to the wedge state used to flip workspace status to degraded. Extract the wedge state to a runtime-side module that the heartbeat can keep importing regardless of which adapter executor is wedged: - workspace/runtime_wedge.py — single-flag state + mark_wedged / clear_wedge / is_wedged / wedge_reason / reset_for_test. Same semantics as the original claude_sdk_executor implementation (sticky first-write-wins, auto-clear on observed success). 100 LOC of pure stateless helpers; lock-free ok because there's one executor per workspace process today. - workspace/claude_sdk_executor.py — drops the in-file definitions; re-exports the same names from runtime_wedge as a backwards-compat shim. Any third-party adapter that imported is_wedged / wedge_reason / _mark_sdk_wedged from claude_sdk_executor keeps working for one release cycle while they migrate to runtime_wedge. - workspace/heartbeat.py — _runtime_state_payload() now imports from runtime_wedge instead of claude_sdk_executor. Lazy-import pattern preserved; the docstring updated to explain the new cross-cutting source-of-truth. Tests (10 new in test_runtime_wedge.py): - Default state (unwedged), mark sets flag, first-write-wins, clear restores healthy, clear-when-not-wedged is no-op, re-marking after clear is allowed - Re-export shim: each old name in claude_sdk_executor IS the runtime_wedge function (identity check), state is shared (marking via the executor shim is observable via runtime_wedge and vice versa) Verification: - 1251/1251 workspace pytest pass (was 1241 after orphan deletion; +10 = exactly the new test_runtime_wedge.py cases) - All existing test_claude_sdk_executor.py cases (which call _mark_sdk_wedged via the shim) still pass After this lands + the claude-code template image rebuilds with the local claude_sdk_executor.py copy (template PR #13), the molecule- core deletion of workspace/claude_sdk_executor.py becomes safe (the shim deletion comes alongside the file deletion, since runtime_wedge is the new public API). See project memory `project_runtime_native_pluggable.md`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
100 lines
4.0 KiB
Python
100 lines
4.0 KiB
Python
"""Per-process runtime-wedge state.
|
|
|
|
Adapter executors that hit a non-recoverable wedge (e.g. claude-agent-sdk's
|
|
`Control request timeout: initialize` corrupting the client process's
|
|
internal state) call mark_wedged(reason). The heartbeat task reads
|
|
is_wedged() / wedge_reason() and forwards them in the heartbeat payload's
|
|
runtime_state field — the platform then flips workspace status to
|
|
`degraded` so the canvas surfaces a Restart hint instead of leaving the
|
|
user staring at a green dot while every chat hangs.
|
|
|
|
Module scope (not instance scope) is deliberate: the wedge is a property
|
|
of the Python process, not any particular executor. With one executor
|
|
per workspace process today this is the simplest lock-free
|
|
read+write fit. A future per-org multi-executor design could move this
|
|
to a shared registry.
|
|
|
|
This module lives in molecule-runtime (NOT in any adapter / template
|
|
repo) because:
|
|
|
|
1. workspace/heartbeat.py reads it on every heartbeat — cross-cutting
|
|
concern, runtime owns it.
|
|
2. Multiple adapter executors can mark themselves wedged with their
|
|
own reason; the runtime aggregates one flag for the platform.
|
|
3. Decoupling from claude_sdk_executor is the prerequisite for the
|
|
universal-runtime refactor (molecule-core task #87) — without
|
|
this extraction, claude_sdk_executor.py couldn't move to its
|
|
template repo because heartbeat would lose access to the wedge
|
|
state.
|
|
|
|
Public API: mark_wedged(reason), clear_wedge(), is_wedged(),
|
|
wedge_reason(). The reset_for_test() helper is for unit tests only.
|
|
"""
|
|
from __future__ import annotations
|
|
|
|
import logging
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
|
|
# Single-flag state. None = healthy; non-empty string = wedged with that
|
|
# human-readable reason. Surfaced verbatim as the canvas's degraded-card
|
|
# banner text via heartbeat.sample_error.
|
|
_wedged_reason: str | None = None
|
|
|
|
|
|
def is_wedged() -> bool:
|
|
"""True if some adapter executor in this process has marked itself
|
|
wedged. Sticky until the same executor calls clear_wedge() on
|
|
observed recovery (or the process restarts)."""
|
|
return _wedged_reason is not None
|
|
|
|
|
|
def wedge_reason() -> str:
|
|
"""Human-readable description of the wedge cause, or empty string
|
|
when not wedged. Surfaced to the canvas via heartbeat sample_error."""
|
|
return _wedged_reason or ""
|
|
|
|
|
|
def mark_wedged(reason: str) -> None:
|
|
"""Flag the runtime as wedged. Only the FIRST call wins so a
|
|
subsequent identical-class wedge can't overwrite a more specific
|
|
initial reason — the operator-visible banner stays stable.
|
|
|
|
Adapters call this from their executor's exception path when the
|
|
SDK has hit a non-recoverable error class. Safe to call multiple
|
|
times; the no-op when already wedged is intentional.
|
|
"""
|
|
global _wedged_reason
|
|
if _wedged_reason is None:
|
|
_wedged_reason = reason
|
|
logger.error(
|
|
"runtime wedge detected: %s — workspace will report degraded until cleared",
|
|
reason,
|
|
)
|
|
|
|
|
|
def clear_wedge() -> None:
|
|
"""Auto-recovery: adapter calls this after an observed successful
|
|
operation. The original wedge could be transient (single network
|
|
blip during the SDK's first-message handshake), and a sticky-only
|
|
flag would lock the workspace into degraded forever even after the
|
|
SDK started working again. Clearing on observed success means the
|
|
next heartbeat after a working query reports runtime_state empty
|
|
and the platform flips status back to online.
|
|
|
|
No-op when not wedged (the common case)."""
|
|
global _wedged_reason
|
|
if _wedged_reason is not None:
|
|
logger.info("runtime wedge cleared after successful operation — workspace will recover to online on next heartbeat")
|
|
_wedged_reason = None
|
|
|
|
|
|
def reset_for_test() -> None:
|
|
"""Test-only escape hatch. Production code clears the wedge via
|
|
clear_wedge() on observed success; this helper is for unit tests
|
|
that need to reset between cases without going through the full
|
|
SDK round-trip."""
|
|
global _wedged_reason
|
|
_wedged_reason = None
|