molecule-core/workspace/runtime_wedge.py
Hongming Wang 4b5ac2ebc2 chore(workspace): drop claude_sdk_executor — Phase 2 of #87
Phase 2 of the universal-runtime refactor (task #87). Now that the
claude-code template repo ships its own claude_sdk_executor.py
(template PR #13 merged + image rebuilt at 07:36 UTC) the
molecule-runtime no longer needs to ship the file.

Deletes:
  - workspace/claude_sdk_executor.py (704 LOC)
  - workspace/tests/test_claude_sdk_executor.py (~1.6K LOC)

Updates:
  - workspace/runtime_wedge.py — drops the "Compatibility shim" docstring
    section. The shim was time-bounded ("removed once #87 Phase 2 lands");
    this is that PR.
  - workspace/tests/test_runtime_wedge.py — drops the
    TestClaudeSdkExecutorReExportShim test class (the shim doesn't
    exist anymore so the identity assertions would fail at import).
  - workspace/tests/conftest.py — drops the claude_agent_sdk stub.
    Its only consumer was test_claude_sdk_executor.py which is gone;
    no other test imports the SDK.
  - workspace/cli_executor.py — comment refresh: claude-code template
    repo (not workspace/) is now the home for ClaudeSDKExecutor.

Verified-safe-to-delete:
  - heartbeat.py: migrated to runtime_wedge in PR #2154 (no longer
    imports from claude_sdk_executor)
  - cli_executor.py: only comments referenced claude_sdk_executor;
    its line-117 ValueError defends against accidental routing
  - tests: only test_claude_sdk_executor.py + test_runtime_wedge.py's
    shim class consumed the deleted module; both removed in this PR

Verification:
  - 1182/1182 workspace pytest pass (was 1251; -69 = exactly the
    deleted test cases — zero unexpected regressions)
  - No live import of claude_sdk_executor anywhere in molecule-core
    after deletion (grep verified)

Closes #87 for the claude-code adapter. Hermes is already template-only.
The remaining adapter-specific code in workspace/ is cli_executor.py
(codex/ollama/gemini-cli) tracked by task #122. preflight.py's
SUPPORTED_RUNTIMES static list is tracked by task #123 (PR #2155 in
flight).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 00:52:55 -07:00

195 lines
8.1 KiB
Python

"""Per-process runtime-wedge state.
Adapter executors that hit a non-recoverable wedge (e.g. claude-agent-sdk's
`Control request timeout: initialize` corrupting the client process's
internal state) call mark_wedged(reason). The heartbeat task reads
is_wedged() / wedge_reason() and forwards them in the heartbeat payload's
runtime_state field — the platform then flips workspace status to
`degraded` so the canvas surfaces a Restart hint instead of leaving the
user staring at a green dot while every chat hangs.
Module scope (not instance scope) is deliberate: the wedge is a property
of the Python process, not any particular executor. With one executor
per workspace process today this is the simplest lock-free
read+write fit. A future per-org multi-executor design could move this
to a shared registry.
This module lives in molecule-runtime (NOT in any adapter / template
repo) because:
1. workspace/heartbeat.py reads it on every heartbeat — cross-cutting
concern, runtime owns it.
2. Multiple adapter executors can mark themselves wedged with their
own reason; the runtime aggregates one flag for the platform.
3. Decoupling from claude_sdk_executor is the prerequisite for the
universal-runtime refactor (molecule-core task #87) — without
this extraction, claude_sdk_executor.py couldn't move to its
template repo because heartbeat would lose access to the wedge
state.
Public API: mark_wedged(reason), clear_wedge(), is_wedged(),
wedge_reason(). The reset_for_test() helper is for unit tests only.
How to use from a NEW adapter (template repo)
---------------------------------------------
Hermes, Codex, LangGraph, or any future adapter that wants the same
"flip-to-degraded-on-fatal-wedge" UX should call mark_wedged + clear_wedge
from its executor. The runtime imports + heartbeat plumbing are already
in place — adapters do not change anything in molecule-runtime.
Minimum integration (~6 LOC inside the executor):
# Import path:
# - In a TEMPLATE repo (the common case for new adapters), the
# runtime is installed via PyPI as `molecule-ai-workspace-runtime`,
# so the import is `from molecule_runtime.runtime_wedge import …`.
# - In molecule-core itself (when editing this repo's own
# workspace/ tree), the module is at the top level — import as
# `from runtime_wedge import …`.
from molecule_runtime.runtime_wedge import mark_wedged, clear_wedge
async def execute(self, ctx, queue):
try:
result = await self._run_query(ctx)
except SomeFatalSdkError as e:
# Pick a short, operator-actionable reason. This becomes the
# banner text on the canvas's degraded card — keep it under
# ~80 chars and name the recovery action when possible.
mark_wedged(f"hermes init timeout — restart workspace ({e})")
raise
clear_wedge() # observed-success → next heartbeat reports healthy
return result
What you get for free:
- Heartbeat payload sets runtime_state="wedged" + sample_error=<reason>
on the next 30s tick.
- registry.go's evaluateStatus flips the workspace to `degraded` and
broadcasts WORKSPACE_DEGRADED so the canvas card turns yellow with
your reason as the subtitle.
- clear_wedge() on the next successful turn flips the workspace back
to `online` automatically — no manual operator action.
What NOT to do:
- Don't store wedge state in your adapter module. The platform-side
consumer (heartbeat) imports from runtime_wedge by name; an adapter-
local copy won't be observed.
- Don't call mark_wedged for transient errors (rate limits, single
failed network call). The whole point is "the SDK process is in a
state that can only be cleared by restart" — false positives
train operators to ignore the degraded banner.
- Don't write your own clear logic. clear_wedge() is the only path
the heartbeat watches; a custom flag won't propagate.
When wedge is the WRONG primitive: if the failure is per-request (the
SDK works for some inputs but not others), surface as a normal A2A
error response, not a wedge. Wedge means "every subsequent request in
this process will fail until restart."
"""
from __future__ import annotations
import logging
logger = logging.getLogger(__name__)
class _WedgeState:
"""Internal carrier for the wedge flag. Exposed only via the module-
level helpers below; adapters never see this class.
Wrapping the state in a class (instead of a bare module-level global)
is forward-cover for the day a runtime hosts multiple executors per
process — a future per-scope variant can hand out keyed instances
without changing the public mark_wedged / clear_wedge / is_wedged /
wedge_reason API. Today there's exactly one instance (_DEFAULT).
"""
def __init__(self) -> None:
# None = healthy; non-empty string = wedged with that human-
# readable reason. Surfaced verbatim as the canvas's degraded-
# card banner text via heartbeat.sample_error.
self._reason: str | None = None
def is_wedged(self) -> bool:
return self._reason is not None
def reason(self) -> str:
return self._reason or ""
def mark(self, reason: str) -> None:
# First-write-wins: a subsequent identical-class wedge can't
# overwrite a more specific initial reason so the operator-
# visible banner stays stable.
if self._reason is None:
self._reason = reason
logger.error(
"runtime wedge detected: %s — workspace will report degraded until cleared",
reason,
)
def clear(self) -> None:
# No-op when not wedged (the common case — adapters call this
# on every successful query).
if self._reason is not None:
logger.info(
"runtime wedge cleared after successful operation — workspace will recover to online on next heartbeat",
)
self._reason = None
def reset(self) -> None:
# Unconditional clear — for test fixtures only. Skips the
# info-level log line the production clear() path emits.
self._reason = None
# Single shared instance backing the module-level helpers. Today there's
# one executor per workspace process so this fits perfectly; the class
# wrap above is the seam for any future per-scope variant.
_DEFAULT = _WedgeState()
def is_wedged() -> bool:
"""True if some adapter executor in this process has marked itself
wedged. Sticky until the same executor calls clear_wedge() on
observed recovery (or the process restarts)."""
return _DEFAULT.is_wedged()
def wedge_reason() -> str:
"""Human-readable description of the wedge cause, or empty string
when not wedged. Surfaced to the canvas via heartbeat sample_error."""
return _DEFAULT.reason()
def mark_wedged(reason: str) -> None:
"""Flag the runtime as wedged. Only the FIRST call wins so a
subsequent identical-class wedge can't overwrite a more specific
initial reason — the operator-visible banner stays stable.
Adapters call this from their executor's exception path when the
SDK has hit a non-recoverable error class. Safe to call multiple
times; the no-op when already wedged is intentional.
"""
_DEFAULT.mark(reason)
def clear_wedge() -> None:
"""Auto-recovery: adapter calls this after an observed successful
operation. The original wedge could be transient (single network
blip during the SDK's first-message handshake), and a sticky-only
flag would lock the workspace into degraded forever even after the
SDK started working again. Clearing on observed success means the
next heartbeat after a working query reports runtime_state empty
and the platform flips status back to online.
No-op when not wedged (the common case)."""
_DEFAULT.clear()
def reset_for_test() -> None:
"""Test-only escape hatch. Production code clears the wedge via
clear_wedge() on observed success; this helper is for unit tests
that need to reset between cases without going through the full
SDK round-trip."""
_DEFAULT.reset()