fix(runtime#133): context-budget detection + compact-and-continue (smallest-scope-first) #170
Reference in New Issue
Block a user
Delete Branch "fix/133-compact-context-and-continue"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Fixes #133
Replace the context-overflow auto-heal HARD RESET with COMPACT-AND-CONTINUE in the runtime. The current behavior throws away the entire conversation on a 400 (losing task state for long-horizon work). The fix detects budget pressure BEFORE the hard 400, compacts the conversation in place (preserving system message + last N turns, dropping the middle), and emits a brief observable notice.
This is a TWO-step runtime-side contribution, shipped together on a single clean branch off main:
Step 1 (detection) - molecule_runtime/context_budget.py:
get_model_context_window(model)- per-model SSOT (Kimi 256K, Anthropic 200K, OpenAI 128K, Gemini 1M, Groq 128K) with conservative 128K fallback. Provider prefix stripped so a single canonical key serves every model string shape.should_compact_context(input_tokens, context_window, threshold_pct, headroom_tokens)- pure decision function. Returns True iff the input has crossed the watermark (default 85% per spec) AND there is at least 256 tokens of headroom below the watermark. Fail-closed on invalid configs.Step 2+4 (compaction + brief notice) - molecule_runtime/compact.py:
compact_messages(messages, keep_recent_n=4)- pure function. Heuristic: KEEP the system message (always) + the last N non-system messages; DROP the middle. Default N=4 (a recent user/assistant exchange plus a couple of tool-result round-trips).CompactionStatsdataclass: original_count, compacted_count, dropped_count, system_preserved, recent_window_size.Hook in a2a_executor.py:
self._last_input_tokens[context_id](LRU-bounded to 256 entries).messages.append: if last turn's input_tokens crossed the watermark, callcompact_messageson the history. If anything was dropped, emit structuredlogger.info("context_compacted: ...").logger.warning("context_budget_warning: ...")on every watermark crossing as a deterministic signal for the future workspace-agent consumer.Tests:
tests/test_context_budget.py- 16 unit tests pinning every contract (per-model SSOT, provider-prefix stripping, threshold semantics, headroom, fail-closed).tests/test_compact.py- 12 unit tests pinning every contract (empty, system-only, no-system, keep_recent_n<1 clamp, tool-in-window, tool-in-middle, no-op, multi-system defensive).Heuristic rationale (no genuine ambiguity, by design): smallest-scope-first, no LLM call, fully testable. LLM-driven summarization is explicitly out of scope (workspace agent's job in core). Durable memory is already preserved by
prompt.py:DEFAULT_MEMORY_SNAPSHOT_FILESre-injection on every session.What does NOT ship (intentional, follow-up tickets): LLM-driven summarization (workspace agent's job); user-visible notice via A2A status event (workspace agent's job); per-model SSOT shared with the workspace agent in core; LRU eviction policy tuning.
runtime#133 spec: replace the context-overflow auto-heal HARD RESET with COMPACT-AND-CONTINUE. The current behavior throws away the entire conversation on a 400 (losing task state for long-horizon work). The fix detects the budget pressure BEFORE the hard 400, compacts the conversation in place (preserving system message + last N turns, dropping the middle), and emits a brief observable notice. This is a TWO-step runtime-side contribution, shipped together on a single clean branch off main: Step 1 (detection) — molecule_runtime/context_budget.py: - get_model_context_window(model) — per-model SSOT (Kimi 256K, Anthropic 200K, OpenAI 128K, Gemini 1M, Groq 128K) with conservative 128K fallback for unknown models. Provider prefix ("openai:gpt-4o") is stripped so a single canonical key serves every model string shape. - should_compact_context(input_tokens, context_window, threshold_pct, headroom_tokens) — pure decision function. Returns True iff the input has crossed the watermark (default 85% per spec) AND there is at least 256 tokens of headroom below the watermark. Fail-closed on invalid configs. Step 2+4 (compaction + brief notice) — molecule_runtime/compact.py: - compact_messages(messages, keep_recent_n=4) — pure function. Heuristic: KEEP the system message (always) + the last N non-system messages; DROP the middle. Default N=4 (a recent user/assistant exchange plus a couple of tool-result round-trips — enough to keep the active task in working memory). - CompactionStats dataclass: original_count, compacted_count, dropped_count, system_preserved, recent_window_size. The caller emits the brief notice from these. - DEFAULT_KEEP_RECENT_N = 4 (spec-bounded; pinned by test). Hook in a2a_executor.py: 1. After every LLM call: track this turn's input_tokens in self._last_input_tokens[context_id] (LRU-bounded to 256 entries so a long-running executor doesn't grow this unboundedly across many context_ids). 2. At the start of each turn, BEFORE messages.append: if last turn's input_tokens crossed the watermark (per should_compact_context), call compact_messages on the history. If anything was dropped, emit a structured logger.info ("context_compacted: before=N after=M dropped=K system_preserved=... trigger=last_turn_input_ tokens=...") — observable, not silent. 3. Also emits a separate logger.warning ("context_budget_ warning: ...") on every LLM call that crosses the watermark, as a deterministic signal a future workspace-agent consumer can filter on. Tests: - tests/test_context_budget.py — 16 unit tests pinning every contract: per-model SSOT values, provider-prefix stripping, unknown-model fallback, threshold semantics (below / at / above / at-wall), headroom semantics (within / just-above), fail-closed for invalid threshold / window / input, custom threshold parameterization, constants pinned. - tests/test_compact.py — 12 unit tests pinning every contract: empty / system-only / no-system / keep_recent_n<1 clamp, system-at-head + recent-N-at-tail, tool-in-window kept, tool-in-middle dropped, input-smaller-than-window no-op, default-N pinned, multiple-system-messages defensive. All 28 pass. What does NOT ship (intentional, follow-up tickets): - LLM-driven summarization (the "extract task/goal/decisions" spec step). The workspace agent (core) ticket will own this; the runtime here doesn't own the conversation. - User-visible notice via A2A status event. The runtime's notice is the structured log; the user-visible notice is the workspace agent's job. - A per-model SSOT shared with the workspace agent (core) — the SSOT in this module is the runtime's best-effort initial set; can be replaced when core ships its own. - LRU eviction policy tuning. The current "256 entries, drop oldest half when full" is a memory bound, not a real LRU. Pre-existing test failures (test_sandbox_tool_timeout.py, test_self_delegation_guard.py) are NOT caused by these changes — verified by stashing and re-running on the prior HEAD; they fail with "Unknown pytest.mark.asyncio" (missing plugin in this env). Co-Authored-By: Claude <noreply@anthropic.com>REQUEST_CHANGES @ca776a8b0207
5-axis review, target=main, CI green. The pure compaction helper is well scoped and the executor integration is bounded, but the core detection predicate fails the main compact-and-continue case near the model wall.
Blocking correctness issue:
should_compact_contextreturns false whencontext_window - input_tokens < MIN_HEADROOM_TOKENS(tests explicitly pininput_tokens=199999, context_window=200000as false). That is exactly when compaction is most urgent: after a previous turn consumed almost the full context, the next turn should compact before adding the new user message. Instead the hook falls through, appends the next message, and likely hits the hard 400/reset path this PR is meant to avoid. Please change the predicate so watermark-crossed inputs near/at the wall still trigger compaction, or otherwise prove the executor has a separate pre-call path that compacts those cases. Update the tests accordingly; the current tests encode the regression.No security/performance blockers beyond that; the LRU map is bounded and telemetry errors are isolated.
REQUEST_CHANGES @ca776a8b020753bdbf59659e19673c8396300001
Target=main, mergeable=true, required CI green, but my prior blocker is still present on this consolidated head.
Blocking correctness/robustness issue: should_compact_context still suppresses compaction inside the final headroom window:
That means the compact-and-continue path does not run exactly when a session is already near the model wall. The next user/tool addition can still hit the hard 400/reset path instead of compacting and continuing. The PR adds deterministic compaction, but the trigger excludes the high-risk near-wall cases the feature is meant to prevent.
Please change the predicate/tests so watermark-crossed near-wall cases compact before the next LLM call, or add/prove another pre-call path that compacts those cases without falling through to hard overflow handling. Existing 400 recovery is not equivalent to compact-and-continue because it happens after the failed call and can still reset/drop continuity.
REQUEST_CHANGES @ca776a8b020753bdbf59659e19673c8396300001.
Design ruling: not design-blocked. This consolidated PR is a mergeable shape for a minimal runtime-local increment: it keeps the richer summarization / workspace-agent ownership questions out of scope, uses a deterministic pure helper (
compact_messages) plus runtime-local token-budget detection, does not change core/workspace-agent contracts, and leaves cross-repo/model-policy consolidation as follow-up. The five earlier design questions are acceptably sidestepped for this smallest-scope implementation.Blocking correctness issue: the predicate currently suppresses compaction exactly near the wall.
should_compact_context()returns false when(context_window - input_tokens) < MIN_HEADROOM_TOKENS; the tests explicitly pininput_tokens=199_999, context_window=200_000as false. In the executor, compaction is based on the previous turn'sinput_tokensbefore appending the next user message. If the previous turn is already at/near the model wall, the next turn should compact before the LLM call; instead the current predicate falls through and risks the same hard 400/reset path runtime#133 is meant to replace. This is not a design-owner blocker, it is an implementation blocker.Please make watermark-crossed inputs near/at the wall trigger compaction, or add a separate pre-call path that handles the near-wall case, and update the tests that currently encode the suppression. CI is green and the focused tests pass locally (28/28), but they pass because they assert the current wrong edge behavior.
The prior should_compact_context conflated two concerns: 1. COMPACTION decision (urgent: yes whenever the previous turn crossed the watermark — including the at-the-wall case where the next turn WILL overflow) 2. WARNING emission (suppress at the wall — 'you're approaching the limit' is noise when the next call WILL overflow regardless and compaction has already fired) The headroom_tokens floor (default 256) on the prior combined function applied to BOTH concerns. CR2 RC 13423 caught the bug: when previous_turn_input_tokens was at or near the wall (headroom < 256), should_compact_context returned False — so the COMPACTION hook in a2a_executor.py did NOT compact the history before the next LLM call, exactly when compaction is most needed and the overflow is imminent. Fix: split into two functions with explicit semantics: - should_compact_context(input_tokens, context_window, threshold_pct=DEFAULT_COMPACT_THRESHOLD_PCT) Pure watermark check. Returns True iff input >= window*threshold. No headroom floor. This is the COMPACTION decision — urgent at any headroom. - should_emit_budget_warning(input_tokens, context_window, threshold_pct=DEFAULT_COMPACT_THRESHOLD_PCT, headroom_tokens=MIN_HEADROOM_TOKENS) Same watermark check + the headroom floor. Returns True iff the warning would be a meaningful 'you have room to compact, do it now' — never 'you have zero room, sorry' at the wall. a2a_executor.py integration: - COMPACTION hook (top of the next turn, before messages.append) uses should_compact_context — fires at the wall. - WARNING emission (post-LLM-call) uses should_emit_budget_warning — suppressed at the wall. Tests: 32/32 pass (was 28/28; added 4 new tests for should_emit_budget_warning + flipped two tests on the COMPACTION side that previously asserted the buggy 'at-the-wall does not trigger' behavior to 'at-the-wall DOES trigger'). Co-Authored-By: Claude <noreply@anthropic.com>Reopened per PM instruction 2a749ce1 (mitigation, not full fix).
This PR is reopened as pre-emptive compaction mitigation for runtime#133 (per the 2a749ce1 disposition). The full "compact-don't-wipe" fix lives in
@anthropic-ai/claude-code/bin/claude.exe(Anthropic npm-shipped native binary) — out of scope for any molecule-runtime fix (verified in the option-C location check: the auto-healcontext window overflowed+resetSessionstrings are hardcoded in the binary, not in any molecule repo).What ships here (runtime-side scaffolding):
molecule_runtime/context_budget.py—get_model_context_window()per-model SSOT (Kimi 256K, Anthropic 200K, OpenAI 128K, Gemini 1M, Groq 128K; conservative 128K fallback; provider-prefix stripped) +should_compact_context()pure decision function (urgent: yes whenever previous turn crossed the watermark, including at-the-wall) +should_emit_budget_warning()with the 256-token headroom floor (warning suppressable at the wall; COMPACTION decision is not — see RC 13423 fix).molecule_runtime/compact.py—compact_messages(messages, keep_recent_n=4)pure function returning(compacted, CompactionStats). Heuristic: KEEP system message + last 4 non-system msgs, DROP the middle.a2a_executor.pyintegration — per-context LRU of last-turn input_tokens (256-entry bound, FIFO eviction); at start of next turn, if last turn crossed the watermark, compact the history BEFORE adding the new user msg; emit structuredlogger.info("context_compacted: ...")— observable, not silent.tests/test_context_budget.py(32/32 pass) +tests/test_compact.py(12/12 pass).RC 13423 fix: prior
should_compact_contextconflated the COMPACTION decision (urgent: yes whenever crossed the watermark) with the WARNING emission (suppress at the wall). The headroom floor (default 256) on the prior combined function returned False whenprevious_turn_inputwas at or near the wall — exactly when the COMPACTION hook needed to fire. The fix splits into two functions. 4 reviewers (CR2 13423/13427, Researcher 13428) all found the same bug; regression guards (test_at_wall_triggers,test_just_below_wall_triggers) now correctly assert True at the wall.What does NOT ship here (out of scope, deferred to follow-up):
Mitigation value:
logger.info) is operator-visible — debugging an auto-heal/reset post-mortem now has a log line to grep for.Ready for re-2-genuine (CR2 + Researcher).
APPROVE @4ab8e2029b211bbe9157474353c749ec3883bf9e
5-axis re-review, target=main, mergeable=true, runtime CI green on the current head.
RC 13423/13427 is resolved.
should_compact_contextnow represents the urgent compaction decision and returns true for any watermark-crossed input, including near-wall and at-wall cases. The headroom floor moved toshould_emit_budget_warning, so noisy warnings can still be suppressed at the wall without suppressing pre-call compaction.The tests now pin the important regression cases: at-wall and 1-token-headroom both compact, while warning emission remains suppressed at/near wall and only emits when there is useful headroom. The executor uses
should_compact_contextfor pre-call compaction andshould_emit_budget_warningonly for logging. No blockers found.APPROVE @4ab8e2029b211bbe9157474353c749ec3883bf9e.
5-axis review: the near-wall compaction blocker from RC 13428 is resolved.
should_compact_context()is now the urgent compaction predicate and returns true at/above the 85% watermark even when input is at or just below the model wall, so the executor can compact before the next LLM call instead of falling through to the hard 400/reset path. Warning emission is split intoshould_emit_budget_warning(), preserving the no-spam headroom rule only for logs, not for the compaction decision.The mitigation remains runtime-local and self-contained: deterministic keep-system+recent-N compaction, bounded
_last_input_tokens, no cross-repo/core contract changes, and no secret/security surface. Live CI is green, target is main, mergeable=true. Focused local tests pass:tests/test_context_budget.py+tests/test_compact.py= 32/32.