Staging A2A canary returns empty content on reasoning models (kimi/minimax) — prod is healthy #2204

Closed
opened 2026-06-04 04:22:12 +00:00 by core-devops · 4 comments
Member

E2E Staging SaaS (full lifecycle) + the scheduled staging synthetic E2E have 0 successful A2A round-trips for 8h+, failing at step 8 with message contained no text content — deterministic, on BOTH BYOK (MiniMax-M2) and platform (moonshot/kimi-k2.6).

PROD is healthy (verified): moonshot/kimi-k2.6 with adequate max_tokens returns real content (the model emits a separate reasoning_content field; tiny budgets spend the whole allocation on reasoning → empty content). MiniMax returns real content. So this is staging-only, not a prod serving regression.

Likely root cause: the staging A2A path / canary either (a) does not read reasoning_content when content is empty, or (b) uses a max_tokens too small for a reasoning model so content never materializes within budget. Investigate the A2A text-extraction + the canary request budget on staging.

Diagnostic precision added in #2203 (names the empty-completion class). This issue tracks the real fix. Non-blocking for required gates.

`E2E Staging SaaS (full lifecycle)` + the scheduled staging synthetic E2E have 0 successful A2A round-trips for 8h+, failing at step 8 with `message contained no text content` — deterministic, on BOTH BYOK (MiniMax-M2) and platform (moonshot/kimi-k2.6). **PROD is healthy** (verified): `moonshot/kimi-k2.6` with adequate max_tokens returns real `content` (the model emits a separate `reasoning_content` field; tiny budgets spend the whole allocation on reasoning → empty `content`). MiniMax returns real content. So this is **staging-only**, not a prod serving regression. **Likely root cause:** the staging A2A path / canary either (a) does not read `reasoning_content` when `content` is empty, or (b) uses a max_tokens too small for a reasoning model so `content` never materializes within budget. Investigate the A2A text-extraction + the canary request budget on staging. Diagnostic precision added in #2203 (names the empty-completion class). This issue tracks the real fix. Non-blocking for required gates.
Member

Recurrence evidence on molecule-core main 0ad52852fd38: E2E Staging Platform Boot job 274530 now fails with the explicit #2203 diagnostic. Workspace boot reaches status → running, then A2A fails: EMPTY COMPLETION. Raw runtime error is message contained no text content for MODEL_SLUG=moonshot/kimi-k2.6. This confirms the issue class is staging model/proxy empty assistant text after successful boot, not CP health, auth, quota, or workspace-server startup.

Recurrence evidence on molecule-core main `0ad52852fd38`: `E2E Staging Platform Boot` job `274530` now fails with the explicit #2203 diagnostic. Workspace boot reaches `status → running`, then A2A fails: `EMPTY COMPLETION`. Raw runtime error is `message contained no text content` for `MODEL_SLUG=moonshot/kimi-k2.6`. This confirms the issue class is staging model/proxy empty assistant text after successful boot, not CP health, auth, quota, or workspace-server startup.
Author
Member

Root cause: verdict (a) extraction (+ (b) budget as a secondary trigger)

Confirmed from operator runner logs (docker logs since 8h): staging canaries on MODEL_SLUG=MiniMax-M2 and moonshot/kimi-k2.6 deterministically surface Error: message contained no text content. on the parent's first cold turn. Prod is healthy with the same models.

Why: these are reasoning models. On the OpenAI-compatible /v1/chat/completions endpoint they put the assistant turn in reasoning_content and leave content empty when the whole budget went to the thinking preamble. The runtime executor's text extraction read only message.content → empty → the runtime emits message contained no text content.

This is a product-quality bug, not just a canary trigger: any real agent on a reasoning model would have reasoning-only turns dropped as empty.

Fixes (3 pieces, none merged — all need review)

  • hermes runtime (this was uncovered): molecule-ai-workspace-template-hermes PR #75_extract_assistant_text falls back to reasoning_content when content is empty; both-empty still flags the sentinel (we do NOT mask empty content blindly). Deterministic unit tests + an e2e chat_completions test.
  • claude-code runtime: molecule-ai-workspace-template-claude-code PR #85 — equivalent thinking-block extraction (already in flight).
  • E2E budget trigger: molecule-core PR #2209 — liveness-probe max_tokens 4→32 (already in flight). Note this covers step 8d's tiny-budget probe; the extraction fixes cover the step-8 PONG round-trip and real agents regardless of budget.

Note on workspace-configs-templates/ in core

The hermes/claude-code adapters under molecule-core/workspace-configs-templates/ are gitignored clones (populated from manifest.json), NOT the SSOT — so the fix belongs in the template repos above, not a core commit.

The extraction fixes are the robust root-cause fix; #2209 is the cheap budget fix. Both layers are needed for a fully green canary.

## Root cause: verdict (a) extraction (+ (b) budget as a secondary trigger) Confirmed from operator runner logs (`docker logs` since 8h): staging canaries on `MODEL_SLUG=MiniMax-M2` and `moonshot/kimi-k2.6` deterministically surface `Error: message contained no text content.` on the parent's first cold turn. Prod is healthy with the same models. **Why:** these are reasoning models. On the OpenAI-compatible `/v1/chat/completions` endpoint they put the assistant turn in `reasoning_content` and leave `content` empty when the whole budget went to the thinking preamble. The runtime executor's text extraction read only `message.content` → empty → the runtime emits `message contained no text content`. This is a **product-quality bug, not just a canary trigger**: any real agent on a reasoning model would have reasoning-only turns dropped as empty. ### Fixes (3 pieces, none merged — all need review) - **hermes runtime** (this was uncovered): `molecule-ai-workspace-template-hermes` PR #75 — `_extract_assistant_text` falls back to `reasoning_content` when `content` is empty; both-empty still flags the sentinel (we do NOT mask empty content blindly). Deterministic unit tests + an e2e chat_completions test. - **claude-code runtime**: `molecule-ai-workspace-template-claude-code` PR #85 — equivalent thinking-block extraction (already in flight). - **E2E budget trigger**: `molecule-core` PR #2209 — liveness-probe `max_tokens` 4→32 (already in flight). Note this covers step 8d's tiny-budget probe; the extraction fixes cover the step-8 PONG round-trip and real agents regardless of budget. ### Note on `workspace-configs-templates/` in core The hermes/claude-code adapters under `molecule-core/workspace-configs-templates/` are gitignored clones (populated from manifest.json), NOT the SSOT — so the fix belongs in the template repos above, not a core commit. The extraction fixes are the robust root-cause fix; #2209 is the cheap budget fix. Both layers are needed for a fully green canary.
Member

MECHANISM: recurrence on core main aa7bc922d72e: E2E Staging SaaS reaches an online claude-code workspace, then fails at tests/e2e/test_staging_full_saas.sh:615-626. The harness posts A2A, extracts only result.parts[0].text, and treats empty text as fatal. The workflow pins the default model at .gitea/workflows/e2e-staging-saas.yml:149-156; this run reports MODEL_SLUG=moonshot/kimi-k2.6, then the runtime surfaces message contained no text content.

EVIDENCE: run 207383, jobs 276374 / 276375, head aa7bc922d72e. Log sequence: tenant running, workspace online, then A2A — EMPTY COMPLETION. The failure line says the model returned a 2xx completion with no text part; teardown was clean and no EC2 leak remained.

RECOMMENDED FIX SHAPE: keep this in #2204 as a staging LLM/proxy/canary extraction issue, not a workspace boot or teardown issue. Responsible surfaces are tests/e2e/test_staging_full_saas.sh text extraction/request budget and the staging LLM proxy/model config for moonshot/kimi-k2.6; engineers should verify whether reasoning-only content needs extraction or whether the canary token budget/model selection must change.

MECHANISM: recurrence on core main `aa7bc922d72e`: E2E Staging SaaS reaches an online `claude-code` workspace, then fails at `tests/e2e/test_staging_full_saas.sh:615-626`. The harness posts A2A, extracts only `result.parts[0].text`, and treats empty text as fatal. The workflow pins the default model at `.gitea/workflows/e2e-staging-saas.yml:149-156`; this run reports `MODEL_SLUG=moonshot/kimi-k2.6`, then the runtime surfaces `message contained no text content`. EVIDENCE: run `207383`, jobs `276374` / `276375`, head `aa7bc922d72e`. Log sequence: tenant `running`, workspace `online`, then `A2A — EMPTY COMPLETION`. The failure line says the model returned a 2xx completion with no text part; teardown was clean and no EC2 leak remained. RECOMMENDED FIX SHAPE: keep this in `#2204` as a staging LLM/proxy/canary extraction issue, not a workspace boot or teardown issue. Responsible surfaces are `tests/e2e/test_staging_full_saas.sh` text extraction/request budget and the staging LLM proxy/model config for `moonshot/kimi-k2.6`; engineers should verify whether reasoning-only content needs extraction or whether the canary token budget/model selection must change.
Author
Member

Investigation update — slug is valid; the gap is MiniMax-M2 reasoning-format coverage in the runtime

Rigorous dig (verify-by-state, did NOT mask):

  • MiniMax-M2 is a REGISTERED claude-code modelproviders.yaml:743 (core + CP byte-synced), registry_gen.go:83. validateRegisteredModelForRuntime("claude-code","MiniMax-M2") and validateDerivedProviderInRegistry(...) both return ok=true (registered since 2026-05-27). So the model-validation work (#2179/#2220) is NOT rejecting it, and the earlier "step-5 provision 400" was a stale 2026-05-30 transient that self-resolved by 05-31. NOT a slug/registry problem — do not change the canary's MODEL_SLUG (that would mask the real red + silently change coverage).
  • The persistent red (05-31 → now) is the step-8 A2A message contained no text content on the parent's first cold turn — the reasoning-only empty-completion this issue tracks. The canary reaches step 8 fine (provision is green).
  • The reasoning-content fix lives in the claude-code RUNTIME repo (#85), now in the staging pin 12acfb1 (promoted 07:08 + durable auto-promote via template #87/#76). The string + fix are NOT in molecule-core.

OPEN QUESTION (the real remaining gap): #85 extracts Anthropic-style thinking blocks; MiniMax-M2's reasoning-only turn may use a DIFFERENT shape (e.g. <think> in content vs a separate reasoning field) that #85 doesn't cover. Need a CLEAN post-07:08 canary run (pin 12acfb1) to confirm: if MiniMax-M2 still empty-contents with the fix deployed → a claude-code-runtime follow-up is needed to handle MiniMax's reasoning format specifically. If it passes → cleared. Tracking that verification.

Non-blocking: this is a non-required (continue-on-error/soak) staging gate; core main is green. Prod LLM serving verified healthy.

## Investigation update — slug is valid; the gap is MiniMax-M2 reasoning-format coverage in the runtime Rigorous dig (verify-by-state, did NOT mask): - **`MiniMax-M2` is a REGISTERED claude-code model** — `providers.yaml:743` (core + CP byte-synced), `registry_gen.go:83`. `validateRegisteredModelForRuntime("claude-code","MiniMax-M2")` and `validateDerivedProviderInRegistry(...)` both return `ok=true` (registered since 2026-05-27). So the model-validation work (#2179/#2220) is NOT rejecting it, and the earlier "step-5 provision 400" was a **stale 2026-05-30 transient** that self-resolved by 05-31. NOT a slug/registry problem — do not change the canary's MODEL_SLUG (that would mask the real red + silently change coverage). - The **persistent red (05-31 → now) is the step-8 A2A `message contained no text content`** on the parent's first cold turn — the reasoning-only empty-completion this issue tracks. The canary reaches step 8 fine (provision is green). - The reasoning-content fix lives in the **claude-code RUNTIME repo** (#85), now in the staging pin `12acfb1` (promoted 07:08 + durable auto-promote via template #87/#76). The string + fix are NOT in molecule-core. **OPEN QUESTION (the real remaining gap):** #85 extracts Anthropic-style thinking blocks; MiniMax-M2's reasoning-only turn may use a DIFFERENT shape (e.g. `<think>` in content vs a separate reasoning field) that #85 doesn't cover. Need a CLEAN post-07:08 canary run (pin 12acfb1) to confirm: if MiniMax-M2 still empty-contents with the fix deployed → a claude-code-runtime follow-up is needed to handle MiniMax's reasoning format specifically. If it passes → cleared. Tracking that verification. Non-blocking: this is a non-required (continue-on-error/soak) staging gate; core main is green. Prod LLM serving verified healthy.
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2204