forked from molecule-ai/molecule-core
Canary started flaking 2026-05-01 22:11 with model-refusal replies: - "I'm unable to do that." - "I'm unable to fulfill that request. Can I assist you with anything else?" - "I'm unable to reply with responses that don't allow me to fulfill tasks…" 3 fails / 10 recent runs ≈ 30% flake. Trigger: 2026-04-30's Platform Capabilities preamble (#2332) added the directive "Use them proactively" to the top of every system prompt. Combined with the heavy A2A + HMA tool docs further down, the model reads the contrived bare-echo prompt ("Reply with exactly: PONG") as out-of-role and intermittently refuses. Real user prompts don't hit this — only the synthetic smoke prompt does, so the right fix is in the canary's prompt phrasing, not the platform's system prompt (which is correctly priming agents toward tool use). New phrasing explicitly tells the model "this is a smoke test" and "no tools or memory are needed" so it has permission to comply. Also updates the child workspace's CHILD_PONG prompt with the same framing — same failure mode would have hit it once full-mode runs again. No code change to system prompt, no test infra change. Just two prompt strings + a load-bearing comment so future readers don't trim back to the brittle phrasing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| _extract_token.py | ||
| _lib.sh | ||
| STAGING_SAAS_E2E.md | ||
| test_2307_peer_visibility_staging.sh | ||
| test_a2a_e2e.sh | ||
| test_activity_e2e.sh | ||
| test_api.sh | ||
| test_chat_attachments_e2e.sh | ||
| test_chat_attachments_multiruntime_e2e.sh | ||
| test_chat_upload_e2e.sh | ||
| test_claude_code_e2e.sh | ||
| test_comprehensive_e2e.sh | ||
| test_dev_mode.sh | ||
| test_harness_rc_normalization.sh | ||
| test_notify_attachments_e2e.sh | ||
| test_poll_mode_e2e.sh | ||
| test_priority_runtimes_e2e.sh | ||
| test_saas_tenant.sh | ||
| test_staging_external_runtime.sh | ||
| test_staging_full_saas.sh | ||