fix(canary): reframe smoke prompt to give GPT-4o explicit permission to echo

Canary started flaking 2026-05-01 22:11 with model-refusal replies: - "I'm unable to do that." - "I'm unable to fulfill that request. Can I assist you with anything else?" - "I'm unable to reply with responses that don't allow me to fulfill tasks…" 3 fails / 10 recent runs ≈ 30% flake. Trigger: 2026-04-30's Platform Capabilities preamble (#2332) added the directive "Use them proactively" to the top of every system prompt. Combined with the heavy A2A + HMA tool docs further down, the model reads the contrived bare-echo prompt ("Reply with exactly: PONG") as out-of-role and intermittently refuses. Real user prompts don't hit this — only the synthetic smoke prompt does, so the right fix is in the canary's prompt phrasing, not the platform's system prompt (which is correctly priming agents toward tool use). New phrasing explicitly tells the model "this is a smoke test" and "no tools or memory are needed" so it has permission to comply. Also updates the child workspace's CHILD_PONG prompt with the same framing — same failure mode would have hit it once full-mode runs again. No code change to system prompt, no test infra change. Just two prompt strings + a load-bearing comment so future readers don't trim back to the brittle phrasing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 23:53:24 -07:00 · 2026-05-01 23:53:24 -07:00 · fa9e29f2f5
commit fa9e29f2f5
parent 435e13e57e
1 changed files with 15 additions and 2 deletions
--- a/tests/e2e/test_staging_full_saas.sh
+++ b/tests/e2e/test_staging_full_saas.sh
@ -433,6 +433,19 @@ done

 # ─── 8. A2A round-trip on parent ───────────────────────────────────────
 log "8/11 Sending A2A message to parent — expecting agent response..."
+# Smoke prompt phrasing — DO NOT trim back to the bare "Reply with exactly: PONG"
+# version that ran here pre-2026-05-02. After the Platform Capabilities preamble
+# (#2332, 2026-04-30) landed in the system prompt, GPT-4o began intermittently
+# refusing the bare echo prompt with messages like:
+#   - "I'm unable to do that."
+#   - "I'm unable to fulfill that request. Can I assist you with anything else?"
+#   - "I'm unable to reply with responses that don't allow me to fulfill tasks…"
+# 3 fails / 10 runs ≈ 30% flake. Root cause: the preamble primes the model
+# ("Use them proactively") to expect tool use, then a zero-tool echo request
+# reads as out-of-role. Real user prompts (which is what hits prod) don't
+# trigger this — only this contrived smoke prompt does, so the right fix is
+# in the prompt phrasing, not in the platform's system prompt. Keep the
+# explicit "no tools needed" framing so the model has permission to comply.
 A2A_PAYLOAD=$(python3 -c "
 import json, uuid
 print(json.dumps({
@ -443,7 +456,7 @@ print(json.dumps({
        'message': {
            'role': 'user',
            'messageId': f'e2e-{uuid.uuid4().hex[:8]}',
-            'parts': [{'kind': 'text', 'text': 'Reply with exactly: PONG'}]
+            'parts': [{'kind': 'text', 'text': 'This is the platform smoke test verifying agent wiring. No tools or memory are needed — please respond with exactly the single token: PONG'}]
        }
    }
 }))
@ -559,7 +572,7 @@ print(json.dumps({
        'message': {
            'role': 'user',
            'messageId': f'e2e-deleg-{uuid.uuid4().hex[:8]}',
-            'parts': [{'kind': 'text', 'text': 'Reply with exactly: CHILD_PONG'}]
+            'parts': [{'kind': 'text', 'text': 'This is the platform smoke test verifying child workspace wiring. No tools or memory are needed — please respond with exactly the single token: CHILD_PONG'}]
        }
    }
 }))