molecule-core/tests/e2e
Hongming Wang fa9e29f2f5 fix(canary): reframe smoke prompt to give GPT-4o explicit permission to echo
Canary started flaking 2026-05-01 22:11 with model-refusal replies:
  - "I'm unable to do that."
  - "I'm unable to fulfill that request. Can I assist you with anything else?"
  - "I'm unable to reply with responses that don't allow me to fulfill tasks…"
3 fails / 10 recent runs ≈ 30% flake.

Trigger: 2026-04-30's Platform Capabilities preamble (#2332) added the
directive "Use them proactively" to the top of every system prompt.
Combined with the heavy A2A + HMA tool docs further down, the model
reads the contrived bare-echo prompt ("Reply with exactly: PONG") as
out-of-role and intermittently refuses.

Real user prompts don't hit this — only the synthetic smoke prompt does,
so the right fix is in the canary's prompt phrasing, not the platform's
system prompt (which is correctly priming agents toward tool use). New
phrasing explicitly tells the model "this is a smoke test" and "no
tools or memory are needed" so it has permission to comply.

Also updates the child workspace's CHILD_PONG prompt with the same
framing — same failure mode would have hit it once full-mode runs again.

No code change to system prompt, no test infra change. Just two prompt
strings + a load-bearing comment so future readers don't trim back to
the brittle phrasing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 23:53:24 -07:00
..
_extract_token.py chore: apply round-7 review nits 2026-04-13 17:08:45 -07:00
_lib.sh feat(platform): GET /admin/workspaces/:id/test-token for E2E (#6) 2026-04-14 09:35:26 -07:00
STAGING_SAAS_E2E.md feat(e2e): pivot to admin-bearer-only auth + add sanity self-check workflow 2026-04-21 04:34:11 -07:00
test_2307_peer_visibility_staging.sh test(e2e): add staging peer-visibility harness for #2307 2026-04-29 13:26:24 -07:00
test_a2a_e2e.sh initial commit — Molecule AI platform 2026-04-13 11:55:37 -07:00
test_activity_e2e.sh chore: apply code-review round-6 suggestions 2026-04-13 17:08:45 -07:00
test_api.sh fix(e2e): stop asserting current_task on public workspace GET (#966) 2026-04-19 02:19:15 -07:00
test_chat_attachments_e2e.sh feat(canvas+platform): chat attachments, model selection, deploy/delete UX 2026-04-24 13:27:51 -07:00
test_chat_attachments_multiruntime_e2e.sh feat(canvas+platform): chat attachments, model selection, deploy/delete UX 2026-04-24 13:27:51 -07:00
test_chat_upload_e2e.sh feat(chat_files): rewrite Upload as HTTP-forward to workspace (RFC #2312, PR-C) 2026-04-29 14:26:37 -07:00
test_claude_code_e2e.sh chore: final open-source cleanup — binary, stale paths, private refs 2026-04-18 00:38:55 -07:00
test_comprehensive_e2e.sh fix(e2e): make provisioning-status assertions robust to CI environment 2026-04-13 17:31:07 -07:00
test_dev_mode.sh fix(quickstart): hotfixes discovered during live testing session 2026-04-23 14:57:18 -07:00
test_harness_rc_normalization.sh fix(e2e-sanity): normalize unexpected curl exit codes in cleanup trap (#2159) 2026-04-27 02:55:44 -07:00
test_notify_attachments_e2e.sh test(notify): pre-sweep prior workspaces so interrupted runs don't pile up 2026-04-26 20:55:13 -07:00
test_poll_mode_e2e.sh fix(e2e): use real UUIDs for poll-mode test workspace ids 2026-04-29 23:10:36 -07:00
test_priority_runtimes_e2e.sh feat(e2e): extend priority-runtimes test to cover all 8 templates 2026-04-27 05:57:59 -07:00
test_saas_tenant.sh chore: final open-source cleanup — binary, stale paths, private refs 2026-04-18 00:38:55 -07:00
test_staging_external_runtime.sh test(e2e): read delivery_mode from register response, not GET 2026-04-30 10:35:21 -07:00
test_staging_full_saas.sh fix(canary): reframe smoke prompt to give GPT-4o explicit permission to echo 2026-05-01 23:53:24 -07:00