Compare commits
4 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 6ba9424196 | |||
| 531d98efea | |||
| 0b17567891 | |||
| 59d699b61c |
@@ -0,0 +1,104 @@
|
||||
# local-e2e — session-continuity canary harness
|
||||
|
||||
Self-contained Docker-Compose harness that gates RFC#600-class template
|
||||
changes (session continuity, file-only messages, multimodal prompts,
|
||||
cross-session memory) **before** they reach customer canary.
|
||||
|
||||
Per CTO standing directive "fully tested + separate CI": this is a
|
||||
dedicated, *fast* (target <3 min), *small-surface* harness that uses a
|
||||
Python tenant-CP simulator (not the full `workspace-server` Go service)
|
||||
to exercise the runtime image end-to-end against canonical canary turns.
|
||||
|
||||
See [`feedback_no_single_source_of_truth`] — the harness IS the canonical
|
||||
session-continuity validator. Per-runtime unit tests still cover their
|
||||
own guard logic; the harness covers the live conversational behaviour
|
||||
that those unit tests cannot prove.
|
||||
|
||||
See [`feedback_image_promote_is_not_user_live`] — every assertion reads
|
||||
state back from the *running container*, never from a publish-pipeline
|
||||
ack.
|
||||
|
||||
## What it tests (the 4 canaries)
|
||||
|
||||
| # | Scenario | Asserts |
|
||||
|---|----------|---------|
|
||||
| 1 | 2-turn name canary | turn 2 reply contains "Hongming" → SessionStore continuity |
|
||||
| 2 | File-only message (no caption) | NOT "(empty prompt — nothing to do)" + reply references filename or asks for clarification |
|
||||
| 3 | File + caption ("summarize this") | reply addresses attachment + caption |
|
||||
| 4 | Cross-session memory recall | new session pulls "blue" via memory tool |
|
||||
|
||||
Each scenario re-uses the same A2A wire-shape that the production
|
||||
`workspace-server` POSTs to runtime `:8000` (canvas-thread-id semantics
|
||||
via `context_id`).
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
local-e2e/
|
||||
docker-compose.yml # runtime under test + cp_sim
|
||||
cp_sim/ # ≈300 LoC Python A2A poster + file uploader
|
||||
cp_sim.py
|
||||
Dockerfile
|
||||
requirements.txt
|
||||
canary/
|
||||
conftest.py
|
||||
test_session_continuity.py # 4 canary scenarios
|
||||
test_layer_diagnostics.py # SessionStore state probe + key derivation
|
||||
scripts/
|
||||
run-canary.sh # one-shot orchestration entrypoint
|
||||
```
|
||||
|
||||
The CP simulator emits the **exact** JSON-RPC `message/send` envelope
|
||||
that `workspace-server` produces (verified against
|
||||
`tests/e2e/test_chat_attachments_e2e.sh`). No Go service is in the loop —
|
||||
this keeps the harness lean per the CTO directive.
|
||||
|
||||
## Run locally
|
||||
|
||||
```bash
|
||||
# from molecule-core repo root:
|
||||
export TEMPLATE_IMAGE=ghcr.io/molecule-ai/workspace-template-hermes:latest
|
||||
./local-e2e/scripts/run-canary.sh
|
||||
```
|
||||
|
||||
Exit code 0 = all 4 canaries pass. Non-zero = at least one canary failed
|
||||
and the harness dumped SessionStore state + last 200 log lines from the
|
||||
runtime container into `./local-e2e/artifacts/`.
|
||||
|
||||
## How it integrates into CI
|
||||
|
||||
Each template repo's `.gitea/workflows/session-continuity-e2e.yml` calls
|
||||
`run-canary.sh` with its own freshly-built `TEMPLATE_IMAGE`. The
|
||||
template repo's Gitea branch-protection lists
|
||||
`session-continuity-e2e (pull_request)` as a required context.
|
||||
|
||||
Rollout order (deliberate — per `feedback_image_promote_is_not_user_live`
|
||||
we bake before we cascade):
|
||||
|
||||
1. `molecule-ai-workspace-template-hermes` — highest-traffic + most
|
||||
recent RFC#600-class fixes — REQUIRED gate
|
||||
2. Bake for 5 business days
|
||||
3. Cascade to claude-code, langgraph, autogen, openclaw, smolagents,
|
||||
google-adk (one PR per template — see `scripts/onboard-template.sh`)
|
||||
|
||||
## Future extensions (out of scope for the initial PR)
|
||||
|
||||
- Multi-session memory consistency (3+ sessions deep)
|
||||
- Tool-use canary (workspace seeded with skills/, agent must invoke)
|
||||
- Streaming-cancellation canary (mid-stream client disconnect)
|
||||
- Cross-runtime A2A peer call (currently covered by `e2e-peer-visibility`)
|
||||
|
||||
## Why a thin Python simulator and not the real `workspace-server`?
|
||||
|
||||
`workspace-server` is a 60+ MB Go binary that requires Postgres, Redis,
|
||||
admin-token wiring, registry plumbing, and a 30+ second cold-boot. None
|
||||
of that touches session-continuity behaviour, which is fully owned by
|
||||
the runtime container's `a2a_executor.py`. Per CTO directive "separate
|
||||
CI as possible" + the <3 min target, we excise the platform-tenant Go
|
||||
service from the loop and emit identical wire-shape envelopes from a
|
||||
single Python file.
|
||||
|
||||
If the simulator diverges from `workspace-server` wire shape, the gate
|
||||
goes red — fix the simulator to match production. The wire shape is
|
||||
asserted in `tests/e2e/test_chat_attachments_e2e.sh` and the runtime's
|
||||
`workspace/a2a_executor.py:_core_execute`.
|
||||
@@ -0,0 +1,19 @@
|
||||
# Python tenant-CP simulator + canary test driver.
|
||||
# Single image — pytest + httpx + the canary tests baked in.
|
||||
FROM python:3.11-slim@sha256:e78299e55776ca065dcb769f80161f48465ad352014240eb5fe4712e22505e9b
|
||||
|
||||
WORKDIR /harness
|
||||
|
||||
COPY requirements.txt .
|
||||
RUN pip install --no-cache-dir -r requirements.txt
|
||||
|
||||
# Test files are bind-mounted by docker-compose at run time so a `pytest -x`
|
||||
# rerun loop doesn't require a rebuild. The COPY here is for the
|
||||
# self-contained image used by Gitea Actions (where bind mounts are awkward).
|
||||
COPY cp_sim.py /harness/cp_sim.py
|
||||
COPY canary /harness/canary
|
||||
|
||||
ENV PYTHONUNBUFFERED=1
|
||||
|
||||
# Default: run the 4 canaries with verbose output + JUnit XML for CI.
|
||||
CMD ["pytest", "-v", "--tb=short", "--junitxml=/harness/artifacts/junit.xml", "canary/"]
|
||||
@@ -0,0 +1,31 @@
|
||||
"""Shared pytest fixtures for the canary suite."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import sys
|
||||
import uuid
|
||||
|
||||
# cp_sim.py lives one dir up — make it importable without packaging.
|
||||
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
|
||||
|
||||
import pytest # noqa: E402
|
||||
|
||||
from cp_sim import CPSim, CPSimConfig # noqa: E402
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def sim() -> CPSim:
|
||||
"""Fresh CPSim per test — cheap, isolates connection state."""
|
||||
return CPSim(
|
||||
cfg=CPSimConfig(
|
||||
runtime_url=os.environ.get("RUNTIME_URL", "http://localhost:18000"),
|
||||
)
|
||||
)
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def context_id() -> str:
|
||||
"""A unique canvas-thread-id per test — guarantees SessionStore isolation
|
||||
between scenarios so a failing canary doesn't poison the next one."""
|
||||
return f"canary-ctx-{uuid.uuid4().hex[:12]}"
|
||||
@@ -0,0 +1,80 @@
|
||||
"""Layer-isolation diagnostics — runs alongside the 4 canaries.
|
||||
|
||||
These probes are not strict pass/fail gates by themselves; they exist so
|
||||
when a canary fails, the artifacts include enough state to tell whether
|
||||
the regression is in the wire-shape layer, the SessionStore layer, or
|
||||
the memory layer. Each test always passes (returns early) when the
|
||||
underlying surface is unavailable on the runtime under test — different
|
||||
templates expose different debug endpoints.
|
||||
|
||||
Cross-refs:
|
||||
- feedback_verify_actual_endstate_not_ack_follow_sop — we read state
|
||||
back, not the side-effect ack.
|
||||
- feedback_image_promote_is_not_user_live — the verification is at
|
||||
the running-container layer.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import uuid
|
||||
|
||||
import httpx
|
||||
|
||||
from cp_sim import CPSim
|
||||
|
||||
|
||||
def test_diag_agent_card_advertises_a2a(sim: CPSim) -> None:
|
||||
"""The runtime's /agent-card must advertise A2A capabilities.
|
||||
|
||||
If this fails, the canaries' transport assumption (POST /a2a) is
|
||||
already broken — diagnose the runtime image, not the canary.
|
||||
"""
|
||||
url = f"{sim.cfg.runtime_url}/agent-card"
|
||||
r = httpx.get(url, timeout=10.0)
|
||||
assert r.status_code == 200, (
|
||||
f"/agent-card returned {r.status_code}: {r.text[:300]!r}"
|
||||
)
|
||||
body = r.json()
|
||||
# AgentCard spec: capabilities object must exist, even if empty.
|
||||
assert isinstance(body, dict), f"/agent-card body not an object: {body!r}"
|
||||
# We don't require any specific capability flag — different templates
|
||||
# advertise different sets. The point of this diag is "is the card
|
||||
# there at all", which signals the runtime booted past entrypoint.
|
||||
|
||||
|
||||
def test_diag_context_id_required_for_continuity(sim: CPSim) -> None:
|
||||
"""Same context_id in two turns must not crash the runtime.
|
||||
|
||||
Pure smoke probe — proves the executor accepts a continuation
|
||||
message without 5xx-ing. The substantive assertion is canary 1; this
|
||||
one just guarantees the path is reachable.
|
||||
"""
|
||||
ctx = f"diag-{uuid.uuid4().hex[:8]}"
|
||||
r1 = sim.send_text("ping", context_id=ctx)
|
||||
r2 = sim.send_text("ping again", context_id=ctx, task_id=r1.get("result", {}).get("id"))
|
||||
# Both replies must parse — non-empty envelope, no JSON-RPC error.
|
||||
for label, env in (("turn1", r1), ("turn2", r2)):
|
||||
assert "error" not in env, f"{label} returned JSON-RPC error: {env['error']}"
|
||||
|
||||
|
||||
def test_diag_memory_root_writable_in_canary_mode(sim: CPSim) -> None:
|
||||
"""When MOLECULE_CANARY_MODE=1, the memory root must accept writes.
|
||||
|
||||
Probes via the recall_memory MCP tool — if /mcp is not exposed,
|
||||
returns early (skip-style; we still pass because some templates
|
||||
proxy MCP elsewhere).
|
||||
"""
|
||||
# We can't write directly here — only confirm the read path doesn't
|
||||
# 500 on a missing key. A real write happens in canary 4.
|
||||
key = f"canary-probe-{uuid.uuid4().hex[:8]}"
|
||||
try:
|
||||
val = sim.probe_memory(key)
|
||||
except Exception as e:
|
||||
# /mcp may not be exposed on this template — canary 4 will
|
||||
# surface the real defect if memory is actually broken.
|
||||
if os.environ.get("CANARY_STRICT_MCP") == "1":
|
||||
raise
|
||||
return
|
||||
# Unknown key → None is fine. The point is the call didn't crash.
|
||||
assert val is None or isinstance(val, str)
|
||||
@@ -0,0 +1,204 @@
|
||||
"""The 4 canonical session-continuity canaries (task #342, RFC#600 class).
|
||||
|
||||
These tests speak A2A directly to the runtime under test. They are the
|
||||
authoritative gate that the runtime preserves conversation continuity,
|
||||
handles file-only messages without dropping to the empty-prompt error,
|
||||
addresses multimodal prompts, and persists memory across sessions.
|
||||
|
||||
Wire-shape source of truth: see ../cp_sim.py docstring.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import re
|
||||
import uuid
|
||||
|
||||
from cp_sim import CPSim
|
||||
|
||||
|
||||
# ---------- canary 1: 2-turn name continuity -------------------------------
|
||||
|
||||
|
||||
def test_canary_1_two_turn_name_continuity(sim: CPSim, context_id: str) -> None:
|
||||
"""SessionStore continuity — turn 2 must recall the name from turn 1.
|
||||
|
||||
Empirically tests:
|
||||
- ``a2a_executor._core_execute`` injects prior-turn history via
|
||||
``_extract_history(context)`` (workspace/a2a_executor.py:313).
|
||||
- The runtime's session store is keyed on ``context_id`` (canvas
|
||||
thread id) NOT ``task_id`` — task_id is per-turn, context_id is
|
||||
per-conversation. Regressions to that key derivation were the
|
||||
root cause of the 2026-05 multi-turn-amnesia incidents
|
||||
(#a60623344 diagnosis).
|
||||
"""
|
||||
# Turn 1 — establish the fact.
|
||||
r1 = sim.send_text(
|
||||
"Hi, my name is Hongming.",
|
||||
context_id=context_id,
|
||||
)
|
||||
reply1 = sim.extract_text_parts(r1)
|
||||
assert reply1, f"Turn 1 produced empty reply. envelope={r1!r}"
|
||||
|
||||
# Turn 2 — ask back. Same context_id → same SessionStore key.
|
||||
r2 = sim.send_text(
|
||||
"What's my name?",
|
||||
context_id=context_id,
|
||||
)
|
||||
reply2 = sim.extract_text_parts(r2)
|
||||
assert reply2, f"Turn 2 produced empty reply. envelope={r2!r}"
|
||||
|
||||
# Substring match, case-insensitive — agents may reply
|
||||
# "Your name is Hongming." or "It's Hongming!" or similar.
|
||||
assert re.search(r"\bhongming\b", reply2, flags=re.IGNORECASE), (
|
||||
f"Turn 2 reply does not contain 'Hongming' — SessionStore "
|
||||
f"continuity regression suspected. context_id={context_id} "
|
||||
f"turn1_reply={reply1[:200]!r} turn2_reply={reply2[:400]!r}"
|
||||
)
|
||||
|
||||
|
||||
# ---------- canary 2: file-only message (no caption) -----------------------
|
||||
|
||||
|
||||
_DROPPED_TURN_MARKERS = (
|
||||
"(empty prompt — nothing to do)",
|
||||
"empty prompt",
|
||||
"message contained no text content",
|
||||
"no text content",
|
||||
)
|
||||
|
||||
|
||||
def test_canary_2_file_only_message(sim: CPSim, context_id: str) -> None:
|
||||
"""File-attached A2A message with NO text part must not be dropped.
|
||||
|
||||
Root cause this guards against: a long-standing executor bug where
|
||||
``extract_message_text`` returned "" for file-only messages and the
|
||||
executor short-circuited with the "Error: message contained no text
|
||||
content." reply, even though the attached file was the entire point
|
||||
of the turn.
|
||||
|
||||
Hard assertions:
|
||||
- Reply is non-empty AND not the dropped-turn marker.
|
||||
- Reply references the file by name OR asks an actionable
|
||||
clarifying question (NOT a flat error).
|
||||
"""
|
||||
file_name = f"canary-{uuid.uuid4().hex[:8]}.txt"
|
||||
file_body = b"Project status: nominal. Lighthouse score 98."
|
||||
|
||||
r = sim.send_with_file(
|
||||
context_id=context_id,
|
||||
text=None, # ← THE CANARY: no caption.
|
||||
file_name=file_name,
|
||||
file_bytes=file_body,
|
||||
mime_type="text/plain",
|
||||
)
|
||||
reply = sim.extract_text_parts(r)
|
||||
assert reply, f"File-only message produced empty reply. envelope={r!r}"
|
||||
|
||||
low = reply.lower()
|
||||
for marker in _DROPPED_TURN_MARKERS:
|
||||
assert marker.lower() not in low, (
|
||||
f"File-only message was dropped — reply contains "
|
||||
f"{marker!r}. Full reply: {reply[:500]!r}"
|
||||
)
|
||||
|
||||
# Soft assertion: reply must engage with the file (reference its
|
||||
# name) OR ask an actionable clarification. We require ONE of those —
|
||||
# a generic "Hello! How can I help?" reply is also a drop.
|
||||
name_referenced = file_name.lower() in low or "file" in low or "attach" in low
|
||||
asks_clarification = (
|
||||
"what" in low or "would you like" in low or "?" in reply
|
||||
)
|
||||
assert name_referenced or asks_clarification, (
|
||||
f"File-only reply neither references the file nor asks a "
|
||||
f"clarifying question. Reply: {reply[:500]!r}"
|
||||
)
|
||||
|
||||
|
||||
# ---------- canary 3: file + prompt (multimodal) ---------------------------
|
||||
|
||||
|
||||
def test_canary_3_file_with_prompt(sim: CPSim, context_id: str) -> None:
|
||||
"""File-attached A2A message WITH a caption — multimodal happy path.
|
||||
|
||||
Lower bar than canary 2: assert the agent acknowledges the file was
|
||||
received and tries to address the caption. We deliberately don't
|
||||
require a perfect summary because canary mode replies are canned —
|
||||
the goal is to prove the executor's multimodal code path doesn't
|
||||
drop EITHER the file OR the caption.
|
||||
"""
|
||||
file_name = f"canary-doc-{uuid.uuid4().hex[:8]}.txt"
|
||||
file_body = (
|
||||
b"Quarterly review. Revenue up 14%. Churn down 3%. "
|
||||
b"Team headcount steady. Action: ship RFC#600 by end of week."
|
||||
)
|
||||
r = sim.send_with_file(
|
||||
context_id=context_id,
|
||||
text="summarize this",
|
||||
file_name=file_name,
|
||||
file_bytes=file_body,
|
||||
mime_type="text/plain",
|
||||
)
|
||||
reply = sim.extract_text_parts(r)
|
||||
assert reply, f"File+prompt produced empty reply. envelope={r!r}"
|
||||
|
||||
low = reply.lower()
|
||||
for marker in _DROPPED_TURN_MARKERS:
|
||||
assert marker.lower() not in low, (
|
||||
f"File+prompt was dropped — reply contains {marker!r}. "
|
||||
f"Full reply: {reply[:500]!r}"
|
||||
)
|
||||
|
||||
# At minimum: the reply must mention file/attach/summary semantics,
|
||||
# demonstrating the executor accepted both parts.
|
||||
engaged = any(
|
||||
kw in low for kw in ("file", "attach", "summary", "summarize", "content", file_name.lower())
|
||||
)
|
||||
assert engaged, (
|
||||
f"Multimodal reply doesn't engage with attached file or caption. "
|
||||
f"Reply: {reply[:500]!r}"
|
||||
)
|
||||
|
||||
|
||||
# ---------- canary 4: cross-session memory recall --------------------------
|
||||
|
||||
|
||||
def test_canary_4_cross_session_memory_recall(sim: CPSim) -> None:
|
||||
"""Memory persists across distinct context_ids → memory layer (NOT
|
||||
SessionStore) is the storage.
|
||||
|
||||
Two distinct context_ids in this test — SessionStore CANNOT bridge
|
||||
them. The bridge is the runtime's persistent memory (MOLECULE_MEMORY_ROOT
|
||||
in canary mode). If the recall returns "blue" in session 2, the
|
||||
memory layer is wired correctly.
|
||||
|
||||
Note: we ask the agent to commit the memory explicitly in session 1
|
||||
so that the canary doesn't depend on memory auto-extraction
|
||||
heuristics (which vary by runtime). The commit goes through the
|
||||
same MCP tool the canvas would invoke.
|
||||
"""
|
||||
ctx_a = f"canary-ctx-{uuid.uuid4().hex[:12]}"
|
||||
ctx_b = f"canary-ctx-{uuid.uuid4().hex[:12]}"
|
||||
|
||||
# Session 1 — commit a fact via the memory tool. Use the explicit
|
||||
# "remember" verb so canary-mode agents (which short-circuit to a
|
||||
# deterministic tool-call) reliably invoke `commit_memory`.
|
||||
r1 = sim.send_text(
|
||||
"Please use the memory tool to remember: my favorite color is blue.",
|
||||
context_id=ctx_a,
|
||||
)
|
||||
reply1 = sim.extract_text_parts(r1)
|
||||
assert reply1, f"Session 1 produced empty reply. envelope={r1!r}"
|
||||
|
||||
# Session 2 — different context_id. Same workspace, same memory.
|
||||
r2 = sim.send_text(
|
||||
"Use the memory tool to recall my favorite color, then tell me what it is.",
|
||||
context_id=ctx_b,
|
||||
)
|
||||
reply2 = sim.extract_text_parts(r2)
|
||||
assert reply2, f"Session 2 produced empty reply. envelope={r2!r}"
|
||||
|
||||
assert re.search(r"\bblue\b", reply2, flags=re.IGNORECASE), (
|
||||
f"Session 2 reply does not contain 'blue' — cross-session memory "
|
||||
f"recall regression suspected. ctx_a={ctx_a} ctx_b={ctx_b} "
|
||||
f"session1_reply={reply1[:200]!r} session2_reply={reply2[:400]!r}"
|
||||
)
|
||||
@@ -0,0 +1,214 @@
|
||||
"""Tenant control-plane simulator.
|
||||
|
||||
Emits the byte-identical JSON-RPC `message/send` wire shape that the
|
||||
production `workspace-server` POSTs to the runtime's :8000 — see
|
||||
``workspace-server/internal/handlers/a2a.go`` and the canonical sample
|
||||
in ``tests/e2e/test_chat_attachments_e2e.sh``.
|
||||
|
||||
This file is purposefully small (~250 LoC). It is NOT a re-implementation
|
||||
of `workspace-server`; it is just the minimum surface required to drive
|
||||
the 4 session-continuity canaries.
|
||||
|
||||
If the runtime asserts on a header / envelope field that the production
|
||||
platform sets but this simulator omits, FIX THE SIMULATOR — never weaken
|
||||
the runtime to accept divergent wire shapes. The simulator is the
|
||||
canonical contract emitter for canary purposes
|
||||
(``feedback_no_single_source_of_truth``).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import base64
|
||||
import json
|
||||
import os
|
||||
import uuid
|
||||
from dataclasses import dataclass
|
||||
from typing import Any
|
||||
|
||||
import httpx
|
||||
|
||||
|
||||
@dataclass
|
||||
class CPSimConfig:
|
||||
runtime_url: str
|
||||
"""Base URL of the runtime under test (e.g. http://runtime:8000)."""
|
||||
request_timeout_s: float = 60.0
|
||||
"""Per-A2A-call timeout. Generous — canary mode replies are fast,
|
||||
but a real Provider-backed runtime under cold cache can take 30+s."""
|
||||
|
||||
|
||||
class CPSim:
|
||||
"""Thin client matching workspace-server's wire shape."""
|
||||
|
||||
def __init__(self, cfg: CPSimConfig | None = None) -> None:
|
||||
self.cfg = cfg or CPSimConfig(
|
||||
runtime_url=os.environ.get("RUNTIME_URL", "http://localhost:18000"),
|
||||
)
|
||||
self._client = httpx.Client(timeout=self.cfg.request_timeout_s)
|
||||
|
||||
# ------------------------------------------------------------------ A2A
|
||||
|
||||
def send_text(
|
||||
self,
|
||||
text: str,
|
||||
*,
|
||||
context_id: str,
|
||||
task_id: str | None = None,
|
||||
) -> dict[str, Any]:
|
||||
"""POST a text-only A2A message. Returns the JSON-RPC envelope."""
|
||||
msg_id = f"canary-{uuid.uuid4().hex[:12]}"
|
||||
payload = {
|
||||
"jsonrpc": "2.0",
|
||||
"id": msg_id,
|
||||
"method": "message/send",
|
||||
"params": {
|
||||
"message": {
|
||||
"role": "user",
|
||||
"messageId": msg_id,
|
||||
"kind": "message",
|
||||
"contextId": context_id,
|
||||
"taskId": task_id,
|
||||
"parts": [{"kind": "text", "text": text}],
|
||||
},
|
||||
"configuration": {
|
||||
"acceptedOutputModes": ["text/plain"],
|
||||
"blocking": True,
|
||||
},
|
||||
},
|
||||
}
|
||||
return self._post(payload)
|
||||
|
||||
def send_with_file(
|
||||
self,
|
||||
*,
|
||||
context_id: str,
|
||||
text: str | None,
|
||||
file_name: str,
|
||||
file_bytes: bytes,
|
||||
mime_type: str = "text/plain",
|
||||
task_id: str | None = None,
|
||||
) -> dict[str, Any]:
|
||||
"""POST an A2A message with an inline file part.
|
||||
|
||||
Uses the inline `bytes` form of A2A file parts (RFC#600 — the
|
||||
no-URI variant added precisely so canary tests don't need a
|
||||
`/chat/uploads` round-trip). Each runtime's executor calls
|
||||
``extract_attached_files`` which handles both forms — verified
|
||||
in ``workspace/executor_helpers.py:903``.
|
||||
"""
|
||||
msg_id = f"canary-{uuid.uuid4().hex[:12]}"
|
||||
parts: list[dict[str, Any]] = []
|
||||
if text:
|
||||
parts.append({"kind": "text", "text": text})
|
||||
parts.append(
|
||||
{
|
||||
"kind": "file",
|
||||
"file": {
|
||||
"name": file_name,
|
||||
"mimeType": mime_type,
|
||||
"bytes": base64.b64encode(file_bytes).decode("ascii"),
|
||||
},
|
||||
}
|
||||
)
|
||||
payload = {
|
||||
"jsonrpc": "2.0",
|
||||
"id": msg_id,
|
||||
"method": "message/send",
|
||||
"params": {
|
||||
"message": {
|
||||
"role": "user",
|
||||
"messageId": msg_id,
|
||||
"kind": "message",
|
||||
"contextId": context_id,
|
||||
"taskId": task_id,
|
||||
"parts": parts,
|
||||
},
|
||||
"configuration": {
|
||||
"acceptedOutputModes": ["text/plain"],
|
||||
"blocking": True,
|
||||
},
|
||||
},
|
||||
}
|
||||
return self._post(payload)
|
||||
|
||||
# ------------------------------------------------------------ helpers
|
||||
|
||||
def _post(self, payload: dict[str, Any]) -> dict[str, Any]:
|
||||
url = f"{self.cfg.runtime_url}/a2a"
|
||||
try:
|
||||
r = self._client.post(url, json=payload)
|
||||
except httpx.HTTPError as e:
|
||||
raise CPSimError(f"A2A POST failed: {e}") from e
|
||||
if r.status_code != 200:
|
||||
raise CPSimError(
|
||||
f"A2A non-200: status={r.status_code} body={r.text[:500]}"
|
||||
)
|
||||
try:
|
||||
return r.json()
|
||||
except json.JSONDecodeError as e:
|
||||
raise CPSimError(f"A2A body not JSON: {r.text[:500]}") from e
|
||||
|
||||
@staticmethod
|
||||
def extract_text_parts(envelope: dict[str, Any]) -> str:
|
||||
"""Return concatenated text from all text parts of a reply.
|
||||
|
||||
Handles both top-level `result.parts` (the canonical shape) and
|
||||
`result.artifacts[*].parts` (which some runtimes emit when the
|
||||
reply was streamed as artifact chunks). Matches the extractor in
|
||||
``tests/e2e/test_chat_attachments_e2e.sh``.
|
||||
"""
|
||||
result = envelope.get("result") or {}
|
||||
chunks: list[str] = []
|
||||
for p in result.get("parts", []) or []:
|
||||
if p.get("kind") == "text":
|
||||
chunks.append(p.get("text", ""))
|
||||
for art in result.get("artifacts", []) or []:
|
||||
for p in art.get("parts", []) or []:
|
||||
if p.get("kind") == "text":
|
||||
chunks.append(p.get("text", ""))
|
||||
# Some runtimes return a status.message instead of/in addition to parts.
|
||||
status = result.get("status") or {}
|
||||
status_msg = status.get("message") or {}
|
||||
for p in status_msg.get("parts", []) or []:
|
||||
if p.get("kind") == "text":
|
||||
chunks.append(p.get("text", ""))
|
||||
return "\n".join(chunks).strip()
|
||||
|
||||
# ----------------------------------------------------- memory probe
|
||||
|
||||
def probe_memory(self, key: str) -> str | None:
|
||||
"""Read a memory value via the runtime's MCP memory tool.
|
||||
|
||||
Uses the same MCP transport the canvas uses
|
||||
(``POST /workspaces/:id/mcp``-shaped JSON-RPC over /mcp). Returns
|
||||
the recalled string or None if the key is missing.
|
||||
"""
|
||||
payload = {
|
||||
"jsonrpc": "2.0",
|
||||
"id": f"canary-mem-{uuid.uuid4().hex[:8]}",
|
||||
"method": "tools/call",
|
||||
"params": {"name": "recall_memory", "arguments": {"key": key}},
|
||||
}
|
||||
try:
|
||||
r = self._client.post(f"{self.cfg.runtime_url}/mcp", json=payload)
|
||||
except httpx.HTTPError as e:
|
||||
raise CPSimError(f"MCP POST failed: {e}") from e
|
||||
if r.status_code != 200:
|
||||
return None
|
||||
body = r.json()
|
||||
result = body.get("result") or {}
|
||||
# MCP responses wrap the tool output in result.content[*].text per
|
||||
# the JSON-RPC tools/call contract.
|
||||
for c in result.get("content", []) or []:
|
||||
if c.get("type") == "text":
|
||||
return c.get("text")
|
||||
return None
|
||||
|
||||
|
||||
class CPSimError(RuntimeError):
|
||||
"""Raised on transport / envelope failures (NOT canary assertion failures).
|
||||
|
||||
Distinct from AssertionError so pytest reports them as ERROR not
|
||||
FAILED — a transport-layer fault should be debugged differently from
|
||||
a real session-continuity regression.
|
||||
"""
|
||||
@@ -0,0 +1,5 @@
|
||||
# Pinned (not floating) so the harness is reproducible across CI runs.
|
||||
# These versions match what tests/e2e/_lib.sh and tests/e2e/conftest.py use.
|
||||
httpx==0.27.2
|
||||
pytest==8.3.3
|
||||
pytest-asyncio==0.24.0
|
||||
@@ -0,0 +1,58 @@
|
||||
# local-e2e/docker-compose.yml — minimal harness stack.
|
||||
#
|
||||
# Two services:
|
||||
# runtime — the template image under test (TEMPLATE_IMAGE env var).
|
||||
# Exposes :8000 for A2A traffic. The simulator POSTs to it.
|
||||
# cp_sim — thin Python tenant-CP simulator. Drives the canary turns.
|
||||
#
|
||||
# Deliberately NO postgres, NO redis, NO platform Go service. SessionStore
|
||||
# continuity is a runtime-internal concern (a2a_executor + executor_helpers);
|
||||
# we test it without dragging the platform-tenant Go binary into the loop.
|
||||
# See README.md "Why a thin Python simulator" for rationale.
|
||||
|
||||
services:
|
||||
runtime:
|
||||
image: ${TEMPLATE_IMAGE:?TEMPLATE_IMAGE env required, e.g. ghcr.io/molecule-ai/workspace-template-hermes:latest}
|
||||
# The runtime entrypoint (workspace/entrypoint.sh) refuses to start when
|
||||
# any operator-scope env var is present. We deliberately set no creds —
|
||||
# the canary doesn't invoke a real LLM provider (see TEST_NO_PROVIDER below).
|
||||
environment:
|
||||
# Disable provider calls during canary — the runtime returns canned
|
||||
# echo-style replies so the harness can assert continuity / file-handling
|
||||
# behaviour without burning provider quota. The template image must
|
||||
# honour MOLECULE_CANARY_MODE=1 (added in molecule-ai-workspace-runtime
|
||||
# PR #46 — see molecule_runtime/a2a_executor.py canary short-circuit).
|
||||
MOLECULE_CANARY_MODE: "1"
|
||||
# Anonymous workspace identity so RBAC paths exercise the same code
|
||||
# they would in tenant production.
|
||||
WORKSPACE_ID: "canary-${CANARY_RUN_ID:-local}"
|
||||
# Memory tool requires a writable scope; point at /tmp inside the
|
||||
# container so cross-session canary (#4) works without bind mounts.
|
||||
MOLECULE_MEMORY_ROOT: "/tmp/canary-memory"
|
||||
# The provisioner's forbidden-env guard exits non-zero when any
|
||||
# operator-scope literal is present; the canary intentionally sets
|
||||
# zero of them. Leave guard ON (do NOT set MOLECULE_TENANT_GUARD_DISABLE)
|
||||
# so we exercise the prod entrypoint code path verbatim.
|
||||
ports:
|
||||
- "${RUNTIME_PORT:-18000}:8000"
|
||||
healthcheck:
|
||||
# /agent-card is the universal A2A discovery endpoint — every template
|
||||
# exposes it. /health varies per template.
|
||||
test: ["CMD-SHELL", "wget -qO /dev/null --tries=1 http://localhost:8000/agent-card || exit 1"]
|
||||
interval: 3s
|
||||
timeout: 3s
|
||||
retries: 20
|
||||
start_period: 30s
|
||||
|
||||
cp_sim:
|
||||
build:
|
||||
context: ./cp_sim
|
||||
depends_on:
|
||||
runtime:
|
||||
condition: service_healthy
|
||||
environment:
|
||||
RUNTIME_URL: "http://runtime:8000"
|
||||
CANARY_RUN_ID: "${CANARY_RUN_ID:-local}"
|
||||
# cp_sim doesn't expose a port — it's a one-shot driver invoked by
|
||||
# run-canary.sh via `docker compose run cp_sim pytest ...`.
|
||||
profiles: ["driver"]
|
||||
Executable
+68
@@ -0,0 +1,68 @@
|
||||
#!/usr/bin/env bash
|
||||
# onboard-template.sh — gitops helper to wire local-e2e into a new template.
|
||||
#
|
||||
# Drops .gitea/workflows/session-continuity-e2e.yml into the target template
|
||||
# repo (a thin shim that clones molecule-core's local-e2e harness, then runs
|
||||
# run-canary.sh against the locally-built template image). Opens a PR.
|
||||
#
|
||||
# Usage:
|
||||
# ./local-e2e/scripts/onboard-template.sh molecule-ai-workspace-template-claude-code
|
||||
#
|
||||
# Per task #342 sequencing: do NOT run this for every template at once.
|
||||
# Bake the gate on hermes for ≥5 business days first; expand only after
|
||||
# the canary is empirically stable.
|
||||
#
|
||||
# Cross-refs:
|
||||
# feedback_no_single_source_of_truth — the workflow content is identical
|
||||
# across templates; this helper guarantees it.
|
||||
# feedback_image_promote_is_not_user_live — we wire the gate at the
|
||||
# CI layer; flipping it to REQUIRED in branch_protection is a
|
||||
# separate step (see README.md).
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
REPO="${1:?usage: onboard-template.sh <template-repo-name>}"
|
||||
HARNESS_ROOT="$( cd "$( dirname "${BASH_SOURCE[0]}" )/.." && pwd )"
|
||||
|
||||
# Sanity: ensure the template-side workflow file exists in this repo.
|
||||
TEMPLATE_WORKFLOW="$HARNESS_ROOT/templates/session-continuity-e2e.yml"
|
||||
[ -f "$TEMPLATE_WORKFLOW" ] || {
|
||||
echo "ERROR: $TEMPLATE_WORKFLOW not found in this harness checkout"
|
||||
exit 1
|
||||
}
|
||||
|
||||
WORK_DIR=$(mktemp -d -t e2e-onboard-XXXXXX)
|
||||
trap 'rm -rf "$WORK_DIR"' EXIT
|
||||
|
||||
cd "$WORK_DIR"
|
||||
|
||||
# Use mol_clone — preserves the persona credential model.
|
||||
# shellcheck disable=SC1090
|
||||
source "$HOME/.molecule-ai/ops.sh"
|
||||
mol_clone "$REPO"
|
||||
cd "$REPO"
|
||||
|
||||
git checkout -b "task342/session-continuity-e2e-gate"
|
||||
|
||||
mkdir -p .gitea/workflows
|
||||
cp "$TEMPLATE_WORKFLOW" .gitea/workflows/session-continuity-e2e.yml
|
||||
|
||||
git add .gitea/workflows/session-continuity-e2e.yml
|
||||
git commit -m "ci: add local-e2e session-continuity canary gate (task #342)
|
||||
|
||||
Wires this template into the cross-template session-continuity harness
|
||||
in molecule-ai/molecule-core/local-e2e/. The gate boots THIS repo's
|
||||
locally-built image, drives 4 canonical canaries (2-turn name continuity,
|
||||
file-only message, file+prompt, cross-session memory recall), and fails
|
||||
PRs that regress any of them.
|
||||
|
||||
Per CTO directive: required-context flip in branch_protection is a
|
||||
SEPARATE step after 5 business days of bake."
|
||||
|
||||
# Push branch; do not auto-open PR — leave that to the operator so the
|
||||
# review-relay routing follows the same rules as a normal change.
|
||||
git push -u origin "task342/session-continuity-e2e-gate"
|
||||
|
||||
echo
|
||||
echo "DONE. Branch pushed to $REPO. Open PR manually:"
|
||||
echo " https://git.moleculesai.app/molecule-ai/$REPO/compare/main...task342/session-continuity-e2e-gate"
|
||||
Executable
+105
@@ -0,0 +1,105 @@
|
||||
#!/usr/bin/env bash
|
||||
# run-canary.sh — one-shot orchestration for the local-e2e session-continuity
|
||||
# canary harness. Used by both interactive local runs and the per-template
|
||||
# .gitea/workflows/session-continuity-e2e.yml.
|
||||
#
|
||||
# Usage:
|
||||
# TEMPLATE_IMAGE=ghcr.io/molecule-ai/workspace-template-hermes:latest \
|
||||
# ./local-e2e/scripts/run-canary.sh
|
||||
#
|
||||
# Optional env:
|
||||
# CANARY_RUN_ID — disambiguator for parallel CI runs (default: random)
|
||||
# RUNTIME_PORT — host port for runtime :8000 (default: 18000)
|
||||
# KEEP_RUNNING — set =1 to leave containers up for post-mortem
|
||||
#
|
||||
# Exit codes:
|
||||
# 0 — all 4 canaries passed
|
||||
# 1 — at least one canary failed (artifacts/ has the dump)
|
||||
# 2 — harness infrastructure failure (image pull / compose / etc.)
|
||||
#
|
||||
# Cross-refs:
|
||||
# feedback_image_promote_is_not_user_live — we verify at the running
|
||||
# container layer, NOT at the pipeline-green layer.
|
||||
# feedback_verify_actual_endstate_not_ack_follow_sop — every assert
|
||||
# reads state back; no side-effect-ack claims success.
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
: "${TEMPLATE_IMAGE:?TEMPLATE_IMAGE env required (the runtime image under test)}"
|
||||
|
||||
# ----------------------------------------------------------------- paths
|
||||
HARNESS_ROOT="$( cd "$( dirname "${BASH_SOURCE[0]}" )/.." && pwd )"
|
||||
ARTIFACTS_DIR="$HARNESS_ROOT/artifacts"
|
||||
mkdir -p "$ARTIFACTS_DIR"
|
||||
|
||||
export CANARY_RUN_ID="${CANARY_RUN_ID:-$(uuidgen 2>/dev/null | tr A-Z a-z | tr -d - | cut -c1-12 || date +%s)}"
|
||||
export RUNTIME_PORT="${RUNTIME_PORT:-18000}"
|
||||
export TEMPLATE_IMAGE
|
||||
COMPOSE_PROJECT="canary-${CANARY_RUN_ID}"
|
||||
COMPOSE_FILE="$HARNESS_ROOT/docker-compose.yml"
|
||||
|
||||
log() { printf "\n=== [%s] %s ===\n" "$(date +%H:%M:%S)" "$*"; }
|
||||
|
||||
# ----------------------------------------------------------- cleanup hook
|
||||
cleanup() {
|
||||
local rc=$?
|
||||
if [ "${KEEP_RUNNING:-0}" = "1" ]; then
|
||||
log "KEEP_RUNNING=1 — leaving containers up (project=$COMPOSE_PROJECT)"
|
||||
return $rc
|
||||
fi
|
||||
log "Tearing down compose project $COMPOSE_PROJECT"
|
||||
# On non-zero exit, capture logs FIRST. Per feedback_image_promote_is_
|
||||
# not_user_live: dump state from the actually-running container, not
|
||||
# an inferred pipeline state.
|
||||
if [ $rc -ne 0 ]; then
|
||||
log "Canary FAILED — dumping artifacts to $ARTIFACTS_DIR"
|
||||
docker compose -p "$COMPOSE_PROJECT" -f "$COMPOSE_FILE" logs \
|
||||
--no-color --tail=200 runtime \
|
||||
> "$ARTIFACTS_DIR/runtime.log" 2>&1 || true
|
||||
# SessionStore state probe — runtime exposes /admin/session-store
|
||||
# in canary mode; if not present this 404s and the file is empty.
|
||||
docker compose -p "$COMPOSE_PROJECT" -f "$COMPOSE_FILE" exec -T runtime \
|
||||
sh -c 'ls -la /tmp/canary-memory 2>/dev/null; find /tmp -name "session*.json" -exec cat {} \; 2>/dev/null' \
|
||||
> "$ARTIFACTS_DIR/session-store.txt" 2>&1 || true
|
||||
fi
|
||||
docker compose -p "$COMPOSE_PROJECT" -f "$COMPOSE_FILE" down --volumes --remove-orphans >/dev/null 2>&1 || true
|
||||
return $rc
|
||||
}
|
||||
trap cleanup EXIT
|
||||
|
||||
# ------------------------------------------------------ stack bring-up
|
||||
log "Building cp_sim image"
|
||||
docker compose -p "$COMPOSE_PROJECT" -f "$COMPOSE_FILE" build cp_sim
|
||||
|
||||
log "Pulling runtime image: $TEMPLATE_IMAGE"
|
||||
docker compose -p "$COMPOSE_PROJECT" -f "$COMPOSE_FILE" pull runtime 2>&1 \
|
||||
| tail -5 || true
|
||||
|
||||
log "Starting runtime (host port $RUNTIME_PORT)"
|
||||
docker compose -p "$COMPOSE_PROJECT" -f "$COMPOSE_FILE" up -d runtime
|
||||
|
||||
# Wait for healthcheck — docker-compose `--wait` is the canonical mechanism
|
||||
# (introduced in v2.1.1 in 2021, available on every supported runner pool).
|
||||
log "Waiting for runtime healthcheck"
|
||||
if ! docker compose -p "$COMPOSE_PROJECT" -f "$COMPOSE_FILE" up -d --wait runtime; then
|
||||
log "Runtime never went healthy — dumping logs"
|
||||
docker compose -p "$COMPOSE_PROJECT" -f "$COMPOSE_FILE" logs --no-color --tail=200 runtime \
|
||||
> "$ARTIFACTS_DIR/runtime-boot-failure.log" 2>&1 || true
|
||||
exit 2
|
||||
fi
|
||||
|
||||
# -------------------------------------------------------------- run tests
|
||||
log "Running canary suite"
|
||||
# Run cp_sim under the same compose project so DNS (runtime hostname)
|
||||
# resolves on the molecule-core-net bridge. --rm cleans the driver container
|
||||
# after pytest exits; volume bind mounts pytest's junit-xml back to host.
|
||||
if docker compose -p "$COMPOSE_PROJECT" -f "$COMPOSE_FILE" --profile driver run \
|
||||
--rm \
|
||||
-v "$ARTIFACTS_DIR:/harness/artifacts" \
|
||||
cp_sim; then
|
||||
log "All canaries PASSED"
|
||||
exit 0
|
||||
else
|
||||
log "At least one canary FAILED — see $ARTIFACTS_DIR/junit.xml"
|
||||
exit 1
|
||||
fi
|
||||
@@ -0,0 +1,85 @@
|
||||
name: session-continuity-e2e
|
||||
|
||||
# Per-template wrapper for the molecule-core/local-e2e canary harness.
|
||||
# DO NOT EDIT THIS FILE IN A TEMPLATE REPO — the canonical copy lives at
|
||||
# molecule-ai/molecule-core:local-e2e/templates/session-continuity-e2e.yml
|
||||
# (feedback_no_single_source_of_truth). The onboard-template.sh script
|
||||
# copies it verbatim into each template; future fixes propagate via that
|
||||
# helper, not by editing the template-side copy.
|
||||
#
|
||||
# What this workflow does:
|
||||
# 1. Build THIS template's runtime image locally on the docker-host runner.
|
||||
# 2. Clone molecule-core (canonical harness source).
|
||||
# 3. Invoke local-e2e/scripts/run-canary.sh with TEMPLATE_IMAGE set to
|
||||
# the just-built local image.
|
||||
# 4. Upload artifacts/ on failure for post-mortem.
|
||||
#
|
||||
# Required-context flip:
|
||||
# This workflow posts a status under the literal context name
|
||||
# "session-continuity-e2e (pull_request)" — Gitea's standard
|
||||
# <workflow-name> (<event>) format. To make it REQUIRED, add that
|
||||
# exact string to the template repo's branch_protection
|
||||
# status_check_contexts list. See README.md for the bake-period rule.
|
||||
#
|
||||
# Gitea 1.22.6 / act_runner notes (cross-refs to known footguns):
|
||||
# - No cross-repo `uses:` (feedback_gitea_cross_repo_uses_blocked) —
|
||||
# we clone molecule-core via plain git instead.
|
||||
# - Per-SHA concurrency (feedback_concurrency_group_per_sha).
|
||||
# - Workflow-level GITHUB_SERVER_URL pinned to the Gitea host
|
||||
# (feedback_act_runner_github_server_url).
|
||||
# - Runs on docker-host pool — NOT the heavy CI pool — per CTO
|
||||
# directive "separate CI as possible" and the <3 min target.
|
||||
|
||||
on:
|
||||
pull_request:
|
||||
branches: [main]
|
||||
push:
|
||||
branches: [main]
|
||||
|
||||
concurrency:
|
||||
group: session-continuity-e2e-${{ github.workflow }}-${{ github.event_name }}-${{ github.event.pull_request.head.sha || github.sha }}
|
||||
cancel-in-progress: true
|
||||
|
||||
env:
|
||||
GITHUB_SERVER_URL: https://git.moleculesai.app
|
||||
|
||||
jobs:
|
||||
session-continuity-e2e:
|
||||
runs-on: docker-host
|
||||
timeout-minutes: 8
|
||||
steps:
|
||||
- name: Checkout template
|
||||
uses: actions/checkout@v4
|
||||
with:
|
||||
path: template
|
||||
|
||||
- name: Build template image
|
||||
id: build
|
||||
working-directory: template
|
||||
run: |
|
||||
IMAGE_TAG="local-e2e-${GITHUB_SHA::12}"
|
||||
docker build -t "molecule-ai/template-under-test:${IMAGE_TAG}" .
|
||||
echo "image=molecule-ai/template-under-test:${IMAGE_TAG}" >> "$GITHUB_OUTPUT"
|
||||
|
||||
- name: Clone harness from molecule-core
|
||||
run: |
|
||||
# Anonymous clone — molecule-core is internal-readable. NEVER bake
|
||||
# an auth token into the URL (feedback_credentials_in_git_url).
|
||||
git clone --depth 1 "${GITHUB_SERVER_URL}/molecule-ai/molecule-core.git" harness
|
||||
|
||||
- name: Run canary
|
||||
env:
|
||||
TEMPLATE_IMAGE: ${{ steps.build.outputs.image }}
|
||||
CANARY_RUN_ID: ${{ github.run_id }}-${{ github.run_attempt }}
|
||||
run: |
|
||||
cd harness
|
||||
./local-e2e/scripts/run-canary.sh
|
||||
|
||||
- name: Upload artifacts on failure
|
||||
if: failure()
|
||||
uses: actions/upload-artifact@v4
|
||||
with:
|
||||
name: session-continuity-canary-${{ github.run_id }}
|
||||
path: harness/local-e2e/artifacts/
|
||||
if-no-files-found: warn
|
||||
retention-days: 7
|
||||
Reference in New Issue
Block a user