diff --git a/local-e2e/README.md b/local-e2e/README.md new file mode 100644 index 000000000..624a41831 --- /dev/null +++ b/local-e2e/README.md @@ -0,0 +1,104 @@ +# local-e2e — session-continuity canary harness + +Self-contained Docker-Compose harness that gates RFC#600-class template +changes (session continuity, file-only messages, multimodal prompts, +cross-session memory) **before** they reach customer canary. + +Per CTO standing directive "fully tested + separate CI": this is a +dedicated, *fast* (target <3 min), *small-surface* harness that uses a +Python tenant-CP simulator (not the full `workspace-server` Go service) +to exercise the runtime image end-to-end against canonical canary turns. + +See [`feedback_no_single_source_of_truth`] — the harness IS the canonical +session-continuity validator. Per-runtime unit tests still cover their +own guard logic; the harness covers the live conversational behaviour +that those unit tests cannot prove. + +See [`feedback_image_promote_is_not_user_live`] — every assertion reads +state back from the *running container*, never from a publish-pipeline +ack. + +## What it tests (the 4 canaries) + +| # | Scenario | Asserts | +|---|----------|---------| +| 1 | 2-turn name canary | turn 2 reply contains "Hongming" → SessionStore continuity | +| 2 | File-only message (no caption) | NOT "(empty prompt — nothing to do)" + reply references filename or asks for clarification | +| 3 | File + caption ("summarize this") | reply addresses attachment + caption | +| 4 | Cross-session memory recall | new session pulls "blue" via memory tool | + +Each scenario re-uses the same A2A wire-shape that the production +`workspace-server` POSTs to runtime `:8000` (canvas-thread-id semantics +via `context_id`). + +## Architecture + +``` +local-e2e/ + docker-compose.yml # runtime under test + cp_sim + cp_sim/ # ≈300 LoC Python A2A poster + file uploader + cp_sim.py + Dockerfile + requirements.txt + canary/ + conftest.py + test_session_continuity.py # 4 canary scenarios + test_layer_diagnostics.py # SessionStore state probe + key derivation + scripts/ + run-canary.sh # one-shot orchestration entrypoint +``` + +The CP simulator emits the **exact** JSON-RPC `message/send` envelope +that `workspace-server` produces (verified against +`tests/e2e/test_chat_attachments_e2e.sh`). No Go service is in the loop — +this keeps the harness lean per the CTO directive. + +## Run locally + +```bash +# from molecule-core repo root: +export TEMPLATE_IMAGE=ghcr.io/molecule-ai/workspace-template-hermes:latest +./local-e2e/scripts/run-canary.sh +``` + +Exit code 0 = all 4 canaries pass. Non-zero = at least one canary failed +and the harness dumped SessionStore state + last 200 log lines from the +runtime container into `./local-e2e/artifacts/`. + +## How it integrates into CI + +Each template repo's `.gitea/workflows/session-continuity-e2e.yml` calls +`run-canary.sh` with its own freshly-built `TEMPLATE_IMAGE`. The +template repo's Gitea branch-protection lists +`session-continuity-e2e (pull_request)` as a required context. + +Rollout order (deliberate — per `feedback_image_promote_is_not_user_live` +we bake before we cascade): + +1. `molecule-ai-workspace-template-hermes` — highest-traffic + most + recent RFC#600-class fixes — REQUIRED gate +2. Bake for 5 business days +3. Cascade to claude-code, langgraph, autogen, openclaw, smolagents, + google-adk (one PR per template — see `scripts/onboard-template.sh`) + +## Future extensions (out of scope for the initial PR) + +- Multi-session memory consistency (3+ sessions deep) +- Tool-use canary (workspace seeded with skills/, agent must invoke) +- Streaming-cancellation canary (mid-stream client disconnect) +- Cross-runtime A2A peer call (currently covered by `e2e-peer-visibility`) + +## Why a thin Python simulator and not the real `workspace-server`? + +`workspace-server` is a 60+ MB Go binary that requires Postgres, Redis, +admin-token wiring, registry plumbing, and a 30+ second cold-boot. None +of that touches session-continuity behaviour, which is fully owned by +the runtime container's `a2a_executor.py`. Per CTO directive "separate +CI as possible" + the <3 min target, we excise the platform-tenant Go +service from the loop and emit identical wire-shape envelopes from a +single Python file. + +If the simulator diverges from `workspace-server` wire shape, the gate +goes red — fix the simulator to match production. The wire shape is +asserted in `tests/e2e/test_chat_attachments_e2e.sh` and the runtime's +`workspace/a2a_executor.py:_core_execute`. diff --git a/local-e2e/cp_sim/Dockerfile b/local-e2e/cp_sim/Dockerfile new file mode 100644 index 000000000..09888e34f --- /dev/null +++ b/local-e2e/cp_sim/Dockerfile @@ -0,0 +1,19 @@ +# Python tenant-CP simulator + canary test driver. +# Single image — pytest + httpx + the canary tests baked in. +FROM python:3.11-slim@sha256:e78299e55776ca065dcb769f80161f48465ad352014240eb5fe4712e22505e9b + +WORKDIR /harness + +COPY requirements.txt . +RUN pip install --no-cache-dir -r requirements.txt + +# Test files are bind-mounted by docker-compose at run time so a `pytest -x` +# rerun loop doesn't require a rebuild. The COPY here is for the +# self-contained image used by Gitea Actions (where bind mounts are awkward). +COPY cp_sim.py /harness/cp_sim.py +COPY canary /harness/canary + +ENV PYTHONUNBUFFERED=1 + +# Default: run the 4 canaries with verbose output + JUnit XML for CI. +CMD ["pytest", "-v", "--tb=short", "--junitxml=/harness/artifacts/junit.xml", "canary/"] diff --git a/local-e2e/cp_sim/canary/__init__.py b/local-e2e/cp_sim/canary/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/local-e2e/cp_sim/canary/conftest.py b/local-e2e/cp_sim/canary/conftest.py new file mode 100644 index 000000000..fc9562c6a --- /dev/null +++ b/local-e2e/cp_sim/canary/conftest.py @@ -0,0 +1,31 @@ +"""Shared pytest fixtures for the canary suite.""" + +from __future__ import annotations + +import os +import sys +import uuid + +# cp_sim.py lives one dir up — make it importable without packaging. +sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) + +import pytest # noqa: E402 + +from cp_sim import CPSim, CPSimConfig # noqa: E402 + + +@pytest.fixture +def sim() -> CPSim: + """Fresh CPSim per test — cheap, isolates connection state.""" + return CPSim( + cfg=CPSimConfig( + runtime_url=os.environ.get("RUNTIME_URL", "http://localhost:18000"), + ) + ) + + +@pytest.fixture +def context_id() -> str: + """A unique canvas-thread-id per test — guarantees SessionStore isolation + between scenarios so a failing canary doesn't poison the next one.""" + return f"canary-ctx-{uuid.uuid4().hex[:12]}" diff --git a/local-e2e/cp_sim/canary/test_layer_diagnostics.py b/local-e2e/cp_sim/canary/test_layer_diagnostics.py new file mode 100644 index 000000000..b6cc46d74 --- /dev/null +++ b/local-e2e/cp_sim/canary/test_layer_diagnostics.py @@ -0,0 +1,80 @@ +"""Layer-isolation diagnostics — runs alongside the 4 canaries. + +These probes are not strict pass/fail gates by themselves; they exist so +when a canary fails, the artifacts include enough state to tell whether +the regression is in the wire-shape layer, the SessionStore layer, or +the memory layer. Each test always passes (returns early) when the +underlying surface is unavailable on the runtime under test — different +templates expose different debug endpoints. + +Cross-refs: + - feedback_verify_actual_endstate_not_ack_follow_sop — we read state + back, not the side-effect ack. + - feedback_image_promote_is_not_user_live — the verification is at + the running-container layer. +""" + +from __future__ import annotations + +import os +import uuid + +import httpx + +from cp_sim import CPSim + + +def test_diag_agent_card_advertises_a2a(sim: CPSim) -> None: + """The runtime's /agent-card must advertise A2A capabilities. + + If this fails, the canaries' transport assumption (POST /a2a) is + already broken — diagnose the runtime image, not the canary. + """ + url = f"{sim.cfg.runtime_url}/agent-card" + r = httpx.get(url, timeout=10.0) + assert r.status_code == 200, ( + f"/agent-card returned {r.status_code}: {r.text[:300]!r}" + ) + body = r.json() + # AgentCard spec: capabilities object must exist, even if empty. + assert isinstance(body, dict), f"/agent-card body not an object: {body!r}" + # We don't require any specific capability flag — different templates + # advertise different sets. The point of this diag is "is the card + # there at all", which signals the runtime booted past entrypoint. + + +def test_diag_context_id_required_for_continuity(sim: CPSim) -> None: + """Same context_id in two turns must not crash the runtime. + + Pure smoke probe — proves the executor accepts a continuation + message without 5xx-ing. The substantive assertion is canary 1; this + one just guarantees the path is reachable. + """ + ctx = f"diag-{uuid.uuid4().hex[:8]}" + r1 = sim.send_text("ping", context_id=ctx) + r2 = sim.send_text("ping again", context_id=ctx, task_id=r1.get("result", {}).get("id")) + # Both replies must parse — non-empty envelope, no JSON-RPC error. + for label, env in (("turn1", r1), ("turn2", r2)): + assert "error" not in env, f"{label} returned JSON-RPC error: {env['error']}" + + +def test_diag_memory_root_writable_in_canary_mode(sim: CPSim) -> None: + """When MOLECULE_CANARY_MODE=1, the memory root must accept writes. + + Probes via the recall_memory MCP tool — if /mcp is not exposed, + returns early (skip-style; we still pass because some templates + proxy MCP elsewhere). + """ + # We can't write directly here — only confirm the read path doesn't + # 500 on a missing key. A real write happens in canary 4. + key = f"canary-probe-{uuid.uuid4().hex[:8]}" + try: + val = sim.probe_memory(key) + except Exception as e: + # /mcp may not be exposed on this template — canary 4 will + # surface the real defect if memory is actually broken. + if os.environ.get("CANARY_STRICT_MCP") == "1": + raise + return + # Unknown key → None is fine. The point is the call didn't crash. + assert val is None or isinstance(val, str) diff --git a/local-e2e/cp_sim/canary/test_session_continuity.py b/local-e2e/cp_sim/canary/test_session_continuity.py new file mode 100644 index 000000000..0c95a2971 --- /dev/null +++ b/local-e2e/cp_sim/canary/test_session_continuity.py @@ -0,0 +1,204 @@ +"""The 4 canonical session-continuity canaries (task #342, RFC#600 class). + +These tests speak A2A directly to the runtime under test. They are the +authoritative gate that the runtime preserves conversation continuity, +handles file-only messages without dropping to the empty-prompt error, +addresses multimodal prompts, and persists memory across sessions. + +Wire-shape source of truth: see ../cp_sim.py docstring. +""" + +from __future__ import annotations + +import re +import uuid + +from cp_sim import CPSim + + +# ---------- canary 1: 2-turn name continuity ------------------------------- + + +def test_canary_1_two_turn_name_continuity(sim: CPSim, context_id: str) -> None: + """SessionStore continuity — turn 2 must recall the name from turn 1. + + Empirically tests: + - ``a2a_executor._core_execute`` injects prior-turn history via + ``_extract_history(context)`` (workspace/a2a_executor.py:313). + - The runtime's session store is keyed on ``context_id`` (canvas + thread id) NOT ``task_id`` — task_id is per-turn, context_id is + per-conversation. Regressions to that key derivation were the + root cause of the 2026-05 multi-turn-amnesia incidents + (#a60623344 diagnosis). + """ + # Turn 1 — establish the fact. + r1 = sim.send_text( + "Hi, my name is Hongming.", + context_id=context_id, + ) + reply1 = sim.extract_text_parts(r1) + assert reply1, f"Turn 1 produced empty reply. envelope={r1!r}" + + # Turn 2 — ask back. Same context_id → same SessionStore key. + r2 = sim.send_text( + "What's my name?", + context_id=context_id, + ) + reply2 = sim.extract_text_parts(r2) + assert reply2, f"Turn 2 produced empty reply. envelope={r2!r}" + + # Substring match, case-insensitive — agents may reply + # "Your name is Hongming." or "It's Hongming!" or similar. + assert re.search(r"\bhongming\b", reply2, flags=re.IGNORECASE), ( + f"Turn 2 reply does not contain 'Hongming' — SessionStore " + f"continuity regression suspected. context_id={context_id} " + f"turn1_reply={reply1[:200]!r} turn2_reply={reply2[:400]!r}" + ) + + +# ---------- canary 2: file-only message (no caption) ----------------------- + + +_DROPPED_TURN_MARKERS = ( + "(empty prompt — nothing to do)", + "empty prompt", + "message contained no text content", + "no text content", +) + + +def test_canary_2_file_only_message(sim: CPSim, context_id: str) -> None: + """File-attached A2A message with NO text part must not be dropped. + + Root cause this guards against: a long-standing executor bug where + ``extract_message_text`` returned "" for file-only messages and the + executor short-circuited with the "Error: message contained no text + content." reply, even though the attached file was the entire point + of the turn. + + Hard assertions: + - Reply is non-empty AND not the dropped-turn marker. + - Reply references the file by name OR asks an actionable + clarifying question (NOT a flat error). + """ + file_name = f"canary-{uuid.uuid4().hex[:8]}.txt" + file_body = b"Project status: nominal. Lighthouse score 98." + + r = sim.send_with_file( + context_id=context_id, + text=None, # ← THE CANARY: no caption. + file_name=file_name, + file_bytes=file_body, + mime_type="text/plain", + ) + reply = sim.extract_text_parts(r) + assert reply, f"File-only message produced empty reply. envelope={r!r}" + + low = reply.lower() + for marker in _DROPPED_TURN_MARKERS: + assert marker.lower() not in low, ( + f"File-only message was dropped — reply contains " + f"{marker!r}. Full reply: {reply[:500]!r}" + ) + + # Soft assertion: reply must engage with the file (reference its + # name) OR ask an actionable clarification. We require ONE of those — + # a generic "Hello! How can I help?" reply is also a drop. + name_referenced = file_name.lower() in low or "file" in low or "attach" in low + asks_clarification = ( + "what" in low or "would you like" in low or "?" in reply + ) + assert name_referenced or asks_clarification, ( + f"File-only reply neither references the file nor asks a " + f"clarifying question. Reply: {reply[:500]!r}" + ) + + +# ---------- canary 3: file + prompt (multimodal) --------------------------- + + +def test_canary_3_file_with_prompt(sim: CPSim, context_id: str) -> None: + """File-attached A2A message WITH a caption — multimodal happy path. + + Lower bar than canary 2: assert the agent acknowledges the file was + received and tries to address the caption. We deliberately don't + require a perfect summary because canary mode replies are canned — + the goal is to prove the executor's multimodal code path doesn't + drop EITHER the file OR the caption. + """ + file_name = f"canary-doc-{uuid.uuid4().hex[:8]}.txt" + file_body = ( + b"Quarterly review. Revenue up 14%. Churn down 3%. " + b"Team headcount steady. Action: ship RFC#600 by end of week." + ) + r = sim.send_with_file( + context_id=context_id, + text="summarize this", + file_name=file_name, + file_bytes=file_body, + mime_type="text/plain", + ) + reply = sim.extract_text_parts(r) + assert reply, f"File+prompt produced empty reply. envelope={r!r}" + + low = reply.lower() + for marker in _DROPPED_TURN_MARKERS: + assert marker.lower() not in low, ( + f"File+prompt was dropped — reply contains {marker!r}. " + f"Full reply: {reply[:500]!r}" + ) + + # At minimum: the reply must mention file/attach/summary semantics, + # demonstrating the executor accepted both parts. + engaged = any( + kw in low for kw in ("file", "attach", "summary", "summarize", "content", file_name.lower()) + ) + assert engaged, ( + f"Multimodal reply doesn't engage with attached file or caption. " + f"Reply: {reply[:500]!r}" + ) + + +# ---------- canary 4: cross-session memory recall -------------------------- + + +def test_canary_4_cross_session_memory_recall(sim: CPSim) -> None: + """Memory persists across distinct context_ids → memory layer (NOT + SessionStore) is the storage. + + Two distinct context_ids in this test — SessionStore CANNOT bridge + them. The bridge is the runtime's persistent memory (MOLECULE_MEMORY_ROOT + in canary mode). If the recall returns "blue" in session 2, the + memory layer is wired correctly. + + Note: we ask the agent to commit the memory explicitly in session 1 + so that the canary doesn't depend on memory auto-extraction + heuristics (which vary by runtime). The commit goes through the + same MCP tool the canvas would invoke. + """ + ctx_a = f"canary-ctx-{uuid.uuid4().hex[:12]}" + ctx_b = f"canary-ctx-{uuid.uuid4().hex[:12]}" + + # Session 1 — commit a fact via the memory tool. Use the explicit + # "remember" verb so canary-mode agents (which short-circuit to a + # deterministic tool-call) reliably invoke `commit_memory`. + r1 = sim.send_text( + "Please use the memory tool to remember: my favorite color is blue.", + context_id=ctx_a, + ) + reply1 = sim.extract_text_parts(r1) + assert reply1, f"Session 1 produced empty reply. envelope={r1!r}" + + # Session 2 — different context_id. Same workspace, same memory. + r2 = sim.send_text( + "Use the memory tool to recall my favorite color, then tell me what it is.", + context_id=ctx_b, + ) + reply2 = sim.extract_text_parts(r2) + assert reply2, f"Session 2 produced empty reply. envelope={r2!r}" + + assert re.search(r"\bblue\b", reply2, flags=re.IGNORECASE), ( + f"Session 2 reply does not contain 'blue' — cross-session memory " + f"recall regression suspected. ctx_a={ctx_a} ctx_b={ctx_b} " + f"session1_reply={reply1[:200]!r} session2_reply={reply2[:400]!r}" + ) diff --git a/local-e2e/cp_sim/cp_sim.py b/local-e2e/cp_sim/cp_sim.py new file mode 100644 index 000000000..48d735c97 --- /dev/null +++ b/local-e2e/cp_sim/cp_sim.py @@ -0,0 +1,214 @@ +"""Tenant control-plane simulator. + +Emits the byte-identical JSON-RPC `message/send` wire shape that the +production `workspace-server` POSTs to the runtime's :8000 — see +``workspace-server/internal/handlers/a2a.go`` and the canonical sample +in ``tests/e2e/test_chat_attachments_e2e.sh``. + +This file is purposefully small (~250 LoC). It is NOT a re-implementation +of `workspace-server`; it is just the minimum surface required to drive +the 4 session-continuity canaries. + +If the runtime asserts on a header / envelope field that the production +platform sets but this simulator omits, FIX THE SIMULATOR — never weaken +the runtime to accept divergent wire shapes. The simulator is the +canonical contract emitter for canary purposes +(``feedback_no_single_source_of_truth``). +""" + +from __future__ import annotations + +import base64 +import json +import os +import uuid +from dataclasses import dataclass +from typing import Any + +import httpx + + +@dataclass +class CPSimConfig: + runtime_url: str + """Base URL of the runtime under test (e.g. http://runtime:8000).""" + request_timeout_s: float = 60.0 + """Per-A2A-call timeout. Generous — canary mode replies are fast, + but a real Provider-backed runtime under cold cache can take 30+s.""" + + +class CPSim: + """Thin client matching workspace-server's wire shape.""" + + def __init__(self, cfg: CPSimConfig | None = None) -> None: + self.cfg = cfg or CPSimConfig( + runtime_url=os.environ.get("RUNTIME_URL", "http://localhost:18000"), + ) + self._client = httpx.Client(timeout=self.cfg.request_timeout_s) + + # ------------------------------------------------------------------ A2A + + def send_text( + self, + text: str, + *, + context_id: str, + task_id: str | None = None, + ) -> dict[str, Any]: + """POST a text-only A2A message. Returns the JSON-RPC envelope.""" + msg_id = f"canary-{uuid.uuid4().hex[:12]}" + payload = { + "jsonrpc": "2.0", + "id": msg_id, + "method": "message/send", + "params": { + "message": { + "role": "user", + "messageId": msg_id, + "kind": "message", + "contextId": context_id, + "taskId": task_id, + "parts": [{"kind": "text", "text": text}], + }, + "configuration": { + "acceptedOutputModes": ["text/plain"], + "blocking": True, + }, + }, + } + return self._post(payload) + + def send_with_file( + self, + *, + context_id: str, + text: str | None, + file_name: str, + file_bytes: bytes, + mime_type: str = "text/plain", + task_id: str | None = None, + ) -> dict[str, Any]: + """POST an A2A message with an inline file part. + + Uses the inline `bytes` form of A2A file parts (RFC#600 — the + no-URI variant added precisely so canary tests don't need a + `/chat/uploads` round-trip). Each runtime's executor calls + ``extract_attached_files`` which handles both forms — verified + in ``workspace/executor_helpers.py:903``. + """ + msg_id = f"canary-{uuid.uuid4().hex[:12]}" + parts: list[dict[str, Any]] = [] + if text: + parts.append({"kind": "text", "text": text}) + parts.append( + { + "kind": "file", + "file": { + "name": file_name, + "mimeType": mime_type, + "bytes": base64.b64encode(file_bytes).decode("ascii"), + }, + } + ) + payload = { + "jsonrpc": "2.0", + "id": msg_id, + "method": "message/send", + "params": { + "message": { + "role": "user", + "messageId": msg_id, + "kind": "message", + "contextId": context_id, + "taskId": task_id, + "parts": parts, + }, + "configuration": { + "acceptedOutputModes": ["text/plain"], + "blocking": True, + }, + }, + } + return self._post(payload) + + # ------------------------------------------------------------ helpers + + def _post(self, payload: dict[str, Any]) -> dict[str, Any]: + url = f"{self.cfg.runtime_url}/a2a" + try: + r = self._client.post(url, json=payload) + except httpx.HTTPError as e: + raise CPSimError(f"A2A POST failed: {e}") from e + if r.status_code != 200: + raise CPSimError( + f"A2A non-200: status={r.status_code} body={r.text[:500]}" + ) + try: + return r.json() + except json.JSONDecodeError as e: + raise CPSimError(f"A2A body not JSON: {r.text[:500]}") from e + + @staticmethod + def extract_text_parts(envelope: dict[str, Any]) -> str: + """Return concatenated text from all text parts of a reply. + + Handles both top-level `result.parts` (the canonical shape) and + `result.artifacts[*].parts` (which some runtimes emit when the + reply was streamed as artifact chunks). Matches the extractor in + ``tests/e2e/test_chat_attachments_e2e.sh``. + """ + result = envelope.get("result") or {} + chunks: list[str] = [] + for p in result.get("parts", []) or []: + if p.get("kind") == "text": + chunks.append(p.get("text", "")) + for art in result.get("artifacts", []) or []: + for p in art.get("parts", []) or []: + if p.get("kind") == "text": + chunks.append(p.get("text", "")) + # Some runtimes return a status.message instead of/in addition to parts. + status = result.get("status") or {} + status_msg = status.get("message") or {} + for p in status_msg.get("parts", []) or []: + if p.get("kind") == "text": + chunks.append(p.get("text", "")) + return "\n".join(chunks).strip() + + # ----------------------------------------------------- memory probe + + def probe_memory(self, key: str) -> str | None: + """Read a memory value via the runtime's MCP memory tool. + + Uses the same MCP transport the canvas uses + (``POST /workspaces/:id/mcp``-shaped JSON-RPC over /mcp). Returns + the recalled string or None if the key is missing. + """ + payload = { + "jsonrpc": "2.0", + "id": f"canary-mem-{uuid.uuid4().hex[:8]}", + "method": "tools/call", + "params": {"name": "recall_memory", "arguments": {"key": key}}, + } + try: + r = self._client.post(f"{self.cfg.runtime_url}/mcp", json=payload) + except httpx.HTTPError as e: + raise CPSimError(f"MCP POST failed: {e}") from e + if r.status_code != 200: + return None + body = r.json() + result = body.get("result") or {} + # MCP responses wrap the tool output in result.content[*].text per + # the JSON-RPC tools/call contract. + for c in result.get("content", []) or []: + if c.get("type") == "text": + return c.get("text") + return None + + +class CPSimError(RuntimeError): + """Raised on transport / envelope failures (NOT canary assertion failures). + + Distinct from AssertionError so pytest reports them as ERROR not + FAILED — a transport-layer fault should be debugged differently from + a real session-continuity regression. + """ diff --git a/local-e2e/cp_sim/requirements.txt b/local-e2e/cp_sim/requirements.txt new file mode 100644 index 000000000..b8bb5cd3e --- /dev/null +++ b/local-e2e/cp_sim/requirements.txt @@ -0,0 +1,5 @@ +# Pinned (not floating) so the harness is reproducible across CI runs. +# These versions match what tests/e2e/_lib.sh and tests/e2e/conftest.py use. +httpx==0.27.2 +pytest==8.3.3 +pytest-asyncio==0.24.0 diff --git a/local-e2e/docker-compose.yml b/local-e2e/docker-compose.yml new file mode 100644 index 000000000..e306b31ea --- /dev/null +++ b/local-e2e/docker-compose.yml @@ -0,0 +1,58 @@ +# local-e2e/docker-compose.yml — minimal harness stack. +# +# Two services: +# runtime — the template image under test (TEMPLATE_IMAGE env var). +# Exposes :8000 for A2A traffic. The simulator POSTs to it. +# cp_sim — thin Python tenant-CP simulator. Drives the canary turns. +# +# Deliberately NO postgres, NO redis, NO platform Go service. SessionStore +# continuity is a runtime-internal concern (a2a_executor + executor_helpers); +# we test it without dragging the platform-tenant Go binary into the loop. +# See README.md "Why a thin Python simulator" for rationale. + +services: + runtime: + image: ${TEMPLATE_IMAGE:?TEMPLATE_IMAGE env required, e.g. ghcr.io/molecule-ai/workspace-template-hermes:latest} + # The runtime entrypoint (workspace/entrypoint.sh) refuses to start when + # any operator-scope env var is present. We deliberately set no creds — + # the canary doesn't invoke a real LLM provider (see TEST_NO_PROVIDER below). + environment: + # Disable provider calls during canary — the runtime returns canned + # echo-style replies so the harness can assert continuity / file-handling + # behaviour without burning provider quota. The template image must + # honour MOLECULE_CANARY_MODE=1 (added in molecule-ai-workspace-runtime + # PR #46 — see molecule_runtime/a2a_executor.py canary short-circuit). + MOLECULE_CANARY_MODE: "1" + # Anonymous workspace identity so RBAC paths exercise the same code + # they would in tenant production. + WORKSPACE_ID: "canary-${CANARY_RUN_ID:-local}" + # Memory tool requires a writable scope; point at /tmp inside the + # container so cross-session canary (#4) works without bind mounts. + MOLECULE_MEMORY_ROOT: "/tmp/canary-memory" + # The provisioner's forbidden-env guard exits non-zero when any + # operator-scope literal is present; the canary intentionally sets + # zero of them. Leave guard ON (do NOT set MOLECULE_TENANT_GUARD_DISABLE) + # so we exercise the prod entrypoint code path verbatim. + ports: + - "${RUNTIME_PORT:-18000}:8000" + healthcheck: + # /agent-card is the universal A2A discovery endpoint — every template + # exposes it. /health varies per template. + test: ["CMD-SHELL", "wget -qO /dev/null --tries=1 http://localhost:8000/agent-card || exit 1"] + interval: 3s + timeout: 3s + retries: 20 + start_period: 30s + + cp_sim: + build: + context: ./cp_sim + depends_on: + runtime: + condition: service_healthy + environment: + RUNTIME_URL: "http://runtime:8000" + CANARY_RUN_ID: "${CANARY_RUN_ID:-local}" + # cp_sim doesn't expose a port — it's a one-shot driver invoked by + # run-canary.sh via `docker compose run cp_sim pytest ...`. + profiles: ["driver"] diff --git a/local-e2e/scripts/onboard-template.sh b/local-e2e/scripts/onboard-template.sh new file mode 100755 index 000000000..79b3c8dfc --- /dev/null +++ b/local-e2e/scripts/onboard-template.sh @@ -0,0 +1,68 @@ +#!/usr/bin/env bash +# onboard-template.sh — gitops helper to wire local-e2e into a new template. +# +# Drops .gitea/workflows/session-continuity-e2e.yml into the target template +# repo (a thin shim that clones molecule-core's local-e2e harness, then runs +# run-canary.sh against the locally-built template image). Opens a PR. +# +# Usage: +# ./local-e2e/scripts/onboard-template.sh molecule-ai-workspace-template-claude-code +# +# Per task #342 sequencing: do NOT run this for every template at once. +# Bake the gate on hermes for ≥5 business days first; expand only after +# the canary is empirically stable. +# +# Cross-refs: +# feedback_no_single_source_of_truth — the workflow content is identical +# across templates; this helper guarantees it. +# feedback_image_promote_is_not_user_live — we wire the gate at the +# CI layer; flipping it to REQUIRED in branch_protection is a +# separate step (see README.md). + +set -euo pipefail + +REPO="${1:?usage: onboard-template.sh }" +HARNESS_ROOT="$( cd "$( dirname "${BASH_SOURCE[0]}" )/.." && pwd )" + +# Sanity: ensure the template-side workflow file exists in this repo. +TEMPLATE_WORKFLOW="$HARNESS_ROOT/templates/session-continuity-e2e.yml" +[ -f "$TEMPLATE_WORKFLOW" ] || { + echo "ERROR: $TEMPLATE_WORKFLOW not found in this harness checkout" + exit 1 +} + +WORK_DIR=$(mktemp -d -t e2e-onboard-XXXXXX) +trap 'rm -rf "$WORK_DIR"' EXIT + +cd "$WORK_DIR" + +# Use mol_clone — preserves the persona credential model. +# shellcheck disable=SC1090 +source "$HOME/.molecule-ai/ops.sh" +mol_clone "$REPO" +cd "$REPO" + +git checkout -b "task342/session-continuity-e2e-gate" + +mkdir -p .gitea/workflows +cp "$TEMPLATE_WORKFLOW" .gitea/workflows/session-continuity-e2e.yml + +git add .gitea/workflows/session-continuity-e2e.yml +git commit -m "ci: add local-e2e session-continuity canary gate (task #342) + +Wires this template into the cross-template session-continuity harness +in molecule-ai/molecule-core/local-e2e/. The gate boots THIS repo's +locally-built image, drives 4 canonical canaries (2-turn name continuity, +file-only message, file+prompt, cross-session memory recall), and fails +PRs that regress any of them. + +Per CTO directive: required-context flip in branch_protection is a +SEPARATE step after 5 business days of bake." + +# Push branch; do not auto-open PR — leave that to the operator so the +# review-relay routing follows the same rules as a normal change. +git push -u origin "task342/session-continuity-e2e-gate" + +echo +echo "DONE. Branch pushed to $REPO. Open PR manually:" +echo " https://git.moleculesai.app/molecule-ai/$REPO/compare/main...task342/session-continuity-e2e-gate" diff --git a/local-e2e/scripts/run-canary.sh b/local-e2e/scripts/run-canary.sh new file mode 100755 index 000000000..ae9c98c9e --- /dev/null +++ b/local-e2e/scripts/run-canary.sh @@ -0,0 +1,105 @@ +#!/usr/bin/env bash +# run-canary.sh — one-shot orchestration for the local-e2e session-continuity +# canary harness. Used by both interactive local runs and the per-template +# .gitea/workflows/session-continuity-e2e.yml. +# +# Usage: +# TEMPLATE_IMAGE=ghcr.io/molecule-ai/workspace-template-hermes:latest \ +# ./local-e2e/scripts/run-canary.sh +# +# Optional env: +# CANARY_RUN_ID — disambiguator for parallel CI runs (default: random) +# RUNTIME_PORT — host port for runtime :8000 (default: 18000) +# KEEP_RUNNING — set =1 to leave containers up for post-mortem +# +# Exit codes: +# 0 — all 4 canaries passed +# 1 — at least one canary failed (artifacts/ has the dump) +# 2 — harness infrastructure failure (image pull / compose / etc.) +# +# Cross-refs: +# feedback_image_promote_is_not_user_live — we verify at the running +# container layer, NOT at the pipeline-green layer. +# feedback_verify_actual_endstate_not_ack_follow_sop — every assert +# reads state back; no side-effect-ack claims success. + +set -euo pipefail + +: "${TEMPLATE_IMAGE:?TEMPLATE_IMAGE env required (the runtime image under test)}" + +# ----------------------------------------------------------------- paths +HARNESS_ROOT="$( cd "$( dirname "${BASH_SOURCE[0]}" )/.." && pwd )" +ARTIFACTS_DIR="$HARNESS_ROOT/artifacts" +mkdir -p "$ARTIFACTS_DIR" + +export CANARY_RUN_ID="${CANARY_RUN_ID:-$(uuidgen 2>/dev/null | tr A-Z a-z | tr -d - | cut -c1-12 || date +%s)}" +export RUNTIME_PORT="${RUNTIME_PORT:-18000}" +export TEMPLATE_IMAGE +COMPOSE_PROJECT="canary-${CANARY_RUN_ID}" +COMPOSE_FILE="$HARNESS_ROOT/docker-compose.yml" + +log() { printf "\n=== [%s] %s ===\n" "$(date +%H:%M:%S)" "$*"; } + +# ----------------------------------------------------------- cleanup hook +cleanup() { + local rc=$? + if [ "${KEEP_RUNNING:-0}" = "1" ]; then + log "KEEP_RUNNING=1 — leaving containers up (project=$COMPOSE_PROJECT)" + return $rc + fi + log "Tearing down compose project $COMPOSE_PROJECT" + # On non-zero exit, capture logs FIRST. Per feedback_image_promote_is_ + # not_user_live: dump state from the actually-running container, not + # an inferred pipeline state. + if [ $rc -ne 0 ]; then + log "Canary FAILED — dumping artifacts to $ARTIFACTS_DIR" + docker compose -p "$COMPOSE_PROJECT" -f "$COMPOSE_FILE" logs \ + --no-color --tail=200 runtime \ + > "$ARTIFACTS_DIR/runtime.log" 2>&1 || true + # SessionStore state probe — runtime exposes /admin/session-store + # in canary mode; if not present this 404s and the file is empty. + docker compose -p "$COMPOSE_PROJECT" -f "$COMPOSE_FILE" exec -T runtime \ + sh -c 'ls -la /tmp/canary-memory 2>/dev/null; find /tmp -name "session*.json" -exec cat {} \; 2>/dev/null' \ + > "$ARTIFACTS_DIR/session-store.txt" 2>&1 || true + fi + docker compose -p "$COMPOSE_PROJECT" -f "$COMPOSE_FILE" down --volumes --remove-orphans >/dev/null 2>&1 || true + return $rc +} +trap cleanup EXIT + +# ------------------------------------------------------ stack bring-up +log "Building cp_sim image" +docker compose -p "$COMPOSE_PROJECT" -f "$COMPOSE_FILE" build cp_sim + +log "Pulling runtime image: $TEMPLATE_IMAGE" +docker compose -p "$COMPOSE_PROJECT" -f "$COMPOSE_FILE" pull runtime 2>&1 \ + | tail -5 || true + +log "Starting runtime (host port $RUNTIME_PORT)" +docker compose -p "$COMPOSE_PROJECT" -f "$COMPOSE_FILE" up -d runtime + +# Wait for healthcheck — docker-compose `--wait` is the canonical mechanism +# (introduced in v2.1.1 in 2021, available on every supported runner pool). +log "Waiting for runtime healthcheck" +if ! docker compose -p "$COMPOSE_PROJECT" -f "$COMPOSE_FILE" up -d --wait runtime; then + log "Runtime never went healthy — dumping logs" + docker compose -p "$COMPOSE_PROJECT" -f "$COMPOSE_FILE" logs --no-color --tail=200 runtime \ + > "$ARTIFACTS_DIR/runtime-boot-failure.log" 2>&1 || true + exit 2 +fi + +# -------------------------------------------------------------- run tests +log "Running canary suite" +# Run cp_sim under the same compose project so DNS (runtime hostname) +# resolves on the molecule-core-net bridge. --rm cleans the driver container +# after pytest exits; volume bind mounts pytest's junit-xml back to host. +if docker compose -p "$COMPOSE_PROJECT" -f "$COMPOSE_FILE" --profile driver run \ + --rm \ + -v "$ARTIFACTS_DIR:/harness/artifacts" \ + cp_sim; then + log "All canaries PASSED" + exit 0 +else + log "At least one canary FAILED — see $ARTIFACTS_DIR/junit.xml" + exit 1 +fi diff --git a/local-e2e/templates/session-continuity-e2e.yml b/local-e2e/templates/session-continuity-e2e.yml new file mode 100644 index 000000000..901816f4d --- /dev/null +++ b/local-e2e/templates/session-continuity-e2e.yml @@ -0,0 +1,85 @@ +name: session-continuity-e2e + +# Per-template wrapper for the molecule-core/local-e2e canary harness. +# DO NOT EDIT THIS FILE IN A TEMPLATE REPO — the canonical copy lives at +# molecule-ai/molecule-core:local-e2e/templates/session-continuity-e2e.yml +# (feedback_no_single_source_of_truth). The onboard-template.sh script +# copies it verbatim into each template; future fixes propagate via that +# helper, not by editing the template-side copy. +# +# What this workflow does: +# 1. Build THIS template's runtime image locally on the docker-host runner. +# 2. Clone molecule-core (canonical harness source). +# 3. Invoke local-e2e/scripts/run-canary.sh with TEMPLATE_IMAGE set to +# the just-built local image. +# 4. Upload artifacts/ on failure for post-mortem. +# +# Required-context flip: +# This workflow posts a status under the literal context name +# "session-continuity-e2e (pull_request)" — Gitea's standard +# () format. To make it REQUIRED, add that +# exact string to the template repo's branch_protection +# status_check_contexts list. See README.md for the bake-period rule. +# +# Gitea 1.22.6 / act_runner notes (cross-refs to known footguns): +# - No cross-repo `uses:` (feedback_gitea_cross_repo_uses_blocked) — +# we clone molecule-core via plain git instead. +# - Per-SHA concurrency (feedback_concurrency_group_per_sha). +# - Workflow-level GITHUB_SERVER_URL pinned to the Gitea host +# (feedback_act_runner_github_server_url). +# - Runs on docker-host pool — NOT the heavy CI pool — per CTO +# directive "separate CI as possible" and the <3 min target. + +on: + pull_request: + branches: [main] + push: + branches: [main] + +concurrency: + group: session-continuity-e2e-${{ github.workflow }}-${{ github.event_name }}-${{ github.event.pull_request.head.sha || github.sha }} + cancel-in-progress: true + +env: + GITHUB_SERVER_URL: https://git.moleculesai.app + +jobs: + session-continuity-e2e: + runs-on: docker-host + timeout-minutes: 8 + steps: + - name: Checkout template + uses: actions/checkout@v4 + with: + path: template + + - name: Build template image + id: build + working-directory: template + run: | + IMAGE_TAG="local-e2e-${GITHUB_SHA::12}" + docker build -t "molecule-ai/template-under-test:${IMAGE_TAG}" . + echo "image=molecule-ai/template-under-test:${IMAGE_TAG}" >> "$GITHUB_OUTPUT" + + - name: Clone harness from molecule-core + run: | + # Anonymous clone — molecule-core is internal-readable. NEVER bake + # an auth token into the URL (feedback_credentials_in_git_url). + git clone --depth 1 "${GITHUB_SERVER_URL}/molecule-ai/molecule-core.git" harness + + - name: Run canary + env: + TEMPLATE_IMAGE: ${{ steps.build.outputs.image }} + CANARY_RUN_ID: ${{ github.run_id }}-${{ github.run_attempt }} + run: | + cd harness + ./local-e2e/scripts/run-canary.sh + + - name: Upload artifacts on failure + if: failure() + uses: actions/upload-artifact@v4 + with: + name: session-continuity-canary-${{ github.run_id }} + path: harness/local-e2e/artifacts/ + if-no-files-found: warn + retention-days: 7