Adds a self-contained docker-compose harness in local-e2e/ that gates
RFC#600-class template changes BEFORE customer canary. Implements the 4
canonical canaries:
1. 2-turn name continuity — SessionStore key derivation
2. File-only message — no caption drop-to-empty-prompt regress
3. File + prompt (multimodal) — multimodal happy path
4. Cross-session memory — explicit memory tool, distinct context_ids
Architecture is deliberately lean per CTO "separate CI as possible":
local-e2e/
docker-compose.yml # runtime + cp_sim ONLY (no platform Go, no pg)
cp_sim/ # ~250 LoC Python A2A wire-shape emitter
cp_sim/canary/ # 4 canary scenarios + layer-isolation probes
scripts/run-canary.sh # one-shot orchestration (target <3 min)
scripts/onboard-template.sh # gitops helper for cascade
templates/session-continuity-e2e.yml # canonical workflow shim
Rationale for a Python tenant-CP simulator (not the real workspace-server):
SessionStore behaviour is fully owned by workspace/a2a_executor.py +
executor_helpers.py — the Go platform service doesn't touch session
continuity. Excising it gets the harness to <3 min cold-boot on
docker-host runners and keeps the surface small enough to debug fast.
The simulator emits the byte-identical JSON-RPC message/send envelope
that workspace-server POSTs (cross-checked against
tests/e2e/test_chat_attachments_e2e.sh and workspace/a2a_executor.py
:_core_execute).
Per feedback_no_single_source_of_truth: the harness IS the canonical
session-continuity validator across templates. Per-template unit tests
keep covering their own guard logic.
Per feedback_image_promote_is_not_user_live + feedback_verify_actual_
endstate_not_ack_follow_sop: every canary asserts at the running-
container layer; artifacts dump SessionStore state + runtime logs on
failure for post-mortem.
Rollout (deliberate sequencing, per task #342):
1. THIS PR — lands harness in molecule-core. NOT yet wired to any
template repo.
2. Companion PR in molecule-ai-workspace-template-hermes — adds
.gitea/workflows/session-continuity-e2e.yml. NOT required yet.
3. Bake on hermes for ≥5 business days.
4. Cascade to remaining 6 templates via onboard-template.sh.
5. Per-template BP flip — add "session-continuity-e2e (pull_request)"
to status_check_contexts on each repo, hermes first.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4.3 KiB
local-e2e — session-continuity canary harness
Self-contained Docker-Compose harness that gates RFC#600-class template changes (session continuity, file-only messages, multimodal prompts, cross-session memory) before they reach customer canary.
Per CTO standing directive "fully tested + separate CI": this is a
dedicated, fast (target <3 min), small-surface harness that uses a
Python tenant-CP simulator (not the full workspace-server Go service)
to exercise the runtime image end-to-end against canonical canary turns.
See [feedback_no_single_source_of_truth] — the harness IS the canonical
session-continuity validator. Per-runtime unit tests still cover their
own guard logic; the harness covers the live conversational behaviour
that those unit tests cannot prove.
See [feedback_image_promote_is_not_user_live] — every assertion reads
state back from the running container, never from a publish-pipeline
ack.
What it tests (the 4 canaries)
| # | Scenario | Asserts |
|---|---|---|
| 1 | 2-turn name canary | turn 2 reply contains "Hongming" → SessionStore continuity |
| 2 | File-only message (no caption) | NOT "(empty prompt — nothing to do)" + reply references filename or asks for clarification |
| 3 | File + caption ("summarize this") | reply addresses attachment + caption |
| 4 | Cross-session memory recall | new session pulls "blue" via memory tool |
Each scenario re-uses the same A2A wire-shape that the production
workspace-server POSTs to runtime :8000 (canvas-thread-id semantics
via context_id).
Architecture
local-e2e/
docker-compose.yml # runtime under test + cp_sim
cp_sim/ # ≈300 LoC Python A2A poster + file uploader
cp_sim.py
Dockerfile
requirements.txt
canary/
conftest.py
test_session_continuity.py # 4 canary scenarios
test_layer_diagnostics.py # SessionStore state probe + key derivation
scripts/
run-canary.sh # one-shot orchestration entrypoint
The CP simulator emits the exact JSON-RPC message/send envelope
that workspace-server produces (verified against
tests/e2e/test_chat_attachments_e2e.sh). No Go service is in the loop —
this keeps the harness lean per the CTO directive.
Run locally
# from molecule-core repo root:
export TEMPLATE_IMAGE=ghcr.io/molecule-ai/workspace-template-hermes:latest
./local-e2e/scripts/run-canary.sh
Exit code 0 = all 4 canaries pass. Non-zero = at least one canary failed
and the harness dumped SessionStore state + last 200 log lines from the
runtime container into ./local-e2e/artifacts/.
How it integrates into CI
Each template repo's .gitea/workflows/session-continuity-e2e.yml calls
run-canary.sh with its own freshly-built TEMPLATE_IMAGE. The
template repo's Gitea branch-protection lists
session-continuity-e2e (pull_request) as a required context.
Rollout order (deliberate — per feedback_image_promote_is_not_user_live
we bake before we cascade):
molecule-ai-workspace-template-hermes— highest-traffic + most recent RFC#600-class fixes — REQUIRED gate- Bake for 5 business days
- Cascade to claude-code, langgraph, autogen, openclaw, smolagents,
google-adk (one PR per template — see
scripts/onboard-template.sh)
Future extensions (out of scope for the initial PR)
- Multi-session memory consistency (3+ sessions deep)
- Tool-use canary (workspace seeded with skills/, agent must invoke)
- Streaming-cancellation canary (mid-stream client disconnect)
- Cross-runtime A2A peer call (currently covered by
e2e-peer-visibility)
Why a thin Python simulator and not the real workspace-server?
workspace-server is a 60+ MB Go binary that requires Postgres, Redis,
admin-token wiring, registry plumbing, and a 30+ second cold-boot. None
of that touches session-continuity behaviour, which is fully owned by
the runtime container's a2a_executor.py. Per CTO directive "separate
CI as possible" + the <3 min target, we excise the platform-tenant Go
service from the loop and emit identical wire-shape envelopes from a
single Python file.
If the simulator diverges from workspace-server wire shape, the gate
goes red — fix the simulator to match production. The wire shape is
asserted in tests/e2e/test_chat_attachments_e2e.sh and the runtime's
workspace/a2a_executor.py:_core_execute.