Files
molecule-core/local-e2e/README.md
claude-ceo-assistant 59d699b61c
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Waiting to run
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 4s
CI / Detect changes (pull_request) Successful in 7s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 24s
E2E API Smoke Test / detect-changes (pull_request) Successful in 14s
E2E Chat / detect-changes (pull_request) Successful in 11s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 7s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 9s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 5s
Lint no tenant GITEA or GITHUB token write / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 6s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 12s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 9s
gate-check-v3 / gate-check (pull_request) Successful in 7s
qa-review / approved (pull_request) Failing after 7s
security-review / approved (pull_request) Failing after 6s
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request) Successful in 5s
sop-checklist / review-refire (pull_request) Has been skipped
sop-tier-check / tier-check (pull_request) Successful in 5s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m3s
CI / Platform (Go) (pull_request) Successful in 5m45s
CI / Python Lint & Test (pull_request) Successful in 7m0s
CI / Canvas (Next.js) (pull_request) Successful in 7m34s
CI / all-required (pull_request) Successful in 7m14s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 5s
E2E Chat / E2E Chat (pull_request) Successful in 6s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 6s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 2s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
feat(local-e2e): session-continuity canary harness (task #342, RFC#600 gate)
Adds a self-contained docker-compose harness in local-e2e/ that gates
RFC#600-class template changes BEFORE customer canary. Implements the 4
canonical canaries:

  1. 2-turn name continuity   — SessionStore key derivation
  2. File-only message        — no caption drop-to-empty-prompt regress
  3. File + prompt (multimodal) — multimodal happy path
  4. Cross-session memory     — explicit memory tool, distinct context_ids

Architecture is deliberately lean per CTO "separate CI as possible":

  local-e2e/
    docker-compose.yml       # runtime + cp_sim ONLY (no platform Go, no pg)
    cp_sim/                  # ~250 LoC Python A2A wire-shape emitter
    cp_sim/canary/           # 4 canary scenarios + layer-isolation probes
    scripts/run-canary.sh    # one-shot orchestration (target <3 min)
    scripts/onboard-template.sh  # gitops helper for cascade
    templates/session-continuity-e2e.yml  # canonical workflow shim

Rationale for a Python tenant-CP simulator (not the real workspace-server):
SessionStore behaviour is fully owned by workspace/a2a_executor.py +
executor_helpers.py — the Go platform service doesn't touch session
continuity. Excising it gets the harness to <3 min cold-boot on
docker-host runners and keeps the surface small enough to debug fast.

The simulator emits the byte-identical JSON-RPC message/send envelope
that workspace-server POSTs (cross-checked against
tests/e2e/test_chat_attachments_e2e.sh and workspace/a2a_executor.py
:_core_execute).

Per feedback_no_single_source_of_truth: the harness IS the canonical
session-continuity validator across templates. Per-template unit tests
keep covering their own guard logic.

Per feedback_image_promote_is_not_user_live + feedback_verify_actual_
endstate_not_ack_follow_sop: every canary asserts at the running-
container layer; artifacts dump SessionStore state + runtime logs on
failure for post-mortem.

Rollout (deliberate sequencing, per task #342):
  1. THIS PR — lands harness in molecule-core. NOT yet wired to any
     template repo.
  2. Companion PR in molecule-ai-workspace-template-hermes — adds
     .gitea/workflows/session-continuity-e2e.yml. NOT required yet.
  3. Bake on hermes for ≥5 business days.
  4. Cascade to remaining 6 templates via onboard-template.sh.
  5. Per-template BP flip — add "session-continuity-e2e (pull_request)"
     to status_check_contexts on each repo, hermes first.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 02:39:30 -07:00

4.3 KiB

local-e2e — session-continuity canary harness

Self-contained Docker-Compose harness that gates RFC#600-class template changes (session continuity, file-only messages, multimodal prompts, cross-session memory) before they reach customer canary.

Per CTO standing directive "fully tested + separate CI": this is a dedicated, fast (target <3 min), small-surface harness that uses a Python tenant-CP simulator (not the full workspace-server Go service) to exercise the runtime image end-to-end against canonical canary turns.

See [feedback_no_single_source_of_truth] — the harness IS the canonical session-continuity validator. Per-runtime unit tests still cover their own guard logic; the harness covers the live conversational behaviour that those unit tests cannot prove.

See [feedback_image_promote_is_not_user_live] — every assertion reads state back from the running container, never from a publish-pipeline ack.

What it tests (the 4 canaries)

# Scenario Asserts
1 2-turn name canary turn 2 reply contains "Hongming" → SessionStore continuity
2 File-only message (no caption) NOT "(empty prompt — nothing to do)" + reply references filename or asks for clarification
3 File + caption ("summarize this") reply addresses attachment + caption
4 Cross-session memory recall new session pulls "blue" via memory tool

Each scenario re-uses the same A2A wire-shape that the production workspace-server POSTs to runtime :8000 (canvas-thread-id semantics via context_id).

Architecture

local-e2e/
  docker-compose.yml           # runtime under test + cp_sim
  cp_sim/                      # ≈300 LoC Python A2A poster + file uploader
    cp_sim.py
    Dockerfile
    requirements.txt
  canary/
    conftest.py
    test_session_continuity.py # 4 canary scenarios
    test_layer_diagnostics.py  # SessionStore state probe + key derivation
  scripts/
    run-canary.sh              # one-shot orchestration entrypoint

The CP simulator emits the exact JSON-RPC message/send envelope that workspace-server produces (verified against tests/e2e/test_chat_attachments_e2e.sh). No Go service is in the loop — this keeps the harness lean per the CTO directive.

Run locally

# from molecule-core repo root:
export TEMPLATE_IMAGE=ghcr.io/molecule-ai/workspace-template-hermes:latest
./local-e2e/scripts/run-canary.sh

Exit code 0 = all 4 canaries pass. Non-zero = at least one canary failed and the harness dumped SessionStore state + last 200 log lines from the runtime container into ./local-e2e/artifacts/.

How it integrates into CI

Each template repo's .gitea/workflows/session-continuity-e2e.yml calls run-canary.sh with its own freshly-built TEMPLATE_IMAGE. The template repo's Gitea branch-protection lists session-continuity-e2e (pull_request) as a required context.

Rollout order (deliberate — per feedback_image_promote_is_not_user_live we bake before we cascade):

  1. molecule-ai-workspace-template-hermes — highest-traffic + most recent RFC#600-class fixes — REQUIRED gate
  2. Bake for 5 business days
  3. Cascade to claude-code, langgraph, autogen, openclaw, smolagents, google-adk (one PR per template — see scripts/onboard-template.sh)

Future extensions (out of scope for the initial PR)

  • Multi-session memory consistency (3+ sessions deep)
  • Tool-use canary (workspace seeded with skills/, agent must invoke)
  • Streaming-cancellation canary (mid-stream client disconnect)
  • Cross-runtime A2A peer call (currently covered by e2e-peer-visibility)

Why a thin Python simulator and not the real workspace-server?

workspace-server is a 60+ MB Go binary that requires Postgres, Redis, admin-token wiring, registry plumbing, and a 30+ second cold-boot. None of that touches session-continuity behaviour, which is fully owned by the runtime container's a2a_executor.py. Per CTO directive "separate CI as possible" + the <3 min target, we excise the platform-tenant Go service from the loop and emit identical wire-shape envelopes from a single Python file.

If the simulator diverges from workspace-server wire shape, the gate goes red — fix the simulator to match production. The wire shape is asserted in tests/e2e/test_chat_attachments_e2e.sh and the runtime's workspace/a2a_executor.py:_core_execute.