59d699b61c
Lint shellcheck (arm64 pilot) / shellcheck-arm64 (pilot) (pull_request) Waiting to run
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 4s
CI / Detect changes (pull_request) Successful in 7s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 24s
E2E API Smoke Test / detect-changes (pull_request) Successful in 14s
E2E Chat / detect-changes (pull_request) Successful in 11s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 7s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 9s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 5s
Lint no tenant GITEA or GITHUB token write / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 6s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 12s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 9s
gate-check-v3 / gate-check (pull_request) Successful in 7s
qa-review / approved (pull_request) Failing after 7s
security-review / approved (pull_request) Failing after 6s
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request) Successful in 5s
sop-checklist / review-refire (pull_request) Has been skipped
sop-tier-check / tier-check (pull_request) Successful in 5s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m3s
CI / Platform (Go) (pull_request) Successful in 5m45s
CI / Python Lint & Test (pull_request) Successful in 7m0s
CI / Canvas (Next.js) (pull_request) Successful in 7m34s
CI / all-required (pull_request) Successful in 7m14s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 5s
E2E Chat / E2E Chat (pull_request) Successful in 6s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 6s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 2s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Adds a self-contained docker-compose harness in local-e2e/ that gates
RFC#600-class template changes BEFORE customer canary. Implements the 4
canonical canaries:
1. 2-turn name continuity — SessionStore key derivation
2. File-only message — no caption drop-to-empty-prompt regress
3. File + prompt (multimodal) — multimodal happy path
4. Cross-session memory — explicit memory tool, distinct context_ids
Architecture is deliberately lean per CTO "separate CI as possible":
local-e2e/
docker-compose.yml # runtime + cp_sim ONLY (no platform Go, no pg)
cp_sim/ # ~250 LoC Python A2A wire-shape emitter
cp_sim/canary/ # 4 canary scenarios + layer-isolation probes
scripts/run-canary.sh # one-shot orchestration (target <3 min)
scripts/onboard-template.sh # gitops helper for cascade
templates/session-continuity-e2e.yml # canonical workflow shim
Rationale for a Python tenant-CP simulator (not the real workspace-server):
SessionStore behaviour is fully owned by workspace/a2a_executor.py +
executor_helpers.py — the Go platform service doesn't touch session
continuity. Excising it gets the harness to <3 min cold-boot on
docker-host runners and keeps the surface small enough to debug fast.
The simulator emits the byte-identical JSON-RPC message/send envelope
that workspace-server POSTs (cross-checked against
tests/e2e/test_chat_attachments_e2e.sh and workspace/a2a_executor.py
:_core_execute).
Per feedback_no_single_source_of_truth: the harness IS the canonical
session-continuity validator across templates. Per-template unit tests
keep covering their own guard logic.
Per feedback_image_promote_is_not_user_live + feedback_verify_actual_
endstate_not_ack_follow_sop: every canary asserts at the running-
container layer; artifacts dump SessionStore state + runtime logs on
failure for post-mortem.
Rollout (deliberate sequencing, per task #342):
1. THIS PR — lands harness in molecule-core. NOT yet wired to any
template repo.
2. Companion PR in molecule-ai-workspace-template-hermes — adds
.gitea/workflows/session-continuity-e2e.yml. NOT required yet.
3. Bake on hermes for ≥5 business days.
4. Cascade to remaining 6 templates via onboard-template.sh.
5. Per-template BP flip — add "session-continuity-e2e (pull_request)"
to status_check_contexts on each repo, hermes first.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
105 lines
4.3 KiB
Markdown
105 lines
4.3 KiB
Markdown
# local-e2e — session-continuity canary harness
|
|
|
|
Self-contained Docker-Compose harness that gates RFC#600-class template
|
|
changes (session continuity, file-only messages, multimodal prompts,
|
|
cross-session memory) **before** they reach customer canary.
|
|
|
|
Per CTO standing directive "fully tested + separate CI": this is a
|
|
dedicated, *fast* (target <3 min), *small-surface* harness that uses a
|
|
Python tenant-CP simulator (not the full `workspace-server` Go service)
|
|
to exercise the runtime image end-to-end against canonical canary turns.
|
|
|
|
See [`feedback_no_single_source_of_truth`] — the harness IS the canonical
|
|
session-continuity validator. Per-runtime unit tests still cover their
|
|
own guard logic; the harness covers the live conversational behaviour
|
|
that those unit tests cannot prove.
|
|
|
|
See [`feedback_image_promote_is_not_user_live`] — every assertion reads
|
|
state back from the *running container*, never from a publish-pipeline
|
|
ack.
|
|
|
|
## What it tests (the 4 canaries)
|
|
|
|
| # | Scenario | Asserts |
|
|
|---|----------|---------|
|
|
| 1 | 2-turn name canary | turn 2 reply contains "Hongming" → SessionStore continuity |
|
|
| 2 | File-only message (no caption) | NOT "(empty prompt — nothing to do)" + reply references filename or asks for clarification |
|
|
| 3 | File + caption ("summarize this") | reply addresses attachment + caption |
|
|
| 4 | Cross-session memory recall | new session pulls "blue" via memory tool |
|
|
|
|
Each scenario re-uses the same A2A wire-shape that the production
|
|
`workspace-server` POSTs to runtime `:8000` (canvas-thread-id semantics
|
|
via `context_id`).
|
|
|
|
## Architecture
|
|
|
|
```
|
|
local-e2e/
|
|
docker-compose.yml # runtime under test + cp_sim
|
|
cp_sim/ # ≈300 LoC Python A2A poster + file uploader
|
|
cp_sim.py
|
|
Dockerfile
|
|
requirements.txt
|
|
canary/
|
|
conftest.py
|
|
test_session_continuity.py # 4 canary scenarios
|
|
test_layer_diagnostics.py # SessionStore state probe + key derivation
|
|
scripts/
|
|
run-canary.sh # one-shot orchestration entrypoint
|
|
```
|
|
|
|
The CP simulator emits the **exact** JSON-RPC `message/send` envelope
|
|
that `workspace-server` produces (verified against
|
|
`tests/e2e/test_chat_attachments_e2e.sh`). No Go service is in the loop —
|
|
this keeps the harness lean per the CTO directive.
|
|
|
|
## Run locally
|
|
|
|
```bash
|
|
# from molecule-core repo root:
|
|
export TEMPLATE_IMAGE=ghcr.io/molecule-ai/workspace-template-hermes:latest
|
|
./local-e2e/scripts/run-canary.sh
|
|
```
|
|
|
|
Exit code 0 = all 4 canaries pass. Non-zero = at least one canary failed
|
|
and the harness dumped SessionStore state + last 200 log lines from the
|
|
runtime container into `./local-e2e/artifacts/`.
|
|
|
|
## How it integrates into CI
|
|
|
|
Each template repo's `.gitea/workflows/session-continuity-e2e.yml` calls
|
|
`run-canary.sh` with its own freshly-built `TEMPLATE_IMAGE`. The
|
|
template repo's Gitea branch-protection lists
|
|
`session-continuity-e2e (pull_request)` as a required context.
|
|
|
|
Rollout order (deliberate — per `feedback_image_promote_is_not_user_live`
|
|
we bake before we cascade):
|
|
|
|
1. `molecule-ai-workspace-template-hermes` — highest-traffic + most
|
|
recent RFC#600-class fixes — REQUIRED gate
|
|
2. Bake for 5 business days
|
|
3. Cascade to claude-code, langgraph, autogen, openclaw, smolagents,
|
|
google-adk (one PR per template — see `scripts/onboard-template.sh`)
|
|
|
|
## Future extensions (out of scope for the initial PR)
|
|
|
|
- Multi-session memory consistency (3+ sessions deep)
|
|
- Tool-use canary (workspace seeded with skills/, agent must invoke)
|
|
- Streaming-cancellation canary (mid-stream client disconnect)
|
|
- Cross-runtime A2A peer call (currently covered by `e2e-peer-visibility`)
|
|
|
|
## Why a thin Python simulator and not the real `workspace-server`?
|
|
|
|
`workspace-server` is a 60+ MB Go binary that requires Postgres, Redis,
|
|
admin-token wiring, registry plumbing, and a 30+ second cold-boot. None
|
|
of that touches session-continuity behaviour, which is fully owned by
|
|
the runtime container's `a2a_executor.py`. Per CTO directive "separate
|
|
CI as possible" + the <3 min target, we excise the platform-tenant Go
|
|
service from the loop and emit identical wire-shape envelopes from a
|
|
single Python file.
|
|
|
|
If the simulator diverges from `workspace-server` wire shape, the gate
|
|
goes red — fix the simulator to match production. The wire shape is
|
|
asserted in `tests/e2e/test_chat_attachments_e2e.sh` and the runtime's
|
|
`workspace/a2a_executor.py:_core_execute`.
|