From db5d11ffca548bd4893f9f04f5f55794e8fbbabe Mon Sep 17 00:00:00 2001
From: Hongming Wang <hongmingwangalt@gmail.com>
Date: Wed, 29 Apr 2026 22:04:57 -0700
Subject: [PATCH] ci: continuous synthetic E2E against staging (#2342)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Hard gate Tier 2 item 2 of 4. Cron-driven full-lifecycle E2E that
catches regressions visible only at runtime — schema drift,
deployment-pipeline gaps, vendor outages, env-var rotations,
DNS / CF / Railway side-effects.

Empirical motivation from today:
  - #2345 (A2A v0.2 silent drop) — passed unit tests, broke at JSON-RPC
    parse layer between sender + receiver. Visible only when a sender
    exercises the full path. Now-fixed by PR #2349, but a continuous
    E2E would have surfaced it within 20 min of the regression.
  - RFC #2312 chat upload — landed staging-branch but never reached
    staging tenants because publish-workspace-server-image was main-
    only. Caught by manual dogfooding hours after deploy. Same pattern.

Both classes are invisible to PR-time CI. The continuous gate fires
every 20 min against a real staging tenant and surfaces regressions
within minutes.

Cadence: cron `0,20,40 * * * *` (3x/hour). Offsets the existing
sweep-cf-orphans (:15) and sweep-cf-tunnels (:45) so the three ops
don't burst CF/AWS APIs at the same minute. Concurrency group
prevents overlapping runs if one hangs.

Cost: ~$0.50-1/day GHA + pennies of staging tenant lifecycle.

Reuses existing tests/e2e/test_staging_full_saas.sh — no new harness
to maintain. Bounded at 10 min wall-clock (vs 15 min default) so
stuck runs fail fast rather than holding up the next firing.
Defaults to E2E_RUNTIME=langgraph (fastest cold start; the regression
classes this gate catches don't need hermes-specific paths). Operators
can dispatch with runtime=hermes when they want SDK-native coverage.

Schedule-vs-dispatch hardening: hard-fail on missing
CP_STAGING_ADMIN_API_TOKEN for cron firing (silent-skip would mask
real outages); soft-skip for operator dispatch.

Refs:
  - #2342 hard-gates Tier 2 item 2
  - #2345 (A2A v0.2 fix that this gate would have caught earlier)
  - #2335 / #2337 (deployment-pipeline gaps that this gate also catches)
---
 .github/workflows/continuous-synth-e2e.yml | 160 +++++++++++++++++++++
 1 file changed, 160 insertions(+)
 create mode 100644 .github/workflows/continuous-synth-e2e.yml

diff --git a/.github/workflows/continuous-synth-e2e.yml b/.github/workflows/continuous-synth-e2e.yml
new file mode 100644
index 00000000..e477214a
--- /dev/null
+++ b/.github/workflows/continuous-synth-e2e.yml
@@ -0,0 +1,160 @@
+name: Continuous synthetic E2E (staging)
+
+# Hard gate (#2342): cron-driven full-lifecycle E2E that catches
+# regressions visible only at runtime — schema drift, deployment-pipeline
+# gaps, vendor outages, env-var rotations, DNS / CF / Railway side-effects.
+#
+# Why this gate exists:
+#   PR-time CI catches code-level regressions but not deployment-time or
+#   integration-time ones. Today's empirical data:
+#     • #2345 (A2A v0.2 silent drop) — passed all unit tests, broke at
+#       JSON-RPC parse layer between sender and receiver. Visible only
+#       to a sender exercising the full path.
+#     • RFC #2312 chat upload — landed on staging-branch but never
+#       reached staging tenants because publish-workspace-server-image
+#       was main-only. Caught by manual dogfooding hours after deploy.
+#   Both would have surfaced within 15-20 min of regression if a
+#   continuous synth-E2E was running.
+#
+# Cadence: every 20 min (3x/hour). The script is conservatively
+# bounded at 10 min wall-clock; even on degraded staging it should
+# finish before the next firing. cron-overlap is guarded by the
+# concurrency group below.
+#
+# Cost: ~3 runs/hour × 5-10 min × $0.008/min GHA = ~$0.50-$1/day.
+# Plus a fresh tenant provisioned + torn down each run (Railway +
+# AWS pennies). Negligible.
+#
+# Failure handling: when the run fails, the workflow exits non-zero
+# and GitHub's standard email/notification path fires. Operators
+# can subscribe to this workflow's failure channel for paging-grade
+# alerting.
+
+on:
+  schedule:
+    # Every 20 minutes, on the :00 :20 :40. Offsets the existing :15
+    # sweep-cf-orphans and :45 sweep-cf-tunnels so the three
+    # operations don't all hit Cloudflare/AWS at the same minute.
+    - cron: '0,20,40 * * * *'
+  workflow_dispatch:
+    inputs:
+      runtime:
+        description: "Runtime to provision (langgraph = fastest, default; hermes = slower but covers SDK-native path; claude-code = needs OAUTH token in tenant env)"
+        required: false
+        default: "langgraph"
+        type: string
+      keep_org:
+        description: "Skip teardown for post-mortem debugging (only manual dispatch — never set this for cron runs)"
+        required: false
+        default: false
+        type: boolean
+
+permissions:
+  contents: read
+  # No issue-write here — failures surface as red runs in the workflow
+  # history. If you want auto-issue-on-fail, add a follow-up step that
+  # uses gh issue create gated on `if: failure()`. Keeping the surface
+  # minimal until that's actually wanted.
+
+# Serialize so two firings can never overlap. Cron firing every 20 min
+# but scripts conservatively bounded at 10 min — overlap shouldn't
+# happen in steady state, but if a run hangs we don't want N more
+# stacking up.
+concurrency:
+  group: continuous-synth-e2e
+  cancel-in-progress: false
+
+jobs:
+  synth:
+    name: Synthetic E2E against staging
+    runs-on: ubuntu-latest
+    timeout-minutes: 12
+    env:
+      # langgraph default keeps cold-start under 5 min on staging EC2.
+      # hermes is slower (~7-10 min) and isn't needed for the
+      # regression class this gate exists to catch (deployment-pipeline
+      # + schema-drift + integration). Operators can pick hermes via
+      # workflow_dispatch when they need to exercise the SDK-native
+      # session path.
+      E2E_RUNTIME: ${{ github.event.inputs.runtime || 'langgraph' }}
+      # Bound to 10 min so a stuck provision fails the run instead of
+      # holding up the next cron firing. 15-min default in the script
+      # is for the on-PR full lifecycle where we have more headroom.
+      E2E_PROVISION_TIMEOUT_SECS: '600'
+      # Slug suffix — namespaced "synth-" so these runs are
+      # distinguishable from PR-driven runs in CP admin.
+      E2E_RUN_ID: synth-${{ github.run_id }}
+      # Forced false for cron; respected for manual dispatch
+      E2E_KEEP_ORG: ${{ github.event.inputs.keep_org == 'true' && '1' || '' }}
+      MOLECULE_CP_URL: ${{ vars.STAGING_CP_URL || 'https://staging-api.moleculesai.app' }}
+      MOLECULE_ADMIN_TOKEN: ${{ secrets.CP_STAGING_ADMIN_API_TOKEN }}
+    steps:
+      - uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
+
+      - name: Verify required secret present
+        run: |
+          # Schedule-vs-dispatch hardening (mirrors the sweep-cf-* and
+          # redeploy-tenants-on-* workflows): hard-fail on missing secret
+          # for cron firing so a misconfigured-repo doesn't silently
+          # report green while doing nothing. Soft-skip on operator
+          # dispatch — operators can dispatch ad-hoc to verify a fix
+          # without setting up the secret first.
+          if [ -z "${MOLECULE_ADMIN_TOKEN:-}" ]; then
+            if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
+              echo "::warning::CP_STAGING_ADMIN_API_TOKEN not set — synth E2E cannot run"
+              echo "::warning::Set it at Settings → Secrets and Variables → Actions"
+              exit 0
+            fi
+            echo "::error::CP_STAGING_ADMIN_API_TOKEN secret missing — synth E2E cannot run"
+            echo "::error::Set it at Settings → Secrets and Variables → Actions; pull from staging-CP's CP_ADMIN_API_TOKEN env in Railway."
+            exit 1
+          fi
+
+      - name: Install required tools
+        run: |
+          # The script depends on jq + curl (already on ubuntu-latest)
+          # and python3 (likewise). Verify they're all present so we
+          # fail fast on a runner image regression rather than mid-script.
+          for cmd in jq curl python3; do
+            command -v "$cmd" >/dev/null 2>&1 || {
+              echo "::error::required tool '$cmd' not on PATH — runner image regression?"
+              exit 1
+            }
+          done
+
+      - name: Run synthetic E2E
+        # The script handles its own teardown via EXIT trap; even on
+        # failure (timeout, assertion), the org is deprovisioned and
+        # leaks are reported. Exit code propagates from the script.
+        run: |
+          bash tests/e2e/test_staging_full_saas.sh
+
+      - name: Failure summary
+        # Runs only on failure. Adds a job summary so the workflow run
+        # page shows a quick "what happened" instead of forcing readers
+        # to scroll through script output.
+        if: failure()
+        run: |
+          {
+            echo "## Continuous synth E2E failed"
+            echo ""
+            echo "**Run ID:** ${{ github.run_id }}"
+            echo "**Trigger:** ${{ github.event_name }}"
+            echo "**Runtime:** ${E2E_RUNTIME}"
+            echo "**Slug:** synth-${{ github.run_id }}"
+            echo ""
+            echo "### What this means"
+            echo ""
+            echo "Staging just regressed on a path that previously worked. Likely classes:"
+            echo "- Schema mismatch between sender and receiver (#2345 class)"
+            echo "- Deployment-pipeline gap (RFC #2312 / staging-tenant-image-stale class)"
+            echo "- Vendor outage (Cloudflare, Railway, AWS, GHCR)"
+            echo "- Staging-CP env var rotation"
+            echo ""
+            echo "### Next steps"
+            echo ""
+            echo "1. Check the script output above for the assertion that failed"
+            echo "2. If it's a vendor outage, no action needed — next firing in ~20 min"
+            echo "3. If it's a code regression, find the causing PR via \`git log\` against last green run and revert/fix"
+            echo "4. Keep an eye on the next 1-2 firings — flake vs persistent fail differs in priority"
+          } >> "$GITHUB_STEP_SUMMARY"