Merge remote-tracking branch 'origin/main' into canvas-followup
Some checks failed
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 17s
Harness Replays / detect-changes (pull_request) Successful in 14s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 13s
sop-tier-check / tier-check (pull_request) Successful in 16s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 42s
E2E API Smoke Test / detect-changes (pull_request) Successful in 47s
CI / Detect changes (pull_request) Successful in 50s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 48s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 44s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 10s
CI / Platform (Go) (pull_request) Successful in 9s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 7s
CI / Python Lint & Test (pull_request) Successful in 9s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 9s
Harness Replays / Harness Replays (pull_request) Failing after 1m48s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 12s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 9m0s
CI / Canvas (Next.js) (pull_request) Failing after 10m53s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
audit-force-merge / audit (pull_request) Has been skipped

This commit is contained in:
Molecule AI · core-fe 2026-05-11 11:25:36 +00:00
commit 40ecd0cbab
15 changed files with 183 additions and 133 deletions

View File

@ -56,7 +56,7 @@ on:
# 2. Avoid colliding with the existing :15 sweep-cf-orphans # 2. Avoid colliding with the existing :15 sweep-cf-orphans
# and :45 sweep-cf-tunnels — both hit the CF API and we # and :45 sweep-cf-tunnels — both hit the CF API and we
# don't want to fight for rate-limit tokens. # don't want to fight for rate-limit tokens.
# 3. Avoid the :30 heavy slot (canary-staging /30, sweep-aws- # 3. Avoid the :30 heavy slot (staging-smoke /30, sweep-aws-
# secrets, sweep-stale-e2e-orgs every :15) — multiple # secrets, sweep-stale-e2e-orgs every :15) — multiple
# overlapping cron registrations on the same minute is part # overlapping cron registrations on the same minute is part
# of what GH drops under load. # of what GH drops under load.

View File

@ -95,7 +95,7 @@ jobs:
# ANTHROPIC_BASE_URL to api.minimax.io/anthropic and reads # ANTHROPIC_BASE_URL to api.minimax.io/anthropic and reads
# MINIMAX_API_KEY at boot — separate billing account so an # MINIMAX_API_KEY at boot — separate billing account so an
# OpenAI quota collapse no longer wedges the gate. Mirrors the # OpenAI quota collapse no longer wedges the gate. Mirrors the
# canary-staging.yml + continuous-synth-e2e.yml migrations. # staging-smoke.yml + continuous-synth-e2e.yml migrations.
E2E_MINIMAX_API_KEY: ${{ secrets.MOLECULE_STAGING_MINIMAX_API_KEY }} E2E_MINIMAX_API_KEY: ${{ secrets.MOLECULE_STAGING_MINIMAX_API_KEY }}
# Direct-Anthropic alternative for operators who don't want to # Direct-Anthropic alternative for operators who don't want to
# set up a MiniMax account (priority below MiniMax — first # set up a MiniMax account (priority below MiniMax — first

View File

@ -11,11 +11,11 @@ name: E2E Staging Sanity (leak-detection self-check)
# - `continue-on-error: true` on the job (RFC §1 contract). # - `continue-on-error: true` on the job (RFC §1 contract).
# #
# Periodic assertion that the teardown safety nets in e2e-staging-saas # Periodic assertion that the teardown safety nets in e2e-staging-saas
# and canary-staging actually work. Runs the E2E harness with # and staging-smoke (formerly canary-staging) actually work. Runs the
# E2E_INTENTIONAL_FAILURE=1, which poisons the tenant admin token after # E2E harness with E2E_INTENTIONAL_FAILURE=1, which poisons the tenant
# the org is provisioned. The workspace-provision step then fails, the # admin token after the org is provisioned. The workspace-provision
# script exits non-zero, and the EXIT trap + workflow always()-step # step then fails, the script exits non-zero, and the EXIT trap +
# must still tear down cleanly. # workflow always()-step must still tear down cleanly.
on: on:
schedule: schedule:
@ -43,7 +43,7 @@ jobs:
env: env:
MOLECULE_CP_URL: https://staging-api.moleculesai.app MOLECULE_CP_URL: https://staging-api.moleculesai.app
MOLECULE_ADMIN_TOKEN: ${{ secrets.MOLECULE_STAGING_ADMIN_TOKEN }} MOLECULE_ADMIN_TOKEN: ${{ secrets.MOLECULE_STAGING_ADMIN_TOKEN }}
E2E_MODE: canary E2E_MODE: smoke
E2E_RUNTIME: hermes E2E_RUNTIME: hermes
E2E_RUN_ID: "sanity-${{ github.run_id }}" E2E_RUN_ID: "sanity-${{ github.run_id }}"
E2E_INTENTIONAL_FAILURE: "1" E2E_INTENTIONAL_FAILURE: "1"
@ -127,8 +127,14 @@ jobs:
import json, sys import json, sys
d = json.load(sys.stdin) d = json.load(sys.stdin)
today = __import__('datetime').date.today().strftime('%Y%m%d') today = __import__('datetime').date.today().strftime('%Y%m%d')
# Match both the new e2e-smoke- prefix (post-2026-05-11 rename)
# and the legacy e2e-canary- prefix for one rollout cycle so
# any in-flight org provisioned under the old prefix on an
# older runner checkout still gets cleaned up. Remove the
# canary fallback after one week of no-old-prefix observations.
prefixes = (f'e2e-smoke-{today}-sanity-', f'e2e-canary-{today}-sanity-')
candidates = [o['slug'] for o in d.get('orgs', []) candidates = [o['slug'] for o in d.get('orgs', [])
if o.get('slug','').startswith(f'e2e-canary-{today}-sanity-') if any(o.get('slug','').startswith(p) for p in prefixes)
and o.get('status') not in ('purged',)] and o.get('status') not in ('purged',)]
print('\n'.join(candidates)) print('\n'.join(candidates))
" 2>/dev/null) " 2>/dev/null)

View File

@ -11,7 +11,7 @@ name: publish-canvas-image
# - `continue-on-error: true` on each job (RFC §1 contract). # - `continue-on-error: true` on each job (RFC §1 contract).
# - **Open question for review**: this workflow pushes the canvas # - **Open question for review**: this workflow pushes the canvas
# image to `ghcr.io`. GHCR was retired during the 2026-05-06 # image to `ghcr.io`. GHCR was retired during the 2026-05-06
# Gitea migration in favor of ECR (per canary-verify.yml header # Gitea migration in favor of ECR (per staging-verify.yml header
# notes). The image may not be consumable post-migration. Two # notes). The image may not be consumable post-migration. Two
# options for follow-up: (a) retarget to # options for follow-up: (a) retarget to
# `153263036946.dkr.ecr.us-east-2.amazonaws.com/molecule-ai/canvas`, # `153263036946.dkr.ecr.us-east-2.amazonaws.com/molecule-ai/canvas`,

View File

@ -32,7 +32,7 @@ name: redeploy-tenants-on-main
# #
# Registry: ECR (153263036946.dkr.ecr.us-east-2.amazonaws.com/ # Registry: ECR (153263036946.dkr.ecr.us-east-2.amazonaws.com/
# molecule-ai/platform-tenant). GHCR was retired 2026-05-07 during the # molecule-ai/platform-tenant). GHCR was retired 2026-05-07 during the
# Gitea suspension migration. The canary-verify.yml promote step now # Gitea suspension migration. The staging-verify.yml promote step now
# uses the same redeploy-fleet endpoint (fixes the silent-GHCR gap). # uses the same redeploy-fleet endpoint (fixes the silent-GHCR gap).
# #
# Runtime ordering: # Runtime ordering:
@ -104,7 +104,7 @@ jobs:
# `staging-<sha>` to roll back to a known-good build. # `staging-<sha>` to roll back to a known-good build.
# 2. Default → `staging-<short_head_sha>`. The just-published # 2. Default → `staging-<short_head_sha>`. The just-published
# digest. Bypasses the `:latest` retag path that's currently # digest. Bypasses the `:latest` retag path that's currently
# dead (canary-verify soft-skips without canary fleet, so # dead (staging-verify soft-skips without canary fleet, so
# the only thing retagging `:latest` today is the manual # the only thing retagging `:latest` today is the manual
# promote-latest.yml — last run 2026-04-28). Auto-trigger # promote-latest.yml — last run 2026-04-28). Auto-trigger
# from workflow_run uses workflow_run.head_sha; manual # from workflow_run uses workflow_run.head_sha; manual
@ -359,7 +359,7 @@ jobs:
# Belt-and-suspenders sanity floor: same logic as the staging # Belt-and-suspenders sanity floor: same logic as the staging
# variant — see that file's comment for the full rationale. # variant — see that file's comment for the full rationale.
# Floor only applies when fleet >= 4; below that, canary-verify # Floor only applies when fleet >= 4; below that, staging-verify
# is the actual gate. # is the actual gate.
TOTAL_VERIFIED=${#SLUGS[@]} TOTAL_VERIFIED=${#SLUGS[@]}
if [ $TOTAL_VERIFIED -ge 4 ] && [ $UNREACHABLE_COUNT -gt $((TOTAL_VERIFIED / 2)) ]; then if [ $TOTAL_VERIFIED -ge 4 ] && [ $UNREACHABLE_COUNT -gt $((TOTAL_VERIFIED / 2)) ]; then

View File

@ -21,7 +21,7 @@ name: redeploy-tenants-on-staging
# #
# Mirror of redeploy-tenants-on-main.yml, with the staging-CP host and # Mirror of redeploy-tenants-on-main.yml, with the staging-CP host and
# the :staging-latest tag. Sister workflow exists for prod (rolls # the :staging-latest tag. Sister workflow exists for prod (rolls
# :latest after canary-verify). Both share the same shape — just # :latest after staging-verify). Both share the same shape — just
# different CP_URL + target_tag + admin token secret. # different CP_URL + target_tag + admin token secret.
# #
# Why this workflow exists: publish-workspace-server-image now builds # Why this workflow exists: publish-workspace-server-image now builds
@ -336,7 +336,7 @@ jobs:
# crashes on startup), not a teardown race. Hard-fail. # crashes on startup), not a teardown race. Hard-fail.
# #
# Floor only applies when TOTAL_VERIFIED >= 4 — below that, the # Floor only applies when TOTAL_VERIFIED >= 4 — below that, the
# canary-verify step is the actual gate for "all tenants down" # staging-verify step is the actual gate for "all tenants down"
# detection (it runs against the canary first and aborts the # detection (it runs against the canary first and aborts the
# rollout if the canary fails to come up). Without the >=4 gate, # rollout if the canary fails to come up). Without the >=4 gate,
# a 1-tenant fleet (e.g. a single ephemeral e2e-* tenant on a # a 1-tenant fleet (e.g. a single ephemeral e2e-* tenant on a

View File

@ -1,6 +1,8 @@
name: Canary — staging SaaS smoke (every 30 min) name: Staging SaaS smoke (every 30 min)
# Ported from .github/workflows/canary-staging.yml on 2026-05-11 per RFC # Renamed from canary-staging.yml on 2026-05-11 per Hongming directive
# ("canary naming changed to staging for all"). Originally ported from
# .github/workflows/canary-staging.yml on 2026-05-11 per RFC
# internal#219 §1 sweep. Differences from the GitHub version: # internal#219 §1 sweep. Differences from the GitHub version:
# - Dropped `workflow_dispatch.inputs` (Gitea 1.22.6 parser rejects them # - Dropped `workflow_dispatch.inputs` (Gitea 1.22.6 parser rejects them
# per feedback_gitea_workflow_dispatch_inputs_unsupported). # per feedback_gitea_workflow_dispatch_inputs_unsupported).
@ -21,21 +23,21 @@ name: Canary — staging SaaS smoke (every 30 min)
# catches drift in the 30-min window between those runs (AMI health, CF # catches drift in the 30-min window between those runs (AMI health, CF
# cert rotation, WorkOS session stability, etc.). # cert rotation, WorkOS session stability, etc.).
# #
# Lean mode: E2E_MODE=canary skips the child workspace + HMA memory + # Lean mode: E2E_MODE=smoke skips the child workspace + HMA memory +
# peers/activity checks. One parent workspace + one A2A turn is enough # peers/activity checks. One parent workspace + one A2A turn is enough
# to signal "SaaS stack end-to-end is alive." # to signal "SaaS stack end-to-end is alive."
on: on:
schedule: schedule:
# Every 30 min. Cron on GitHub-hosted runners has a known drift of # Every 30 min. Cron on GitHub-hosted runners has a known drift of
# a few minutes under load — that's fine for a canary. # a few minutes under load — that's fine for a smoke check.
- cron: '*/30 * * * *' - cron: '*/30 * * * *'
# Serialise with the full-SaaS workflow so they don't contend for the # Serialise with the full-SaaS workflow so they don't contend for the
# same org-create quota on staging. Different group key from # same org-create quota on staging. Different group key from
# e2e-staging-saas since we don't mind queueing canaries behind one # e2e-staging-saas since we don't mind queueing smoke runs behind one
# full run, but two canaries SHOULD queue against each other. # full run, but two smoke runs SHOULD queue against each other.
concurrency: concurrency:
group: canary-staging group: staging-smoke
cancel-in-progress: false cancel-in-progress: false
permissions: permissions:
@ -47,8 +49,8 @@ env:
GITHUB_SERVER_URL: https://git.moleculesai.app GITHUB_SERVER_URL: https://git.moleculesai.app
jobs: jobs:
canary: smoke:
name: Canary smoke name: Staging SaaS smoke
runs-on: ubuntu-latest runs-on: ubuntu-latest
# Phase 3 (RFC #219 §1): surface broken workflows without blocking. # Phase 3 (RFC #219 §1): surface broken workflows without blocking.
continue-on-error: true continue-on-error: true
@ -56,23 +58,23 @@ jobs:
# tests/e2e/test_staging_full_saas.sh (#2107). Without the buffer # tests/e2e/test_staging_full_saas.sh (#2107). Without the buffer
# the job is killed at the wall-clock 15:00 mark BEFORE the bash # the job is killed at the wall-clock 15:00 mark BEFORE the bash
# `fail` + diagnostic burst can fire, leaving every cancellation # `fail` + diagnostic burst can fire, leaving every cancellation
# silent. Sibling staging E2E jobs run at 20-45 min — keeping # silent. Sibling staging E2E jobs run at 20-45 min — keeping the
# canary tighter than them so a true wedge still surfaces here # smoke tighter than them so a true wedge still surfaces here
# first. # first.
timeout-minutes: 25 timeout-minutes: 25
env: env:
MOLECULE_CP_URL: https://staging-api.moleculesai.app MOLECULE_CP_URL: https://staging-api.moleculesai.app
MOLECULE_ADMIN_TOKEN: ${{ secrets.MOLECULE_STAGING_ADMIN_TOKEN }} MOLECULE_ADMIN_TOKEN: ${{ secrets.MOLECULE_STAGING_ADMIN_TOKEN }}
# MiniMax is the canary's PRIMARY LLM auth path post-2026-05-04. # MiniMax is the smoke's PRIMARY LLM auth path post-2026-05-04.
# Switched from hermes+OpenAI after #2578 (the staging OpenAI key # Switched from hermes+OpenAI after #2578 (the staging OpenAI key
# account went over quota and stayed dead for 36+ hours, taking # account went over quota and stayed dead for 36+ hours, taking
# the canary red the entire time). claude-code template's # the smoke red the entire time). claude-code template's
# `minimax` provider routes ANTHROPIC_BASE_URL to # `minimax` provider routes ANTHROPIC_BASE_URL to
# api.minimax.io/anthropic and reads MINIMAX_API_KEY at boot — # api.minimax.io/anthropic and reads MINIMAX_API_KEY at boot —
# ~5-10x cheaper per token than gpt-4.1-mini AND on a separate # ~5-10x cheaper per token than gpt-4.1-mini AND on a separate
# billing account, so OpenAI quota collapse no longer wedges the # billing account, so OpenAI quota collapse no longer wedges the
# canary. Mirrors the migration continuous-synth-e2e.yml made on # smoke. Mirrors the migration continuous-synth-e2e.yml made on
# 2026-05-03 (#265) for the same reason. tests/e2e/test_staging_ # 2026-05-03 (#265) for the same reason. tests/e2e/test_staging_
# full_saas.sh branches SECRETS_JSON on which key is present — # full_saas.sh branches SECRETS_JSON on which key is present —
# MiniMax wins when set. # MiniMax wins when set.
@ -86,16 +88,16 @@ jobs:
# E2E_RUNTIME=hermes overridden via workflow_dispatch can still # E2E_RUNTIME=hermes overridden via workflow_dispatch can still
# exercise the OpenAI path without re-editing the workflow. # exercise the OpenAI path without re-editing the workflow.
E2E_OPENAI_API_KEY: ${{ secrets.MOLECULE_STAGING_OPENAI_API_KEY }} E2E_OPENAI_API_KEY: ${{ secrets.MOLECULE_STAGING_OPENAI_API_KEY }}
E2E_MODE: canary E2E_MODE: smoke
E2E_RUNTIME: claude-code E2E_RUNTIME: claude-code
# Pin the canary to a specific MiniMax model rather than relying # Pin the smoke to a specific MiniMax model rather than relying
# on the per-runtime default (which could resolve to "sonnet" → # on the per-runtime default (which could resolve to "sonnet" →
# direct Anthropic and defeat the cost saving). M2.7-highspeed # direct Anthropic and defeat the cost saving). M2.7-highspeed
# is "Token Plan only" but cheap-per-token and fast. # is "Token Plan only" but cheap-per-token and fast.
E2E_MODEL_SLUG: MiniMax-M2.7-highspeed E2E_MODEL_SLUG: MiniMax-M2.7-highspeed
E2E_RUN_ID: "canary-${{ github.run_id }}" E2E_RUN_ID: "smoke-${{ github.run_id }}"
# Debug-only: when an operator dispatches with keep_on_failure=true, # Debug-only: when an operator dispatches with keep_on_failure=true,
# the canary script's E2E_KEEP_ORG=1 path skips teardown so the # the smoke script's E2E_KEEP_ORG=1 path skips teardown so the
# tenant org + EC2 stay alive for SSM-based log capture. Cron runs # tenant org + EC2 stay alive for SSM-based log capture. Cron runs
# never set this (the input only exists on workflow_dispatch) so # never set this (the input only exists on workflow_dispatch) so
# unattended cron always tears down. See molecule-core#129 # unattended cron always tears down. See molecule-core#129
@ -119,7 +121,7 @@ jobs:
# langgraph (operator-dispatched only) use OpenAI. Hard-fail # langgraph (operator-dispatched only) use OpenAI. Hard-fail
# rather than soft-skip per the lesson from synth E2E #2578: # rather than soft-skip per the lesson from synth E2E #2578:
# an empty key silently falls through to the wrong # an empty key silently falls through to the wrong
# SECRETS_JSON branch and the canary fails 5 min later with # SECRETS_JSON branch and the smoke fails 5 min later with
# a confusing auth error instead of the clean "secret # a confusing auth error instead of the clean "secret
# missing" message at the top. # missing" message at the top.
case "${E2E_RUNTIME}" in case "${E2E_RUNTIME}" in
@ -155,8 +157,8 @@ jobs:
fi fi
echo "LLM key present ✓ (runtime=${E2E_RUNTIME}, key=${required_secret_name}, len=${#required_secret_value})" echo "LLM key present ✓ (runtime=${E2E_RUNTIME}, key=${required_secret_name}, len=${#required_secret_value})"
- name: Canary run - name: Smoke run
id: canary id: smoke
run: bash tests/e2e/test_staging_full_saas.sh run: bash tests/e2e/test_staging_full_saas.sh
# Alerting: open a sticky issue on the FIRST failure; comment on # Alerting: open a sticky issue on the FIRST failure; comment on
@ -184,6 +186,9 @@ jobs:
run: | run: |
set -euo pipefail set -euo pipefail
API="${SERVER_URL%/}/api/v1" API="${SERVER_URL%/}/api/v1"
# Title kept stable across the canary-staging.yml → staging-smoke.yml
# rename (2026-05-11) so any open alert issue from the old name
# still title-matches and auto-closes on the next green run.
TITLE="Canary failing: staging SaaS smoke" TITLE="Canary failing: staging SaaS smoke"
RUN_URL="${SERVER_URL}/${REPO}/actions/runs/${RUN_ID}" RUN_URL="${SERVER_URL}/${REPO}/actions/runs/${RUN_ID}"
@ -194,18 +199,18 @@ jobs:
if [ -n "$EXISTING" ]; then if [ -n "$EXISTING" ]; then
curl -fsS -X POST -H "Authorization: token $GITEA_TOKEN" -H "Content-Type: application/json" \ curl -fsS -X POST -H "Authorization: token $GITEA_TOKEN" -H "Content-Type: application/json" \
"${API}/repos/${REPO}/issues/${EXISTING}/comments" \ "${API}/repos/${REPO}/issues/${EXISTING}/comments" \
-d "$(jq -nc --arg run "$RUN_URL" '{body: ("Canary still failing. " + $run)}')" >/dev/null -d "$(jq -nc --arg run "$RUN_URL" '{body: ("Smoke still failing. " + $run)}')" >/dev/null
echo "Commented on existing issue #${EXISTING}" echo "Commented on existing issue #${EXISTING}"
else else
NOW=$(date -u +%Y-%m-%dT%H:%M:%SZ) NOW=$(date -u +%Y-%m-%dT%H:%M:%SZ)
BODY=$(jq -nc --arg t "$TITLE" --arg now "$NOW" --arg run "$RUN_URL" \ BODY=$(jq -nc --arg t "$TITLE" --arg now "$NOW" --arg run "$RUN_URL" \
'{title: $t, body: ("Canary run failed at " + $now + ".\n\nRun: " + $run + "\n\nThis issue auto-closes on the next green canary run. Consecutive failures add a comment here rather than a new issue.")}') '{title: $t, body: ("Smoke run failed at " + $now + ".\n\nRun: " + $run + "\n\nThis issue auto-closes on the next green smoke run. Consecutive failures add a comment here rather than a new issue.")}')
curl -fsS -X POST -H "Authorization: token $GITEA_TOKEN" -H "Content-Type: application/json" \ curl -fsS -X POST -H "Authorization: token $GITEA_TOKEN" -H "Content-Type: application/json" \
"${API}/repos/${REPO}/issues" -d "$BODY" >/dev/null "${API}/repos/${REPO}/issues" -d "$BODY" >/dev/null
echo "Opened canary failure issue (first red)" echo "Opened smoke failure issue (first red)"
fi fi
- name: Auto-close canary issue on success (Gitea API) - name: Auto-close smoke issue on success (Gitea API)
if: success() if: success()
env: env:
GITEA_TOKEN: ${{ secrets.GITHUB_TOKEN }} GITEA_TOKEN: ${{ secrets.GITHUB_TOKEN }}
@ -215,6 +220,8 @@ jobs:
run: | run: |
set -euo pipefail set -euo pipefail
API="${SERVER_URL%/}/api/v1" API="${SERVER_URL%/}/api/v1"
# Title kept stable across the canary-staging.yml → staging-smoke.yml
# rename so open alert issues from the old name still match.
TITLE="Canary failing: staging SaaS smoke" TITLE="Canary failing: staging SaaS smoke"
NUMS=$(curl -fsS -H "Authorization: token $GITEA_TOKEN" \ NUMS=$(curl -fsS -H "Authorization: token $GITEA_TOKEN" \
@ -225,10 +232,10 @@ jobs:
for N in $NUMS; do for N in $NUMS; do
curl -fsS -X POST -H "Authorization: token $GITEA_TOKEN" -H "Content-Type: application/json" \ curl -fsS -X POST -H "Authorization: token $GITEA_TOKEN" -H "Content-Type: application/json" \
"${API}/repos/${REPO}/issues/${N}/comments" \ "${API}/repos/${REPO}/issues/${N}/comments" \
-d "$(jq -nc --arg now "$NOW" '{body: ("Canary recovered at " + $now + ". Closing.")}')" >/dev/null -d "$(jq -nc --arg now "$NOW" '{body: ("Smoke recovered at " + $now + ". Closing.")}')" >/dev/null
curl -fsS -X PATCH -H "Authorization: token $GITEA_TOKEN" -H "Content-Type: application/json" \ curl -fsS -X PATCH -H "Authorization: token $GITEA_TOKEN" -H "Content-Type: application/json" \
"${API}/repos/${REPO}/issues/${N}" -d '{"state":"closed"}' >/dev/null "${API}/repos/${REPO}/issues/${N}" -d '{"state":"closed"}' >/dev/null
echo "Closed recovered canary issue #${N}" echo "Closed recovered smoke issue #${N}"
done done
- name: Teardown safety net - name: Teardown safety net
@ -238,24 +245,23 @@ jobs:
run: | run: |
set +e set +e
# Slug prefix matches what test_staging_full_saas.sh emits # Slug prefix matches what test_staging_full_saas.sh emits
# in canary mode: # in smoke mode:
# SLUG="e2e-canary-$(date +%Y%m%d)-${RUN_ID_SUFFIX}" # SLUG="e2e-smoke-$(date +%Y%m%d)-${RUN_ID_SUFFIX}"
# Earlier this was `e2e-{today}-canary-` — that was the # Earlier (pre-2026-05-11 canary→staging rename) the prefix was
# full-mode pattern (date FIRST, mode SECOND); canary slugs # `e2e-canary-`; both prefixes are matched here for one
# have mode FIRST, date SECOND. The mismatch silently # release cycle so cleanup still catches any in-flight org
# never matched, leaving every cancelled-canary EC2 alive # provisioned under the old prefix on an older runner that
# until the once-an-hour sweep eventually caught it # hasn't picked up the renamed script. Remove the canary
# (incident 2026-04-26 21:03Z: 1h25m EC2 leak before manual # fallback after one week of no-old-prefix observations.
# cleanup; same gap on three earlier cancellations today).
orgs=$(curl -sS "$MOLECULE_CP_URL/cp/admin/orgs" \ orgs=$(curl -sS "$MOLECULE_CP_URL/cp/admin/orgs" \
-H "Authorization: Bearer $ADMIN_TOKEN" 2>/dev/null \ -H "Authorization: Bearer $ADMIN_TOKEN" 2>/dev/null \
| python3 -c " | python3 -c "
import json, sys, os, datetime import json, sys, os, datetime
run_id = os.environ.get('GITHUB_RUN_ID', '') run_id = os.environ.get('GITHUB_RUN_ID', '')
d = json.load(sys.stdin) d = json.load(sys.stdin)
# Scope to slugs from THIS canary run when GITHUB_RUN_ID is # Scope to slugs from THIS smoke run when GITHUB_RUN_ID is
# available; the canary workflow sets E2E_RUN_ID='canary-\${run_id}' # available; the smoke workflow sets E2E_RUN_ID='smoke-\${run_id}'
# so the slug suffix is '-canary-\${run_id}-...'. Mirrors the # so the slug suffix is '-smoke-\${run_id}-...'. Mirrors the
# full-mode safety net's per-run scoping (e2e-staging-saas.yml) # full-mode safety net's per-run scoping (e2e-staging-saas.yml)
# added after the 2026-04-21 cross-run cleanup incident. # added after the 2026-04-21 cross-run cleanup incident.
# Sweep both today AND yesterday's UTC dates so a run that # Sweep both today AND yesterday's UTC dates so a run that
@ -265,9 +271,11 @@ jobs:
yesterday = today - datetime.timedelta(days=1) yesterday = today - datetime.timedelta(days=1)
dates = (today.strftime('%Y%m%d'), yesterday.strftime('%Y%m%d')) dates = (today.strftime('%Y%m%d'), yesterday.strftime('%Y%m%d'))
if run_id: if run_id:
prefixes = tuple(f'e2e-canary-{d}-canary-{run_id}' for d in dates) prefixes = tuple(f'e2e-smoke-{d}-smoke-{run_id}' for d in dates) \
+ tuple(f'e2e-canary-{d}-canary-{run_id}' for d in dates)
else: else:
prefixes = tuple(f'e2e-canary-{d}-' for d in dates) prefixes = tuple(f'e2e-smoke-{d}-' for d in dates) \
+ tuple(f'e2e-canary-{d}-' for d in dates)
candidates = [o['slug'] for o in d.get('orgs', []) candidates = [o['slug'] for o in d.get('orgs', [])
if any(o.get('slug','').startswith(p) for p in prefixes) if any(o.get('slug','').startswith(p) for p in prefixes)
and o.get('status') not in ('purged',)] and o.get('status') not in ('purged',)]
@ -280,8 +288,8 @@ jobs:
# stale sweep caught it (up to 2h later). Now we capture the # stale sweep caught it (up to 2h later). Now we capture the
# response code and surface non-2xx as a workflow warning, so # response code and surface non-2xx as a workflow warning, so
# the run page shows which slug leaked. We still don't `exit 1` # the run page shows which slug leaked. We still don't `exit 1`
# on cleanup failure — a single-canary cleanup miss shouldn't # on cleanup failure — a single-smoke cleanup miss shouldn't
# fail-flag the canary itself when the actual smoke check # fail-flag the smoke itself when the actual smoke check
# passed. The sweep-stale-e2e-orgs cron (now every 15 min, # passed. The sweep-stale-e2e-orgs cron (now every 15 min,
# 30-min threshold) is the safety net for whatever slips past. # 30-min threshold) is the safety net for whatever slips past.
# See molecule-controlplane#420. # See molecule-controlplane#420.
@ -290,21 +298,21 @@ jobs:
# Tempfile-routed -w + set +e/-e prevents curl-exit-code # Tempfile-routed -w + set +e/-e prevents curl-exit-code
# pollution of the captured status (lint-curl-status-capture.yml). # pollution of the captured status (lint-curl-status-capture.yml).
set +e set +e
curl -sS -o /tmp/canary-cleanup.out -w "%{http_code}" \ curl -sS -o /tmp/smoke-cleanup.out -w "%{http_code}" \
-X DELETE "$MOLECULE_CP_URL/cp/admin/tenants/$slug" \ -X DELETE "$MOLECULE_CP_URL/cp/admin/tenants/$slug" \
-H "Authorization: Bearer $ADMIN_TOKEN" \ -H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-d "{\"confirm\":\"$slug\"}" >/tmp/canary-cleanup.code -d "{\"confirm\":\"$slug\"}" >/tmp/smoke-cleanup.code
set -e set -e
code=$(cat /tmp/canary-cleanup.code 2>/dev/null || echo "000") code=$(cat /tmp/smoke-cleanup.code 2>/dev/null || echo "000")
if [ "$code" = "200" ] || [ "$code" = "204" ]; then if [ "$code" = "200" ] || [ "$code" = "204" ]; then
echo "[teardown] deleted $slug (HTTP $code)" echo "[teardown] deleted $slug (HTTP $code)"
else else
echo "::warning::canary teardown for $slug returned HTTP $code — sweep-stale-e2e-orgs will catch it within ~45 min. Body: $(head -c 300 /tmp/canary-cleanup.out 2>/dev/null)" echo "::warning::smoke teardown for $slug returned HTTP $code — sweep-stale-e2e-orgs will catch it within ~45 min. Body: $(head -c 300 /tmp/smoke-cleanup.out 2>/dev/null)"
leaks+=("$slug") leaks+=("$slug")
fi fi
done done
if [ ${#leaks[@]} -gt 0 ]; then if [ ${#leaks[@]} -gt 0 ]; then
echo "::warning::canary teardown left ${#leaks[@]} leak(s): ${leaks[*]}" echo "::warning::smoke teardown left ${#leaks[@]} leak(s): ${leaks[*]}"
fi fi
exit 0 exit 0

View File

@ -1,6 +1,8 @@
name: canary-verify name: Staging verify
# Ported from .github/workflows/canary-verify.yml on 2026-05-11 per RFC # Renamed from canary-verify.yml on 2026-05-11 per Hongming directive
# ("canary naming changed to staging for all"). Originally ported from
# .github/workflows/canary-verify.yml on 2026-05-11 per RFC
# internal#219 §1 sweep. Differences from the GitHub version: # internal#219 §1 sweep. Differences from the GitHub version:
# - Dropped `workflow_dispatch.inputs` (Gitea 1.22.6 parser rejects them # - Dropped `workflow_dispatch.inputs` (Gitea 1.22.6 parser rejects them
# per feedback_gitea_workflow_dispatch_inputs_unsupported). # per feedback_gitea_workflow_dispatch_inputs_unsupported).
@ -23,13 +25,22 @@ name: canary-verify
# digest. On red, :latest stays on the prior known-good digest and # digest. On red, :latest stays on the prior known-good digest and
# prod is untouched. # prod is untouched.
# #
# Terminology note (2026-05-11): The deployment STRATEGY here is still
# called "canary release" (a small subset of tenants gets the new image
# first, the rest follow on green). The "canary" word stays for the
# pre-fan-out cohort concept (see docs/architecture/canary-release.md
# and CANARY_SLUG in redeploy-tenants-on-*.yml). What changed is the
# FILE NAME and the SECRETS feeding this workflow — both are renamed
# to drop the redundant "canary-" prefix that conflated workflow
# identity with deployment strategy.
#
# Registry note (2026-05-10): This workflow previously used GHCR # Registry note (2026-05-10): This workflow previously used GHCR
# (ghcr.io/molecule-ai/platform-tenant) — that registry was retired # (ghcr.io/molecule-ai/platform-tenant) — that registry was retired
# during the 2026-05-06 Gitea suspension migration when publish- # during the 2026-05-06 Gitea suspension migration when publish-
# workspace-server-image.yml switched to the operator's ECR org # workspace-server-image.yml switched to the operator's ECR org
# (153263036946.dkr.ecr.us-east-2.amazonaws.com/molecule-ai/ # (153263036946.dkr.ecr.us-east-2.amazonaws.com/molecule-ai/
# platform-tenant). The GHCR → ECR migration was never applied to # platform-tenant). The GHCR → ECR migration was never applied to
# this file, so canary-verify was silently smoke-testing the stale # this file, so this workflow was silently smoke-testing the stale
# GHCR image while the actual staging/prod tenants ran the ECR image. # GHCR image while the actual staging/prod tenants ran the ECR image.
# Result: smoke tests could not catch a broken ECR build. Fix: # Result: smoke tests could not catch a broken ECR build. Fix:
# - Wait step: reads SHA from running canary /health (tenant- # - Wait step: reads SHA from running canary /health (tenant-
@ -43,8 +54,9 @@ name: canary-verify
# to ECR on staging and main merges. # to ECR on staging and main merges.
# - Canary tenants are configured to pull :staging-<sha> from ECR # - Canary tenants are configured to pull :staging-<sha> from ECR
# (TENANT_IMAGE env set to the ECR :staging-<sha> tag). # (TENANT_IMAGE env set to the ECR :staging-<sha> tag).
# - Repo secrets CANARY_TENANT_URLS / CANARY_ADMIN_TOKENS / # - Repo secrets MOLECULE_STAGING_TENANT_URLS /
# CANARY_CP_SHARED_SECRET are populated. # MOLECULE_STAGING_ADMIN_TOKENS / MOLECULE_STAGING_CP_SHARED_SECRET
# are populated.
on: on:
workflow_run: workflow_run:
@ -65,7 +77,7 @@ env:
GITHUB_SERVER_URL: https://git.moleculesai.app GITHUB_SERVER_URL: https://git.moleculesai.app
jobs: jobs:
canary-smoke: staging-smoke:
# Skip when the upstream workflow failed — no image to test against. # Skip when the upstream workflow failed — no image to test against.
# workflow_dispatch trigger dropped in this Gitea port; only the # workflow_dispatch trigger dropped in this Gitea port; only the
# workflow_run path remains. # workflow_run path remains.
@ -97,15 +109,15 @@ jobs:
# other registry — the canary is telling us what it's actually # other registry — the canary is telling us what it's actually
# running, which is the ground truth for smoke testing. # running, which is the ground truth for smoke testing.
env: env:
CANARY_TENANT_URLS: ${{ secrets.CANARY_TENANT_URLS }} MOLECULE_STAGING_TENANT_URLS: ${{ secrets.MOLECULE_STAGING_TENANT_URLS }}
EXPECTED_SHA: ${{ steps.compute.outputs.sha }} EXPECTED_SHA: ${{ steps.compute.outputs.sha }}
run: | run: |
if [ -z "$CANARY_TENANT_URLS" ]; then if [ -z "$MOLECULE_STAGING_TENANT_URLS" ]; then
echo "No canary URLs configured — falling back to 60s wait" echo "No canary URLs configured — falling back to 60s wait"
sleep 60 sleep 60
exit 0 exit 0
fi fi
IFS=',' read -ra URLS <<< "$CANARY_TENANT_URLS" IFS=',' read -ra URLS <<< "$MOLECULE_STAGING_TENANT_URLS"
MAX_WAIT=420 # 7 minutes MAX_WAIT=420 # 7 minutes
INTERVAL=30 INTERVAL=30
ELAPSED=0 ELAPSED=0
@ -129,7 +141,7 @@ jobs:
done done
echo "Timeout after ${MAX_WAIT}s — proceeding anyway (smoke suite will validate)" echo "Timeout after ${MAX_WAIT}s — proceeding anyway (smoke suite will validate)"
- name: Run canary smoke suite - name: Run staging smoke suite
id: smoke id: smoke
# Graceful-skip when no canary fleet is configured (Phase 2 not yet # Graceful-skip when no canary fleet is configured (Phase 2 not yet
# stood up — see molecule-controlplane/docs/canary-tenants.md). # stood up — see molecule-controlplane/docs/canary-tenants.md).
@ -138,29 +150,29 @@ jobs:
# promote-latest.yml is the release gate while canary is absent. # promote-latest.yml is the release gate while canary is absent.
# Once the fleet is real: delete the early-exit branch. # Once the fleet is real: delete the early-exit branch.
env: env:
CANARY_TENANT_URLS: ${{ secrets.CANARY_TENANT_URLS }} MOLECULE_STAGING_TENANT_URLS: ${{ secrets.MOLECULE_STAGING_TENANT_URLS }}
CANARY_ADMIN_TOKENS: ${{ secrets.CANARY_ADMIN_TOKENS }} MOLECULE_STAGING_ADMIN_TOKENS: ${{ secrets.MOLECULE_STAGING_ADMIN_TOKENS }}
CANARY_CP_BASE_URL: https://staging-api.moleculesai.app MOLECULE_STAGING_CP_BASE_URL: https://staging-api.moleculesai.app
CANARY_CP_SHARED_SECRET: ${{ secrets.CANARY_CP_SHARED_SECRET }} MOLECULE_STAGING_CP_SHARED_SECRET: ${{ secrets.MOLECULE_STAGING_CP_SHARED_SECRET }}
run: | run: |
set -euo pipefail set -euo pipefail
if [ -z "${CANARY_TENANT_URLS:-}" ] \ if [ -z "${MOLECULE_STAGING_TENANT_URLS:-}" ] \
|| [ -z "${CANARY_ADMIN_TOKENS:-}" ] \ || [ -z "${MOLECULE_STAGING_ADMIN_TOKENS:-}" ] \
|| [ -z "${CANARY_CP_SHARED_SECRET:-}" ]; then || [ -z "${MOLECULE_STAGING_CP_SHARED_SECRET:-}" ]; then
{ {
echo "## ⚠️ canary-verify skipped" echo "## ⚠️ staging-verify skipped"
echo echo
echo "One or more canary secrets are unset (\`CANARY_TENANT_URLS\`, \`CANARY_ADMIN_TOKENS\`, \`CANARY_CP_SHARED_SECRET\`)." echo "One or more canary secrets are unset (\`MOLECULE_STAGING_TENANT_URLS\`, \`MOLECULE_STAGING_ADMIN_TOKENS\`, \`MOLECULE_STAGING_CP_SHARED_SECRET\`)."
echo "Phase 2 canary fleet has not been stood up yet —" echo "Phase 2 canary fleet has not been stood up yet —"
echo "see [canary-tenants.md](https://git.moleculesai.app/molecule-ai/molecule-controlplane/blob/main/docs/canary-tenants.md)." echo "see [canary-tenants.md](https://git.moleculesai.app/molecule-ai/molecule-controlplane/blob/main/docs/canary-tenants.md)."
echo echo
echo "**Skipped — promote-to-latest will NOT auto-fire.** Dispatch \`promote-latest.yml\` manually when ready." echo "**Skipped — promote-to-latest will NOT auto-fire.** Dispatch \`promote-latest.yml\` manually when ready."
} >> "$GITHUB_STEP_SUMMARY" } >> "$GITHUB_STEP_SUMMARY"
echo "ran=false" >> "$GITHUB_OUTPUT" echo "ran=false" >> "$GITHUB_OUTPUT"
echo "::notice::canary-verify: skipped — no canary fleet configured" echo "::notice::staging-verify: skipped — no canary fleet configured"
exit 0 exit 0
fi fi
bash scripts/canary-smoke.sh bash scripts/staging-smoke.sh
echo "ran=true" >> "$GITHUB_OUTPUT" echo "ran=true" >> "$GITHUB_OUTPUT"
- name: Summary on failure - name: Summary on failure
@ -173,7 +185,7 @@ jobs:
echo ":latest stays pinned to the prior good digest — prod is untouched." echo ":latest stays pinned to the prior good digest — prod is untouched."
echo echo
echo "Fix forward and merge again, or investigate the specific failed" echo "Fix forward and merge again, or investigate the specific failed"
echo "assertions in the canary-smoke step log above." echo "assertions in the staging-smoke step log above."
} >> "$GITHUB_STEP_SUMMARY" } >> "$GITHUB_STEP_SUMMARY"
promote-to-latest: promote-to-latest:
@ -188,13 +200,13 @@ jobs:
# silently promoting a stale GHCR image while actual prod tenants # silently promoting a stale GHCR image while actual prod tenants
# pulled from ECR. Canary smoke tests were GHCR-targeted and could # pulled from ECR. Canary smoke tests were GHCR-targeted and could
# not catch a broken ECR build. # not catch a broken ECR build.
needs: canary-smoke needs: staging-smoke
if: ${{ needs.canary-smoke.result == 'success' && needs.canary-smoke.outputs.smoke_ran == 'true' }} if: ${{ needs.staging-smoke.result == 'success' && needs.staging-smoke.outputs.smoke_ran == 'true' }}
runs-on: ubuntu-latest runs-on: ubuntu-latest
# Phase 3 (RFC #219 §1): surface broken workflows without blocking. # Phase 3 (RFC #219 §1): surface broken workflows without blocking.
continue-on-error: true continue-on-error: true
env: env:
SHA: ${{ needs.canary-smoke.outputs.sha }} SHA: ${{ needs.staging-smoke.outputs.sha }}
CP_URL: ${{ vars.CP_URL || 'https://staging-api.moleculesai.app' }} CP_URL: ${{ vars.CP_URL || 'https://staging-api.moleculesai.app' }}
# CP_ADMIN_API_TOKEN gates write access to the redeploy endpoint. # CP_ADMIN_API_TOKEN gates write access to the redeploy endpoint.
# Stored at the repo level so all workflows pick it up automatically. # Stored at the repo level so all workflows pick it up automatically.
@ -264,9 +276,9 @@ jobs:
- name: Summary - name: Summary
run: | run: |
{ {
echo "## Canary verified — :latest promoted via CP redeploy-fleet" echo "## Staging verified — :latest promoted via CP redeploy-fleet"
echo "" echo ""
echo "- **Target tag:** \`staging-${{ needs.canary-smoke.outputs.sha }}\`" echo "- **Target tag:** \`staging-${{ needs.staging-smoke.outputs.sha }}\`"
echo "- **Registry:** ECR (\`${TENANT_IMAGE_NAME}\`)" echo "- **Registry:** ECR (\`${TENANT_IMAGE_NAME}\`)"
echo "- **Canary slug:** \`${CANARY_SLUG:-<none>}\` (soak ${SOAK_SECONDS}s)" echo "- **Canary slug:** \`${CANARY_SLUG:-<none>}\` (soak ${SOAK_SECONDS}s)"
echo "- **Batch size:** ${BATCH_SIZE:-3}" echo "- **Batch size:** ${BATCH_SIZE:-3}"

View File

@ -99,7 +99,8 @@ jobs:
# Filter: # Filter:
# 1. slug starts with one of the ephemeral test prefixes: # 1. slug starts with one of the ephemeral test prefixes:
# - 'e2e-' — covers e2e-canary-, e2e-canvas-*, etc. # - 'e2e-' — covers e2e-smoke- (formerly e2e-canary-),
# e2e-canvas-*, etc.
# - 'rt-e2e-' — runtime-test harness fixtures (RFC #2251); # - 'rt-e2e-' — runtime-test harness fixtures (RFC #2251);
# missing this prefix left two such tenants # missing this prefix left two such tenants
# orphaned 8h on staging (2026-05-03), then # orphaned 8h on staging (2026-05-03), then

View File

@ -2,7 +2,7 @@
How a workspace-server code change reaches the prod tenant fleet — and how to stop it if something's wrong. How a workspace-server code change reaches the prod tenant fleet — and how to stop it if something's wrong.
> **⚠️ State note (2026-04-22):** this doc describes the **intended design**. As of this write, the canary fleet described below is **not actually running** — no canary tenants are provisioned, `CANARY_TENANT_URLS` / `CANARY_ADMIN_TOKENS` / `CANARY_CP_SHARED_SECRET` are empty in repo secrets, and `canary-verify.yml` fails every run. > **⚠️ State note (2026-04-22, secret names refreshed 2026-05-11):** this doc describes the **intended design**. As of this write, the canary fleet described below is **not actually running** — no canary tenants are provisioned, `MOLECULE_STAGING_TENANT_URLS` / `MOLECULE_STAGING_ADMIN_TOKENS` / `MOLECULE_STAGING_CP_SHARED_SECRET` are empty in repo secrets, and `staging-verify.yml` (formerly `canary-verify.yml`) fails every run.
> >
> Current merges gate on manual `promote-latest.yml` dispatches, not canary. See [molecule-controlplane/docs/canary-tenants.md](https://git.moleculesai.app/molecule-ai/molecule-controlplane/src/branch/main/docs/canary-tenants.md) for the Phase 1 code work that's already shipped + the Phase 2 plan for actually standing up the fleet + a "should we even do this now?" decision framework. > Current merges gate on manual `promote-latest.yml` dispatches, not canary. See [molecule-controlplane/docs/canary-tenants.md](https://git.moleculesai.app/molecule-ai/molecule-controlplane/src/branch/main/docs/canary-tenants.md) for the Phase 1 code work that's already shipped + the Phase 2 plan for actually standing up the fleet + a "should we even do this now?" decision framework.
> >
@ -22,7 +22,7 @@ publish-workspace-server-image.yml ← pushes :staging-<sha> ONLY
Canary tenants auto-update to :staging-<sha> Canary tenants auto-update to :staging-<sha>
│ (5-min auto-updater cycle on each canary EC2) │ (5-min auto-updater cycle on each canary EC2)
canary-verify.yml waits 6 min, runs scripts/canary-smoke.sh staging-verify.yml waits 6 min, runs scripts/staging-smoke.sh
├─► GREEN → crane tag :staging-<sha> → :latest ├─► GREEN → crane tag :staging-<sha> → :latest
│ │ │ │
@ -42,7 +42,7 @@ Canary tenants are configured to pull `:staging-<sha>` (not `:latest`) via `TENA
## Smoke suite ## Smoke suite
`scripts/canary-smoke.sh` hits each canary tenant (URL + ADMIN_TOKEN pair) and asserts: `scripts/staging-smoke.sh` hits each canary tenant (URL + ADMIN_TOKEN pair) and asserts:
- `/admin/liveness` returns a subsystems map (tenant booted, AdminAuth reachable) - `/admin/liveness` returns a subsystems map (tenant booted, AdminAuth reachable)
- `/workspaces` returns a JSON array (wsAuth + DB healthy) - `/workspaces` returns a JSON array (wsAuth + DB healthy)
@ -59,8 +59,8 @@ Expand by editing the script — each `check "name" "expected" "$response"` call
3. Re-trigger provision (or delete + recreate if the org was already provisioned into staging) — the fresh EC2 lands in the canary AWS account (see internal runbook for the specific ID) 3. Re-trigger provision (or delete + recreate if the org was already provisioned into staging) — the fresh EC2 lands in the canary AWS account (see internal runbook for the specific ID)
Then set repo secrets: Then set repo secrets:
- `CANARY_TENANT_URLS` — append the new tenant's URL - `MOLECULE_STAGING_TENANT_URLS` — append the new tenant's URL
- `CANARY_ADMIN_TOKENS` — append its ADMIN_TOKEN in the same position - `MOLECULE_STAGING_ADMIN_TOKENS` — append its ADMIN_TOKEN in the same position
## Rolling back `:latest` ## Rolling back `:latest`

View File

@ -50,7 +50,7 @@ pipeline.
| `check-merge-group-trigger.yml` | The workflow's own header (lines 18-23) documents that it's vacuously satisfied on Gitea — Gitea has no merge queue, no `merge_group:` event type, no `gh-readonly-queue/...` refs. Nothing to lint. | | `check-merge-group-trigger.yml` | The workflow's own header (lines 18-23) documents that it's vacuously satisfied on Gitea — Gitea has no merge queue, no `merge_group:` event type, no `gh-readonly-queue/...` refs. Nothing to lint. |
| `codeql.yml` | The workflow's own header (lines 3-67) documents that `github/codeql-action/init@v4` hits api.github.com bundle endpoints not implemented by Gitea (observed: `::error::404 page not found` in Initialize CodeQL step). Per Hongming decision 2026-05-07 (task #156): CodeQL is ADVISORY/non-blocking until a Gitea-compatible SAST pipeline lands. Replacement options (Semgrep self-host, Sonatype, GitHub-mirror-for-SAST) tracked in #156. | | `codeql.yml` | The workflow's own header (lines 3-67) documents that `github/codeql-action/init@v4` hits api.github.com bundle endpoints not implemented by Gitea (observed: `::error::404 page not found` in Initialize CodeQL step). Per Hongming decision 2026-05-07 (task #156): CodeQL is ADVISORY/non-blocking until a Gitea-compatible SAST pipeline lands. Replacement options (Semgrep self-host, Sonatype, GitHub-mirror-for-SAST) tracked in #156. |
| `pr-guards.yml` | The workflow's own header documents that Gitea has no `gh pr merge --auto` primitive — the guard is a structural no-op on Gitea. Branch protection on `main` does NOT reference any `pr-guards` check name; deletion is safe. | | `pr-guards.yml` | The workflow's own header documents that Gitea has no `gh pr merge --auto` primitive — the guard is a structural no-op on Gitea. Branch protection on `main` does NOT reference any `pr-guards` check name; deletion is safe. |
| `promote-latest.yml` | Uses `imjasonh/setup-crane` against `ghcr.io/molecule-ai/platform` — the GHCR registry was retired during the 2026-05-06 Gitea migration (per `canary-verify.yml` header notes, the canonical tenant image moved to ECR `153263036946.dkr.ecr.us-east-2.amazonaws.com/molecule-ai/platform-tenant`). The workflow can no longer find any image to retag. Follow-up issue suggested if an ECR-based retag promote is desired. | | `promote-latest.yml` | Uses `imjasonh/setup-crane` against `ghcr.io/molecule-ai/platform` — the GHCR registry was retired during the 2026-05-06 Gitea migration (per `staging-verify.yml` header notes — file was renamed from `canary-verify.yml` on 2026-05-11; the canonical tenant image moved to ECR `153263036946.dkr.ecr.us-east-2.amazonaws.com/molecule-ai/platform-tenant`). The workflow can no longer find any image to retag. Follow-up issue suggested if an ECR-based retag promote is desired. |
## Category C — ported to .gitea/ ## Category C — ported to .gitea/

View File

@ -43,7 +43,7 @@ endpoint handler for the supported range.
- `cleanup-rogue-workspaces.sh` — emergency teardown for leaked - `cleanup-rogue-workspaces.sh` — emergency teardown for leaked
workspaces. Prompts for confirmation. Pair with the harnesses if a workspaces. Prompts for confirmation. Pair with the harnesses if a
cleanup trap fails (see `cleanup_*_failed` events). cleanup trap fails (see `cleanup_*_failed` events).
- `canary-smoke.sh` — quick smoke test for canary releases. - `staging-smoke.sh` — quick smoke test for the staging canary fleet (formerly `canary-smoke.sh`).
- `dev-start.sh` — local-dev platform bring-up. - `dev-start.sh` — local-dev platform bring-up.
The rest are self-documenting in their header comments. The rest are self-documenting in their header comments.

View File

@ -1,29 +1,40 @@
#!/bin/bash #!/bin/bash
# canary-smoke.sh — runs the post-deploy smoke suite against the # staging-smoke.sh — runs the post-deploy smoke suite against the
# staging canary tenant fleet. Called by the canary-verify.yml GitHub # staging canary tenant fleet. Called by the staging-verify.yml Gitea
# Actions workflow after a new workspace-server image lands in ECR; # Actions workflow after a new workspace-server image lands in ECR;
# exits non-zero on any failure so the workflow can block the # exits non-zero on any failure so the workflow can block the
# redeploy-fleet promotion that would otherwise release broken code # redeploy-fleet promotion that would otherwise release broken code
# to the prod tenant fleet. # to the prod tenant fleet.
# #
# Naming note (2026-05-11): The script (and its input env vars) were
# renamed from canary-smoke.sh / CANARY_* to staging-smoke.sh /
# MOLECULE_STAGING_* per Hongming directive. The tested COHORT is still
# called the "canary fleet" (a small subset of staging tenants that
# ingest :staging-<sha> before the rest of the fleet); that strategy
# concept is unchanged.
#
# Registry note: GHCR was retired 2026-05-06. Images are now pushed # Registry note: GHCR was retired 2026-05-06. Images are now pushed
# to the operator's ECR org (153263036946.dkr.ecr.us-east-2.amazonaws.com/ # to the operator's ECR org (153263036946.dkr.ecr.us-east-2.amazonaws.com/
# molecule-ai/platform-tenant). The registry URL is a runtime concern for # molecule-ai/platform-tenant). The registry URL is a runtime concern for
# the CI push step; this script tests the running tenant directly. # the CI push step; this script tests the running tenant directly.
# #
# Environment: # Environment:
# CANARY_TENANT_URLS space-sep list of canary tenant base URLs # MOLECULE_STAGING_TENANT_URLS space-sep list of canary tenant base
# (e.g. "https://canary-pm.staging.moleculesai.app # URLs (e.g. "https://canary-pm.staging.
# https://canary-mcp.staging.moleculesai.app") # moleculesai.app https://canary-mcp.
# CANARY_ADMIN_TOKENS space-sep list of ADMIN_TOKENs, positionally # staging.moleculesai.app")
# matched to CANARY_TENANT_URLS. Canary tenants # MOLECULE_STAGING_ADMIN_TOKENS space-sep list of ADMIN_TOKENs,
# are provisioned with known ADMIN_TOKENs so CI # positionally matched to
# can hit their admin-gated endpoints. # MOLECULE_STAGING_TENANT_URLS.
# CANARY_CP_BASE_URL CP base URL the canaries call back to # Canary tenants are provisioned with
# (https://staging-api.moleculesai.app) # known ADMIN_TOKENs so CI can hit
# CANARY_CP_SHARED_SECRET matches CP's PROVISION_SHARED_SECRET so this # their admin-gated endpoints.
# script can also exercise /cp/workspaces/* via # MOLECULE_STAGING_CP_BASE_URL CP base URL the canaries call back to
# the canary's own CPProvisioner identity. # (https://staging-api.moleculesai.app)
# MOLECULE_STAGING_CP_SHARED_SECRET matches CP's PROVISION_SHARED_SECRET
# so this script can also exercise
# /cp/workspaces/* via the canary's
# own CPProvisioner identity.
# #
# Exit codes: 0 = all green, 1 = assertion failure, 2 = setup/env problem. # Exit codes: 0 = all green, 1 = assertion failure, 2 = setup/env problem.
@ -31,12 +42,12 @@ set -euo pipefail
# ── Setup ──────────────────────────────────────────────────────────────── # ── Setup ────────────────────────────────────────────────────────────────
: "${CANARY_TENANT_URLS:?space-sep list of canary base URLs required}" : "${MOLECULE_STAGING_TENANT_URLS:?space-sep list of canary base URLs required}"
: "${CANARY_ADMIN_TOKENS:?space-sep list of ADMIN_TOKENs required, same order as URLs}" : "${MOLECULE_STAGING_ADMIN_TOKENS:?space-sep list of ADMIN_TOKENs required, same order as URLs}"
: "${CANARY_CP_BASE_URL:?CP base URL required}" : "${MOLECULE_STAGING_CP_BASE_URL:?CP base URL required}"
read -r -a URLS <<< "$CANARY_TENANT_URLS" read -r -a URLS <<< "$MOLECULE_STAGING_TENANT_URLS"
read -r -a TOKENS <<< "$CANARY_ADMIN_TOKENS" read -r -a TOKENS <<< "$MOLECULE_STAGING_ADMIN_TOKENS"
if [ "${#URLS[@]}" -ne "${#TOKENS[@]}" ]; then if [ "${#URLS[@]}" -ne "${#TOKENS[@]}" ]; then
echo "ERROR: URLS(${#URLS[@]}) and TOKENS(${#TOKENS[@]}) length mismatch" >&2 echo "ERROR: URLS(${#URLS[@]}) and TOKENS(${#TOKENS[@]}) length mismatch" >&2
@ -69,7 +80,7 @@ check() {
# tenant never gets the wrong token. # tenant never gets the wrong token.
acurl() { acurl() {
local base="$1" token="$2"; shift 2 local base="$1" token="$2"; shift 2
curl -sS --max-time 20 -H "Authorization: Bearer $token" "$@" -- "$base${CANARY_ACURL_PATH:-}" curl -sS --max-time 20 -H "Authorization: Bearer $token" "$@" -- "$base${ACURL_PATH:-}"
} }
# ── Checks (run per canary tenant) ─────────────────────────────────────── # ── Checks (run per canary tenant) ───────────────────────────────────────
@ -80,7 +91,7 @@ for i in "${!URLS[@]}"; do
printf "\n── %s ──\n" "$base" printf "\n── %s ──\n" "$base"
# 1. Liveness — the tenant is up and responding to admin auth. # 1. Liveness — the tenant is up and responding to admin auth.
CANARY_ACURL_PATH="/admin/liveness" resp=$(acurl "$base" "$token" || true) ACURL_PATH="/admin/liveness" resp=$(acurl "$base" "$token" || true)
check "liveness returns a subsystems map" '"subsystems"' "$resp" check "liveness returns a subsystems map" '"subsystems"' "$resp"
# 2. CP env refresh — the workspace-server fetched MOLECULE_CP_SHARED_SECRET # 2. CP env refresh — the workspace-server fetched MOLECULE_CP_SHARED_SECRET
@ -89,25 +100,25 @@ for i in "${!URLS[@]}"; do
# booted without crashing on the refresh call. A startup failure in # booted without crashing on the refresh call. A startup failure in
# refreshEnvFromCP logs but still boots (best-effort semantics), so # refreshEnvFromCP logs but still boots (best-effort semantics), so
# this is a sanity check, not a proof. # this is a sanity check, not a proof.
CANARY_ACURL_PATH="/workspaces" resp=$(acurl "$base" "$token" || true) ACURL_PATH="/workspaces" resp=$(acurl "$base" "$token" || true)
check "workspace list is JSON array" "[" "$resp" check "workspace list is JSON array" "[" "$resp"
# 3. Memory commit round-trip — scope=LOCAL so test data stays on this # 3. Memory commit round-trip — scope=LOCAL so test data stays on this
# tenant. Verifies encryption + scrubber + retrieval end-to-end. # tenant. Verifies encryption + scrubber + retrieval end-to-end.
probe_id="canary-smoke-$(date +%s)-$i" probe_id="canary-smoke-$(date +%s)-$i"
body=$(printf '{"scope":"LOCAL","namespace":"canary-smoke","content":"probe-%s"}' "$probe_id") body=$(printf '{"scope":"LOCAL","namespace":"canary-smoke","content":"probe-%s"}' "$probe_id")
CANARY_ACURL_PATH="/memories/commit" resp=$(curl -sS --max-time 20 \ ACURL_PATH="/memories/commit" resp=$(curl -sS --max-time 20 \
-X POST -H "Content-Type: application/json" -H "Authorization: Bearer $token" \ -X POST -H "Content-Type: application/json" -H "Authorization: Bearer $token" \
--data "$body" "$base/memories/commit" || true) --data "$body" "$base/memories/commit" || true)
check "memory commit accepted" '"id"' "$resp" check "memory commit accepted" '"id"' "$resp"
CANARY_ACURL_PATH="/memories/search?query=probe-${probe_id}" \ ACURL_PATH="/memories/search?query=probe-${probe_id}" \
resp=$(curl -sS --max-time 20 -H "Authorization: Bearer $token" \ resp=$(curl -sS --max-time 20 -H "Authorization: Bearer $token" \
"$base/memories/search?query=probe-${probe_id}" || true) "$base/memories/search?query=probe-${probe_id}" || true)
check "memory search finds the probe" "probe-${probe_id}" "$resp" check "memory search finds the probe" "probe-${probe_id}" "$resp"
# 4. Events admin read — AdminAuth path (C4 fail-closed proof on SaaS). # 4. Events admin read — AdminAuth path (C4 fail-closed proof on SaaS).
CANARY_ACURL_PATH="/events" resp=$(acurl "$base" "$token" || true) ACURL_PATH="/events" resp=$(acurl "$base" "$token" || true)
check "events endpoint returns JSON" "[" "$resp" check "events endpoint returns JSON" "[" "$resp"
# 5. Negative: unauth'd admin call must 401 (C4 regression gate). # 5. Negative: unauth'd admin call must 401 (C4 regression gate).
@ -117,7 +128,7 @@ for i in "${!URLS[@]}"; do
# 6. POST /org/import unauth → 401. Proves the route is compiled in # 6. POST /org/import unauth → 401. Proves the route is compiled in
# and AdminAuth is enforced. A missing route returns 404 (the failure # and AdminAuth is enforced. A missing route returns 404 (the failure
# mode caught by issue #213). Regression guard for the silent-GHCR- # mode caught by issue #213). Regression guard for the silent-GHCR-
# migration gap: canary-verify was testing a stale GHCR image while # migration gap: staging-verify (formerly canary-verify) was testing a stale GHCR image while
# actual tenants ran ECR — this test would have caught a missing-route # actual tenants ran ECR — this test would have caught a missing-route
# binary before it reached prod. # binary before it reached prod.
unauth_code=$(curl -sS -o /dev/null -w '%{http_code}' \ unauth_code=$(curl -sS -o /dev/null -w '%{http_code}' \

View File

@ -7,11 +7,11 @@ Four workflows + a shared bash harness that together cover the SaaS stack end to
| Workflow | Cadence | Wall time | Scope | | Workflow | Cadence | Wall time | Scope |
|---|---|---|---| |---|---|---|---|
| `e2e-staging-saas.yml` | push + nightly 07:00 UTC | ~20 min | Full API: org → tenant → 2 workspaces → A2A → HMA → delegation → leak check | | `e2e-staging-saas.yml` | push + nightly 07:00 UTC | ~20 min | Full API: org → tenant → 2 workspaces → A2A → HMA → delegation → leak check |
| `canary-staging.yml` | every 30 min | ~8 min | Minimum smoke + self-managed alert issue | | `staging-smoke.yml` | every 30 min | ~8 min | Minimum smoke + self-managed alert issue |
| `e2e-staging-canvas.yml` | push + weekly Sunday 08:00 | ~25 min | All 13 canvas workspace-panel tabs via Playwright | | `e2e-staging-canvas.yml` | push + weekly Sunday 08:00 | ~25 min | All 13 canvas workspace-panel tabs via Playwright |
| `e2e-staging-sanity.yml` | weekly Monday 06:00 | ~10 min | Intentional-failure: teardown safety-net self-check | | `e2e-staging-sanity.yml` | weekly Monday 06:00 | ~10 min | Intentional-failure: teardown safety-net self-check |
`tests/e2e/test_staging_full_saas.sh` is the shared harness all workflows invoke (with `E2E_MODE={full|canary}` and `E2E_INTENTIONAL_FAILURE={0|1}` toggles). `tests/e2e/test_staging_full_saas.sh` is the shared harness all workflows invoke (with `E2E_MODE={full|smoke}` and `E2E_INTENTIONAL_FAILURE={0|1}` toggles).
### Full-SaaS checklist (sections) ### Full-SaaS checklist (sections)
@ -82,7 +82,7 @@ bash tests/e2e/test_staging_full_saas.sh
## Cost ## Cost
- Full run: ~20 min, ~$0.007 - Full run: ~20 min, ~$0.007
- Canary (48/day): ~$0.06/day - Smoke (48/day): ~$0.06/day
- Canvas (few/week): ~$0.01/day - Canvas (few/week): ~$0.01/day
- Sanity (weekly): ~$0.002/week - Sanity (weekly): ~$0.002/week
- **Total staging burn: < $0.15/day** at expected CI load - **Total staging burn: < $0.15/day** at expected CI load

View File

@ -27,7 +27,11 @@
# E2E_PROVISION_TIMEOUT_SECS default 900 (15 min cold EC2 budget) # E2E_PROVISION_TIMEOUT_SECS default 900 (15 min cold EC2 budget)
# E2E_KEEP_ORG 1 → skip teardown (debugging only) # E2E_KEEP_ORG 1 → skip teardown (debugging only)
# E2E_RUN_ID Slug suffix; CI: ${GITHUB_RUN_ID} # E2E_RUN_ID Slug suffix; CI: ${GITHUB_RUN_ID}
# E2E_MODE full (default) | canary # E2E_MODE full (default) | smoke
# (legacy alias `canary` still accepted —
# mapped to `smoke` for back-compat with
# any in-flight runner picking up an older
# workflow checkout)
# E2E_INTENTIONAL_FAILURE 1 → poison tenant token mid-run so the # E2E_INTENTIONAL_FAILURE 1 → poison tenant token mid-run so the
# script fails; the EXIT trap MUST still # script fails; the EXIT trap MUST still
# tear down cleanly (and exit 4 on leak). # tear down cleanly (and exit 4 on leak).
@ -49,15 +53,23 @@ RUNTIME="${E2E_RUNTIME:-hermes}"
PROVISION_TIMEOUT_SECS="${E2E_PROVISION_TIMEOUT_SECS:-900}" PROVISION_TIMEOUT_SECS="${E2E_PROVISION_TIMEOUT_SECS:-900}"
RUN_ID_SUFFIX="${E2E_RUN_ID:-$(date +%H%M%S)-$$}" RUN_ID_SUFFIX="${E2E_RUN_ID:-$(date +%H%M%S)-$$}"
MODE="${E2E_MODE:-full}" MODE="${E2E_MODE:-full}"
# `canary` is a legacy alias for `smoke` retained for back-compat with
# any in-flight runner picking up an older workflow checkout during the
# 2026-05-11 canary→staging rename rollout. Both map to the same slug
# prefix below. Remove the `canary` alias after one week of no-old-mode
# observations.
if [ "$MODE" = "canary" ]; then
MODE="smoke"
fi
case "$MODE" in case "$MODE" in
full|canary) ;; full|smoke) ;;
*) echo "E2E_MODE must be 'full' or 'canary' (got: $MODE)" >&2; exit 2 ;; *) echo "E2E_MODE must be 'full' or 'smoke' (got: $MODE)" >&2; exit 2 ;;
esac esac
# Canary runs get a distinct prefix so their safety-net sweeper only # Smoke runs get a distinct slug prefix so their safety-net sweeper only
# touches their own runs, not in-flight full runs. # touches their own runs, not in-flight full runs.
if [ "$MODE" = "canary" ]; then if [ "$MODE" = "smoke" ]; then
SLUG="e2e-canary-$(date +%Y%m%d)-${RUN_ID_SUFFIX}" SLUG="e2e-smoke-$(date +%Y%m%d)-${RUN_ID_SUFFIX}"
else else
SLUG="e2e-$(date +%Y%m%d)-${RUN_ID_SUFFIX}" SLUG="e2e-$(date +%Y%m%d)-${RUN_ID_SUFFIX}"
fi fi