molecule-core/.github/workflows/canary-verify.yml
Hongming Wang 9662590360 feat(canary): smoke harness + GHA verification workflow (Phase 2)
Post-deploy verification for staging tenant images. Runs against the
canary fleet after each publish-workspace-server-image build — catches
auto-update breakage (a la today's E2E current_task drift) before it
propagates to the prod tenant fleet that auto-pulls :latest every 5 min.

scripts/canary-smoke.sh iterates a space-sep list of canary base URLs
(paired with their ADMIN_TOKENs) and checks:
- /admin/liveness reachable with admin bearer (tenant boot OK)
- /workspaces list responds (wsAuth + DB path OK)
- /memories/commit + /memories/search round-trip (encryption + scrubber)
- /events admin read (AdminAuth C4 path)
- /admin/liveness without bearer returns 401 (C4 fail-closed regression)

.github/workflows/canary-verify.yml runs after publish succeeds:
- 6-min sleep (tenant auto-updater pulls every 5 min)
- bash scripts/canary-smoke.sh with secrets pulled from repo settings
- on failure: writes a Step Summary flagging that :latest should be
  rolled back to prior known-good digest

Phase 3 follow-up will split the publish workflow so only
:staging-<sha> ships initially, and canary-verify's green gate is
what promotes :staging-<sha> → :latest. This commit lays the test
gate alone so we have something running against tenants immediately.

Secrets to set in GitHub repo settings before this workflow can run:
- CANARY_TENANT_URLS (space-sep list)
- CANARY_ADMIN_TOKENS (same order as URLs)
- CANARY_CP_SHARED_SECRET (matches staging CP PROVISION_SHARED_SECRET)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 03:30:19 -07:00

57 lines
2.1 KiB
YAML

name: canary-verify
# Runs the canary smoke suite against the staging canary tenant fleet
# after a new workspace-server image lands on :latest. On failure,
# alerts via a GitHub Actions summary — follow-up PR will add:
# - :staging-<sha> intermediate tag published BY publish workflow
# - retag :staging-<sha> → :latest ONLY when this workflow is green
# - Telegram/Slack notifier on red
# For now this exists as the test gate itself so we can catch
# auto-update breakage before it reaches the prod tenant fleet
# (which auto-pulls :latest every 5 min).
on:
workflow_run:
workflows: ["publish-workspace-server-image"]
types: [completed]
workflow_dispatch:
permissions:
contents: read
actions: read
jobs:
canary-smoke:
# Skip when the upstream workflow failed — no image to test against.
if: ${{ github.event.workflow_run.conclusion == 'success' || github.event_name == 'workflow_dispatch' }}
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Wait for canary tenants to pick up new image
# Tenant auto-updater runs every 5 min. Sleep 6 min to give every
# canary time to pull + restart. Cheaper than polling.
run: sleep 360
- name: Run canary smoke suite
env:
CANARY_TENANT_URLS: ${{ secrets.CANARY_TENANT_URLS }}
CANARY_ADMIN_TOKENS: ${{ secrets.CANARY_ADMIN_TOKENS }}
CANARY_CP_BASE_URL: https://staging-api.moleculesai.app
CANARY_CP_SHARED_SECRET: ${{ secrets.CANARY_CP_SHARED_SECRET }}
run: bash scripts/canary-smoke.sh
- name: Summary on failure
if: ${{ failure() }}
run: |
{
echo "## Canary smoke FAILED"
echo
echo "The staging canary tenant fleet failed its post-deploy smoke suite."
echo "The :latest tag on ghcr.io/molecule-ai/platform should be rolled back"
echo "to the prior known-good digest until this is resolved."
echo
echo "See job log above for the specific failed assertions."
} >> "$GITHUB_STEP_SUMMARY"