feat(canary): gate :latest tag promotion on canary verify green (Phase 3)

Completes the canary release train. Before this, publish-workspace-
server-image.yml pushed both :staging-<sha> and :latest on every
main merge — meaning the prod tenant fleet auto-pulled every image
immediately, before any post-deploy smoke test. A broken image
(think: this morning's E2E current_task drift, but shipped at 3am
instead of caught in CI) would have fanned out to every running
tenant within 5 min.

Now:
- publish workflow pushes :staging-<sha> ONLY
- canary tenants are configured to track :staging-<sha>; they pick
  up the new image on their next auto-update cycle
- canary-verify.yml runs the smoke suite (Phase 2) after the sleep
- on green: a new promote-to-latest job uses crane to remotely
  retag :staging-<sha> → :latest for both platform and tenant images
- prod tenants auto-update to the newly-retagged :latest within
  their usual 5-min window
- on red: :latest stays frozen on prior good digest; prod is untouched

crane is pulled onto the runner (~4 MB, GitHub release) rather than
docker-daemon retag so the workflow doesn't need a privileged runner.

Rollback: if canary passed but something surfaces post-promotion,
operator runs "crane tag ghcr.io/molecule-ai/platform:<prior-good-sha>
latest" manually. A follow-up can wrap that in a Phase 4 admin
endpoint / script.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Hongming Wang 2026-04-19 03:33:04 -07:00
parent fdcc715e97
commit 7e3a6fd756
2 changed files with 86 additions and 21 deletions

View File

@ -1,14 +1,19 @@
name: canary-verify
# Runs the canary smoke suite against the staging canary tenant fleet
# after a new workspace-server image lands on :latest. On failure,
# alerts via a GitHub Actions summary — follow-up PR will add:
# - :staging-<sha> intermediate tag published BY publish workflow
# - retag :staging-<sha> → :latest ONLY when this workflow is green
# - Telegram/Slack notifier on red
# For now this exists as the test gate itself so we can catch
# auto-update breakage before it reaches the prod tenant fleet
# (which auto-pulls :latest every 5 min).
# after a new :staging-<sha> image lands in GHCR. On green, promotes
# :staging-<sha> → :latest so the prod tenant fleet's 5-minute
# auto-updater picks up the verified digest. On red, :latest stays
# on the prior known-good digest and prod is untouched.
#
# Dependencies:
# - publish-workspace-server-image.yml publishes :staging-<sha>
# (NOT :latest) on main merge
# - canary tenants are configured to pull :staging-<sha> as their
# tenant image (set TENANT_IMAGE=ghcr.io/…:staging-<sha> on the
# canary provisioner code path OR rotate via an admin endpoint)
# - Repo secrets CANARY_TENANT_URLS / CANARY_ADMIN_TOKENS /
# CANARY_CP_SHARED_SECRET are populated
on:
workflow_run:
@ -18,18 +23,29 @@ on:
permissions:
contents: read
packages: write
actions: read
env:
IMAGE_NAME: ghcr.io/molecule-ai/platform
TENANT_IMAGE_NAME: ghcr.io/molecule-ai/platform-tenant
jobs:
canary-smoke:
# Skip when the upstream workflow failed — no image to test against.
if: ${{ github.event.workflow_run.conclusion == 'success' || github.event_name == 'workflow_dispatch' }}
runs-on: ubuntu-latest
outputs:
sha: ${{ steps.compute.outputs.sha }}
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Wait for canary tenants to pick up new image
- name: Compute sha
id: compute
run: echo "sha=${GITHUB_SHA::7}" >> "$GITHUB_OUTPUT"
- name: Wait for canary tenants to pick up :staging-<sha>
# Tenant auto-updater runs every 5 min. Sleep 6 min to give every
# canary time to pull + restart. Cheaper than polling.
run: sleep 360
@ -48,9 +64,50 @@ jobs:
{
echo "## Canary smoke FAILED"
echo
echo "The staging canary tenant fleet failed its post-deploy smoke suite."
echo "The :latest tag on ghcr.io/molecule-ai/platform should be rolled back"
echo "to the prior known-good digest until this is resolved."
echo "Canary tenants rejected image \`staging-${{ steps.compute.outputs.sha }}\`."
echo ":latest stays pinned to the prior good digest — prod is untouched."
echo
echo "See job log above for the specific failed assertions."
echo "Fix forward and merge again, or investigate the specific failed"
echo "assertions in the canary-smoke step log above."
} >> "$GITHUB_STEP_SUMMARY"
promote-to-latest:
# On green, retag :staging-<sha> → :latest for BOTH images.
# crane is a lightweight registry client (no Docker daemon needed on
# the runner) that can retag remotely with a single API call each.
needs: canary-smoke
if: ${{ needs.canary-smoke.result == 'success' }}
runs-on: ubuntu-latest
steps:
- name: Install crane
run: |
curl -fsSL https://github.com/google/go-containerregistry/releases/download/v0.20.2/go-containerregistry_Linux_x86_64.tar.gz | \
tar xz -C /usr/local/bin crane
- name: GHCR login
run: |
echo "${{ secrets.GITHUB_TOKEN }}" | \
crane auth login ghcr.io -u "${{ github.actor }}" --password-stdin
- name: Retag platform :staging-<sha> → :latest
run: |
crane tag \
"${IMAGE_NAME}:staging-${{ needs.canary-smoke.outputs.sha }}" \
latest
- name: Retag tenant :staging-<sha> → :latest
run: |
crane tag \
"${TENANT_IMAGE_NAME}:staging-${{ needs.canary-smoke.outputs.sha }}" \
latest
- name: Summary
run: |
{
echo "## Canary verified — :latest promoted"
echo
echo "- \`${IMAGE_NAME}:staging-${{ needs.canary-smoke.outputs.sha }}\` → \`${IMAGE_NAME}:latest\`"
echo "- \`${TENANT_IMAGE_NAME}:staging-${{ needs.canary-smoke.outputs.sha }}\` → \`${TENANT_IMAGE_NAME}:latest\`"
echo
echo "Prod tenant fleet will pick up the new digest on its next 5-min auto-update cycle."
} >> "$GITHUB_STEP_SUMMARY"

View File

@ -55,7 +55,17 @@ jobs:
run: |
echo "sha=${GITHUB_SHA::7}" >> "$GITHUB_OUTPUT"
- name: Build & push platform image to GHCR
# Canary-gated release: we publish :staging-<sha> ONLY here. The
# :latest tag (which existing prod tenants auto-pull every 5 min)
# is promoted by .github/workflows/canary-verify.yml after the
# staging canary fleet green-lights this digest.
# That means:
# - Every main merge produces a :staging-<sha> image
# - Canary tenants (configured to pull :staging-<sha>) pick it up
# - canary-verify.yml runs smoke tests against them
# - On green → canary-verify retags :staging-<sha> → :latest
# - On red → :latest stays on the prior good digest, prod is safe
- name: Build & push platform image to GHCR (staging-<sha> only)
uses: docker/build-push-action@v6
with:
context: .
@ -63,16 +73,15 @@ jobs:
platforms: linux/amd64
push: true
tags: |
${{ env.IMAGE_NAME }}:latest
${{ env.IMAGE_NAME }}:sha-${{ steps.tags.outputs.sha }}
${{ env.IMAGE_NAME }}:staging-${{ steps.tags.outputs.sha }}
cache-from: type=gha
cache-to: type=gha,mode=max
labels: |
org.opencontainers.image.source=https://github.com/${{ github.repository }}
org.opencontainers.image.revision=${{ github.sha }}
org.opencontainers.image.description=Molecule AI platform (Go API server)
org.opencontainers.image.description=Molecule AI platform (Go API server) — pending canary verify
- name: Build & push tenant image to GHCR
- name: Build & push tenant image to GHCR (staging-<sha> only)
uses: docker/build-push-action@v6
with:
context: .
@ -80,11 +89,10 @@ jobs:
platforms: linux/amd64
push: true
tags: |
${{ env.TENANT_IMAGE_NAME }}:latest
${{ env.TENANT_IMAGE_NAME }}:sha-${{ steps.tags.outputs.sha }}
${{ env.TENANT_IMAGE_NAME }}:staging-${{ steps.tags.outputs.sha }}
cache-from: type=gha
cache-to: type=gha,mode=max
labels: |
org.opencontainers.image.source=https://github.com/${{ github.repository }}
org.opencontainers.image.revision=${{ github.sha }}
org.opencontainers.image.description=Molecule AI tenant platform + canvas (one EC2 instance per org)
org.opencontainers.image.description=Molecule AI tenant platform + canvas — pending canary verify