feat(scripts): codify ECR :staging-latest → :latest promote + tenant redeploy (closes #660) #672

Open
hongming wants to merge 1 commits from infra/660-codify-promote-tenant-image into main

1 Commits

Author SHA1 Message Date
hongming
ac20b17f85 feat(scripts): codify ECR :staging-latest → :latest promote + tenant redeploy (closes #660)
Some checks failed
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 3s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 8s
CI / Detect changes (pull_request) Successful in 15s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 12s
E2E API Smoke Test / detect-changes (pull_request) Successful in 16s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 18s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 19s
qa-review / approved (pull_request) Failing after 14s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 22s
security-review / approved (pull_request) Failing after 13s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 8s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 3s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 16s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 44s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 5s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 6s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Failing after 1m9s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m11s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 1m25s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 1m26s
CI / Platform (Go) (pull_request) Failing after 4m50s
CI / Canvas (Next.js) (pull_request) Successful in 5m23s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / Python Lint & Test (pull_request) Successful in 7m8s
CI / all-required (pull_request) Failing after 2s
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4
sop-checklist-gate / gate (pull_request) Successful in 5s
gate-check-v3 / gate-check (pull_request) Successful in 6s
sop-tier-check / tier-check (pull_request) Successful in 6s
lint-mask-pr-atomicity / lint-mask-pr-atomicity (pull_request) Successful in 1m6s
Replaces the manual 4-step runbook in
`reference_manual_ecr_promote_procedure.md` with a single self-contained
script + 40 mock-driven e2e tests + a CI gate.

## What's in this change

### `scripts/promote-tenant-image.sh`

The script does the full chain end-to-end:
1. **PREFLIGHT** — AWS auth ok, source-tag exists, CP base reachable.
   Exits 1 with no mutations if anything's wrong.
2. **SNAPSHOT** — saves the current dest-tag manifest as
   `<dest>-prev-YYYYMMDD`. Idempotent: same UTC day re-runs are no-ops.
3. **PROMOTE** — copies `<source-tag>` manifest → `<dest-tag>` via
   `aws ecr put-image` with the OCI image-index media type (preserves
   inner child-manifest digest per `reference_ecr_cross_account_digest_exact_mirror`).
4. **REDEPLOY** — per-tenant POST `/cp/admin/tenants/<slug>/redeploy`.
   On HTTP 403 (stale tenant docker ECR auth — `feedback_ec2_ecr_auth_12h_stale`)
   it SSM-refreshes the EC2's docker login and retries once.
5. **VERIFY** — per-tenant `/buildinfo` + `/health` probes. Failure
   here triggers auto-rollback.
6. **ROLLBACK** (on failure) — re-promotes the rollback tag back to
   `<dest-tag>` and redeploys the fleet. Exits 3 if rollback OK, 4 if not.

Every external call (aws/curl/ssm) is wrapped in a function with a
`--mock-dir` injection point so the tests can drive every branch
without touching real infrastructure.

### `scripts/test-promote-tenant-image.sh`

40 cases across 11 test groups:
- happy path (5 assertions on call counts + exit code)
- preflight failures with no mutations
- snapshot idempotency
- `--dry-run` skips all mutations
- 403 → SSM-refresh → retry path
- redeploy fail with vs without rollback (exit 3 vs 4)
- argument validation (missing/conflicting/unknown flags)
- date override for rollback tag naming
- empty source manifest detection
- verify-failure triggers rollback

Runs `bash scripts/test-promote-tenant-image.sh`. No live infra touched.

### `.gitea/workflows/ci.yml`

Two new steps in the existing `Shellcheck (E2E scripts)` job (a
required check on `main`), gated by the existing `scripts` change
filter (`scripts/`, `tests/e2e/`, `infra/scripts/`, or this workflow
file itself):

1. Run `scripts/test-promote-tenant-image.sh` — fails CI if any of
   the 40 cases regresses.
2. Run `shellcheck --severity=warning` on the two files. The bulk
   shellcheck step intentionally excludes `scripts/` for legacy
   SC3040/SC3043 reasons; explicit invocation here catches new
   regressions in the promote script without unblocking the bulk
   cleanup.

## Validated locally

```
$ bash scripts/test-promote-tenant-image.sh
...
All 40 tests passed.

$ shellcheck --severity=warning scripts/promote-tenant-image.sh scripts/test-promote-tenant-image.sh
(clean)
```

## Closes

- core#660 — "Codify manual ECR promote operation as
  `scripts/promote-tenant-image.sh`" (tier:medium, core-devops)

## Cross-links

- core#658 — proper fix for the 12h-stale tenant ECR auth (this script
  ships the SSM-refresh workaround pending the credential-helper
  rollout).
- `reference_manual_ecr_promote_procedure.md` (memory) — the manual
  procedure this script replaces.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 08:55:02 +00:00