Some checks failed
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 5s
CI / Detect changes (pull_request) Successful in 5s
E2E API Smoke Test / detect-changes (pull_request) Successful in 6s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 6s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 5s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 5s
CI / Platform (Go) (pull_request) Successful in 3s
CI / Canvas (Next.js) (pull_request) Successful in 5s
CI / Python Lint & Test (pull_request) Successful in 3s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 4s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 8s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 4s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 4s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 3s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Failing after 12s
CodeQL / Analyze (${{ matrix.language }}) (go) (pull_request) Failing after 51s
CodeQL / Analyze (${{ matrix.language }}) (javascript-typescript) (pull_request) Failing after 1m20s
CodeQL / Analyze (${{ matrix.language }}) (python) (pull_request) Failing after 1m20s
Mass-sed across 17 files / 38 active refs in molecule-core .md docs (README + CONTRIBUTING + docs/architecture/ + docs/blog/ + docs/guides/ + docs/integrations/ + docs/quickstart.md + scripts/README.md). Driver: /tmp/sweep_core.py — same pattern set as the internal-marketing bulk-sed (PR #50). 4 url-substitution patterns + SKIP_PATTERN preserves /pull/<n> /issues/<n> /commit/<sha> /releases/... historical refs. Files NOT touched in this PR: - docs/workspace-runtime-package.md — owned by molecule-core#15 (workspace-runtime source-edit per #41). Reverted my bulk-sed of that file to avoid merge conflict. - 2 Go-import-path refs in docs/memory-plugins/testing-your-plugin.md (github.com/Molecule-AI/molecule-monorepo/platform/internal/...) — Q5 cross-repo Go-module migration territory. - 1 GitHub Gist link in docs/guides/external-workspace-quickstart.md (gist.github.com/molecule-ai/...) — no Gitea equivalent; consistent with the same handling in docs#1. Manual fixes (2): - docs/blog/2026-04-20-chrome-devtools-mcp-seo/index.md:306 — GitHub Discussions (no Gitea equivalent) → issue tracker link - docs/guides/external-workspace-quickstart.md:218 — tracking-issue ?q= query-string url (regex didn't catch) → reformulated text + Gitea search-by-query approach Pattern matches my docs#1 (public docs site) PR + internal#50 (internal/marketing bulk-sed). Standard substitutions: - https://github.com/Molecule-AI/<repo> → https://git.moleculesai.app/molecule-ai/<repo> - /blob/<branch>/ + /tree/<branch>/ → /src/branch/<branch>/ Refs: molecule-ai/internal#37, molecule-ai/internal#38
88 lines
4.9 KiB
Markdown
88 lines
4.9 KiB
Markdown
# Canary release pipeline
|
|
|
|
How a workspace-server code change reaches the prod tenant fleet — and how to stop it if something's wrong.
|
|
|
|
> **⚠️ State note (2026-04-22):** this doc describes the **intended design**. As of this write, the canary fleet described below is **not actually running** — no canary tenants are provisioned, `CANARY_TENANT_URLS` / `CANARY_ADMIN_TOKENS` / `CANARY_CP_SHARED_SECRET` are empty in repo secrets, and `canary-verify.yml` fails every run.
|
|
>
|
|
> Current merges gate on manual `promote-latest.yml` dispatches, not canary. See [molecule-controlplane/docs/canary-tenants.md](https://git.moleculesai.app/molecule-ai/molecule-controlplane/src/branch/main/docs/canary-tenants.md) for the Phase 1 code work that's already shipped + the Phase 2 plan for actually standing up the fleet + a "should we even do this now?" decision framework.
|
|
>
|
|
> **Account-specific identifiers (AWS account ID, IAM role name) referenced below in the original design have been redacted from this public doc.** The actual values — if they exist — are in `Molecule-AI/internal/runbooks/canary-fleet.md`. If you're implementing Phase 2, start there.
|
|
>
|
|
> When Phase 2 lands, delete this note and reconcile the two docs.
|
|
|
|
## The loop
|
|
|
|
```
|
|
PR merged to staging → main
|
|
│
|
|
▼
|
|
publish-workspace-server-image.yml ← pushes :staging-<sha> ONLY
|
|
│ (NOT :latest — prod is untouched)
|
|
▼
|
|
Canary tenants auto-update to :staging-<sha>
|
|
│ (5-min auto-updater cycle on each canary EC2)
|
|
▼
|
|
canary-verify.yml waits 6 min, runs scripts/canary-smoke.sh
|
|
│
|
|
├─► GREEN → crane tag :staging-<sha> → :latest
|
|
│ │
|
|
│ ▼
|
|
│ Prod tenants auto-update within 5 min
|
|
│
|
|
└─► RED → :latest stays on prior good digest
|
|
GitHub Step Summary flags the rejected sha
|
|
Ops fixes forward OR rolls back manually
|
|
```
|
|
|
|
## Canary fleet
|
|
|
|
Lives in a separate AWS account via an assumed role. The CP's `is_canary` org flag routes provisioning there; every other org goes to the default account. Specific account ID and role name are tracked in the internal runbook (`Molecule-AI/internal/runbooks/canary-fleet.md`) rather than here, so rotating them doesn't require rewriting public git history.
|
|
|
|
Canary tenants are configured to pull `:staging-<sha>` (not `:latest`) via `TENANT_IMAGE` on their provisioner, so they ingest each new build before prod does.
|
|
|
|
## Smoke suite
|
|
|
|
`scripts/canary-smoke.sh` hits each canary tenant (URL + ADMIN_TOKEN pair) and asserts:
|
|
|
|
- `/admin/liveness` returns a subsystems map (tenant booted, AdminAuth reachable)
|
|
- `/workspaces` returns a JSON array (wsAuth + DB healthy)
|
|
- `/memories/commit` + `/memories/search` round-trip (encryption + scrubber)
|
|
- `/events` admin read (C4 fail-closed proof)
|
|
- `/admin/liveness` without bearer → 401 (C4 regression gate)
|
|
|
|
Expand by editing the script — each `check "name" "expected" "$response"` call is one line.
|
|
|
|
## Adding a canary tenant
|
|
|
|
1. `POST /cp/orgs` — create the org normally (is_canary defaults to false)
|
|
2. `POST /cp/admin/orgs/<slug>/canary` with `{"is_canary": true}` — admin only, refuses to flip if already provisioned
|
|
3. Re-trigger provision (or delete + recreate if the org was already provisioned into staging) — the fresh EC2 lands in the canary AWS account (see internal runbook for the specific ID)
|
|
|
|
Then set repo secrets:
|
|
- `CANARY_TENANT_URLS` — append the new tenant's URL
|
|
- `CANARY_ADMIN_TOKENS` — append its ADMIN_TOKEN in the same position
|
|
|
|
## Rolling back `:latest`
|
|
|
|
When canary was green but something surfaces post-promotion, retag `:latest` to a prior digest:
|
|
|
|
```bash
|
|
export GITHUB_TOKEN=ghp_... # write:packages
|
|
scripts/rollback-latest.sh 4c1d56e # retags both platform + tenant images
|
|
```
|
|
|
|
`scripts/rollback-latest.sh` pre-checks that `:staging-<sha>` exists before moving `:latest`, and verifies the digest after the move. Prod tenants pick up the rolled-back image on their next 5-min auto-update.
|
|
|
|
A post-mortem should always include:
|
|
- the commit sha that broke
|
|
- why canary didn't catch it (new code path the smoke suite doesn't exercise?)
|
|
- whether the smoke suite should grow a new check to prevent the same class of bug
|
|
|
|
## What this gate doesn't catch
|
|
|
|
- Bugs that only surface under prod-only data (customer workloads with scale or shape canary doesn't produce). Canary uses real traffic shapes but can't simulate weeks of accumulated state.
|
|
- Config drift between canary and prod (different env-var values, different feature flags). Keep canary's config deltas minimal and documented.
|
|
- Cross-tenant interactions — canary tenants run in their own AWS account, so a bug that only appears when two tenants compete for a shared resource won't reproduce here.
|
|
|
|
When these miss, `rollback-latest.sh` is the escape hatch.
|