molecule-ai/molecule-core

Fork 2

claude-ceo-assistant ae30cdef87

Block internal-flavored paths / Block forbidden paths (push) Successful in 13s

Details

Lint curl status-code capture / Scan workflows for curl status-capture pollution (push) Successful in 11s

Details

CI / Detect changes (push) Successful in 35s

Details

E2E API Smoke Test / detect-changes (push) Successful in 43s

Details

E2E Staging Canvas (Playwright) / detect-changes (push) Successful in 45s

Details

publish-workspace-server-image / build-and-push (push) Failing after 17s

Details

Handlers Postgres Integration / detect-changes (push) Successful in 52s

Details

Secret scan / Scan diff for credential-shaped strings (push) Successful in 14s

Details

publish-canvas-image / Build & push canvas image (push) Failing after 44s

Details

Runtime PR-Built Compatibility / detect-changes (push) Successful in 43s

Details

Ops Scripts Tests / Ops scripts (unittest) (push) Successful in 51s

Details

CI / Platform (Go) (push) Successful in 7s

Details

CI / Canvas (Next.js) (push) Successful in 8s

Details

CI / Python Lint & Test (push) Successful in 7s

Details

Handlers Postgres Integration / Handlers Postgres Integration (push) Successful in 8s

Details

CI / Shellcheck (E2E scripts) (push) Successful in 17s

Details

E2E Staging Canvas (Playwright) / Canvas tabs E2E (push) Successful in 10s

Details

Sweep stale e2e-* orgs (staging) / Sweep e2e orgs (push) Successful in 13s

Details

Runtime PR-Built Compatibility / PR-built wheel + import smoke (push) Successful in 6s

Details

Sweep stale AWS Secrets Manager secrets / Sweep AWS Secrets Manager (push) Failing after 12s

Details

E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (push) Failing after 5m9s

Details

CI / Canvas Deploy Reminder (push) Has been skipped

Details

E2E API Smoke Test / E2E API Smoke Test (push) Failing after 3m25s

Details

Staging SaaS smoke (every 30 min) / Staging SaaS smoke (push) Failing after 4m48s

Details

Continuous synthetic E2E (staging) / Synthetic E2E against staging (push) Failing after 4m57s

Details

refactor(ci): drop "canary-" prefix → staging-smoke/staging-verify (Hongming directive 2026-05-11) (#443 )

Co-authored-by: claude-ceo-assistant <claude-ceo-assistant@agents.moleculesai.app>
Co-committed-by: claude-ceo-assistant <claude-ceo-assistant@agents.moleculesai.app>

2026-05-11 11:25:29 +00:00

5.0 KiB

Raw Blame History

Canary release pipeline

How a workspace-server code change reaches the prod tenant fleet — and how to stop it if something's wrong.

⚠️ State note (2026-04-22, secret names refreshed 2026-05-11): this doc describes the intended design. As of this write, the canary fleet described below is not actually running — no canary tenants are provisioned, MOLECULE_STAGING_TENANT_URLS / MOLECULE_STAGING_ADMIN_TOKENS / MOLECULE_STAGING_CP_SHARED_SECRET are empty in repo secrets, and staging-verify.yml (formerly canary-verify.yml) fails every run.

Current merges gate on manual promote-latest.yml dispatches, not canary. See molecule-controlplane/docs/canary-tenants.md for the Phase 1 code work that's already shipped + the Phase 2 plan for actually standing up the fleet + a "should we even do this now?" decision framework.

Account-specific identifiers (AWS account ID, IAM role name) referenced below in the original design have been redacted from this public doc. The actual values — if they exist — are in Molecule-AI/internal/runbooks/canary-fleet.md. If you're implementing Phase 2, start there.

When Phase 2 lands, delete this note and reconcile the two docs.

The loop

PR merged to staging → main
      │
      ▼
publish-workspace-server-image.yml   ← pushes :staging-<sha> ONLY
      │                                (NOT :latest — prod is untouched)
      ▼
Canary tenants auto-update to :staging-<sha>
      │   (5-min auto-updater cycle on each canary EC2)
      ▼
staging-verify.yml waits 6 min, runs scripts/staging-smoke.sh
      │
      ├─► GREEN → crane tag :staging-<sha> → :latest
      │                                       │
      │                                       ▼
      │                           Prod tenants auto-update within 5 min
      │
      └─► RED   → :latest stays on prior good digest
                  GitHub Step Summary flags the rejected sha
                  Ops fixes forward OR rolls back manually

Canary fleet

Lives in a separate AWS account via an assumed role. The CP's is_canary org flag routes provisioning there; every other org goes to the default account. Specific account ID and role name are tracked in the internal runbook (Molecule-AI/internal/runbooks/canary-fleet.md) rather than here, so rotating them doesn't require rewriting public git history.

Canary tenants are configured to pull :staging-<sha> (not :latest) via TENANT_IMAGE on their provisioner, so they ingest each new build before prod does.

Smoke suite

scripts/staging-smoke.sh hits each canary tenant (URL + ADMIN_TOKEN pair) and asserts:

/admin/liveness returns a subsystems map (tenant booted, AdminAuth reachable)
/workspaces returns a JSON array (wsAuth + DB healthy)
/memories/commit + /memories/search round-trip (encryption + scrubber)
/events admin read (C4 fail-closed proof)
/admin/liveness without bearer → 401 (C4 regression gate)

Expand by editing the script — each check "name" "expected" "$response" call is one line.

Adding a canary tenant

POST /cp/orgs — create the org normally (is_canary defaults to false)
POST /cp/admin/orgs/<slug>/canary with {"is_canary": true} — admin only, refuses to flip if already provisioned
Re-trigger provision (or delete + recreate if the org was already provisioned into staging) — the fresh EC2 lands in the canary AWS account (see internal runbook for the specific ID)

Then set repo secrets:

MOLECULE_STAGING_TENANT_URLS — append the new tenant's URL
MOLECULE_STAGING_ADMIN_TOKENS — append its ADMIN_TOKEN in the same position

Rolling back `:latest`

When canary was green but something surfaces post-promotion, retag :latest to a prior digest:

export GITHUB_TOKEN=ghp_...    # write:packages
scripts/rollback-latest.sh 4c1d56e  # retags both platform + tenant images

scripts/rollback-latest.sh pre-checks that :staging-<sha> exists before moving :latest, and verifies the digest after the move. Prod tenants pick up the rolled-back image on their next 5-min auto-update.

A post-mortem should always include:

the commit sha that broke
why canary didn't catch it (new code path the smoke suite doesn't exercise?)
whether the smoke suite should grow a new check to prevent the same class of bug

What this gate doesn't catch

Bugs that only surface under prod-only data (customer workloads with scale or shape canary doesn't produce). Canary uses real traffic shapes but can't simulate weeks of accumulated state.
Config drift between canary and prod (different env-var values, different feature flags). Keep canary's config deltas minimal and documented.
Cross-tenant interactions — canary tenants run in their own AWS account, so a bug that only appears when two tenants compete for a shared resource won't reproduce here.

When these miss, rollback-latest.sh is the escape hatch.

5.0 KiB Raw Blame History

Canary release pipeline

The loop

Canary fleet

Smoke suite

Adding a canary tenant

Rolling back :latest

What this gate doesn't catch

5.0 KiB

Raw Blame History

Rolling back `:latest`