molecule-core/runbooks/production-auto-deploy.md
hongming-codex-laptop 782eaf2e80
Some checks failed
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 13m8s
CI / Platform (Go) (pull_request) Successful in 5s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 4s
CI / Python Lint & Test (pull_request) Successful in 4s
CI / Detect changes (pull_request) Successful in 35s
Harness Replays / Harness Replays (pull_request) Successful in 5s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 19s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 7s
qa-review / approved (pull_request) Failing after 20s
sop-checklist / all-items-acked (pull_request) [info tier:low] acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 7s
gate-check-v3 / gate-check (pull_request) Failing after 27s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 46s
security-review / approved (pull_request) Failing after 21s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m26s
Harness Replays / detect-changes (pull_request) Successful in 13s
sop-checklist-gate / gate (pull_request) Successful in 24s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 20s
E2E API Smoke Test / detect-changes (pull_request) Successful in 35s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 1m54s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 37s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Successful in 2m1s
sop-tier-check / tier-check (pull_request) Successful in 17s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 34s
lint-required-context-exists-in-bp / lint-required-context-exists-in-bp (pull_request) Successful in 2m3s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 11s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 1m50s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 1m33s
CI / Canvas (Next.js) (pull_request) Successful in 14m21s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 9s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / all-required (pull_request) Successful in 5s
ci: auto deploy production tenants after green main
2026-05-13 04:51:43 -07:00

2.5 KiB

Production Auto-Deploy

molecule-core deploys production tenant code automatically from Gitea Actions.

This runbook is an implementation-specific companion to runbooks/sop-production-cicd.md.

Default Flow

On a push to main that touches deployable code, .gitea/workflows/publish-workspace-server-image.yml:

  1. Builds and pushes platform and tenant ECR images tagged staging-<sha> and staging-latest.
  2. Self-tests the production deploy helper and workflow-YAML linter.
  3. Waits for strict required push contexts on the same commit to become success.
  4. Calls production control-plane POST /cp/admin/tenants/redeploy-fleet with target_tag=staging-<sha>.
  5. Verifies every redeploy result is healthy and every tenant returns the same Git SHA from /buildinfo.

The deploy workflow intentionally does not use Gitea concurrency because Gitea 1.22.6 can cancel queued runs even when cancel-in-progress: false.

Kill Switch

Set either repository variable or secret:

PROD_AUTO_DEPLOY_DISABLED=true

The image publish still runs, but the production redeploy step exits successfully without touching tenants. Immediately before the production POST, the workflow re-checks the live Gitea repo variable when PROD_AUTO_DEPLOY_CONTROL_TOKEN can read Actions variables. If that token is not configured, the job-start value is still honored.

Tunables

Repository variables:

PROD_CP_URL=https://api.moleculesai.app
PROD_AUTO_DEPLOY_CANARY_SLUG=hongming
PROD_AUTO_DEPLOY_SOAK_SECONDS=60
PROD_AUTO_DEPLOY_BATCH_SIZE=3
PROD_AUTO_DEPLOY_DRY_RUN=false
PROD_MANUAL_REDEPLOY_TARGET_TAG=staging-<known-good-sha>

Secrets required:

CP_ADMIN_API_TOKEN
AUTO_SYNC_TOKEN
PROD_AUTO_DEPLOY_CONTROL_TOKEN
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY

AUTO_SYNC_TOKEN is only used to read Gitea commit statuses while waiting for required push contexts. PROD_AUTO_DEPLOY_CONTROL_TOKEN is optional but recommended so the pre-POST kill-switch check can read the live PROD_AUTO_DEPLOY_DISABLED Actions variable.

Manual Fallback

Use .gitea/workflows/redeploy-tenants-on-main.yml when the automatic path needs to be rerun or rolled back. Gitea 1.22.6 does not support reliable workflow_dispatch inputs, so rollback uses a repo variable:

  1. Set PROD_MANUAL_REDEPLOY_TARGET_TAG=staging-<known-good-sha>.
  2. Dispatch manual-redeploy-tenants-on-main.
  3. Clear PROD_MANUAL_REDEPLOY_TARGET_TAG after the rollback finishes.

With no variable set, the fallback redeploys staging-<current-main-sha>.