forked from molecule-ai/molecule-core
Addresses PR #82 code review: 🟡×3 + 🔵×5. - Fly registry login username: 'x' → 'molecule-ai' + explanatory comment. - Build & push split into two steps (GHCR / Fly registry) so a single- registry outage can't fail the other. Second step uses 'if: always()' to ensure Fly mirror runs even if GHCR push flakes. - docs/runbooks/saas-secrets.md: full secret map + rotation procedures for every SaaS credential, with danger-case callouts. Documents the coupled FLY_API_TOKEN (lives in GHA secret AND fly secrets — must be rotated in both). - CLAUDE.md: new 'SaaS ops' section linking to the runbook.
103 lines
4.4 KiB
Markdown
103 lines
4.4 KiB
Markdown
# SaaS secret rotation — runbook
|
|
|
|
Where each secret lives, why, and the **full rotation procedure** so a partial
|
|
update doesn't silently break production.
|
|
|
|
## Secret map
|
|
|
|
| Secret | Location(s) | Purpose |
|
|
|---|---|---|
|
|
| `FLY_API_TOKEN` | **(a)** `molecule-monorepo` GitHub Actions secret (push image to `registry.fly.io/molecule-tenant`) + **(b)** `fly secrets` on `molecule-cp` app (control plane creates + deletes tenant Fly Machines) | Any Fly Machines API call |
|
|
| `NEON_API_KEY` | `fly secrets` on `molecule-cp` | Create + delete tenant Neon branches |
|
|
| `DATABASE_URL` | `fly secrets` on `molecule-cp` | Control-plane Postgres connection (Neon `cool-sea-89357706`) |
|
|
| `TENANT_REDIS_URL` | `fly secrets` on `molecule-cp` | Injected into every tenant container as `REDIS_URL` |
|
|
| `SECRETS_ENCRYPTION_KEY` | `fly secrets` on `molecule-cp` | AES-256 key wrapping tenant DB/Redis URLs in `org_instances` (provisioner + tenant use this) |
|
|
| `GITHUB_TOKEN` | Built-in GitHub Actions token | GHCR push; rotated automatically |
|
|
|
|
## Coupled secrets — MUST rotate together
|
|
|
|
`FLY_API_TOKEN` is the one secret duplicated across systems. Rotating **only
|
|
one** will cause **silent** breakage:
|
|
|
|
- Rotating **only (a) GHA** → image publish workflow fails, but no alert; control plane keeps provisioning from the stale `latest` tag.
|
|
- Rotating **only (b) Fly secrets** → control plane's Fly API calls start erroring (`401`), tenant provisioning fails, but image publishes keep succeeding so everything *looks* fine on the build side.
|
|
|
|
## Rotation procedure — FLY_API_TOKEN
|
|
|
|
1. Generate new token:
|
|
```
|
|
flyctl tokens create deploy --name molecule-cp-rotation-$(date +%Y%m%d)
|
|
```
|
|
2. Update **both** locations (order matters — Fly secrets first, then GHA):
|
|
```
|
|
# (b) Fly secrets — triggers zero-downtime redeploy
|
|
flyctl secrets set --app molecule-cp FLY_API_TOKEN='FlyV1 fm2_...'
|
|
|
|
# (a) GitHub Actions secret — next workflow run uses new token
|
|
echo 'FlyV1 fm2_...' | gh secret set FLY_API_TOKEN --repo Molecule-AI/molecule-monorepo
|
|
```
|
|
3. Verify:
|
|
```
|
|
# Control plane can reach Fly API:
|
|
curl https://molecule-cp.fly.dev/health
|
|
# Trigger image publish (dispatches workflow, pushes to both registries):
|
|
gh workflow run publish-platform-image.yml --repo Molecule-AI/molecule-monorepo
|
|
gh run list --repo Molecule-AI/molecule-monorepo --workflow publish-platform-image --limit 1
|
|
```
|
|
4. Revoke the old token:
|
|
```
|
|
flyctl tokens list
|
|
flyctl tokens revoke <id-of-old-token>
|
|
```
|
|
|
|
## Rotation procedure — NEON_API_KEY
|
|
|
|
1. Create replacement key in Neon console → Account Settings → API Keys.
|
|
2. Update Fly secrets:
|
|
```
|
|
flyctl secrets set --app molecule-cp NEON_API_KEY='napi_...'
|
|
```
|
|
3. Trigger a test provision (dry run — create + delete):
|
|
```
|
|
curl -X POST https://molecule-cp.fly.dev/cp/orgs \
|
|
-H 'Content-Type: application/json' \
|
|
-d '{"slug":"keytest-'$(date +%s)'","name":"Rotation test"}'
|
|
# Wait 60s, inspect logs:
|
|
flyctl logs --app molecule-cp --no-tail | tail -30
|
|
# Clean up the test org via DELETE once live
|
|
```
|
|
4. Revoke old key in Neon console.
|
|
|
|
## Rotation procedure — SECRETS_ENCRYPTION_KEY
|
|
|
|
**DANGEROUS**: rotating this key will invalidate every encrypted row in
|
|
`org_instances.database_url_encrypted` + `redis_url_encrypted`. Every tenant
|
|
becomes unreachable until re-provisioned.
|
|
|
|
Mitigation: we intentionally defer real KMS + key-rotation to Phase H. Until
|
|
then, **do not rotate this key unless compromised.** If compromise, procedure is:
|
|
|
|
1. Generate new key: `openssl rand -hex 32`
|
|
2. Set new key on `molecule-cp`.
|
|
3. For every row in `org_instances`: re-provision the tenant (creates fresh
|
|
Neon branch + Fly machine). The old encrypted URLs are un-decryptable but
|
|
irrelevant — we mint fresh ones.
|
|
4. Migration to rotate encrypted columns in-place (decrypt-with-old → encrypt-
|
|
with-new) is Phase H work and requires envelope encryption with KMS.
|
|
|
|
## Rotation procedure — DATABASE_URL (control plane)
|
|
|
|
The Neon `molecule-cp` project has a stable primary endpoint. Rotate only if:
|
|
- Neon forces a migration
|
|
- The connection-URI password is leaked
|
|
|
|
Procedure: regenerate URI via Neon API → `flyctl secrets set DATABASE_URL=...`.
|
|
Zero-downtime (Fly applies secret via rolling restart).
|
|
|
|
## Emergency contacts
|
|
|
|
- **Fly**: billing dashboard at fly.io → Support
|
|
- **Neon**: console.neon.tech → Support
|
|
- **Upstash**: upstash.com → Support
|
|
- **GHCR**: github.com/orgs/Molecule-AI (org admins)
|