Security: - Replace hardcoded Cloudflare account/zone/KV IDs in wrangler.toml with placeholders; add wrangler.toml to .gitignore, ship .example - Replace real EC2 IPs in docs with <EC2_IP> placeholders - Redact partial CF API token prefix in retrospective - Parameterize Langfuse dev credentials in docker-compose.infra.yml - Replace Neon project ID in runbook with <neon-project-id> Community: - Add CONTRIBUTING.md (build, test, branch conventions, CI info) - Add CODE_OF_CONDUCT.md (Contributor Covenant 2.1) Cleanup: - Replace personal runner username/machine name in CI + PLAN.md - Replace personal tenant URL in MCP setup guide - Replace personal author field in bundle-system doc - Replace personal login in webhook test fixture - Rewrite cryptominer incident reference as generic security remediation - Remove private repo commit hashes from PLAN.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
228 lines
11 KiB
Markdown
228 lines
11 KiB
Markdown
# SaaS secret rotation — runbook
|
|
|
|
Where each secret lives, why, and the **full rotation procedure** so a partial
|
|
update doesn't silently break production.
|
|
|
|
## Secret map
|
|
|
|
| Secret | Location(s) | Purpose |
|
|
|---|---|---|
|
|
| `FLY_API_TOKEN` | **(a)** `molecule-monorepo` GitHub Actions secret (push image to `registry.fly.io/molecule-tenant`) + **(b)** `fly secrets` on `<fly-app-name>` app (control plane creates + deletes tenant Fly Machines) | Any Fly Machines API call |
|
|
| `NEON_API_KEY` | `fly secrets` on `<fly-app-name>` | Create + delete tenant Neon branches |
|
|
| `DATABASE_URL` | `fly secrets` on `<fly-app-name>` | Control-plane Postgres connection (Neon `<neon-project-id>`) |
|
|
| `TENANT_REDIS_URL` | `fly secrets` on `<fly-app-name>` | Injected into every tenant container as `REDIS_URL` |
|
|
| `SECRETS_ENCRYPTION_KEY` | `fly secrets` on `<fly-app-name>` | AES-256 key wrapping tenant DB/Redis URLs in `org_instances` (provisioner + tenant use this) |
|
|
| `RESEND_API_KEY` | `fly secrets` on `<fly-app-name>` | Resend REST API token used by `internal/email.ResendProvider` — GDPR erasure confirmation today; welcome + plan-change emails later. Empty → `DisabledProvider` silently no-ops all sends |
|
|
| `RESEND_FROM_EMAIL` | `fly secrets` on `<fly-app-name>` | RFC-5322 From line, typically `"Molecule AI <noreply@moleculesai.app>"`. Must resolve to a Resend-verified domain or sends fail with `403 domain not verified` |
|
|
| `STRIPE_API_KEY` | `fly secrets` on `<fly-app-name>` | `sk_live_…` secret key used by `internal/billing.StripeProvider` for customer/subscription/checkout mutations + GDPR Art. 17 cascade |
|
|
| `STRIPE_WEBHOOK_SECRET` | `fly secrets` on `<fly-app-name>` | `whsec_…` used by `internal/billing.verifySignature` to reject forged webhook calls. Rotated independently from the API key — Stripe treats them as separate secrets |
|
|
| `GITHUB_TOKEN` | Built-in GitHub Actions token | GHCR push; rotated automatically |
|
|
| `ANTHROPIC_API_KEY` | **Global secret** via `PUT /settings/secrets` on each tenant platform instance | Default LLM provider (`MODEL_PROVIDER=anthropic`). Must be set as a **global** secret so it propagates to all workspace containers — workspace-level-only is not sufficient for SDK-direct workspaces (e.g. molecule-hitl). See [rotation procedure below](#anthropic_api_key). |
|
|
|
|
## Coupled secrets — MUST rotate together
|
|
|
|
`FLY_API_TOKEN` is the one secret duplicated across systems. Rotating **only
|
|
one** will cause **silent** breakage:
|
|
|
|
- Rotating **only (a) GHA** → image publish workflow fails, but no alert; control plane keeps provisioning from the stale `latest` tag.
|
|
- Rotating **only (b) Fly secrets** → control plane's Fly API calls start erroring (`401`), tenant provisioning fails, but image publishes keep succeeding so everything *looks* fine on the build side.
|
|
|
|
## Rotation procedure — FLY_API_TOKEN
|
|
|
|
1. Generate new token:
|
|
```
|
|
flyctl tokens create deploy --name <fly-app-name>-rotation-$(date +%Y%m%d)
|
|
```
|
|
2. Update **both** locations (order matters — Fly secrets first, then GHA):
|
|
```
|
|
# (b) Fly secrets — triggers zero-downtime redeploy
|
|
flyctl secrets set --app <fly-app-name> FLY_API_TOKEN='FlyV1 fm2_...'
|
|
|
|
# (a) GitHub Actions secret — next workflow run uses new token
|
|
echo 'FlyV1 fm2_...' | gh secret set FLY_API_TOKEN --repo Molecule-AI/molecule-monorepo
|
|
```
|
|
3. Verify:
|
|
```
|
|
# Control plane can reach Fly API:
|
|
curl https://<fly-app-name>.fly.dev/health
|
|
# Trigger image publish (dispatches workflow, pushes to both registries):
|
|
gh workflow run publish-platform-image.yml --repo Molecule-AI/molecule-monorepo
|
|
gh run list --repo Molecule-AI/molecule-monorepo --workflow publish-platform-image --limit 1
|
|
```
|
|
4. Revoke the old token:
|
|
```
|
|
flyctl tokens list
|
|
flyctl tokens revoke <id-of-old-token>
|
|
```
|
|
|
|
## Rotation procedure — NEON_API_KEY
|
|
|
|
1. Create replacement key in Neon console → Account Settings → API Keys.
|
|
2. Update Fly secrets:
|
|
```
|
|
flyctl secrets set --app <fly-app-name> NEON_API_KEY='napi_...'
|
|
```
|
|
3. Trigger a test provision (dry run — create + delete):
|
|
```
|
|
curl -X POST https://<fly-app-name>.fly.dev/cp/orgs \
|
|
-H 'Content-Type: application/json' \
|
|
-d '{"slug":"keytest-'$(date +%s)'","name":"Rotation test"}'
|
|
# Wait 60s, inspect logs:
|
|
flyctl logs --app <fly-app-name> --no-tail | tail -30
|
|
# Clean up the test org via DELETE once live
|
|
```
|
|
4. Revoke old key in Neon console.
|
|
|
|
## Rotation procedure — SECRETS_ENCRYPTION_KEY
|
|
|
|
**DANGEROUS**: rotating this key will invalidate every encrypted row in
|
|
`org_instances.database_url_encrypted` + `redis_url_encrypted`. Every tenant
|
|
becomes unreachable until re-provisioned.
|
|
|
|
Mitigation: we intentionally defer real KMS + key-rotation to Phase H. Until
|
|
then, **do not rotate this key unless compromised.** If compromise, procedure is:
|
|
|
|
1. Generate new key: `openssl rand -hex 32`
|
|
2. Set new key on `<fly-app-name>`.
|
|
3. For every row in `org_instances`: re-provision the tenant (creates fresh
|
|
Neon branch + Fly machine). The old encrypted URLs are un-decryptable but
|
|
irrelevant — we mint fresh ones.
|
|
4. Migration to rotate encrypted columns in-place (decrypt-with-old → encrypt-
|
|
with-new) is Phase H work and requires envelope encryption with KMS.
|
|
|
|
## Rotation procedure — DATABASE_URL (control plane)
|
|
|
|
The Neon `<fly-app-name>` project has a stable primary endpoint. Rotate only if:
|
|
- Neon forces a migration
|
|
- The connection-URI password is leaked
|
|
|
|
Procedure: regenerate URI via Neon API → `flyctl secrets set DATABASE_URL=...`.
|
|
Zero-downtime (Fly applies secret via rolling restart).
|
|
|
|
## Rotation procedure — RESEND_API_KEY
|
|
|
|
Low-blast-radius rotation — the only consumer is the transactional-email
|
|
path and sends fail loudly (the cascade logs `purge confirmation email
|
|
failed`) without breaking user-facing flows.
|
|
|
|
1. In Resend dashboard → API Keys → create a new key scoped to
|
|
"<fly-app-name> production", e.g. name
|
|
`<fly-app-name>-rotation-$(date +%Y%m%d)`.
|
|
2. Stage the replacement on Fly (not immediately live):
|
|
```
|
|
flyctl secrets set --app <fly-app-name> \
|
|
--stage RESEND_API_KEY='re_...'
|
|
```
|
|
`--stage` holds the secret for the next deploy instead of restarting
|
|
machines immediately. Skip `--stage` if you want a rolling restart
|
|
right now.
|
|
3. Redeploy (or wait for the next image publish) — machines pick up the
|
|
new key.
|
|
4. Trigger a real send to verify: delete a disposable test org via
|
|
`DELETE /cp/orgs/test-rotate` and confirm the Resend dashboard shows
|
|
the event in Emails → Logs within a minute.
|
|
5. Revoke the old key in the Resend dashboard.
|
|
|
|
### Blast-radius note
|
|
|
|
The GDPR Art. 17 cascade sends a best-effort confirmation email after
|
|
purge succeeds; a failed send is logged but does **not** flip the 204
|
|
response (purge data is already gone). This means a broken
|
|
`RESEND_API_KEY` silently skips confirmation emails — monitor the
|
|
`purge confirmation email failed` log line after any rotation.
|
|
|
|
### Domain verification
|
|
|
|
`RESEND_FROM_EMAIL` must come from a Resend-verified domain or every
|
|
send returns `403 domain not verified`. Domain verification lives in
|
|
Resend dashboard → Domains → Add Domain; Resend gives you 3 DNS records
|
|
(SPF, DKIM, DMARC) to add to the DNS provider for `moleculesai.app`.
|
|
**Do not rotate the From address without confirming the new domain is
|
|
verified** — there's no server-side check at deploy time.
|
|
|
|
## Rotation procedure — STRIPE_API_KEY + STRIPE_WEBHOOK_SECRET
|
|
|
|
These are independent Stripe secrets. Rotating one does **not** affect
|
|
the other — they can be rotated on separate schedules.
|
|
|
|
1. Stripe dashboard → Developers → API keys → **Roll key** on the live
|
|
secret key. Stripe gives you a new `sk_live_…`.
|
|
2. Stage on Fly:
|
|
```
|
|
flyctl secrets set --app <fly-app-name> \
|
|
--stage STRIPE_API_KEY='sk_live_...'
|
|
```
|
|
3. Redeploy, then verify: hit
|
|
`https://<fly-app-name>.fly.dev/cp/billing/checkout` from an authenticated
|
|
test session and confirm the returned checkout URL redirects to a
|
|
valid Stripe-hosted page.
|
|
4. Stripe auto-revokes the old key after rolling — no manual revoke
|
|
step.
|
|
|
|
For `STRIPE_WEBHOOK_SECRET`:
|
|
|
|
1. Stripe dashboard → Developers → Webhooks → the <fly-app-name> endpoint →
|
|
**Roll secret**.
|
|
2. Stripe shows you BOTH old and new secret for a 24-hour overlap window.
|
|
Copy the new `whsec_…`.
|
|
3. Stage + deploy on Fly as above.
|
|
4. Inside the overlap window, send a Stripe CLI test event:
|
|
```
|
|
stripe trigger customer.subscription.updated \
|
|
--forward-to https://<fly-app-name>.fly.dev/webhooks/stripe
|
|
```
|
|
If the signature-verification layer accepts it (no `400 invalid
|
|
signature` in Fly logs), the new secret is live.
|
|
5. Wait for the overlap window to expire or click "Delete old secret"
|
|
in Stripe dashboard.
|
|
|
|
## Rotation procedure — ANTHROPIC_API_KEY
|
|
|
|
This key is set as a **platform global secret** (not a Fly secret). It propagates
|
|
automatically to every non-paused workspace container via the Phase 15 global-secrets
|
|
fan-out (`PUT /settings/secrets` triggers auto-restart of all affected workspaces).
|
|
|
|
Per-workspace overrides (e.g. a workspace with its own `ANTHROPIC_API_KEY` secret)
|
|
shadow the global value — the per-workspace value takes precedence.
|
|
|
|
1. Generate a new key at [console.anthropic.com](https://console.anthropic.com) →
|
|
API Keys → Create key. Name it `molecule-<env>-rotation-$(date +%Y%m%d)`.
|
|
|
|
2. Set the new key as a global secret on each platform instance:
|
|
```bash
|
|
# Self-hosted (local/staging)
|
|
curl -X PUT http://localhost:8080/settings/secrets \
|
|
-H "Authorization: Bearer $ADMIN_TOKEN" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"key":"ANTHROPIC_API_KEY","value":"sk-ant-api03-..."}'
|
|
|
|
# SaaS control plane — set on the tenant platform via control-plane API
|
|
# (details TBD when <fly-app-name> exposes a /cp/orgs/:id/secrets endpoint)
|
|
```
|
|
The platform auto-restarts every non-paused workspace on set.
|
|
|
|
3. Verify: restart one workspace and confirm it starts up without 401 errors:
|
|
```bash
|
|
curl -X POST http://localhost:8080/workspaces/$WORKSPACE_ID/restart \
|
|
-H "Authorization: Bearer $ADMIN_TOKEN"
|
|
# Watch logs — no "401 unauthorized" from Anthropic SDK should appear
|
|
```
|
|
|
|
4. Revoke the old key in the Anthropic console once all workspaces have restarted.
|
|
|
|
### Blast-radius note
|
|
|
|
Rotating `ANTHROPIC_API_KEY` restarts **every non-paused workspace** on the
|
|
instance. Schedule rotation during low-traffic windows. Paused workspaces pick
|
|
up the new key when they are next resumed (secrets are injected at container
|
|
start, not from the running container env).
|
|
|
|
## Emergency contacts
|
|
|
|
- **Fly**: billing dashboard at fly.io → Support
|
|
- **Neon**: console.neon.tech → Support
|
|
- **Upstash**: upstash.com → Support
|
|
- **Resend**: resend.com/dashboard → Help (email-only support, ~24h turnaround)
|
|
- **Stripe**: stripe.com/support → live chat
|
|
- **GHCR**: github.com/orgs/Molecule-AI (org admins)
|