diff --git a/CLAUDE.md b/CLAUDE.md index e14c0ee7..550cf637 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -23,6 +23,13 @@ secrets` on `molecule-cp`), the correct rotation order, and danger cases — notably `SECRETS_ENCRYPTION_KEY`, which cannot be rotated without a data migration until Phase H lands KMS envelope encryption. +When handling a GDPR erasure request (user asks "delete my org and all +my data"), read **`docs/runbooks/gdpr-erasure.md`** first. It explains the +4-step cascade in `molecule-controlplane` (Stripe → Redis → Infra → DB +rows), how to read the `org_purges` audit table, how to resume a failed +purge, and what the cascade deliberately does NOT cover (WorkOS users, +LLM provider history, Langfuse traces). + ## Agent operating rules (auto-loaded — read first) The following are project-level rules that override default behavior. They diff --git a/docs/runbooks/gdpr-erasure.md b/docs/runbooks/gdpr-erasure.md new file mode 100644 index 00000000..ea1f090b --- /dev/null +++ b/docs/runbooks/gdpr-erasure.md @@ -0,0 +1,106 @@ +# GDPR Art. 17 hard-delete cascade + +Operational reference for the "delete my org" flow in `molecule-controlplane`. +Skim this before replying to an erasure request, answering a DPA (Data +Processing Addendum) audit, or debugging a failed purge. + +## What Art. 17 actually requires + +The EU General Data Protection Regulation, Article 17 ("right to erasure" / +"right to be forgotten") says: when a user asks us to delete their personal +data, we must do so within 30 days and destroy **every copy** we control — +including copies held by our sub-processors (Stripe, Fly, Neon, Upstash, +Vercel, WorkOS). Soft-delete is not compliant. A database row with +`deleted_at IS NOT NULL` still counts as "data we process" under the GDPR +definition. + +## What the cascade actually does + +`DELETE /cp/orgs/:slug` triggers `handlers.executeOrgPurge`, which walks +four steps in order: + +| # | Step | Action | Idempotent? | +|---|------|--------|-------------| +| 1 | `stripe` | List every subscription on `cus_*`, DELETE each, then DELETE the customer record. Stripe retains deleted customers for ~30 days per their internal policy — we do not control that window | Yes (404 → success) | +| 2 | `redis` | Upstash REST `scan` + `del` against pattern `:*` | Yes (empty scan = no-op) | +| 3 | `infra` | `Provisioner.DeprovisionInstance` → Fly Machine destroy + Neon branch delete + Vercel subdomain removal | Yes at each sub-step | +| 4 | `db_rows` | One transaction: `DELETE FROM org_instances` → `DELETE FROM org_members` → `DELETE FROM organizations` | Atomic | + +Each successful step writes `org_purges.last_step`. The orchestrator reads +`last_step` on entry and **skips every step at or before it** — so a retried +DELETE resumes from the first unfinished step instead of repeating Stripe +cancellations or Redis scans. + +The `org_purges` audit row outlives the deleted org on purpose — `org_id` is +NOT a foreign key. Auditors or support staff can still answer "when was +acme.moleculesai.app deleted and did it succeed?" three months later. + +## When a purge fails mid-cascade + +The API returns `500` with a JSON body: + +```json +{ + "error": "purge cascade failed; retry the request to resume", + "purge_id": "" +} +``` + +What to do: + +1. **Inspect the audit row** — `SELECT status, last_step, last_error, attempts + FROM org_purges WHERE id = ''`. That tells you which step blew + up and why. +2. **Fix the underlying cause** if it's ours (Stripe API key rotation, + Upstash network blip, Fly API 500). +3. **Re-issue the DELETE** — the handler picks up from `last_step + 1`. No + manual DB surgery is needed in the happy path. +4. **If the step that failed is `db_rows`** — the transaction rolled back, so + the org is still fully intact. Retry is safe. +5. **If the step that failed is `infra`** — check Fly + Neon + Vercel + dashboards before retrying. A half-destroyed Fly Machine won't block the + retry (DeprovisionInstance is idempotent), but it's worth confirming the + resource actually went away. + +## 30-day deadline + +GDPR gives us one calendar month to complete erasure from the request date. +The cascade runs synchronously and typically finishes in <15 seconds, so +latency is not the concern — **unattended failure** is. If an `org_purges` +row sits in `status='failed'` for more than 24h, that's the operator's cue +to intervene. A future Phase H task will add a cron that pings Slack when +any purge row is older than 48h without hitting `completed`. + +## What this cascade does NOT do + +- **It does not delete WorkOS user records.** WorkOS Users are org-scoped + (a user can belong to multiple orgs), and we don't own enough lifecycle + signal to decide when to purge the underlying user account. When the last + org containing a user is erased, the WorkOS user will be orphaned. Phase + H.2 adds a sweep to reconcile. +- **It does not delete LLM provider history.** Agent conversations that + used OpenAI / Anthropic / OpenRouter may still appear in the provider's + own retention window. Our DPAs with those vendors cap that at 30 days; we + do not expose a hook to accelerate it. +- **It does not delete Langfuse traces** for self-hosted Langfuse. In + production we forward traces to Langfuse Cloud which has its own + retention policy — check `LANGFUSE_HOST` in the env before claiming + compliance. + +## Testing the cascade + +See the test plan in [PR #29](https://github.com/Molecule-AI/molecule-controlplane/pull/29) +for the staging checklist. The unit tests cover the orchestrator logic +(happy path, resume-from-step, Stripe failure, no-customer); end-to-end +proof requires a real Stripe test-mode customer + provisioned Fly Machine +because the failure modes that matter are transport errors, not logic. + +## Related + +- `docs/runbooks/saas-secrets.md` — if a cascade fails with "invalid API + key" the relevant secret probably rotated +- `docs/runbooks/admin-auth.md` — `DELETE /cp/orgs/:slug` is behind + session-cookie auth in controlplane, not the workspace bearer-token + middleware documented there +- `molecule-controlplane/internal/handlers/purge.go` — the orchestrator +- `molecule-controlplane/migrations/006_org_purges.*.sql` — audit schema