Merge pull request #242 from Molecule-AI/docs/gdpr-erasure-runbook

docs: GDPR Art. 17 erasure runbook
2026-04-15 13:49:28 -07:00 · 2026-04-15 13:49:28 -07:00 · b76f9dbcdb
commit b76f9dbcdb
parent 5d7deb9363 9b82bce7ef
2 changed files with 113 additions and 0 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -23,6 +23,13 @@ secrets` on `molecule-cp`), the correct rotation order, and danger cases —
 notably `SECRETS_ENCRYPTION_KEY`, which cannot be rotated without a data
 migration until Phase H lands KMS envelope encryption.

+When handling a GDPR erasure request (user asks "delete my org and all
+my data"), read **`docs/runbooks/gdpr-erasure.md`** first. It explains the
+4-step cascade in `molecule-controlplane` (Stripe → Redis → Infra → DB
+rows), how to read the `org_purges` audit table, how to resume a failed
+purge, and what the cascade deliberately does NOT cover (WorkOS users,
+LLM provider history, Langfuse traces).
+
 ## Agent operating rules (auto-loaded — read first)

 The following are project-level rules that override default behavior. They
--- a/docs/runbooks/gdpr-erasure.md
+++ b/docs/runbooks/gdpr-erasure.md
@ -0,0 +1,106 @@
+# GDPR Art. 17 hard-delete cascade
+
+Operational reference for the "delete my org" flow in `molecule-controlplane`.
+Skim this before replying to an erasure request, answering a DPA (Data
+Processing Addendum) audit, or debugging a failed purge.
+
+## What Art. 17 actually requires
+
+The EU General Data Protection Regulation, Article 17 ("right to erasure" /
+"right to be forgotten") says: when a user asks us to delete their personal
+data, we must do so within 30 days and destroy **every copy** we control —
+including copies held by our sub-processors (Stripe, Fly, Neon, Upstash,
+Vercel, WorkOS). Soft-delete is not compliant. A database row with
+`deleted_at IS NOT NULL` still counts as "data we process" under the GDPR
+definition.
+
+## What the cascade actually does
+
+`DELETE /cp/orgs/:slug` triggers `handlers.executeOrgPurge`, which walks
+four steps in order:
+
+| # | Step | Action | Idempotent? |
+|---|------|--------|-------------|
+| 1 | `stripe` | List every subscription on `cus_*`, DELETE each, then DELETE the customer record. Stripe retains deleted customers for ~30 days per their internal policy — we do not control that window | Yes (404 → success) |
+| 2 | `redis` | Upstash REST `scan` + `del` against pattern `<org_slug>:*` | Yes (empty scan = no-op) |
+| 3 | `infra` | `Provisioner.DeprovisionInstance` → Fly Machine destroy + Neon branch delete + Vercel subdomain removal | Yes at each sub-step |
+| 4 | `db_rows` | One transaction: `DELETE FROM org_instances` → `DELETE FROM org_members` → `DELETE FROM organizations` | Atomic |
+
+Each successful step writes `org_purges.last_step`. The orchestrator reads
+`last_step` on entry and **skips every step at or before it** — so a retried
+DELETE resumes from the first unfinished step instead of repeating Stripe
+cancellations or Redis scans.
+
+The `org_purges` audit row outlives the deleted org on purpose — `org_id` is
+NOT a foreign key. Auditors or support staff can still answer "when was
+acme.moleculesai.app deleted and did it succeed?" three months later.
+
+## When a purge fails mid-cascade
+
+The API returns `500` with a JSON body:
+
+```json
+{
+  "error":    "purge cascade failed; retry the request to resume",
+  "purge_id": "<uuid>"
+}
+```
+
+What to do:
+
+1. **Inspect the audit row** — `SELECT status, last_step, last_error, attempts
+   FROM org_purges WHERE id = '<purge_id>'`. That tells you which step blew
+   up and why.
+2. **Fix the underlying cause** if it's ours (Stripe API key rotation,
+   Upstash network blip, Fly API 500).
+3. **Re-issue the DELETE** — the handler picks up from `last_step + 1`. No
+   manual DB surgery is needed in the happy path.
+4. **If the step that failed is `db_rows`** — the transaction rolled back, so
+   the org is still fully intact. Retry is safe.
+5. **If the step that failed is `infra`** — check Fly + Neon + Vercel
+   dashboards before retrying. A half-destroyed Fly Machine won't block the
+   retry (DeprovisionInstance is idempotent), but it's worth confirming the
+   resource actually went away.
+
+## 30-day deadline
+
+GDPR gives us one calendar month to complete erasure from the request date.
+The cascade runs synchronously and typically finishes in <15 seconds, so
+latency is not the concern — **unattended failure** is. If an `org_purges`
+row sits in `status='failed'` for more than 24h, that's the operator's cue
+to intervene. A future Phase H task will add a cron that pings Slack when
+any purge row is older than 48h without hitting `completed`.
+
+## What this cascade does NOT do
+
+- **It does not delete WorkOS user records.** WorkOS Users are org-scoped
+  (a user can belong to multiple orgs), and we don't own enough lifecycle
+  signal to decide when to purge the underlying user account. When the last
+  org containing a user is erased, the WorkOS user will be orphaned. Phase
+  H.2 adds a sweep to reconcile.
+- **It does not delete LLM provider history.** Agent conversations that
+  used OpenAI / Anthropic / OpenRouter may still appear in the provider's
+  own retention window. Our DPAs with those vendors cap that at 30 days; we
+  do not expose a hook to accelerate it.
+- **It does not delete Langfuse traces** for self-hosted Langfuse. In
+  production we forward traces to Langfuse Cloud which has its own
+  retention policy — check `LANGFUSE_HOST` in the env before claiming
+  compliance.
+
+## Testing the cascade
+
+See the test plan in [PR #29](https://github.com/Molecule-AI/molecule-controlplane/pull/29)
+for the staging checklist. The unit tests cover the orchestrator logic
+(happy path, resume-from-step, Stripe failure, no-customer); end-to-end
+proof requires a real Stripe test-mode customer + provisioned Fly Machine
+because the failure modes that matter are transport errors, not logic.
+
+## Related
+
+- `docs/runbooks/saas-secrets.md` — if a cascade fails with "invalid API
+  key" the relevant secret probably rotated
+- `docs/runbooks/admin-auth.md` — `DELETE /cp/orgs/:slug` is behind
+  session-cookie auth in controlplane, not the workspace bearer-token
+  middleware documented there
+- `molecule-controlplane/internal/handlers/purge.go` — the orchestrator
+- `molecule-controlplane/migrations/006_org_purges.*.sql` — audit schema