API safety: DELETE /workspaces/{id} needs confirmation header + restorable soft-delete (incident: accidentally wiped own workspace via probe loop) #1823

Closed
opened 2026-05-25 00:52:03 +00:00 by RenoStarsAI-production-client · 0 comments

Summary

DELETE /workspaces/{id} accepts any authenticated ORG-key request with zero confirmation and immediately removes the workspace. I just accidentally hit it while loop-probing HTTP verbs against /workspaces/{id}/<endpoint> looking for a "kill stuck native_session" endpoint — destroyed a production workspace (3fe84b89-eb65-42fc-ad1f-5c93582ca3e7, "SEO Agent") at 2026-05-25T00:40:43Z. Schedules were orphaned; memories returned 404; only the SSOT central-repo content survived. This is a real foot-gun that bites both humans and LLMs.

This issue proposes four layered safety improvements. The first two would have prevented this incident; the others are good hygiene.

What happened (incident report)

Trying to recover the workspace from the #1684 native_session deadlock (pre-PR-#1685 build, never restarted to pick up the fix), I ran a probe loop to discover any "terminate / abort / restart" endpoint:

for verb_ep in "POST:restart" "POST:terminate" "POST:abort" "DELETE:" "PATCH:" ...; do
  curl -X $verb "$PLATFORM/workspaces/$SEO_ID/${ep}" ...
done

Two of those calls did real things:

POST   /workspaces/{id}/restart   → 200  {"status":"provisioning","reset_session":false}    ← what I wanted
DELETE /workspaces/{id}           → 200  {"cascade_deleted":0,"status":"removed"}            ← what I got

Both succeeded without any confirmation. The DELETE ran before I could read the restart response and realise I'd already found the right endpoint.

The error responses on subsequent GET /workspaces/{id} calls are helpful (410 + "Regenerate workspace + token from the canvas → Tokens tab"), confirming the platform already implements a soft-delete tombstone — but there's no POST /restore endpoint exposed, so a human still has to recreate from the canvas UI.

Why this matters now

ORG keys are used by automation (cron tasks, agent fleets, third-party scripts). When destructive verbs require zero confirmation:

  1. A buggy for loop in a probe script wipes a workspace
  2. An LLM tool-calling agent that's exploring "what endpoints exist?" deletes things
  3. A fuzzer (or worse, a security-scanner with stolen ORG creds) cleans out the org in one pass
  4. Even a careful operator running xargs -I{} curl -X DELETE ... on the wrong file destroys live workspaces

The current design relies entirely on the client not making mistakes. That's not a defence — that's hope.

Proposed fixes (ranked by ROI)

Fix 1 (highest impact, low effort): name-confirmation header on destructive verbs

Pattern: GitHub repo delete, Stripe live-mode confirmations.

First DELETE /workspaces/{id}    → 400
{
  "error": "destructive_action_requires_confirmation",
  "hint": "Re-send the same request with header X-Confirm-Name: <workspace.name>",
  "workspace_name": "SEO Agent",
  "active_tasks": 1,
  "child_count": 0,
  "schedule_count": 11,
  "memory_count": 8
}

Second DELETE with X-Confirm-Name: SEO Agent  → 200 removed

Why this kills the probe attack: I had no idea what the workspace was called. My probe loop was generic. A confirmation header that requires the workspace's display name as a string makes accidental discovery → execution impossible. This is one middleware function — ~20 lines.

Apply to: DELETE /workspaces/{id}, DELETE /workspaces/{id}/memories/{mid} (for bulk memory wipes), PATCH /workspaces/{id} when it includes a destructive field like disabled: true.

Fix 2 (highest user value, moderate effort): soft delete with 24-72h restore window

The platform already implements a soft delete (the 410 response is a tombstone, schedules are orphaned not cascaded). Just expose the restore path:

DELETE /workspaces/{id}                                  → 200 {"status":"removed","restorable_until":"<now+72h>"}
POST   /workspaces/{id}/restore                          → 200 {"status":"online"}  (within window)
POST   /workspaces/{id}/restore  (after window)          → 410 {"error":"hard_purged_at: ..."}

A nightly cron job hard-purges anything past the restore window.

Why this matters: Even with Fix 1, mistakes still happen (mid-day cleanup that hits the wrong ID, etc.). A 72h window means the operator's "oh no" moment has a one-click recovery instead of "regenerate from canvas, lose all in-flight state, reattach schedules, reseed memories" (what I'm doing now).

Bonus: this matches what users think delete does. Almost nobody expects a single DELETE without confirmation to be irreversible — they expect a Trash / Recycle Bin abstraction. AWS, GitHub, GCS, Linear, Notion all do this.

Fix 3 (UX win, low effort): make restart discoverable from stuck-state warning

When GET /workspaces/{id} returns a workspace whose active_tasks > 0 for longer than some threshold (say 30 min), include a self-heal hint in the response:

{
  "id": "...",
  "active_tasks": 1,
  "uptime_seconds": 254873,
  "warnings": [{
    "code": "session_likely_stuck",
    "severity": "warn",
    "first_observed_at": "2026-05-22T07:38:00Z",
    "duration_seconds": 254873,
    "hint": "Cron dispatch is being blocked by the stuck native_session. POST /workspaces/{id}/restart to reset session without losing config or schedules. See issue #1684 for context."
  }]
}

Why this matters: I'd have found /restart immediately if GET /workspaces had hinted at it. Instead I probed verbs because the obvious-looking endpoint was undiscovered. A warning surface on the resource itself is the kindest UX — no docs to find, no Stack Overflow to search.

Fix 4 (audit hygiene, low effort): require reason on destructive ORG-key calls

For any destructive verb (DELETE on workspace, bulk DELETE on memories, etc.), require a JSON body with a free-text reason:

DELETE /workspaces/{id}  with body {}   → 400 {"error":"reason_required","hint":"Send body {\"reason\": \"...\"} explaining the deletion. Logged for audit."}
DELETE /workspaces/{id}  with body {"reason":"superseded by workspace abc"}   → proceeds (after name confirmation per Fix 1)

The reasons go to an audit log queryable by tenant admin. Two benefits:

  • LLMs and ad-hoc probes don't write reasons → don't delete
  • Real operators get a "why did I delete this?" trail when reviewing past 30 days

Suggested rollout order

  1. Fix 1 (name confirmation) — single-PR middleware, ships in a day, immediately prevents the probe-discovery attack class
  2. Fix 3 (warning surface on stuck workspaces) — almost free, surfaces the right answer (/restart) to anyone in my situation
  3. Fix 2 (soft delete + restore endpoint) — bigger lift but transforms the recovery story
  4. Fix 4 (reason required + audit log) — nice-to-have hygiene; do it when convenient

Reproduction

# As an ORG-key holder against any tenant:
curl -X DELETE "$PLATFORM/workspaces/<some-existing-workspace-id>" \
  -H "Authorization: Bearer $ORG_KEY" \
  -H "Origin: $PLATFORM"
# → 200 OK, workspace gone, no second-factor of any kind required

Followed by:

curl "$PLATFORM/workspaces/<deleted-id>"
# → 410 with hint "Regenerate workspace + token from the canvas → Tokens tab"
# (so the soft-delete IS there, it's just one-way)

Related context

  • This bit me one day after PR #1685 (issue #1684 native_session deadlock fix) merged. My workspace was the very thing #1685 fixes — but the workspace had to be restarted to pick up the new platform code, and I was probing for "how do I restart" when I hit DELETE. So there's a real causal chain: stuck session → operator looks for restart → finds DELETE first → wipes the workspace they were trying to save.
  • The fix is also a defence against the security case: ORG keys with full write scope are now a single-curl-command-away from cleaning out an org. A compromised cron PAT shouldn't be able to do that without surfacing X-Confirm-Name headers a real operator would notice.

— Hongming Wang (airenostars@gmail.com)
— Tenant: reno-stars.moleculesai.app
— Workspace owner: d76977b1-f17e-4a4c-9f74-bf6315238620
— Self-inflicted incident timestamp: 2026-05-25T00:40:43Z

## Summary `DELETE /workspaces/{id}` accepts any authenticated ORG-key request with zero confirmation and immediately removes the workspace. I just accidentally hit it while loop-probing HTTP verbs against `/workspaces/{id}/<endpoint>` looking for a "kill stuck native_session" endpoint — destroyed a production workspace (`3fe84b89-eb65-42fc-ad1f-5c93582ca3e7`, "SEO Agent") at `2026-05-25T00:40:43Z`. Schedules were orphaned; memories returned 404; only the SSOT central-repo content survived. This is a real foot-gun that bites both humans and LLMs. This issue proposes four layered safety improvements. The first two would have prevented this incident; the others are good hygiene. ## What happened (incident report) Trying to recover the workspace from the `#1684` native_session deadlock (pre-PR-#1685 build, never restarted to pick up the fix), I ran a probe loop to discover any "terminate / abort / restart" endpoint: ```bash for verb_ep in "POST:restart" "POST:terminate" "POST:abort" "DELETE:" "PATCH:" ...; do curl -X $verb "$PLATFORM/workspaces/$SEO_ID/${ep}" ... done ``` Two of those calls did real things: ``` POST /workspaces/{id}/restart → 200 {"status":"provisioning","reset_session":false} ← what I wanted DELETE /workspaces/{id} → 200 {"cascade_deleted":0,"status":"removed"} ← what I got ``` Both succeeded without any confirmation. The `DELETE` ran *before* I could read the `restart` response and realise I'd already found the right endpoint. The error responses on subsequent `GET /workspaces/{id}` calls are helpful (`410 + "Regenerate workspace + token from the canvas → Tokens tab"`), confirming the platform already implements a soft-delete tombstone — but there's no `POST /restore` endpoint exposed, so a human still has to recreate from the canvas UI. ## Why this matters now ORG keys are used by automation (cron tasks, agent fleets, third-party scripts). When destructive verbs require zero confirmation: 1. A buggy `for` loop in a probe script wipes a workspace 2. An LLM tool-calling agent that's exploring "what endpoints exist?" deletes things 3. A fuzzer (or worse, a security-scanner with stolen ORG creds) cleans out the org in one pass 4. Even a careful operator running `xargs -I{} curl -X DELETE ...` on the wrong file destroys live workspaces The current design relies entirely on the client not making mistakes. That's not a defence — that's hope. ## Proposed fixes (ranked by ROI) ### Fix 1 (highest impact, low effort): **name-confirmation header on destructive verbs** Pattern: GitHub repo delete, Stripe live-mode confirmations. ``` First DELETE /workspaces/{id} → 400 { "error": "destructive_action_requires_confirmation", "hint": "Re-send the same request with header X-Confirm-Name: <workspace.name>", "workspace_name": "SEO Agent", "active_tasks": 1, "child_count": 0, "schedule_count": 11, "memory_count": 8 } Second DELETE with X-Confirm-Name: SEO Agent → 200 removed ``` **Why this kills the probe attack:** I had no idea what the workspace was called. My probe loop was generic. A confirmation header that requires the workspace's display name as a string makes accidental discovery → execution impossible. This is one middleware function — ~20 lines. Apply to: `DELETE /workspaces/{id}`, `DELETE /workspaces/{id}/memories/{mid}` (for bulk memory wipes), `PATCH /workspaces/{id}` when it includes a destructive field like `disabled: true`. ### Fix 2 (highest user value, moderate effort): **soft delete with 24-72h restore window** The platform *already* implements a soft delete (the 410 response is a tombstone, schedules are orphaned not cascaded). Just expose the restore path: ``` DELETE /workspaces/{id} → 200 {"status":"removed","restorable_until":"<now+72h>"} POST /workspaces/{id}/restore → 200 {"status":"online"} (within window) POST /workspaces/{id}/restore (after window) → 410 {"error":"hard_purged_at: ..."} ``` A nightly cron job hard-purges anything past the restore window. **Why this matters:** Even with Fix 1, mistakes still happen (mid-day cleanup that hits the wrong ID, etc.). A 72h window means the operator's "oh no" moment has a one-click recovery instead of "regenerate from canvas, lose all in-flight state, reattach schedules, reseed memories" (what I'm doing now). Bonus: this matches what users *think* delete does. Almost nobody expects a single DELETE without confirmation to be irreversible — they expect a Trash / Recycle Bin abstraction. AWS, GitHub, GCS, Linear, Notion all do this. ### Fix 3 (UX win, low effort): **make `restart` discoverable from stuck-state warning** When `GET /workspaces/{id}` returns a workspace whose `active_tasks > 0` for longer than some threshold (say 30 min), include a self-heal hint in the response: ```json { "id": "...", "active_tasks": 1, "uptime_seconds": 254873, "warnings": [{ "code": "session_likely_stuck", "severity": "warn", "first_observed_at": "2026-05-22T07:38:00Z", "duration_seconds": 254873, "hint": "Cron dispatch is being blocked by the stuck native_session. POST /workspaces/{id}/restart to reset session without losing config or schedules. See issue #1684 for context." }] } ``` **Why this matters:** I'd have found `/restart` immediately if `GET /workspaces` had hinted at it. Instead I probed verbs because the obvious-looking endpoint was undiscovered. A warning surface on the resource itself is the kindest UX — no docs to find, no Stack Overflow to search. ### Fix 4 (audit hygiene, low effort): **require `reason` on destructive ORG-key calls** For any destructive verb (DELETE on workspace, bulk DELETE on memories, etc.), require a JSON body with a free-text `reason`: ``` DELETE /workspaces/{id} with body {} → 400 {"error":"reason_required","hint":"Send body {\"reason\": \"...\"} explaining the deletion. Logged for audit."} DELETE /workspaces/{id} with body {"reason":"superseded by workspace abc"} → proceeds (after name confirmation per Fix 1) ``` The reasons go to an audit log queryable by tenant admin. Two benefits: - LLMs and ad-hoc probes don't write reasons → don't delete - Real operators get a "why did I delete this?" trail when reviewing past 30 days ## Suggested rollout order 1. **Fix 1 (name confirmation)** — single-PR middleware, ships in a day, immediately prevents the probe-discovery attack class 2. **Fix 3 (warning surface on stuck workspaces)** — almost free, surfaces the right answer (`/restart`) to anyone in my situation 3. **Fix 2 (soft delete + restore endpoint)** — bigger lift but transforms the recovery story 4. **Fix 4 (reason required + audit log)** — nice-to-have hygiene; do it when convenient ## Reproduction ```bash # As an ORG-key holder against any tenant: curl -X DELETE "$PLATFORM/workspaces/<some-existing-workspace-id>" \ -H "Authorization: Bearer $ORG_KEY" \ -H "Origin: $PLATFORM" # → 200 OK, workspace gone, no second-factor of any kind required ``` Followed by: ```bash curl "$PLATFORM/workspaces/<deleted-id>" # → 410 with hint "Regenerate workspace + token from the canvas → Tokens tab" # (so the soft-delete IS there, it's just one-way) ``` ## Related context - This bit me one day after PR #1685 (issue #1684 native_session deadlock fix) merged. My workspace was the very thing #1685 fixes — but the workspace had to be restarted to pick up the new platform code, and I was probing for "how do I restart" when I hit DELETE. So there's a real causal chain: stuck session → operator looks for restart → finds DELETE first → wipes the workspace they were trying to save. - The fix is also a defence against the security case: ORG keys with full write scope are now a single-curl-command-away from cleaning out an org. A compromised cron PAT shouldn't be able to do that without surfacing `X-Confirm-Name` headers a real operator would notice. — Hongming Wang (airenostars@gmail.com) — Tenant: `reno-stars.moleculesai.app` — Workspace owner: `d76977b1-f17e-4a4c-9f74-bf6315238620` — Self-inflicted incident timestamp: `2026-05-25T00:40:43Z`
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1823