N+1 query audit 2026-05-19 — TOP-5 opportunities for Neon compute reduction #1569

Open
opened 2026-05-19 18:16:11 +00:00 by cp-be · 0 comments
Member

Summary

Static audit of workspace-server/internal/handlers/ + workspace-server/internal/registry/ + workspace-server/internal/channels/ for DB N+1 anti-patterns. Five concrete offenders identified. Filing as RFC-stage per audit charter (no fix PRs in this issue) so the core-be / infra-runtime-be team can scope remediation.

Audit method: grep for for ... range ... { ... .Query|Exec|QueryRow ... } with awk-based loop+60-line-window analysis (mc handlers tend to have deeper nesting than CP), then human triage to drop intentionally-serialised polling loops and per-iteration retries.

Vendor-doc check: same as the cp-issue N+1 audit — multi-row INSERT via unnest() for bulk writes; = ANY($1) with pq.Array for IN-list reads. mc already uses both idioms in internal/memory/pgplugin/store.go:224, :229 and internal/registry/orphan_sweeper.go:172, :256, :435, so the pattern is in-house.

TOP-5 N+1 offenders (molecule-core / workspace-server)

# File:line Loop context N (realistic) Suggested fix
1 internal/handlers/admin_memories.go:183 (AdminMemoriesHandler.Import legacy path) for _, entry := range entries { ... QueryRowContext(workspace lookup) ... QueryRowContext(dup check) ... ExecContext(INSERT) ... }three DB round-trips per imported entry 3 × batch size — backups are typically 200-2000 memories per workspace, so 600-6000 queries per import call Pre-resolve all distinct workspace names with one SELECT name, id FROM workspaces WHERE name = ANY($1::text[]), build a name→id map in-process. Then issue a single INSERT INTO agent_memories (workspace_id, content, scope, namespace, created_at) SELECT * FROM unnest($1::uuid[], $2::text[], $3::text[], $4::text[], $5::timestamptz[]) ON CONFLICT (workspace_id, content, scope) DO NOTHING — the existing dedup check SELECT EXISTS(...) becomes a unique-constraint ON CONFLICT, eliminating both the dedup query AND the per-row INSERT. (Pre-flight: confirm an index agent_memories(workspace_id, content, scope) exists; if not, add it before the refactor — the dedup query already implies this scan pattern). Side note: the importViaPlugin path at :446 has the same shape; fix both.
2 internal/registry/healthsweep.go:97 (sweepOnlineWorkspaces) and :163 (sweepStaleRemoteWorkspaces) After SELECTing the candidate workspace IDs, loop: for _, id := range ids { checker.IsRunning(...); db.DB.ExecContext(UPDATE workspaces SET status=... WHERE id=$2) } online-workspace count per sweep tick — currently ~30-100 per tenant, sweep runs every 60s. Multiply by tenant count for fleet-wide compute. Batch the UPDATEs after the container-check completes: UPDATE workspaces SET status=$2, updated_at=now() FROM (VALUES ($3::uuid, $4::text), ($5, $6), ...) AS v(id, _) WHERE workspaces.id = v.id AND status NOT IN ('removed','provisioning'). The Docker IsRunning call is the unavoidable per-row cost, but the DB write is collapsible. Cross-tenant impact: this runs in every tenant container — multiplier on Neon compute scales with tenant count, not workspace count.
3 internal/channels/manager.go:468 (BroadcastToWorkspaceChannels) for rows.Next() { ... m.SendOutbound(ctx, channelID, text) ... } — and SendOutbound itself calls m.loadChannel(channelID) (QueryRowContext) + ExecContext(UPDATE workspace_channels SET last_message_at=now(), message_count=...) per channel 2 round-trips × enabled-channel count per workspace — typically 1-3 Slack channels per workspace; spikes during cron-completion fan-outs Replace the row-iterate with a SELECT that already returns the full channel config (avoid the round-trip in loadChannel), and after all sends complete, batch the UPDATE: UPDATE workspace_channels SET last_message_at=now(), message_count=message_count+1, updated_at=now() WHERE id = ANY($1::uuid[]).
4 internal/handlers/memories.go:118 Loop body issues QueryRowContext(SELECT parent_id FROM workspaces WHERE id = $1) per iteration — walks the ancestor chain row-by-row workspace tree depth — typically 3-5, can be 10+ for deeply-nested org imports; per request of GET /workspaces/:id/memories Replace with one recursive CTE: WITH RECURSIVE ancestors AS (SELECT id, parent_id FROM workspaces WHERE id = $1 UNION ALL SELECT w.id, w.parent_id FROM workspaces w JOIN ancestors a ON w.id = a.parent_id) SELECT id FROM ancestors. One round-trip vs depth-N. (Pattern already used at workspace_crud.go:427.)
5 internal/handlers/workspace_restart.go:754, :810 Restart handler: inner loop iterates retry candidates with per-iteration QueryRowContext for status checks retry count × queries — typically 5-10 attempts × 2 queries = 10-20 round-trips per restart Hoist the status read to a single batched lookup at the top of the retry, or convert to a polling loop with backoff that reads once per tick instead of per-candidate. Lower priority — restart paths are not high-throughput, but the pattern repeats across restart_context.go siblings.

Top-2 concrete refactor proposals

Refactor 1 — admin_memories.go:183 + :446 (worst-multiplier on bulk imports)

Pre-resolve workspaces in one query:

names := make([]string, 0, len(entries))
seen := map[string]struct{}{}
for _, e := range entries { if _, ok := seen[e.WorkspaceName]; !ok { names = append(names, e.WorkspaceName); seen[e.WorkspaceName] = struct{}{} } }

rows, err := db.DB.QueryContext(ctx,
    `SELECT name, id::text FROM workspaces WHERE name = ANY($1::text[])`, pq.Array(names))
// ... build name→id map; missing names → "skipped"

Then single batch INSERT (assuming (workspace_id, content, scope) unique index — add migration if absent):

wsIDs := make([]string, 0, len(entries))
contents := make([]string, 0, len(entries))
scopes := make([]string, 0, len(entries))
nss := make([]string, 0, len(entries))
createdAts := make([]sql.NullTime, 0, len(entries))
// populate from entries, redacting content first per SAFE-T1201

_, err = db.DB.ExecContext(ctx, `
    INSERT INTO agent_memories (workspace_id, content, scope, namespace, created_at)
    SELECT * FROM unnest($1::uuid[], $2::text[], $3::text[], $4::text[], $5::timestamptz[])
    ON CONFLICT (workspace_id, content, scope) DO NOTHING
`, pq.Array(wsIDs), pq.Array(contents), pq.Array(scopes), pq.Array(nss), pq.Array(createdAts))

Estimated saving: 2000-entry import: 6000 queries → 2 queries (+1 migration). At ~3ms/query p50 to Neon, that's 18s → ~50ms of sync DB time per import. Critical for core-be's memory-import SLA work.

Migration prerequisite: CREATE UNIQUE INDEX CONCURRENTLY agent_memories_dedup_idx ON agent_memories (workspace_id, content, scope) — must precede the refactor PR or the ON CONFLICT clause will explode. The existing SELECT EXISTS semantics imply this uniqueness is already a logical invariant, so the index is a documented safety-net not a behavior change.

Refactor 2 — healthsweep.go:97 + :163 (worst fleet-multiplier — runs in every tenant)

After all IsRunning calls complete (collect offline IDs into a slice), one batched UPDATE:

offlineIDs := []string{}
for _, id := range ids {
    if running, err := checker.IsRunning(ctx, id); err == nil && !running {
        offlineIDs = append(offlineIDs, id)
    }
}
if len(offlineIDs) > 0 {
    _, err := db.DB.ExecContext(ctx, `
        UPDATE workspaces SET status = $1, updated_at = now()
        WHERE id = ANY($2::uuid[]) AND status NOT IN ('removed', 'provisioning')
    `, models.StatusOffline, pq.Array(offlineIDs))
}
// then per-id callbacks (ClearWorkspaceKeys, onOffline) in a second pass — cheap, in-process

Estimated saving: at 50 online workspaces, 50 UPDATE round-trips → 1 per sweep tick. Sweep runs every 60s → 50× per minute reduction in per-tenant Neon write traffic. Fleet-wide: at 30 tenants × 50 workspaces × 60 sweeps/hour = ~5.4M needless single-row UPDATEs/day → ~108k batched UPDATEs/day across the fleet.

The :163 sweepStaleRemoteWorkspaces variant has the same shape but writes awaiting_agent instead of offline — fix both in the same PR.

Conservative exclusions

Triaged out:

  • restart_context.go:108, :155, :169, :202 — polling loops on state transitions; serialisation is the point.
  • org.go:844 — slice-build loop, the actual DB query is outside it.
  • workspace_crud.go:362 purge loop — already uses ANY($1::uuid[]) per-table; iteration is over the in-process table-name list.
  • pendinguploads/storage.go:284 — transactional, single-iteration in practice.
  • messagestore/postgres_store.go:156 — switch-style branch, not iterative N+1.

Assigned reviewer personas

  • core-be — workspace-server handlers + registry ownership (this issue filed under cp-be temporarily because persona-core-be PAT is not currently active on the local agent — see reference_prod_team_infisical_identities; re-assign on this issue's metadata once core-be identity is online)
  • infra-runtime-be — healthsweep + workspace_restart paths cross the template-runtime boundary (per reference_runtime_provider_creds_and_template_id_footgun — refactor 2's batched UPDATE must preserve the status NOT IN ('removed','provisioning') guard so provisioning races don't get clobbered)

Scope

  • RFC-stage only. No fix PR. Refactor 1 = ~120 LoC + integration test + index migration; refactor 2 = ~80 LoC + integration test.
  • File fix-PRs against staging (per reference_molecule_core_platform_no_staging_to_main_autopromote mc has a manual staging→main flip — promote the whole delta together once both refactors are green).
  • Out of scope: internal/messagestore/, internal/memory/ (already audited as good per the = ANY($1) audit-grep), internal/channels/ adapters (per-channel SDK calls, not DB).

🤖 Generated with Claude Code
Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

## Summary Static audit of `workspace-server/internal/handlers/` + `workspace-server/internal/registry/` + `workspace-server/internal/channels/` for DB N+1 anti-patterns. Five concrete offenders identified. Filing as **RFC-stage** per audit charter (no fix PRs in this issue) so the core-be / infra-runtime-be team can scope remediation. **Audit method**: grep for `for ... range ... { ... .Query|Exec|QueryRow ... }` with awk-based loop+60-line-window analysis (mc handlers tend to have deeper nesting than CP), then human triage to drop intentionally-serialised polling loops and per-iteration retries. **Vendor-doc check**: same as the cp-issue N+1 audit — multi-row INSERT via `unnest()` for bulk writes; `= ANY($1)` with `pq.Array` for IN-list reads. mc already uses both idioms in `internal/memory/pgplugin/store.go:224, :229` and `internal/registry/orphan_sweeper.go:172, :256, :435`, so the pattern is in-house. ## TOP-5 N+1 offenders (molecule-core / workspace-server) | # | File:line | Loop context | N (realistic) | Suggested fix | |---|-----------|--------------|---------------|---------------| | 1 | `internal/handlers/admin_memories.go:183` (`AdminMemoriesHandler.Import` legacy path) | `for _, entry := range entries { ... QueryRowContext(workspace lookup) ... QueryRowContext(dup check) ... ExecContext(INSERT) ... }` — **three** DB round-trips per imported entry | **3 × batch size** — backups are typically 200-2000 memories per workspace, so 600-6000 queries per import call | Pre-resolve all distinct workspace names with one `SELECT name, id FROM workspaces WHERE name = ANY($1::text[])`, build a name→id map in-process. Then issue a single `INSERT INTO agent_memories (workspace_id, content, scope, namespace, created_at) SELECT * FROM unnest($1::uuid[], $2::text[], $3::text[], $4::text[], $5::timestamptz[]) ON CONFLICT (workspace_id, content, scope) DO NOTHING` — the existing dedup check `SELECT EXISTS(...)` becomes a unique-constraint ON CONFLICT, eliminating both the dedup query AND the per-row INSERT. (Pre-flight: confirm an index `agent_memories(workspace_id, content, scope)` exists; if not, add it before the refactor — the dedup query already implies this scan pattern). Side note: the `importViaPlugin` path at `:446` has the same shape; fix both. | | 2 | `internal/registry/healthsweep.go:97` (`sweepOnlineWorkspaces`) and `:163` (`sweepStaleRemoteWorkspaces`) | After SELECTing the candidate workspace IDs, loop: `for _, id := range ids { checker.IsRunning(...); db.DB.ExecContext(UPDATE workspaces SET status=... WHERE id=$2) }` | **online-workspace count per sweep tick** — currently ~30-100 per tenant, sweep runs every 60s. Multiply by tenant count for fleet-wide compute. | Batch the UPDATEs after the container-check completes: `UPDATE workspaces SET status=$2, updated_at=now() FROM (VALUES ($3::uuid, $4::text), ($5, $6), ...) AS v(id, _) WHERE workspaces.id = v.id AND status NOT IN ('removed','provisioning')`. The Docker `IsRunning` call is the unavoidable per-row cost, but the DB write is collapsible. **Cross-tenant impact**: this runs in *every* tenant container — multiplier on Neon compute scales with tenant count, not workspace count. | | 3 | `internal/channels/manager.go:468` (`BroadcastToWorkspaceChannels`) | `for rows.Next() { ... m.SendOutbound(ctx, channelID, text) ... }` — and **`SendOutbound`** itself calls `m.loadChannel(channelID)` (`QueryRowContext`) + `ExecContext(UPDATE workspace_channels SET last_message_at=now(), message_count=...)` per channel | **2 round-trips × enabled-channel count per workspace** — typically 1-3 Slack channels per workspace; spikes during cron-completion fan-outs | Replace the row-iterate with a SELECT that already returns the full channel config (avoid the round-trip in `loadChannel`), and after all sends complete, batch the UPDATE: `UPDATE workspace_channels SET last_message_at=now(), message_count=message_count+1, updated_at=now() WHERE id = ANY($1::uuid[])`. | | 4 | `internal/handlers/memories.go:118` | Loop body issues `QueryRowContext(SELECT parent_id FROM workspaces WHERE id = $1)` per iteration — walks the ancestor chain row-by-row | **workspace tree depth** — typically 3-5, can be 10+ for deeply-nested org imports; **per request** of `GET /workspaces/:id/memories` | Replace with one recursive CTE: `WITH RECURSIVE ancestors AS (SELECT id, parent_id FROM workspaces WHERE id = $1 UNION ALL SELECT w.id, w.parent_id FROM workspaces w JOIN ancestors a ON w.id = a.parent_id) SELECT id FROM ancestors`. One round-trip vs depth-N. (Pattern already used at `workspace_crud.go:427`.) | | 5 | `internal/handlers/workspace_restart.go:754, :810` | Restart handler: inner loop iterates retry candidates with per-iteration `QueryRowContext` for status checks | **retry count × queries** — typically 5-10 attempts × 2 queries = 10-20 round-trips per restart | Hoist the status read to a single batched lookup at the top of the retry, or convert to a polling loop with backoff that reads once per tick instead of per-candidate. Lower priority — restart paths are not high-throughput, but the pattern repeats across `restart_context.go` siblings. | ## Top-2 concrete refactor proposals ### Refactor 1 — `admin_memories.go:183` + `:446` (worst-multiplier on bulk imports) Pre-resolve workspaces in one query: ```go names := make([]string, 0, len(entries)) seen := map[string]struct{}{} for _, e := range entries { if _, ok := seen[e.WorkspaceName]; !ok { names = append(names, e.WorkspaceName); seen[e.WorkspaceName] = struct{}{} } } rows, err := db.DB.QueryContext(ctx, `SELECT name, id::text FROM workspaces WHERE name = ANY($1::text[])`, pq.Array(names)) // ... build name→id map; missing names → "skipped" ``` Then single batch INSERT (assuming `(workspace_id, content, scope)` unique index — add migration if absent): ```go wsIDs := make([]string, 0, len(entries)) contents := make([]string, 0, len(entries)) scopes := make([]string, 0, len(entries)) nss := make([]string, 0, len(entries)) createdAts := make([]sql.NullTime, 0, len(entries)) // populate from entries, redacting content first per SAFE-T1201 _, err = db.DB.ExecContext(ctx, ` INSERT INTO agent_memories (workspace_id, content, scope, namespace, created_at) SELECT * FROM unnest($1::uuid[], $2::text[], $3::text[], $4::text[], $5::timestamptz[]) ON CONFLICT (workspace_id, content, scope) DO NOTHING `, pq.Array(wsIDs), pq.Array(contents), pq.Array(scopes), pq.Array(nss), pq.Array(createdAts)) ``` Estimated saving: 2000-entry import: 6000 queries → 2 queries (+1 migration). At ~3ms/query p50 to Neon, that's 18s → ~50ms of sync DB time per import. Critical for `core-be`'s memory-import SLA work. **Migration prerequisite**: `CREATE UNIQUE INDEX CONCURRENTLY agent_memories_dedup_idx ON agent_memories (workspace_id, content, scope)` — must precede the refactor PR or the ON CONFLICT clause will explode. The existing `SELECT EXISTS` semantics imply this uniqueness is already a logical invariant, so the index is a documented safety-net not a behavior change. ### Refactor 2 — `healthsweep.go:97 + :163` (worst fleet-multiplier — runs in every tenant) After all `IsRunning` calls complete (collect offline IDs into a slice), one batched UPDATE: ```go offlineIDs := []string{} for _, id := range ids { if running, err := checker.IsRunning(ctx, id); err == nil && !running { offlineIDs = append(offlineIDs, id) } } if len(offlineIDs) > 0 { _, err := db.DB.ExecContext(ctx, ` UPDATE workspaces SET status = $1, updated_at = now() WHERE id = ANY($2::uuid[]) AND status NOT IN ('removed', 'provisioning') `, models.StatusOffline, pq.Array(offlineIDs)) } // then per-id callbacks (ClearWorkspaceKeys, onOffline) in a second pass — cheap, in-process ``` Estimated saving: at 50 online workspaces, 50 UPDATE round-trips → 1 per sweep tick. Sweep runs every 60s → **50× per minute reduction** in per-tenant Neon write traffic. Fleet-wide: at 30 tenants × 50 workspaces × 60 sweeps/hour = ~5.4M needless single-row UPDATEs/day → ~108k batched UPDATEs/day across the fleet. The `:163` `sweepStaleRemoteWorkspaces` variant has the same shape but writes `awaiting_agent` instead of `offline` — fix both in the same PR. ## Conservative exclusions Triaged out: - `restart_context.go:108, :155, :169, :202` — polling loops on state transitions; serialisation is the point. - `org.go:844` — slice-build loop, the actual DB query is outside it. - `workspace_crud.go:362` purge loop — already uses `ANY($1::uuid[])` per-table; iteration is over the in-process table-name list. - `pendinguploads/storage.go:284` — transactional, single-iteration in practice. - `messagestore/postgres_store.go:156` — switch-style branch, not iterative N+1. ## Assigned reviewer personas - **core-be** — workspace-server handlers + registry ownership (this issue filed under cp-be temporarily because `persona-core-be` PAT is not currently active on the local agent — see `reference_prod_team_infisical_identities`; re-assign on this issue's metadata once core-be identity is online) - **infra-runtime-be** — healthsweep + workspace_restart paths cross the template-runtime boundary (per `reference_runtime_provider_creds_and_template_id_footgun` — refactor 2's batched UPDATE must preserve the `status NOT IN ('removed','provisioning')` guard so provisioning races don't get clobbered) ## Scope - **RFC-stage only.** No fix PR. Refactor 1 = ~120 LoC + integration test + index migration; refactor 2 = ~80 LoC + integration test. - File fix-PRs against `staging` (per `reference_molecule_core_platform_no_staging_to_main_autopromote` mc has a manual staging→main flip — promote the whole delta together once both refactors are green). - **Out of scope**: `internal/messagestore/`, `internal/memory/` (already audited as good per the `= ANY($1)` audit-grep), `internal/channels/` adapters (per-channel SDK calls, not DB). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#1569