molecule-core/workspace-server/internal/textutil/truncate.go
Hongming Wang 656a02fae4 fix(textutil): SSOT for rune-safe string truncation, fix 3 audit-gap bugs
Closes #2962.

## Why

Six per-package `truncate` helpers had drifted into independent
re-implementations of the same idea. Three of them (delegation.go,
memory/client/client.go, memory-backfill/verify.go) used
`s[:max] + "…"` byte-slice form, which on a multi-byte codepoint at
byte `max` produces invalid UTF-8 → Postgres `text`/`jsonb` rejects
the INSERT silently → `delegation` / `activity_logs` row never lands
→ audit gap.

Three other helpers (delegation_ledger.go #2962, agent_message_writer.go
#2959, scheduler.go #2026) had each been fixed in isolation with three
slightly different rune-safe shapes — confirming this is a class of
bug, not a single instance.

## What

New package `internal/textutil` with three rune-safe functions:

- `TruncateBytes(s, maxBytes)` — byte-cap, "…" marker. Used by 5
  callers writing into byte-bounded columns / log lines.
- `TruncateBytesNoMarker(s, maxBytes)` — byte-cap, no marker. Used by
  delegation_ledger.go where the storage already conveys "preview"
  and an extra ellipsis would push the result over the column cap.
- `TruncateRunes(s, maxRunes)` — rune-cap, "…" marker. Used by
  agent_message_writer.go where the cap is in display chars (UI
  summary), not bytes.

All three guarantee `utf8.ValidString(out)` for any `utf8.ValidString(in)`.
Inputs already invalid go through `sanitizeUTF8` at the call site
boundary (scheduler.go preserved this defense-in-depth).

## Migration map

| Old | New | Behavior change |
|---|---|---|
| `delegation_ledger.truncatePreview` | `textutil.TruncateBytesNoMarker(s, 4096)` | none |
| `agent_message_writer.truncatePreviewRunes` | `textutil.TruncateRunes(s, n)` | none |
| `scheduler.truncate` | `textutil.TruncateBytes(s, n)` | "..." → "…" (3 bytes either way; single-glyph display) |
| `delegation.truncate` | `textutil.TruncateBytes(s, n)` | bug fix + ellipsis swap |
| `memory/client.truncate` | `textutil.TruncateBytes(s, n)` | bug fix |
| `memory-backfill.truncate` | `textutil.TruncateBytes(s, n)` | bug fix |

Five separate `truncate*` helpers + their per-package tests removed.
Net: 12 files / +427 / -255.

## Tests

- `internal/textutil/truncate_test.go` — 27 table-test cases + 145
  fuzz-invariant cases asserting `utf8.ValidString` and byte-cap
  invariants on every output.
- `delegation_ledger_test.go TestLedgerInsert_TruncatesOversizedPreview`
  strengthened with `capValidUTF8Matcher` so the SQL-write argument
  is asserted to be valid UTF-8 + within cap (not just `AnyArg()`).
  Mutation-tested: replacing the SSOT call with byte-slice form makes
  this test fail loud.

## Compatibility

- All callers internal; no external API surface change.
- Ellipsis swap "..." → "…": same byte budget (3 bytes), single-glyph
  display. No alerting/grep on either marker in this codebase
  (verified). Canvas renders both correctly.
- DB column widths unchanged (4096 / 80 / 200 / 256 / 300 — all
  preserved in the migrations).

## Security

Fixes a silent INSERT-failure mode that hid `activity_logs` /
`delegations` rows containing peer-controlled text. The class of input
that triggered it (CJK, emoji, accented Latin) is normal user content,
not malicious — but the symptom (audit gap) makes incident
reconstruction harder. Helper is pure-function over `string`; no
secrets / PII / auth handling involved. Untrusted input is handled
identically to before, just rune-aligned now.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 23:01:21 -07:00

131 lines
4.9 KiB
Go

// Package textutil provides string-handling helpers that respect UTF-8
// rune boundaries.
//
// Why this package exists
// -----------------------
// `s[:max]` truncates by BYTES; for any string with a multi-byte
// codepoint at byte `max` (CJK, emoji, accented Latin), the slice
// produces invalid UTF-8. Postgres `text` and `jsonb` columns reject
// invalid UTF-8 with `invalid byte sequence for encoding "UTF8"`,
// which silently fails the INSERT and holds the surrounding tx open
// — a class of audit-gap that has bitten this codebase three times
// (scheduler.go #2026, agent_message_writer.go #2959,
// delegation_ledger.go #2962). Six per-package helpers had
// independently re-implemented this logic with varying correctness;
// this package is the single source of truth.
//
// Use sites
// ---------
// - DB writes whose column is bytes-bounded (jsonb preview field,
// varchar(N)): TruncateBytes / TruncateBytesNoMarker.
// - UI summaries whose cap is in display chars, not bytes:
// TruncateRunes.
//
// All functions guarantee `utf8.ValidString(out) == true` for any
// `s` where `utf8.ValidString(s) == true`. Inputs that are already
// invalid UTF-8 should be sanitized at the trust boundary (e.g. via
// `strings.ToValidUTF8`); this package does not silently fix
// upstream invalid input.
package textutil
import "unicode/utf8"
// ellipsis is the truncation marker. U+2026 HORIZONTAL ELLIPSIS —
// 3 bytes in UTF-8, 1 rune, 1 display column. Standardized across
// the codebase to avoid the "..." (3 ASCII chars) vs "…" (1 char)
// inconsistency the per-package helpers had drifted into.
const ellipsis = "…"
// TruncateBytes returns s if `len(s) <= maxBytes`, otherwise returns
// the longest rune-aligned prefix of s that fits in `maxBytes - 3`
// bytes followed by the ellipsis marker. The returned string is
// always at most `maxBytes` bytes long.
//
// Example: TruncateBytes("你好世界你好", 10) returns "你好世…" (9 bytes)
// — three "你好" runes (each 3 bytes = 9 bytes) plus "…" (3 bytes)
// would be 12 bytes, so we walk back to "你好" (6 bytes) + "…" (3) = 9.
//
// Edge cases:
// - maxBytes <= 0: returns "" (no room even for input or marker)
// - maxBytes < len(ellipsis): returns "" (can't add marker without
// exceeding cap, and we won't return a marker-less truncation
// here — caller wanted a marker; use TruncateBytesNoMarker if
// they don't)
// - s contains invalid UTF-8: continuation bytes are walked over
// same as valid runes; the result preserves the (invalid) input
// bytes up to the truncation point. Caller is responsible for
// pre-sanitizing if Postgres validity is required.
func TruncateBytes(s string, maxBytes int) string {
if len(s) <= maxBytes {
return s
}
if maxBytes < len(ellipsis) {
return ""
}
// Reserve room for the marker, then walk back to the nearest
// rune boundary at or below the cut point.
cut := maxBytes - len(ellipsis)
for cut > 0 && !utf8.RuneStart(s[cut]) {
cut--
}
return s[:cut] + ellipsis
}
// TruncateBytesNoMarker returns s if `len(s) <= maxBytes`, otherwise
// returns the longest rune-aligned prefix of s that fits in
// `maxBytes` bytes. No marker is appended — useful when the caller's
// storage already conveys "preview" / "snippet" semantics and an
// extra ellipsis would push the result over a hard column cap.
//
// Example: TruncateBytesNoMarker("hello world", 5) returns "hello".
//
// Edge case: maxBytes <= 0 returns "".
func TruncateBytesNoMarker(s string, maxBytes int) string {
if len(s) <= maxBytes {
return s
}
if maxBytes <= 0 {
return ""
}
cut := maxBytes
for cut > 0 && !utf8.RuneStart(s[cut]) {
cut--
}
return s[:cut]
}
// TruncateRunes returns s if it has at most maxRunes runes, otherwise
// returns the first maxRunes runes followed by the ellipsis marker.
// Use this when the cap is in user-visible characters (UI summary,
// activity feed line) rather than bytes (DB column).
//
// Example: TruncateRunes("你好世界你好", 3) returns "你好世…" — three
// runes plus the marker, regardless of the resulting byte count.
//
// Edge case: maxRunes <= 0 returns "" (caller asked for no content).
func TruncateRunes(s string, maxRunes int) string {
if maxRunes <= 0 {
return ""
}
// Fast path: if every byte is a single-byte rune, the byte-length
// upper-bounds the rune count. This avoids a runes alloc for the
// common ASCII case where the input fits.
if len(s) <= maxRunes {
return s
}
// Walk by rune boundaries; stop at the (maxRunes+1)-th rune so we
// know the cut point and that truncation is needed.
count := 0
for i := range s {
if count == maxRunes {
return s[:i] + ellipsis
}
count++
}
// Reachable when the byte count exceeded maxRunes but the actual
// rune count didn't (e.g. all single-byte runes that just happen
// to be more than maxRunes). The fast path catches len(s) <=
// maxRunes; this catches maxRunes < runeCount(s) <= len(s).
return s
}