Closes #2962. ## Why Six per-package `truncate` helpers had drifted into independent re-implementations of the same idea. Three of them (delegation.go, memory/client/client.go, memory-backfill/verify.go) used `s[:max] + "…"` byte-slice form, which on a multi-byte codepoint at byte `max` produces invalid UTF-8 → Postgres `text`/`jsonb` rejects the INSERT silently → `delegation` / `activity_logs` row never lands → audit gap. Three other helpers (delegation_ledger.go #2962, agent_message_writer.go #2959, scheduler.go #2026) had each been fixed in isolation with three slightly different rune-safe shapes — confirming this is a class of bug, not a single instance. ## What New package `internal/textutil` with three rune-safe functions: - `TruncateBytes(s, maxBytes)` — byte-cap, "…" marker. Used by 5 callers writing into byte-bounded columns / log lines. - `TruncateBytesNoMarker(s, maxBytes)` — byte-cap, no marker. Used by delegation_ledger.go where the storage already conveys "preview" and an extra ellipsis would push the result over the column cap. - `TruncateRunes(s, maxRunes)` — rune-cap, "…" marker. Used by agent_message_writer.go where the cap is in display chars (UI summary), not bytes. All three guarantee `utf8.ValidString(out)` for any `utf8.ValidString(in)`. Inputs already invalid go through `sanitizeUTF8` at the call site boundary (scheduler.go preserved this defense-in-depth). ## Migration map | Old | New | Behavior change | |---|---|---| | `delegation_ledger.truncatePreview` | `textutil.TruncateBytesNoMarker(s, 4096)` | none | | `agent_message_writer.truncatePreviewRunes` | `textutil.TruncateRunes(s, n)` | none | | `scheduler.truncate` | `textutil.TruncateBytes(s, n)` | "..." → "…" (3 bytes either way; single-glyph display) | | `delegation.truncate` | `textutil.TruncateBytes(s, n)` | bug fix + ellipsis swap | | `memory/client.truncate` | `textutil.TruncateBytes(s, n)` | bug fix | | `memory-backfill.truncate` | `textutil.TruncateBytes(s, n)` | bug fix | Five separate `truncate*` helpers + their per-package tests removed. Net: 12 files / +427 / -255. ## Tests - `internal/textutil/truncate_test.go` — 27 table-test cases + 145 fuzz-invariant cases asserting `utf8.ValidString` and byte-cap invariants on every output. - `delegation_ledger_test.go TestLedgerInsert_TruncatesOversizedPreview` strengthened with `capValidUTF8Matcher` so the SQL-write argument is asserted to be valid UTF-8 + within cap (not just `AnyArg()`). Mutation-tested: replacing the SSOT call with byte-slice form makes this test fail loud. ## Compatibility - All callers internal; no external API surface change. - Ellipsis swap "..." → "…": same byte budget (3 bytes), single-glyph display. No alerting/grep on either marker in this codebase (verified). Canvas renders both correctly. - DB column widths unchanged (4096 / 80 / 200 / 256 / 300 — all preserved in the migrations). ## Security Fixes a silent INSERT-failure mode that hid `activity_logs` / `delegations` rows containing peer-controlled text. The class of input that triggered it (CJK, emoji, accented Latin) is normal user content, not malicious — but the symptom (audit gap) makes incident reconstruction harder. Helper is pure-function over `string`; no secrets / PII / auth handling involved. Untrusted input is handled identically to before, just rune-aligned now. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
131 lines
4.9 KiB
Go
131 lines
4.9 KiB
Go
// Package textutil provides string-handling helpers that respect UTF-8
|
|
// rune boundaries.
|
|
//
|
|
// Why this package exists
|
|
// -----------------------
|
|
// `s[:max]` truncates by BYTES; for any string with a multi-byte
|
|
// codepoint at byte `max` (CJK, emoji, accented Latin), the slice
|
|
// produces invalid UTF-8. Postgres `text` and `jsonb` columns reject
|
|
// invalid UTF-8 with `invalid byte sequence for encoding "UTF8"`,
|
|
// which silently fails the INSERT and holds the surrounding tx open
|
|
// — a class of audit-gap that has bitten this codebase three times
|
|
// (scheduler.go #2026, agent_message_writer.go #2959,
|
|
// delegation_ledger.go #2962). Six per-package helpers had
|
|
// independently re-implemented this logic with varying correctness;
|
|
// this package is the single source of truth.
|
|
//
|
|
// Use sites
|
|
// ---------
|
|
// - DB writes whose column is bytes-bounded (jsonb preview field,
|
|
// varchar(N)): TruncateBytes / TruncateBytesNoMarker.
|
|
// - UI summaries whose cap is in display chars, not bytes:
|
|
// TruncateRunes.
|
|
//
|
|
// All functions guarantee `utf8.ValidString(out) == true` for any
|
|
// `s` where `utf8.ValidString(s) == true`. Inputs that are already
|
|
// invalid UTF-8 should be sanitized at the trust boundary (e.g. via
|
|
// `strings.ToValidUTF8`); this package does not silently fix
|
|
// upstream invalid input.
|
|
package textutil
|
|
|
|
import "unicode/utf8"
|
|
|
|
// ellipsis is the truncation marker. U+2026 HORIZONTAL ELLIPSIS —
|
|
// 3 bytes in UTF-8, 1 rune, 1 display column. Standardized across
|
|
// the codebase to avoid the "..." (3 ASCII chars) vs "…" (1 char)
|
|
// inconsistency the per-package helpers had drifted into.
|
|
const ellipsis = "…"
|
|
|
|
// TruncateBytes returns s if `len(s) <= maxBytes`, otherwise returns
|
|
// the longest rune-aligned prefix of s that fits in `maxBytes - 3`
|
|
// bytes followed by the ellipsis marker. The returned string is
|
|
// always at most `maxBytes` bytes long.
|
|
//
|
|
// Example: TruncateBytes("你好世界你好", 10) returns "你好世…" (9 bytes)
|
|
// — three "你好" runes (each 3 bytes = 9 bytes) plus "…" (3 bytes)
|
|
// would be 12 bytes, so we walk back to "你好" (6 bytes) + "…" (3) = 9.
|
|
//
|
|
// Edge cases:
|
|
// - maxBytes <= 0: returns "" (no room even for input or marker)
|
|
// - maxBytes < len(ellipsis): returns "" (can't add marker without
|
|
// exceeding cap, and we won't return a marker-less truncation
|
|
// here — caller wanted a marker; use TruncateBytesNoMarker if
|
|
// they don't)
|
|
// - s contains invalid UTF-8: continuation bytes are walked over
|
|
// same as valid runes; the result preserves the (invalid) input
|
|
// bytes up to the truncation point. Caller is responsible for
|
|
// pre-sanitizing if Postgres validity is required.
|
|
func TruncateBytes(s string, maxBytes int) string {
|
|
if len(s) <= maxBytes {
|
|
return s
|
|
}
|
|
if maxBytes < len(ellipsis) {
|
|
return ""
|
|
}
|
|
// Reserve room for the marker, then walk back to the nearest
|
|
// rune boundary at or below the cut point.
|
|
cut := maxBytes - len(ellipsis)
|
|
for cut > 0 && !utf8.RuneStart(s[cut]) {
|
|
cut--
|
|
}
|
|
return s[:cut] + ellipsis
|
|
}
|
|
|
|
// TruncateBytesNoMarker returns s if `len(s) <= maxBytes`, otherwise
|
|
// returns the longest rune-aligned prefix of s that fits in
|
|
// `maxBytes` bytes. No marker is appended — useful when the caller's
|
|
// storage already conveys "preview" / "snippet" semantics and an
|
|
// extra ellipsis would push the result over a hard column cap.
|
|
//
|
|
// Example: TruncateBytesNoMarker("hello world", 5) returns "hello".
|
|
//
|
|
// Edge case: maxBytes <= 0 returns "".
|
|
func TruncateBytesNoMarker(s string, maxBytes int) string {
|
|
if len(s) <= maxBytes {
|
|
return s
|
|
}
|
|
if maxBytes <= 0 {
|
|
return ""
|
|
}
|
|
cut := maxBytes
|
|
for cut > 0 && !utf8.RuneStart(s[cut]) {
|
|
cut--
|
|
}
|
|
return s[:cut]
|
|
}
|
|
|
|
// TruncateRunes returns s if it has at most maxRunes runes, otherwise
|
|
// returns the first maxRunes runes followed by the ellipsis marker.
|
|
// Use this when the cap is in user-visible characters (UI summary,
|
|
// activity feed line) rather than bytes (DB column).
|
|
//
|
|
// Example: TruncateRunes("你好世界你好", 3) returns "你好世…" — three
|
|
// runes plus the marker, regardless of the resulting byte count.
|
|
//
|
|
// Edge case: maxRunes <= 0 returns "" (caller asked for no content).
|
|
func TruncateRunes(s string, maxRunes int) string {
|
|
if maxRunes <= 0 {
|
|
return ""
|
|
}
|
|
// Fast path: if every byte is a single-byte rune, the byte-length
|
|
// upper-bounds the rune count. This avoids a runes alloc for the
|
|
// common ASCII case where the input fits.
|
|
if len(s) <= maxRunes {
|
|
return s
|
|
}
|
|
// Walk by rune boundaries; stop at the (maxRunes+1)-th rune so we
|
|
// know the cut point and that truncation is needed.
|
|
count := 0
|
|
for i := range s {
|
|
if count == maxRunes {
|
|
return s[:i] + ellipsis
|
|
}
|
|
count++
|
|
}
|
|
// Reachable when the byte count exceeded maxRunes but the actual
|
|
// rune count didn't (e.g. all single-byte runes that just happen
|
|
// to be more than maxRunes). The fast path catches len(s) <=
|
|
// maxRunes; this catches maxRunes < runeCount(s) <= len(s).
|
|
return s
|
|
}
|