Closes #2962. ## Why Six per-package `truncate` helpers had drifted into independent re-implementations of the same idea. Three of them (delegation.go, memory/client/client.go, memory-backfill/verify.go) used `s[:max] + "…"` byte-slice form, which on a multi-byte codepoint at byte `max` produces invalid UTF-8 → Postgres `text`/`jsonb` rejects the INSERT silently → `delegation` / `activity_logs` row never lands → audit gap. Three other helpers (delegation_ledger.go #2962, agent_message_writer.go #2959, scheduler.go #2026) had each been fixed in isolation with three slightly different rune-safe shapes — confirming this is a class of bug, not a single instance. ## What New package `internal/textutil` with three rune-safe functions: - `TruncateBytes(s, maxBytes)` — byte-cap, "…" marker. Used by 5 callers writing into byte-bounded columns / log lines. - `TruncateBytesNoMarker(s, maxBytes)` — byte-cap, no marker. Used by delegation_ledger.go where the storage already conveys "preview" and an extra ellipsis would push the result over the column cap. - `TruncateRunes(s, maxRunes)` — rune-cap, "…" marker. Used by agent_message_writer.go where the cap is in display chars (UI summary), not bytes. All three guarantee `utf8.ValidString(out)` for any `utf8.ValidString(in)`. Inputs already invalid go through `sanitizeUTF8` at the call site boundary (scheduler.go preserved this defense-in-depth). ## Migration map | Old | New | Behavior change | |---|---|---| | `delegation_ledger.truncatePreview` | `textutil.TruncateBytesNoMarker(s, 4096)` | none | | `agent_message_writer.truncatePreviewRunes` | `textutil.TruncateRunes(s, n)` | none | | `scheduler.truncate` | `textutil.TruncateBytes(s, n)` | "..." → "…" (3 bytes either way; single-glyph display) | | `delegation.truncate` | `textutil.TruncateBytes(s, n)` | bug fix + ellipsis swap | | `memory/client.truncate` | `textutil.TruncateBytes(s, n)` | bug fix | | `memory-backfill.truncate` | `textutil.TruncateBytes(s, n)` | bug fix | Five separate `truncate*` helpers + their per-package tests removed. Net: 12 files / +427 / -255. ## Tests - `internal/textutil/truncate_test.go` — 27 table-test cases + 145 fuzz-invariant cases asserting `utf8.ValidString` and byte-cap invariants on every output. - `delegation_ledger_test.go TestLedgerInsert_TruncatesOversizedPreview` strengthened with `capValidUTF8Matcher` so the SQL-write argument is asserted to be valid UTF-8 + within cap (not just `AnyArg()`). Mutation-tested: replacing the SSOT call with byte-slice form makes this test fail loud. ## Compatibility - All callers internal; no external API surface change. - Ellipsis swap "..." → "…": same byte budget (3 bytes), single-glyph display. No alerting/grep on either marker in this codebase (verified). Canvas renders both correctly. - DB column widths unchanged (4096 / 80 / 200 / 256 / 300 — all preserved in the migrations). ## Security Fixes a silent INSERT-failure mode that hid `activity_logs` / `delegations` rows containing peer-controlled text. The class of input that triggered it (CJK, emoji, accented Latin) is normal user content, not malicious — but the symptom (audit gap) makes incident reconstruction harder. Helper is pure-function over `string`; no secrets / PII / auth handling involved. Untrusted input is handled identically to before, just rune-aligned now. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
223 lines
7.8 KiB
Go
223 lines
7.8 KiB
Go
package textutil
|
|
|
|
import (
|
|
"testing"
|
|
"unicode/utf8"
|
|
)
|
|
|
|
// TestTruncateBytes_RuneBoundary pins the byte-cap, marker-bearing
|
|
// truncation path. Every case asserts both:
|
|
// 1. the exact expected output (so a refactor that flips ellipsis or
|
|
// drops a rune is caught), and
|
|
// 2. utf8.ValidString on the output (the invariant that the bug class
|
|
// in #2026/#2959/#2962 violated by slicing mid-codepoint).
|
|
//
|
|
// Per memory feedback_assert_exact_not_substring.md, asserts are exact
|
|
// equality, not substring matches.
|
|
func TestTruncateBytes_RuneBoundary(t *testing.T) {
|
|
cases := []struct {
|
|
name string
|
|
in string
|
|
maxBytes int
|
|
want string
|
|
}{
|
|
// Under-cap: returns input verbatim.
|
|
{"empty", "", 10, ""},
|
|
{"under-cap ASCII", "hi", 10, "hi"},
|
|
{"exactly-at-cap ASCII", "hello", 5, "hello"},
|
|
{"under-cap CJK", "你好", 10, "你好"}, // 6 bytes
|
|
{"exactly-at-cap CJK", "你好", 6, "你好"},
|
|
|
|
// Over-cap ASCII: trims to (maxBytes - 3) bytes + "…".
|
|
{"over-cap ASCII", "abcdefghij", 6, "abc…"},
|
|
|
|
// Over-cap CJK where cut would land mid-codepoint. The
|
|
// pre-fix bug shape: 7 - 3 = 4, but byte 4 is mid-"好"
|
|
// (好 is bytes 3..5 of "你好世界"). Walking back to byte 3
|
|
// (start of 好 — wait, that IS the start). Actually 你=0..2,
|
|
// 好=3..5, 世=6..8, 界=9..11. Cut=4, walk back to 3 (start
|
|
// of 好), then s[:3]="你", + "…" = "你…" (3+3=6 bytes ≤ 7).
|
|
{"over-cap CJK lands mid-codepoint", "你好世界", 7, "你…"},
|
|
|
|
// Over-cap CJK where cut lands exactly on rune boundary.
|
|
// 9 - 3 = 6, byte 6 is start of 世. Walk-back is no-op.
|
|
// s[:6]="你好" + "…" = "你好…" (9 bytes).
|
|
{"over-cap CJK rune-aligned", "你好世界", 9, "你好…"},
|
|
|
|
// Emoji: 😀 is 4 bytes (U+1F600). 7 - 3 = 4, byte 4 is start
|
|
// of second 😀 — walk-back no-op. s[:4]="😀" + "…" = "😀…".
|
|
{"over-cap emoji", "😀😀😀", 7, "😀…"},
|
|
|
|
// Mixed ASCII + CJK. "ab你好世界": a(1) b(1) 你(3) 好(3) 世(3) 界(3) = 14 bytes.
|
|
// maxBytes=8, 8-3=5. byte 5 is mid-好. Walk back to start of 好 = byte 5? Let me
|
|
// recompute: a=0, b=1, 你=2..4, 好=5..7, 世=8..10. Byte 5 IS start of 好.
|
|
// Walk-back keeps cut at 5. s[:5] = "ab你" + "…" = "ab你…" (8 bytes).
|
|
{"mixed prefix ASCII over-cap CJK", "ab你好世界", 8, "ab你…"},
|
|
|
|
// Pathological: maxBytes too small to even fit the marker.
|
|
{"cap below ellipsis len", "hello", 2, ""},
|
|
{"cap zero", "hello", 0, ""},
|
|
{"cap negative", "hello", -1, ""},
|
|
|
|
// Cap exactly == ellipsis len: no room for content, but
|
|
// the marker fits. This returns "" (cut = 0, s[:0] = "").
|
|
{"cap equals ellipsis len", "hello", 3, "…"},
|
|
}
|
|
for _, c := range cases {
|
|
t.Run(c.name, func(t *testing.T) {
|
|
got := TruncateBytes(c.in, c.maxBytes)
|
|
if got != c.want {
|
|
t.Errorf("TruncateBytes(%q, %d) = %q, want %q", c.in, c.maxBytes, got, c.want)
|
|
}
|
|
if !utf8.ValidString(got) {
|
|
t.Errorf("TruncateBytes(%q, %d) returned invalid UTF-8: %q", c.in, c.maxBytes, got)
|
|
}
|
|
// Output never exceeds the byte cap (when one is set).
|
|
if c.maxBytes > 0 && len(got) > c.maxBytes {
|
|
t.Errorf("TruncateBytes(%q, %d) overflowed cap: len(out)=%d > %d",
|
|
c.in, c.maxBytes, len(got), c.maxBytes)
|
|
}
|
|
})
|
|
}
|
|
}
|
|
|
|
// TestTruncateBytesNoMarker pins the marker-less variant. Same
|
|
// boundary handling as TruncateBytes but no ellipsis cost — the cut
|
|
// happens at maxBytes itself, walking back only if that lands
|
|
// mid-codepoint.
|
|
func TestTruncateBytesNoMarker(t *testing.T) {
|
|
cases := []struct {
|
|
name string
|
|
in string
|
|
maxBytes int
|
|
want string
|
|
}{
|
|
{"empty", "", 10, ""},
|
|
{"under-cap ASCII", "hi", 10, "hi"},
|
|
{"exactly-at-cap ASCII", "hello", 5, "hello"},
|
|
{"over-cap ASCII", "abcdefghij", 5, "abcde"},
|
|
|
|
// Over-cap CJK rune-aligned: "你好世界", maxBytes=6, byte 6 is start of 世.
|
|
// s[:6]="你好" — perfect cut.
|
|
{"over-cap CJK rune-aligned", "你好世界", 6, "你好"},
|
|
|
|
// Over-cap CJK mid-codepoint: maxBytes=4, byte 4 is mid-好.
|
|
// Walk back to byte 3 (start of 好), s[:3]="你".
|
|
{"over-cap CJK mid-codepoint", "你好世界", 4, "你"},
|
|
|
|
// Emoji: maxBytes=5, "😀😀" is bytes 0..3 then 4..7. byte 5 is mid-second-😀.
|
|
// Walk back to byte 4 (start of second 😀), s[:4]="😀".
|
|
{"over-cap emoji", "😀😀", 5, "😀"},
|
|
|
|
// Edge: cap zero or negative → "".
|
|
{"cap zero", "hello", 0, ""},
|
|
{"cap negative", "hello", -1, ""},
|
|
|
|
// Cap = 1 and first rune is multi-byte: walk-back to 0, return "".
|
|
{"cap one with leading CJK", "你hello", 1, ""},
|
|
}
|
|
for _, c := range cases {
|
|
t.Run(c.name, func(t *testing.T) {
|
|
got := TruncateBytesNoMarker(c.in, c.maxBytes)
|
|
if got != c.want {
|
|
t.Errorf("TruncateBytesNoMarker(%q, %d) = %q, want %q", c.in, c.maxBytes, got, c.want)
|
|
}
|
|
if !utf8.ValidString(got) {
|
|
t.Errorf("TruncateBytesNoMarker(%q, %d) returned invalid UTF-8: %q", c.in, c.maxBytes, got)
|
|
}
|
|
if c.maxBytes > 0 && len(got) > c.maxBytes {
|
|
t.Errorf("TruncateBytesNoMarker(%q, %d) overflowed cap: len(out)=%d > %d",
|
|
c.in, c.maxBytes, len(got), c.maxBytes)
|
|
}
|
|
})
|
|
}
|
|
}
|
|
|
|
// TestTruncateRunes pins the rune-cap variant. The key contract is
|
|
// that maxRunes counts user-visible characters (Go runes, which line
|
|
// up with Unicode codepoints), not bytes — so "你好世界" with
|
|
// maxRunes=2 returns "你好…", regardless of the resulting byte count.
|
|
func TestTruncateRunes(t *testing.T) {
|
|
cases := []struct {
|
|
name string
|
|
in string
|
|
maxRunes int
|
|
want string
|
|
}{
|
|
{"empty", "", 5, ""},
|
|
{"under-cap ASCII", "hi", 5, "hi"},
|
|
{"exactly-at-cap ASCII", "hello", 5, "hello"},
|
|
{"over-cap ASCII", "abcdefghij", 5, "abcde…"},
|
|
|
|
{"under-cap CJK", "你好", 5, "你好"},
|
|
{"exactly-at-cap CJK", "你好", 2, "你好"},
|
|
|
|
// Over-cap CJK: maxRunes=3, expect first 3 runes + marker.
|
|
{"over-cap CJK", "你好世界你好", 3, "你好世…"},
|
|
|
|
// Emoji is one rune per glyph in Go (no ZWJ here).
|
|
{"over-cap emoji", "😀😀😀😀😀", 2, "😀😀…"},
|
|
|
|
// Mixed: maxRunes=3 of "ab你好世界" → "ab你…".
|
|
{"mixed prefix", "ab你好世界", 3, "ab你…"},
|
|
|
|
// Edge: maxRunes 0 / negative → "".
|
|
{"cap zero", "hello", 0, ""},
|
|
{"cap negative", "hello", -1, ""},
|
|
}
|
|
for _, c := range cases {
|
|
t.Run(c.name, func(t *testing.T) {
|
|
got := TruncateRunes(c.in, c.maxRunes)
|
|
if got != c.want {
|
|
t.Errorf("TruncateRunes(%q, %d) = %q, want %q", c.in, c.maxRunes, got, c.want)
|
|
}
|
|
if !utf8.ValidString(got) {
|
|
t.Errorf("TruncateRunes(%q, %d) returned invalid UTF-8: %q", c.in, c.maxRunes, got)
|
|
}
|
|
})
|
|
}
|
|
}
|
|
|
|
// TestTruncate_FuzzInvariants stays as a property-style sanity check:
|
|
// for any rune-valid input and any cap, the output is rune-valid and
|
|
// (for byte-cap variants) within the cap. This catches off-by-one
|
|
// regressions in cuts that slip past the table-test cases above.
|
|
func TestTruncate_FuzzInvariants(t *testing.T) {
|
|
inputs := []string{
|
|
"",
|
|
"a",
|
|
"hello world",
|
|
"你好世界",
|
|
"😀😀😀",
|
|
"ab你c好d世e界",
|
|
"日本語の文字列",
|
|
"🇺🇸🇯🇵", // flags: each is 2 codepoints (regional indicators)
|
|
}
|
|
for _, in := range inputs {
|
|
for cap := -1; cap <= len(in)+5; cap++ {
|
|
t.Run("", func(t *testing.T) {
|
|
gotB := TruncateBytes(in, cap)
|
|
if !utf8.ValidString(gotB) {
|
|
t.Errorf("TruncateBytes(%q, %d) invalid UTF-8: %q", in, cap, gotB)
|
|
}
|
|
if cap > 0 && len(gotB) > cap {
|
|
t.Errorf("TruncateBytes(%q, %d) overflowed: %q (%d bytes)", in, cap, gotB, len(gotB))
|
|
}
|
|
|
|
gotN := TruncateBytesNoMarker(in, cap)
|
|
if !utf8.ValidString(gotN) {
|
|
t.Errorf("TruncateBytesNoMarker(%q, %d) invalid UTF-8: %q", in, cap, gotN)
|
|
}
|
|
if cap > 0 && len(gotN) > cap {
|
|
t.Errorf("TruncateBytesNoMarker(%q, %d) overflowed: %q (%d bytes)", in, cap, gotN, len(gotN))
|
|
}
|
|
|
|
gotR := TruncateRunes(in, cap)
|
|
if !utf8.ValidString(gotR) {
|
|
t.Errorf("TruncateRunes(%q, %d) invalid UTF-8: %q", in, cap, gotR)
|
|
}
|
|
})
|
|
}
|
|
}
|
|
}
|