molecule-core/workspace-server/internal/textutil/truncate_test.go
Hongming Wang 656a02fae4 fix(textutil): SSOT for rune-safe string truncation, fix 3 audit-gap bugs
Closes #2962.

## Why

Six per-package `truncate` helpers had drifted into independent
re-implementations of the same idea. Three of them (delegation.go,
memory/client/client.go, memory-backfill/verify.go) used
`s[:max] + "…"` byte-slice form, which on a multi-byte codepoint at
byte `max` produces invalid UTF-8 → Postgres `text`/`jsonb` rejects
the INSERT silently → `delegation` / `activity_logs` row never lands
→ audit gap.

Three other helpers (delegation_ledger.go #2962, agent_message_writer.go
#2959, scheduler.go #2026) had each been fixed in isolation with three
slightly different rune-safe shapes — confirming this is a class of
bug, not a single instance.

## What

New package `internal/textutil` with three rune-safe functions:

- `TruncateBytes(s, maxBytes)` — byte-cap, "…" marker. Used by 5
  callers writing into byte-bounded columns / log lines.
- `TruncateBytesNoMarker(s, maxBytes)` — byte-cap, no marker. Used by
  delegation_ledger.go where the storage already conveys "preview"
  and an extra ellipsis would push the result over the column cap.
- `TruncateRunes(s, maxRunes)` — rune-cap, "…" marker. Used by
  agent_message_writer.go where the cap is in display chars (UI
  summary), not bytes.

All three guarantee `utf8.ValidString(out)` for any `utf8.ValidString(in)`.
Inputs already invalid go through `sanitizeUTF8` at the call site
boundary (scheduler.go preserved this defense-in-depth).

## Migration map

| Old | New | Behavior change |
|---|---|---|
| `delegation_ledger.truncatePreview` | `textutil.TruncateBytesNoMarker(s, 4096)` | none |
| `agent_message_writer.truncatePreviewRunes` | `textutil.TruncateRunes(s, n)` | none |
| `scheduler.truncate` | `textutil.TruncateBytes(s, n)` | "..." → "…" (3 bytes either way; single-glyph display) |
| `delegation.truncate` | `textutil.TruncateBytes(s, n)` | bug fix + ellipsis swap |
| `memory/client.truncate` | `textutil.TruncateBytes(s, n)` | bug fix |
| `memory-backfill.truncate` | `textutil.TruncateBytes(s, n)` | bug fix |

Five separate `truncate*` helpers + their per-package tests removed.
Net: 12 files / +427 / -255.

## Tests

- `internal/textutil/truncate_test.go` — 27 table-test cases + 145
  fuzz-invariant cases asserting `utf8.ValidString` and byte-cap
  invariants on every output.
- `delegation_ledger_test.go TestLedgerInsert_TruncatesOversizedPreview`
  strengthened with `capValidUTF8Matcher` so the SQL-write argument
  is asserted to be valid UTF-8 + within cap (not just `AnyArg()`).
  Mutation-tested: replacing the SSOT call with byte-slice form makes
  this test fail loud.

## Compatibility

- All callers internal; no external API surface change.
- Ellipsis swap "..." → "…": same byte budget (3 bytes), single-glyph
  display. No alerting/grep on either marker in this codebase
  (verified). Canvas renders both correctly.
- DB column widths unchanged (4096 / 80 / 200 / 256 / 300 — all
  preserved in the migrations).

## Security

Fixes a silent INSERT-failure mode that hid `activity_logs` /
`delegations` rows containing peer-controlled text. The class of input
that triggered it (CJK, emoji, accented Latin) is normal user content,
not malicious — but the symptom (audit gap) makes incident
reconstruction harder. Helper is pure-function over `string`; no
secrets / PII / auth handling involved. Untrusted input is handled
identically to before, just rune-aligned now.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 23:01:21 -07:00

223 lines
7.8 KiB
Go

package textutil
import (
"testing"
"unicode/utf8"
)
// TestTruncateBytes_RuneBoundary pins the byte-cap, marker-bearing
// truncation path. Every case asserts both:
// 1. the exact expected output (so a refactor that flips ellipsis or
// drops a rune is caught), and
// 2. utf8.ValidString on the output (the invariant that the bug class
// in #2026/#2959/#2962 violated by slicing mid-codepoint).
//
// Per memory feedback_assert_exact_not_substring.md, asserts are exact
// equality, not substring matches.
func TestTruncateBytes_RuneBoundary(t *testing.T) {
cases := []struct {
name string
in string
maxBytes int
want string
}{
// Under-cap: returns input verbatim.
{"empty", "", 10, ""},
{"under-cap ASCII", "hi", 10, "hi"},
{"exactly-at-cap ASCII", "hello", 5, "hello"},
{"under-cap CJK", "你好", 10, "你好"}, // 6 bytes
{"exactly-at-cap CJK", "你好", 6, "你好"},
// Over-cap ASCII: trims to (maxBytes - 3) bytes + "…".
{"over-cap ASCII", "abcdefghij", 6, "abc…"},
// Over-cap CJK where cut would land mid-codepoint. The
// pre-fix bug shape: 7 - 3 = 4, but byte 4 is mid-"好"
// (好 is bytes 3..5 of "你好世界"). Walking back to byte 3
// (start of 好 — wait, that IS the start). Actually 你=0..2,
// 好=3..5, 世=6..8, 界=9..11. Cut=4, walk back to 3 (start
// of 好), then s[:3]="你", + "…" = "你…" (3+3=6 bytes ≤ 7).
{"over-cap CJK lands mid-codepoint", "你好世界", 7, "你…"},
// Over-cap CJK where cut lands exactly on rune boundary.
// 9 - 3 = 6, byte 6 is start of 世. Walk-back is no-op.
// s[:6]="你好" + "…" = "你好…" (9 bytes).
{"over-cap CJK rune-aligned", "你好世界", 9, "你好…"},
// Emoji: 😀 is 4 bytes (U+1F600). 7 - 3 = 4, byte 4 is start
// of second 😀 — walk-back no-op. s[:4]="😀" + "…" = "😀…".
{"over-cap emoji", "😀😀😀", 7, "😀…"},
// Mixed ASCII + CJK. "ab你好世界": a(1) b(1) 你(3) 好(3) 世(3) 界(3) = 14 bytes.
// maxBytes=8, 8-3=5. byte 5 is mid-好. Walk back to start of 好 = byte 5? Let me
// recompute: a=0, b=1, 你=2..4, 好=5..7, 世=8..10. Byte 5 IS start of 好.
// Walk-back keeps cut at 5. s[:5] = "ab你" + "…" = "ab你…" (8 bytes).
{"mixed prefix ASCII over-cap CJK", "ab你好世界", 8, "ab你…"},
// Pathological: maxBytes too small to even fit the marker.
{"cap below ellipsis len", "hello", 2, ""},
{"cap zero", "hello", 0, ""},
{"cap negative", "hello", -1, ""},
// Cap exactly == ellipsis len: no room for content, but
// the marker fits. This returns "" (cut = 0, s[:0] = "").
{"cap equals ellipsis len", "hello", 3, "…"},
}
for _, c := range cases {
t.Run(c.name, func(t *testing.T) {
got := TruncateBytes(c.in, c.maxBytes)
if got != c.want {
t.Errorf("TruncateBytes(%q, %d) = %q, want %q", c.in, c.maxBytes, got, c.want)
}
if !utf8.ValidString(got) {
t.Errorf("TruncateBytes(%q, %d) returned invalid UTF-8: %q", c.in, c.maxBytes, got)
}
// Output never exceeds the byte cap (when one is set).
if c.maxBytes > 0 && len(got) > c.maxBytes {
t.Errorf("TruncateBytes(%q, %d) overflowed cap: len(out)=%d > %d",
c.in, c.maxBytes, len(got), c.maxBytes)
}
})
}
}
// TestTruncateBytesNoMarker pins the marker-less variant. Same
// boundary handling as TruncateBytes but no ellipsis cost — the cut
// happens at maxBytes itself, walking back only if that lands
// mid-codepoint.
func TestTruncateBytesNoMarker(t *testing.T) {
cases := []struct {
name string
in string
maxBytes int
want string
}{
{"empty", "", 10, ""},
{"under-cap ASCII", "hi", 10, "hi"},
{"exactly-at-cap ASCII", "hello", 5, "hello"},
{"over-cap ASCII", "abcdefghij", 5, "abcde"},
// Over-cap CJK rune-aligned: "你好世界", maxBytes=6, byte 6 is start of 世.
// s[:6]="你好" — perfect cut.
{"over-cap CJK rune-aligned", "你好世界", 6, "你好"},
// Over-cap CJK mid-codepoint: maxBytes=4, byte 4 is mid-好.
// Walk back to byte 3 (start of 好), s[:3]="你".
{"over-cap CJK mid-codepoint", "你好世界", 4, "你"},
// Emoji: maxBytes=5, "😀😀" is bytes 0..3 then 4..7. byte 5 is mid-second-😀.
// Walk back to byte 4 (start of second 😀), s[:4]="😀".
{"over-cap emoji", "😀😀", 5, "😀"},
// Edge: cap zero or negative → "".
{"cap zero", "hello", 0, ""},
{"cap negative", "hello", -1, ""},
// Cap = 1 and first rune is multi-byte: walk-back to 0, return "".
{"cap one with leading CJK", "你hello", 1, ""},
}
for _, c := range cases {
t.Run(c.name, func(t *testing.T) {
got := TruncateBytesNoMarker(c.in, c.maxBytes)
if got != c.want {
t.Errorf("TruncateBytesNoMarker(%q, %d) = %q, want %q", c.in, c.maxBytes, got, c.want)
}
if !utf8.ValidString(got) {
t.Errorf("TruncateBytesNoMarker(%q, %d) returned invalid UTF-8: %q", c.in, c.maxBytes, got)
}
if c.maxBytes > 0 && len(got) > c.maxBytes {
t.Errorf("TruncateBytesNoMarker(%q, %d) overflowed cap: len(out)=%d > %d",
c.in, c.maxBytes, len(got), c.maxBytes)
}
})
}
}
// TestTruncateRunes pins the rune-cap variant. The key contract is
// that maxRunes counts user-visible characters (Go runes, which line
// up with Unicode codepoints), not bytes — so "你好世界" with
// maxRunes=2 returns "你好…", regardless of the resulting byte count.
func TestTruncateRunes(t *testing.T) {
cases := []struct {
name string
in string
maxRunes int
want string
}{
{"empty", "", 5, ""},
{"under-cap ASCII", "hi", 5, "hi"},
{"exactly-at-cap ASCII", "hello", 5, "hello"},
{"over-cap ASCII", "abcdefghij", 5, "abcde…"},
{"under-cap CJK", "你好", 5, "你好"},
{"exactly-at-cap CJK", "你好", 2, "你好"},
// Over-cap CJK: maxRunes=3, expect first 3 runes + marker.
{"over-cap CJK", "你好世界你好", 3, "你好世…"},
// Emoji is one rune per glyph in Go (no ZWJ here).
{"over-cap emoji", "😀😀😀😀😀", 2, "😀😀…"},
// Mixed: maxRunes=3 of "ab你好世界" → "ab你…".
{"mixed prefix", "ab你好世界", 3, "ab你…"},
// Edge: maxRunes 0 / negative → "".
{"cap zero", "hello", 0, ""},
{"cap negative", "hello", -1, ""},
}
for _, c := range cases {
t.Run(c.name, func(t *testing.T) {
got := TruncateRunes(c.in, c.maxRunes)
if got != c.want {
t.Errorf("TruncateRunes(%q, %d) = %q, want %q", c.in, c.maxRunes, got, c.want)
}
if !utf8.ValidString(got) {
t.Errorf("TruncateRunes(%q, %d) returned invalid UTF-8: %q", c.in, c.maxRunes, got)
}
})
}
}
// TestTruncate_FuzzInvariants stays as a property-style sanity check:
// for any rune-valid input and any cap, the output is rune-valid and
// (for byte-cap variants) within the cap. This catches off-by-one
// regressions in cuts that slip past the table-test cases above.
func TestTruncate_FuzzInvariants(t *testing.T) {
inputs := []string{
"",
"a",
"hello world",
"你好世界",
"😀😀😀",
"ab你c好d世e界",
"日本語の文字列",
"🇺🇸🇯🇵", // flags: each is 2 codepoints (regional indicators)
}
for _, in := range inputs {
for cap := -1; cap <= len(in)+5; cap++ {
t.Run("", func(t *testing.T) {
gotB := TruncateBytes(in, cap)
if !utf8.ValidString(gotB) {
t.Errorf("TruncateBytes(%q, %d) invalid UTF-8: %q", in, cap, gotB)
}
if cap > 0 && len(gotB) > cap {
t.Errorf("TruncateBytes(%q, %d) overflowed: %q (%d bytes)", in, cap, gotB, len(gotB))
}
gotN := TruncateBytesNoMarker(in, cap)
if !utf8.ValidString(gotN) {
t.Errorf("TruncateBytesNoMarker(%q, %d) invalid UTF-8: %q", in, cap, gotN)
}
if cap > 0 && len(gotN) > cap {
t.Errorf("TruncateBytesNoMarker(%q, %d) overflowed: %q (%d bytes)", in, cap, gotN, len(gotN))
}
gotR := TruncateRunes(in, cap)
if !utf8.ValidString(gotR) {
t.Errorf("TruncateRunes(%q, %d) invalid UTF-8: %q", in, cap, gotR)
}
})
}
}
}