forked from molecule-ai/molecule-core
Phase 3 of the poll-mode chat upload rollout. Stack atop Phase 2.
The platform's pending_uploads table grows once-per-uploaded-file with
no built-in cleanup. Phase 1's hard TTL (expires_at default 24h) makes
expired rows un-fetchable but doesn't actually delete them; Phase 1's
ack stamps acked_at but leaves the row indefinitely. Without a sweep
the table grows unbounded across normal traffic.
This PR adds:
- `Storage.Sweep(ctx, ackRetention)` — a single round-trip CTE that
deletes acked rows past their retention window plus unacked rows
past expires_at. Returns `(acked, expired)` deletion counts so
Phase 3 dashboards can spot the stuck-fetch pattern (high expired,
low acked) vs healthy churn.
- `pendinguploads.StartSweeper(ctx, storage, ackRetention)` —
background goroutine that calls Sweep every 5 minutes (default).
Runs once immediately on startup so a platform restart cleans up
any rows that became eligible while we were down.
- Prometheus counters `molecule_pending_uploads_swept_total` with
`outcome={acked,expired,error}` labels. Wired into the existing
`/metrics` endpoint.
- Wired from cmd/server/main.go via supervised.RunWithRecover —
one transient panic doesn't take the platform down with it.
Defaults:
- SweepInterval = 5m (matches the dashboard refresh cadence)
- DefaultAckRetention = 1h (gives the workspace at-least-once retry
headroom in case it processed but failed to write the file before
crashing)
Test coverage: 100% on storage_test.go (extended with sweepSQL pin +
six Sweep test cases including negative-retention clamp + zero-retention
immediate-delete + DB error wrapping) and sweeper_test.go (ticker-driven
+ ctx-cancel + nil-storage + transient-error-doesn't-crash + metric
counter assertions).
Closes the third of four phases tracked on the parent RFC; phase 4 is
the staging E2E test.
|
||
|---|---|---|
| .. | ||
| memory-backfill | ||
| memory-plugin-postgres | ||
| server | ||