Adds the `delegations` table and the DelegationLedger writer that PRs #2-#4 of RFC #2829 build on. Schema-only foundation — no behavior change in this PR. PR-2 wires the ledger into the existing handlers and ships the result- push-to-inbox cutover behind a feature flag. Why a dedicated table when activity_logs already records every delegation event: Today, "what is currently in flight for this workspace" is reconstructed by GROUPing activity_logs by delegation_id and ORDER BY created_at DESC. PR-3's stuck-task sweeper needs the join SELECT delegation_id FROM delegations WHERE status = 'in_progress' AND last_heartbeat < now() - interval '10 minutes' which is impossible to express against the event stream without a window over every (delegation_id, latest event) pair — a planner-killing query at scale. The dedicated table makes the sweeper an indexed scan. Same posture as tenant_resources (PR #2343, memory `reference_tenant_resources_audit`): activity_logs remains the audit- grade source of truth, delegations is the queryable view for dashboards + sweeper joins. Symmetric writes — both tables are written, neither blocks orchestration on the other's failure. Schema highlights: - delegation_id PRIMARY KEY (caller-chosen, idempotent retry on restart is a no-op via ON CONFLICT DO NOTHING) - caller_id / callee_id NOT FK — workspace delete must NOT cascade- delete delegation history (audit retention) - status CHECK constraint enforces the lifecycle (queued|dispatched|in_progress|completed|failed|stuck) - last_heartbeat NULL-able; PR-3 sweeper compares to NOW() - deadline default now()+6h matches longest-observed legit delegation (memory-namespace migrations) — protects against forever-heartbeating wedged agents - Partial index `idx_delegations_inflight_heartbeat` keeps the sweeper hot path tiny (only non-terminal rows) - UNIQUE(caller_id, idempotency_key) WHERE NOT NULL — natural collision becomes ON CONFLICT no-op without colliding across callers DelegationLedger.SetStatus enforces forward-only on terminal states (completed/failed/stuck cannot be revised) as defense-in-depth on the schema CHECK. Same-status replay is a no-op. Missing-row SetStatus is a no-op (transient inconsistency the next agent retry will heal). Heartbeat updates only in-flight rows — terminal-state delegations are silently skipped. Coverage: - 17 unit tests against sqlmock-backed *sql.DB (Insert happy path, missing-required guards, truncation, lifecycle transitions, terminal forward-only protection, replay no-op, missing-row no-op, empty-input rejection, heartbeat semantics, transition table shape) - Migration roundtrip verified on a real Postgres 15 instance: up creates the expected schema with all 4 indexes + CHECK, down drops everything cleanly. Refs RFC #2829.
100 lines
4.8 KiB
SQL
100 lines
4.8 KiB
SQL
-- RFC #2829 PR-1: durable delegations ledger.
|
|
--
|
|
-- Today, delegation state is reconstructed by GROUPing activity_logs rows
|
|
-- by delegation_id and ORDER BY created_at DESC. Three problems:
|
|
--
|
|
-- 1. No queryable "what is currently in flight for this workspace" — every
|
|
-- caller has to fold the event stream itself.
|
|
-- 2. No place to durably stamp last_heartbeat / deadline on a per-task
|
|
-- basis, so a stuck-task sweeper has nothing to scan.
|
|
-- 3. The 600s message/send proxy timeout (the user's 2026-05-05 home-hermes
|
|
-- iteration-14/90 incident) leaves the in-flight HTTP connection holding
|
|
-- all the state — caller restart, callee restart, proxy timeout all kill
|
|
-- the delegation. activity_logs can replay the *intent* but not the
|
|
-- *current state* without the row that says "yes this is still alive".
|
|
--
|
|
-- This table is the durable ledger that PRs #2-#4 build on:
|
|
-- PR-2 — push result to caller's inbox + use this row to track readiness
|
|
-- PR-3 — sweeper joins on (status='in_progress', last_heartbeat<now-N)
|
|
-- PR-4 — operator dashboard reads SELECT * WHERE status NOT IN ('completed','failed')
|
|
--
|
|
-- Delegation lifecycle:
|
|
-- queued — caller recorded intent, target unreachable / busy queue
|
|
-- dispatched — A2A request sent to target's HTTP server
|
|
-- in_progress — target acknowledged + started work
|
|
-- completed — terminal: result delivered to caller
|
|
-- failed — terminal: gave up after retries
|
|
-- stuck — terminal-ish: sweeper couldn't reach target for >threshold;
|
|
-- operator can transition to failed via dashboard (PR-4)
|
|
|
|
CREATE TABLE IF NOT EXISTS delegations (
|
|
-- delegation_id chosen by the caller so callee + caller agree on the key
|
|
-- without a database round-trip. UUID, but stored as TEXT to match the
|
|
-- existing agent-side string contract (delegation.py uses str(uuid4())).
|
|
delegation_id text PRIMARY KEY,
|
|
|
|
-- Caller is the workspace that initiated the delegation. Callee is the
|
|
-- target. Both reference workspaces, but we don't FK them — workspace
|
|
-- delete should NOT cascade-delete delegations history (audit retention).
|
|
-- Same posture as tenant_resources (PR #2343).
|
|
caller_id uuid NOT NULL,
|
|
callee_id uuid NOT NULL,
|
|
|
|
-- Truncated at insertion so a 50KB prompt doesn't bloat the ledger; the
|
|
-- full prompt lives in activity_logs.request_body for forensic replay.
|
|
task_preview text NOT NULL,
|
|
|
|
status text NOT NULL DEFAULT 'queued'
|
|
CHECK (status IN ('queued','dispatched','in_progress','completed','failed','stuck')),
|
|
|
|
-- Stamped by callee heartbeats (PR-3 sweeper compares to NOW()). NULL
|
|
-- before any heartbeat — sweeper treats NULL same as last_heartbeat
|
|
-- < (created_at) for stuckness purposes.
|
|
last_heartbeat timestamptz,
|
|
|
|
-- Hard deadline. Beyond this, sweeper marks `failed` regardless of
|
|
-- heartbeat liveness — protects against agents that heartbeat forever
|
|
-- without making progress. Default 6h matches the longest-observed legit
|
|
-- delegation in production (memory-namespace migration runs).
|
|
deadline timestamptz NOT NULL DEFAULT (now() + interval '6 hours'),
|
|
|
|
-- Truncated result preview (full result in activity_logs response_body).
|
|
-- Set on terminal completed transition.
|
|
result_preview text,
|
|
|
|
-- Set on failed/stuck terminal transition.
|
|
error_detail text,
|
|
|
|
-- For PR-3 retry policy. Not used in PR-1 — declared so PR-3 doesn't
|
|
-- need a follow-on migration.
|
|
retry_count integer NOT NULL DEFAULT 0,
|
|
|
|
created_at timestamptz NOT NULL DEFAULT now(),
|
|
updated_at timestamptz NOT NULL DEFAULT now(),
|
|
|
|
-- Idempotency: the agent-side delegate_task call accepts an idempotency
|
|
-- key. Two records of the same key collapse to one row. Indexed UNIQUE
|
|
-- where non-null so the natural collision becomes an INSERT … ON
|
|
-- CONFLICT no-op.
|
|
idempotency_key text
|
|
);
|
|
|
|
-- Sweeper hot path (PR-3): list everything that's in_progress and overdue
|
|
-- for a heartbeat. Partial index on non-terminal status keeps this small.
|
|
CREATE INDEX IF NOT EXISTS idx_delegations_inflight_heartbeat
|
|
ON delegations (last_heartbeat NULLS FIRST)
|
|
WHERE status IN ('queued','dispatched','in_progress');
|
|
|
|
-- Operator dashboard (PR-4): per-workspace recent delegations.
|
|
CREATE INDEX IF NOT EXISTS idx_delegations_caller_created
|
|
ON delegations (caller_id, created_at DESC);
|
|
|
|
CREATE INDEX IF NOT EXISTS idx_delegations_callee_created
|
|
ON delegations (callee_id, created_at DESC);
|
|
|
|
-- Idempotency dedupe: composite (caller_id, idempotency_key) so two
|
|
-- different callers can use the same key without colliding.
|
|
CREATE UNIQUE INDEX IF NOT EXISTS idx_delegations_idempotency
|
|
ON delegations (caller_id, idempotency_key)
|
|
WHERE idempotency_key IS NOT NULL;
|