fix(a2a-proxy): normalize system-caller source_id to NULL (core#2680 wedge recovery) #2701
Reference in New Issue
Block a user
Delete Branch "fix/restart-context-callerid-normalize"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
The post-restart wedge (core#2680) wedges a workspace in
degradedbecause the syntheticsystem:restart-contextcaller ID gets persisted verbatim intoactivity_logs.source_id(UUID-typed). Two compounding effects:source_idcolumn accepts the raw string at write time (TEXT-coerced via the JSONBrequest_body), but every downstream JOIN (e.g.activity_logs.source_id = workspaces.id) silently misses — the row is unreadable for the canvas/activityfeed filter that keys onsource_id IS NULL.a2a_queue.caller_id) that lets a workspace recover from the wedge depends onLogActivity'srequest_bodysurviving a re-read bymessageId. The poisonedsource_idalso breaks the partial-unique index lookup used by the recovery path.Net: the wedge detector (registry.go:950-955) fires on
hasRecentRegisterFailure→ workspace flips online→degraded within seconds of a restart → the recovery path can't find the row → workspace stays degraded pastRESTART_TIMEOUT(240s, post-#2688).Fix
Extend
nilIfEmpty(a2a_proxy_helpers.go:441) to also returnnilfor any system-caller prefix (matchingisSystemCallerin a2a_proxy.go:85). All 5 call sites —LogActivityinvocations at lines 319, 347, 420, 835, 937 — pick up the new behavior without a per-site change. Thesystem callersemantic is preserved viasource_id IS NULL(the existing canvas/activity?source=canvasfilter already keys on that).Mirrors the queue-side fix in #2696 (
a2a_queue.caller_id): BOTH persisted-caller columns follow the sameisSystemCaller()→ NULL normalization. Single source of truth in a2a_proxy.go:82-91. No drift surface.Why the prior #2694 attempt closed
The prior #2694 had the same fix-shape but also changed
nilIfEmptyin a way that accidentally collapsed real workspace UUIDs to NULL in a sub-case. This PR carries an explicit regression-guard test (TestNilIfEmpty_RealWorkspaceUUIDStillPreserved, 4 cases) so the same bug can't ship twice.Cross-reference to #2680
This fix unblocks the queue-fallback recovery path. Concretely:
system:restart-context(string), queue-fallback SELECT by source_id returns 0 rows → recovery stalls → workspace stays degraded.activity_logs.message_id(the partial-unique index) returns the durable row → recovery completes → workspace returns to online.The wedge detector (the
hasRecentRegisterFailureflip in registry.go:950-955) is the trigger of the wedge; this fix is the enabler of recovery. Both halves are needed for the workspace to fully clear the degraded state, but this PR is sufficient to make the recovery path functional. The trigger side (#2530, auth-token rotation on container re-create) is tracked separately and not in scope here.Tests
TestNilIfEmpty_SystemCallerPrefixes(4 cases):system:restart-context,webhook:github,test:lifecycle-1,channel:slack:C0123— all normalize tonil.TestNilIfEmpty_RealWorkspaceUUIDStillPreserved(4 cases):ws-1, full UUID,agent-dev-b,canvas_user— all preserved (regression guard for the bug that closed the prior #2694).TestNilIfEmpty_EmptyStringandTestNilIfEmpty_NonEmptyStringstill pass.Verified
go build ./...clean.go vet ./...clean.TestNilIfEmpty_*all pass (6 cases).Out of scope (deferred follow-ups)
a2a_queue.caller_id) — already open, pending Researcher approval. After this PR merges, the two columns follow the same normalization.Refs: #2680, #2693, #2696.
REQUEST_CHANGES: this closes the intended
activity_logs.source_idgap in principle, but the implementation changes the genericnilIfEmptyhelper, which has non-source-id call sites and can silently drop valid non-empty values.At head
e6c78e328a90284f390f83fb0d0a5f9f46633a32,nilIfEmptynow returns nil for anyisSystemCaller(s)prefix (a2a_proxy_helpers.go:469-471). That does makeSourceID: nilIfEmpty(callerID)normalizesystem:restart-contextto NULL, and the tests prove system prefixes nil plus normal callers preserved. HowevernilIfEmptyis also used for non-ID fields:TargetID,Method,Summary,ErrorDetailinactivity.go:931-938,MessageIdinactivity.go:1054, andworkspace_dirinworkspace.go:1062. With this patch, a legitimate summary/error/method/message/workspace_dir value beginning withsystem:,test:,webhook:, orchannel:is now treated as absent. That is outside the requested source_id fix and is not covered by tests.Fix shape: keep
nilIfEmptyas empty-string-only for generic optional text fields, add a source/caller-specific helper such asnilIfEmptyOrSystemCaller(ornilCallerIDIfSystem) and use it only at theSourceID: ... callerIDLogActivity call sites. Add regression coverage forsystem:restart-contextsource_id -> NULL, real workspace UUID -> preserved, and genericnilIfEmpty("system:literal")still preserved for non-ID text. Then I can approve the source_id half.APPROVED: #2701 head
20b0c1d0069988d23d815c20afd795662c4d8d98resolves my RC #11295.The source-id side is now scoped correctly: the five
SourceID: callerIDactivity-log bind sites usecallerIDToSourceID, which maps empty/system callers such assystem:restart-contextto NULL and preserves real caller IDs. The genericnilIfEmptyhelper is back to its narrow contract, empty-string -> nil only, so non-ID fields (TargetID,Method,Summary,ErrorDetail,MessageId,workspace_dir) are no longer affected by system-caller normalization.The tests cover the contract: system prefixes -> nil, real workspace/op/canvas caller values preserved, empty caller nil, and
nilIfEmpty("system:foo")remains preserved for non-ID text. Paired with already-approved #2696 (a2a_queue.caller_id), this fully normalizes both persisted caller UUID columns without the prior collateral gap./sop-ack