fix(registry): add register-400 diagnostics + restart-context callerID regression guard (#2680 residual) #2710
Reference in New Issue
Block a user
Delete Branch "fix/restart-context-register-400-diagnostics"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Two-part fix for the #2680 residual (per Researcher RCA comment #99166, run 355924/job 482768):
(a) SYNTHETIC-CALLER NORMALIZATION — already in main via #2701 (the
callerIDToSourceIDscoped helper at a2a_proxy_helpers.go:500, applied at the 5 SourceID=callerID call sites inpersistUserMessageAtIngest+logA2ASuccess). The restart-context production path atrestart_context.go:296callsProxyA2ARequest(ctx, workspaceID, body, "system:restart-context", false)→ flows topersistUserMessageAtIngest→callerIDToSourceID("system:restart-context")→ returns nil (becauseisSystemCaller("system:restart-context")is true). The INSERT usesSourceID=NULL; the UUID cast is skipped; the post-restart queue-fallback path can recover.This PR adds the regression-guard test (
TestRestartContext_SystemCallerDoesNotPoisonSourceID) that pins this contract for all 4systemCallerPrefixes. If the helper is later removed OR weakened OR a call site is reverted, the test fails and the #2680 wedge class re-emerges.(b) REGISTER-400 DIAGNOSTICS — the new work. When a recreated container's first
/registry/registercall returns 400, the operator currently has NO actionable signal. The deferredboot_register_failedlog at the end of the function only fires AFTER validation has already returned, so by the time it logs we have the status code but not the reason (URL missing? URL in a private range? delivery_mode? kind? invalid JSON? SSRF blocklist?).This PR adds:
fetchExistingWorkspaceStateForDiagnostics(registry.go): reads the workspace's current row state (url, kind, delivery_mode) for the diagnostic log line. Best-effort; never fails the handler.registerDiagnosticsstruct: the row-state container; empty means "unavailable" / "new".logRegister400Reason: emits a single grep-able line for every 400 path. Stable shape:registry_register_400 workspace=<id> reason=<key> payload_url=... payload_card_url=... payload_kind=... payload_delivery_mode=... existing_url=... existing_kind=... existing_delivery_mode=... detail=.... Thereasonkey is one of:invalid_json|invalid_delivery_mode|invalid_kind|url_required_for_push|url_validate_failed.invalid_json(parse fail),invalid_delivery_mode,invalid_kind,url_required_for_push,url_validate_failed(SSRF blocklist). The pre-ctx paths pass an emptyregisterDiagnostics{}so the existing row state defaults to'(new)'in the log line (the right framing for a fresh-register parse failure / pre-ctx early validation). The post-ctx paths use the fetched row state.Tests
TestRegister_400_LogsDiagnosticsReason(registry_test.go): 4 subtests coveringinvalid_delivery_mode,invalid_kind,url_required_for_push,url_validate_failed(link-local URLhttp://169.254.169.254:8000which is always-blocked in any deploy mode). Each asserts the diagnostic log line shape.TestRegister_400_LogsExistingRowState: asserts the log line captures the workspace's existing row state (delivery_mode, runtime) at the time of the 400, so the operator can compare the rejected payload against the row to identify the drift source.TestRestartContext_SystemCallerDoesNotPoisonSourceID(a2a_proxy_helpers_test.go): 4 subtests covering thesystemCallerPrefixes(system:,webhook:,test:,channel:) for the scoped helper. Pins the contract from #2701.Verified
go build ./...clean.go vet ./...clean.TestRegister_BadJSON,TestRegister_MissingRequiredFields,TestRegister_DBError,TestRegister_Non200_LogsStatusCode,TestRegister_200_DoesNotLogFailureall still pass.Register-400 root cause: NOT found from the static analysis
The 5 documented 400 sources are: invalid JSON body, invalid delivery_mode, invalid kind, URL required for push-mode, URL fails SSRF validation. The most likely recreate-path 400 candidate is the URL-validate path (private range in non-SaaS mode OR a metadata-endpoint IP in any mode), but the specific cause requires a live restart to surface. The diagnostics in this PR will reveal the actual cause on the next run. If the cause is a specific bad URL pattern, a follow-up PR can hard-fix the validation OR relax the SSRF blocklist for the specific case; this PR is the prerequisite observability.
SOP Checklist
sop-checklist-gate,required_status_checks,required_approvalsare all unchanged.fetchExistingWorkspaceStateForDiagnosticsis best-effort: a DB error logs+continues (the row state defaults to(new)in the log line). The diagnostic MUST NOT introduce a new failure path (a second 500 on top of the original 400 would mask the cause).TestRestartContext_SystemCallerDoesNotPoisonSourceIDpins thecallerIDToSourceIDcontract for all 4 systemCallerPrefixes. Catches any future removal/weakening/revert.invalid_jsonpath uses a different test path because the row state isn't fetched pre-ctx; the log line shape is verified by the existingTestRegister_BadJSONfailure-mode test).Out of scope (intentional)
Links
callerIDToSourceIDhelper this PR's regression-guard pins)a2a_queue.caller_idnormalization sibling)Reply with PR link + whether the register-400 root cause was found.
Register-400 root cause: NOT found from the static analysis (the diagnostics in this PR are the prerequisite observability for the next live restart to surface it). The most likely candidate is the URL-validate path on a private range / metadata-endpoint IP in non-SaaS mode; a follow-up PR can hard-fix once the cause is concrete.
REQUEST_CHANGES on head
daf59e474e. The restart-context callerID regression guard is fine, but the new register-400 diagnostics need fixes before this is safe to merge.Sensitive logging risk:
logRegister400Reasonwrites rawpayload.URL, rawagent_card.url, raw existing row URL, and detail strings containingurl=/payload.url=/agent_card.url=. Register URLs can contain userinfo or token-like query parameters, and the URL-SSRF path is specifically expected to log private/internal URLs. Please sanitize/redact URLs before logging: at minimum strip userinfo and credential-like query params/fragments, and avoid dumping raw URLs indetail. The diagnostic can keep host/classification/reason without leaking secrets.Existing-row-state coverage is not correct.
TestRegister_400_LogsExistingRowStateuses an invaliddelivery_mode, but that code path returns beforefetchExistingWorkspaceStateForDiagnosticsruns, so it cannot assert existing URL/kind/delivery_mode. The test also does not assertexisting_url,existing_kind, orexisting_delivery_mode; it only checks reason/workspace. Please add a real row-state test on a URL-required or URL-validation 400 path, mock the newSELECT url, kind, delivery_mode, and assert the log contains the expected existing row fields.The URL-path subtests likely need their SQL expectations updated for the new best-effort row-state fetch, which now runs before the C18/token/resolution queries. Please ensure the 5-path test suite actually passes and covers all five reasons plus row-state where applicable.
Once the logs are redacted and the tests genuinely cover row-state, this should be low-risk diagnostics-only.
REQUEST_CHANGES on head
b55c0a93dc. RC item 1 is addressed: production diagnostics now log URL presence (present/absent) instead of rawpayload_url,payload_card_url, orexisting_url, and the URL-required / URL-validation details no longer include the submitted raw URL.RC item 2 is still unaddressed.
TestRegister_400_LogsExistingRowStatestill usesdelivery_mode=foo, which returns in the invalid_delivery_mode pre-ctx path beforeexistingState := h.fetchExistingWorkspaceStateForDiagnostics(...)runs. It still mocksSELECT delivery_mode, runtime FROM workspaces WHERE id, not the newSELECT url, kind, delivery_mode FROM workspaces WHERE id = $1, and it still does not assertexisting_url,existing_kind, orexisting_delivery_mode. So the test does not prove row-state diagnostics work.Please add/replace with a real URL-required or URL-validation 400 test that reaches the post-ctx diagnostics fetch, mocks
SELECT url, kind, delivery_mode, and asserts the emitted log has the expectedexisting_url=presentplus expectedexisting_kindandexisting_delivery_modefields. CI is green otherwise, but this test coverage was an explicit RC condition.MECHANISM: The real #2710 red is
E2E Staging Platform Boot, not the qa/security/SOP governance gates. The platform-boot harness reaches a live tenant and an online/routable workspace, then the first parent A2A call intests/e2e/test_staging_full_saas.sh:1130-1175accepts any 2xx as final and immediately extractsresult.parts[0].text. In run357324job485280, that 2xx body was a queue response (queued:true,queue_id:64a398e3-...), so the text extractor sees no result text and fails. This is the #2437 queued-response boundary, but on the initial platform-boot A2A path rather than the later known-answer poll path.EVIDENCE: #2710 head
b55c0a93is not based on merged #2706 (c31f7eb5is not an ancestor), so it does not have the queue-status 401/403/404-retryable endpoint available on this branch. Job log excerpt:A2A returned no text. The raw response in the same log is{"queued":true,"queue_id":"64a398e3-e28d-4a9c-b62a-f7db1a300012"...}. The same script does have queue polling later for the known-answer path aroundtests/e2e/test_staging_full_saas.sh:1289-1378, which reinforces that the initial parent A2A path is the uncovered branch.RECOMMENDED FIX SHAPE: Rebase #2710 onto current main so #2706's
GetA2AQueueStatusbehavior is present, then extend the queue-poll handling to the initial parent A2A block intests/e2e/test_staging_full_saas.sh:1130-1175instead of treating202/2xx queuedas final. Responsible file:tests/e2e/test_staging_full_saas.sh; no evidence points toregistry.goregister-400 diagnostics as the direct cause of this red.APPROVED on head
a1a1192f06. The prior RC is addressed. URL diagnostics log presence rather than raw payload/card/existing URLs, and the URL-validation detail uses the friendly blocked-address label instead of the submitted address. The row-state regression test now exercises a post-ctx URL-validation 400 path, mocks the actualSELECT url, kind, delivery_mode FROM workspaces WHERE id = $1, assertsexisting_url=present,existing_kind="workspace", andexisting_delivery_mode="push", and verifies raw existing/payload URLs do not leak. The system-caller SourceID regression guard still covers the four system caller prefixes. CI/all-required and relevant Go/E2E checks are green; PR is mergeable./sop-ack