fix(core#2693): normalize system:restart-context callerID + register-400 diagnostics #2722

Closed
agent-dev-b wants to merge 1 commits from fix/core2693-restart-context-callerid-register400 into main
Member

Two-part production-code fix for #2680 / #2530 / #2693

(a) Synthetic callerID normalization (the production-code half of #2701)

The restart-context path (restart_context.go:296) issues ProxyA2ARequest with callerID="system:restart-context". Pre-fix that literal string reached activity_logs.source_id (UUID-typed, see migration 009) via the logA2A* helpers in a2a_proxy_helpers.go, either failing the UUID INSERT or poisoning the column.

Fix: new callerIDToSourceID helper in a2a_proxy.go (next to isSystemCaller) returns nil for:

  • empty callerID
  • system-prefixed callers (system:, webhook:, test:, channel:)
  • any non-UUID caller (defense in depth — future caller paths that forget the system: prefix still get a nil, not a 500 from a UUID parse error)

Wired into all 5 SourceID: sites in a2a_proxy_helpers.go (logA2AFailure, logA2ABusyQueued, logA2ASuccess, logA2AReceiveQueued, persistUserMessageAtIngest). The restart-context request body, summary, and system-caller trace are still preserved in the activity row; only source_id is now nil.

(b) Register-400 diagnostics (the recreate-path observability)

Pre-fix boot_register_failed reported status=400, which was the same line for FIVE distinct causes (invalid body / invalid delivery_mode / invalid kind / url-required for push / url fails SSRF). Operators triaging the restart-context "Secret with name bootstrap not found" / register-400-on-recreate class of issues had no way to tell WHY register returned 400.

Fix: new deferred registry_register_400 reason=<key> detail=<...> log line in registry.go:Register, with the reason captured at each of the 5 400 paths:

  • invalid_request_body (c.ShouldBindJSON failure)
  • invalid_delivery_mode (offending value)
  • invalid_kind (offending value)
  • url_required_for_push (no detail — payload state is the signal)
  • url_validate_failed (friendly CIDR label from validateAgentURL, NOT the raw URL — PII/secret-leak guard; the raw URL stays in the response body where the agent sent it)

The same reason is also folded into the existing boot_register_failed line on the four 400 paths that return LATE enough for the existing defer to fire (invalid_request_body returns EARLY, before the defer is registered, so it only gets the new registry_register_400 line — pre-existing limitation, the new log fills the gap).

Test coverage (all green)

  • TestCallerIDToSourceID — 10 subtests: empty, 4 system prefixes, valid UUID, 3 non-UUID shapes (incl. malformed UUID) → all return nil except the valid-UUID case which returns &callerID.
  • TestRegister_400_*_LogsReason — 5 tests, one per 400 path, asserting the new registry_register_400 reason= line fires.
  • TestRegister_200_DoesNotEmit400ReasonLog — pins the contract that 200 responses do NOT emit a registry_register_400 line (deferred log guards on register400Reason != "").
  • TestRegister_400_URLValidateFailed_LogsReasonWithoutRawURL — pins the PII/secret-leak guard: log line must NOT contain the raw URL, must contain the SSRF/link-local classifier from validateAgentURL.

Verification

$ go build ./...                    ✓
$ go vet ./...                      ✓
$ golangci-lint run --timeout 5m    0 issues
$ go test -run 'TestCallerIDToSourceID|TestRegister_400|TestRegister_200_DoesNotEmit' -v
=== RUN   TestCallerIDToSourceID                                    --- PASS (10 subtests)
=== RUN   TestRegister_400_InvalidRequestBody_LogsReason             --- PASS
=== RUN   TestRegister_400_InvalidDeliveryMode_LogsReason           --- PASS
=== RUN   TestRegister_400_InvalidKind_LogsReason                   --- PASS
=== RUN   TestRegister_400_URLRequiredForPush_LogsReason            --- PASS
=== RUN   TestRegister_400_URLValidateFailed_LogsReasonWithoutRawURL --- PASS
=== RUN   TestRegister_200_DoesNotEmit400ReasonLog                  --- PASS
PASS
ok      .../internal/handlers    0.088s

(Pre-existing TestApplyPlatformManagedLLMEnv_* failures on main are unrelated — verified by stashing my changes + re-running on main; they fail identically.)

Diff

workspace-server/internal/handlers/a2a_proxy.go         |  41 ++++
workspace-server/internal/handlers/a2a_proxy_helpers.go  |  10 +-
workspace-server/internal/handlers/a2a_proxy_test.go    |  49 +++++
workspace-server/internal/handlers/registry.go          |  42 +++-
workspace-server/internal/handlers/registry_test.go      | 216 +++++++ (new tests)
5 files changed, 352 insertions(+), 6 deletions(-)

Refs

  • #2693 (this PR — production-code fix)
  • #2680 (RCA parent — the bug this fixes the production code for)
  • #2530 (auth-token loss on container re-create, the upstream root cause)
  • #2701 (in-main helper, whose presence unblocked this PR — the regression-guard tests in #2710 already pinned the contract; this PR provides the production helper they tested against)
  • #2710 (the prior diagnostics+regression-guard PR for #2680 — APPROVED, awaiting admin-merge, unrelated to this PR)

What's NOT in this PR (deliberate scope)

  • Auth-token survival on container re-create (#2530 root cause) — that's a spec-level design choice (option a vs b vs c in the #2693 body), needs CR2 + human signoff, not a single-tick fix. This PR's callerID normalization removes the UUID-poisoning symptom; the re-registration contract itself is the next ticket.
  • PR #2710's regression-guard tests — already in main, not touched here. They guard the call sites; this PR provides the production helper they were guarding.
## Two-part production-code fix for #2680 / #2530 / #2693 ### (a) Synthetic callerID normalization (the production-code half of #2701) The restart-context path (`restart_context.go:296`) issues `ProxyA2ARequest` with `callerID="system:restart-context"`. Pre-fix that literal string reached `activity_logs.source_id` (UUID-typed, see migration 009) via the `logA2A*` helpers in `a2a_proxy_helpers.go`, either failing the UUID INSERT or poisoning the column. **Fix:** new `callerIDToSourceID` helper in `a2a_proxy.go` (next to `isSystemCaller`) returns `nil` for: - empty callerID - system-prefixed callers (`system:`, `webhook:`, `test:`, `channel:`) - any non-UUID caller (defense in depth — future caller paths that forget the system: prefix still get a nil, not a 500 from a UUID parse error) Wired into all 5 `SourceID:` sites in `a2a_proxy_helpers.go` (`logA2AFailure`, `logA2ABusyQueued`, `logA2ASuccess`, `logA2AReceiveQueued`, `persistUserMessageAtIngest`). The restart-context request body, summary, and system-caller trace are still preserved in the activity row; only `source_id` is now nil. ### (b) Register-400 diagnostics (the recreate-path observability) Pre-fix `boot_register_failed` reported `status=400`, which was the same line for FIVE distinct causes (invalid body / invalid delivery_mode / invalid kind / url-required for push / url fails SSRF). Operators triaging the restart-context "Secret with name bootstrap not found" / register-400-on-recreate class of issues had no way to tell WHY register returned 400. **Fix:** new deferred `registry_register_400 reason=<key> detail=<...>` log line in `registry.go:Register`, with the reason captured at each of the 5 400 paths: - `invalid_request_body` (c.ShouldBindJSON failure) - `invalid_delivery_mode` (offending value) - `invalid_kind` (offending value) - `url_required_for_push` (no detail — payload state is the signal) - `url_validate_failed` (friendly CIDR label from validateAgentURL, **NOT the raw URL** — PII/secret-leak guard; the raw URL stays in the response body where the agent sent it) The same reason is also folded into the existing `boot_register_failed` line on the four 400 paths that return LATE enough for the existing defer to fire (invalid_request_body returns EARLY, before the defer is registered, so it only gets the new `registry_register_400` line — pre-existing limitation, the new log fills the gap). ## Test coverage (all green) - **TestCallerIDToSourceID** — 10 subtests: empty, 4 system prefixes, valid UUID, 3 non-UUID shapes (incl. malformed UUID) → all return nil except the valid-UUID case which returns `&callerID`. - **TestRegister_400_*_LogsReason** — 5 tests, one per 400 path, asserting the new `registry_register_400 reason=` line fires. - **TestRegister_200_DoesNotEmit400ReasonLog** — pins the contract that 200 responses do NOT emit a `registry_register_400` line (deferred log guards on `register400Reason != ""`). - **TestRegister_400_URLValidateFailed_LogsReasonWithoutRawURL** — pins the PII/secret-leak guard: log line must NOT contain the raw URL, must contain the SSRF/link-local classifier from `validateAgentURL`. ## Verification ``` $ go build ./... ✓ $ go vet ./... ✓ $ golangci-lint run --timeout 5m 0 issues $ go test -run 'TestCallerIDToSourceID|TestRegister_400|TestRegister_200_DoesNotEmit' -v === RUN TestCallerIDToSourceID --- PASS (10 subtests) === RUN TestRegister_400_InvalidRequestBody_LogsReason --- PASS === RUN TestRegister_400_InvalidDeliveryMode_LogsReason --- PASS === RUN TestRegister_400_InvalidKind_LogsReason --- PASS === RUN TestRegister_400_URLRequiredForPush_LogsReason --- PASS === RUN TestRegister_400_URLValidateFailed_LogsReasonWithoutRawURL --- PASS === RUN TestRegister_200_DoesNotEmit400ReasonLog --- PASS PASS ok .../internal/handlers 0.088s ``` (Pre-existing `TestApplyPlatformManagedLLMEnv_*` failures on main are unrelated — verified by stashing my changes + re-running on main; they fail identically.) ## Diff ``` workspace-server/internal/handlers/a2a_proxy.go | 41 ++++ workspace-server/internal/handlers/a2a_proxy_helpers.go | 10 +- workspace-server/internal/handlers/a2a_proxy_test.go | 49 +++++ workspace-server/internal/handlers/registry.go | 42 +++- workspace-server/internal/handlers/registry_test.go | 216 +++++++ (new tests) 5 files changed, 352 insertions(+), 6 deletions(-) ``` ## Refs - #2693 (this PR — production-code fix) - #2680 (RCA parent — the bug this fixes the production code for) - #2530 (auth-token loss on container re-create, the upstream root cause) - #2701 (in-main helper, whose presence unblocked this PR — the regression-guard tests in #2710 already pinned the contract; this PR provides the production helper they tested against) - #2710 (the prior diagnostics+regression-guard PR for #2680 — APPROVED, awaiting admin-merge, unrelated to this PR) ## What's NOT in this PR (deliberate scope) - **Auth-token survival on container re-create** (#2530 root cause) — that's a spec-level design choice (option a vs b vs c in the #2693 body), needs CR2 + human signoff, not a single-tick fix. This PR's callerID normalization removes the UUID-poisoning symptom; the re-registration contract itself is the next ticket. - **PR #2710's regression-guard tests** — already in main, not touched here. They guard the call sites; this PR provides the production helper they were guarding.
agent-dev-b added 1 commit 2026-06-13 07:25:46 +00:00
fix(core#2693): normalize system:restart-context callerID + register-400 diagnostics
CI / Python Lint & Test (pull_request) Successful in 5s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 7s
Harness Replays / detect-changes (pull_request) Successful in 6s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 6s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 7s
sop-checklist / review-refire (pull_request_target) Has been skipped
Harness Replays / Harness Replays (pull_request) Successful in 2s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 8s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 14s
security-review / approved (pull_request_target) Failing after 8s
qa-review / approved (pull_request_target) Failing after 8s
reserved-path-review / reserved-path-review (pull_request_target) Failing after 8s
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, local-postgres-e2
E2E API Smoke Test / detect-changes (pull_request) Successful in 18s
CI / Detect changes (pull_request) Successful in 18s
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request_target) Successful in 9s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 15s
gate-check-v3 / gate-check (pull_request_target) Failing after 13s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 1s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 19s
CI / Canvas (Next.js) (pull_request) Successful in 2s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Successful in 21s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Successful in 6s
CI / Canvas Deploy Status (pull_request) Successful in 0s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 3s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (pull_request) Successful in 28s
E2E Chat / detect-changes (pull_request) Successful in 42s
E2E Chat / E2E Chat (pull_request) Successful in 3s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 53s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (pull_request) Successful in 35s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m12s
CI / Platform (Go) (pull_request) Successful in 2m28s
CI / all-required (pull_request) Successful in 3s
audit-force-merge / audit (pull_request_target) Has been skipped
E2E Staging External Runtime / E2E Staging External Runtime (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge user_tasks (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Workspace Requests (core#2606) (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Creates Workspace (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge (compile+skip) (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Platform Agent (pull_request) Waiting to run
e37ff145c0
(a) Synthetic callerID normalization. The restart-context path
(restart_context.go:296) issues ProxyA2ARequest with callerID=
"system:restart-context". Pre-fix that literal string reached
activity_logs.source_id (UUID-typed, see migration 009) via the
logA2A* helpers in a2a_proxy_helpers.go, either failing the UUID
INSERT or poisoning the column. New callerIDToSourceID helper
returns nil for system-prefixed callers ("system:", "webhook:",
"test:", "channel:"), empty callers, and any non-UUID caller
(defense in depth for future caller paths that forget the system:
prefix). Wired into all 5 SourceID sites in a2a_proxy_helpers.go
(logA2AFailure, logA2ABusyQueued, logA2ASuccess,
logA2AReceiveQueued, persistUserMessageAtIngest). The restart-
context request body, summary, and system-caller trace are still
preserved in the activity row; only source_id is now nil.

(b) Register-400 diagnostics. Pre-fix boot_register_failed reported
status=400, which was the same line for five distinct causes
(invalid body / invalid delivery_mode / invalid kind / url-
required for push / url fails SSRF). New deferred
registry_register_400 reason= line + reason folded into the
existing boot_register_failed line on the four paths that return
LATE enough for the existing defer to fire. PII/secret-leak guard:
the url_validate_failed detail carries the friendly CIDR label
from validateAgentURL, not the raw URL.

Refs #2680, #2530, #2693.

Test coverage:
- TestCallerIDToSourceID: 10 subtests (empty, 4 system prefixes,
  valid UUID, 3 non-UUID shapes incl. malformed UUID)
- TestRegister_400_*_LogsReason: 5 tests (one per 400 path)
- TestRegister_200_DoesNotEmit400ReasonLog: pin the contract
  that 200 responses do NOT emit a registry_register_400 line
- TestRegister_400_URLValidateFailed_LogsReasonWithoutRawURL:
  pin the PII/secret-leak guard (no raw URL in the log)
agent-dev-b closed this pull request 2026-06-13 07:29:09 +00:00
Author
Member

Closing as duplicate. The production-code fix for #2693 / #2680-residual was already merged via #2710 (head a1a1192f, included in main at 714ed320). Verified by inspecting origin/main: callerIDToSourceID helper is in workspace-server/internal/handlers/a2a_proxy_helpers.go; logRegister400Reason is in workspace-server/internal/handlers/registry.go with all 5 reasons (invalid_json / invalid_delivery_mode / invalid_kind / url_required_for_push / url_validate_failed), URL redaction via urlPresence() (RC #11335), and all 5 SourceID sites already use callerIDToSourceID. Nothing for me to add here — #2722 was opened before I checked origin/main, which was the verify-first step I should have run first. Apologies for the noise.

Closing as duplicate. The production-code fix for #2693 / #2680-residual was already merged via #2710 (head a1a1192f, included in main at 714ed320). Verified by inspecting origin/main: callerIDToSourceID helper is in workspace-server/internal/handlers/a2a_proxy_helpers.go; logRegister400Reason is in workspace-server/internal/handlers/registry.go with all 5 reasons (invalid_json / invalid_delivery_mode / invalid_kind / url_required_for_push / url_validate_failed), URL redaction via urlPresence() (RC #11335), and all 5 SourceID sites already use callerIDToSourceID. Nothing for me to add here — #2722 was opened before I checked origin/main, which was the verify-first step I should have run first. Apologies for the noise.
Some optional checks failed
CI / Python Lint & Test (pull_request) Successful in 5s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 7s
Harness Replays / detect-changes (pull_request) Successful in 6s
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 6s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 7s
sop-checklist / review-refire (pull_request_target) Has been skipped
Harness Replays / Harness Replays (pull_request) Successful in 2s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 8s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 14s
security-review / approved (pull_request_target) Failing after 8s
qa-review / approved (pull_request_target) Failing after 8s
reserved-path-review / reserved-path-review (pull_request_target) Failing after 8s
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, local-postgres-e2
E2E API Smoke Test / detect-changes (pull_request) Successful in 18s
CI / Detect changes (pull_request) Successful in 18s
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request_target) Successful in 9s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 15s
gate-check-v3 / gate-check (pull_request_target) Failing after 13s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 1s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 19s
CI / Canvas (Next.js) (pull_request) Successful in 2s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Successful in 21s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Successful in 6s
Required
Details
CI / Canvas Deploy Status (pull_request) Successful in 0s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 3s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (pull_request) Successful in 28s
E2E Chat / detect-changes (pull_request) Successful in 42s
E2E Chat / E2E Chat (pull_request) Successful in 3s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 53s
Required
Details
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (pull_request) Successful in 35s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m12s
Required
Details
CI / Platform (Go) (pull_request) Successful in 2m28s
CI / all-required (pull_request) Successful in 3s
Required
Details
audit-force-merge / audit (pull_request_target) Has been skipped
E2E Staging External Runtime / E2E Staging External Runtime (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Platform Boot (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge user_tasks (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Workspace Requests (core#2606) (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Creates Workspace (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge (compile+skip) (pull_request) Waiting to run
E2E Staging SaaS (full lifecycle) / E2E Staging Concierge Platform Agent (pull_request) Waiting to run

Pull request closed

Sign in to join this conversation.
No Reviewers
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2722