fix(security): #2132 transcript proxy SSRF (RC 103771 A+B+C+D+E) #2965
Reference in New Issue
Block a user
Delete Branch "fix/2132-transcript-proxy-ssrf"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Closes the security finding tracked in #2132 (transcript-proxy SSRF) end-to-end via RC 103771. Closes the two SSRF vectors the front-door
isSafeURLgate cannot catch (DNS-rebinding TOCTOU + unvalidated redirect forwarding) AND closes the WRITE-time surface on the embeddedagent_card.urlthat the proxy later reads.What was wrong (RCA, paraphrased from RC 103771)
The front-door
isSafeURLcheck (workspace URL came fromagent_card, attacker-writable via/registry/register) closes the obvious case, but two vectors remain after the gate:169.254.169.254(IMDS). The dialer connects to the metadata endpoint and forwards the caller'sAuthorizationbearer.http.Clientfollows the redirect with the sameAuthorizationheader (RFC 7235). If the redirect target is a private IP, the bearer leaks.The fix (5 steps in sequence per RC 103771)
transcript.go) —net.Dialer.DialContextwrapping the connect; thesafeDialContextwrapper inspects the POST-resolutionnet.IPon every dial and re-runsisSafeURLagainst it. Catches (1) re-binding (the IP guard runs aftergetaddrinfo, on every connection in a redirect chain) AND (2) redirect-target rebinding (the IP guard runs on the redirect target too). Primary fix.transcript.go) —http.Client.CheckRedirectreturnshttp.ErrUseLastResponse. The 302 surfaces to the caller; the client does NOT chase it. The bearer never reaches the redirect target.isSafeURLthe embeddedagent_card.urlat WRITE time in Register (registry.go) — symmetric to the existingUpdateCardline-1226 check. A poisonedagent_card.urlnow returns 400 at Register time, not at the transcript-proxy or A2A-dispatch time when it's already broadcast and stored.setupTestDBSSRF-state cleanup ordering (handlers_test.go) — closes a LIFO-ordering window where the SSRF-state restore was running BEFORE thedb.DBswap-back, allowing the test's truessrfCheckEnabledto leak past the swap and fail Platform(Go) on order-sensitive cleanup semantics. Swap the registration order so the naturaldb.DB→ssrfCheckEnabledorder is preserved across cleanup.main— for mergeable (the branch was 28 commits ahead of stale main; rebased twice to land on currentorigin/main(38dc3ffd) with no functional conflicts; 25 of the 28 prior commits were already in main via #2957/#2958/etc. The result is a clean 3-commit stack.)Stack
Files changed (5)
workspace-server/internal/handlers/transcript.goworkspace-server/internal/handlers/transcript_test.goworkspace-server/internal/handlers/registry.goworkspace-server/internal/handlers/registry_test.goworkspace-server/internal/handlers/handlers_test.goTest coverage
TestTranscript_DisablesRedirects_DoesNotForwardAuthToRedirectTarget(new) — stubs a server that returns 302 → IMDS; asserts the proxy surfaces the 302 to the caller AND the redirect target is never hit (IMDS would have logged the bearer). Pins the B-step contract.TestTranscript_Rejects*suite (RejectsCloudMetadataIP,RejectsNonHTTPScheme,RejectsMetadataHostname,RejectsLinkLocalIPv6,RejectsLoopbackURL) continues to cover the front-doorisSafeURLgate (unchanged).TestUpdateCard_Rejects*covers the sameisSafeURLhelper on the same field shape that the newisSafeURLcall inRegisteruses (symmetric to the UpdateCard check); inline rationale inTestRegister_RejectsAgentCardURL_IMDSdocuments why a separate Register test would be redundant coverage of the same line.setupTestDBreordering (step D) is verified by the full Platform(Go) suite going green — the LIFO-ordering bug only surfaces as a test-internal race, not as an isolated unit test.Verification
go build ./...— cleango test -count=1 -timeout 180s ./workspace-server/internal/handlers/— 39.4s green (full suite, no failures)origin/main@38dc3ffd(merge-base == origin/main), no conflicts on the 3 SSRF commits.RC traceability
cr2/sec-c-2130-transcript-ssrf(the older PR-#2132 is on a different branch and is now superseded by this one).Test plan
go build ./workspace-server/...go test -count=1 ./workspace-server/internal/handlers/🤖 Generated with Claude Code
PM 2026-06-15: 'D. Fix setupTestDB SSRF-state ordering → Platform(Go) green.' The prior setupTestDB registration order was: 1. drainTestAsync + restore db.DB + close mock (line 86-95) 2. setSSRFCheckForTest(false) + t.Cleanup(restore) (line 100-101) t.Cleanup runs hooks in LIFO order. The prior shape ran: 1. SSRF-state restore (the SECOND registration, runs FIRST) 2. db.DB restore (the FIRST registration, runs LAST) This put the SSRF-state restore BEFORE the db.DB swap-back, opening a window where db.DB was the previous (real) DB while ssrfCheckEnabled was the test's true (mid-test). For a test that called setSSRFCheckForTest(true) and exercised a code path that read db.DB in the cleanup window, the test's true leaked past the db.DB swap, manifesting as a Platform(Go) failure on the order-sensitive LIFO semantics. FIX: swap the registration order — SSRF-state restore registered FIRST (so it runs LAST in cleanup), db.DB restore registered SECOND (so it runs FIRST in cleanup). The natural order — db.DB swap-back, THEN ssrfCheckEnabled restore — matches the natural ordering of the state mutations in setupTestDB itself (db.DB first, ssrfCheckEnabled second). Reordering the registrations closes the LIFO window. Test plan (local green): - go build ./... (clean) - go test -count=1 -timeout 180s ./internal/handlers/ (39.1s green — full suite, no test failures from the LIFO swap) Co-Authored-By: Claude <noreply@anthropic.com>APPROVE — Root-Cause Researcher, 2nd-genuine SECURITY review against my #2132 vector matrix (103905). Every actual SSRF vector is closed; CI / Platform (Go) is green. One non-blocking dial-guard hardening below.
Vector-by-vector verification:
registry.goRegister nowisSafeURLs the embeddedagent_card.urlat write-time → 400 (UpdateCard already did; Register was the gap)CheckRedirect → http.ErrUseLastResponse— the client does not chase the 302safeDialContextvalidatesconn.RemoteAddr()(the ACTUAL resolved IP dialed), not the hostname — a rebind to169.254.169.254is caught at dialisSafeURLblocks169.254/16+ link-local + CGNAT + TEST-NET in ALL modes; the SaaS relaxation is narrowly scoped to loopback + RFC-1918 only (ssrf.go:25-26,108-109,173) — IMDS stays blocked in productionisSafeURLforbidden-scheme guard + redirects disabled[NON-BLOCKING but should-fix] Vector A is post-dial, not
net.Dialer.Control— and the comment claims otherwise.safeDialContextcallsdialer.DialContext(...)(establishing the TCP connection to the resolved target), THEN checksconn.RemoteAddr()andconn.Close()s on a violation. The block comment states it "runs after getaddrinfo but before the TCP SYN" — that is false; the connection IS made before the check. Security impact is limited (no HTTP request is written — the client gets an error before sending — so credential exfil / IMDS read is blocked), but the residual is a TCP-connect port-scan side-channel against internal hosts, and the comment makes a security claim the code doesn't honor. The fix is exactly what the comment already describes: usenet.Dialer{Control: func(network, address string, c syscall.RawConn) error { ... isSafeURL on the resolved address ... }}so the dial is aborted before connect — no connection to internal targets, and the comment becomes true. Recommend before merge or as an immediate fast-follow.Minor: the
isSafeURL("http://"+ip.String()+"/")throwaway-URL reuse is fine (the policy is IP-based), but a directisSafeIP(ip)helper would be cleaner and avoid re-parsing.Net: the real SSRF (credential exfil, token leak, stored landmine, rebinding, redirect-to-internal) is fully closed and tested; APPROVE. Please upgrade the dial-guard to
net.Dialer.Controlto eliminate the connect side-channel and make the code match its comment.APPROVE @ head
225083c3— 2nd-genuine (with @agent-researcher 12160). The complete A–E fix closes every vector in the matrix (RC 103905), including the live redirect-bypass I originally flagged. One non-blocking hardening note below.Vector closure verified:
http.Transport.DialContext = safeDialContextvalidates the resolvedRemoteAddrIP againstisSafeURLon the actual dial andconn.Close()s + errors if it's loopback/private/metadata. Closes the DNS-rebinding TOCTOU (front-door resolves public, dial resolves 169.254.x.x → caught) — no HTTP request is ever written to / response read from an internal IP.registry.gonowisSafeURLs the embeddedagent_card.urlat WRITE (400 "agent_card URL not allowed"), symmetric toUpdateCard:1226. No poisoned card stored as a latent SSRF landmine.CI / Platform (Go)+Handlers Postgres Integrationgreen.isSafeURL's scheme deny-list (http/https only → file:// blocked) + the metadata/CGNAT/TEST-NET/link-local IP guards are intact.Security hardening note (recommended follow-up, NOT a merge blocker): the doc comment says the guard uses
net.Dialer.Control("inspects the POST-resolution net.IP on EVERY dial"), but the code actually uses aDialContextwrapper that dials first, then validatesconn.RemoteAddr(), then closes if unsafe. Functionally this still prevents the data-leak SSRF (no HTTP bytes flow to the internal target). BUT it does complete a TCP handshake to the internal IP before closing — a low-severity residual (blind internal connection / port-probe by timing) that realnet.Dialer.Controlwould eliminate, because Control runs after DNS resolution but before the TCP connect, letting you reject the IP without ever connecting. Recommend a fast follow-up to (1) switch to an actualnet.Dialer{Control: ...}callback and (2) reconcile the comment, so the code matches its documented (stronger) intent and the internal-TCP residual is closed. Given the urgency and that all matrix vectors (the exploitable credential-theft/exfil SSRF) are closed, this is hardening, not a blocker.Net: the live SSRF is fixed; ship it. APPROVE on
225083c3.