fix(security#2132/RC): fail-CLOSED in safeDialControl hostname-fallback path (RC 103980/12169) #2969

Merged
devops-engineer merged 1 commits from fix/2967-safedialcontrol-fail-closed-lookupip into main 2026-06-15 22:05:11 +00:00
Member

Summary

Fail-CLOSED hardening in safeDialControl's hostname-fallback path (RC 103980/12169, Researcher non-blocking nit on merged #2967). The prior shape FAIL-OPENed on LookupIP error or empty result, allowing a dial to proceed when the SSRF policy couldn't be positively verified.

The bug

In the hostname-fallback branch (when Control is called with an unresolved hostname — ip == nil on the address), the prior code:

if ips, lookupErr := net.LookupIP(host); lookupErr == nil {
    for _, candidate := range ips {
        ...isSafeURL check...
    }
}
return nil  // <-- FAIL-OPEN: LookupIP error OR empty result falls through

An attacker who can suppress DNS for a hostname (DNS outage, hostile resolver, etc.) could otherwise get the dial to proceed with whatever IP the dialer's own resolver picks up later. The SSRF policy is deny-by-default — a hostname we can't positively verify is safe is treated as unsafe.

The fix

Change the hostname-fallback path to fail-CLOSED on both error and empty result:

ips, lookupErr := net.LookupIP(host)
if lookupErr != nil {
    log.Printf("safeDialControl: hostname %q LookupIP errored: %v (FAIL-CLOSED)", host, lookupErr)
    return &ssrfDialError{host: host, reason: fmt.Errorf("hostname resolution failed: %w", lookupErr)}
}
if len(ips) == 0 {
    log.Printf("safeDialControl: hostname %q LookupIP returned no addresses (FAIL-CLOSED)", host)
    return &ssrfDialError{host: host, reason: errors.New("hostname resolution returned no addresses")}
}
for _, candidate := range ips {
    if safeErr := isSafeURL(...); safeErr != nil {
        return &ssrfDialError{ip: candidate, reason: safeErr}
    }
}
return nil

The resolved-IP validation path is unchanged — only the error/empty-result branches are now treated as unsafe.

ssrfDialError struct extended with a host string field for hostname-only errors (where ip is nil). Error() method handles 3 cases: ip set / host set / neither (defensive). The blocked-IP case message is unchanged; the new hostname-blocked case gets a parallel "ssrf: dial-time hostname X blocked: ..." message.

Files changed (2)

File Lines Purpose
workspace-server/internal/handlers/transcript.go +43/-13 fail-closed + log + ssrfDialError struct extension
workspace-server/internal/handlers/transcript_test.go +73/-0 2 new tests
Total +116/-13

Test coverage

  • TestSafeDialControl_FailsClosedOnLookupIPError — exercises the LookupIP-error case via the .invalid reserved TLD (RFC 2606, never resolves). Asserts:
    • safeDialControl returns a non-nil error (would have been nil in the FAIL-OPEN bug shape)
    • The error is *ssrfDialError with host == "nonexistent-host.invalid" and ip == nil
    • The Error() method doesn't panic and includes both hostname + reason
  • TestSafeDialControl_FailsClosedOnEmptyLookupIPResult — documents the LookupIP-returns-empty case via Error() format. There's no easy way to mock an empty LookupIP result without a custom resolver; the LookupIP-error test exercises the same fail-closed code path with the same shape (both return *ssrfDialError with host set).

Verification

  • go build ./workspace-server/... — clean
  • go test -count=1 -run TestSafeDialControl_ -v ./workspace-server/internal/handlers/2/2 PASS in 0.03s
  • go test -count=1 -timeout 180s ./workspace-server/internal/handlers/39.5s green (full suite, no regressions)

RC traceability

  • RC 103771 (Researcher, #2132 transcript-proxy SSRF) — landed as PR #2965.
  • RC 103905 (Researcher, dial-guard fast-follow) — landed as PR #2967 (merged at cf316196).
  • RC 103980 Finding 12169 (Researcher, fail-closed hardening) — this PR.

Merge

Per the new no-author-self-merge convention (PM dispatch f3cc220d): leaving for the gitea-merge-queue or a non-author applier. Will NOT self-merge even when 2-genuine approved.

🤖 Generated with Claude Code

## Summary Fail-CLOSED hardening in `safeDialControl`'s hostname-fallback path (RC 103980/12169, Researcher non-blocking nit on merged #2967). The prior shape FAIL-OPENed on `LookupIP` error or empty result, allowing a dial to proceed when the SSRF policy couldn't be positively verified. ## The bug In the hostname-fallback branch (when `Control` is called with an unresolved hostname — `ip == nil` on the address), the prior code: ```go if ips, lookupErr := net.LookupIP(host); lookupErr == nil { for _, candidate := range ips { ...isSafeURL check... } } return nil // <-- FAIL-OPEN: LookupIP error OR empty result falls through ``` An attacker who can suppress DNS for a hostname (DNS outage, hostile resolver, etc.) could otherwise get the dial to proceed with whatever IP the dialer's own resolver picks up later. The SSRF policy is deny-by-default — a hostname we can't positively verify is safe is treated as unsafe. ## The fix Change the hostname-fallback path to fail-CLOSED on both error and empty result: ```go ips, lookupErr := net.LookupIP(host) if lookupErr != nil { log.Printf("safeDialControl: hostname %q LookupIP errored: %v (FAIL-CLOSED)", host, lookupErr) return &ssrfDialError{host: host, reason: fmt.Errorf("hostname resolution failed: %w", lookupErr)} } if len(ips) == 0 { log.Printf("safeDialControl: hostname %q LookupIP returned no addresses (FAIL-CLOSED)", host) return &ssrfDialError{host: host, reason: errors.New("hostname resolution returned no addresses")} } for _, candidate := range ips { if safeErr := isSafeURL(...); safeErr != nil { return &ssrfDialError{ip: candidate, reason: safeErr} } } return nil ``` The resolved-IP validation path is **unchanged** — only the error/empty-result branches are now treated as unsafe. `ssrfDialError` struct extended with a `host string` field for hostname-only errors (where `ip` is nil). `Error()` method handles 3 cases: ip set / host set / neither (defensive). The blocked-IP case message is unchanged; the new hostname-blocked case gets a parallel `"ssrf: dial-time hostname X blocked: ..."` message. ## Files changed (2) | File | Lines | Purpose | |------|------:|---------| | `workspace-server/internal/handlers/transcript.go` | +43/-13 | fail-closed + log + ssrfDialError struct extension | | `workspace-server/internal/handlers/transcript_test.go` | +73/-0 | 2 new tests | | **Total** | **+116/-13** | | ## Test coverage - **`TestSafeDialControl_FailsClosedOnLookupIPError`** — exercises the `LookupIP`-error case via the `.invalid` reserved TLD (RFC 2606, never resolves). Asserts: - `safeDialControl` returns a non-nil error (would have been nil in the FAIL-OPEN bug shape) - The error is `*ssrfDialError` with `host == "nonexistent-host.invalid"` and `ip == nil` - The `Error()` method doesn't panic and includes both hostname + reason - **`TestSafeDialControl_FailsClosedOnEmptyLookupIPResult`** — documents the `LookupIP`-returns-empty case via `Error()` format. There's no easy way to mock an empty `LookupIP` result without a custom resolver; the `LookupIP`-error test exercises the same fail-closed code path with the same shape (both return `*ssrfDialError` with `host` set). ## Verification - `go build ./workspace-server/...` — clean - `go test -count=1 -run TestSafeDialControl_ -v ./workspace-server/internal/handlers/` — **2/2 PASS in 0.03s** - `go test -count=1 -timeout 180s ./workspace-server/internal/handlers/` — **39.5s green** (full suite, no regressions) ## RC traceability - RC 103771 (Researcher, #2132 transcript-proxy SSRF) — landed as PR #2965. - RC 103905 (Researcher, dial-guard fast-follow) — landed as PR #2967 (merged at cf316196). - RC 103980 Finding 12169 (Researcher, fail-closed hardening) — this PR. ## Merge Per the new **no-author-self-merge convention** (PM dispatch f3cc220d): leaving for the gitea-merge-queue or a non-author applier. Will NOT self-merge even when 2-genuine approved. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
agent-dev-b added 1 commit 2026-06-15 21:59:19 +00:00
fix(security#2132/RC): fail-CLOSED in safeDialControl hostname-fallback path (RC 103980/12169)
CI / Python Lint & Test (pull_request) Successful in 7s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 8s
E2E Peer Visibility (literal MCP list_peers) / detect-changes (pull_request) Successful in 6s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 6s
sop-checklist / review-refire (pull_request_target) Has been skipped
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (local) (pull_request) Has been skipped
Lint forbidden tenant-env keys / Scan for repo-host token write into tenant workspace surface (pull_request) Successful in 6s
Harness Replays / detect-changes (pull_request) Successful in 8s
Lint forbidden tenant-env keys / Scan workspace_secrets writers for forbidden env keys (pull_request) Successful in 8s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 8s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Successful in 5s
E2E Chat / detect-changes (pull_request) Successful in 16s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 16s
reserved-path-review / reserved-path-review (pull_request_target) Successful in 9s
sop-checklist / na-declarations (pull_request) N/A: (none)
E2E API Smoke Test / detect-changes (pull_request) Successful in 17s
sop-checklist / all-items-acked (pull_request_target) Successful in 11s
CI / Detect changes (pull_request) Successful in 20s
PR Diff Guard / PR diff guard (pull_request) Successful in 16s
gate-check-v3 / gate-check (pull_request_target) Successful in 15s
CI / Canvas (Next.js) (pull_request) Successful in 2s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 3s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 1s
E2E Chat / E2E Chat (pull_request) Successful in 3s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 21s
CI / Canvas Deploy Status (pull_request) Successful in 1s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (stub) (pull_request) Successful in 31s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 46s
Harness Replays / Harness Replays (pull_request) Successful in 1m31s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m17s
Local Provision Lifecycle E2E / Local Provision Lifecycle E2E (real image + MiniMax LLM, advisory) (pull_request) Successful in 2m3s
CI / Platform (Go) (pull_request) Successful in 3m13s
CI / all-required (pull_request) Successful in 4s
reserved-path-review / reserved-path-review (pull_request_review) Successful in 9s
security-review / approved (pull_request_target) Approved via pull_request_review trigger
qa-review / approved (pull_request_target) Approved via pull_request_review trigger
security-review / approved (pull_request_review) Successful in 10s
qa-review / approved (pull_request_review) Successful in 11s
audit-force-merge / audit (pull_request_target) Successful in 9s
sop-checklist / all-items-acked (pull_request) Compensated by status-reaper (non-required pull_request/pull_request_review governance shadow overridden by successful pull_request_target status; see .gitea/scripts/status-reaper.py)
341c1b2028
Per PM 2026-06-15 [dispatch cc510849] + Researcher comment 12169 on
merged #2967: in safeDialControl, the hostname-fallback path
(ip == nil on the address) currently FAIL-OPENs:

  if ips, lookupErr := net.LookupIP(host); lookupErr == nil {
      for _, candidate := range ips {
          ...isSafeURL check...
      }
  }
  return nil  // <-- FAIL-OPEN: LookupIP error OR empty result falls through

An attacker who can suppress DNS for a hostname (DNS outage,
hostile resolver, etc.) could otherwise get the dial to proceed
with whatever IP the dialer's own resolver picks up later. The
SSRF policy is deny-by-default — a hostname we can't positively
verify is safe is treated as unsafe.

FIX: change the hostname-fallback path to fail-CLOSED:
  - LookupIP errored → return *ssrfDialError with host + reason
    (treated as unsafe; logged for operator visibility)
  - LookupIP returned 0 addresses → return *ssrfDialError with
    host + 'no addresses' reason (same)
  - Resolved-IP validation path is unchanged (still the
    isSafeURL(candidate) check on each returned address)

ssrfDialError struct extended with a host string field (for
hostname-only errors where ip is nil) and the Error() method
updated to handle 3 cases (ip set / host set / neither — defensive).
The blocked-IP case is unchanged; the new hostname-blocked case
gets a parallel 'ssrf: dial-time hostname X blocked: ...' message.

Test plan (local green):
  - go build ./workspace-server/... — clean
  - go test -count=1 -run TestSafeDialControl_ -v
    ./workspace-server/internal/handlers/ — 2/2 PASS in 0.03s
  - go test -count=1 -timeout 180s ./workspace-server/internal/handlers/
    — 39.5s green (full suite, no regressions)

New tests:
  - TestSafeDialControl_FailsClosedOnLookupIPError: exercises the
    LookupIP-error case via the .invalid reserved TLD (RFC 2606),
    asserts *ssrfDialError with host set + ip nil + Error() includes
    hostname + reason.
  - TestSafeDialControl_FailsClosedOnEmptyLookupIPResult: documents
    the empty-result case via Error() format (no easy way to mock
    an empty LookupIP result; the LookupIP-error test exercises
    the same fail-closed code path with the same shape).

Per the new no-author-self-merge convention: leaving for the
gitea-merge-queue or non-author applier.

Co-Authored-By: Claude <noreply@anthropic.com>
agent-reviewer-cr2 approved these changes 2026-06-15 22:04:36 +00:00
agent-reviewer-cr2 left a comment
Member

APPROVE (qa-review + security-review) @ 341c1b20 — correct fail-closed hardening of the safeDialControl hostname-fallback path; the third clean fast-follow in the #2965→#2967→#2969 SSRF chain. Adversarially verified all four points:

  1. LookupIP error AND empty-result BOTH fail-closed net.LookupIP(host): lookupErr != nilreturn &ssrfDialError{...} (157–160), AND len(ips) == 0return &ssrfDialError{...} (161–164). The prior fall-through return nil (fail-open) is gone on both branches. Deny-by-default: a hostname we can't positively verify is treated as unsafe.
  2. No regression to #2965/#2967 — the primary resolved-IP path is untouched: when net.ParseIP(host) succeeds (the normal post-resolution case Control is invoked with), it still does isSafeURL("http://"+ip+"/") (172–175). The fail-closed change is confined to the ip == nil hostname-fallback branch.
  3. Legit hostnames → public IPs still succeed (no over-blocking) — the candidate loop (165–169) returns an ssrfDialError only if a resolved candidate is unsafe; if all candidates are public it falls through to return nil (allow). Correctly blocks when ANY candidate is internal (prevents the dialer picking the internal IP from a mixed result) — that's the right conservative behavior, not over-blocking.
  4. Test genuinely asserts the failing-lookup hostname is BLOCKED TestSafeDialControl_FailsClosedOnLookupIPError calls safeDialControl("tcp","nonexistent-host.invalid:443",nil) (.invalid = RFC 2606 reserved, resolves-never → deterministic LookupIP error, no flake) and asserts a non-nil *ssrfDialError (errors.As), with the message naming the fail-open it guards. TestSafeDialControl_FailsClosedOnEmptyLookupIPResult covers the empty branch. Both return the error inside Control → before connect(), so no SYN (the #2967 DialGuardBlocksBeforeConnect accept-count==0 proof covers the no-connect property).

5-axis: Correctness , Robustness (both error+empty), Security (fail-closed deny-by-default; closes the residual fail-open I flagged on #2967), Performance (an extra LookupIP only in the rare custom-resolver fallback), Readability (comment cites the RC + explains deny-by-default).

Platform (Go) is green; the red contexts are the review-gates (this approval greens qa-review + security-review) + sop-checklist (the write:issue token gap, which I can't clear). qa-review + security-review from me; CR3 = the 2nd genuine. 👍

**APPROVE (qa-review + security-review)** @ `341c1b20` — correct fail-closed hardening of the `safeDialControl` hostname-fallback path; the third clean fast-follow in the #2965→#2967→#2969 SSRF chain. Adversarially verified all four points: 1. **LookupIP error AND empty-result BOTH fail-closed** ✅ — `net.LookupIP(host)`: `lookupErr != nil` → `return &ssrfDialError{...}` (157–160), AND `len(ips) == 0` → `return &ssrfDialError{...}` (161–164). The prior fall-through `return nil` (fail-open) is gone on both branches. Deny-by-default: a hostname we can't positively verify is treated as unsafe. 2. **No regression to #2965/#2967** ✅ — the primary resolved-IP path is untouched: when `net.ParseIP(host)` succeeds (the normal post-resolution case Control is invoked with), it still does `isSafeURL("http://"+ip+"/")` (172–175). The fail-closed change is confined to the `ip == nil` hostname-fallback branch. 3. **Legit hostnames → public IPs still succeed (no over-blocking)** ✅ — the candidate loop (165–169) returns an `ssrfDialError` only if a resolved candidate is unsafe; if all candidates are public it falls through to `return nil` (allow). Correctly blocks when ANY candidate is internal (prevents the dialer picking the internal IP from a mixed result) — that's the right conservative behavior, not over-blocking. 4. **Test genuinely asserts the failing-lookup hostname is BLOCKED** ✅ — `TestSafeDialControl_FailsClosedOnLookupIPError` calls `safeDialControl("tcp","nonexistent-host.invalid:443",nil)` (`.invalid` = RFC 2606 reserved, resolves-never → deterministic LookupIP error, no flake) and asserts a non-nil `*ssrfDialError` (`errors.As`), with the message naming the fail-open it guards. `TestSafeDialControl_FailsClosedOnEmptyLookupIPResult` covers the empty branch. Both return the error inside Control → before `connect()`, so no SYN (the #2967 `DialGuardBlocksBeforeConnect` accept-count==0 proof covers the no-connect property). 5-axis: Correctness ✅, Robustness ✅ (both error+empty), Security ✅ (fail-closed deny-by-default; closes the residual fail-open I flagged on #2967), Performance ✅ (an extra LookupIP only in the rare custom-resolver fallback), Readability ✅ (comment cites the RC + explains deny-by-default). `Platform (Go)` is green; the red contexts are the review-gates (this approval greens qa-review + security-review) + sop-checklist (the write:issue token gap, which I can't clear). qa-review ✅ + security-review ✅ from me; CR3 = the 2nd genuine. 👍
devops-engineer merged commit 5cfa4b8cac into main 2026-06-15 22:05:11 +00:00
Sign in to join this conversation.
No Reviewers
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#2969