fix(desktop): pin computer-use to 1:1 screenshot/click coords (core#2200) #89

Merged
hongming merged 3 commits from fix/2200-desktop-coord-1to1 into main 2026-06-04 03:54:14 +00:00
Owner

Resolves the runtime half of core#2200 — computer-use desktop agents SEE a target in the screenshot but MISS the click.

Root cause

tool_desktop_screenshot captures the display at native pixels (scrot, no resize) and tool_desktop_click clicks at native pixels (xdotool mousemove). Those two spaces are identical only if the model reasons over the screenshot at those same pixels. Claude's vision silently downscales any image above ~1.15 MP / 1568px long edge before the model sees it, so the 1920x1080 (2.07 MP) desktop desyncs screenshot-space from click-space — the model reads a coordinate off the downscaled image, xdotool clicks it in full-res space, and the click lands elsewhere. (The agent in the incident even guessed 2:1 / 4:1 scale factors and still missed.)

Fix (this PR = runtime side)

  • tool_desktop_screenshot now returns the exact pixel space (width, height, vision_safe) by parsing the PNG IHDR (no image lib). The agent never has to infer DPI: the coordinates it reads are the coordinates tool_desktop_click consumes.
  • If a display ever exceeds the vision-safe bound it emits a loud warning instead of letting clicks silently miss.
  • _size_browser_window sizes to 1280x800 (WXGA), matching the resolution the provisioner pins.

The companion control-plane PR pins the Xvfb :99 display to 1280x800 (Anthropic's recommended computer-use resolution: 1.02 MP, 1280<1568 → no downscale → screenshot(x,y) == click(x,y) 1:1).

Tests

  • _png_dimensions parses IHDR / rejects non-PNG / missing file
  • screenshot reports width/height + vision_safe: true at 1280x800 (no warning)
  • screenshot flags vision_safe: false + warning at 1920x1080
  • existing browser-window assertion updated to 1280x800

14 passed.

Follow-up (noted, not blocking): full auto downscale-to-safe + click scale-back for arbitrary display sizes (needs ImageMagick on the host).

Resolves the runtime half of **core#2200** — computer-use desktop agents SEE a target in the screenshot but MISS the click. ## Root cause `tool_desktop_screenshot` captures the display at native pixels (`scrot`, no resize) and `tool_desktop_click` clicks at native pixels (`xdotool mousemove`). Those two spaces are identical **only if the model reasons over the screenshot at those same pixels**. Claude's vision silently **downscales** any image above ~1.15 MP / 1568px long edge before the model sees it, so the 1920x1080 (2.07 MP) desktop desyncs screenshot-space from click-space — the model reads a coordinate off the downscaled image, xdotool clicks it in full-res space, and the click lands elsewhere. (The agent in the incident even guessed 2:1 / 4:1 scale factors and still missed.) ## Fix (this PR = runtime side) - `tool_desktop_screenshot` now returns the **exact pixel space** (`width`, `height`, `vision_safe`) by parsing the PNG IHDR (no image lib). The agent never has to infer DPI: the coordinates it reads are the coordinates `tool_desktop_click` consumes. - If a display ever exceeds the vision-safe bound it emits a loud `warning` instead of letting clicks silently miss. - `_size_browser_window` sizes to **1280x800** (WXGA), matching the resolution the provisioner pins. The companion control-plane PR pins the Xvfb `:99` display to **1280x800** (Anthropic's recommended computer-use resolution: 1.02 MP, 1280<1568 → no downscale → screenshot(x,y) == click(x,y) 1:1). ## Tests - `_png_dimensions` parses IHDR / rejects non-PNG / missing file - screenshot reports `width`/`height` + `vision_safe: true` at 1280x800 (no warning) - screenshot flags `vision_safe: false` + warning at 1920x1080 - existing browser-window assertion updated to 1280x800 `14 passed`. Follow-up (noted, not blocking): full auto downscale-to-safe + click scale-back for arbitrary display sizes (needs ImageMagick on the host).
hongming added 2 commits 2026-06-04 03:40:51 +00:00
Claude's vision downscales any image > ~1.15 MP / 1568px before the
model reasons over it, so a 1920x1080 capture desyncs screenshot pixels
from xdotool click coords and clicks miss small targets. Surface the
exact pixel space (width/height + vision_safe) from tool_desktop_screenshot
so the agent never infers a scale, warn loudly above the bound, and size
the browser window to the WXGA 1280x800 the provisioner now pins.
test(desktop): cover screenshot dims + vision-safe guard (core#2200)
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 3s
ci / unit-tests (pull_request) Successful in 45s
ci / lint (pull_request) Successful in 1m7s
ci / smoke-install (pull_request) Successful in 1m10s
ci / build (pull_request) Successful in 1m20s
9434763ac2
hongming added 1 commit 2026-06-04 03:48:39 +00:00
test(desktop): boundary cases for each vision_safe clause (core#2200)
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 2s
ci / lint (pull_request) Successful in 26s
ci / unit-tests (pull_request) Successful in 49s
ci / build (pull_request) Successful in 1m1s
ci / smoke-install (pull_request) Successful in 1m3s
c4fea47332
cp-lead approved these changes 2026-06-04 03:54:03 +00:00
cp-lead left a comment
Member

Approve. RCA is correct — 1920x1080 screenshots exceed Claude vision downscale bound, desyncing pixel/click space. Surfacing width/height/vision_safe from tool_desktop_screenshot is the right fix; PNG IHDR parse + guard are sound. Tests cover both clauses incl boundary.

Approve. RCA is correct — 1920x1080 screenshots exceed Claude vision downscale bound, desyncing pixel/click space. Surfacing width/height/vision_safe from tool_desktop_screenshot is the right fix; PNG IHDR parse + guard are sound. Tests cover both clauses incl boundary.
cp-qa approved these changes 2026-06-04 03:54:04 +00:00
cp-qa left a comment
Member

QA approve. 16 unit tests pass; boundary cases for each vision_safe clause added; browser-window 1280x800 assertion updated; JSON-contract (ok:true + path) preserved. Pairs with CP#516 display pin.

QA approve. 16 unit tests pass; boundary cases for each vision_safe clause added; browser-window 1280x800 assertion updated; JSON-contract (ok:true + path) preserved. Pairs with CP#516 display pin.
hongming merged commit 9f3d056281 into main 2026-06-04 03:54:14 +00:00
Sign in to join this conversation.
No Reviewers
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-ai-workspace-runtime#89